The Kaldi speech recognition toolkit

P Guo, F Boyer, X Chang, T Hayashi… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org

In this study, we present recent developments on ESPnet: End-to-End Speech Processing
toolkit, which mainly involves a recently proposed architecture called Conformer …

被引用次数：299 相关文章所有 8 个版本

[HTML] pianshen.com

[HTML][HTML] Self-attentive speaker embeddings for text-independent speaker verification.

Y Zhu, T Ko, D Snyder, B Mak, D Povey - Interspeech, 2018 - pianshen.com

摘要This paper introduces a new method to extract speaker embed-dings from a deep
neural network (DNN) for text-independent speaker verification. Usually, speaker …

被引用次数：309 相关文章所有 15 个版本

[PDF] acm.org

Split computing and early exiting for deep learning applications: Survey and research challenges

Y Matsubara, M Levorato, F Restuccia - ACM Computing Surveys, 2022 - dl.acm.org

Mobile devices such as smartphones and autonomous vehicles increasingly rely on deep
neural networks (DNNs) to execute complex inference tasks such as image classification …

被引用次数：215 相关文章所有 5 个版本

[PDF] jmlr.org

Scaling speech technology to 1,000+ languages

V Pratap, A Tjandra, B Shi, P Tomasello, A Babu… - Journal of Machine …, 2024 - jmlr.org

Expanding the language coverage of speech technology has the potential to improve
access to information for many more people. However, current speech technology is …

被引用次数：262 相关文章所有 3 个版本

[PDF] arxiv.org

Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

被引用次数：570 相关文章所有 3 个版本

[PDF] nature.com

A high-performance speech neuroprosthesis

FR Willett, EM Kunz, C Fan, DT Avansino, GH Wilson… - Nature, 2023 - nature.com

Speech brain–computer interfaces (BCIs) have the potential to restore rapid communication
to people with paralysis by decoding neural activity evoked by attempted speech into text, or …

被引用次数：270 相关文章所有 16 个版本

[PDF] neurips.cc

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024 - proceedings.neurips.cc

Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

被引用次数：219 相关文章所有 8 个版本

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

被引用次数：938 相关文章所有 13 个版本

[PDF] neurips.cc

Masked autoencoders that listen

PY Huang, H Xu, J Li, A Baevski… - Advances in …, 2022 - proceedings.neurips.cc

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …

被引用次数：233 相关文章所有 5 个版本

[PDF] arxiv.org

SpeechBrain: A general-purpose speech toolkit

M Ravanelli, T Parcollet, P Plantinga, A Rouhe… - arXiv preprint arXiv …, 2021 - arxiv.org

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the
research and development of neural speech processing technologies by being simple …

被引用次数：713 相关文章所有 5 个版本