From senones to chenones: Tied context-dependent graphemes for hybrid speech recognition

WN Hsu, B Bolte, YHH Tsai, K Lakhotia… - … ACM transactions on …, 2021 - ieeexplore.ieee.org

Self-supervised approaches for speech representation learning are challenged by three
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …

被引用次数：2081 相关文章所有 6 个版本

[PDF] ieee.org

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org

In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

被引用次数：80 相关文章所有 6 个版本

HuBERT: How much can a bad teacher benefit ASR pre-training?

WN Hsu, YHH Tsai, B Bolte… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org

Compared to vision and language applications, self-supervised pre-training approaches for
ASR are challenged by three unique problems:(1) There are multiple sound units in each …

被引用次数：147 相关文章所有 2 个版本

[PDF] arxiv.org

Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition

Y Shi, Y Wang, C Wu, CF Yeh, J Chan… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org

This paper proposes an efficient memory transformer Emformer for low latency streaming
speech recognition. In Emformer, the long-range history context is distilled into an …

被引用次数：162 相关文章所有 3 个版本

[PDF] arxiv.org

Transformer-based acoustic modeling for hybrid speech recognition

Y Wang, A Mohamed, D Le, C Liu… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org

We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech
recognition. Several modeling choices are discussed in this work, including various …

被引用次数：253 相关文章所有 4 个版本

[PDF] arxiv.org

Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion

D Le, M Jain, G Keren, S Kim, Y Shi… - arXiv preprint arXiv …, 2021 - arxiv.org

How to leverage dynamic contextual information in end-to-end speech recognition has
remained an active research area. Previous solutions to this problem were either designed …

被引用次数：73 相关文章所有 5 个版本

[PDF] arxiv.org

Deep shallow fusion for RNN-T personalization

D Le, G Keren, J Chan, J Mahadeokar… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org

End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in
particular, have gained significant traction in the automatic speech recognition community in …

被引用次数：73 相关文章所有 3 个版本

[PDF] arxiv.org

Alignment restricted streaming recurrent neural network transducer

J Mahadeokar, Y Shangguan, D Le… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org

There is a growing interest in the speech community in developing Recurrent Neural
Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications …

被引用次数：67 相关文章所有 4 个版本

[PDF] arxiv.org

Streaming transformer-based acoustic models using self-attention with augmented memory

C Wu, Y Wang, Y Shi, CF Yeh, F Zhang - arXiv preprint arXiv:2005.08042, 2020 - arxiv.org

Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and
sequence-to-sequence speech recogni-tion. However, it requires access to the full …

被引用次数：67 相关文章所有 6 个版本

[PDF] arxiv.org

Improving RNN transducer based ASR with auxiliary tasks

C Liu, F Zhang, D Le, S Kim, Y Saraf… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org

End-to-end automatic speech recognition (ASR) models with a single neural network have
recently demonstrated state-of-the-art results compared to conventional hybrid speech …

被引用次数：46 相关文章所有 3 个版本