Hubert: Self-supervised speech representation learning by masked prediction of hidden units

WN Hsu, B Bolte, YHH Tsai, K Lakhotia… - … ACM transactions on …, 2021 - ieeexplore.ieee.org
Self-supervised approaches for speech representation learning are challenged by three
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

HuBERT: How much can a bad teacher benefit ASR pre-training?

WN Hsu, YHH Tsai, B Bolte… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Compared to vision and language applications, self-supervised pre-training approaches for
ASR are challenged by three unique problems:(1) There are multiple sound units in each …

Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition

Y Shi, Y Wang, C Wu, CF Yeh, J Chan… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
This paper proposes an efficient memory transformer Emformer for low latency streaming
speech recognition. In Emformer, the long-range history context is distilled into an …

Transformer-based acoustic modeling for hybrid speech recognition

Y Wang, A Mohamed, D Le, C Liu… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech
recognition. Several modeling choices are discussed in this work, including various …

Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion

D Le, M Jain, G Keren, S Kim, Y Shi… - arXiv preprint arXiv …, 2021 - arxiv.org
How to leverage dynamic contextual information in end-to-end speech recognition has
remained an active research area. Previous solutions to this problem were either designed …

Deep shallow fusion for RNN-T personalization

D Le, G Keren, J Chan, J Mahadeokar… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in
particular, have gained significant traction in the automatic speech recognition community in …

Alignment restricted streaming recurrent neural network transducer

J Mahadeokar, Y Shangguan, D Le… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
There is a growing interest in the speech community in developing Recurrent Neural
Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications …

Streaming transformer-based acoustic models using self-attention with augmented memory

C Wu, Y Wang, Y Shi, CF Yeh, F Zhang - arXiv preprint arXiv:2005.08042, 2020 - arxiv.org
Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and
sequence-to-sequence speech recogni-tion. However, it requires access to the full …

Improving RNN transducer based ASR with auxiliary tasks

C Liu, F Zhang, D Le, S Kim, Y Saraf… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
End-to-end automatic speech recognition (ASR) models with a single neural network have
recently demonstrated state-of-the-art results compared to conventional hybrid speech …