[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

Large-scale asr domain adaptation using self-and semi-supervised learning

D Hwang, A Misra, Z Huo, N Siddhartha… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Self-and semi-supervised learning methods have been actively investigated to reduce
labeled training data or enhance model performance. However, these approaches mostly …

Pseudo label is better than human label

D Hwang, KC Sim, Z Huo, T Strohman - arXiv preprint arXiv:2203.12668, 2022 - arxiv.org
State-of-the-art automatic speech recognition (ASR) systems are trained with tens of
thousands of hours of labeled speech data. Human transcription is expensive and time …

Improving the latency and quality of cascaded encoders

TN Sainath, Y He, A Narayanan… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
In this paper, we explore reducing computational latency of the 2-pass cascaded encoder
model [1]. Specifically, we experiment with reducing the size of the causal 1st-pass and …

On addressing practical challenges for rnn-transducer

R Zhao, J Xue, J Li, W Wei, L He… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
In this paper, several works are proposed to address practi-cal challenges for deploying
RNN Transducer (RNN-T) based speech recognition systems. These challenges are …

Asr and emotional speech: A word-level investigation of the mutual impact of speech and emotion recognition

Y Li, Z Zhao, O Klejch, P Bell, C Lai - arXiv preprint arXiv:2305.16065, 2023 - arxiv.org
In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to
address their inherent variability. However, the reliance on human annotated text in most …

Asr rescoring and confidence estimation with electra

H Futami, H Inaguma, M Mimura… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
In automatic speech recognition (ASR) rescoring, the hypothesis with the fewest errors
should be selected from the n-best list using a language model (LM). However, LMs are …

ETEH: Unified attention-based end-to-end ASR and KWS architecture

G Cheng, H Miao, R Yang, K Deng… - IEEE/ACM Transactions …, 2022 - ieeexplore.ieee.org
Even though attention-based end-to-end (E2E) automatic speech recognition (ASR) models
have been yielding state-of-the-art recognition accuracy, they still fall behind many of the …

Residual energy-based models for end-to-end speech recognition

Q Li, Y Zhang, B Li, L Cao, PC Woodland - arXiv preprint arXiv …, 2021 - arxiv.org
End-to-end models with auto-regressive decoders have shown impressive results for
automatic speech recognition (ASR). These models formulate the sequence-level probability …

Multi-task learning for end-to-end ASR word and utterance confidence with deletion prediction

D Qiu, Y He, Q Li, Y Zhang, L Cao, I McGraw - arXiv preprint arXiv …, 2021 - arxiv.org
Confidence scores are very useful for downstream applications of automatic speech
recognition (ASR) systems. Recent works have proposed using neural networks to learn …