[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

Joist: A joint speech and text streaming model for asr

TN Sainath, R Prabhavalkar, A Bapna… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E)
model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous …

Understanding automatic speech recognition

D O'Shaughnessy - Computer Speech & Language, 2023 - Elsevier
This paper discusses how automatic speech recognition systems are and could be
designed, in order to best exploit the discriminative information encoded in human speech …

Improving the latency and quality of cascaded encoders

TN Sainath, Y He, A Narayanan… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
In this paper, we explore reducing computational latency of the 2-pass cascaded encoder
model [1]. Specifically, we experiment with reducing the size of the causal 1st-pass and …

Injecting text in self-supervised speech pretraining

Z Chen, Y Zhang, A Rosenberg… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied
degrees of success. In this paper, we propose to jointly learn representations during …

4-bit conformer with native quantization aware training for speech recognition

S Ding, P Meadowlark, Y He, L Lew, S Agrawal… - arXiv preprint arXiv …, 2022 - arxiv.org
Reducing the latency and model size has always been a significant research problem for
live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model …

Turn-taking prediction for natural conversational speech

S Chang, B Li, TN Sainath, C Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
While a streaming voice assistant system has been used in many applications, this system
typically focuses on unnatural, one-shot interactions assuming input from a single voice …

E2e segmenter: Joint segmenting and decoding for long-form asr

WR Huang, S Chang, D Rybach… - arXiv preprint arXiv …, 2022 - arxiv.org
Improving the performance of end-to-end ASR models on long utterances ranging from
minutes to hours in length is an ongoing challenge in speech recognition. A common …

Large-scale language model rescoring on long-form data

T Chen, C Allauzen, Y Huang, D Park… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
In this work, we study the impact of Large-scale Language Models (LLM) on Automated
Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form …