[PDF][PDF] Recent advances in end-to-end automatic speech recognition
J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …
End-to-end speech recognition: A survey
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …
learning has brought considerable reductions in word error rate of more than 50% relative …
Joist: A joint speech and text streaming model for asr
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E)
model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous …
model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous …
Understanding automatic speech recognition
D O'Shaughnessy - Computer Speech & Language, 2023 - Elsevier
This paper discusses how automatic speech recognition systems are and could be
designed, in order to best exploit the discriminative information encoded in human speech …
designed, in order to best exploit the discriminative information encoded in human speech …
Improving the latency and quality of cascaded encoders
In this paper, we explore reducing computational latency of the 2-pass cascaded encoder
model [1]. Specifically, we experiment with reducing the size of the causal 1st-pass and …
model [1]. Specifically, we experiment with reducing the size of the causal 1st-pass and …
Injecting text in self-supervised speech pretraining
Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied
degrees of success. In this paper, we propose to jointly learn representations during …
degrees of success. In this paper, we propose to jointly learn representations during …
4-bit conformer with native quantization aware training for speech recognition
Reducing the latency and model size has always been a significant research problem for
live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model …
live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model …
Turn-taking prediction for natural conversational speech
While a streaming voice assistant system has been used in many applications, this system
typically focuses on unnatural, one-shot interactions assuming input from a single voice …
typically focuses on unnatural, one-shot interactions assuming input from a single voice …
E2e segmenter: Joint segmenting and decoding for long-form asr
Improving the performance of end-to-end ASR models on long utterances ranging from
minutes to hours in length is an ongoing challenge in speech recognition. A common …
minutes to hours in length is an ongoing challenge in speech recognition. A common …
Large-scale language model rescoring on long-form data
In this work, we study the impact of Large-scale Language Models (LLM) on Automated
Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form …
Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form …