An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word...

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com

Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

被引用次数：326 相关文章所有 7 个版本

[PDF] ieee.org

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org

In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

被引用次数：80 相关文章所有 6 个版本

[PDF] arxiv.org

Joist: A joint speech and text streaming model for asr

TN Sainath, R Prabhavalkar, A Bapna… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org

We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E)
model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous …

被引用次数：29 相关文章所有 3 个版本

Understanding automatic speech recognition

D O'Shaughnessy - Computer Speech & Language, 2023 - Elsevier

This paper discusses how automatic speech recognition systems are and could be
designed, in order to best exploit the discriminative information encoded in human speech …

被引用次数：2 相关文章

Improving the latency and quality of cascaded encoders

TN Sainath, Y He, A Narayanan… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org

In this paper, we explore reducing computational latency of the 2-pass cascaded encoder
model [1]. Specifically, we experiment with reducing the size of the causal 1st-pass and …

被引用次数：26 相关文章

[PDF] arxiv.org

Injecting text in self-supervised speech pretraining

Z Chen, Y Zhang, A Rosenberg… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org

Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied
degrees of success. In this paper, we propose to jointly learn representations during …

被引用次数：34 相关文章所有 3 个版本

[PDF] arxiv.org

4-bit conformer with native quantization aware training for speech recognition

S Ding, P Meadowlark, Y He, L Lew, S Agrawal… - arXiv preprint arXiv …, 2022 - arxiv.org

Reducing the latency and model size has always been a significant research problem for
live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model …

被引用次数：23 相关文章所有 7 个版本

[PDF] arxiv.org

Turn-taking prediction for natural conversational speech

S Chang, B Li, TN Sainath, C Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org

While a streaming voice assistant system has been used in many applications, this system
typically focuses on unnatural, one-shot interactions assuming input from a single voice …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

E2e segmenter: Joint segmenting and decoding for long-form asr

WR Huang, S Chang, D Rybach… - arXiv preprint arXiv …, 2022 - arxiv.org

Improving the performance of end-to-end ASR models on long utterances ranging from
minutes to hours in length is an ongoing challenge in speech recognition. A common …

被引用次数：18 相关文章所有 7 个版本

[PDF] arxiv.org

Large-scale language model rescoring on long-form data

T Chen, C Allauzen, Y Huang, D Park… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

In this work, we study the impact of Large-scale Language Models (LLM) on Automated
Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form …

被引用次数：11 相关文章所有 3 个版本