[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

How might we create better benchmarks for speech recognition?

A Aksënova, D van Esch, J Flynn… - Proceedings of the 1st …, 2021 - aclanthology.org
The applications of automatic speech recognition (ASR) systems are proliferating, in part
due to recent significant quality improvements. However, as recent work indicates, even …

Zeroprompt: streaming acoustic encoders are zero-shot masked lms

X Song, D Wu, B Zhang, Z Peng, B Dang, F Pan… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we present ZeroPrompt (Figure 1-(a)) and the corresponding Prompt-and-
Refine strategy (Figure 3), two simple but effective\textbf {training-free} methods to decrease …

Extreme encoder output frame rate reduction: Improving computational latencies of large end-to-end models

R Prabhavalkar, Z Meng, W Wang… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to
improve as they are scaled to larger sizes, with some now reaching billions of parameters …

The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems

A Liesenfeld, A Lopez, M Dingemanse - arXiv preprint arXiv:2307.15493, 2023 - arxiv.org
Speech recognition systems are a key intermediary in voice-driven human-computer
interaction. Although speech recognition works well for pristine monologic audio, real-life …

A CIF-based speech segmentation method for streaming E2E ASR

Y Shu, H Luo, S Zhang, L Wang… - IEEE Signal Processing …, 2023 - ieeexplore.ieee.org
Long utterances segmentation is crucial in end-to-end (E2E) streaming automatic speech
recognition (ASR). However, commonly used voice activity detection (VAD)-based and fixed …

Unified end-to-end speech recognition and endpointing for fast and efficient speech systems

S Bijwadia, S Chang, B Li, T Sainath… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP)
model to identify speech boundaries. In this work, we propose a method to jointly train the …

Fast-u2++: Fast and accurate end-to-end speech recognition in joint ctc/attention frames

C Liang, XL Zhang, BB Zhang, D Wu… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Recently, the unified streaming and non-streaming two-pass (U2/U2++) end-to-end model
for speech recognition has shown great performance in terms of streaming capability …

E2e segmentation in a two-pass cascaded encoder asr model

WR Huang, SY Chang, TN Sainath… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single
model. A key challenge is allowing the segmenter (which runs in real-time, synchronously …

Alignment knowledge distillation for online streaming attention-based speech recognition

H Inaguma, T Kawahara - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
This article describes an efficient training method for online streaming attention-based
encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have …