[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

[PDF][PDF] Time-synchronous one-pass beam search for parallel online and offline transducers with dynamic block training

Y Sudo, M Shakeel, Y Peng… - Proc. INTERSPEECH …, 2023 - researchgate.net
End-to-end automatic speech recognition (ASR) has become an increasingly popular area
of research, with two main models being online and offline ASR. Online models aim to …

A unified cascaded encoder asr model for dynamic model sizes

S Ding, W Wang, D Zhao, TN Sainath, Y He… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition
(ASR) model, which unifies models for different deployment scenarios. Moreover, the model …

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

Y Peng, J Lee, S Watanabe - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Transformer-based end-to-end speech recognition has achieved great success. However,
the large footprint and computational overhead make it difficult to deploy these models in …

Streaming parallel transducer beam search with fast-slow cascaded encoders

J Mahadeokar, Y Shi, K Li, D Le, J Zhu… - arXiv preprint arXiv …, 2022 - arxiv.org
Streaming ASR with strict latency constraints is required in many speech recognition
applications. In order to achieve the required latency, streaming ASR models sacrifice …

Compute cost amortized transformer for streaming asr

Y Xie, J Macoskey, M Radfar, FJ Chang, B King… - arXiv preprint arXiv …, 2022 - arxiv.org
We present a streaming, Transformer-based end-to-end automatic speech recognition
(ASR) architecture which achieves efficient neural inference through compute cost …

Gated contextual adapters for selective contextual biasing in neural transducers

A Alexandridis, KM Sathyendra… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Neural contextual biasing for end-to-end neural ASR transducers has shown significant
improvements in the recognition of named entities, such as contact names or device names …

Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition

K Deng, PC Woodland - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
Although end-to-end (E2E) automatic speech recognition (ASR) has shown state-of-the-art
recognition accuracy, it tends to be implicitly biased towards the training data distribution …

Caching networks: Capitalizing on common speech for asr

A Alexandridis, GP Strimel, A Rastrow… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
We introduce Caching Networks (CachingNets), a speech recognition network architecture
capable of delivering faster, more accurate decoding by leveraging common speech …

Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech Recognition

J Xu, W Zhou, Z Yang, E Beck, R Schlüter - arXiv preprint arXiv …, 2024 - arxiv.org
Varying-size models are often required to deploy ASR systems under different hardware
and/or application constraints such as memory and latency. To avoid redundant training and …