[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

Advancing RNN transducer technology for speech recognition

G Saon, Z Tüske, D Bolanos… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in
lowering the word error rate on three different tasks (Switchboard 300 hours, conversational …

Diagonal state space augmented transformers for speech recognition

G Saon, A Gupta, X Cui - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
We improve on the popular conformer architecture by replacing the depthwise temporal
convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant …

The 2020 espnet update: new features, broadened applications, performance improvements, and future plans

S Watanabe, F Boyer, X Chang, P Guo… - 2021 IEEE Data …, 2021 - ieeexplore.ieee.org
This paper describes the recent development of ESPnet (https://github. com/espnet/espnet),
an end-to-end speech processing toolkit. This project was initiated in December 2017 to …

[PDF][PDF] Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition.

G Kurata, G Saon - Interspeech, 2020 - interspeech2020.org
End-to-end training of recurrent neural network transducers (RNN-Ts) does not require
frame-level alignments between audio and output symbols. Because of that, the posterior …

A new training pipeline for an improved neural transducer

A Zeyer, A Merboldt, R Schlüter, H Ney - arXiv preprint arXiv:2005.09319, 2020 - arxiv.org
The RNN transducer is a promising end-to-end model candidate. We compare the original
training criterion with the full marginalization over all alignments, to the commonly used …

Streaming transformer asr with blockwise synchronous beam search

E Tsunoo, Y Kashiwagi… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
The Transformer self-attention network has shown promising performance as an alternative
to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) …

A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies

F Boyer, Y Shinohara, T Ishii… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
In this study, we present recent developments of models trained with the RNN-T loss in
ESPnet. It involves the use of various archi-tectures such as recently proposed Conformer …

ESPnet-ST-v2: Multipurpose spoken language translation toolkit

B Yan, J Shi, Y Tang, H Inaguma, Y Peng… - arXiv preprint arXiv …, 2023 - arxiv.org
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the
broadening interests of the spoken language translation community. ESPnet-ST-v2 supports …