End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

ESPnet-ST: All-in-one speech translation toolkit

H Inaguma, S Kiyono, K Duh, S Karita… - arXiv preprint arXiv …, 2020 - arxiv.org
We present ESPnet-ST, which is designed for the quick development of speech-to-speech
translation systems in a single framework. ESPnet-ST is a new project inside end-to-end …

Espresso: A fast end-to-end neural speech recognition toolkit

Y Wang, T Chen, H Xu, S Ding, H Lv… - 2019 IEEE Automatic …, 2019 - ieeexplore.ieee.org
We present Espresso, an open-source, modular, extensible end-to-end neural automatic
speech recognition (ASR) toolkit based on the deep learning library PyTorch and the …

CTC alignments improve autoregressive translation

B Yan, S Dalmia, Y Higuchi, G Neubig, F Metze… - arXiv preprint arXiv …, 2022 - arxiv.org
Connectionist Temporal Classification (CTC) is a widely used approach for automatic
speech recognition (ASR) that performs conditionally independent monotonic alignment …

The 2020 espnet update: new features, broadened applications, performance improvements, and future plans

S Watanabe, F Boyer, X Chang, P Guo… - 2021 IEEE Data …, 2021 - ieeexplore.ieee.org
This paper describes the recent development of ESPnet (https://github. com/espnet/espnet),
an end-to-end speech processing toolkit. This project was initiated in December 2017 to …

Streaming transformer asr with blockwise synchronous beam search

E Tsunoo, Y Kashiwagi… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
The Transformer self-attention network has shown promising performance as an alternative
to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) …

Advanced long-context end-to-end speech recognition using context-expanded transformers

T Hori, N Moritz, C Hori, JL Roux - arXiv preprint arXiv:2104.09426, 2021 - arxiv.org
This paper addresses end-to-end automatic speech recognition (ASR) for long audio
recordings such as lecture and conversational speeches. Most end-to-end ASR models are …

End-to-end automatic speech recognition integrated with CTC-based voice activity detection

T Yoshimura, T Hayashi, K Takeda… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
This paper integrates a voice activity detection (VAD) function with end-to-end automatic
speech recognition toward an online speech interface and transcribing very long audio …

Improving hybrid ctc/attention architecture for agglutinative language speech recognition

Z Ren, N Yolwas, W Slamu, R Cao, H Wang - Sensors, 2022 - mdpi.com
Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech
information such as a pronunciation dictionary, and its system is built through a single neural …

Searchable hidden intermediates for end-to-end models of decomposable sequence tasks

S Dalmia, B Yan, V Raunak, F Metze… - arXiv preprint arXiv …, 2021 - arxiv.org
End-to-end approaches for sequence tasks are becoming increasingly popular. Yet for
complex sequence tasks, like speech translation, systems that cascade several models …