End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

Distilling the knowledge of BERT for sequence-to-sequence ASR

H Futami, H Inaguma, S Ueno, M Mimura… - arXiv preprint arXiv …, 2020 - arxiv.org
Attention-based sequence-to-sequence (seq2seq) models have achieved promising results
in automatic speech recognition (ASR). However, as these models decode in a left-to-right …

Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset

Z Yang, Y Chen, L Luo, R Yang, L Ye, G Cheng… - arXiv preprint arXiv …, 2022 - arxiv.org
This paper introduces a high-quality rich annotated Mandarin conversational (RAMC)
speech dataset called MagicData-RAMC. The MagicData-RAMC corpus contains 180 hours …

Advanced long-context end-to-end speech recognition using context-expanded transformers

T Hori, N Moritz, C Hori, JL Roux - arXiv preprint arXiv:2104.09426, 2021 - arxiv.org
This paper addresses end-to-end automatic speech recognition (ASR) for long audio
recordings such as lecture and conversational speeches. Most end-to-end ASR models are …

Hierarchical transformer-based large-context end-to-end asr with large-context knowledge distillation

R Masumura, N Makishima, M Ihori… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
We present a novel large-context end-to-end automatic speech recognition (E2E-ASR)
model and its effective training method based on knowledge distillation. Common E2E-ASR …

End-to-end automatic speech recognition integrated with CTC-based voice activity detection

T Yoshimura, T Hayashi, K Takeda… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
This paper integrates a voice activity detection (VAD) function with end-to-end automatic
speech recognition toward an online speech interface and transcribing very long audio …

Advanced long-content speech recognition with factorized neural transducer

X Gong, Y Wu, J Li, S Liu, R Zhao… - … /ACM Transactions on …, 2024 - ieeexplore.ieee.org
Long-content automatic speech recognition (ASR) has obtained increasing interest in recent
years, as it captures the relationship among consecutive historical utterances while …

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

K Wei, B Li, H Lv, Q Lu, N Jiang… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Automatic Speech Recognition (ASR) in conversational settings presents unique
challenges, including extracting relevant contextual information from previous …

Context-aware end-to-end ASR using self-attentive embedding and tensor fusion

SY Chang, C Zhang, TN Sainath, B Li… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Typical automatic speech recognition (ASR) systems are built to recognize independent
utterances without using the cross-utterance context. However, the context over multiple …

[PDF][PDF] Transformer-Based Long-Context End-to-End Speech Recognition.

T Hori, N Moritz, C Hori, J Le Roux - Interspeech, 2020 - isca-archive.org
This paper presents an approach to long-context end-to-end automatic speech recognition
(ASR) using Transformers, aiming at improving ASR accuracy for long audio recordings …