[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

End-to-end speech summarization using restricted self-attention

R Sharma, S Palaskar, AW Black… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Speech summarization is typically performed by using a cascade of speech recognition and
text summarization models. End-to-end modeling of speech summarization models is …

Advanced long-content speech recognition with factorized neural transducer

X Gong, Y Wu, J Li, S Liu, R Zhao… - … /ACM Transactions on …, 2024 - ieeexplore.ieee.org
Long-content automatic speech recognition (ASR) has obtained increasing interest in recent
years, as it captures the relationship among consecutive historical utterances while …

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

K Wei, B Li, H Lv, Q Lu, N Jiang… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Automatic Speech Recognition (ASR) in conversational settings presents unique
challenges, including extracting relevant contextual information from previous …

Context-aware end-to-end ASR using self-attentive embedding and tensor fusion

SY Chang, C Zhang, TN Sainath, B Li… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Typical automatic speech recognition (ASR) systems are built to recognize independent
utterances without using the cross-utterance context. However, the context over multiple …

Towards effective and compact contextual representation for conformer transducer speech recognition systems

M Cui, J Kang, J Deng, X Yin, Y Xie, X Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Current ASR systems are mainly trained and evaluated at the utterance level. Long range
cross utterance context can be incorporated. A key task is to derive a suitable compact …

Longfnt: Long-form speech recognition with factorized neural transducer

X Gong, Y Wu, J Li, S Liu, R Zhao… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Traditional automatic speech recognition (ASR) systems usually focus on individual
utterances, without considering long-form speech with useful historical information, which is …

Context-aware fine-tuning of self-supervised speech models

S Shon, F Wu, K Kim, P Sridhar… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Self-supervised pre-trained transformers have improved the state of the art on a variety of
speech tasks. Due to the quadratic time and space complexity of self-attention, they usually …

Leveraging acoustic contextual representation by audio-textual cross-modal learning for conversational asr

K Wei, Y Zhang, S Sun, L Xie, L Ma - arXiv preprint arXiv:2207.01039, 2022 - arxiv.org
Leveraging context information is an intuitive idea to improve performance on
conversational automatic speech recognition (ASR). Previous works usually adopt …

Bass: Block-wise adaptation for speech summarization

R Sharma, K Zheng, S Arora, S Watanabe… - arXiv preprint arXiv …, 2023 - arxiv.org
End-to-end speech summarization has been shown to improve performance over cascade
baselines. However, such models are difficult to train on very large inputs (dozens of …