[PDF][PDF] Recent advances in end-to-end automatic speech recognition
J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …
End-to-end speech summarization using restricted self-attention
Speech summarization is typically performed by using a cascade of speech recognition and
text summarization models. End-to-end modeling of speech summarization models is …
text summarization models. End-to-end modeling of speech summarization models is …
Advanced long-content speech recognition with factorized neural transducer
Long-content automatic speech recognition (ASR) has obtained increasing interest in recent
years, as it captures the relationship among consecutive historical utterances while …
years, as it captures the relationship among consecutive historical utterances while …
Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation
Automatic Speech Recognition (ASR) in conversational settings presents unique
challenges, including extracting relevant contextual information from previous …
challenges, including extracting relevant contextual information from previous …
Context-aware end-to-end ASR using self-attentive embedding and tensor fusion
Typical automatic speech recognition (ASR) systems are built to recognize independent
utterances without using the cross-utterance context. However, the context over multiple …
utterances without using the cross-utterance context. However, the context over multiple …
Towards effective and compact contextual representation for conformer transducer speech recognition systems
Current ASR systems are mainly trained and evaluated at the utterance level. Long range
cross utterance context can be incorporated. A key task is to derive a suitable compact …
cross utterance context can be incorporated. A key task is to derive a suitable compact …
Longfnt: Long-form speech recognition with factorized neural transducer
Traditional automatic speech recognition (ASR) systems usually focus on individual
utterances, without considering long-form speech with useful historical information, which is …
utterances, without considering long-form speech with useful historical information, which is …
Context-aware fine-tuning of self-supervised speech models
Self-supervised pre-trained transformers have improved the state of the art on a variety of
speech tasks. Due to the quadratic time and space complexity of self-attention, they usually …
speech tasks. Due to the quadratic time and space complexity of self-attention, they usually …
Leveraging acoustic contextual representation by audio-textual cross-modal learning for conversational asr
Leveraging context information is an intuitive idea to improve performance on
conversational automatic speech recognition (ASR). Previous works usually adopt …
conversational automatic speech recognition (ASR). Previous works usually adopt …
Bass: Block-wise adaptation for speech summarization
End-to-end speech summarization has been shown to improve performance over cascade
baselines. However, such models are difficult to train on very large inputs (dozens of …
baselines. However, such models are difficult to train on very large inputs (dozens of …