Improving RNN transducer based ASR with auxiliary tasks

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：100 相关文章所有 6 个版本

[PDF] arxiv.org

A practical survey on faster and lighter transformers

Q Fournier, GM Caron, D Aloise - ACM Computing Surveys, 2023 - dl.acm.org

Recurrent neural networks are effective models to process sequences. However, they are
unable to learn long-term dependencies because of their inherent sequential nature. As a …

被引用次数：64 相关文章所有 4 个版本

[PDF] arxiv.org

Visual speech recognition for multiple languages in the wild

P Ma, S Petridis, M Pantic - Nature Machine Intelligence, 2022 - nature.com

Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …

被引用次数：106 相关文章所有 7 个版本

[PDF] neurips.cc

Squeezeformer: An efficient transformer for automatic speech recognition

S Kim, A Gholami, A Shaw, N Lee… - Advances in …, 2022 - proceedings.neurips.cc

The recently proposed Conformer model has become the de facto backbone model for
various downstream speech tasks based on its hybrid attention-convolution architecture that …

被引用次数：74 相关文章所有 7 个版本

[PDF] arxiv.org

Intermediate loss regularization for ctc-based speech recognition

J Lee, S Watanabe - ICASSP 2021-2021 IEEE International …, 2021 - ieeexplore.ieee.org

We present a simple and efficient auxiliary loss function for automatic speech recognition
(ASR) based on the connectionist temporal classification (CTC) objective. The proposed …

被引用次数：124 相关文章所有 5 个版本

[PDF] arxiv.org

Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion

D Le, M Jain, G Keren, S Kim, Y Shi… - arXiv preprint arXiv …, 2021 - arxiv.org

How to leverage dynamic contextual information in end-to-end speech recognition has
remained an active research area. Previous solutions to this problem were either designed …

被引用次数：73 相关文章所有 5 个版本

[PDF] arxiv.org

Deep shallow fusion for RNN-T personalization

D Le, G Keren, J Chan, J Mahadeokar… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org

End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in
particular, have gained significant traction in the automatic speech recognition community in …

被引用次数：73 相关文章所有 3 个版本

[PDF] arxiv.org

Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions

C Liu, M Picheny, L Sarı, P Chitkara… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org

The problem of machine learning systems demonstrating bias towards specific groups of
individuals has been studied extensively, particularly in the Facial Recognition area, but …

被引用次数：33 相关文章所有 5 个版本

[PDF] arxiv.org

A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies

F Boyer, Y Shinohara, T Ishii… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org

In this study, we present recent developments of models trained with the RNN-T loss in
ESPnet. It involves the use of various archi-tectures such as recently proposed Conformer …

被引用次数：29 相关文章所有 5 个版本

[PDF] arxiv.org

Multi-head state space model for speech recognition

Y Fathullah, C Wu, Y Shangguan, J Jia… - arXiv preprint arXiv …, 2023 - arxiv.org

State space models (SSMs) have recently shown promising results on small-scale sequence
and language modelling tasks, rivalling and outperforming many attention-based …

被引用次数：10 相关文章所有 4 个版本