[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

Acoustic modeling based on deep learning for low-resource speech recognition: An overview

C Yu, M Kang, Y Chen, J Wu, X Zhao - IEEE Access, 2020 - ieeexplore.ieee.org
The polarization of world languages is becoming more and more obvious. Many languages,
mainly endangered languages, are of low-resource attribute due to lack of information. Both …

Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

J Cho, MK Baskar, R Li, M Wiesner… - 2018 IEEE Spoken …, 2018 - ieeexplore.ieee.org
Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new
direction in speech research. The approach benefits by performing model training without …

Deep lip reading: a comparison of models and an online application

T Afouras, JS Chung, A Zisserman - arXiv preprint arXiv:1806.06053, 2018 - arxiv.org
The goal of this paper is to develop state-of-the-art models for lip reading--visual speech
recognition. We develop three architectures and compare their accuracy and training …

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition

X Cheng, T Jin, R Huang, L Li, W Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …

Attention-based end-to-end models for small-footprint keyword spotting

C Shan, J Zhang, Y Wang, L Xie - arXiv preprint arXiv:1803.10916, 2018 - arxiv.org
In this paper, we propose an attention-based end-to-end neural approach for small-footprint
keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality …

Streaming small-footprint keyword spotting using sequence-to-sequence models

Y He, R Prabhavalkar, K Rao, W Li… - 2017 IEEE Automatic …, 2017 - ieeexplore.ieee.org
We develop streaming keyword spotting systems using a recurrent neural network
transducer (RNN-T) model: an all-neural, end-to-end trained, sequence-to-sequence model …

Seeing wake words: Audio-visual keyword spotting

L Momeni, T Afouras, T Stafylakis, S Albanie… - arXiv preprint arXiv …, 2020 - arxiv.org
The goal of this work is to automatically determine whether and when a word of interest is
spoken by a talking face, with or without the audio. We propose a zero-shot method suitable …

End-to-end speech recognition from federated acoustic models

Y Gao, T Parcollet, S Zaiem… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Training Automatic Speech Recognition (ASR) models under federated learning (FL)
settings has attracted a lot of attention recently. However, the FL scenarios often presented …

Language-agnostic multilingual modeling

A Datta, B Ramabhadran, J Emond… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of
data-rich and data-scarce languages in a single model. This enables data and parameter …