[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

MIMO-Speech: End-to-end multi-channel multi-speaker speech recognition

X Chang, W Zhang, Y Qian, J Le Roux… - 2019 IEEE Automatic …, 2019 - ieeexplore.ieee.org
Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech
recognition. However, high word error rates (WERs) still prevent these systems from being …

Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

L Mošner, M Wu, A Raju… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
For real-world speech recognition applications, noise robustness is still a challenge. In this
work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy …

End-to-end multi-channel transformer for speech recognition

FJ Chang, M Radfar, A Mouchtaris… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Transformers are powerful neural architectures that allow integrating different modalities
using attention mechanisms. In this paper, we leverage the neural transformer architectures …

Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition

G Li, S Liang, S Nie, W Liu, Z Yang - Neural Networks, 2021 - Elsevier
The traditional generalized sidelobe canceller (GSC) is a common speech enhancement
front end to improve the noise robustness of automatic speech recognition (ASR) systems in …

Human listening and live captioning: Multi-task training for speech enhancement

SE Eskimez, X Wang, M Tang, H Yang, Z Zhu… - arXiv preprint arXiv …, 2021 - arxiv.org
With the surge of online meetings, it has become more critical than ever to provide high-
quality speech audio and live captioning under various noise conditions. However, most …

[PDF][PDF] GAN-Based Data Generation for Speech Emotion Recognition.

SE Eskimez, D Dimitriadis, R Gmyr… - …, 2020 - interspeech2020.org
In this work, we propose a GAN-based method to generate synthetic data for speech
emotion recognition. Specifically, we investigate the usage of GANs for capturing the data …

An end-to-end architecture of online multi-channel speech separation

J Wu, Z Chen, J Li, T Yoshioka, Z Tan, E Lin… - arXiv preprint arXiv …, 2020 - arxiv.org
Multi-speaker speech recognition has been one of the keychallenges in conversation
transcription as it breaks the singleactive speaker assumption employed by most state-of-the …

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Q Zhu, J Zhang, Y Gu, Y Hu, L Dai - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
Self-supervised speech pre-training methods have developed rapidly in recent years, which
show to be very effective for many near-field single-channel speech tasks. However, far-field …

Self-attention channel combinator frontend for end-to-end multichannel far-field speech recognition

R Gong, C Quillen, D Sharma, A Goderre… - arXiv preprint arXiv …, 2021 - arxiv.org
When a sufficiently large far-field training data is presented, jointly optimizing a multichannel
frontend and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows …