[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

A review of speaker diarization: Recent advances with deep learning

TJ Park, N Kanda, D Dimitriadis, KJ Han… - Computer Speech & …, 2022 - Elsevier
Speaker diarization is a task to label audio or video recordings with classes that correspond
to speaker identity, or in short, a task to identify “who spoke when”. In the early years …

Recent progresses in deep learning based acoustic models

D Yu, J Li - IEEE/CAA Journal of automatica sinica, 2017 - ieeexplore.ieee.org
In this paper, we summarize recent progresses made in deep learning based acoustic
models and the motivation and insights behind the surveyed techniques. We first discuss …

Serialized output training for end-to-end overlapped speech recognition

N Kanda, Y Gaur, X Wang, Z Meng… - arXiv preprint arXiv …, 2020 - arxiv.org
This paper proposes serialized output training (SOT), a novel framework for multi-speaker
overlapped speech recognition based on an attention-based encoder-decoder approach …

Streaming multi-talker ASR with token-level serialized output training

N Kanda, J Wu, Y Wu, X Xiao, Z Meng, X Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …

End-to-end multi-speaker speech recognition with transformer

X Chang, W Zhang, Y Qian, J Le Roux… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Recently, fully recurrent neural network (RNN) based end-to-end models have been proven
to be effective for multi-speaker speech recognition in both the single-channel and multi …

Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers

N Kanda, Y Gaur, X Wang, Z Meng, Z Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
We propose an end-to-end speaker-attributed automatic speech recognition model that
unifies speaker counting, speech recognition, and speaker identification on monaural …

Deep extractor network for target speaker recovery from single channel speech mixtures

J Wang, J Chen, D Su, L Chen, M Yu, Y Qian… - arXiv preprint arXiv …, 2018 - arxiv.org
Speaker-aware source separation methods are promising workarounds for major difficulties
such as arbitrary source permutation and unknown number of sources. However, it remains …

Automatic lyrics transcription of polyphonic music with lyrics-chord multi-task learning

X Gao, C Gupta, H Li - IEEE/ACM Transactions on Audio …, 2022 - ieeexplore.ieee.org
Lyrics are the words that make up a song, while chords are harmonic sets of multiple notes
in music. Lyrics and chords are generally essential information in music, ie unaccompanied …

Past review, current progress, and challenges ahead on the cocktail party problem

Y Qian, C Weng, X Chang, S Wang, D Yu - Frontiers of Information …, 2018 - Springer
The cocktail party problem, ie, tracing and recognizing the speech of a specific speaker
when multiple speakers talk simultaneously, is one of the critical problems yet to be solved …