[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

Streaming multi-talker ASR with token-level serialized output training

N Kanda, J Wu, Y Wu, X Xiao, Z Meng, X Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …

On word error rate definitions and their efficient computation for multi-speaker speech recognition systems

T von Neumann, C Boeddeker… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
We propose a general framework to compute the word error rate (WER) of ASR systems that
process recordings containing multiple speakers at their input and that produce multiple …

Endpoint detection for streaming end-to-end multi-talker ASR

L Lu, J Li, Y Gong - ICASSP 2022-2022 IEEE International …, 2022 - ieeexplore.ieee.org
Streaming end-to-end multi-talker speech recognition aims at transcribing the overlapped
speech from conversations or meetings with an all-neural model in a streaming fashion …

One Model to Rule Them All? Towards End-to-End Joint Speaker Diarization and Speech Recognition

S Cornell, J Jung, S Watanabe… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
This paper presents a novel framework for joint speaker diarization (SD) and automatic
speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented …

On Speaker Attribution with SURT

D Raj, M Wiesner, M Maciejewski… - arXiv preprint arXiv …, 2024 - arxiv.org
The Streaming Unmixing and Recognition Transducer (SURT) has recently become a
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …

Alignment-Free Training for Transducer-based Multi-Talker ASR

T Moriya, S Horiguchi, M Delcroix, R Masumura… - arXiv preprint arXiv …, 2024 - arxiv.org
Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for
wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims …

Separator-transducer-segmenter: Streaming recognition and segmentation of multi-party speech

I Sklyar, A Piunova, C Osendorfer - arXiv preprint arXiv:2205.05199, 2022 - arxiv.org
Streaming recognition and segmentation of multi-party conversations with overlapping
speech is crucial for the next generation of voice assistant applications. In this work we …

EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings

SH Mun, MH Han, C Moon, NS Kim - arXiv preprint arXiv:2312.06065, 2023 - arxiv.org
In recent years, there have been studies to further improve the end-to-end neural speaker
diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel …

Directed speech separation for automatic speech recognition of long form conversational speech

R Paturi, S Srinivasan, K Kirchhoff… - arXiv preprint arXiv …, 2021 - arxiv.org
Many of the recent advances in speech separation are primarily aimed at synthetic mixtures
of short audio utterances with high degrees of overlap. Most of these approaches need an …