[PDF][PDF] Recent advances in end-to-end automatic speech recognition
J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …
Streaming multi-talker ASR with token-level serialized output training
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …
On word error rate definitions and their efficient computation for multi-speaker speech recognition systems
T von Neumann, C Boeddeker… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
We propose a general framework to compute the word error rate (WER) of ASR systems that
process recordings containing multiple speakers at their input and that produce multiple …
process recordings containing multiple speakers at their input and that produce multiple …
Endpoint detection for streaming end-to-end multi-talker ASR
Streaming end-to-end multi-talker speech recognition aims at transcribing the overlapped
speech from conversations or meetings with an all-neural model in a streaming fashion …
speech from conversations or meetings with an all-neural model in a streaming fashion …
One Model to Rule Them All? Towards End-to-End Joint Speaker Diarization and Speech Recognition
This paper presents a novel framework for joint speaker diarization (SD) and automatic
speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented …
speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented …
On Speaker Attribution with SURT
D Raj, M Wiesner, M Maciejewski… - arXiv preprint arXiv …, 2024 - arxiv.org
The Streaming Unmixing and Recognition Transducer (SURT) has recently become a
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …
Alignment-Free Training for Transducer-based Multi-Talker ASR
Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for
wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims …
wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims …
Separator-transducer-segmenter: Streaming recognition and segmentation of multi-party speech
Streaming recognition and segmentation of multi-party conversations with overlapping
speech is crucial for the next generation of voice assistant applications. In this work we …
speech is crucial for the next generation of voice assistant applications. In this work we …
EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings
In recent years, there have been studies to further improve the end-to-end neural speaker
diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel …
diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel …
Directed speech separation for automatic speech recognition of long form conversational speech
Many of the recent advances in speech separation are primarily aimed at synthetic mixtures
of short audio utterances with high degrees of overlap. Most of these approaches need an …
of short audio utterances with high degrees of overlap. Most of these approaches need an …