An exploration of self-supervised pretrained representations for end-to-end speech recognition

X Chang, T Maekaku, P Guo, J Shi… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity
representation of the speech signal is learned from a lot of untranscribed data and shows …

Cascaded encoders for unifying streaming and non-streaming ASR

A Narayanan, TN Sainath, R Pang, J Yu… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown
competitive performance on several benchmarks. These models are structured to either …

Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition

AS Subramanian, C Weng, S Watanabe, M Yu… - Computer Speech & …, 2022 - Elsevier
Multi-source localization is an important and challenging technique for multi-talker
conversation analysis. This paper proposes a novel supervised learning method using deep …

Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge

F Yu, S Zhang, P Guo, Y Fu, Z Du… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge
(M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech …

EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers

S Maiti, Y Ueda, S Watanabe, C Zhang… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
In this paper, we present a novel framework that jointly performs three tasks: speaker
diarization, speech separation, and speaker counting. Our proposed framework integrates …

L-spex: Localized target speaker extraction

M Ge, C Xu, L Wang, ES Chng… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech
mixture given an auxiliary reference utterance. Recent studies show that speaker extraction …

Single channel voice separation for unknown number of speakers under reverberant and noisy settings

SE Chazan, L Wolf, E Nachmani… - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
We present a unified network for voice separation of an unknown number of speakers. The
proposed approach is composed of several separation heads optimized together with a …

Dual-path RNN for long recording speech separation

C Li, Y Luo, C Han, J Li, T Yoshioka… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
Continuous speech separation (CSS) is an arising task in speech separation aiming at
separating overlap-free targets from a long, partially-overlapped recording. A straightforward …

End-to-end speaker diarization conditioned on speech activity and overlap detection

Y Takashima, Y Fujita, S Watanabe… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
In this paper, we present a conditional multitask learning method for end-to-end neural
speaker diarization (EEND). The EEND system has shown promising performance …

Train from scratch: Single-stage joint training of speech separation and recognition

J Shi, X Chang, S Watanabe, B Xu - Computer Speech & Language, 2022 - Elsevier
Multi-speaker speech separation and recognition gains much attention in the speech
community recently. Previously, most studies train the front-end separation module and back …