A review of speaker diarization: Recent advances with deep learning

TJ Park, N Kanda, D Dimitriadis, KJ Han… - Computer Speech & …, 2022 - Elsevier
Speaker diarization is a task to label audio or video recordings with classes that correspond
to speaker identity, or in short, a task to identify “who spoke when”. In the early years …

End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors

S Horiguchi, Y Fujita, S Watanabe, Y Xue… - arXiv preprint arXiv …, 2020 - arxiv.org
End-to-end speaker diarization for an unknown number of speakers is addressed in this
paper. Recently proposed end-to-end speaker diarization outperformed conventional …

VarArray meets t-SOT: Advancing the state of the art of streaming distant conversational speech recognition

N Kanda, J Wu, X Wang, Z Chen, J Li… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
This paper presents a novel streaming automatic speech recognition (ASR) framework for
multi-talker overlapping speech captured by a distant microphone array with an arbitrary …

Microsoft speaker diarization system for the voxceleb speaker recognition challenge 2020

X Xiao, N Kanda, Z Chen, T Zhou… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
This paper describes the Microsoft speaker diarization system for monaural multi-talker
recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker …

GPU-accelerated guided source separation for meeting transcription

D Raj, D Povey, S Khudanpur - arXiv preprint arXiv:2212.05271, 2022 - arxiv.org
Guided source separation (GSS) is a type of target-speaker extraction method that relies on
pre-computed speaker activities and blind source separation to perform front-end …

Streaming multi-talker ASR with token-level serialized output training

N Kanda, J Wu, Y Wu, X Xiao, Z Meng, X Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …

Advances in online audio-visual meeting transcription

T Yoshioka, I Abramovski, C Aksoylar… - 2019 IEEE Automatic …, 2019 - ieeexplore.ieee.org
This paper describes a system that generates speaker-annotated transcripts of meetings by
using a microphone array and a 360-degree camera. The hallmark of the system is its ability …

Encoder-decoder based attractors for end-to-end neural diarization

S Horiguchi, Y Fujita, S Watanabe… - … /ACM Transactions on …, 2022 - ieeexplore.ieee.org
This paper investigates an end-to-end neural diarization (EEND) method for an unknown
number of speakers. In contrast to the conventional cascaded approach to speaker …

Jointly optimal denoising, dereverberation, and source separation

T Nakatani, C Boeddeker, K Kinoshita… - … on Audio, Speech …, 2020 - ieeexplore.ieee.org
This article proposes methods that can optimize a Convolutional BeamFormer (CBF) for
jointly performing denoising, dereverberation, and source separation (DN+ DR+ SS) in a …

[PDF][PDF] The STC system for the CHiME-6 challenge

I Medennikov, M Korenevsky, T Prisyach… - … 2020 Workshop on …, 2020 - isca-archive.org
This paper is a description of the Speech Technology Center (STC) systems for the CHiME-6
challenge aimed at multimicrophone multi-speaker speech recognition and diarization in a …