VarArray meets t-SOT: Advancing the state of the art of streaming distant conversational speech recognition

N Kanda, J Wu, X Wang, Z Chen, J Li… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
This paper presents a novel streaming automatic speech recognition (ASR) framework for
multi-talker overlapping speech captured by a distant microphone array with an arbitrary …

The chime-7 dasr challenge: Distant meeting transcription with multiple devices in diverse scenarios

S Cornell, M Wiesner, S Watanabe, D Raj… - arXiv preprint arXiv …, 2023 - arxiv.org
The CHiME challenges have played a significant role in the development and evaluation of
robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR …

[HTML][HTML] An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings

L Serafini, S Cornell, G Morrone, E Zovato… - Computer Speech & …, 2023 - Elsevier
We performed an experimental review of current diarization systems for the conversational
telephone speech (CTS) domain. In detail, we considered a total of eight different algorithms …

Powerset multi-class cross entropy loss for neural speaker diarization

A Plaquet, H Bredin - arXiv preprint arXiv:2310.13025, 2023 - arxiv.org
Since its introduction in 2019, the whole end-to-end neural diarization (EEND) line of work
has been addressing speaker diarization as a frame-wise multi-label classification problem …

pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe

H Bredin - 24th INTERSPEECH Conference (INTERSPEECH …, 2023 - hal.science
pyannote. audio is an open-source toolkit written in Python for speaker diarization. Version
2.1 introduces a major overhaul of pyannote. audio default speaker diarization pipeline …

GPU-accelerated guided source separation for meeting transcription

D Raj, D Povey, S Khudanpur - arXiv preprint arXiv:2212.05271, 2022 - arxiv.org
Guided source separation (GSS) is a type of target-speaker extraction method that relies on
pre-computed speaker activities and blind source separation to perform front-end …

UNSSOR: unsupervised neural speech separation by leveraging over-determined training mixtures

ZQ Wang, S Watanabe - Advances in Neural Information …, 2024 - proceedings.neurips.cc
In reverberant conditions with multiple concurrent speakers, each microphone acquires a
mixture signal of multiple speakers at a different location. In over-determined conditions …

Cross-channel attention-based target speaker voice activity detection: Experimental results for the m2met challenge

W Wang, X Qin, M Li - ICASSP 2022-2022 IEEE International …, 2022 - ieeexplore.ieee.org
DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-
based target-speaker voice activity detection (TS-VAD) to find the overlap between …

Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge

F Yu, S Zhang, P Guo, Y Fu, Z Du… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge
(M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech …

Diaper: End-to-end neural diarization with perceiver-based attractors

F Landini, T Stafylakis, L Burget - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to
their limitations, mainly regarding overlapped speech and cumbersome pipelines, endto …