Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arXiv preprint arXiv …, 2023 - arxiv.org
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset

Z Cai, S Ghosh, AP Adatia, M Hayat, A Dhall… - arXiv preprint arXiv …, 2023 - arxiv.org
The detection and localization of highly realistic deepfake audio-visual content are
challenging even for the most advanced state-of-the-art methods. While most of the research …

Diaper: End-to-end neural diarization with perceiver-based attractors

F Landini, T Stafylakis, L Burget - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to
their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to …

Careless Whisper: Speech-to-Text Hallucination Harms

A Koenecke, ASG Choi, KX Mei, H Schellmann… - The 2024 ACM …, 2024 - dl.acm.org
Speech-to-text services aim to transcribe input audio as accurately as possible. They
increasingly play a role in everyday life, for example in personal voice assistants or in …

1M-Deepfakes Detection Challenge

Z Cai, A Dhall, S Ghosh, M Hayat, D Kollias… - arXiv preprint arXiv …, 2024 - arxiv.org
The detection and localization of deepfake content, particularly when small fake segments
are seamlessly mixed with real videos, remains a significant challenge in the field of digital …

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

H He, Z Shang, C Wang, X Li, Y Gu, H Hua… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, speech generation models have made significant progress by using large-scale
training data. However, the research community struggle to produce highly spontaneous …

[PDF][PDF] pyannote. audio speaker diarization pipeline at VoxSRC 2023

S Baroudi, H Bredin, A Plaquet, T Pellegrini - The VoxCeleb Speaker …, 2023 - mmai.io
This technical report describes the submission of team pyannote to the VoxSRC 2023
speaker diarization challenge. It relies on 3 stages: local end-to-end neural speaker …

Pheme: Efficient and Conversational Speech Generation

P Budzianowski, T Sereda, T Cichy, I Vulić - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, speech generation has seen remarkable progress, now achieving one-shot
generation capability that is often virtually indistinguishable from real human voice …

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

J Kalda, T Alumäe, M Lebourdais, H Bredin… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper describes the submissions of team TalTech-IRIT-LIS to the DISPLACE 2024
challenge. Our team participated in the speaker diarization and language diarization tracks …

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

J Kalda, R Marxer, T Alumäe, H Bredin - arXiv preprint arXiv:2403.02288, 2024 - arxiv.org
A major drawback of supervised speech separation (SSep) systems is their reliance on
synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) …