Audiobox: Unified audio generation with natural language prompts
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …
consuming. Research communities have made great progress over the past year advancing …
AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset
The detection and localization of highly realistic deepfake audio-visual content are
challenging even for the most advanced state-of-the-art methods. While most of the research …
challenging even for the most advanced state-of-the-art methods. While most of the research …
Diaper: End-to-end neural diarization with perceiver-based attractors
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to
their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to …
their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to …
Careless Whisper: Speech-to-Text Hallucination Harms
Speech-to-text services aim to transcribe input audio as accurately as possible. They
increasingly play a role in everyday life, for example in personal voice assistants or in …
increasingly play a role in everyday life, for example in personal voice assistants or in …
1M-Deepfakes Detection Challenge
The detection and localization of deepfake content, particularly when small fake segments
are seamlessly mixed with real videos, remains a significant challenge in the field of digital …
are seamlessly mixed with real videos, remains a significant challenge in the field of digital …
Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation
Recently, speech generation models have made significant progress by using large-scale
training data. However, the research community struggle to produce highly spontaneous …
training data. However, the research community struggle to produce highly spontaneous …
[PDF][PDF] pyannote. audio speaker diarization pipeline at VoxSRC 2023
This technical report describes the submission of team pyannote to the VoxSRC 2023
speaker diarization challenge. It relies on 3 stages: local end-to-end neural speaker …
speaker diarization challenge. It relies on 3 stages: local end-to-end neural speaker …
Pheme: Efficient and Conversational Speech Generation
In recent years, speech generation has seen remarkable progress, now achieving one-shot
generation capability that is often virtually indistinguishable from real human voice …
generation capability that is often virtually indistinguishable from real human voice …
TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024
This paper describes the submissions of team TalTech-IRIT-LIS to the DISPLACE 2024
challenge. Our team participated in the speaker diarization and language diarization tracks …
challenge. Our team participated in the speaker diarization and language diarization tracks …
PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings
A major drawback of supervised speech separation (SSep) systems is their reliance on
synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) …
synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) …