Powerset multi-class cross entropy loss for neural speaker diarization

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arXiv preprint arXiv …, 2023 - arxiv.org

Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

被引用次数：61 相关文章所有 2 个版本

[PDF] arxiv.org

AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset

Z Cai, S Ghosh, AP Adatia, M Hayat, A Dhall… - arXiv preprint arXiv …, 2023 - arxiv.org

The detection and localization of highly realistic deepfake audio-visual content are
challenging even for the most advanced state-of-the-art methods. While most of the research …

被引用次数：21 相关文章所有 2 个版本

[PDF] arxiv.org

Diaper: End-to-end neural diarization with perceiver-based attractors

F Landini, T Stafylakis, L Burget - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org

Until recently, the field of speaker diarization was dominated by cascaded systems. Due to
their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to …

被引用次数：6 相关文章所有 2 个版本

[PDF] acm.org

Careless Whisper: Speech-to-Text Hallucination Harms

A Koenecke, ASG Choi, KX Mei, H Schellmann… - The 2024 ACM …, 2024 - dl.acm.org

Speech-to-text services aim to transcribe input audio as accurately as possible. They
increasingly play a role in everyday life, for example in personal voice assistants or in …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

1M-Deepfakes Detection Challenge

Z Cai, A Dhall, S Ghosh, M Hayat, D Kollias… - arXiv preprint arXiv …, 2024 - arxiv.org

The detection and localization of deepfake content, particularly when small fake segments
are seamlessly mixed with real videos, remains a significant challenge in the field of digital …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

H He, Z Shang, C Wang, X Li, Y Gu, H Hua… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, speech generation models have made significant progress by using large-scale
training data. However, the research community struggle to produce highly spontaneous …

被引用次数：4 相关文章所有 3 个版本

[PDF] mmai.io

[PDF][PDF] pyannote. audio speaker diarization pipeline at VoxSRC 2023

S Baroudi, H Bredin, A Plaquet, T Pellegrini - The VoxCeleb Speaker …, 2023 - mmai.io

This technical report describes the submission of team pyannote to the VoxSRC 2023
speaker diarization challenge. It relies on 3 stages: local end-to-end neural speaker …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

Pheme: Efficient and Conversational Speech Generation

P Budzianowski, T Sereda, T Cichy, I Vulić - arXiv preprint arXiv …, 2024 - arxiv.org

In recent years, speech generation has seen remarkable progress, now achieving one-shot
generation capability that is often virtually indistinguishable from real human voice …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

J Kalda, T Alumäe, M Lebourdais, H Bredin… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper describes the submissions of team TalTech-IRIT-LIS to the DISPLACE 2024
challenge. Our team participated in the speaker diarization and language diarization tracks …

被引用次数：2 相关文章所有 10 个版本

[PDF] arxiv.org

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

J Kalda, R Marxer, T Alumäe, H Bredin - arXiv preprint arXiv:2403.02288, 2024 - arxiv.org

A major drawback of supervised speech separation (SSep) systems is their reliance on
synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) …

被引用次数：6 相关文章所有 2 个版本