Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arXiv preprint arXiv …, 2023 - arxiv.org
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

Gass: Generalizing audio source separation with large-scale data

J Pons, X Liu, S Pascual, J Serrà - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Universal source separation targets at separating the audio sources of an arbitrary mix,
removing the constraint to operate on a specific domain like speech or music. Yet, the …

Cacophony: An improved contrastive audio-text model

G Zhu, Z Duan - arXiv preprint arXiv:2402.06986, 2024 - arxiv.org
Despite recent improvements in audio-text modeling, audio-text contrastive models still lag
behind their image-text counterparts in scale and performance. We propose a method to …

Audio-Language Datasets of Scenes and Events: A Survey

G Wijngaard, E Formisano, M Esposito… - arXiv preprint arXiv …, 2024 - arxiv.org
Audio-language models (ALMs) process sounds to provide a linguistic description of sound-
producing events and scenes. Recent advances in computing power and dataset creation …

Prompt-driven target speech diarization

Y Jiang, Z Chen, R Tao, L Deng… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
We introduce a novel task named 'target speech diarization', which seeks to determine
'when target event occurred'within an audio signal. We devise a neural architecture called …

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling

J Zhou, D Guo, Y Zhong, M Wang - International Journal of Computer …, 2024 - Springer
Abstract The Audio-Visual Video Parsing task aims to identify and temporally localize the
events that occur in either or both the audio and visual streams of audible videos. It often …

Target conversation extraction: Source separation using turn-taking dynamics

T Chen, Q Wang, B Wu, M Itani, ES Eskimez… - arXiv preprint arXiv …, 2024 - arxiv.org
Extracting the speech of participants in a conversation amidst interfering speakers and noise
presents a challenging problem. In this paper, we introduce the novel task of target …

A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

KN Watcharasupat, A Lerch - arXiv preprint arXiv:2406.18747, 2024 - arxiv.org
Despite significant recent progress across multiple subtasks of audio source separation, few
music source separation systems support separation beyond the four-stem vocals, drums …

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

F Xiao, J Guan, Q Zhu, X Liu, W Wang, S Qi… - arXiv preprint arXiv …, 2024 - arxiv.org
Language-queried audio source separation (LASS) aims to separate an audio source
guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being …

[HTML][HTML] Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network

S Gul, MS Khan, A Ur-Rehman - Plos one, 2024 - journals.plos.org
Speech enhancement is crucial both for human and machine listening applications. Over the
last decade, the use of deep learning for speech enhancement has resulted in tremendous …