Audiobox: Unified audio generation with natural language prompts
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …
consuming. Research communities have made great progress over the past year advancing …
Gass: Generalizing audio source separation with large-scale data
Universal source separation targets at separating the audio sources of an arbitrary mix,
removing the constraint to operate on a specific domain like speech or music. Yet, the …
removing the constraint to operate on a specific domain like speech or music. Yet, the …
Cacophony: An improved contrastive audio-text model
Despite recent improvements in audio-text modeling, audio-text contrastive models still lag
behind their image-text counterparts in scale and performance. We propose a method to …
behind their image-text counterparts in scale and performance. We propose a method to …
Audio-Language Datasets of Scenes and Events: A Survey
Audio-language models (ALMs) process sounds to provide a linguistic description of sound-
producing events and scenes. Recent advances in computing power and dataset creation …
producing events and scenes. Recent advances in computing power and dataset creation …
Prompt-driven target speech diarization
We introduce a novel task named 'target speech diarization', which seeks to determine
'when target event occurred'within an audio signal. We devise a neural architecture called …
'when target event occurred'within an audio signal. We devise a neural architecture called …
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling
Abstract The Audio-Visual Video Parsing task aims to identify and temporally localize the
events that occur in either or both the audio and visual streams of audible videos. It often …
events that occur in either or both the audio and visual streams of audible videos. It often …
Target conversation extraction: Source separation using turn-taking dynamics
Extracting the speech of participants in a conversation amidst interfering speakers and noise
presents a challenging problem. In this paper, we introduce the novel task of target …
presents a challenging problem. In this paper, we introduce the novel task of target …
A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems
KN Watcharasupat, A Lerch - arXiv preprint arXiv:2406.18747, 2024 - arxiv.org
Despite significant recent progress across multiple subtasks of audio source separation, few
music source separation systems support separation beyond the four-stem vocals, drums …
music source separation systems support separation beyond the four-stem vocals, drums …
A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining
Language-queried audio source separation (LASS) aims to separate an audio source
guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being …
guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being …
[HTML][HTML] Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network
S Gul, MS Khan, A Ur-Rehman - Plos one, 2024 - journals.plos.org
Speech enhancement is crucial both for human and machine listening applications. Over the
last decade, the use of deep learning for speech enhancement has resulted in tremendous …
last decade, the use of deep learning for speech enhancement has resulted in tremendous …