Separate anything you describe

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arXiv preprint arXiv …, 2023 - arxiv.org

Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

被引用次数：49 相关文章所有 2 个版本

[PDF] arxiv.org

Gass: Generalizing audio source separation with large-scale data

J Pons, X Liu, S Pascual, J Serrà - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org

Universal source separation targets at separating the audio sources of an arbitrary mix,
removing the constraint to operate on a specific domain like speech or music. Yet, the …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

Cacophony: An improved contrastive audio-text model

G Zhu, Z Duan - arXiv preprint arXiv:2402.06986, 2024 - arxiv.org

Despite recent improvements in audio-text modeling, audio-text contrastive models still lag
behind their image-text counterparts in scale and performance. We propose a method to …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Audio-Language Datasets of Scenes and Events: A Survey

G Wijngaard, E Formisano, M Esposito… - arXiv preprint arXiv …, 2024 - arxiv.org

Audio-language models (ALMs) process sounds to provide a linguistic description of sound-
producing events and scenes. Recent advances in computing power and dataset creation …

Prompt-driven target speech diarization

Y Jiang, Z Chen, R Tao, L Deng… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

We introduce a novel task named 'target speech diarization', which seeks to determine
'when target event occurred'within an audio signal. We devise a neural architecture called …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling

J Zhou, D Guo, Y Zhong, M Wang - International Journal of Computer …, 2024 - Springer

Abstract The Audio-Visual Video Parsing task aims to identify and temporally localize the
events that occur in either or both the audio and visual streams of audible videos. It often …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Target conversation extraction: Source separation using turn-taking dynamics

T Chen, Q Wang, B Wu, M Itani, ES Eskimez… - arXiv preprint arXiv …, 2024 - arxiv.org

Extracting the speech of participants in a conversation amidst interfering speakers and noise
presents a challenging problem. In this paper, we introduce the novel task of target …

被引用次数：1 相关文章所有 5 个版本

[PDF] arxiv.org

A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

KN Watcharasupat, A Lerch - arXiv preprint arXiv:2406.18747, 2024 - arxiv.org

Despite significant recent progress across multiple subtasks of audio source separation, few
music source separation systems support separation beyond the four-stem vocals, drums …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

F Xiao, J Guan, Q Zhu, X Liu, W Wang, S Qi… - arXiv preprint arXiv …, 2024 - arxiv.org

Language-queried audio source separation (LASS) aims to separate an audio source
guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being …

被引用次数：1 相关文章所有 2 个版本

[HTML] plos.org

[HTML][HTML] Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network

S Gul, MS Khan, A Ur-Rehman - Plos one, 2024 - journals.plos.org

Speech enhancement is crucial both for human and machine listening applications. Over the
last decade, the use of deep learning for speech enhancement has resulted in tremendous …

Audiobox: Unified audio generation with natural language prompts

Gass: Generalizing audio source separation with large-scale data

Cacophony: An improved contrastive audio-text model

Audio-Language Datasets of Scenes and Events: A Survey

Prompt-driven target speech diarization

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling

Target conversation extraction: Source separation using turn-taking dynamics

A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

[HTML][HTML] Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network

高级搜索

引用