Masked Audio Modeling with CLAP and Multi-Objective Learning

Y Xin, X Peng, Y Lu - arXiv preprint arXiv:2401.15953, 2024 - arxiv.org
Most existing masked audio modeling (MAM) methods learn audio representations by
masking and reconstructing local spectrogram patches. However, the reconstruction loss …

[PDF][PDF] Background-aware Modeling for Weakly Supervised Sound Event Detection

Y Xin, D Yang, Y Zou - Proc. INTERSPEECH, 2023 - isca-archive.org
Nowadays, a common framework for weakly supervised sound event detection (WSSED) is
multiple instance learning (MIL). However, MIL directly optimizes the clip-level classification …

Improving Speech Enhancement Using Audio Tagging Knowledge From Pre-Trained Representations and Multi-Task Learning

S Lin, C Zhang, Y Qian - 2023 IEEE Automatic Speech …, 2023 - ieeexplore.ieee.org
In deep-learning-based speech enhancement (SE), an audio-knowledge-ignorant approach
is often used, which estimates a denoising model to transform the noisy input speech into …

SLIT: Boosting Audio-Text Pre-Training via Multi-Stage Learning and Instruction Tuning

H Zhao, Y Xin, Z Yu, B Zhu, L Lu, Z Ma - arXiv preprint arXiv:2402.07485, 2024 - arxiv.org
Audio-text pre-training (ATP) has witnessed remarkable strides across a variety of
downstream tasks. Yet, most existing pretrained audio models only specialize in either …

Complete and separate: Conditional separation with missing target source attribute completion

D Bralios, E Tzinis, P Smaragdis - 2023 IEEE Workshop on …, 2023 - ieeexplore.ieee.org
Recent approaches in source separation leverage semantic information about their input
mixtures and constituent sources that when used in conditional separation models can …