Sound source localization is all about cross-modal alignment

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

Learning audio-visual source localization via false negative aware contrastive learning

W Sun, J Zhang, J Wang, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …

Open-vocabulary semantic segmentation via attribute decomposition-aggregation

C Ma, Y Yuhuan, C Ju, F Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary semantic segmentation is a challenging task that requires segmenting
novel object categories at inference time. Recent works explore vision-language pre-training …

Annotation-free audio-visual segmentation

J Liu, Y Wang, C Ju, C Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract The objective of Audio-Visual Segmentation (AVS) is to localise the sounding
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …

Audio-visual segmentation by exploring cross-modal mutual semantics

C Liu, PP Li, X Qi, H Zhang, L Li, D Wang… - Proceedings of the 31st …, 2023 - dl.acm.org
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given
video. Existing works mainly focus on fusing audio and visual features of a given video to …

Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization

C Ju, K Zheng, J Liu, P Zhao, Y Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Weakly-supervised temporal action localization (WTAL) learns to detect and classify action
instances with only category labels. Most methods widely adopt the off-the-shelf …

Dual mean-teacher: An unbiased semi-supervised framework for audio-visual source localization

Y Guo, S Ma, H Su, Z Wang, Y Zhao… - Advances in …, 2024 - proceedings.neurips.cc
Abstract Audio-Visual Source Localization (AVSL) aims to locate sounding objects within
video frames given the paired audio clips. Existing methods predominantly rely on self …

Audio-Visual Segmentation via Unlabeled Frame Exploitation

J Liu, Y Liu, F Zhang, C Ju… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …

Can CLIP Help Sound Source Localization?

S Park, A Senocak, JS Chung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …

Audio-aware query-enhanced transformer for audio-visual segmentation

J Liu, C Ju, C Ma, Y Wang, Y Wang, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in
the video frames using audio cues. However, current fusion-based methods have the …