Sound source localization is all about cross-modal alignment

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

Can CLIP Help Sound Source Localization?

S Park, A Senocak, JS Chung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …

Multimodal imbalance-aware gradient modulation for weakly-supervised audio-visual video parsing

J Fu, J Gao, BK Bao, C Xu - … on Circuits and Systems for Video …, 2023 - ieeexplore.ieee.org
Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal
extents of audio, visual and audio-visual event instances as well as identify the …

Exploiting visual context semantics for sound source localization

X Zhou, D Zhou, D Hu, H Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com
Self-supervised sound source localization in unconstrained visual scenes is an important
task of audio-visual learning. In this paper, we propose a visual reasoning module to …

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

T Liu, P Zhang, W Huang, Y Zha, T You… - Proceedings of the 31st …, 2023 - dl.acm.org
Self-supervised sound source localization is usually challenged by the modality
inconsistency. In recent studies, contrastive learning based strategies have shown …

Revisit weakly-supervised audio-visual video parsing from the language perspective

Y Fan, Y Wu, B Du, Y Lin - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to
identify and locate all the events in audio/visual modalities. Previous works only concentrate …

Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

A Shahabaz, S Sarkar - IEEE Access, 2024 - ieeexplore.ieee.org
The joint analysis of audio and video is a powerful tool that can be applied to various
contexts, including action, speech, and sound recognition, audio-visual video parsing …

How does Layer Normalization improve Batch Normalization in self-supervised sound source localization?

T Liu, P Zhang, W Huang, Y Zha, T You, Y Zhang - Neurocomputing, 2024 - Elsevier
Self-supervised sound source localization is usually challenged by the unexpected large
input and incorrect direction of normalization in current solutions. A promising way for this …

Audio-visual spatial integration and recursive attention for robust sound source localization

SJ Um, D Kim, JU Kim - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
The objective of the sound source localization task is to enable machines to detect the
location of sound-making objects within a visual scene. While the audio modality provides …

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

D Kim, SJ Um, S Lee, JU Kim - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The goal of the multi-sound source localization task is to localize sound sources from the
mixture individually. While recent multi-sound source localization methods have shown …