Sound source localization is all about cross-modal alignment
Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …
source localization. Recent studies on learning-based sound source localization have …
Can CLIP Help Sound Source Localization?
Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …
diverse tasks, benefiting from their robust representational capabilities and effective …
Multimodal imbalance-aware gradient modulation for weakly-supervised audio-visual video parsing
Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal
extents of audio, visual and audio-visual event instances as well as identify the …
extents of audio, visual and audio-visual event instances as well as identify the …
Exploiting visual context semantics for sound source localization
Self-supervised sound source localization in unconstrained visual scenes is an important
task of audio-visual learning. In this paper, we propose a visual reasoning module to …
task of audio-visual learning. In this paper, we propose a visual reasoning module to …
Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
T Liu, P Zhang, W Huang, Y Zha, T You… - Proceedings of the 31st …, 2023 - dl.acm.org
Self-supervised sound source localization is usually challenged by the modality
inconsistency. In recent studies, contrastive learning based strategies have shown …
inconsistency. In recent studies, contrastive learning based strategies have shown …
Revisit weakly-supervised audio-visual video parsing from the language perspective
We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to
identify and locate all the events in audio/visual modalities. Previous works only concentrate …
identify and locate all the events in audio/visual modalities. Previous works only concentrate …
Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey
A Shahabaz, S Sarkar - IEEE Access, 2024 - ieeexplore.ieee.org
The joint analysis of audio and video is a powerful tool that can be applied to various
contexts, including action, speech, and sound recognition, audio-visual video parsing …
contexts, including action, speech, and sound recognition, audio-visual video parsing …
How does Layer Normalization improve Batch Normalization in self-supervised sound source localization?
T Liu, P Zhang, W Huang, Y Zha, T You, Y Zhang - Neurocomputing, 2024 - Elsevier
Self-supervised sound source localization is usually challenged by the unexpected large
input and incorrect direction of normalization in current solutions. A promising way for this …
input and incorrect direction of normalization in current solutions. A promising way for this …
Audio-visual spatial integration and recursive attention for robust sound source localization
The objective of the sound source localization task is to enable machines to detect the
location of sound-making objects within a visual scene. While the audio modality provides …
location of sound-making objects within a visual scene. While the audio modality provides …
Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge
The goal of the multi-sound source localization task is to localize sound sources from the
mixture individually. While recent multi-sound source localization methods have shown …
mixture individually. While recent multi-sound source localization methods have shown …