Sound source localization is all about cross-modal alignment

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com
Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

Meerkat: Audio-visual large language model for grounding in space and time

S Chowdhury, S Nag, S Dasgupta, J Chen… - … on Computer Vision, 2025 - Springer
Abstract Leveraging Large Language Models' remarkable proficiency in text-based tasks,
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and …

Audio-Visual Segmentation via Unlabeled Frame Exploitation

J Liu, Y Liu, F Zhang, C Ju… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …

Can CLIP Help Sound Source Localization?

S Park, A Senocak, JS Chung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …

Audio–Visual Segmentation based on robust principal component analysis

S Fang, Q Zhu, Q Wu, S Wu, S Xie - Expert Systems with Applications, 2024 - Elsevier
Abstract Audio–Visual Segmentation (AVS) aims to extract the sounding objects from a
video. The current learning-based AVS methods are often supervised, which rely on specific …

Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

A Shahabaz, S Sarkar - IEEE Access, 2024 - ieeexplore.ieee.org
The joint analysis of audio and video is a powerful tool that can be applied to various
contexts, including action, speech, and sound recognition, audio-visual video parsing …

Audio-visual spatial integration and recursive attention for robust sound source localization

SJ Um, D Kim, JU Kim - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
The objective of the sound source localization task is to enable machines to detect the
location of sound-making objects within a visual scene. While the audio modality provides …

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

D Kim, SJ Um, S Lee, JU Kim - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The goal of the multi-sound source localization task is to localize sound sources from the
mixture individually. While recent multi-sound source localization methods have shown …

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

A Senocak, H Ryu, J Kim, TH Oh, H Pfister… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent studies on learning-based sound source localization have mainly focused on the
localization performance perspective. However, prior work and existing benchmarks …

Robust Audio-Visual Contrastive Learning for Proposal-based Self-supervised Sound Source Localization in Videos

H Xuan, Z Wu, J Yang, B Jiang, L Luo… - IEEE transactions on …, 2024 - ieeexplore.ieee.org
By observing a scene and listening to corresponding audio cues, humans can easily
recognize where the sound is. To achieve such cross-modal perception on machines …