A proposal-based paradigm for self-supervised sound source localization in videos

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com

Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

被引用次数：15 相关文章所有 8 个版本

[PDF] thecvf.com

Can CLIP Help Sound Source Localization?

S Park, A Senocak, JS Chung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …

被引用次数：7 相关文章所有 5 个版本

[PDF] arxiv.org

Multimodal imbalance-aware gradient modulation for weakly-supervised audio-visual video parsing

J Fu, J Gao, BK Bao, C Xu - … on Circuits and Systems for Video …, 2023 - ieeexplore.ieee.org

Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal
extents of audio, visual and audio-visual event instances as well as identify the …

被引用次数：5 相关文章所有 3 个版本

[PDF] thecvf.com

Exploiting visual context semantics for sound source localization

X Zhou, D Zhou, D Hu, H Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com

Self-supervised sound source localization in unconstrained visual scenes is an important
task of audio-visual learning. In this paper, we propose a visual reasoning module to …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

T Liu, P Zhang, W Huang, Y Zha, T You… - Proceedings of the 31st …, 2023 - dl.acm.org

Self-supervised sound source localization is usually challenged by the modality
inconsistency. In recent studies, contrastive learning based strategies have shown …

被引用次数：3 相关文章所有 3 个版本

[PDF] neurips.cc

Revisit weakly-supervised audio-visual video parsing from the language perspective

Y Fan, Y Wu, B Du, Y Lin - Advances in Neural Information …, 2024 - proceedings.neurips.cc

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to
identify and locate all the events in audio/visual modalities. Previous works only concentrate …

被引用次数：6 相关文章所有 5 个版本

[PDF] ieee.org

Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

A Shahabaz, S Sarkar - IEEE Access, 2024 - ieeexplore.ieee.org

The joint analysis of audio and video is a powerful tool that can be applied to various
contexts, including action, speech, and sound recognition, audio-visual video parsing …

被引用次数：2 相关文章所有 2 个版本

How does Layer Normalization improve Batch Normalization in self-supervised sound source localization?

T Liu, P Zhang, W Huang, Y Zha, T You, Y Zhang - Neurocomputing, 2024 - Elsevier

Self-supervised sound source localization is usually challenged by the unexpected large
input and incorrect direction of normalization in current solutions. A promising way for this …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Audio-visual spatial integration and recursive attention for robust sound source localization

SJ Um, D Kim, JU Kim - Proceedings of the 31st ACM International …, 2023 - dl.acm.org

The objective of the sound source localization task is to enable machines to detect the
location of sound-making objects within a visual scene. While the audio modality provides …

被引用次数：3 相关文章所有 4 个版本

[PDF] thecvf.com

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

D Kim, SJ Um, S Lee, JU Kim - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The goal of the multi-sound source localization task is to localize sound sources from the
mixture individually. While recent multi-sound source localization methods have shown …

被引用次数：2 相关文章所有 3 个版本