Exploiting transformation invariance and equivariance for self-supervised sound localisation

A Senocak, H Ryu, J Kim, TH Oh… - Proceedings of the …, 2023 - openaccess.thecvf.com

Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …

被引用次数：12 相关文章所有 8 个版本

[PDF] thecvf.com

Learning audio-visual source localization via false negative aware contrastive learning

W Sun, J Zhang, J Wang, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …

被引用次数：27 相关文章所有 6 个版本

[PDF] neurips.cc

Open-vocabulary semantic segmentation via attribute decomposition-aggregation

C Ma, Y Yuhuan, C Ju, F Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc

Open-vocabulary semantic segmentation is a challenging task that requires segmenting
novel object categories at inference time. Recent works explore vision-language pre-training …

被引用次数：10 相关文章所有 4 个版本

[PDF] thecvf.com

Annotation-free audio-visual segmentation

J Liu, Y Wang, C Ju, C Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract The objective of Audio-Visual Segmentation (AVS) is to localise the sounding
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …

被引用次数：27 相关文章所有 6 个版本

Audio-visual segmentation by exploring cross-modal mutual semantics

C Liu, PP Li, X Qi, H Zhang, L Li, D Wang… - Proceedings of the 31st …, 2023 - dl.acm.org

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given
video. Existing works mainly focus on fusing audio and visual features of a given video to …

被引用次数：15 相关文章所有 3 个版本

[PDF] thecvf.com

Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization

C Ju, K Zheng, J Liu, P Zhao, Y Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Weakly-supervised temporal action localization (WTAL) learns to detect and classify action
instances with only category labels. Most methods widely adopt the off-the-shelf …

被引用次数：19 相关文章所有 6 个版本

[PDF] neurips.cc

Dual mean-teacher: An unbiased semi-supervised framework for audio-visual source localization

Y Guo, S Ma, H Su, Z Wang, Y Zhao… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Audio-Visual Source Localization (AVSL) aims to locate sounding objects within
video frames given the paired audio clips. Existing methods predominantly rely on self …

被引用次数：3 相关文章所有 7 个版本

[PDF] thecvf.com

Audio-Visual Segmentation via Unlabeled Frame Exploitation

J Liu, Y Liu, F Zhang, C Ju… - Proceedings of the …, 2024 - openaccess.thecvf.com

Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …

被引用次数：2 相关文章所有 4 个版本

[PDF] thecvf.com

Can CLIP Help Sound Source Localization?

S Park, A Senocak, JS Chung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

Audio-aware query-enhanced transformer for audio-visual segmentation

J Liu, C Ju, C Ma, Y Wang, Y Wang, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org

The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in
the video frames using audio cues. However, current fusion-based methods have the …

被引用次数：11 相关文章所有 2 个版本