Learning audio-visual source localization via false negative aware contrastive learning

L Sun, X Xu, M Wu, W Xie - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org

Recently, the AI community has made significant strides in developing powerful foundation
models, driven by large-scale multimodal datasets. However, for audio representation …

被引用次数：21 相关文章所有 2 个版本

[PDF] arxiv.org

Learning modality-agnostic representation for semantic segmentation from any modalities

X Zheng, Y Lyu, L Wang - European Conference on Computer Vision, 2025 - Springer

Image modality is not perfect as it often fails in certain conditions, eg, night and fast motion.
This significantly limits the robustness and versatility of existing multi-modal (ie, Image+ X) …

被引用次数：4 相关文章所有 5 个版本

[PDF] thecvf.com

Bi-directional training for composed image retrieval via text prompt learning

Z Liu, W Sun, Y Hong, D Teney… - Proceedings of the …, 2024 - openaccess.thecvf.com

Composed image retrieval searches for a target image based on a multi-modal user query
comprised of a reference image and modification text describing the desired changes …

被引用次数：18 相关文章所有 5 个版本

[PDF] arxiv.org

Meerkat: Audio-visual large language model for grounding in space and time

S Chowdhury, S Nag, S Dasgupta, J Chen… - … on Computer Vision, 2025 - Springer

Abstract Leveraging Large Language Models' remarkable proficiency in text-based tasks,
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and …

被引用次数：4 相关文章所有 10 个版本

[PDF] thecvf.com

T-vsl: Text-guided visual sound source localization in mixtures

T Mahmud, Y Tian… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Visual sound source localization poses a significant challenge in identifying the semantic
region of each sounding source within a video. Existing self-supervised and weakly …

被引用次数：4 相关文章所有 3 个版本

[PDF] thecvf.com

Can CLIP Help Sound Source Localization?

S Park, A Senocak, JS Chung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …

被引用次数：7 相关文章所有 5 个版本

[PDF] thecvf.com

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Y Lyu, X Zheng, J Zhou, L Wang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

We present UniBind a flexible and efficient approach that learns a unified representation
space for seven diverse modalities--images text audio point cloud thermal video and event …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Audio-aware query-enhanced transformer for audio-visual segmentation

J Liu, C Ju, C Ma, Y Wang, Y Wang, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org

The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in
the video frames using audio cues. However, current fusion-based methods have the …

被引用次数：13 相关文章所有 2 个版本

[PDF] aaai.org

Decoupled contrastive multi-view clustering with high-order random walks

Y Lu, Y Lin, M Yang, D Peng, P Hu… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

In recent, some robust contrastive multi-view clustering (MvC) methods have been
proposed, which construct data pairs from neighborhoods to alleviate the false negative …

被引用次数：28 相关文章所有 4 个版本

[PDF] arxiv.org

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

H Yun, R Gao, I Ananthabhotla, A Kumar… - … on Computer Vision, 2025 - Springer

Egocentric videos provide comprehensive contexts for user and scene understanding,
spanning multisensory perception to behavioral interaction. We propose Spherical World …

被引用次数：1 相关文章所有 6 个版本