Auto-ACD: A large-scale dataset for audio-language representation learning

L Sun, X Xu, M Wu, W Xie - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
Recently, the AI community has made significant strides in developing powerful foundation
models, driven by large-scale multimodal datasets. However, for audio representation …

Learning modality-agnostic representation for semantic segmentation from any modalities

X Zheng, Y Lyu, L Wang - European Conference on Computer Vision, 2025 - Springer
Image modality is not perfect as it often fails in certain conditions, eg, night and fast motion.
This significantly limits the robustness and versatility of existing multi-modal (ie, Image+ X) …

Bi-directional training for composed image retrieval via text prompt learning

Z Liu, W Sun, Y Hong, D Teney… - Proceedings of the …, 2024 - openaccess.thecvf.com
Composed image retrieval searches for a target image based on a multi-modal user query
comprised of a reference image and modification text describing the desired changes …

Meerkat: Audio-visual large language model for grounding in space and time

S Chowdhury, S Nag, S Dasgupta, J Chen… - … on Computer Vision, 2025 - Springer
Abstract Leveraging Large Language Models' remarkable proficiency in text-based tasks,
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and …

T-vsl: Text-guided visual sound source localization in mixtures

T Mahmud, Y Tian… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Visual sound source localization poses a significant challenge in identifying the semantic
region of each sounding source within a video. Existing self-supervised and weakly …

Can CLIP Help Sound Source Localization?

S Park, A Senocak, JS Chung - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Y Lyu, X Zheng, J Zhou, L Wang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
We present UniBind a flexible and efficient approach that learns a unified representation
space for seven diverse modalities--images text audio point cloud thermal video and event …

Audio-aware query-enhanced transformer for audio-visual segmentation

J Liu, C Ju, C Ma, Y Wang, Y Wang, Y Zhang - arXiv preprint arXiv …, 2023 - arxiv.org
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in
the video frames using audio cues. However, current fusion-based methods have the …

Decoupled contrastive multi-view clustering with high-order random walks

Y Lu, Y Lin, M Yang, D Peng, P Hu… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In recent, some robust contrastive multi-view clustering (MvC) methods have been
proposed, which construct data pairs from neighborhoods to alleviate the false negative …

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

H Yun, R Gao, I Ananthabhotla, A Kumar… - … on Computer Vision, 2025 - Springer
Egocentric videos provide comprehensive contexts for user and scene understanding,
spanning multisensory perception to behavioral interaction. We propose Spherical World …