Auto-ACD: A large-scale dataset for audio-language representation learning
Recently, the AI community has made significant strides in developing powerful foundation
models, driven by large-scale multimodal datasets. However, for audio representation …
models, driven by large-scale multimodal datasets. However, for audio representation …
Learning modality-agnostic representation for semantic segmentation from any modalities
Image modality is not perfect as it often fails in certain conditions, eg, night and fast motion.
This significantly limits the robustness and versatility of existing multi-modal (ie, Image+ X) …
This significantly limits the robustness and versatility of existing multi-modal (ie, Image+ X) …
Bi-directional training for composed image retrieval via text prompt learning
Composed image retrieval searches for a target image based on a multi-modal user query
comprised of a reference image and modification text describing the desired changes …
comprised of a reference image and modification text describing the desired changes …
Meerkat: Audio-visual large language model for grounding in space and time
Abstract Leveraging Large Language Models' remarkable proficiency in text-based tasks,
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and …
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and …
T-vsl: Text-guided visual sound source localization in mixtures
Visual sound source localization poses a significant challenge in identifying the semantic
region of each sounding source within a video. Existing self-supervised and weakly …
region of each sounding source within a video. Existing self-supervised and weakly …
Can CLIP Help Sound Source Localization?
Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …
diverse tasks, benefiting from their robust representational capabilities and effective …
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
We present UniBind a flexible and efficient approach that learns a unified representation
space for seven diverse modalities--images text audio point cloud thermal video and event …
space for seven diverse modalities--images text audio point cloud thermal video and event …
Audio-aware query-enhanced transformer for audio-visual segmentation
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in
the video frames using audio cues. However, current fusion-based methods have the …
the video frames using audio cues. However, current fusion-based methods have the …
Decoupled contrastive multi-view clustering with high-order random walks
In recent, some robust contrastive multi-view clustering (MvC) methods have been
proposed, which construct data pairs from neighborhoods to alleviate the false negative …
proposed, which construct data pairs from neighborhoods to alleviate the false negative …
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
Egocentric videos provide comprehensive contexts for user and scene understanding,
spanning multisensory perception to behavioral interaction. We propose Spherical World …
spanning multisensory perception to behavioral interaction. We propose Spherical World …