Sound source localization is all about cross-modal alignment
Humans can easily perceive the direction of sound sources in a visual scene, termed sound
source localization. Recent studies on learning-based sound source localization have …
source localization. Recent studies on learning-based sound source localization have …
Learning audio-visual source localization via false negative aware contrastive learning
Self-supervised audio-visual source localization aims to locate sound-source objects in
video frames without extra annotations. Recent methods often approach this goal with the …
video frames without extra annotations. Recent methods often approach this goal with the …
Open-vocabulary semantic segmentation via attribute decomposition-aggregation
Open-vocabulary semantic segmentation is a challenging task that requires segmenting
novel object categories at inference time. Recent works explore vision-language pre-training …
novel object categories at inference time. Recent works explore vision-language pre-training …
Annotation-free audio-visual segmentation
Abstract The objective of Audio-Visual Segmentation (AVS) is to localise the sounding
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …
Audio-visual segmentation by exploring cross-modal mutual semantics
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given
video. Existing works mainly focus on fusing audio and visual features of a given video to …
video. Existing works mainly focus on fusing audio and visual features of a given video to …
Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization
Weakly-supervised temporal action localization (WTAL) learns to detect and classify action
instances with only category labels. Most methods widely adopt the off-the-shelf …
instances with only category labels. Most methods widely adopt the off-the-shelf …
Dual mean-teacher: An unbiased semi-supervised framework for audio-visual source localization
Abstract Audio-Visual Source Localization (AVSL) aims to locate sounding objects within
video frames given the paired audio clips. Existing methods predominantly rely on self …
video frames given the paired audio clips. Existing methods predominantly rely on self …
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …
Although great progress has been witnessed we experimentally reveal that current methods …
Can CLIP Help Sound Source Localization?
Large-scale pre-trained image-text models demonstrate remarkable versatility across
diverse tasks, benefiting from their robust representational capabilities and effective …
diverse tasks, benefiting from their robust representational capabilities and effective …
Audio-aware query-enhanced transformer for audio-visual segmentation
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in
the video frames using audio cues. However, current fusion-based methods have the …
the video frames using audio cues. However, current fusion-based methods have the …