Audioscopev2: Audio-visual attention architectures for calibrated open-domain on-screen sound...

S Mo, P Morgado - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Audio-visual source localization is a challenging task that aims to predict the location of
visual sound sources in a video. Since collecting ground-truth annotations of sounding …

被引用次数：53 相关文章所有 6 个版本

[PDF] thecvf.com

Annotation-free audio-visual segmentation

J Liu, Y Wang, C Ju, C Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract The objective of Audio-Visual Segmentation (AVS) is to localise the sounding
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …

被引用次数：27 相关文章所有 6 个版本

[PDF] thecvf.com

iquery: Instruments as queries for audio-visual sound separation

J Chen, R Zhang, D Lian, J Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Current audio-visual separation methods share a standard architecture design where an
audio encoder-decoder network is fused with visual encoding features at the encoder …

被引用次数：16 相关文章所有 5 个版本

[PDF] neurips.cc

Dual mean-teacher: An unbiased semi-supervised framework for audio-visual source localization

Y Guo, S Ma, H Su, Z Wang, Y Zhao… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Audio-Visual Source Localization (AVSL) aims to locate sounding objects within
video frames given the paired audio clips. Existing methods predominantly rely on self …

被引用次数：3 相关文章所有 7 个版本

[PDF] thecvf.com

Audio-Visual Segmentation via Unlabeled Frame Exploitation

J Liu, Y Liu, F Zhang, C Ju… - Proceedings of the …, 2024 - openaccess.thecvf.com

Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …

被引用次数：2 相关文章所有 4 个版本

[PDF] neurips.cc

UNSSOR: unsupervised neural speech separation by leveraging over-determined training mixtures

ZQ Wang, S Watanabe - Advances in Neural Information …, 2024 - proceedings.neurips.cc

In reverberant conditions with multiple concurrent speakers, each microphone acquires a
mixture signal of multiple speakers at a different location. In over-determined conditions …

被引用次数：8 相关文章所有 8 个版本

[PDF] thecvf.com

Sound localization from motion: Jointly learning sound direction and camera rotation

Z Chen, S Qian, A Owens - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

The images and sounds that we perceive undergo subtle but geometrically consistent
changes as we rotate our heads. In this paper, we use these cues to solve a problem we call …

被引用次数：6 相关文章所有 7 个版本

[PDF] thecvf.com

Lavss: Location-guided audio-visual spatial audio separation

Y Ye, W Yang, Y Tian - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com

Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …

被引用次数：5 相关文章所有 5 个版本

[PDF] arxiv.org

Multimodal imbalance-aware gradient modulation for weakly-supervised audio-visual video parsing

J Fu, J Gao, BK Bao, C Xu - … on Circuits and Systems for Video …, 2023 - ieeexplore.ieee.org

Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal
extents of audio, visual and audio-visual event instances as well as identify the …

被引用次数：4 相关文章所有 3 个版本

[PDF] thecvf.com

CrossMAE: Cross-Modality Masked Autoencoders for Region-Aware Audio-Visual Pre-Training

Y Guo, S Sun, S Ma, K Zheng, X Bao… - Proceedings of the …, 2024 - openaccess.thecvf.com

Learning joint and coordinated features across modalities is essential for many audio-visual
tasks. Existing pre-training methods primarily focus on global information neglecting fine …