Mix and localize: Localizing sound sources in mixtures

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

被引用次数：62 相关文章所有 2 个版本

[PDF] thecvf.com

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

被引用次数：69 相关文章所有 5 个版本

[PDF] thecvf.com

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

被引用次数：29 相关文章所有 4 个版本

[PDF] thecvf.com

Audio-visual class-incremental learning

W Pian, S Mo, Y Guo, Y Tian - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

In this paper, we introduce audio-visual class-incremental learning, a class-incremental
learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual …

被引用次数：33 相关文章所有 5 个版本

[PDF] thecvf.com

Self-supervised video forensics by audio-visual anomaly detection

C Feng, Z Chen, A Owens - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

Manipulated videos often contain subtle inconsistencies between their visual and audio
signals. We propose a video forensics method, based on anomaly detection, that can …

被引用次数：57 相关文章所有 6 个版本

[PDF] thecvf.com

Audio-visual grouping network for sound localization from mixtures

S Mo, Y Tian - Proceedings of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

Sound source localization is a typical and challenging task that predicts the location of
sound sources in a video. Previous single-source methods mainly used the audio-visual …

被引用次数：41 相关文章所有 5 个版本

[PDF] thecvf.com

Class-incremental grouping network for continual audio-visual learning

S Mo, W Pian, Y Tian - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Continual learning is a challenging problem in which models need to be trained on non-
stationary data across sequential tasks for class-incremental learning. While previous …

被引用次数：22 相关文章所有 5 个版本

[PDF] arxiv.org

Auto-ACD: A large-scale dataset for audio-language representation learning

L Sun, X Xu, M Wu, W Xie - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org

Recently, the AI community has made significant strides in developing powerful foundation
models, driven by large-scale multimodal datasets. However, for audio representation …

被引用次数：21 相关文章所有 2 个版本

[PDF] thecvf.com

Egocentric audio-visual object localization

C Huang, Y Tian, A Kumar… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person
view. Likewise, machines are advanced to approach human intelligence by learning with …

被引用次数：33 相关文章所有 5 个版本

[PDF] ieee.org

A survey of audio enhancement algorithms for music, speech, bioacoustics, biomedical, industrial and environmental sounds by image U-Net

S Gul, MS Khan - IEEE Access, 2023 - ieeexplore.ieee.org

The recent surge in the use of Deep Neural Networks (DNNs) has also made its mark in the
field of Audio Enhancement (AE), providing much better quality than the classical methods …

被引用次数：9 相关文章所有 2 个版本