iquery: Instruments as queries for audio-visual sound separation

X Zhu, R Zhang, B He, A Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …

被引用次数：54 相关文章所有 5 个版本

[PDF] thecvf.com

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

被引用次数：26 相关文章所有 4 个版本

[PDF] thecvf.com

Wordepth: Variational language prior for monocular depth estimation

Z Zeng, D Wang, F Yang, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Three-dimensional (3D) reconstruction from a single image is an ill-posed problem
with inherent ambiguities ie scale. Predicting a 3D scene from text description (s) is similarly …

被引用次数：15 相关文章所有 3 个版本

[PDF] thecvf.com

Lavss: Location-guided audio-visual spatial audio separation

Y Ye, W Yang, Y Tian - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com

Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …

被引用次数：6 相关文章所有 5 个版本

[PDF] arxiv.org

Neurobind: Towards unified multimodal representations for neural signals

F Yang, C Feng, D Wang, T Wang, Z Zeng, Z Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Understanding neural activity and information representation is crucial for advancing
knowledge of brain function and cognition. Neural activity, measured through techniques …

被引用次数：4 相关文章所有 3 个版本

[PDF] thecvf.com

Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling

X Liu, YW Tai, CK Tang, P Miraldo… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have
enabled their near photo-realistic free-viewpoint rendering. Although these methods have …

被引用次数：4 相关文章所有 3 个版本

[PDF] ieee.org

Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

A Shahabaz, S Sarkar - IEEE Access, 2024 - ieeexplore.ieee.org

The joint analysis of audio and video is a powerful tool that can be applied to various
contexts, including action, speech, and sound recognition, audio-visual video parsing …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Y Su, A Vosoughi, S Deng, Y Tian, C Xu - arXiv preprint arXiv:2310.11713, 2023 - arxiv.org

The audio-visual sound separation field assumes visible sources in videos, but this excludes
invisible sounds beyond the camera's view. Current methods struggle with such sounds …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

N Cheng, Y Li, J Gao, B Fang, J Xu, W Han - arXiv preprint arXiv …, 2024 - arxiv.org

Tactility provides crucial support and enhancement for the perception and interaction
capabilities of both humans and robots. Nevertheless, the multimodal research related to …

被引用次数：1 相关文章所有 2 个版本

[PDF] aaai.org

Independency Adversarial Learning for Cross-Modal Sound Separation

Z Lin, Y Ji, Y Yang - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org

The sound mixture separation is still challenging due to heavy sound overlapping and
disturbance from noise. Unsupervised separation would significantly increase the difficulty …