Not all features matter: Enhancing few-shot clip with adaptive prior refinement

X Zhu, R Zhang, B He, A Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com
The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

Wordepth: Variational language prior for monocular depth estimation

Z Zeng, D Wang, F Yang, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Three-dimensional (3D) reconstruction from a single image is an ill-posed problem
with inherent ambiguities ie scale. Predicting a 3D scene from text description (s) is similarly …

Lavss: Location-guided audio-visual spatial audio separation

Y Ye, W Yang, Y Tian - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com
Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …

Neurobind: Towards unified multimodal representations for neural signals

F Yang, C Feng, D Wang, T Wang, Z Zeng, Z Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Understanding neural activity and information representation is crucial for advancing
knowledge of brain function and cognition. Neural activity, measured through techniques …

Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling

X Liu, YW Tai, CK Tang, P Miraldo… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have
enabled their near photo-realistic free-viewpoint rendering. Although these methods have …

Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

A Shahabaz, S Sarkar - IEEE Access, 2024 - ieeexplore.ieee.org
The joint analysis of audio and video is a powerful tool that can be applied to various
contexts, including action, speech, and sound recognition, audio-visual video parsing …

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Y Su, A Vosoughi, S Deng, Y Tian, C Xu - arXiv preprint arXiv:2310.11713, 2023 - arxiv.org
The audio-visual sound separation field assumes visible sources in videos, but this excludes
invisible sounds beyond the camera's view. Current methods struggle with such sounds …

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

N Cheng, Y Li, J Gao, B Fang, J Xu, W Han - arXiv preprint arXiv …, 2024 - arxiv.org
Tactility provides crucial support and enhancement for the perception and interaction
capabilities of both humans and robots. Nevertheless, the multimodal research related to …

Independency Adversarial Learning for Cross-Modal Sound Separation

Z Lin, Y Ji, Y Yang - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
The sound mixture separation is still challenging due to heavy sound overlapping and
disturbance from noise. Unsupervised separation would significantly increase the difficulty …