Not all features matter: Enhancing few-shot clip with adaptive prior refinement
Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …
application to diverse downstream vision tasks. To improve its capacity on downstream …
Binding touch to everything: Learning unified multimodal tactile representations
The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …
computational systems. However multimodal learning with touch remains challenging due to …
Wordepth: Variational language prior for monocular depth estimation
Abstract Three-dimensional (3D) reconstruction from a single image is an ill-posed problem
with inherent ambiguities ie scale. Predicting a 3D scene from text description (s) is similarly …
with inherent ambiguities ie scale. Predicting a 3D scene from text description (s) is similarly …
Lavss: Location-guided audio-visual spatial audio separation
Existing machine learning research has achieved promising results in monaural audio-
visual separation (MAVS). However, most MAVS methods purely consider what the sound …
visual separation (MAVS). However, most MAVS methods purely consider what the sound …
Neurobind: Towards unified multimodal representations for neural signals
Understanding neural activity and information representation is crucial for advancing
knowledge of brain function and cognition. Neural activity, measured through techniques …
knowledge of brain function and cognition. Neural activity, measured through techniques …
Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling
Abstract Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have
enabled their near photo-realistic free-viewpoint rendering. Although these methods have …
enabled their near photo-realistic free-viewpoint rendering. Although these methods have …
Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey
A Shahabaz, S Sarkar - IEEE Access, 2024 - ieeexplore.ieee.org
The joint analysis of audio and video is a powerful tool that can be applied to various
contexts, including action, speech, and sound recognition, audio-visual video parsing …
contexts, including action, speech, and sound recognition, audio-visual video parsing …
Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
The audio-visual sound separation field assumes visible sources in videos, but this excludes
invisible sounds beyond the camera's view. Current methods struggle with such sounds …
invisible sounds beyond the camera's view. Current methods struggle with such sounds …
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
Tactility provides crucial support and enhancement for the perception and interaction
capabilities of both humans and robots. Nevertheless, the multimodal research related to …
capabilities of both humans and robots. Nevertheless, the multimodal research related to …
Independency Adversarial Learning for Cross-Modal Sound Separation
Z Lin, Y Ji, Y Yang - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
The sound mixture separation is still challenging due to heavy sound overlapping and
disturbance from noise. Unsupervised separation would significantly increase the difficulty …
disturbance from noise. Unsupervised separation would significantly increase the difficulty …