Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Maskclip: Masked self-distillation advances contrastive language-image pretraining
This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly
proposed masked self-distillation into contrastive language-image pretraining. The core idea …
proposed masked self-distillation into contrastive language-image pretraining. The core idea …
Compositional chain-of-thought prompting for large multimodal models
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias
The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-
training (VLP). A potential solution lies in the combination of datasets from various language …
training (VLP). A potential solution lies in the combination of datasets from various language …
Sus-x: Training-free name-only transfer of vision-language models
V Udandarao, A Gupta… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet
effective way to train large-scale vision-language models. CLIP demonstrates impressive …
effective way to train large-scale vision-language models. CLIP demonstrates impressive …
[HTML][HTML] DILF: Differentiable rendering-based multi-view Image–Language Fusion for zero-shot 3D shape understanding
Zero-shot 3D shape understanding aims to recognize “unseen” 3D categories that are not
present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has …
present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has …
Teaching structured vision & language concepts to vision & language models
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …
a variety of tasks. However, some aspects of complex language understanding still remain a …
Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition
D Hegde, JMJ Valanarasu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Vision-Language models like CLIP have been widely adopted for various tasks due to their
impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric …
impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric …
Going beyond nouns with vision & language models using synthetic data
P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …
performance in many applications, enabling replacing a fixed set of supported classes with …
Clipood: Generalizing clip to out-of-distributions
Abstract Out-of-distribution (OOD) generalization, where the model needs to handle
distribution shifts from training, is a major challenge of machine learning. Contrastive …
distribution shifts from training, is a major challenge of machine learning. Contrastive …