Cyclip: Cyclic contrastive language-image pretraining

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：145 相关文章所有 7 个版本

[PDF] thecvf.com

Maskclip: Masked self-distillation advances contrastive language-image pretraining

X Dong, J Bao, Y Zheng, T Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly
proposed masked self-distillation into contrastive language-image pretraining. The core idea …

被引用次数：92 相关文章所有 10 个版本

[PDF] thecvf.com

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

被引用次数：18 相关文章所有 3 个版本

[PDF] neurips.cc

Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias

Z Wan, C Liu, M Zhang, J Fu, B Wang… - Advances in …, 2024 - proceedings.neurips.cc

The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-
training (VLP). A potential solution lies in the combination of datasets from various language …

被引用次数：40 相关文章所有 7 个版本

[PDF] thecvf.com

Sus-x: Training-free name-only transfer of vision-language models

V Udandarao, A Gupta… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet
effective way to train large-scale vision-language models. CLIP demonstrates impressive …

被引用次数：61 相关文章所有 5 个版本

[HTML] sciencedirect.com

[HTML][HTML] DILF: Differentiable rendering-based multi-view Image–Language Fusion for zero-shot 3D shape understanding

X Ning, Z Yu, L Li, W Li, P Tiwari - Information Fusion, 2024 - Elsevier

Zero-shot 3D shape understanding aims to recognize “unseen” 3D categories that are not
present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has …

被引用次数：28 相关文章所有 5 个版本

[PDF] thecvf.com

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

被引用次数：46 相关文章所有 8 个版本

[PDF] thecvf.com

Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition

D Hegde, JMJ Valanarasu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Vision-Language models like CLIP have been widely adopted for various tasks due to their
impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric …

被引用次数：38 相关文章所有 5 个版本

[PDF] thecvf.com

Going beyond nouns with vision & language models using synthetic data

P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …

被引用次数：30 相关文章所有 12 个版本

[PDF] mlr.press

Clipood: Generalizing clip to out-of-distributions

Y Shu, X Guo, J Wu, X Wang… - … on Machine Learning, 2023 - proceedings.mlr.press

Abstract Out-of-distribution (OOD) generalization, where the model needs to handle
distribution shifts from training, is a major challenge of machine learning. Contrastive …

被引用次数：40 相关文章所有 7 个版本