Learning transferable visual models from natural language supervision

A Radford, JW Kim, C Hallacy… - International …, 2021 - proceedings.mlr.press
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …

K-lite: Learning transferable visual models with external knowledge

S Shen, C Li, X Hu, Y Xie, J Yang… - Advances in …, 2022 - proceedings.neurips.cc
The new generation of state-of-the-art computer vision systems are trained from natural
language supervision, ranging from simple object category names to descriptive captions …

Elevater: A benchmark and toolkit for evaluating language-augmented visual models

C Li, H Liu, L Li, P Zhang, J Aneja… - Advances in …, 2022 - proceedings.neurips.cc
Learning visual representations from natural language supervision has recently shown great
promise in a number of pioneering works. In general, these language-augmented visual …

Sus-x: Training-free name-only transfer of vision-language models

V Udandarao, A Gupta… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet
effective way to train large-scale vision-language models. CLIP demonstrates impressive …

Generative pretraining from pixels

M Chen, A Radford, R Child, J Wu… - International …, 2020 - proceedings.mlr.press
Inspired by progress in unsupervised representation learning for natural language, we
examine whether similar models can learn useful representations for images. We train a …

Learning vision from models rivals learning vision from data

Y Tian, L Fan, K Chen, D Katabi… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce SynCLR a novel approach for learning visual representations exclusively from
synthetic images without any real data. We synthesize a large dataset of image captions …

Learning to decompose visual features with latent textual prompts

F Wang, M Li, X Lin, H Lv, AG Schwing, H Ji - arXiv preprint arXiv …, 2022 - arxiv.org
Recent advances in pre-training vision-language models like CLIP have shown great
potential in learning transferable visual representations. Nonetheless, for downstream …

Stablerep: Synthetic images from text-to-image models make strong visual representation learners

Y Tian, L Fan, P Isola, H Chang… - Advances in Neural …, 2024 - proceedings.neurips.cc
We investigate the potential of learning visual representations using synthetic images
generated by text-to-image models. This is a natural question in the light of the excellent …

Vila: On pre-training for visual language models

J Lin, H Yin, W Ping, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com
Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

Sus-x: Training-free name-only transfer of vision-language models

V Udandarao, A Gupta, S Albanie - arXiv preprint arXiv:2211.16198, 2022 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way
to train large-scale vision-language models. CLIP demonstrates impressive zero-shot …