Sigmoid loss for language image pre-training
We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …
Paligemma: A versatile 3b vlm for transfer
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
Data filtering networks
Large training sets have become a cornerstone of machine learning and are the foundation
for recent advances in language modeling and multimodal learning. While data curation for …
for recent advances in language modeling and multimodal learning. While data curation for …
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …
community. The VLMs provide stronger and more generalizable feature embeddings …
Resolving discrepancies in compute-optimal scaling of language models
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …
size as a function of the compute budget, but these laws yield substantially different …
Colpali: Efficient document retrieval with vision language models
Documents are visually rich structures that convey information through text, as well as
tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit …
tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit …
Scaling laws for sparsely-connected foundation models
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained
on massive datasets (ie," foundation models"), in both vision and language domains. In this …
on massive datasets (ie," foundation models"), in both vision and language domains. In this …
Convnets match vision transformers at scale
Many researchers believe that ConvNets perform well on small or moderately sized
datasets, but are not competitive with Vision Transformers when given access to datasets on …
datasets, but are not competitive with Vision Transformers when given access to datasets on …
Scaling and renormalization in high-dimensional regression
This paper presents a succinct derivation of the training and generalization performance of a
variety of high-dimensional ridge regression models using the basic tools of random matrix …
variety of high-dimensional ridge regression models using the basic tools of random matrix …
A dynamical model of neural scaling laws
On a variety of tasks, the performance of neural networks predictably improves with training
time, dataset size and model size across many orders of magnitude. This phenomenon is …
time, dataset size and model size across many orders of magnitude. This phenomenon is …