Sigmoid loss for language image pre-training

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Data filtering networks

A Fang, AM Jose, A Jain, L Schmidt, A Toshev… - arXiv preprint arXiv …, 2023 - arxiv.org
Large training sets have become a cornerstone of machine learning and are the foundation
for recent advances in language modeling and multimodal learning. While data curation for …

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

Colpali: Efficient document retrieval with vision language models

M Faysse, H Sibille, T Wu, B Omrani, G Viaud… - arXiv preprint arXiv …, 2024 - arxiv.org
Documents are visually rich structures that convey information through text, as well as
tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit …

Scaling laws for sparsely-connected foundation models

E Frantar, C Riquelme, N Houlsby, D Alistarh… - arXiv preprint arXiv …, 2023 - arxiv.org
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained
on massive datasets (ie," foundation models"), in both vision and language domains. In this …

Convnets match vision transformers at scale

SL Smith, A Brock, L Berrada, S De - arXiv preprint arXiv:2310.16764, 2023 - arxiv.org
Many researchers believe that ConvNets perform well on small or moderately sized
datasets, but are not competitive with Vision Transformers when given access to datasets on …

Scaling and renormalization in high-dimensional regression

A Atanasov, JA Zavatone-Veth, C Pehlevan - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents a succinct derivation of the training and generalization performance of a
variety of high-dimensional ridge regression models using the basic tools of random matrix …

A dynamical model of neural scaling laws

B Bordelon, A Atanasov, C Pehlevan - arXiv preprint arXiv:2402.01092, 2024 - arxiv.org
On a variety of tasks, the performance of neural networks predictably improves with training
time, dataset size and model size across many orders of magnitude. This phenomenon is …