Getting vit in shape: Scaling laws for compute-optimal model design

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

被引用次数：563 相关文章所有 5 个版本

[PDF] arxiv.org

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

被引用次数：94 相关文章所有 3 个版本

[PDF] arxiv.org

Data filtering networks

A Fang, AM Jose, A Jain, L Schmidt, A Toshev… - arXiv preprint arXiv …, 2023 - arxiv.org

Large training sets have become a cornerstone of machine learning and are the foundation
for recent advances in language modeling and multimodal learning. While data curation for …

被引用次数：107 相关文章所有 4 个版本

[PDF] thecvf.com

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

被引用次数：9 相关文章所有 5 个版本

[PDF] arxiv.org

Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

被引用次数：7 相关文章

Colpali: Efficient document retrieval with vision language models

M Faysse, H Sibille, T Wu, B Omrani, G Viaud… - arXiv preprint arXiv …, 2024 - arxiv.org

Documents are visually rich structures that convey information through text, as well as
tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Scaling laws for sparsely-connected foundation models

E Frantar, C Riquelme, N Houlsby, D Alistarh… - arXiv preprint arXiv …, 2023 - arxiv.org

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained
on massive datasets (ie," foundation models"), in both vision and language domains. In this …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Convnets match vision transformers at scale

SL Smith, A Brock, L Berrada, S De - arXiv preprint arXiv:2310.16764, 2023 - arxiv.org

Many researchers believe that ConvNets perform well on small or moderately sized
datasets, but are not competitive with Vision Transformers when given access to datasets on …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Scaling and renormalization in high-dimensional regression

A Atanasov, JA Zavatone-Veth, C Pehlevan - arXiv preprint arXiv …, 2024 - arxiv.org

This paper presents a succinct derivation of the training and generalization performance of a
variety of high-dimensional ridge regression models using the basic tools of random matrix …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

A dynamical model of neural scaling laws

B Bordelon, A Atanasov, C Pehlevan - arXiv preprint arXiv:2402.01092, 2024 - arxiv.org

On a variety of tasks, the performance of neural networks predictably improves with training
time, dataset size and model size across many orders of magnitude. This phenomenon is …

被引用次数：26 相关文章所有 3 个版本