Challenges and applications of large language models
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …
A survey on data selection for language models
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
Reproducible scaling laws for contrastive language-image learning
M Cherti, R Beaumont, R Wightman… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scaling up neural networks has led to remarkable performance across a wide range of
tasks. Moreover, performance often follows reliable scaling laws as a function of training set …
tasks. Moreover, performance often follows reliable scaling laws as a function of training set …
Obelics: An open web-scale filtered dataset of interleaved image-text documents
H Laurençon, L Saulnier, L Tronchon… - Advances in …, 2024 - proceedings.neurips.cc
Large multimodal models trained on natural documents, which interleave images and text,
outperform models trained on image-text pairs on various multimodal benchmarks …
outperform models trained on image-text pairs on various multimodal benchmarks …
Scaling data-constrained language models
The current trend of scaling language models involves increasing both parameter count and
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …
Datacomp: In search of the next generation of multimodal datasets
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable
Diffusion and GPT-4, yet their design does not receive the same research attention as model …
Diffusion and GPT-4, yet their design does not receive the same research attention as model …
D4: Improving llm pretraining via document de-duplication and diversification
Over recent years, an increasing amount of compute and data has been poured into training
large language models (LLMs), usually by doing one-pass learning on as many tokens as …
large language models (LLMs), usually by doing one-pass learning on as many tokens as …
Scaling laws of synthetic images for model training... for now
Recent significant advances in text-to-image models unlock the possibility of training vision
systems using synthetic images potentially overcoming the difficulty of collecting curated …
systems using synthetic images potentially overcoming the difficulty of collecting curated …
Frontier AI regulation: Managing emerging risks to public safety
Advanced AI models hold the promise of tremendous benefits for humanity, but society
needs to proactively manage the accompanying risks. In this paper, we focus on what we …
needs to proactively manage the accompanying risks. In this paper, we focus on what we …
Quality not quantity: On the interaction between dataset design and robustness of clip
Web-crawled datasets have enabled remarkable generalization capabilities in recent image-
text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little …
text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little …