Beyond neural scaling laws: beating power law scaling via data pruning

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

被引用次数：432 相关文章所有 3 个版本

[PDF] arxiv.org

A survey on data selection for language models

A Albalak, Y Elazar, SM Xie, S Longpre… - arXiv preprint arXiv …, 2024 - arxiv.org

A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

被引用次数：65 相关文章所有 2 个版本

[PDF] thecvf.com

Reproducible scaling laws for contrastive language-image learning

M Cherti, R Beaumont, R Wightman… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scaling up neural networks has led to remarkable performance across a wide range of
tasks. Moreover, performance often follows reliable scaling laws as a function of training set …

被引用次数：643 相关文章所有 6 个版本

[PDF] neurips.cc

Obelics: An open web-scale filtered dataset of interleaved image-text documents

H Laurençon, L Saulnier, L Tronchon… - Advances in …, 2024 - proceedings.neurips.cc

Large multimodal models trained on natural documents, which interleave images and text,
outperform models trained on image-text pairs on various multimodal benchmarks …

被引用次数：230 相关文章所有 5 个版本

[PDF] neurips.cc

Scaling data-constrained language models

N Muennighoff, A Rush, B Barak… - Advances in …, 2023 - proceedings.neurips.cc

The current trend of scaling language models involves increasing both parameter count and
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …

被引用次数：210 相关文章所有 7 个版本

[PDF] neurips.cc

Datacomp: In search of the next generation of multimodal datasets

SY Gadre, G Ilharco, A Fang… - Advances in …, 2024 - proceedings.neurips.cc

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable
Diffusion and GPT-4, yet their design does not receive the same research attention as model …

被引用次数：331 相关文章所有 9 个版本

[PDF] neurips.cc

D4: Improving llm pretraining via document de-duplication and diversification

K Tirumala, D Simig, A Aghajanyan… - Advances in Neural …, 2023 - proceedings.neurips.cc

Over recent years, an increasing amount of compute and data has been poured into training
large language models (LLMs), usually by doing one-pass learning on as many tokens as …

被引用次数：84 相关文章所有 6 个版本

[PDF] thecvf.com

Scaling laws of synthetic images for model training... for now

L Fan, K Chen, D Krishnan, D Katabi… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent significant advances in text-to-image models unlock the possibility of training vision
systems using synthetic images potentially overcoming the difficulty of collecting curated …

被引用次数：50 相关文章所有 5 个版本

[PDF] arxiv.org

Frontier AI regulation: Managing emerging risks to public safety

M Anderljung, J Barnhart, A Korinek, J Leung… - arXiv preprint arXiv …, 2023 - arxiv.org

Advanced AI models hold the promise of tremendous benefits for humanity, but society
needs to proactively manage the accompanying risks. In this paper, we focus on what we …

被引用次数：109 相关文章所有 5 个版本

[PDF] neurips.cc

Quality not quantity: On the interaction between dataset design and robustness of clip

T Nguyen, G Ilharco, M Wortsman… - Advances in Neural …, 2022 - proceedings.neurips.cc

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-
text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little …

被引用次数：90 相关文章所有 7 个版本