Will we run out of data? an analysis of the limits of scaling datasets in machine learning

AJ Thirunavukarasu, DSJ Ting, K Elangovan… - Nature medicine, 2023 - nature.com

Large language models (LLMs) can respond to free-text queries without being specifically
trained in the task in question, causing excitement and concern about their use in healthcare …

被引用次数：1184 相关文章所有 5 个版本

[PDF] arxiv.org

Challenges and applications of large language models

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

被引用次数：309 相关文章所有 3 个版本

[PDF] arxiv.org

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org

Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

被引用次数：2246 相关文章所有 4 个版本

[PDF] neurips.cc

Scaling data-constrained language models

N Muennighoff, A Rush, B Barak… - Advances in …, 2024 - proceedings.neurips.cc

The current trend of scaling language models involves increasing both parameter count and
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …

被引用次数：139 相关文章所有 7 个版本

[PDF] neurips.cc

The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data only

G Penedo, Q Malartic, D Hesslow… - Advances in …, 2023 - proceedings.neurips.cc

Large language models are commonly trained on a mixture of filtered web data and
curated``high-quality''corpora, such as social media conversations, books, or technical …

被引用次数：32 相关文章所有 4 个版本

[PDF] arxiv.org

A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity

S Longpre, G Yauney, E Reif, K Lee, A Roberts… - arXiv preprint arXiv …, 2023 - arxiv.org

Pretraining is the preliminary and fundamental step in developing capable language models
(LM). Despite this, pretraining data design is critically under-documented and often guided …

被引用次数：71 相关文章所有 3 个版本

[PDF] arxiv.org

Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity

TY Zhuo, Y Huang, C Chen, Z Xing - arXiv preprint arXiv:2301.12867, 2023 - arxiv.org

Recent breakthroughs in natural language processing (NLP) have permitted the synthesis
and comprehension of coherent text in an open-ended way, therefore translating the …

被引用次数：105 相关文章所有 2 个版本

[PDF] arxiv.org

Self-consuming generative models go mad

S Alemohammad, J Casco-Rodriguez, L Luzi… - arXiv preprint arXiv …, 2023 - arxiv.org

Seismic advances in generative AI algorithms for imagery, text, and other data types has led
to the temptation to use synthetic data to train next-generation models. Repeating this …

被引用次数：82 相关文章所有 4 个版本

[PDF] neurips.cc

To repeat or not to repeat: Insights from scaling llm under token-crisis

F Xue, Y Fu, W Zhou, Z Zheng… - Advances in Neural …, 2024 - proceedings.neurips.cc

Recent research has highlighted the importance of dataset size in scaling language models.
However, large language models (LLMs) are notoriously token-hungry during pre-training …

被引用次数：39 相关文章所有 5 个版本

[PDF] arxiv.org

When foundation model meets federated learning: Motivations, challenges, and future directions

W Zhuang, C Chen, L Lyu - arXiv preprint arXiv:2306.15546, 2023 - arxiv.org

The intersection of the Foundation Model (FM) and Federated Learning (FL) provides mutual
benefits, presents a unique opportunity to unlock new possibilities in AI research, and …

被引用次数：59 相关文章所有 2 个版本