H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms

A Didolkar, A Goyal, NR Ke, S Guo, M Valko… - arXiv preprint arXiv …, 2024 - arxiv.org

Metacognitive knowledge refers to humans' intuitive knowledge of their own thinking and
reasoning processes. Today's best LLMs clearly possess some reasoning processes. The …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

Data curation via joint example selection further accelerates multimodal learning

T Evans, N Parthasarathy, H Merzic… - arXiv preprint arXiv …, 2024 - arxiv.org

Data curation is an essential component of large-scale pretraining. In this work, we
demonstrate that jointly selecting batches of data is more effective for learning than selecting …

Large Language Model-guided Document Selection

X Kong, T Gunter, R Pang - arXiv preprint arXiv:2406.04638, 2024 - arxiv.org

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet
recent research has demonstrated that careful document selection enables comparable …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Data-Centric AI in the Age of Large Language Models

X Xu, Z Wu, R Qiao, A Verma, Y Shu, J Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

This position paper proposes a data-centric viewpoint of AI research, focusing on large
language models (LLMs). We start by making the key observation that data is instrumental in …

[PDF] arxiv.org

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Z Yu, S Das, C Xiong - arXiv preprint arXiv:2406.06046, 2024 - arxiv.org

Pretraining data selection has the potential to improve language model pretraining efficiency
by utilizing higher-quality data from massive web data corpora. Current data selection …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

T Bai, L Yang, ZH Wong, J Peng, X Zhuang… - arXiv preprint arXiv …, 2024 - arxiv.org

Efficient data selection is crucial to accelerate the pretraining of large language models
(LLMs). While various methods have been proposed to enhance data efficiency, limited …

[PDF] arxiv.org

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

C Zhang, H Zhong, K Zhang, C Chai, R Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Data selection is of great significance in pre-training large language models, given the
variation in quality within the large-scale available training corpora. To achieve this …

[PDF] arxiv.org