The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data only

Y Liu, H He, T Han, X Zhang, M Liu, J Tian… - arXiv preprint arXiv …, 2024 - arxiv.org

The introduction of ChatGPT has led to a significant increase in the utilization of Large
Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on …

被引用次数：51 相关文章所有 4 个版本

[PDF] arxiv.org

Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L Jin - arXiv preprint arXiv:2402.18041, 2024 - arxiv.org

This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

被引用次数：33 相关文章所有 4 个版本

[PDF] arxiv.org

Starcoder 2 and the stack v2: The next generation

A Lozhkov, R Li, LB Allal, F Cassano… - arXiv preprint arXiv …, 2024 - arxiv.org

The BigCode project, an open-scientific collaboration focused on the responsible
development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In …

被引用次数：127 相关文章所有 2 个版本

[PDF] arxiv.org

Show-o: One single transformer to unify multimodal understanding and generation

J Xie, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

Building and better understanding vision-language models: insights and future directions

H Laurençon, A Marafioti, V Sanh… - arXiv preprint arXiv …, 2024 - arxiv.org

The field of vision-language models (VLMs), which take images and texts as inputs and
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Data management for large language models: A survey

Z Wang, W Zhong, Y Wang, Q Zhu, F Mi… - arXiv preprint arXiv …, 2023 - arxiv.org

Data plays a fundamental role in the training of Large Language Models (LLMs). Effective
data management, particularly in the formulation of a well-suited training dataset, holds …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

How to train long-context language models (effectively)

T Gao, A Wettig, H Yen, D Chen - arXiv preprint arXiv:2410.02660, 2024 - arxiv.org

We study continued training and supervised fine-tuning (SFT) of a language model (LM) to
make effective use of long-context information. We first establish a reliable evaluation …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Qurating: Selecting high-quality data for training language models

A Wettig, A Gupta, S Malik, D Chen - arXiv preprint arXiv:2402.09739, 2024 - arxiv.org

Selecting high-quality pre-training data is important for creating capable language models,
but existing methods rely on simple heuristics. We introduce QuRating, a method for …

被引用次数：23 相关文章所有 4 个版本

[PDF] arxiv.org

Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

被引用次数：4 相关文章

[PDF] arxiv.org

From matching to generation: A survey on generative information retrieval

X Li, J Jin, Y Zhou, Y Zhang, P Zhang, Y Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org

Information Retrieval (IR) systems are crucial tools for users to access information, widely
applied in scenarios like search engines, question answering, and recommendation …

被引用次数：18 相关文章所有 2 个版本