Dolma: An open corpus of three trillion tokens for language model pretraining research

Y Liu, J Cao, C Liu, K Ding, L Jin - arXiv preprint arXiv:2402.18041, 2024 - arxiv.org

This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

被引用次数：13 相关文章所有 4 个版本

[PDF] arxiv.org

A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity

S Longpre, G Yauney, E Reif, K Lee, A Roberts… - arXiv preprint arXiv …, 2023 - arxiv.org

Pretraining is the preliminary and fundamental step in developing capable language models
(LM). Despite this, pretraining data design is critically under-documented and often guided …

被引用次数：62 相关文章所有 3 个版本

[HTML] cell.com Full View

[HTML][HTML] An archival perspective on pretraining data

MA Desai, IV Pasquetto, AZ Jacobs, D Card - Patterns, 2024 - cell.com

Alongside an explosion in research and development related to large language models,
there has been a concomitant rise in the creation of pretraining datasets—massive …

被引用次数：4 相关文章所有 8 个版本

[PDF] arxiv.org

Openmoe: An early effort on open mixture-of-experts language models

F Xue, Z Zheng, Y Fu, J Ni, Z Zheng, W Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

To help the open-source community have a better understanding of Mixture-of-Experts
(MoE) based large language models (LLMs), we train and release OpenMoE, a series of …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

Language models scale reliably with over-training and on downstream tasks

SY Gadre, G Smyrnis, V Shankar, S Gururangan… - arXiv preprint arXiv …, 2024 - arxiv.org

Scaling laws are useful guides for developing language models, but there are still gaps
between current scaling studies and how language models are ultimately trained and …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

xLSTM: Extended Long Short-Term Memory

M Beck, K Pöppel, M Spanring, A Auer… - arXiv preprint arXiv …, 2024 - arxiv.org

In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

被引用次数：22 相关文章所有 2 个版本

[PDF] arxiv.org

Detection and measurement of syntactic templates in generated text

C Shaib, Y Elazar, JJ Li, BC Wallace - arXiv preprint arXiv:2407.00211, 2024 - arxiv.org

Recent work on evaluating the diversity of text generated by LLMs has focused on word-
level features. Here we offer an analysis of syntactic features to characterize general …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

S Mehta, MH Sekhavat, Q Cao, M Horton, Y Jin… - arXiv preprint arXiv …, 2024 - arxiv.org

The reproducibility and transparency of large language models are crucial for advancing
open research, ensuring the trustworthiness of results, and enabling investigations into data …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org

Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

H Bansal, A Suvarna, G Bhatt, N Peng… - arXiv preprint arXiv …, 2024 - arxiv.org

A common technique for aligning large language models (LLMs) relies on acquiring human
preferences by comparing multiple generations conditioned on a fixed context. This only …

被引用次数：1 相关文章所有 3 个版本