Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L Jin - arXiv preprint arXiv:2402.18041, 2024 - arxiv.org
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity

S Longpre, G Yauney, E Reif, K Lee, A Roberts… - arXiv preprint arXiv …, 2023 - arxiv.org
Pretraining is the preliminary and fundamental step in developing capable language models
(LM). Despite this, pretraining data design is critically under-documented and often guided …

[HTML][HTML] An archival perspective on pretraining data

MA Desai, IV Pasquetto, AZ Jacobs, D Card - Patterns, 2024 - cell.com
Alongside an explosion in research and development related to large language models,
there has been a concomitant rise in the creation of pretraining datasets—massive …

Openmoe: An early effort on open mixture-of-experts language models

F Xue, Z Zheng, Y Fu, J Ni, Z Zheng, W Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
To help the open-source community have a better understanding of Mixture-of-Experts
(MoE) based large language models (LLMs), we train and release OpenMoE, a series of …

Language models scale reliably with over-training and on downstream tasks

SY Gadre, G Smyrnis, V Shankar, S Gururangan… - arXiv preprint arXiv …, 2024 - arxiv.org
Scaling laws are useful guides for developing language models, but there are still gaps
between current scaling studies and how language models are ultimately trained and …

xLSTM: Extended Long Short-Term Memory

M Beck, K Pöppel, M Spanring, A Auer… - arXiv preprint arXiv …, 2024 - arxiv.org
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

Detection and measurement of syntactic templates in generated text

C Shaib, Y Elazar, JJ Li, BC Wallace - arXiv preprint arXiv:2407.00211, 2024 - arxiv.org
Recent work on evaluating the diversity of text generated by LLMs has focused on word-
level features. Here we offer an analysis of syntactic features to characterize general …

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

S Mehta, MH Sekhavat, Q Cao, M Horton, Y Jin… - arXiv preprint arXiv …, 2024 - arxiv.org
The reproducibility and transparency of large language models are crucial for advancing
open research, ensuring the trustworthiness of results, and enabling investigations into data …

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org
Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

H Bansal, A Suvarna, G Bhatt, N Peng… - arXiv preprint arXiv …, 2024 - arxiv.org
A common technique for aligning large language models (LLMs) relies on acquiring human
preferences by comparing multiple generations conditioned on a fixed context. This only …