Datasets for large language models: A comprehensive survey
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …
A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity
Pretraining is the preliminary and fundamental step in developing capable language models
(LM). Despite this, pretraining data design is critically under-documented and often guided …
(LM). Despite this, pretraining data design is critically under-documented and often guided …
[HTML][HTML] An archival perspective on pretraining data
Alongside an explosion in research and development related to large language models,
there has been a concomitant rise in the creation of pretraining datasets—massive …
there has been a concomitant rise in the creation of pretraining datasets—massive …
Openmoe: An early effort on open mixture-of-experts language models
To help the open-source community have a better understanding of Mixture-of-Experts
(MoE) based large language models (LLMs), we train and release OpenMoE, a series of …
(MoE) based large language models (LLMs), we train and release OpenMoE, a series of …
Language models scale reliably with over-training and on downstream tasks
Scaling laws are useful guides for developing language models, but there are still gaps
between current scaling studies and how language models are ultimately trained and …
between current scaling studies and how language models are ultimately trained and …
xLSTM: Extended Long Short-Term Memory
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …
Detection and measurement of syntactic templates in generated text
Recent work on evaluating the diversity of text generated by LLMs has focused on word-
level features. Here we offer an analysis of syntactic features to characterize general …
level features. Here we offer an analysis of syntactic features to characterize general …
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
The reproducibility and transparency of large language models are crucial for advancing
open research, ensuring the trustworthiness of results, and enabling investigations into data …
open research, ensuring the trustworthiness of results, and enabling investigations into data …
The responsible foundation model development cheatsheet: A review of tools & resources
Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …
and applications. To help shape responsible development practices, we introduce the …
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
A common technique for aligning large language models (LLMs) relies on acquiring human
preferences by comparing multiple generations conditioned on a fixed context. This only …
preferences by comparing multiple generations conditioned on a fixed context. This only …