Resolving discrepancies in compute-optimal scaling of language models
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …
size as a function of the compute budget, but these laws yield substantially different …
Improving text-to-audio models with synthetic captions
It is an open challenge to obtain high quality training data, especially captions, for text-to-
audio models. Although prior methods have leveraged\textit {text-only language models} to …
audio models. Although prior methods have leveraged\textit {text-only language models} to …
A Practitioner's Guide to Continual Multimodal Pretraining
Multimodal foundation models serve numerous applications at the intersection of vision and
language. Still, despite being pretrained on extensive data, they become outdated over time …
language. Still, despite being pretrained on extensive data, they become outdated over time …
Scaling laws for precision
Low precision training and inference affect both the quality and cost of language models, but
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for
retrieval and classification tasks with respect to larger decoder-only models. Despite being …
retrieval and classification tasks with respect to larger decoder-only models. Despite being …
Optimization hyper-parameter laws for large language models
Large Language Models have driven significant AI advancements, yet their training is
resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws …
resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws …
Power scheduler: A batch size and token number agnostic learning rate scheduler
Finding the optimal learning rate for language model pretraining is a challenging task. This
is not only because there is a complicated correlation between learning rate, batch size …
is not only because there is a complicated correlation between learning rate, batch size …
Scaling Laws for Pre-training Agents and World Models
The performance of embodied agents has been shown to improve by increasing model
parameters, dataset size, and compute. This has been demonstrated in domains from …
parameters, dataset size, and compute. This has been demonstrated in domains from …
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Pretraining data selection has the potential to improve language model pretraining efficiency
by utilizing higher-quality data from massive web data corpora. Current data selection …
by utilizing higher-quality data from massive web data corpora. Current data selection …
Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models
This paper investigates the under-explored area of low-rank weight training for large-scale
Conformer-based speech recognition models from scratch. Our study demonstrates the …
Conformer-based speech recognition models from scratch. Our study demonstrates the …