Resolving discrepancies in compute-optimal scaling of language models

T Porian, M Wortsman, J Jitsev, L Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model
size as a function of the compute budget, but these laws yield substantially different …

Improving text-to-audio models with synthetic captions

Z Kong, S Lee, D Ghosal, N Majumder… - arXiv preprint arXiv …, 2024 - arxiv.org
It is an open challenge to obtain high quality training data, especially captions, for text-to-
audio models. Although prior methods have leveraged\textit {text-only language models} to …

A Practitioner's Guide to Continual Multimodal Pretraining

K Roth, V Udandarao, S Dziadzio, A Prabhu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal foundation models serve numerous applications at the intersection of vision and
language. Still, despite being pretrained on extensive data, they become outdated over time …

Scaling laws for precision

T Kumar, Z Ankner, BF Spector, B Bordelon… - arXiv preprint arXiv …, 2024 - arxiv.org
Low precision training and inference affect both the quality and cost of language models, but
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

B Warner, A Chaffin, B Clavié, O Weller… - arXiv preprint arXiv …, 2024 - arxiv.org
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for
retrieval and classification tasks with respect to larger decoder-only models. Despite being …

Optimization hyper-parameter laws for large language models

X Xie, S Yan, KC Toh, T Wei - arXiv preprint arXiv:2409.04777, 2024 - arxiv.org
Large Language Models have driven significant AI advancements, yet their training is
resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws …

Power scheduler: A batch size and token number agnostic learning rate scheduler

Y Shen, M Stallone, M Mishra, G Zhang, S Tan… - arXiv preprint arXiv …, 2024 - arxiv.org
Finding the optimal learning rate for language model pretraining is a challenging task. This
is not only because there is a complicated correlation between learning rate, batch size …

Scaling Laws for Pre-training Agents and World Models

T Pearce, T Rashid, D Bignell, R Georgescu… - arXiv preprint arXiv …, 2024 - arxiv.org
The performance of embodied agents has been shown to improve by increasing model
parameters, dataset size, and compute. This has been demonstrated in domains from …

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Z Yu, S Das, C Xiong - arXiv preprint arXiv:2406.06046, 2024 - arxiv.org
Pretraining data selection has the potential to improve language model pretraining efficiency
by utilizing higher-quality data from massive web data corpora. Current data selection …

Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models

A Fernandez-Lopez, S Liu, L Yin, S Petridis… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper investigates the under-explored area of low-rank weight training for large-scale
Conformer-based speech recognition models from scratch. Our study demonstrates the …