Scaling laws for downstream task performance of large language models

SY Gadre, G Smyrnis, V Shankar, S Gururangan… - arXiv preprint arXiv …, 2024 - arxiv.org

Scaling laws are useful guides for developing language models, but there are still gaps
between current scaling studies and how language models are ultimately trained and …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Rho-1: Not all tokens are what you need

Z Lin, Z Gou, Y Gong, X Liu, Y Shen, R Xu, C Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

Previous language model pre-training methods have uniformly applied a next-token
prediction loss to all training tokens. Challenging this norm, we posit that" Not all tokens in a …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

C Tao, Q Liu, L Dou, N Muennighoff, Z Wan… - arXiv preprint arXiv …, 2024 - arxiv.org

Research on scaling large language models (LLMs) has primarily focused on model
parameters and training data size, overlooking the role of vocabulary size.% Intuitively …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

STAR: Constraint LoRA with Dynamic Active Learning for Data-Efficient Fine-Tuning of Large Language Models

L Zhang, J Wu, D Zhou, G Xu - arXiv preprint arXiv:2403.01165, 2024 - arxiv.org

Though Large Language Models (LLMs) have demonstrated the powerful capabilities of few-
shot learning through prompting methods, supervised training is still necessary for complex …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Unraveling the mystery of scaling laws: Part i

H Su, Z Tian, X Shen, X Cai - arXiv preprint arXiv:2403.06563, 2024 - arxiv.org

Scaling law principles indicate a power-law correlation between loss and variables such as
model size, dataset size, and computational resources utilized during training. These …

[PDF][PDF] OnResource Efficient Transfer Learning via End Task Aware Training

LM Dery - 2024 - kilthub.cmu.edu

Transfer learning is a machine learning (ML) paradigm where performance on a desired end
task 1 is improved by exploiting” knowledge” from other tasks. The technique has become a …