Lora: Low-rank adaptation of large language models

EJ Hu, Y Shen, P Wallis, Z Allen-Zhu, Y Li… - arXiv preprint arXiv …, 2021 - arxiv.org
An important paradigm of natural language processing consists of large-scale pre-training
on general domain data and adaptation to particular tasks or domains. As we pre-train larger …

Birth of a transformer: A memory viewpoint

A Bietti, V Cabannes, D Bouchacourt… - Advances in …, 2024 - proceedings.neurips.cc
Large language models based on transformers have achieved great empirical successes.
However, as they are deployed more widely, there is a growing need to better understand …

A kernel-based view of language model fine-tuning

S Malladi, A Wettig, D Yu, D Chen… - … on Machine Learning, 2023 - proceedings.mlr.press
It has become standard to solve NLP tasks by fine-tuning pre-trained language models
(LMs), especially in low-data settings. There is minimal theoretical understanding of …

From lazy to rich to exclusive task representations in neural networks and neural codes

M Farrell, S Recanatesi, E Shea-Brown - Current opinion in neurobiology, 2023 - Elsevier
Neural circuits—both in the brain and in “artificial” neural network models—learn to solve a
remarkable variety of tasks, and there is a great current opportunity to use neural networks …

Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data

S Frei, NS Chatterji, P Bartlett - Conference on Learning …, 2022 - proceedings.mlr.press
Benign overfitting, the phenomenon where interpolating models generalize well in the
presence of noisy data, was first observed in neural network models trained with gradient …

Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs

E Boursier, L Pillaud-Vivien… - Advances in Neural …, 2022 - proceedings.neurips.cc
The training of neural networks by gradient descent methods is a cornerstone of the deep
learning revolution. Yet, despite some recent progress, a complete theory explaining its …

Less: Selecting influential data for targeted instruction tuning

M Xia, S Malladi, S Gururangan, S Arora… - arXiv preprint arXiv …, 2024 - arxiv.org
Instruction tuning has unlocked powerful capabilities in large language models (LLMs),
effectively using combined datasets to develop generalpurpose chatbots. However, real …

Self-consistent dynamical field theory of kernel evolution in wide neural networks

B Bordelon, C Pehlevan - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We analyze feature learning in infinite-width neural networks trained with gradient flow
through a self-consistent dynamical field theory. We construct a collection of deterministic …

Dynamics of finite width kernel and prediction fluctuations in mean field neural networks

B Bordelon, C Pehlevan - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …

On the stepwise nature of self-supervised learning

JB Simon, M Knutins, L Ziyin, D Geisz… - International …, 2023 - proceedings.mlr.press
We present a simple picture of the training process of self-supervised learning methods with
dual deep networks. In our picture, these methods learn their high-dimensional embeddings …