Tensor programs iv: Feature learning in infinite-width neural networks

EJ Hu, Y Shen, P Wallis, Z Allen-Zhu, Y Li… - arXiv preprint arXiv …, 2021 - arxiv.org

An important paradigm of natural language processing consists of large-scale pre-training
on general domain data and adaptation to particular tasks or domains. As we pre-train larger …

被引用次数：6215 相关文章所有 10 个版本

[PDF] neurips.cc

Birth of a transformer: A memory viewpoint

A Bietti, V Cabannes, D Bouchacourt… - Advances in …, 2024 - proceedings.neurips.cc

Large language models based on transformers have achieved great empirical successes.
However, as they are deployed more widely, there is a growing need to better understand …

被引用次数：42 相关文章所有 7 个版本

[PDF] mlr.press

A kernel-based view of language model fine-tuning

S Malladi, A Wettig, D Yu, D Chen… - … on Machine Learning, 2023 - proceedings.mlr.press

It has become standard to solve NLP tasks by fine-tuning pre-trained language models
(LMs), especially in low-data settings. There is minimal theoretical understanding of …

被引用次数：46 相关文章所有 9 个版本

From lazy to rich to exclusive task representations in neural networks and neural codes

M Farrell, S Recanatesi, E Shea-Brown - Current opinion in neurobiology, 2023 - Elsevier

Neural circuits—both in the brain and in “artificial” neural network models—learn to solve a
remarkable variety of tasks, and there is a great current opportunity to use neural networks …

被引用次数：6 相关文章所有 3 个版本

[PDF] mlr.press

Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data

S Frei, NS Chatterji, P Bartlett - Conference on Learning …, 2022 - proceedings.mlr.press

Benign overfitting, the phenomenon where interpolating models generalize well in the
presence of noisy data, was first observed in neural network models trained with gradient …

被引用次数：76 相关文章所有 4 个版本

[PDF] neurips.cc

Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs

E Boursier, L Pillaud-Vivien… - Advances in Neural …, 2022 - proceedings.neurips.cc

The training of neural networks by gradient descent methods is a cornerstone of the deep
learning revolution. Yet, despite some recent progress, a complete theory explaining its …

被引用次数：55 相关文章所有 12 个版本

[PDF] arxiv.org

Less: Selecting influential data for targeted instruction tuning

M Xia, S Malladi, S Gururangan, S Arora… - arXiv preprint arXiv …, 2024 - arxiv.org

Instruction tuning has unlocked powerful capabilities in large language models (LLMs),
effectively using combined datasets to develop generalpurpose chatbots. However, real …

被引用次数：44 相关文章所有 4 个版本

[PDF] neurips.cc

Self-consistent dynamical field theory of kernel evolution in wide neural networks

B Bordelon, C Pehlevan - Advances in Neural Information …, 2022 - proceedings.neurips.cc

We analyze feature learning in infinite-width neural networks trained with gradient flow
through a self-consistent dynamical field theory. We construct a collection of deterministic …

被引用次数：54 相关文章所有 10 个版本

[PDF] neurips.cc

Dynamics of finite width kernel and prediction fluctuations in mean field neural networks

B Bordelon, C Pehlevan - Advances in Neural Information …, 2024 - proceedings.neurips.cc

We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …

被引用次数：20 相关文章所有 8 个版本

[PDF] mlr.press

On the stepwise nature of self-supervised learning

JB Simon, M Knutins, L Ziyin, D Geisz… - International …, 2023 - proceedings.mlr.press

We present a simple picture of the training process of self-supervised learning methods with
dual deep networks. In our picture, these methods learn their high-dimensional embeddings …

被引用次数：23 相关文章所有 6 个版本