Gradient-based feature learning under structured data

A Mousavi-Hosseini, D Wu, T Suzuki… - Advances in Neural …, 2023 - proceedings.neurips.cc
Recent works have demonstrated that the sample complexity of gradient-based learning of
single index models, ie functions that depend on a 1-dimensional projection of the input …

Dynamics of finite width kernel and prediction fluctuations in mean field neural networks

B Bordelon, C Pehlevan - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …

How two-layer neural networks learn, one (giant) step at a time

Y Dandi, F Krzakala, B Loureiro, L Pesce… - arXiv preprint arXiv …, 2023 - arxiv.org
We investigate theoretically how the features of a two-layer neural network adapt to the
structure of the target function through a few large batch gradient descent steps, leading to …

Learning time-scales in two-layers neural networks

R Berthier, A Montanari, K Zhou - arXiv preprint arXiv:2303.00055, 2023 - arxiv.org
Gradient-based learning in multi-layer neural networks displays a number of striking
features. In particular, the decrease rate of empirical risk is non-monotone even after …

On learning gaussian multi-index models with gradient flow

A Bietti, J Bruna, L Pillaud-Vivien - arXiv preprint arXiv:2310.19793, 2023 - arxiv.org
We study gradient flow on the multi-index regression problem for high-dimensional
Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear …

On the different regimes of stochastic gradient descent

A Sclocchi, M Wyart - … of the National Academy of Sciences, 2024 - National Acad Sciences
Modern deep networks are trained with stochastic gradient descent (SGD) whose key
hyperparameters are the number of data considered at each step or batch size B, and the …

On the impact of overparameterization on the training of a shallow neural network in high dimensions

S Martin, F Bach, G Biroli - International Conference on …, 2024 - proceedings.mlr.press
We study the training dynamics of a shallow neural network with quadratic activation
functions and quadratic cost in a teacher-student setup. In line with previous works on the …

Grokking as the transition from lazy to rich training dynamics

T Kumar, B Bordelon, SJ Gershman… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose that the grokking phenomenon, where the train loss of a neural network
decreases much earlier than its test loss, can arise due to a neural network transitioning …

On Single-Index Models beyond Gaussian Data

A Zweig, L Pillaud-Vivien… - Advances in Neural …, 2024 - proceedings.neurips.cc
Sparse high-dimensional functions have arisen as a rich framework to study the behavior of
gradient-descent methods using shallow neural networks, and showcasing its ability to …

Asymptotics of feature learning in two-layer networks after one gradient-step

H Cui, L Pesce, Y Dandi, F Krzakala, YM Lu… - arXiv preprint arXiv …, 2024 - arxiv.org
In this manuscript we investigate the problem of how two-layer neural networks learn
features from data, and improve over the kernel regime, after being trained with a single …