Gradient-based feature learning under structured data
Recent works have demonstrated that the sample complexity of gradient-based learning of
single index models, ie functions that depend on a 1-dimensional projection of the input …
single index models, ie functions that depend on a 1-dimensional projection of the input …
Dynamics of finite width kernel and prediction fluctuations in mean field neural networks
B Bordelon, C Pehlevan - Advances in Neural Information …, 2024 - proceedings.neurips.cc
We analyze the dynamics of finite width effects in wide but finite feature learning neural
networks. Starting from a dynamical mean field theory description of infinite width deep …
networks. Starting from a dynamical mean field theory description of infinite width deep …
How two-layer neural networks learn, one (giant) step at a time
We investigate theoretically how the features of a two-layer neural network adapt to the
structure of the target function through a few large batch gradient descent steps, leading to …
structure of the target function through a few large batch gradient descent steps, leading to …
Learning time-scales in two-layers neural networks
Gradient-based learning in multi-layer neural networks displays a number of striking
features. In particular, the decrease rate of empirical risk is non-monotone even after …
features. In particular, the decrease rate of empirical risk is non-monotone even after …
On learning gaussian multi-index models with gradient flow
We study gradient flow on the multi-index regression problem for high-dimensional
Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear …
Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear …
On the different regimes of stochastic gradient descent
A Sclocchi, M Wyart - … of the National Academy of Sciences, 2024 - National Acad Sciences
Modern deep networks are trained with stochastic gradient descent (SGD) whose key
hyperparameters are the number of data considered at each step or batch size B, and the …
hyperparameters are the number of data considered at each step or batch size B, and the …
On the impact of overparameterization on the training of a shallow neural network in high dimensions
We study the training dynamics of a shallow neural network with quadratic activation
functions and quadratic cost in a teacher-student setup. In line with previous works on the …
functions and quadratic cost in a teacher-student setup. In line with previous works on the …
Grokking as the transition from lazy to rich training dynamics
We propose that the grokking phenomenon, where the train loss of a neural network
decreases much earlier than its test loss, can arise due to a neural network transitioning …
decreases much earlier than its test loss, can arise due to a neural network transitioning …
On Single-Index Models beyond Gaussian Data
A Zweig, L Pillaud-Vivien… - Advances in Neural …, 2024 - proceedings.neurips.cc
Sparse high-dimensional functions have arisen as a rich framework to study the behavior of
gradient-descent methods using shallow neural networks, and showcasing its ability to …
gradient-descent methods using shallow neural networks, and showcasing its ability to …
Asymptotics of feature learning in two-layer networks after one gradient-step
In this manuscript we investigate the problem of how two-layer neural networks learn
features from data, and improve over the kernel regime, after being trained with a single …
features from data, and improve over the kernel regime, after being trained with a single …