Gradient-based feature learning under structured data

A Mousavi-Hosseini, D Wu, T Suzuki… - Advances in Neural …, 2023 - proceedings.neurips.cc
Recent works have demonstrated that the sample complexity of gradient-based learning of
single index models, ie functions that depend on a 1-dimensional projection of the input …

On learning gaussian multi-index models with gradient flow

A Bietti, J Bruna, L Pillaud-Vivien - arXiv preprint arXiv:2310.19793, 2023 - arxiv.org
We study gradient flow on the multi-index regression problem for high-dimensional
Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear …

A theory of non-linear feature learning with one gradient step in two-layer neural networks

B Moniri, D Lee, H Hassani, E Dobriban - arXiv preprint arXiv:2310.07891, 2023 - arxiv.org
Feature learning is thought to be one of the fundamental reasons for the success of deep
neural networks. It is rigorously known that in two-layer fully-connected neural networks …

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

B Simsek, A Bendjeddou… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Any continuous function $ f^* $ can be approximated arbitrarily well by a neural
network with sufficiently many neurons $ k $. We consider the case when $ f^* $ itself is a …

Provable multi-task representation learning by two-layer relu neural networks

L Collins, H Hassani, M Soltanolkotabi… - arXiv preprint arXiv …, 2023 - arxiv.org
Feature learning, ie extracting meaningful representations of data, is quintessential to the
practical success of neural networks trained with gradient descent, yet it is notoriously …

Provably learning a multi-head attention layer

S Chen, Y Li - arXiv preprint arXiv:2402.04084, 2024 - arxiv.org
The multi-head attention layer is one of the key components of the transformer architecture
that sets it apart from traditional feed-forward models. Given a sequence length $ k …

Spectral phase transitions in non-linear wigner spiked models

A Guionnet, J Ko, F Krzakala, P Mergny… - arXiv preprint arXiv …, 2023 - arxiv.org
We study the asymptotic behavior of the spectrum of a random matrix where a non-linearity
is applied entry-wise to a Wigner matrix perturbed by a rank-one spike with independent and …

Asymptotics of feature learning in two-layer networks after one gradient-step

H Cui, L Pesce, Y Dandi, F Krzakala, YM Lu… - arXiv preprint arXiv …, 2024 - arxiv.org
In this manuscript we investigate the problem of how two-layer neural networks learn
features from data, and improve over the kernel regime, after being trained with a single …

Learning hierarchical polynomials with three-layer neural networks

Z Wang, E Nichani, JD Lee - arXiv preprint arXiv:2311.13774, 2023 - arxiv.org
We study the problem of learning hierarchical polynomials over the standard Gaussian
distribution with three-layer neural networks. We specifically consider target functions of the …

Sgd finds then tunes features in two-layer neural networks with near-optimal sample complexity: A case study in the xor problem

M Glasgow - arXiv preprint arXiv:2309.15111, 2023 - arxiv.org
In this work, we consider the optimization process of minibatch stochastic gradient descent
(SGD) on a 2-layer neural network with data separated by a quadratic ground truth function …