Gradient-based feature learning under structured data
Recent works have demonstrated that the sample complexity of gradient-based learning of
single index models, ie functions that depend on a 1-dimensional projection of the input …
single index models, ie functions that depend on a 1-dimensional projection of the input …
On learning gaussian multi-index models with gradient flow
We study gradient flow on the multi-index regression problem for high-dimensional
Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear …
Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear …
A theory of non-linear feature learning with one gradient step in two-layer neural networks
Feature learning is thought to be one of the fundamental reasons for the success of deep
neural networks. It is rigorously known that in two-layer fully-connected neural networks …
neural networks. It is rigorously known that in two-layer fully-connected neural networks …
Should Under-parameterized Student Networks Copy or Average Teacher Weights?
B Simsek, A Bendjeddou… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Any continuous function $ f^* $ can be approximated arbitrarily well by a neural
network with sufficiently many neurons $ k $. We consider the case when $ f^* $ itself is a …
network with sufficiently many neurons $ k $. We consider the case when $ f^* $ itself is a …
Provable multi-task representation learning by two-layer relu neural networks
Feature learning, ie extracting meaningful representations of data, is quintessential to the
practical success of neural networks trained with gradient descent, yet it is notoriously …
practical success of neural networks trained with gradient descent, yet it is notoriously …
Provably learning a multi-head attention layer
S Chen, Y Li - arXiv preprint arXiv:2402.04084, 2024 - arxiv.org
The multi-head attention layer is one of the key components of the transformer architecture
that sets it apart from traditional feed-forward models. Given a sequence length $ k …
that sets it apart from traditional feed-forward models. Given a sequence length $ k …
Spectral phase transitions in non-linear wigner spiked models
A Guionnet, J Ko, F Krzakala, P Mergny… - arXiv preprint arXiv …, 2023 - arxiv.org
We study the asymptotic behavior of the spectrum of a random matrix where a non-linearity
is applied entry-wise to a Wigner matrix perturbed by a rank-one spike with independent and …
is applied entry-wise to a Wigner matrix perturbed by a rank-one spike with independent and …
Asymptotics of feature learning in two-layer networks after one gradient-step
In this manuscript we investigate the problem of how two-layer neural networks learn
features from data, and improve over the kernel regime, after being trained with a single …
features from data, and improve over the kernel regime, after being trained with a single …
Learning hierarchical polynomials with three-layer neural networks
We study the problem of learning hierarchical polynomials over the standard Gaussian
distribution with three-layer neural networks. We specifically consider target functions of the …
distribution with three-layer neural networks. We specifically consider target functions of the …
Sgd finds then tunes features in two-layer neural networks with near-optimal sample complexity: A case study in the xor problem
M Glasgow - arXiv preprint arXiv:2309.15111, 2023 - arxiv.org
In this work, we consider the optimization process of minibatch stochastic gradient descent
(SGD) on a 2-layer neural network with data separated by a quadratic ground truth function …
(SGD) on a 2-layer neural network with data separated by a quadratic ground truth function …