Gradient-based feature learning under structured data

A Mousavi-Hosseini, D Wu, T Suzuki… - Advances in Neural …, 2023 - proceedings.neurips.cc
Recent works have demonstrated that the sample complexity of gradient-based learning of
single index models, ie functions that depend on a 1-dimensional projection of the input …

Bayes-optimal learning of an extensive-width neural network from quadratically many samples

A Maillard, E Troiani, S Martin, F Krzakala… - arXiv preprint arXiv …, 2024 - arxiv.org
We consider the problem of learning a target function corresponding to a single hidden layer
neural network, with a quadratic activation function after the first layer, and random weights …

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

L Arnaboldi, Y Dandi, F Krzakala, B Loureiro… - arXiv preprint arXiv …, 2024 - arxiv.org
We study the impact of the batch size $ n_b $ on the iteration time $ T $ of training two-layer
neural networks with one-pass stochastic gradient descent (SGD) on multi-index target …

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

JD Lee, K Oko, T Suzuki, D Wu - arXiv preprint arXiv:2406.01581, 2024 - arxiv.org
We study the problem of gradient descent learning of a single-index target function $
f_*(\boldsymbol {x})=\textstyle\sigma_*\left (\langle\boldsymbol {x},\boldsymbol …

Learning multi-index models with neural networks via mean-field langevin dynamics

A Mousavi-Hosseini, D Wu, MA Erdogdu - arXiv preprint arXiv:2408.07254, 2024 - arxiv.org
We study the problem of learning multi-index models in high-dimensions using a two-layer
neural network trained with the mean-field Langevin algorithm. Under mild distributional …

SGD with memory: fundamental properties and stochastic acceleration

D Yarotsky, M Velikanov - arXiv preprint arXiv:2410.04228, 2024 - arxiv.org
An important open problem is the theoretically feasible acceleration of mini-batch SGD-type
algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the …

The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

E Collins-Woodfin, I Seroussi… - arXiv preprint arXiv …, 2024 - arxiv.org
We develop a framework for analyzing the training and learning rate dynamics on a large
class of high-dimensional optimization problems, which we call the high line, trained using …

High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

A Agarwala, J Pennington - arXiv preprint arXiv:2404.19261, 2024 - arxiv.org
Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues
of the training loss Hessian have some remarkably robust features across models and …

Gradient descent inference in empirical risk minimization

Q Han, X Xu - arXiv preprint arXiv:2412.09498, 2024 - arxiv.org
Gradient descent is one of the most widely used iterative algorithms in modern statistical
learning. However, its precise algorithmic dynamics in high-dimensional settings remain …

A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator

T Misiakiewicz, B Saeed - arXiv preprint arXiv:2403.08938, 2024 - arxiv.org
We consider learning an unknown target function $ f_* $ using kernel ridge regression
(KRR) given iid data $(u_i, y_i) $, $ i\leq n $, where $ u_i\in U $ is a covariate vector and …