Gradient-based feature learning under structured data
Recent works have demonstrated that the sample complexity of gradient-based learning of
single index models, ie functions that depend on a 1-dimensional projection of the input …
single index models, ie functions that depend on a 1-dimensional projection of the input …
Bayes-optimal learning of an extensive-width neural network from quadratically many samples
We consider the problem of learning a target function corresponding to a single hidden layer
neural network, with a quadratic activation function after the first layer, and random weights …
neural network, with a quadratic activation function after the first layer, and random weights …
Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs
We study the impact of the batch size $ n_b $ on the iteration time $ T $ of training two-layer
neural networks with one-pass stochastic gradient descent (SGD) on multi-index target …
neural networks with one-pass stochastic gradient descent (SGD) on multi-index target …
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit
We study the problem of gradient descent learning of a single-index target function $
f_*(\boldsymbol {x})=\textstyle\sigma_*\left (\langle\boldsymbol {x},\boldsymbol …
f_*(\boldsymbol {x})=\textstyle\sigma_*\left (\langle\boldsymbol {x},\boldsymbol …
Learning multi-index models with neural networks via mean-field langevin dynamics
We study the problem of learning multi-index models in high-dimensions using a two-layer
neural network trained with the mean-field Langevin algorithm. Under mild distributional …
neural network trained with the mean-field Langevin algorithm. Under mild distributional …
SGD with memory: fundamental properties and stochastic acceleration
D Yarotsky, M Velikanov - arXiv preprint arXiv:2410.04228, 2024 - arxiv.org
An important open problem is the theoretically feasible acceleration of mini-batch SGD-type
algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the …
algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the …
The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms
E Collins-Woodfin, I Seroussi… - arXiv preprint arXiv …, 2024 - arxiv.org
We develop a framework for analyzing the training and learning rate dynamics on a large
class of high-dimensional optimization problems, which we call the high line, trained using …
class of high-dimensional optimization problems, which we call the high line, trained using …
High dimensional analysis reveals conservative sharpening and a stochastic edge of stability
A Agarwala, J Pennington - arXiv preprint arXiv:2404.19261, 2024 - arxiv.org
Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues
of the training loss Hessian have some remarkably robust features across models and …
of the training loss Hessian have some remarkably robust features across models and …
Gradient descent inference in empirical risk minimization
Gradient descent is one of the most widely used iterative algorithms in modern statistical
learning. However, its precise algorithmic dynamics in high-dimensional settings remain …
learning. However, its precise algorithmic dynamics in high-dimensional settings remain …
A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator
T Misiakiewicz, B Saeed - arXiv preprint arXiv:2403.08938, 2024 - arxiv.org
We consider learning an unknown target function $ f_* $ using kernel ridge regression
(KRR) given iid data $(u_i, y_i) $, $ i\leq n $, where $ u_i\in U $ is a covariate vector and …
(KRR) given iid data $(u_i, y_i) $, $ i\leq n $, where $ u_i\in U $ is a covariate vector and …