Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

M Belkin - Acta Numerica, 2021 - cambridge.org
In the past decade the mathematical theory of machine learning has lagged far behind the
triumphs of deep neural networks on practical challenges. However, the gap between theory …

The role of permutation invariance in linear mode connectivity of neural networks

R Entezari, H Sedghi, O Saukh… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we conjecture that if the permutation invariance of neural networks is taken into
account, SGD solutions will likely have no barrier in the linear interpolation between them …

Decentralized federated averaging

T Sun, D Li, B Wang - IEEE Transactions on Pattern Analysis …, 2022 - ieeexplore.ieee.org
Federated averaging (FedAvg) is a communication-efficient algorithm for distributed training
with an enormous number of clients. In FedAvg, clients keep their data locally for privacy …

Bome! bilevel optimization made easy: A simple first-order approach

B Liu, M Ye, S Wright, P Stone… - Advances in neural …, 2022 - proceedings.neurips.cc
Bilevel optimization (BO) is useful for solving a variety of important machine learning
problems including but not limited to hyperparameter optimization, meta-learning, continual …

Learning linear causal representations from interventions under general nonlinear mixing

S Buchholz, G Rajendran… - Advances in …, 2024 - proceedings.neurips.cc
We study the problem of learning causal representations from unknown, latent interventions
in a general setting, where the latent distribution is Gaussian but the mixing function is …

Full stack optimization of transformer inference: a survey

S Kim, C Hooper, T Wattanawong, M Kang… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …

Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs

E Boursier, L Pillaud-Vivien… - Advances in Neural …, 2022 - proceedings.neurips.cc
The training of neural networks by gradient descent methods is a cornerstone of the deep
learning revolution. Yet, despite some recent progress, a complete theory explaining its …

On penalty-based bilevel gradient descent method

H Shen, T Chen - International Conference on Machine …, 2023 - proceedings.mlr.press
Bilevel optimization enjoys a wide range of applications in hyper-parameter optimization,
meta-learning and reinforcement learning. However, bilevel problems are difficult to solve …

Neural collapse with normalized features: A geometric analysis over the riemannian manifold

C Yaras, P Wang, Z Zhu… - Advances in neural …, 2022 - proceedings.neurips.cc
When training overparameterized deep networks for classification tasks, it has been widely
observed that the learned features exhibit a so-called" neural collapse'" phenomenon. More …

High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance

A Sadiev, M Danilova, E Gorbunov… - International …, 2023 - proceedings.mlr.press
During the recent years the interest of optimization and machine learning communities in
high-probability convergence of stochastic optimization methods has been growing. One of …