Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

M Belkin - Acta Numerica, 2021 - cambridge.org

In the past decade the mathematical theory of machine learning has lagged far behind the
triumphs of deep neural networks on practical challenges. However, the gap between theory …

被引用次数：241 相关文章所有 6 个版本

[PDF] arxiv.org

The role of permutation invariance in linear mode connectivity of neural networks

R Entezari, H Sedghi, O Saukh… - arXiv preprint arXiv …, 2021 - arxiv.org

In this paper, we conjecture that if the permutation invariance of neural networks is taken into
account, SGD solutions will likely have no barrier in the linear interpolation between them …

被引用次数：181 相关文章所有 10 个版本

[PDF] arxiv.org

Decentralized federated averaging

T Sun, D Li, B Wang - IEEE Transactions on Pattern Analysis …, 2022 - ieeexplore.ieee.org

Federated averaging (FedAvg) is a communication-efficient algorithm for distributed training
with an enormous number of clients. In FedAvg, clients keep their data locally for privacy …

被引用次数：223 相关文章所有 10 个版本

[PDF] neurips.cc

Bome! bilevel optimization made easy: A simple first-order approach

B Liu, M Ye, S Wright, P Stone… - Advances in neural …, 2022 - proceedings.neurips.cc

Bilevel optimization (BO) is useful for solving a variety of important machine learning
problems including but not limited to hyperparameter optimization, meta-learning, continual …

被引用次数：65 相关文章所有 10 个版本

[PDF] neurips.cc

Learning linear causal representations from interventions under general nonlinear mixing

S Buchholz, G Rajendran… - Advances in …, 2024 - proceedings.neurips.cc

We study the problem of learning causal representations from unknown, latent interventions
in a general setting, where the latent distribution is Gaussian but the mixing function is …

被引用次数：37 相关文章所有 11 个版本

[PDF] arxiv.org

Full stack optimization of transformer inference: a survey

S Kim, C Hooper, T Wattanawong, M Kang… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …

被引用次数：74 相关文章所有 4 个版本

[PDF] neurips.cc

Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs

E Boursier, L Pillaud-Vivien… - Advances in Neural …, 2022 - proceedings.neurips.cc

The training of neural networks by gradient descent methods is a cornerstone of the deep
learning revolution. Yet, despite some recent progress, a complete theory explaining its …

被引用次数：63 相关文章所有 12 个版本

[PDF] mlr.press

On penalty-based bilevel gradient descent method

H Shen, T Chen - International Conference on Machine …, 2023 - proceedings.mlr.press

Bilevel optimization enjoys a wide range of applications in hyper-parameter optimization,
meta-learning and reinforcement learning. However, bilevel problems are difficult to solve …

被引用次数：39 相关文章所有 7 个版本

[PDF] neurips.cc

Neural collapse with normalized features: A geometric analysis over the riemannian manifold

C Yaras, P Wang, Z Zhu… - Advances in neural …, 2022 - proceedings.neurips.cc

When training overparameterized deep networks for classification tasks, it has been widely
observed that the learned features exhibit a so-called" neural collapse'" phenomenon. More …

被引用次数：40 相关文章所有 8 个版本

[PDF] mlr.press

High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance

A Sadiev, M Danilova, E Gorbunov… - International …, 2023 - proceedings.mlr.press

During the recent years the interest of optimization and machine learning communities in
high-probability convergence of stochastic optimization methods has been growing. One of …

被引用次数：38 相关文章所有 14 个版本