Kronecker-factored approximate curvature for modern neural network architectures

R Eschenhagen, A Immer, R Turner… - Advances in …, 2024 - proceedings.neurips.cc
The core components of many modern neural network architectures, such as transformers,
convolutional, or graph neural networks, can be expressed as linear layers with* weight …

Can we remove the square-root in adaptive gradient methods? a second-order perspective

W Lin, F Dangel, R Eschenhagen, J Bae… - arXiv preprint arXiv …, 2024 - arxiv.org
Adaptive gradient optimizers like Adam (W) are the default training algorithms for many deep
learning architectures, such as transformers. Their diagonal preconditioner is based on the …

Variational Stochastic Gradient Descent for Deep Neural Networks

H Chen, A Kuzina, B Esmaeili, JM Tomczak - arXiv preprint arXiv …, 2024 - arxiv.org
Optimizing deep neural networks is one of the main tasks in successful deep learning.
Current state-of-the-art optimizers are adaptive gradient-based optimization methods such …

A Geometric Modeling of Occam's Razor in Deep Learning

K Sun, F Nielsen - arXiv preprint arXiv:1905.11027, 2019 - arxiv.org
Why do deep neural networks (DNNs) benefit from very high dimensional parameter
spaces? Their huge parameter complexities vs. stunning performances in practice is all the …

[图书][B] Symplectic Numerical Integration at the Service of Accelerated Optimization and Structure-Preserving Dynamics Learning

V Duruisseaux - 2023 - search.proquest.com
Symplectic numerical integrators for Hamiltonian systems form the paramount class of
geometric numerical integrators, and have been very well investigated in the past forty …

StEVE: Adaptive Optimization in a Kronecker-Factored Eigenbasis

JNM Gamboa - openreview.net
Adaptive optimization algorithms such as Adam see widespread use in Deep Learning.
However, these methods rely on diagonal approximations of the preconditioner, losing much …