Kronecker-factored approximate curvature for modern neural network architectures
The core components of many modern neural network architectures, such as transformers,
convolutional, or graph neural networks, can be expressed as linear layers with* weight …
convolutional, or graph neural networks, can be expressed as linear layers with* weight …
Can we remove the square-root in adaptive gradient methods? a second-order perspective
Adaptive gradient optimizers like Adam (W) are the default training algorithms for many deep
learning architectures, such as transformers. Their diagonal preconditioner is based on the …
learning architectures, such as transformers. Their diagonal preconditioner is based on the …
Variational Stochastic Gradient Descent for Deep Neural Networks
Optimizing deep neural networks is one of the main tasks in successful deep learning.
Current state-of-the-art optimizers are adaptive gradient-based optimization methods such …
Current state-of-the-art optimizers are adaptive gradient-based optimization methods such …
A Geometric Modeling of Occam's Razor in Deep Learning
Why do deep neural networks (DNNs) benefit from very high dimensional parameter
spaces? Their huge parameter complexities vs. stunning performances in practice is all the …
spaces? Their huge parameter complexities vs. stunning performances in practice is all the …
[图书][B] Symplectic Numerical Integration at the Service of Accelerated Optimization and Structure-Preserving Dynamics Learning
V Duruisseaux - 2023 - search.proquest.com
Symplectic numerical integrators for Hamiltonian systems form the paramount class of
geometric numerical integrators, and have been very well investigated in the past forty …
geometric numerical integrators, and have been very well investigated in the past forty …
StEVE: Adaptive Optimization in a Kronecker-Factored Eigenbasis
JNM Gamboa - openreview.net
Adaptive optimization algorithms such as Adam see widespread use in Deep Learning.
However, these methods rely on diagonal approximations of the preconditioner, losing much …
However, these methods rely on diagonal approximations of the preconditioner, losing much …