Can we remove the square-root in adaptive gradient methods? a second-order perspective

W Lin, F Dangel, R Eschenhagen, J Bae… - arXiv preprint arXiv …, 2024 - arxiv.org
Adaptive gradient optimizers like Adam (W) are the default training algorithms for many deep
learning architectures, such as transformers. Their diagonal preconditioner is based on the …

DEPT: Decoupled Embeddings for Pre-training Language Models

A Iacob, L Sani, M Kurmanji, WF Shen, X Qiu… - arXiv preprint arXiv …, 2024 - arxiv.org
Language Model pre-training benefits from a broader data mixture to enhance performance
across domains and languages. However, training on such heterogeneous text corpora is …

Training one DeePMD Model in Minutes: a Step towards Online Learning

S Hu, T Zhao, Q Sha, E Li, X Meng, L Liu… - Proceedings of the 29th …, 2024 - dl.acm.org
Neural Network Molecular Dynamics (NNMD) has become a major approach in material
simulations, which can speedup the molecular dynamics (MD) simulation for thousands of …

Old Optimizer, New Norm: An Anthology

J Bernstein, L Newhouse - arXiv preprint arXiv:2409.20325, 2024 - arxiv.org
Deep learning optimizers are often motivated through a mix of convex and approximate
second-order theory. We select three such methods--Adam, Shampoo and Prodigy--and …

SOAP: Improving and Stabilizing Shampoo using Adam

N Vyas, D Morwani, R Zhao, I Shapira… - arXiv preprint arXiv …, 2024 - arxiv.org
There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning
method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks …