Can we remove the square-root in adaptive gradient methods? a second-order perspective
Adaptive gradient optimizers like Adam (W) are the default training algorithms for many deep
learning architectures, such as transformers. Their diagonal preconditioner is based on the …
learning architectures, such as transformers. Their diagonal preconditioner is based on the …
DEPT: Decoupled Embeddings for Pre-training Language Models
Language Model pre-training benefits from a broader data mixture to enhance performance
across domains and languages. However, training on such heterogeneous text corpora is …
across domains and languages. However, training on such heterogeneous text corpora is …
Training one DeePMD Model in Minutes: a Step towards Online Learning
S Hu, T Zhao, Q Sha, E Li, X Meng, L Liu… - Proceedings of the 29th …, 2024 - dl.acm.org
Neural Network Molecular Dynamics (NNMD) has become a major approach in material
simulations, which can speedup the molecular dynamics (MD) simulation for thousands of …
simulations, which can speedup the molecular dynamics (MD) simulation for thousands of …
Old Optimizer, New Norm: An Anthology
J Bernstein, L Newhouse - arXiv preprint arXiv:2409.20325, 2024 - arxiv.org
Deep learning optimizers are often motivated through a mix of convex and approximate
second-order theory. We select three such methods--Adam, Shampoo and Prodigy--and …
second-order theory. We select three such methods--Adam, Shampoo and Prodigy--and …
SOAP: Improving and Stabilizing Shampoo using Adam
There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning
method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks …
method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks …