A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for...

W Lin, F Dangel, R Eschenhagen, J Bae… - arXiv preprint arXiv …, 2024 - arxiv.org

Adaptive gradient optimizers like Adam (W) are the default training algorithms for many deep
learning architectures, such as transformers. Their diagonal preconditioner is based on the …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

DEPT: Decoupled Embeddings for Pre-training Language Models

A Iacob, L Sani, M Kurmanji, WF Shen, X Qiu… - arXiv preprint arXiv …, 2024 - arxiv.org

Language Model pre-training benefits from a broader data mixture to enhance performance
across domains and languages. However, training on such heterogeneous text corpora is …

[PDF] acm.org

Training one DeePMD Model in Minutes: a Step towards Online Learning

S Hu, T Zhao, Q Sha, E Li, X Meng, L Liu… - Proceedings of the 29th …, 2024 - dl.acm.org

Neural Network Molecular Dynamics (NNMD) has become a major approach in material
simulations, which can speedup the molecular dynamics (MD) simulation for thousands of …

[PDF] arxiv.org

Old Optimizer, New Norm: An Anthology

J Bernstein, L Newhouse - arXiv preprint arXiv:2409.20325, 2024 - arxiv.org

Deep learning optimizers are often motivated through a mix of convex and approximate
second-order theory. We select three such methods--Adam, Shampoo and Prodigy--and …

[PDF] arxiv.org

SOAP: Improving and Stabilizing Shampoo using Adam

N Vyas, D Morwani, R Zhao, I Shapira… - arXiv preprint arXiv …, 2024 - arxiv.org

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning
method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks …