Fixnorm: Dissecting weight decay for training deep neural networks

S Black, S Biderman, E Hallahan, Q Anthony… - arXiv preprint arXiv …, 2022 - arxiv.org

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model
trained on the Pile, whose weights will be made freely and openly available to the public …

被引用次数：806 相关文章所有 7 个版本

[PDF] arxiv.org

Rotational equilibrium: How weight decay balances learning across neural networks

A Kosson, B Messmer, M Jaggi - arXiv preprint arXiv:2305.17212, 2023 - arxiv.org

This study investigates how weight decay affects the update behavior of individual neurons
in deep neural networks through a combination of applied analysis and experimentation …

被引用次数：12 相关文章所有 5 个版本

[PDF] openreview.net

Towards understanding convergence and generalization of AdamW

P Zhou, X Xie, Z Lin, S Yan - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

AdamW modifies Adam by adding a decoupled weight decay to decay network weights per
training iteration. For adaptive algorithms, this decoupled weight decay does not affect …

被引用次数：17 相关文章所有 5 个版本

[PDF] researchgate.net

Edge NLP for Efficient Machine Translation in Low Connectivity Areas

T Watt, C Chrysoulas, D Gkatzia - … on Internet of Things (WF-IoT …, 2023 - ieeexplore.ieee.org

Machine translation (MT) usually requires connectivity and access to the cloud which is often
limited in many parts of the world, including hard to reach rural areas. Natural language …

A stochastic proximal method for nonsmooth regularized finite sum optimization

D Lakhmiri, D Orban, A Lodi - arXiv preprint arXiv:2206.06531, 2022 - arxiv.org

We consider the problem of training a deep neural network with nonsmooth regularization to
retrieve a sparse and efficient sub-structure. Our regularizer is only assumed to be lower …

被引用次数：1 相关文章所有 4 个版本