Gpt-neox-20b: An open-source autoregressive language model

S Black, S Biderman, E Hallahan, Q Anthony… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model
trained on the Pile, whose weights will be made freely and openly available to the public …

Rotational equilibrium: How weight decay balances learning across neural networks

A Kosson, B Messmer, M Jaggi - arXiv preprint arXiv:2305.17212, 2023 - arxiv.org
This study investigates how weight decay affects the update behavior of individual neurons
in deep neural networks through a combination of applied analysis and experimentation …

Towards understanding convergence and generalization of AdamW

P Zhou, X Xie, Z Lin, S Yan - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
AdamW modifies Adam by adding a decoupled weight decay to decay network weights per
training iteration. For adaptive algorithms, this decoupled weight decay does not affect …

Edge NLP for Efficient Machine Translation in Low Connectivity Areas

T Watt, C Chrysoulas, D Gkatzia - … on Internet of Things (WF-IoT …, 2023 - ieeexplore.ieee.org
Machine translation (MT) usually requires connectivity and access to the cloud which is often
limited in many parts of the world, including hard to reach rural areas. Natural language …

A stochastic proximal method for nonsmooth regularized finite sum optimization

D Lakhmiri, D Orban, A Lodi - arXiv preprint arXiv:2206.06531, 2022 - arxiv.org
We consider the problem of training a deep neural network with nonsmooth regularization to
retrieve a sparse and efficient sub-structure. Our regularizer is only assumed to be lower …