Gpt-neox-20b: An open-source autoregressive language model
We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model
trained on the Pile, whose weights will be made freely and openly available to the public …
trained on the Pile, whose weights will be made freely and openly available to the public …
Rotational equilibrium: How weight decay balances learning across neural networks
This study investigates how weight decay affects the update behavior of individual neurons
in deep neural networks through a combination of applied analysis and experimentation …
in deep neural networks through a combination of applied analysis and experimentation …
Towards understanding convergence and generalization of AdamW
AdamW modifies Adam by adding a decoupled weight decay to decay network weights per
training iteration. For adaptive algorithms, this decoupled weight decay does not affect …
training iteration. For adaptive algorithms, this decoupled weight decay does not affect …
Edge NLP for Efficient Machine Translation in Low Connectivity Areas
Machine translation (MT) usually requires connectivity and access to the cloud which is often
limited in many parts of the world, including hard to reach rural areas. Natural language …
limited in many parts of the world, including hard to reach rural areas. Natural language …
A stochastic proximal method for nonsmooth regularized finite sum optimization
We consider the problem of training a deep neural network with nonsmooth regularization to
retrieve a sparse and efficient sub-structure. Our regularizer is only assumed to be lower …
retrieve a sparse and efficient sub-structure. Our regularizer is only assumed to be lower …