Optimization for deep learning: An overview

RY Sun - Journal of the Operations Research Society of China, 2020 - Springer
Optimization is a critical component in deep learning. We think optimization for neural
networks is an interesting topic for theoretical research due to various reasons. First, its …

Piecewise linear neural networks and deep learning

Q Tao, L Li, X Huang, X Xi, S Wang… - Nature Reviews Methods …, 2022 - nature.com
As a powerful modelling method, piecewise linear neural networks (PWLNNs) have proven
successful in various fields, most recently in deep learning. To apply PWLNN methods, both …

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

Att3d: Amortized text-to-3d object synthesis

J Lorraine, K Xie, X Zeng, CH Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Text-to-3D modelling has seen exciting progress by combining generative text-to-image
models with image-to-3D methods like Neural Radiance Fields. DreamFusion recently …

Knowledge distillation: A good teacher is patient and consistent

L Beyer, X Zhai, A Royer, L Markeeva… - Proceedings of the …, 2022 - openaccess.thecvf.com
There is a growing discrepancy in computer vision between large-scale models that achieve
state-of-the-art performance and models that are affordable in practical applications. In this …

Sophia: A scalable stochastic second-order optimizer for language model pre-training

H Liu, Z Li, D Hall, P Liang, T Ma - arXiv preprint arXiv:2305.14342, 2023 - arxiv.org
Given the massive cost of language model pre-training, a non-trivial improvement of the
optimization algorithm would lead to a material reduction on the time and cost of training …

Cramming: Training a Language Model on a single GPU in one day.

J Geiping, T Goldstein - International Conference on …, 2023 - proceedings.mlr.press
Recent trends in language modeling have focused on increasing performance through
scaling, and have resulted in an environment where training language models is out of …

Pyhessian: Neural networks through the lens of the hessian

Z Yao, A Gholami, K Keutzer… - 2020 IEEE international …, 2020 - ieeexplore.ieee.org
We present PYHESSIAN, a new scalable framework that enables fast computation of
Hessian (ie, second-order derivative) information for deep neural networks. PYHESSIAN …

No train no gain: Revisiting efficient training algorithms for transformer-based language models

J Kaddour, O Key, P Nawrot… - Advances in Neural …, 2024 - proceedings.neurips.cc
The computation necessary for training Transformer-based language models has
skyrocketed in recent years. This trend has motivated research on efficient training …

Large-scale differentially private BERT

R Anil, B Ghazi, V Gupta, R Kumar… - arXiv preprint arXiv …, 2021 - arxiv.org
In this work, we study the large-scale pretraining of BERT-Large with differentially private
SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch …