On efficient training of large-scale deep learning models: A literature review

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - arXiv preprint arXiv …, 2023 - arxiv.org
The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …

Sparse upcycling: Training mixture-of-experts from dense checkpoints

A Komatsuzaki, J Puigcerver, J Lee-Thorp… - arXiv preprint arXiv …, 2022 - arxiv.org
Training large, deep neural networks to convergence can be prohibitively expensive. As a
result, often only a small selection of popular, dense models are reused across different …

Lessons on parameter sharing across layers in transformers

S Takase, S Kiyono - arXiv preprint arXiv:2104.06022, 2021 - arxiv.org
We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The
proposed approach relaxes a widely used technique, which shares parameters for one layer …

Automated progressive learning for efficient training of vision transformers

C Li, B Zhuang, G Wang, X Liang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recent advances in vision Transformers (ViTs) have come with a voracious appetite for
computing power, high-lighting the urgent need to develop efficient training methods for …

On Efficient Training of Large-Scale Deep Learning Models

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - ACM Computing Surveys, 2024 - dl.acm.org
The field of deep learning has witnessed significant progress in recent times, particularly in
areas such as computer vision (CV), natural language processing (NLP), and speech. The …

Multilingual machine translation systems from Microsoft for WMT21 shared task

J Yang, S Ma, H Huang, D Zhang, L Dong… - arXiv preprint arXiv …, 2021 - arxiv.org
This report describes Microsoft's machine translation systems for the WMT21 shared task on
large-scale multilingual machine translation. We participated in all three evaluation tracks …

Lightseq2: Accelerated training for transformer-based models on gpus

X Wang, Y Wei, Y Xiong, G Huang… - … Conference for High …, 2022 - ieeexplore.ieee.org
Transformer-based neural models are used in many AI applications. Training these models
is expensive, as it takes huge GPU resources and long duration. It is challenging because …

Learning high-dimensional parametric maps via reduced basis adaptive residual networks

T O'Leary-Roseberry, X Du, A Chaudhuri… - Computer Methods in …, 2022 - Elsevier
We propose a scalable framework for the learning of high-dimensional parametric maps via
adaptively constructed residual network (ResNet) maps between reduced bases of the …

Learning light-weight translation models from deep transformer

B Li, Z Wang, H Liu, Q Du, T Xiao, C Zhang… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
Recently, deep models have shown tremendous improvements in neural machine
translation (NMT). However, systems of this kind are computationally expensive and memory …

An efficient transformer decoder with compressed sub-layers

Y Li, Y Lin, T Xiao, J Zhu - Proceedings of the AAAI Conference on …, 2021 - ojs.aaai.org
The large attention-based encoder-decoder network (Transformer) has become prevailing
recently due to its effectiveness. But the high computation complexity of its decoder raises …