On efficient training of large-scale deep learning models: A literature review
The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …
(CV), natural language processing (NLP), and speech. The use of large-scale models …
Sparse upcycling: Training mixture-of-experts from dense checkpoints
Training large, deep neural networks to convergence can be prohibitively expensive. As a
result, often only a small selection of popular, dense models are reused across different …
result, often only a small selection of popular, dense models are reused across different …
Lessons on parameter sharing across layers in transformers
We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The
proposed approach relaxes a widely used technique, which shares parameters for one layer …
proposed approach relaxes a widely used technique, which shares parameters for one layer …
Automated progressive learning for efficient training of vision transformers
Recent advances in vision Transformers (ViTs) have come with a voracious appetite for
computing power, high-lighting the urgent need to develop efficient training methods for …
computing power, high-lighting the urgent need to develop efficient training methods for …
On Efficient Training of Large-Scale Deep Learning Models
The field of deep learning has witnessed significant progress in recent times, particularly in
areas such as computer vision (CV), natural language processing (NLP), and speech. The …
areas such as computer vision (CV), natural language processing (NLP), and speech. The …
Multilingual machine translation systems from Microsoft for WMT21 shared task
This report describes Microsoft's machine translation systems for the WMT21 shared task on
large-scale multilingual machine translation. We participated in all three evaluation tracks …
large-scale multilingual machine translation. We participated in all three evaluation tracks …
Lightseq2: Accelerated training for transformer-based models on gpus
Transformer-based neural models are used in many AI applications. Training these models
is expensive, as it takes huge GPU resources and long duration. It is challenging because …
is expensive, as it takes huge GPU resources and long duration. It is challenging because …
Learning high-dimensional parametric maps via reduced basis adaptive residual networks
We propose a scalable framework for the learning of high-dimensional parametric maps via
adaptively constructed residual network (ResNet) maps between reduced bases of the …
adaptively constructed residual network (ResNet) maps between reduced bases of the …
Learning light-weight translation models from deep transformer
Recently, deep models have shown tremendous improvements in neural machine
translation (NMT). However, systems of this kind are computationally expensive and memory …
translation (NMT). However, systems of this kind are computationally expensive and memory …
An efficient transformer decoder with compressed sub-layers
The large attention-based encoder-decoder network (Transformer) has become prevailing
recently due to its effectiveness. But the high computation complexity of its decoder raises …
recently due to its effectiveness. But the high computation complexity of its decoder raises …