Shallow-to-deep training for neural machine translation

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - arXiv preprint arXiv …, 2023 - arxiv.org

The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …

被引用次数：37 相关文章所有 2 个版本

[PDF] arxiv.org

Sparse upcycling: Training mixture-of-experts from dense checkpoints

A Komatsuzaki, J Puigcerver, J Lee-Thorp… - arXiv preprint arXiv …, 2022 - arxiv.org

Training large, deep neural networks to convergence can be prohibitively expensive. As a
result, often only a small selection of popular, dense models are reused across different …

被引用次数：92 相关文章所有 3 个版本

[PDF] arxiv.org

Lessons on parameter sharing across layers in transformers

S Takase, S Kiyono - arXiv preprint arXiv:2104.06022, 2021 - arxiv.org

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The
proposed approach relaxes a widely used technique, which shares parameters for one layer …

被引用次数：79 相关文章所有 5 个版本

[PDF] thecvf.com

Automated progressive learning for efficient training of vision transformers

C Li, B Zhuang, G Wang, X Liang… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recent advances in vision Transformers (ViTs) have come with a voracious appetite for
computing power, high-lighting the urgent need to develop efficient training methods for …

被引用次数：44 相关文章所有 9 个版本

On Efficient Training of Large-Scale Deep Learning Models

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - ACM Computing Surveys, 2024 - dl.acm.org

The field of deep learning has witnessed significant progress in recent times, particularly in
areas such as computer vision (CV), natural language processing (NLP), and speech. The …

[PDF] arxiv.org

Multilingual machine translation systems from Microsoft for WMT21 shared task

J Yang, S Ma, H Huang, D Zhang, L Dong… - arXiv preprint arXiv …, 2021 - arxiv.org

This report describes Microsoft's machine translation systems for the WMT21 shared task on
large-scale multilingual machine translation. We participated in all three evaluation tracks …

被引用次数：44 相关文章所有 6 个版本

[PDF] arxiv.org

Lightseq2: Accelerated training for transformer-based models on gpus

X Wang, Y Wei, Y Xiong, G Huang… - … Conference for High …, 2022 - ieeexplore.ieee.org

Transformer-based neural models are used in many AI applications. Training these models
is expensive, as it takes huge GPU resources and long duration. It is challenging because …

被引用次数：38 相关文章所有 10 个版本

[PDF] sciencedirect.com

Learning high-dimensional parametric maps via reduced basis adaptive residual networks

T O'Leary-Roseberry, X Du, A Chaudhuri… - Computer Methods in …, 2022 - Elsevier

We propose a scalable framework for the learning of high-dimensional parametric maps via
adaptively constructed residual network (ResNet) maps between reduced bases of the …

被引用次数：25 相关文章所有 6 个版本

[PDF] aaai.org

Learning light-weight translation models from deep transformer

B Li, Z Wang, H Liu, Q Du, T Xiao, C Zhang… - Proceedings of the AAAI …, 2021 - ojs.aaai.org

Recently, deep models have shown tremendous improvements in neural machine
translation (NMT). However, systems of this kind are computationally expensive and memory …

被引用次数：37 相关文章所有 7 个版本

[PDF] aaai.org

An efficient transformer decoder with compressed sub-layers

Y Li, Y Lin, T Xiao, J Zhu - Proceedings of the AAAI Conference on …, 2021 - ojs.aaai.org

The large attention-based encoder-decoder network (Transformer) has become prevailing
recently due to its effectiveness. But the high computation complexity of its decoder raises …

被引用次数：34 相关文章所有 6 个版本