Automatic cross-replica sharding of weight update in data-parallel training

J Yu, Z Wang, V Vasudevan, L Yeung… - arXiv preprint arXiv …, 2022 - arxiv.org

Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …

被引用次数：1232 相关文章所有 7 个版本

[PDF] arxiv.org

Pytorch fsdp: experiences on scaling fully sharded data parallel

Y Zhao, A Gu, R Varma, L Luo, CC Huang, M Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …

被引用次数：168 相关文章所有 3 个版本

[PDF] arxiv.org

Efficient large-scale language model training on gpu clusters using megatron-lm

D Narayanan, M Shoeybi, J Casper… - Proceedings of the …, 2021 - dl.acm.org

Large language models have led to state-of-the-art accuracies across several tasks.
However, training these models efficiently is challenging because: a) GPU memory capacity …

被引用次数：593 相关文章所有 11 个版本

[PDF] arxiv.org

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

被引用次数：184 相关文章所有 4 个版本

[PDF] arxiv.org

Efficient large scale language modeling with mixtures of experts

M Artetxe, S Bhosale, N Goyal, T Mihaylov… - arXiv preprint arXiv …, 2021 - arxiv.org

Mixture of Experts layers (MoEs) enable efficient scaling of language models through
conditional computation. This paper presents a detailed empirical study of how …

被引用次数：97 相关文章所有 4 个版本

[PDF] thecvf.com

CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images

A Gokaslan, AF Cooper, J Collins… - Proceedings of the …, 2024 - openaccess.thecvf.com

We train a set of open text-to-image (T2I) diffusion models on a dataset of curated Creative-
Commons-licensed (CC) images which yields models that are competitive with Stable …

被引用次数：14 相关文章所有 4 个版本

[PDF] arxiv.org

Galvatron: Efficient transformer training over multiple gpus using automatic parallelism

X Miao, Y Wang, Y Jiang, C Shi, X Nie, H Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org

Transformer models have achieved state-of-the-art performance on various domains of
applications and gradually becomes the foundations of the advanced large deep learning …

被引用次数：45 相关文章所有 5 个版本

[PDF] arxiv.org

Scalable second order optimization for deep learning

R Anil, V Gupta, T Koren, K Regan, Y Singer - arXiv preprint arXiv …, 2020 - arxiv.org

Optimization in machine learning, both theoretical and applied, is presently dominated by
first-order gradient methods such as stochastic gradient descent. Second-order optimization …

被引用次数：80 相关文章所有 4 个版本

[PDF] usenix.org

{SmartMoE}: Efficiently Training {Sparsely-Activated} Models through Combining Offline and Online Parallelization

M Zhai, J He, Z Ma, Z Zong, R Zhang… - 2023 USENIX Annual …, 2023 - usenix.org

Deep neural networks are growing large for stronger model ability, consuming enormous
computation resources to train them. Sparsely activated models have been increasingly …

被引用次数：15 相关文章所有 2 个版本

[PDF] arxiv.org

Mobile edge intelligence for large language models: A contemporary survey

G Qu, Q Chen, W Wei, Z Lin, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …

被引用次数：8 相关文章所有 5 个版本