Coca: Contrastive captioners are image-text foundation models

J Yu, Z Wang, V Vasudevan, L Yeung… - arXiv preprint arXiv …, 2022 - arxiv.org
Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …

Pytorch fsdp: experiences on scaling fully sharded data parallel

Y Zhao, A Gu, R Varma, L Luo, CC Huang, M Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …

Efficient large-scale language model training on gpu clusters using megatron-lm

D Narayanan, M Shoeybi, J Casper… - Proceedings of the …, 2021 - dl.acm.org
Large language models have led to state-of-the-art accuracies across several tasks.
However, training these models efficiently is challenging because: a) GPU memory capacity …

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

Efficient large scale language modeling with mixtures of experts

M Artetxe, S Bhosale, N Goyal, T Mihaylov… - arXiv preprint arXiv …, 2021 - arxiv.org
Mixture of Experts layers (MoEs) enable efficient scaling of language models through
conditional computation. This paper presents a detailed empirical study of how …

CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images

A Gokaslan, AF Cooper, J Collins… - Proceedings of the …, 2024 - openaccess.thecvf.com
We train a set of open text-to-image (T2I) diffusion models on a dataset of curated Creative-
Commons-licensed (CC) images which yields models that are competitive with Stable …

Galvatron: Efficient transformer training over multiple gpus using automatic parallelism

X Miao, Y Wang, Y Jiang, C Shi, X Nie, H Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformer models have achieved state-of-the-art performance on various domains of
applications and gradually becomes the foundations of the advanced large deep learning …

Scalable second order optimization for deep learning

R Anil, V Gupta, T Koren, K Regan, Y Singer - arXiv preprint arXiv …, 2020 - arxiv.org
Optimization in machine learning, both theoretical and applied, is presently dominated by
first-order gradient methods such as stochastic gradient descent. Second-order optimization …

{SmartMoE}: Efficiently Training {Sparsely-Activated} Models through Combining Offline and Online Parallelization

M Zhai, J He, Z Ma, Z Zong, R Zhang… - 2023 USENIX Annual …, 2023 - usenix.org
Deep neural networks are growing large for stronger model ability, consuming enormous
computation resources to train them. Sparsely activated models have been increasingly …

Mobile edge intelligence for large language models: A contemporary survey

G Qu, Q Chen, W Wei, Z Lin, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …