Coca: Contrastive captioners are image-text foundation models
Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …
vision because these models can be quickly transferred to many downstream tasks. This …
Pytorch fsdp: experiences on scaling fully sharded data parallel
It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …
performance across a broad range of domains. Despite the remarkable progress made in …
Efficient large-scale language model training on gpu clusters using megatron-lm
Large language models have led to state-of-the-art accuracies across several tasks.
However, training these models efficiently is challenging because: a) GPU memory capacity …
However, training these models efficiently is challenging because: a) GPU memory capacity …
Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition
We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …
models pre-trained using large, diverse unlabeled datasets containing approximately a …
Efficient large scale language modeling with mixtures of experts
Mixture of Experts layers (MoEs) enable efficient scaling of language models through
conditional computation. This paper presents a detailed empirical study of how …
conditional computation. This paper presents a detailed empirical study of how …
CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images
We train a set of open text-to-image (T2I) diffusion models on a dataset of curated Creative-
Commons-licensed (CC) images which yields models that are competitive with Stable …
Commons-licensed (CC) images which yields models that are competitive with Stable …
Galvatron: Efficient transformer training over multiple gpus using automatic parallelism
Transformer models have achieved state-of-the-art performance on various domains of
applications and gradually becomes the foundations of the advanced large deep learning …
applications and gradually becomes the foundations of the advanced large deep learning …
Scalable second order optimization for deep learning
Optimization in machine learning, both theoretical and applied, is presently dominated by
first-order gradient methods such as stochastic gradient descent. Second-order optimization …
first-order gradient methods such as stochastic gradient descent. Second-order optimization …
{SmartMoE}: Efficiently Training {Sparsely-Activated} Models through Combining Offline and Online Parallelization
Deep neural networks are growing large for stronger model ability, consuming enormous
computation resources to train them. Sparsely activated models have been increasingly …
computation resources to train them. Sparsely activated models have been increasingly …
Mobile edge intelligence for large language models: A contemporary survey
On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …