Pytorch fsdp: experiences on scaling fully sharded data parallel
It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …
performance across a broad range of domains. Despite the remarkable progress made in …
Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning
Alpa automates model-parallel training of large deep learning (DL) models by generating
execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel …
execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel …
Fast distributed inference serving for large language models
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demand low job …
exemplified by ChatGPT. The interactive nature of these applications demand low job …
Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update
Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus
promptly serving new users and content. Existing DLRSs, however, fail to do so. They …
promptly serving new users and content. Existing DLRSs, however, fail to do so. They …
Drive: One-bit distributed mean estimation
We consider the problem where $ n $ clients transmit $ d $-dimensional real-valued vectors
using $ d (1+ o (1)) $ bits each, in a manner that allows the receiver to approximately …
using $ d (1+ o (1)) $ bits each, in a manner that allows the receiver to approximately …
Graft: Efficient inference serving for hybrid deep learning with SLO guarantees via DNN re-alignment
Deep neural networks (DNNs) have been widely adopted for various mobile inference tasks,
yet their ever-increasing computational demands are hindering their deployment on …
yet their ever-increasing computational demands are hindering their deployment on …
Dragonn: Distributed randomized approximate gradients of neural networks
Data-parallel distributed training (DDT) has become the de-facto standard for accelerating
the training of most deep learning tasks on massively parallel hardware. In the DDT …
the training of most deep learning tasks on massively parallel hardware. In the DDT …
PervasiveFL: Pervasive federated learning for heterogeneous IoT systems
Federated learning (FL) has been recognized as a promising collaborative on-device
machine learning method in the design of Internet of Things (IoT) systems. However, most …
machine learning method in the design of Internet of Things (IoT) systems. However, most …
Hi-speed dnn training with espresso: Unleashing the full potential of gradient compression with near-optimal usage strategies
Gradient compression (GC) is a promising approach to addressing the communication
bottleneck in distributed deep learning (DDL). It saves the communication time, but also …
bottleneck in distributed deep learning (DDL). It saves the communication time, but also …
High dimensional statistical estimation under uniformly dithered one-bit quantization
In this paper, we propose a uniformly dithered 1-bit quantization scheme for high-
dimensional statistical estimation. The scheme contains truncation, dithering, and …
dimensional statistical estimation. The scheme contains truncation, dithering, and …