Pytorch fsdp: experiences on scaling fully sharded data parallel

Y Zhao, A Gu, R Varma, L Luo, CC Huang, M Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …

Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning

L Zheng, Z Li, H Zhang, Y Zhuang, Z Chen… - … USENIX Symposium on …, 2022 - usenix.org
Alpa automates model-parallel training of large deep learning (DL) models by generating
execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, G Huang, X Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demand low job …

Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update

C Sima, Y Fu, MK Sit, L Guo, X Gong, F Lin… - … USENIX Symposium on …, 2022 - usenix.org
Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus
promptly serving new users and content. Existing DLRSs, however, fail to do so. They …

Drive: One-bit distributed mean estimation

S Vargaftik, R Ben-Basat, A Portnoy… - Advances in …, 2021 - proceedings.neurips.cc
We consider the problem where $ n $ clients transmit $ d $-dimensional real-valued vectors
using $ d (1+ o (1)) $ bits each, in a manner that allows the receiver to approximately …

Graft: Efficient inference serving for hybrid deep learning with SLO guarantees via DNN re-alignment

J Wu, L Wang, Q Jin, F Liu - IEEE Transactions on Parallel and …, 2023 - ieeexplore.ieee.org
Deep neural networks (DNNs) have been widely adopted for various mobile inference tasks,
yet their ever-increasing computational demands are hindering their deployment on …

Dragonn: Distributed randomized approximate gradients of neural networks

Z Wang, Z Xu, X Wu, A Shrivastava… - … on Machine Learning, 2022 - proceedings.mlr.press
Data-parallel distributed training (DDT) has become the de-facto standard for accelerating
the training of most deep learning tasks on massively parallel hardware. In the DDT …

PervasiveFL: Pervasive federated learning for heterogeneous IoT systems

J Xia, T Liu, Z Ling, T Wang, X Fu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Federated learning (FL) has been recognized as a promising collaborative on-device
machine learning method in the design of Internet of Things (IoT) systems. However, most …

Hi-speed dnn training with espresso: Unleashing the full potential of gradient compression with near-optimal usage strategies

Z Wang, H Lin, Y Zhu, TSE Ng - Proceedings of the Eighteenth …, 2023 - dl.acm.org
Gradient compression (GC) is a promising approach to addressing the communication
bottleneck in distributed deep learning (DDL). It saves the communication time, but also …

High dimensional statistical estimation under uniformly dithered one-bit quantization

J Chen, CL Wang, MK Ng… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In this paper, we propose a uniformly dithered 1-bit quantization scheme for high-
dimensional statistical estimation. The scheme contains truncation, dithering, and …