A survey on distributed machine learning
J Verbraeken, M Wolting, J Katzy… - Acm computing surveys …, 2020 - dl.acm.org
The demand for artificial intelligence has grown significantly over the past decade, and this
growth has been fueled by advances in machine learning techniques and the ability to …
growth has been fueled by advances in machine learning techniques and the ability to …
Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools
R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …
art results in various domains, such as image recognition and natural language processing …
PipeDream: Generalized pipeline parallelism for DNN training
DNN training is extremely time-consuming, necessitating efficient multi-accelerator
parallelization. Current approaches to parallelizing training primarily use intra-batch …
parallelization. Current approaches to parallelizing training primarily use intra-batch …
Dorylus: Affordable, scalable, and accurate {GNN} training with distributed {CPU} servers and serverless threads
A graph neural network (GNN) enables deep learning on structured graph data. There are
two major GNN training obstacles: 1) it relies on high-end servers with many GPUs which …
two major GNN training obstacles: 1) it relies on high-end servers with many GPUs which …
Gaia:{Geo-Distributed} machine learning approaching {LAN} speeds
Machine learning (ML) is widely used to derive useful information from large-scale data
(such as user activities, pictures, and videos) generated at increasingly rapid rates, all over …
(such as user activities, pictures, and videos) generated at increasingly rapid rates, all over …
Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms
When training machine learning models using stochastic gradient descent (SGD) with a
large number of nodes or massive edge devices, the communication cost of synchronizing …
large number of nodes or massive edge devices, the communication cost of synchronizing …
Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD
Large-scale machine learning training, in particular distributed stochastic gradient descent,
needs to be robust to inherent system variability such as node straggling and random …
needs to be robust to inherent system variability such as node straggling and random …
Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server
Large-scale deep learning requires huge computational resources to train a multi-layer
neural network. Recent systems propose using 100s to 1000s of machines to train networks …
neural network. Recent systems propose using 100s to 1000s of machines to train networks …
Pipedream: Fast and efficient pipeline parallel dnn training
PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes
computation by pipelining execution across multiple machines. Its pipeline parallel …
computation by pipelining execution across multiple machines. Its pipeline parallel …
Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping
It is known that deeper and wider neural networks can achieve better accuracy. But it is
difficult to continue the trend to increase model size due to limited GPU memory. One …
difficult to continue the trend to increase model size due to limited GPU memory. One …