Deep learning workload scheduling in gpu datacenters: A survey
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
A survey on scheduling techniques in computing and network convergence
S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …
computing power. This trend results in the urgent need for higher-level computing resource …
Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …
actively leverage containerization and orchestration technologies for high elasticity, low and …
{KungFu}: Making training in distributed machine learning adaptive
When using distributed machine learning (ML) systems to train models on a cluster of worker
machines, users must configure a large number of parameters: hyper-parameters (eg the …
machines, users must configure a large number of parameters: hyper-parameters (eg the …
Gemini: Fast failure recovery in distributed training with in-memory checkpoints
Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …
academia and industry. Nonetheless, frequent failures are observed during large model …
Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …
ElasticFlow: An elastic serverless training platform for distributed deep learning
This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …
Rubberband: cloud-based hyperparameter tuning
Hyperparameter tuning is essential to achieving state-of-the-art accuracy in machine
learning (ML), but requires substantial compute resources to perform. Existing systems …
learning (ML), but requires substantial compute resources to perform. Existing systems …
EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs
Distributed synchronized GPU training is commonly used for deep learning. The resource
constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long …
constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long …
Unicron: Economizing self-healing llm training at scale
Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …
hindered by frequent failures, leading to significant time and economic costs. Current failure …