Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

{KungFu}: Making training in distributed machine learning adaptive

L Mai, G Li, M Wagenländer, K Fertakis… - … USENIX Symposium on …, 2020 - usenix.org
When using distributed machine learning (ML) systems to train models on a cluster of worker
machines, users must configure a large number of parameters: hyper-parameters (eg the …

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu… - Proceedings of the 29th …, 2023 - dl.acm.org
Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

ElasticFlow: An elastic serverless training platform for distributed deep learning

D Gu, Y Zhao, Y Zhong, Y Xiong, Z Han… - Proceedings of the 28th …, 2023 - dl.acm.org
This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …

Rubberband: cloud-based hyperparameter tuning

U Misra, R Liaw, L Dunlap, R Bhardwaj… - Proceedings of the …, 2021 - dl.acm.org
Hyperparameter tuning is essential to achieving state-of-the-art accuracy in machine
learning (ML), but requires substantial compute resources to perform. Existing systems …

EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs

M Li, W Xiao, H Yang, B Sun, H Zhao, S Ren… - Proceedings of the …, 2023 - dl.acm.org
Distributed synchronized GPU training is commonly used for deep learning. The resource
constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long …

Unicron: Economizing self-healing llm training at scale

T He, X Li, Z Wang, K Qian, J Xu, W Yu… - arXiv preprint arXiv …, 2023 - arxiv.org
Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …