Elastic deep learning in multi-tenant GPU clusters

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：8 相关文章所有 4 个版本

A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org

The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

被引用次数：6 相关文章所有 2 个版本

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org

Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

被引用次数：52 相关文章所有 2 个版本

[PDF] usenix.org

{KungFu}: Making training in distributed machine learning adaptive

L Mai, G Li, M Wagenländer, K Fertakis… - … USENIX Symposium on …, 2020 - usenix.org

When using distributed machine learning (ML) systems to train models on a cluster of worker
machines, users must configure a large number of parameters: hyper-parameters (eg the …

被引用次数：70 相关文章所有 9 个版本

[HTML] amazon.science

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu… - Proceedings of the 29th …, 2023 - dl.acm.org

Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …

被引用次数：31 相关文章所有 6 个版本

[PDF] arxiv.org

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

被引用次数：29 相关文章所有 3 个版本

[PDF] github.io

ElasticFlow: An elastic serverless training platform for distributed deep learning

D Gu, Y Zhao, Y Zhong, Y Xiong, Z Han… - Proceedings of the 28th …, 2023 - dl.acm.org

This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …

被引用次数：18 相关文章所有 4 个版本

[PDF] acm.org

Rubberband: cloud-based hyperparameter tuning

U Misra, R Liaw, L Dunlap, R Bhardwaj… - Proceedings of the …, 2021 - dl.acm.org

Hyperparameter tuning is essential to achieving state-of-the-art accuracy in machine
learning (ML), but requires substantial compute resources to perform. Existing systems …

被引用次数：28 相关文章所有 6 个版本

[PDF] archive.org

EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs

M Li, W Xiao, H Yang, B Sun, H Zhao, S Ren… - Proceedings of the …, 2023 - dl.acm.org

Distributed synchronized GPU training is commonly used for deep learning. The resource
constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Unicron: Economizing self-healing llm training at scale

T He, X Li, Z Wang, K Qian, J Xu, W Yu… - arXiv preprint arXiv …, 2023 - arxiv.org

Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …

被引用次数：4 相关文章所有 2 个版本