A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X Jin… - Proceedings of the 29th …, 2023 - dl.acm.org
Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

S Jayaram Subramanya, D Arfeen, S Lin… - Proceedings of the 29th …, 2023 - dl.acm.org
The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X Xi, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu… - Proceedings of the 29th …, 2023 - dl.acm.org
Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …

Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training

X Miao, Y Shi, Z Yang, B Cui, Z Jia - Proceedings of the VLDB …, 2023 - dl.acm.org
The increasing size of both deep learning models and training data necessitates the ability
to scale out model training through pipeline-parallel training, which combines pipelined …

Hexgen: Generative inference of foundation model over heterogeneous decentralized environment

Y Jiang, R Yan, X Yao, B Chen, B Yuan - arXiv preprint arXiv:2311.11514, 2023 - arxiv.org
Serving foundation model inference is a pivotal component of contemporary AI applications,
where this service is usually hosted in a centralized data center on a group of homogeneous …

Parcae: Proactive,{Liveput-Optimized}{DNN} Training on Preemptible Instances

J Duan, Z Song, X Miao, X Xi, D Lin, H Xu… - … USENIX Symposium on …, 2024 - usenix.org
Deep neural networks (DNNs) are becoming progressively large and costly to train. This
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …

Training and Serving System of Foundation Models: A Comprehensive Survey

J Zhou, Y Chen, Z Hong, W Chen, Y Yu… - IEEE Open Journal …, 2024 - ieeexplore.ieee.org
Foundation models (eg, ChatGPT, DALL-E, PengCheng Mind, PanGu-) have demonstrated
extraordinary performance in key technological areas, such as natural language processing …