A survey on scheduling techniques in computing and network convergence
S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …
computing power. This trend results in the urgent need for higher-level computing resource …
Oobleck: Resilient distributed training of large models using pipeline templates
Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …
Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
S Jayaram Subramanya, D Arfeen, S Lin… - Proceedings of the 29th …, 2023 - dl.acm.org
The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …
Spotserve: Serving generative large language models on preemptible instances
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …
A survey of resource-efficient llm and multimodal foundation models
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
Gemini: Fast failure recovery in distributed training with in-memory checkpoints
Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …
academia and industry. Nonetheless, frequent failures are observed during large model …
Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training
The increasing size of both deep learning models and training data necessitates the ability
to scale out model training through pipeline-parallel training, which combines pipelined …
to scale out model training through pipeline-parallel training, which combines pipelined …
Hexgen: Generative inference of foundation model over heterogeneous decentralized environment
Serving foundation model inference is a pivotal component of contemporary AI applications,
where this service is usually hosted in a centralized data center on a group of homogeneous …
where this service is usually hosted in a centralized data center on a group of homogeneous …
Parcae: Proactive,{Liveput-Optimized}{DNN} Training on Preemptible Instances
Deep neural networks (DNNs) are becoming progressively large and costly to train. This
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …
Training and Serving System of Foundation Models: A Comprehensive Survey
Foundation models (eg, ChatGPT, DALL-E, PengCheng Mind, PanGu-) have demonstrated
extraordinary performance in key technological areas, such as natural language processing …
extraordinary performance in key technological areas, such as natural language processing …