Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org

The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X Jin… - Proceedings of the 29th …, 2023 - dl.acm.org

Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

被引用次数：26 相关文章所有 7 个版本

[PDF] acm.org

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

S Jayaram Subramanya, D Arfeen, S Lin… - Proceedings of the 29th …, 2023 - dl.acm.org

The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …

被引用次数：26 相关文章所有 5 个版本

[PDF] acm.org

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X Xi, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org

The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

被引用次数：32 相关文章所有 4 个版本

[PDF] arxiv.org

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

被引用次数：60 相关文章所有 3 个版本

[HTML] amazon.science

Gemini: Fast failure recovery in distributed training with in-memory checkpoints

Z Wang, Z Jia, S Zheng, Z Zhang, X Fu… - Proceedings of the 29th …, 2023 - dl.acm.org

Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …

被引用次数：35 相关文章所有 6 个版本

[PDF] vldb.org

Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training

X Miao, Y Shi, Z Yang, B Cui, Z Jia - Proceedings of the VLDB …, 2023 - dl.acm.org

The increasing size of both deep learning models and training data necessitates the ability
to scale out model training through pipeline-parallel training, which combines pipelined …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Hexgen: Generative inference of foundation model over heterogeneous decentralized environment

Y Jiang, R Yan, X Yao, B Chen, B Yuan - arXiv preprint arXiv:2311.11514, 2023 - arxiv.org

Serving foundation model inference is a pivotal component of contemporary AI applications,
where this service is usually hosted in a centralized data center on a group of homogeneous …

被引用次数：7 相关文章所有 2 个版本

[PDF] usenix.org

Parcae: Proactive,{Liveput-Optimized}{DNN} Training on Preemptible Instances

J Duan, Z Song, X Miao, X Xi, D Lin, H Xu… - … USENIX Symposium on …, 2024 - usenix.org

Deep neural networks (DNNs) are becoming progressively large and costly to train. This
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …

被引用次数：4 相关文章所有 5 个版本

[PDF] ieee.org

Training and Serving System of Foundation Models: A Comprehensive Survey

J Zhou, Y Chen, Z Hong, W Chen, Y Yu… - IEEE Open Journal …, 2024 - ieeexplore.ieee.org

Foundation models (eg, ChatGPT, DALL-E, PengCheng Mind, PanGu-) have demonstrated
extraordinary performance in key technological areas, such as natural language processing …

被引用次数：3 相关文章所有 6 个版本