Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training

X Miao, C Shi, J Duan, X Xi, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org

The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

被引用次数：43 相关文章所有 4 个版本

[PDF] usenix.org

Metis: Fast Automatic Distributed Training on Heterogeneous {GPUs}

T Um, B Oh, M Kang, WY Lee, G Kim, D Kim… - 2024 USENIX Annual …, 2024 - usenix.org

As deep learning model sizes expand and new GPUs are released every year, the need for
distributed training on heterogeneous GPUs rises to fully harness under-utilized low-end …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Efficient training of large language models on distributed infrastructures: A survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

被引用次数：4 相关文章所有 5 个版本

[PDF] github.io

Sylvie: 3d-adaptive and universal system for large-scale graph neural network training

M Zhang, Q Hu, C Wan, H Wang, P Sun… - 2024 IEEE 40th …, 2024 - ieeexplore.ieee.org

Distributed full-graph training of Graph Neural Networks (GNNs) has been widely adopted to
learn large-scale graphs. While recent system advancements can improve the training …

被引用次数：2 相关文章

[PDF] arxiv.org

Osdp: Optimal sharded data parallel for distributed deep learning

Y Jiang, F Fu, X Miao, X Nie, B Cui - arXiv preprint arXiv:2209.13258, 2022 - arxiv.org

Large-scale deep learning models contribute to significant performance improvements on
varieties of downstream tasks. Current data and model parallelism approaches utilize model …

被引用次数：11 相关文章所有 6 个版本

[PDF] arxiv.org

Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Y Wang, S Zhu, F Fu, X Miao, J Zhang, J Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent foundation models are capable of handling multiple machine learning (ML) tasks
and multiple data modalities with the unified base model structure and several specialized …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

R Yan, Y Jiang, W Tao, X Nie, B Cui, B Yuan - arXiv preprint arXiv …, 2024 - arxiv.org

Training large language model (LLM) is a computationally intensive task, which is typically
conducted in data centers with homogeneous high-performance GPUs. This paper explores …

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

K Cheng, W Hu, Z Wang, H Peng, J Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) iteratively generate text token by token, with memory usage
increasing with the length of generated token sequences. The unpredictability of generation …

被引用次数：1 相关文章

[PDF] acm.org

Demystifying Data Management for Large Language Models

X Miao, Z Jia, B Cui - Companion of the 2024 International Conference …, 2024 - dl.acm.org

Navigating the intricacies of data management in the era of Large Language Models (LLMs)
presents both challenges and opportunities for database and data management …

被引用次数：11 相关文章

[PDF] arxiv.org

Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models

RB Guo, U Anand, A Chen, K Daudjee - arXiv preprint arXiv:2411.01075, 2024 - arxiv.org

Training transformer models requires substantial GPU compute and memory resources. In
homogeneous clusters, distributed strategies allocate resources evenly, but this approach is …