Sancus: staleness-aware communication-avoiding full-graph decentralized training in large-scale graph neural networks

J Peng, Z Chen, Y Shao, Y Shen, L Chen… - Proceedings of the VLDB …, 2022 - dl.acm.org
Graph neural networks (GNNs) have emerged due to their success at modeling graph data.
Yet, it is challenging for GNNs to efficiently scale to large graphs. Thus, distributed GNNs …

Blindfl: Vertical federated machine learning without peeking into your data

F Fu, H Xue, Y Cheng, Y Tao, B Cui - Proceedings of the 2022 …, 2022 - dl.acm.org
Due to the rising concerns on privacy protection, how to build machine learning (ML) models
over different data sources with security guarantees is gaining more popularity. Vertical …

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X Xi, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

Galvatron: Efficient transformer training over multiple gpus using automatic parallelism

X Miao, Y Wang, Y Jiang, C Shi, X Nie, H Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformer models have achieved state-of-the-art performance on various domains of
applications and gradually becomes the foundations of the advanced large deep learning …

HET: scaling out huge embedding model training via cache-enabled distributed framework

X Miao, H Zhang, Y Shi, X Nie, Z Yang, Y Tao… - arXiv preprint arXiv …, 2021 - arxiv.org
Embedding models have been an effective learning paradigm for high-dimensional data.
However, one open issue of embedding models is that their representations (latent factors) …

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

H Xia, Z Zheng, Y Li, D Zhuang, Z Zhou, X Qiu… - arXiv preprint arXiv …, 2023 - arxiv.org
With the fast growth of parameter size, it becomes increasingly challenging to deploy large
generative models as they typically require large GPU memory consumption and massive …

Dear: Accelerating distributed deep learning with fine-grained all-reduce pipelining

L Zhang, S Shi, X Chu, W Wang, B Li… - 2023 IEEE 43rd …, 2023 - ieeexplore.ieee.org
Communication scheduling has been shown to be effective in accelerating distributed
training, which enables all-reduce communications to be overlapped with backpropagation …

Sdpipe: A semi-decentralized framework for heterogeneity-aware pipeline-parallel training

X Miao, Y Shi, Z Yang, B Cui, Z Jia - Proceedings of the VLDB …, 2023 - dl.acm.org
The increasing size of both deep learning models and training data necessitates the ability
to scale out model training through pipeline-parallel training, which combines pipelined …

HET-GMP: A graph-based system approach to scaling large embedding model training

X Miao, Y Shi, H Zhang, X Zhang, X Nie… - Proceedings of the …, 2022 - dl.acm.org
Embedding models have been recognized as an effective learning paradigm for high-
dimensional data. However, a major embedding model training obstacle is that updating …

Bladedisc: Optimizing dynamic shape machine learning workloads via compiler approach

Z Zheng, Z Pan, D Wang, K Zhu, W Zhao… - Proceedings of the …, 2023 - dl.acm.org
Compiler optimization plays an increasingly important role to boost the performance of
machine learning models for data processing and management. With increasingly complex …