Efficient memory management for large language model serving with pagedattention

W Kwon, Z Li, S Zhuang, Y Sheng, L Zheng… - Proceedings of the 29th …, 2023 - dl.acm.org
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X Jin… - … USENIX Symposium on …, 2023 - usenix.org
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

{dLoRA}: Dynamically Orchestrating Requests and Adapters for {LoRA}{LLM} Serving

B Wu, R Zhu, Z Zhang, P Sun, X Liu, X Jin - 18th USENIX Symposium on …, 2024 - usenix.org
Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language
models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …

Characterization of large language model development in the datacenter

Q Hu, Z Ye, Z Wang, G Wang, M Zhang… - … USENIX Symposium on …, 2024 - usenix.org
Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X Xi, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

Power-aware Deep Learning Model Serving with {μ-Serve}

H Qiu, W Mao, A Patke, S Cui, S Jha, C Wang… - 2024 USENIX Annual …, 2024 - usenix.org
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …

ServerlessLLM: Low-latency serverless inference for large language models

Y Fu, L Xue, Y Huang, AO Brabete… - … Systems Design and …, 2024 - research.ed.ac.uk
This paper presents ServerlessLLM, a distributed system designed to support low-latency
serverless inference for Large Language Models (LLMs). By harnessing the substantial near …