{SHEPHERD}: Serving {DNNs} in the wild

W Kwon, Z Li, S Zhuang, Y Sheng, L Zheng… - Proceedings of the 29th …, 2023 - dl.acm.org

High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …

被引用次数：1109 相关文章所有 4 个版本

[PDF] acm.org

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：19 相关文章所有 4 个版本

[PDF] usenix.org

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X Jin… - … USENIX Symposium on …, 2023 - usenix.org

Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

被引用次数：124 相关文章所有 4 个版本

[PDF] usenix.org

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org

Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

被引用次数：29 相关文章

[PDF] arxiv.org

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

被引用次数：75 相关文章所有 2 个版本

[PDF] usenix.org

{dLoRA}: Dynamically Orchestrating Requests and Adapters for {LoRA}{LLM} Serving

B Wu, R Zhu, Z Zhang, P Sun, X Liu, X Jin - 18th USENIX Symposium on …, 2024 - usenix.org

Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language
models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …

被引用次数：8 相关文章

[PDF] usenix.org

Characterization of large language model development in the datacenter

Q Hu, Z Ye, Z Wang, G Wang, M Zhang… - … USENIX Symposium on …, 2024 - usenix.org

Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …

被引用次数：26 相关文章所有 6 个版本

[PDF] acm.org

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X Xi, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org

The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

被引用次数：42 相关文章所有 4 个版本

[PDF] usenix.org

Power-aware Deep Learning Model Serving with {μ-Serve}

H Qiu, W Mao, A Patke, S Cui, S Jha, C Wang… - 2024 USENIX Annual …, 2024 - usenix.org

With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …

被引用次数：8 相关文章所有 2 个版本

[PDF] ed.ac.uk

ServerlessLLM: Low-latency serverless inference for large language models

Y Fu, L Xue, Y Huang, AO Brabete… - … Systems Design and …, 2024 - research.ed.ac.uk

This paper presents ServerlessLLM, a distributed system designed to support low-latency
serverless inference for Large Language Models (LLMs). By harnessing the substantial near …

被引用次数：13 相关文章所有 4 个版本