Efficient memory management for large language model serving with pagedattention
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …
many requests at a time. However, existing systems struggle because the key-value cache …
Deep learning workload scheduling in gpu datacenters: A survey
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
development of a DL model is a time-consuming and resource-intensive procedure. Hence …
{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …
learning model beyond the memory limits of a single device. In this paper, we demonstrate …
{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …
across various natural language processing tasks. Serving LLM inference for generating …
Fast distributed inference serving for large language models
Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …
exemplified by ChatGPT. The interactive nature of these applications demands low latency …
{dLoRA}: Dynamically Orchestrating Requests and Adapters for {LoRA}{LLM} Serving
Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language
models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …
models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving …
Characterization of large language model development in the datacenter
Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …
Spotserve: Serving generative large language models on preemptible instances
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …
Power-aware Deep Learning Model Serving with {μ-Serve}
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …
pressing need to reduce the energy consumption of a model-serving cluster while …
ServerlessLLM: Low-latency serverless inference for large language models
This paper presents ServerlessLLM, a distributed system designed to support low-latency
serverless inference for Large Language Models (LLMs). By harnessing the substantial near …
serverless inference for Large Language Models (LLMs). By harnessing the substantial near …