Towards efficient generative large language model serving: A survey from algorithms to systems
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
Inference without interference: Disaggregate llm inference for mixed downstream workloads
Transformer-based large language model (LLM) inference serving is now the backbone of
many cloud services. LLM inference consists of a prefill phase and a decode phase …
many cloud services. LLM inference consists of a prefill phase and a decode phase …
Model compression and efficient inference for large language models: A survey
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …
the significant memory and computational costs incurred during the inference process make …
Bitdelta: Your fine-tune may only be worth one bit
Large Language Models (LLMs) are typically trained in two phases: pre-training on large
internet-scale datasets, and fine-tuning for downstream tasks. Given the higher …
internet-scale datasets, and fine-tuning for downstream tasks. Given the higher …
Memserve: Context caching for disaggregated llm serving with elastic memory pool
Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …
utilizing techniques like context caching and disaggregated inference. These optimizations …
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Existing methods for adapting large language models (LLMs) to new tasks are not suited to
multi-task adaptation because they modify all the model weights--causing destructive …
multi-task adaptation because they modify all the model weights--causing destructive …
P/D-Serve: Serving Disaggregated Large Language Model at Scale
Serving disaggregated large language models (LLMs) over tens of thousands of xPU
devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring …
devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring …
Demystifying Data Management for Large Language Models
Navigating the intricacies of data management in the era of Large Language Models (LLMs)
presents both challenges and opportunities for database and data management …
presents both challenges and opportunities for database and data management …