Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

Inference without interference: Disaggregate llm inference for mixed downstream workloads

C Hu, H Huang, L Xu, X Chen, J Xu, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer-based large language model (LLM) inference serving is now the backbone of
many cloud services. LLM inference consists of a prefill phase and a decode phase …

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

Bitdelta: Your fine-tune may only be worth one bit

J Liu, G Xiao, K Li, JD Lee, S Han, T Dao… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are typically trained in two phases: pre-training on large
internet-scale datasets, and fine-tuning for downstream tasks. Given the higher …

Memserve: Context caching for disaggregated llm serving with elastic memory pool

C Hu, H Huang, J Hu, J Xu, X Chen, T Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

A Panda, B Isik, X Qi, S Koyejo, T Weissman… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing methods for adapting large language models (LLMs) to new tasks are not suited to
multi-task adaptation because they modify all the model weights--causing destructive …

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Y Jin, T Wang, H Lin, M Song, P Li, Y Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
Serving disaggregated large language models (LLMs) over tens of thousands of xPU
devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring …

Demystifying Data Management for Large Language Models

X Miao, Z Jia, B Cui - Companion of the 2024 International Conference …, 2024 - dl.acm.org
Navigating the intricacies of data management in the era of Large Language Models (LLMs)
presents both challenges and opportunities for database and data management …