{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation

L Wang, L Ma, S Cao, Q Zhang, J Xue, Y Shi… - … USENIX Symposium on …, 2024 - usenix.org
The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …

Nanoflow: Towards optimal large language model serving throughput

K Zhu, Y Zhao, L Zhao, G Zuo, Y Gu, D Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …

Llm for mobile: An initial roadmap

D Chen, Y Liu, M Zhou, Y Zhao, H Wang… - ACM Transactions on …, 2024 - dl.acm.org
When mobile meets LLMs, mobile app users deserve to have more intelligent usage
experiences. For this to happen, we argue that there is a strong need to apply LLMs for the …

Efficient training of large language models on distributed infrastructures: A survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

Aspen: Breaking operator barriers for efficient parallelization of deep neural networks

J Park, K Bin, G Park, S Ha… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Modern Deep Neural Network (DNN) frameworks use tensor operators as the main
building blocks of DNNs. However, we observe that operator-based construction of DNNs …

Magis: Memory optimization via coordinated graph transformation and scheduling for dnn

R Chen, Z Ding, S Zheng, C Zhang, J Leng… - Proceedings of the 29th …, 2024 - dl.acm.org
Recently, memory consumption of Deep Neural Network (DNN) rapidly increases, mainly
due to long lifetimes and large shapes of tensors. Graph scheduling has emerged as an …

Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs

A Chen, F Xu, L Han, Y Dong, L Chen… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
GPUs have become the defacto hardware devices for accelerating Deep Neural Network
(DNN) inference workloads. However, the conventional sequential execution mode of DNN …

Unifying kv cache compression for large language models with leankv

Y Zhang, Y Hu, R Zhao, J Lui, H Chen - arXiv preprint arXiv:2412.03131, 2024 - arxiv.org
Large language models (LLMs) demonstrate exceptional performance but incur high serving
costs due to substantial memory demands, with the key-value (KV) cache being a primary …