{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …
across various natural language processing tasks. Serving LLM inference for generating …
Towards efficient generative large language model serving: A survey from algorithms to systems
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation
The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …
paradigm shift in supporting low-precision computation to harness the robustness of deep …
Nanoflow: Towards optimal large language model serving throughput
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …
Llm for mobile: An initial roadmap
When mobile meets LLMs, mobile app users deserve to have more intelligent usage
experiences. For this to happen, we argue that there is a strong need to apply LLMs for the …
experiences. For this to happen, we argue that there is a strong need to apply LLMs for the …
Efficient training of large language models on distributed infrastructures: A survey
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …
their sophisticated capabilities. Training these models requires vast GPU clusters and …
Aspen: Breaking operator barriers for efficient parallelization of deep neural networks
Abstract Modern Deep Neural Network (DNN) frameworks use tensor operators as the main
building blocks of DNNs. However, we observe that operator-based construction of DNNs …
building blocks of DNNs. However, we observe that operator-based construction of DNNs …
Magis: Memory optimization via coordinated graph transformation and scheduling for dnn
Recently, memory consumption of Deep Neural Network (DNN) rapidly increases, mainly
due to long lifetimes and large shapes of tensors. Graph scheduling has emerged as an …
due to long lifetimes and large shapes of tensors. Graph scheduling has emerged as an …
Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
A Chen, F Xu, L Han, Y Dong, L Chen… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
GPUs have become the defacto hardware devices for accelerating Deep Neural Network
(DNN) inference workloads. However, the conventional sequential execution mode of DNN …
(DNN) inference workloads. However, the conventional sequential execution mode of DNN …
Unifying kv cache compression for large language models with leankv
Y Zhang, Y Hu, R Zhao, J Lui, H Chen - arXiv preprint arXiv:2412.03131, 2024 - arxiv.org
Large language models (LLMs) demonstrate exceptional performance but incur high serving
costs due to substantial memory demands, with the key-value (KV) cache being a primary …
costs due to substantial memory demands, with the key-value (KV) cache being a primary …