Welder: Scheduling deep learning memory access via tile-graph

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org

Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

被引用次数：30 相关文章

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：65 相关文章所有 2 个版本

[PDF] usenix.org

Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation

L Wang, L Ma, S Cao, Q Zhang, J Xue, Y Shi… - … USENIX Symposium on …, 2024 - usenix.org

The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …

被引用次数：6 相关文章

[PDF] arxiv.org

Nanoflow: Towards optimal large language model serving throughput

K Zhu, Y Zhao, L Zhao, G Zuo, Y Gu, D Xie… - arXiv preprint arXiv …, 2024 - arxiv.org

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …

被引用次数：8 相关文章所有 3 个版本

[PDF] acm.org

Llm for mobile: An initial roadmap

D Chen, Y Liu, M Zhou, Y Zhao, H Wang… - ACM Transactions on …, 2024 - dl.acm.org

When mobile meets LLMs, mobile app users deserve to have more intelligent usage
experiences. For this to happen, we argue that there is a strong need to apply LLMs for the …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

Efficient training of large language models on distributed infrastructures: A survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

被引用次数：4 相关文章所有 5 个版本

[PDF] neurips.cc

Aspen: Breaking operator barriers for efficient parallelization of deep neural networks

J Park, K Bin, G Park, S Ha… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract Modern Deep Neural Network (DNN) frameworks use tensor operators as the main
building blocks of DNNs. However, we observe that operator-based construction of DNNs …

被引用次数：1 相关文章所有 2 个版本

[PDF] researchgate.net

Magis: Memory optimization via coordinated graph transformation and scheduling for dnn

R Chen, Z Ding, S Zheng, C Zhang, J Leng… - Proceedings of the 29th …, 2024 - dl.acm.org

Recently, memory consumption of Deep Neural Network (DNN) rapidly increases, mainly
due to long lifetimes and large shapes of tensors. Graph scheduling has emerged as an …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs

A Chen, F Xu, L Han, Y Dong, L Chen… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

GPUs have become the defacto hardware devices for accelerating Deep Neural Network
(DNN) inference workloads. However, the conventional sequential execution mode of DNN …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Unifying kv cache compression for large language models with leankv

Y Zhang, Y Hu, R Zhao, J Lui, H Chen - arXiv preprint arXiv:2412.03131, 2024 - arxiv.org

Large language models (LLMs) demonstrate exceptional performance but incur high serving
costs due to substantial memory demands, with the key-value (KV) cache being a primary …

被引用次数：1 相关文章所有 2 个版本