Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

Beyond efficiency: A systematic survey of resource-efficient large language models

G Bai, Z Chai, C Ling, S Wang, J Lu, N Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated
models like OpenAI's ChatGPT, represents a significant advancement in artificial …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

L Donisch, S Schacht, C Lanquillon - arXiv preprint arXiv:2408.03130, 2024 - arxiv.org
Large language models are ubiquitous in natural language processing because they can
adapt to new tasks without retraining. However, their sheer scale and complexity present …

Survey of different large language model architectures: Trends, benchmarks, and challenges

M Shao, A Basit, R Karri, M Shafique - IEEE Access, 2024 - ieeexplore.ieee.org
Large Language Models (LLMs) represent a class of deep learning models adept at
understanding natural language and generating coherent text in response to prompts or …

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

X Shen, P Dong, L Lu, Z Kong, Z Li, M Lin… - Proceedings of the …, 2024 - ojs.aaai.org
Large Language Models (LLMs) stand out for their impressive performance in intricate
language modeling tasks. However, their demanding computational and memory needs …

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

X Shen, Z Han, L Lu, Z Kong, P Dong… - … on Computer-Aided …, 2024 - ieeexplore.ieee.org
The Large Language Models (LLMs) have been popular and widely used in creative ways
because of their powerful capabilities. However, the substantial model size and complexity …

What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation

Z Gong, J Liu, J Wang, X Cai, D Zhao… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Quantization has emerged as a promising technique for improving the memory and
computational efficiency of large language models (LLMs). Though the trade-off between …

Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache

Z Zhang, S Liu, R Chen, B Kailkhura… - Proceedings of …, 2024 - proceedings.mlsys.org
This paper focuses on addressing the substantial memory footprints and bandwidth costs
associated with the deployment of Large Language Models (LLMs). LLMs, characterized by …