Llm inference unveiled: Survey and roofline model insights
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …
unique blend of opportunities and challenges. Although the field has expanded and is …
Beyond efficiency: A systematic survey of resource-efficient large language models
The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated
models like OpenAI's ChatGPT, represents a significant advancement in artificial …
models like OpenAI's ChatGPT, represents a significant advancement in artificial …
A survey on efficient inference for large language models
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …
performance across various tasks. However, the substantial computational and memory …
Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations
L Donisch, S Schacht, C Lanquillon - arXiv preprint arXiv:2408.03130, 2024 - arxiv.org
Large language models are ubiquitous in natural language processing because they can
adapt to new tasks without retraining. However, their sheer scale and complexity present …
adapt to new tasks without retraining. However, their sheer scale and complexity present …
Survey of different large language model architectures: Trends, benchmarks, and challenges
Large Language Models (LLMs) represent a class of deep learning models adept at
understanding natural language and generating coherent text in response to prompts or …
understanding natural language and generating coherent text in response to prompts or …
Model compression and efficient inference for large language models: A survey
Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …
the significant memory and computational costs incurred during the inference process make …
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
Large Language Models (LLMs) stand out for their impressive performance in intricate
language modeling tasks. However, their demanding computational and memory needs …
language modeling tasks. However, their demanding computational and memory needs …
HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models
The Large Language Models (LLMs) have been popular and widely used in creative ways
because of their powerful capabilities. However, the substantial model size and complexity …
because of their powerful capabilities. However, the substantial model size and complexity …
What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation
Quantization has emerged as a promising technique for improving the memory and
computational efficiency of large language models (LLMs). Though the trade-off between …
computational efficiency of large language models (LLMs). Though the trade-off between …
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache
This paper focuses on addressing the substantial memory footprints and bandwidth costs
associated with the deployment of Large Language Models (LLMs). LLMs, characterized by …
associated with the deployment of Large Language Models (LLMs). LLMs, characterized by …