A survey on model compression for large language models
Abstract Large Language Models (LLMs) have transformed natural language processing
tasks successfully. Yet, their large size and high computational needs pose challenges for …
tasks successfully. Yet, their large size and high computational needs pose challenges for …
Llm inference serving: Survey of recent advances and opportunities
This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …
Llm inference unveiled: Survey and roofline model insights
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …
unique blend of opportunities and challenges. Although the field has expanded and is …
Quarot: Outlier-free 4-bit inference in rotated llms
We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to
quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot …
quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot …
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for
large language models and long-context applications. FlashAttention elaborated an …
large language models and long-context applications. FlashAttention elaborated an …
[PDF][PDF] Efficiently Programming Large Language Models using SGLang.
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …
generation calls, advanced prompting techniques, control flow, and structured …
Llm maybe longlm: Self-extend llm context window without tuning
This work elicits LLMs' inherent ability to handle long contexts without fine-tuning. The
limited length of the training sequence during training may limit the application of Large …
limited length of the training sequence during training may limit the application of Large …
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …
significant barrier to their widespread deployment, especially as prompt lengths continue to …
Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference
Long-context Multimodal Large Language Models (MLLMs) demand substantial
computational resources for inference as the growth of their multimodal Key-Value (KV) …
computational resources for inference as the growth of their multimodal Key-Value (KV) …
Mobile edge intelligence for large language models: A contemporary survey
On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …