- 学术资源搜索

A survey on model compression for large language models

X Zhu, J Li, Y Liu, C Ma, W Wang - Transactions of the Association for …, 2024 - direct.mit.edu

Abstract Large Language Models (LLMs) have transformed natural language processing
tasks successfully. Yet, their large size and high computational needs pose challenges for …

被引用次数：202 相关文章所有 2 个版本

[PDF] arxiv.org

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arXiv preprint arXiv:2407.12391, 2024 - arxiv.org

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

被引用次数：47 相关文章所有 2 个版本

[PDF] arxiv.org

Quarot: Outlier-free 4-bit inference in rotated llms

S Ashkboos, A Mohtashami, ML Croci, B Li… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to
quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot …

被引用次数：56 相关文章所有 2 个版本

[PDF] arxiv.org

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

J Shah, G Bikshandi, Y Zhang, V Thakkar… - arXiv preprint arXiv …, 2024 - arxiv.org

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for
large language models and long-context applications. FlashAttention elaborated an …

被引用次数：34 相关文章所有 4 个版本

[PDF] nsf.gov

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z Xie, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

被引用次数：62 相关文章所有 2 个版本

[PDF] openreview.net

Llm maybe longlm: Self-extend llm context window without tuning

H Jin, X Han, J Yang, Z Jiang, Z Liu, CY Chang… - arXiv preprint arXiv …, 2024 - arxiv.org

This work elicits LLMs' inherent ability to handle long contexts without fine-tuning. The
limited length of the training sequence during training may limit the application of Large …

被引用次数：71 相关文章所有 6 个版本

[PDF] arxiv.org

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

H Jiang, Y Li, C Zhang, Q Wu, X Luo, S Ahn… - arXiv preprint arXiv …, 2024 - arxiv.org

The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …

被引用次数：23 相关文章所有 4 个版本

[PDF] arxiv.org

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Z Wan, Z Wu, C Liu, J Huang, Z Zhu, P Jin… - arXiv preprint arXiv …, 2024 - arxiv.org

Long-context Multimodal Large Language Models (MLLMs) demand substantial
computational resources for inference as the growth of their multimodal Key-Value (KV) …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

Mobile edge intelligence for large language models: A contemporary survey

G Qu, Q Chen, W Wei, Z Lin, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …

被引用次数：14 相关文章所有 4 个版本