Cascade speculative drafting for even faster llm inference

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

H Xia, Z Yang, Q Dong, P Wang, Y Li, T Ge… - arXiv preprint arXiv …, 2024 - arxiv.org

To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …

被引用次数：56 相关文章所有 4 个版本

[PDF] arxiv.org

Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2406.16858, 2024 - arxiv.org

Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

被引用次数：56 相关文章所有 3 个版本

[PDF] arxiv.org

Eagle: Speculative sampling requires rethinking feature uncertainty

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2401.15077, 2024 - arxiv.org

Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …

被引用次数：63 相关文章所有 3 个版本

[PDF] arxiv.org

The mamba in the llama: Distilling and accelerating hybrid models

J Wang, D Paliotta, A May, AM Rush, T Dao - arXiv preprint arXiv …, 2024 - arxiv.org

Linear RNN architectures, like Mamba, can be competitive with Transformer models in
language modeling while having advantageous deployment characteristics. Given the focus …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

被引用次数：60 相关文章所有 5 个版本

[PDF] arxiv.org

Recurrent drafter for fast speculative decoding in large language models

Y Cheng, A Zhang, X Zhang, C Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that
achieves state-of-the-art speedup for large language models (LLMs) inference. The …

被引用次数：11 相关文章所有 2 个版本

[PDF] aclanthology.org

Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding

W Zhao, Y Huang, X Han, W Xu, C Xiao… - Proceedings of the …, 2024 - aclanthology.org

Speculative decoding is a widely used method that accelerates the generation process of
large language models (LLMs) with no compromise in model performance. It achieves this …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

A theoretical perspective for speculative decoding algorithm

M Yin, M Chen, K Huang, M Wang - arXiv preprint arXiv:2411.00841, 2024 - arxiv.org

Transformer-based autoregressive sampling has been the major bottleneck for slowing
down large language model inferences. One effective way to accelerate inference is\emph …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting

W Zhao, Y Huang, X Han, C Xiao, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Drafting-then-verifying decoding methods such as speculative decoding are widely adopted
training-free methods to accelerate the inference of large language models (LLMs). Instead …

被引用次数：8 相关文章所有 2 个版本