Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

H Xia, Z Yang, Q Dong, P Wang, Y Li, T Ge… - arXiv preprint arXiv …, 2024 - arxiv.org
To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …

Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2406.16858, 2024 - arxiv.org
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

Eagle: Speculative sampling requires rethinking feature uncertainty

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2401.15077, 2024 - arxiv.org
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …

The mamba in the llama: Distilling and accelerating hybrid models

J Wang, D Paliotta, A May, AM Rush, T Dao - arXiv preprint arXiv …, 2024 - arxiv.org
Linear RNN architectures, like Mamba, can be competitive with Transformer models in
language modeling while having advantageous deployment characteristics. Given the focus …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Recurrent drafter for fast speculative decoding in large language models

Y Cheng, A Zhang, X Zhang, C Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that
achieves state-of-the-art speedup for large language models (LLMs) inference. The …

Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding

W Zhao, Y Huang, X Han, W Xu, C Xiao… - Proceedings of the …, 2024 - aclanthology.org
Speculative decoding is a widely used method that accelerates the generation process of
large language models (LLMs) with no compromise in model performance. It achieves this …

A theoretical perspective for speculative decoding algorithm

M Yin, M Chen, K Huang, M Wang - arXiv preprint arXiv:2411.00841, 2024 - arxiv.org
Transformer-based autoregressive sampling has been the major bottleneck for slowing
down large language model inferences. One effective way to accelerate inference is\emph …

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting

W Zhao, Y Huang, X Han, C Xiao, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Drafting-then-verifying decoding methods such as speculative decoding are widely adopted
training-free methods to accelerate the inference of large language models (LLMs). Instead …