Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding
To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …
Eagle-2: Faster inference of language models with dynamic draft trees
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …
and speculative sampling has proven to be an effective solution. Most speculative sampling …
Break the sequential dependency of llm inference using lookahead decoding
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …
resulting in high latency and significant wastes of the parallel processing power of modern …
Eagle: Speculative sampling requires rethinking feature uncertainty
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …
The mamba in the llama: Distilling and accelerating hybrid models
Linear RNN architectures, like Mamba, can be competitive with Transformer models in
language modeling while having advantageous deployment characteristics. Given the focus …
language modeling while having advantageous deployment characteristics. Given the focus …
A survey on efficient inference for large language models
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …
performance across various tasks. However, the substantial computational and memory …
Recurrent drafter for fast speculative decoding in large language models
We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that
achieves state-of-the-art speedup for large language models (LLMs) inference. The …
achieves state-of-the-art speedup for large language models (LLMs) inference. The …
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Speculative decoding is a widely used method that accelerates the generation process of
large language models (LLMs) with no compromise in model performance. It achieves this …
large language models (LLMs) with no compromise in model performance. It achieves this …
A theoretical perspective for speculative decoding algorithm
Transformer-based autoregressive sampling has been the major bottleneck for slowing
down large language model inferences. One effective way to accelerate inference is\emph …
down large language model inferences. One effective way to accelerate inference is\emph …
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
Drafting-then-verifying decoding methods such as speculative decoding are widely adopted
training-free methods to accelerate the inference of large language models (LLMs). Instead …
training-free methods to accelerate the inference of large language models (LLMs). Instead …