Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding
To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …
Towards efficient generative large language model serving: A survey from algorithms to systems
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
Eagle-2: Faster inference of language models with dynamic draft trees
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …
and speculative sampling has proven to be an effective solution. Most speculative sampling …
Break the sequential dependency of llm inference using lookahead decoding
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …
resulting in high latency and significant wastes of the parallel processing power of modern …
Eagle: Speculative sampling requires rethinking feature uncertainty
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
With the increasingly giant scales of (causal) large language models (LLMs), the inference
efficiency comes as one of the core concerns along the improved performance. In contrast to …
efficiency comes as one of the core concerns along the improved performance. In contrast to …
Kangaroo: Lossless self-speculative decoding via double early exiting
Speculative decoding has demonstrated its effectiveness in accelerating the inference of
large language models while maintaining a consistent sampling distribution. However, the …
large language models while maintaining a consistent sampling distribution. However, the …
Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding
With large language models (LLMs) widely deployed in long content generation recently,
there has emerged an increasing demand for efficient long-sequence inference support …
there has emerged an increasing demand for efficient long-sequence inference support …
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition
We propose a new model for multi-token prediction in transformers, aiming to enhance
sampling efficiency without compromising accuracy. Motivated by recent work that predicts …
sampling efficiency without compromising accuracy. Motivated by recent work that predicts …
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Large language models (LLMs) commonly employ autoregressive generation during
inference, leading to high memory bandwidth demand and consequently extended latency …
inference, leading to high memory bandwidth demand and consequently extended latency …