Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

H Xia, Z Yang, Q Dong, P Wang, Y Li, T Ge… - arXiv preprint arXiv …, 2024 - arxiv.org
To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2406.16858, 2024 - arxiv.org
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

Eagle: Speculative sampling requires rethinking feature uncertainty

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2401.15077, 2024 - arxiv.org
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

C Zhang, Z Liu, D Song - arXiv preprint arXiv:2404.14897, 2024 - arxiv.org
With the increasingly giant scales of (causal) large language models (LLMs), the inference
efficiency comes as one of the core concerns along the improved performance. In contrast to …

Kangaroo: Lossless self-speculative decoding via double early exiting

F Liu, Y Tang, Z Liu, Y Ni, K Han, Y Wang - arXiv preprint arXiv …, 2024 - arxiv.org
Speculative decoding has demonstrated its effectiveness in accelerating the inference of
large language models while maintaining a consistent sampling distribution. However, the …

Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding

H Sun, Z Chen, X Yang, Y Tian, B Chen - arXiv preprint arXiv:2404.11912, 2024 - arxiv.org
With large language models (LLMs) widely deployed in long content generation recently,
there has emerged an increasing demand for efficient long-sequence inference support …

Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition

A Basharin, A Chertkov, I Oseledets - arXiv preprint arXiv:2410.17765, 2024 - arxiv.org
We propose a new model for multi-token prediction in transformers, aiming to enhance
sampling efficiency without compromising accuracy. Motivated by recent work that predicts …

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

F Lin, H Yi, H Li, Y Yang, X Yu, G Lu, R Xiao - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) commonly employ autoregressive generation during
inference, leading to high memory bandwidth demand and consequently extended latency …