Speed: Speculative pipelined execution for efficient decoding

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

H Xia, Z Yang, Q Dong, P Wang, Y Li, T Ge… - arXiv preprint arXiv …, 2024 - arxiv.org

To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …

被引用次数：57 相关文章所有 4 个版本

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：61 相关文章所有 2 个版本

[PDF] arxiv.org

Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2406.16858, 2024 - arxiv.org

Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

Break the sequential dependency of llm inference using lookahead decoding

Y Fu, P Bailis, I Stoica, H Zhang - arXiv preprint arXiv:2402.02057, 2024 - arxiv.org

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …

被引用次数：56 相关文章所有 3 个版本

[PDF] arxiv.org

Eagle: Speculative sampling requires rethinking feature uncertainty

Y Li, F Wei, C Zhang, H Zhang - arXiv preprint arXiv:2401.15077, 2024 - arxiv.org

Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …

被引用次数：64 相关文章所有 3 个版本

[PDF] arxiv.org

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

C Zhang, Z Liu, D Song - arXiv preprint arXiv:2404.14897, 2024 - arxiv.org

With the increasingly giant scales of (causal) large language models (LLMs), the inference
efficiency comes as one of the core concerns along the improved performance. In contrast to …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Kangaroo: Lossless self-speculative decoding via double early exiting

F Liu, Y Tang, Z Liu, Y Ni, K Han, Y Wang - arXiv preprint arXiv …, 2024 - arxiv.org

Speculative decoding has demonstrated its effectiveness in accelerating the inference of
large language models while maintaining a consistent sampling distribution. However, the …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding

H Sun, Z Chen, X Yang, Y Tian, B Chen - arXiv preprint arXiv:2404.11912, 2024 - arxiv.org

With large language models (LLMs) widely deployed in long content generation recently,
there has emerged an increasing demand for efficient long-sequence inference support …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition

A Basharin, A Chertkov, I Oseledets - arXiv preprint arXiv:2410.17765, 2024 - arxiv.org

We propose a new model for multi-token prediction in transformers, aiming to enhance
sampling efficiency without compromising accuracy. Motivated by recent work that predicts …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

F Lin, H Yi, H Li, Y Yang, X Yu, G Lu, R Xiao - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) commonly employ autoregressive generation during
inference, leading to high memory bandwidth demand and consequently extended latency …

被引用次数：3 相关文章所有 2 个版本