Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding
To mitigate the high inference latency stemming from autoregressive decoding in Large
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …
Language Models (LLMs), Speculative Decoding has emerged as a novel decoding …
Llm inference unveiled: Survey and roofline model insights
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …
unique blend of opportunities and challenges. Although the field has expanded and is …
Medusa: Simple llm inference acceleration framework with multiple decoding heads
The inference process in Large Language Models (LLMs) is often limited due to the absence
of parallelism in the auto-regressive decoding process, resulting in most operations being …
of parallelism in the auto-regressive decoding process, resulting in most operations being …
Towards efficient generative large language model serving: A survey from algorithms to systems
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
Distillspec: Improving speculative decoding via knowledge distillation
Speculative decoding (SD) accelerates large language model inference by employing a
faster draft model for generating multiple tokens, which are then verified in parallel by the …
faster draft model for generating multiple tokens, which are then verified in parallel by the …
Eagle-2: Faster inference of language models with dynamic draft trees
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …
and speculative sampling has proven to be an effective solution. Most speculative sampling …
Break the sequential dependency of llm inference using lookahead decoding
Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded,
resulting in high latency and significant wastes of the parallel processing power of modern …
resulting in high latency and significant wastes of the parallel processing power of modern …
Rephrasing the web: A recipe for compute and data-efficient language modeling
Large language models are trained on massive scrapes of the web, which are often
unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such …
unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such …
Eagle: Speculative sampling requires rethinking feature uncertainty
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …
consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater …
A survey on efficient inference for large language models
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …
performance across various tasks. However, the substantial computational and memory …