Llm-based edge intelligence: A comprehensive survey on architectures, applications, security and trustworthiness
The integration of Large Language Models (LLMs) and Edge Intelligence (EI) introduces a
groundbreaking paradigm for intelligent edge devices. With their capacity for human-like …
groundbreaking paradigm for intelligent edge devices. With their capacity for human-like …
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …
significant barrier to their widespread deployment, especially as prompt lengths continue to …
Mobile edge intelligence for large language models: A contemporary survey
On-device large language models (LLMs), referring to running LLMs on edge devices, have
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …
raised considerable interest owing to their superior privacy, reduced latency, and bandwidth …
Kvquant: Towards 10 million context length llm inference with kv cache quantization
LLMs are seeing growing use for applications such as document analysis and
summarization which require large context windows, and with these large context windows …
summarization which require large context windows, and with these large context windows …
Instinfer: In-storage attention offloading for cost-effective long-context llm inference
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …
Retrievalattention: Accelerating long-context llm inference via vector retrieval
Transformer-based large Language Models (LLMs) become increasingly important in
various domains. However, the quadratic time complexity of attention operation poses a …
various domains. However, the quadratic time complexity of attention operation poses a …
Post-Training Sparse Attention with Double Sparsity
The inference process for large language models is slow and memory-intensive, with one of
the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper …
the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper …
UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation
We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented
Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to …
Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to …
LoCoCo: Dropping In Convolutions for Long Context Compression
This paper tackles the memory hurdle of processing long context sequences in Large
Language Models (LLMs), by presenting a novel approach, Dropping In Convolutions for …
Language Models (LLMs), by presenting a novel approach, Dropping In Convolutions for …
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Large language models (LLMs) now support extremely long context windows, but the
quadratic complexity of vanilla attention results in significantly long Time-to-First-Token …
quadratic complexity of vanilla attention results in significantly long Time-to-First-Token …