Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks
How to efficiently serve Large Language Models (LLMs) has become a pressing issue
because of their huge computational cost in their autoregressive generation process. To …
because of their huge computational cost in their autoregressive generation process. To …
InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing
Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse
modalities, which has received widespread attention in dialogue systems. Despite the …
modalities, which has received widespread attention in dialogue systems. Despite the …
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
W Wu, Z Pan, C Wang, L Chen, Y Bai, K Fu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the development of large language models (LLMs), the ability to handle longer contexts
has become a key capability for Web applications such as cross-document understanding …
has become a key capability for Web applications such as cross-document understanding …
UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference
Deploying large language models (LLMs) is challenging due to their high memory and
computational demands, especially during long-context inference. While key-value (KV) …
computational demands, especially during long-context inference. While key-value (KV) …
Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion
Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to
methods based on Transformer architecture. This work introduces Fast Mamba for Vision …
methods based on Transformer architecture. This work introduces Fast Mamba for Vision …
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
Large Language models (LLMs) have become a research hotspot. To accelerate the
inference of LLMs, storing computed caches in memory has become the standard technique …
inference of LLMs, storing computed caches in memory has become the standard technique …