Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks

Z Wang, B Jin, Z Yu, M Zhang - arXiv preprint arXiv:2407.08454, 2024 - arxiv.org
How to efficiently serve Large Language Models (LLMs) has become a pressing issue
because of their huge computational cost in their autoregressive generation process. To …

InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing

Z Zhu, X Cheng, Z Chen, Y Chen, Y Zhang… - Proceedings of the …, 2024 - dl.acm.org
Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse
modalities, which has received widespread attention in dialogue systems. Despite the …

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

W Wu, Z Pan, C Wang, L Chen, Y Bai, K Fu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the development of large language models (LLMs), the ability to handle longer contexts
has become a key capability for Web applications such as cross-document understanding …

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

J Xiong, J Shen, F Ye, C Tao, Z Wan, J Lu, X Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Deploying large language models (LLMs) is challenging due to their high memory and
computational demands, especially during long-context inference. While key-value (KV) …

Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion

H Shen, Z Wan, X Wang, M Zhang - arXiv preprint arXiv:2409.09808, 2024 - arxiv.org
Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to
methods based on Transformer architecture. This work introduces Fast Mamba for Vision …

ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

M Zhong, X Liu, C Zhang, Y Lei, Y Gao, Y Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language models (LLMs) have become a research hotspot. To accelerate the
inference of LLMs, storing computed caches in memory has become the standard technique …