D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models - 学术资源搜索

文章

学术资源搜索

获得 6 条结果（用时0.02秒）

我的图书馆

D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models

在引用文章中搜索

[PDF] arxiv.org

Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks

Z Wang, B Jin, Z Yu, M Zhang - arXiv preprint arXiv:2407.08454, 2024 - arxiv.org

How to efficiently serve Large Language Models (LLMs) has become a pressing issue
because of their huge computational cost in their autoregressive generation process. To …

被引用次数：9 相关文章所有 2 个版本

[PDF] openreview.net

InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing

Z Zhu, X Cheng, Z Chen, Y Chen, Y Zhang… - Proceedings of the …, 2024 - dl.acm.org

Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse
modalities, which has received widespread attention in dialogue systems. Despite the …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

W Wu, Z Pan, C Wang, L Chen, Y Bai, K Fu… - arXiv preprint arXiv …, 2024 - arxiv.org

With the development of large language models (LLMs), the ability to handle longer contexts
has become a key capability for Web applications such as cross-document understanding …

相关文章所有 2 个版本

[PDF] arxiv.org

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

J Xiong, J Shen, F Ye, C Tao, Z Wan, J Lu, X Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Deploying large language models (LLMs) is challenging due to their high memory and
computational demands, especially during long-context inference. While key-value (KV) …

相关文章所有 2 个版本

[PDF] arxiv.org

Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion

H Shen, Z Wan, X Wang, M Zhang - arXiv preprint arXiv:2409.09808, 2024 - arxiv.org

Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to
methods based on Transformer architecture. This work introduces Fast Mamba for Vision …

相关文章所有 4 个版本

[PDF] arxiv.org

ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty

M Zhong, X Liu, C Zhang, Y Lei, Y Gao, Y Hu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language models (LLMs) have become a research hotspot. To accelerate the
inference of LLMs, storing computed caches in memory has become the standard technique …

相关文章所有 2 个版本