Gpt-zip: Deep compression of finetuned large language models

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：42 相关文章所有 2 个版本

[PDF] arxiv.org

Inference without interference: Disaggregate llm inference for mixed downstream workloads

C Hu, H Huang, L Xu, X Chen, J Xu, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer-based large language model (LLM) inference serving is now the backbone of
many cloud services. LLM inference consists of a prefill phase and a decode phase …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Bitdelta: Your fine-tune may only be worth one bit

J Liu, G Xiao, K Li, JD Lee, S Han, T Dao… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) are typically trained in two phases: pre-training on large
internet-scale datasets, and fine-tuning for downstream tasks. Given the higher …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Memserve: Context caching for disaggregated llm serving with elastic memory pool

C Hu, H Huang, J Hu, J Xu, X Chen, T Xie… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

A Panda, B Isik, X Qi, S Koyejo, T Weissman… - arXiv preprint arXiv …, 2024 - arxiv.org

Existing methods for adapting large language models (LLMs) to new tasks are not suited to
multi-task adaptation because they modify all the model weights--causing destructive …

[PDF] arxiv.org

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Y Jin, T Wang, H Lin, M Song, P Li, Y Ma… - arXiv preprint arXiv …, 2024 - arxiv.org

Serving disaggregated large language models (LLMs) over tens of thousands of xPU
devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring …

Demystifying Data Management for Large Language Models

X Miao, Z Jia, B Cui - Companion of the 2024 International Conference …, 2024 - dl.acm.org

Navigating the intricacies of data management in the era of Large Language Models (LLMs)
presents both challenges and opportunities for database and data management …

被引用次数：8 相关文章