S-lora: Serving thousands of concurrent lora adapters

Z Han, C Gao, J Liu, J Zhang, SQ Zhang - arXiv preprint arXiv:2403.14608, 2024 - arxiv.org

Large models represent a groundbreaking advancement in multiple application fields,
enabling remarkable achievements across various tasks. However, their unprecedented …

被引用次数：186 相关文章所有 2 个版本

[PDF] acm.org

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

Powerinfer: Fast large language model serving with a consumer-grade gpu

Y Song, Z Mi, H Xie, H Chen - Proceedings of the ACM SIGOPS 30th …, 2024 - dl.acm.org

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference
engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key …

被引用次数：70 相关文章所有 3 个版本

[PDF] usenix.org

Fairness in serving large language models

Y Sheng, S Cao, D Li, B Zhu, Z Li, D Zhuo… - … USENIX Symposium on …, 2024 - usenix.org

High-demand LLM inference services (eg, ChatGPT and BARD) support a wide range of
requests from short chat conversations to long document reading. To ensure that all client …

被引用次数：34 相关文章所有 3 个版本

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：61 相关文章所有 2 个版本

[PDF] aclanthology.org

LoRAMoE: Alleviating world knowledge forgetting in large language models via MoE-style plugin

S Dou, E Zhou, Y Liu, S Gao, W Shen… - Proceedings of the …, 2024 - aclanthology.org

Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling
them to align with human instructions and enhance their capabilities in downstream tasks …

被引用次数：16 相关文章

[PDF] arxiv.org

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

被引用次数：71 相关文章所有 3 个版本

[PDF] arxiv.org

Splitlora: A split parameter-efficient fine-tuning framework for large language models

Z Lin, X Hu, Y Zhang, Z Chen, Z Fang, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

The scalability of large language models (LLMs) in handling high-complexity models and
large-scale datasets has led to tremendous successes in pivotal domains. While there is an …

被引用次数：12 相关文章所有 3 个版本

[PDF] ed.ac.uk

ServerlessLLM: Low-latency serverless inference for large language models

Y Fu, L Xue, Y Huang, AO Brabete… - … Systems Design and …, 2024 - research.ed.ac.uk

This paper presents ServerlessLLM, a distributed system designed to support low-latency
serverless inference for Large Language Models (LLMs). By harnessing the substantial near …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Galore: Memory-efficient llm training by gradient low-rank projection

J Zhao, Z Zhang, B Chen, Z Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Training Large Language Models (LLMs) presents significant memory challenges,
predominantly due to the growing size of weights and optimizer states. Common memory …

被引用次数：37 相关文章所有 5 个版本