Clipper: A {Low-Latency} online prediction serving system

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

被引用次数：256 相关文章所有 8 个版本

[PDF] semanticscholar.org

Data management in machine learning: Challenges, techniques, and systems

A Kumar, M Boehm, J Yang - Proceedings of the 2017 ACM International …, 2017 - dl.acm.org

Large-scale data analytics using statistical machine learning (ML), popularly called
advanced analytics, underpins many modern data-driven applications. The data …

被引用次数：184 相关文章所有 11 个版本

[PDF] acm.org

Efficient memory management for large language model serving with pagedattention

W Kwon, Z Li, S Zhuang, Y Sheng, L Zheng… - Proceedings of the 29th …, 2023 - dl.acm.org

High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …

被引用次数：1064 相关文章所有 4 个版本

[PDF] usenix.org

Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org

Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

被引用次数：317 相关文章所有 6 个版本

[PDF] usenix.org

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X Jin… - … USENIX Symposium on …, 2023 - usenix.org

Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

被引用次数：121 相关文章所有 4 个版本

[PDF] acm.org

Pond: Cxl-based memory pooling systems for cloud platforms

H Li, DS Berger, L Hsu, D Ernst, P Zardoshti… - Proceedings of the 28th …, 2023 - dl.acm.org

Public cloud providers seek to meet stringent performance requirements and low hardware
cost. A key driver of performance and cost is main memory. Memory pooling promises to …

被引用次数：276 相关文章所有 11 个版本

[PDF] usenix.org

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org

Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

被引用次数：26 相关文章

[PDF] arxiv.org

Software engineering for AI-based systems: a survey

S Martínez-Fernández, J Bogner, X Franch… - ACM Transactions on …, 2022 - dl.acm.org

AI-based systems are software systems with functionalities enabled by at least one AI
component (eg, for image-, speech-recognition, and autonomous driving). AI-based systems …

被引用次数：253 相关文章所有 10 个版本

[PDF] usenix.org

Ray: A distributed framework for emerging {AI} applications

P Moritz, R Nishihara, S Wang, A Tumanov… - … USENIX symposium on …, 2018 - usenix.org

The next generation of AI applications will continuously interact with the environment and
learn from these interactions. These applications impose new and demanding systems …

被引用次数：1521 相关文章所有 23 个版本

[PDF] usenix.org

{INFaaS}: Automated model-less inference serving

F Romero, Q Li, NJ Yadwadkar… - 2021 USENIX Annual …, 2021 - usenix.org

Despite existing work in machine learning inference serving, ease-of-use and cost efficiency
remain challenges at large scales. Developers must manually search through thousands of …

被引用次数：225 相关文章所有 4 个版本