Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

Data management in machine learning: Challenges, techniques, and systems

A Kumar, M Boehm, J Yang - Proceedings of the 2017 ACM International …, 2017 - dl.acm.org
Large-scale data analytics using statistical machine learning (ML), popularly called
advanced analytics, underpins many modern data-driven applications. The data …

Efficient memory management for large language model serving with pagedattention

W Kwon, Z Li, S Zhuang, Y Sheng, L Zheng… - Proceedings of the 29th …, 2023 - dl.acm.org
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …

Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Z Li, L Zheng, Y Zhong, V Liu, Y Sheng, X Jin… - … USENIX Symposium on …, 2023 - usenix.org
Model parallelism is conventionally viewed as a method to scale a single large deep
learning model beyond the memory limits of a single device. In this paper, we demonstrate …

Pond: Cxl-based memory pooling systems for cloud platforms

H Li, DS Berger, L Hsu, D Ernst, P Zardoshti… - Proceedings of the 28th …, 2023 - dl.acm.org
Public cloud providers seek to meet stringent performance requirements and low hardware
cost. A key driver of performance and cost is main memory. Memory pooling promises to …

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

W Lee, J Lee, J Seo, J Sim - 18th USENIX Symposium on Operating …, 2024 - usenix.org
Transformer-based large language models (LLMs) demonstrate impressive performance
across various natural language processing tasks. Serving LLM inference for generating …

Software engineering for AI-based systems: a survey

S Martínez-Fernández, J Bogner, X Franch… - ACM Transactions on …, 2022 - dl.acm.org
AI-based systems are software systems with functionalities enabled by at least one AI
component (eg, for image-, speech-recognition, and autonomous driving). AI-based systems …

Ray: A distributed framework for emerging {AI} applications

P Moritz, R Nishihara, S Wang, A Tumanov… - … USENIX symposium on …, 2018 - usenix.org
The next generation of AI applications will continuously interact with the environment and
learn from these interactions. These applications impose new and demanding systems …

{INFaaS}: Automated model-less inference serving

F Romero, Q Li, NJ Yadwadkar… - 2021 USENIX Annual …, 2021 - usenix.org
Despite existing work in machine learning inference serving, ease-of-use and cost efficiency
remain challenges at large scales. Developers must manually search through thousands of …