Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving

Y Dai, R Pan, A Iyer, K Li, R Netravali - arXiv preprint arXiv:2312.05385, 2023 - arxiv.org
Machine learning (ML) inference platforms are tasked with balancing two competing goals:
ensuring high throughput given many requests, and delivering low-latency responses to …

Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs

A Chen, F Xu, L Han, Y Dong, L Chen, Z Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org
GPUs have become the defacto hardware devices to accelerate Deep Neural Network
(DNN) inference in deep learning (DL) frameworks. However, the conventional sequential …

[PDF][PDF] Cascade: A Platform for Delay-Sensitive Edge Intelligence

W Song, T Garrett, Y Yang, M Liu, E Tremel… - arXiv preprint arXiv …, 2023 - cs.cornell.edu
Interest in intelligent edge computing is surging, driven by improving connectivity and
hardware advances. This is creating a need: today's cloud platforms optimize for high …