Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
Machine learning (ML) inference platforms are tasked with balancing two competing goals:
ensuring high throughput given many requests, and delivering low-latency responses to …
ensuring high throughput given many requests, and delivering low-latency responses to …
Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
GPUs have become the defacto hardware devices to accelerate Deep Neural Network
(DNN) inference in deep learning (DL) frameworks. However, the conventional sequential …
(DNN) inference in deep learning (DL) frameworks. However, the conventional sequential …
[PDF][PDF] Cascade: A Platform for Delay-Sensitive Edge Intelligence
Interest in intelligent edge computing is surging, driven by improving connectivity and
hardware advances. This is creating a need: today's cloud platforms optimize for high …
hardware advances. This is creating a need: today's cloud platforms optimize for high …