InArt: In-Network Aggregation with Route Selection for Accelerating Distributed Training

J Liu, Y Zhai, G Zhao, H Xu, J Fang, Z Zeng… - Proceedings of the ACM …, 2024 - dl.acm.org
Deep learning has brought about a revolutionary transformation in network applications,
particularly in domains like e-commerce and online advertising. Distributed training (DT), as …

Online Scheduling and Pricing for Multi-LoRA Fine-Tuning Tasks

Y Zheng, L Jiao, H Yang, L Chen, Y Liu… - Proceedings of the 53rd …, 2024 - dl.acm.org
Fine-tuning pre-trained models with task-specific data can produce customized models
effective for downstream tasks. However, operating large-scale such fine-tuning tasks in real …

Proactive, Accuracy-aware Straggler Mitigation in Machine Learning Clusters

S Tairin, H Shen, A Iyer - 2024 IEEE International Parallel and …, 2024 - ieeexplore.ieee.org
Slower workers, known as stragglers, can signifi-cantly prolong training time in Machine
Learning (ML) clusters. We present SMS, a proactive straggler mitigation system with four …