Chronus: A novel deadline-aware scheduler for deep learning training jobs W Gao, Z Ye, P Sun, Y Wen, T Zhang Proceedings of the ACM Symposium on Cloud Computing, 609-623, 2021 | 30 | 2021 |
Deep learning workload scheduling in gpu datacenters: A survey Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo, T Zhang, Y Wen ACM Computing Surveys 56 (6), 1-38, 2024 | 28* | 2024 |
Astraea: A fair deep learning scheduler for multi-tenant gpu clusters Z Ye, P Sun, W Gao, T Zhang, X Wang, S Yan, Y Luo IEEE Transactions on Parallel and Distributed Systems 33 (11), 2781-2793, 2021 | 12 | 2021 |
Characterization of large language model development in the datacenter Q Hu, Z Ye, Z Wang, G Wang, M Zhang, Q Chen, P Sun, D Lin, X Wang, ... 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI …, 2024 | 11 | 2024 |
Hydro: Surrogate-Based Hyperparameter Tuning Service in the Datacenter Q Hu, Z Ye, M Zhang, Q Chen, P Sun, Y Wen, T Zhang | 4 | 2023 |
Tear up the bubble boom: Lessons learned from a deep learning research and development cluster Z Yang, Z Ye, T Fu, J Luo, X Wei, Y Luo, X Wang, Z Wang, T Zhang 2022 IEEE 40th International Conference on Computer Design (ICCD), 672-680, 2022 | 3 | 2022 |
UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands W Gao, Z Ye, P Sun, T Zhang, Y Wen IEEE Transactions on Computers, 2024 | 1 | 2024 |
AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning Q Chen, Q Hu, Z Ye, G Wang, P Sun, Y Wen, T Zhang arXiv preprint arXiv:2311.00257, 2023 | | 2023 |