Software-hardware co-design for fast and scalable training of deep learning recommendation models D Mudigere, Y Hao, J Huang, Z Jia, A Tulloch, S Sridharan, X Liu, ... Proceedings of the 49th Annual International Symposium on Computer …, 2022 | 91 | 2022 |
The MVAPICH project: Transforming research into high-performance MPI library for HPC community DK Panda, H Subramoni, CH Chu, M Bayatpour Journal of Computational Science 52, 101208, 2021 | 74 | 2021 |
Scalable distributed dnn training using tensorflow and cuda-aware mpi: Characterization, designs, and performance evaluation AA Awan, J Bédorf, CH Chu, H Subramoni, DK Panda 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid …, 2019 | 58 | 2019 |
Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? AA Awan, CH Chu, H Subramoni, DK Panda Proceedings of the 25th European MPI Users' Group Meeting, 1-9, 2018 | 55 | 2018 |
Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems CH Chu, P Kousha, AA Awan, KS Khorassani, H Subramoni, DK Panda Proceedings of the 34th ACM International Conference on Supercomputing, 1-12, 2020 | 43 | 2020 |
M. khorashadi, P D Mudigere, Y Hao, J Huang, Z Jia, A Tulloch, S Sridharan, X Liu, ... Bhattacharya, P. Lapukhov, M. Naumov, L. Qiao, M. Smelyanskiy, B. Jia, and V …, 2021 | 42 | 2021 |
Oc-dnn: Exploiting advanced unified memory capabilities in cuda 9 and volta gpus for out-of-core dnn training AA Awan, CH Chu, H Subramoni, X Lu, DK Panda 2018 IEEE 25th International Conference on High Performance Computing (HiPC …, 2018 | 39 | 2018 |
High-performance, distributed training of large-scale deep learning recommendation models D Mudigere, Y Hao, J Huang, A Tulloch, S Sridharan, X Liu, M Ozdal, ... arXiv preprint arXiv:2104.05158, 2021 | 33 | 2021 |
Exploiting GPUDirect RDMA in designing high performance OpenSHMEM for NVIDIA GPU clusters K Hamidouche, A Venkatesh, AA Awan, H Subramoni, CH Chu, ... 2015 IEEE International Conference on Cluster Computing, 78-87, 2015 | 28 | 2015 |
Improving SCTP performance by jitter-based congestion control over wired-wireless networks JM Chen, CH Chu, EHK Wu, MF Tsai, JR Wang EURASIP Journal on Wireless Communications and Networking 2011, 1-13, 2011 | 28 | 2011 |
Designing high-performance mpi libraries with on-the-fly compression for modern gpu clusters Q Zhou, C Chu, NS Kumar, P Kousha, SM Ghazimirsaeed, H Subramoni, ... 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS …, 2021 | 27 | 2021 |
Cuda kernel based collective reduction operations on large-scale gpu clusters CH Chu, K Hamidouche, A Venkatesh, AA Awan, DK Panda 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid …, 2016 | 27 | 2016 |
Performance evaluation of MPI libraries on GPU-enabled OpenPOWER architectures: Early experiences KS Khorassani, CH Chu, H Subramoni, DK Panda High Performance Computing: ISC High Performance 2019 International …, 2019 | 26 | 2019 |
Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning CH Chu, X Lu, AA Awan, H Subramoni, J Hashmi, B Elton, DK Panda 2017 46th International Conference on Parallel Processing (ICPP), 161-170, 2017 | 25 | 2017 |
Characterizing cuda unified memory (um)-aware mpi designs on modern gpu architectures KV Manian, AA Ammar, A Ruhela, CH Chu, H Subramoni, DK Panda Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 43-52, 2019 | 23 | 2019 |
Designing a profiling and visualization tool for scalable and in-depth analysis of high-performance GPU clusters P Kousha, B Ramesh, KK Suresh, CH Chu, A Jain, N Sarkauskas, ... 2019 IEEE 26th International Conference on High Performance Computing, Data …, 2019 | 22 | 2019 |
Communication profiling and characterization of deep-learning workloads on clusters with high-performance interconnects AA Awan, A Jain, CH Chu, H Subramoni, DK Panda IEEE Micro 40 (1), 35-43, 2019 | 20 | 2019 |
Designing a ROCm-aware MPI library for AMD GPUs: early experiences K Shafie Khorassani, J Hashmi, CH Chu, CC Chen, H Subramoni, ... International Conference on High Performance Computing, 118-136, 2021 | 18 | 2021 |
Optimized large-message broadcast for deep learning workloads: MPI, MPI+ NCCL, or NCCL2? AA Awan, KV Manian, CH Chu, H Subramoni, DK Panda parallel computing 85, 141-152, 2019 | 18 | 2019 |
IVC: Imperceptible video communication R Carvalho, CH Chu, LJ Chen Proc. of HotMobile (poster), 2014 | 18 | 2014 |