Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems

CH Chu, P Kousha, AA Awan, KS Khorassani… - Proceedings of the 34th …, 2020 - dl.acm.org
The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics
Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large …

Designing high-performance mpi libraries with on-the-fly compression for modern gpu clusters

Q Zhou, C Chu, NS Kumar, P Kousha… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
While the memory bandwidth of accelerators such as GPU has significantly improved over
the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in …

Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL?

AA Awan, CH Chu, H Subramoni… - Proceedings of the 25th …, 2018 - dl.acm.org
Traditionally, MPI runtimes have been designed for clusters with a large number of nodes.
However, with the advent of MPI+ CUDA applications and dense multi-GPU systems, it has …

Evaluating scalability bottlenecks by workload extrapolation

R Shi, Y Gan, Y Wang - 2018 IEEE 26th international …, 2018 - ieeexplore.ieee.org
Testing a scalability bottleneck requires a large system to generate sufficient load, which is
usually not accessible to researchers. To address this problem, this paper extrapolates the …

Accelerating mpi all-to-all communication with online compression on modern gpu clusters

Q Zhou, P Kousha, Q Anthony… - … Conference on High …, 2022 - Springer
Abstract As more High-Performance Computing (HPC) and Deep Learning (DL) applications
are adapting to scale using GPUs, the communication of GPU-resident data is becoming …

Toward a new linpack‐like benchmark for heterogeneous computing resources

L Carracciuolo, V Mele… - … and Computation: Practice …, 2024 - Wiley Online Library
This work describes some first efforts to design a new Linpack‐like benchmark useful to
evaluate the performance of Heterogeneous Computing Resources. The benchmark is …

Zedwulf: Power-performance tradeoffs of a 32-node zynq soc cluster

P Moorthy, N Kapre - 2015 IEEE 23rd Annual International …, 2015 - ieeexplore.ieee.org
Commodity SoCs with hybrid architectures that combine CPUs with programmable FPGA
fabric such as the Xilinx Zynq SoC have become a competitive energy-efficient platform for …

Adaptive and hierarchical large message all-to-all communication algorithms for large-scale dense gpu systems

KS Khorassani, CH Chu, QG Anthony… - 2021 IEEE/ACM 21st …, 2021 - ieeexplore.ieee.org
In recent years, GPU-enhanced clusters have become more prevalent in High-Performance
Computing (HPC), leading to a demand for more efficient multi-GPU communication. This …

Designing a ROCm-aware MPI library for AMD GPUs: early experiences

K Shafie Khorassani, J Hashmi, CH Chu… - … Conference on High …, 2021 - Springer
Due to the emergence of AMD GPUs and their adoption in upcoming exascale systems (eg
Frontier), it is pertinent to have scientific applications and communication middlewares …

Performance evaluation of MPI libraries on GPU-enabled OpenPOWER architectures: Early experiences

KS Khorassani, CH Chu, H Subramoni… - … Computing: ISC High …, 2019 - Springer
Abstract The advent of Graphics Processing Unit (GPU)-enabled OpenPOWER architectures
are empowering the advancement of various High-Performance Computing (HPC) …