Designing efficient small message transfer mechanism for inter-node MPI communication on...

CH Chu, P Kousha, AA Awan, KS Khorassani… - Proceedings of the 34th …, 2020 - dl.acm.org

The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics
Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large …

被引用次数：47 相关文章所有 4 个版本

[PDF] nsf.gov

Designing high-performance mpi libraries with on-the-fly compression for modern gpu clusters

Q Zhou, C Chu, NS Kumar, P Kousha… - 2021 IEEE …, 2021 - ieeexplore.ieee.org

While the memory bandwidth of accelerators such as GPU has significantly improved over
the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in …

被引用次数：33 相关文章所有 4 个版本

[PDF] arxiv.org

Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL?

AA Awan, CH Chu, H Subramoni… - Proceedings of the 25th …, 2018 - dl.acm.org

Traditionally, MPI runtimes have been designed for clusters with a large number of nodes.
However, with the advent of MPI+ CUDA applications and dense multi-GPU systems, it has …

被引用次数：60 相关文章所有 5 个版本

[PDF] nsf.gov

Evaluating scalability bottlenecks by workload extrapolation

R Shi, Y Gan, Y Wang - 2018 IEEE 26th international …, 2018 - ieeexplore.ieee.org

Testing a scalability bottleneck requires a large system to generate sufficient load, which is
usually not accessible to researchers. To address this problem, this paper extrapolates the …

被引用次数：44 相关文章所有 4 个版本

[PDF] nsf.gov

Accelerating mpi all-to-all communication with online compression on modern gpu clusters

Q Zhou, P Kousha, Q Anthony… - … Conference on High …, 2022 - Springer

Abstract As more High-Performance Computing (HPC) and Deep Learning (DL) applications
are adapting to scale using GPUs, the communication of GPU-resident data is becoming …

被引用次数：19 相关文章所有 6 个版本

[PDF] wiley.com

Toward a new linpack‐like benchmark for heterogeneous computing resources

L Carracciuolo, V Mele… - … and Computation: Practice …, 2024 - Wiley Online Library

This work describes some first efforts to design a new Linpack‐like benchmark useful to
evaluate the performance of Heterogeneous Computing Resources. The benchmark is …

被引用次数：3 相关文章

[PDF] ntu.edu.sg

Zedwulf: Power-performance tradeoffs of a 32-node zynq soc cluster

P Moorthy, N Kapre - 2015 IEEE 23rd Annual International …, 2015 - ieeexplore.ieee.org

Commodity SoCs with hybrid architectures that combine CPUs with programmable FPGA
fabric such as the Xilinx Zynq SoC have become a competitive energy-efficient platform for …

被引用次数：50 相关文章所有 9 个版本

[PDF] nsf.gov

Adaptive and hierarchical large message all-to-all communication algorithms for large-scale dense gpu systems

KS Khorassani, CH Chu, QG Anthony… - 2021 IEEE/ACM 21st …, 2021 - ieeexplore.ieee.org

In recent years, GPU-enhanced clusters have become more prevalent in High-Performance
Computing (HPC), leading to a demand for more efficient multi-GPU communication. This …

被引用次数：16 相关文章所有 3 个版本

[PDF] nsf.gov

Designing a ROCm-aware MPI library for AMD GPUs: early experiences

K Shafie Khorassani, J Hashmi, CH Chu… - … Conference on High …, 2021 - Springer

Due to the emergence of AMD GPUs and their adoption in upcoming exascale systems (eg
Frontier), it is pertinent to have scientific applications and communication middlewares …

被引用次数：20 相关文章所有 3 个版本

[PDF] researchgate.net

Performance evaluation of MPI libraries on GPU-enabled OpenPOWER architectures: Early experiences

KS Khorassani, CH Chu, H Subramoni… - … Computing: ISC High …, 2019 - Springer

Abstract The advent of Graphics Processing Unit (GPU)-enabled OpenPOWER architectures
are empowering the advancement of various High-Performance Computing (HPC) …

被引用次数：28 相关文章所有 10 个版本