Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems
The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics
Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large …
Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large …
Designing high-performance mpi libraries with on-the-fly compression for modern gpu clusters
While the memory bandwidth of accelerators such as GPU has significantly improved over
the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in …
the last decade, the commodity networks such as Ethernet and InfiniBand are lagging in …
Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL?
Traditionally, MPI runtimes have been designed for clusters with a large number of nodes.
However, with the advent of MPI+ CUDA applications and dense multi-GPU systems, it has …
However, with the advent of MPI+ CUDA applications and dense multi-GPU systems, it has …
Evaluating scalability bottlenecks by workload extrapolation
Testing a scalability bottleneck requires a large system to generate sufficient load, which is
usually not accessible to researchers. To address this problem, this paper extrapolates the …
usually not accessible to researchers. To address this problem, this paper extrapolates the …
Accelerating mpi all-to-all communication with online compression on modern gpu clusters
Abstract As more High-Performance Computing (HPC) and Deep Learning (DL) applications
are adapting to scale using GPUs, the communication of GPU-resident data is becoming …
are adapting to scale using GPUs, the communication of GPU-resident data is becoming …
Toward a new linpack‐like benchmark for heterogeneous computing resources
L Carracciuolo, V Mele… - … and Computation: Practice …, 2024 - Wiley Online Library
This work describes some first efforts to design a new Linpack‐like benchmark useful to
evaluate the performance of Heterogeneous Computing Resources. The benchmark is …
evaluate the performance of Heterogeneous Computing Resources. The benchmark is …
Zedwulf: Power-performance tradeoffs of a 32-node zynq soc cluster
Commodity SoCs with hybrid architectures that combine CPUs with programmable FPGA
fabric such as the Xilinx Zynq SoC have become a competitive energy-efficient platform for …
fabric such as the Xilinx Zynq SoC have become a competitive energy-efficient platform for …
Adaptive and hierarchical large message all-to-all communication algorithms for large-scale dense gpu systems
In recent years, GPU-enhanced clusters have become more prevalent in High-Performance
Computing (HPC), leading to a demand for more efficient multi-GPU communication. This …
Computing (HPC), leading to a demand for more efficient multi-GPU communication. This …
Designing a ROCm-aware MPI library for AMD GPUs: early experiences
Due to the emergence of AMD GPUs and their adoption in upcoming exascale systems (eg
Frontier), it is pertinent to have scientific applications and communication middlewares …
Frontier), it is pertinent to have scientific applications and communication middlewares …
Performance evaluation of MPI libraries on GPU-enabled OpenPOWER architectures: Early experiences
Abstract The advent of Graphics Processing Unit (GPU)-enabled OpenPOWER architectures
are empowering the advancement of various High-Performance Computing (HPC) …
are empowering the advancement of various High-Performance Computing (HPC) …