Swing: Short-cutting Rings for Higher Bandwidth Allreduce

D De Sensi, T Bonato, D Saam, T Hoefler - 21st USENIX Symposium on …, 2024 - usenix.org
The allreduce collective operation accounts for a significant fraction of the runtime of
workloads running on distributed systems. One factor determining its performance is the …

Network-accelerated non-contiguous memory transfers

S Di Girolamo, K Taranov, A Kurth… - Proceedings of the …, 2019 - dl.acm.org
Applications often communicate data that is non-contiguous in the send-or the receive-
buffer, eg, when exchanging a column of a matrix stored in row-major order. While non …

Hand: A hybrid approach to accelerate non-contiguous data movement using mpi datatypes on gpu clusters

R Shi, X Lu, S Potluri, K Hamidouche… - 2014 43rd …, 2014 - ieeexplore.ieee.org
An increasing number of MPI applications are being ported to take advantage of the
compute power offered by GPUs. Data movement continues to be the major bottleneck on …

Mpi derived datatypes: Performance and portability issues

Q Xiong, PV Bangalore, A Skjellum… - Proceedings of the 25th …, 2018 - dl.acm.org
This paper addresses performance-portability and overall performance issues when derived
datatypes are used with four MPI implementations: Open MPI, MPICH, MVAPICH2, and Intel …

Falcon: Efficient designs for zero-copy mpi datatype processing on emerging architectures

JM Hashmi, S Chakraborty, M Bayatpour… - 2019 IEEE …, 2019 - ieeexplore.ieee.org
Derived datatypes are commonly used in MPI applications to exchange non-contiguous data
among processes. However, state-of-the-art MPI libraries do not offer efficient processing of …

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures

JM Hashmi, CH Chu, S Chakraborty… - Journal of Parallel and …, 2020 - Elsevier
This paper addresses the challenges of MPI derived datatype processing and proposes
FALCON-X—A Fast and Low-overhead Communication framework for optimized zero-copy …

On the expected and observed communication performance with MPI derived datatypes

A Carpen-Amarie, S Hunold, JL Träff - … of the 23rd European MPI Users' …, 2016 - dl.acm.org
We examine natural expectations on communication performance using MPI derived
datatypes in comparison to the baseline," raw" performance of communicating simple …

High performance MPI datatype support with user-mode memory registration: Challenges, designs, and benefits

M Li, H Subramoni, K Hamidouche… - … on Cluster Computing, 2015 - ieeexplore.ieee.org
Noncontiguous data communication has been heavily adopted in scientific applications,
especially for those written with MPI. Common strategies to handle noncontiguous data, like …

FPsPIN: An FPGA-based Open-Hardware Research Platform for Processing in the Network

T Schneider, P Xu, T Hoefler - arXiv preprint arXiv:2405.16378, 2024 - arxiv.org
In the era of post-Moore computing, network offload emerges as a solution to two
challenges: the imperative for low-latency communication and the push towards hardware …

Evaluating Data Redistribution in PaRSEC

Q Cao, G Bosilca, N Losada, W Wu… - … on Parallel and …, 2021 - ieeexplore.ieee.org
Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The
objective can be multi-dimensional, such as improving computational load balance or …