Swing: Short-cutting Rings for Higher Bandwidth Allreduce
The allreduce collective operation accounts for a significant fraction of the runtime of
workloads running on distributed systems. One factor determining its performance is the …
workloads running on distributed systems. One factor determining its performance is the …
Network-accelerated non-contiguous memory transfers
Applications often communicate data that is non-contiguous in the send-or the receive-
buffer, eg, when exchanging a column of a matrix stored in row-major order. While non …
buffer, eg, when exchanging a column of a matrix stored in row-major order. While non …
Hand: A hybrid approach to accelerate non-contiguous data movement using mpi datatypes on gpu clusters
An increasing number of MPI applications are being ported to take advantage of the
compute power offered by GPUs. Data movement continues to be the major bottleneck on …
compute power offered by GPUs. Data movement continues to be the major bottleneck on …
Mpi derived datatypes: Performance and portability issues
This paper addresses performance-portability and overall performance issues when derived
datatypes are used with four MPI implementations: Open MPI, MPICH, MVAPICH2, and Intel …
datatypes are used with four MPI implementations: Open MPI, MPICH, MVAPICH2, and Intel …
Falcon: Efficient designs for zero-copy mpi datatype processing on emerging architectures
Derived datatypes are commonly used in MPI applications to exchange non-contiguous data
among processes. However, state-of-the-art MPI libraries do not offer efficient processing of …
among processes. However, state-of-the-art MPI libraries do not offer efficient processing of …
FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures
This paper addresses the challenges of MPI derived datatype processing and proposes
FALCON-X—A Fast and Low-overhead Communication framework for optimized zero-copy …
FALCON-X—A Fast and Low-overhead Communication framework for optimized zero-copy …
On the expected and observed communication performance with MPI derived datatypes
A Carpen-Amarie, S Hunold, JL Träff - … of the 23rd European MPI Users' …, 2016 - dl.acm.org
We examine natural expectations on communication performance using MPI derived
datatypes in comparison to the baseline," raw" performance of communicating simple …
datatypes in comparison to the baseline," raw" performance of communicating simple …
High performance MPI datatype support with user-mode memory registration: Challenges, designs, and benefits
Noncontiguous data communication has been heavily adopted in scientific applications,
especially for those written with MPI. Common strategies to handle noncontiguous data, like …
especially for those written with MPI. Common strategies to handle noncontiguous data, like …
FPsPIN: An FPGA-based Open-Hardware Research Platform for Processing in the Network
T Schneider, P Xu, T Hoefler - arXiv preprint arXiv:2405.16378, 2024 - arxiv.org
In the era of post-Moore computing, network offload emerges as a solution to two
challenges: the imperative for low-latency communication and the push towards hardware …
challenges: the imperative for low-latency communication and the push towards hardware …
Evaluating Data Redistribution in PaRSEC
Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The
objective can be multi-dimensional, such as improving computational load balance or …
objective can be multi-dimensional, such as improving computational load balance or …