Scalable reduction collectives with data partitioning-based multi-leader design

M Bayatpour, S Chakraborty, H Subramoni… - Proceedings of the …, 2017 - dl.acm.org
Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in
modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in …

MPI+ ULT: Overlapping communication and computation with user-level threads

H Lu, S Seo, P Balaji - … on Cyberspace Safety and Security, and …, 2015 - ieeexplore.ieee.org
As the core density of future processors keeps increasing, MPI+ Threads is becoming a
promising programming model for large scale SMP clusters. Generally speaking, hybrid …

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

S Kumar, A Mamidala, P Heidelberger… - … Journal of High …, 2014 - journals.sagepub.com
The Blue Gene/Q (BG/Q) machine is the latest in the line of IBM massively parallel
supercomputers, designed to scale to 262,144 nodes and 16 million threads. Each BG/Q …

Using performance models to understand scalable Krylov solver performance at scale for structured grid problems

PR Eller, T Hoefler, W Gropp - … of the ACM International Conference on …, 2019 - dl.acm.org
Krylov solvers are key kernels in many large-scale science and engineering applications for
solving sparse linear systems. Applications running at scale can experience significant …

Scalable mpi collectives using sharp: Large scale performance evaluation on the tacc frontera system

B Ramesh, KK Suresh, N Sarkauskas… - 2020 Workshop on …, 2020 - ieeexplore.ieee.org
The Message-Passing Interface (MPI) is the de-facto standard for designing and executing
applications on massively parallel hardware. MPI collectives provide a convenient …

Designing non-blocking personalized collectives with near perfect overlap for rdma-enabled clusters

H Subramoni, AA Awan, K Hamidouche… - … Conference, ISC High …, 2015 - Springer
Several techniques have been proposed in the past for designing non-blocking collective
operations on high-performance clusters. While some of them required a dedicated …

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

MG Venkata, P Shamis, R Sampath… - 2013 IEEE …, 2013 - ieeexplore.ieee.org
Many scientific simulations, using the Message Passing Interface (MPI) programming model,
are sensitive to the performance and scalability of reduction collective operations such as …

Offloading collective operations to programmable logic on a Zynq cluster

O Arap, M Swany - 2016 IEEE 24th Annual Symposium on High …, 2016 - ieeexplore.ieee.org
This paper describes our architecture and implementation for offloading collective
operations to programmable logic in the communication substrate. Collective operations …

Gaps: a genetic programming system

MD Kramer, D Zhang - Proceedings 24th Annual International …, 2000 - ieeexplore.ieee.org
Genetic programming tackles the issue of how to automatically create a working computer
program for a given problem from some initial problem statement. The goal is accomplished …

Offloaded MPI persistent collectives using persistent generalized request interface

M Hatanaka, M Takagi, A Hori, Y Ishikawa - Proceedings of the 24th …, 2017 - dl.acm.org
This paper proposes a library with a persistent generalized request interface for the
implementation of persistent communication operations. This interface allows developers to …