Designing non-blocking allreduce with collective offload on InfiniBand clusters: A case study...

M Bayatpour, S Chakraborty, H Subramoni… - Proceedings of the …, 2017 - dl.acm.org

Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in
modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in …

被引用次数：47 相关文章所有 4 个版本

[PDF] anl.gov

MPI+ ULT: Overlapping communication and computation with user-level threads

H Lu, S Seo, P Balaji - … on Cyberspace Safety and Security, and …, 2015 - ieeexplore.ieee.org

As the core density of future processors keeps increasing, MPI+ Threads is becoming a
promising programming model for large scale SMP clusters. Generally speaking, hybrid …

被引用次数：40 相关文章所有 8 个版本

[PDF] psu.edu

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

S Kumar, A Mamidala, P Heidelberger… - … Journal of High …, 2014 - journals.sagepub.com

The Blue Gene/Q (BG/Q) machine is the latest in the line of IBM massively parallel
supercomputers, designed to scale to 262,144 nodes and 16 million threads. Each BG/Q …

被引用次数：37 相关文章所有 4 个版本

[PDF] acm.org

Using performance models to understand scalable Krylov solver performance at scale for structured grid problems

PR Eller, T Hoefler, W Gropp - … of the ACM International Conference on …, 2019 - dl.acm.org

Krylov solvers are key kernels in many large-scale science and engineering applications for
solving sparse linear systems. Applications running at scale can experience significant …

被引用次数：14 相关文章所有 21 个版本

[PDF] nsf.gov

Scalable mpi collectives using sharp: Large scale performance evaluation on the tacc frontera system

B Ramesh, KK Suresh, N Sarkauskas… - 2020 Workshop on …, 2020 - ieeexplore.ieee.org

The Message-Passing Interface (MPI) is the de-facto standard for designing and executing
applications on massively parallel hardware. MPI collectives provide a convenient …

被引用次数：11 相关文章所有 3 个版本

Designing non-blocking personalized collectives with near perfect overlap for rdma-enabled clusters

H Subramoni, AA Awan, K Hamidouche… - … Conference, ISC High …, 2015 - Springer

Several techniques have been proposed in the past for designing non-blocking collective
operations on high-performance clusters. While some of them required a dedicated …

被引用次数：19 相关文章所有 2 个版本

[PDF] archive.org

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

MG Venkata, P Shamis, R Sampath… - 2013 IEEE …, 2013 - ieeexplore.ieee.org

Many scientific simulations, using the Message Passing Interface (MPI) programming model,
are sensitive to the performance and scalability of reduction collective operations such as …

被引用次数：19 相关文章所有 6 个版本

Offloading collective operations to programmable logic on a Zynq cluster

O Arap, M Swany - 2016 IEEE 24th Annual Symposium on High …, 2016 - ieeexplore.ieee.org

This paper describes our architecture and implementation for offloading collective
operations to programmable logic in the communication substrate. Collective operations …

被引用次数：13 相关文章所有 2 个版本

[PDF] researchgate.net

Gaps: a genetic programming system

MD Kramer, D Zhang - Proceedings 24th Annual International …, 2000 - ieeexplore.ieee.org

Genetic programming tackles the issue of how to automatically create a working computer
program for a given problem from some initial problem statement. The goal is accomplished …

被引用次数：33 相关文章所有 6 个版本

Offloaded MPI persistent collectives using persistent generalized request interface

M Hatanaka, M Takagi, A Hori, Y Ishikawa - Proceedings of the 24th …, 2017 - dl.acm.org

This paper proposes a library with a persistent generalized request interface for the
implementation of persistent communication operations. This interface allows developers to …

被引用次数：11 相关文章