Fault tolerance of MPI applications in exascale systems: The ULFM solution
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …
number of hardware components. In standard practice, applications are made resilient …
Failure detection and propagation in HPC systems
Building an infrastructure for Exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This paper describes the design and …
components, a stable and efficient failure detector. This paper describes the design and …
A failure detector for HPC platforms
Building an infrastructure for exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This article describes the design and …
components, a stable and efficient failure detector. This article describes the design and …
Epidemic failure detection and consensus for extreme parallelism
Future extreme-scale high-performance computing systems will be required to work under
frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has …
frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has …
Corrected gossip algorithms for fast reliable broadcast on unreliable systems
T Hoefler, A Barak, A Shiloh… - 2017 IEEE international …, 2017 - ieeexplore.ieee.org
Large-scale parallel programming environments and algorithms require efficient group-
communication on computing systems with failing nodes. Existing reliable broadcast …
communication on computing systems with failing nodes. Existing reliable broadcast …
Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales
Obtaining multi-process hard failure resilience at the application level is a key challenge that
must be overcome before the promise of exascale can be fully realized. Previous work has …
must be overcome before the promise of exascale can be fully realized. Previous work has …
Running resilient mpi applications on a dynamic group of recommended processes
ET Camargo, EP Duarte - Journal of the Brazilian Computer Society, 2018 - Springer
High-performance computing systems run applications that can take several hours to
execute and have to deal with the occurrence of a potentially large number of faults. Most of …
execute and have to deal with the occurrence of a potentially large number of faults. Most of …
Match: An mpi fault tolerance benchmark suite
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate
distributed scientific applications running on tens of hundreds of processes and compute …
distributed scientific applications running on tens of hundreds of processes and compute …
Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions
M Casas, WN Gansterer… - The International Journal …, 2019 - journals.sagepub.com
We investigate the usefulness of gossip-based reduction algorithms in a high-performance
computing (HPC) context. We compare them to state-of-the-art deterministic parallel …
computing (HPC) context. We compare them to state-of-the-art deterministic parallel …