Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

Failure detection and propagation in HPC systems

G Bosilca, A Bouteiller, A Guermouche… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org
Building an infrastructure for Exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This paper describes the design and …

A failure detector for HPC platforms

G Bosilca, A Bouteiller, A Guermouche… - … Journal of High …, 2018 - journals.sagepub.com
Building an infrastructure for exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This article describes the design and …

Epidemic failure detection and consensus for extreme parallelism

A Katti, G Di Fatta, T Naughton… - … International Journal of …, 2018 - journals.sagepub.com
Future extreme-scale high-performance computing systems will be required to work under
frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has …

Corrected gossip algorithms for fast reliable broadcast on unreliable systems

T Hoefler, A Barak, A Shiloh… - 2017 IEEE international …, 2017 - ieeexplore.ieee.org
Large-scale parallel programming environments and algorithms require efficient group-
communication on computing systems with failing nodes. Existing reliable broadcast …

Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales

M Gamell, K Teranishi, J Mayo, H Kolla… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Obtaining multi-process hard failure resilience at the application level is a key challenge that
must be overcome before the promise of exascale can be fully realized. Previous work has …

Running resilient mpi applications on a dynamic group of recommended processes

ET Camargo, EP Duarte - Journal of the Brazilian Computer Society, 2018 - Springer
High-performance computing systems run applications that can take several hours to
execute and have to deal with the occurrence of a potentially large number of faults. Most of …

Match: An mpi fault tolerance benchmark suite

L Guo, G Georgakoudis, K Parasyris… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate
distributed scientific applications running on tens of hundreds of processes and compute …

Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions

M Casas, WN Gansterer… - The International Journal …, 2019 - journals.sagepub.com
We investigate the usefulness of gossip-based reduction algorithms in a high-performance
computing (HPC) context. We compare them to state-of-the-art deterministic parallel …