Practical scalable consensus for pseudo-synchronous distributed systems

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier

The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

被引用次数：59 相关文章所有 7 个版本

[HTML] nih.gov

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer

Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

被引用次数：33 相关文章所有 6 个版本

[PDF] hal.science

Failure detection and propagation in HPC systems

G Bosilca, A Bouteiller, A Guermouche… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org

Building an infrastructure for Exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This paper describes the design and …

被引用次数：43 相关文章所有 17 个版本

[PDF] nsf.gov

A failure detector for HPC platforms

G Bosilca, A Bouteiller, A Guermouche… - … Journal of High …, 2018 - journals.sagepub.com

Building an infrastructure for exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This article describes the design and …

被引用次数：22 相关文章所有 13 个版本

[PDF] osti.gov

Epidemic failure detection and consensus for extreme parallelism

A Katti, G Di Fatta, T Naughton… - … International Journal of …, 2018 - journals.sagepub.com

Future extreme-scale high-performance computing systems will be required to work under
frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has …

被引用次数：17 相关文章所有 9 个版本

[PDF] ethz.ch

Corrected gossip algorithms for fast reliable broadcast on unreliable systems

T Hoefler, A Barak, A Shiloh… - 2017 IEEE international …, 2017 - ieeexplore.ieee.org

Large-scale parallel programming environments and algorithms require efficient group-
communication on computing systems with failing nodes. Existing reliable broadcast …

被引用次数：17 相关文章所有 30 个版本

[PDF] ieee.org

Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales

M Gamell, K Teranishi, J Mayo, H Kolla… - … on Parallel and …, 2017 - ieeexplore.ieee.org

Obtaining multi-process hard failure resilience at the application level is a key challenge that
must be overcome before the promise of exascale can be fully realized. Previous work has …

被引用次数：16 相关文章所有 6 个版本

[PDF] springer.com

Running resilient mpi applications on a dynamic group of recommended processes

ET Camargo, EP Duarte - Journal of the Brazilian Computer Society, 2018 - Springer

High-performance computing systems run applications that can take several hours to
execute and have to deal with the occurrence of a potentially large number of faults. Most of …

被引用次数：11 相关文章所有 8 个版本

[PDF] arxiv.org

Match: An mpi fault tolerance benchmark suite

L Guo, G Georgakoudis, K Parasyris… - 2020 IEEE …, 2020 - ieeexplore.ieee.org

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate
distributed scientific applications running on tens of hundreds of processes and compute …

被引用次数：11 相关文章所有 6 个版本

[PDF] univie.ac.at

Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions

M Casas, WN Gansterer… - The International Journal …, 2019 - journals.sagepub.com

We investigate the usefulness of gossip-based reduction algorithms in a high-performance
computing (HPC) context. We compare them to state-of-the-art deterministic parallel …

被引用次数：10 相关文章所有 6 个版本