[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Evaluating the viability of process replication reliability for exascale systems

K Ferreira, J Stearley, JH Laros III, R Oldfield… - Proceedings of 2011 …, 2011 - dl.acm.org
As high-end computing machines continue to grow in size, issues such as fault tolerance
and reliability limit application scalability. Current techniques to ensure progress across …

Exploring automatic, online failure recovery for scientific applications at extreme scales

M Gamell, DS Katz, H Kolla, J Chen… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

System-level scalable checkpoint-restart for petascale computing

J Cao, K Arya, R Garg, S Matott… - 2016 IEEE 22nd …, 2016 - ieeexplore.ieee.org
Fault tolerance for the upcoming exascale generation has long been an area of active
research. One of the components of a fault tolerance strategy is checkpointing. Petascale …

Shrink or substitute: handling process failures in HPC systems using in-situ recovery

RA Ashraf, S Hukerikar… - 2018 26th Euromicro …, 2018 - ieeexplore.ieee.org
Efficient utilization of today's high-performance computing (HPC) systems with complex
software and hardware components requires that the HPC applications are designed to …

Accelerating seismic redatuming using tile low-rank approximations on NEC SX-Aurora TSUBASA

Y Hong, H Ltaief, M Ravasi, L Gatineau, DE Keyes - 2021 - repository.kaust.edu.sa
With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of
the Earth must be numerically re-positioned at locations in the subsurface where reflections …

[HTML][HTML] Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation

WN Gansterer, G Niederbrucker, H Straková… - Journal of …, 2013 - Elsevier
The construction of distributed algorithms for matrix computations built on top of distributed
data aggregation algorithms with randomized communication schedules is investigated. For …

Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)

S Hukerikar, C Engelmann - arXiv preprint arXiv:1611.02717, 2016 - arxiv.org
In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …

Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

D Montezanti, E Rucci, A De Giusti, M Naiouf… - Future Generation …, 2020 - Elsevier
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that
silent undetected errors will occur several times a day, increasing the occurrence of …