[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

[图书][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

Detection and correction of silent data corruption for large-scale high-performance computing

D Fiala, F Mueller, C Engelmann… - SC'12: Proceedings …, 2012 - ieeexplore.ieee.org
Faults have become the norm rather than the exception for high-end computing clusters.
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …

Understanding the propagation of transient errors in HPC applications

RA Ashraf, R Gioiosa, G Kestor, RF DeMara… - Proceedings of the …, 2015 - dl.acm.org
Resiliency of exascale systems has quickly become an important concern for the scientific
community. Despite its importance, still much remains to be determined regarding how faults …

Evaluating the impact of SDC on the GMRES iterative solver

J Elliott, M Hoemmen, F Mueller - 2014 ieee 28th international …, 2014 - ieeexplore.ieee.org
Increasing parallelism and transistor density, along with increasingly tighter energy and
peak power constraints, may force exposure of occasionally incorrect computation or …

ERSA: Error resilient system architecture for probabilistic applications

H Cho, L Leem, S Mitra - IEEE Transactions on Computer-Aided …, 2012 - ieeexplore.ieee.org
There is a growing concern about the increasing vulnerability of future computing systems to
errors in the underlying hardware. Traditional redundancy techniques are expensive for …

Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

D Li, JS Vetter, W Yu - SC'12: Proceedings of the International …, 2012 - ieeexplore.ieee.org
Extreme-scale scientific applications are at a significant risk of being hit by soft errors on
supercomputers as the scale of these systems and the component density continues to …

Fault tolerant preconditioned conjugate gradient for sparse linear system solution

M Shantharam, S Srinivasmurthy… - Proceedings of the 26th …, 2012 - dl.acm.org
In scientific applications that involve dense matrices, checksum encodings have yielded"
algorithm-based fault tolerance"(ABFT) in the event of data corruption from either hard or …

Self-stabilizing iterative solvers

P Sao, R Vuduc - Proceedings of the workshop on latest advances in …, 2013 - dl.acm.org
We show how to use the idea of self-stabilization, which originates in the context of
distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system …