[HTML][HTML] Toward exascale resilience: 2014 update
F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …
systems will typically gather millions of CPU cores running up to a billion threads …
Predictive reliability and fault management in exascale systems: State of the art and perspectives
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
[图书][B] Fault tolerance techniques for high-performance computing
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …
checkpointing, the de-facto standard technique for resilience in High Performance …
Detection and correction of silent data corruption for large-scale high-performance computing
Faults have become the norm rather than the exception for high-end computing clusters.
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …
Understanding the propagation of transient errors in HPC applications
Resiliency of exascale systems has quickly become an important concern for the scientific
community. Despite its importance, still much remains to be determined regarding how faults …
community. Despite its importance, still much remains to be determined regarding how faults …
Evaluating the impact of SDC on the GMRES iterative solver
Increasing parallelism and transistor density, along with increasingly tighter energy and
peak power constraints, may force exposure of occasionally incorrect computation or …
peak power constraints, may force exposure of occasionally incorrect computation or …
ERSA: Error resilient system architecture for probabilistic applications
There is a growing concern about the increasing vulnerability of future computing systems to
errors in the underlying hardware. Traditional redundancy techniques are expensive for …
errors in the underlying hardware. Traditional redundancy techniques are expensive for …
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool
Extreme-scale scientific applications are at a significant risk of being hit by soft errors on
supercomputers as the scale of these systems and the component density continues to …
supercomputers as the scale of these systems and the component density continues to …
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
M Shantharam, S Srinivasmurthy… - Proceedings of the 26th …, 2012 - dl.acm.org
In scientific applications that involve dense matrices, checksum encodings have yielded"
algorithm-based fault tolerance"(ABFT) in the event of data corruption from either hard or …
algorithm-based fault tolerance"(ABFT) in the event of data corruption from either hard or …
Self-stabilizing iterative solvers
We show how to use the idea of self-stabilization, which originates in the context of
distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system …
distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system …