Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
This paper describes and evaluates a scalable and efficient resilience scheme based on the
concept of containment domains. Containment domains are a programming construct that …
concept of containment domains. Containment domains are a programming construct that …
Resilience design patterns: A structured approach to resilience at extreme scale
S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …
systems. While the HPC community has developed various resilience solutions, the solution …
FIMSIM: A fault injection infrastructure for microarchitectural simulators
Fault injection is a widely used approach for experiment-based dependability evaluation.
Injecting faults to microarchitectural simulators is particularly appealing for researchers …
Injecting faults to microarchitectural simulators is particularly appealing for researchers …
Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)
S Hukerikar, C Engelmann - arXiv preprint arXiv:1611.02717, 2016 - arxiv.org
In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …
based on the concept of resilience-based design patterns. A design pattern is a general …
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory
Providing fault tolerance especially to mission critical applications in order to detect transient
and permanent faults and to recover from them is one of the main necessity for processor …
and permanent faults and to recover from them is one of the main necessity for processor …
FaulTM: error detection and recovery using hardware transactional memory
Reliability is an essential concern for processor designers due to increasing transient and
permanent fault rates. Executing instruction streams redundantly in chip multi processors …
permanent fault rates. Executing instruction streams redundantly in chip multi processors …
Rolex: Resilience-oriented language extensions for extreme-scale systems
S Hukerikar, RF Lucas - The Journal of Supercomputing, 2016 - Springer
Future exascale high-performance computing (HPC) systems will be constructed from VLSI
devices that will be less reliable than those used today, and faults will become the norm, not …
devices that will be less reliable than those used today, and faults will become the norm, not …
[PDF][PDF] Survey of error and fault detection mechanisms
I Lee, M Basoglu, M Sullivan, DH Yoon… - University of Texas …, 2011 - lph.ece.utexas.edu
This report describes diverse error detection mechanisms that can be utilized within a
resilient system to protect applications against various types of errors and faults, both hard …
resilient system to protect applications against various types of errors and faults, both hard …
Using redundant transactions to verify the correctness of program code execution
S Gurumurthi, V Sridharan - US Patent 9,448,933, 2016 - Google Patents
In the described embodiments, a processor core (eg, a GPU core) receives a section of
program code to be executed in a transaction from another entity in a computing device. The …
program code to be executed in a transaction from another entity in a computing device. The …
Transactional memory for dependable embedded systems
Transactional Memory (TM) has been touted as one of the most promising approaches to
concurrent programming for multi-core processors. By combining ease of use with high …
concurrent programming for multi-core processors. By combining ease of use with high …