Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

J Chung, I Lee, M Sullivan, JH Ryoo… - Scientific …, 2013 - content.iospress.com
This paper describes and evaluates a scalable and efficient resilience scheme based on the
concept of containment domains. Containment domains are a programming construct that …

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

FIMSIM: A fault injection infrastructure for microarchitectural simulators

G Yalcin, OS Unsal, A Cristal… - 2011 IEEE 29th …, 2011 - ieeexplore.ieee.org
Fault injection is a widely used approach for experiment-based dependability evaluation.
Injecting faults to microarchitectural simulators is particularly appealing for researchers …

Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)

S Hukerikar, C Engelmann - arXiv preprint arXiv:1611.02717, 2016 - arxiv.org
In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …

Fault tolerance for multi-threaded applications by leveraging hardware transactional memory

G Yalcin, OS Unsal, A Cristal - … of the ACM International Conference on …, 2013 - dl.acm.org
Providing fault tolerance especially to mission critical applications in order to detect transient
and permanent faults and to recover from them is one of the main necessity for processor …

FaulTM: error detection and recovery using hardware transactional memory

G Yalcin, O Unsal, A Cristal - … & Test in Europe Conference & …, 2013 - ieeexplore.ieee.org
Reliability is an essential concern for processor designers due to increasing transient and
permanent fault rates. Executing instruction streams redundantly in chip multi processors …

Rolex: Resilience-oriented language extensions for extreme-scale systems

S Hukerikar, RF Lucas - The Journal of Supercomputing, 2016 - Springer
Future exascale high-performance computing (HPC) systems will be constructed from VLSI
devices that will be less reliable than those used today, and faults will become the norm, not …

[PDF][PDF] Survey of error and fault detection mechanisms

I Lee, M Basoglu, M Sullivan, DH Yoon… - University of Texas …, 2011 - lph.ece.utexas.edu
This report describes diverse error detection mechanisms that can be utilized within a
resilient system to protect applications against various types of errors and faults, both hard …

Using redundant transactions to verify the correctness of program code execution

S Gurumurthi, V Sridharan - US Patent 9,448,933, 2016 - Google Patents
In the described embodiments, a processor core (eg, a GPU core) receives a section of
program code to be executed in a transaction from another entity in a computing device. The …

Transactional memory for dependable embedded systems

C Fetzer, P Felber - … IEEE/IFIP 41st International Conference on …, 2011 - ieeexplore.ieee.org
Transactional Memory (TM) has been touted as one of the most promising approaches to
concurrent programming for multi-core processors. By combining ease of use with high …