Rolex: Resilience-oriented language extensions for extreme-scale systems

S Hukerikar, RF Lucas - The Journal of Supercomputing, 2016 - Springer
Future exascale high-performance computing (HPC) systems will be constructed from VLSI
devices that will be less reliable than those used today, and faults will become the norm, not …

A self-correcting connected components algorithm

P Sao, O Green, C Jain, R Vuduc - … of the ACM Workshop on Fault …, 2016 - dl.acm.org
We present a new fault-tolerant algorithm for the problem of computing the connected
components of a graph. Our algorithm derives from a highly parallel but non-resilient …

Introspective resilience for exascale high-performance computing systems

S Hukerikar - 2015 - search.proquest.com
Future exascale high-performance computing (HPC) systems will be constructed using VLSI
devices with smaller feature sizes that will be far less reliable than those used today …