Addressing failures in exascale computing
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
Demystifying the system vulnerability stack: Transient fault effects across the layers
G Papadimitriou, D Gizopoulos - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …
Shoestring: Probabilistic soft error reliability on the cheap
Aggressive technology scaling provides designers with an ever increasing budget of
cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in …
cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in …
Reliable on-chip systems in the nano-era: Lessons learnt and future trends
Reliability concerns due to technology scaling have been a major focus of researchers and
designers for several technology nodes. Therefore, many new techniques for enhancing and …
designers for several technology nodes. Therefore, many new techniques for enhancing and …
Understanding the propagation of hard errors to software and implications for resilient system design
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-
the-field faults. To be broadly deployable, the hardware reliability solution must incur low …
the-field faults. To be broadly deployable, the hardware reliability solution must incur low …
Relax: An architectural framework for software recovery of hardware faults
M De Kruijf, S Nomura, K Sankaralingam - ACM SIGARCH Computer …, 2010 - dl.acm.org
As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …
Low-cost program-level detectors for reducing silent data corruptions
With technology scaling, transient faults are becoming an increasing threat to hardware
reliability. Commodity systems must be made resilient to these in-field faults through very …
reliability. Commodity systems must be made resilient to these in-field faults through very …
Design and optimization of low voltage high performance dual threshold CMOS circuits
Reduction in leakage power has become an important concern in low voltage, low power
and high performance applications. In this paper, we use dual threshold technique to reduce …
and high performance applications. In this paper, we use dual threshold technique to reduce …
Optimizing software-directed instruction replication for gpu error detection
Application execution on safety-critical and high-performance computer systems must be
resilient to transient errors. As GPUs become more pervasive in such systems, they must …
resilient to transient errors. As GPUs become more pervasive in such systems, they must …
nZDC: A compiler technique for near zero silent data corruption
M Didehban, A Shrivastava - Proceedings of the 53rd Annual Design …, 2016 - dl.acm.org
Exponentially growing rate of soft errors makes reliability a major concern in modern
processor design. Since software-oriented approaches offer flexible protection even in off …
processor design. Since software-oriented approaches offer flexible protection even in off …