Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

Demystifying the system vulnerability stack: Transient fault effects across the layers

G Papadimitriou, D Gizopoulos - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …

Shoestring: Probabilistic soft error reliability on the cheap

S Feng, S Gupta, A Ansari, S Mahlke - ACM SIGARCH Computer …, 2010 - dl.acm.org
Aggressive technology scaling provides designers with an ever increasing budget of
cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in …

Reliable on-chip systems in the nano-era: Lessons learnt and future trends

J Henkel, L Bauer, N Dutt, P Gupta, S Nassif… - Proceedings of the 50th …, 2013 - dl.acm.org
Reliability concerns due to technology scaling have been a major focus of researchers and
designers for several technology nodes. Therefore, many new techniques for enhancing and …

Understanding the propagation of hard errors to software and implications for resilient system design

ML Li, P Ramachandran, SK Sahoo, SV Adve… - ACM Sigplan …, 2008 - dl.acm.org
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-
the-field faults. To be broadly deployable, the hardware reliability solution must incur low …

Relax: An architectural framework for software recovery of hardware faults

M De Kruijf, S Nomura, K Sankaralingam - ACM SIGARCH Computer …, 2010 - dl.acm.org
As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …

Low-cost program-level detectors for reducing silent data corruptions

SKS Hari, SV Adve, H Naeimi - IEEE/IFIP international …, 2012 - ieeexplore.ieee.org
With technology scaling, transient faults are becoming an increasing threat to hardware
reliability. Commodity systems must be made resilient to these in-field faults through very …

Design and optimization of low voltage high performance dual threshold CMOS circuits

L Wei, Z Chen, M Johnson, K Roy, V De - Proceedings of the 35th annual …, 1998 - dl.acm.org
Reduction in leakage power has become an important concern in low voltage, low power
and high performance applications. In this paper, we use dual threshold technique to reduce …

Optimizing software-directed instruction replication for gpu error detection

A Mahmoud, SKS Hari, MB Sullivan… - … Conference for High …, 2018 - ieeexplore.ieee.org
Application execution on safety-critical and high-performance computer systems must be
resilient to transient errors. As GPUs become more pervasive in such systems, they must …

nZDC: A compiler technique for near zero silent data corruption

M Didehban, A Shrivastava - Proceedings of the 53rd Annual Design …, 2016 - dl.acm.org
Exponentially growing rate of soft errors makes reliability a major concern in modern
processor design. Since software-oriented approaches offer flexible protection even in off …