A survey of techniques for modeling and improving reliability of computing systems

S Mittal, JS Vetter - IEEE Transactions on Parallel and …, 2015 - ieeexplore.ieee.org
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences
and impact of faults in computing systems. This has madereliability'a first-order design …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation

SKS Hari, T Tsai, M Stephenson… - … Analysis of Systems …, 2017 - ieeexplore.ieee.org
As GPUs become more pervasive in both scalable high-performance computing systems
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …

BinFI an efficient fault injector for safety-critical machine learning systems

Z Chen, G Li, K Pattabiraman… - Proceedings of the …, 2019 - dl.acm.org
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …

Demystifying the system vulnerability stack: Transient fault effects across the layers

G Papadimitriou, D Gizopoulos - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …

Quantifying the accuracy of high-level fault injection techniques for hardware faults

J Wei, A Thomas, G Li… - 2014 44th Annual IEEE …, 2014 - ieeexplore.ieee.org
Hardware errors are on the rise with reducing feature sizes, however tolerating them in
hardware is expensive. Researchers have explored software-based techniques for building …

Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory

Y Luo, S Govindan, B Sharma… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …

A low-cost fault corrector for deep neural networks through range restriction

Z Chen, G Li, K Pattabiraman - 2021 51st Annual IEEE/IFIP …, 2021 - ieeexplore.ieee.org
The adoption of deep neural networks (DNNs) in safety-critical domains has engendered
serious reliability concerns. A prominent example is hardware transient faults that are …

Understanding and mitigating hardware failures in deep learning training systems

Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …

Modeling soft-error propagation in programs

G Li, K Pattabiraman, SKS Hari… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …