A survey of techniques for modeling and improving reliability of computing systems
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences
and impact of faults in computing systems. This has madereliability'a first-order design …
and impact of faults in computing systems. This has madereliability'a first-order design …
Addressing failures in exascale computing
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation
As GPUs become more pervasive in both scalable high-performance computing systems
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …
BinFI an efficient fault injector for safety-critical machine learning systems
As machine learning (ML) becomes pervasive in high performance computing, ML has
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …
found its way into safety-critical domains (eg, autonomous vehicles). Thus the reliability of …
Demystifying the system vulnerability stack: Transient fault effects across the layers
G Papadimitriou, D Gizopoulos - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …
pitfalls in widely used vulnerability measurement approaches, which separate the hardware …
Quantifying the accuracy of high-level fault injection techniques for hardware faults
Hardware errors are on the rise with reducing feature sizes, however tolerating them in
hardware is expensive. Researchers have explored software-based techniques for building …
hardware is expensive. Researchers have explored software-based techniques for building …
Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …
and techniques used to reduce errors that occur on these devices increase this cost. Existing …
A low-cost fault corrector for deep neural networks through range restriction
The adoption of deep neural networks (DNNs) in safety-critical domains has engendered
serious reliability concerns. A prominent example is hardware transient faults that are …
serious reliability concerns. A prominent example is hardware transient faults that are …
Understanding and mitigating hardware failures in deep learning training systems
Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …
Modeling soft-error propagation in programs
As technology scales to lower feature sizes, devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …