GPU devices for safety-critical systems: A survey

J Perez-Cerrolaza, J Abella, L Kosmidis… - ACM Computing …, 2022 - dl.acm.org
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …

Memory errors in modern systems: The good, the bad, and the ugly

V Sridharan, N DeBardeleben, S Blanchard… - ACM SIGARCH …, 2015 - dl.acm.org
Several recent publications have shown that hardware faults in the memory subsystem are
commonplace. These faults are predicted to become more frequent in future systems that …

Analyzing and increasing the reliability of convolutional neural networks on GPUs

FF dos Santos, PF Pimenta, C Lunardi… - IEEE Transactions …, 2018 - ieeexplore.ieee.org
Graphics processing units (GPUs) are playing a critical role in convolutional neural networks
(CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments …

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation

SKS Hari, T Tsai, M Stephenson… - … Analysis of Systems …, 2017 - ieeexplore.ieee.org
As GPUs become more pervasive in both scalable high-performance computing systems
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …

A survey on multithreading alternatives for soft error fault tolerance

I Oz, S Arslan - ACM Computing Surveys (CSUR), 2019 - dl.acm.org
Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce
higher soft error rates. This trend makes reliability a primary design constraint for computer …

Making convolutions resilient via algorithm-based error detection techniques

SKS Hari, MB Sullivan, T Tsai… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …

Achieving exascale capabilities through heterogeneous computing

MJ Schulte, M Ignatowski, GH Loh, BM Beckmann… - IEEE Micro, 2015 - ieeexplore.ieee.org
This article provides an overview of AMD's vision for exascale computing, and in particular,
how heterogeneity will play a central role in realizing this vision. Exascale computing …

Optimizing software-directed instruction replication for gpu error detection

A Mahmoud, SKS Hari, MB Sullivan… - … Conference for High …, 2018 - ieeexplore.ieee.org
Application execution on safety-critical and high-performance computer systems must be
resilient to transient errors. As GPUs become more pervasive in such systems, they must …

Design and Analysis of an APU for Exascale Computing

T Vijayaraghavan, Y Eckert, GH Loh… - … Symposium on High …, 2017 - ieeexplore.ieee.org
The challenges to push computing to exaflop levels are difficult given desired targets for
memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper …

Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

JD Guerrero Balaguera, JE Rodriguez Condia… - Proceedings of the …, 2023 - dl.acm.org
Modern Graphics Processing Units (GPUs) demand life expectancy extended to many years,
exposing the hardware to aging (ie, permanent faults arising after the end-of-manufacturing …