GPU devices for safety-critical systems: A survey
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …
languages and frameworks can deliver the computing performance required to facilitate the …
Memory errors in modern systems: The good, the bad, and the ugly
Several recent publications have shown that hardware faults in the memory subsystem are
commonplace. These faults are predicted to become more frequent in future systems that …
commonplace. These faults are predicted to become more frequent in future systems that …
Analyzing and increasing the reliability of convolutional neural networks on GPUs
FF dos Santos, PF Pimenta, C Lunardi… - IEEE Transactions …, 2018 - ieeexplore.ieee.org
Graphics processing units (GPUs) are playing a critical role in convolutional neural networks
(CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments …
(CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments …
SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation
As GPUs become more pervasive in both scalable high-performance computing systems
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …
and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors …
A survey on multithreading alternatives for soft error fault tolerance
Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce
higher soft error rates. This trend makes reliability a primary design constraint for computer …
higher soft error rates. This trend makes reliability a primary design constraint for computer …
Making convolutions resilient via algorithm-based error detection techniques
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …
high-performance computing systems. As such systems require high levels of resilience to …
Achieving exascale capabilities through heterogeneous computing
MJ Schulte, M Ignatowski, GH Loh, BM Beckmann… - IEEE Micro, 2015 - ieeexplore.ieee.org
This article provides an overview of AMD's vision for exascale computing, and in particular,
how heterogeneity will play a central role in realizing this vision. Exascale computing …
how heterogeneity will play a central role in realizing this vision. Exascale computing …
Optimizing software-directed instruction replication for gpu error detection
Application execution on safety-critical and high-performance computer systems must be
resilient to transient errors. As GPUs become more pervasive in such systems, they must …
resilient to transient errors. As GPUs become more pervasive in such systems, they must …
Design and Analysis of an APU for Exascale Computing
T Vijayaraghavan, Y Eckert, GH Loh… - … Symposium on High …, 2017 - ieeexplore.ieee.org
The challenges to push computing to exaflop levels are difficult given desired targets for
memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper …
memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper …
Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units
JD Guerrero Balaguera, JE Rodriguez Condia… - Proceedings of the …, 2023 - dl.acm.org
Modern Graphics Processing Units (GPUs) demand life expectancy extended to many years,
exposing the hardware to aging (ie, permanent faults arising after the end-of-manufacturing …
exposing the hardware to aging (ie, permanent faults arising after the end-of-manufacturing …