Enabling software resilience in gpgpu applications via partial thread protection
Graphics Processing Units (GPUs) are widely used by various applications in a broad
variety of fields to accelerate their computation but remain susceptible to transient hardware …
variety of fields to accelerate their computation but remain susceptible to transient hardware …
Mitigating silent data corruptions in HPC applications across multiple program inputs
With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a
common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used …
common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used …
Druto: Upper-bounding silent data corruption vulnerability in gpu applications
Due to the increasing scale of high-performance computing (HPC) systems, transient
hardware faults have become a major reliability concern. Consequently, Silent Data …
hardware faults have become a major reliability concern. Consequently, Silent Data …
Peppa-x: finding program test inputs to bound silent data corruption vulnerability in hpc applications
Transient hardware faults have become prevalent due to the shrinking size of transistors,
leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated …
leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated …
Regional soft error vulnerability and error propagation analysis for GPGPU applications
I Öz, ÖF Karadaş - The Journal of Supercomputing, 2022 - Springer
The wide use of GPUs for general-purpose computations as well as graphics programs
makes soft errors a critical concern. Evaluating the soft error vulnerability of GPGPU …
makes soft errors a critical concern. Evaluating the soft error vulnerability of GPGPU …
Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study
GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least
understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to …
understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to …
Data-centric reliability management in gpus
Graphics Processing Units (GPUs) have become the default choice of acceleration in a wide
range of application domains. To keep up with computational demands, the GPU memory …
range of application domains. To keep up with computational demands, the GPU memory …
Aspis: Lightweight Neural Network Protection Against Soft Errors
A Schmedding, L Yang, A Jog… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Convolutional neural networks (CNN) are incorporated into many image-based tasks across
a variety of domains. Some of these are safety critical tasks such as object …
a variety of domains. Some of these are safety critical tasks such as object …
Investigating the impact of high-level software design on low-level hardware fault resilience
B Zhang, L Yang, G Li, H Xu - 2023 53rd Annual IEEE/IFIP …, 2023 - ieeexplore.ieee.org
Silent Data Corruptions (SDCs) have become an insurmountable issue that threatens the
system reliability. General strategies for protecting programs from SDCs, such as dual …
system reliability. General strategies for protecting programs from SDCs, such as dual …
GPU Reliability Assessment: Insights Across the Abstraction Layers
L Yang, G Papadimitriou, D Sartzetakis… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) are widely de-ployed and utilized across various
computing domains including cloud and high-performance computing. Considering its …
computing domains including cloud and high-performance computing. Considering its …