Enabling software resilience in gpgpu applications via partial thread protection

L Yang, B Nie, A Jog, E Smirni - 2021 IEEE/ACM 43rd …, 2021 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) are widely used by various applications in a broad
variety of fields to accelerate their computation but remain susceptible to transient hardware …

Mitigating silent data corruptions in HPC applications across multiple program inputs

Y Huang, S Guo, S Di, G Li… - … Conference for High …, 2022 - ieeexplore.ieee.org
With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a
common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used …

Druto: Upper-bounding silent data corruption vulnerability in gpu applications

MH Rahman, S Di, S Guo, X Lu, G Li… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
Due to the increasing scale of high-performance computing (HPC) systems, transient
hardware faults have become a major reliability concern. Consequently, Silent Data …

Peppa-x: finding program test inputs to bound silent data corruption vulnerability in hpc applications

MH Rahman, A Shamji, S Guo, G Li - Proceedings of the International …, 2021 - dl.acm.org
Transient hardware faults have become prevalent due to the shrinking size of transistors,
leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated …

Regional soft error vulnerability and error propagation analysis for GPGPU applications

I Öz, ÖF Karadaş - The Journal of Supercomputing, 2022 - Springer
The wide use of GPUs for general-purpose computations as well as graphics programs
makes soft errors a critical concern. Evaluating the soft error vulnerability of GPGPU …

Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study

V Oles, A Schmedding, G Ostrouchov, W Shin… - Proceedings of the 38th …, 2024 - dl.acm.org
GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least
understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to …

Data-centric reliability management in gpus

G Kadam, E Smirni, A Jog - 2021 51st Annual IEEE/IFIP …, 2021 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) have become the default choice of acceleration in a wide
range of application domains. To keep up with computational demands, the GPU memory …

Aspis: Lightweight Neural Network Protection Against Soft Errors

A Schmedding, L Yang, A Jog… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Convolutional neural networks (CNN) are incorporated into many image-based tasks across
a variety of domains. Some of these are safety critical tasks such as object …

Investigating the impact of high-level software design on low-level hardware fault resilience

B Zhang, L Yang, G Li, H Xu - 2023 53rd Annual IEEE/IFIP …, 2023 - ieeexplore.ieee.org
Silent Data Corruptions (SDCs) have become an insurmountable issue that threatens the
system reliability. General strategies for protecting programs from SDCs, such as dual …

GPU Reliability Assessment: Insights Across the Abstraction Layers

L Yang, G Papadimitriou, D Sartzetakis… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) are widely de-ployed and utilized across various
computing domains including cloud and high-performance computing. Considering its …