GPU devices for safety-critical systems: A survey

J Perez-Cerrolaza, J Abella, L Kosmidis… - ACM Computing …, 2022 - dl.acm.org
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …

Understanding and mitigating hardware failures in deep learning training systems

Y He, M Hutton, S Chan, R De Gruijl… - Proceedings of the 50th …, 2023 - dl.acm.org
Deep neural network (DNN) training workloads are increasingly susceptible to hardware
failures in datacenters. For example, Google experienced" mysterious, difficult to identify …

Data masking techniques for NoSQL database security: A systematic review

A Cuzzocrea, H Shahriar - … conference on big data (Big Data), 2017 - ieeexplore.ieee.org
This paper first presents an in-depth study of potential security vulnerabilities in MongoDB
and Cassandra, two popular NoSQL databases. We provide examples of attacks. We then …

Making convolutions resilient via algorithm-based error detection techniques

SKS Hari, MB Sullivan, T Tsai… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Convolutional Neural Networks (CNNs) are being increasingly used in safety-critical and
high-performance computing systems. As such systems require high levels of resilience to …

[PDF][PDF] Optimizing Selective Protection for CNN Resilience.

A Mahmoud, SKS Hari, CW Fletcher, SV Adve, C Sakr… - ISSRE, 2021 - ma3mool.github.io
As CNNs are being extensively employed in high performance and safety-critical
applications that demand high reliability, it is important to ensure that they are resilient to …

Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs

J Kosaian, KV Rashmi - Proceedings of the International Conference for …, 2021 - dl.acm.org
Neural networks (NNs) are increasingly employed in safety-critical domains and in
environments prone to unreliability (eg, soft errors), such as on spacecraft. Therefore, it is …

Featherweight soft error resilience for GPUs

Y Zhang, C Jung - … 55th IEEE/ACM International Symposium on …, 2022 - ieeexplore.ieee.org
This paper presents Flame, a hardware/software co-designed resilience scheme for
protecting GPUs against soft errors. For low-cost yet high-performance resilience, Flame …

Lltfi: Framework agnostic fault injection for machine learning applications (tools and artifact track)

UK Agarwal, A Chan… - 2022 IEEE 33rd …, 2022 - ieeexplore.ieee.org
As machine learning (ML) has become more preva-lent across many critical domains, so
has the need to understand ML applications' resilience. While prior work like TensorFI [1] …

Exploiting temporal data diversity for detecting safety-critical faults in AV compute systems

S Jha, S Cui, T Tsai, SKS Hari… - 2022 52nd Annual …, 2022 - ieeexplore.ieee.org
Silent data corruption caused by random hardware faults in autonomous vehicle (AV)
computational elements is a significant threat to vehicle safety. Previous research has …

Software-only based diverse redundancy for asil-d automotive applications on embedded hpc platforms

S Alcaide, L Kosmidis, C Hernandez… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
High-Performance Computing (HPC) platforms become a must in automotive systems to
enable autonomous driving. However, automotive platforms must avoid Common Cause …