GPU devices for safety-critical systems: A survey

J Perez-Cerrolaza, J Abella, L Kosmidis… - ACM Computing …, 2022 - dl.acm.org
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …

A Survey on Failure Analysis and Fault Injection in AI Systems

G Yu, G Tan, H Huang, Z Zhang, P Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

Machine learning models for GPU error prediction in a large scale HPC system

B Nie, J Xue, S Gupta, T Patel… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org
GPUs are widely deployed on large-scale HPC systems to provide powerful computational
capability for scientific applications from various domains. As those applications are …

Revealing power, energy and thermal dynamics of a 200pf pre-exascale supercomputer

W Shin, V Oles, AM Karimi, JA Ellis… - Proceedings of the …, 2021 - dl.acm.org
As we approach the exascale computing era, the focused understanding of power
consumption and its overall constraint on HPC architectures and applications are becoming …

Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org
As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …

Lifespan and failures of SSDs and HDDs: similarities, differences, and prediction models

R Pinciroli, L Yang, J Alter… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Data center downtime typically centers around IT equipment failure. Storage devices are the
most frequently failing components in data centers. We present a comparative study of hard …

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

Radiation-tolerant deep learning processor unit (DPU)-based platform using Xilinx 20-nm kintex UltraScale FPGA

P Maillard, YP Chen, J Vidmar, N Fraser… - … on Nuclear Science, 2022 - ieeexplore.ieee.org
This article presents a platform and design appr-oach for enabling radiation-tolerant deep
learning acceleration on static random access memory (SRAM)-based 20-nm Kintex …