GPU devices for safety-critical systems: A survey
Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …
languages and frameworks can deliver the computing performance required to facilitate the …
A Survey on Failure Analysis and Fault Injection in AI Systems
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …
Desh: deep learning for system health prediction of lead times to failure in hpc
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …
likely to experience even higher fault rates due to increased component count and density …
Job characteristics on large-scale systems: long-term analysis, quantification, and implications
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …
better operation practices, system procurement decisions, and designing effective resource …
Machine learning models for GPU error prediction in a large scale HPC system
GPUs are widely deployed on large-scale HPC systems to provide powerful computational
capability for scientific applications from various domains. As those applications are …
capability for scientific applications from various domains. As those applications are …
Revealing power, energy and thermal dynamics of a 200pf pre-exascale supercomputer
As we approach the exascale computing era, the focused understanding of power
consumption and its overall constraint on HPC architectures and applications are becoming …
consumption and its overall constraint on HPC architectures and applications are becoming …
Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice
As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …
Lifespan and failures of SSDs and HDDs: similarities, differences, and prediction models
R Pinciroli, L Yang, J Alter… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Data center downtime typically centers around IT equipment failure. Storage devices are the
most frequently failing components in data centers. We present a comparative study of hard …
most frequently failing components in data centers. We present a comparative study of hard …
Resilience design patterns: A structured approach to resilience at extreme scale
S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …
systems. While the HPC community has developed various resilience solutions, the solution …
Radiation-tolerant deep learning processor unit (DPU)-based platform using Xilinx 20-nm kintex UltraScale FPGA
This article presents a platform and design appr-oach for enabling radiation-tolerant deep
learning acceleration on static random access memory (SRAM)-based 20-nm Kintex …
learning acceleration on static random access memory (SRAM)-based 20-nm Kintex …