Characterizing temperature, power, and soft-error behaviors in data center systems: Insights,...

J Perez-Cerrolaza, J Abella, L Kosmidis… - ACM Computing …, 2022 - dl.acm.org

Graphics Processing Unit (GPU) devices and their associated software programming
languages and frameworks can deliver the computing performance required to facilitate the …

被引用次数：28 相关文章所有 7 个版本

[PDF] arxiv.org

A Survey on Failure Analysis and Fault Injection in AI Systems

G Yu, G Tan, H Huang, Z Zhang, P Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …

被引用次数：2 相关文章所有 3 个版本

[PDF] acm.org

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org

Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

被引用次数：117 相关文章所有 4 个版本

[PDF] google.com

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org

HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

被引用次数：68 相关文章所有 4 个版本

[PDF] osti.gov

Machine learning models for GPU error prediction in a large scale HPC system

B Nie, J Xue, S Gupta, T Patel… - 2018 48th Annual …, 2018 - ieeexplore.ieee.org

GPUs are widely deployed on large-scale HPC systems to provide powerful computational
capability for scientific applications from various domains. As those applications are …

被引用次数：86 相关文章所有 12 个版本

[PDF] acm.org

Revealing power, energy and thermal dynamics of a 200pf pre-exascale supercomputer

W Shin, V Oles, AM Karimi, JA Ellis… - Proceedings of the …, 2021 - dl.acm.org

As we approach the exascale computing era, the focused understanding of power
consumption and its overall constraint on HPC architectures and applications are becoming …

被引用次数：36 相关文章所有 7 个版本

Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org

As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …

被引用次数：43 相关文章

[PDF] gssi.it

Lifespan and failures of SSDs and HDDs: similarities, differences, and prediction models

R Pinciroli, L Yang, J Alter… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Data center downtime typically centers around IT equipment failure. Storage devices are the
most frequently failing components in data centers. We present a comparative study of hard …

被引用次数：19 相关文章所有 6 个版本

[PDF] arxiv.org

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org

Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

被引用次数：53 相关文章所有 21 个版本

Radiation-tolerant deep learning processor unit (DPU)-based platform using Xilinx 20-nm kintex UltraScale FPGA

P Maillard, YP Chen, J Vidmar, N Fraser… - … on Nuclear Science, 2022 - ieeexplore.ieee.org

This article presents a platform and design appr-oach for enabling radiation-tolerant deep
learning acceleration on static random access memory (SRAM)-based 20-nm Kintex …

被引用次数：15 相关文章所有 3 个版本