Shiraz: Exploiting system reliability and application resilience characteristics to improve...

E Costa, T Patel, B Schwaller, JM Brandt… - Proceedings of the …, 2021 - dl.acm.org

Monitoring and analyzing I/O behaviors is critical to the efficient utilization of parallel storage
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …

被引用次数：22 相关文章所有 4 个版本

[PDF] google.com

A highly reliable metadata service for large-scale distributed file systems

J Zhou, Y Chen, W Wang, S He… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org

Many massive data processing applications nowadays often need long, continuous, and
uninterrupted data accesses. Distributed file systems are used as the back-end storage to …

被引用次数：20 相关文章所有 4 个版本

[PDF] sciencedirect.com

An unsupervised machine-learning checkpoint-restart algorithm using Gaussian mixtures for particle-in-cell simulations

G Chen, L Chacón, TB Nguyen - Journal of Computational Physics, 2021 - Elsevier

We propose an unsupervised machine-learning checkpoint-restart (CR) algorithm for
particle-in-cell (PIC) algorithms using Gaussian mixtures (GM). The algorithm compresses …

被引用次数：15 相关文章所有 13 个版本

[PDF] arxiv.org

Workload failure prediction for data centers

J Li, R Wang, G Ali, T Dang, A Sill… - 2023 IEEE 16th …, 2023 - ieeexplore.ieee.org

Failed workloads that consumed significant computational resources in time and space
affect the efficiency of HPC data centers significantly and thus limit the amount of scientific …

被引用次数：3 相关文章所有 5 个版本

[PDF] ncsu.edu

Systemic assessment of node failures in HPC production platforms

A Das, F Mueller, B Rountree - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org

Production HPC clusters endure failures reducing computational capability and resource
availability. Despite the presence of various failure prediction schemes for large-scale …

被引用次数：10 相关文章所有 5 个版本

[PDF] nsf.gov

Operating liquid-cooled large-scale systems: Long-term monitoring, reliability analysis, and efficiency measures

RB Roy, T Patel, R Kettimuthu, W Allcock… - … Symposium on High …, 2021 - ieeexplore.ieee.org

The past decade has seen a rise in the use of liquid cooling due to its energy efficiency.
While many previous works have helped make progress toward improving data center …

被引用次数：6 相关文章所有 4 个版本

[PDF] nsf.gov

Examining failures and repairs on supercomputers with multi-GPU compute nodes

A Taherin, T Patel, G Georgakoudis… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org

Understanding the reliability characteristics of supercomputers has been a key focus of the
HPC and dependability communities. However, there is no current study that analyzes both …

被引用次数：7 相关文章所有 3 个版本

[PDF] acm.org

Orchestrating fault prediction with live migration and checkpointing

S Behera, L Wan, F Mueller, M Wolf… - Proceedings of the 29th …, 2020 - dl.acm.org

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance
Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

X Chu, D Hofstätter, S Ilager, S Talluri… - 2024 IEEE 30th …, 2024 - ieeexplore.ieee.org

HPC datacenters offer a backbone to the modern digital society. Increasingly, they run
Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting …

How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster

X Chu, S Talluri, L Versluis, A Iosup - Companion of the 2023 ACM …, 2023 - dl.acm.org

Reliable job execution is important in High Performance Computing clusters. Understanding
the failure distribution and failure pattern of jobs helps HPC cluster managers design better …

被引用次数：3 相关文章所有 5 个版本