Systematically inferring I/O performance variability by examining repetitive job behavior

E Costa, T Patel, B Schwaller, JM Brandt… - Proceedings of the …, 2021 - dl.acm.org
Monitoring and analyzing I/O behaviors is critical to the efficient utilization of parallel storage
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …

A highly reliable metadata service for large-scale distributed file systems

J Zhou, Y Chen, W Wang, S He… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Many massive data processing applications nowadays often need long, continuous, and
uninterrupted data accesses. Distributed file systems are used as the back-end storage to …

An unsupervised machine-learning checkpoint-restart algorithm using Gaussian mixtures for particle-in-cell simulations

G Chen, L Chacón, TB Nguyen - Journal of Computational Physics, 2021 - Elsevier
We propose an unsupervised machine-learning checkpoint-restart (CR) algorithm for
particle-in-cell (PIC) algorithms using Gaussian mixtures (GM). The algorithm compresses …

Workload failure prediction for data centers

J Li, R Wang, G Ali, T Dang, A Sill… - 2023 IEEE 16th …, 2023 - ieeexplore.ieee.org
Failed workloads that consumed significant computational resources in time and space
affect the efficiency of HPC data centers significantly and thus limit the amount of scientific …

Systemic assessment of node failures in HPC production platforms

A Das, F Mueller, B Rountree - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org
Production HPC clusters endure failures reducing computational capability and resource
availability. Despite the presence of various failure prediction schemes for large-scale …

Operating liquid-cooled large-scale systems: Long-term monitoring, reliability analysis, and efficiency measures

RB Roy, T Patel, R Kettimuthu, W Allcock… - … Symposium on High …, 2021 - ieeexplore.ieee.org
The past decade has seen a rise in the use of liquid cooling due to its energy efficiency.
While many previous works have helped make progress toward improving data center …

Examining failures and repairs on supercomputers with multi-GPU compute nodes

A Taherin, T Patel, G Georgakoudis… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org
Understanding the reliability characteristics of supercomputers has been a key focus of the
HPC and dependability communities. However, there is no current study that analyzes both …

Orchestrating fault prediction with live migration and checkpointing

S Behera, L Wan, F Mueller, M Wolf… - Proceedings of the 29th …, 2020 - dl.acm.org
Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance
Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure …

Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

X Chu, D Hofstätter, S Ilager, S Talluri… - 2024 IEEE 30th …, 2024 - ieeexplore.ieee.org
HPC datacenters offer a backbone to the modern digital society. Increasingly, they run
Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting …

How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster

X Chu, S Talluri, L Versluis, A Iosup - Companion of the 2023 ACM …, 2023 - dl.acm.org
Reliable job execution is important in High Performance Computing clusters. Understanding
the failure distribution and failure pattern of jobs helps HPC cluster managers design better …