Systematically inferring I/O performance variability by examining repetitive job behavior
Monitoring and analyzing I/O behaviors is critical to the efficient utilization of parallel storage
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …
A highly reliable metadata service for large-scale distributed file systems
Many massive data processing applications nowadays often need long, continuous, and
uninterrupted data accesses. Distributed file systems are used as the back-end storage to …
uninterrupted data accesses. Distributed file systems are used as the back-end storage to …
An unsupervised machine-learning checkpoint-restart algorithm using Gaussian mixtures for particle-in-cell simulations
We propose an unsupervised machine-learning checkpoint-restart (CR) algorithm for
particle-in-cell (PIC) algorithms using Gaussian mixtures (GM). The algorithm compresses …
particle-in-cell (PIC) algorithms using Gaussian mixtures (GM). The algorithm compresses …
Workload failure prediction for data centers
Failed workloads that consumed significant computational resources in time and space
affect the efficiency of HPC data centers significantly and thus limit the amount of scientific …
affect the efficiency of HPC data centers significantly and thus limit the amount of scientific …
Systemic assessment of node failures in HPC production platforms
A Das, F Mueller, B Rountree - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org
Production HPC clusters endure failures reducing computational capability and resource
availability. Despite the presence of various failure prediction schemes for large-scale …
availability. Despite the presence of various failure prediction schemes for large-scale …
Operating liquid-cooled large-scale systems: Long-term monitoring, reliability analysis, and efficiency measures
The past decade has seen a rise in the use of liquid cooling due to its energy efficiency.
While many previous works have helped make progress toward improving data center …
While many previous works have helped make progress toward improving data center …
Examining failures and repairs on supercomputers with multi-GPU compute nodes
Understanding the reliability characteristics of supercomputers has been a key focus of the
HPC and dependability communities. However, there is no current study that analyzes both …
HPC and dependability communities. However, there is no current study that analyzes both …
Orchestrating fault prediction with live migration and checkpointing
Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance
Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure …
Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure …
Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis
HPC datacenters offer a backbone to the modern digital society. Increasingly, they run
Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting …
Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting …
How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster
Reliable job execution is important in High Performance Computing clusters. Understanding
the failure distribution and failure pattern of jobs helps HPC cluster managers design better …
the failure distribution and failure pattern of jobs helps HPC cluster managers design better …