Job characteristics on large-scale systems: long-term analysis, quantification, and implications
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …
better operation practices, system procurement decisions, and designing effective resource …
QoS-aware co-scheduling for distributed long-running applications on shared clusters
J Zhu, R Yang, X Sun, T Wo, C Hu… - … on Parallel and …, 2022 - ieeexplore.ieee.org
To achieve a high degree of resource utilization, production clusters need to co-schedule
diverse workloads–including both batch analytic jobs with short-lived tasks and long-running …
diverse workloads–including both batch analytic jobs with short-lived tasks and long-running …
DCDB wintermute: Enabling online and holistic operational data analytics on HPC systems
A Netti, M Müller, C Guillen, M Ott, D Tafani… - Proceedings of the 29th …, 2020 - dl.acm.org
As we approach the exascale era, the size and complexity of HPC systems continues to
increase, raising concerns about their manageability and sustainability. For this reason …
increase, raising concerns about their manageability and sustainability. For this reason …
Systematically inferring I/O performance variability by examining repetitive job behavior
Monitoring and analyzing I/O behaviors is critical to the efficient utilization of parallel storage
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …
A conceptual framework for HPC operational data analytics
This paper provides a broad framework for understanding trends in Operational Data
Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to …
Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to …
Understanding hpc application i/o behavior using system level statistics
AK Paul, O Faaland, A Moody… - 2020 IEEE 27th …, 2020 - ieeexplore.ieee.org
The processor performance of high performance computing (HPC) systems is increasing at
a much higher rate than storage performance. This imbalance leads to I/O performance …
a much higher rate than storage performance. This imbalance leads to I/O performance …
Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads
The power & energy demands of HPC machines have grown significantly. Modern exascale
HPC systems require tens of megawatts of combined power for computing resources and …
HPC systems require tens of megawatts of combined power for computing resources and …
AI4IO: A suite of AI-based tools for IO-aware scheduling
Traditional workload managers do not have the capacity to consider how IO contention can
increase job runtime and even cause entire resource allocations to be wasted. Whether from …
increase job runtime and even cause entire resource allocations to be wasted. Whether from …
Operational data analytics in practice: experiences from design to deployment in production HPC environments
As HPC systems continue to grow in scale and complexity, efficient and manageable
operation is increasingly critical. For this reason, many centers are starting to explore the …
operation is increasingly critical. For this reason, many centers are starting to explore the …
The mystery of the failing jobs: Insights from operational data from two university-wide computing systems
Node downtime and failed jobs in a computing cluster translate into wasted resources and
user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is …
user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is …