Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

QoS-aware co-scheduling for distributed long-running applications on shared clusters

J Zhu, R Yang, X Sun, T Wo, C Hu… - … on Parallel and …, 2022 - ieeexplore.ieee.org
To achieve a high degree of resource utilization, production clusters need to co-schedule
diverse workloads–including both batch analytic jobs with short-lived tasks and long-running …

DCDB wintermute: Enabling online and holistic operational data analytics on HPC systems

A Netti, M Müller, C Guillen, M Ott, D Tafani… - Proceedings of the 29th …, 2020 - dl.acm.org
As we approach the exascale era, the size and complexity of HPC systems continues to
increase, raising concerns about their manageability and sustainability. For this reason …

Systematically inferring I/O performance variability by examining repetitive job behavior

E Costa, T Patel, B Schwaller, JM Brandt… - Proceedings of the …, 2021 - dl.acm.org
Monitoring and analyzing I/O behaviors is critical to the efficient utilization of parallel storage
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …

A conceptual framework for HPC operational data analytics

A Netti, W Shin, M Ott, T Wilde… - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
This paper provides a broad framework for understanding trends in Operational Data
Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to …

Understanding hpc application i/o behavior using system level statistics

AK Paul, O Faaland, A Moody… - 2020 IEEE 27th …, 2020 - ieeexplore.ieee.org
The processor performance of high performance computing (HPC) systems is increasing at
a much higher rate than storage performance. This imbalance leads to I/O performance …

Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads

AM Karimi, NS Sattar, W Shin… - 2024 IEEE 44th …, 2024 - ieeexplore.ieee.org
The power & energy demands of HPC machines have grown significantly. Modern exascale
HPC systems require tens of megawatts of combined power for computing resources and …

AI4IO: A suite of AI-based tools for IO-aware scheduling

MR Wyatt, S Herbein, T Gamblin… - … International Journal of …, 2022 - journals.sagepub.com
Traditional workload managers do not have the capacity to consider how IO contention can
increase job runtime and even cause entire resource allocations to be wasted. Whether from …

Operational data analytics in practice: experiences from design to deployment in production HPC environments

A Netti, M Ott, C Guillen, D Tafani, M Schulz - Parallel Computing, 2022 - Elsevier
As HPC systems continue to grow in scale and complexity, efficient and manageable
operation is increasingly critical. For this reason, many centers are starting to explore the …

The mystery of the failing jobs: Insights from operational data from two university-wide computing systems

R Kumar, S Jha, A Mahgoub… - 2020 50th Annual …, 2020 - ieeexplore.ieee.org
Node downtime and failed jobs in a computing cluster translate into wasted resources and
user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is …