- 学术资源搜索

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org

HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

被引用次数：68 相关文章所有 4 个版本

[PDF] ieee.org

QoS-aware co-scheduling for distributed long-running applications on shared clusters

J Zhu, R Yang, X Sun, T Wo, C Hu… - … on Parallel and …, 2022 - ieeexplore.ieee.org

To achieve a high degree of resource utilization, production clusters need to co-schedule
diverse workloads–including both batch analytic jobs with short-lived tasks and long-running …

被引用次数：17 相关文章所有 5 个版本

[PDF] arxiv.org

DCDB wintermute: Enabling online and holistic operational data analytics on HPC systems

A Netti, M Müller, C Guillen, M Ott, D Tafani… - Proceedings of the 29th …, 2020 - dl.acm.org

As we approach the exascale era, the size and complexity of HPC systems continues to
increase, raising concerns about their manageability and sustainability. For this reason …

被引用次数：42 相关文章所有 3 个版本

[PDF] google.com

Systematically inferring I/O performance variability by examining repetitive job behavior

E Costa, T Patel, B Schwaller, JM Brandt… - Proceedings of the …, 2021 - dl.acm.org

Monitoring and analyzing I/O behaviors is critical to the efficient utilization of parallel storage
systems. Unfortunately, with increasing I/O requirements and resource contention, I/O …

被引用次数：22 相关文章所有 4 个版本

[PDF] osti.gov

A conceptual framework for HPC operational data analytics

A Netti, W Shin, M Ott, T Wilde… - 2021 IEEE International …, 2021 - ieeexplore.ieee.org

This paper provides a broad framework for understanding trends in Operational Data
Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to …

被引用次数：27 相关文章所有 6 个版本

[PDF] osti.gov

Understanding hpc application i/o behavior using system level statistics

AK Paul, O Faaland, A Moody… - 2020 IEEE 27th …, 2020 - ieeexplore.ieee.org

The processor performance of high performance computing (HPC) systems is increasing at
a much higher rate than storage performance. This imbalance leads to I/O performance …

被引用次数：35 相关文章所有 9 个版本

[PDF] osti.gov

Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads

AM Karimi, NS Sattar, W Shin… - 2024 IEEE 44th …, 2024 - ieeexplore.ieee.org

The power & energy demands of HPC machines have grown significantly. Modern exascale
HPC systems require tens of megawatts of combined power for computing resources and …

被引用次数：2 相关文章所有 5 个版本

[PDF] sagepub.com

AI4IO: A suite of AI-based tools for IO-aware scheduling

MR Wyatt, S Herbein, T Gamblin… - … International Journal of …, 2022 - journals.sagepub.com

Traditional workload managers do not have the capacity to consider how IO contention can
increase job runtime and even cause entire resource allocations to be wasted. Whether from …

被引用次数：10 相关文章所有 5 个版本

[PDF] arxiv.org

Operational data analytics in practice: experiences from design to deployment in production HPC environments

A Netti, M Ott, C Guillen, D Tafani, M Schulz - Parallel Computing, 2022 - Elsevier

As HPC systems continue to grow in scale and complexity, efficient and manageable
operation is increasingly critical. For this reason, many centers are starting to explore the …

被引用次数：12 相关文章所有 4 个版本

[PDF] purdue.edu

The mystery of the failing jobs: Insights from operational data from two university-wide computing systems

R Kumar, S Jha, A Mahgoub… - 2020 50th Annual …, 2020 - ieeexplore.ieee.org

Node downtime and failed jobs in a computing cluster translate into wasted resources and
user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is …

被引用次数：13 相关文章所有 12 个版本