There goes the neighborhood: performance degradation due to nearby jobs

A Bhatele, K Mohror, SH Langer… - Proceedings of the …, 2013 - dl.acm.org
Predictable performance is important for understanding and alleviating application
performance issues; quantifying the effects of source code, compiler, or system software …

A slurm simulator: Implementation and parametric analysis

NA Simakov, MD Innus, MD Jones, RL DeLeon… - … , and Simulation: 8th …, 2018 - Springer
Slurm is an open-source resource manager for HPC that provides high configurability for
inhomogeneous resources and job scheduling. Various Slurm parametric settings can …

Satori: efficient and fair resource partitioning by sacrificing short-term benefits for long-term gains

RB Roy, T Patel, D Tiwari - 2021 ACM/IEEE 48th Annual …, 2021 - ieeexplore.ieee.org
Multi-core architectures have enabled data centers to increasingly co-locate multiple jobs to
improve resource utilization and lower the operational cost. Unfortunately, naively co …

Software Resource Disaggregation for HPC with Serverless Computing

M Copik, M Chrapek, L Schmid, A Calotoiu… - arXiv preprint arXiv …, 2024 - arxiv.org
Aggregated HPC resources have rigid allocation systems and programming models which
struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to …

Slurm simulator: Improving slurm scheduler performance on large hpc systems by utilization of multiple controllers and node sharing

NA Simakov, RL DeLeon, MD Innus… - Proceedings of the …, 2018 - dl.acm.org
A Slurm simulator was used to study the potential benefits of using multiple Slurm controllers
and node-sharing on the TACC Stampede 2 system. Splitting a large cluster into smaller sub …

Enabling fair pricing on hpc systems with node sharing

AD Breslow, A Tiwari, M Schulz, L Carrington… - Proceedings of the …, 2013 - dl.acm.org
Co-location, where multiple jobs share compute nodes in large-scale HPC systems, has
been shown to increase aggregate throughput and energy efficiency by 10 to 20 …

Hybrid resource management for HPC and data intensive workloads

A Souza, M Rezaei, E Laure… - 2019 19th IEEE/ACM …, 2019 - ieeexplore.ieee.org
High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed
on separate clusters using different tools for resource and application management. With …

Analyzing HPC Monitoring Data With a View Towards Efficient Resource Utilization

S Maloney, E Suarez, N Eicker… - 2024 IEEE 36th …, 2024 - ieeexplore.ieee.org
Compute nodes in modern HPC systems are growing in size and their hardware has
become ever more diverse. Still, many HPC centers allocate the resources of full nodes …

Spread-n-share: improving application performance and cluster throughput with resource-aware job placement

X Tang, H Wang, X Ma, N El-Sayed, J Zhai… - Proceedings of the …, 2019 - dl.acm.org
Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing
processes of a parallel job into as few compute nodes as possible. While CE minimizes inter …

Improving QoS and Utilisation in modern multi-core servers with Dynamic Cache Partitioning

I Papadakis, K Nikas, V Karakostas… - Proceedings of the …, 2017 - mediatum.ub.tum.de
Co-execution of multiple workloads in modern multi-core servers may create severe
performance degradation and unpredictable execution behavior, impacting significantly their …