There goes the neighborhood: performance degradation due to nearby jobs
Predictable performance is important for understanding and alleviating application
performance issues; quantifying the effects of source code, compiler, or system software …
performance issues; quantifying the effects of source code, compiler, or system software …
A slurm simulator: Implementation and parametric analysis
NA Simakov, MD Innus, MD Jones, RL DeLeon… - … , and Simulation: 8th …, 2018 - Springer
Slurm is an open-source resource manager for HPC that provides high configurability for
inhomogeneous resources and job scheduling. Various Slurm parametric settings can …
inhomogeneous resources and job scheduling. Various Slurm parametric settings can …
Satori: efficient and fair resource partitioning by sacrificing short-term benefits for long-term gains
Multi-core architectures have enabled data centers to increasingly co-locate multiple jobs to
improve resource utilization and lower the operational cost. Unfortunately, naively co …
improve resource utilization and lower the operational cost. Unfortunately, naively co …
Software Resource Disaggregation for HPC with Serverless Computing
Aggregated HPC resources have rigid allocation systems and programming models which
struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to …
struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to …
Slurm simulator: Improving slurm scheduler performance on large hpc systems by utilization of multiple controllers and node sharing
NA Simakov, RL DeLeon, MD Innus… - Proceedings of the …, 2018 - dl.acm.org
A Slurm simulator was used to study the potential benefits of using multiple Slurm controllers
and node-sharing on the TACC Stampede 2 system. Splitting a large cluster into smaller sub …
and node-sharing on the TACC Stampede 2 system. Splitting a large cluster into smaller sub …
Enabling fair pricing on hpc systems with node sharing
Co-location, where multiple jobs share compute nodes in large-scale HPC systems, has
been shown to increase aggregate throughput and energy efficiency by 10 to 20 …
been shown to increase aggregate throughput and energy efficiency by 10 to 20 …
Hybrid resource management for HPC and data intensive workloads
High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed
on separate clusters using different tools for resource and application management. With …
on separate clusters using different tools for resource and application management. With …
Analyzing HPC Monitoring Data With a View Towards Efficient Resource Utilization
Compute nodes in modern HPC systems are growing in size and their hardware has
become ever more diverse. Still, many HPC centers allocate the resources of full nodes …
become ever more diverse. Still, many HPC centers allocate the resources of full nodes …
Spread-n-share: improving application performance and cluster throughput with resource-aware job placement
Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing
processes of a parallel job into as few compute nodes as possible. While CE minimizes inter …
processes of a parallel job into as few compute nodes as possible. While CE minimizes inter …
Improving QoS and Utilisation in modern multi-core servers with Dynamic Cache Partitioning
Co-execution of multiple workloads in modern multi-core servers may create severe
performance degradation and unpredictable execution behavior, impacting significantly their …
performance degradation and unpredictable execution behavior, impacting significantly their …