Exploring the design tradeoffs for extreme-scale high-performance computing system software

K Wang, A Kulkarni, M Lang, D Arnold… - IEEE Transactions on …, 2015 - ieeexplore.ieee.org
IEEE Transactions on Parallel and Distributed Systems, 2015ieeexplore.ieee.org
Owing to the extreme parallelism and the high component failure rates of tomorrow's
exascale, high-performance computing (HPC) system software will need to be scalable,
failure-resistant, and adaptive for sustained system operation and full system utilizations.
Many of the existing HPC system software are still designed around a centralized server
paradigm and hence are susceptible to scaling issues and single points of failure. In this
article, we explore the design tradeoffs for scalable system software at extreme scales. We …
Owing to the extreme parallelism and the high component failure rates of tomorrow's exascale, high-performance computing (HPC) system software will need to be scalable, failure-resistant, and adaptive for sustained system operation and full system utilizations. Many of the existing HPC system software are still designed around a centralized server paradigm and hence are susceptible to scaling issues and single points of failure. In this article, we explore the design tradeoffs for scalable system software at extreme scales. We propose a general system software taxonomy by deconstructing common HPC system software into their basic components. The taxonomy helps us reason about system software as follows: (1) it gives us a systematic way to architect scalable system software by decomposing them into their basic components; (2) it allows us to categorize system software based on the features of these components, and finally (3) it suggests the configuration space to consider for design evaluation via simulations or real implementations. Further, we evaluate different design choices of a representative system software, i.e. key-value store, through simulations up to millions of nodes. Finally, we show evaluation results of two distributed system software, Slurm++ (a distributed HPC resource manager) and MATRIX (a distributed task execution framework), both developed based on insights from this work. We envision that the results in this article help to lay the foundations of developing next-generation HPC system software for extreme scales.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果