Rolex: Resilience-oriented language extensions for extreme-scale systems

S Hukerikar, RF Lucas - The Journal of Supercomputing, 2016 - Springer
S Hukerikar, RF Lucas
The Journal of Supercomputing, 2016Springer
Future exascale high-performance computing (HPC) systems will be constructed from VLSI
devices that will be less reliable than those used today, and faults will become the norm, not
the exception. This will pose significant problems for system designers and programmers,
who for half-a-century have enjoyed an execution model that assumed correct behaviour by
the underlying computing system. The mean time to failure of the system scales inversely to
the number of components in the system and, therefore, faults and resultant system level …
Abstract
Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behaviour by the underlying computing system. The mean time to failure of the system scales inversely to the number of components in the system and, therefore, faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However, every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Our experiments show that an approach that leverages the programmer’s insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果