The landscape of exascale research: A data-driven literature analysis
S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …
systems capable of at least one quintillion (billion billion) floating-point operations per …
A survey on multithreading alternatives for soft error fault tolerance
Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce
higher soft error rates. This trend makes reliability a primary design constraint for computer …
higher soft error rates. This trend makes reliability a primary design constraint for computer …
Resilience design patterns: A structured approach to resilience at extreme scale
S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …
systems. While the HPC community has developed various resilience solutions, the solution …
Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators-Trends in Quantum Computing, Heterogeneous Systems and …
S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org
Rapid progress in the CMOS technology for the past 25 years has increased the
vulnerability of processors towards faults. Subsequently, focus of computer architects shifted …
vulnerability of processors towards faults. Subsequently, focus of computer architects shifted …
Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading
Redundant multithreading (RMT) is an effective reliability solution that provides thread-level
replication; however, it imposes additional overheads in terms of performance loss or energy …
replication; however, it imposes additional overheads in terms of performance loss or energy …
Redthreads: An interface for application-level fault detection/correction through adaptive redundant multithreading
In the presence of accelerated fault rates, which are projected to be the norm on future
exascale systems, it will become increasingly difficult for high-performance computing (HPC) …
exascale systems, it will become increasingly difficult for high-performance computing (HPC) …
Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)
S Hukerikar, C Engelmann - arXiv preprint arXiv:1611.02717, 2016 - arxiv.org
In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …
based on the concept of resilience-based design patterns. A design pattern is a general …
CARE: Compiler-assisted recovery from soft failures
As processors continue to boost the system performance with higher circuit density,
shrinking process technology and near-threshold voltage (NTV) operations, they are …
shrinking process technology and near-threshold voltage (NTV) operations, they are …
Near-zero downtime recovery from transient-error-induced crashes
Due to the system scaling, transient errors caused by external noise, eg, heat fluxes and
particle strikes, have become a growing concern for the current and upcoming exa-scale …
particle strikes, have become a growing concern for the current and upcoming exa-scale …
Concepts for OpenMP target offload resilience
Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak
Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale …
Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale …