The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

A survey on multithreading alternatives for soft error fault tolerance

I Oz, S Arslan - ACM Computing Surveys (CSUR), 2019 - dl.acm.org
Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce
higher soft error rates. This trend makes reliability a primary design constraint for computer …

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators-Trends in Quantum Computing, Heterogeneous Systems and …

S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org
Rapid progress in the CMOS technology for the past 25 years has increased the
vulnerability of processors towards faults. Subsequently, focus of computer architects shifted …

Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading

S Arslan, O Unsal - The Journal of Supercomputing, 2021 - Springer
Redundant multithreading (RMT) is an effective reliability solution that provides thread-level
replication; however, it imposes additional overheads in terms of performance loss or energy …

Redthreads: An interface for application-level fault detection/correction through adaptive redundant multithreading

S Hukerikar, K Teranishi, PC Diniz… - International Journal of …, 2018 - Springer
In the presence of accelerated fault rates, which are projected to be the norm on future
exascale systems, it will become increasingly difficult for high-performance computing (HPC) …

Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)

S Hukerikar, C Engelmann - arXiv preprint arXiv:1611.02717, 2016 - arxiv.org
In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …

CARE: Compiler-assisted recovery from soft failures

C Chen, G Eisenhauer, S Pande, Q Guan - Proceedings of the …, 2019 - dl.acm.org
As processors continue to boost the system performance with higher circuit density,
shrinking process technology and near-threshold voltage (NTV) operations, they are …

Near-zero downtime recovery from transient-error-induced crashes

C Chen, G Eisenhauer, S Pande - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Due to the system scaling, transient errors caused by external noise, eg, heat fluxes and
particle strikes, have become a growing concern for the current and upcoming exa-scale …

Concepts for OpenMP target offload resilience

C Engelmann, GR Vallée, S Pophale - OpenMP: Conquering the Full …, 2019 - Springer
Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak
Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale …