Rolex: Resilience-oriented language extensions for extreme-scale systems

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org

The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

被引用次数：68 相关文章所有 4 个版本

A survey on multithreading alternatives for soft error fault tolerance

I Oz, S Arslan - ACM Computing Surveys (CSUR), 2019 - dl.acm.org

Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce
higher soft error rates. This trend makes reliability a primary design constraint for computer …

被引用次数：35 相关文章所有 3 个版本

[PDF] arxiv.org

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org

Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

被引用次数：53 相关文章所有 21 个版本

Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators-Trends in Quantum Computing, Heterogeneous Systems and …

S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org

Rapid progress in the CMOS technology for the past 25 years has increased the
vulnerability of processors towards faults. Subsequently, focus of computer architects shifted …

被引用次数：3 相关文章

[PDF] upc.edu

Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading

S Arslan, O Unsal - The Journal of Supercomputing, 2021 - Springer

Redundant multithreading (RMT) is an effective reliability solution that provides thread-level
replication; however, it imposes additional overheads in terms of performance loss or energy …

被引用次数：8 相关文章所有 7 个版本

[PDF] arxiv.org

Redthreads: An interface for application-level fault detection/correction through adaptive redundant multithreading

S Hukerikar, K Teranishi, PC Diniz… - International Journal of …, 2018 - Springer

In the presence of accelerated fault rates, which are projected to be the norm on future
exascale systems, it will become increasingly difficult for high-performance computing (HPC) …

被引用次数：21 相关文章所有 10 个版本

[PDF] arxiv.org

Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)

S Hukerikar, C Engelmann - arXiv preprint arXiv:1611.02717, 2016 - arxiv.org

In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …

被引用次数：19 相关文章所有 10 个版本

[PDF] researchgate.net

CARE: Compiler-assisted recovery from soft failures

C Chen, G Eisenhauer, S Pande, Q Guan - Proceedings of the …, 2019 - dl.acm.org

As processors continue to boost the system performance with higher circuit density,
shrinking process technology and near-threshold voltage (NTV) operations, they are …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Near-zero downtime recovery from transient-error-induced crashes

C Chen, G Eisenhauer, S Pande - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Due to the system scaling, transient errors caused by external noise, eg, heat fluxes and
particle strikes, have become a growing concern for the current and upcoming exa-scale …

被引用次数：2 相关文章所有 4 个版本

[PDF] osti.gov

Concepts for OpenMP target offload resilience

C Engelmann, GR Vallée, S Pophale - OpenMP: Conquering the Full …, 2019 - Springer

Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak
Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale …

被引用次数：3 相关文章所有 8 个版本