A survey on multithreading alternatives for soft error fault tolerance

I Oz, S Arslan - ACM Computing Surveys (CSUR), 2019 - dl.acm.org
Smaller transistor sizes and reduction in voltage levels in modern microprocessors induce
higher soft error rates. This trend makes reliability a primary design constraint for computer …

Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators-Trends in Quantum Computing, Heterogeneous Systems and …

S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org
Rapid progress in the CMOS technology for the past 25 years has increased the
vulnerability of processors towards faults. Subsequently, focus of computer architects shifted …

Resilience design patterns: A structured approach to resilience at extreme scale

S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …

Expert: Effective and flexible error protection by redundant multithreading

H So, M Didehban, Y Ko… - … Design, Automation & …, 2018 - ieeexplore.ieee.org
Resiliency is a first-order design concern in modern microprocessor design. Compiler-level
Redundant MultiThreading (RMT) schemes are promising because of their capability to …

Hybrid lockstep technique for soft error mitigation

M Peña-Fernández, A Serrano-Cases… - … on Nuclear Science, 2022 - ieeexplore.ieee.org
This work presents the evaluation of a new dual-core lockstep hybrid approach aimed to
improve the fault tolerance in microprocessors. Our approach takes advantage of modern …

EXPERTISE: An effective software-level redundant multithreading scheme against hardware faults

H So, M Didehban, Y Ko, A Shrivastava… - ACM Transactions on …, 2022 - dl.acm.org
Error resilience is the primary design concern for safety-and mission-critical applications.
Redundant MultiThreading (RMT) is one of the most promising soft and hard error resilience …

Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading

S Arslan, O Unsal - The Journal of Supercomputing, 2021 - Springer
Redundant multithreading (RMT) is an effective reliability solution that provides thread-level
replication; however, it imposes additional overheads in terms of performance loss or energy …

Resilience design patterns-a structured approach to resilience at extreme scale (version 1.0)

S Hukerikar, C Engelmann - arXiv preprint arXiv:1611.02717, 2016 - arxiv.org
In this document, we develop a structured approach to the management of HPC resilience
based on the concept of resilience-based design patterns. A design pattern is a general …

Regional soft error vulnerability and error propagation analysis for GPGPU applications

I Öz, ÖF Karadaş - The Journal of Supercomputing, 2022 - Springer
The wide use of GPUs for general-purpose computations as well as graphics programs
makes soft errors a critical concern. Evaluating the soft error vulnerability of GPGPU …

Efficient thread‐to‐core mapping alternatives for application‐level redundant multithreading

S Arslan, O Ünsal - Concurrency and Computation: Practice …, 2023 - Wiley Online Library
Redundant multithreading (RMT) is an effective thread‐level replication method to improve
the reliability requirements of applications. Although it significantly improves the robustness …