Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor

SS Mukherjee, C Weaver, J Emer… - … . 36th Annual IEEE …, 2003 - ieeexplore.ieee.org
Single-event upsets from particle strikes have become a key challenge in microprocessor
design. Techniques to deal with these transients faults exist, but come at a cost. Designers …

SWIFT: Software implemented fault tolerance

GA Reis, J Chang, N Vachharajani… - … symposium on Code …, 2005 - ieeexplore.ieee.org
To improve performance and reduce power, processor designers employ advances that
shrink feature sizes, lower voltage levels, reduce noise margins, and increase clock rates …

Transient fault detection via simultaneous multithreading

SK Reinhardt, SS Mukherjee - Proceedings of the 27th annual …, 2000 - dl.acm.org
Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise
margins make future generations of microprocessors increasingly prone to transient …

[图书][B] Architecture design for soft errors

S Mukherjee - 2011 - books.google.com
Architecture Design for Soft Errors provides a comprehensive description of the architectural
techniques to tackle the soft error problem. It covers the new methodologies for quantitative …

Detailed design and evaluation of redundant multithreading alternatives

SS Mukherjee, M Kontz, SK Reinhardt - ACM SIGARCH Computer …, 2002 - dl.acm.org
Exponential growth in the number of on-chip transistors, coupled with reductions in voltage
levels, makes each generation of microprocessors increasingly vulnerable to transient faults …

The soft error problem: An architectural perspective

SS Mukherjee, J Emer… - … Symposium on High …, 2005 - ieeexplore.ieee.org
Radiation-induced soft errors have emerged as a key challenge in computer system design.
If the industry is to continue to provide customers with the level of reliability they expect …

Transient-fault recovery for chip multiprocessors

M Gomaa, C Scarbrough, TN Vijaykumar… - ACM SIGARCH …, 2003 - dl.acm.org
To address the increasing susceptibility of commodity chip multiprocessors (CMPs) to
transient faults, we propose Chiplevel Redundantly Threaded multiprocessor with Recovery …

Argus: Low-cost, comprehensive error detection in simple cores

A Meixner, ME Bauer, D Sorin - 40th Annual IEEE/ACM …, 2007 - ieeexplore.ieee.org
We have developed Argus, a novel approach for providing low-cost, comprehensive error
detection for simple cores. The key to Argus is that the operation of a von Neumann core …