Exploring automatic, online failure recovery for scientific applications at extreme scales

M Gamell, DS Katz, H Kolla, J Chen… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …

Internet of Things (IoT): a survey

M Kavre, A Gadekar, Y Gadhade - 2019 IEEE pune section …, 2019 - ieeexplore.ieee.org
Internet of things (IoT) is considered as the next evolution of the Internet. IoT is considered
as a global network of things, having a distinct identity, and these are interconnected via a …

FlipIt: An LLVM based fault injector for HPC

J Calhoun, L Olson, M Snir - … Euro-Par 2014 International Workshops, Porto …, 2014 - Springer
High performance computing (HPC) is increasingly subjected to faulty computations. The
frequency of silent data corruptions (SDCs) in particular is expected to increase in emerging …

[PDF][PDF] Quantifying the impact of single bit flips on floating point arithmetic

J Elliott, F Mueller, F Stoyanov, C Webster - 2013 - repository.lib.ncsu.edu
In high-end computing, the collective surface area, smaller fabrication sizes, and increasing
density of components have led to an increase in the number of observed bit flips. Such flips …

Fault tolerance for remote memory access programming models

M Besta, T Hoefler - Proceedings of the 23rd international symposium on …, 2014 - dl.acm.org
Remote Memory Access (RMA) is an emerging mechanism for programming high-
performance computers and datacenters. However, little work exists on resilience schemes …

SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing

T Ropars, TV Martsinkevich, A Guermouche… - Proceedings of the …, 2013 - dl.acm.org
The high failure rate expected for future supercomputers requires the design of new fault
tolerant solutions. Most checkpointing protocols are designed to work with any message …

A method to represent multiple-output switching functions by using multi-valued decision diagrams

T Sasao, JT Butler - … of 26th IEEE International Symposium on …, 1996 - ieeexplore.ieee.org
Multiple-output switching functions can be simulated by multiple-valued decision diagrams
(MDDs) at a significant reduction in computation time. analyze the following approaches to …

Unified fault-tolerance framework for hybrid task-parallel message-passing applications

O Subasi, T Martsinkevich… - … Journal of High …, 2018 - journals.sagepub.com
We present a unified fault-tolerance framework for task-parallel message-passing
applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …

System-wide trade-off modeling of performance, power, and resilience on petascale systems

L Yu, Z Zhou, Y Fan, ME Papka, Z Lan - The Journal of Supercomputing, 2018 - Springer
While performance remains a major objective in the field of high-performance computing
(HPC), future systems will have to deliver desired performance under both reliability and …

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

D Göddeke, M Altenbernd, D Ribbrock - Parallel Computing, 2015 - Elsevier
We analyse novel fault tolerance schemes for data loss in multigrid solvers, which
essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To …