A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

Necessary and sufficient conditions for consistent global snapshots

RHB Netzer, J Xu - IEEE Transactions on Parallel and …, 1995 - ieeexplore.ieee.org
Consistent global snapshots are important in many distributed applications. We prove the
exact conditions for an arbitrary checkpoint, or a set of checkpoints, to belong to a consistent …

Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations

AM Agbaria, R Friedman - Proceedings. The Eighth …, 1999 - ieeexplore.ieee.org
This paper reports on the architecture and design of Starfish, an environment for executing
dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being …

Quasi-synchronous checkpointing: Models, characterization, and classification

D Manivannan, M Singhal - IEEE Transactions on Parallel and …, 1999 - ieeexplore.ieee.org
Checkpointing algorithms are classified as synchronous and asynchronous in the literature.
In synchronous checkpointing, processes synchronize their checkpointing activities so that a …

ickp: A consistent checkpointer for multicomputers

JS Plank, K Li - … Parallel & Distributed Technology: Systems & …, 1994 - ieeexplore.ieee.org
There has been much research on checkpointing algorithms for parallel and distributed
systems; but surprisingly few implementations for uniprocessors, multiprocessors, and …

Resilience of stateful IoT applications in a dynamic fog environment

U Ozeer, X Etchevers, L Letondeur… - Proceedings of the 15th …, 2018 - dl.acm.org
Fog computing provides computing, storage and communication resources at the edge of
the network, near the physical world. Subsequently, end devices nearing the physical world …

Communication-based prevention of useless checkpoints in distributed computations

JM Hélary, A Mostefaoui, RHB Netzer, M Raynal - Distributed Computing, 2000 - Springer
A useless checkpoint is a local checkpoint that cannot be part of a consistent global
checkpoint. This paper addresses the following problem. Given a set of processes that take …

Algorithm-based diskless checkpointing for fault tolerant matrix operations

JS Plank, Y Kim, JJ Dongarra - Twenty-Fifth International …, 1995 - ieeexplore.ieee.org
The paper is an exploration of diskless checkpointing for distributed scientific computations.
With the widespread use of the" network of workstations"(NOW) platform for distributed …

Finding consistent global checkpoints in a distributed computation

D Manivannan, RHB Netzer… - IEEE Transactions on …, 1997 - ieeexplore.ieee.org
Consistent global checkpoints have many uses in distributed computations. A central
question in applications that use consistent global checkpoints is to determine whether a …

Preventing useless checkpoints in distributed computations

JM Helary, A Mostefaoui, RHB Netzer… - Proceedings of SRDS' …, 1997 - ieeexplore.ieee.org
A useless checkpoint is a local checkpoint that cannot be part of a consistent global
checkpoint. The paper addresses the following important problem. Given a set of processes …