A survey of rollback-recovery protocols in message-passing systems
EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …
constructs. In the first part of the survey we classify rollback-recovery protocols into …
Necessary and sufficient conditions for consistent global snapshots
RHB Netzer, J Xu - IEEE Transactions on Parallel and …, 1995 - ieeexplore.ieee.org
Consistent global snapshots are important in many distributed applications. We prove the
exact conditions for an arbitrary checkpoint, or a set of checkpoints, to belong to a consistent …
exact conditions for an arbitrary checkpoint, or a set of checkpoints, to belong to a consistent …
Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations
AM Agbaria, R Friedman - Proceedings. The Eighth …, 1999 - ieeexplore.ieee.org
This paper reports on the architecture and design of Starfish, an environment for executing
dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being …
dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being …
Quasi-synchronous checkpointing: Models, characterization, and classification
D Manivannan, M Singhal - IEEE Transactions on Parallel and …, 1999 - ieeexplore.ieee.org
Checkpointing algorithms are classified as synchronous and asynchronous in the literature.
In synchronous checkpointing, processes synchronize their checkpointing activities so that a …
In synchronous checkpointing, processes synchronize their checkpointing activities so that a …
ickp: A consistent checkpointer for multicomputers
There has been much research on checkpointing algorithms for parallel and distributed
systems; but surprisingly few implementations for uniprocessors, multiprocessors, and …
systems; but surprisingly few implementations for uniprocessors, multiprocessors, and …
Resilience of stateful IoT applications in a dynamic fog environment
U Ozeer, X Etchevers, L Letondeur… - Proceedings of the 15th …, 2018 - dl.acm.org
Fog computing provides computing, storage and communication resources at the edge of
the network, near the physical world. Subsequently, end devices nearing the physical world …
the network, near the physical world. Subsequently, end devices nearing the physical world …
Communication-based prevention of useless checkpoints in distributed computations
A useless checkpoint is a local checkpoint that cannot be part of a consistent global
checkpoint. This paper addresses the following problem. Given a set of processes that take …
checkpoint. This paper addresses the following problem. Given a set of processes that take …
Algorithm-based diskless checkpointing for fault tolerant matrix operations
JS Plank, Y Kim, JJ Dongarra - Twenty-Fifth International …, 1995 - ieeexplore.ieee.org
The paper is an exploration of diskless checkpointing for distributed scientific computations.
With the widespread use of the" network of workstations"(NOW) platform for distributed …
With the widespread use of the" network of workstations"(NOW) platform for distributed …
Finding consistent global checkpoints in a distributed computation
D Manivannan, RHB Netzer… - IEEE Transactions on …, 1997 - ieeexplore.ieee.org
Consistent global checkpoints have many uses in distributed computations. A central
question in applications that use consistent global checkpoints is to determine whether a …
question in applications that use consistent global checkpoints is to determine whether a …
Preventing useless checkpoints in distributed computations
JM Helary, A Mostefaoui, RHB Netzer… - Proceedings of SRDS' …, 1997 - ieeexplore.ieee.org
A useless checkpoint is a local checkpoint that cannot be part of a consistent global
checkpoint. The paper addresses the following important problem. Given a set of processes …
checkpoint. The paper addresses the following important problem. Given a set of processes …