A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

[图书][B] Distributed algorithms for message-passing systems

M Raynal - 2013 - Springer
Distributed Algorithms for Message-Passing Systems Page 1 Michel Raynal Distributed Algorithms
for Message-Passing Systems Page 2 Distributed Algorithms for Message-Passing Systems Page …

Uncoordinated checkpointing without domino effect for send-deterministic MPI applications

A Guermouche, T Ropars, E Brunet… - … Parallel & Distributed …, 2011 - ieeexplore.ieee.org
As reported by many recent studies, the mean time between failures of future post-petascale
supercomputers is likely to reduce, compared to the current situation. The most popular fault …

Communication-induced determination of consistent snapshots

J Helary, A Mostefaoui, M Raynal - IEEE Transactions on …, 1999 - ieeexplore.ieee.org
A classical way to determine consistent snapshots consists in using Chandy-Lamport's
algorithm. This algorithm relies on specific control messages that allow processes to …

A non-intrusive minimum process synchronous checkpointing protocol for mobile distributed systems

P Kumar, L Kumar, RK Chauhan… - 2005 IEEE International …, 2005 - ieeexplore.ieee.org
Mobile computing raises many new issues, such as lack of stable storage, low bandwidth of
wireless channels, high mobility and limited battery life. These issues make traditional …

A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems

P Kumar - Mobile Information Systems, 2008 - content.iospress.com
Mobile distributed systems raise new issues such as mobility, low bandwidth of wireless
channels, disconnections, limited battery power and lack of reliable stable storage on mobile …

An index-based checkpointing algorithm for autonomous distributed systems

R Baldoni, F Quaglia, P Fornara - IEEE Transactions on …, 1999 - ieeexplore.ieee.org
This paper presents an index-based checkpointing algorithm for distributed systems with the
aim of reducing the total number of checkpoints while ensuring that each checkpoint …

Flexible rollback recovery in dynamic heterogeneous grid computing

S Jafar, A Krings, T Gautier - IEEE Transactions on Dependable …, 2008 - ieeexplore.ieee.org
Large applications executing on Grid or cluster architectures consisting of hundreds or
thousands of computational nodes create problems with respect to reliability. The source of …

Theoretical analysis for communication-induced checkpointing protocols with rollback-dependency trackability

J Tsai, SY Kuo, YM Wang - IEEE Transactions on Parallel and …, 1998 - ieeexplore.ieee.org
Rollback-Dependency Trackability (RDT) is a property that states that all rollback
dependencies between local checkpoints are on-line trackable by using a transitive …

Consistency issues in distributed checkpoints

JM Hélary, RHB Netzer… - IEEE Transactions on …, 1999 - ieeexplore.ieee.org
A global checkpoint is a set of local checkpoints, one per process. The traditional
consistency criterion for global checkpoints states that a global checkpoint is consistent if it …