A survey of rollback-recovery protocols in message-passing systems
EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …
constructs. In the first part of the survey we classify rollback-recovery protocols into …
[图书][B] Distributed algorithms for message-passing systems
M Raynal - 2013 - Springer
Distributed Algorithms for Message-Passing Systems Page 1 Michel Raynal Distributed Algorithms
for Message-Passing Systems Page 2 Distributed Algorithms for Message-Passing Systems Page …
for Message-Passing Systems Page 2 Distributed Algorithms for Message-Passing Systems Page …
Uncoordinated checkpointing without domino effect for send-deterministic MPI applications
A Guermouche, T Ropars, E Brunet… - … Parallel & Distributed …, 2011 - ieeexplore.ieee.org
As reported by many recent studies, the mean time between failures of future post-petascale
supercomputers is likely to reduce, compared to the current situation. The most popular fault …
supercomputers is likely to reduce, compared to the current situation. The most popular fault …
Communication-induced determination of consistent snapshots
J Helary, A Mostefaoui, M Raynal - IEEE Transactions on …, 1999 - ieeexplore.ieee.org
A classical way to determine consistent snapshots consists in using Chandy-Lamport's
algorithm. This algorithm relies on specific control messages that allow processes to …
algorithm. This algorithm relies on specific control messages that allow processes to …
A non-intrusive minimum process synchronous checkpointing protocol for mobile distributed systems
Mobile computing raises many new issues, such as lack of stable storage, low bandwidth of
wireless channels, high mobility and limited battery life. These issues make traditional …
wireless channels, high mobility and limited battery life. These issues make traditional …
A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems
P Kumar - Mobile Information Systems, 2008 - content.iospress.com
Mobile distributed systems raise new issues such as mobility, low bandwidth of wireless
channels, disconnections, limited battery power and lack of reliable stable storage on mobile …
channels, disconnections, limited battery power and lack of reliable stable storage on mobile …
An index-based checkpointing algorithm for autonomous distributed systems
This paper presents an index-based checkpointing algorithm for distributed systems with the
aim of reducing the total number of checkpoints while ensuring that each checkpoint …
aim of reducing the total number of checkpoints while ensuring that each checkpoint …
Flexible rollback recovery in dynamic heterogeneous grid computing
Large applications executing on Grid or cluster architectures consisting of hundreds or
thousands of computational nodes create problems with respect to reliability. The source of …
thousands of computational nodes create problems with respect to reliability. The source of …
Theoretical analysis for communication-induced checkpointing protocols with rollback-dependency trackability
J Tsai, SY Kuo, YM Wang - IEEE Transactions on Parallel and …, 1998 - ieeexplore.ieee.org
Rollback-Dependency Trackability (RDT) is a property that states that all rollback
dependencies between local checkpoints are on-line trackable by using a transitive …
dependencies between local checkpoints are on-line trackable by using a transitive …
Consistency issues in distributed checkpoints
JM Hélary, RHB Netzer… - IEEE Transactions on …, 1999 - ieeexplore.ieee.org
A global checkpoint is a set of local checkpoints, one per process. The traditional
consistency criterion for global checkpoints states that a global checkpoint is consistent if it …
consistency criterion for global checkpoints states that a global checkpoint is consistent if it …