A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

[PDF][PDF] An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance

JS Plank - 1997 - library.eecs.utk.edu
Checkpointing is the act of saving the state of a running program so that it may be
reconstructed later in time. It is an important basic functionality in computing systems that …

Software rejuvenation: Analysis, module and applications

Y Huang, C Kintala, N Kolettis… - Twenty-fifth international …, 1995 - ieeexplore.ieee.org
Software rejuvenation is the concept of gracefully terminating an application and
immediately restarting it at a clean internal state. In a client-server type of application where …

Rx: treating bugs as allergies---a safe method to survive software failures

F Qin, J Tucek, J Sundaresan, Y Zhou - Proceedings of the twentieth …, 2005 - dl.acm.org
Many applications demand availability. Unfortunately, software failures greatly reduce
system availability. Prior work on surviving software failures suffers from one or more of the …

Experiments on local positioning with Bluetooth

A Kotanen, M Hannikainen… - Proceedings ITCC …, 2003 - ieeexplore.ieee.org
This paper presents the design and implementation of the Bluetooth local positioning
application. Positioning is based on received power levels, which are converted to distance …

Analysis of preventive maintenance in transactions based software systems

S Garg, A Puliafito, M Telek… - IEEE transactions on …, 1998 - ieeexplore.ieee.org
Preventive maintenance of operational software systems, a novel technique for software
fault tolerance, is used specifically to counteract the phenomenon of software" aging" …

[PDF][PDF] Software implemented fault tolerance: Technologies and experience

Y Huang, C Kintala - FTCS, 1993 - researchgate.net
By software implemented fault tolerance, we mean a set of software facilities to detect 'and
recover from faults that are are not handled by the underlying hardware or operating system …

Consistent global checkpoints that contain a given set of local checkpoints

YM Wang - IEEE Transactions on Computers, 1997 - ieeexplore.ieee.org
In this paper, we consider the problem of constructing consistent global checkpoints that
contain a given set of checkpoints. We address three important issues related to this …

Software dependability in the Tandem GUARDIAN system

I Lee, RK Iyer - IEEE Transactions on Software Engineering, 1995 - ieeexplore.ieee.org
Based on extensive field failure data for Tandem's GUARDIAN operating system, the paper
discusses evaluation of the dependability of operational software. Software faults …

An empirical study of service mesh traffic management policies for microservices

MR Saleh Sedghpour, C Klein, J Tordsson - … of the 2022 ACM/SPEC on …, 2022 - dl.acm.org
A microservice architecture features hundreds or even thousands of small loosely coupled
services with multiple instances. Because microservice performance depends on many …