Fine-grained fault tolerance using device checkpoints

A Kadav, MJ Renzelmann, MM Swift - ACM SIGPLAN Notices, 2013 - dl.acm.org
A Kadav, MJ Renzelmann, MM Swift
ACM SIGPLAN Notices, 2013dl.acm.org
Recovering faults in drivers is difficult compared to other code because their state is spread
across both memory and a device. Existing driver fault-tolerance mechanisms either restart
the driver and discard its state, which can break applications, or require an extensive
logging mechanism to replay requests and recreate driver state. Even logging may be
insufficient, though, if the semantics of requests are ambiguous. In addition, these systems
either require large subsystems that must be kept up-to-date as the kernel changes, or …
Recovering faults in drivers is difficult compared to other code because their state is spread across both memory and a device. Existing driver fault-tolerance mechanisms either restart the driver and discard its state, which can break applications, or require an extensive logging mechanism to replay requests and recreate driver state. Even logging may be insufficient, though, if the semantics of requests are ambiguous. In addition, these systems either require large subsystems that must be kept up-to-date as the kernel changes, or require substantial rewriting of drivers.
We present a new driver fault-tolerance mechanism that provides fine-grained control over the code protected. Fine-Grained Fault Tolerance (FGFT) isolates driver code at the granularity of a single entry point. It executes driver code as a transaction, allowing roll back if the driver fails. We develop a novel checkpointing mechanism to save and restore device state using existing power management code. Unlike past systems, FGFT can be incrementally deployed in a single driver without the need for a large kernel subsystem, but at the cost of small modifications to the driver.
In the evaluation, we show that FGFT can have almost zero runtime cost in many cases, and that checkpoint-based recovery can reduce the duration of a failure by 79% compared to restarting the driver. Finally, we show that applying FGFT to a driver requires little effort, and the majority of drivers in common classes already contain the power-management code needed for checkpoint/restore.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果