Lineage stash: fault tolerance off the critical path

S Wang, J Liagouris, R Nishihara, P Moritz… - Proceedings of the 27th …, 2019 - dl.acm.org
Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019dl.acm.org
As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed
in mission critical applications and on larger and larger clusters, their ability to tolerate
failures is growing in importance. These frameworks employ two broad approaches for fault
tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal
operation but high overhead during recovery, while lineage-based solutions make the
opposite tradeoff. We propose the lineage stash, a decentralized causal logging technique …
As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad approaches for fault tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal operation but high overhead during recovery, while lineage-based solutions make the opposite tradeoff.
We propose the lineage stash, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency. With the lineage stash, instead of recording the task's information before the task is executed, we record it asynchronously and forward the lineage along with the task. This makes it possible to support large-scale, low-latency (millisecond-level) data processing applications with low runtime and recovery overheads. Experimental results for applications in distributed training and stream processing show that the lineage stash provides task execution latencies similar to checkpointing alone, while incurring a recovery overhead as low as traditional lineage-based approaches.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果