Halfmoon: Log-optimal fault-tolerant stateful serverless computing

S Qi, X Liu, X Jin - Proceedings of the 29th Symposium on Operating …, 2023 - dl.acm.org
Serverless computing separates function execution from state management. Simple retry-
based fault tolerance might corrupt the shared state with duplicate updates. Existing …

Apus: Fast and scalable paxos on rdma

C Wang, J Jiang, X Chen, N Yi, H Cui - Proceedings of the 2017 …, 2017 - dl.acm.org
State machine replication (SMR) uses Paxos to enforce the same inputs for a program (eg,
Redis) replicated on a number of hosts, tolerating various types of failures. Unfortunately …

HovercRaft: Achieving scalability and fault-tolerance for microsecond-scale datacenter services

M Kogias, E Bugnion - … of the Fifteenth European Conference on …, 2020 - dl.acm.org
Cloud platform services must simultaneously be scalable, meet low tail latency service-level
objectives, and be resilient to a combination of software, hardware, and network failures …

State machine replication in containers managed by Kubernetes

HV Netto, LC Lung, M Correia, AF Luiz… - Journal of Systems …, 2017 - Elsevier
Computer virtualization brought fast resource provisioning to data centers and the
deployment of pay-per-use cost models. The system virtualization provided by containers …

Derecho: Fast state machine replication for cloud services

S Jha, J Behrens, T Gkountouvas, M Milano… - ACM Transactions on …, 2019 - dl.acm.org
Cloud computing services often replicate data and may require ways to coordinate
distributed actions. Here we present Derecho, a library for such tasks. The API provides …

Recent results on fault-tolerant consensus in message-passing networks

L Tseng - … Colloquium, SIROCCO 2016, Helsinki, Finland, July …, 2016 - Springer
Fault-tolerant consensus has been studied extensively in the literature, because it is one of
the important distributed primitives and has wide applications in practice. This paper surveys …

Optimal prediction of synchronization-preserving races

U Mathur, A Pavlogiannis, M Viswanathan - Proceedings of the ACM on …, 2021 - dl.acm.org
Concurrent programs are notoriously hard to write correctly, as scheduling nondeterminism
introduces subtle errors that are both hard to detect and to reproduce. The most common …

HAFT: Hardware-assisted fault tolerance

D Kuvaiskii, R Faqeh, P Bhatotia, P Felber… - Proceedings of the …, 2016 - dl.acm.org
Transient hardware faults during the execution of a program can cause data corruptions. We
present HAFT, a fault tolerance technique using hardware extensions of commodity CPUs to …

{IONIA}:{High-Performance} Replication for Modern Disk-based {KV} Stores

Y Xu, H Zhu, P Pandey, A Conway, R Johnson… - … USENIX Conference on …, 2024 - usenix.org
We introduce IONIA, a novel replication protocol tailored for modern SSD-based write-
optimized key-value (WO-KV) stores. Unlike existing replication approaches, IONIA carefully …

Disaggregating stateful network functions

D Bansal, G DeGrace, R Tewari, M Zygmunt… - … USENIX Symposium on …, 2023 - usenix.org
For security, isolation, metering and other purposes, public clouds today implement complex
network functions at every server. Today's implementations, in software or on FPGAs and …