Checkpointing in Parallel State-Machine Replication

Mendizabal, Odorico M.; Jalili Marandi, Parisa; Dotti, Fernando Luís; Pedone, Fernando

doi:10.1007/978-3-319-14472-6_9

Odorico M. Mendizabal^18,19,
Parisa Jalili Marandi²⁰,
Fernando Luís Dotti¹⁸ &
…
Fernando Pedone²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8878))

Included in the following conference series:

International Conference on Principles of Distributed Systems

1052 Accesses
8 Citations

Abstract

State-machine replication is a popular approach to building fault-tolerant systems, which relies on the sequential execution of commands to guarantee strong consistency. Sequential execution, however, threatens performance. Recently, several proposals have suggested parallelizing the execution model of the replicas to enhance state-machine replication’s performance. Despite their success in accomplishing high performance, the implications of these models on checkpointing and recovery is mostly left unaddressed. In this paper, we focus on the checkpointing problem in the context of Parallel State-Machine Replication. We propose two novel algorithms and assess them through simulation and a real implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations, and Advanced Topics. Wiley-Interscience (2004)
Google Scholar
Bessani, A., Santos, M., Felix, J., Neves, N., Correia, M.: On the efficiency of durable state machine replication. In: ATC (2001)
Google Scholar
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43(2), 225–267 (1996)
Article MATH MathSciNet Google Scholar
Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of the ACM (JACM) 35(2), 288–323 (1988)
Article MathSciNet Google Scholar
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002), http://doi.acm.org/10.1145/568522.568525
Article Google Scholar
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM) 32(2), 374–382 (1985)
Article MATH MathSciNet Google Scholar
Guo, Z., Hong, C., Yang, M., Zhou, D., Zhou, L., Zhuang, L.: Rex: Replication at the speed of multi-core. In: Proceedings of the Ninth European Conference on Computer Systems, p. 11. ACM (2014)
Google Scholar
Kapritsos, M., Wang, Y., Quema, V., Clement, A., Alvisi, L., Dahlin, M.: All about eve: execute-verify replication for multi-core servers. In: OSDI, pp. 237–250. USENIX Association (2012)
Google Scholar
Kotla, R., Dahlin, M.: High throughput byzantine fault tolerance. In: DSN (2004)
Google Scholar
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978)
Article MATH Google Scholar
Lamport, L.: The part-time parliament. ACM Transactions on Computer Systems (TOCS) 16(2), 133–169 (1998)
Article Google Scholar
Marandi, P.J., Bezerra, C.E.B., Pedone, F.: Rethinking state-machine replication for parallelism. In: ICDCS (2013)
Google Scholar
Marandi, P.J., Primi, M., Pedone, F.: High performance state-machine replication. In: DSN (2011)
Google Scholar
Marandi, P.J., Primi, M., Pedone, F.: Multi-Ring Paxos. In: DSN (2012)
Google Scholar
Santos, N., Schiper, A.: Achieving high-throughput state machine replication in multi-core systems. In: ICDCS (2013)
Google Scholar
Schneider, F.B.: Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys (CSUR) 22(4), 299–319 (1990)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Pontifícia Universidade Católica do Rio Grande do Sul – PUCRS, Porto Alegre, Brazil
Odorico M. Mendizabal & Fernando Luís Dotti
Universidade Federal do Rio Grande – FURG, Rio Grande, Brazil
Odorico M. Mendizabal
University of Lugano – USI, Lugano, Switzerland
Parisa Jalili Marandi & Fernando Pedone

Authors

Odorico M. Mendizabal
View author publications
You can also search for this author in PubMed Google Scholar
Parisa Jalili Marandi
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Luís Dotti
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Pedone
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Microsoft Research Silicon Valley, Mountain View, CA, USA
Marcos K. Aguilera
DIAG, Sapienza University of Rome, Via Ariosto 25, 00185, Roma, Italy
Leonardo Querzoni
UPMC Univ Paris 06, LIP6, Inria Paris-Rocquencourt and Sorbonne Universités, 4 place Jussieu, 75005, Paris, France
Marc Shapiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mendizabal, O.M., Jalili Marandi, P., Dotti, F.L., Pedone, F. (2014). Checkpointing in Parallel State-Machine Replication. In: Aguilera, M.K., Querzoni, L., Shapiro, M. (eds) Principles of Distributed Systems. OPODIS 2014. Lecture Notes in Computer Science, vol 8878. Springer, Cham. https://doi.org/10.1007/978-3-319-14472-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-14472-6_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14471-9
Online ISBN: 978-3-319-14472-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics