Efficient and non-blocking agreement protocols

Abstract

Large scale distributed databases are designed to support commercial and cloud based applications. The minimal expectation from such systems is that they ensure consistency and reliability in case of node failures. The distributed database guarantees reliability through the use of atomic commitment protocols. Atomic commitment protocols help in ensuring that either all the changes of a transaction are applied or none of them exist. To ensure efficient commitment process, the database community has mainly used the two-phase commit (2PC) protocol. However, the 2PC protocol is blocking under multiple failures. This necessitated the development of non-blocking, three-phase commit (3PC) protocol. However, the database community is still reluctant to use the 3PC protocol, as it acts as a scalability bottleneck in the design of efficient transaction processing systems. In this work, we present EasyCommit protocol which leverages the best of both worlds (2PC and 3PC), that is non-blocking (like 3PC) and requires two phases (like 2PC). EasyCommit achieves these goals by ensuring two key observations: (i) first transmit and then commit, and (ii) message redundancy. We present the design of the EasyCommit protocol and prove that it guarantees both safety and liveness. We also present a detailed evaluation of EC protocol and show that it is nearly as efficient as the 2PC protocol. To cater the needs of geographically large scale distributed systems we also design a topology-aware agreement protocol (Geo-scale EasyCommit) that is non-blocking, safe, live and outperforms 3PC protocol.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Notes

  1. 1.

    The coordinating node is the one which initiates the commit protocol, and in this work it is also the node which receives the client request to execute a transaction.

  2. 2.

    Partitioned database is the terminology used by the database community to refer to the shared-nothing distributed databases, and should not be intermixed with the term network partitioning.

  3. 3.

    The term cohort refers to a participating node in the transaction commit process. We use these terms interchangeably.

  4. 4.

    INITAL, READY and WAIT states are considered as non-committable states.

  5. 5.

    Without node failures, any transaction that reaches the prepare phase is assumed to successfully commit.

References

  1. 1.

    Abbadi, A.E., Toueg, S.: Maintaining availability in partitioned replicated databases. ACM Trans Database Syst 14(2), 264–290 (1989). https://doi.org/10.1145/63500.63501

    MathSciNet  Article  Google Scholar 

  2. 2.

    Abdallah, M., Guerraoui, R., Pucheral, P.: One-phase commit: does it make sense? ICPADS (1998)

  3. 3.

    Agrawal, D., El Abbadi, A., Mahmoud, H.A., Nawab, F., Salem, K.: Managing geo-replicated data in multi-datacenters. In: Proceedings of the 2013 Databases in Networked Information Systems—8th International Workshop, DNIS’13, pp. 23–43 (2013)

  4. 4.

    Amir, Y., Danilov, C., Dolev, D., Kirsch, J., Lane, J., Nita-Rotaru, C., Olsen, J., Zage, D.: Steward: scaling byzantine fault-tolerant replication to wide area networks. IEEE Trans. Dependable Secur. Comput. 7(1), 80–93 (2010). https://doi.org/10.1109/TDSC.2008.53

    Article  Google Scholar 

  5. 5.

    Bailis, P., Davidson, A., Fekete, A., Ghodsi, A., Hellerstein, J.M., Stoica, I.: Highly available transactions: virtues and limitations. Proc VLDB Endow 7(3), 181–192 (2013)

    Article  Google Scholar 

  6. 6.

    Bailis, P., Fekete, A., Ghodsi, A., Hellerstein, J.M., Stoica, I.: Scalable atomic visibility with RAMP transactions. ACM Trans Database Syst 41(3), 15 (2016)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Baker, J., Bond, C., Corbett, J.C., Furman, J., Khorlin, A., Larson, J., Leon, J.M., Li, Y., Lloyd, A., Yushprakh, V.: Megastore: providing scalable, highly available storage for interactive services. In: Proceedings of the Conference on Innovative Data system Research (CIDR), pp. 223–234 (2011)

  8. 8.

    Bernstein, P.A., Goodman, N.: Concurrency control in distributed database systems. ACM Comput Surv 13(2), 185–221 (1981)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Bernstein, P.A., Goodman, N.: Multiversion concurrency control—theory and algorithms. ACM TODS 8(4), 465–483 (1983)

    MathSciNet  Article  Google Scholar 

  10. 10.

    Bernstein, P.A., Goodman, N.: An algorithm for concurrency control and recovery in replicated distributed databases. ACM Trans Database Syst 9(4), 596–615 (1984). https://doi.org/10.1145/1994.2207

    MathSciNet  Article  Google Scholar 

  11. 11.

    Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley Longman Publishing Co., Inc., Boston, MA (1987a)

    Google Scholar 

  12. 12.

    Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley Longman Publishing Co., Boston, MA (1987b)

    Google Scholar 

  13. 13.

    Boutros, B.S., Desai, B.C.: A two-phase commit protocol and its performance. In: IEEE, DEXA, pp. 100–105 (1996)

  14. 14.

    Chen, K., Zhou, Y., Cao, Y.: Online data partitioning in distributed database systems. In: Proceedings of the 18th International Conference on Extending Database Technology, OpenProceeding.org, pp. 1–12 (2015)

  15. 15.

    CockroachDB (2018). https://www.cockroachlabs.com/

  16. 16.

    Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing, ACM, pp. 143–154 (2010)

  17. 17.

    Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., Woodford, D.: Spanner: Google’s globally-distributed database. In: 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), USENIX Association, pp. 261–264 (2012)

  18. 18.

    Council TPP (2010) Tpc benchmark c (revision 5.11)

  19. 19.

    Diaconu, C., Freedman, C., Ismert, E., Larson, P.A., Mittal, P., Stonecipher, R., Verma, N., Zwilling, M.: Hekaton: SQL Server’s Memory-optimized OLTP Engine. ACM, pp. 1243–1254 (2013)

  20. 20.

    Dutta, P., Guerraoui, R., Pochon, B.: Fast non-blocking atomic commit: an inherent trade-off. Inf Process Lett 91(4), 195–200 (2004)

    MathSciNet  Article  Google Scholar 

  21. 21.

    El Abbadi, A., Skeen, D., Cristian, F.: An efficient, fault-tolerant protocol for replicated data management. In: Proceedings of the Fourth ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, ACM, New York, PODS ’85, pp 215–229 (1985). https://doi.org/10.1145/325405.325443

  22. 22.

    Freels, M.: FaunaDB: an architectural overview (2018)

  23. 23.

    Fung, B.: The embarrassing reason behind Amazons huge cloud computing outage this week. The Washington Post, Washington, DC (2017)

    Google Scholar 

  24. 24.

    Gawlick, D., Kinkade, D.: Varieties of concurrency control in IMS/VS fast path. IEEE Database Eng. Bull. 8, 3–10 (1985)

    Google Scholar 

  25. 25.

    Gifford, D.K.: Weighted voting for replicated data. In: Proceedings of the Seventh ACM Symposium on Operating Systems Principles, ACM, New York, NY, SOSP ’79, pp 150–162 (1979). https://doi.org/10.1145/800215.806583

  26. 26.

    Gray, J.: Notes on data base operating systems. In: Operating Systems, An Advanced Course. Springer, Berlin, pp. 393–481 (1978)

  27. 27.

    Gray, J.: The transaction concept: virtues and limitations (invited paper). In: VLDB, pp. 144–154 (1981)

  28. 28.

    Gray, J.: A Comparison of the Byzantine Agreement Problem and the Transaction Commit Problem, pp. 10–17. Springer, New York (1990)

    Google Scholar 

  29. 29.

    Gray, J., Lamport, L.: Consens. Trans. Commit. ACM TODS 31(1), 133–160 (2006)

    Article  Google Scholar 

  30. 30.

    Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques, 1st edn. Morgan Kaufmann Publishers Inc., Burlington (1992)

    Google Scholar 

  31. 31.

    Guerraoui, R.: Revisiting the Relationship Between Non-blocking Atomic Commitment and Consensus, pp. 87–100. Springer, Berlin (1995)

    Google Scholar 

  32. 32.

    Guerraoui, R., Larrea, M., Schiper, A.: Reducing the Cost for Non-blocking in Atomic Commitment. In: IEEE Proceedings of 16th International Conference on Distributed Computing Systems, pp. 692–697 (1996)

  33. 33.

    Gupta, S., Sadoghi, M.: Blockchain Transaction Processing, pp. 1–11. Springer, Cham (2018a)

    Google Scholar 

  34. 34.

    Gupta, S., Sadoghi, M.: EasyCommit: A non-blocking two-phase commit protocol. In: Proceedings of the 21st International Conference on Extending Database Technology, Open Proceedings, EDBT (2018b)

  35. 35.

    Harding, R., Van Aken, D., Pavlo, A., Stonebraker, M.: An evaluation of distributed concurrency control. Proc VLDB Endow 10(5), 553–564 (2017)

    Article  Google Scholar 

  36. 36.

    Haritsa, J.R., Ramamritham, K., Gupta, R.: The PROMPT real-time commit protocol. IEEE TPDS 11(2), 160–181 (2000)

    Google Scholar 

  37. 37.

    Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM TOPLAS 12(3), 463–492 (1990)

    Article  Google Scholar 

  38. 38.

    Jiménez-Peris, R., Patiño Martínez, M., Alonso, G., Arévalo, S.: A low-latency non-blocking commit service. Springer, Berlin DISC’01 (2001)

  39. 39.

    Kallman, R., Kimura, H., Natkins, J., Pavlo, A., Rasin, A., Zdonik, S.B., Jones, E.P.C., Madden, S., Stonebraker, M., Zhang, Y., Hugg, J., Abadi, D.J.: H-store: a high-performance, distributed main memory transaction processing system. PVLDB 1, 1496–1499 (2008)

    Google Scholar 

  40. 40.

    Lamport, L.: The part-time parliament. ACM Trans Comput Syst 16(2), 133–169 (1998)

    Article  Google Scholar 

  41. 41.

    Levy, E., Korth, H.F., Silberschatz, A.: An optimistic commit protocol for distributed transaction management. In: ACM SIGMOD, ACM, pp. 88–97 (1991)

  42. 42.

    Lin, Q., Chang, P., Chen, G., Ooi, B.C., Tan, K.L., Wang, Z.: Towards a non-2PC transaction management in distributed database systems. In: Proceedings of the 2016 International Conference on Management of Data, ACM, New York, NY, SIGMOD ’16, pp 1659–1674 (2016). https://doi.org/10.1145/2882903.2882923

  43. 43.

    Lloyd, W., Freedman, M.J., Kaminsky, M., Andersen, D.G.: Stronger semantics for low-latency geo-replicated storage. In: USENIX Association, NSDI, pp. 313–328 (2013)

  44. 44.

    Mahmoud, H., Nawab, F., Pucher, A., Agrawal, D., El Abbadi, A.: Low-latency multi-datacenter databases using replicated commit. Proc VLDB Endow 6(9), 661–672 (2013). https://doi.org/10.14778/2536360.2536366

    Article  Google Scholar 

  45. 45.

    Mahmoud, H.A., Arora, V., Nawab, F., Agrawal, D., El Abbadi, A.: MaaT: effective and scalable coordination of distributed transactions in the cloud. Proc VLDB Endow 7(5), 329–340 (2014). https://doi.org/10.14778/2732269.2732270

    Article  Google Scholar 

  46. 46.

    Mao, Y., Junqueira, F.P., Marzullo, K.: Mencius: building efficient replicated state machines for WANs. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, pp. 369–384 (2008)

  47. 47.

    MemSQL (2013). http://www.memsql.com

  48. 48.

    Mohan, C., Lindsay, B., Obermarck, R.: Transaction management in the R* distributed database management system. ACM TODS 11(4), 378–396 (1986)

    Article  Google Scholar 

  49. 49.

    Nawab, F., Sadoghi, M.: Blockplane: A global-scale byzantizing middleware. In: Proceedings of the 35th IEEE International Conference on Data Engineering, IEEE, ICDE ’19 (2019)

  50. 50.

    Nawab, F., Arora, V., Agrawal, D., El Abbadi, A.: Minimizing commit latency of transactions in geo-replicated data stores. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, SIGMOD ’15, pp 1279–1294 (2015)

  51. 51.

    NuoDB (2010). http://www.nuodb.com

  52. 52.

    O’Brien, S.A.: Facebook. Instagram experience outages Saturday. CNN, GA, USA (2017)

  53. 53.

    Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX Association, USENIX ATC’14, pp. 305–320 (2014)

  54. 54.

    Oracle, C.: Oracle 9i real application clusters concepts release 2 (9.2), Part Number A96597-01 (2002)

  55. 55.

    Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, New York (2011)

    Google Scholar 

  56. 56.

    Park, T., Yeom, H.Y.: A distributed group commit protocol for distributed database systems. ICPADS (1991)

  57. 57.

    Patterson, S., Elmore, A.J., Nawab, F., Agrawal, D., El Abbadi, A.: Serializability, not serial: concurrency control and availability in multi-datacenter datastores. Proc VLDB Endow 5(11), (2012)

  58. 58.

    Pavlo, A., Curino, C., Zdonik, S.: Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In: ACM, SIGMOD ’12, pp. 61–72 (2012)

  59. 59.

    Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, OSDI’10, pp. 251–264 (2010)

  60. 60.

    Qadah, T.M., Sadoghi, M.: QueCC: a queue-oriented, control-free concurrency architecture. In: Proceedings of the 19th International Middleware Conference, ACM, New York, NY, Middleware ’18, pp 13–25, (2018). https://doi.org/10.1145/3274808.3274810

  61. 61.

    Reddy, P.K., Kitsuregawa, M.: Reducing the blocking in two-phase commit protocol employing backup sites. In: IEEE, COOPIS’98, pp. 406–416 (1998)

  62. 62.

    Sadoghi, M., Blanas, S.: Transaction processing on modern hardware. Synth. Lect. Data Manag. 14(2), 1–138 (2019). https://doi.org/10.2200/S00896ED1V01Y201901DTM058

    Article  Google Scholar 

  63. 63.

    Sadoghi, M., Ross, K.A., Canim, M., Bhattacharjee, B.: Making updates disk-I/O friendly using SSDs. Proc VLDB Endow 6(11), 997–1008 (2013)

    Article  Google Scholar 

  64. 64.

    Sadoghi, M., Canim, M., Bhattacharjee, B., Nagel, F., Ross, K.A.: Reducing database locking contention through multi-version concurrency. Proc VLDB Endow 7(13), 1331–1342 (2014)

    Article  Google Scholar 

  65. 65.

    Sadoghi, M., Bhattacherjee, S., Bhattacharjee, B., Canim, M.: L-Store: A real-time OLTP and OLAP system (2018). http://www.OpenProceeding.org, EDBT

  66. 66.

    Samaras, G., Britton, K., Citron, A., Mohan, C.: Two-phase commit optimizations in a commercial distributed environment. Distrib. Parallel Databases 3(4), 325–360 (1995)

    Article  Google Scholar 

  67. 67.

    Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Oancea, M., Littleeld, K., Menestrina, D., Ellner, S., Apte, H.: F1: A distributed sql database that scales. In: VLDB (2013)

  68. 68.

    Skeen, D.: Nonblocking commit protocols. In: ACM, SIGMOD, pp. 133–142 (1981)

  69. 69.

    Skeen, D.: A quorum-based commit protocol. Tech. rep. (1982)

  70. 70.

    Skeen, D., Stonebraker, M.: A formal model of crash recovery in a distributed system. IEEE Trans. Softw. Eng. 9(3), 219–228 (1983)

    Article  Google Scholar 

  71. 71.

    Stamos, J., Cristian, F.: A low-cost atomic commit protocol. In: Proceedings of the 9th Symposium on Reliable Distributed Systems, IEEE, pp. 10–17 (1990)

  72. 72.

    Stonebraker, M.: Concurrency control and consistency of multiple copies of data in distributed ingres. IEEE Trans. Softw. Eng. SE–5(3), 188–194 (1979). https://doi.org/10.1109/TSE.1979.234180

    Article  MATH  Google Scholar 

  73. 73.

    Stonebraker, M.: The case for shared nothing. Database Eng. 9, 4–9 (1986)

    Google Scholar 

  74. 74.

    Sulleyman, A.: Twitter down: social media app and website not working. The Independent, UK (2017)

  75. 75.

    Thomson, A., Diamond, T., Weng, S.C., Ren, K., Shao, P., Abadi, D.J.: Calvin: fast distributed transactions for partitioned database systems. In: SIGMOD (2012)

  76. 76.

    TiDB (2018). https://pingcap.com/en/

  77. 77.

    VoltDB (2010). https://www.voltdb.com/

Download references

Acknowledgements

We would like to acknowledge Thamir Qadah for the valuable discussions that helped us to design ExpoDB system. Further, we acknowledge the anonymous reviewers for their useful inputs and comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Suyash Gupta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gupta, S., Sadoghi, M. Efficient and non-blocking agreement protocols. Distrib Parallel Databases 38, 287–333 (2020). https://doi.org/10.1007/s10619-019-07267-w

Download citation

Keywords

  • Agreement
  • Node failures
  • Geo-scale