Factory: Master Node High-Availability for Big Data Applications and Beyond

  • Ivan Gankevich
  • Yuri Tipikin
  • Vladimir Korkhov
  • Vladimir Gaiduchok
  • Alexander Degtyarev
  • Alexander Bogdanov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9787)


Master node fault-tolerance is the topic that is often dimmed in the discussion of big data processing technologies. Although failure of a master node can take down the whole data processing pipeline, this is considered either improbable or too difficult to encounter. The aim of the studies reported here is to propose rather simple technique to deal with master-node failures. This technique is based on temporary delegation of master role to one of the slave nodes and transferring updated state back to the master when one step of computation is complete. That way the state is duplicated and computation can proceed to the next step regardless of a failure of a delegate or the master (but not both). We run benchmarks to show that a failure of a master is almost “invisible” to other nodes, and failure of a delegate results in recomputation of only one step of data processing pipeline. We believe that the technique can be used not only in Big Data processing but in other types of applications.


Parallel computing Big data processing Distributed computing Backup node State transfer Delegation Cluster computing Fault-tolerance 



The research was carried out using computational resources of Resource Centre “Computational Centre of Saint Petersburg State University” (T-EDGE96 HPC-0011828-001) within frameworks of grants of Russian Foundation for Basic Research (projects no. 16-07-01111, 16-07-00886, 16-07-01113) and Saint Petersburg State University (project no.


  1. 1.
    Acun, B., Gupta, A., Jain, N., Langer, A., Menon, H., Mikida, E., Ni, X., Robson, M., Sun, Y., Totoni, E., et al.: Parallel programming with migratable objects: Charm++ in practice. In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 647–658. IEEE (2014)Google Scholar
  2. 2.
    Agha, G.A.: Actors: a model of concurrent computation in distributed systems. Technical report, DTIC Document (1985)Google Scholar
  3. 3.
    Anderson, J.C., Lehnardt, J., Slater, N.: CouchDB: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2010)Google Scholar
  4. 4.
    Bogdanov, A., Degtyarev, A., Korkhov, V., Gaiduchok, V., Gankevich, I.: Virtual Supercomputer as Basis of Scientific Computing. Horizons in Computer Science Research, vol. 11, pp. 159–198 (2015)Google Scholar
  5. 5.
    Boyer, E.B., Broomfield, M.C., Perrotti, T.A.: Glusterfs one storage server to rule them all. Technical report, Los Alamos National Laboratory (LANL) (2012)Google Scholar
  6. 6.
    Cassen, A.: Keepalived: Health checking for lvs & high availability (2002).
  7. 7.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  8. 8.
    Divya, M.S., Goyal, S.K.: Elasticsearch: an advanced and quick search technique to handle voluminous data. Compusoft 2(6), 171 (2013)Google Scholar
  9. 9.
    Earle, M.D.: Nondirectional and directional wave data analysis procedures. Technical report, NDBC (1996)Google Scholar
  10. 10.
    Engelmann, C., Scott, S.L., Leangsuksun, C.B., He, X.B., et al.: Symmetric active/active high availability for high-performance computing system services. J. Comput. 1(8), 43–54 (2006)CrossRefGoogle Scholar
  11. 11.
    Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM (JACM) 32(2), 374–382 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Gankevich, I., Gaiduchok, V., Gushchanskiy, D., Tipikin, Y., Korkhov, V., Degtyarev, A., Bogdanov, A., Zolotarev, V.: Virtual private supercomputer: design and evaluation. In: CSIT 2013–9th International Conference on Computer Science and Information Technologies, Revised Selected Papers, pp. 1–6 (2013)Google Scholar
  13. 13.
    Gankevich, I., Korkhov, V., Balyan, S., Gaiduchok, V., Gushchanskiy, D., Tipikin, Y., Degtyarev, A., Bogdanov, A.: Constructing virtual private supercomputer using virtualization and cloud technologies. In: Murgante, B., et al. (eds.) ICCSA 2014, Part VI. LNCS, vol. 8584, pp. 341–354. Springer, Heidelberg (2014)Google Scholar
  14. 14.
    Gankevich, I., Degtyarev, A.: Efficient processing and classification of wave energy spectrum data with a distributed pipeline. Comput. Res. Model. 7(3), 517–520 (2015). Google Scholar
  15. 15.
    Gankevich, I., Tipikin, Y., Degtyarev, A., Korkhov, V.: Novel approaches for distributing workload on commodity computer systems. In: Gervasi, O., Murgante, B., Misra, S., Gavrilova, M.L., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O. (eds.) ICCSA 2015. LNCS, vol. 9158, pp. 259–271. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  16. 16.
    Gankevich, I., Tipikin, Y., Gaiduchok, V.: Subordination: cluster management without distributed consensus. In: International Conference on High Performance Computing & Simulation (HPCS), pp. 639–642. IEEE (2015)Google Scholar
  17. 17.
    Hewitt, C., Bishop, P., Steiger, R.: A universal modular actor formalism for artificial intelligence. In: Proceedings of the 3rd International Joint Conference on Artificial Intelligence, pp. 235–245. Morgan Kaufmann Publishers Inc. (1973)Google Scholar
  18. 18.
    Hinden, R., et al.: Virtual router redundancy protocol (vrrp); rfc3768. txt. IETF Standard, Internet Engineering Task Force, IETF, CH, pp. 0000–0003 (2004)Google Scholar
  19. 19.
    Islam, M., Huang, A.K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., Neumann, A., Abdelnur, A.: Oozie: towards a scalable workflow management system for Hadoop. In: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, p. 4. ACM (2012)Google Scholar
  20. 20.
    Knight, S., Weaver, D., Whipple, D., Hinden, R., Mitzel, D., Hunt, P., Higginson, P., Shand, M., Lindem, A.: Rfc2338. Virtual Router Redundancy Protocol (1998)Google Scholar
  21. 21.
    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)CrossRefGoogle Scholar
  22. 22.
    Murthy, A.C., Douglas, C., Konar, M., OMalley, O., Radia, S., Agarwal, S., Vinod, K.V.: Architecture of next generation apache hadoop mapreduce framework. Apache Jira (2011)Google Scholar
  23. 23.
    Nadas, S.: Rfc 5798: Virtual router redundancy protocol (vrrp) version 3 for ipv4 and ipv6. Internet Engineering Task Force (IETF) (2010)Google Scholar
  24. 24.
    NDBC directional wave stations.
  25. 25.
    Okorafor, E., Patrick, M.K.: Availability of jobtracker machine in hadoop/mapreduce zookeeper coordinated clusters. Adv. Comput. Int. J. (ACIJ) 3(3), 19–30 (2012)CrossRefGoogle Scholar
  26. 26.
    Ostrovsky, D., Rodenski, Y., Haji, M.: Pro Couchbase Server. Apress, Berkeley (2015)CrossRefGoogle Scholar
  27. 27.
    Uhlemann, K., Engelmann, C., Scott, S.L.: Joshua: symmetric active/active replication for highly available hpc job and resource management. In: 2006 IEEE International Conference on Cluster Computing, pp. 1–10. IEEE (2006)Google Scholar
  28. 28.
    Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRefGoogle Scholar
  29. 29.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th annual Symposium on Cloud Computing, p. 5. ACM (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Ivan Gankevich
    • 1
  • Yuri Tipikin
    • 1
  • Vladimir Korkhov
    • 1
  • Vladimir Gaiduchok
    • 1
  • Alexander Degtyarev
    • 1
  • Alexander Bogdanov
    • 1
  1. 1.Department of Computer Modelling and Multiprocessor SystemsSaint Petersburg State UniversitySaint PetersburgRussia

Personalised recommendations