Skip to main content
Log in

Analysis and implementation of reactive fault tolerance techniques in Hadoop: a comparative study

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

A Publisher Correction to this article was published on 11 February 2021

This article has been updated


Hadoop is a state-of-the-art industry’s de facto tool for the computation of Big Data. Native fault tolerance procedure in Hadoop is dilatory and leads us towards performance degradation. Moreover, it is failed to completely consider the computational overhead and storage cost. On the other hand, the dynamic nature of MapReduce and complexity are also important parameters that affect the response time of the job. To achieve all this, it is essential to have a foolproof failure handling technique. In this paper, we have performed an analysis of notable fault tolerance techniques to see the impact of using different performance metrics under variable dataset with variable fault injections. The critical result shows that response timewise, the byzantine technique has a performance priority over the retrying and checkpointing technique in regards to killing one node failure. In addition, throughput wise, task-level byzantine fault tolerance technique once again had high priority as compared to checkpointing and retrying in terms of network disconnect failure. All in all, this comparative study highlights the strengths and weaknesses of different fault-tolerant techniques and is essential in determining the best technique in a given environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Change history


  1. Jin H, Ibrahim S, Qi L, Cao H, Wu S, Shi X (2011) The mapreduce programming model and implementations. In: Cloud computing: principles and paradigms, pp 373–390

  2. Borthakur D et al (2008) Hdfs architecture guide. Hadoop Apache Project 53

  3. Madani SA, Hayat K, Li H, Khan SU, Ranjan R, Khan IA, Kolodziej J, Nazir B, Chen D, Irfan R, Wang L, Bickler G (2013) Survey on social networking services. IET Netw 2(4):224–234

    Article  Google Scholar 

  4. Cowsalya T, Mugunthan S (2015) Hadoop architecture and fault tolerance based Hadoop clusters in geographically distributed data centre. ARPN J Eng Appl Sci 10(7):2818–2821

    Google Scholar 

  5. Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid computing system. Comput Electr Eng 36(6):1110–1122

    Article  Google Scholar 

  6. Dinu F, Ng T (2012) Understanding the effects and implications of compute node related failures in hadoop. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp 187–198

  7. Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. In: Journal of Physics: Conference Series, vol 78, no 1. IOP Publishing, , p 012022

  8. Dean J (2004) Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and Implementation (San Francisco, CA, Dec. 6.8). Usenix Association, 2004

  9. Subramanian S, Zhang Y, Vaidyanathan R, Gunawi HS, Arpaci-Dusseau AC, Arpaci-Dusseau RH, Naughton JF (2010) Impact of disk corruption on open-source DBMS. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE). IEEE, pp 509–520

  10. Yang C, Yen C, Tan C, Madden SR (2010) Osprey: implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE). IEEE, pp 657–668

  11. Faghri F, Bazarbayev S, Overholt M, Farivar R, Campbell RH, Sanders WH (2012) Failure scenario as a service (fsaas) for Hadoop clusters. In: Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management. ACM, p 5

  12. Sangroya A, Serrano D, Bouchenak S (2012) MRBS: towards dependability benchmarking for Hadoop mapreduce. In: European Conference on Parallel Processing. Springer, Berlin, Heidelberg, pp 3–12

  13. Malik S, Nazir B, Qureshi K, Khan IA (2013) A reliable checkpoint storage strategy for grid. Computing 95(7):611–632

    Article  Google Scholar 

  14. Quiane-Ruiz JA, Pinkel C, Schad J, Dittrich J (2011) RAFTing MapReduce: fast recovery on the RAFT. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE). IEEE, pp 589–600

  15. Hu P, Dai W (2014) Enhancing fault tolerance based on Hadoop cluster. Int J Database Theory Appl 7(1):37–48

    Article  MathSciNet  Google Scholar 

  16. Yildiz O, Ibrahim S, Antoniu G (2017) Enabling fast failure recovery in shared Hadoop clusters: towards failure-aware scheduling. Future Gener Comput Syst 74:208–219

    Article  Google Scholar 

  17. Soualhia M, Khomh F, Tahar S (2015) Atlas: an adaptive failure-aware scheduler for Hadoop. In: 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC). IEEE, pp 1–8

  18. Costa P, Pasin M, Bessani AN, Correia M (2011) Byzantine fault-tolerant mapreduce: faults are not just crashes. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, pp 32–39

  19. Liu Y, Wei W (2015) A replication-based mechanism for fault tolerance in mapreduce framework. In: Mathematical problems in engineering 2015

  20. Mustafa S, Nazir B, Hayat A, Khan AR, Madani SA (2015) Resource management in cloud computing: taxonomy, prospects, and challenges. Comput Electr Eng 47:186–203

    Article  Google Scholar 

  21. Kuromatsu N, Okita M, Hagihara K (2013) Evolving fault tolerance in Hadoop with robust auto-recovering JobTracker. Bull Netw Comput Syst Softw 2(1):4

    Google Scholar 

  22. Varghese LA, Sreejith V, Bose S (2014) Enhancing NameNode fault tolerance in Hadoop over cloud environment. In: 2014 6th International Conference on Advanced Computing (ICoAC). IEEE, pp 82–85

  23. Song L, Wu S, Wang H, Yang Q (2014) Distributed mapreduce engine with fault tolerance. In: 2014 IEEE International Conference on Communications (ICC). IEEE, pp 3626–3630

  24. Costa PA, Bai X, Ramos FM, Correia M (2016) Medusa: an efficient cloud fault-tolerant mapreduce. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, pp 443–452

  25. Bala A, Chana I (2012) Fault tolerance-challenges, techniques and implementation in cloud computing. IJCSI Int J Comput Sci Issues 9(1):1694–1814

    Google Scholar 

  26. Nazir B, Qureshi K, Manuel P (2009) Adaptive checkpointing strategy to tolerate faults in economy based grid. J Supercomput 50(1):1–18

    Article  Google Scholar 

  27. Vernica R, Balmin A, Beyer KS, Ercegovac V (2012) Adaptive mapreduce using situation-aware mappers. In: Proceedings of the 15th International Conference on Extending Database Technology. ACM, pp 420–431

  28. Zhao D (2017) Performance comparison between Hadoop and HAMR under laboratory environment. Procedia Comput Sci 111:223–229

    Article  Google Scholar 

  29. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  30. Chen Y, Alspaugh S, Katz R (2012) Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. Proc VLDB Endow 5(12):1802–1813

    Article  Google Scholar 

  31. david78k, Jul 2013. david78k/anarchyape. Accessed 16 Jan 2017

  32. Bouchenak S, Sangroya A (2016) MRBS—Hadoop MapReduce dependability and performance benchmarking. Accessed 12 Nov 2017

  33. Noll MG (2011) Michael G. Noll. Benchmarking and Stress Testing an Hadoop Cluster with TeraSort, TestDFSIO & Co.—Michael G. Noll. Accessed 16 Jan 2017

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Babar Nazir.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asghar, H., Nazir, B. Analysis and implementation of reactive fault tolerance techniques in Hadoop: a comparative study. J Supercomput 77, 7184–7210 (2021).

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: