Advertisement

Fault Tolerance in MapReduce: A Survey

  • Bunjamin MemishiEmail author
  • Shadi Ibrahim
  • María S. Pérez
  • Gabriel Antoniu
Chapter
Part of the Computer Communications and Networks book series (CCN)

Abstract

MapReduce-based systems have emerged as a prominent framework for large-scale data analysis, having fault tolerance as one of its key features. MapReduce has introduced simple yet efficient mechanisms to handle different kinds of failures including crashes, omissions, and arbitrary failures. This contribution discusses in detail the types of failures in MapReduce systems and surveys the different mechanisms used in the framework for detecting, handling, and recovering from these failures. It also surveys the state-of-the-art optimization mechanisms to improve the fault tolerance in MapReduce, and in particular its open-source implementation Hadoop. Finally, it identifies the remaining challenges and open issues for building efficient fault tolerance mechanisms for MapReduce.

Keywords

MapReduce Hadoop fault tolerance failures detection handling recovery 

Notes

Acknowledgments

The research leading to these results has received funding from the H2020 project reference number 642963 in the call H2020-MSCA-ITN-2014.

References

  1. 1.
    Ananthanarayanan, G., Agarwal, S., Kandula, S., Greenberg, A., Stoica, I., Harlan, D., Harris, E.: Scarlett: coping with skewed content popularity in mapreduce clusters. In: Proceedings of the Sixth Conference on Computer Systems, ACM, New York, NY, USA, EuroSys ’11, pp. 287–300, (2011). http://doi.acm.org/10.1145/1966445.1966472
  2. 2.
    Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: Attack of the clones. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’13, pp. 185–198, (2013). http://dl.acm.org/citation.cfm?id=2482626.2482645
  3. 3.
    Ananthanarayanan, G., Hung, M.C.C., Ren, X., Stoica, I., Wierman, A., Yu, M.: GRASS: trimming stragglers in approximation analytics. In: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’14, pp. 289–302, (2014). http://dl.acm.org/citation.cfm?id=2616448.2616475
  4. 4.
    Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using Mantri. In: Proceedings of the 9th USENIX conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’10, pp. 1–16, (2010). http://dl.acm.org/citation.cfm?id=1924943.1924962
  5. 5.
    Apache Zookeeper: (2015). http://zookeeper.apache.org/
  6. 6.
    Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark sql: Relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’15, pp. 1383–1394 (2015). http://doi.acm.org/10.1145/2723372.2742797
  7. 7.
    Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1(1), 11–33 (2004)CrossRefGoogle Scholar
  8. 8.
    Barborak, M., Dahbura, A., Malek, M.: The consensus problem in fault-tolerant computing. ACM Comput. Surv. 25(2), 171–220 (1993). http://doi.acm.org/10.1145/152610.152612
  9. 9.
    Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, ACM, New York, NY, USA, SIGMOD ’11, pp. 1071–1080 (2011). http://doi.acm.org/10.1145/1989323.1989438
  10. 10.
    Bressoud, T.C., Kozuch, M.A.: Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation. In: Proceedings of the 2009 IEEE International Conference on Cluster Computing and Workshops, IEEE, pp. 1–10 (2009). http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5289185
  11. 11.
    Cachin, C., Guerraoui, R., Rodrigues, L.: Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer (2011)Google Scholar
  12. 12.
    Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proc. VLDB Endow 1(2), 1265–1276 (2008). http://dl.acm.org/citation.cfm?id=1454159.1454166
  13. 13.
    Chen, Q., Liu, C., Xiao, Z.: Improving mapreduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63(4), 954–967 (2014). doi: 10.1109/TC.2013.15 MathSciNetCrossRefGoogle Scholar
  14. 14.
    Chohan, N., Castillo, C., Spreitzer, M., Steinder, M., Tantawi, A., Krintz, C.: See spot run: using spot instances for MapReduce workflows. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, USENIX Association, Berkeley, CA, USA, HotCloud’10, pp. 7–7 (2010). http://dl.acm.org/citation.cfm?id=1863103.1863110
  15. 15.
    Clement, A., Kapritsos, M., Lee, S., Wang, Y., Alvisi, L., Dahlin, M., Riche, T.: Upright cluster services. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, ACM, New York, NY, USA, SOSP ’09, pp. 277–290 (2009). http://doi.acm.org/10.1145/1629575.1629602
  16. 16.
    Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’10, pp. 21–21 (2010). http://dl.acm.org/citation.cfm?id=1855711.1855732
  17. 17.
    Correia, M., Costa, P., Pasin, M., Bessani, A., Ramos, F., Verissimo, P.: On the feasibility of byzantine fault-tolerant mapreduce in clouds-of-clouds. In: 2012 IEEE 31st Symposium on Reliable Distributed Systems (SRDS), pp. 448–453 (2012). doi: 10.1109/SRDS.2012.46
  18. 18.
    Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine Fault-Tolerant MapReduce: Faults are Not Just Crashes. In: Proceedings of the 3rd IEEE Second International Conference on Cloud Computing Technology and Science, IEEE Computer Society, Washington, DC, USA, CLOUDCOM ’11, pp. 17–24 (2010). http://dx.doi.org/10.1109/CloudCom.2010.25
  19. 19.
    Dean, J., Ghemawat, S., Inc, G.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, USENIX Association, OSDI’04 (2004)Google Scholar
  20. 20.
    Dean, J.: Building software systems at google and lessons learned. Stanford EE Computer Systems Colloquium (2010). http://www.stanford.edu/class/ee380/Abstracts/101110-slides.pdf
  21. 21.
    Dinu, F., Ng, T.S.E.: Hadoop’s Overload Tolerant Design Exacerbates Failure Detection and Recovery. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, ACM, New York, NY, USA, NetDB’11, pp. 1–7 (2011)Google Scholar
  22. 22.
    Dinu, F., Ng, T.E.: Understanding the effects and implications of compute node related failures in Hadoop. In: HPDC ’12: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, ACM, New York, NY, USA, pp. 187–198 (2012). http://doi.acm.org/10.1145/2287076.2287108
  23. 23.
    Facebook, Inc.: (2015). https://www.facebook.com/
  24. 24.
    Facebook, I.: Under the Hood: Scheduling MapReduce jobs more efficiently with Corona (2012). http://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
  25. 25.
    Fedak, G., He, H., Cappello, F.: BitDew: A data management and distribution service with multi-protocol file transfer and metadata abstraction. J Netw. Compu. Appl. 32(5), 961–975 (2009)CrossRefGoogle Scholar
  26. 26.
    Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: Graph processing in a distributed dataflow framework. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’14, pp. 599–613 (2014). http://dl.acm.org/citation.cfm?id=2685048.2685096
  27. 27.
    Hadoop Releases: (2015). http://hadoop.apache.org/releases.html
  28. 28.
    Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’11, pp. 22–22 (2011). http://dl.acm.org/citation.cfm?id=1972457.1972488
  29. 29.
    How-to: Set Up a Hadoop Cluster with Network Encryption: (2013). http://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/
  30. 30.
    Ibrahim, S., Phuong, T.A., Antoniu, G.: An Eye on the Elephant in the Wild: A Performance Evaluation of Hadoop’s Schedulers Under Failures. In: Workshop on Adaptive Resource Management and Scheduling for Cloud Computing (ARMS-CC-2015), held in conjunction with PODC’15 (2015)Google Scholar
  31. 31.
    Introduction to Hadoop Security: (2013). http://www.cloudera.com/content/cloudera/en/home.html
  32. 32.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys 2007, ACM, New York, NY, USA, EuroSys ’07, pp. 59–72 (2007). http://doi.acm.org/10.1145/1272996.1273005
  33. 33.
    Jin, H., Ibrahim, S., Qi, L., Cao, H., Wu, S., Shi, X.: The MapReduce programming model and implementations. Cloud Computing: Principles and Paradigms pp. 373–390. doi: 10.1002/9780470940105.ch14
  34. 34.
    Jin, H., Qiao, K., Sun, X.H., Li, Y.l.: Performance under Failures of MapReduce Applications. In: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, IEEE Computer Society, Washington, DC, USA, CCGRID ’11, pp. 608–609 (2011). http://dx.doi.org/10.1109/CCGrid.2011.84
  35. 35.
    Jin, H., Sun, X.H.: Performance comparison under failures of MPI and MapReduce: An Analytical Approach. Future Gener. Comput. Syst. 29(7), 1808–1815 (2013). http://dx.doi.org/10.1016/j.future.2013.01.013
  36. 36.
    Kerberos: The Network Authentication Protocol: (2015). http://web.mit.edu/kerberos/
  37. 37.
    Ko, S.Y., Hoque, I., Cho, B., Gupta, I.: Making cloud intermediate data fault-tolerant. In: Proceedings of the 1st ACM Symposium on Cloud Computing, ACM, New York, NY, USA, SoCC ’10, pp. 181–192 (2010). http://doi.acm.org/10.1145/1807128.1807160
  38. 38.
    Ko, S.Y., Hoque, I., Cho, B., Gupta, I.: On availability of intermediate data in cloud computations. In: Proceedings of the 12th conference on Hot topics in operating systems, USENIX Association, Berkeley, CA, USA, HotOS’09, pp. 6–6 (2009). http://dl.acm.org/citation.cfm?id=1855568.1855574
  39. 39.
    Lin, H., Ma, X., Archuleta, J., Feng, W.c., Gardner, M., Zhang, Z.: MOON: MapReduce On Opportunistic eNvironments. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, New York, NY, USA, HPDC ’10, pp. 95–106 (2010). http://doi.acm.org/10.1145/1851476.1851489
  40. 40.
    Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Tech. rep., University of Maryland, College Park (2010)Google Scholar
  41. 41.
    Liu, H., Orban, D.: Cloud MapReduce: A MapReduce implementation on top of a cloud operating system. In: 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 464–474 (2011). doi: 10.1109/CCGrid.2011.25
  42. 42.
    Liu, H.: Cutting MapReduce Cost with Spot Market. In: Proceedings of the 3rd USENIX Conference on Hot topics in Cloud Computing, USENIX Association, Berkeley, CA, USA, HotCloud’11, pp. 5–5 (2011). https://www.usenix.org/conference/hotcloud11/cutting-mapreduce-cost-spot-market
  43. 43.
    Memishi, B., Ibrahim, S., Pérez, M.S., Antoniu, G.: On the Dynamic Shifting of the MapReduce Timeout. In: Kannan, R., Rasool, R.U., Jin, H., Balasundaram, S. (eds) Managing and Processing Big Data in Cloud Computing, IGI Global, Hershey, Pennsylvania (USA), pp. 1–22 (2016). doi: 10.4018/978-1-4666-9767-6
  44. 44.
    Memishi, B., Pérez, M.S., Antoniu, G.: Diarchy: An Optimized Management Approach for MapReduce Masters. Procedia Comput. Sci. 51, 9–18 (2015). http://www.sciencedirect.com/science/article/pii/S1877050915009874. International Conference On Computational Science, ICCS Computational Science at the Gates of Nature
  45. 45.
    Microsoft, Inc.: (2015). http://www.microsoft.com/
  46. 46.
    Mone, G.: Beyond Hadoop. Commun. ACM 56(1), 22–24 (2013). http://doi.acm.org/10.1145/2398356.2398364
  47. 47.
    Okorafor, E., Patrick, M.K.: Availability of Jobtracker machine in Hadoop/MapReduce Zookeeper coordinated clusters. Adv. Comput.: An Int. J. 3(3), 19–30 (2012). http://www.chinacloud.cn/upload/2012-07/12072600543782.pdf
  48. 48.
    Pan, X., Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Ganesha: blackBox diagnosis of MapReduce systems. SIGMETRICS Perform. Eval. Rev. 37(3), 8–13 (2010). http://doi.acm.org/10.1145/1710115.1710118
  49. 49.
    Phan, T.D., Ibrahim, S., Antoniu, G., Bougé, L.: On Understanding the energy impact of speculative execution in Hadoop. In: IEEE International Conference on Green Computing and Communications (GreenCom 2015), Sydney, Australia (2015). https://hal.inria.fr/hal-01238055
  50. 50.
    RedHat: A guide for developers using the JBoss Enterprise SOA Platform (2008). http://www.redhat.com/docs/en-US/JBoss_SOA_Platform/4.3.GA/html/Programmers_Guide/index.html, programmersGuide
  51. 51.
    Roy, I., Setty, S.T.V., Kilzer, A., Shmatikov, V., Witchel, E.: Airavat: security and privacy for MapReduce. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’10, pp. 20–20 (2010). http://dl.acm.org/citation.cfm?id=1855711.1855731
  52. 52.
    Shih, J.: Hadoop security overview—from security infrastructure deployment to high-level services. Hadoop & BigData Technology Conference (2012). www.hbtc2012.hadooper.cn/subject/keynotep8shihongliang.pdf
  53. 53.
  54. 54.
    Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53:64–71 (2010). http://doi.acm.org/10.1145/1629175.1629197
  55. 55.
    Tang, B., Moca, M., Chevalier, S., He, H., Fedak, G.: Towards MapReduce for Desktop Grid Computing. In: Proceedings of the 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, IEEE Computer Society, Washington, DC, USA, 3PGCIC ’10, pp. 193–200 (2010). http://dx.doi.org/10.1109/3PGCIC.2010.33
  56. 56.
    The Apache Hadoop Project: (2015). http://hadoop.apache.org/
  57. 57.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: Yet Another Resource Negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, New York, NY, USA, SoCC ’13, p. 5:1–5:16 (2013). http://doi.acm.org/10.1145/2523616.2523633
  58. 58.
    Wang, G., Butt, A.R., Pandey, P., Gupta, K.: A simulation approach to evaluating design decisions in MapReduce setups. In: 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, IEEE, MASCOTS 2009, pp. 1–11Google Scholar
  59. 59.
    Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop high availability through metadata replication. In: Proceedings of the First International Workshop on Cloud Data Management, ACM, New York, NY, USA, CloudDB ’09, pp. 37–44 (2009). http://doi.acm.org/10.1145/1651263.1651271
  60. 60.
    Warneke, D., Kao, O.: Nephele: Efficient parallel data processing in the cloud. In: Proceedings of the 2Nd Workshop on Many-Task Computing on Grids and Supercomputers, ACM, New York, NY, USA, MTAGS ’09, pp. 8:1–8:10 (2009). http://doi.acm.org/10.1145/1646468.1646476
  61. 61.
    White, T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O’Reilly (2012)Google Scholar
  62. 62.
    Xiao, Z., Xiao, Y.: Achieving accountable MapReduce in cloud computing. Future Gener. Comput. Syst. 30, 1–13 (2014). http://dx.doi.org/10.1016/j.future.2013.07.001
  63. 63.
    Xu, H., Lau, W.C.: Optimization for speculative execution in a MapReduce-like cluster. In: 2015 IEEE Conference on Computer Communications, INFOCOM 2015, Kowloon, Hong Kong, April 26–1May 1, 2015, pp. 1071–1079. http://dx.doi.org/10.1109/INFOCOM.2015.7218480
  64. 64.
    Xu, H., Lau, W.C.: Speculative execution for a single job in a mapreduce-like system. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD), pp. 586–593 (2014). doi: 10.1109/CLOUD.2014.84
  65. 65.
    Yahoo! Inc: (2015). http://www.yahoo.com/
  66. 66.
    Yildiz, O., Ibrahim, S., Phuong, T.A., Antoniu, G.: Chronos: Failure-aware scheduling in shared Hadoop clusters. In: IEEE International Conference on Big Data (BigData 2015), pp 313–318 (2015). doi: 10.1109/BigData.2015.7363770
  67. 67.
    Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’08, pp. 1–14 (2008). http://dl.acm.org/citation.cfm?id=1855741.1855742
  68. 68.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, NSDI’12, pp. 2–2 (2012). http://dl.acm.org/citation.cfm?id=2228298.2228301
  69. 69.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, USENIX Association, Berkeley, CA, USA, HotCloud’10, pp. 10–10 (2010). http://dl.acm.org/citation.cfm?id=1863103.1863113
  70. 70.
    Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: Fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, New York, NY, USA, SOSP ’13, pp. 423–438 (2013). http://doi.acm.org/10.1145/2517349.2522737
  71. 71.
    Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing, USENIX Association, Berkeley, CA, USA, HotCloud’12, pp. 10–10 (2012). http://dl.acm.org/citation.cfm?id=2342763.2342773
  72. 72.
    Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, OSDI’08, pp. 29–42 (2008). http://dl.acm.org/citation.cfm?id=1855741.1855744
  73. 73.
    Zhu, H., Haopeng, C.: Adaptive failure detection via heartbeat under Hadoop. In: Proceedings of the 2011 IEEE Asia-Pacific Services Computing Conference, IEEE, New York, NY, USA, ApSCC’11, pp. 231–238 (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Bunjamin Memishi
    • 1
    Email author
  • Shadi Ibrahim
    • 2
  • María S. Pérez
    • 1
  • Gabriel Antoniu
    • 2
  1. 1.OEG, E.T.S. Ingenieros InformáticosUniversidad Politécnica de MadridBoadilla del Monte, MadridSpain
  2. 2.Inria Campus Universitaire de BeaulieuBrittanyFrance

Personalised recommendations