Skip to main content

An Eye on the Elephant in the Wild: A Performance Evaluation of Hadoop’s Schedulers Under Failures

  • Conference paper
  • First Online:
Adaptive Resource Management and Scheduling for Cloud Computing (ARMS-CC 2015)


Large-scale data analysis has increasingly come to rely on MapReduce and its open-source implementation Hadoop. Recently, Hadoop has not only been used for running single batch jobs but it has also been optimized to simultaneously support the execution of multiple jobs belonging to multiple concurrent users. Several schedulers (i.e., Fifo, Fair, and Capacity schedulers) have been proposed to optimize locality executions of tasks but do not consider failures, although, evidence in the literature shows that faults do occur and can probably result in performance problems.

In this paper, we have designed a set of experiments to evaluate the performance of Hadoop under failure when applying several schedulers (i.e., explore the conflict between job scheduling, exposing locality executions, and failures). Our results reveal several drawbacks of current Hadoop’s mechanism in prioritizing failed tasks. By trying to launch failed tasks as soon as possible regardless of locality, it significantly increases the execution time of jobs with failed tasks, due to two reasons: (1) available resources might not be freed up as quickly as expected and (2) failed tasks might be re-executed on machines with no data on it, introducing extra cost for data transferring through network, which is normally the most scarce resource in today’s data-centers. Our preliminary study with Hadoop not only helps us to understand the interplay between fault-tolerance and job scheduling, but also offers useful insights into optimizing the current schedulers to be more efficient in case of failures.

This work was done while Tran Anh Phuong was an intern at Inria Rennes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.


  1. Amazon Elastic MapReduce. Accessed May 2015

  2. Apache Hadoop Welcome page. Accessed May 2015

  3. Grid’5000 Home page. Accessed May 2015

  4. Size matters: yahoo claims 2-petabyte database is world’s biggest, busiest. Accessed May 2015

  5. Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.N.: Puma: purdue mapreduce benchmarks suite. ECE Technical reports. Paper 437 (2012)

    Google Scholar 

  6. Bicer, T., Jiang, W., Agrawal, G.: Supporting fault tolerance in a data-intensive computing middleware. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2010), pp. 1–12. IEEE (2010)

    Google Scholar 

  7. Borthakur, D.: Facebook has the world’s largest Hadoop cluster! Accessed May 2015

  8. Dean, J.: Large-scale distributed systems at google: current systems and future directions. In: Keynote speech at the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS) (2009)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Conference on Symposium on Opearting Systems Design & Implementation (OSDI 2004), San Francisco, CA, USA, pp. 137–150 (2004)

    Google Scholar 

  10. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  11. Dinu, F., Eugene Ng, T.S.: Understanding the effects and implications of compute node related failures in hadoop. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, pp. 187–198. ACM, New York (2012)

    Google Scholar 

  12. Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I.: Above the clouds: a berkeley view of cloud computing. Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Rep. UCB/EECS, 28:13 (2009)

    Google Scholar 

  13. Gottfrid, D.: Self-service, prorated supercomputing fun! Accessed May 2015

  14. Huang, D., Shi, X., Ibrahim, S., Lu, L., Liu, H., Wu, S., Jin, H.: Mr-scope: a real-time tracing tool for mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, pp. 849–855 (2010)

    Google Scholar 

  15. Ibrahim, S., He, B., Jin, H.: Towards pay-as-you-consume cloud computing. In: Proceedings of the 2011 IEEE International Conference on Services Computing (SCC 2011), Washington, DC, USA, pp. 370–377 (2011)

    Google Scholar 

  16. Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., Song, W.: Maestro: replica-aware map scheduling for mapreduce. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012), Ottawa, Canada, pp. 59–72 (2012)

    Google Scholar 

  17. Ibrahim, S., Jin, H., Lu, L., Qi, L., Wu, S., Shi, X.: Evaluating mapreduce on virtual machines: the hadoop case. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 519–528. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  18. Jin, H., Ibrahim, S., Qi, L., Cao, H., Wu, S., Shi, X.: The mapreduce programming model and implementations. In: Buyya, R., Broberg, J., Goscinski, A.M. (eds.) Cloud Computing: Principles and Paradigms, pp. 373–390. John Wiley & Sons, USA (2011)

    Chapter  Google Scholar 

  19. Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: The 2nd ACM Symposium on Cloud Computing, SOCC 2011, pp. 21:1–21:14. ACM, New York (2011)

    Google Scholar 

  20. Ko, S.Y., Hoque, I., Cho, B., Gupta, I.: Making cloud intermediate data fault-tolerant. In: The 1st ACM Symposium on Cloud computing (SOCC 2010), pp. 181–192. ACM (2010)

    Google Scholar 

  21. Lai, E.: Companies are spending a lot on Big Data. Accessed May 2015

  22. Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: The 1st ACM Symposium on Cloud Computing (SOCC 2010), pp. 51–62. ACM (2010)

    Google Scholar 

  23. Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)

    Google Scholar 

  24. Thirumala Rao, B., Sridevi, N.V., Krishna Reddy, V., Reddy, L.S.S.: Performance issues of heterogeneous hadoop clusters in cloud computing. ArXiv e-prints, July 2012

    Google Scholar 

  25. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 996–1005. IEEE (2010)

    Google Scholar 

  26. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems (EuroSys 2010), pp. 265–278. ACM (2010)

    Google Scholar 

  27. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI 2008), San Diego, California, pp. 29–42 (2008)

    Google Scholar 

  28. Zhu H., Chen, H.: Adaptive failure detection via heartbeat under hadoop. In: 2011 IEEE Asia-Pacific Services Computing Conference (APSCC), pp. 231–238. IEEE (2011)

    Google Scholar 

Download references


Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see

Author information

Authors and Affiliations


Corresponding author

Correspondence to Shadi Ibrahim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ibrahim, S., Phuong, T.A., Antoniu, G. (2015). An Eye on the Elephant in the Wild: A Performance Evaluation of Hadoop’s Schedulers Under Failures. In: Pop, F., Potop-Butucaru, M. (eds) Adaptive Resource Management and Scheduling for Cloud Computing. ARMS-CC 2015. Lecture Notes in Computer Science(), vol 9438. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28447-7

  • Online ISBN: 978-3-319-28448-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics