Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets

  • Danilo Ardagna
  • Simona Bernardi
  • Eugenio Gianniti
  • Soroush Karimian Aliabadi
  • Diego Perez-Palacin
  • José Ignacio Requeno
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10048)


Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the end-user and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%.


MapReduce Performance models 



This work has received funding from the European Union Horizon 2020 research and innovation program under grant agreement No. 644869 (DICE). Experimental data are available as open data at


  1. 1.
  2. 2.
    The digital universe in 2020.
  3. 3.
    Aguilera-Mendoza, L., Llorente-Quesada, M.T.: Modeling and simulation of Hadoop distributed file system in a cluster of workstations. In: Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 1–12. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-41366-7_1 CrossRefGoogle Scholar
  4. 4.
    Ahmed, S.T., Loguinov, D.: On the performance of MapReduce: a stochastic approach. In: IEEE International Conference on Big Data, pp. 49–54. IEEE (2014)Google Scholar
  5. 5.
    Alipour, H., Liu, Y., Gorton, I.: Model driven performance simulation of cloud provisioned Hadoop MapReduce applications. In: Proceedings of the 8th International Workshop on Modeling in Software Engineering, MiSE 2016 (2016)Google Scholar
  6. 6.
    Ardagna, D., Ghezzi, C., Mirandola, R.: Rethinking the use of models in software architecture. In: Becker, S., Plasil, F., Reussner, R. (eds.) QoSA 2008. LNCS, vol. 5281, pp. 1–27. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-87879-7_1 CrossRefGoogle Scholar
  7. 7.
    Baarir, S., Beccuti, M., Cerotti, D., De Pierro, M., Donatelli, S., Franceschinis, G.: The GreatSPN tool: recent enhancements. ACM SIGMETRICS PER 36(4), 4–9 (2009)CrossRefGoogle Scholar
  8. 8.
    Barbierato, E., Gribaudo, M., Iacono, M.: Modeling apache hive based applications in big data architectures. In: VALUETOOLS 2013 Proceedings (2013)Google Scholar
  9. 9.
    Bardhan, S., Menascé, D.: Queuing network models to predict the completion time of the map phase of MapReduce jobs. In: Proceedings of the Computer Measurement Group International Conference (2012)Google Scholar
  10. 10.
    Bertoli, M., Casale, G., Serazzi, G.: JMT: performance engineering tools for system modeling. SIGMETRICS Perform. Eval. Rev. 36(4), 10–15 (2009)CrossRefGoogle Scholar
  11. 11.
    Bruneo, D., Longo, F., Ghosh, R., Scarpa, M., Puliafito, A., Trivedi, K.S.: Analytical modeling of reactive autonomic management techniques in IAAS clouds. In: IEEE CLOUD 2015 Proceedings (2015)Google Scholar
  12. 12.
    Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Exploiting mean field analysis to model performances of big data architectures. Future Gener. Comput. Syst. 37, 203–211 (2014)CrossRefGoogle Scholar
  13. 13.
    Chu, W.W., Sit, C.M., Leung, K.K.: Task response time for real-time distributed systems with resource contentions. IEEE Trans. Softw. Eng. 17(10), 1076–1092 (1991)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Dubois, D.J., Casale, G.: OptiSpot: minimizing application deployment cost using spot cloud resources. Clust. Comput. 19, 1–17 (2016)CrossRefGoogle Scholar
  15. 15.
    Gibilisco, G.P., Li, M., Zhang, L., Ardagna, D.: Stage aware performance modeling of DAG based in memory analytic platforms. In: Cloud (2016)Google Scholar
  16. 16.
    Herodotou, H.: Hadoop performance models (2011)Google Scholar
  17. 17.
    Jagadish, H.V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Shahabi, C.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)CrossRefGoogle Scholar
  18. 18.
    Jensen, K., Kristensen, L.M., Wells, L.: Coloured Petri nets and CPN tools for modelling and validation of concurrent systems. Int. J. Softw. Tools Technol. Transf. 9(3–4), 213–254 (2007)CrossRefGoogle Scholar
  19. 19.
    Jin, H., Qiao, K., Sun, X.H., Li, Y.: Performance under failures of MapReduce applications. In: CCGrid 2011 Proceedings (2011)Google Scholar
  20. 20.
    Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)CrossRefGoogle Scholar
  21. 21.
    Krevat, E., Shiran, T., Anderson, E., Tucek, J., Wylie, J.J., Ganger, G.R.: Applying performance models to understand data-intensive computing efficiency. Technical report, DTIC Document (2010)Google Scholar
  22. 22.
    Laney, D.: 3D data management: controlling data volume, velocity, and variety. Technical report, META Group (2012)Google Scholar
  23. 23.
    Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance. Prentice-Hall, Upper Saddle River (1984)Google Scholar
  24. 24.
    Liang, D.R., Tripathi, S.K.: On performance prediction of parallel computations with precedent constraints. IEEE Trans. Parallel Distrib. Syst. 11(5), 491–508 (2000)CrossRefGoogle Scholar
  25. 25.
    Lin, M., Zhang, L., Wierman, A., Tan, J.: Joint optimization of overlapping phases in MapReduce. SIGMETRICS Perform. Eval. Rev. 41(3), 16–18 (2013)CrossRefGoogle Scholar
  26. 26.
    Lin, X., Meng, Z., Xu, C., Wang, M.: A practical performance model for Hadoop MapReduce. In: 2012 IEEE International Conference on Cluster Computing Workshops (CLUSTER WORKSHOPS), pp. 231–239. IEEE (2012)Google Scholar
  27. 27.
    Mak, V.W., Lundstrom, S.F.: Predicting performance of parallel computations. IEEE Trans. Parallel Distrib. Syst. 1(3), 257–270 (1990)CrossRefGoogle Scholar
  28. 28.
    Marynowski, J.E., Santin, A.O., Pimentel, A.R.: Method for testing the fault tolerance of MapReduce frameworks. Comput. Netw. 86, 1–13 (2015)CrossRefGoogle Scholar
  29. 29.
    Nelson, R.D., Tantawi, A.N.: Approximate analysis of fork/join synchronization in parallel queues. IEEE Trans. Comput. 37(6), 739–743 (1988)CrossRefGoogle Scholar
  30. 30.
    Polo, J., Becerra, Y., Carrera, D., Steinder, M., Whalley, I., Torres, J., Ayguadé, E.: Deadline-based MapReduce workload management. IEEE Trans. Netw. Serv. Manag. 10(2), 231–244 (2013)CrossRefGoogle Scholar
  31. 31.
    Ruiz, M.C., Calleja, J., Cazorla, D.: Petri nets formalization of Map/Reduce paradigm to optimise the performance-cost tradeoff. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 3, pp. 92–99. IEEE (2015)Google Scholar
  32. 32.
    Shanklin, C.: Benchmarking Apache Hive 13 for Enterprise Hadoop.
  33. 33.
    Verma, A., Cherkasova, L., Campbell, R.H.: ARIA: automatic resource inference and allocation for MapReduce environments. In: ICAC 2011 Proceedings (2011)Google Scholar
  34. 34.
    Vianna, E., Comarela, G., Pontes, T., Almeida, J.M., Almeida, V.A.F., Wilkinson, K., Kuno, H.A., Dayal, U.: Analytical performance models for MapReduce workloads. Int. J. Parallel Program. 41(4), 495–525 (2013)CrossRefGoogle Scholar
  35. 35.
    Yang, X., Sun, J.: An analytical performance model of MapReduce. In: CCIS 2011 (2011)Google Scholar
  36. 36.
    Yu, X., Li, W.: Performance modelling and analysis of MapReduce/Hadoop workloads. In: LANMAN 2015 Proceedings (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Danilo Ardagna
    • 1
  • Simona Bernardi
    • 2
  • Eugenio Gianniti
    • 1
  • Soroush Karimian Aliabadi
    • 3
  • Diego Perez-Palacin
    • 1
  • José Ignacio Requeno
    • 4
  1. 1.Dip. di Elettronica, Informazione e BioingegneriaPolitecnico di MilanoMilanItaly
  2. 2.Centro Universitario de la DefensaAcademia General MilitarZaragozaSpain
  3. 3.Department of Computer EngineeringSharif University of TechnologyTehranIran
  4. 4.Dpto. de Informática e Ingeniería de SistemasUniversidad de ZaragozaZaragozaSpain

Personalised recommendations