Don’t Tune Twice: Reusing Tuning Setups for SQL-on-Hadoop Queries

  • Edson Ramiro Lucas FilhoEmail author
  • Eduardo Cunha de Almeida
  • Stefanie Scherzinger
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11788)


SQL-on-Hadoop processing engines have become state-of-the-art in data lake analysis. However, the skills required to tune such systems are rare. This has inspired automated tuning advisors which profile the query workload and produce tuning setups for the low-level MapReduce jobs. Yet with highly dynamic query workloads, repeated re-tuning costs time and money in IaaS environments. In this paper, we focus on reducing the costs for up-front tuning. At the heart of our approach is the observation that a SQL query is compiled into a query plan of MapReduce jobs. While the plans differ from query to query, single jobs tend to be similar between queries. We introduce the notion of the code signature of a MapReduce job and, based on this, our concept of job similarity. We show that we can effectively recycle tuning setups from similar MapReduce jobs already profiled. In doing so, we can leverage any third-party tuning adviser for MapReduce engines. We are able to show that by recycling tuning setups, we can reduce the time spent on profiling by 50% in the TPC-H benchmark.



We thank Herodotos Herodotou for all the support with Starfish. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Ní­vel Superior - Brasil (CAPES) - Finance Code 001.


  1. 1.
    Aken, D.V., Pavlo, A., Gordon, G.J.: Automatic database management system tuning through large-scale machine learning. In: SIGMOD (2017)Google Scholar
  2. 2.
    Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD (2015)Google Scholar
  3. 3.
    Bei, Z., Yu, Z., Liu, Q., Xu, C., Feng, S., Song, S.: MEST: a model-driven efficient searching approach for mapreduce self-tuning. IEEE Access 5, 3580–3593 (2017)CrossRefGoogle Scholar
  4. 4.
    Bei, Z., et al.: RFHOC: a random-forest approach to auto-tuning hadoop’s configuration. IEEE Trans. Parallel Distrib. Syst. 27(5), 1470–1483 (2016)CrossRefGoogle Scholar
  5. 5.
    Bonnet, P., Shasha, D.E.: Application-level tuning. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems. Springer, Boston (2009). Scholar
  6. 6.
    Cai, L., Qi, Y., Li, J.: A recommendation-based parameter tuning approach for hadoop. In: International Symposium on Cloud and Service Computing, SC2 2017 (2018)Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)Google Scholar
  8. 8.
    Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. ReCALL 2(1), 1246–1257 (2009)Google Scholar
  9. 9.
    Ead, M.: PStorM: profile storage and matching for feedback-based tuning of mapreduce jobs. In: EDBT (2014)Google Scholar
  10. 10.
    Filho, E.R.L., de Melo, R.S., de Almeida, E.C.: A non-uniform tuning method for SQL-on-hadoop systems. In: AMW (2019)Google Scholar
  11. 11.
    Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7, 1295–1306 (2014)Google Scholar
  12. 12.
    Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: CIDR (2011)Google Scholar
  13. 13.
    Kambatla, K., Pathak, A., Pucha, H.: Towards optimizing hadoop provisioning in the cloud. Design (2009)Google Scholar
  14. 14.
    Khan, M., Huang, Z., Li, M., Taylor, G.A., Khan, M.: Optimizing hadoop parameter settings with gene expression programming guided PSO. Concurrency Comput. Pract. Expereience 29, e3786 (2017)CrossRefGoogle Scholar
  15. 15.
    Li, M., et al.: MRONLINE: mapreduce online performance tuning. In: HPDC (2014)Google Scholar
  16. 16.
    Liao, G., Datta, K., Willke, T.L.: Gunther: search-based auto-tuning of MapReduce. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 406–419. Springer, Heidelberg (2013). Scholar
  17. 17.
    Liu, C., Zeng, D., Yao, H., Hu, C., Yan, X., Fan, Y.: MR-COF: a genetic mapreduce configuration optimization framework. In: Wang, G., Zomaya, A., Perez, G.M., Li, K. (eds.) ICA3PP 2015. LNCS, vol. 9531, pp. 344–357. Springer, Cham (2015). Scholar
  18. 18.
    Liu, J., Ravi, N., Chakradhar, S., Kandemir, M.: Panacea: towards holistic optimization of mapreduce applications. In: CHO (2012)Google Scholar
  19. 19.
    Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7, 1319–1330 (2014)Google Scholar
  20. 20.
    The Apache Software Fundation: Rumen: a tool to extract job characterization data form job tracker logs (2013).
  21. 21.
    Thusoo, A., et al.: Hive - a petabyte scale data warehouse using hadoop. In: ICDE (2010)Google Scholar
  22. 22.
    Ding, X., Liu, Y., Qian, D., et al.: JellyFish: online performance tuning with adaptive configuration and elastic container in hadoop YARN. In: ICPADS (2016)Google Scholar
  23. 23.
    Chen, Y., Alspaugh, S., Katz, R.: Interactive query processing in big data systems: a cross industry study of mapreduce workloads. Technical report 12, University of California, Berkeley, August 2012Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Edson Ramiro Lucas Filho
    • 1
    Email author
  • Eduardo Cunha de Almeida
    • 1
  • Stefanie Scherzinger
    • 2
  1. 1.Universidade Federal do ParanáCuritibaBrazil
  2. 2.OTHRegensburgBrazil

Personalised recommendations