Skip to main content

Don’t Tune Twice: Reusing Tuning Setups for SQL-on-Hadoop Queries

  • Conference paper
  • First Online:
Conceptual Modeling (ER 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11788))

Included in the following conference series:

Abstract

SQL-on-Hadoop processing engines have become state-of-the-art in data lake analysis. However, the skills required to tune such systems are rare. This has inspired automated tuning advisors which profile the query workload and produce tuning setups for the low-level MapReduce jobs. Yet with highly dynamic query workloads, repeated re-tuning costs time and money in IaaS environments. In this paper, we focus on reducing the costs for up-front tuning. At the heart of our approach is the observation that a SQL query is compiled into a query plan of MapReduce jobs. While the plans differ from query to query, single jobs tend to be similar between queries. We introduce the notion of the code signature of a MapReduce job and, based on this, our concept of job similarity. We show that we can effectively recycle tuning setups from similar MapReduce jobs already profiled. In doing so, we can leverage any third-party tuning adviser for MapReduce engines. We are able to show that by recycling tuning setups, we can reduce the time spent on profiling by 50% in the TPC-H benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The Starfish binary is available at https://www.cs.duke.edu/starfish/release.html.

  2. 2.

    See https://issues.apache.org/jira/browse/HIVE-600 for the verbatim SQL queries.

  3. 3.

    http://collectl.sourceforge.net.

References

  1. Aken, D.V., Pavlo, A., Gordon, G.J.: Automatic database management system tuning through large-scale machine learning. In: SIGMOD (2017)

    Google Scholar 

  2. Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD (2015)

    Google Scholar 

  3. Bei, Z., Yu, Z., Liu, Q., Xu, C., Feng, S., Song, S.: MEST: a model-driven efficient searching approach for mapreduce self-tuning. IEEE Access 5, 3580–3593 (2017)

    Article  Google Scholar 

  4. Bei, Z., et al.: RFHOC: a random-forest approach to auto-tuning hadoop’s configuration. IEEE Trans. Parallel Distrib. Syst. 27(5), 1470–1483 (2016)

    Article  Google Scholar 

  5. Bonnet, P., Shasha, D.E.: Application-level tuning. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-39940-9

    Chapter  Google Scholar 

  6. Cai, L., Qi, Y., Li, J.: A recommendation-based parameter tuning approach for hadoop. In: International Symposium on Cloud and Service Computing, SC2 2017 (2018)

    Google Scholar 

  7. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)

    Google Scholar 

  8. Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. ReCALL 2(1), 1246–1257 (2009)

    Google Scholar 

  9. Ead, M.: PStorM: profile storage and matching for feedback-based tuning of mapreduce jobs. In: EDBT (2014)

    Google Scholar 

  10. Filho, E.R.L., de Melo, R.S., de Almeida, E.C.: A non-uniform tuning method for SQL-on-hadoop systems. In: AMW (2019)

    Google Scholar 

  11. Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7, 1295–1306 (2014)

    Google Scholar 

  12. Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: CIDR (2011)

    Google Scholar 

  13. Kambatla, K., Pathak, A., Pucha, H.: Towards optimizing hadoop provisioning in the cloud. Design (2009)

    Google Scholar 

  14. Khan, M., Huang, Z., Li, M., Taylor, G.A., Khan, M.: Optimizing hadoop parameter settings with gene expression programming guided PSO. Concurrency Comput. Pract. Expereience 29, e3786 (2017)

    Article  Google Scholar 

  15. Li, M., et al.: MRONLINE: mapreduce online performance tuning. In: HPDC (2014)

    Google Scholar 

  16. Liao, G., Datta, K., Willke, T.L.: Gunther: search-based auto-tuning of MapReduce. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 406–419. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_42

    Chapter  Google Scholar 

  17. Liu, C., Zeng, D., Yao, H., Hu, C., Yan, X., Fan, Y.: MR-COF: a genetic mapreduce configuration optimization framework. In: Wang, G., Zomaya, A., Perez, G.M., Li, K. (eds.) ICA3PP 2015. LNCS, vol. 9531, pp. 344–357. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27140-8_24

    Chapter  Google Scholar 

  18. Liu, J., Ravi, N., Chakradhar, S., Kandemir, M.: Panacea: towards holistic optimization of mapreduce applications. In: CHO (2012)

    Google Scholar 

  19. Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7, 1319–1330 (2014)

    Google Scholar 

  20. The Apache Software Fundation: Rumen: a tool to extract job characterization data form job tracker logs (2013). https://hadoop.apache.org/docs/r1.2.1/rumen.html

  21. Thusoo, A., et al.: Hive - a petabyte scale data warehouse using hadoop. In: ICDE (2010)

    Google Scholar 

  22. Ding, X., Liu, Y., Qian, D., et al.: JellyFish: online performance tuning with adaptive configuration and elastic container in hadoop YARN. In: ICPADS (2016)

    Google Scholar 

  23. Chen, Y., Alspaugh, S., Katz, R.: Interactive query processing in big data systems: a cross industry study of mapreduce workloads. Technical report 12, University of California, Berkeley, August 2012

    Google Scholar 

Download references

Acknowledgments

We thank Herodotos Herodotou for all the support with Starfish. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Ní­vel Superior - Brasil (CAPES) - Finance Code 001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edson Ramiro Lucas Filho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Filho, E.R.L., de Almeida, E.C., Scherzinger, S. (2019). Don’t Tune Twice: Reusing Tuning Setups for SQL-on-Hadoop Queries. In: Laender, A., Pernici, B., Lim, EP., de Oliveira, J. (eds) Conceptual Modeling. ER 2019. Lecture Notes in Computer Science(), vol 11788. Springer, Cham. https://doi.org/10.1007/978-3-030-33223-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33223-5_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33222-8

  • Online ISBN: 978-3-030-33223-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics