Abstract
SQL-on-Hadoop processing engines have become state-of-the-art in data lake analysis. However, the skills required to tune such systems are rare. This has inspired automated tuning advisors which profile the query workload and produce tuning setups for the low-level MapReduce jobs. Yet with highly dynamic query workloads, repeated re-tuning costs time and money in IaaS environments. In this paper, we focus on reducing the costs for up-front tuning. At the heart of our approach is the observation that a SQL query is compiled into a query plan of MapReduce jobs. While the plans differ from query to query, single jobs tend to be similar between queries. We introduce the notion of the code signature of a MapReduce job and, based on this, our concept of job similarity. We show that we can effectively recycle tuning setups from similar MapReduce jobs already profiled. In doing so, we can leverage any third-party tuning adviser for MapReduce engines. We are able to show that by recycling tuning setups, we can reduce the time spent on profiling by 50% in the TPC-H benchmark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The Starfish binary is available at https://www.cs.duke.edu/starfish/release.html.
- 2.
See https://issues.apache.org/jira/browse/HIVE-600 for the verbatim SQL queries.
- 3.
References
Aken, D.V., Pavlo, A., Gordon, G.J.: Automatic database management system tuning through large-scale machine learning. In: SIGMOD (2017)
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD (2015)
Bei, Z., Yu, Z., Liu, Q., Xu, C., Feng, S., Song, S.: MEST: a model-driven efficient searching approach for mapreduce self-tuning. IEEE Access 5, 3580–3593 (2017)
Bei, Z., et al.: RFHOC: a random-forest approach to auto-tuning hadoop’s configuration. IEEE Trans. Parallel Distrib. Syst. 27(5), 1470–1483 (2016)
Bonnet, P., Shasha, D.E.: Application-level tuning. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-39940-9
Cai, L., Qi, Y., Li, J.: A recommendation-based parameter tuning approach for hadoop. In: International Symposium on Cloud and Service Computing, SC2 2017 (2018)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. ReCALL 2(1), 1246–1257 (2009)
Ead, M.: PStorM: profile storage and matching for feedback-based tuning of mapreduce jobs. In: EDBT (2014)
Filho, E.R.L., de Melo, R.S., de Almeida, E.C.: A non-uniform tuning method for SQL-on-hadoop systems. In: AMW (2019)
Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7, 1295–1306 (2014)
Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: CIDR (2011)
Kambatla, K., Pathak, A., Pucha, H.: Towards optimizing hadoop provisioning in the cloud. Design (2009)
Khan, M., Huang, Z., Li, M., Taylor, G.A., Khan, M.: Optimizing hadoop parameter settings with gene expression programming guided PSO. Concurrency Comput. Pract. Expereience 29, e3786 (2017)
Li, M., et al.: MRONLINE: mapreduce online performance tuning. In: HPDC (2014)
Liao, G., Datta, K., Willke, T.L.: Gunther: search-based auto-tuning of MapReduce. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 406–419. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_42
Liu, C., Zeng, D., Yao, H., Hu, C., Yan, X., Fan, Y.: MR-COF: a genetic mapreduce configuration optimization framework. In: Wang, G., Zomaya, A., Perez, G.M., Li, K. (eds.) ICA3PP 2015. LNCS, vol. 9531, pp. 344–357. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27140-8_24
Liu, J., Ravi, N., Chakradhar, S., Kandemir, M.: Panacea: towards holistic optimization of mapreduce applications. In: CHO (2012)
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7, 1319–1330 (2014)
The Apache Software Fundation: Rumen: a tool to extract job characterization data form job tracker logs (2013). https://hadoop.apache.org/docs/r1.2.1/rumen.html
Thusoo, A., et al.: Hive - a petabyte scale data warehouse using hadoop. In: ICDE (2010)
Ding, X., Liu, Y., Qian, D., et al.: JellyFish: online performance tuning with adaptive configuration and elastic container in hadoop YARN. In: ICPADS (2016)
Chen, Y., Alspaugh, S., Katz, R.: Interactive query processing in big data systems: a cross industry study of mapreduce workloads. Technical report 12, University of California, Berkeley, August 2012
Acknowledgments
We thank Herodotos Herodotou for all the support with Starfish. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Filho, E.R.L., de Almeida, E.C., Scherzinger, S. (2019). Don’t Tune Twice: Reusing Tuning Setups for SQL-on-Hadoop Queries. In: Laender, A., Pernici, B., Lim, EP., de Oliveira, J. (eds) Conceptual Modeling. ER 2019. Lecture Notes in Computer Science(), vol 11788. Springer, Cham. https://doi.org/10.1007/978-3-030-33223-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-33223-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33222-8
Online ISBN: 978-3-030-33223-5
eBook Packages: Computer ScienceComputer Science (R0)