Don’t Tune Twice: Reusing Tuning Setups for SQL-on-Hadoop Queries

Filho, Edson Ramiro Lucas; de Almeida, Eduardo Cunha; Scherzinger, Stefanie

doi:10.1007/978-3-030-33223-5_9

Edson Ramiro Lucas Filho¹²,
Eduardo Cunha de Almeida¹² &
Stefanie Scherzinger¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11788))

Included in the following conference series:

International Conference on Conceptual Modeling

1728 Accesses
2 Citations

Abstract

SQL-on-Hadoop processing engines have become state-of-the-art in data lake analysis. However, the skills required to tune such systems are rare. This has inspired automated tuning advisors which profile the query workload and produce tuning setups for the low-level MapReduce jobs. Yet with highly dynamic query workloads, repeated re-tuning costs time and money in IaaS environments. In this paper, we focus on reducing the costs for up-front tuning. At the heart of our approach is the observation that a SQL query is compiled into a query plan of MapReduce jobs. While the plans differ from query to query, single jobs tend to be similar between queries. We introduce the notion of the code signature of a MapReduce job and, based on this, our concept of job similarity. We show that we can effectively recycle tuning setups from similar MapReduce jobs already profiled. In doing so, we can leverage any third-party tuning adviser for MapReduce engines. We are able to show that by recycling tuning setups, we can reduce the time spent on profiling by 50% in the TPC-H benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The Starfish binary is available at https://www.cs.duke.edu/starfish/release.html.
2.
See https://issues.apache.org/jira/browse/HIVE-600 for the verbatim SQL queries.
3.
http://collectl.sourceforge.net.

References

Aken, D.V., Pavlo, A., Gordon, G.J.: Automatic database management system tuning through large-scale machine learning. In: SIGMOD (2017)
Google Scholar
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD (2015)
Google Scholar
Bei, Z., Yu, Z., Liu, Q., Xu, C., Feng, S., Song, S.: MEST: a model-driven efficient searching approach for mapreduce self-tuning. IEEE Access 5, 3580–3593 (2017)
Article Google Scholar
Bei, Z., et al.: RFHOC: a random-forest approach to auto-tuning hadoop’s configuration. IEEE Trans. Parallel Distrib. Syst. 27(5), 1470–1483 (2016)
Article Google Scholar
Bonnet, P., Shasha, D.E.: Application-level tuning. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-39940-9
Chapter Google Scholar
Cai, L., Qi, Y., Li, J.: A recommendation-based parameter tuning approach for hadoop. In: International Symposium on Cloud and Service Computing, SC2 2017 (2018)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Google Scholar
Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. ReCALL 2(1), 1246–1257 (2009)
Google Scholar
Ead, M.: PStorM: profile storage and matching for feedback-based tuning of mapreduce jobs. In: EDBT (2014)
Google Scholar
Filho, E.R.L., de Melo, R.S., de Almeida, E.C.: A non-uniform tuning method for SQL-on-hadoop systems. In: AMW (2019)
Google Scholar
Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: full circle back to shared-nothing database architectures. PVLDB 7, 1295–1306 (2014)
Google Scholar
Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: CIDR (2011)
Google Scholar
Kambatla, K., Pathak, A., Pucha, H.: Towards optimizing hadoop provisioning in the cloud. Design (2009)
Google Scholar
Khan, M., Huang, Z., Li, M., Taylor, G.A., Khan, M.: Optimizing hadoop parameter settings with gene expression programming guided PSO. Concurrency Comput. Pract. Expereience 29, e3786 (2017)
Article Google Scholar
Li, M., et al.: MRONLINE: mapreduce online performance tuning. In: HPDC (2014)
Google Scholar
Liao, G., Datta, K., Willke, T.L.: Gunther: search-based auto-tuning of MapReduce. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 406–419. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_42
Chapter Google Scholar
Liu, C., Zeng, D., Yao, H., Hu, C., Yan, X., Fan, Y.: MR-COF: a genetic mapreduce configuration optimization framework. In: Wang, G., Zomaya, A., Perez, G.M., Li, K. (eds.) ICA3PP 2015. LNCS, vol. 9531, pp. 344–357. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27140-8_24
Chapter Google Scholar
Liu, J., Ravi, N., Chakradhar, S., Kandemir, M.: Panacea: towards holistic optimization of mapreduce applications. In: CHO (2012)
Google Scholar
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7, 1319–1330 (2014)
Google Scholar
The Apache Software Fundation: Rumen: a tool to extract job characterization data form job tracker logs (2013). https://hadoop.apache.org/docs/r1.2.1/rumen.html
Thusoo, A., et al.: Hive - a petabyte scale data warehouse using hadoop. In: ICDE (2010)
Google Scholar
Ding, X., Liu, Y., Qian, D., et al.: JellyFish: online performance tuning with adaptive configuration and elastic container in hadoop YARN. In: ICPADS (2016)
Google Scholar
Chen, Y., Alspaugh, S., Katz, R.: Interactive query processing in big data systems: a cross industry study of mapreduce workloads. Technical report 12, University of California, Berkeley, August 2012
Google Scholar

Download references

Acknowledgments

We thank Herodotos Herodotou for all the support with Starfish. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

Author information

Authors and Affiliations

Universidade Federal do Paraná, Curitiba, Brazil
Edson Ramiro Lucas Filho & Eduardo Cunha de Almeida
OTH, Regensburg, Brazil
Stefanie Scherzinger

Authors

Edson Ramiro Lucas Filho
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Cunha de Almeida
View author publications
You can also search for this author in PubMed Google Scholar
Stefanie Scherzinger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edson Ramiro Lucas Filho .

Editor information

Editors and Affiliations

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Alberto H. F. Laender
Politecnico di Milano, Milan, Italy
Barbara Pernici
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Univ Federal do Rio Grande do Sul, Porto Alegre, Brazil
José Palazzo M. de Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Filho, E.R.L., de Almeida, E.C., Scherzinger, S. (2019). Don’t Tune Twice: Reusing Tuning Setups for SQL-on-Hadoop Queries. In: Laender, A., Pernici, B., Lim, EP., de Oliveira, J. (eds) Conceptual Modeling. ER 2019. Lecture Notes in Computer Science(), vol 11788. Springer, Cham. https://doi.org/10.1007/978-3-030-33223-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-33223-5_9
Published: 15 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33222-8
Online ISBN: 978-3-030-33223-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics