Automatically Configuring Parallelism for Hybrid Layouts

Munir, Rana Faisal; Abelló, Alberto; Romero, Oscar; Thiele, Maik; Lehner, Wolfgang

doi:10.1007/978-3-030-30278-8_15

Rana Faisal Munir^17,18,
Alberto Abelló¹⁷,
Oscar Romero¹⁷,
Maik Thiele¹⁸ &
…
Wolfgang Lehner¹⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1064))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1166 Accesses
1 Altmetric

Abstract

Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.).

To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Baldacci, L., Golfarelli, M.: A cost model for Spark SQL. TKDE 31(5), 819–832 (2019)
Google Scholar
Bian, H., Tao, Y., Jin, G., Chen, Y., Qin, X., Du, X.: Rainbow: adaptive layout optimization for wide tables. In: ICDE, pp. 1657–1660 (2018)
Google Scholar
Bian, H., et al.: Wide table layout optimization based on column ordering and duplication. In: SIGMOD (2017)
Google Scholar
Dasarathy, G.: A simple probability trick for bounding the expected maximum of n random variables. Technical report, Arizona State University (2011)
Google Scholar
Davidson, A., Or, A.: Optimizing shuffle performance in Spark. Technical report, UC Berkeley (2013)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Google Scholar
Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Google Scholar
Gounaris, A., Torres, J.: A methodology for Spark parameter tuning. Big Data Res. 11, 22–32 (2018)
Google Scholar
Islam, M.T., Karunasekera, S., Buyya, R.: dSpark: deadline-based resource allocation for big data applications in Apache Spark. In: e-Science, pp. 89–98 (2017)
Google Scholar
Li, Y., Patel, J.M.: WideTable: an accelerator for analytical data processing. PVLDB 7(10), 907–918 (2014)
Google Scholar
Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W.: ATUN-HL: auto tuning of hybrid layouts using workload and data characteristics. In: Benczúr, A., Thalheim, B., Horváth, T. (eds.) ADBIS 2018. LNCS, vol. 11019, pp. 200–215. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98398-1_14
Chapter Google Scholar
Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W.: A cost-based storage format selector for materialization in big data frameworks. In: Distributed and Parallel Databases (2019)
Google Scholar
Nghiem, P.P., Figueira, S.M.: Towards efficient resource provisioning in MapReduce. JPDC 95, 29–41 (2016)
Google Scholar
Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 226–237. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_24
Chapter Google Scholar
Shvachko, K.V.: HDFS scalability: the limits to growth. Login 35(2), 6–16 (2010)
Google Scholar
Sidhanta, S., Golab, W.M., Mukhopadhyay, S.: Optex: a deadline-aware cost optimization model for Spark. In: CCGrid, pp. 193–202 (2016)
Google Scholar
Verma, A., Cherkasova, L., Campbell, R.H.: Resource provisioning framework for MapReduce jobs with performance goals. In: Kon, F., Kermarrec, A.-M. (eds.) Middleware 2011. LNCS, vol. 7049, pp. 165–186. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25821-3_9
Chapter Google Scholar
Wu, W., Lin, W., Hsu, C., He, L.: Energy-efficient Hadoop for big data analytics and computing: a systematic review and research insights. Future Gener. Comput. Syst. 86, 1351–1367 (2018)
Google Scholar

Download references

Acknowledgement

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence - Doctoral College” (IT4BI-DC).

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya, Barcelona, Spain
Rana Faisal Munir, Alberto Abelló & Oscar Romero
Technische Universität Dresden, Dresden, Germany
Rana Faisal Munir, Maik Thiele & Wolfgang Lehner

Authors

Rana Faisal Munir
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Abelló
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Romero
View author publications
You can also search for this author in PubMed Google Scholar
Maik Thiele
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Lehner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rana Faisal Munir .

Editor information

Editors and Affiliations

University of Maribor, Maribor, Slovenia
Tatjana Welzer
Alpen Adria University Klagenfurt, Klagenfurt am Wörthersee, Austria
Johann Eder
University of Maribor, Maribor, Slovenia
Vili Podgorelec
Poznan University of Technology, Poznan, Poland
Robert Wrembel
University of Novi Sad, Novi Sad, Serbia
Mirjana Ivanović
Free University of Bozen-Bolzano, Bolzano, Italy
Johann Gamper
Poznań University of Technology, Poznan, Poland
Mikoƚaj Morzy
University of Thessaly, Lamia, Greece
Theodoros Tzouramanis
Université Lumière Lyon 2, Lyon, France
Jérôme Darmont
University of Maribor, Maribor, Slovenia
Aida Kamišalić Latifić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W. (2019). Automatically Configuring Parallelism for Hybrid Layouts. In: Welzer, T., et al. New Trends in Databases and Information Systems. ADBIS 2019. Communications in Computer and Information Science, vol 1064. Springer, Cham. https://doi.org/10.1007/978-3-030-30278-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-30278-8_15
Published: 01 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30277-1
Online ISBN: 978-3-030-30278-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics