Abstract
Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.).
To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baldacci, L., Golfarelli, M.: A cost model for Spark SQL. TKDE 31(5), 819–832 (2019)
Bian, H., Tao, Y., Jin, G., Chen, Y., Qin, X., Du, X.: Rainbow: adaptive layout optimization for wide tables. In: ICDE, pp. 1657–1660 (2018)
Bian, H., et al.: Wide table layout optimization based on column ordering and duplication. In: SIGMOD (2017)
Dasarathy, G.: A simple probability trick for bounding the expected maximum of n random variables. Technical report, Arizona State University (2011)
Davidson, A., Or, A.: Optimizing shuffle performance in Spark. Technical report, UC Berkeley (2013)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Gounaris, A., Torres, J.: A methodology for Spark parameter tuning. Big Data Res. 11, 22–32 (2018)
Islam, M.T., Karunasekera, S., Buyya, R.: dSpark: deadline-based resource allocation for big data applications in Apache Spark. In: e-Science, pp. 89–98 (2017)
Li, Y., Patel, J.M.: WideTable: an accelerator for analytical data processing. PVLDB 7(10), 907–918 (2014)
Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W.: ATUN-HL: auto tuning of hybrid layouts using workload and data characteristics. In: Benczúr, A., Thalheim, B., Horváth, T. (eds.) ADBIS 2018. LNCS, vol. 11019, pp. 200–215. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98398-1_14
Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W.: A cost-based storage format selector for materialization in big data frameworks. In: Distributed and Parallel Databases (2019)
Nghiem, P.P., Figueira, S.M.: Towards efficient resource provisioning in MapReduce. JPDC 95, 29–41 (2016)
Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 226–237. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_24
Shvachko, K.V.: HDFS scalability: the limits to growth. Login 35(2), 6–16 (2010)
Sidhanta, S., Golab, W.M., Mukhopadhyay, S.: Optex: a deadline-aware cost optimization model for Spark. In: CCGrid, pp. 193–202 (2016)
Verma, A., Cherkasova, L., Campbell, R.H.: Resource provisioning framework for MapReduce jobs with performance goals. In: Kon, F., Kermarrec, A.-M. (eds.) Middleware 2011. LNCS, vol. 7049, pp. 165–186. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25821-3_9
Wu, W., Lin, W., Hsu, C., He, L.: Energy-efficient Hadoop for big data analytics and computing: a systematic review and research insights. Future Gener. Comput. Syst. 86, 1351–1367 (2018)
Acknowledgement
This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence - Doctoral College” (IT4BI-DC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W. (2019). Automatically Configuring Parallelism for Hybrid Layouts. In: Welzer, T., et al. New Trends in Databases and Information Systems. ADBIS 2019. Communications in Computer and Information Science, vol 1064. Springer, Cham. https://doi.org/10.1007/978-3-030-30278-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-30278-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30277-1
Online ISBN: 978-3-030-30278-8
eBook Packages: Computer ScienceComputer Science (R0)