Skip to main content

Automatically Configuring Parallelism for Hybrid Layouts

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2019)

Abstract

Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.).

To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://hadoop.apache.org.

  2. 2.

    https://spark.apache.org.

  3. 3.

    https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.

  4. 4.

    http://www.essi.upc.edu/dtim/tools/adbis2019.

References

  1. Baldacci, L., Golfarelli, M.: A cost model for Spark SQL. TKDE 31(5), 819–832 (2019)

    Google Scholar 

  2. Bian, H., Tao, Y., Jin, G., Chen, Y., Qin, X., Du, X.: Rainbow: adaptive layout optimization for wide tables. In: ICDE, pp. 1657–1660 (2018)

    Google Scholar 

  3. Bian, H., et al.: Wide table layout optimization based on column ordering and duplication. In: SIGMOD (2017)

    Google Scholar 

  4. Dasarathy, G.: A simple probability trick for bounding the expected maximum of n random variables. Technical report, Arizona State University (2011)

    Google Scholar 

  5. Davidson, A., Or, A.: Optimizing shuffle performance in Spark. Technical report, UC Berkeley (2013)

    Google Scholar 

  6. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Google Scholar 

  7. Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)

    Google Scholar 

  8. Gounaris, A., Torres, J.: A methodology for Spark parameter tuning. Big Data Res. 11, 22–32 (2018)

    Google Scholar 

  9. Islam, M.T., Karunasekera, S., Buyya, R.: dSpark: deadline-based resource allocation for big data applications in Apache Spark. In: e-Science, pp. 89–98 (2017)

    Google Scholar 

  10. Li, Y., Patel, J.M.: WideTable: an accelerator for analytical data processing. PVLDB 7(10), 907–918 (2014)

    Google Scholar 

  11. Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W.: ATUN-HL: auto tuning of hybrid layouts using workload and data characteristics. In: Benczúr, A., Thalheim, B., Horváth, T. (eds.) ADBIS 2018. LNCS, vol. 11019, pp. 200–215. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98398-1_14

    Chapter  Google Scholar 

  12. Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W.: A cost-based storage format selector for materialization in big data frameworks. In: Distributed and Parallel Databases (2019)

    Google Scholar 

  13. Nghiem, P.P., Figueira, S.M.: Towards efficient resource provisioning in MapReduce. JPDC 95, 29–41 (2016)

    Google Scholar 

  14. Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 226–237. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_24

    Chapter  Google Scholar 

  15. Shvachko, K.V.: HDFS scalability: the limits to growth. Login 35(2), 6–16 (2010)

    Google Scholar 

  16. Sidhanta, S., Golab, W.M., Mukhopadhyay, S.: Optex: a deadline-aware cost optimization model for Spark. In: CCGrid, pp. 193–202 (2016)

    Google Scholar 

  17. Verma, A., Cherkasova, L., Campbell, R.H.: Resource provisioning framework for MapReduce jobs with performance goals. In: Kon, F., Kermarrec, A.-M. (eds.) Middleware 2011. LNCS, vol. 7049, pp. 165–186. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25821-3_9

    Chapter  Google Scholar 

  18. Wu, W., Lin, W., Hsu, C., He, L.: Energy-efficient Hadoop for big data analytics and computing: a systematic review and research insights. Future Gener. Comput. Syst. 86, 1351–1367 (2018)

    Google Scholar 

Download references

Acknowledgement

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence - Doctoral College” (IT4BI-DC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rana Faisal Munir .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W. (2019). Automatically Configuring Parallelism for Hybrid Layouts. In: Welzer, T., et al. New Trends in Databases and Information Systems. ADBIS 2019. Communications in Computer and Information Science, vol 1064. Springer, Cham. https://doi.org/10.1007/978-3-030-30278-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30278-8_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30277-1

  • Online ISBN: 978-3-030-30278-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics