Skip to main content

Framework to Optimize Data Processing Pipelines Using Performance Metrics

  • Conference paper
  • First Online:
Book cover Big Data Analytics and Knowledge Discovery (DaWaK 2020)

Abstract

Optimizing Data Processing Pipelines (DPPs) is challenging in the context of both, data warehouse architectures and data science architectures. Few approaches to this problem have been proposed so far. The most challenging issue is to build a cost model of the whole DPP, especially if user defined functions (UDFs) are used. In this paper we addressed the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://developers.google.com/optimization/mip/mip.

  2. 2.

    https://github.com/fawadali/MCKPCostModel/blob/master/ML-CostModel/.

  3. 3.

    https://calculator.s3.amazonaws.com/index.html.

References

  1. Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019)

    Google Scholar 

  2. Ali, S.M.F.: Next-generation ETL framework to address the challenges posed by big data. In: DOLAP (2018)

    Google Scholar 

  3. Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. AMCS J. 29, 69–79 (2019)

    MATH  Google Scholar 

  4. Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2

    Article  Google Scholar 

  5. Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27

    Chapter  Google Scholar 

  6. Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endown. 2(2), 1402–1413 (2009)

    Article  Google Scholar 

  7. Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In: SSDBM, p. 36 (2014)

    Google Scholar 

  8. Halasipuram, R., Deshpande, P.M., Padmanabhan S.: Determining essential statistics for cost based optimization of an ETL workflow. In: EDBT, pp. 307–318 (2014)

    Google Scholar 

  9. Hueske, F., Peters, M., Krettek, A., Ringwald, M., Tzoumas, K., Markl, V., Freytag, J.-C.: Peeking into the optimization of data flow programs with MapReduce-style UDFs. In: ICDE, pp. 1292–1295 (2013)

    Google Scholar 

  10. Hueske, F., Peters, M., Sax, M.J., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. VLDB Endown. 5(11), 1256–1267 (2012)

    Article  Google Scholar 

  11. IBM. IBM InfoSphere DataStage Balanced Optimization. Whitepaper

    Google Scholar 

  12. Informatica. How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. Whitepaper

    Google Scholar 

  13. Ismail, H., Harous, S., Belkhouche, B.: A comparative analysis of machine learning classifiers for twitter sentiment analysis. Res. Comput. Sci. 110, 71–83 (2016)

    Article  Google Scholar 

  14. Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE TKDE 28(5), 1203–1216 (2016)

    Google Scholar 

  15. Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient etl execution. Inf. Syst. 38(6), 927–945 (2013)

    Article  Google Scholar 

  16. Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: Castellanos, M., Dayal, U., Markl, V. (eds.) BIRTE 2010. LNBIP, vol. 84, pp. 68–83. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22970-1_6

    Chapter  Google Scholar 

  17. Liu , X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM SAC, pp. 1015–1022 (2015)

    Google Scholar 

  18. Quemy, A.: Binary classification in unstructured space with hypergraph case-based reasoning. Inf. Syst. 85, 92–113 (2019)

    Article  Google Scholar 

  19. Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: Sofa: An extensible logical optimizer for udf-heavy data flows. Inf. Syst. 52, 96–125 (2015)

    Article  Google Scholar 

  20. Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE TKDE 17(10), 1404–1419 (2005)

    Google Scholar 

  21. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD (2010)

    Google Scholar 

  22. Wrembel, R.: Still open issues in ETL design and optimization (2019). www.cs.put.poznan.pl/rwrembel/ETL-open-issues.pdf. Res. seminar, BarcelonaTech

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Syed Muhammad Fawad Ali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ali, S.M.F., Wrembel, R. (2020). Framework to Optimize Data Processing Pipelines Using Performance Metrics. In: Song, M., Song, IY., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2020. Lecture Notes in Computer Science(), vol 12393. Springer, Cham. https://doi.org/10.1007/978-3-030-59065-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59065-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59064-2

  • Online ISBN: 978-3-030-59065-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics