Framework to Optimize Data Processing Pipelines Using Performance Metrics

Ali, Syed Muhammad Fawad; Wrembel, Robert

doi:10.1007/978-3-030-59065-9_11

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12393))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1054 Accesses
2 Citations

Abstract

Optimizing Data Processing Pipelines (DPPs) is challenging in the context of both, data warehouse architectures and data science architectures. Few approaches to this problem have been proposed so far. The most challenging issue is to build a cost model of the whole DPP, especially if user defined functions (UDFs) are used. In this paper we addressed the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019)
Google Scholar
Ali, S.M.F.: Next-generation ETL framework to address the challenges posed by big data. In: DOLAP (2018)
Google Scholar
Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. AMCS J. 29, 69–79 (2019)
MATH Google Scholar
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2
Article Google Scholar
Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27
Chapter Google Scholar
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endown. 2(2), 1402–1413 (2009)
Article Google Scholar
Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In: SSDBM, p. 36 (2014)
Google Scholar
Halasipuram, R., Deshpande, P.M., Padmanabhan S.: Determining essential statistics for cost based optimization of an ETL workflow. In: EDBT, pp. 307–318 (2014)
Google Scholar
Hueske, F., Peters, M., Krettek, A., Ringwald, M., Tzoumas, K., Markl, V., Freytag, J.-C.: Peeking into the optimization of data flow programs with MapReduce-style UDFs. In: ICDE, pp. 1292–1295 (2013)
Google Scholar
Hueske, F., Peters, M., Sax, M.J., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. VLDB Endown. 5(11), 1256–1267 (2012)
Article Google Scholar
IBM. IBM InfoSphere DataStage Balanced Optimization. Whitepaper
Google Scholar
Informatica. How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. Whitepaper
Google Scholar
Ismail, H., Harous, S., Belkhouche, B.: A comparative analysis of machine learning classifiers for twitter sentiment analysis. Res. Comput. Sci. 110, 71–83 (2016)
Article Google Scholar
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE TKDE 28(5), 1203–1216 (2016)
Google Scholar
Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient etl execution. Inf. Syst. 38(6), 927–945 (2013)
Article Google Scholar
Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: Castellanos, M., Dayal, U., Markl, V. (eds.) BIRTE 2010. LNBIP, vol. 84, pp. 68–83. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22970-1_6
Chapter Google Scholar
Liu , X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM SAC, pp. 1015–1022 (2015)
Google Scholar
Quemy, A.: Binary classification in unstructured space with hypergraph case-based reasoning. Inf. Syst. 85, 92–113 (2019)
Article Google Scholar
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: Sofa: An extensible logical optimizer for udf-heavy data flows. Inf. Syst. 52, 96–125 (2015)
Article Google Scholar
Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE TKDE 17(10), 1404–1419 (2005)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD (2010)
Google Scholar
Wrembel, R.: Still open issues in ETL design and optimization (2019). www.cs.put.poznan.pl/rwrembel/ETL-open-issues.pdf. Res. seminar, BarcelonaTech

Download references

Author information

Authors and Affiliations

Poznan University of Technology, Poznań, Poland
Syed Muhammad Fawad Ali & Robert Wrembel

Authors

Syed Muhammad Fawad Ali
View author publications
You can also search for this author in PubMed Google Scholar
Robert Wrembel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Muhammad Fawad Ali .

Editor information

Editors and Affiliations

Department of Library and Information, Yonsei University, Seoul, Korea (Republic of)
Min Song
Drexel University, Philadelphia, PA, USA
Il-Yeol Song
Johannes Kepler University of Linz, Linz, Austria
Gabriele Kotsis
Software Competence Center Hagenberg (Au), Vienna, Wien, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Oberösterreich, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ali, S.M.F., Wrembel, R. (2020). Framework to Optimize Data Processing Pipelines Using Performance Metrics. In: Song, M., Song, IY., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2020. Lecture Notes in Computer Science(), vol 12393. Springer, Cham. https://doi.org/10.1007/978-3-030-59065-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-59065-9_11
Published: 11 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59064-2
Online ISBN: 978-3-030-59065-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics