Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11695)


Today’s ETL tools provide capabilities for developing custom code as user-defined functions (UDFs) to extend the expressiveness of standard ETL operators. However, a custom code of an UDF may execute inefficiently due to its poor implementation (e.g., due to the lack of using parallel processing or adequate data structures). In this paper we address the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs.


ETL workflow ETL execution optimization User-defined functions Cost model Parallelization 



The work of Fawad Ali is partially supported by the European Commission through the Erasmus Mundus Joint Doctorate project Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC).

The work of Robert Wrembel is partially supported by: (1) the grant No. 2015/19/B/ST6/02637 of the National Science Center and (2) the grant of the Polish National Agency for Academic Exchange, within the Bekker programme.


  1. 1.
    Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)CrossRefGoogle Scholar
  2. 2.
    Ali, S.M.F.: Next-generation ETL framework to address the challenges posed by Big Data. In: International Workshop Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) (2018)Google Scholar
  3. 3.
    Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int. J. Appl. Math. Comput. Sci. (AMCS) 29, 69–79 (2019)CrossRefGoogle Scholar
  4. 4.
    Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26, 1–25 (2017)CrossRefGoogle Scholar
  5. 5.
    Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: ACM Symposium on Cloud Computing, pp. 119–130 (2010)Google Scholar
  6. 6.
    Borthakur, D.: The Hadoop distributed file system: Architecture and design. Hadoop Project Website, vol. 11, p. 21 (2007)Google Scholar
  7. 7.
    Caruccio, L., Deufemia, V., Polese, G.: Visual data integration based on description logic reasoning. In: International Database Engineering Applications Symposium, pp. 19–28 (2014)Google Scholar
  8. 8.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  9. 9.
    Evans, J.P., Steuer, R.E.: A revised simplex method for linear multiple objective programs. Math. Program. 5(1), 54–72 (1973)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endowment 2(2), 1402–1413 (2009)CrossRefGoogle Scholar
  11. 11.
    Gartner: Magic Quadrant for Data Integration Tools. Accessed 18 Mar 2019
  12. 12.
    Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In: International Conference on Scientific and Statistical Database Management, p. 36. ACM (2014)Google Scholar
  13. 13.
    Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In: International Conference on Extending Database Technology (EDBT), pp. 307–318 (2014)Google Scholar
  14. 14.
    Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Conference on Innovative Data Systems Research (CIDR), vol. 11, pp. 261–272 (2011)Google Scholar
  15. 15.
    Hueske, F., et al.: Peeking into the optimization of data flow programs with MapReduce-style UDFs. In: International Conference on Data Engineering (ICDE), pp. 1292–1295 (2013)Google Scholar
  16. 16.
    Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)CrossRefGoogle Scholar
  17. 17.
    Ibaraki, T., Hasegawa, T., Teranaka, K., Iwase, J.: The multiple choice knapsack problem. J. Oper. Res. Soc. Japan 21(1), 59–93 (1978)MathSciNetzbMATHGoogle Scholar
  18. 18.
    IBM: IBM InfoSphere DataStage Balanced Optimization. IBM Whitepaper. Accessed 18 Mar 2019Google Scholar
  19. 19.
    Informatica: How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. Accessed 18 Mar 2019
  20. 20.
    Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE Trans. Knowl. Data Eng. 28(5), 1203–1216 (2016)CrossRefGoogle Scholar
  21. 21.
    Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)CrossRefGoogle Scholar
  22. 22.
    Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: VLDB Workshop on Enabling Real-Time Business Intelligence, pp. 68–83 (2010)Google Scholar
  23. 23.
    Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Lella, R.: Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization. Accessed 18 Mar 2019
  25. 25.
    Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)Google Scholar
  26. 26.
    Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)CrossRefGoogle Scholar
  27. 27.
    Russom, P.: Data lakes: purposes, practices, patterns, and platforms. TDWI white paper (2017)Google Scholar
  28. 28.
    Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)CrossRefGoogle Scholar
  29. 29.
    Skoutas, D., Simitsis, A., Sellis, T.: Ontology-driven conceptual design of ETL processes using graph transformations. J. Data Semant. 13, 120–146 (2009)CrossRefGoogle Scholar
  30. 30.
    Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)Google Scholar
  31. 31.
    Vaandrager, F.: Model learning. Commun. ACM 60(2), 86–95 (2017)CrossRefGoogle Scholar
  32. 32.
    Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Data-Centric Systems and Applications. Springer, Heidelberg (2014). Scholar
  33. 33.
    Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: ACM SIGMOD International Conference on Management of Data (2010)Google Scholar
  34. 34.
    Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 34–52 (2019)CrossRefGoogle Scholar
  35. 35.
    Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Poznan University of TechnologyPoznanPoland

Personalised recommendations