Abstract
A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Data Warehouse Trends Report. Technical report, Panoply (2018)
Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019)
Abadi, D., Agrawal, R., Ailamaki, A., et al.: The Beckman report on database research. Commun. ACM 59(2), 92–99 (2016)
Abadi, D., Ailamaki, A., Andersen, D., et al.: The Seattle report on database research. SIGMOD Rec. 48(4), 44–53 (2020)
Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018)
Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)
Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of SIGMOD, pp. 1103–1114 (2014)
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2
Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27
Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Comput. Stand. Interf. 57, 101–109 (2018)
Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of PODS, pp. 34–43 (1998)
European Commission: Towards a Thriving Data-driven Economy (2018)
Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of SIGMOD, pp. 1053–1056 (2013)
Forrester Consulting: Digital Businesses Demand Agile Integration (2019)
Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: Proceedings of IEEE HPEC, pp. 1–6 (2016)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems - The Complete Book. Pearson Education, London (2009)
IBM: The Quant Crunch Report (2017)
Nadal, S., et al.: A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol. 90, 75–92 (2017)
Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A.: Data engineering for data analytics: a classification of the issues, and case studies. CoRR, abs/2004.12929 (2020)
Piparo, D., Tejedor, E., Mato, P., Mascetti, L., Moscicki, J.T., Lamanna, M.: SWAN: a service for interactive analysis in the cloud. Future Gener. Comput. Syst. 78, 1071–1078 (2018)
Quemy, A.: Data pipeline selection and optimization. In: Proceedings of DOLAP (2019)
Vaandrager, F.: Model learning. Commun. ACM 60(2), 86–95 (2017)
Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Romero, O., Wrembel, R. (2020). Data Engineering for Data Science: Two Sides of the Same Coin. In: Song, M., Song, IY., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2020. Lecture Notes in Computer Science(), vol 12393. Springer, Cham. https://doi.org/10.1007/978-3-030-59065-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-59065-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59064-2
Online ISBN: 978-3-030-59065-9
eBook Packages: Computer ScienceComputer Science (R0)