Data Engineering for Data Science: Two Sides of the Same Coin

Romero, Oscar; Wrembel, Robert

doi:10.1007/978-3-030-59065-9_13

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12393))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1144 Accesses
8 Citations

Abstract

A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Data Warehouse Trends Report. Technical report, Panoply (2018)
Google Scholar
Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019)
Google Scholar
Abadi, D., Agrawal, R., Ailamaki, A., et al.: The Beckman report on database research. Commun. ACM 59(2), 92–99 (2016)
Article Google Scholar
Abadi, D., Ailamaki, A., Andersen, D., et al.: The Seattle report on database research. SIGMOD Rec. 48(4), 44–53 (2020)
Article Google Scholar
Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018)
Google Scholar
Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of SIGMOD, pp. 1103–1114 (2014)
Google Scholar
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2
Article Google Scholar
Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27
Chapter Google Scholar
Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Comput. Stand. Interf. 57, 101–109 (2018)
Article Google Scholar
Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of PODS, pp. 34–43 (1998)
Google Scholar
European Commission: Towards a Thriving Data-driven Economy (2018)
Google Scholar
Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of SIGMOD, pp. 1053–1056 (2013)
Google Scholar
Forrester Consulting: Digital Businesses Demand Agile Integration (2019)
Google Scholar
Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: Proceedings of IEEE HPEC, pp. 1–6 (2016)
Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems - The Complete Book. Pearson Education, London (2009)
Google Scholar
IBM: The Quant Crunch Report (2017)
Google Scholar
Nadal, S., et al.: A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol. 90, 75–92 (2017)
Article Google Scholar
Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A.: Data engineering for data analytics: a classification of the issues, and case studies. CoRR, abs/2004.12929 (2020)
Google Scholar
Piparo, D., Tejedor, E., Mato, P., Mascetti, L., Moscicki, J.T., Lamanna, M.: SWAN: a service for interactive analysis in the cloud. Future Gener. Comput. Syst. 78, 1071–1078 (2018)
Article Google Scholar
Quemy, A.: Data pipeline selection and optimization. In: Proceedings of DOLAP (2019)
Google Scholar
Vaandrager, F.: Model learning. Commun. ACM 60(2), 86–95 (2017)
Article Google Scholar
Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya, Catalunya, Spain
Oscar Romero
Poznan University of Technology, Poznań, Poland
Robert Wrembel

Authors

Oscar Romero
View author publications
You can also search for this author in PubMed Google Scholar
Robert Wrembel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Wrembel .

Editor information

Editors and Affiliations

Department of Library and Information, Yonsei University, Seoul, Korea (Republic of)
Min Song
Drexel University, Philadelphia, PA, USA
Il-Yeol Song
Johannes Kepler University of Linz, Linz, Austria
Gabriele Kotsis
Software Competence Center Hagenberg (Au), Vienna, Wien, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Oberösterreich, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Romero, O., Wrembel, R. (2020). Data Engineering for Data Science: Two Sides of the Same Coin. In: Song, M., Song, IY., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2020. Lecture Notes in Computer Science(), vol 12393. Springer, Cham. https://doi.org/10.1007/978-3-030-59065-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-59065-9_13
Published: 11 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59064-2
Online ISBN: 978-3-030-59065-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics