Skip to main content

Data Engineering for Data Science: Two Sides of the Same Coin

  • Conference paper
  • First Online:
Book cover Big Data Analytics and Knowledge Discovery (DaWaK 2020)

Abstract

A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Data Warehouse Trends Report. Technical report, Panoply (2018)

    Google Scholar 

  2. Data Engineering, Preparation, and Labeling for AI 2019. Technical report, Cognilytica Research (2019)

    Google Scholar 

  3. Abadi, D., Agrawal, R., Ailamaki, A., et al.: The Beckman report on database research. Commun. ACM 59(2), 92–99 (2016)

    Article  Google Scholar 

  4. Abadi, D., Ailamaki, A., Andersen, D., et al.: The Seattle report on database research. SIGMOD Rec. 48(4), 44–53 (2020)

    Article  Google Scholar 

  5. Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael (2018)

    Google Scholar 

  6. Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  7. Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of SIGMOD, pp. 1103–1114 (2014)

    Google Scholar 

  8. Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2

    Article  Google Scholar 

  9. Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27

    Chapter  Google Scholar 

  10. Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Comput. Stand. Interf. 57, 101–109 (2018)

    Article  Google Scholar 

  11. Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of PODS, pp. 34–43 (1998)

    Google Scholar 

  12. European Commission: Towards a Thriving Data-driven Economy (2018)

    Google Scholar 

  13. Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of SIGMOD, pp. 1053–1056 (2013)

    Google Scholar 

  14. Forrester Consulting: Digital Businesses Demand Agile Integration (2019)

    Google Scholar 

  15. Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: Proceedings of IEEE HPEC, pp. 1–6 (2016)

    Google Scholar 

  16. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems - The Complete Book. Pearson Education, London (2009)

    Google Scholar 

  17. IBM: The Quant Crunch Report (2017)

    Google Scholar 

  18. Nadal, S., et al.: A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol. 90, 75–92 (2017)

    Article  Google Scholar 

  19. Nazábal, A., Williams, C.K.I., Colavizza, G., Smith, C.R., Williams, A.: Data engineering for data analytics: a classification of the issues, and case studies. CoRR, abs/2004.12929 (2020)

    Google Scholar 

  20. Piparo, D., Tejedor, E., Mato, P., Mascetti, L., Moscicki, J.T., Lamanna, M.: SWAN: a service for interactive analysis in the cloud. Future Gener. Comput. Syst. 78, 1071–1078 (2018)

    Article  Google Scholar 

  21. Quemy, A.: Data pipeline selection and optimization. In: Proceedings of DOLAP (2019)

    Google Scholar 

  22. Vaandrager, F.: Model learning. Commun. ACM 60(2), 86–95 (2017)

    Article  Google Scholar 

  23. Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Wrembel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Romero, O., Wrembel, R. (2020). Data Engineering for Data Science: Two Sides of the Same Coin. In: Song, M., Song, IY., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2020. Lecture Notes in Computer Science(), vol 12393. Springer, Cham. https://doi.org/10.1007/978-3-030-59065-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59065-9_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59064-2

  • Online ISBN: 978-3-030-59065-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics