Abstract
For years, data integration (DI) architectures evolved from those supporting virtual integration, through physical integration, to those supporting both virtual and physical integration. Regardless of its type, all of the developed DI architectures include an integration layer. This layer is implemented by a sophisticated software, which runs the so-called DI processes. The integration layer is responsible for ingesting data from various sources (typically heterogeneous and distributed) and for homogenizing data into formats suitable for future processing and analysis. Nowadays, in all business domains, large volumes of highly heterogeneous data are produced, e.g., medical systems, smart cities, smart agriculture, which require further advancements in the data integration technologies. In this keynote talk paper, I present my personal opinion on still-to-be developed data integration techniques - potential research directions, namely: (1) more flexible DI, (2) quality assurance in complex multi-modal systems, (3) execution optimization of DI processes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahle, U., Hemetsberger, L., Łakomski, M., Wrembel, R.: AI and data: how cities of the future will use data in their development (2023)
Akkem, Y., Biswas, S.K., Varanasi, A.: Smart farming using artificial intelligence: a review. Eng. Appl. Artif. Intell. 120, 105899 (2023)
Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int. J. Appl. Math. Comput. Sci. 29(1), 69–79 (2019)
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2
Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27
Ali, S.M.F., Wrembel, R.: Framework to optimize data processing pipelines using performance metrics. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 131–140. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_11
Andrzejewski, W., Bebel, B., Boiński, P., Sienkiewicz, M., Wrembel, R.: Text similarity measures in a data deduplication pipeline for customers records. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 33–42. CEUR-WS.org (2023)
Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: PRESISTANT: learning based assistant for data pre-processing. Data Knowl. Eng. 123, 101727 (2019)
Bode, J., Kühl, N., Kreuzberger, D., Hirschl, S., Holtmann, C.: Data mesh: best practices to avoid the data mess. CoRR, abs/2302.01713 (2023)
Bodziony, M., Krzyzanowski, H., Pieta, L., Wrembel, R.: On discovering semantics of user-defined functions in data processing workflows. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE) @ SIGMOD/PODS, pp. 7:1–7:6. ACM (2021)
Bodziony, M., Morawski, R., Wrembel, R.: Evaluating push-down on nosql data sources: experiments and analysis paper. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE) @ SIGMOD/PODS, pp. 4:1–4:6 (2022)
Bodziony, M., Roszyk, S., Wrembel, R.: On evaluating performance of balanced optimization of ETL processes for streaming data sources. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 2572 of CEUR Workshop Proceedings, pp. 74–78 (2020)
Bodziony, M., Wrembel, R.: Reference architecture for running large scale data integration experiments. In: Strauss, C., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2021. LNCS, vol. 12923, pp. 3–9. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86472-9_1
Bodziony, M., Wrembel, R.: Data source connectors layer as a service - design patterns. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 76–80. CEUR-WS.org (2023)
Boiński, P., Andrzejewski, W., Bębel, B., Wrembel, R.: On tuning the sorted neighborhood method for record comparisons in a data deduplication pipeline. In: International Conference on Database and Expert Systems Applications (DEXA). Springer, Cham (2023). Volume to appear of LNCS
Boinski, P., Sienkiewicz, M., Bebel, B., Wrembel, R., Galezowski, D., Graniszewski, W.: On customer data deduplication: lessons learned from a R&D project in the financial sector. In Workshops of the EDBT/ICDT Joint Conference, volume 3135 of CEUR Workshop Proceedings (2022)
Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Kluwer Academic Publishers, Alphen aan den Rijn (1998). ISBN: 0792382161
Brezany, P., Tjoa, A.M., Wanek, H., Wöhrer, A.: Mediators in the architecture of grid information systems. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds.) PPAM 2003. LNCS, vol. 3019, pp. 788–795. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24669-5_103
Chen, X., et al.: Leon: a new framework for ml-aided query optimization. VLDB Endowment 16(9), 2261–2273 (2023)
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 127:1-127:42 (2021)
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: International Conference on Management of Data (SIGMOD), pp. 2201–2206. ACM (2016)
Dehghani, Z.: Data Mesh: Delivering Data-Driven Value at Scale. O’Reilly, Newton (2022). ISBN: 1492092398
DICOM. Dicom - digital imaging and communications in medicine. https://www.dicomstandard.org/
Elmagarmid, A., Rusinkiewicz, M., Sheth, A.: Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann Publishers, Burlington (1999). ISBN: 1-55860-216-X
Errami, S.A., Hajji, H., Kadi, K.A.E., Badir, H.: Spatial big data architecture: from data warehouses and data lakes to the Lakehouse. J. Parallel Distrib. Comput. 176, 70–79 (2023)
Fivetrain. Connectors for every data source. Accessed June 2023
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endowment 2(2), 1402–1413 (2009)
Gillet, A., Leclercq, É., Cullot, N.: Lambda+, the renewal of the lambda architecture: category theory to the rescue. In: La Rosa, M., Sadiq, S., Teniente, E. (eds.) CAiSE 2021. LNCS, vol. 12751, pp. 381–396. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79382-1_23
Giovanelli, J., Bilalli, B., Abelló, A.: Data pre-processing pipeline generation for AutoETL. Inf. Syst. 108, 101957 (2022)
Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In; Conference on Scientific and Statistical Database Management (SSDBM), p. 36 (2014)
Gupta, A., Mumick, I.S.: Materialized Views: Techniques, Implementations, and Applications. The MIT Press, Cambridge (1999)
Hai, R., Koutras, C., Quix, C., Jarke, M.: Data lakes: a survey of functions and systems (2023)
Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In: International Conference on Extending Database Technology (EDBT), pp. 307–318 (2014)
Harby, A.A., Zulkernine, F.: From data warehouse to Lakehouse: a comparative review. In: IEEE International Conference on Big Data, pp. 389–395 (2022)
Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Distributed caching of scientific workflows in multisite cloud. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2020. LNCS, vol. 12392, pp. 51–65. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59051-2_4
Hernández, Á.B., Pérez, M.S., Gupta, S., Muntés-Mulero, V.: Using machine learning to optimize parallelism in big data applications. Future Gener. Comput. Syst. 86, 1076–1092 (2018)
Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Conference on Innovative Data Systems Research CIDR, pp. 261–272 (2011)
Hueske, F., et al.: Peeking into the optimization of data flow programs with mapreduce-style UDFs. In: International Conference on Data Engineering (ICDE), pp. 1292–1295 (2013)
Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)
IBM. IBM Cloud Pak for Data: Supported data sources. Accessed June 2023
IBM: Introduction to InfoSphere DataStage balanced optimization. Documentation. Accessed June 2023
Informatica: Pushdown optimization overview. Documentation. Accessed June 2023
Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer, Cham (2003). https://doi.org/10.1007/978-3-662-05153-5
Jemmali, R., Abdelhédi, F., Zurfluh, G.: Dltodw: transferring relational and NoSQL databases from a data lake. SN Comput. Sci. 3(5), 381 (2022)
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE Trans. Knowl. Data Eng. 28(5), 1203–1216 (2016)
Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)
Kechar, M., Bellatreche, L.: Safeness: suffix arrays driven materialized view selection framework for large-scale workloads. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2022. Lecture Notes in Computer Science, vol. 13428, pp. 74–86. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12670-3_7
Konstantinou, N., Paton, N.W.: Feedback driven improvement of data preparation pipelines. Inf. Syst. 92, 101480 (2020)
Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: Castellanos, M., Dayal, U., Markl, V. (eds.) BIRTE 2010. LNBIP, vol. 84, pp. 68–83. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22970-1_6
Lerner, A., Hussein, R., Ryser, A., Lee, S., Cudré-Mauroux, P.: Networking and storage: the next computing elements in exascale systems? IEEE Data Eng. Bull. 43(1), 60–71 (2020)
Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)
Munshi, A.A., Mohamed, Y.A.I.: Data lake lambda architecture for smart grids big data analytics. IEEE Access 6, 40463–40471 (2018)
Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. VLDB Endowment 12(12), 1986–1989 (2019)
Owaida, M., Alonso, G., Fogliarini, L., Hock-Koon, A., Melet, P.: Lowering the latency of data processing pipelines through FPGA based hardware acceleration. VLDB Endowment 13(1), 71–85 (2019)
Popescu, A.D., Ercegovac, V., Balmin, A., Branco, M., Ailamaki, A.: Same queries, different data: can we predict runtime performance? In: Workshops @ International Conference on Data Engineering (ICDE), pp. 275–280. IEEE Computer Society (2012)
Quemy, A.: Binary classification in unstructured space with hypergraph case-based reasoning. Inf. Syst. 85, 92–113 (2019)
Ramachandra, K., Park, K., Emani, K.V., Halverson, A., Galindo-Legaria, C.A., Cunningham, C.: Froid: optimization of imperative programs in a relational database. VLDB Endowment 11(4), 432–444 (2017)
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)
Romero, O., Wrembel, R.: Data engineering for data science: two sides of the same coin. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 157–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_13
Rusinkiewicz, M., Czejdo, B., Embley, D.W.: An implementation model for muldidatabase queries. In: Karagiannis, D. (ed.) Database and Expert Systems Applications, pp. 309–314. Springer-Verlag, Vienna (1991). https://doi.org/10.1007/978-3-7091-7555-2_52
Sichert, M., Neumann, T.: User-defined operators: efficiently integrating custom algorithms into modern databases. VLDB Endowment 15(5), 1119–1131 (2022)
Sienkiewicz, M., Wrembel, R.: Managing data in a big financial institution: conclusions from a R&D project. In: Workshops of the EDBT/ICDT Joint Conference, vol. 2841 (2021)
Simitsis, A., Skiadopoulos, S., Vassiliadis, P.: The history, present, and future of ETL technology (invited). In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 3–12. CEUR-WS.org (2023)
Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL processes in data warehouses. In: International Conference on Data Engineering (ICDE), pp. 564–575. IEEE Computer Society (2005)
Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)
Strengholt, P.: Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric. O’Reilly, Newton (2023). ISBN: 1098138864
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220 (2017)
Thomsen, C.: ETL. In Encyclopedia of Big Data Technologies, Springer, Cham (2019). https://doi.org/10.1007/978-3-319-77525-8
Tsesmelis, D., Simitsis, A.: Database optimizers in the era of learning. In: International Conference on Data Engineering (ICDE), pp. 3213–3216 (2022)
Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Data-Centric Systems and Applications, 2nd edn. Springer (2022). https://doi.org/10.1007/978-3-662-65167-4
Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)
Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 33–52 (2019)
Zaharia, M., Ghodsi, A., Xin, R., Armbrust, M.: Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In: Conference on Innovative Data Systems Research (CIDR) (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wrembel, R. (2023). Data Integration Revitalized: From Data Warehouse Through Data Lake to Data Mesh. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14146. Springer, Cham. https://doi.org/10.1007/978-3-031-39847-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-39847-6_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39846-9
Online ISBN: 978-3-031-39847-6
eBook Packages: Computer ScienceComputer Science (R0)