Abstract
In today’s data-driven world, real-time analysis of enterprise data plays an important role in the organization to take strategic decisions and improve business operations. The availability of data in real time and analyzing those data instantly are becoming a challenge for most organizations. Outdated data do not add any value to an organization. The company needs a reliable, minute-to-minute information to improve operational efficiency and make better proactive business decisions. Typically, running a data warehouse in an enterprise requires coordination of many operations across multiple teams. Also, a lot of manual intervention is required, which is error-prone. Executing all related steps in correct sequences under correct conditions can be a challenge. The automated data integration, specifically, ETL (Extract-Transform-Load) process, is the only solution to address all these problems. Improving ETL process system data flows can provide a better return on your business investment. Since, data across multiple systems are integrated into data warehouse (DWH). There can be quality issues of integrated data that can generate inaccurate analytic. Hence, data need to be pre-processed and optimized for the business intelligence process. Automated data integration, specifically the ETL process, can address the issues of traditional data warehouse related to availability and quality of data. Here, the solution approach of the automated ETL process is explained, which supports continuous integration. It also describes how machine learning can be leveraged in the ETL process so that the quality and availability of data not ever have been compromised.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Charu C. Aggarwal. Data classification: algorithms and applications. Chapman and Hall/CRC, 1st edition, 2014.
Z. El Akkaoui, E. Zimányi, J. N. Mazón López, J. C. Trujillo Mondéjar, et al. A BPMN-based design and maintenance framework for ETL processes. International Journal of Data Warehousing and Mining (IJDWM), 9(3):46–72, 2013.
I. Ankorion. Change data capture efficient ETL for real-time BI. DM Review, 15(1):36, 2005.
Jere Aunola. Data Quality in Data Warehouses. Master’s thesis, Lahti University of Applied Sciences, Last accessed January 26, 2020.
D P. Ballou and G K. Tayi. Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1):73–78, 1999.
N. Biswas, S. Chattapadhyay, G. Mahapatra, S. Chatterjee, and K. C. Mondal. SysML based Conceptual ETL Process Modeling. In Communications in Computer and Information Science, pages 242–255. Springer, Singapore, 2017.
N. Biswas, S. Chattapadhyay, G. Mahapatra, S. Chatterjee, and K. C. Mondal. A New Approach for Conceptual ETL Process Modeling. International Journal of Ambient Computing and Intelligence (IJACI), 10(1):30–45, 2019.
N. Biswas, A. Sarkar, and K. C. Mondal. Empirical Analysis of Programmable ETL Tools. In Communications in Computer and Information Science, pages 267–277. Springer, Singapore, 2018.
N. Biswas, A. Sarkar, and K. C. Mondal. Efficient Incremental Loading in ETL Processing for Real-Time Data Integration. Innovations in Systems and Software Engineering, 16:53–61, 2019.
M. B. Bokade, S. S. Dhande, and H. R. Vyavahare. Framework of Change Data Capture and Real Time Data Warehouse. International Journal of Engineering Research and Technology, 2(4), 2013.
M. Castellanos, A. Simitsis, K. Wilkinson, and U. Dayal. Automating the loading of business process data warehouses. In International Conference on Extending Database Technology: Advances in Database Technology, pages 612–623. ACM, 2009.
cStor. Leveraging DevOps for a Next Gen Insurance Platform. Case Study, Last accessed on September 07, 2020. https://cstor.com/case-study-optimize-devops-improve-customer-experience/cstor-finance-devops-case-study-thumb/.
Kot Dotson. The DevOps of data: How informatica prepares developers for the age of Data 3.0 #infa16. Technical report, Last accessed on September 07, 2020. https://siliconangle.com/2016/06/01/the-devops-of-data-how-informatica-prepares-developers-for-the-age-of-data-3-0-infa16/.
D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. K. Ng, and R. D. Smith. Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering, 31(3):227–251, 1999.
M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In International Conference on Software Maintenance, pages 23–32. IEEE, 2003.
Informatica. Continuous Integration-Delivery-Deployment in Next Generation Data Integration. White Paper, Last accessed on September 07, 2020. https://kb.informatica.com/whitepapers/4/Documents.
W. H. Inmon. Building the data warehouse. John Wiley & Sons, 3rd edition, 2002.
S. B. Kotsiantis, I. Zaharakis, and P. Pintelas. Supervised machine learning: A review of classification techniques. Emerging Artificial Intelligence Applications in Computer Engineering, 160:3–24, 2007.
SB Kotsiantis, D. Kanellopoulos, and PE Pintelas. Data preprocessing for supervised leaning. International Journal of Computer Science, 1(2):111–117, 2006.
Liquibase. How to get started with database release automation in 4 easy steps. White Paper, Last accessed on September 07, 2020. https://www.datical.com/whitepapers/how-to-get-started-with-database-release-automation-4-steps/.
L. Muñoz, J. N. Mazón, and J. Trujillo. Automatic generation of ETL processes from conceptual models. In International Workshop on Data Warehousing and OLAP, pages 33–40. ACM, 2009.
M. A. Naeem, G. Dobbie, and G. Webber. An event-based near real-time data integration architecture. In 12th Enterprise Distributed Object Computing Conference Workshops, pages 401–404. IEEE, 2008.
N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell. Supporting streaming updates in an active data warehouse. In IEEE 23rd International Conference on Data Engineering (ICDE’07), pages 476–485. IEEE, 2007.
W. Qu, V. Basavaraj, S. Shankar, and S. Dessloch. Real-Time Snapshot Maintenance with Incremental ETL Pipelines in Data Warehouses. In Big Data Analytics and Knowledge Discovery, pages 217–228. Springer, 2015.
V. Radhakrishna and K. SravanKiran, V.and Ravikiran. Automating ETL Process with Scripting Technology. In Nirma University International Conference on Engineering (NUiCONE), pages 1–4. IEEE, 2012.
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1):1–47, 2002.
D. Skoutas and A. Simitsis. Designing ETL Processes using Semantic Web Technologies. In 9th International Workshop on Data Warehousing and OLAP (DOLAP 2006), pages 67–74. ACM, 2006.
UC4 Software. Benefits of automating data warehousing. White Paper, pages 1–9, Last accessed on September 07, 2020. http://hosteddocs.ittoolbox.com/aa_data_warehouse_wp_us.pdf.
S. Suresh, J. P. Gautam, G. Pancha, Frank J. DeRose, and M. Sankaran. Method and architecture for automated optimization of ETL throughput in data warehousing applications, 2001. US Patent 6208990.
M. N. Tho and A. M. Tjoa. Zero-latency data warehousing for heterogeneous data sources and continuous data streams. In 5th International Conference on Information Integration and Web-based Applications Services, pages 55–64, 2003.
V. Tziovara, P. Vassiliadis, and A. Simitsis. Deciding the physical implementation of ETL workflows. In International Workshop on Data Warehousing and OLAP, pages 49–56. ACM, 2007.
P. Vassiliadis. A Survey of Extract - Transform - Load Technology. International Journal of Data Warehousing and Mining, 5(3):1–27, 2009.
P. Vassiliadis and A. Simitsis. Near Real Time ETL. Springer Annals of Information Systems, 3:1–31, 2008.
P. Vassiliadis and A. Simitsis. Extraction, transformation, and loading. In Encyclopedia of Database Systems, pages 1095–1101. Springer, 2009.
P. Vassiliadis, A. Simitsis, and S. Skiadopoulos. On the Logical Modeling of ETL Processes. In International Conference on Advanced Information Systems Engineering, pages 782–786, 2002.
H. Zhou, D. Yang, and Y. Xu. An ETL Strategy for Real-Time Data Warehouse. In Practical Applications of Intelligent Systems, pages 329–336. Springer, 2011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Mondal, K.C., Saha, S. (2023). Data Integration Process Automation Using Machine Learning: Issues and Solution. In: Rokach, L., Maimon, O., Shmueli, E. (eds) Machine Learning for Data Science Handbook. Springer, Cham. https://doi.org/10.1007/978-3-031-24628-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-24628-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24627-2
Online ISBN: 978-3-031-24628-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)