Skip to main content

Data Integration Process Automation Using Machine Learning: Issues and Solution

  • Chapter
  • First Online:
Machine Learning for Data Science Handbook

Abstract

In today’s data-driven world, real-time analysis of enterprise data plays an important role in the organization to take strategic decisions and improve business operations. The availability of data in real time and analyzing those data instantly are becoming a challenge for most organizations. Outdated data do not add any value to an organization. The company needs a reliable, minute-to-minute information to improve operational efficiency and make better proactive business decisions. Typically, running a data warehouse in an enterprise requires coordination of many operations across multiple teams. Also, a lot of manual intervention is required, which is error-prone. Executing all related steps in correct sequences under correct conditions can be a challenge. The automated data integration, specifically, ETL (Extract-Transform-Load) process, is the only solution to address all these problems. Improving ETL process system data flows can provide a better return on your business investment. Since, data across multiple systems are integrated into data warehouse (DWH). There can be quality issues of integrated data that can generate inaccurate analytic. Hence, data need to be pre-processed and optimized for the business intelligence process. Automated data integration, specifically the ETL process, can address the issues of traditional data warehouse related to availability and quality of data. Here, the solution approach of the automated ETL process is explained, which supports continuous integration. It also describes how machine learning can be leveraged in the ETL process so that the quality and availability of data not ever have been compromised.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 279.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Charu C. Aggarwal. Data classification: algorithms and applications. Chapman and Hall/CRC, 1st edition, 2014.

    Google Scholar 

  2. Z. El Akkaoui, E. Zimányi, J. N. Mazón López, J. C. Trujillo Mondéjar, et al. A BPMN-based design and maintenance framework for ETL processes. International Journal of Data Warehousing and Mining (IJDWM), 9(3):46–72, 2013.

    Article  Google Scholar 

  3. I. Ankorion. Change data capture efficient ETL for real-time BI. DM Review, 15(1):36, 2005.

    Google Scholar 

  4. Jere Aunola. Data Quality in Data Warehouses. Master’s thesis, Lahti University of Applied Sciences, Last accessed January 26, 2020.

    Google Scholar 

  5. D P. Ballou and G K. Tayi. Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1):73–78, 1999.

    Google Scholar 

  6. N. Biswas, S. Chattapadhyay, G. Mahapatra, S. Chatterjee, and K. C. Mondal. SysML based Conceptual ETL Process Modeling. In Communications in Computer and Information Science, pages 242–255. Springer, Singapore, 2017.

    Google Scholar 

  7. N. Biswas, S. Chattapadhyay, G. Mahapatra, S. Chatterjee, and K. C. Mondal. A New Approach for Conceptual ETL Process Modeling. International Journal of Ambient Computing and Intelligence (IJACI), 10(1):30–45, 2019.

    Article  Google Scholar 

  8. N. Biswas, A. Sarkar, and K. C. Mondal. Empirical Analysis of Programmable ETL Tools. In Communications in Computer and Information Science, pages 267–277. Springer, Singapore, 2018.

    Google Scholar 

  9. N. Biswas, A. Sarkar, and K. C. Mondal. Efficient Incremental Loading in ETL Processing for Real-Time Data Integration. Innovations in Systems and Software Engineering, 16:53–61, 2019.

    Article  Google Scholar 

  10. M. B. Bokade, S. S. Dhande, and H. R. Vyavahare. Framework of Change Data Capture and Real Time Data Warehouse. International Journal of Engineering Research and Technology, 2(4), 2013.

    Google Scholar 

  11. M. Castellanos, A. Simitsis, K. Wilkinson, and U. Dayal. Automating the loading of business process data warehouses. In International Conference on Extending Database Technology: Advances in Database Technology, pages 612–623. ACM, 2009.

    Google Scholar 

  12. cStor. Leveraging DevOps for a Next Gen Insurance Platform. Case Study, Last accessed on September 07, 2020. https://cstor.com/case-study-optimize-devops-improve-customer-experience/cstor-finance-devops-case-study-thumb/.

  13. Kot Dotson. The DevOps of data: How informatica prepares developers for the age of Data 3.0 #infa16. Technical report, Last accessed on September 07, 2020. https://siliconangle.com/2016/06/01/the-devops-of-data-how-informatica-prepares-developers-for-the-age-of-data-3-0-infa16/.

  14. D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. K. Ng, and R. D. Smith. Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering, 31(3):227–251, 1999.

    Article  MATH  Google Scholar 

  15. M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In International Conference on Software Maintenance, pages 23–32. IEEE, 2003.

    Google Scholar 

  16. Informatica. Continuous Integration-Delivery-Deployment in Next Generation Data Integration. White Paper, Last accessed on September 07, 2020. https://kb.informatica.com/whitepapers/4/Documents.

  17. W. H. Inmon. Building the data warehouse. John Wiley & Sons, 3rd edition, 2002.

    Google Scholar 

  18. S. B. Kotsiantis, I. Zaharakis, and P. Pintelas. Supervised machine learning: A review of classification techniques. Emerging Artificial Intelligence Applications in Computer Engineering, 160:3–24, 2007.

    Google Scholar 

  19. SB Kotsiantis, D. Kanellopoulos, and PE Pintelas. Data preprocessing for supervised leaning. International Journal of Computer Science, 1(2):111–117, 2006.

    Google Scholar 

  20. Liquibase. How to get started with database release automation in 4 easy steps. White Paper, Last accessed on September 07, 2020. https://www.datical.com/whitepapers/how-to-get-started-with-database-release-automation-4-steps/.

  21. L. Muñoz, J. N. Mazón, and J. Trujillo. Automatic generation of ETL processes from conceptual models. In International Workshop on Data Warehousing and OLAP, pages 33–40. ACM, 2009.

    Google Scholar 

  22. M. A. Naeem, G. Dobbie, and G. Webber. An event-based near real-time data integration architecture. In 12th Enterprise Distributed Object Computing Conference Workshops, pages 401–404. IEEE, 2008.

    Google Scholar 

  23. N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell. Supporting streaming updates in an active data warehouse. In IEEE 23rd International Conference on Data Engineering (ICDE’07), pages 476–485. IEEE, 2007.

    Google Scholar 

  24. W. Qu, V. Basavaraj, S. Shankar, and S. Dessloch. Real-Time Snapshot Maintenance with Incremental ETL Pipelines in Data Warehouses. In Big Data Analytics and Knowledge Discovery, pages 217–228. Springer, 2015.

    Google Scholar 

  25. V. Radhakrishna and K. SravanKiran, V.and Ravikiran. Automating ETL Process with Scripting Technology. In Nirma University International Conference on Engineering (NUiCONE), pages 1–4. IEEE, 2012.

    Google Scholar 

  26. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1):1–47, 2002.

    Article  MathSciNet  Google Scholar 

  27. D. Skoutas and A. Simitsis. Designing ETL Processes using Semantic Web Technologies. In 9th International Workshop on Data Warehousing and OLAP (DOLAP 2006), pages 67–74. ACM, 2006.

    Google Scholar 

  28. UC4 Software. Benefits of automating data warehousing. White Paper, pages 1–9, Last accessed on September 07, 2020. http://hosteddocs.ittoolbox.com/aa_data_warehouse_wp_us.pdf.

  29. S. Suresh, J. P. Gautam, G. Pancha, Frank J. DeRose, and M. Sankaran. Method and architecture for automated optimization of ETL throughput in data warehousing applications, 2001. US Patent 6208990.

    Google Scholar 

  30. M. N. Tho and A. M. Tjoa. Zero-latency data warehousing for heterogeneous data sources and continuous data streams. In 5th International Conference on Information Integration and Web-based Applications Services, pages 55–64, 2003.

    Google Scholar 

  31. V. Tziovara, P. Vassiliadis, and A. Simitsis. Deciding the physical implementation of ETL workflows. In International Workshop on Data Warehousing and OLAP, pages 49–56. ACM, 2007.

    Google Scholar 

  32. P. Vassiliadis. A Survey of Extract - Transform - Load Technology. International Journal of Data Warehousing and Mining, 5(3):1–27, 2009.

    Article  Google Scholar 

  33. P. Vassiliadis and A. Simitsis. Near Real Time ETL. Springer Annals of Information Systems, 3:1–31, 2008.

    Google Scholar 

  34. P. Vassiliadis and A. Simitsis. Extraction, transformation, and loading. In Encyclopedia of Database Systems, pages 1095–1101. Springer, 2009.

    Google Scholar 

  35. P. Vassiliadis, A. Simitsis, and S. Skiadopoulos. On the Logical Modeling of ETL Processes. In International Conference on Advanced Information Systems Engineering, pages 782–786, 2002.

    Google Scholar 

  36. H. Zhou, D. Yang, and Y. Xu. An ETL Strategy for Real-Time Data Warehouse. In Practical Applications of Intelligent Systems, pages 329–336. Springer, 2011.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kartick Chandra Mondal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mondal, K.C., Saha, S. (2023). Data Integration Process Automation Using Machine Learning: Issues and Solution. In: Rokach, L., Maimon, O., Shmueli, E. (eds) Machine Learning for Data Science Handbook. Springer, Cham. https://doi.org/10.1007/978-3-031-24628-9_3

Download citation

Publish with us

Policies and ethics