AScale: Big/Small Data ETL and Real-Time Data Freshness
In this paper we investigate the problem of providing timely results for the Extraction, Transformation and Load (ETL) process and automatic scalability to the entire pipeline including the data warehouse. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during specific offline time windows. Parallel architectures and mechanisms are able to optimize the ETL process by speeding-up each part of the pipeline process as more performance is needed. However, none of them allow the user to specify the ETL time and the framework scales automatically to assure it.
We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL process in time, suitable for smallData and bigData scenarios. A general framework for testing and implementing the system was developed to provide solutions for each part of the ETL automatic scalability in time. The results show that the proposed system is capable of handling scalability to provide the desired processing speed for both near-real-time results ETL processing.
KeywordsScalability ETL Freshness High-rate Performance Parallel processing Distributed systems Database bigData smallData Business management
- 1.Fernandez, R.C., Pietzuch, P., Koshy, J., Kreps, J., Lin, D., Narkhede, N., Rao, J., Riccomini, C., Wang, G.: Liquid: unifying nearline and offline big data integration. In: Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA. ACM, January 2015Google Scholar
- 2.Liu, X.: Data warehousing technologies for large-scale and right-time data. Ph.D. thesis, dissertation, Faculty of Engineering and Science at Aalborg University, Denmark (2012)Google Scholar
- 3.Muñoz, L., Mazón, J.N., Trujillo, J.: Automatic generation of ETL processes from conceptual models. In: Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP, pp. 33–40. ACM (2009)Google Scholar
- 4.O’Neil, P.E., O’Neil, E.J., Chen, X.: The star schema benchmark (ssb). Pat (2007)Google Scholar
- 5.Simitsis, A., Gupta, C., Wang, S., Dayal, U.: Partitioning real-time ETL workflows (2010)Google Scholar