Abstract
In this paper we investigate the problem of providing timely results for the Extraction, Transformation and Load (ETL) process and automatic scalability to the entire pipeline including the data warehouse. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during specific offline time windows. Parallel architectures and mechanisms are able to optimize the ETL process by speeding-up each part of the pipeline process as more performance is needed. However, none of them allow the user to specify the ETL time and the framework scales automatically to assure it.
We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL process in time, suitable for smallData and bigData scenarios. A general framework for testing and implementing the system was developed to provide solutions for each part of the ETL automatic scalability in time. The results show that the proposed system is capable of handling scalability to provide the desired processing speed for both near-real-time results ETL processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Fernandez, R.C., Pietzuch, P., Koshy, J., Kreps, J., Lin, D., Narkhede, N., Rao, J., Riccomini, C., Wang, G.: Liquid: unifying nearline and offline big data integration. In: Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA. ACM, January 2015
Liu, X.: Data warehousing technologies for large-scale and right-time data. Ph.D. thesis, dissertation, Faculty of Engineering and Science at Aalborg University, Denmark (2012)
Muñoz, L., Mazón, J.N., Trujillo, J.: Automatic generation of ETL processes from conceptual models. In: Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP, pp. 33–40. ACM (2009)
O’Neil, P.E., O’Neil, E.J., Chen, X.: The star schema benchmark (ssb). Pat (2007)
Simitsis, A., Gupta, C., Wang, S., Dayal, U.: Partitioning real-time ETL workflows (2010)
Vassiliadis, P., Simitsis, A.: Near real time ETL. In: Kozielski, S., Wrembel, R. (eds.) New Trends in Data Warehousing and Data Analysis. Annals of Information Systems, vol. 3, pp. 1–31. Springer, New York (2009)
Acknowledgement
This project is part of a larger software prototype, partially financed by, Portugal, CISUC research group from the University of Coimbra and by the Foundation for Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Martins, P., Abbasi, M., Furtado, P. (2016). AScale: Big/Small Data ETL and Real-Time Data Freshness. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-34099-9_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34098-2
Online ISBN: 978-3-319-34099-9
eBook Packages: Computer ScienceComputer Science (R0)