AScale: Auto-Scale in and out ETL+Q Framework

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 613)

Abstract

The purpose of this study is to investigate the problem of providing automatic scalability and data freshness to data warehouses, while simultaneously dealing with high-rate data efficiently. In general, data freshness is not guaranteed in these contexts, since data loading, transformation and integration are heavy tasks that are performed only periodically.

Desirably, users developing data warehouses need to concentrate solely on the conceptual and logic design such as business driven requirements, logical warehouse schemas, workload and ETL process, while physical details, including mechanisms for scalability, freshness and integration of high-rate data, should be left for automated tools.

In this regard, we propose a universal data warehouse parallelization system, that is, an approach to enable the automatic scalability and freshness of warehouses and ETL processes. A general framework for testing and implementing the proposed system was developed. The results show that the proposed system is capable of handling scalability to provide the desired processing speed and data freshness.

Keywords

Algorithms Architecture Performance Distributed Elastic Parallel processing Distributed systems Database Scalability Load-balance 

References

  1. 1.
    Albrecht, A., Naumann, F.: Metl: managing and integrating ETL processes. In: VLDB PhD Workshop (2009)Google Scholar
  2. 2.
    Ceri, S., Negri, M., Pelagatti, G.: Horizontal data partitioning in database design. In: Proceedings of the 1982 ACM SIGMOD International Conference on Management of Data, pp. 128–136. ACM (1982)Google Scholar
  3. 3.
    Council, T.P.P.: Tpc-h benchmark specification (2008). http://www.tcp.org/hspec.html
  4. 4.
    Furtado, P.: Efficient and robust node-partitioned data warehouses. In: Data Warehouses and OLAP: Concepts, Architectures, and Solutions, p. 203 (2007)Google Scholar
  5. 5.
    Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In: EDBT, pp. 307–318 (2014)Google Scholar
  6. 6.
    Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)CrossRefGoogle Scholar
  7. 7.
    Liu, X.: Data warehousing technologies for large-scale and right-time data. Ph.D. thesis, dissertation, Faculty of Engineering and Science at Aalborg University, Denmark (2012)Google Scholar
  8. 8.
    Liu, X., Thomsen, C., Pedersen, T.B.: Mapreduce-based dimensional ETL made easy. Proc. VLDB Endowment 5(12), 1882–1885 (2012)CrossRefGoogle Scholar
  9. 9.
    Muñoz, L., Mazón, J.N., Trujillo, J.: Automatic generation of ETL processes from conceptual models. In: Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP, pp. 33–40. ACM (2009)Google Scholar
  10. 10.
    O’Neil, P.E., O’Neil, E.J., Chen, X.: The star schema benchmark (ssb). Pat (2007)Google Scholar
  11. 11.
    Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for fault-tolerance. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 385–396. IEEE (2010)Google Scholar
  12. 12.
    Thomsen, C., Bach Pedersen, T.: pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP, pp. 49–56. ACM (2009)Google Scholar
  13. 13.
    Vassiliadis, P., Simitsis, A.: Near real time ETL. In: Vassiliadis, P., Wrembel, R. (eds.) New Trends in Data Warehousing and Data Analysis. Annals of Information Systems, vol. 3, pp. 1–31. Springer, New York (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer SciencesUniversity of CoimbraCoimbraPortugal

Personalised recommendations