Abstract
Extract-Transform-Load (ETL) processes are used for extracting data, transforming it and loading it into data warehouses (DWs). The dominating ETL tools use graphical user interfaces (GUIs) such that the developer “draws” the ETL flow by connecting steps/transformations with lines. This gives an easy overview, but can also be rather tedious and require much trivial work for simple things. We therefore challenge this approach and propose to do ETL programming by writing code. To make the programming easy, we present the Python-based framework pygrametl which offers commonly used functionality for ETL development. By using the framework, the developer can efficiently create effective ETL solutions from which the full power of programming can be exploited. In this chapter, we present our work on pygrametl and related activities. Further, we consider some of the lessons learned during the development of pygrametl as an open source framework.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We assume that a dimension has a non-composite key.
- 2.
It is also possible to do the lookups and insertions from the leaves towards the root but when going towards the leaves, it is possible to stop the search earlier if a part of the member is already present.
- 3.
We did not test PDI’s support for distributed execution.
- 4.
http://flexdanmark.dk. One of the authors (Ove Andersen) is employed by FlexDanmark.
- 5.
- 6.
- 7.
- 8.
References
Beyer, M.A., Thoo, E., Selvage, M.Y., Zaidi, E.: Gartner Magic Quadrant for Data Integration Tools (2017)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the OSDI, pp. 137–150 (2004). https://doi.org/10.1145/1327452.1327492
Django. djangoproject.com/. Accessed 13 Oct 2017
Grönniger, H., Krahn, H., Rumpe, B., Schindler, M., Völkel, S.: Text-based modeling. In: Proceedings of ATEM (2007)
IBM InfoSphere DataStage. https://www.ibm.com/ms-en/marketplace/datastage. Accessed 13 Oct 2017
Informatica. informatica.com. Accessed 13 Oct 2017
Jensen, C.S., Pedersen, T.B., Thomsen, C.: Multidimensional Databases and Data Warehousing. Morgan and Claypool, San Rafael (2010). https://doi.org/10.2200/S00299ED1V01Y201009DTM009
Kimball, R., Ross, M.: The Data Warehouse Toolkit, 2nd edn. Wiley, New York (2002)
Liu, X., Thomsen, C., Pedersen, T.B.: ETLMR: a highly scalable dimensional ETL framework based on MapReduce. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 96–111. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23544-3_8
Microsoft SQL Server Integration Services. https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services. Accessed 13 Oct 2017
Oracle Data Integrator. http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html. Accessed 13 Oct 2017
Pentaho Data Integration - Kettle. http://kettle.pentaho.org. Accessed 13 Oct 2017
Petre, M.: Why looking isn’t always seeing: readership skills and graphical programming. Commun. ACM 38(6), 33–44 (1995). https://doi.org/10.1145/203241.203251
PostgreSQL. postgresql.org. Accessed 13 Oct 2017
Psycopg. http://initd.org/psycopg/. Accessed 13 Oct 2017
Python. python.org. Accessed 13 Oct 2017
Ruby on Rails. rubyonrails.org/. Accessed 13 Oct 2017
SAP Data Services. https://www.sap.com/products/data-services.html. Accessed 13 Oct 2017
Scriptella. scriptella.org. Accessed 13 Oct 2017
Simitsis, A., Vassiliadis, P., Terrovitis, M., Skiadopoulos, S.: Graph-based modeling of ETL activities with multi-level transformations and updates. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 43–52. Springer, Heidelberg (2005). https://doi.org/10.1007/11546849_5
Thomsen, C., Pedersen, T.B.: Building a web warehouse for accessibility data. In: Proceedings of DOLAP (2006). https://doi.org/10.1145/1183512.1183522
Thomsen, C., Pedersen, T.B.: A survey of open source tools for business intelligence. IJDWM 5(3), 56–75 (2009). https://doi.org/10.4018/jdwm.2009070103
Thomsen, C., Pedersen, T.B.: pygrametl: a powerful programming framework for extract-transform-load programmers. In: Proceedings of DOLAP, pp. 49–56 (2009). https://doi.org/10.1145/2064676.2064684
Thomsen, C., Pedersen, T.B.: pygrametl: a powerful programming framework for extract-transform-load programmers. DBTR-25, Aalborg University (2009). www.cs.aau.dk/DBTR
Thomsen, C., Pedersen, T.B.: Easy and effective parallel programmable ETL. In: Proceedings of DOLAP, pp. 37–44 (2011)
Trujillo, J., Luján-Mora, S.: A UML based approach for modeling ETL processes in data warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39648-2_25
Vaisman, A., Zimanyi, E.: Data Warehouse Systems: Design and Implementation. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54655-6
Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009). https://doi.org/10.4018/jdwm.2009070101
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A Data Warehouse Concepts
Appendix A Data Warehouse Concepts
This appendix offers a very short introduction to concepts and terms used in the chapter. More details and explanations can be found in the literature [7, 8, 27]. In a data warehouse (DW), data from an organization’s different operational systems is stored in a way that supports analysis (rather than the daily operations which are supported by the operational systems). An Extract-Transform-Load (ETL) process extracts data from the source systems, transforms the data (to make it fit into the DW and to cleanse it), and loads it into the DW. Data is divided into facts and dimensions. Facts represent the subjects of the desired analyses (e.g., sales) and have numerical measures (e.g., sales amount). Dimensions provide context and describe facts (Product, Store, and Time are, for example, relevant dimensions for sales). Dimensions are thus used for selection of data and grouping of data in analyses. Dimensions have hierarchies with levels (a Time dimension can, for example, have the hierarchy Day \(\rightarrow \) Month \(\rightarrow \) Quarter \(\rightarrow \) Year). Each of the levels can also have a number of attributes.
When using a relational database to represent a DW, one can choose between two approaches for the schema design. In a snowflake schema, each level in a dimension is represented by a table and the tables have foreign keys to the following levels. The dimension tables are thus normalized. In a star schema there is only one table for each dimension. This table thus represents all levels and is denormalized. In both star schemas and snowflake schemas, the facts are represented by a fact table which has a foreign key for each dimension and a column for each measure. In a star schema, the foreign keys reference the dimension tables while they reference the tables for the lowest levels of the dimensions in a snowflake schema. The keys used in a dimension table should be integers not carrying any special meaning. Such keys are called surrogate keys.
Changes may happen in the represented world. It is thus necessary to be able to represent changes in dimensions. A dimension where changes are represented is called a slowly changing dimension (SCD). There are a number of different techniques for SCDs [8]. Here we will consider two of the most commonly used. For type 1 SCDs, changes are simply represented by overwriting old values in the dimension tables. If, for example, the size of a shop changes, the size attribute is updated. This can be problematic as old facts (e.g., facts about sales from the shop when it had the previous size) now refer to the updated dimension member such that history is not correctly represented. This problem is avoided with a type 2 SCD where a new version of the dimension member is created when there is a change. In other words, for type 2 SCDs, changes result in new rows in the dimension tables. In a type 2 SCD, there are often attributes called something like ValidFrom, ValidTo, MostCurrentVersion, and VersionNumber to provide information about the represented versions.
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Thomsen, C., Andersen, O., Jensen, S.K., Pedersen, T.B. (2018). Programmatic ETL. In: Zimányi, E. (eds) Business Intelligence and Big Data. eBISS 2017. Lecture Notes in Business Information Processing, vol 324. Springer, Cham. https://doi.org/10.1007/978-3-319-96655-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-96655-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96654-0
Online ISBN: 978-3-319-96655-7
eBook Packages: Computer ScienceComputer Science (R0)