An Efficient Heuristic for Logical Optimization of ETL Workflows
An ETL process is used to extract data from various sources, transform it and load it into a Data Warehouse. In this paper, we analyse an ETL flow and observe that only some of the dependencies in an ETL flow are essential while others are basically represents the flow of data. For the linear flows, we exploit the underlying dependency graph and develop a greedy heuristic technique to determine a reordering that significantly improves the quality of the flow. Rather than adopting a state-space search approach, we use the cost functions and selectivities to determine the best option at each position in a right-to-left manner. To deal with complex flows, we identify activities that can be transferred between linear segments in it and position those activities appropriately. We then use the re-orderings of the linear segments to obtain a cost-optimal semantically equivalent flow for a given complex flow. Experimental evaluation has shown that by using the proposed techniques, ETL flows can be better optimized and with much less effort compared to existing methods.
KeywordsData integration Data Warehousing ETL Optimization
Unable to display preview. Download preview PDF.
- 1.Inmon, W.: Building the Data Warehouse, 3rd edn. Wiley & Sons, New York (2002)Google Scholar
- 4.Eckerson, W., White, C.: http://www.dw-institute.com/etlreport (2003)
- 5.IBM: IBM data warehouse manager, www3.ibm.com/software/data/db2/datawarehouse
- 6.Oracle: Oracle warehouse builder 11g, http://www.oracle.com/technology/products/warehouse/
- 7.Informatica: PowerCenter, http://www.informatica.com/products/data+integration/powercenter/default.htm
- 9.Vassiliadis, P., Simitsis, A., Spiros, S.: Modeling ETL Activities as Graphs. In: 4th International Workshop on the Design and Management of Data Warehouses (DMDW 2002), pp. 52–61. IEEE Computer Society, Toronto (2002)Google Scholar
- 12.Vassiliadis, P., Karagiannis, A., Tziovara, V., Simitsis, A.: Towards a Benchmark for ETL Workflows. In: Proceedings of the 5th International Workshop on Quality in Databases (QDB 2007), in Conjunction with the 33rd International Conference on Very Large Data Bases (VLDB 2007), pp. 117–137 (2007)Google Scholar