An Efficient Heuristic for Logical Optimization of ETL Workflows

  • Nitin Kumar
  • P. Sreenivasa Kumar
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 84)


An ETL process is used to extract data from various sources, transform it and load it into a Data Warehouse. In this paper, we analyse an ETL flow and observe that only some of the dependencies in an ETL flow are essential while others are basically represents the flow of data. For the linear flows, we exploit the underlying dependency graph and develop a greedy heuristic technique to determine a reordering that significantly improves the quality of the flow. Rather than adopting a state-space search approach, we use the cost functions and selectivities to determine the best option at each position in a right-to-left manner. To deal with complex flows, we identify activities that can be transferred between linear segments in it and position those activities appropriately. We then use the re-orderings of the linear segments to obtain a cost-optimal semantically equivalent flow for a given complex flow. Experimental evaluation has shown that by using the proposed techniques, ETL flows can be better optimized and with much less effort compared to existing methods.


Data integration Data Warehousing ETL Optimization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Inmon, W.: Building the Data Warehouse, 3rd edn. Wiley & Sons, New York (2002)Google Scholar
  2. 2.
    Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual Modeling for ETL Processes. In: Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP (DOLAP 2002), pp. 14–21. ACM, New York (2002)CrossRefGoogle Scholar
  3. 3.
    Trujillo, J., Luján-Mora, S.: A UML Based Approach for Modeling ETL Processes in Data Warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  4. 4.
    Eckerson, W., White, C.: (2003)
  5. 5.
    IBM: IBM data warehouse manager,
  6. 6.
    Oracle: Oracle warehouse builder 11g,
  7. 7.
  8. 8.
    Simitsis, A., Vassiliadis, P., Sellis, T.: State-Space Optimization of ETL Workflows. IEEE Trans. on Knowl. and Data Eng. 17(10), 1404–1419 (2005)CrossRefGoogle Scholar
  9. 9.
    Vassiliadis, P., Simitsis, A., Spiros, S.: Modeling ETL Activities as Graphs. In: 4th International Workshop on the Design and Management of Data Warehouses (DMDW 2002), pp. 52–61. IEEE Computer Society, Toronto (2002)Google Scholar
  10. 10.
    Vassiliadis, P., Simitsis, A., Baikousi, E.: A taxonomy of etl activities. In: DOLAP 2009: Proceeding of the ACM Twelfth International Workshop on Data Warehousing and OLAP, pp. 25–32. ACM, New York (2009)CrossRefGoogle Scholar
  11. 11.
    Tziovara, V., Vassiliadis, P., Simitsis, A.: Deciding the Physical Implementation of ETL Workflows. In: Proceedings of the ACM Tenth International Workshop on Data Warehousing and OLAP (DOLAP 2007), pp. 49–56. ACM, New York (2007)CrossRefGoogle Scholar
  12. 12.
    Vassiliadis, P., Karagiannis, A., Tziovara, V., Simitsis, A.: Towards a Benchmark for ETL Workflows. In: Proceedings of the 5th International Workshop on Quality in Databases (QDB 2007), in Conjunction with the 33rd International Conference on Very Large Data Bases (VLDB 2007), pp. 117–137 (2007)Google Scholar
  13. 13.
    Elmasri, R., Navathe, S.: Fundamentals of Database Systems. Addison-Wesley Pubs., Reading (2000)MATHGoogle Scholar
  14. 14.
    Simitsis, A., Vassiliadis, P., Sellis, T.: Optimizing ETL Processes in Data Warehouses. In: Proceedings of the 21st International Conference on Data Engineering (ICDE 2005), pp. 564–575. IEEE Computer Society, Washington, DC (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Nitin Kumar
    • 1
  • P. Sreenivasa Kumar
    • 1
  1. 1.Indian Institute of Technology MadrasChennaiIndia

Personalised recommendations