Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

Extraction, Transformation, and Loading

  • Alkis Simitsis
  • Panos Vassiliadis
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_158-3



Extraction, transformation, and loading (ETL) processes are responsible for the operations taking place in the back stage of a data warehouse architecture. In a high-level description of an ETL process, first, the data are extracted from the source data stores that can be online transaction processing (OLTP) or legacy systems, files under any format, web pages, various kinds of documents (e.g., spreadsheets and text documents), or even data coming in a streaming fashion. Typically, only the data that are different from the previous execution of an ETL process (newly inserted, updated, and deleted information) should be extracted from the sources. After this phase, the extracted data are propagated to a special-purpose area of the warehouse, called the data staging area (DSA), where their transformation, homogenization, and cleansing take place. The most frequently used...


Data Warehouse Execution Engine Cleaning Task Data Mart Data Warehouse Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Akkaoui ZE, Zimányi E, Mazón J, Trujillo J. A BPMN-based design and maintenance framework for ETL processes. IJDWM. 2013;9(3):46–72.Google Scholar
  2. 2.
    Dayal U, Castellanos M, Simitsis A, Wilkinson K. Data integration flows for business intelligence. In: Proceedings of EDBT 2009, 12th International Conference on Extending Database Technology. Saint Petersburg; 24–26 Mar 2009. p. 1–11.Google Scholar
  3. 3.
    Fagin R, Kolaitis PG, Popa L. Data exchange: getting to the core. ACM Trans Database Syst. 2005;30(1):174–210.CrossRefzbMATHGoogle Scholar
  4. 4.
    Grund M, Krüger J, Plattner H, Zeier A, Cudré-Mauroux P, Madden S. HYRISE – a main memory hybrid storage engine. PVLDB. 2010;4(2):105–16.Google Scholar
  5. 5.
    Haas LM, Hernández MA, Ho H, Popa L, Roth M. Clio grows up: from research prototype to industrial tool. In: SIGMOD Conference. Baltimore; 2005. p. 805–10.Google Scholar
  6. 6.
    Halasipuram R, Deshpande PM, Padmanabhan S. Determining essential statistics for cost based optimization of an ETL workflow. In: Proceedings of 17th International Conference on Extending Database Technology (EDBT). Athens; 24–28 Mar 2014. p. 307–18.Google Scholar
  7. 7.
    Inmon W. Building the data warehouse. 2nd ed. New York: John Wiley & Sons; 1996.Google Scholar
  8. 8.
    Kemper A, Neumann T. Hyper: a hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In: Proceedings of the 27th International Conference on Data Engineering ICDE. Hannover; 11–16 Apr 2011. p. 195–206.Google Scholar
  9. 9.
    Kimbal R, Reeves L, Ross M, Thornthwaite W. The data warehouse lifecycle toolkit: expert methods for designing, developing, and deploying data warehouses. New York: John Wiley & Sons; 1998.Google Scholar
  10. 10.
    Labio W, Garcia-Molina H. Efficient snapshot differential algorithms for data warehousing. In: VLDB. Bombay; 1996. p. 63–74.Google Scholar
  11. 11.
    Labio W, Wiener JL, Garcia-Molina H, Gorelik V. Efficient resumption of interrupted warehouse loads. In: Proceedings of ACM SIGMOD. New York: ACM; 2000. p. 46–57.Google Scholar
  12. 12.
    Lenzerini M. Data integration: a theoretical perspective. In: PODS. Madison; 2002. p. 233–46.Google Scholar
  13. 13.
    Liu X, Thomsen C, Pedersen TB. ETLMR: a highly scalable dimensional ETL framework based on mapreduce. Trans Large-Scale Data- Knowl-Cent Syst. 2013;8:1–31.Google Scholar
  14. 14.
    Luján-Mora S, Vassiliadis P, Trujillo J. Data mapping diagrams for data warehouse design with UML. In: 23rd International Conference on Conceptual Modeling (ER 2004). Shanghai; 2004. p. 191–204.Google Scholar
  15. 15.
    Oracle. Oracle9i SQL Reference. Release 9.2; 2002.Google Scholar
  16. 16.
    Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. VLDB J. 2001;10(4):334–50.CrossRefzbMATHGoogle Scholar
  17. 17.
    Rizzi S, Abelló A, Lechtenbörger J, Trujillo J. Research in data warehouse modeling and design: dead or alive? In: DOLAP’06: Proceedings of the 9th ACM International Workshop on Data Warehousing and OLAP. New York; 2006. p. 3–10.Google Scholar
  18. 18.
    Romero O, Simitsis A, Abelló A. GEM: requirement-driven generation of ETL and multidimensional conceptual designs. In: Proceedings Data Warehousing and Knowledge Discovery – 13th International Conference (DaWaK 2011). Toulouse; 29 Aug–2 Sept 2011. p. 80–95.Google Scholar
  19. 19.
    Roth MT, Schwarz PM. Don’t scrap it, wrap it! a wrapper architecture for legacy data sources. In: VLDB. Athens; 1997. p. 266–75.Google Scholar
  20. 20.
    Shu NC, Housel BC, Taylor RW, Ghosh SP, Lum VY. Express: a data extraction, processing, amd restructuring system. ACM Trans Database Syst. 1977;2(2):134–74.CrossRefGoogle Scholar
  21. 21.
    Simitsis A, Vassiliadis P, Sellis TK. Optimizing ETL processes in data warehouses. In: Proceedings of the 21st International Conference on Data Engineering (ICDE’05). Tokyo; 2005. p. 564–75.Google Scholar
  22. 22.
    Simitsis A, Vassiliadis P, Sellis TK. State-space optimization of ETL workflows. IEEE Trans Knowl Data Eng. 2005;17(10):1404–19.CrossRefGoogle Scholar
  23. 23.
    Simitsis A, Wilkinson K, Castellanos M, Dayal U. Qox-driven ETL design: reducing the cost of ETL consulting engagements. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD. Providence; 29 June–2 July 2009. p. 953–60.Google Scholar
  24. 24.
    Simitsis A, Wilkinson K, Castellanos M, Dayal U. Optimizing analytic data flows for multiple execution engines. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Scottsdale; 20–24 May 2012. p. 829–40.Google Scholar
  25. 25.
    Skoutas D, Simitsis A. Designing ETL processes using semantic web technologies. In: Proceedings of the ACM 9th International Workshop on Data Warehousing and OLAP (DOLAP’06). New York; 2006. p. 67–74.Google Scholar
  26. 26.
    Thomsen C, Pedersen TB. Easy and effective parallel programmable ETL. In: Proceedings of DOLAP 2011, ACM 14th International Workshop on Data Warehousing and OLAP. Glasgow; 28 Oct 2011. p. 37–44.Google Scholar
  27. 27.
    TPC. TPC-DS (Decision Support) specification, draft version 52; Feb 2007.Google Scholar
  28. 28.
    Trujillo J, Luján-Mora S. A UML based approach for modeling ETL processes in data warehouses. In: 22nd International Conference on Conceptual Modeling (ER 2003). Chicago; 2003. p. 307–20.Google Scholar
  29. 29.
    Vassiliadis P, Karagiannis A, Tziovara V, Vassiliadis P, Simitsis A. Towards a benchmark for ETL workflows. In: 5th International Workshop on Quality in Databases (QDB) at VLDB. Vienna; 2007.Google Scholar
  30. 30.
    Vassiliadis P, Simitsis A, Skiadopoulos S. Conceptual modeling for ETL processes. In: Proceedings of the ACM 5th International Workshop on Data Warehousing and OLAP (DOLAP’02). McLean; 2002. p. 14–21.Google Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  1. 1.HP LabsPalo AltoUSA
  2. 2.University of IoanninaIoanninaGreece

Section editors and affiliations

  • Torben Bach Pedersen
    • 1
  • Stefano Rizzi
    • 2
  1. 1.Department of Computer ScienceAalborg UniversityAalborgDenmark
  2. 2.DISIUniversity of BolognaBolognaItaly