Representing Interoperable Provenance Descriptions for ETL Workflows

  • André FreitasEmail author
  • Benedikt Kämpgen
  • João Gabriel Oliveira
  • Seán O’Riain
  • Edward Curry
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7540)


The increasing availability of data on the Web provided by the emergence of Web 2.0 applications and, more recently by Linked Data, brought additional complexity to data management tasks, where the number of available data sources and their associated heterogeneity drastically increases. In this scenario, where data is reused and repurposed on a new scale, the pattern expressed as Extract-Transform-Load (ETL) emerges as a fundamental and recurrent process for both producers and consumers of data on the Web. In addition to ETL, provenance, the representation of source artifacts, processes and agents behind data, becomes another cornerstone element for Web data management, playing a fundamental role in data quality assessment, data semantics and facilitating the reproducibility of data transformation processes. This paper proposes the convergence of these two Web data management concerns, introducing a principled provenance model for ETL processes in the form of a vocabulary based on the Open Provenance Model (OPM) standard and focusing on the provision of an interoperable provenance model for ETL environments. The proposed ETL provenance model is instantiated in a real-world sustainability reporting scenario.


ETL Data transformation Provenance Linked Data Web 



The work presented in this paper has been funded by Science Foundation Ireland under Grant No.SFI/08/CE/I1380 (Lion-2), by the German Ministry of Education and Research (BMBF) within the SMART project (Ref. 02WM0800) and the European Community’s Seventh Framework Programme FP7/2007–2013 (PlanetData, Grant 257641).


  1. 1.
    Altinel, M., Brown, P., Cline, S., Kartha, R. Louie, E., Markl, V., Mau, L., Ng, Y.-H., Simmen, D., Singh. A.: Damia: a data mashup fabric for intranet applications. In: Proceedings of the 33rd International Conference on Very Large Data Bases (2007)Google Scholar
  2. 2.
    Becker, K., Ghedini, C.: A documentation infrastructure for the management of data mining projects. Inf. Softw. Technol. 47, 95–111 (2005)CrossRefGoogle Scholar
  3. 3.
    Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB J. 12, 41–58 (2003)CrossRefGoogle Scholar
  4. 4.
    Davidson, S., Buneman, P., Kosky, A.: Semantics of database transformations. In: Thalheim, B. (ed.) Semantics in Databases 1995. LNCS, vol. 1358. Springer, Heidelberg (1998) CrossRefGoogle Scholar
  5. 5.
    El Akkaoui, Z., Zimanyi, E.: Defining ETL worfklows using BPMN and BPEL. In: Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP, DOLAP 2009, New York, NY, USA, pp. 41–48 (2009)Google Scholar
  6. 6.
    Freitas, A., Knap, T., O’Riain, S., Curry, E.: W3P: building an OPM based provenance model for the Web. Future Gener. Comput. Syst. 27, 766–774 (2010)CrossRefGoogle Scholar
  7. 7.
    Freitas, A., Legendre, A., O’Riain, S., Curry, E.: Prov4J: a semantic Web framework for generic provenance management. In: Second International Workshop on Role of Semantic Web in Provenance Management (SWPM 2010), 2010Google Scholar
  8. 8.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th International Conference on Very Large Data Bases (2001)Google Scholar
  9. 9.
    Kietz, J.-U., Serban, F., Bernstein, A., Fischer, S.: Towards cooperative planning of data mining workflows. In: Proceedings of the ECML/PKDD 2009 Workshop on Third Generation Data Mining (SoKD 2009) (2009)Google Scholar
  10. 10.
    Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting Cleaning. Wiley, New York (2004) Google Scholar
  11. 11.
    Moreau, L.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Morik, K., Scholz, M.: The miningmart approach to knowledge discovery in databases. In: Zhong, N., Liu, J. (eds.) Intelligent Technologies for Information Analysis, pp. 47–65. Springer, Heidelberg (2003)Google Scholar
  13. 13.
    Omitola, T., Freitas, A., O’Riain, S., Curry, E., Gibbins, N., Shadbolt, N.: Capturing interactive data transformation operations using provenance workflows. In: Proceedings of the 3rd International Workshop on Role of Semantic Web in Provenance Management (SWPM 2012) (2012)Google Scholar
  14. 14.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34, 31–36 (2005)CrossRefGoogle Scholar
  15. 15.
    Skoutas, D., Simitsis, A.: Designing ETL processes using semantic Web technologies. In: Proceedings of the 9th ACM International Workshop on Data Warehousing and OLAP (2006)Google Scholar
  16. 16.
    Thi, A., Nguyen, B.T.: A semantic approach towards CWM-based ETL processes. In: Proceedings of I-SEMANTICS (2008)Google Scholar
  17. 17.
    Trujillo, J., Luján-Mora, S.: A UML based approach for modeling ETL processes in data warehouses. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 307–320. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  18. 18.
    Vassiliadis, P., Karagiannis, A., Tziovara, V., Simitsis, A.: Towards a benchmark for etl workflows. In: Ganti, V., Naumann, F. (eds.) QDB, pp. 49–60 (2007)Google Scholar
  19. 19.
    Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP (2002)Google Scholar
  20. 20.
    Trio, J.W.: A system for integrated management of data, accuracy, and lineage. In: Innovative Data Systems Research (CIDR 2005) (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • André Freitas
    • 1
    Email author
  • Benedikt Kämpgen
    • 2
  • João Gabriel Oliveira
    • 1
  • Seán O’Riain
    • 1
  • Edward Curry
    • 1
  1. 1.Digital Enterprise Research Institute (DERI)National University of IrelandGalwayIreland
  2. 2.Institute AIFBKarlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations