Computing Data Lineage and Business Semantics for Data Warehouse

  • Kalle TomingasEmail author
  • Priit Järv
  • Tanel Tammet
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 914)


We present and validate a method and underlying set of technologies, data structures and algorithms to calculate, categorize and visualize component dependencies, data lineage and business semantics from the database structures and queries, independently of actual data in the data warehouse. Chosen approach based on semantic techniques, probabilistic weight calculation and estimation of the impact of data in queries and implemented rule system supports the calculation of the dependency graph from these estimates. We demonstrate a method for business semantics integration and ontology learning from data structures and schemas with a combination of query semantics captured by dependency graph. Annotation of technical assets using a business ontology provides meaning and governance view for human and machine agents to address various planning, automation and decision support problems. Data processing performance and business ontology integration is evaluated and analyzed over several real-life datasets.


Data warehouse Data lineage Dependency analysis Data flow visualization Business semantics Business ontology 



The research has been supported by EU through European Regional Development Fund.


  1. 1.
    Cheney, J., Chiticariu, L., Tan, W.-C.: Provenance in databases: why, how, and where. Found. Trends Databases 1(4), 379–474 (2007)CrossRefGoogle Scholar
  2. 2.
    Tan, W.: Provenance in databases: past, current, and future. In: SIGMOD 2007, pp. 1–10 (2007)Google Scholar
  3. 3.
    Priebe, T., Reisser, A., Anh Hoang, D.T.: Reinventing the wheel?! Why harmonization and reuse fail in complex data warehouse environments and a proposed solution to the problem. In: Proceedings of the 10th International Conference on Wirtschaftsinformatik, pp. 766–775 (2011)Google Scholar
  4. 4.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-Science. SIGMOD Rec. 34(3), 31–36 (2005)CrossRefGoogle Scholar
  5. 5.
    Davidson, S.B., Freire, J.: Provenance and scientific workflows. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data - SIGMOD 2008, p. 1345 (2008)Google Scholar
  6. 6.
    Bose, R., Frew, J.: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 37(1), 1–28 (2005)CrossRefGoogle Scholar
  7. 7.
    Buneman, P., Tan, W.: Provenance in databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 1171–1173 (2007)Google Scholar
  8. 8.
    Zdonik, S.B.: Provenance, lineage, and workflows. In: Computer (Long. Beach. Calif), pp. 1–24 (2010)Google Scholar
  9. 9.
    Buneman, P., Khanna, S., Wang-Chiew, T.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2001). Scholar
  10. 10.
    Cui, Y., Widom, J., Wiener, J.L.: Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 25(2), 179–227 (2000)CrossRefGoogle Scholar
  11. 11.
    Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems - Pod. 2007, no. June, p. 31 (2007)Google Scholar
  12. 12.
    Buneman, P., Khanna, S., Tan, W.-C.: On propagation of deletions and annotations through views. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems - Pod. 2002, vol. 2002, no. June, p. 150 (2002)Google Scholar
  13. 13.
    Buneman, P., Cheney, J., Vansummeren, S.: On the expressiveness of implicit provenance in query and update languages. In: Schwentick, T., Suciu, D. (eds.) ICDT 2007. LNCS, vol. 4353, pp. 209–223. Springer, Heidelberg (2006). Scholar
  14. 14.
    Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An annotation management system for relational databases. VLDB J. 14(4), 373–396 (2005)CrossRefGoogle Scholar
  15. 15.
    Green, T., Karvounarakis, G.: Update exchange with mappings and provenance. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 675–686 (2007)Google Scholar
  16. 16.
    Deutch, D., Moskovitch, Y., Tannen, V.: A provenance framework for data-dependent process analysis. Proc. VLDB Endow. 7(6), 457–468 (2014)CrossRefGoogle Scholar
  17. 17.
    Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data - SIGMOD 2008, Section 2, p. 1007 (2008)Google Scholar
  18. 18.
    Missier, P., Belhajjame, K., Zhao, J., Roos, M., Goble, C.: Data lineage model for taverna workflows with lightweight annotation requirements. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 17–30. Springer, Heidelberg (2008). Scholar
  19. 19.
    Ikeda, R., Das Sarma, A., Widom, J.: Logical provenance in data-oriented workflows? In: Proceedings - International Conference on Data Engineering, pp. 877–888 (2013)Google Scholar
  20. 20.
    Ramesh, B., Jarke, M.: Toward reference models for requirements traceability. IEEE Trans. Softw. Eng. 27(1), 58–93 (2001)CrossRefGoogle Scholar
  21. 21.
    Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB J. 12(1), 41–58 (2003)CrossRefGoogle Scholar
  22. 22.
    Benjelloun, O., Das Sarma, A., Hayworth, C., Widom, J.: An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull. 29(1), 5–16 (2006)Google Scholar
  23. 23.
    Fan, H., Poulovassilis, A.: Using AutoMed metadata in data warehousing environments. In: Proceedings of the 6th ACM International of the Work. In: Data Warehouse Ol. - Dol. 2003, p. 86 (2003)Google Scholar
  24. 24.
    Giorgini, P., Rizzi, S., Garzetti, M.: A goal-oriented approach to requirement analysis in data warehouses. Decis. Support Syst. 45(1), 4–21 (2008)CrossRefGoogle Scholar
  25. 25.
    Fan, H., Poulovassilis, A.: Using schema transformation pathways for data lineage tracing. In: Jackson, M., Nelson, D., Stirk, S. (eds.) BNCOD 2005. LNCS, vol. 3567, pp. 133–144. Springer, Heidelberg (2005). Scholar
  26. 26.
    Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: Proceedings of the 13th International Conference on Data Engineering, no. January, pp. 91–102 (1997)Google Scholar
  27. 27.
    Dayal, U., Castellanos, M., Simitsis, A., Wilkinson, K.: Data integration flows for business intelligence. In: Proceedings of the 12th International Conference on Extending Database Technology Advanced Database Technology - EDBT 2009, p. 1 (2009)Google Scholar
  28. 28.
    Simitsis, A., Vassiliadis, P.: A methodology for the conceptual modeling of ETL processes. In: CAiSE Work, pp. 305–316 (2003)Google Scholar
  29. 29.
    Kabiri, A., Chiadmi, D.: A method for modelling and organizing ETL processes. In: 2nd International Conference on Innovative Computing Technology, INTECH 2012, pp. 138–143 (2012)Google Scholar
  30. 30.
    Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semant. Web Inf. Syst. 3, 1–24 (2007)CrossRefGoogle Scholar
  31. 31.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Improving data cleaning quality using a data lineage facility. In: DMDW (2001)Google Scholar
  32. 32.
    Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of the 2005 CIDR Conference, pp. 262–276 (2005)Google Scholar
  33. 33.
    DeSantana, A.S., Moura, A.M.C.: Metadata to support transformations and data & metadata lineage in a warehousing environment. In: Proceedings of 6th International Conference on Data Warehousing and Knowledge Discovery, DaWaK 2004, Zaragoza, Spain, vol. 3181, 1–3 September 2004, pp. 249–258 (2004)Google Scholar
  34. 34.
    Tomingas, K., Kliimask, M., Tammet, T.: Data integration patterns for data warehouse automation. In: Bassiliades, N., et al. (eds.) New Trends in Database and Information Systems II. AISC, vol. 312, pp. 41–55. Springer, Cham (2015). Scholar
  35. 35.
    Bala, M., Boussaid, O., Alimazighi, Z.: Extracting-transforming-loading modeling approach for big data analytics. Int. J. Decis. Support Syst. Technol. 8(4), 50–69 (2016)CrossRefGoogle Scholar
  36. 36.
    Bansal, S.K.: Towards a semantic extract-transform-load (ETL) framework for big data integration. In: Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014, pp. 522–529 (2014)Google Scholar
  37. 37.
    Wang, J., Crawl, D., Purawat, S., Nguyen, M., Altintas, I.: Big data provenance: challenges, state of the art and opportunities. In: Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015, pp. 2509–2516 (2015)Google Scholar
  38. 38.
    Suen, C.H., Ko, R.K.L., Tan, Y.S., Jagadpramana, P., Lee, B.S.: S2Logger: end-to-end data tracking mechanism for cloud data provenance. In: Proceedings - 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2013 (2013)Google Scholar
  39. 39.
    Glavic, B., Dittrich, K.: Data provenance: a categorization of existing approaches. In: BTW, pp. 227–241 (2007)Google Scholar
  40. 40.
    Davidson, S., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1–6 (2008)Google Scholar
  41. 41.
    Anand, M.K., Bowers, S., Ludascher, B.: Techniques for efficiently querying scientific workflow provenance graphs. In: International Conference on Extending Database Technology, pp. 287–298 (2010)Google Scholar
  42. 42.
    Guarino, N.: Formal ontology and information systems. In: Proceedings of the first International Conference on FOIS 1998, vol. 46, no. June, pp. 3–15 (1998)Google Scholar
  43. 43.
    Guarino, N.: Semantic matching: formal ontological distinctions for information organization, extraction, and integration. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS, vol. 1299, pp. 139–170. Springer, Heidelberg (1997). Scholar
  44. 44.
    Maedche, A., Staab, S.: Ontology learning. Handb. Ontol. 13(3), 245–267 (2004)Google Scholar
  45. 45.
    Maedche, A., Staab, S.: Ontology learning for the semantic web. IEEE Intell. Syst. 16, 72–79 (2001)CrossRefGoogle Scholar
  46. 46.
    Li, M.L.M., Du, X.-Y., Wang, S.: Learning ontology from relational database. In: 2005 International Conference on Machine Learning and Cybernetics, vol. 6, no. August, pp. 18–21 (2005)Google Scholar
  47. 47.
    Astrova, I.: Rules for mapping SQL relational databases to OWL ontologies. In: Metadata and Semantics, pp. 415–424 (2009)Google Scholar
  48. 48.
    Tomingas, K., Tammet, T., Kliimask, M.: Rule-based impact analysis for enterprise business intelligence. In: Iliadis, L., Maglogiannis, I., Papadopoulos, H., Sioutas, S., Makris, C. (eds.) AIAI 2014. IAICT, vol. 437, pp. 301–309. Springer, Heidelberg (2014). Scholar
  49. 49.
    Anand, M.K., Bowers, S., McPhillips, T., Ludäscher, B.: Efficient provenance storage over nested data collections. In: Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology EDBT 2009, p. 958 (2009)Google Scholar
  50. 50.
    Tomingas, K., Järv, P., Tammet, T.: Discovering data lineage from data warehouse procedures 1. In: Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, pp. 101–110 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Tallinn University of TechnologyTallinnEstonia

Personalised recommendations