Ontology-Driven Provenance Management in eScience: An Application in Parasite Research

  • Satya S. Sahoo
  • D. Brent Weatherly
  • Raghava Mutharaju
  • Pramod Anantharam
  • Amit Sheth
  • Rick L. Tarleton
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5871)


Provenance, from the French word “provenir”, describes the lineage or history of a data entity. Provenance is critical information in scientific applications to verify experiment process, validate data quality and associate trust values with scientific results. Current industrial scale eScience projects require an end-to-end provenance management infrastructure. This infrastructure needs to be underpinned by formal semantics to enable analysis of large scale provenance information by software applications. Further, effective analysis of provenance information requires well-defined query mechanisms to support complex queries over large datasets. This paper introduces an ontology-driven provenance management infrastructure for biology experiment data, as part of the Semantic Problem Solving Environment (SPSE) for Trypanosoma cruzi (T.cruzi). This provenance infrastructure, called T.cruzi Provenance Management System (PMS), is underpinned by (a) a domain-specific provenance ontology called Parasite Experiment ontology, (b) specialized query operators for provenance analysis, and (c) a provenance query engine. The query engine uses a novel optimization technique based on materialized views called materialized provenance views (MPV) to scale with increasing data size and query complexity. This comprehensive ontology-driven provenance infrastructure not only allows effective tracking and management of ongoing experiments in the Tarleton Research Group at the Center for Tropical and Emerging Global Diseases (CTEGD), but also enables researchers to retrieve the complete provenance information of scientific results for publication in literature.


Provenance management framework provenir ontology Parasite Experiment ontology provenance query operators provenance query engine eScience Bioinformatics T.cruzi parasite research 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Society, B.C.: Grand challenges in computing research, BCS Survey (2004)Google Scholar
  2. 2.
  3. 3.
    Sahoo, S.S., Sheth, A., Henson, C.: Semantic Provenance for eScience: Managing the Deluge of Scientific Data. IEEE Internet Computing 12(4), 46–54 (2008)CrossRefGoogle Scholar
  4. 4.
    Sahoo, S.S., Barga, R.S., Goldstein, J., Sheth, A.: Provenance Algebra and Materialized View-based Provenance Management: Microsoft Research Technical Report; (November 2008)Google Scholar
  5. 5.
  6. 6.
    Tan, W.C.: Provenance in Databases: Past, Current, and Future. IEEE Data Eng. Bull. 30(4), 3–12 (2007)Google Scholar
  7. 7.
    Simmhan, Y.L., Plale, A.B., Gannon, A.D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005)CrossRefGoogle Scholar
  8. 8.
    Sahoo, S.S., Barga, R.S., Goldstein, J., Sheth, A.P., Thirunarayan, K.: Where did you come from.Where did you go? An Algebra and RDF Query Engine for Provenance Kno.e.sis Center, Wright State University (2009)Google Scholar
  9. 9.
  10. 10.
  11. 11.
    Aurrecoechea, C., Heiges, M., Wang, H., Wang, Z., Fischer, S., Rhodes, P., Miller, J., Kraemer, E., Stoeckert Jr., C.J., Roos, D.S., Kissinger, J.C.: ApiDB: integrated resources for the apicomplexan bioinformatics resource center. Nucleic Acids Research 35(D), 427–430 (2007)CrossRefGoogle Scholar
  12. 12.
  13. 13.
  14. 14.
    Kelly, B.K., Anderson, P.E., Reo, N.V., DelRaso, N.J., Doom, T.E., Raymer, M.L.: A proposed statistical protocol for the analysis of metabolic toxicological data derived from NMR spectroscopy. In: 7th IEEE International Conference on Bioinformatics and Bioengineering (BIBE 2007), Cambridge - Boston, Massachusetts, USA, pp. 1414–1418 (2007)Google Scholar
  15. 15.
  16. 16.
  17. 17.
    Chong, E.I., Das, S., Eadon, G., Srinivasan, J.: An efficient SQL-based RDF querying scheme. In: 31st international Conference on Very Large Data Bases, August 30 - September 02, pp. 1216–1227. VLDB Endowment, Trondheim (2005)Google Scholar
  18. 18.
    Sahoo, S.S., Thomas, C., Sheth, A., York, W.S., Tartir, S.: Knowledge modeling and its application in life sciences: a tale of two ontologies. In: Proceedings of the 15th international Conference on World Wide Web WWW 2006, Edinburgh, Scotland, May 23 - 26, pp. 317–326 (2006)Google Scholar
  19. 19.
  20. 20.
    Smith, B., Ceusters, W., Klagges, B., Kohler, J., Kumar, A., Lomax, J., et al.: Relations in biomedical ontologies. Genome Biol. 6(5), R46 (2005)CrossRefGoogle Scholar
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
    Hobbs, J.R., Pan, F.: Time Ontology in OWL In: W3C Working Draft (2006)Google Scholar
  26. 26.
    Pérez, J., Arenas, M., Gutiérrez, C.: Semantics and Complexity of SPARQL. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 30–43. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  27. 27.
    Vardi, M.: The Complexity of Relational Query Languages. In: 14th Ann. ACM Symp. Theory of Computing (STOC 1982), pp. 137–146 (1982)Google Scholar
  28. 28.
    Buneman, P., Khanna, S., Tan, W.C.: Why and Where: A Characterization of Data Provenance. In: 8th International Conference on Database Theory, pp. 316–330 (2001)Google Scholar
  29. 29.
  30. 30.

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Satya S. Sahoo
    • 1
  • D. Brent Weatherly
    • 2
  • Raghava Mutharaju
    • 1
  • Pramod Anantharam
    • 1
  • Amit Sheth
    • 1
  • Rick L. Tarleton
    • 2
  1. 1.Kno.e.sis Center., Computer Science amd Engineering DepartmentWright State UniversityDaytonUSA
  2. 2.Tarleton Research Group, CTEGDUniveristy of GeorgiaAthensUSA

Personalised recommendations