Skip to main content
Log in

The Requirements of Using Provenance in e-Science Experiments

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

In e-Science experiments, it is vital to record the experimental process for later use such as in interpreting results, verifying that the correct process took place or tracing where data came from. The process that led to some data is called the provenance of that data, and a provenance architecture is the software architecture for a system that will provide the necessary functionality to record, store and use process documentation to determine the provenance of data items. However, there has been little principled analysis of what is actually required of a provenance architecture, so it is impossible to determine the functionality they would ideally support. In this paper, we present use cases for a provenance architecture from current experiments in biology, chemistry, physics and computer science, and analyse the use cases to determine the technical requirements of a generic, technology and application-independent architecture. We propose an architecture that meets these requirements, analyse its features compared with other approaches and evaluate a preliminary implementation by attempting to realise two of the use cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

CGE:

Candidate Gene Experiment

ICE:

Intron Compressibility Experiment

PASOA:

Provenance-Aware Service-Oriented Architecture

PDE:

Particle Detection Experiment

PIE:

Protein Identification Experiment

SHGE:

Second Harmonic Generation Experiment

SOA:

Service-Oriented Architecture

SRE:

Service Reliability Experiment

STE:

Security Testing Experiment

VDS:

Virtual Data System

References

  1. Addis, M., Ferris, J., Greenwood, M., Marvin, D., Li, P., Oinn, T., Wipat, A.: Experiences with eScience workflow specification and enactment in bioinformatics. In: Cox, S. (ed.) Proceedings of the UK OST e-Science Second All Hands Meeting 2003 (AHM’03), pp. 459–467, Nottingham, UK (2003)

    Google Scholar 

  2. Alonso, G., Abbadi, A.E.: GOOSE: Geographic object oriented support environment. In: Proceedings of the ACM workshop on Advances in Geographic Information Systems, pp. 38–49, Arlington, Virginia (1993)

  3. Alonso, G., Hagen, C.: Geo-opera: workflow concepts for spatial processes. In: Proceedings of 5th International Symposium on Spatial Databases (SSD ’97), pp. 238–258, Berlin, Germany (1997)

  4. Andrews, T., Curbera, F., Dholakia, H., Goland, Y., Klein, J., Leymann, F., Liu, K., Roller, D., Smith, D., Thatte, S., Trickovic, I., Weerawarana, S.: Business process execution language for web services version 1.1. http://www-128.ibm.com/developerworks/libraryws-bpel/ (2006)

  5. Ashri, R., Payne, T., Marvin, D., Surridge, M., Taylor, S.: Towards a semantic web security infrastructure. In: Semantic Web Services, AAAI Spring Symposium Series. Published as part of AAAI Technical Report SS-04-06, no page numbers given (2004)

  6. Becker, R.A., Chambers, J.M.J.M.: Auditing of data analyses. SIAM J. Sci. Statist. Comput. 9(4), 747–760 (1988)

    Article  MATH  Google Scholar 

  7. Buneman, P., Khanna, S., Tajima, K., Tan, W.: Archiving scientific data. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2002)

  8. Buneman, P., Khanna, S., Tan, W.: Why and where: a characterization of data provenance. In: Int. Conf. on Databases Theory (ICDT). pp. 316–330 (2001)

  9. Consultative Committee for Space Data Systems: Reference Model for an Open Archival Information System (OAIS). Technical report 650.0-B-1, National Aeronautics and Space Administration, Washington, DC 20546 USA (2002)

  10. Crawford, M.J., Frey, J.G., VanderNoot, T.J., Zhao, Y.G.: Investigation of transport across an immiscible liquid/liquid interface– electrochemical and second harmonic generation studies. J. Chem. Soc., Faraday Trans. 92(8), 1369–1373 (1996)

    Article  Google Scholar 

  11. Cui, Y., Widom, J., Wiener, J.L.: Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 25(2), 179–227 (2000)

    Article  Google Scholar 

  12. e-Demand: http://www.comp.leeds.ac.uk/edemand (2006)

  13. Fan, H., Poulovassilis, A.: Tracing data lineage using schema transformation pathways. In: Omelayenko, B., Klein, M. (eds.) Knowledge Transformation for the Semantic Web, pp. 64–79, IOS Press, Amsterdam, The Netherlands (2003)

    Google Scholar 

  14. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Int. J. Supercomput. Appl. 15(3), 200–222 (2001)

    Article  Google Scholar 

  15. Foster, I., Vockler, J., Wilde, M., Zhao, Y.: The virtual data Grid: a new model and architecture for data-intensive collaboration. In: Proceedings of the CIDR 2003 First Biennial Conference on Innovative Data Systems Research (non-published) (2003)

  16. Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Chimera: a virtual data system for representing, querying and automating data derivation. In: Proceedings of the 14th Conf. on Scientific and Statistical Database Management. pp. 37–46 (2002)

  17. GenBank: http://www.ncbi.nlm.nih.gov/Genbank/ (2006)

  18. Gene Ontology Consortium: http://www.geneontology.org/ (2006)

  19. Greenwood, M., Goble, C., Stevens, R., Zhao, J., Addis, M., Marvin, D., Moreau, L., Oinn, T.: Provenance of e-Science experiments–experience from Bioinformatics. In: Cox, S.J. (ed.) Proceedings of the UK e-Science All Hands Meeting 2003, pp. 223–226 (2003)

  20. Groth, P., Luck, M., Moreau, L.: A protocol for recording provenance in service-oriented Grids. In: Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS’04), vol. 3544, pp. 124–139, Grenoble, France (2004)

  21. Groth, P., Miles, S., Fang, W., Wong, S.C., Zauner, K.-P., Moreau, L.: Recording and using provenance in a protein compressibility experiment. In: Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC’05). Forthcoming (2005)

  22. Hughes, G., Mills, H., de Roure, D., Frey, J.G., Moreau, L., Schraefel, M.C., Smith, G., Zaluska, E.: The semantic smart laboratory: a system for supporting the chemical eScientist. Org. Biomol. Chem. 2(2), 1–10 (2004)

    Google Scholar 

  23. Lanter, D.: Design of a lineage-based meta-data base for GIS. Cartogr. Geogr. Inf. Syst. 18(4), 255–261 (1991a)

    Article  Google Scholar 

  24. Lanter, D.: Lineage in GIS: the problem and a solution. Technical report 90-6, National Center for Geographic Information and Analysis (NCGIA), UCSB, Santa Barbara, CA (1991b)

  25. Lanter, D., Essinger, R.: User-centered graphical user interface design for GIS. Technical report 91-6, National Center for Geographic Information and Analysis (NCGIA). UCSB (1991)

  26. Marathe, A.P.: Tracing lineage of array data. J. Intel. Inf. Syst. 17(2-3), 193–214 (2001)

    Article  MATH  Google Scholar 

  27. Myers, J., Chappell, A., Elder, M., Geist, A., Schwidder, J.: Re-integrating the research record. IEEE Comput. Sci. Eng. 5(3), 44–50 (2003a)

    Google Scholar 

  28. Myers, J.D., Pancerella, C., Lansing, C., Schuchardt, K.L., Didier, B.: Multi-scale science: supporting emerging practice with semantically derived provenance. In: ISWC 2003 Workshop: Semantic Web Technologies for Searching and Retrieving Scientific Data. Sanibel Island, Florida, USA. Online Proceedings (2003b)

  29. myGrid: http://www.mygrid.org.uk (2006)

  30. Pope, A.: The CORBA Reference Guide: Understanding the Common Object Request Broker Architecture. Addison Wesley, Reading, MA (1997)

    Google Scholar 

  31. Proteomics Standards Initiative: http://psidev.sourceforge.net (2006)

  32. Ruth, P., Xu, D., Bhargava, B.K., Regnier, F.: E-notebook middleware for acccountability and reputation based trust in distributed data sharing communities. In: Proceedings 2nd Int. Conf. on Trust Management, Oxford, UK, vol. 2995 of LNCS. pp. 161–175 (2004)

  33. Seltzer, M., Muniswamy-Reddy, K.-K., Holland, D.A., Braun, U., Ledlie, J.: Provenance-aware storage systems. Technical report, Harvard University Computer Science Technical Report TR-18-05 (2005)

  34. Szomszor, M., Moreau, L.: Recording and reasoning over data provenance in web and Grid services. In: Int. Conf. on Ontologies, Databases and Applications of Semantics, vol. 2888 of LNCS. Catania, Sicily, Italy, pp. 603–620 (2003)

  35. Tan, V.H.K.: Interaction tracing for mobile agent security. PhD thesis, University of Southampton (2004)

  36. Townend, P., Groth, P., Xu, J.: A provenance-aware weighted fault tolerance scheme for service-based applications. In: Proceedings of the 8th IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 2005), pp. 258–266 (2005)

  37. Vahdat, A., Anderson, T.: Transparent result caching. In: Proceedings of the 1998 USENIX Technical Conference, New Orleans, LA, pp. 25–37 (1998)

  38. Waldo, J.: The Jini Specifications (2nd edn), Addison-Wesley, Reading, MA (2000)

    Google Scholar 

  39. Web Services Architecture: http://www.w3.org/TR/ws-arch/ (2006)

  40. Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: Proceedings of the 13th International Conference on Data Engineering, Birmingham, England, pp. 91–102 (1997)

  41. Woodruff, A.G.: Data lineage and information density in database visualization. Ph.D. thesis, University of California at Berkeley (1998)

  42. Zhao, J., Goble, C., Greenwood, M., Wroe, C., Stevens, R.: Annotating, linking and browsing provenance logs for e-Science. In: Proceedings of the Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data. pp. 92–106, Online Proceedings (2003)

  43. Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D., Greenwood, M.: Using semantic web technologies for representing e-Science provenance. In: Proceedings of Third International Semantic Web Conference (ISWC2004), vol. 3298 of LNCS, pp. 92–106, Hiroshima, Japan (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simon Miles.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Miles, S., Groth, P., Branco, M. et al. The Requirements of Using Provenance in e-Science Experiments. J Grid Computing 5, 1–25 (2007). https://doi.org/10.1007/s10723-006-9055-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-006-9055-3

Key words

Navigation