Towards a Model of Provenance and User Views in Scientific Workflows

  • Shirley Cohen
  • Sarah Cohen-Boulakia
  • Susan Davidson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4075)


Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many cases, the means by which the data are produced is not known, making the data difficult to interpret and the experiment impossible to reproduce. Provenance in scientific workflows is thus of paramount importance.

In this paper, we provide a formal model of provenance for scientific workflows which is general (i.e. can be used with existing workflow systems, such as Kepler, myGrid and Chimera) and sufficiently expressive to answer the provenance queries we encountered in a number of case studies. Interestingly, our model not only takes into account the chained and nested structure of scientific workflows, but allows asks for provenance at different levels of abstraction (user views).


Data Object Transitive Closure Service Invocation Data Provenance Provenance Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alpdemir, M.N., Mukherjee, A., Paton, N.W., Fernandes, A.A.A., Watson, P., Glover, K., Greenhalgh, C., Oinn, T., Tipney, H.: Contextualised Workflow Execution in MyGrid. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 444–453. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  2. 2.
    Berry, D., Buneman, P., Wilde, M., Ioannidis, Y.: e-Science Workshop on Data Provenance and Annotation. National e-Science Centre, Edinburgh (2003)Google Scholar
  3. 3.
    Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An Annotation Management System for Relational Databases. In: Proc. Conference on Very Large Data Bases (VLDB), pp. 900–911 (2004)Google Scholar
  4. 4.
    Bowers, S., McPhillips, T., Ludäscher, B., Cohen, S., Davidson, S.B.: A model for user-oriented data provenance in pipelined scientific workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 133–147. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Bowers, S., Ludäscher, B.: Actor-Oriented Design of Scientific Workflows. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, J., Pastor, Ó. (eds.) ER 2005. LNCS, vol. 3716, pp. 369–384. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  6. 6.
    Buneman, P., Khanna, S., Tan, W.: Why and Where: A Characterization of Data Provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Buneman, P., Chapman, A., Cheney, J.: Provenance Management in Curated Databases. In: Proc. of SIGMOD International Conference on Management of Data (to appear, 2006)Google Scholar
  8. 8.
    Clark, T., Martin, S., Liefeld, T.: Globally distributed object identification for biological knowledgebases. Briefings in Bioinformatics 5(1), 59–70 (2004)CrossRefGoogle Scholar
  9. 9.
    Cohen-Boulakia, S., Lair, S., Stransky, N., Graziani, S., Radvanyi, F., Barillot, E., Froidevaux, C.: Selecting biomedical data sources according to user preferences. In: Bioinformatics, Proc. ISMB/ECCB 2004, vol. 20, pp. i86–i93 (2004)Google Scholar
  10. 10.
    Cohen, S., Cohen-Boulakia, S., Davidson, S.: Towards a Model of Provenance in Scientific Workflows, University of Pennsylvania, Internal Report, #MS-CIS-06-03 (2006)Google Scholar
  11. 11.
    Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, C., Stoeckert, C.: K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal (2001)Google Scholar
  12. 12.
    Foster, I., Vockler, J., Woilde, M., Zhao, Y.: Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In: Proc. of the 14th Intl. Conf. on Scientific and Statistical Database Management (SSDBM) (2002)Google Scholar
  13. 13.
    Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. In: Proc. of Conference on Innovative Data System Research (CIDR) (2003)Google Scholar
  14. 14.
    Greiner, U., Müller, R., Rahm, E., Ramsch, J., Heller, B., Löffler, M.: AdaptFlow: Protocol-based Medical Treatment Using Adaptive Workflows. Methods of Information in Medicine 44, 80–88 (2005)Google Scholar
  15. 15.
    Higgins, D.G., Sharp, P.M.: Clustal: A package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1998)CrossRefGoogle Scholar
  16. 16.
    Kiepuszewski, B., ter Hofstede, A.H.M., van der Aalst, W.M.P.: Fundamentals of control flow in workflows. Acta Inf. 39(3), 143–209 (2003)MATHCrossRefGoogle Scholar
  17. 17.
    McPhillips, T., Bowers, S.: An approach for pipelining nested collections in scientific workflows. SIGMOD Record 34(3), 12–17 (2005)CrossRefGoogle Scholar
  18. 18.
    Moss, J.E.B.: Nested Transactions: An Approach to Reliable Distributed Computing, Ph.D. dissertation, Dept. of Electrical Engineering and Computer Science, MIT (April 1981)Google Scholar
  19. 19.
    Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.T., Carver, K., Glover, P.M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, Proc. ISMB/ECCB03 20(1), 3045–3054 (2003)CrossRefGoogle Scholar
  20. 20.
    The Pasoa Project Luc Moreau et al.,
  21. 21.
  22. 22.
    Rowe, A., Kalaitzopoulos, D., Osmond, M., Ghanem, M., Guo, Y.: The discovery net system for high throughput bioinformatics. Bioinformatics 19(1), i225–i231 (2004)CrossRefGoogle Scholar
  23. 23.
    Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005)CrossRefGoogle Scholar
  24. 24.
    Swofford, D.L.: PAUP*: Phylogenetic Analysis Using Parsimony (*and other methods). Sinauer Associates, Sunderland, MA (2000)Google Scholar
  25. 25.
    Targino, R., Cavalcanti, M.C., Mattoso, M.: An Environment to Define and Execute In-Silico Workflows Using Web Services. In: Ludäscher, B., Raschid, L. (eds.) DILS 2005. LNCS (LNBI), vol. 3615, pp. 288–291. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  26. 26.
    Ullman, J.D., Widom, J.: A First Course in Database Systems. Prentice-Hall, Englewood Cliffs (1997)Google Scholar
  27. 27.
    Widom, J.: Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In: CIDR 2005, Conference on Innovative Data Systems Research, pp. 262–276 (2005)Google Scholar
  28. 28.
    Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D., Greenwood, M.: Using Semantic Web Technologies for Representing E-science Provenance. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 92–106. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  29. 29.
    Zhao, J., Goble, C.A., Stevens, R., Bechhofer, S.: Semantically Linking and Browsing Provenance Logs for E-science. In: Bouzeghoub, M., Goble, C.A., Kashyap, V., Spaccapietra, S. (eds.) ICSNW 2004. LNCS, vol. 3226, pp. 158–176. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  30. 30.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shirley Cohen
    • 1
  • Sarah Cohen-Boulakia
    • 1
  • Susan Davidson
    • 1
  1. 1.Department of Computer and Information ScienceUniversity of PennsylvaniaUSA

Personalised recommendations