Applying the Virtual Data Provenance Model

  • Yong Zhao
  • Michael Wilde
  • Ian Foster
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4145)


In many domains of science, engineering, and commerce, data analysis systems are employed to derive new data (and ultimately, one hopes, knowledge) from datasets describing experimental results or simulated phenomena. To support such analyses, we have developed a “virtual data system” that allows users first to define, then to invoke, and finally explore the provenance of procedures (and workflows comprising multiple procedure calls) that perform such data derivations. The underlying execution model is “functional” in the sense that procedures read (but do not modify) their input and produce output via deterministic computations. This property makes it straightforward for the virtual data system to record not only the recipe for producing any given data object but also sufficient information about the environment in which the recipe has been executed, all with sufficient fidelity that the steps used to create a data object can be re-executed to reproduce the data object at a later time or a different location. The virtual data system maintains this information in an integrated schema alongside semantic annotations, and thus enables a powerful query capability in which the rich semantic information implied by knowledge of the structure of data derivation procedures can be exploited to provide an information environment that fuses recipe, history, and application-specific semantics. We provide here an overview of this integration, the queries and transformations that it enables, and examples of how these capabilities can serve scientific processes.


Procedure Call Procedure Definition Provenance Information Virtual Data Argument Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. [AV+06]
    Alvarez, S., Vazquez-Salceda, J., Kifor, T., Varga, L.Z., Willmott, S.: Applying Provenance in Distributed Organ Transplant Management. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 28–36. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. [BG+05]
    Bardeen, M., Gilbert, E., Jordan, T., Nepywoda, P., Quigg, E., Wilde, M., Zhao, Y.: The QuarkNet/grid collaborative learning e-Lab. In: EEE International Symposium on Cluster Computing and the Grid, 2005. CCGrid 2005, 9 May 2005, vol. 1, pp. 27–34 (2005) DOI: 10.1109/CCGRID.2005.1558530Google Scholar
  3. [BD06]
    Barga, R.S., Digiampietri, L.A.: Automatic Generation of Workflow Provenance. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 1–9. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. [BG+06]
    Braun, U., Garfinkel, S., Holland, D., Muniswamy-Reddy, K., Seltzer, M.: Issues in Automatic Provenance Collection. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 171–183. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. [BK+06]
    Bourilkov, D., Khandelwal, V., Kulkarni, A., Totala, S.: Virtual Logbooks and Collaboration in Science and Software Development. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 19–27. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. [BKT01]
    Buneman, P., Khanna, S., Tan, W.-C.: Why and Where: A Characterization of Data Provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. [BM06]
    Branco, M., Moreau, L.: Enabling provenance on large scale e-Science applications. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 55–63. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. [BMP06]
    Bose, R., Mann, R.G., Prina-Ricotti, D.: AstroDAS: Sharing Assertions across Astronomy Catalogues through Distributed Annotation. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 193–202. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. [CF+06]
    Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: Managing the Evolution of Dataflows with VisTrails. In: IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow) (2006)Google Scholar
  10. [CW00]
    Cui, Y., Widom, J.: Practical Lineage Tracing in Data Warehouses. In: 16th International Conference on Data Engineering, pp. 367–378 (2000)Google Scholar
  11. [CWW00]
    Cui, Y., Widom, J., Wiener, J.L.: Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25(2), 179–227 (2000)CrossRefGoogle Scholar
  12. [DS+05]
    Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming 13(3), 219–237 (2005)Google Scholar
  13. [F05]
    Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. [FS+06]
    Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing Rapidly-Evolving Scientific Workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 10–18. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. [FV+02]
    Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In: 14th Conference on Scientific and Statistical Database Management (2002)Google Scholar
  16. [FT+02]
    Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Computation Management Agent for Multi-Institutional Grids. Cluster Computing 5(3), 237–246 (2002)CrossRefGoogle Scholar
  17. [GM+05]
    Groth, P., Miles, S., Tan, V., Moreau, L.: Architecture for Provenance Systems. Technical report, University of Southampton (October 2005)Google Scholar
  18. [GS02]
    Giugno, R., Shasha, D.: Graphgrep: A fast and universal method for querying graphs. In: Proceeding of the IEEE International Conference in Pattern recognition (ICPR), Quebec, Canada (August 2002) Google Scholar
  19. [KS06]
    Kloss, G.K., Schreiber, A.: Provenance Implementation in a Scientific Simulation Environment. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 37–45. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  20. [HSM04]
    Huettel, S., Song, A., McCarthy, G.: Functional Magnetic Resonance Imaging. Sinauer Associates (2004)Google Scholar
  21. [MC+03]
    Myers, J.D., Chappell, A.R., Elder, M., Geist, A., Schwidder, J.: Re-integrating the research record. IEEE Computing in Science & Engineering, 44–50 (2003)Google Scholar
  22. [MH+06]
    Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M.: Provenance-Aware Storage Systems. In: 2006 USENIX Annual Technical Conference, Boston, MA (June 2006)Google Scholar
  23. [SKD06]
    Singh, G., Kesselman, C., Deelman, E.: Optimizing Grid-Based Workflow Execution. Journal of Grid Computing 3(3-4), 201–219 (2006)CrossRefGoogle Scholar
  24. [SM03]
    Szomszor, M., Moreau, L.: Recording and reasoning over data provenance in web and grid services. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 603–620. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  25. [WG+98a]
    Woods, R.P., Grafton, S.T., Holmes, C.J., Cherry, S.R., Mazziotta, J.C.: Automated image registration: I. General methods and intrasubject, intramodality validation. Journal of Computer Assisted Tomography 22, 139–152 (1998)CrossRefGoogle Scholar
  26. [WG+98b]
    Woods, R.P., Grafton, S.T., Watson, J.D.G., Sicotte, N.L., Mazziotta, J.C.: Automated image registration: II. Intersubject validation of linear and nonlinear models. Journal of Computer Assisted Tomography 22, 153–165 (1998)CrossRefGoogle Scholar
  27. [WS97]
    Woodruff, A., Stonebraker, M.: Supporting Fine-Grained Data Lineage in a Database Visualization Environment. In: 13th International Conference on Data Engineering, pp. 91–102 (1997)Google Scholar
  28. [ZG+03]
    Zhao, J., Goble, C., Greenwood, M., Wroe, C., Stevens, R.: Annotating, linking and browsing provenance logs for e-science. In: Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data (October 2003)Google Scholar
  29. [ZGR06]
    Zhao, J., Goble, C., Stevens, R.: An Identity Crisis in the Life Sciences. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 254–269. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  30. [ZD+05]
    Zhao, Y., Dobson, J., Foster, I., Moreau, L., Wilde, M.: A Notation and System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data. SIGMOD Record 34(3), 37–43 (2005)CrossRefGoogle Scholar
  31. [ZW+05]
    Zhao, Y., Wilde, M., Foster, I., Voeckler, J., Dobson, J., Gilbert, E., Jordan, T., Quigg, E.: Virtual Data Grid Middleware Services for Data-Intensive Science. Concurrency and Computation: Practice and Experience (2005) doi:10.1002/cpe.968Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yong Zhao
    • 1
  • Michael Wilde
    • 2
  • Ian Foster
    • 2
  1. 1.University of Chicago 
  2. 2.University of Chicago and Argonne National Laboratory 

Personalised recommendations