Advertisement

A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows

  • Shawn Bowers
  • Timothy McPhillips
  • Bertram Ludäscher
  • Shirley Cohen
  • Susan B. Davidson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4145)

Abstract

Integrated provenance support promises to be a chief advantage of scientific workflow systems over script-based alternatives. While it is often recognized that information gathered during scientific workflow execution can be used automatically to increase fault tolerance (via checkpointing) and to optimize performance (by reusing intermediate data products in future runs), it is perhaps more significant that provenance information may also be used by scientists to reproduce results from earlier runs, to explain unexpected results, and to prepare results for publication. Current workflow systems offer little or no direct support for these “scientist-oriented” queries of provenance information. Indeed the use of advanced execution models in scientific workflows (e.g. process networks, which exhibit pipeline parallelism over streaming data) and failure to record certain fundamental events such as state resets of processes, can render existing provenance schemas useless for scientific applications of provenance. We develop a simple provenance model that is capable of supporting a wide range of scientific use cases even for complex models of computation such as process networks. Our approach reduces these use cases to database queries over event logs, and is capable of reconstructing complete data and invocation dependency graphs for a workflow run.

Keywords

Output Port Dependency Graph Process Network Data Provenance Provenance Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Berry, D., Buneman, P., Wilde, M., Ioannidis, Y. (eds.): e-Science Workshop on Data Provenance and Annotation, National e-Science Centre, Edinburgh (December 2003)Google Scholar
  2. 2.
    Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An annotation management system for relational databases. In: Proc. of VLDB (2004)Google Scholar
  3. 3.
    Bose, R., Frew, J.: Lineage retrieval for scientific data processing: A survey. ACM Computing Surveys 37(1), 1–28 (2005)CrossRefGoogle Scholar
  4. 4.
    Buneman, P., Foster, I. (eds.): Workshop on Data Derivation and Provenance, Chicago (October 2002)Google Scholar
  5. 5.
    Buneman, P., Khanna, S., Tan, W.C.: Why and where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  6. 6.
    Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: Managing the evolution of dataflows with vistrails. In: IEEE Workshop on Workflow and Data-Flow for Scientific Applications (SciFlow) (2006)Google Scholar
  7. 7.
    Castresana, J.: Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552 (2000)Google Scholar
  8. 8.
    Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. In: VLDB (2001)Google Scholar
  9. 9.
    Cui, Y., Widom, J., Wiender, J.: Tracing the lineage of view data in a warehousing environment. ACM TODS 25(2) (2000)Google Scholar
  10. 10.
    Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.-H., Vahi, K., Livny, M.: Pegasus: Mapping scientific workflows onto the grid. In: Proc. of the European Across Grids Conference (2004)Google Scholar
  11. 11.
    Goble, C.: Position statement: Musings on provenance, workflow and (semantic web) annotations for bioinformatic‘s. In: Buneman and Foster [4]Google Scholar
  12. 12.
    Kahn, G., MacQueen, D.B.: Coroutines and networks of parallel processes. In: Proc. of the IFIP Congress (1977)Google Scholar
  13. 13.
    Lee, E.A., Messerschmitt, D.: Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers C-36 (1987)Google Scholar
  14. 14.
    Lee, E.A., Neuendorffer, S.: Actor-oriented models for codesign: Balancing reuse and performance. In: Formal Methods and Models for System Design. Kluwer, Dordrecht (2004)Google Scholar
  15. 15.
    Lee, E.A., Parks, T.M.: Dataflow process networks. Proc. of the IEEE 83(5) (1995)Google Scholar
  16. 16.
    Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience (2005)Google Scholar
  17. 17.
    Moreau, L., Rana, O., Walker, D.: Provenance aware service-oriented architecture (pasoa) (2006), http://pasoa.org
  18. 18.
    Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17) (2004)Google Scholar
  19. 19.
    PHYLIP Phylogeny Inference Package, http://evolution.gs.washington.edu/phylip.html
  20. 20.
    Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005)CrossRefGoogle Scholar
  21. 21.
    Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The Condor experience. Concurrency – Practice and Experience 17(2–4) (2005)Google Scholar
  22. 22.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)CrossRefGoogle Scholar
  23. 23.
    Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: Conference on Innovative Data Systems Research (CIDR) (2005)Google Scholar
  24. 24.
    Wong, S., Miles, S., Fang, W., Groth, P., Moreau, L.: Provenance-based validation of e-science experiments. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 801–815. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  25. 25.
    Zhao, J., Goble, C., Stephens, R., Bechhofer, S.: Linking and browsing provenance logs for e-science. In: Bouzeghoub, M., Goble, C.A., Kashyap, V., Spaccapietra, S. (eds.) ICSNW 2004. LNCS, vol. 3226. Springer, Heidelberg (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shawn Bowers
    • 1
  • Timothy McPhillips
    • 1
  • Bertram Ludäscher
    • 1
    • 2
  • Shirley Cohen
    • 3
  • Susan B. Davidson
    • 3
  1. 1.UC Davis Genome CenterUniversity of CaliforniaDavis
  2. 2.Department of Computer ScienceUniversity of CaliforniaDavis
  3. 3.Computer and Information ScienceUniversity of Pennsylvania 

Personalised recommendations