Abstract
Integrated provenance support promises to be a chief advantage of scientific workflow systems over script-based alternatives. While it is often recognized that information gathered during scientific workflow execution can be used automatically to increase fault tolerance (via checkpointing) and to optimize performance (by reusing intermediate data products in future runs), it is perhaps more significant that provenance information may also be used by scientists to reproduce results from earlier runs, to explain unexpected results, and to prepare results for publication. Current workflow systems offer little or no direct support for these “scientist-oriented” queries of provenance information. Indeed the use of advanced execution models in scientific workflows (e.g. process networks, which exhibit pipeline parallelism over streaming data) and failure to record certain fundamental events such as state resets of processes, can render existing provenance schemas useless for scientific applications of provenance. We develop a simple provenance model that is capable of supporting a wide range of scientific use cases even for complex models of computation such as process networks. Our approach reduces these use cases to database queries over event logs, and is capable of reconstructing complete data and invocation dependency graphs for a workflow run.
Work supported in part by SciDAC/SDM (DE-FC02-01ER25486), NSF/SEEK (DBI-0533368), and NSF/GEON (EAR-0225673).
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Berry, D., Buneman, P., Wilde, M., Ioannidis, Y. (eds.): e-Science Workshop on Data Provenance and Annotation, National e-Science Centre, Edinburgh (December 2003)
Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An annotation management system for relational databases. In: Proc. of VLDB (2004)
Bose, R., Frew, J.: Lineage retrieval for scientific data processing: A survey. ACM Computing Surveys 37(1), 1–28 (2005)
Buneman, P., Foster, I. (eds.): Workshop on Data Derivation and Provenance, Chicago (October 2002)
Buneman, P., Khanna, S., Tan, W.C.: Why and where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)
Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: Managing the evolution of dataflows with vistrails. In: IEEE Workshop on Workflow and Data-Flow for Scientific Applications (SciFlow) (2006)
Castresana, J.: Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552 (2000)
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. In: VLDB (2001)
Cui, Y., Widom, J., Wiender, J.: Tracing the lineage of view data in a warehousing environment. ACM TODS 25(2) (2000)
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.-H., Vahi, K., Livny, M.: Pegasus: Mapping scientific workflows onto the grid. In: Proc. of the European Across Grids Conference (2004)
Goble, C.: Position statement: Musings on provenance, workflow and (semantic web) annotations for bioinformatic‘s. In: Buneman and Foster [4]
Kahn, G., MacQueen, D.B.: Coroutines and networks of parallel processes. In: Proc. of the IFIP Congress (1977)
Lee, E.A., Messerschmitt, D.: Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers C-36 (1987)
Lee, E.A., Neuendorffer, S.: Actor-oriented models for codesign: Balancing reuse and performance. In: Formal Methods and Models for System Design. Kluwer, Dordrecht (2004)
Lee, E.A., Parks, T.M.: Dataflow process networks. Proc. of the IEEE 83(5) (1995)
Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience (2005)
Moreau, L., Rana, O., Walker, D.: Provenance aware service-oriented architecture (pasoa) (2006), http://pasoa.org
Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17) (2004)
PHYLIP Phylogeny Inference Package, http://evolution.gs.washington.edu/phylip.html
Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005)
Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The Condor experience. Concurrency – Practice and Experience 17(2–4) (2005)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)
Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: Conference on Innovative Data Systems Research (CIDR) (2005)
Wong, S., Miles, S., Fang, W., Groth, P., Moreau, L.: Provenance-based validation of e-science experiments. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 801–815. Springer, Heidelberg (2005)
Zhao, J., Goble, C., Stephens, R., Bechhofer, S.: Linking and browsing provenance logs for e-science. In: Bouzeghoub, M., Goble, C.A., Kashyap, V., Spaccapietra, S. (eds.) ICSNW 2004. LNCS, vol. 3226. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bowers, S., McPhillips, T., Ludäscher, B., Cohen, S., Davidson, S.B. (2006). A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows. In: Moreau, L., Foster, I. (eds) Provenance and Annotation of Data. IPAW 2006. Lecture Notes in Computer Science, vol 4145. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11890850_15
Download citation
DOI: https://doi.org/10.1007/11890850_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46302-3
Online ISBN: 978-3-540-46303-0
eBook Packages: Computer ScienceComputer Science (R0)
