Provenance Collection Support in the Kepler Scientific Workflow System

  • Ilkay Altintas
  • Oscar Barney
  • Efrat Jaeger-Frank
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4145)


In many data-driven applications, analysis needs to be performed on scientific information obtained from several sources and generated by computations on distributed resources. Systematic analysis of this scientific information unleashes a growing need for automated data-driven applications that also can keep track of the provenance of the data and processes with little user interaction and overhead. Such data analysis can be facilitated by the recent advancements in scientific workflow systems. A major profit when using scientific workflow systems is the ability to make provenance collection a part of the workflow. Specifically, provenance should include not only the standard data lineage information but also information about the context in which the workflow was used, execution that processed the data, and the evolution of the workflow design. In this paper we describe a complete framework for data and process provenance in the Kepler Scientific Workflow System. We outline the requirements and issues related to data and workflow provenance in a multi-disciplinary workflow system and introduce how generic provenance capture can be facilitated in Kepler’s actor-oriented workflow environment. We also describe the usage of the stored provenance information for efficient rerun of scientific workflows.


Geographic Information System Provenance Data Provenance Information Provenance Recorder Kepler System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows (to appear, 2005),
  2. 2.
    Oinn, T., Greenwood, M., Addis, M., Alpdemir, M.N., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P., Lord, P., Pocock, M.R., Senger, M., Stevens, R., Wipat, A., Wroe, C.: Taverna: Lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience Grid Workflow (Special Issue accepted for Publication)Google Scholar
  3. 3.
    Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M., Taylor, I., Wang, I.: Programming Scientific and Distributed Workflow with Triana Services. In: Grid Workflow 2004 Special Issue of Concurrency and Computation: Practice and Experience (to be published, 2005)Google Scholar
  4. 4.
    Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue Ribbon Advisory Panel on CyberinfrastructureGoogle Scholar
  5. 5.
    Lipps, J.H.: The Decline of Reason?,
  6. 6.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005)CrossRefGoogle Scholar
  7. 7.
    Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In: Proceedings of the 14th Conference on Scientific and Statistical Database Management (2002)Google Scholar
  8. 8.
    Greenwood, M., Goble, C., Stevens, R., Zhao, J., Addis, M., Marvin, D., Moreau, L., Oinn, T.: Provenance of e-Science Experiments - experience from Bioinformatics. In: Proceedings of The UK OST e-Science second All Hands Meeting 2003 (AHM 2003) (2003)Google Scholar
  9. 9.
    Groth, P., Luck, M., Moreau, L.: A protocol for recording provenance in service-oriented grids. In: Higashino, T. (ed.) OPODIS 2004. LNCS, vol. 3544, pp. 124–139. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Buneman, P., Khanna, S., Tan, W.C.: Why and Where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  11. 11.
    Lanter, D.P.: Design of a lineage-based meta-data base for GIS. Cartography and Geographic Information Systems 18(4), 255–261 (1991)CrossRefGoogle Scholar
  12. 12.
    Ptolemy Project, See Website:
  13. 13.
    Altintas, I., Birnbaum, A., Baldridge, K.K., Sudholt, W., Miller, M., Amoreira, C., Potier, Y., Ludaescher, B.: A Framework for the Design and Reuse of Grid Workflows. In: Herrero, P., Pérez, M., S., Robles, V. (eds.) SAG 2004. LNCS, vol. 3458, pp. 120–133. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Yang, G., Watanabe, Y., Balarin, F., Sangiovanni-Vincentelli, A.: Separation of Concerns: Overhead in Modeling and Efficient Simulation Techniques. In: Fourth ACM International Conference on Embedded Software (EMSOFT 2004) (September 2004)Google Scholar
  15. 15.
    Bavoil, L., Callahan, S., Crossno, P., Freire, J., Scheidegger, C., Silva, C., Vo, H.: Vistrails: Enabling interactive multipleview visualizations. In: IEEE Visualization 2005, pp. 135–142 (2005)Google Scholar
  16. 16.
    Callahan, S., Freire, J., Santos, E., Scheidegger, C., Silva, C., Vo, H.: Managing the Evolution of Dataflows with VisTrails. In: Proceedings of the IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow 2006) (2006)Google Scholar
  17. 17.
    The Visualization Toolkit (VTK), See Website:
  18. 18.
    Miles, S., Groth, P., Branco, M., Moreau, L.: The requirements of recording and using provenance in e-Science experiments. Technical Report, Electronics and Computer Science, University of Southampton (2005)Google Scholar
  19. 19.
    Buneman, P., Khanna, S., Tan, W.C.: Data Provenance: Some Basic Issues. In: Proceedings of the 20th Conference on Foundations of Software Technology and theoretical Computer Science (2000)Google Scholar
  20. 20.
    Bowers, S., McPhillips, T., Ludaescher, B., Cohen, S., Davidson, S.B.: A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 133–147. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing Rapidly-Evolving Scientific Workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 10–18. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  22. 22.
    Talbott, T.D., Schuchardt, K.L., Stephan, E.G., Myers, J.D.: Mapping Physical Formats to Logical Models to Extract Data and Metadata: The Defuddle Parsing Engine. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 73–81. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Ilkay Altintas
    • 1
  • Oscar Barney
    • 2
  • Efrat Jaeger-Frank
    • 1
  1. 1.San Diego Supercomputer CenterUniversity of CaliforniaSan DiegoUSA
  2. 2.Scientific Computing and Imaging InstituteUniversity of UtahSalt Lake CityUSA

Personalised recommendations