Bridging Workflow and Data Provenance Using Strong Links

  • David Koop
  • Emanuele Santos
  • Bela Bauer
  • Matthias Troyer
  • Juliana Freire
  • Cláudio T. Silva
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6187)


As scientists continue to migrate their work to computational methods, it is important to track not only the steps involved in the computation but also the data consumed and produced. While this provenance information can be captured, in existing approaches, it often contains only weak references between data and provenance. When data files or provenance are moved or modified, it can be difficult to find the data associated with the provenance or to find the provenance associated with the data. We propose a persistent storage mechanism that manages input, intermediate, and output data files, strengthening the links between provenance and data. This mechanism provides better support for reproducibility because it ensures the data referenced in provenance information can be readily located. Another important benefit of such management is that it allows caching of intermediate data which can then be shared with other users. We present an implemented infrastructure for managing data in a provenance-aware manner and demonstrate its application in scientific projects.


Strong Link Upstream Signature Central Store Data Provenance Provenance Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: A survey. Computing in Science and Engineering 10(13), 11–21 (2008)Google Scholar
  2. 2.
    Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of SIGMOD, pp. 1345–1350 (2008)Google Scholar
  3. 3.
    Davidson, S.B., Boulakia, S.C., Eyal, A., Ludäscher, B., McPhillips, T.M., Bowers, S., Anand, M.K., Freire, J.: Provenance in scientific workflow systems. IEEE Data Eng. Bull. 30(4), 44–50 (2007)Google Scholar
  4. 4.
    Bavoil, L., Callahan, S., Crossno, P., Freire, J., Scheidegger, C., Silva, C., Vo, H.: VisTrails: Enabling interactive multiple-view visualizations. In: Proceedings of IEEE Visualization, pp. 135–142 (2005)Google Scholar
  5. 5.
    Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Albuquerque, A., Alet, F., Corboz, P., Dayal, P., Feiguin, A., Fuchs, S., Gamper, L., Gull, E., Gürtler, S., Honecker, A., Igarashi, R., Körner, M., Kozhevnikov, M., Läuchli, A., Manmana, S., Matsumoto, M., McCulloch, I., Michel, F., Noack, R., Pawlowski, G., Pollet, L., Pruschke, T., Schollwöck, U., Todo, S., Trebst, S., Troyer, M., Werner, P., Wessel, S.: The alps project release 1.3: open source software for strongly correlated systems. J. Mag. Mag. Mat. 310, 1187 (2007)CrossRefGoogle Scholar
  7. 7.
  8. 8.
  9. 9.
    Mouallem, P., Barreto, R., Klasky, S., Podhorszki, N., Vouk, M.: Tracking files in the kepler provenance framework. In: SSDBM 2009: Proceedings of the 21st International Conference on Scientific and Statistical Database Management, pp. 273–282 (2009)Google Scholar
  10. 10.
  11. 11.
    Fomel, S., Claerbout, J.F.: Guest editors’ introduction: Reproducible research. Computing in Science and Engineering 11, 5–7 (2009)Google Scholar
  12. 12.
    Santos, E., Freire, J., Silva, C.: Information Sharing in Science 2.0: Challenges and Opportunities. In: CHI Workshop on The Changing Face of Digital Science: New Practices in Scientific Collaborations (2009)Google Scholar
  13. 13.
    The VisTrails Project,
  14. 14.
    Dagotto, E., Rice, T.M.: Surprises on the Way from One- to Two-Dimensional Quantum Magnets: The Ladder Materials. Science 271(5249), 618–623 (1996)CrossRefGoogle Scholar
  15. 15.
    Troyer, M., Tsunetsugu, H., Würtz, D.: Thermodynamics and spin gap of the heisenberg ladder calculated by the look-ahead lanczos algorithm. Phys. Rev. B 50(18), 13515–13527 (1994)CrossRefGoogle Scholar
  16. 16.
    Todo, S., Kato, K.: Cluster algorithms for general- s quantum spin systems. Phys. Rev. Lett. 87(4), 047203 (2001)CrossRefGoogle Scholar
  17. 17.
    Concurrent Versions System,
  18. 18.
  19. 19.
    The Taverna Project,
  20. 20.
    The Kepler Project,
  21. 21.
  22. 22.
    Moreau, L., Freire, J., Futrelle, J., McGrath, R.E., Myers, J., Paulson, P.: The open provenance model: An overview. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 323–326. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  23. 23.
    Cheney, J., Chiticariu, L., Tan, W.C.: Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1(4), 379–474 (2009)CrossRefGoogle Scholar
  24. 24.
    Plale, B., Alameda, J., Wilhelmson, B., Gannon, D., Hampton, S., Rossi, A., Droegemeier, K.: Active management of scientific data. IEEE Internet Computing 9(1), 27–34 (2005)CrossRefGoogle Scholar
  25. 25.
    Simmhan, Y., Barga, R., van Ingen, C., Lazowska, E., Szalay, A.: Building the trident scientific workflow workbench for data management in the cloud. In: International Conference on Advanced Engineering Computing and Applications in Sciences, pp. 41–50 (2009)Google Scholar
  26. 26.
    Salamone, S.: Lsid: An informatics lifesaver. Bio-ITWorld (2004)Google Scholar
  27. 27.
    Paskin, N.: Digital object identifiers for scientific data. Data Science Journal 4, 12–20 (2005)CrossRefGoogle Scholar
  28. 28.
    Hasan, R., Sion, R., Winslett, M.: The case of the fake picasso: preventing history forgery with secure provenance. In: FAST 2009: Proccedings of the 7th conference on File and storage technologies, pp. 1–14 (2009)Google Scholar
  29. 29.
    Peng, R.S., Eckel, S.P.: Distributed reproducible research using cached computations. Computing in Science & Engineering 11, 28–34 (2009)CrossRefGoogle Scholar
  30. 30.
    Allcock, W., Bester, J., Bresnahan, J., Chervenak, A., Liming, L., Tuecke, S.: Gridftp: Protocol extensions to ftp for the grid. Global Grid Forum, 3 (2001)Google Scholar
  31. 31.
    Shoshani, A., Sim, A., Gu, J.: Storage resource managers: essential components for the Grid, pp. 321–340. Kluwer Academic Publishers, Dordrecht (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • David Koop
    • 1
  • Emanuele Santos
    • 1
  • Bela Bauer
    • 2
  • Matthias Troyer
    • 2
  • Juliana Freire
    • 1
  • Cláudio T. Silva
    • 1
  1. 1.SCI InstituteUniversity of UtahUSA
  2. 2.Theoretische PhysikETH ZurichSwitzerland

Personalised recommendations