Understanding Collaborative Studies through Interoperable Workflow Provenance

  • Ilkay Altintas
  • Manish Kumar Anand
  • Daniel Crawl
  • Shawn Bowers
  • Adam Belloum
  • Paolo Missier
  • Bertram Ludäscher
  • Carole A. Goble
  • Peter M. A. Sloot
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6378)

Abstract

The provenance of a data product contains information about how the product was derived, and is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. Currently, most provenance models are designed to capture the provenance related to a single run, and mostly executed by a single user. However, a scientific discovery is often the result of methodical execution of many scientific workflows with many datasets produced at different times by one or more users. Further, to promote and facilitate exchange of information between multiple workflow systems supporting provenance, the Open Provenance Model (OPM) has been proposed by the scientific workflow community. In this paper, we describe a new query model that captures implicit user collaborations. We show how this model maps to OPM and helps to answer collaborative queries, e.g., identifying combined workflows and contributions of users collaborating on a project based on the records of previous workflow executions. We also adopt and extend the high-level Query Language for Provenance (QLP) with additional constructs, and show how these extensions allow non-expert users to express collaborative provenance queries against this model easily and concisely. Furthermore, we adopt the Provenance Challenge 3 (PC3) workflows as a collaborative and interoperable usecase scenario, where different stages of the workflow are executed in three different workflow environments - Kepler, Taverna, and WSVLAM. Through this usecase, we demonstrate how we can establish and understand collaborative studies through interoperable workflow provenance.

References

  1. 1.
    Ludäscher, B., Goble, C. (eds.) Special section on scientific workflows. ACM SIGMOD Record 34(3) (2005)Google Scholar
  2. 2.
    Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. (eds.) Workflows for e-Science. Springer, Heidelberg (2007)Google Scholar
  3. 3.
    Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., Gannon, D., Goble, C., Livny, M., Moreau, L., Myers, J.: Examining the challenges of scientific workflows. IEEE Computer 40(12), 24–32 (2007)CrossRefGoogle Scholar
  4. 4.
    Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: An overview of workflow system features and capabilities. Future Generation Computer Systems 25, 528–540 (2009)CrossRefGoogle Scholar
  5. 5.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34, 31–36 (2005)CrossRefGoogle Scholar
  6. 6.
    Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: A survey. Computing in Science and Engineering 10, 11–21 (2008)CrossRefGoogle Scholar
  7. 7.
    Bowers, S., McPhillips, T., Wu, M.W., Ludäscher, B.: Project Histories: Managing Data Provenance Across Collection-Oriented Scientific Workflow Runs. In: Cohen-Boulakia, S., Tannen, V. (eds.) DILS 2007. LNCS (LNBI), vol. 4544, pp. 122–138. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Altintas, I., Lin, A.W., Chen, J., Churas, C., Gujral, M., Sun, S., Li, W., Manansala, R., Sedova, M., Grethe, J.S., Ellisman, M.: Camera 2.0: A data-centric metagenomics community infrastructure driven by scientific workflows. In: Proceeding of The IEEE 2010 Fourth International Workshop on Scientific Workflows, Miami, Florida (2010)Google Scholar
  9. 9.
    Zhao, Z., Booms, S., Belloum, A., de Laat, C., Hertzberger, B.: Vle-wfbus: A scientific workflow bus for multi e-science domains. In: International Conference on e-Science and Grid Computing (2006)Google Scholar
  10. 10.
    Roure, D.D., Goble, C., Stevens, R.: Designing the myexperiment virtual research environment for the social sharing of workflows. In: E-SCIENCE 2007: Proceedings of the Third IEEE International Conference on e-Science and Grid Computing, Washington, DC, USA, pp. 603–610. IEEE Computer Society Press, Los Alamitos (2007)Google Scholar
  11. 11.
    Anand, M.K., Bowers, S., Mcphillips, T., Ludäscher, B.: Exploring scientific workflow provenance using hybrid queries over nested data and lineage graphs. In: SSDBM 2009: Proceedings of the 21st International Conference on SSDM, pp. 237–254. Springer, Heidelberg (2009)Google Scholar
  12. 12.
    Moreau, L., Freire, J., Futrelle, J., McGrath, R.E., Myers, J., Paulson, P.: The Open Provenance Model: An Overview. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 323–326. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  13. 13.
    Anand, M.K., Bowers, S., Ludäscher, B.: A navigation model for exploring scientific workflow provenance graphs. In: WORKS 2009: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, pp. 1–10. ACM, New York (2009)Google Scholar
  14. 14.
    Cohen, S., Cohen-Boulakia, S., Davidson, S.B.: Towards a model of provenance and user views in scientific workflows. In: Leser, U., Naumann, F., Eckman, B. (eds.) DILS 2006. LNCS (LNBI), vol. 4075, pp. 264–279. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1007–1018. ACM P, New York (2008)CrossRefGoogle Scholar
  16. 16.
    Anand, M.K., Bowers, S., Ludäscher, B.: Techniques for efficiently querying scientific workflow provenance graphs. In: EDBT 2010: Proceedings of the 13th International Conference on Extending Database Technology, pp. 287–298. ACM, New York (2010)Google Scholar
  17. 17.
    Anand, M.K., Bowers, S., Altintas, I., Ludäscher, B.: Approaches for exploring and querying scientific workflow provenance graphs. In: IPAW (2010)Google Scholar
  18. 18.
    Anand, M.K., Bowers, S., Ludäscher, B.: Provenance browser: Displaying and querying scientific workflow provenance graphs (Demo). In: 26th IEEE International Conference on Data Engineering (2010)Google Scholar
  19. 19.
    Turi, D., Missier, P., Goble, C., De Roure, D., Oinn, T.: Taverna workflows: Syntax and semantics. In: International Conference on e-Science and Grid Computing, pp. 441–448 (2007)Google Scholar
  20. 20.
    Korkhov, V., Vasyunin, D., Wibisono, A., Guevara-Masis, V., Belloum, A., de Laat, C., Adriaans, P., Hertzberger, L.: Ws-vlam: towards a scalable workflow system on the grid. In: WORKS 2007: Proceedings of the 2nd workshop on Workflows in Support of Large-scale Science, pp. 63–68. ACM, New York (2007)Google Scholar
  21. 21.
    Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience. Special Issue on Scientific Workflows (2005)Google Scholar
  22. 22.
    Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  23. 23.
    Davidson, S.B., Boulakia, S.C., Eyal, A., Ludäscher, B., McPhillips, T.M., Bowers, S., Anand, M.K., Freire, J.: Provenance in scientific workflow systems. IEEE Data Eng. Bull. 30, 44–50 (2007)Google Scholar
  24. 24.
    Bowers, S., Mcphillips, T., Riddle, S., Anand, M.K., Ludäscher, B.: Kepler/ppod: Scientific workflow and provenance support for assembling the tree of life. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 70–77. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  25. 25.
    Zhao, J., Goble, C., Stevens, R., Turi, D.: Mining taverna’s semantic web of provenance. Concurrency and Computation: Practice and Experience, Special Issue on The First Provenance Challenge 20, 463–472 (2007)CrossRefGoogle Scholar
  26. 26.
    Scheidegger, C.E., Vo, H.T., Koop, D., Freire, J., Silva, C.T.: Querying and re-using workflows with vstrails. In: SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1251–1254. ACM, New York (2008)CrossRefGoogle Scholar
  27. 27.
    Anand, M.K., Bowers, S., McPhillips, T., Ludäscher, B.: Efficient provenance storage over nested data collections. In: EDBT 2009: Proceedings of the 12th International Conference on Extending Database Technology, pp. 958–969. ACM, New York (2009)Google Scholar
  28. 28.
    Malawski, M., Bartynski, T., Bubak, M.: Invocation of operations from script-based grid applications. Future Generation Computer Systems 26, 138–146 (2010)CrossRefGoogle Scholar
  29. 29.
    De Roure, D., Goble, C.: Research objects for data intensive research. In: E-Science (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Ilkay Altintas
    • 1
    • 2
  • Manish Kumar Anand
    • 1
  • Daniel Crawl
    • 1
  • Shawn Bowers
    • 3
  • Adam Belloum
    • 2
  • Paolo Missier
    • 4
  • Bertram Ludäscher
    • 5
  • Carole A. Goble
    • 4
  • Peter M. A. Sloot
    • 2
  1. 1.San Diego Supercomputer CenterUniversity of CaliforniaSan DiegoUSA
  2. 2.Computational ScienceUniversity of AmsterdamThe Netherlands
  3. 3.Department of Computer ScienceGonzaga UniversityUSA
  4. 4.School of Computer ScienceUniversity of ManchesterManchesterUK
  5. 5.UC Davis Genome CenterUniversity of CaliforniaDavisUSA

Personalised recommendations