Distributed and Parallel Databases

, Volume 36, Issue 1, pp 219–264 | Cite as

P-PIF: a ProvONE provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces

  • Ajinkya Prabhune
  • Aaron Zweig
  • Rainer Stotzka
  • Jürgen Hesser
  • Michael Gertz
Part of the following topical collections:
  1. Special Issue on Large-Scale Data Curation and Metadata Management


Enabling provenance interoperability by analyzing heterogeneous provenance information from different scientific workflow management systems is a novel research topic. With the advent of the ProvONE model, it is now possible to model both the prospective as well as the retrospective provenance in a single provenance model. Scientific workflows are composed using a declarative definition language, such as BPEL, SCUFL/t2flow, or MoML. Associated with the execution of a workflow is its corresponding provenance that is modeled and stored in the data model specified by the workflow system. However, sharing of provenance generated by heterogeneous workflows is a challenging task and prevents the aggregate analysis and comparison of workflows and their associated provenance. To address these challenges, this paper introduces a ProvONE-based Provenance Interoperability Framework that completely automates the modeling of provenance from heterogeneous WfMSs by: (a) automatically translating the scientific workflows to their equivalent representation in a ProvONE prospective graph using the Prov2ONE algorithm, (b) enriching the ProvONE prospective graph with the retrospective provenance exported by the WfMSs, and (c) native support for storing the ProvONE provenance graphs in a Resource Description Framework triplestore that supports the SPARQL query language for querying and retrieving ProvONE graphs. The Prov2ONE algorithm is based on a set of vocabulary translation rules between workflow specifications and the ProvONE model. The correctness and completeness proof of the algorithm is shown and its complexity is analyzed. Moreover, to demonstrate the practical applicability of the complete framework, ProvONE graphs for workflows defined in BPEL, SCUFL, and MoML are generated. Finally, the provenance challenge queries are extended with six additional queries for retrieving the provenance modeled in ProvONE.


Provenance interoperability ProvONE provenance model Prospective provenance Retrospective provenance SPARQL RDF Workflow Management System 



This research is supported by the Portfolio Extension of Helmholtz Association "Large Scale Data Management and Analysis" and DFG (German Research Foundation) MASi project (STO 397/4-1). We are thankful to Kay-Michael Wuerzner and the OCR-D team for contributing their use case and volunteering as pilot adopters of P-PIF.


  1. 1.
    Schwab, M., Karrenbach, M., Claerbout, J.: Making scientific computations reproducible. Comput. Sci. Eng. 2(6), 61–67 (2000)CrossRefGoogle Scholar
  2. 2.
    Stodden, V.: The Scientific Method in Practice: Reproducibility in the Computational Sciences. MIT Sloan Research Paper (2010)Google Scholar
  3. 3.
    Silva, C.T., Freire, J., Callahan, S.P.: Provenance for visualizations: reproducibility and beyond. Comput. Sci. Eng. 9(5), 82–89 (2007)CrossRefGoogle Scholar
  4. 4.
    Houstis, E.N., Rice, J.R., Gallopoulos, E., Bramley, R.: Enabling Technologies for Computational Science: Frameworks, Middleware and Environments, vol. 548. Springer, New York (2012)zbMATHGoogle Scholar
  5. 5.
    Berry, D., Parastatidis, S.: e-Science Workflow Services Workshop (2003).
  6. 6.
    Gannon, D., Fox, G., Farazdel, A., Goble, C., Deelman, E., Berry, D.: Workflow in grid systems workshop (2004).
  7. 7.
    Jacob, J., Katz, D., Miller, C., et al.: GRIST Workshop on Service Composition for Data Exploration in the Virtual Observatory (2004).
  8. 8.
    LINK-Up Workshop on Scientific Workflows (2004).
  9. 9.
    Scientific Data Management Framework Workshop (2003).
  10. 10.
    Deelman, E., Gil, Y., Zemankova, M.: NSF Workshop on the Challenges of Scientific Workflows, pp. 1–2 (2006)Google Scholar
  11. 11.
    Shields, M.: Control-versus data-driven workflows. In: Workflows for e-Science, pp. 167–173. Springer, London (2007)Google Scholar
  12. 12.
    OASIS. Web Services Business Process Execution Language version 2.0. (2007)
  13. 13.
    Lee, E.A., Neuendorffer, S.: MoML: A Modeling Markup Language in SML: Version 0.4. Electronics Research Laboratory, College of Engineering. University of California (2000)Google Scholar
  14. 14.
    Wolstencroft, K., Haines, R., et al.: The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 41, W557–W561 (2013)CrossRefGoogle Scholar
  15. 15.
    Marru, S., Gunathilake, L., Herath, C., Tangchaisin, P., Pierce, M., Mattmann, C., Singh, R., Gunarathne, T., Chinthaka, E., Gardler, R. Slominski, A., Douma, A., Perera, S., Weerawarana, S.: Apache airavata: a framework for distributed applications and computational workflows. In: Proceedings of the ACM Workshop on Gateway Computing Environments, GCE, pp. 21–28. ACM, New York, NY, USA (2011)Google Scholar
  16. 16.
    Droegemeier, K.K., Gannon, D., Reed, D., Plale, B., Alameda, J., Baltzer, T., Brewster, K., Clark, R., Domenico, B., Graves, S., et al.: Service-oriented environments for dynamically interacting with mesoscale weather. Comput. Sci. Eng. 7(6), 12–29 (2005)CrossRefGoogle Scholar
  17. 17.
    Scherp, G., Höing, A., Gudenkauf, S., Hasselbring, W., Kao, O.: Using UNICORE and WS-BPEL for scientific workflow execution in grid environments. In: Euro-Par Workshops, pp. 335–344. Springer (2009)Google Scholar
  18. 18.
    Wassermann, B., Emmerich, W., Butchart, B., Cameron, N., Chen, L., Patel, J.: Sedna: a BPEL-Based environment for visual scientific workflow modeling. In: Workflows for e-Science, pp. 428–449. Springer, London (2007)Google Scholar
  19. 19.
    Emmerich, W., Butchart, B., Chen, L., Wassermann, B., Price, S.L.: Grid service orchestration using the business process execution language (BPEL). J. Grid Comput. 3, 283–304 (2005)CrossRefGoogle Scholar
  20. 20.
    Sonntag, M., Karastoyanova, D., Deelman, E.: BPEL4Pegasus: combining business and scientific workflows. In: International Conference on Service-Oriented Computing, pp. 728–729. Springer (2010)Google Scholar
  21. 21.
    Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications, pp. 320–339. Springer, London (2007)Google Scholar
  22. 22.
    Goble, C.: Position statement: musings on provenance, workflow and (semantic web) annotations for bioinformatics. In: Workshop on Data Derivation and Provenance, vol. 3. Chicago (2002)Google Scholar
  23. 23.
    Simmhan, Y.L., Plale, B., Gannon, D.: Towards a quality model for effective data selection in collaboratories. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), pp. 72–72. IEEE (2006)Google Scholar
  24. 24.
    Zhao, Y., Wilde, M., Foster, I.: Applying the virtual data provenance model. International Provenance and Annotation Workshop. IPAW ’06, pp. 148–161. Springer, Heidelberg (2006)Google Scholar
  25. 25.
    Missier, P., Dey, S.C., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B.: D-PROV: extending the PROV provenance model with workflow structure. In: Workshop Theory and Practice of Provenance (TaPP) (2013)Google Scholar
  26. 26.
    Lim, C., Lu, S., Chebotko, A., Fotouhi, F.: Prospective and retrospective provenance collection in scientific workflow environments. In: IEEE International Conference on Services Computing (SCC), pp. 449–456 (2010)Google Scholar
  27. 27.
    Cuevas-Vicenttín, V., Kianmajd, P., Ludäscher, B., Missier, P., Chirigati, F., Wei, Y., Koop, D., Dey, S.: The PBase Scientific Workflow Provenance Repository. Int. J. Digit. Curation 9(2), 28–38 (2014)CrossRefGoogle Scholar
  28. 28.
    Prabhune, A., Stotzka, R., Jejkal, T., Hartmann, V., Bach, M., Schmitt, E., Hausmann, M., Hesser, J.: An optimized generic client service API for managing large datasets within a data repository. In: BigDataService, pp. 44–51 (2015)Google Scholar
  29. 29.
    Prabhune, A., Zweig, A., Stotzka, R., Gertz, M., Hesser, J.: Prov2ONE: an algorithm for automatically constructing ProvONE provenance graphs. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 204–208. Springer International Publishing (2016)Google Scholar
  30. 30.
    Chandna, S., Tonne, D., Jejkal, T., Stotzka, R., Krause, C., Vanscheidt, P., Busch, H., Prabhune, A.: Software Workflow for the Automatic Tagging of Medieval Manuscript Images (SWATI) (2015)Google Scholar
  31. 31.
    Stotzka, R., Hartmann, V., Jejkal, T., Sutter, M., van Wezel, J., Hardt, M., Garcia, A., Kupsch, R., Bourov, S.: Perspective of the Large Scale Data Facility (LSDF) supporting nuclear fusion applications. In: 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 373–379. IEEE (2011)Google Scholar
  32. 32.
    Jejkal, T., Vondrous, A., Kopmann, A., Stotzka, R., Hartmann, V.: KIT data manager: the repository architecture enabling cross-disciplinary research. Large-Scale Data Management and Analysis-Big Data in Science (2014)Google Scholar
  33. 33.
    Lassila, O., Swick, Ralph R: Resource Description Framework (RDF) model and syntax specification. Recommendation, 22 Feb 1999, W3C, Cambridge, MA (1999)Google Scholar
  34. 34.
    Prud, E., Seaborne, A., et al.: SPARQL query language for RDF (2017).
  35. 35.
    Russell, N., Ter Hofstede, A.H.M., van der Aalst, W.M.P., Mulyar, N.: Workflow control-flow patterns: a revised view. BPM Center Report BPM-06-22, (2006)Google Scholar
  36. 36.
    Wohed, P., van der Aalst, W.M.P., Dumas, M., ter Hofstede, A.H.M.: Analysis of web services composition languages: the case of BPEL4WS. In: International Conference on Conceptual Modeling, pp. 200–215 (2003)Google Scholar
  37. 37.
    Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)CrossRefGoogle Scholar
  38. 38.
    Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D., Greenwood, M.: Using semantic web technologies for representing E-science provenance. In: International Semantic Web Conference, pp. 92–106. Springer, Heidelberg (2004)Google Scholar
  39. 39.
    da Cruz, S.M.S., Campos, M.L.M., Mattoso, M.: Towards a taxonomy of provenance in scientific workflow management systems. In: Congress on Services—I, pp. 259–266 (2009)Google Scholar
  40. 40.
    Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing rapidly-evolving scientific workflows. International Provenance and Annotation Workshop. IPAW ’06, pp. 10–18. Springer, Heidelberg (2006)Google Scholar
  41. 41.
    Ludscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the KEPLER system. Concurr. Comput. 18(10), 1039–1065 (2006)CrossRefGoogle Scholar
  42. 42.
    Davidson, S.B., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1345–1350 (2008)Google Scholar
  43. 43.
    Moreau, L., Missier, P.: PROV-DM: The PROV Data Model. Technical Report, World Wide Web Consortium (April (2013)Google Scholar
  44. 44.
    Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., Van den Bussche, J.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)Google Scholar
  45. 45.
    Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005)CrossRefGoogle Scholar
  46. 46.
    Gadelha, L.M., Clifford, B., Mattoso, M., Wilde, M., Foster, I., et al.: Provenance Management in Swift with Implementation Details. Technical Report, Argonne National Laboratory (ANL) (2011)CrossRefGoogle Scholar
  47. 47.
    Missier, Paolo, Belhajjame, Khalid, Zhao, Jun, Roos, Marco, Goble, Carole: Data lineage model for Taverna workflows with lightweight annotation requirements. In: International Provenance and Annotation Workshop, pp. 17–30. Springer (2008)Google Scholar
  48. 48.
    Plale, B., Cao, B., Aktas, M.: S: Provenance Capture of Unmanaged Workflows with Karma. Indiana University, Bloomington, IN (2011)Google Scholar
  49. 49.
    Braun, U., Seltzer, M.I., Chapman, A., Blaustein, B.T., Allen, M.D., Seligman, L.: Towards query interoperability: PASSing PLUS. In: Workshop Theory and Practice of Provenance (TaPP), pp. 1–10 (2010)Google Scholar
  50. 50.
    Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M.K., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: Workflows in Support of Large-Scale Science (WORKS), pp. 1–8 (2010)Google Scholar
  51. 51.
    Plankensteiner, K., Prodan, R., Janetschek, M., Fahringer, T., Montagnat, J., Rogers, D., Harvey, I., Taylor, I., Balaskó, Á.: Fine-grain interoperability of scientific workflows in distributed computing infrastructures. J. Grid Comput. 11(3), 429–455 (2013)CrossRefGoogle Scholar
  52. 52.
    Altintas, I., Anand, M.K., Crawl, D., Bowers, S., Belloum, A., Missier, P., Ludäscher, B., Goble, C.A., Sloot, P.M.A.: Understanding collaborative studies through interoperable workflow provenance. International Provenance and Annotation Workshop. IPAW ’10, pp. 42–58. Springer, Heidelberg (2010)Google Scholar
  53. 53.
    Song, M., Miller, J.A., Arpinar, I.B: RepoX: An XML Repository for Workflow Designs and Specifications. Univeristy of Georgia, USA (2001)Google Scholar
  54. 54.
    Vanhatalo, J., Koehler, J., Leymann, F.: Repository for business processes and arbitrary associated metadata. In: Proceedings of the BPM Demo Session at the Fourth International Conference on Business Process Management (BPM), pp. 25–31. CEUR (2006)Google Scholar
  55. 55.
    Oliveira, W., Missier, P., Ocaña, K., de Oliveira, D., Braganholo, V.: Analyzing provenance across heterogeneous provenance graphs. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 57–70. Springer International Publishing (2016)Google Scholar
  56. 56.
    Watson, P., Hiden, H., Woodman, S.: e-Science central for CARMEN: science as a service. Concurr. Comput. 22(17), 2369–2380 (2010)CrossRefGoogle Scholar
  57. 57.
    de Oliveira, D., Ogasawara, E., Baião, F., SciCumulus, M.M.: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: IEEE 3rd International Conference on Cloud Computing, pp. 378–385 (2010)Google Scholar
  58. 58.
    De Abreu, D., Flores, A., Palma, G., Pestana, V., Piñero, J., Queipo, J., Sánchez, J., Vidal, M-E.: Choosing between graph databases and RDF engines for consuming and mining linked data. In: Proceedings of the Fourth International Conference on Consuming Linked Data, COLD ’13, pp. 37–49 (2013)Google Scholar
  59. 59.
    Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: a data provenance perspective. In: Proceedings of the 48th Annual Southeast Regional Conference, pp. 42:1–42:6. ACM (2010)Google Scholar
  60. 60.
    Jena, A.: A free and open source Java framework for building Semantic Web and Linked Data applications.
  61. 61.
    Goderis, A., Brooks, C., Altintas, I., Lee, E.A., Goble, C.: Composing different models of computation in Kepler and Ptolemy II. In: International Conference on Computational Science, pp. 182–190. Springer (2007)Google Scholar
  62. 62.
    Berglund, A., Boag, S., Chamberlin, D., Fernández, M., Kay, M., Robie, J., Siméon, J.: XML Path Language (XPath). W3C (2003)Google Scholar
  63. 63.
    Moreau, L., Ludäscher, B., Altintas, I., Barga, R.S., Bowers, S., Callahan, S., Chin, G., Clifford, B., Cohen, S., Cohen-Boulakia, S., et al.: Special issue: the first provenance challenge. Concurr. Comput. 20(5), 409–418 (2008)CrossRefGoogle Scholar
  64. 64.
    Ellqvist, T., Koop, D., Freire, J., Silva, C., Strömbäck, L.: Using mediation to achieve provenance interoperability. In: Congress on Services—I, pp. 291–298 (2009)Google Scholar
  65. 65.
    Blaustein, B.T., Seligman, L., Morse, M., Allen, M.D., Rosenthal, A.: PLUS: synthesizing privacy, lineage, uncertainty and security. In: IEEE 24th International Conference on Data Engineering Workshop. ICDEW, pp. 242–245 (2008)Google Scholar
  66. 66.
    Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.I.: Provenance-aware storage systems. In: USENIX Annual Technical Conference, General Track, pp. 43–56 (2006)Google Scholar
  67. 67.
    Ding, L., Michaelis, J., McCusker, J., McGuinness, D.L.: Linked provenance data: a semantic web-based approach to interoperable workflow traces. Future Gener. Comput. Syst. 27(6), 797–805 (2011)CrossRefGoogle Scholar
  68. 68.
    Anand, M.K., Bowers, S., Ludäscher, B.: Techniques for efficiently querying scientific workflow provenance graphs. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pp. 287–298 (2010)Google Scholar
  69. 69.
    Garijo, D., Gil, Y.: Augmenting PROV with plans in P-plan: scientific processes as linked data. In: Proceedings of the 2nd International Workshop on Linked Science (2012)Google Scholar
  70. 70.
    Dey, S., Belhajjame, K., Koop, D., Raul, M., Ludäscher, B.: Linking prospective and retrospective provenance in scripts. In: Theory and Practice of Provenance (TaPP) (2015)Google Scholar
  71. 71.
    Pimentel, J., Dey, S., et al. Yin & Yang: demonstrating complementary provenance from noWorkflow & YesWorkflow. In: International Provenance and Annotation Workshop. IPAW ’16, pp. 161–165. Springer (2016)Google Scholar
  72. 72.
    Terstyanszky, G., Kukla, T., Kiss, T., Kacsuk, P., Balasko, A., Farkas, Z.: Enabling scientific workflow sharing through coarse-grained interoperability. Future Gener. Comput. Syst. 37, 46–59 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  • Ajinkya Prabhune
    • 1
  • Aaron Zweig
    • 2
  • Rainer Stotzka
    • 1
  • Jürgen Hesser
    • 3
  • Michael Gertz
    • 4
  1. 1.Karlsruhe Institute of TechnologyEggenstein-LeopoldshafenGermany
  2. 2.Department of MathematicsStanford UniversityStanfordUSA
  3. 3.Department of Radiation OncologyHeidelberg UniversityHeidelbergGermany
  4. 4.Institute of Computer ScienceHeidelberg UniversityHeidelbergGermany

Personalised recommendations