Prov2ONE: An Algorithm for Automatically Constructing ProvONE Provenance Graphs

  • Ajinkya Prabhune
  • Aaron Zweig
  • Rainer Stotzka
  • Michael Gertz
  • Juergen Hesser
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9672)

Abstract

Provenance traces history within workflows and enables researchers to validate and compare their results. Currently, modelling provenance in ProvONE is an arduous task and lacks an automated approach. This paper introduces a novel algorithm, called Prov2ONE that automatically generates the ProvONE prospective provenance for scientific workflows defined in BPEL4WS. The same prospective ProvONE graph is updated with the relevant retrospective provenance, preventing provenance to be captured in various non-standard provenance models and thus enabling research communities to share, compare and analyze workflows and its associated provenance. Finally, using the Prov2ONE algorithm, a ProvONE provenance graph for the nanoscopy workflow is generated.

1 Introduction

In the last decade, research communities have adopted workflow management systems (WfMS) for orchestrating their complex scientific workflows. Nanoscopy is a novel imaging technique in biological and medical research that aims to reduce the resolution gap between conventional light microscopy and electron microscopy [1]. In a nanoscopy workflow, the raw image datasets acquired by high-resolution microscopes are processed in multiple stages to produce final results. Nanoscopy Open Reference Data Repository (NORDR) [2] is provisioned to the researchers to store, process and access their data. For executing the nanoscopy workflows a WfMS1 is integrated with NORDR. A critical aspect associated with the NORDR is the management of provenance information.

The paper addresses three main requirements of managing provenance in NORDR: (i) enable automated modelling of both prospective as well as retrospective provenance in a single provenance model; (ii) design an extensible provenance management component for NORDR; (iii) provision a dedicated provenance storage system with efficient query processing. To fulfill the first requirement, the paper presents the Prov2ONE algorithm that generates a ProvONE [3] provenance graph for BPEL4WS2 workflows. The algorithm is based on ProvONE due to the limitations of the Open Provenance Model (OPM) [5] and PROV [6] to model only the retrospective provenance [4]. The second requirement is met by presenting the provenance management architecture for NORDR and finally, for efficient storage and retrieval of provenance information, the ProvONE graphs are stored in a graph database (ArangoDB3).
Fig. 1.

Provenance management in NORDR

Fig. 2.

Nanoscopy workflow defined in BPEL4WS

2 Provenance Management Architecture

The Fig. 1 briefly describes the various components of NORDR system that are essential for either modelling, collecting or storing the provenance information for a scientific workflow.

Workflow (WF) Engine: The WF engine is responsible for interpreting the workflow definition and invoking the necessary data processing services.

NORDR: The NORDR is a multi-layered architecture with many modules that primarily offers the various data processing and data storage service.

Provenance Manager: The provenance manager is responsible for handling all the provenance information generated before, during and after the execution of each scientific workflow. The Provenance Manager comprises four modules: (i) The Prov2ONE module holds the implementation of the Prov2ONE algorithm. (ii) NORDR Provenance Collector module collects the retrospective provenance information from the NORDR. (iii) WF Engine Provenance Collector module collects the retrospective provenance information from the WF engine. (iv) OPM/PROV Provenance Exporter module enables interoperability between the ProvONE and OPM/PROV standard.
Fig. 3.

ProvONE graph of nanoscopy workflow (Color figure online)

3 Prov2ONE Algorithm

The Prov2ONE algorithm comprises two components. In the first component, BPEL4WS activities are distinguished according to their status as structure activities or operation activities. Structure activities are added to the stack, with their head and tail sets determined according to the previous structure activities. The algorithm then recurses on the children of the ingested structure, which are popped upon completion. In the second component, labeled nodes defined by the set \(\varSigma \) = (Workflow,Process,InputPort,OutputPort, DataLink,SeqCtrlLink) are created and the relevant associations, with labels defined by set \(\varOmega \) = (sourcePToCL,CLtoDestP, hasInPort, hasOutPort, DLToInPort, outPortToDL) are drawn. This step is completed in the GenerateProvOne method of Algorithm 2. The ProvONE is defined as a graph G = (V, E, \(\lambda \), \(\psi \)), with: a set of vertices V = \(\{v_1, v_2, v_3,...,v_n \}\), a set of edges E \(\subseteq \) V \(\times \) V, a vertex labeling function \(\lambda \): V \(\rightarrow \)\(\varSigma \), an edge labeling function \(\psi \): E \(\rightarrow \)\(\varOmega \). The ProvONE algorithm is tested for a nanoscopy workflow shown in Fig. 2.

4 Conclusion and Future Work

This paper presented a novel algorithm, called Prov2ONE, that generates the ProvONE prospective provenance graph for an arbitrary BPEL4WS workflow. Figure 3 shows the ProvONE graph generated by the Prov2ONE algorithm for the nanoscopy workflow. During the execution of the workflow, the retrospective ProvONE, i.e. ProcessExec, Data and User is linked to the ProvONE prospective graph with associations wasAssociatedWith, wasGeneratedBy, used, wasDerivedFrom and dataOnLink. The services for collecting and appending the retrospective provenance to the prospective ProvONE graph are implemented in the Provenance Manager component. By modelling both the prospective and retrospective provenance for a scientific workflow in the ProvONE, the redundant task of collecting, storing and maintaining provenance in various systems is entirely avoided. The architecture of the NORDR system is shown in Fig. 1, and for enabling efficient storage and querying of the provenance information, a graph database is used. Currently, we are implementing the OPM/PROV exporter module based on formal semantic mapping between ProvONE and OPM/PROV.

Footnotes

References

  1. 1.
    Cremer, C.: Optics far beyond the diffraction limit. In: Träger, F. (ed.) Handbook of Lasers and Optics, pp. 1359–1397. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Prabhune, A., et al.: An optimized generic client service API for managing large datasets within a data repository. In: IEEE BigDataService, pp. 44–51. IEEE (2015)Google Scholar
  3. 3.
    Cuevas-Vicenttín, V., et al.: ProvONE: A Prov Extension Data Model for Scientific Workflow Provenance (2015). http://purl.org/provone
  4. 4.
    Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)CrossRefGoogle Scholar
  5. 5.
    Moreau, L.: The specification, open provenance model core (v1. 1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Moreau, L., Missier, P., et al. (eds.): PROV-DM: The PROV Data Model. W3C Recommendation (2013). http://www.w3.org/TR/prov-dm/

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Ajinkya Prabhune
    • 1
  • Aaron Zweig
    • 1
  • Rainer Stotzka
    • 1
  • Michael Gertz
    • 2
  • Juergen Hesser
    • 3
  1. 1.Institute for Data Processing and ElectronicsKarslruhe Institute of TechnologyKarlsruheGermany
  2. 2.Institute of Computer ScienceHeidelberg UniversityHeidelbergGermany
  3. 3.Department of Radiation OncologyHeidelberg UniversityHeidelbergGermany

Personalised recommendations