Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Depending on the size and complexity of the scientific experiment, it can be divided/modelled into two or more workflows (i.e. fragments) [1]. This division can ease the management of the experiment, reducing the total execution time, and enables a cooperative work where each research team works on parts of the experiment in an “independent” way [2]. From there, data provenance management becomes a challenge when the workflow (and their fragments) needs to be executed in more than one SWfMS, and each SWfMS has its own associated provenance model.

For scientists to analyze, share, and combine provenance data generated by different systems, it is necessary to ensure the interoperability between these SWfMSs. Some authors propose an additional layer in a higher level of abstraction [3] to perform the mediation between the provenance data items collected in the various SWfMSs. In our view, the provenance data should be “imported” to one of the SWfMSs used (preferably that the scientist is used to) so that the analysis is performed in a single system, taking advantage of the existing analysis infrastructure of these SWfMSs. Thus, in this paper we propose Géfyra, an approach for provenance data interoperability between existing SWfMSs. Géfyra is based on a recently proposed provenance model called PROV-Wf [4].

2 Géfyra: Making Provenance Interoperable

Our main goal in this paper is to provide a bridge between different SWfMSs so that it allows scientists to analyze provenance data generated by other SWfMS. Thus, we named our approach as Géfyra, which means “bridge” in Greek. We designed a representation schema in XML Schema (which we call Prov-Wf Schema) to create and/or validate provenance data from heterogeneous data sources. While designing it, we used some elements of PROV-XML [5] and included all entities and relationships of the PROV-Wf conceptual model. The resulting schema is available at www.ic.uff.br/~vanessa/papers/PROV-Wf.xsd.

The Géfyra architecture is shown in Fig. 1. To convert provenance data from SWfMS A to SWfMS B, the Géfyra Broker triggers the cartridge of SWfMS A, which converts the data stored in SWfMS A’s provenance repository to an XML file that follows the PROV-Wf Schema. This XML file is then sent to the Géfyra Broker, which stores it in the PROV-Wf Repository and sends it to the cartridge within SWfMS B for conversion. The cartridge of SWfMS B then converts the XML file to SWfMS B’s provenance repository format, and stores the provenance information in the repository of that SWfMS. Note that each Cartridge knows how to convert from a specific SWfMS format to the PROV-Wf XML format, and vice versa.

Fig. 1.
figure 1

The Géfyra conceptual architecture.

3 Experimental Evaluation and Final Remarks

To evaluate Géfyra, we use the SciPhy [6] workflow that is executed in two SWfMSs that can collect and store provenance data in a relational database: SciCumulus and VisTrails. This way, we develop two cartridges (PROV-Wf_Sci and PROV-Wf_Vis) to map the provenance data from SciCumulus and VisTrails to XML (according the PROV-Wf Schema) and in the opposite direction, from XML to the SWfMS itself. In order to assess the quality of our mapping, we developed a series of queries to evaluate the amount of tuples and fields, types and values of attributes and the compatibility between the databases of the two SWfMSs. We also were inspired by the queries of the First and Second Provenance Challenges [7]. Our main goal was to evaluate information loss that might occur in the import process (since there are some attributes that do not exist in both models), and capture mistakes in our mapping.

To evaluate our results we use the concepts of precision and recall. Thus, we execute the queries in two provenance databases (SciCumulus and VisTrails) to assess the amount of records and check whether the fields were aligned to the attributes of the respective elements in the PROV-Wf Schema. Tables 1 and 2 show the results.

Table 1. Results for SciCumulus
Table 2. Results for VisTrails

The Géfyra architecture is flexible and extensible: new cartridges of different SWfMSs can be connected to it at any time (as shown in Fig. 1). Géfyra maps heterogeneous provenance data sources, allowing the data from a SWfMS to be converted to XML and the latter to another SWfMS. This way, it is not necessary to convert provenance data from each SWfMS to all other provenance systems one wants to use, as Géfyra converts the data to a single XML format that can be shared by all SWfMSs. As a limitation, since one data model may contain data that cannot be mapped to Prov-Wf, some data may be lost in the conversion process. This is the tradeoff of being able to use a single system for analysis.

As future work, we intend to implement cartridges to other SWfMS, we intend to further explore the semantic dimension of the provenance data, and the implications of such a dimension in the mapping of different provenance data sources.