PROV-O-Viz - Understanding the Role of Activities in Provenance
- 1.4k Downloads
This paper presents PROV-O-Viz, a Web-based visualization tool for PROV-based provenance traces coming from various sources, that leverages Sankey Diagrams to reflect the flow of information through activities. We briefly discuss the advantages of this approach compared to other provenance visualization tools. PROV-O-Viz has already been used to visualize provenance traces generated by very different applications.
KeywordsProvenance Visualization Sankey Information flow Linked data Reusability
Understanding data provenance (the origin or source of data) is a critical facilitator for data quality, trust, reproducibility, compliance and debugging of complex computational systems . In 2013, the World Wide Web consortium released the W3C PROV standards that enable the interchange of provenance between systems . These standards are becoming increasingly implemented .
Given the wealth of provenance information available, techniques are needed to help users navigate and investigate this information space. Several works have focused on the visualization or provenance using a number of presentation paradigms including networks, data flow graphs, and radial layouts [4, 5], https://provenance.ecs.soton.ac.uk/vis/.
Here, we focus on a visualization approach to identify important activities within a provenance graph and link those activities together. Additionally, our aim is to show how this approach can be useful in an uncontrolled setting, i.e. for PROV coming from multiple environments, generated through the execution of diverse and potentially undefined tasks or workflows. To do so, we demonstrate a Sankey Diagram based visualization of PROV and apply that visualization to multiple provenance traces originating from multiple environments, machine learning experiments, version control systems (GitHub), and scientific workflows originating from different workflow systems. The demonstration is available at http://provoviz.org.
2 Sankey Diagrams
determine important activities based on data flow; and
understand how data flows through a selected activity.
Import of PROV data from both plain text and published data (i.e. available at a URL)
Focus on particular activities within a provenance diagram, by selecting them from a dropdown box.
Highlight data flows in and out of activities within the diagram,; the width of the box indicates the amount of information flowing through the activity.
Leverage reasoning to fill out missing information within a provenance graph.
Additionally, we allow provenance graphs to be embedded directly within web pages. This allows provenance visualizations to be included directly with other web applications. Furthermore, this visualization is self contained. Once the provenance is rendered there is no need to call to the server. For example, in LinkItUp (, http://linkitup.data2semantics.org), an application to enrich the content of data with metadata, PROV-O-Viz is used to display the provenance of how the application enriches data with this extra data. Thus, users understand how the application makes its suggestions. (We will also demonstrate this capability.)
We evaluated the visualization capabilities of PROV-O-Viz by using it to inspect PROV data coming from four different sources. First of all, the provenance traces of scientific workflows executed through the Taverna and WINGS workflow systems, that are made available as part of the Wf4Ever ProvBench benchmark.1 The Taverna PROV traces do not explicitly provide the type of events and activities that many visualizations rely on. PROV-O-Viz automatically infers these types by applying reasoning over the PROV-O schema definitions. Even though some of these datasets are relatively large, focusing on the ego graph of information dependencies flowing through the selected activity allows the visualization to remain manageable. At the moment, however, PROV-O-Viz generates a visualization for the ego graph centered around every activity. This means that for provenance traces that contain very many connected activities, the process of generating the Sankey diagram may take a long time. After the diagrams have been built, the visualization will be very responsive. Embedded PROV-O-Viz diagrams are already generated, and therefore do not suffer from this potential performance hit. The next version will feature a more responsive user interface, that keeps users up-to-date as to the progress made in generating the visualizations.
The Ducktape platform2 is another such scientific workflow system that is focused on Machine Learning tasks. The visualization in Fig. 2 is based on the provenance of one of the steps in a Machine Learning pipeline. Ducktape can generate interactive reports of workflow execution that embeds a visualization of its provenance trace . See Fig. 3 for a screenshot of such a report.
The LinkItUp3 system for enriching metadata for datasets stored in the Figshare.com scientific data publishing platform, stores all enrichment activities performed by users as part of a provenance trace. This provenance trace can be inspected from within the application through a call to the PROV-O-Viz API.
Git2PROV is a web service that can convert Git version histories to a provenance trace expressed in various PROV compliant syntaxes.4 Every commit is represented as a PROV activity. Visualizing these graphs can be even more challenging than those of the workflow systems because version commit histories are tree-shaped, and highly connected: they all originate from the same initial commit. Workflow systems can produce large graphs, but oftentimes these are in fact multiple separate graphs for runs against multiple files.
In this demonstration, we show how generic visualization tools can be used to interrogate provenance coming from multiple different applications. This provides evidence that provenance can provide added value without domain specific extensions. In future work we will focus on the ability to generate entity-centric diagrams, a browsing feature, allowing users to click through the various parts of the provenance graph. We are furthermore considering the implementation a more efficient method for calculating the information flow, e.g. based on centrality measures based on current flow in an electrical network .
This work was funded under the Dutch national programme COMMIT.
- 1.Freire, J., Bonnet, P., Shasha, D.: Computational reproducibility: state-of-the-art, challenges, and database research opportunities. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. SIGMOD ’12, pp. 593–596. ACM, New York (2012)Google Scholar
- 2.Groth, P., Moreau, L.: PROV overview: An overview of the prov family of documents. Technical report, W3C (2013)Google Scholar
- 3.Huynh, T.D., Groth, P., Zednik, S.: Prov implementation report. Technical report, W3C (2013)Google Scholar
- 5.Meyer, B., Prohaska, S., Hege, H.C.: Provenance visualization and usage. Technical report (2009)Google Scholar
- 6.Wibisono, A., Bloem, P., de Vries, G.K., Groth, P., Belloum, A., Bubak, M.: Generating scientific documentation for computational experiments using provenance. In: Proceedings of IPAW 2014 (2014)Google Scholar
- 7.Hoekstra, R., Groth, P.: Linkitup: link discovery for research data. In: Discovery Informatics: AI Takes a Science-Centered View on Big Data, AAAI Fall Symposium Series (2013)Google Scholar