Provenance Annotation and Analysis to Support Process Re-computation

  • Jacek CałaEmail author
  • Paolo Missier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11017)


Many resource-intensive analytics processes evolve over time following new versions of the reference datasets and software dependencies they use. We focus on scenarios in which any version change has the potential to affect many outcomes, as is the case for instance in high throughput genomics where the same process is used to analyse large cohorts of patient genomes, or cases. As any version change is unlikely to affect the entire population, an efficient strategy for restoring the currency of the outcomes requires first to identify the scope of a change, i.e., the subset of affected data products. In this paper we describe a generic and reusable provenance-based approach to address this scope discovery problem. It applies to a scenario where the process consists of complex hierarchical components, where different input cases are processed using different version configurations of each component, and where separate provenance traces are collected for the executions of each of the components. We show how a new data structure, called a restart tree, is computed and exploited to manage the change scope discovery problem.


Provenance annotations Process re-computation 


  1. 1.
    Alper, P., Belhajjame, K., Curcin, V., Goble, C.: LabelFlow framework for annotating workflow provenance. Informatics 5(1), 11 (2018)CrossRefGoogle Scholar
  2. 2.
    Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the Kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006). Scholar
  3. 3.
    Angelino, E., Yamins, D., Seltzer, M.: StarFlow: a script-centric data analysis environment. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010. LNCS, vol. 6378, pp. 236–250. Springer, Heidelberg (2010). Scholar
  4. 4.
    Bavoil, L., et al.: VisTrails: enabling interactive multiple-view visualizations. In: VIS 05. IEEE Visualization, 2005, No. Dx, pp. 135–142. IEEE (2005)Google Scholar
  5. 5.
    Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud. Future Gener. Comput. Syst. 65, 153–168 (2016)CrossRefGoogle Scholar
  6. 6.
    Cała, J., Missier, P.: Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study. Big Data Res. (2018). ISSN 2214-5796
  7. 7.
    Cuevas-Vicenttín, V., et al.: ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance (2016)Google Scholar
  8. 8.
    Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing rapidly-evolving scientific workflows. In: Proceedings of the 2006 International Conference on Provenance and Annotation of Data, pp. 10–18 (2006)Google Scholar
  9. 9.
    Herschel, M., Diestelkämper, R., Ben Lahmar, H.: A survey on provenance: what for? what form? what from? VLDB J. 26(6), 1–26 (2017)CrossRefGoogle Scholar
  10. 10.
    Ikeda, R., Das Sarma, A., Widom, J.: Logical provenance in data-oriented workflows. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 877–888. IEEE (2013)Google Scholar
  11. 11.
    Koop, D., Scheidegger, C.E., Freire, J., Silva, C.T.: The provenance of workflow upgrades. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010. LNCS, vol. 6378, pp. 2–16. Springer, Heidelberg (2010). Scholar
  12. 12.
    Lakhani, H., Tahir, R., Aqil, A., Zaffar, F., Tariq, D., Gehani, A.: Optimized rollback and re-computation. In: 2013 46th Hawaii International Conference on System Sciences, No. I, pp. 4930–4937. IEEE (Jan 2013)Google Scholar
  13. 13.
    Moreau, L., et al.: PROV-DM: the PROV data model. Technical report, World Wide Web Consortium (2012)Google Scholar
  14. 14.
    Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: noWorkflow: a tool for collecting, analyzing, and managing provenance from python scripts. Proc. VLDB Endow. 10(12), 1841–1844 (2017)CrossRefGoogle Scholar
  15. 15.
    Woodman, S., Hiden, H., Watson, P.: Applications of provenance in performance prediction and data storage optimisation. Future Gener. Comput. Syst. 75, 299–309 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Computing, Newcastle UniversityNewcastle upon TyneUK

Personalised recommendations