Skip to main content

Automatic Discovery of High-Level Provenance Using Semantic Similarity

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 7525)


As interest in provenance grows among the Semantic Web community, it is recognized as a useful tool across many domains. However, existing automatic provenance collection techniques are not universally applicable. Most existing methods either rely on (low-level) observed provenance, or require that the user discloses formal workflows. In this paper, we propose a new approach for automatic discovery of provenance, at multiple levels of granularity. To accomplish this, we detect entity derivations, relying on clustering algorithms, linked data and semantic similarity. The resulting derivations are structured in compliance with the Provenance Data Model (PROV-DM). While the proposed approach is purposely kept general, allowing adaptation in many use cases, we provide an implementation for one of these use cases, namely discovering the sources of news articles. With this implementation, we were able to detect 73% of the original sources of 410 news stories, at 68% precision. Lastly, we discuss possible improvements and future work.


  • Provenance
  • Data Model
  • Semantic Web
  • Linked Data
  • Similarity
  • News


  1. Gil, Y., Cheney, J., Groth, P., Hartig, O., Miles, S., Moreau, L., Da Silva, P.P.: Provenance XG final report. Final Incubator Group Report (2010)

    Google Scholar 

  2. Gómez-Pérez, J.M., Corcho, O.: Problem-solving methods for understanding process executions. IEEE Computing in Science & Engineering 10, 47–52 (2008)

    CrossRef  Google Scholar 

  3. Braun, U., Garfinkel, S., Holland, D.A., Muniswamy-Reddy, K.-K., Seltzer, M.I.: Issues in Automatic Provenance Collection. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 171–183. Springer, Heidelberg (2006)

    CrossRef  Google Scholar 

  4. PROV-DM Part 1: The Provenance Data Model, W3C Editor’s Draft (May 29, 2012),

  5. Rizzo, G., Troncy, R.: NERD: Evaluating Named Entity Recognition Tools in the Web of Data. In: Workshop on Web Scale Knowledge Extraction, WEKEX 2011 (2011)

    Google Scholar 

  6. Iacobelli, F., Nichols, N., Birnbaum, L., Hammond, K.: Finding new information via robust entity detection. In: Proactive Assistant Agents AAAI Fall Symposium (2010)

    Google Scholar 

  7. Hasan, M.A., Salem, S., Pupacdi, B., Zaki, M.J.: Clustering with Lower Bound on Similarity. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 122–133. Springer, Heidelberg (2009)

    CrossRef  Google Scholar 

  8. Zhao, J., Sahoo, S.S., Missier, P., Sheth, A., Goble, C.: Extending semantic provenance into the web of data. IEEE Internet Computing, 40–48 (2011)

    Google Scholar 

  9. Zhao, J., Gomadam, K., Prasanna, V.: Predicting Missing Provenance using Semantic Associations in Reservoir Engineering. In: 2011 Fifth IEEE International Conference on Semantic Computing, ICSC (2011)

    Google Scholar 

  10. Zhang, J., Jagadish, H.V.: Lost source provenance. In: Proceedings of the 13th International Conference on Extending Database Technology. ACM (2010)

    Google Scholar 

  11. Ram, S., Liu, J.: A new perspective on Semantics of Data Provenance. In: First International Workshop on the Role of Semantic Web in Provenance Management, SWPM (2009)

    Google Scholar 

  12. Moreau, L.: The foundations for provenance on the web. Now Publishers (2010)

    Google Scholar 

  13. Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E.G.M., Milios, E.: Information retrieval by semantic similarity. International Journal on Semantic Web and Information Systems (IJSWIS), 55–73 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

De Nies, T., Coppens, S., Van Deursen, D., Mannens, E., Van de Walle, R. (2012). Automatic Discovery of High-Level Provenance Using Semantic Similarity. In: Groth, P., Frew, J. (eds) Provenance and Annotation of Data and Processes. IPAW 2012. Lecture Notes in Computer Science, vol 7525. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34221-9

  • Online ISBN: 978-3-642-34222-6

  • eBook Packages: Computer ScienceComputer Science (R0)