Reconstructing Human-Generated Provenance Through Similarity-Based Clustering

De Nies, Tom; Mannens, Erik; Van de Walle, Rik

doi:10.1007/978-3-319-40593-3_19

Tom De Nies¹⁵,
Erik Mannens¹⁵ &
Rik Van de Walle¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9672))

Included in the following conference series:

International Provenance and Annotation Workshop

1185 Accesses
2 Citations
1 Altmetric

Abstract

In this paper, we revisit our method for reconstructing the primary sources of documents, which make up an important part of their provenance. Our method is based on the assumption that if two documents are semantically similar, there is a high chance that they also share a common source. We previously evaluated this assumption on an excerpt from a news archive, achieving 68.2 % precision and 73 % recall when reconstructing the primary sources of all articles. However, since we could not release this dataset to the public, it made our results hard to compare to others. In this work, we extend the flexibility of our method by adding a new parameter, and re-evaluate it on the human-generated dataset created for the 2014 Provenance Reconstruction Challenge. The extended method achieves up to 86 % precision and 59 % recall, and is now directly comparable to any approach that uses the same dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.data2semantics.org/prov-reconstruction-challenge/.

References

Aierken, A., Davis, D.B., Zhang, Q., Gupta, K., Wong, A., Asuncion, H.U.: A multi-level funneling approach to data provenance reconstruction. In: IEEE 10th International Conference on e-Science, vol. 2, pp. 71–74. IEEE (2014)
Google Scholar
De Nies, T., Coppens, S., Van Deursen, D., Mannens, E., Van de Walle, R.: Automatic discovery of high-level provenance using semantic similarity. In: Groth, P., Frew, J. (eds.) IPAW 2012. LNCS, vol. 7525, pp. 97–110. Springer, Heidelberg (2012)
Chapter Google Scholar
De Nies, T., Magliacane, S., Verborgh, R., Coppens, S., Groth, P., Mannens, E., Van de Walle, R.: Git2PROV: exposing version control system content as W3C PROV. In: ISWC Posters & Demos, pp. 125–128 (2013)
Google Scholar
Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506. ACM (2009)
Google Scholar
Simmons, M.P., Adamic, L.A., Adar, E.: Memes online: extracted, subtracted, injected, and recollected. In: ICWSM 2011, pp. 17–21 (2011)
Google Scholar
Zhang, J., Jagadish, H.V.: Lost source provenance. In: 13th International Conference on Extending Database Technology, pp. 311–322. ACM (2010)
Google Scholar
Zhao, J., Gomadam, K., Prasanna, V.: Predicting missing provenance using semantic associations in reservoir engineering. In: Fifth IEEE International Conference on Semantic Computing (ICSC), pp. 141–148. IEEE (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Ghent University – iMinds – Data Science Lab, Ghent, Belgium
Tom De Nies, Erik Mannens & Rik Van de Walle

Authors

Tom De Nies
View author publications
You can also search for this author in PubMed Google Scholar
Erik Mannens
View author publications
You can also search for this author in PubMed Google Scholar
Rik Van de Walle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tom De Nies .

Editor information

Editors and Affiliations

COPPE/UFRJ, Rio de Janeiro, Brazil
Marta Mattoso
Illinois Institute of Technology, Chicago, Illinois, USA
Boris Glavic

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Nies, T., Mannens, E., Van de Walle, R. (2016). Reconstructing Human-Generated Provenance Through Similarity-Based Clustering. In: Mattoso, M., Glavic, B. (eds) Provenance and Annotation of Data and Processes. IPAW 2016. Lecture Notes in Computer Science(), vol 9672. Springer, Cham. https://doi.org/10.1007/978-3-319-40593-3_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-40593-3_19
Published: 04 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40592-6
Online ISBN: 978-3-319-40593-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics