Provenance Information in a Collaborative Knowledge Graph: An Evaluation of Wikidata External References

  • Alessandro PiscopoEmail author
  • Lucie-Aimée Kaffee
  • Chris Phethean
  • Elena Simperl
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10587)


Wikidata is a collaboratively-edited knowledge graph; it expresses knowledge in the form of subject-property-value triples, which can be enhanced with references to add provenance information. Understanding the quality of Wikidata is key to its widespread adoption as a knowledge resource. We analyse one aspect of Wikidata quality, provenance, in terms of relevance and authoritativeness of its external references. We follow a two-staged approach. First, we perform a crowdsourced evaluation of references. Second, we use the judgements collected in the first stage to train a machine learning model to predict reference quality on a large-scale. The features chosen for the models were related to reference editing and the semantics of the triples they referred to. \(61\%\) of the references evaluated were relevant and authoritative. Bad references were often links that changed and either stopped working or pointed to other pages. The machine learning models outperformed the baseline and were able to accurately predict non-relevant and non-authoritative references. Further work should focus on implementing our approach in Wikidata to help editors find bad references.


Wikidata Provenance Collaborative knowledge graph 



This project is supported by funding received from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 642795 (WDAqua ITN).


  1. 1.
    Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., Lehmann, J.: Crowdsourcing linked data quality assessment. In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 260–276. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-41338-4_17 CrossRefGoogle Scholar
  2. 2.
    Alonso, O., Rose, D.E., Stewart, B.: Crowdsourcing for relevance evaluation. SIGIR Forum 42(2), 9–15 (2008)CrossRefGoogle Scholar
  3. 3.
    Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000)CrossRefGoogle Scholar
  4. 4.
    Brasileiro, F., Almeida, J.P.A., de Carvalho, V.A., Guizzardi, G.: Applying a multi-level modeling theory to assess taxonomic hierarchies in Wikidata. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, 11–15 April 2016, Companion Volume, pp. 975–980 (2016)Google Scholar
  5. 5.
    Dai, C., Lin, D., Bertino, E., Kantarcioglu, M.: An approach to evaluate data trustworthiness based on data provenance. In: Jonker, W., Petković, M. (eds.) SDM 2008. LNCS, vol. 5159, pp. 82–98. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-85259-9_6 CrossRefGoogle Scholar
  6. 6.
    Eickhoff, C., de Vries, A.P.: Increasing cheat robustness of crowdsourcing tasks. Inf. Retr. 16(2), 121–137 (2013)CrossRefGoogle Scholar
  7. 7.
    Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D.: Introducing Wikidata to the linked data web. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 50–65. Springer, Cham (2014). doi: 10.1007/978-3-319-11964-9_4 Google Scholar
  8. 8.
    Fetahu, B., Markert, K., Nejdl, W., Anand, A.: Finding news citations for Wikipedia. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, 24–28 October 2016, pp. 337–346. ACM (2016)Google Scholar
  9. 9.
    Ford, H., Sen, S., Musicant, D.R., Miller, N.: Getting to the source: where does Wikipedia get its information from? In: Proceedings of the 9th International Symposium on Open Collaboration, Hong Kong, China, 05–07 August 2013, pp. 9:1–9:10 (2013)Google Scholar
  10. 10.
    Forman, G., Scholz, M.: Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor. 12(1), 49–57 (2010)CrossRefGoogle Scholar
  11. 11.
    Hartig, O.: Provenance information in the web of data. In: Proceedings of the WWW 2009 Workshop on Linked Data on the Web, LDOW 2009, Madrid, Spain, 20 April 2009. CEUR Workshop Proceedings, vol. 538. (2009)Google Scholar
  12. 12.
    Hartig, O., Zhao, J.: Using web data provenance for quality assessment. In: Proceedings of the First International Workshop on the Role of Semantic Web in Provenance Management (SWPM 2009), Collocated with the 8th International Semantic Web Conference (ISWC-2009), Washington DC, USA, 25 October 2009. CEUR Workshop Proceedings, vol. 526. (2009)Google Scholar
  13. 13.
    Kakol, M., Jankowski-Lorek, M., Abramczuk, K., Wierzbicki, A., Catasta, M.: On the subjectivity and bias of web content credibility evaluations. In: 22nd International World Wide Web Conference, WWW 2013, Rio de Janeiro, Brazil, 13–17 May 2013, Companion Volume, pp. 1131–1136. International World Wide Web Conferences Steering Committee/ACM (2013)Google Scholar
  14. 14.
    Karampinas, D., Triantafillou, P.: Crowdsourcing taxonomies. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 545–559. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-30284-8_43 CrossRefGoogle Scholar
  15. 15.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Lehmann, J., Gerber, D., Morsey, M., Ngonga Ngomo, A.-C.: DeFacto - deep fact validation. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 312–327. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-35176-1_20 CrossRefGoogle Scholar
  17. 17.
    Lucassen, T., Schraagen, J.M.: Trust in Wikipedia: how users trust information from an unknown source. In: Proceedings of the 4th ACM Workshop on Information Credibility on the Web, WICOW 2010, Raleigh, North Carolina, USA, 27 April 2010, pp. 19–26. ACM (2010)Google Scholar
  18. 18.
    Müller-Birn, C., Karran, B., Lehmann, J., Luczak-Rösch, M.: Peer-production system or collaborative ontology engineering effort: what is Wikidata? In: Proceedings of the 11th International Symposium on Open Collaboration, San Francisco, CA, USA, 19–21 August 2015, pp. 20:1–20:10. ACM (2015)Google Scholar
  19. 19.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Piscopo, A., Phethean, C., Simperl, E.: Wikidatians are born: paths to full participation in a collaborative structured knowledge base. In: 50th Hawaii International Conference on System Sciences, HICSS 2017, Hilton Waikoloa Village, Hawaii, USA, 4–7 January 2017. AIS Electronic Library (AISeL) (2017)Google Scholar
  21. 21.
    Potthast, M., Stein, B., Gerling, R.: Automatic vandalism detection in Wikipedia. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 663–668. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-78646-7_75 CrossRefGoogle Scholar
  22. 22.
    Raymond, E.S.: The Cathedral and the Bazaar - Musings on Linux and Open Source by an Accidental Revoltionary, Rev. edn. O’Reilly, Sebastopol (2001)Google Scholar
  23. 23.
    Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In: 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25–27 October 2008, Honolulu, Hawaii, USA, A Meeting of SIGDAT, A Special Interest Group of the ACL. pp. 254–263. ACL (2008)Google Scholar
  24. 24.
    Steiner, T.: Bots vs. Wikipedians, Anons vs. Logged-Ins (Redux): a global study of edit activity on Wikipedia and Wikidata. In: Proceedings of the International Symposium on Open Collaboration, OpenSym 2014, Berlin, Germany, 27–29 August 2014, pp. 25:1–25:7. ACM (2014)Google Scholar
  25. 25.
    Tanon, T.P., Vrandecic, D., Schaffert, S., Steiner, T., Pintscher, L.: From freebase to Wikidata: the great migration. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, 11–15 April 2016, pp. 1419–1428 (2016)Google Scholar
  26. 26.
    Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledge base. Commun. ACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  27. 27.
    Wikidata: Wikidata: Sources – Wikidata, the free knowledge base (2017). Accessed 09 Apr 2017
  28. 28.
    Wikidata: Wikidata: Verifiability – Wikidata, the free knowledge base (2017). Accessed 07 Apr 2017
  29. 29.
    Wikipedia: Wikipedia: Verifiability – Wikipedia, the free encyclopedia (2017). Accessed 07 Apr 2017

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Alessandro Piscopo
    • 1
    Email author
  • Lucie-Aimée Kaffee
    • 1
  • Chris Phethean
    • 1
  • Elena Simperl
    • 1
  1. 1.University of SouthamptonSouthamptonUK

Personalised recommendations