Advertisement

Non-parametric Class Completeness Estimators for Collaborative Knowledge Graphs—The Case of Wikidata

  • Michael LuggenEmail author
  • Djellel Difallah
  • Cristina Sarasua
  • Gianluca Demartini
  • Philippe Cudré-Mauroux
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11778)

Abstract

Collaborative Knowledge Graph platforms allow humans and automated scripts to collaborate in creating, updating and interlinking entities and facts. To ensure both the completeness of the data as well as a uniform coverage of the different topics, it is crucial to identify underrepresented classes in the Knowledge Graph. In this paper, we tackle this problem by developing statistical techniques for class cardinality estimation in collaborative Knowledge Graph platforms. Our method is able to estimate the completeness of a class—as defined by a schema or ontology—hence can be used to answer questions such as “Does the knowledge base have a complete list of all {Beer Brands|Volcanos|Video Game Consoles}?” As a use-case, we focus on Wikidata, which poses unique challenges in terms of the size of its ontology, the number of users actively populating its graph, and its extremely dynamic nature. Our techniques are derived from species estimation and data-management methodologies, and are applied to the case of graphs and collaborative editing. In our empirical evaluation, we observe that (i) the number and frequency of unique class instances drastically influence the performance of an estimator, (ii) bursts of inserts cause some estimators to overestimate the true size of the class if they are not properly handled, and (iii) one can effectively measure the convergence of a class towards its true size by considering the stability of an estimator against the number of available instances.

Keywords

Knowledge Graph Class completeness Class cardinality Estimators Edit history 

Notes

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 683253/GraphInt). It is also supported by the Australian Research Council (ARC) Discovery Project (Grant No. DP190102141).

References

  1. 1.
    Balaraman, V., Razniewski, S., Nutt, W.: Recoin: relative completeness in Wikidata. In: Companion Proceedings of the The Web Conference, pp. 1787–1792 (2018)Google Scholar
  2. 2.
    Bunge, J., Fitzpatrick, M.: Estimating the number of species: a review. J. Am. Stat. Assoc. 88(421), 364–373 (1993)Google Scholar
  3. 3.
    Burnham, K.P., Overton, W.S.: Robust estimation of population size when capture probabilities vary among animals. Ecology 60(5), 927–936 (1979)CrossRefGoogle Scholar
  4. 4.
    Chao, A., Lee, S.M.: Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87(417), 210–217 (1992)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Chiu, C.H., Wang, Y.T., Walther, B.A., Chao, A.: An improved nonparametric lower bound of species richness via a modified good-turing frequency formula. Biometrics 70(3), 671–682 (2014)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Darari, F., Nutt, W., Pirrò, G., Razniewski, S.: Completeness management for RDF data sources. ACM Trans. Web 12(3), 18:1–18:53 (2018)CrossRefGoogle Scholar
  7. 7.
    Difallah, D., Filatova, E., Ipeirotis, P.: Demographics and dynamics of mechanical Turk workers. In: WSDM, pp. 135–143. ACM (2018)Google Scholar
  8. 8.
    Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandecic, D.: Introducing Wikidata to the linked data web. In: ISWC, pp. 50–65 (2014)CrossRefGoogle Scholar
  9. 9.
    Galárraga, L., Razniewski, S., Amarilli, A., Suchanek, F.M.: Predicting completeness in knowledge bases. In: WSDM, pp. 375–383 (2017)Google Scholar
  10. 10.
    Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40(3–4), 237–264 (1953)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Heltshe, J.F., Forrester, N.E.: Estimating species richness using the jackknife procedure. Biometrics 39, 1–11 (1983)CrossRefGoogle Scholar
  12. 12.
    Kaffee, L., Simperl, E.: The human face of the web of data: a cross-sectional study of labels. In: SEMANTICS, pp. 66–77 (2018)CrossRefGoogle Scholar
  13. 13.
    Mannino, M.V., Chu, P., Sager, T.: Statistical profile estimation in database systems. ACM Comput. Surv. 20(3), 191–221 (1988)CrossRefGoogle Scholar
  14. 14.
    Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994. IEEE (2011)Google Scholar
  15. 15.
    Papapetrou, O., Siberski, W., Nejdl, W.: Cardinality estimation and dynamic length adaptation for bloom filters. Distrib. Parallel Databases 28(2–3), 119–156 (2010) CrossRefGoogle Scholar
  16. 16.
    Sarasua, C., Checco, A., Demartini, G., Difallah, D., Feldman, M., Pintscher, L.: The evolution of power and standard Wikidata editors: comparing editing behavior over time to predict lifespan and volume of edits. Comput. Support. Coop. Work (CSCW) (2018).  https://doi.org/10.1007/s10606-018-9344-y
  17. 17.
    Soulet, A., Giacometti, A., Markhoff, B., Suchanek, F.M.: Representativeness of knowledge bases with the generalized Benford’s law. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11136, pp. 374–390. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-00671-6_22CrossRefGoogle Scholar
  18. 18.
    Pellissier Tanon, T., Stepanova, D., Razniewski, S., Mirza, P., Weikum, G.: Completeness-aware rule learning from knowledge graphs. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 507–525. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-68288-4_30CrossRefGoogle Scholar
  19. 19.
    Trushkowsky, B., Kraska, T., Franklin, M.J., Sarkar, P.: Crowdsourced enumeration queries. In: ICDE, pp. 673–684. IEEE (2013)Google Scholar
  20. 20.
    Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  21. 21.
    Walther, B., Morand, S.: Comparative performance of species richness estimation methods. Parasitology 116(4), 395–405 (1998)CrossRefGoogle Scholar
  22. 22.
    Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)CrossRefGoogle Scholar
  23. 23.
    Wulczyn, E., West, R., Zia, L., Leskovec, J.: Growing Wikipedia across languages via recommendation. In: WWW, pp. 975–985 (2016)Google Scholar
  24. 24.
    Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Semant. Web J. 7, 63–93 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Michael Luggen
    • 1
    Email author
  • Djellel Difallah
    • 2
  • Cristina Sarasua
    • 3
  • Gianluca Demartini
    • 4
  • Philippe Cudré-Mauroux
    • 1
  1. 1.University of FribourgFribourgSwitzerland
  2. 2.New York UniversityNew YorkUSA
  3. 3.University of ZurichZurichSwitzerland
  4. 4.University of QueenslandBrisbaneAustralia

Personalised recommendations