Detecting Incorrect Numerical Data in DBpedia

  • Dominik Wienand
  • Heiko Paulheim
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8465)


DBpedia is a central hub of Linked Open Data (LOD). Being based on crowd-sourced contents and heuristic extraction methods, it is not free of errors. In this paper, we study the application of unsupervised numerical outlier detection methods to DBpedia, using Interquantile Range (IQR), Kernel Density Estimation (KDE), and various dispersion estimators, combined with different semantic grouping methods. Our approach reaches 87% precision, and has lead to the identification of 11 systematic errors in the DBpedia extraction framework.


#eswc2014Wienand Linked Open Data DBpedia Data Quality Error Detection Outlier Detection Clustering 


  1. 1.
    Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., Lehmann, J.: Crowdsourcing linked data quality assessment. In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 260–276. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  2. 2.
    Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems 5(3), 1–22 (2009)CrossRefGoogle Scholar
  3. 3.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38 (1977)Google Scholar
  4. 4.
    Grubbs, F.E.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)CrossRefGoogle Scholar
  5. 5.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)CrossRefGoogle Scholar
  6. 6.
    Hardle, W.: Nonparametric and semiparametric models. Springer (2004)Google Scholar
  7. 7.
    Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artificial Intelligence Review 22(2), 85–126 (2004)CrossRefzbMATHGoogle Scholar
  8. 8.
    Lehmann, J., Bühmann, L.: Ore-a tool for repairing and enriching knowledge bases. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part II. LNCS, vol. 6497, pp. 177–193. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. 9.
    Lehmann, J., Gerber, D., Morsey, M., Ngonga Ngomo, A.-C.: Defacto-deep fact validation. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 312–327. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  10. 10.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal (2013)Google Scholar
  11. 11.
    Parzen, E.: On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3), 1065–1076 (1962)CrossRefzbMATHMathSciNetGoogle Scholar
  12. 12.
    Paulheim, H., Bizer, C.: Type inference on noisy RDF data. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 510–525. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  13. 13.
    Paulheim, H., Fürnkranz, J.: Unsupervised Generation of Data Mining Features from Linked Open Data. In: International Conference on Web Intelligence, Mining, and Semantics (WIMS 2012) (2012)Google Scholar
  14. 14.
    Silverman, B.W.: Density estimation for statistics and data analysis, vol. 26. CRC Press (1986)Google Scholar
  15. 15.
    Silverman, B.W.: Algorithm as 176: Kernel density estimation using the fast fourier transform. Journal of the Royal Statistical Society. Series C (Applied Statistics) 31(1), 93–99 (1982)zbMATHGoogle Scholar
  16. 16.
    Töpper, G., Knuth, M., Sack, H.: Dbpedia ontology enrichment for inconsistency detection. In: Proceedings of the 8th International Conference on Semantic Systems, pp. 33–40. ACM (2012)Google Scholar
  17. 17.
    Waitelonis, J., Ludwig, N., Knuth, M., Sack, H.: Whoknows? evaluating linked data heuristics with a quiz that cleans up dbpedia. Interactive Technology and Smart Education 8(4), 236–248 (2011)CrossRefGoogle Scholar
  18. 18.
    Zaveri, A., Kontokostas, D., Sherif, M.A., Bühmann, L., Morsey, M., Auer, S., Lehmann, J.: User-driven quality evaluation of dbpedia. In: 9th International Conference on Semantic Systems (I-SEMANTICS 2013) (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Dominik Wienand
    • 1
  • Heiko Paulheim
    • 2
  1. 1.Knowledge Engineering GroupTechnische Universität DarmstadtGermany
  2. 2.Research Group Data and Web ScienceUniversity of MannheimGermany

Personalised recommendations