Quality Assessment of Linked Datasets Using Probabilistic Approximation

  • Jeremy DebattistaEmail author
  • Santiago Londoño
  • Christoph Lange
  • Sören Auer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9088)


With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.


Data quality Linked data Probabilistic approximation 


  1. 1.
    Bera, S.K., Dutta, S., Narang, A., Bhattacherjee, S.: Advanced Bloom filter based algorithms for efficient approximate data de-duplication in streams (2012)Google Scholar
  2. 2.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefzbMATHGoogle Scholar
  3. 3.
    Broder, A.Z., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet Math. 1, 485–509 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Dasgupta, A., Kumar, R., Sarlos, T.: On estimating the average degree. In: WWW, pp. 795–806. ACM, New York (2014)Google Scholar
  5. 5.
    Debattista, J., Londoño, S., Lange, C., Auer, S.: LUZZU - a framework for linked data quality assessment (2014).
  6. 6.
    Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: ACM SIGMOD 2006, pp. 25–36. ACM (2006)Google Scholar
  7. 7.
    Guéret, C., Groth, P., Stadler, C., Lehmann, J.: Assessing linked data mappings using network measures. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 87–102. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  8. 8.
    Hardiman, S.J., Katzir, L.: Estimating clustering coefficients and size of social networks via random walk. In: WWW, pp. 539–550. ACM (2013)Google Scholar
  9. 9.
    Hitzler, P., Janowicz, K.: Linked data, big data, and the 4th paradigm. Semant. Web 4(3), 233–235 (2013)Google Scholar
  10. 10.
    Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of linked data conformance. J. Web Sem. 14, 14–44 (2012)CrossRefGoogle Scholar
  11. 11.
    Jain, N., Dahlin, M., Tewari, R.: Taper: tiered approach for eliminating redundancy in replica synchronization. In: FAST, USENIX (2005)Google Scholar
  12. 12.
    Kaiser, M., Klier, M., Heinrich, B.: How to measure data quality? - a metric-based approach. In: International Conference on Information Systems (ICIS), p. 108 (2007)Google Scholar
  13. 13.
    Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: 2012 Joint EDBT/ICDT Workshops, pp. 116–123. ACM (2012)Google Scholar
  14. 14.
    Metwally, A., Agrawal, D., Abbadi, A.E.: Duplicate detection in click streams. In: WWW 2005. ACM (2005)Google Scholar
  15. 15.
    Mohaisen, A., Yun, A., Kim, Y.: Measuring the mixing time of social graphs. In: SIGCOMM. IMC 2010, pp. 383–389. ACM (2010)Google Scholar
  16. 16.
    Mühleisen, H.: Vocabulary usage by pay-level domain (2014).
  17. 17.
  18. 18.
    Saberi, B., Ghadiri, N.: A sample-based approach to data quality assessment in spatial databases with application to mobile trajectory nearest-neighbor search (2014)Google Scholar
  19. 19.
    Sauermann, L., Cyganiak, R.: Cool URIs for the semantic web. Interest Group Note, W3C, December 2008.
  20. 20.
    Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 37–57 (1985)CrossRefzbMATHMathSciNetGoogle Scholar
  21. 21.
    Xie, H., Tong, X.H., Jiang, Z.Q.: The quality assessment and sampling model for the geological spatial data in China. In: ISPRS Archives - Volume XXXVII, Part B2. ISPRS 2008, pp. 819–824 (2008)Google Scholar
  22. 22.
    Yang, Y., Che, H., Gibbins, N., Hall, W., Shadbolt, N.: Dereferencing semantic web URIs: what is 200 OK on the semantic web?.
  23. 23.
    Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment methodologies for linked open data. Semant. Web J. (2014).

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Jeremy Debattista
    • 1
    Email author
  • Santiago Londoño
    • 1
  • Christoph Lange
    • 1
  • Sören Auer
    • 1
  1. 1.University of Bonn and Fraunhofer IAISBonnGermany

Personalised recommendations