Lessons Learned — The Case of CROCUS: Cluster-Based Ontology Data Cleansing

  • Didier Cherix
  • Ricardo Usbeck
  • Andreas Both
  • Jens Lehmann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8798)

Abstract

Over the past years, a vast number of datasets have been published based on Semantic Web standards, which provides an opportunity for creating novel industrial applications. However, industrial requirements on data quality are high while the time to market as well as the required costs for data preparation have to be kept low. Unfortunately, many Linked Data sources are error-prone which prevents their direct use in productive systems. Hence, (semi-)automatic quality assurance processes are needed as manual ontology repair procedures by domain experts are expensive and time consuming. In this article, we present CROCUS – a pipeline for cluster-based ontology data cleansing. Our system provides a semi-automatic approach for instance-level error detection in ontologies which is agnostic of the underlying Linked Data knowledge base and works at very low costs. CROCUS has been evaluated on two datasets. The experiments show that we are able to detect errors with high recall. Furthermore, we provide an exhaustive related work as well as a number of lessons learned.

References

  1. 1.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Seman. Web J. (2014)Google Scholar
  2. 2.
    Cherix, D., Usbeck, R., Both, A., Lehmann, J.: Crocus: Cluster-based ontology data cleansing. In: Proceedings of the 2nd International Workshop on Semantic Web Enterprise Adoption and Best Practice (2014)Google Scholar
  3. 3.
    Bizer, C., Cyganiak, R.: Quality-driven information filtering using the wiqa policy framework. Web Semant. Sci. Serv. Agents World Wide Web 7(1), 1–10 (2009)CrossRefGoogle Scholar
  4. 4.
    Böhm, C., Naumann, F., Abedjan, Z., Fenz, D., Grutze, T., Hefenbrock, D., Pohl, M., Sonnabend, D.: Profiling linked open data with ProLOD. In: IEEE 26th International Conference on Data Engineering Workshops ICDEW 2010, pp. 175–178 (2010)Google Scholar
  5. 5.
    Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) LDOW. CEUR Workshop Proceedings, vol. 628. CEUR-WS.org (2010)Google Scholar
  6. 6.
    Guéret, C., Groth, P., Stadler, C., Lehmann, J.: Assessing linked data mappings using network measures. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 87–102. Springer, Heidelberg (2012)Google Scholar
  7. 7.
    Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of linked data conformance. Web Semant. Sci. Serv. Agents World Wide Web 14, 14 (2012)CrossRefGoogle Scholar
  8. 8.
    Wang, R.Y., Strong, D.M.: Beyond accuracy. what data quality means to data consumers. J. Manage. Inf. Syst. 12(4), 5–33 (1996)MATHGoogle Scholar
  9. 9.
    Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S., Hitzler, P.: Quality assessment methodologies for linked open data. Seman. Web J. (2013) (Submitted)Google Scholar
  10. 10.
    Zaveri, A., Kontokostas, D., Sherif, M.A., Bühmann, L., Morsey, M., Auer, S., Lehmann, J.: User-driven quality evaluation of dbpedia. In: Sabou, M., Blomqvist, E., Noia, T.D., Sack, H., Pellegrini, T. (eds.) I-SEMANTICS, pp. 97–104. ACM (2013)Google Scholar
  11. 11.
    Lehmann, J.: DL-learner: learning concepts in description logics. J. Mach. Learn. Res. 10, 2639–2642 (2009)MathSciNetMATHGoogle Scholar
  12. 12.
    Bühmann, L., Lehmann, J.: Pattern based knowledge base enrichment. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 33–48. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  13. 13.
    Fürber, C., Hepp, M.: Swiqa - a semantic web information quality assessment framework. In: Tuunainen, V.K., Rossi, M., Nandhakumar, J. (eds.) ECIS (2011)Google Scholar
  14. 14.
    Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Lehmann, J., Cornelissen, R., Zaveri, A.J.: Test-driven evaluation of linked data quality. In: Proceedings of the 23rd International Conference on World Wide Web (2014, to appear)Google Scholar
  15. 15.
    Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 524–538. Springer, Heidelberg (2008)Google Scholar
  16. 16.
    Stickler, P.: Cbd-concise bounded description. W3C Member Submission 3 (2005)Google Scholar
  17. 17.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)Google Scholar
  18. 18.
    Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. Web Semant. Sci. Serv. Agents World Wide Web 3(2–3), 158–182 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Didier Cherix
    • 2
  • Ricardo Usbeck
    • 1
    • 2
  • Andreas Both
    • 2
  • Jens Lehmann
    • 1
  1. 1.University of LeipzigLeipzigGermany
  2. 2.R & DUnister GmbHLeipzigGermany

Personalised recommendations