NLP Data Cleansing Based on Linguistic Ontology Constraints

  • Dimitris Kontokostas
  • Martin Brümmer
  • Sebastian Hellmann
  • Jens Lehmann
  • Lazaros Ioannidis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8465)


Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.


#eswc2014Kontokostas Linked Data NLP data quality 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Angles, R., Gutierrez, C.: The expressive power of SPARQL. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 114–129. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  2. 2.
    Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: Prov-o: The prov ontology. Technical report (2013)Google Scholar
  3. 3.
    Bizer, C., Cyganiak, R.: Quality-driven information filtering using the WIQA policy framework. Web Semantics 7(1), 1–10 (2009)CrossRefGoogle Scholar
  4. 4.
    Bühmann, L., Lehmann, J.: Pattern based knowledge base enrichment. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 33–48. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  5. 5.
    Eckle-Kohler, J., McCrae, J.P., Chiarcos, C.: lemonuby - a large, interlinked, syntactically-rich resource for ontologies. Submitted to the Semantic Web JournalGoogle Scholar
  6. 6.
    Flemming, A.: Quality characteristics of linked data publishing datasources. Master’s thesis, Humboldt-Universität of Berlin (2010)Google Scholar
  7. 7.
    Fürber, C., Hepp, M.: Using SPARQL and SPIN for data quality management on the semantic web. In: Abramowicz, W., Tolksdorf, R. (eds.) BIS 2010. LNBIP, vol. 47, pp. 35–46. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  8. 8.
    Guéret, C., Groth, P., Stadler, C., Lehmann, J.: Assessing linked data mappings using network measures. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 87–102. Springer, Heidelberg (2012)Google Scholar
  9. 9.
    Hellmann, S., Brekle, J., Auer, S.: Leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic data cloud. In: Takeda, H., Qu, Y., Mizoguchi, R., Kitamura, Y. (eds.) JIST 2012. LNCS, vol. 7774, pp. 191–206. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  10. 10.
    Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  11. 11.
    Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)Google Scholar
  12. 12.
    Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Lehmann, J., Cornelissen, R.: Test-driven evaluation of linked data quality. In: WWW (to appear, 2014)Google Scholar
  13. 13.
    McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gmez-Prez, A., Gracia, J., Hollink, L., Montiel-Ponsoda, E., Spohr, D., Wunner, T.: Interchanging lexical resources on the semantic web. LRE 46(4), 701–719 (2012)Google Scholar
  14. 14.
    Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: EDBT/ICDT Workshops, pp. 116–123. ACM (2012)Google Scholar
  15. 15.
    Moran, S., Brümmer, M.: Lemon-aid: using lemon to aid quantitative historical linguistic analysis. In: LDL (2013)Google Scholar
  16. 16.
    Röder, M., Usbeck, R., Hellmann, S., Gerber, D., Both, A.: N3 - a collection of datasets for named entity recognition and disambiguation in the nlp interchange format. In: LREC (2014)Google Scholar
  17. 17.
    Sirin, E., Tao, J.: Towards integrity constraints in owl. In: Proceedings of the Workshop on OWL: Experiences and Directions, OWLED (2009)Google Scholar
  18. 18.
    Steinmetz, N., Knuth, M., Sack, H.: Statistical Analyses of Named Entity Disambiguation Benchmarks. In: NLP and DBpedia WS @ ISWC (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Dimitris Kontokostas
    • 1
  • Martin Brümmer
    • 1
  • Sebastian Hellmann
    • 1
  • Jens Lehmann
    • 1
  • Lazaros Ioannidis
    • 2
  1. 1.Institut für Informatik, AKSWUniversität LeipzigGermany
  2. 2.Medical Physics LaboratoryAristotle University of ThessalonikiGreece

Personalised recommendations