NLP Data Cleansing Based on Linguistic Ontology Constraints

Kontokostas, Dimitris; Brümmer, Martin; Hellmann, Sebastian; Lehmann, Jens; Ioannidis, Lazaros

doi:10.1007/978-3-319-07443-6_16

NLP Data Cleansing Based on Linguistic Ontology Constraints

Dimitris Kontokostas²¹,
Martin Brümmer²¹,
Sebastian Hellmann²¹,
Jens Lehmann²¹ &
…
Lazaros Ioannidis²²

Conference paper

2484 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8465))

Abstract

Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.

Download to read the full chapter text

Chapter PDF

References

Angles, R., Gutierrez, C.: The expressive power of SPARQL. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 114–129. Springer, Heidelberg (2008)
Chapter Google Scholar
Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: Prov-o: The prov ontology. Technical report (2013)
Google Scholar
Bizer, C., Cyganiak, R.: Quality-driven information filtering using the WIQA policy framework. Web Semantics 7(1), 1–10 (2009)
Article Google Scholar
Bühmann, L., Lehmann, J.: Pattern based knowledge base enrichment. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 33–48. Springer, Heidelberg (2013)
Chapter Google Scholar
Eckle-Kohler, J., McCrae, J.P., Chiarcos, C.: lemonuby - a large, interlinked, syntactically-rich resource for ontologies. Submitted to the Semantic Web Journal
Google Scholar
Flemming, A.: Quality characteristics of linked data publishing datasources. Master’s thesis, Humboldt-Universität of Berlin (2010)
Google Scholar
Fürber, C., Hepp, M.: Using SPARQL and SPIN for data quality management on the semantic web. In: Abramowicz, W., Tolksdorf, R. (eds.) BIS 2010. LNBIP, vol. 47, pp. 35–46. Springer, Heidelberg (2010)
Chapter Google Scholar
Guéret, C., Groth, P., Stadler, C., Lehmann, J.: Assessing linked data mappings using network measures. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 87–102. Springer, Heidelberg (2012)
Google Scholar
Hellmann, S., Brekle, J., Auer, S.: Leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic data cloud. In: Takeda, H., Qu, Y., Mizoguchi, R., Kitamura, Y. (eds.) JIST 2012. LNCS, vol. 7774, pp. 191–206. Springer, Heidelberg (2013)
Chapter Google Scholar
Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013)
Chapter Google Scholar
Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)
Google Scholar
Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Lehmann, J., Cornelissen, R.: Test-driven evaluation of linked data quality. In: WWW (to appear, 2014)
Google Scholar
McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gmez-Prez, A., Gracia, J., Hollink, L., Montiel-Ponsoda, E., Spohr, D., Wunner, T.: Interchanging lexical resources on the semantic web. LRE 46(4), 701–719 (2012)
Google Scholar
Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: EDBT/ICDT Workshops, pp. 116–123. ACM (2012)
Google Scholar
Moran, S., Brümmer, M.: Lemon-aid: using lemon to aid quantitative historical linguistic analysis. In: LDL (2013)
Google Scholar
Röder, M., Usbeck, R., Hellmann, S., Gerber, D., Both, A.: N3 - a collection of datasets for named entity recognition and disambiguation in the nlp interchange format. In: LREC (2014)
Google Scholar
Sirin, E., Tao, J.: Towards integrity constraints in owl. In: Proceedings of the Workshop on OWL: Experiences and Directions, OWLED (2009)
Google Scholar
Steinmetz, N., Knuth, M., Sack, H.: Statistical Analyses of Named Entity Disambiguation Benchmarks. In: NLP and DBpedia WS @ ISWC (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Informatik, AKSW, Universität Leipzig, Germany
Dimitris Kontokostas, Martin Brümmer, Sebastian Hellmann & Jens Lehmann
Medical Physics Laboratory, Aristotle University of Thessaloniki, Greece
Lazaros Ioannidis

Authors

Dimitris Kontokostas
View author publications
You can also search for this author in PubMed Google Scholar
Martin Brümmer
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Hellmann
View author publications
You can also search for this author in PubMed Google Scholar
Jens Lehmann
View author publications
You can also search for this author in PubMed Google Scholar
Lazaros Ioannidis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Cognitive Sciences and Technologies, Semantic Technology Laboratory, ISTC-CNR, Via Nomentana 56, 00161, Rome, Italy
Valentina Presutti
Department of Compter Science, University of Bari, Via Orabona, 4, 70125, Bari, Italia
Claudia d’Amato
Wimmics Research Team at Inria, University of Nice - Sophia Antipolis, Route des Lucioles, BP 93, 06902, Sophia Antipolis, France
Fabien Gandon
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Mathieu d’Aquin
Institute for Web Science and Technologies, University of Koblenz, Universitätsstraße 1, 56016, Koblenz, Germany
Steffen Staab
Elsevier B.V., Radarweg 29, 1043 NX, Amsterdam, The Netherlands
Anna Tordai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kontokostas, D., Brümmer, M., Hellmann, S., Lehmann, J., Ioannidis, L. (2014). NLP Data Cleansing Based on Linguistic Ontology Constraints. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds) The Semantic Web: Trends and Challenges. ESWC 2014. Lecture Notes in Computer Science, vol 8465. Springer, Cham. https://doi.org/10.1007/978-3-319-07443-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-07443-6_16
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07442-9
Online ISBN: 978-3-319-07443-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics