ESWC 2017: The Semantic Web pp 305-320

All that Glitters Is Not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking

  • Kunal Jha
  • Michael Röder
  • Axel-Cyrille Ngonga Ngomo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10249)

Abstract

The evaluation of Named Entity Recognition as well as Entity Linking systems is mostly based on manually created gold standards. However, the current gold standards have three main drawbacks. First, they do not share a common set of rules pertaining to what is to be marked and linked as an entity. Moreover, most of the gold standards have not been checked by other researchers after they were published. Hence, they commonly contain mistakes. Finally, many gold standards lack actuality as in most cases the reference knowledge bases used to link entities are refined over time while the gold standards are typically not updated to the newest version of the reference knowledge base. In this work, we analyze existing gold standards and derive a set of rules for annotating documents for named entity recognition and entity linking. We derive Eaglet, a tool that supports the semi-automatic checking of a gold standard based on these rules. A manual evaluation of Eaglet’s results shows that it achieves an accuracy of up to 88% when detecting errors. We apply Eaglet to 13 English gold standards and detect 38,453 errors. An evaluation of 10 tools on a subset of these datasets shows a performance difference of up to 10% micro F-measure on average.

Keywords

Entity recognition Entity linking Benchmarks 

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Kunal Jha
    • 1
  • Michael Röder
    • 1
  • Axel-Cyrille Ngonga Ngomo
    • 1
    • 2
  1. 1.AKSW Research GroupUniversity of LeipzigLeipzigGermany
  2. 2.Data Science GroupUniversity of PaderbornPaderbornGermany

Personalised recommendations