These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The classification of entities is an important task in information extraction (IE) from textual sources that requires the support of a comprehensive knowledge base. In a standard workflow, a Named Entity Recognition (NER) tool is used to discover the surface forms of entity mentions in some input text. These surface forms then have to be disambiguated and linked to a specific entity in a knowledge base (entity linking) to be useful in subsequent IE tasks. For the latter step of entity linking, suitable entity candidates have to be selected from the underlying knowledge base. In the general case of linking arbitrary entities, information about the classes of entity candidates is advantageous for the disambiguation process and for pruning candidates. In more specialized cases, only a subset of entity mentions may be of interest, such as toponyms or person mentions, which requires the classification of entity mentions. As a result, the classification of entities in the underlying knowledge base serves to support the linking procedure and directly translates into a classification of the entities that are mentioned in the text, which is a necessary precondition for many subsequent tasks such as event detection (Kumaran and Allan 2004) or document geolocation (Ding et al. 2000).

There is a number of knowledge bases that provide such a background repository for entity classification, predominantly DBpedia, YAGO, and Wikidata (Färber et al. 2017). While these knowledge bases provide semantically rich and fine-granular classes and relationship types, the task of entity classification often requires associating coarse-grained classes with discovered surface forms of entities. This problem is best illustrated by an IE tasks that has recently gained significant interest in particular in the context of processing streams of news articles and postings in social media, namely event detection and tracking, e.g., (Aggarwal and Subbian 2012; Sakaki et al. 2010; Brants and Chen 2003). Considering an event as something that happens at a given place and time between a group of actors (Allan 2012), the entity classes person, organization, location, and time, are of particular interest. While surface forms of temporal expressions are typically normalized by using a temporal tagger (Strötgen and Gertz 2016), dealing with the classification of the other types of entities often is much more subtle. This is especially true if one recalls that almost all available NER tools tag named entities only at a very coarse-grained level, e.g., Stanford NER (Finkel et al. 2005), which predominately uses the classes location, person, and organization.

The objective of this paper is to provide the community with a dataset and API for entity classification in Wikidata, which is tailored towards entities of the classes location, person, and organization. Like knowledge bases with inherently more coarse or hybrid class hierarchies such as YAGO and DBpedia, this version of Wikidata then supports entity linking tasks at state-of-the-art level (Geiß and Gertz 2016; Spitz et al. 2016b), but links entities to the continuously evolving Wikidata instead of traditional KBs. As we outline in the following, extracting respective sets of entities from a KB for each such class is by no means trivial (Spitz et al. 2016a), especially given the complexity of simultaneously dealing with multi-level class and instance structures inherent to existing KBs, an aspect that is also pointed out by Brasileiro et al. (2016). However, there are several reasons to chose Wikidata over other KBs. First, especially when dealing with news articles and social media data streams, it is crucial to have an up-to-date repository of persons and organizations. To the best of our knowledge, at the time of writing this paper, the most recent version of DBPedia was published in April 2016, and the latest evaluated version of YAGO in September 2015, whereas Wikidata provides a weekly data dump. Even though all three KBs (Wikidata, DBpedia, and YAGO3) are based on Wikipedia, Wikidata also contains information about entities and relationships that have not been simply extracted from Wikipedia (YAGO and DBpedia extract data predominantly from infoboxes) but collaboratively added by users (Müller-Birn et al. 2015). Although the latter feature might raise concerns regarding the quality of the data in Wikidata, for example due to vandalism (Heindorf et al. 2015), we find that the currentness of information far outweights these concerns when using Wikidata as basis for a named entity classifying framework and as a knowledge base in particular. While Wikidata provides a SPARQL interface for direct query access in addition to the weekly dumps, this method of accessing the data has several downsides. First, the interface is not designed for speed and is thus ill suited for entity extraction or linking tasks in large corpora, where many lookups are necessary. Second, and more importantly, the continually evolving content of Wikidata prevents reproducability of scientific results if the online SPARQL access is used, as the versioning is unclear and it is impossible to recreate experimental conditions. Third, we find that the hierarchy and structure in Wikidata is (necessarily) complicated and does not lend itself easily to creating coarse class hierarchies on the fly without substantial prior investigation into the existing hierarchies. Here, NECKAr provides a stable, easy to use view of classified Wikidata entities that is based on a selected Wikidata dump and allows reproducible results of subsequent IE tasks.

In summary, we make the following contributions: We provide an easy to use tool for assigning Wikidata items to commonly used NE classes by exclusively utilizing Wikidata. Furthermore, we make one such resulting Wikidata NE dataset available as a resource, including basic statistics and a thorough comparison to YAGO3.

The remainder of the paper is structured as follows. After a brief discussion of related work in the following section, we describe our named entity classifier in detail in Sect. 3 and present the resulting Wikidata NE dataset in Sect. 4. Section 5 gives a comparison of our results to YAGO3.

2 Related Work

The DBpedia projectFootnote 1 extracts structured information from Wikipedia (Auer et al. 2007). The 2016-04 version includes 28.6M entities, of which 28M are classified in the DBpedia Ontology. This DBpedia 2016-04 ontology is a directed-acyclic graph that consists of 754 classes. It was manually created and is based on the most frequently used infoboxes in Wikipedia. For each Wikipedia language version, there are mappings available between the infoboxes and the DBpedia ontology. In the current version, there are 3.2M persons, 3.1M places and 515,480 organizations. To be available in DBpedia, an entity needs to have a Wikipedia page (in at least one language version that is included in the extraction) that contains an infobox for which a valid mapping is available.

YAGO3 (Mahdisoltani et al. 2015), the multilingual extension of YAGO, combines information from 10 different Wikipedia language versions and fuses it with the English WordNet. YAGOFootnote 2 concentrates on extracting facts about entities, using Wikipedia categories, infoboxes, and Wikidata. The YAGO taxonomy is constructed by making use of the Wikipedia categories. However, instead of using the category hierarchy that is “barely useful for ontological purposes” (Suchanek et al. 2007), the Wikipedia categories are extracted, filtered and parsed for noun phrases in order to map them to WordNet classes. To include the multilingual categories, Wikidata is used to find corresponding English category names. As a result, the entities are assigned to more than 350K classes. YAGO3, which is extracted from Wikipedia dumps of 2013–2014, includes about 4.6M entities.

Both KBs solely depend on Wikipedia. Since it takes some time to update or create the KBs, they do not include up-to-date information. In contrast, the current version of Wikidata can be directly queried and a fresh Wikidata dump is available every week. Another advantage is that Wikidata does not rely on the existence of an infobox or Wikipedia page. Entities and information about the entities can be extracted from Wikipedia or manually entered by any user, meaning that less significant entities that do not warrant their own Wikipedia page are also represented. Since Wikipedia infoboxes are partially populated through templates from Wikidata entries, extracting data from infoboxes instead of Wikidata itself adds an additional source of errors. Furthermore, unless all language versions of Wikipedia are used as a source, such an approach would even limit the amount of retrieved information due to Wikidata’s inherent multi-lingual design as the knowledge base behind all Wikipedias (Vrandeĉić and Krötzsch 2014).

For completeness, FreebaseFootnote 3 should be mentioned as a fourth available knowledge base that has historically been used as a popular alternative to YAGO and DBpedia. However, efforts have recently been taken to merge it entirely into Wikidata (Pellissier Tanon et al. 2016). Given the need for current, up-to-date entity information in many event-related applications, the fact that Freebase is no longer actively maintained and updated means that it is increasingly ill-suited for such tasks.

3 The NECKAr Tool

The Named Entity Classifier for Wikidata (NECKAr) assigns Wikidata entites to the NE classes person, location, and organization. The tool, which is available as open source code (see the URL in the Abstract), is easy to use and only requires a minimum setup of Python3 packages as well as an instance of a MongoDB.

Fig. 1.
figure 1

Wikidata data model

Fig. 2.
figure 2

Class hierarchy for river, generated with the Wikidata Graph Builder

3.1 Wikidata Data Model

WikidataFootnote 4 is a free and open knowledge base that is intended to serve as central storage for all structured data of Wikimedia projects. The data model of Wikidata consists primarily of two major components: items and properties. Items represent all things in human knowledge. Each item corresponds to a clearly identifiable concept or object, or to an instance of a concept or object. For example, there is one item for the concept river and one item Neckar, which is an instance of a river. In this context, a concept is the same as a class. All items are denoted by numerical identifiers prefixed with a Q, while properties have numerical identifiers prefixed with P. Properties (P) connect items (Q) to values (V). A pair (PV) is called a statement. A property classifies the value of a statement. Figure 1 shows a simplified entry for the item Neckar. Here P2043 describes that the value 367 km has to be interpreted as the length of the river. Both items and properties have a label, a description, and (multilingual) aliases. Property entries in Wikidata do not have statements, but data types that restrict what can be given as a properties value. These data types include items, external identifiers (e.g., ISBN codes), URLs, geographic coordinates, strings, or quantities, to name a few.

Table 1. Location types with corresponding Wikidata root classes for location types and number of subclasses

When we are interested in the classification of items, we require the knowledge which item is an instance of which class. Class membership of an item is predominately modelled by the property instance of (P31). For example, consider the statement Q1673:P31:Q4022 (Neckar is an instance of river), in which Q4022 can be seen as a class. Classes can be subclasses of other classes, e.g., river is a subclass of watercourse, which is a subclass of land water. Figure 2 shows the subclass graph for river.

The property subclass of (P279) is transitive, meaning that since Neckar is an instance of river, which is a subclass of watercourse, Neckar is implicitly also an instance of watercourse. Due to this transitivity rule, in Wikidata there is no need to specify more than the most specific statementFootnote 5. In other words, there is no statement that directly specifies Neckar to be a geographic location. Thus, we cannot simply extract items that are instances of the general classes. There are, for example, only 1,733 items that are direct instances of geographic location. Instead, we need to extract the transitive hull, that is, all items that are an instance of any subclass of the general class (henceforward root classes). There are several tools available to show and query the class structure of Wikidata. For Fig. 2 we used the Wikidata Graph Builder (WGB)Footnote 6 to visualize the class tree. For NECKAr, we make use of the SPARQL based Wikidata Query Service Footnote 7 to extract all subclasses of a root class, e.g., geographic location. Once the subclasses of a root class are identified, we can extract all items that are instances of these subclasses.

The task is then to find root classes that, together with their subclasses, best represent the predominately used NE classes location, organization, and person. In the following we describe how items of these classes are extracted and what kind of information we store for each item. For all items, we store the Wikidata ID, the label (the most common name for an item), the links to the English and German Wikipedia, and the description.

3.2 Location Extraction

To extract all locations from Wikidata, we use the root class geographic location (Q222-1906). This class is very large and includes 23,383 subclassesFootnote 8. For each location item, we extract the following statements: coordinate location (P625), population (P1082), country (P17), and continent (P30). Additionally, we assign a location type if an item is an instance of a subclass of the root class for that location type (see Table 1).

Table 2. Example of classified entities

In this large set of subclasses of geographic location, we encounter several problems. For example, Food is a subclass of geographic location. Food is connected to geographic location by a path of length 3 (food \(\rightarrow \) energy storage \(\rightarrow \) storage \(\rightarrow \) geographic location). We cannot simply limit the allowed path length since there are other subclasses with a greater path length that we consider a valid location. For example the shortest path for village of Japan has a length of 4 (village of Japan \(\rightarrow \) municipality of Japan \(\rightarrow \) municipality \(\rightarrow \) human settlement \(\rightarrow \) geographic location). In this case we decided to exclude the subtree for Food, which reduces the number of subclasses considerably to 13,445. However, there might be other subclasses that are not considered a proper location (e.g., Arcade Video Game with the path: arcade video game \(\rightarrow \) arcade game machine \(\rightarrow \) computing platform \(\rightarrow \) computing infrastructure \(\rightarrow \) infrastructure \(\rightarrow \) construction \(\rightarrow \) geographical object \(\rightarrow \) geographic location). For the time being we only exclude the Food subclasses. The identified location items can be filtered for a certain application by using the location type or by only using items for which a coordinate location is given.

3.3 Organization Extraction

The root class organization (Q43229) includes 4811 subclasses, such as nonprofit organization, political organization, team, musical ensemble, newspaper, or state. For each item in this category, we extract additional information such as country (sovereign state of this item, P17), founder (P112), CEO (P169), inception (P571), headquarter location (P159), instance of (P31), official website (P856), and official language (P37).

3.4 Person Extraction

To extract all real world persons from Wikidata, we only use the class human (Q5) instead of a list of subclasses. In Wikidata, a more specific classification of a person is usually given by the occupation property or by having several instance of statements. All items with the statement is instance of human are classified as person. Fictional characters, such as Homer Simpson or Harry Potter and deities that are not also classified as human, are not extracted. For each person, we gather some basic information: date of birth (dob) (P569), date of death (dod) (P570), gender (P21), occupation (P106), and alternative names.

3.5 Extracting Links to Other Knowledge Bases

In addition to the above information, we also record identifiers for the items in other publicly available databases (Wikipedia, DBPedia, Integrated Authority File of the German National Library, Internet Movie Database, MusicBrainz, GeoNames, OpenStreet Map). This information is represented in Wikidata as statements and can be extracted analogously to the examples above.

3.6 Extraction Algorithm

The named entity classes to be extract can be specified in a configuration file. For each chosen class of named entities, the process then works as described in the following. First, the subclasses of the root class are extracted using the Wikidata SPARQL API. The output of this step is a list of subclasses, from which the invalid subclasses are excluded. For locations, we also generate lists of subclasses for the specific location types. The tool then searches the Wikidata dump (stored in a local MongoDB) for all items that are instances of one of the subclasses in the list and extracts the common features (id, label, description, Wikipedia links). Depending on the named entity class, additional information (see above) is extracted, and for locations, the list of location type subclasses is used to assign a location type. This data is then stored in a new, intermediary MongoDB collection. In a subsequent step, we extract for each item the identifiers that link them to the other databases as described in Sect. 3.5 and store them in a separate collection. In the last step, the data is exported to CSV and JSON files for ease of use.

4 Wikidata NE Dataset

The Wikidata NE datasetFootnote 9 was extracted using the NECKAr tool. For the version that we discuss in this paper, we extracted entities from the Wikidata dump from December 5, 2016, which includes 24,580,112 items and 2,910 distinct properties.

In total, we extracted and classified 8,842,103 items, of which 51.8% are locations, 37.6% persons, and 10.6% organizations. Table 2 shows examples for each named entity class, including the class specific additional information.

4.1 Location Entities

Of the 4,582,947 identified locations, 51% have geographic coordinates. Location types are extracted for 93% of the location items (see Table 3).

Table 3. Number of entities for location types

Most of the classified locations are settlements and territorial entities. We find over 2,400 countries: although there currently are only 206 countriesFootnote 10, Wikidata also includes former countries like the Roman Empire, Ancient Greece, or Prussia.

4.2 Person Entities

We extracted 3,322,217 persons, of which 78% are male, 15% female, while for 7% of the persons another gender or no gender is specified. Occupations are given for 66% of the person items, where the largest group are politicians, followed by football players (see Table 4).

Table 4. The five most frequent occupations

Wikidata covers mostly persons from recent history, so 70% of the persons for whom a birth date is given (over 2,5M persons) were born in the 20th century, while around 20% were born in the 19th century.

4.3 Organization Entities

936,939 items were classified as organizations, of which 11% are business enterprises. Table 5 shows the top 5 organization types. Where possible, we also extracted the country in which the organization is based. Figure 3 shows a heatmap of the number of organizations per country. Most organizations are based in the U.S.A., followed by France and Germany. This is partially due to the fact that commune of France and municipality of Germany are subclasses of organization.

Table 5. The five most frequent organization types
Fig. 3.
figure 3

Heatmap of organization frequency by country

4.4 Assignment to More Than One Class

400,856 Wikidata items are assigned to more than one NE class by NECKAr. The vast majority of this subset (over 99%) are members of the two classes location and organization. This is mainly caused by a subclass overlap between the root classes geographic location and organization. In total, they share 1,310 subclasses, e.g., hospital, state or library and their respective subclasses. We do not favour one class over the other, because both interpretations are possible, depending on the context. There are also items that have several instance of statements, which in six cases leads to an assignment to all three classes, e.g., Jean Leon is described as instance of human and instance of winery, which is a subclass of both organization and geographic location. There are 116 items that are classified as person and location or person and organization, which is again caused by multiple instance of statements. In contrast to the subclass overlap between root classes, these cases are caused by incorrect user input into Wikidata.

5 Comparison to YAGO3

In order to get an estimate of the quality of the NECKAr tool, we compare the resulting Wikidata NE dataset to the currently available version of YAGO (Version 3.0.2). When using the YAGO3 hierarchy to classify YAGO3 entities, we find 1,745,219 distinct persons (member of YAGO3 class wordnet_person_100007846), 1,267,402 distinct locations (member of YAGO class yagoGeoEntity) and 481,001 distinct organisations (member of YAGO class wordnet_social_group_107950920) for a total of 3,493,622 entities in comparison to the 8,8M entities in the Wikidata NE dataset (see Table 6).

YAGO3 entities can be linked to Wikidata entries via their subject id, which corresponds to Wikipedia page names. If a YAGO3 entity is derived from a non-English Wikipedia, the subject id is prefixed with the language code. For 3,430,065 YAGO3 entities we find a corresponding entry in Wikidata (1,715,305 persons, 1,250,409 locations and 464,351 organization). This subset is the basis for our comparison in the following.

To assess the quality of NECKAr, the well-known IR measures \(F_{1}\)-score, precision (P) and recall (R) are used. Precision is a measure for exactness, that is, how many of the classified entities are classified correctly. Recall measures completeness and gives the fraction of correctly classified entities of all given entities. \(F_{1}\) is the harmonic mean of P and R. The measures are defined as:

$$\begin{aligned} F_{1}= 2*\frac{P*R}{P+R} \qquad P=\frac{TP}{TP+FP} \qquad R=\frac{TP}{TP+FN} \end{aligned}$$

Here, TP (true positives) is the number of YAGO3 entites, that NECKAr assigns to the same class, while FP (false positives) is the number of YAGO3 entities that are falsely assigned to that class. \(TP+FP\) represents the number of entities assigned to that class by NECKAr. FN (false negatives) is the number of YAGO3 entities in a given class that NECKAr does not assign to that class (these entities might be assigned to a different class or to no class). Thus, \(TP+FN\) is the number of YAGO3 entities in a given class. Using these standard metrics, we receive a overall \(F_{1}\)-score of 0.88 with \(P=0.90\) and \(R=0.86\) (see Table 7). The lower recall is due to the fact that NECKAr does not classify all entries that are a person, location, or organization entity in YAGO3. Only about 88% of the YAGO3 entites that correspond to Wikidata entries are classified. For example, NECKAr does not find Pearson, a town in Victoria, Australia, because the Wikidata entry does not include any is instance of relation. This is true for 290,905 of the 387,259 entities (75.12%) that are not classified by NECKAr. Some entities are missed by NECKAr entirely for a couple of reasons. In some cases, the correct is instance of relation is not given in Wikidata. In others, a relevant subclass or property may not have been included. Finally, since YAGO3 was automatically extracted and not every fact was checked for correctness it contains some erroneous claims or classifications. For example, some overview articles in Wikipedia are classified as entities in YAGO, such as Listed_buildings_in or Index of. The original evaluation of YAGO3 lists the fraction of incorrect facts that it contains as 2% (Mahdisoltani et al. 2015).

Table 6. Number of entities per class in the Wikidata NE dataset created by NECKAr, YAGO3 and the intersection of YAGO3 and Wikidata
Table 7. Evaluation results (F\(_{1}\) score, Precision (P) and Recall (R)) for the Wikidata NE dataset created by NECKAr in comparison to YAGO3

5.1 Location Comparison

For Location, NECKAr achieves a \(F_{1}\)-score of 0.88 (P = 0.93 and R = 0.84). 170,869 YAGO3 locations were not classified, of which 81% have no entry in Wikidata for the instance of property.

Of the entities that are assigned to a different class, NECKAr classified 97,6% as Organization instead of Location. Most of these entities (85%) are radio or television stations for which a classification into either class is a matter of debate. These items are described in Wikidata as instance of radio station or television station which are subclasses of organization. The majority of the FPs for locations are assigned by NECKAr to two classes (organization and locations), whereas in YAGO3 these are only organizations.

5.2 Person Comparison

For entities of class Person, we receive the highest \(F_{1}\)-score of all classes with 0.97 (P = 0.99 and R = 0.95). Most of the entities that NECKAr assigned to a different class (90% to organization, 10% to location) are bands or musical ensembles which are classified as organization.

5.3 Organization Comparison

The class Organization shows the lowest \(F_{1}\)-score = 0.57 (P = 0.54 and R = 0.60). The low precision is caused by the high number of false positives. As discussed in the previous section, many entities that are classified as Persons or Locations by YAGO3 are classified as organizations by NECKAr. The low recall is due to the fact that 156,926 YAGO3 organizations were not identified by NECKAr. Again, the majority of these items (82%) has no is instance of relation in Wikidata, so NECKAr was not able to classify them. The reason for the missing 18% warrants future investigation in more detail, as it is possible that an important subclass or property was excluded. 29,625 YAGO3 organizations were assigned to another class, 96% to Location and 4% to Person. Many of these items are constituencies or administrative units, which could be seen as organizations and/or locations.

In summary, we find that the application of NECKAr to Wikidata produces a set of classified entities that is comparable in quality to a well known and widely used knowledge base. However, in contrast to existing knowledge bases, which are not updated regularly, NECKAr can be used to extract substantially more entities and up-to-date lists of persons, locations and organizations. Since NECKAr can be applied to weekly dumps of Wikidata, it can be used to extract a reproducible resource for subsequent IE tasks.

6 Conclusion and Future Work

In this paper, we introduced the NECKAr tool for assigning NE classes to Wikidata items. We discussed the data model of Wikidata and its class hierarchy. The resulting NE dataset offers the simple classification of entities into locations, organizations and persons that is often used in IE tasks. The datasets includes basic, class specific information on each item and links them to other linked open data sets. The clear and lightweight structure makes the dataset a valuable and easy to use resource. Much of the original more fine grained classification is preserved and can be used to create application-specific subsets. A comparison to YAGO3 showed that NECKAr is able to create state-of-the-art lists of entities with the added advantage of providing larger and more recent data.

Based on these results, we are further investigating the Wikidata class hierarchy in order to reduce the number of incorrect or multiple assignments. We are also working on an automated process to provide the Wikidata NE dataset on a monthly basis. In future releases of NECKAr, we plan to include the option of choosing between a Wikidata dump and the SPARQL API as source for obtaining the entity data.