Entity Extraction from Wikipedia List Pages

Heist, Nicolas; Paulheim, Heiko

doi:10.1007/978-3-030-49461-2_19

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12123))

Included in the following conference series:

European Semantic Web Conference

4563 Accesses
9 Citations

Abstract

When it comes to factual knowledge about a wide range of domains, Wikipedia is often the prime source of information on the web. DBpedia and YAGO, as large cross-domain knowledge graphs, encode a subset of that knowledge by creating an entity for each page in Wikipedia, and connecting them through edges. It is well known, however, that Wikipedia-based knowledge graphs are far from complete. Especially, as Wikipedia’s policies permit pages about subjects only if they have a certain popularity, such graphs tend to lack information about less well-known entities. Information about these entities is oftentimes available in the encyclopedia, but not represented as an individual page. In this paper, we present a two-phased approach for the extraction of entities from Wikipedia’s list pages, which have proven to serve as a valuable source of information. In the first phase, we build a large taxonomy from categories and list pages with DBpedia as a backbone. With distant supervision, we extract training data for the identification of new entities in list pages that we use in the second phase to train a classification model. With this approach we extract over 700k new entities and extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.

You have full access to this open access chapter, Download conference paper PDF

Entity Linking to One Thousand Knowledge Bases

NECKAr: A Named Entity Classifier for Wikidata

DBkWik: extracting and integrating knowledge from thousands of Wikis

Article 02 November 2019

Keywords

1 Introduction

Knowledge graphs like DBpedia [13] and YAGO [14] contain huge amounts of high-quality data on various topical domains. Unfortunately, they are - as their application on real-world tasks show - far from complete: IBM’s DeepQA system uses both of them to answer Jeopardy! questions [11]. While the component that uses this structured information gives correct answers 87% of the time (compared to 70% correctness of the complete system), it is only able to provide answers for 2.3% of the questions posed to it. Given that they find in another analysis that around 96% of the answers to a sample of 3,500 Jeopardy! questions can be answered with Wikipedia titles [3], it is safe to say that there is a lot of information in Wikipedia yet to be extracted.

While Wikipedia’s infoboxes and categories have been the subject of many information extraction efforts of knowledge graphs already, list pages have - despite their obvious wealth of information - received very little attention. For entities of the page List of Japanese speculative fiction writers (shown in Fig. 1), we can derive several bits of information: (type, Writer), (nationality, Japan), and (genre, Speculative Fiction).

In contrast to finding entities of a category, finding such entities among all the entities mentioned in a list page is a non-trivial problem. We will refer to these entities, that are instances of the concept expressed by the list page, as its subject entities. Unlike categories, list pages are an informal construct in Wikipedia. Hence, the identification of their subject entities brings up several challenges: While list pages are usually formatted as enumeration or table, they have no convention of how the information in them is structured. For example, subject entities can be listed somewhere in the middle of a table (instead of in the first column) and enumerations can have multiple levels. Furthermore, context information may not be available (it is difficult to find Japanese speculative fiction writers in a list if one doesn’t know to look for writers).

In this paper, we introduce an approach for identifying subject entities in Wikipedia list pages and provide the following contributions in particular:

An approach for the construction of a combined taxonomy of Wikipedia categories, lists and DBpedia types.
A distantly supervised machine learning approach for the extraction of subject entities from Wikipedia list pages.
700k new entities, 7.5M additional type statements, and 3.8M additional facts for DBpedia that are published as RDF triples and as a standalone knowledge graph called CaLiGraph^{Footnote 1}.

The rest of this paper is structured as follows. Section 2 frames the approach described in this paper in related works. Section 3 introduces the idea of entity extraction from list pages, followed by a description of our approach in Sect. 4. In Sect. 5, we discuss results and present an empirical evaluation of our approach. We close with a summary and an outlook on future developments.

2 Related Work

The extraction of knowledge from structured elements in Wikipedia is mostly focused on two fields: Firstly, the field of taxonomy induction, where most of the approaches use the category graph of Wikipedia to derive a taxonomy, and, secondly, the application of information extraction methods to derive facts from various (semi-)structured sources like infoboxes, tables, lists, or abstracts of Wikipedia pages.

The approach of Ponzetto and Navigli [18] was one of the first to derive a large taxonomy from Wikipedia categories by putting their focus on the lexical head of a category. They exploit the fact that almost exclusively categories with plural lexical heads are useful elements of a taxonomy. Hence, they are able to clean the category graph from non-taxonomic categories and relationships. Several other approaches create a combined taxonomy of the category graph and additional resources like WordNet (YAGO [14]) or Wikipedia pages (WiBi [4]).

The distant supervision paradigm [15] is used extensively for information extraction in Wikipedia as it provides an easy way to automatically gather large amounts of training data with a low error rate. Usually, some form of knowledge base is used as background knowledge to generate training data from a target corpus. In the original work, Mintz et al. use Freebase as background knowledge to extract information from Wikipedia. [1] extend this approach by using DBpedia as background knowledge.

Regarding list pages, Paulheim and Ponzetto [17] frame their general potential as a source of knowledge in Wikipedia. They propose to use a combination of statistical and NLP methods to extract knowledge and show that, by applying them to a single list page, they are able to extract a thousand new statements. [12] infer types for entities on list pages and are thus most closely related to our approach. To identify subject entities of the list pages, they rely on information from DBpedia (e.g. how many relations exist between entities on the list page) and are consequently only able to infer new types for existing DBpedia entities. They use a score inspired by TF-IDF to find the type of a list page and are able to extract 303,934 types from 2,000 list pages with an estimated precision of 86.19%.

Apart from list pages, entity and relation extraction in Wikipedia is applied to structured sources like infoboxes [21], abstracts [8, 19], and tables [2, 16].

With the exploitation of the structured nature of list pages to extract previously unknown entities as well as factual information about them, we see our approach as a useful addition to the existing literature where the focus is set primarily on enriching the ontology or adding information for existing entities.

3 Categories and List Pages in Wikipedia

The Wikipedia Category Graph (WCG) has been used extensively for taxonomy induction (e.g. in [4, 14]) and has proven to yield highly accurate results. The WCG has a subgraph consisting of list categories,^{Footnote 2} which organizes many of the list page articles in Wikipedia. The list page List of Japanese speculative fiction writers (Fig. 1), for example, is a member of the list category Lists of Japanese writers, which in turn has the parent list category Lists of writers by nationality, and so on.

As this subgraph is part of the WCG, we can use the list categories as a natural extension of a taxonomy induced by the WCG (e.g., by linking Lists of Japanese writers to the respective category Japanese writers). This comes with the benefit of including list pages into the taxonomy (i.e., we can infer that List of Japanese speculative fiction writers is a sub-concept of the category Japanese writers). Despite their obvious potential, neither list categories nor list pages have yet explicitly been used for taxonomy induction.

In each list page, some of the links point to entities in the category the list page reflects, others do not. In the list page List of Japanese speculative fiction writers, for example, some links point to pages about such writers (i.e. to its subject entities), while others point to specific works by those writers. To distinguish those two cases, the unifying taxonomy is of immense value. Through the hierarchical relationships between categories and list pages, we can infer that if an entity is mentioned in both a list page and a related category, it is very likely a subject entity of the list page. Consequently, if an entity is mentioned in the list page List of Japanese speculative fiction writers and is contained in the category Japanese writers, it is almost certainly a Japanese speculative fiction writer.

In the remainder of this section we provide necessary background information of the resources used in our approach.

The Wikipedia Category Graph. In the version of October 2016^{Footnote 3} the WCG consists of 1,475,015 categories that are arranged in a directed, but not acyclic graph. This graph does not only contain categories used for the categorisation of content pages, but also ones that are used for administrative purposes (e.g., the category Wikipedia articles in need of updating). Similar to [9], we use only transitive subcategories of the category Main topic classifications while also getting rid of categories having one of the following keywords in their name: wikipedia, lists, template, stub.

The resulting filtered set of categories $\mathcal {C}^F$ contains 1,091,405 categories that are connected by 2,635,718 subcategory edges. We denote the set of entities in a category c with $\mathcal {E}_c$, the set of all types in DBpedia with $\mathcal {T}$ and the set of types of an entity e with $\mathcal {T}_e$.

The Wikipedia List Graph. The set of list categories $\mathcal {C}^L$ contains 7,297 list categories (e.g., Lists of People), connected by 10,245 subcategory edges (e.g., Lists of Celebrities being a subcategory of Lists of People). The set of list pages $\mathcal {L}$ contains 94,562 list pages. Out of those, 75,690 are contained in at least one category in $\mathcal {C}^F$ (e.g., List of Internet Pioneers is contained in the category History of the Internet), 70,099 are contained in at least one category in $\mathcal {C}^L$ (e.g., List of Internet Pioneers is contained in the category Lists of Computer Scientists), and 90,430 are contained in at least one of the two.^{Footnote 4}

The Anatomy of List Pages. List pages can be categorised into one of three possible layout types [12]: 44,288 pages list entities in a bullet point-like enumeration. The list page List of Japanese speculative fiction writers in Fig. 1 lists the subject entities in an enumeration layout. In this case, the subject entities are most often mentioned at the beginning of an enumeration entry. As some exceptions on the page show, however, this is not always the case.

46,160 pages list entities in a table layout. An example of this layout is given in Fig. 2, where an excerpt of the page List of Cuban-American writers is shown. The respective subjects of the rows are listed in the first column, but this can also vary between list pages.

The remaining 4,114 pages do not have a consistent layout and are thus categorised as undefined^{Footnote 5}. As our approach significantly relies on the structured nature of a list page, we exclude list pages with an undefined layout from our extraction pipeline.

For a list page l, we define the task of identifying its subject entities $\mathcal {E}_l$ among all the mentioned entities $\mathcal {\widehat{E}}_l$ in l as a binary classification problem. A mentioned entity is either classified as being a subject entity of l or not. If not, it is usually mentioned in the context of an entity in $\mathcal {E}_l$ or for organisational purposes (e.g. in a See also section). Looking at Figs. 1 and 2, mentioned entities are marked in blue (indicating that they have an own Wikipedia page and are thus contained in DBpedia) and in red (indicating that they do not have a Wikipedia page and are consequently no entities in DBpedia). Additionally, we could include possible entities that are not tagged as such in Wikipedia (e.g. Jesús J. Barquet in the first column of Fig. 2) but we leave this for future work as it introduces additional complexity to the task. Of the three types of possible entities, the latter two are the most interesting as they would add the most amount of information to DBpedia. But it is also beneficial to identify entities that are already contained in DBpedia because we might be able to derive additional information about them through the list page they are mentioned in.

Note that for both layout types, enumeration and table, we find at most one subject entity per enumeration entry or table row. We inspected a subset of $\mathcal {L}$ and found this pattern to occur in every one of them.

Learning Category Axioms with Cat2Ax. The approach presented in this paper uses axioms over categories to derive a taxonomy from the category graph. Cat2Ax [9] is an approach that derives two kinds of axioms from Wikipedia categories: type axioms (e.g., for the category Japanese writers it learns that all entities in this category are of the type Writer), and relation axioms (e.g., for the same category it learns that all entities have the relation (nationality, Japan)). The authors use statistical and linguistic signals to derive the axioms and report a correctness of 96% for the derived axioms.

4 Distantly Supervised Entity Extraction from List Pages

The processing pipeline for the retrieval of subject entities from list pages in $\mathcal {L}$ is summarized in Fig. 3. The pipeline consists of two main components: In the Training Data Generation we create a unified taxonomy of categories, lists, and DBpedia types. With distant supervision we induce positive and negative labels from the taxonomy for a part of the mentioned entities of list pages.

The resulting training data is passed to the Entity Classification component. There we enrich it with features extracted from the list pages and learn classification models to finally identify the subject entities.

4.1 Training Data Generation

Step 1: Cleaning the Graphs

The initial category graph ($\mathcal {C}^F$ as nodes, subcategory relations as edges) and the initial list graph ($\mathcal {C}^L$ and $\mathcal {L}$ as nodes, subcategory relations and category membership as edges) both contain nodes and edges that have to be removed in order to convert them into valid taxonomies. Potential problems are shown in an abstract form in Fig. 4 and on an example in Fig. 5. In particular, we have to remove nodes that do not represent proper taxonomic types (e.g. London in Fig. 5). Additionally, we have to remove edges that either do not express a valid subtype relation (e.g. the edge from Songs to Song awards in Fig. 5), or create cycles (e.g. the self-references in Fig. 4).

For the removal of non-taxonomic nodes we rely on the observation made by [18], that a Wikipedia category is a valid type in a taxonomy if its head noun is in plural. Consequently, we identify the head nouns of the nodes in the graph and remove all nodes with singular head nouns.^{Footnote 6}

For the removal of invalid edges we first apply a domain-specific heuristic to get rid of non-taxonomic edges and subsequently apply a graph-based heuristic that removes cycles in the graphs. The domain-specific heuristic is based on [18]: An edge is removed if the head noun of the parent is not a synonym or a hypernym of the child’s head noun. In Fig. 5 the head nouns of nodes are underlined; we remove, for example, the edge from Songs to Song awards as the word songs is neither a synonym nor a hypernym of awards.

We base our decision of synonym and hypernym relationships on a majority vote from three sources: (1) We parse the corpus of Wikipedia for Hearst patterns [6].^{Footnote 7} (2) We extract them from WebIsALOD [10], a large database of hypernyms crawled from the Web. (3) We extract them directly from categories in Wikipedia. To that end, we apply the Cat2Ax approach [9] which computes robust type and relation axioms for Wikipedia categories from linguistic and statistical signals. For every edge in the category graph, we extract a hypernym relationship between the head noun of the parent and the head noun of the child if we found matching axioms for both parent and child. E.g., if we find the axiom that every entity in the category People from London has the DBpedia type Person and we find the same axiom for Criminals from London, then we extract a hypernym relation between People and Criminals.

As a graph-based heuristic to resolve cycles, we detect edges that are part of a cycle and remove the ones that are pointing from a deeper node to a higher node in the graph.^{Footnote 8} If cycles can not be resolved because edges point between nodes on the same depth level, those are removed as well.

Through the cleaning procedure we reduce the size of the category graph from 1,091,405 nodes and 2,635,718 edges to 738,011 nodes and 1,324,894 edges, and we reduce the size of the list graph from 77,396 nodes and 105,761 edges to 77,396 nodes and 95,985 edges.

Step 2: Combining Categories and Lists

For a combined taxonomy of categories and lists, we find links between lists and categories based on linguistic similarity and existing connections in Wikipedia. As Fig. 6 shows, we find two types of links: equivalence links and hypernym links. We identify the former by looking for category-list pairs that are either named similar (e.g. Japanese writers and Lists of Japanese writers) or are synonyms (e.g. Media in Kuwait and Lists of Kuwaiti media). With this method we find 24,383 links.

We extract a hypernym link (similar to the method that we applied in Step 1) if the head noun of a category is a synonym or hypernym of a list’s head noun. However, we limit the candidate links to existing edges in Wikipedia (i.e. the subcategory relation between a list category and a category, or the membership relation between a list page and a category) in order to avoid false positives. With this method we find 19,015 hypernym links. By integrating the extracted links into the two graphs, we create a category-list graph with 815,543 nodes (738,011 categories, 7,416 list categories, 70,116 list pages) and 1,463,423 edges.

Step 3: Deriving a Taxonomy with DBpedia as Backbone

As a final step, we connect the category-list graph with the DBpedia taxonomy (as depicted in Fig. 7). To achieve that, we again apply the Cat2Ax approach to our current graph to produce type axioms for the graph nodes. E.g., we discover the axiom that every entity in the category Japanese writers has the DBpedia type Writer, thus we use the type as a parent of Japanese writers. Taking the transitivity of the taxonomy into account, we find a DBpedia supertype for 88% of the graph’s nodes.

Step 4: Labeling the Entity Mentions

We parse all entity mentions in list pages directly from the Wikitext using the dumps provided by DBpedia and using WikiTextParser^{Footnote 9} as a markup parser.

We compute the training data for mentioned entities $\mathcal {\widehat{E}}_l$ of a list page l directly from the taxonomy. To that end, we define two mapping functions:

$$\begin{aligned} related: \mathcal {L} \rightarrow P(\mathcal {C}^F) \end{aligned}$$

(1)

$$\begin{aligned} types: \mathcal {L} \rightarrow P(\mathcal {T}) \end{aligned}$$

(2)

The function related(l) from Definition 1 returns the subset of $\mathcal {C}^F$ that contains the taxonomically equivalent or most closely related categories for l. For example, related(List of Japanese speculative fiction writers) returns the category Japanese writers and all its transitive subcategories (e.g. Japanese women writers). To find related(l) of a list page l, we traverse the taxonomy upwards starting from l until we find a category c that is contained in $\mathcal {C}^F$. Then we return c and all of its children.

With this mapping, we assign positive labels to entity mentions in l, if they are contained in a category in related(l):

$$\begin{aligned} \mathcal {\widehat{E}}^+_l = \left\{ e | e \in \mathcal {\widehat{E}}_l \wedge \exists c \in related(l) : e \in \mathcal {E}_c \right\} \end{aligned}$$

(3)

In the case of List of Japanese speculative fiction writers, $\mathcal {\widehat{E}}^+_l$ contains all entities that are mentioned on the list page and are members of the category Japanese writers or one of its subcategories.

The function types(l) from Definition 2 returns the subset of the DBpedia types $\mathcal {T}$ that best describes entities in l. For example, types(List of Japanese speculative fiction writers) returns the DBpedia types Agent, Person, and Writer. To find types(l), we retrieve all ancestors of l in the taxonomy and return those contained in $\mathcal {T}$.

With this mapping, we assign a negative label to an entity e mentioned in l, if there are types in $\mathcal {T}_e$ that are disjoint with types in types(l):

$$\begin{aligned} \mathcal {\widehat{E}}^-_l = \left\{ e | e \in \mathcal {\widehat{E}}_l \wedge \exists t_e \in \mathcal {T}_e, \exists t_l \in types(l) : disjoint(t_e,t_l) \right\} \end{aligned}$$

(4)

To identify disjointnesses in Eq. 4, we use the disjointness axioms provided by DBpedia as well as additional ones that are computed by the methods described in [20]. DBpedia declares, for example, the types Person and Building as disjoint, and the type Person is contained in types(List of Japanese speculative fiction writers). Consequently, we label any mentions of buildings in the list page as negative examples.

In addition to the negative entity mentions that we retrieve via Eq. 4, we label entities as negative using the observation we have made in Sect. 3: As soon as we find a positive entity mention in an enumeration entry or table row, we label all the remaining entity mentions in that entry or row as negative.

For enumeration list pages, we find a total of 9.6M entity mentions. Of those we label 1.4M as positive and 1.4M as negative. For table list pages, we find a total of 11M entity mentions. Of those we label 850k as positive and 3M as negative.

4.2 Entity Classification

Step 5: Generating the Features

For a single data point (i.e. the mention of an entity in a specific list page), we generate a set of features that is shown in Table 1. Shared features are created for entity mentions of both enumeration and table list pages.

Table 1. Features of the machine learning model grouped by list page type and by feature type. Page features are computed for the complete list page (LP) and do not vary between entity mentions. For page features, we include standard deviations as features in addition to averages.

Full size table

Features of the type Page encode characteristics of the list page and are hence similar for all entity mentions of the particular page. Features of the type Positional, Linguistic, Custom describe the characteristics of a single entity mention and its immediate context.

Step 6: Learning the Classification Model

To find a suitable classification model, we conduct an initial experiment on six classifiers (shown in Table 2) and compare them with the obvious baseline of always picking the first entity mention in an enumeration entry or table row. We compute the performance using 10-fold cross validation while taking care that all entity mentions of a list page are in the same fold. In each fold, we use 80% of the data for training and 20% for validation. For all the mentioned classifiers, we report their performances after tuning their most important parameters with a coarse-grained grid search.

Table 2. Precision (P), Recall (R), and F-measure (F1) in percent for the positive class (i.e. true subject entities in a list page) of various classification models.

Full size table

Table 2 shows that all applied approaches outperform the baseline in terms of precision. The XG-Boost algorithm scores highest in terms of precision while maintaining rather high levels of recall. Since we want to identify entities in list pages with highest possible precision, we use the XG-Boost model. After a fine-grained parameter tuning, we train models with a precision of 91% and 90%, and a recall of 82% and 55% for enumeration and table list pages, respectively.^{Footnote 10} Here, we split the dataset into 60% training, 20% validation, and 20% test data.

5 Results and Discussion

Entities. In total, we extract 1,549,893 subject entities that exist in DBpedia already. On average, an entity is extracted from 1.86 different list pages. Furthermore, we extract 754,115 subject entities that are new to DBpedia (from 1.07 list pages on average). Based on the list pages they have been extracted from, we assign them DBpedia types (i.e., the supertypes of the list page in the derived taxonomy). Figure 8 shows the distribution of new entities over various high-level types.

Entity Types. Overall, we generate 7.5M new type statements for DBpedia: 4.9M for entities in DBpedia (we assign a type to 2M previously untyped entities), and 2.6M for new entities (we find an average of 3.5 types per new entity). This is an increase of 51.15% in DBpedia’s total type statements. We especially generate statements for types that are rather specific, i.e., deep in the ontology.^{Footnote 11} Adding all the generated type statements to DBpedia, the average type depth increases from 2.9 to 2.93. For new entities, we have an average type depth of 3.06. Figure 10 shows the increase of type statements for the subtypes of the DBpedia type Building. For almost all of them, we increase the amount of type statements by several orders of magnitude.

Entity Facts. Besides type statements, we also infer relational statements using the relation axioms that we generated via Cat2Ax. In total, we generate 3.8M relational statements: 3.3M for existing entities in DBpedia and 0.5M for new entities. For some previously unknown entities we discover quite large amounts of facts. For the moth species Rioja^{Footnote 12}, for example, we discover the type Insect and information about its class, family, order, and phylum. For Dan Stulbach^{Footnote 13} we discover the type Person and information about his birth place, occupation, and alma mater.

Evaluation. We evaluate the correctness of both the constructed taxonomy and the inferred statements.

To validate the taxonomy we conducted an evaluation on the crowd-sourcing platform Amazon MTurk.^{Footnote 14} We randomly sampled 2,000 edges of the taxonomy graph and asked three annotators each whether the edge is taxonomically correct. The edges have been evaluated as correct in 96.25% (±0.86%) of the cases using majority vote (with an inter-annotator agreement of 0.66 according to Fleiss’ kappa [5]).

The correctness of the inferred type and relation statements are strongly dependent on the Cat2Ax approach as we use its axioms to generate the statements. The authors of Cat2Ax report a correctness of 96.8% for type axioms and 95.6% for relation axioms. For the resulting type and relation statements (after applying the axioms to the entities of the categories) they report a correctness of 90.8% and 87.2%, respectively. However, the original Cat2Ax approach does not rely on a complete taxonomy of categories but computes axioms for individual categories without considering hierarchical relationships between them. In contrast, we include information about the subcategories of a given category while generating the axioms. An inspection of 1,000 statements^{Footnote 15} by the authors yields a correctness of 99% (±1.2%) for existing and 98% (±1.7%) for new type statements, and 95% (±2.7%) for existing and 97% (±2.1%) for new relation statements.

Classification Models. With values of 91% and 90% the precision of the classification models is significantly lower than the correctness of the extracted type and relation statements. At first glance this is a contradiction because, although the models extract entities and not statements, a statement is obviously incorrect if it has been created for the wrong entity. But we have to consider that the training data, which was used to train and evaluate the models, has been created using distant supervision. Hence, it is likely to contain errors (e.g. due to wrong inheritance relations in the taxonomy). The fact that the final output of the processing pipeline has a higher correctness than the evaluation results of the models imply, indicates that the models are in fact able to learn meaningful patterns from the training data.

Figure 9 shows the feature types of the 15 features that have the highest influence on the decision of the final XG-Boost models. Almost all of them are features of the type Page, i.e. features that describe the general shape of the list page the entities are extracted from. Features of the other types, that describe the immediate context of an entity, are used only very sparsely. This might be an indicator that, to bridge the gap in recall between the classification models and the baseline, we have to develop models that can make better use of the structure of a list page. Accordingly, we see the biggest potential in an adapted machine learning approach that, instead of classifying every entity mention in isolation, uses a holistic perspective and identifies the set of mentions that fit the list page best, given its structure.

6 Conclusion

In this paper we have presented an approach for the extraction of entities from Wikipedia list pages in order to enrich DBpedia with additional entities, type statements, and facts. We have shown that by creating a combined taxonomy from the WCG, its subgraph formed of lists, and DBpedia, we are able to train highly precise entity extraction models using distant supervision.

To extend our approach, we investigate in two directions. Firstly, we want to further improve the entity extraction by considering entities that are not explicitly tagged as such in list pages. In alignment with that we are developing a method to extract entities of a list page based on a joint likelihood instead of evaluating each entity mention in isolation. To that end we are experimenting with additional features that take the visual layout of a list page and alignment of entities into account.

As soon as we include untagged entities in the extraction, we will have to develop an entity disambiguation mechanism in order to create separate entities for homonyms. For this, we expect the distance between entities in the taxonomy to be a helpful indicator.

Secondly, we investigate an application of our extraction approach to any kind of structured markup in Wikipedia (e.g. enumerations and tables that occur anywhere in Wikipedia), and, ultimately, to markup of arbitrary pages on the web. To achieve that, we want to combine the information about entity alignment on the web page with the available semantic information as outlined in [7].

Code and results of this paper are published on http://caligraph.org.

Notes

1.
http://caligraph.org.
2.
A list category is a Wikipedia category that starts with the prefix Lists of.
3.
We use this version in order to be compatible with the (at the time of conducting the experiments) most recent release of DBpedia: https://wiki.dbpedia.org/develop/datasets.
4.
Note that $\mathcal {C}^F$ and $\mathcal {C}^L$ are disjoint as we exclude categories with the word lists in $\mathcal {C}^F$.
5.
We heuristically label a list page as having one of the three layout types by looking for the most frequent elements: enumeration entries, table rows, or none of them.
6.
We use spaCy (http://spacy.io) for head noun tagging.
7.
Patterns that indicate a taxonomic relationship between two words like “X is a Y”.
8.
We define the depth of a node in the graph as the length of its shortest path to the root node Main topic classifications.
9.
https://github.com/5j9/wikitextparser.
10.
The models are trained using the scikit-learn library: https://scikit-learn.org/.
11.
We define the depth of a type in DBpedia as the length of its shortest path to the root type owl:Thing.
12.
http://caligraph.org/resource/Rioja_(moth).
13.
http://caligraph.org/resource/Dan_Stulbach.
14.
https://mturk.com.
15.
We inspect 250 type and relation statements for both existing and new entities.

References

Aprosio, A.P., Giuliano, C., Lavelli, A.: Extending the coverage of DBpedia properties using distant supervision over Wikipedia. In: NLP-DBPEDIA@ ISWC (2013)
Google Scholar
Bhagavatula, C.S., et al.: Methods for exploring and mining tables on Wikipedia. In: ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, pp. 18–26. ACM (2013)
Google Scholar
Chu-Carroll, J., Fan, J., et al.: Textual resource acquisition and engineering. IBM J. Res. Dev. 56(3.4), 4:1–4:11 (2012)
Article Google Scholar
Flati, T., et al.: Two is bigger (and better) than one: the Wikipedia bitaxonomy project. In: 52nd Annual Meeting of the ACL, vol. 1, pp. 945–955 (2014)
Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Article Google Scholar
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: 14th Conference on Computational Linguistics, vol. 2, pp. 539–545. ACL (1992)
Google Scholar
Heist, N.: Towards knowledge graph construction from entity co-occurrence. In: Doctoral Consortium at 21st International Conference on Knowledge Engineering and Knowledge Management (2018)
Google Scholar
Heist, N., Paulheim, H.: Language-agnostic relation extraction from Wikipedia abstracts. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 383–399. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_23
Chapter Google Scholar
Heist, N., Paulheim, H.: Uncovering the semantics of Wikipedia categories. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11778, pp. 219–236. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30793-6_13
Chapter Google Scholar
Hertling, S., Paulheim, H.: WebIsALOD: providing hypernymy relations extracted from the web as linked open data. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 111–119. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_11
Chapter Google Scholar
Kalyanpur, A., Boguraev, B.K., et al.: Structured data and inference in DeepQA. IBM J. Res. Dev. 56(3.4), 10:1–10:14 (2012)
Article Google Scholar
Kuhn, P., Mischkewitz, S., Ring, N., Windheuser, F.: Type inference on Wikipedia list pages. Informatik 2016, 1–11 (2016)
Google Scholar
Lehmann, J., Isele, R., Jakob, M., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Article Google Scholar
Mahdisoltani, F., Biega, J., Suchanek, F.M.: YAGO3: a knowledge base from multilingual Wikipedias. In: CIDR (2013)
Google Scholar
Mintz, M., et al.: Distant supervision for relation extraction without labeled data. In: Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on NLP of the AFNLP, vol. 2, pp. 1003–1011. ACL (2009)
Google Scholar
Muñoz, E., Hogan, A., Mileo, A.: Using linked data to mine RDF from Wikipedia’s tables. In: 7th ACM International Conference on Web Search and Data Mining, pp. 533–542. ACM (2014)
Google Scholar
Paulheim, H., Ponzetto, S.P.: Extending DBpedia with Wikipedia list pages. In: NLP-DBPEDIA@ ISWC 13 (2013)
Google Scholar
Ponzetto, S.P., Navigli, R.: Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In: 21st International Joint Conference on Artificial Intelligence (2009)
Google Scholar
Schrage, F., Heist, N., Paulheim, H.: Extracting literal assertions for DBpedia from Wikipedia abstracts. In: Acosta, M., Cudré-Mauroux, P., Maleshkova, M., Pellegrini, T., Sack, H., Sure-Vetter, Y. (eds.) SEMANTiCS 2019. LNCS, vol. 11702, pp. 288–294. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33220-4_21
Chapter Google Scholar
Töpper, G., et al.: DBpedia ontology enrichment for inconsistency detection. In: 8th International Conference on Semantic Systems, pp. 33–40. ACM (2012)
Google Scholar
Wu, F., Weld, D.S.: Autonomously semantifying Wikipedia. In: 16th ACM Conference on Information and Knowledge Management, pp. 41–50. ACM (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Data and Web Science Group, University of Mannheim, Mannheim, Germany
Nicolas Heist & Heiko Paulheim

Authors

Nicolas Heist
View author publications
You can also search for this author in PubMed Google Scholar
Heiko Paulheim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Heist .

Editor information

Editors and Affiliations

University of Erlangen-Nuremberg, Nuremberg, Germany
Andreas Harth
Vienna University of Economics and Business, Vienna, Austria
Sabrina Kirrane
University of Paderborn, Paderborn, Germany
Axel-Cyrille Ngonga Ngomo
University of Mannheim, Mannheim, Germany
Heiko Paulheim
University of Milano-Bicocca, Milan, Italy
Anisa Rula
IBM Research - Almaden, San Jose, CA, USA
Anna Lisa Gentile
metaphacts GmbH, Walldorf, Germany
Peter Haase
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Michael Cochez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heist, N., Paulheim, H. (2020). Entity Extraction from Wikipedia List Pages. In: Harth, A., et al. The Semantic Web. ESWC 2020. Lecture Notes in Computer Science(), vol 12123. Springer, Cham. https://doi.org/10.1007/978-3-030-49461-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-49461-2_19
Published: 27 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49460-5
Online ISBN: 978-3-030-49461-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics