1 Introduction

Motivation. General-purpose knowledge bases (KBs) like Wikidata, DBpedia or YAGO [1, 31, 35] find increasing use in applications such as question answering, entity search or document enrichment, and their automated construction from Internet sources has been greatly advanced. So far, information extraction (IE) to this end has focused on fully qualified subject-predicate-object (SPO) facts such as \(\langle \)Monterey, locatedIn, California\(\rangle \). However, texts often contain only counting information: the number of objects that stand in a specific relation with a certain entity, without mentioning the objects themselves. Examples are: “California is divided into 58 counties”, “Clint Eastwood directed more than twenty movies” or “Trump has three sons and two daughters”.

This kind of knowledge can be codified into an extension of existentially quantified formulas known in AI and logics as counting quantifiers (CQs): they assert the existence of a specific number of SPO triples without fully knowing the triples themselves. Counting information can substantially extend the scope and value of knowledge bases. First, they allow accurate answers for queries that involve counts (e.g., number of counties per US state) or existential quantifiers (e.g., directors who made at least 5 movies). Second, an important use case is KB curation [8, 34]. KBs are notoriously incomplete, contain erroneous triples, and are limited in keeping up with the pace of real-world changes. Counting information helps to identify gaps and inaccuracies. For example, knowing the exact number of counties in California or a lower bound for the number of films directed by Eastwood are important cues to complete and enrich a KB.

State-of-the-Art and Challenges. The predominant approach to extracting facts for KB population is distant supervision, using seeds for the SPO triples of interest (e.g., [21, 32]). The seeds are usually taken from an initial KB or are manually compiled. Spotting the seeds in a text corpus (e.g., Clint Eastwood, directed and Gran Torino) then allows learning patterns for relations (e.g., “director of” or “\(\langle \)someone\(\rangle \)s masterpiece”), which in turn lead to observing new fact candidates. This methodology is known as the pattern-relation duality principle [2].

Distant supervision is a natural approach for extracting counting information as well: the cardinality of distinct O arguments for a given SP pair, \(n:=|\{O\,|\,SPO \in KB \}|\), serves as a seed for the counting assertion, \(\langle S, P, \exists n\rangle \). However, it is more challenging than traditional SPO-fact extraction and needs to cope with several issues:

  1. (1)

    Non-maximal seeds: Unlike for SPO-fact extraction, the incompleteness of KBs not only leads to a reduction in the number of seeds, but to seeds that systematically underestimate the count of facts that are valid in reality. For example, a KB that knows only a subset of Trump’s children, say three out of five, leads to a non-maximal seed that may reward spurious patterns like “owns three golf resorts” at the cost of patterns like “his five children”. Even worse, KBs often have complete blanks on certain relations, e.g., not knowing any of Eastwood’s movies despite labeling his occupation as film director and film producer (https://www.wikidata.org/wiki/Q43203).

  2. (2)

    Sparse and skewed observations: For many relations, counting information is expressed in text in a sparse and highly skewed way. For example, the non-existence of children is rarely mentioned. For musicians, the first Grammy someone has won often has more mentions than later ones, hence giving undue weight to the pattern “his/her first award”. The number of members in a music band is often around four, which makes it hard to learn patterns for very large or very small bands.

  3. (3)

    Linguistic diversity: Counting information can be expressed in a variety of linguistic forms like

    1. (i)

      explicit numerals as cardinal numbers (e.g., “has five children”),

    2. (ii)

      lower bounds via ordinal numbers (e.g., “her third husband”),

    3. (iii)

      number-related noun phrases such as ‘twins’ or ‘quartet’,

    4. (iv)

      existence-proving articles as in “has a child”,

    5. (v)

      non-existence adverbs such as ‘never’ and ‘without’.

Open IE methods [18] cannot cope with these challenges. For example, the sentence “Trump has five children” would typically result in the triple \(\langle \)Trump, has, five children\(\rangle \), failing to recognize that ‘five’ is a numeric modifier of ‘children’. On the other hand, IE methods with pre-specified relations for KB population (e.g., NELL [23]) capture relevant O values only for few relations specified to have numeric literals as their range, such as numberofkilledinbombing or earthquakecasualtiesnumber (http://rtw.ml.cmu.edu/rtw/kbbrowser/).

Approach and Contributions. In this paper, we develop the first full-fledged system for Counting Information Extraction, called CINEX. Our method is based on machine learning for sequence labeling, judiciously designed to cope with the outlined challenges. We leverage distant supervision from fact counts in a given KB, but devise special techniques to handle non-maximal seeds, sparseness and skew in observing count information in text, and linguistic diversity of patterns. We counter non-maximal seeds (Challenge 1) by relaxing matching conditions for numbers higher than KB counts, and by reducing the training to popular, more complete entities. Sparseness and skew (Challenge 2) are addressed by discounting uninformative numbers using entropy measures. Linguistic variance (Challenge 3) is handled by careful consolidation of detected mentions. We devise both a traditional feature-based conditional random field (CRF) and a bi-directional LSTM-CRF model using TensorFlow, finding that both perform roughly comparable, although the traditional approach is more robust when dealing with noisy training data.

The salient original contributions of this paper are:

  • The methodology of our extraction system, CINEX.

  • An empirical evaluation with five manually annotated relations, showing 60% precision on average.

  • An application and large-scale experimental study of CINEX on 2,474 frequent relations of Wikidata, showing that counting information can extend the SPO facts in Wikidata for 110 distinct relations by 28%.

  • Code and data made available to the research community on Github.Footnote 1

The remainder of this paper is structured as follows. In Sect. 2 we specify the scope of counting quantifiers and discuss the incompleteness of KBs, using Wikidata as a reference point. Section 3 presents our methodology for extracting counting information at large scale, which we then detail in Sects. 4 and 5. Section 6 gives experimental results on the quality of our extraction method, with a particular focus on how CINEX can enrich the Wikidata KB in Sect. 6.4. Section 7 discusses related work.

2 Counting Information in Knowledge Bases

Counting quantifiers for a KB with SPO triples are statements on a subset of the SPO arguments. We focus on the dominant case of quantification of O arguments for a given SP pair. We write counting statements as \(\langle S, P, \exists n \rangle \), where S is the subject, P is the predicate and n is a natural number (including zero). For instance, the statement that President Garfield has 7 children would be written as \(\langle Garfield, hasChild , \exists 7 \rangle \). In the OWL description logics, this statement is written as:

ClassAssertion(ObjectExactCardinality(7 :hasChild) :Garfield)

Wikidata. To illustrate how today’s KBs deal with counting information, we briefly discuss the case of Wikidata, presumably the world’s largest and best curated publicly available KB. Wikidata already contains counting relations for a few topics such as numberOfChildren, numberOfSeasons (of a TV series), or numberOfHouseholds (of an administrative entity). This information can coexist with fully qualified SPO facts. Regarding children, for example, Wikidata knows 4 out of the 7 children of President Garfield by name, and knows that he had 7 in total (see Fig. 1). However, the numberOfChildren predicate is asserted for only 0.2% of persons in Wikidata so far. Even the child property is asserted for only 2.2% of persons, creating uncertainty about whether the others have no children or whether Wikidata does not know about them.

Fig. 1.
figure 1

SPO facts and counting information in Wikidata.

Counting information is beneficial for search and question answering, for example to answer “Which US presidents were married twice?” We analyzed the number of questions in the TREC 2003, 2004 and 2007 QA test datasets [4], and found that 5% to 10% of the questions (typically starting with “How many”) fall into this category.

Potential for KB Enrichment. To quantitatively assess the gap in Wikidata, for which counting information can contribute to KB enrichment, we had one expert read the Wikipedia articles of 200 randomly selected people, with the task of comparing the text-borne counting information on the hasChild relation with the explicitly stated children names. The expert was instructed to look at two kinds of cues: (i) explicit numerals expressing counting information, (ii) counting names of children mentioned in the article. We compare these numbers against (iii) the Wikidata SPO triples for the person’s hasChild predicate. Note that approach (ii) corresponds to what standard IE aims to achieve (i.e., extracting full triples and then counting).

We found that counting information via numerals allows the discovery of children counts for 12% of all test entities, while names of children are only mentioned for 7%, and Wikidata contains facts about children for only 2.5%. As for the total number of children, counting information asserts the existence of twice as many children, i.e., 0.35 children per person, as spotting and counting children names (0.18), and even eleven times more than Wikidata currently knows of (0.03).

3 System Overview

The CINEX system aims to solve the following problem:

Problem 1

(Counting Quantifier Extraction). Given a text about a subject S, and a predicate P, the task of counting quantifier (CQ) extraction is to determine the number of objects with which S stands in relation regarding P.

For instance, given the sentence “Trump has three sons and two daughters”, the output for the predicate numberOfChildren should be 5.

Fig. 2.
figure 2

Overview of the CINEX system.

Figure 2 gives a pictorial overview of the system architecture of CINEX. We split the overall task into two main components: the recognition of counting information and the consolidation of intermediate results into the final output of counting quantifiers. These components are presented in Sects. 4 and 5, respectively.

CINEX utilizes seeds from Wikidata in a judicious way in order to train a model for CQ recognition, using one of two options: a conditional random field (CRF) or a bidirectional LSTM neural network. When applied to new text, the output of the recognition model is a set of CQ candidates, which are often fairly noisy, though. Subsequently, the second stage of CINEX – CQ consolidation – cleans and aggregates the counting information and produces the final output of CINEX. The resulting CQ triples could potentially be added to a knowledge base such as Wikidata.

4 Counting Quantifier Recognition

The first stage of CINEX aims to recognize counting information in text, this way collecting a pool of CQ candidates for further cleaning and consolidation. We cast the CQ recognition into a sequence labeling task, operating on a per-sentence basis and learned separately for each predicate P. We are interested in counting information for a given subject-predicate (SP) pair and assume that the subject is already identified by the sentence context (e.g., the main entity featured in a document, like a Wikipedia article about S or S’s homepage on the Web). Furthermore, we assume that the input sentence is pre-processed by detecting terms that indicate counting information: cardinals, ordinals and number-related terms (numterms).

Task 1

(Counting Quantifier Recognition). Given a sentence about subject S and predicate P containing at least one cardinal, ordinal or number-related term (numterm), the task of CQ recognition is to label each token of the sentence with one of the following tag: (i) count, for denoting a CQ mention, (ii) comp, for denoting compositional cues and (iii) o, for others.

The following shows an example:

figure a

Sequence Labeling Models. Our problem resembles the Named Entity Recognition (NER) task, with Conditional Random Fields (CRFs) being a typical choice of sequence labeling models. In order to generalize patterns beyond specific numeric values/tokens, we pre-process sentences to lift these specific tokens into placeholders cardinal, ordinal and numeric term (numterm). For instance, the sentence “Donald Trump has three children from his first wife.” becomes “Donald Trump has cardinal children from his ordinal wife.”

CINEX learns one sequence labeling model for each predicate of interest (e.g., with separate models for children and spouses). We have devised solutions based on two sequence labeling methods:

  1. 1.

    Feature-based model. We constructed a CRF-based sequence classifier using CRF++ [14] with n-gram features (up to pentagrams), taking into account lemmas and placeholders (e.g., {Trump, have, cardinal, child, from}) instead of the original tokens.

  2. 2.

    Neural model. We adopt the bidirectional LSTM-CRF architecture proposed in [15] using TensorFlow, presently the state-of-the-art method for sequence-to-sequence learning, to build our sequence labeling model. The neural architecture takes into account words, placeholders and character embeddings to represent the input sequence. The neural model should be able to exploit, for example, that word embeddings for ‘children’, ‘daughters’ and ‘sons’ are close to each other in the embedding space. Furthermore, word embeddings for out-of-vocabulary words such as ‘ennealogy’ can be generated via character embeddings, recovering similarity to e.g. ‘pentalogy’.

Incompleteness-Aware Distant Supervision. We employ distant supervision to generate training data, as common in relation extraction [3, 21, 32]. Given a knowledge base (KB) relation P, for each entity S in the KB that appears as the subject of P, we retrieve (i) the triple count \(|\langle S, P, * \rangle |\) from the KB and (ii) sentences about S containing candidate mentions, e.g., cardinal numerals. Candidate mentions that are equal to or representing the triple count will be labelled with the tag count denoting counting quantifier mentions, i.e., as positive examples. Otherwise, candidate mentions will be labeled with the O tag, i.e., as negative examples, like any other non-candidate mentions (e.g., non-numerals). We built separate training data for each relation P of interest.

Incomplete information from the KB used as the ground truth may negatively affect the quality of training data resulting from the distant supervision approach. To mitigate the effect that KB incompleteness has on training data quality, we investigated filtering the ground truth based on subject popularity, according to the number of stored KB triples for that subject, which is also highly correlated with other popularity measures like PageRank or Wikipedia article length. For example, for 10 random entities from the 99th, 90th and 80th percentile w.r.t. popularity, the mean difference between Wikidata children counts and a manually established ground truth from Wikipedia is 0.8, 1.5 and 2.4, respectively. Assuming that popularity and completeness are correlated in general, we can thus trade training data quantity for quality by disregarding less popular entities during training.

Candidate counts that are higher than the KB count are normally considered as not expressing the object count for the relation of interest, i.e., as negative training examples. But this can also happen to mentions that actually express the correct count, when the KB is incomplete and only knows counts lower than the correct one. Our remedy is to treat mentions higher than KB counts neither as positive nor as negative examples, but to simply exclude them from the training set. However, there is the need to maintain enough negative examples; otherwise, the classifier would get overly optimistic. For this purpose we utilize upper bound information of triple counts specific to each relation, i.e., the triple count at 99th percentile (e.g., 3 for number of spouses), as found in the KB. A higher count mention will then still be treated as a negative example if it is deemed to be impossible to represent count information for the relation in question.

Furthermore, the more frequent a certain number occurs in a text, the more probable it is to occur in various contexts. As a way to give the classifier less noisy training examples, we ignore sentences that contain count mentions of numbers that have a low entropy in the given text, even when they represent the actual object count. This way we ensure that the models only learn from correct number mentions in the right context.

Linguistic Diversity. As mentioned in the introduction, there are several ways to express count information in natural language text, cardinals and ordinals being only the most obvious ones.

Number-Related Terms. We exploited the relatedTo relation in ConceptNet [29] for collecting around 1,200 terms related to numbers. The terms are split into two groups, those having Latin/Greek prefixesFootnote 2 and those not having them. For the first group, we generated a list of Latin/Greek prefixes (e.g., quadr-) and a list of possible suffixes (e.g., -plets). When generating training data, a term with Latin/Greek affixes was labeled with the positive count tag if its prefix matched the triple count. For feature-based models we also replaced such terms in the input with placeholders numterm appended with their Latin/Greek suffixes, while we use the original tokens for neural models.

From the second group we manually selected 15 terms that were especially strongly associated with specific counts (e.g., twins, dozen). During preprocessing, these terms are then either replaced with corresponding terms/phrases containing cardinal numbers, e.g., thrice \(\rightarrow \) three times and a dozen \(\rightarrow \) twelve, or replaced with corresponding Latin/Greek suffix placeholders (e.g. numterm-plets for twins).

Indefinite Articles. Indefinite articles (i.e., ‘a’, ‘an’) are similar to the ordinal first insofar as they can express the existence of at least one object. We initially planned to treat them this way, yet due to their overwhelming frequency our classifiers could not cope with them. Thus we now disregard them in the training stage and only consider them as candidate mentions when applying the learned models, by replacing them with the cardinal placeholder, and treating them as the mention one.

Compositionality. To account for compositional mentions occurring in one sentence, we introduce an extra label, compositionality tag (comp), for the sequence labeling models. During training data generation, we identify consecutive candidate tokens with label count such that (i) the sum of their values is equal to the triple count and (ii) there exist compositional cues (commas and ‘and’) in between, which are then tagged with the comp label.

5 Counting Quantifier Consolidation

Once tokens expressing counting or compositionality information have been identified, these need to be consolidated into a single prediction for the number of objects.

Task 2

(Counting Quantifier Consolidation). For a given subject S and predicate P, the input to this second stage is a set of token lists, where each token list consists of words/numbers and their corresponding input and output labels (i.e., cardinal, ordinal, numterm, count or comp) and at least one token is tagged cardinal, ordinal or numterm. The desired output is a single number for the counting quantifier for S and P, that is, the correct number of objects for S and P.

For example, for the pair \(\langle \)AngelinaJolie, hasChild\(\rangle \), the following token lists may have been detected (annotated as counting information and \([\)compositional cues\(]\), with confidences as subscripts):

\(l_{1}\)::

Angelina has a grand total of six \(_{0.4}\) children together: three \(_{0.3}\) biological \([\) and \(]_{0.6}\) three \(_{0.5}\) adopted.

\(l_{2}\)::

The arrival of the first \(_{0.5}\) biological child of Jolie and Pitt caused an excited flurry with fans.

\(l_{3}\)::

On July 12, 2008, she gave birth to twins\(_{0.8}\): a\(_{0.1}\) son, Knox Léon, \([\)and\(]_{0.5}\) a\(_{0.2}\) daughter, Vivienne Marcheline.

We use the following algorithm to consolidate the counting quantifier (CQ) candidates from these labeled token lists.

Algorithm 1

(Mention Consolidation)

  1. 1.

    Sum up compositional mentions. Mentions having compositional cues in between are summed up, and their confidence score is set to the highest confidence score of the mentions.

  2. 2.

    Select prediction per type. For multiple mentions of type cardinal and number-related term, only the mention with the highest confidence is retained if it is above a certain threshold, with compositional mentions treated like cardinals. For ordinals, we always select the highest ordinal available in the candidate pool, regardless of the confidence scores.

  3. 3.

    Rank mention types. In the last step, the final prediction is chosen based on the preference \(n_{cardinal} \gg n_{numterm} \gg n_{ordinal} \gg n_{article}\), i.e., whenever a cardinal mention exists, it is returned as final answer, otherwise a number-related term, ordinal or article.

In the example above, in the first step, the two mentions of three in \(s_1\) are summed up to one mention 6\(_{0.5}\), and the two indefinite articles in \(s_3\) are combined into 2\(_{0.2}\). In the second step, 6\(_{0.5}\) is chosen as highest-confidence cardinal, twins\(_{0.8}\) as highest ranking numterm (with numerical value 2), and first\(_{0.5}\) as highest ranking ordinal. In the last step, the cardinal 6\(_{0.5}\) or the term twins\(_{0.8}\) is chosen as final prediction, depending on whether the confidence threshold is below 0.5 or not.

Confidence Scores. We interpret marginal probabilities given by CRFs, i.e., the probability of a token labeled with a certain tag resulting from forward-backward inference, as the confidence scores of identified mentions. When a CRF layer is not applied on top of the neural models, the probabilities are simply given by the softmax output layer.

Count Zero. We so far only considered counting information for counts greater than zero. Reliably recognizing subjects without objects is difficult for two reasons, (i) because reliable training data is even harder to come by, and (ii) because the count zero is neither expressed via cardinals nor ordinals or indefinite articles. We thus consider count zero only in passing, focusing on two especially frequent ways to express it: (i) determiners ‘no’ and ‘any’ (used in negation) and (ii) non-existence-proving adverbs ‘without’ and ‘never’. We approach their labeling in a manner similar to the identification of count information via indefinite articles, i.e., not using the count quantifier cues for training but considering them when applying the models.

We performed text preprocessing beforehand to ensure that the non-existence cues can be discovered by the learned models. This preprocessing step includes transforming sentences containing ‘not-any’, ‘never’ and ‘without’ into sentences containing ‘no’ and ‘0’, for example:

figure b

Finally, textual occurrences of ‘no’ and ‘0’ are replaced with cardinal and treated as count zero.

6 Experiments

6.1 Experimental Setup

Dataset. We chose Wikidata as our source KB and Wikipedia pages about given subject entities as our source text for the distant supervision approach.Footnote 3 While some Wikidata properties are self-explanatory, like child or spouse, some others are overloaded, i.e., used in highly diverse domains with different semantics depending on the type of the subject entities, e.g. has part. Thus, we define relations in our experiments as pairs of a Wikidata subject type/class and a Wikidata property. We focus on five diverse relations (listed in Table 1 under the Relation column) using the four Wikidata properties already used in [22], but using two specific Wikidata classes for the overloaded has part property, i.e., series of creative works and musical ensemble. We use four sets of entities for training and evaluation:

Table 1. Number of Wikidata instances as subjects (#Subject) of each relation in the training set.
  1. 1.

    Training set: For each relation, all subject entities with an English Wikipedia page that have at least one object in Wikidata, except those used for development and testing (counts are shown in Table 1).

  2. 2.

    Manual test set: 200 entities per relation randomly chosen from the training set (i.e., have at least one object).

  3. 3.

    Automated test set: 200 of the 10% most popular entities per relation removed from the training set (i.e., have at least one object).

  4. 4.

    Zero-count test set: 64 and 168 entities for the hasChild and hasSpouse relations, respectively, which are entities in Wikidata having child (P40) and spouse (P26) properties set to the special value no-value.

For the manual test set we manually annotated mentions in text that correspond to counting quantifiers, and established the correct object count from Wikipedia. The automated test set is used for parameter tuning of the neural models, and as silver standard for evaluating our system beyond the 5 gold-annotated relations. For evaluating zero-count quantifier detection, we use two relations for which manually created data from Wikidata is available.

Hyperparameters. We set 0.1 as the confidence score threshold in the mention consolidation task (Sect. 5), after experimenting with varying values. For training the neural models, we employed Adam [12] with a learning rate of 0.001. Using stochastic gradient descent (SGD) with a gradient clipping of 5.0 as reported in [15] results in worse performance. The LSTM network uses a single layer with 300 dimensions. The hidden dimension of the forward and backward character LSTMs are 100. We set the dropout rate to 0.5. We also use GloVe pre-trained embeddings [26] to initialize our lookup table.

6.2 Evaluation

Evaluation Scheme. We evaluate our system, CINEX (Counting Information Extraction), on quantifier recognition, quantifier consolidation, and on the end-to-end task with the following metrics:

We use precision, recall and F1-score to evaluate how well the system can identify counting information in a given text. For entities for which the system recognized at least one counting quantifier (CQ) candidate, we then measure precision in choosing the correct final CQ. Finally, we evaluate the system for the end-to-end task in terms of coverage, i.e., for how many subject entities the system can extract correct object counts from text, and Mean Absolute Error (MAE), to understand how much system predictions deviate from the truth.

Quantifier Recognition. We report in Table 2 the performance results of different architectures w.r.t. precision, recall and F1-score. We also compare our system with the best performing method for extracting cardinals reported in [22] as baseline. As one can see, feature-based CRF models are the most robust sequence labeling approach across relations for this task, although the neural models achieve higher F1-score with 3.3 percentage point difference for containsAdmin. Adding a CRF layer on top of bidirectional LSTM models improves performance across relations, although this architecture still fails to beat the feature-based CRF models in most cases. We conjecture that this is due to neural models being much more prone to overfitting to noisy distantly supervised training data. Still, both feature-based and neural models consistently outperform the baseline by a large margin, in particular w.r.t. precision.

Table 2. Performance of CINEX on recognizing counting quantifier mentions, with different architectures and in comparison with the baseline. Highest F1-score per relation in boldface.
Table 3. Performance of CINEX-CRF on recognizing counting quantifier mentions, per mention type. Numt. stands for number-related terms, Art. for indefinite articles. Baseline comparison is only for cardinals (highest F1-score per relation in boldface).

In Table 3 we split this analysis further by mention type. This provides a more fair comparison with the baseline that only considers cardinal numbers. Still, CINEX-CRF achieves a higher precision on all relations, and a higher F1-score on 4 out of 5. We also see variety within the mention types and relations, ordinals for instance being well picked up for hasSpouse, but badly for hasChild.

Quantifier Consolidation. Table 4 shows the performance of CINEX-CRF, our best performing system for recognizing counting information, on the consolidation and end-to-end task. We report the results broken down per mention type, as well as in overall.

Table 4. Performance of CINEX-CRF in consolidating counting quantifier mentions w.r.t. precision (P), coverage (Cov) and MAE. Numt. stands for number-related terms, Art. for articles. Results per type show contribution (Contr) to overall output and precision of individual types.
Table 5. Examples of correct and incorrect predictions by CINEX-CRF.

In predicting counting quantifiers through recognizing cardinals in text, CINEX-CRF achieves 55–85% precision. This is a considerable improvement (up to 48.9 percentage points) compared to the baseline [22]. Although the baseline yields a comparable coverage, its low precision suggests that it has difficulties to pick up correct context and produces some matches only by chance.

Number-related terms and articles are beneficial in improving coverage particularly for containsWork and hasMember, yet produce low precision results for hasChild, possibly due to spurious indefinite articles frequently identified as counting quantifiers. Overall, taking compositionality as well as mention types other than cardinals into account improve both accuracy and coverage of the system, with MAE of not more than 2.6 across relations. The performance of CINEX-CRF on predicting non-existence of objects is reported in the last two rows of Table 4. We obtain a high accuracy of 92.3% for hasChild and 71.9% for hasSpouse.

Qualitative Analysis. Table 5 lists notable examples of correct and incorrect predictions. Errors for hasMember and hasSpouse are sometimes caused by wrongly labelled mentions that are related instead with other relations, e.g., musical ensemble members and siblings. For some relations, understanding the fine-grained types of subject entities may help in choosing the correct context of counting quantifiers. For instance, a TV series consists of seasons while a specific season of the series contains episodes.

Notable is also the low precision of ordinals shown in Table 4. A main reason is that ordinals only reliably express lower bounds (see e.g. fourth incorrect example). If one considers ordinals as correct whenever they are not higher than the true count, the reported precision scores increase from 14.3–63.2% to 85.7–89.5%.

Table 6. KB enrichment potential for 40 relations, showing only relations with accuracy (Acc) \({>}50\%\) and coverage (Cov) \({>}5\%\).

6.3 KB Enrichment Potential

In this section we return to our original goal of enlarging the number of facts known to exist. We investigate the potential of CINEX on 40 relations, by focusing on the 4 previously used Wikidata properties, but looking at the up to 10 most frequent subject classes of entities using each property. For each relation, we then perform automated evaluation of CINEX as described in Sect. 6.1. In Table 6, we report relations for which CINEX-CRF gave precision \({>}0.5\) and coverage \({>}0.05\). For each relation we report the number of existing facts in Wikidata, and the existence of how many more facts we can infer from the counting quantifiers. For instance, we can derive the existence of 160.4% more children relationships than currently stored. In sum, CINEX is able to identify the existence of 173K more facts than Wikidata currently knows, thus increasing the existential knowledge of Wikidata for these 40 relations by 77.3%.

We also applied CINEX to all human entities to find out how many subjects are found to have no objects w.r.t. the hasChild and hasSpouse relations, finding 1,648 instances for children and 557 for spouses. These assertions increase the existing known zero cases in Wikidata for both relations by a factor of 25.8 and 3.3, respectively.

Table 7. Classes along with relations for which count information could be retrieved best.

6.4 Count Information Across KB Relations

So far we only evaluated CINEX on four manually chosen Wikidata properties. In this section we investigate to which extent counting quantifiers are present for arbitrary relations, and to which extent they can be extracted by CINEX.

To this end, we collected all Wikidata properties that were interesting, i.e., were not asserted to be single-valueFootnote 4, had a functionality degree \((\#\textit{subjects}/\#\textit{triples})\) of less than 0.98 [10], and were used by at least 500 subjects, obtaining 267 properties in total. For each of these properties, we identified the 10 most frequent entity classes used as subjects, resulting in a total of 2,474 relations. For each relation, we then performed automated evaluation of CINEX as described in Sect. 6.1, finding 110 relations for which CINEX gave precision \({>}50\%\) and coverage \({>}5\%\).

Among the frequent classes (grouped by theme) of subjects for which we can mine counting quantifiers from the corresponding Wikipedia pages are: human (including twin, fictional human, biblical figure and mythological Greek character), creative works (e.g., film, television series), administrative territorial entity (e.g., country, municipality), musical ensemble (e.g., band, duo), organization (e.g., business enterprise, nonprofit organization) and transportation facility (e.g., metro station, train station). We show in Table 7 the top 5 Wikidata properties for each mentioned subject type. Other notable relations include: <battle, participant>, <human spaceflight, crew member> and <star, child astronomical body>.

In terms of KB enrichment, CINEX was able to extract a total of 851K counting quantifier facts, which in turn state the existence of 2.5M facts not yet asserted for these 110 <Wikidata class, Wikidata property> pairs. These existential facts, provided on Github, increase the number of facts known to exist for these relations by 28.3%.

7 Related Work

Knowledge bases have seen a rise of attention in recent years. Aside from a few manual efforts like Wikidata, the construction of these knowledge bases is usually done via automated information extraction, focusing either on structured data (DBpedia [1], YAGO [31]), or on unstructured contents from the web. For the latter, directions include extracting arbitrary facts without predefined schema, called Open IE [6, 19, 23], and extracting triples based on well-defined knowledge base relations [13, 25, 33], in which the distant supervision approach is widely used [3, 21, 32]. The idea of distant supervision is to use facts from an existing KB in order to label sentences as positive/negative training samples, depending on whether the entities from the existing facts occur in them or not. A major challenge for distant supervision is knowledge base incompleteness: If the KB used for labeling the training data misses facts, candidates may wrongly be classified as negative samples, reducing the quality of the learning process. Approaches to mitigate this effect include heavily under-sampling the negative evidence [27, 33], to learn only from positive samples [20], or to use heuristics in selecting negative samples [9, 10], yet these do not help with potentially wrong seed counts.

Most works on information extraction focus on relations that link entities, like \(\langle \textit{Trump,}\) \(\textit{presidentOf, USA} \rangle \), or that store String or measurement values. Counting quantifiers have received comparably little attention. Numbers, a major construct for expressing counts, were investigated mostly in the context of temporal information, e.g. to enrich facts with timestamps/durations [16, 30], or in the context of quantities and measures like \(\langle \)MtEverest, height, 8848mt\(\rangle \) [11, 17, 24, 28]. In contrast, terms that express counting quantifiers are either extracted incorrectly by state-of-the-art Open-IE systems, or not at all. While NELL, for instance, knows 13 relations about the number of casualties and injuries in disasters, they all contain only seed facts and no learned facts. In [22], which we use as baseline for our experiments, we have proposed a single-stage process for identifying numbers that express relation counts. Yet, we there only consider explicit cardinals and do not tackle training data incompleteness nor compositionality, thus achieving only moderate precision and coverage.

While a few counting qualifier predicates such as number of children, number of seasons (of a TV series) or number of households (of a territory) already exist in Wikidata, it should be noted that a proper interpretation of counting quantifiers requires to go beyond the standard open-world assumption of the Semantic Web, as they allow to infer negative information. Appropriate models require to combine open-world and closed-world reasoning, as does for instance the local closed-world assumption [5, 7].

8 Conclusions

We have proposed to enrich KBs with counting quantifiers, and discussed the challenges that set counting quantifier extraction apart from standard information extraction. In particular, we showed that it is imperative to consider the compositionality of counts, and their expression in non-numeric form. We have shown that our system, CINEX, can extract counting quantifiers with 60% average precision on five relations, and when applied to a large set of relations, it is possible to extend the number of facts known to exist in 110 of them by 28%. We believe that the extraction of counting quantifiers opens interesting avenues for tasks such as question answering, information extraction or KB curation. Our data and code are available at https://github.com/paramitamirza/CINEX.