Our approach builds on previous works for extracting relation assertions from Wikipedia abstracts [4, 5]. That approach exploits links in Wikipedia abstracts, learns characteristic patterns for relations (e.g., The first place linked in the Wikipedia abstract about a person is that person’s birthplace), and then applies the models to extract new statements (e.g., new facts for the relation birthplace). For each relation, a separate model is trained and validated on the existing instances, which allows for only applying models that achieve a desired precision.
3.1 Training Data Creation
To create the training data, we use regular expressions to detect and parse numbers in the abstract in various formats (i.e., thousands and decimal separators), and SpaCyFootnote 4 and dateparserFootnote 5 to detect and parse dates.
With these approaches, we extract the sets of numerical literals N and date literals D from the Wikipedia abstract describing a DBpedia entity e. Since numbers may be rounded, we accept a training example \(n \in N\) as positive example for a relation if there is a statement r(e, v) in DBpedia with \(n \in [v \cdot (1-p),v \cdot (1+p)]\) for a deviation factor of p. We manually examined candidates drawn at deviation factors of 1%, 1.5%, and 2%, and observed that the precision at 1% and 1.5% was 65%, and dropped to 60% when further increasing the deviation factor. Hence, we decided to use a factor of 1.5% in our further experiments.
Figure 1 illustrates this generation of examples. Since DBpedia is constructed from infoboxes in Wikipedia, the values in the infobox on the right hand side correspond to the values in DBpedia. Given the Wikipedia abstract, 200,507 would be extracted as a training example for the relation population (correct), while 1928 would be extracted as a training example for the relation density (incorrect). The deviation is 0.11% and 1.47%, respectively.
Since dates are not rounded, training data for date valued literals are based on exact matches with DBpedia onlyFootnote 6.
As negative training examples, we use all numbers or dates, respectively, which have been tagged in the abstract which are not identified as positive examples for the relation at hand. In the example depicted in Fig. 1, we would use all numbers except for 200,507 as negative training examples for the relation population.
3.2 Unit Conversion
An initial look at the training data revealed that this approach misses quite a few numerical training examples, since the units of measurement in which the facts are stored are often different from the ones in the abstracts. For example, areas (of countries, cities, ...) are stored in DBpedia in square meters, while they are typically written in square kilometers or non-metric units. Therefore, for those relations, the training data sets create are often very small (e.g., for area, which is one of the most frequent relations in DBpedia, we initially collected less than 100 training examples).
Therefore, we decided to enhance the training example generation with unit conversion. We follow the assumption that (1) units of measurement are typically the token after a numberFootnote 7, and (2) the function for converting units to their standard unit in DBpedia is usually a simple multiplication by a factor. Thus, we group numeric literals for each relation by the token following the number (e.g., ha) and try to learn a regression model for that token. From those regression models, we derive unit conversion rules which are applied to the literals extracted as above before mapping them to relations in DBpedia. Following an initial inspection of the data, we accept unit conversions learned on at least 100 examples and having a coefficient of determination of at least 0.85. Table 1 shows a few example unit conversion factors, including useful rules learned, but also some misleading rules (e.g., converting the “unit” pupils to $).
3.3 Feature Extraction
For each positive and negative training example extracted, we create a set of features to feed into a classifier. We use a similar set of features as in , e.g., position in the sentence, position of the sentence in the abstract, etc., plus a bag of words representation of the sentence in which the literal is located, and, for numerical literals, the deviation from the mean divided by the standard deviation of all values of the respective relation, in order to discard outliers.
3.4 Model Building
To learn models given the examples and feature vectors, we experimented with different classifiers from the scikit-learn libraryFootnote 8, i.e., SGD, Naive Bayes, SVM, Decision Trees, Random Forest, Extra Trees, Bagging Decision Trees, and XGBoost. Out of those, the latter five delivered the best results in an initial experiment (using split validation on a sample of relations with the most already existing instances), without much variance in quality. Random Forests were chosen because of a good trade-off between runtime and accuracy.