In this section, we first describe the benchmark datasets in “Benchmark Datasets” section and evaluation metrics in “Evaluation Metrics” section. Next, we present the implementation details and experiment settings in “Implementation and Settings” section. Finally, we explain the results and detailed analysis on semantic labeling for numerical values and of in “Experimental Results” section.
Benchmark Datasets
To evaluate EmbNum+, we used four datasets, i.e., City Data, Open Data, DBpedia NKB, and Wikidata NKB. City Data is the standard data used in the previous studies [19, 20] while Open Data, DBpedia NKB, and Wikidata NKB are newly built datasets extracted from Open Data portals, DBpedia and Wikidata, respectively. The datasets are available at https://github.com/phucty/embnum.
Table 3 Statistical description about the number of numerical values per semantic label in four datasets: City Data, Open Data, Wikidata NKB, and DBpedia NKB The detailed statistics of each dataset are shown in Table 3. m denotes for the number of semantic labels in a dataset. n denotes for the number of columns in a dataset. In each dataset, each semantic label has 10 columns in the same semantic labels. The columns of City Data, DBpedia NKB, and Wikidata NKB are randomly generated using 10 partitions splitting, while the columns of Open Data are the real table columns from Open Data portals. The number of semantic labels of the new datasets is larger than City Data, enabling rigorous comparisons between EmbNum+ and other baseline approaches.
Table 4 reports the overall quantile ranges of the four datasets. DBpedia NKB is the most complex dataset in terms of the largest semantic labels (206 semantic labels) as well as the range of numerical values (the range of \([{-10}\mathrm {e}{10},{10}\mathrm {e}{16}]\)). Moreover, the overlapping rate of numerical attributes in DBpedia NKB is also higher than other datasets. The detailed distributions of quantile ranges of each numerical attribute in City Data, Open Data, DBpedia NKB, and Wikidata NKB are depicted in Figs. 12, 13, 14, and 15 of “Appendix A,” respectively.
DBpedia NKB and City Data have the same source of data as DBpedia. Therefore, there is high overlapping of attributes between these data. The two other datasets of Wikidata NKB and Open Data are different from DBpedia NKB and City Data. Wikidata NKB is constructed from Wikidata; it is an independence project manually annotated by the community. The source of Wikidata NKB is different from Wikipedia where DBpedia is extracted from; therefore, Wikidata NKB and DBpedia NKB are different. Open Data is extracted from five Open Data portals which are different about the domain of data with other datasets.
Table 4 Overall quantile ranges of City Data, Open Data, DBpedia NKB, and Wikidata NKB City Data
City Data [20] has 30 numerical properties extracted from the city class in DBpedia. The dataset consists of 10 sources; each source has 30 numerical attributes associated with 30 data properties.
Open Data
Open Data has 50 numerical properties extracted from the tables in five Open Data portals. We built the dataset to test semantic labeling for numerical values in the open environment.
To build the dataset, we extracted table data from five Open Data portals, i.e., Ireland (data.gov.ie), the UK (data.gov.uk), the EU (data.europa.eu), Canada (open.canada.ca), and Australia (data.gov.au). First, we crawled CSV files from the five Open Data portals and selected files whose size is less than 50 MB. Then, we analyzed tables in CSV files and selected only numerical attributes. After that, we created attribute categories based on the clustering of the numerical attributes with respect to the textual similarity of column headers. A category contains many numerical columns with the same semantic labels. We got 7,496 categories in total.
We manually evaluated these categories with two criteria: (1) The first criterion was to pick up categories with a certain frequency. By examining the collection of data, we found that high-frequency and low-frequency categories are often unclear on their semantics. We decided to pick up the categories with ten attributes by following the setting of City Data. (2) The second criterion was removing the categories where column headers had too general meanings such as “ID,” “name,” or “value.”
Finally, we chose 50 categories as semantic labels; each semantic label had ten numerical attributes. Following the guideline of City Data, we also made 10 data sources by combining each numerical attributes from each category.
Wikidata NKB
The Wikidata NKB was built from the most frequently used numerical properties of Wikidata. At the time of processing, there were 477 numerical propertiesFootnote 2 but we only selected 169 numerical properties which are used more than 50 times in Wikidata.
DBpedia NKB
To build the DBpedia NKB, we collected numerical values of the 634 of DBpedia properties directly from their SPARQL query service.Footnote 3 Finally, we obtained 206 of the most frequently used numerical properties of DBpedia where each attribute has at least 50 values.
Evaluation Metrics
We used the mean reciprocal rank score (MRR) to measure the effectiveness of semantic labeling. The MRR score was used in the previous studies [19, 20] to measure the probability correctness of a ranking result list. To measure the efficiency of EmbNum+ over the baseline methods, we evaluated the run-time in seconds of the semantic labeling process.
Implementation and Settings
We have different interests in each dataset to evaluate the performance of EmbNum+. The DBpedia NKB is the most complex and complete dataset with the largest semantic labels as well as a wide range of values. It provides a discriminative power to train the representation model as well as relevance model EmbNum+. Therefore, we use the DBpedia NKB dataset for these two learning modules. The details of the learning settings are described in “Representation and Relevance Learning” section. We use City Data as the standard data to fairly compare with other existing approaches. The Wikidata NKB is challenging in terms of large scale and transfer capacity setting where the embedding model is learned from DBpedia NKB. Finally, Open Data is used to evaluate the real-world setting where numerical attributes are extracted from the five Open Data portals.
Representation and Relevance Learning
To train EmbNum+, we used the numerical attributes of DBpedia NKB as the training data. We randomly divided DBpedia NKB into two equal parts: 50% for the two learning tasks and 50% for evaluating the task of semantic labeling. The first part was used for the representation learning of EmbNum+. It is noticed that we made using the attribute augmentation technique to generate training samples. Therefore, the actual training data is not the same as the original data. We also use this part to train the relevance model by using the pair-wise distance between these original training samples. This data was also used to learn the similarity metric for DSL. We followed the guideline that using logistic regression to train the similarity metrics where training samples are the pairs of numerical attributes [19].
We used PyTorch (http://pytorch.org) to implement representation learning. The network uses the rectified linear unit (ReLU) as a nonlinear activation function. To normalize the distribution of each input features in each layer, we also used batch normalization [8] after each convolution and before each ReLU activation function. We trained the network using stochastic gradient descent (SGD) with back-propagation, a momentum of 0.9, and a weight decay of \({1}\mathrm {e}{-5}\). We started with a learning rate of 0.01 and reduced it with a step size of 10 to finalize the model. We set the dimension of the attribute input vector h and the attribute output vector k as 100.
We trained the EmbNum+ with 20 iterations. In each iteration, we used the attribute augmentation technique to generate a \(aug\_size\) of 100 samples for each semantic labels. The numerical values of the augmented samples are randomly selected from the list of numerical values of the attributes. The size of each augmented sample ranges from \(min\_size\) of 4 to the size of its original attribute. Then, the representation learning was trained with 100 epochs. After each epoch, we evaluated the task of semantic labeling on the MRR score using the original training data. We saved the learned model having the highest MRR score. All experiments ran on Deep Learning Box with an Intel i7-7900X-CPU, 64 GB of RAM, and three NVIDIA GeForce GTX 1080 Ti GPU.
The training time of EmbNum+ is 29,732 seconds, while the training time of DSL is 2965 s. It is clear that EmbNum+ uses the deep learning approach and needs more time to train the similarity metric than DSL, which uses logistic regression. However, the similarity metric is only needed to train once, and it could be applied to other domains without retraining. The detailed experimental result on EmbNum+ robustness is reported in “Semantic Labeling: Effectiveness” section.
Semantic Labeling
In this section, we describe the detailed experimental setting to evaluate the semantic labeling task. We follow the evaluation setting of SemanticTyper [20] and DSL [19]. This setting is based on cross-validation, but it was modified to observe how the number of numerical values in the knowledge base will affect the performance of the labeling process. The detail of the experimental setting is described as follows.
Suppose a dataset \(S = \{s_1, s_2, s_3, \ldots , s_d\}\) has d data sources. One data source was retained as the unknown data, and the remaining \(d-1\) data sources were used as the labeled data. We repeated this process d times, with each of the data source used exactly once as the unknown data.
Additionally, we set the number of sources in the labeled data increasing from one source to \(d-1\) sources to analyze the effect of an increment of the number of labeled data on the performance of semantic labeling. We obtained the MRR scores and labeling times on \(d \times (d-1)\) experiments and then averaged them to produce the \(d-1\) estimations of the number of sources in the labeled data.
Table 5 depicts the semantic learning setting with 10 data sources. From the first experiment to eighth experiment, \(s_1\) is assigned as the queries of unknown sources, and the remaining sources are considered as the labeled sources in the knowledge base. We conducted a similar approach for the remaining experiments. Overall, we performed 90 experiments on the 10 sources of a dataset.
Table 5 Semantic labeling setting with 10 data sources Unseen Semantic Labeling
In this section, we describe the setting of unseen semantic labeling. We split data to d partitions and used the setting of d-fold cross-validation for evaluation. To analyze the changing of EmbNum+ performance regarding the number of unseen semantic labels, we linearly increased the percentage of unseen semantic labels from 0 to 90% of all labels in knowledge bases. For details, Table 6 depicts the number of unseen semantic labels of City Data, Open Data, DBpedia NKB, and Wikidata NKB.
Table 6 Number of unseen semantic labels of DBpedia NKB, City Data, Wikidata NKB, and Open Data The performance is evaluated with the MRR score on the four datasets. When a query is an unseen attribute, the reciprocal rank (RR) is 1 if the ranking result is empty, and 0 otherwise.
Ablation Study
We also conducted ablation studies to evaluate the impact of the representation leaning and the attribute augmentation on the task of semantic labeling. For the setting of EmbNum+ without using the representation learning, we created three methods which ignore the representation leaning: \(Num\_l1\), \(Num\_l2\), and \(Num\_l\infty \). The similarities between numerical attributes are directly calculated from the tran(.) without using the embedding model. The \(Num\_l1\) used the Manhattans distance, \(Num\_l2\) used the Euclidean distance, and \(Num\_l\infty \) used Chebyshev distance. For the setting of EmbNum+ without using the attribute augmentation, we call this method as EmbNum+ NonAu.
We conducted this ablation study based on d-fold cross-validation. Given a dataset with d data sources, one data source was retained as the query set, and the remaining \(d-1\) data sources were used as a knowledge base. We conducted semantic labeling for the query set on the knowledge base. We repeated this process d times, with each of the data source used exactly once as the unknown data.
Experimental Results
In this section, we report the experimental results of semantic labeling in terms of effectiveness, robustness (“Semantic Labeling: Effectiveness” section), and efficiency (“Semantic Labeling: Efficiency” section). “Unseen Semantic Labeling” section reports the experimental result of the setting of unseen semantic labeling. Finally, we report the result of the ablation study in “Ablation Study” section.
Semantic Labeling: Effectiveness
We tested SemanticTyper [20], DSL [19], and EmbNum+ on the semantic labeling task using the MRR score to evaluate the effectiveness. The results are shown in Table 7 and Fig. 9.
Table 7 Semantic labeling in the MRR score of SemanticTyper [20], DSL [19], and EmbNum+ on City Data, Open Data, DBpedia NKB, and Wikidata NKB
The MRR scores obtained by three methods steadily increase along with the number of labeled sources. It suggests that the more labeled sources in the database, the more accurate the assigned semantic labels are. DSL outperformed SemanticTyper in City Data and Open Data but was comparable with SemanticTyper in DBpedia NKB and Wikidata NKB. In the DBpedia NKB and Wikidata NKB, there are more semantic labels as well as a high level of range overlapping between numerical attributes; therefore, these features (KS Test, numerical Jaccard, and MW test) proposed by DSL become less effective.
EmbNum+ learned directly from the empirical distribution of numerical values without making any assumption on data type and data distribution and, hence, outperformed SemanticTyper and DSL on every dataset. The similarity metric based on a specific hypothesis test, which was used in SemanticTyper and DSL, is not optimized for semantic meanings with various data types and distributions in DBpedia NKB and Wikidata NKB.
The performance of semantic labeling systems is different according to datasets. In particular, semantic labeling on City Data, DBpedia NKB, and Wikidata NKB yields higher performance than Open Data. The performance differences occur because of data quality. The City Data, DBpedia NKB, and Wikidata NKB are synthesis data, where each numerical values of attributes are normalized in terms of data scaling. Open Data is the real-world data where we usually do not know the meaning of attributes; therefore, it is difficult to do normalization operation. In this paper, we do not tackle the issue of data scaling, it is left as our future work.
Although EmbNum+ was trained on 50% of DBpedia NKB, the performance of EmbNum+ consistent yields the best performance in the four datasets, especially the two different datasets: Wikidata NKB and Open Data. It means that EmbNum+ is promising for semantic labeling in a wide range of data domains.
To understand whether EmbNum+ does significantly outperform SemanticTyper and DSL, we performed a paired sample t test on the MRR scores between EmbNum+ and SemanticTyper, DSL. Table 8 shows the result of the paired t test on City Data, Open Data, DBpedia NKB, and Wikidata NKB. We set the cutoff value for determining statistical significance to 0.01. The results of the paired t test revealed that EmbNum+ significantly outperforms SemanticTyper and DSL on all four datasets (all the \(p\; \mathrm{{values}} < 0.01\)).
Table 8 Paired sample t test between EmbNum+ and SemanticTyper, DSL on DBpedia NKB, City Data, Wikidata NKB, and Open Data Semantic Labeling: Efficiency
In this experiment, we also used the same setting with the previous experiment but the efficiency is evaluated by the run-time of semantic labeling. Table 9 and Fig. 10 depict the run-time of semantic labeling of SemanticTyper, DSL, and EmbNum+ on the four dataset: DBpedia NKB, Wikidata NKB, City Data, and Open Data.
The run-time of semantic labeling linearly increases with the number of labeled sources. The run-time of DSL was extremely high when the number of labeled data sources increased because three similarity metrics were needed to be calculated. The run-time of SemanticTyper was less than DSL because it only used the KS test as a similarity metric. Semantic labeling with EmbNum+ is significantly faster than SemanticTyper (about 17 times) and DSL (about 46 times). EmbNum+ outperforms the baseline approaches in run-time since the similarity metric of EmbNum+ was calculated directly on extracted feature vectors instead of all original values.
Table 9 Run-time in seconds of semantic labeling of SemanticTyper [20], DSL [19], and EmbNum+ on City Data, Open Data, DBpedia NKB, and Wikidata NKB
Unseen Semantic Labeling
In this section, we report the experiment results of unseen semantic labeling. Table 10 and Fig. 11 report the MRR score of EmbNum+ when using (EmbNum+) and not using (EmbNum+ NonRe) relevance model. When the number of unseen semantic labels increases, the performance of semantic labeling decreases if we do not use the relevance model. When we use the relevance model, the performance of EmbNum+ changes considerably.
Interestingly, the trend of MRR score changed from decreasing to increasing at 80% unseen semantic labels in knowledge bases. This result is promising in practice since we usually do not have so many labeled data. Detecting unseen semantic labels assists domain experts in terms of simplifying and reducing the time for the manually labeling process.
Table 10 Unseen semantic labeling in the MRR score of EmbNum+ on DBpedia NKB, City Data, Wikidata NKB, and Open Data
Ablation Study
Table 11 reports the ablation study of EmbNum+ on City Data, Open Data, DBpedia NKB, and Wikidata NKB.
Table 11 Ablation study result of EmbNum+ on City Data, Open Data, DBpedia NKB, and Wikidata NKB The method of \(Num\_l1\), \(Num\_l2\), \(Num\_l\infty \) is EmbNum+ without using the representation learning. Among these methods of EmbNum+ without using representation learning, \(Num\_l1\) outperforms \(Num\_l2\), \(Num\_l\infty \). It indicates that the Manhattan distance has more advantages than Euclidean and Chebyshev distance. By removing the representation learning, we can see the MRR score is significantly reduced in the four datasets. This validates our assumption that the representation leaning is a necessary module in the semantic labeling procedure.
EmbNum+ NonAu is a system of EmbNum+ without using the attribute augmentation module. The performance of EmbNum+ is higher than the EmbNum+ NonAu; therefore, we verify that the module of attribute augmentation is necessary for our proposed approach.