To evaluate the complete prerequisite identification approach (Fig. 2) we selected 80 target concepts in the domains of Computer Science (CS), Math (MATH), Physics (PHY) and Biology (BIO) (25 CS concepts, 25 MATH concepts, 15 PHY concepts and 15 BIO concepts). For each target concept the entire process was carried out and a list of prerequisites was obtained.
Table 3 shows, as an example, the size of the initial set of candidates before pruning, for each search strategy for two concepts. Notice the vertiginous increase in the number of concepts when lmax=2. Considering that the diameter of the DBpedia graph is 6.27Footnote 11 values of lmax greater than 2 could lead to an explosion in the number of concepts retrieved. It is also important to note that the lack of homogeneity in the sizes of the candidate set when CM is used as a search strategy. While the “Machine Learning” concept is related to three different categories to which 456 different concepts belong, “Deep Learning” is related to only one category to which only 10 concepts belong. This can be attributed, to the fact that, there is no guarantee that there are a homogeneous number of categories associated with the concepts, and that relatively recent concepts are not so well described in the DBpedia version used (2016-10).
Table 3 Resulting initial candidate size per search strategy for “Machine learning” and “Deep Learning” target concepts On the initial candidate set the pruning method using the SemRefD function is applied. Table 4 shows the final candidate set size (i.e., |M|) after applying the different proposed versions of SemRefD for the “Machine Learning” concept. For a concept in the initial candidate set to be included in M the value of SemRefD must exceed a threshold value θ. We select three different values of θ={0.1,0.2,0.3}, as a result, for each target concept, 27 different M sets are built corresponding to the different combinations of the searching and pruning functions. Theta values greater than 0.3 are not practical since, in some cases, the entire initial set of candidates is empty. Considering that the original paper used a maximum value of 0.05 for theta (Liang et al. 2015), we are being much more strict. As shown in Table 4, the number of concepts drops drastically by applying SemRefD in particular for θ=0.3.
Table 4 Final candidate set size per search / pruning strategy for “Machine Learning” target concept It is also clear that despite the fact that the initial set of candidates built using LC with lmax=2 is at least 50 times the size using lmax=1 (see Table 3), after pruning, their size is only at most 2.2 times the size of LC with lmax=1. Considering that the concepts that are in the set constructed with LC with lmax=1 must also be found in LC with lmax=2, we can affirm that a large part of the concepts in M can be found through a direct relationship with the concept (i.e. a path of length one).
In the final step all the concepts in M are evaluated via a supervised model to assess the prerequisites relations with the target concept. Considering the complete set of target concepts and all possible M sets, a total of 2812 different concepts pairs were assessed by the supervised model. Those candidate concepts identified as prerequisites constitute the output of the complete process.
It was necessary to build a ground truth to evaluate how precise is our proposed process. To accurately label a given concept pair, we rely on human expert knowledge. We recruited 5 adjunct professors, 1 PhD student and 4 master students with backgrounds in the different domains involved. Three master students and one adjunct professor had CS and MATH backgrounds, two adjunct professors and the PhD student had PHY background whereas two professors and one master student had BIO background. For each candidate pair, at least three annotators decided whether the candidate concept is a prerequisite of the target concept or not. A majority vote rule was used to assign the final label. Due to the low and fixed number of participants per domain, the Fleiss’ Kappa inter-rater reliability measure of agreement were used. We obtained a value of κ=0.42 for CS, κ=0.56 for MATH, κ=0.31 for PHY and κ=0.22 for BIO. According to (Landis and Koch 1977), this indicates a moderate agreement for CS and MATH, and a fair agreement for PHY and BIO.
Tables 5 and 6 present the global results of the evaluation in terms of precision (P), true positives (TP) and false positives (FP) calculated using the previously constructed ground truth. While Table 5 presents the results obtained for CS and MATH, Table 6 presents the results for PHY and BIO. False positives should be interpreted as those concepts in M that were incorrectly identified as prerequisites. The true positives are those concepts that were correctly identified as prerequisites, and precision is defined as \( P = \frac {TP}{TP + FP}\). In both tables the highest values are shown in bold by pruning function and search method.
Table 5 Results of the complete process for CS and MATH target concepts using as evaluation metrics P (precision), TP (true positives), and FP (false positive)
Table 6 Results of the complete process for PHY and BIO target concepts using as evaluation metrics P (precision), TP (true positives), and FP (false positives) As it can be observed, the use of SemRefDHW as pruning function leads in most cases to a high number of FP and consequently to the lowest values of precision. Unlike SemRefDHW,SemRefDNHW led to the lowest FP values and the highest TP and, consequently, precision values. This can be attributed to the fact that the common neighbors is a better indicator of a strong link between concepts than the sharing of common categories (or nearby categories). The taxonomy of KG categories may not be descriptive enough to identify a strong link between the concepts. For example, the concepts “Artificial Intelligence” and “Neuroscience” despite being highly related concepts and sharing many common concepts in DBpedia, do not share any common category and the distance between categories in the taxonomy is not small enough to indicate a strong relationship. The above applies to all analyzed domains.
Regarding θ, it is clear that with its increase there is an increment in the precision at the expense of a reduced number of identified prerequisites. An increase of theta from 0.1 to 0.3 implies an average reduction of 87% in TP (across all domains). An appropriate trade-off between the precision and the number of correct prerequisite concepts identified is achieved when the theta value is 0.2.
We further explored the performance of the search strategies. Initially we expected that the final set M obtained by using LC(lmax=2) would be much larger compared to the rest of the strategies because the initial candidate set built with this strategy was on average 27 times larger than the one built with LC(lmax=1) and 35 times larger than the one built with CM. However, the average increase in M was only 1.5 times larger. This means that between 27 and 35 times more SemRefD computations were made to increase the concepts in the final set by only 1.7 times. This is clearly not practical and new ways of exploring linked concepts with lmax=2 should be proposed. One possible solution is to reduce the number of predicates that are considered in the traversing step.
Regarding CM we discovered that it is not an appropriate strategy when: (a) concept categories are far from the concept main domain, and (b) the categories have a very small number of members. Consider the objective concept “Dijkstra’s algorithm” and two of its associated categories: “ Dutch inventions ” and “ 1959 in computer science”. The category “Dutch inventions” is an example of the case (a) since most of the member concepts are associated with unrelated domains. For its part, the category “1959 in computer science” only contains one member and is precisely the concept “Dijkstra’s algorithm”; therefore, this category does not bring any new concept to the candidate set (case (b)).
Due to the above evidences the best search strategy is LC(lmax=1) since the concepts linked directly by a single predicate represent in most cases a significant semantic relationship (Manrique and Mariño 2017). Additionally, the number of SemRefD calculation keeps small while the number of TP still remains significant.
Table 7 shows the comparison of the results of the different domains using LC(lmax=1) as a search strategy, SemRefDNHW and θ=0.2. We identify that the set of candidate concepts |M| is smaller for the BIO and PHY domains. On average, a concept in these domains has a candidate set twice as low as a concept in MATH or CS. This eventually results in a much smaller set of prerequisites. The above is a clear indication that the BIO/PHY concepts have a smaller number of interconnections between them in the KG, or equivalent, they are not so well described. As all our proposal is based on analyzing the connections between the concepts in the KG (i.e. SemRefD and the features of the supervised learning model), a lower performance was expected in comparison with the other domains considered.
Table 7 Results comparison per domain using LC(lmax=1),SemRefDNHW and θ=0.2 It should also be noted that our supervised model was trained using concepts in the MATH and CS domains, so its prediction may be less accurate in other domains. Building broader training sets for the task of identifying prerequisites remains a subject of recent research (Liang et al. 2018).