1 Introduction: Motivation and Background

Taxonomy of concepts in a knowledge domain, or hierarchical ontology, is a popular computational instrument for representation, maintaining and usage of domain knowledge [13]. A taxonomy is a rooted tree formalizing a hierarchy of subjects in an applied domain. Such a tree corresponds to a generalizing relation between the subjects, usually in the form “A is a B” or “A is part of B”. Automation xxxxof taxonomy building is important for further progress in many areas of data analysis and knowledge engineering including computationally text processing and improving information retrieval [1, 4, 5]. In the authors’ work, domain taxonomies are used to meaningfully map research results to them either to explore research profiles [6] or annotate research papers [7] or measure the level of research results [8].

A definitive taxonomy of the domain of computer science is maintained by the Association for Computer Machinery; the latest version of the ACM computing classification system can be found at [9]. This classification is well balanced so that: (a) its nodes have approximately equal numbers of children, and (b) its branches have approximately equal numbers of layers. However, there are not so many domains for which sound taxonomies are available. For example, when we decided to shift our efforts from the computer science domain to mathematics for the analysis of synopses of courses in mathematics and related subjects in a Russian university, we discovered a rather disappointing picture.

In Russian, the only publicly available taxonomy of mathematics and related domains is the classification for the government-sponsored Abstracting Journal of Mathematics [10] developed back in 1999. This is somewhat outdated and unbalanced. For example, it lacks such topics as “Discrete mathematics”, “Formal concept analysis” and “Mathematical economics”. It has 157 concepts rooted at the topic “Differential equations” and only four topics rooted at “Game theory”. Therefore we thought that we could develop a reasonable taxonomy of mathematics if used instructive materials by the Russian Higher Attestation Commission (HAC). The HAC is a govermental body to supervise the national system of PhD and ScD theses [11]. Its classifications are regularly updated and made publicly available as “passports of specialties”; the list of specialties is revised once in a decade or two. For the case of Mathematics, HAC classification is illustrated in Table 1. As one can see, it covers just two layers of the mathematics domain and one cannot use it in the analysis of a university curriculum, because more layers are needed to reach an adequate degree of granularity of mathematical concepts.

This defines the problem we are going to address as a problem in taxonomy refinement. We start with a manually set an upper part of the taxonomy, a taxonomy frame including the root subject, and then automatically refine leaves of the taxonomy one-by-one. Therefore, given a leaf subject, we need a method that would find appropriately refined concepts and use them to grow the taxonomy. The problem of refinement of taxonomy subjects has received some attention in the literature. A big question arising before any refinement starts is about the sources for generating refined topics. A naive approach is to take a search engine such as Google and run a specially designed query involving the leaf concept under consideratiuon “A”, such as “A consists of...” or “A is a ...” [12]. Such a query would lead to a set of concepts that can be considered as potential subtopics for topic A. This works well if the ontology is represented by means of a formal language, such as OWL, by introducing new logical relations [13]. Yet in a less formal context the approach leads to somewhat dubious and messy results.

Table 1 The set of main mathematics divisions according to [11]. One can easily see differences from the divisions in the classification of Mathematics subjects developed by the American Mathematics Society [26]. For example, the field of computer science here is presented with the Numerical mathematics, and Combinatorics, with Discrete mathematics and mathematical cybernetics

Next idea is to use a manually designed universal taxonomy such as Wikipedia so that the choice of topics comes from a well defined hierarchical structure openly available in the Internet. Indeed, the idea of using the Wikipedia as a major source of topics for taxonomy building is becoming much popular [12, 1416]. Wikipedia covers many specific knowledge domains and offers a lot of data types, such as unstructured texts, images, the category trees, revision history, redirect pages and links, etc. There are several features making Wikipedia a unique and highly convenient tool for taxonomy building [17]:

  • Wikipedia fills the knowledge gap by encoding large amounts of knowledge in an explicit way.

  • Wikipedia is a web of interconnected concepts and named entities, thus showing a high degree of ontologization.

  • In Wikipedia, the high quality of its ontologized information is ensured by means of collaborative editing, which enables scalable and open knowledge management.

  • Thanks to its massive collaborative approach, Wikipedia is able to cover in depth not only domains pertaining to popular culture but also very specialized domains such as Systems Biology or Artificial Intelligence.

  • Wikipedia enables a continuously updated content, which is (i) revised to ensure high quality; (ii) kept up-to-date to reflect changes due to recent events.

  • Wikipedia is one of the largest multilingual repositories of knowledge ever created.

In papers [12, 1416] different approaches for constructing [14, 15] or refining [12, 16] ontologies and taxonomies by using Wikipedia article data are presented. In [15, 17] the Wikipedia articles are used as a source of topics, in [16] the Wikipedia category tree, in [14] both the articles and the category labels, and in [12] the Wikipedia infoboxes are utilized. This line of research is recently extended to the issue of enhancing the Wikipedia taxonomies by using additional text collections [18] and to building a taxonomy for a text collection by using Wikipedia [19, 20].

Yet none of this has anything to do with the problem of our concern, refining leaves of a taxonomy with Wikipedia. Nevertheless, in this perspective, using Wikipedia to refine a topic seems a rather straightforward business. First of all, one should find a category in the Wikipedia tree of categories which is nearest to the topic under consideration if it does not coincides with the topic. This can be done by using a topic-to-text relevance measure applied to texts under each category of the Wikipedia tree. Then children of the nearest category are to be considered as the children of the topic, which would complete a step in the refinement process. Thus outlined Wikipedia based refinement strategy will be referred further on as the WR strategy.

Unfortunately, the actual situation with Wikipedia as a crowdsourcing project is a bit messier. One of the issues is that Wikipedia writers sometimes are more enthusiastic than professional. Therefore, one may expect that either the the hierarchy itself or the set of its categories (subjects) or even some articles or all of those may be flawed.

Indeed, the category tree according to the Wikipedia writers is not necessarily a tree. For example, three categories of Wikipedia in Russian, “Optimization”, “Machine Learning” and “Search engine” are arranged in such a way that “Machine Learning” is parent of “Optimization” which is parental to “Search engine” which is parental to “Machine Learning”. This makes a contour that must be broken, and not necessarily at one edge only.

Next, the category tree is not perfect in the sense that some categories have no semantic relation to their claimed parental categories, the more so with regard to the grandparental categories. An explanation to this phenomenon is given in [21]: Wikipedia writers tend to assign to article or subcategory as many categories as possible. For example, the category “Killed accidentally” lies under the category “Randomness” (see Fig. 1), which is not that bad linguistically speaking. Yet this makes no sense if one wants using that at developing a mathematically oriented taxonomy. Similar examples in the Russian version of the Wikipedia: category “Theory of algorithms” with its subcategory “Feedback loop”; category “Mathematical statistics” with its subcategory “Decision trees”; and category “Algorithm” with its subcategory “Syntactic analysis”.

Fig. 1
figure 1

An example of a wrong subcategory: “Killed accidentally” is a subtopic of “Randomness”

One more source of issues is assignment of articles to categories. Say, in the Russian version of Wikipedia, a stub of article “Percolation theory” is assigned to “Probability theory” category, although it does not properly belong in there; “Artificial life” computational model is assigned to “Evolutionary algorithms”; and article “Linear code” in coding theory is assigned to “Machine learning” category, as well as “Netfix prize” article.

To meaningfully apply the WR strategy, thus, one needs a tool or a set of tools that could be used to evaluate: the similarity between topics, relevance of a category as a subcategory of a topic, relevance of an article to a topic. Using these evaluations one can choose relevant Wikipedia categories and then set thresholds to decide of the relevance of Wikipedia categories or articles to topics depending on the levels of their relevance. To this end, we propose using a naturally defined topic-to-text relevance measure based on building a suffix tree annotated by frequencies to represent the text under consideration as a set of strings consisting of individual letters and symbols. This measure is defined as the conditional probability of characters averaged over fragments of the topic and text being matched (CPAMF) [2224]. The CPAMF based technique involves no natural language features, which makes it more or less universal across the languages. Moreover, it requires no data preprocessing. On the other hand, the technique has also limitations because it cannot capture the structure of synonyms on its own. In experiments, techniques using suffix tree based relevance measures appear superior over competition [22, 25]. For example, [22] reports of a series of experiments in using topics from the ACM Computing Classification System [9] for annotation of research papers according to relevance of the topics to paper abstracts. The CPAMF based relevance measures led to much better results than those based on either of two popular relevance measures, the cosine measure according to the vector space model and the BM25 relevance measure according to a probabilistic model of text [22].

In the remainder, Sect. 2 presents our approach to using the CPAMF based technique to use Wikipedia for refining taxonomy leaves taking into account the noisy structure of Wikipedia. Section 3 describes a version of suffix tree techniques and the CPAMF keyword-to text relevance measure which is used throughout. Two Russian-language examples are given in Sect. 4. Section 5 concludes.

This study (research grant No 15-05-0041) was supported by The National Research University – Higher School of Economics’ Academic Fund Program in 2015. The financial support from the Government of the Russian Federation within the framework of the implementation of the 5-100 Programme Roadmap of the National Research University – Higher School of Economics is acknowledged.

2 Our WR Strategy

First we specify the taxonomy domain and manually form the frame of the taxonomy by extracting basic topics from the publicly available instruction materials of the higher attestation commission (HAC) of Russia [27]. The data for refining the taxonomy frame is extracted from Wikipedia. We will provide two examples of refined taxonomies for concepts from: (1) probability theory and mathematical statistics (PTMS) and (2) numerical mathematics (NM). The frames of both taxonomies are three-layer rooted trees of the main topics in the domain (see Tables 2 and 3, correspondingly).

Table 2 Probability theory and mathematical statistics taxonomy frame
Table 3 Numerical mathematics taxonomy frame

The next step is to define corresponding Wikipedia categories. For each domain we choose only category of the same name, so there is no need to address any other categories. Among the variety of Wikipedia contents we will use only two data types:

  • The hierarchical structure of Wikipedia category tree

  • The collection of unstructured Wikipedia articles. See Table 4 for the total number of categories and articles.

Hereafter we are going to use the Wikipedia category tree for extending our taxonomy tree. We try to assign some Wikipedia categories to every taxonomy topic of the first and second layer. First, we find those Wikipedia categories that correspond to a taxonomy topic under consideration: they should be subdivisions of the topics. Next we check, whether the assigned category should be further subdivided according to the structure of the category tree. If not, the underlying categories are again assigned to taxonomy topics. Since almost every Wikipedia category contains several articles, the titles of these articles become leaves of our refined taxonomy. Finally, we extract keywords representing the content of each leaf-defining Wikipedia article. These keywords are used then as the leaf descriptors. Since related Wikipedia categories usually have just one- or two-layer subtrees only, such a method seems highly convenient for the task (see Fig. 2 for the refining scheme).

Fig. 2
figure 2

The refining scheme. Initial taxonomy topics are in rectangles, the Wikipedia categories and subcategories are in rounded rectangles, the Wikipedia articles are in the ovals, and the leaf descriptors are in the clouds

Table 4 The total number of subcategories and articles and the number of irrelevant subcategories and articles in PTMS and NM categories in the Russian Wikipedia (accessed in August, 2013)

We extract topics from both the Wikipedia category tree and the individual articles. This allows us to follow the above mentioned ACM-CCS golden standard of taxonomy. By restricting the domain of the taxonomy to smaller topics such as the probability theory and mathematical statistics, we avoid the issue of big Wikipedia data and, also, get the possibility to manually examine the results. The method is illustrated by its application to two mathematics domains in Table 1, “Probability theory and mathematical statistics“ and “Numerical mathematics“ (both in Russian), which shows both advantages and drawbacks of the current stage in developing our method.

On the whole, the refined taxonomy should be balanced so that every branch of the taxonomy is approximately of the same depth and width. To achieve that, each topic is refined by one or more layers of Wikipedia categories and articles, placed as leaves at the last layer.

Here are the main steps of our WR approach to taxonomy refining:

  1. 1.

    Specify the domain of taxonomy to be refined and set the frame of taxonomy manually.

  2. 2.

    Download, from the Wikipedia, the category tree and articles from the domain under consideration.

  3. 3.

    Clean the category subtree of irrelevant articles.

  4. 4.

    Clean the category subtree of irrelevant subcategories.

  5. 5.

    Assign the remaining Wikipedia categories to the taxonomy topics.

  6. 6.

    Form the intermediate layers of the taxonomy by using Wikipedia subcategories.

  7. 7.

    Use Wikipedia articles in each of the added category nodes as its leaves.

  8. 8.

    Extract relevant keywords from Wikipedia articles and use them as leaf descriptors.

Let us describe these steps in more detail using domains of the probability theory and mathematical statistics (PTMS) and numerical mathematics (NM) for illustration.

2.1 Specify the Domain of Taxonomy

As we said already, these are PTMS or NM. See Tables 2 and 3 for the frames of the corresponding taxonomies.

2.2 Download the Category Tree and Articles from the Wikipedia

Download from the Wikipedia the category subtrees, rooted at “Probability theory and mathematical statistics” and “Numerical mathematics” and all the underlying articles.

2.3 Clean the Category Subtree of Irrelevant Articles

We consider that an article in irrelevant to the domain under consideration, if

  1. (a)

    The relevance score between the article title and the text of the article is low;

  2. (b)

    The relevance score between the parental category title and the text of the article is low.

The first condition allows us to filter out stubs (short unfinished articles or article templates). According to the second condition we remove those articles that unlikely to have anything to do with the parental categories. The relevance between the title of the parent category and the article is scored by using our string-to-text relevance measure, which follows from the annotated suffix tree (AST) method (described later). It expresses conditional probability of string characters to occur, averaged over the matching fragments in suffix trees, representing a text. It ranges from 0 to 1. The smaller its value, the less is the chance that the string (the title of the parent category) is relevant to the text (the article). We set up the relevance threshold at the value of 0.2 based on our experience in using the measure.

On the first glance, all the judgements of irrelevance in Table 5 seem wrong; yet they are all right. Indeed, the “Collectively exhaustive events” is not an article but just a stub. “Topic modelling” involves probabilities indeed but is part of “Text mining” or “Information retrieval” rather than of “Probability theory”. Similarly, “Verlet integration” belongs in “Integration of differential equations” rather than in “Numerical integration”. Similar doubts can be raised regarding Table 6 presenting examples of articles irrelevant to their Wikipedia assigned categories according to the condition B above. Yet “BSSN formalism”, as part of the general relativity theory, has nothing to do with “Numerical integration” indeed; the more so that, in fact, it is just a stub, not an article. “ROC curve” is a “Machine learning” concept developed specifically for classifiers, not regression. “Judea Pearl” is not a concept but the name of a renown scientist who has made his name in AI rather than in statistics. Although “Projection pursuit” does belong in “Mathematical statistics”, yet this topic hardly can be considered as an immediate offspring of the “Mathematical statistics” because it clearly belongs in “Multivariate statistics”.

Table 5 Examples of irrelevant articles in Russian Wikipedia according to the condition A
Table 6 Examples of irrelevant articles in Russian Wikipedia according to the condition B

2.4 Clean the Category Subtree of Irrelevant Subcategories

We consider that a subcategory is irrelevant if the CPAMF similarity between its parent category title and the text obtained by merging all the articles in the subcategory is low. The relevance threshold here is set again at the value of 0.2 which probably has something to do with properties of the Russian language.

A few examples of this type are given in Table 7. In one of them, “Optimization theory”, which should be a sibling of “Machine learning”, is assigned as its immediate offspring. The last line relates to a situation in which a rather special branch of computational methods, oriented at a specific domain, comes as an immediate offspring of “Numerical methods” in general instead of being classed as belonging to the theory of the specific domain. The other NM example is similar. Two lines in between relate to the meaning of statistics as a social sciences tool and, therefore, do not belong in Mathematics at all.

Table 7 Examples of irrelevant categories in Russian Wikipedia

This approach may fail if the subcategory contains no articles, but is further divided in subcategories, so there is nothing to merge.

2.5 Assign the Wikipedia Categories to the Taxonomy Topics

After clearing the Wikipedia category subtree of irrelevant categories and articles, the method allocates each of the remaining Wikipedia categories to a corresponding topic in the current fragment of taxonomy using the CPAMF relevance scores between the taxonomy topics and the categories. A topic-to-category score is computed between the topic and the text obtained by merging together all the articles in the category, as defined above.

Table 8 CPAMF relevance scores between the category “Bayesian statistics” and all the topics in the PTMS fragment of the taxonomy

Tables 8 and 9 present two such cases: CPAMF relevance scores between a specific Wikipedia category and all the topics rooted at PTMS (Table 8) and at NM (Table 9). The topics are presented in the order of ascending CPAMF score, so that it is the last one which is assigned to the corresponding category.

Table 9 CPAMF relevance scores between the category “Algorithms for solving SLE” and all the topics in the NM fragment of the taxonomy

2.6 Decision on Wikipedia Subcategories

The categories, which are more relevant to the parental categories than to the taxonomy topic under consideration, remain as intermediate layers in the new taxonomy: their offspring are the relevant articles’ titles.

Table 10 Examples of categories, that form intermediate layers

According to the data in Table 10 , the first three subcategories of the category “Random processes”, Markov processes, Martingale theory, and Monte Carlo methods, are more relevant to their parent in Wikipedia, rather than to the topic in our tree, whereas the other three are closer to the topic, so that they go immediately under the topic. Therefore we have obtained a subtree in our taxonomy rooted at “Random processes and fields”. The root has four children: Random processes, Stochastic models, Queueing theory, Noise. Of these, the first one, Random processes, has three children by itself: Markov processes, Martingale theory, Monte Carlo methods.

2.7 Use Wikipedia Articles in each Added Category Node as its Children

If a Wikipedia category is assigned to a taxonomy topic, all the articles left in it after cleaning are put as new children descending from the topic. For example, the category “Monte Carlo methods” has 10 articles listed in Table 11. As the Table shows, four of the articles are deemed to be irrelevant. Those relevant form the set of children to the category.

Table 11 Relevant and irrelevant articles to Monte Carlo methods category

The corresponding subtaxonomies are presented on Figs. 3 and 4.

Fig. 3
figure 3

Random processes and fields subtaxonomy

Fig. 4
figure 4

Monte Carlo methods subtaxonomy

2.8 Extract Keywords from Wikipedia Articles and Use them as Leaf Descriptors

A leaf taxonomy topic can be assigned with a set of phrases falling in it, as is the case of ACM-CCS. To extract keywords and key-phrases, we employ no sophisticated techniques, just taking the most frequent nouns and collocations, respectively. Of course, a key phrase is looked for as a grammar pattern, such as adjective \(+\) noun or noun \(+\) noun.

More specifically, we use a publicly available part-of-speech parser such as [28] for texts in Russian to label all words in a text by part-of-speech tags. After this we select phrases consisting of neighboring words tagged according to a prespecified pattern like noun \(+\) noun or adjective \(+\) noun, count the number of their occurrences and select those of the highest frequency. For example, for the leaf “Gibbs sampling” above we received the following most frequent terms and adjective \(+\) noun pairs: Table 12.

What is nice about them is that these are exactly terms used in the lecture synopses in Mathematics.

3 CPAMF String-to-Text Relevance Score

The suffix tree is a data structure used for storing of and searching for strings of characters and their fragments [29]. In a sense, the suffix tree model is an alternative to the vector space model (VSM), arguably the most popular model for text representation [30]. When the suffix tree representation is used, the text is considered as a set of strings, where a string may be any semantically significant part of the text, like a word, a phrase or even a whole sentence. An annotated suffix tree (AST) is a suffix tree whose nodes (not edges!) are annotated by the frequencies of the strings fragments. An algorithm for the construction and the usage of AST for spam-filtering is described in [25]. Some other applications are described in [23, 24].

Table 12 Frequencies of keywords for leaf “Gibbs sampling”

In our applications we consider a Wikipedia article as a set of its three-word strings. The titles of the Wikipedia categories and articles are also considered as strings in the set. To estimate the relevance of a standalone string to a collection of strings, we build an AST for the set of strings and then find all the matches between the AST and fragments of the given string. For every match we compute the score as the average frequency of a character in it related to the frequency of its prefix. Then the total score is calculated as the average score of all the matches. Obviously, the final value has a flavor of the conditional probability and lies between 0 and 1. In contrast to similarity measures used in [2325], this one has a natural interpretation and, moreover, does not depend on the text length explicitly nor implicitly, as our experiments show. Let us describe the AST method in more details.

According to the annotated suffix tree model [2325], a text document is not a set of words or terms, but a set of the so-called strings, the sequences of characters arranged in the same order as they occur in the text. Each string is characterized by a float number. The greater the number is, the more important the string is for the text. An annotated suffix tree (see Fig. 5) is a data structure used for computing and storing all fragments of the text and their frequencies. It is a rooted tree in which:

  • Every node corresponds to one character.

  • Every node is labeled by the frequency of the text fragment encoded by the path from the root to the node.

To build an AST, we split the text in relatively short strings of three words, and apply them consecutively to warrant that the resulting AST has a relatively modest size. Our algorithm for constructing an AST [25] is a modification of the well-known algorithms for constructing suffix trees [24, 29]. The AST is built in an iterative way. For each string, its suffixes are added to the AST one-by-one starting from an empty set representing the root. To add a suffix to the AST, first check, whether there is already a match, that is, a path in the AST that encodes / reads the whole suffix or its prefix. If such a match exists, we add 1 to all the frequencies in the match and append new nodes with frequencies 1 to the last node in the match, if it does not cover the whole suffix. If there is no match, we create a new chain of nodes in the AST from the root with the frequencies 1.

Fig. 5
figure 5

An AST for string “mining”

To use an AST to score the string to text relevance we first build an AST for a text in the collection under consideration. Next we match the string to the AST to estimate the CPAMF relevance.

3.1 A Procedure for Computing String-to-Text CPAMF Relevance Score

Input: string and AST for a given text. Output: the CPAMF relevance score.

  1. 1.

    The string is represented by the set of its suffixes; itself included;

  2. 2.

    Every suffix is matched to the AST starting from the root. To estimate the match we use the average conditional probability of the next symbol:

    $$\begin{aligned} score(match(suffix,ast)) = \sum \nolimits _{node \in match} \phi (\frac{\frac{f(node)}{f(parent(node))}}{|suffix|}) , \end{aligned}$$

    where \(f(node)\) is the frequency of the matching node, \(f(parent(node))\) is it’s parent frequency, and \(|suffix|\) is the length of the suffix;

  3. 3.

    The relevance of the string is evaluated by averaging the scores of all suffixes:

    $$\begin{aligned} relevance(string,text)= & {} \textit{SCORE}(string,ast) = \\= & {} \frac{\sum _{suffix} score(match(suffix,ast)) }{|string|}, \end{aligned}$$

    where \(|string|\) is the length of the string.

Note, that “score” is found by applying a scaling function to convert a match score into the relevance evaluation. There are three useful scaling functions, according to experiments in [24] over using a similar method to categorize e-mails in the “spam” and “ham” categories:

  • Identity function: \(\phi (x) = x \)

  • Logit function:

    $$\begin{aligned} \phi (x) = \log \frac{x}{1-x} =\log x - \log (1-x) \end{aligned}$$
  • Root function \(\phi (x) = \sqrt{x}\)

We use the identity scaling function because it has an obvious meaning: it stands for the conditional probability of characters averaged over matching fragments (CPAMF).

Consider an example to illustrate the described method. Let us construct an AST for the string “mining”. This string has six suffixes: “mining”, “ining”, “ning”, “ing”, “ng”, and “g” . We start with the first suffix and add it to the empty AST as a chain of nodes with the frequencies equal to unity. To add the next suffix, we need to check whether there is any match, i.e. whether there is such a path in the AST starting at its root that encodes / reads a prefix of “mining”. Since there is no match between existing nodes and the second suffix, we add it to the root as a chain of nodes with the frequencies equal to unity. We repeat this step until a match is found: a prefix of the fourth suffix “ing” matches the second suffix “ining”: two first letters, “in”, coincide. Hence we add 1 to the frequency of each of these nodes and add a new child node “g” to the leaf node “n” (see Fig. 5). The next suffix “ng” matches the third suffix and we repeat the same actions: increase the frequency of the matched nodes and add a new child node that does not match. The last suffix does not match any path in the AST, so again we add it to the AST’s root as a single node with its frequency equal to unity. Now let us calculate the relevance score for string “dining” using the AST in Fig. 5. There are six suffixes of the string “dining”: ‘dining”, “ining”, “ning”, “ing”, “ng”, and “g’. Each of them is aligned with an AST path starting from the root. The scorings of the suffixes are presented in Table 13 .

Table 13 Computing the string “dining” score

We have used the identity scaling function to score all six suffixes of the string “dining”. Now, to get the final CPAMF relevance value we sum and average them:

$$\begin{aligned} relevance(dining, mining)= & {} \frac{0+0.76+0.71+0.61+0.41+0.16}{6}=\\= & {} \frac{2.65}{6}=0.44 \end{aligned}$$

In spite of the fact that “dining” differs from “mining” by just one character, the total score, 0.44, is substantially less than unity. This is not only because the trivial suffix “dining” contributes 0 to the sum, but also because conditional probabilities get smaller for the shorter suffixes. When the similarity is even less noticeable, the score will get even smaller because at the step 2 of CPAMF procedure we divide by the length of the suffix, not the length of the match. This makes the values of the CPAMF score comparable across the strings and texts of various sizes.

4 Results

For the PTMS taxonomy, see Fig. 6, the resulting tree has 6 layers, with its depth varying from 4 to 6. At the cleaning stage 20 categories and 108 articles have been removed from the Wikipedia category tree. The resulting NM taxonomy, see Fig. 7, is of a similar shape: it has 8 layers, the depth varies from 4 to 8. Again at the cleaning stage, 11 categories and 30 articles have been removed.

Fig. 6
figure 6

A fragment of the refined PTMS taxonomy. Lower layers are shown

Fig. 7
figure 7

A fragment of the refined NM taxonomy. Higher layers are shown

Now we provide two illustrative examples of how the lower layers of PTMS taxonomy and the higher layers of the NM taxonomy were refined. Specifically, according to Table 14 the category “Factor analysis” should be allocated to taxonomy topic “Mathematical statistics” since it provides the highest score.

Table 14 The top three candidate taxonomy topics for “Factor analysis” allocation
Table 15 The top three taxonomy topics of the highest score for “Factor analysis” allocation

There are five articles left in the “Factor analysis” category after the cleaning procedures (see Table 15). The keywords / phrases, extracted from these articles and used as leaf descriptors, are presented on Fig. 6 in clouds.

There are three categories allocated to taxonomy topic “Numerical algorithms” (see Fig. 7). Two of them (“Optimal control” and “Numerical linear algebra”) contain three articles each, whereas the third one “Numerical integration” contains four articles. The following numbers lead us to this structure of NM taxonomy: see Table 16 for relevance values of category to topic allocation and Table 17 for articles satisfying the cleaning criteria.

There are several issues with each of the obtained taxonomy trees. First, the position of the topic “Decision Trees” is misleading. According to our method, this topic should be placed under “Mathematical statistics” and be, thus, a sibling of the “Machine Learning” topic. The reason is the low relevance of the string “Machine learning” to any of the four articles in the “Decision tree” category. Second, the category “Transformers/Transducers” ([“Preobrazovateli”] in Russian), which is counted as relevant to the parent category “Algorithm efficiency” is further subdivided in “Piezoelectrics”, “Power sources”, “Sound emitters and detectors”. These concepts have nothing to do with algorithms. They appear just because of the double meaning that the category title has in Russian. Third, both taxonomies are stuffed with articles describing personalities, such as “Probability theorists” or “MIPT Lecturers”. Hence more effective cleaning procedures, including filtering of articles according to their types should be developed. Two fragments of refined taxonomies are presented in Figs. 6 and 7. Also, let us recall that the subtree rooted at “Random processes and fields” has been found a bit unbalanced since Wikipedia has had no articles on Random fields.

Table 16 Three categories allocated to the “Numerical algorithms” topic
Table 17 Articles relevant to the “Numerical algorithms” branch of NM taxonomy

To refine a taxonomy at a given topic, the AST method works five times in the process:

  • Twice to clean the Wikipedia category subtree of irrelevant articles;

  • To clean the category subtree of irrelevant categories;

  • To relate taxonomy topics to Wikipedia categories;

  • To distinguish between categories to be assigned to taxonomy topics and categories to remain children of their Wikipedia parents.

In the first three cases an “irrelevance“ threshold for the article or category title to text should be specified. Our experiments show that the threshold of 0.2, which amounts to 1/3 of the maximum value, works well.

5 Conclusion

We have presented an approach at refining a taxonomy by using the Wikipedia and its structure. Our contribution: (a) CPAMF string-to text relevance measure; (b) using CPAMF for cleaning the Wikipedia out of irrelevant categories and articles; (c) using both Wikipedia articles and categories for adding to the topic under consideration two layers at once; (d) supplying the leaves with descriptors. We think that the last item is important as it can be seen as a further refinement of the taxonomy step, so that synopses of university courses can be meaningfully mapped to the taxonomy.

The presented implementation of the approach, by using the CPAMF relevance scores, has both positive and negative sides. The positive relates to a relative independence on the language and its grammar; the negative, with the lack of tools for capturing synonymy and near-synonymy. Other issues can be related to the fact that Wikipedia may give a bit biased picture of the domain. Extension of the method to cover synonymous words with little degree of coincidence should be one of the main subjects for the further work. Another direction for further developments is in developing more precise Wikipedia preprocessing and analysis procedures.