Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

Chernyak, Ekaterina; Mirkin, Boris

doi:10.1007/s40745-015-0032-1

Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

Published: 02 April 2015

Volume 2, pages 61–82, (2015)
Cite this article

Download PDF

Annals of Data Science Aims and scope Submit manuscript

Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

Download PDF

Ekaterina Chernyak¹ &
Boris Mirkin¹

1345 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

A step-by-step approach to taxonomy construction is presented. On the first step, the upper layer frame of taxonomy is built manually according to educational materials. On the next steps, the frame is refined at a chosen topic using the Wikipedia category tree and articles, both cleaned of noise. Our main tool in this is a naturally defined string-to-text relevance score, based on annotated suffix trees. The relevance scoring is used at several tasks: (1) cleaning the Wikipedia tree or page set of noise; (2) allocating Wikipedia categories to taxonomy topics; (3) deciding whether an allocated category should be included as a child to the taxonomy topic, etc. The resulting fragment of taxonomy consists of three parts: the manually set upper layer topic, the adopted part of the Wikipedia category tree and Wikipedia articles as leaves. Every leaf is assigned a set of so-called descriptors; these are phrases explaining aspects of the leaf topic. The method is illustrated by its application to two domains in the area of Mathematics: (a) “Probability theory and mathematical statistics”, (b) “Numerical mathematics” (both in Russian).

A Review of Recent Trends: Text Mining of Taxonomy Using WordNet 3.1 for the Solution and Problems of Ambiguity in Social Media

Simple, Fast and Accurate Taxonomy Learning

Following the Common Thread Through Word Hierarchies

1 Introduction: Motivation and Background

Taxonomy of concepts in a knowledge domain, or hierarchical ontology, is a popular computational instrument for representation, maintaining and usage of domain knowledge [1–3]. A taxonomy is a rooted tree formalizing a hierarchy of subjects in an applied domain. Such a tree corresponds to a generalizing relation between the subjects, usually in the form “A is a B” or “A is part of B”. Automation xxxxof taxonomy building is important for further progress in many areas of data analysis and knowledge engineering including computationally text processing and improving information retrieval [1, 4, 5]. In the authors’ work, domain taxonomies are used to meaningfully map research results to them either to explore research profiles [6] or annotate research papers [7] or measure the level of research results [8].

A definitive taxonomy of the domain of computer science is maintained by the Association for Computer Machinery; the latest version of the ACM computing classification system can be found at [9]. This classification is well balanced so that: (a) its nodes have approximately equal numbers of children, and (b) its branches have approximately equal numbers of layers. However, there are not so many domains for which sound taxonomies are available. For example, when we decided to shift our efforts from the computer science domain to mathematics for the analysis of synopses of courses in mathematics and related subjects in a Russian university, we discovered a rather disappointing picture.

In Russian, the only publicly available taxonomy of mathematics and related domains is the classification for the government-sponsored Abstracting Journal of Mathematics [10] developed back in 1999. This is somewhat outdated and unbalanced. For example, it lacks such topics as “Discrete mathematics”, “Formal concept analysis” and “Mathematical economics”. It has 157 concepts rooted at the topic “Differential equations” and only four topics rooted at “Game theory”. Therefore we thought that we could develop a reasonable taxonomy of mathematics if used instructive materials by the Russian Higher Attestation Commission (HAC). The HAC is a govermental body to supervise the national system of PhD and ScD theses [11]. Its classifications are regularly updated and made publicly available as “passports of specialties”; the list of specialties is revised once in a decade or two. For the case of Mathematics, HAC classification is illustrated in Table 1. As one can see, it covers just two layers of the mathematics domain and one cannot use it in the analysis of a university curriculum, because more layers are needed to reach an adequate degree of granularity of mathematical concepts.

This defines the problem we are going to address as a problem in taxonomy refinement. We start with a manually set an upper part of the taxonomy, a taxonomy frame including the root subject, and then automatically refine leaves of the taxonomy one-by-one. Therefore, given a leaf subject, we need a method that would find appropriately refined concepts and use them to grow the taxonomy. The problem of refinement of taxonomy subjects has received some attention in the literature. A big question arising before any refinement starts is about the sources for generating refined topics. A naive approach is to take a search engine such as Google and run a specially designed query involving the leaf concept under consideratiuon “A”, such as “A consists of...” or “A is a ...” [12]. Such a query would lead to a set of concepts that can be considered as potential subtopics for topic A. This works well if the ontology is represented by means of a formal language, such as OWL, by introducing new logical relations [13]. Yet in a less formal context the approach leads to somewhat dubious and messy results.

Table 1 The set of main mathematics divisions according to [11]. One can easily see differences from the divisions in the classification of Mathematics subjects developed by the American Mathematics Society [26]. For example, the field of computer science here is presented with the Numerical mathematics, and Combinatorics, with Discrete mathematics and mathematical cybernetics

Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

Abstract

Similar content being viewed by others

A Review of Recent Trends: Text Mining of Taxonomy Using WordNet 3.1 for the Solution and Problems of Ambiguity in Social Media

Simple, Fast and Accurate Taxonomy Learning

Following the Common Thread Through Word Hierarchies

1 Introduction: Motivation and Background

2 Our WR Strategy

2.1 Specify the Domain of Taxonomy

2.2 Download the Category Tree and Articles from the Wikipedia

2.3 Clean the Category Subtree of Irrelevant Articles

2.4 Clean the Category Subtree of Irrelevant Subcategories

2.5 Assign the Wikipedia Categories to the Taxonomy Topics

2.6 Decision on Wikipedia Subcategories

2.7 Use Wikipedia Articles in each Added Category Node as its Children

2.8 Extract Keywords from Wikipedia Articles and Use them as Leaf Descriptors

3 CPAMF String-to-Text Relevance Score

3.1 A Procedure for Computing String-to-Text CPAMF Relevance Score

4 Results

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation