Advertisement

Comparing and Combining Two Approaches to Automated Subject Classification of Text

  • Koraljka Golub
  • Anders Ardö
  • Dunja Mladenić
  • Marko Grobelnik
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4172)

Abstract

A machine-learning and a string-matching approach to automated subject classification of text were compared, as to their performance, advantages and downsides. The former approach was based on an SVM algorithm, while the latter comprised string-matching between a controlled vocabulary and words in the text to be classified. Data collection consisted of a subset from Compendex, classified into six different classes. It was shown that SVM on average outperforms the string-matching approach: our hypothesis that SVM yields better recall and string-matching better precision was confirmed only on one of the classes. The two approaches being complementary, we investigated different combinations of the two based on combining their vocabularies. The results have shown that the original approaches, i.e. machine-learning approach without using background knowledge from the controlled vocabulary, and string-matching approach based on controlled vocabulary, outperform approaches in which combinations of automatically and manually obtained terms were used. Reasons for these results need further investigation, including a larger data collection and combining the two using predictions.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary: challenges and recommendations. New review of hypermedia and multimedia, Special issue on knowledge organization systems and services 2006(1)Google Scholar
  2. 2.
    Milstead, J. (ed.) Ei thesaurus Engineering Information, Castle Point on the Hudson Hoboken, 2nd edn. (1995)Google Scholar
  3. 3.
    Grobelnik, M., Mladenic, D.: Text Mining Recipes. Springer, Heidelberg (2006), accompanying software available at, http://www.textmining.net
  4. 4.
    Mladenic, D., Grobelnik, M.: Feature selection on hierarchy of web documents. Journal of Decision Support Systems 35, 45–87 (2003)CrossRefGoogle Scholar
  5. 5.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Koraljka Golub
    • 1
  • Anders Ardö
    • 1
  • Dunja Mladenić
    • 2
  • Marko Grobelnik
    • 2
  1. 1.KnowLib Research Group, Dept. of Information TechnologyLund UniversitySweden
  2. 2.J. Stefan InstituteLjubljanaSlovenia

Personalised recommendations