A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization

  • Kang Hyuk Lee
  • Judy Kay
  • Byeong Ho Kang
  • Uwe Rosebrock
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2417)

Abstract

Two main research areas in statistical text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (k-NN and Rocchio) and three common thresholding techniques (RCut, PCut, and SCut), we describe a new learning algorithm known as the keyword association network (KAN) and a new thresholding strategy (RinSCut) to improve performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 and 20-Newsgroups data sets. The experimental results show that our new approaches give better results for both micro-averaged F1 and macro-averaged F1 scores.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Kang Hyuk Lee
    • 1
  • Judy Kay
    • 1
  • Byeong Ho Kang
    • 2
  • Uwe Rosebrock
    • 2
  1. 1.School of Information TechnologiesUniversity of SydneyAustralia
  2. 2.School of ComputingUniversity of TasmaniaHobartAustralia

Personalised recommendations