A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization

  • Kang Hyuk Lee
  • Judy Kay
  • Byeong Ho Kang
  • Uwe Rosebrock
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2417)

Abstract

Two main research areas in statistical text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (k-NN and Rocchio) and three common thresholding techniques (RCut, PCut, and SCut), we describe a new learning algorithm known as the keyword association network (KAN) and a new thresholding strategy (RinSCut) to improve performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 and 20-Newsgroups data sets. The experimental results show that our new approaches give better results for both micro-averaged F 1 and macro-averaged F 1 scores.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, A., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast Discovery of Association Rules. In: U. M. Fayyad, G. Piatetsky-Shapiro, P. Smith, R. Uthurusamy (eds.) Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press (1996) 307–328Google Scholar
  2. 2.
    Buckley, C., Salton, G., Allan, J.: The Effect of Adding Relevance Information in a Relevance Feedback Environment. International ACM SIGIR Conference (1994) 292–300Google Scholar
  3. 3.
    Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. The Third Text Retrieval Conference (TREC-3), National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, MD (1995)Google Scholar
  4. 4.
    Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the 14th International Conference on Machine Learning ICML’97 (1997) 143–151Google Scholar
  5. 5.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In European Conference on Machine Learning (ECML-98) (1998)Google Scholar
  6. 6.
    Lewis, D.D., Ringuette, M.: Comparison of Two Learning Algorithms for Text Categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR’94), Nevada, Las Vegas (1994)Google Scholar
  7. 7.
    Lewis, D.D., Schapire, R.E., Callan, J.P., Papka, R.: Training Algorithms for Linear Text Classifiers. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) 298–306Google Scholar
  8. 8.
    McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classifiers. In AAAI-98 Workshop on Learning for Text Categorization (1998)Google Scholar
  9. 9.
    Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naive Bayes. In Proceedings of the 16th International Conference on Machine Learning (ICML-99) (1999)Google Scholar
  10. 10.
    Porter, M.F.: An Algorithm for Suffix Stripping. Program, Vol. 14. No. 3. (1980) 130–137Google Scholar
  11. 11.
    Rocchio, J.: Relevance Feedback in Information Retrieval. In G. Salton (ed): The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall (1971)Google Scholar
  12. 12.
    Ruge, G.: Experiments on Linguistically Based Term Associations. Information Processing & Management, Vol. 28. No. 3. (1992) 317–332CrossRefGoogle Scholar
  13. 13.
    Sekine, S., Carroll, J., Ananiadou, A., Tsujii, J.: Automatic Learning for Semantic Collocation. Proceedings of the Third Conference on Applied Natural Language Processing, ACL (1992) 104–110Google Scholar
  14. 14.
    Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95) (1995)Google Scholar
  15. 15.
    Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval (1994) 13–22Google Scholar
  16. 16.
    Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, Vol. 1. No. 1/2. (1999) 67–88Google Scholar
  17. 17.
    Yang, Y.: A Study on Thresholding Strategies for Text Categorization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01) (2001)Google Scholar
  18. 18.
    Yang, Y., Pedersen, J.O.: Feature Selection in Statistical Learning of Text Categorization. In 14th International Conference on Machine Learning (1997) 412–420Google Scholar
  19. 19.
    Yavuz, T., Guvenir, A.: Application of k-Nearest Neighbor on Feature Projections Classifier to Text Categorization. In Proceedings of the 13th International Symposium on Computer and Information Sciences-ISCIS’98, U. Gudukbay, T. Dayar, A. Gursoy, E. Ge-lenbe (eds), Antalya, Turkey (1998) 135–142Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Kang Hyuk Lee
    • 1
  • Judy Kay
    • 1
  • Byeong Ho Kang
    • 2
  • Uwe Rosebrock
    • 2
  1. 1.School of Information TechnologiesUniversity of SydneyAustralia
  2. 2.School of ComputingUniversity of TasmaniaHobartAustralia

Personalised recommendations