Advertisement

Information Retrieval

, Volume 1, Issue 1–2, pp 69–90 | Cite as

An Evaluation of Statistical Approaches to Text Categorization

  • Yiming Yang
Article

Abstract

This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.

text categorization statistical learning algorithms comparative study evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte C, Damerau F and Weiss S (1994) Towards language independent automated learning of text categorization models. In: Proceedings of the 17th Annual ACM/SIGIR Conference.Google Scholar
  2. Bell TAH and Moffat A (1996) The design of a high performance information filtering system. In: Proceedings of the 19th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), pp. 12–20.Google Scholar
  3. Cohen WW and Singer Y (1996) Context-sensitive learning metods for text categorization. In: SIGIR’ 96: Proceedings of the 19th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315.Google Scholar
  4. Creecy RH, Masand BM, Smith SJ and Waltz DL (1992) Trading mips and memory for knowledge engineering: Classifying census returns on the connection machine. Comm. ACM, 35: 48–63.Google Scholar
  5. Fuhr N, Hartmanna S, Lustig G, Schwantner M and Tzeras K (1991) Air/x—A rule-based multistage indexing systems for large subject fields. In: Proceedings of RIAO'91, pp. 606–623.Google Scholar
  6. Hayes PJ and Weinstein SP (1990) Construe/tis: A system for content-based indexing of a database of new stories. In: Second Annual Conference on Innovative Applications of Artificial Intelligence.Google Scholar
  7. Hersh W, Buckley C, Leone TJ and Hickman D (1994) Ohsumed: An interactive retrieval evaluation and new large text collection for research. In: 17th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pp. 192–201.Google Scholar
  8. Iwayama M and Tokunaga T (1995) Cluster-based text categorization: A comparison of category search strategies. In: Proceedings of the 18th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), pp. 273–281.Google Scholar
  9. Lewis DD and Ringuette M (1994) Comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR'94).Google Scholar
  10. Lewis DD, Schapire RE, Callan JP and Papka R (1996) Training algorithms for linear text classifiers. In: SIGIR’ 96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298–306.Google Scholar
  11. Mitchell T (1996) Machine Learning. McCraw Hill.Google Scholar
  12. Moulinier I (1997) Is learning bias an issue on the text categorization problem? Technical Report, LAFORIA-LIP6, Universite Paris VI, page (to appear).Google Scholar
  13. Moulinier I, Raskinis G and Ganascia J (1996) Text categorization: A symbolic approach. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval.Google Scholar
  14. Ng HT, Goh WB and Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: 20th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), pp. 67–73.Google Scholar
  15. Persin M (1994) Document filtering for fast ranking. In: Proceedings of the 17th Ann. Int.ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pp. 341–348.Google Scholar
  16. Quinlan JR (1986) Induction of decision trees. Machine Learning, 1(1): 81–106.CrossRefGoogle Scholar
  17. Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, PA.Google Scholar
  18. Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series. McGraw-Hill, New York.Google Scholar
  19. Tzeras K and Hartman S (1993) Automatic indexing based on bayesian inference networks. In: Proc. 16th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'93), pp. 22–34.Google Scholar
  20. van Rijsbergen CJ (1979) Information Retrieval. Butterworths, London.Google Scholar
  21. Wiener E, Pedersen JO and Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95).Google Scholar
  22. Yang Y (1994) Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: 17th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pp. 13–22.Google Scholar
  23. Yang Y (1995) Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), pp. 256–263.Google Scholar
  24. Yang Y (1997) An evaluation of statistical approach to text categorization. Technical Report CMU-CS–97–127, Computer Science Department, Carnegie Mellon University.Google Scholar
  25. Yang Y and Chute CG (1992) A linear least squares fit mapping method for information retrieval from natural language texts. In: Proceedings of the 14th International Conference on Computational Linguistics (COLING 92), pp. 447–453.Google Scholar
  26. Yang Y and Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), pp. 253–277.Google Scholar
  27. Yang Y and Pedersen JP (1997) Feature selection in statistical learning of text categorization. In: 14th International Conference on Machine Learning, pp. 412–420.Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • Yiming Yang

There are no affiliations available

Personalised recommendations