Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords
In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf weighting. In contrast to the common belief, we show that using keywords instead of all words generally yields better performance and tf-idf weighting does not always outperform boolean weighting. Our results reveal that corpus-based approach performs better for large number of keywords while class-based approach performs better for small number of keywords. In skewed datasets, class-based keyword selection performs consistently better than corpus-based approach in terms of macro-averaged F-measure. In homogenous datasets, performances of class-based and corpus-based approaches are similar except for small number of keywords.
Unable to display preview. Download preview PDF.
- 1.Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley (1996)Google Scholar
- 5.Özgür, A.: Supervised and Unsupervised Machine Learning Techniques for Text Document Categorization. MS Thesis, Boğaziçi University, Istanbul (2004)Google Scholar
- 6.Joachims, T.: Making Large-Scale SVM Learning Practical. In: Advances in Kernel Methods-Support Vector Learning, MIT Press, Cambridge (1999)Google Scholar
- 7.Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
- 8.Mladenic, D., Grobelnic, M.: Feature Selection for Unbalanced Class Distribution and Naive Bayes. In: Proceedings of the 16th International Conference on Machine Learning, pp. 258–267 (1999)Google Scholar
- 10.Aizawa, A.: Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, Tokyo, pp. 307–314 (2001)Google Scholar
- 13.Porter, M.F.: An Algorithm for Suffix Stripping. Program 14, 130–137 (1980)Google Scholar
- 15.Karypis, G.: Cluto 2.0 Clustering Toolkit (2004), http://www.users.cs.umn.edu/~karypis/cluto
- 16.TREC. Text Retrieval Conference (1999), http://trec.nist.gov
- 17.Lewis, D.D.: Reuters-21578 Document Corpus V1.0, http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
- 18.Han, E.-H.S., et al.: WebAce: A Web Agent for Document Categorization and Exploration. In: Proceedings of the 2nd International Conference on Autonomous Agents (1998)Google Scholar