Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization
Traditional tfidf-like term weighting schemes have a rough statistic — idf as the term weighting factor, which does not exploit the category information (category labels on documents) and intra-document information (the relative importance of a given term to a given document that contains it) from the training data for a text categorization task. We present here a more elaborate nonparametric probabilistic model to make use of this sort of information in the term weighting phase. idf is theoretically proved to be a rough approximation of this new term weighting factor. This work is preliminary and mainly aiming at providing inspiration for further study on exploitation of this information, but it already provides a moderate performance boost on three popular document collections.
Unable to display preview. Download preview PDF.
- 7.Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. Text Mining and its Applications, 81–98 (2004)Google Scholar
- 8.Aizawa, A.: The feature quantity: An information theoretic perspective of tfidf-like measures. In: Proceedings of ACM SIGIR 2000, pp. 104–111. ACM Press, New York (2000)Google Scholar
- 10.Xue, D., Sun, M.: Select strong information features to improve text categorization effectiveness. Journal of Intelligent Systems, Special Issue (2002)Google Scholar
- 11.Xue, D., Sun, M.: A study on feature weighting in chinese text categorization. In: Proceedings of CICLing ’03, pp. 594–604 (2003)Google Scholar
- 12.Li, J., Sun, M., Zhang, X.: A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization. In: Proceedings of COLING-ACL ’06, July 2006, pp. 545–552. Association for Computational Linguistics (2006), http://www.aclweb.org/anthology/P/P06/P06-1069
- 13.Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of CIKM ’02, pp. 659–661. ACM Press, New York (2002), doi:10.1145/584792.584911Google Scholar
- 16.Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at, http://www.csie.ntu.edu.tw/~cjlin/libsvm