Exploiting Category Information and Document Information to Improve Term Weighting for Text Categorization

  • Jingyang Li
  • Maosong Sun
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4394)


Traditional tfidf-like term weighting schemes have a rough statistic — idf as the term weighting factor, which does not exploit the category information (category labels on documents) and intra-document information (the relative importance of a given term to a given document that contains it) from the training data for a text categorization task. We present here a more elaborate nonparametric probabilistic model to make use of this sort of information in the term weighting phase. idf is theoretically proved to be a rough approximation of this new term weighting factor. This work is preliminary and mainly aiming at providing inspiration for further study on exploitation of this information, but it already provides a moderate performance boost on three popular document collections.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988)CrossRefGoogle Scholar
  2. 2.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press, New York (1999), Google Scholar
  3. 3.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)CrossRefGoogle Scholar
  4. 4.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1999)zbMATHGoogle Scholar
  5. 5.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  6. 6.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of ICML ’97, pp. 143–151. Morgan Kaufmann, San Francisco (1997), Google Scholar
  7. 7.
    Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. Text Mining and its Applications, 81–98 (2004)Google Scholar
  8. 8.
    Aizawa, A.: The feature quantity: An information theoretic perspective of tfidf-like measures. In: Proceedings of ACM SIGIR 2000, pp. 104–111. ACM Press, New York (2000)Google Scholar
  9. 9.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of ICML ’97, pp. 412–420. Morgan Kaufmann, San Francisco (1997), Google Scholar
  10. 10.
    Xue, D., Sun, M.: Select strong information features to improve text categorization effectiveness. Journal of Intelligent Systems, Special Issue (2002)Google Scholar
  11. 11.
    Xue, D., Sun, M.: A study on feature weighting in chinese text categorization. In: Proceedings of CICLing ’03, pp. 594–604 (2003)Google Scholar
  12. 12.
    Li, J., Sun, M., Zhang, X.: A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization. In: Proceedings of COLING-ACL ’06, July 2006, pp. 545–552. Association for Computational Linguistics (2006),
  13. 13.
    Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of CIKM ’02, pp. 659–661. ACM Press, New York (2002), doi:10.1145/584792.584911Google Scholar
  14. 14.
    Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  15. 15.
    Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM ’98, pp. 148–155. ACM Press, New York (1998), doi:10.1145/288627.288651CrossRefGoogle Scholar
  16. 16.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at,
  17. 17.
    Debole, F., Sebastiani, F.: An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society of Information Science and Technology 56(6), 584–596 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jingyang Li
    • 1
  • Maosong Sun
    • 1
  1. 1.National Laboratory of Intelligent Technology and Systems, Dept. of Computer Sci. & Tech., Tsinghua University, Beijing 100084China

Personalised recommendations