Distributional Features for Text Categorization

  • Xiao-Bing Xue
  • Zhi-Hua Zhou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4212)


In previous research of text categorization, a word is usually described by features which express that whether the word appears in the document or how frequently the word appears. Although these features are useful, they have not fully expressed the information contained in the document. In this paper, the distributional features are used to describe a word, which express the distribution of a word in a document. In detail, the compactness of the appearances of the word and the position of the first appearance of the word are characterized as features. These features are exploited by a TFIDF style equation in this paper. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency features solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved.


Distributional Feature Text Categorization Term Frequency Word Sense Disambiguation Usenet Newsgroup 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of SIGIR 1998, Melbourne, Australia, pp. 96–103 (1998)Google Scholar
  2. 2.
    Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research 3, 1182–1208 (2003)CrossRefGoogle Scholar
  3. 3.
    Callan, J.P.: Passage retrieval evidence in document retrieval. In: Proceedings of SIGIR 1994, Dublin, Ireland, pp. 302–310 (1994)Google Scholar
  4. 4.
    Craven, M., DiPasquo, D., Freitag, D., McCallum, A.K., Mitchell, T.M., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of AAAI 1998, Madison, WI, pp. 509–516 (1998)Google Scholar
  5. 5.
    Dietterich, T.G.: Machine learning research: Four current directions. AI Magazine 18(4), 97–136 (1997)Google Scholar
  6. 6.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of ECML 1998, Chemnitz, Germany, pp. 137–142 (1998)Google Scholar
  7. 7.
    Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of ICML 1995, Tahoe City, CA, pp. 331–339 (1995)Google Scholar
  8. 8.
    Lewis, D.: Reuters-21578 text categorization test colleciton, Distrib. 1.0 (September 26, 1997)Google Scholar
  9. 9.
    Moschitti, A., Basili, R.: Complex Linguistic Features for Text Classification: A Comprehensive Study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Learning to classify text from labeled and unlabeled documents. In: Proceedings of AAAI 1998, Madison, WI, pp. 792–799 (1998)Google Scholar
  11. 11.
    Rennie, J., Shih, L., Teevan, J., Karger, D.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of ICML 2003, Washington, DC, pp. 616–623 (2003)Google Scholar
  12. 12.
    Sauban, M., Pfahringer, B.: Text categorization using document profiling. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS, vol. 2838, pp. 411–422. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Machine Learning 39(2-3), 135–168 (2000)MATHCrossRefGoogle Scholar
  14. 14.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surverys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  15. 15.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of SIGIR 1999, Berkeley, CA, pp. 42–49 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xiao-Bing Xue
    • 1
  • Zhi-Hua Zhou
    • 1
  1. 1.National Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina

Personalised recommendations