Abstract
In previous research of text categorization, a word is usually described by features which express that whether the word appears in the document or how frequently the word appears. Although these features are useful, they have not fully expressed the information contained in the document. In this paper, the distributional features are used to describe a word, which express the distribution of a word in a document. In detail, the compactness of the appearances of the word and the position of the first appearance of the word are characterized as features. These features are exploited by a TFIDF style equation in this paper. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency features solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of SIGIR 1998, Melbourne, Australia, pp. 96–103 (1998)
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research 3, 1182–1208 (2003)
Callan, J.P.: Passage retrieval evidence in document retrieval. In: Proceedings of SIGIR 1994, Dublin, Ireland, pp. 302–310 (1994)
Craven, M., DiPasquo, D., Freitag, D., McCallum, A.K., Mitchell, T.M., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of AAAI 1998, Madison, WI, pp. 509–516 (1998)
Dietterich, T.G.: Machine learning research: Four current directions. AI Magazine 18(4), 97–136 (1997)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of ECML 1998, Chemnitz, Germany, pp. 137–142 (1998)
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of ICML 1995, Tahoe City, CA, pp. 331–339 (1995)
Lewis, D.: Reuters-21578 text categorization test colleciton, Distrib. 1.0 (September 26, 1997)
Moschitti, A., Basili, R.: Complex Linguistic Features for Text Classification: A Comprehensive Study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Learning to classify text from labeled and unlabeled documents. In: Proceedings of AAAI 1998, Madison, WI, pp. 792–799 (1998)
Rennie, J., Shih, L., Teevan, J., Karger, D.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of ICML 2003, Washington, DC, pp. 616–623 (2003)
Sauban, M., Pfahringer, B.: Text categorization using document profiling. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS, vol. 2838, pp. 411–422. Springer, Heidelberg (2003)
Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Machine Learning 39(2-3), 135–168 (2000)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surverys 34(1), 1–47 (2002)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of SIGIR 1999, Berkeley, CA, pp. 42–49 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xue, XB., Zhou, ZH. (2006). Distributional Features for Text Categorization. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds) Machine Learning: ECML 2006. ECML 2006. Lecture Notes in Computer Science(), vol 4212. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11871842_47
Download citation
DOI: https://doi.org/10.1007/11871842_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45375-8
Online ISBN: 978-3-540-46056-5
eBook Packages: Computer ScienceComputer Science (R0)