Three New Feature Weighting Methods for Text Categorization
- 2.5k Downloads
Feature weighting is an important phase of text categorization, which computes the feature weight for each feature of documents. This paper proposes three new feature weighting methods for text categorization. In the first and second proposed methods, traditional feature weighting method tf×idf is combined with “one-side” feature selection metrics (i.e. odds ratio, correlation coefficient) in a moderate manner, and positive and negative features are weighted separately. tf×idf+CC and tf×idf+OR are used to calculate the feature weights. In the third method, tf is combined with feature entropy, which is effective and concise. The feature entropy measures the diversity of feature’s document frequency in different categories. The experimental results on Reuters-21578 corpus show that the proposed methods outperform several state-of-the-art feature weighting methods, such as tf×idf, tf×CHI, andtf×OR.
Keywordsfeature weight feature selection text categorization
Unable to display preview. Download preview PDF.
- 1.Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: European of Conference on Machine Learning, Chemnitz, pp. 137–142 (1998)Google Scholar
- 3.Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text categorization. In: IJCAI 1999 Workshop on Machine Learning for Information Filtering, Stockholm, pp. 61–67 (1999)Google Scholar
- 5.Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: International Conference on Machine Learning, pp. 412–520 (1997)Google Scholar
- 8.Zheng, Z.H., Srihari, R., Srihari, S.: A feature selection framework for text filtering. In: 3rd IEEE International Conference on Data Mining, Melbourne, pp. 705–708 (2003)Google Scholar
- 9.Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. Studies in Fuzziness and Soft Computing 138, 71–98 (2004)Google Scholar
- 13.Mladenic, D., Grobelnik, M.: Feature selection for classification based on text hierarchy. In: Conference on Automated Learning and Discovery, the Workshop on Learning from Text and the Web, Pittsburg (1998)Google Scholar
- 15.Chang, C., Lin, C.: LibSVM: a library for support vector machines, http://www.csie.ntu.edu.tw/cjlin/libsvm