Reducing Effects of Class Imbalance Distribution in Multi-class Text Categorization
In multi-class text classification, when number of entities in each class is highly imbalanced, performance of feature ranking methods is usually low because the larger class has much dominant influence to the classifier and the smaller one seems to be ignored. This research attempts to solve this problem by separating the larger classes into several smaller subclasses according to their proximities, by k-mean clustering then all subclasses are considered for feature scoring measure instead of the main classes. This cluster-based feature scoring method is proposed to reduce the influence of skewed class distributions. Compared to performance of feature sets selected from main classes and ground-truth subclasses, the experimental results show that performance of a feature set selected by the proposed method achieves significant improvement on classifying imbalanced corpora, the RCV1v2 dataset.
Keywordsfeature selection ranking method text categorization class imbalance distribution
Unable to display preview. Download preview PDF.
- 4.Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: 14th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., 657137 (1997)Google Scholar
- 11.Makrehchi, M., Kamel, M.S.: Combining feature ranking for text classification. In: IEEE International Conference on Systems, Man and Cybernetics, ISIC 2007, pp. 510–515 (2007)Google Scholar
- 12.MacQueen, J.B.: Some Methods for Classification and Analysis of MultiVariate Observations. In: Cam, L.M.L., Neyman, J. (eds.) Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)Google Scholar
- 13.Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar