A New Approach of Feature Selection for Chinese Web Page Categorization
Feature selection is a key step of web page categorization. It can influence the accuracy of categorization directly as well as the efficiency. This paper proposes a new approach of feature selection based on Mutual Information algorithm. It brings in feature whose Mutual Information is negative and emphasizes the occurrence probabilities of features in different categories. Moreover, it makes some improvements on the web page preprocessing to reserve some useful features. The experiment shows that the new feature selection method improves the accuracy of categorization effectively.
KeywordsWeb page categorization Preprocessing Feature selection Mutual Information
Unable to display preview. Download preview PDF.
- 1.Xu, J., Hu, M.: Feature Selection and Classification for Chinese Web Documents. Computer Engineering 31(8), 24–39 (2005)Google Scholar
- 4.Hu, Y., Wu, H., Zhong, L.: Research of Feature Extraction Methods Based on Part of Speech in Chinese Documents Classification. Journal of WuHan University of Technology 29(4), 132–135 (2007)Google Scholar
- 5.Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)Google Scholar
- 6.ZaffaIon, M., Hutter, M.: Robust Feature Selection by Mutual Information Distributions. In: Proceedings of the 18th International Conference on Uncertainty in Artificial Intelligence, UAI, pp. 577–584 (2002)Google Scholar
- 7.Zhang, H.: ICTCLAS 3.0 API of Chinese Academy of Science (2007), http://www.nlp.org.cn/project/project.php?proj_id=6
- 8.Li, X., Yan, H., Wang, J.: Search Engine Theory Technique and System 4(1), 189–190 (2005)Google Scholar