A New Approach of Feature Selection for Chinese Web Page Categorization

  • Cunhe Li
  • Lina Zhu
  • Kangwei Liu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5370)


Feature selection is a key step of web page categorization. It can influence the accuracy of categorization directly as well as the efficiency. This paper proposes a new approach of feature selection based on Mutual Information algorithm. It brings in feature whose Mutual Information is negative and emphasizes the occurrence probabilities of features in different categories. Moreover, it makes some improvements on the web page preprocessing to reserve some useful features. The experiment shows that the new feature selection method improves the accuracy of categorization effectively.


Web page categorization Preprocessing Feature selection Mutual Information 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Xu, J., Hu, M.: Feature Selection and Classification for Chinese Web Documents. Computer Engineering 31(8), 24–39 (2005)Google Scholar
  2. 2.
    Dash, M., Liu, H.: Feature Selection for Classification. International Journal of Intelligent Data Analysis 1, 131–156 (1997)CrossRefGoogle Scholar
  3. 3.
    Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)CrossRefGoogle Scholar
  4. 4.
    Hu, Y., Wu, H., Zhong, L.: Research of Feature Extraction Methods Based on Part of Speech in Chinese Documents Classification. Journal of WuHan University of Technology 29(4), 132–135 (2007)Google Scholar
  5. 5.
    Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)Google Scholar
  6. 6.
    ZaffaIon, M., Hutter, M.: Robust Feature Selection by Mutual Information Distributions. In: Proceedings of the 18th International Conference on Uncertainty in Artificial Intelligence, UAI, pp. 577–584 (2002)Google Scholar
  7. 7.
    Zhang, H.: ICTCLAS 3.0 API of Chinese Academy of Science (2007),
  8. 8.
    Li, X., Yan, H., Wang, J.: Search Engine Theory Technique and System  4(1), 189–190 (2005)Google Scholar
  9. 9.
    Kwak, N., Ho, C.: Input Feature Selection for Classification Problems. IEEE Transaction on Neural Network 13, 143–157 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Cunhe Li
    • 1
  • Lina Zhu
    • 1
  • Kangwei Liu
    • 1
  1. 1.School of Computer & Communication EngineeringChina University of Petroleum, Email: jelly_3@163.comDongyingChina

Personalised recommendations