Advertisement

Feature Selection Using Improved Mutual Information for Text Classification

  • Jana Novovičová
  • Antonín Malík
  • Pavel Pudil
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3138)

Abstract

A major characteristic of text document classification problem is extremely high dimensionality of text data. In this paper we present two algorithms for feature (word) selection for the purpose of text classification. We used sequential forward selection methods based on improved mutual information introduced by Battiti [1] and Kwak and Choi [6] for non-textual data. These feature evaluation functions take into consideration how features work together. The performance of these evaluation functions compared to the information gain which evaluate features individually is discussed. We present experimental results using naive Bayes classifier based on multinomial model on the Reuters data set. Finally, we analyze the experimental results from various perspectives, including F 1-measure, precision and recall. Preliminary experimental results indicate the effectiveness of the proposed feature selection algorithms in a text classification problem.

Keywords

Feature Selection Mutual Information Information Gain Vocabulary Size Feature Subset Selection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Battiti, R.: Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Trans. Neural Networks 5, 537–550 (1994)CrossRefGoogle Scholar
  2. 2.
    Cover, T.M.: The Best Two Independent Measurements are not The Two Best. IEEE Trans. Systems, Man, and Cybernetics 4, 116–117 (1974)zbMATHGoogle Scholar
  3. 3.
    Forman, G.: An Experimental Study of Feature Selection Metrics for Text Categorization. Journal of Machine Learning Research 3, 1289–1305 (2003)zbMATHCrossRefGoogle Scholar
  4. 4.
    Jain, A.K., Duin, R.P.W., Mao, J.: Statistical Pattern Recognition: A Review. IEEE Trans. on Pattern Analysis and Machine Intelligence 22, 4–37 (2000)CrossRefGoogle Scholar
  5. 5.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  6. 6.
    Kwak, N., Choi, C.: Improved Mutual Information Feature Selector for Neural Networks in Supervised Learning. In: Int. Joint Conf. on Neural Networks (IJCNN 1999), pp. 1313–1318 (1999)Google Scholar
  7. 7.
    McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, pp. 41–48 (1998)Google Scholar
  8. 8.
    Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 258–267 (1999)Google Scholar
  9. 9.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labelled and Unlabelled Documents Using EM. Machine Learning 39, 103–134 (2000)zbMATHCrossRefGoogle Scholar
  10. 10.
    Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th ICML 1997, pp. 412–420 (1997)Google Scholar
  11. 11.
    Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1, 67–68 (1999)Google Scholar
  12. 12.
    Yang, Y., Zhang, J., Kisiel, B.: A Scalability Analysis of Classifier in Text Categorization. In: Proceedings of the 26th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Jana Novovičová
    • 1
    • 2
  • Antonín Malík
    • 1
    • 3
  • Pavel Pudil
    • 1
    • 2
  1. 1.Institute of Information Theory and Automation, Department of Pattern RecognitionAcademy of Sciences of the Czech RepublicPragueCzech Republic
  2. 2.Faculty of ManagementThe University of EconomicsPragueCzech Republic
  3. 3.Faculty of Electrical EngineeringCzech Technical UniversityPragueCzech Republic

Personalised recommendations