Oscillating Feature Subset Search Algorithm for Text Categorization

  • Jana Novovičová
  • Petr Somol
  • Pavel Pudil
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4225)

Abstract

A major characteristic of text document categorization problems is the extremely high dimensionality of text data. In this paper we explore the usability of the Oscillating Search algorithm for feature/word selection in text categorization. We propose to use the multiclass Bhattacharyya distance for multinomial model as the global feature subset selection criterion for reducing the dimensionality of the bag of words vector document representation. This criterion takes into consideration inter-feature relationships. We experimentally compare three subset selection procedures: the commonly used best individual feature selection based on information gain, the same based on individual Bhattacharyya distance, and the Oscillating Search to maximize Bhattacharyya distance on groups of features. The obtained feature subsets are then tested on the standard Reuters data with two classifiers: the multinomial Bayes and the linear SVM. The presented experimental results illustrate that using a non-trivial feature selection algorithm is not only computationally feasible, but it also brings substantial improvement in classification accuracy over traditional, individual feature evaluation based methods.

References

  1. 1.
    Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)Google Scholar
  2. 2.
    McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: Proc. Workshop Learning for Text Categorization AAAI (1998)Google Scholar
  3. 3.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proc. 22nd ACM SIGIR-99 Int. Conf. on R. & D. in Information Retrieval, pp. 42–49 (1999)Google Scholar
  4. 4.
    Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a usability case study for text categorization. In: Proc. 20th ACM Int. Conf. on R. & D. in Information Retrieval SIGIR 1997, pp. 67–73 (1997)Google Scholar
  5. 5.
    Apté, C., Damerau, F.J., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12(3), 233–251 (1994)CrossRefGoogle Scholar
  6. 6.
    Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17(2), 141–173 (1999)CrossRefGoogle Scholar
  7. 7.
    Yang, Y., Chute, C.G.: An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems 12(3), 252–277 (1994)CrossRefGoogle Scholar
  8. 8.
    Joachims, T.: Transductive inference for text classification using support vector machines. In: Proc. 16th Int. Conf. on Machine Learning ICML 1999, pp. 200–209 (1999)Google Scholar
  9. 9.
    Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. a boosting-based system for text categorization 39(2/3), 135–168 (2000)MATHGoogle Scholar
  10. 10.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  11. 11.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. 14th Int. Conf. on Machine Learning ICML 1997, pp. 412–420 (1997)Google Scholar
  12. 12.
    Mladenić, D.: Feature subset selection using in text learning. In: 10th European Conference on Machine Learning, pp. 95–100 (1998)Google Scholar
  13. 13.
    Mladenić, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 258–267 (1999)Google Scholar
  14. 14.
    Forman, G.: An experimental study of feature selection metrics for text categorization. Journal of Machine Learning Research 3, 1289–1305 (2003)MATHCrossRefGoogle Scholar
  15. 15.
    Mladenić, D.: Machine Learning on non-homogeneous, distributed text data. PhD thesis, J. Stefan Institute, University of Ljubljana, Ljubljana, SL (1998)Google Scholar
  16. 16.
    Somol, P., Pudil, P.: Oscillating search algorithms for featute selection. In: Proc. of the 15th IAPR Int. Conference on Pattern Recognition, pp. 406–409 (2000)Google Scholar
  17. 17.
    Caropreso, M., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Text Databases and Document Management: Theory and Practice, pp. 78–102. Idea Group Publishing, Hershey (2001)Google Scholar
  18. 18.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proc. 22nd Int. ACM SIGIR Conf. on R. & D. in Information Retrieval, pp. 42–49 (1999)Google Scholar
  19. 19.
    Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 4–37 (2000)CrossRefGoogle Scholar
  20. 20.
    Novovičová, J., Malík, A.: Text document classification using finite mixtures. Research Report 2063, ÚTIA AVČR, Prague, Czech Republic (2002)Google Scholar
  21. 21.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jana Novovičová
    • 1
    • 2
  • Petr Somol
    • 1
    • 2
  • Pavel Pudil
    • 1
    • 2
  1. 1.Dept. of Pattern Recognition, Institute of Information Theory and AutomationAcademy of Sciences of the, Czech Republic 
  2. 2.Faculty of ManagementPrague University of EconomicsCzech Republic

Personalised recommendations