Abstract
A major difficulty of text categorization problems is the high dimensionality of the feature space. Thus, feature selection is often performed in order to increase both the efficiency and effectiveness of the classification. In this paper, we propose a feature selection method based on Testor Theory. This criterion takes into account inter-feature relationships. We experimentally compared our method with the widely used information gain using two well-known classification algorithms: k-nearest neighbour and Support Vector Machine. Two benchmark text collections were chosen as the testbeds: Reuters-21578 and Reuters Corpus Version 1 (RCV1-v2). We found that our method consistently outperformed information gain for both classifiers and both data collections, especially when aggressive feature selection is carried out.
Chapter PDF
Similar content being viewed by others
References
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37 (2000)
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problem. In: Proceedings of the 11th International Conference on Machine Learning, pp. 121–129 (1994)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval, Denmark, pp. 37–50. ACM Press, New York (1992)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th International Conference on Machine Learning, pp. 412–420 (1997)
Mladenic, D.: Feature subset selection using in text learning. In: Proceedings of the 10th European Conference on Machine Learning, pp. 95–100 (1998)
Sebastiani, F.: Machine Learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proc. of the 16th International Conference on Machine Learning, pp. 258–267 (1999)
Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 659–661. ACM Press, New York (2002)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Somol, P., Pudil, P.: Oscillating Search Algorithms for Feature Selection. In: Proc. of the 15th IAPR International Conference on Pattern Recognition, Barcelona, pp. 406–409 (2000)
Lazo-Cortés, M., Ruiz-Shulcloper, J., Alba-Cabrera, E.: An overview of the evolution of the concept of testor. Pattern Recognition 34(4), 753–762 (2001)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Santiesteban, Y., Pons-Porrata, A.: LEX: a new algorithm for the calculus of typical testors. Mathematics Sciences Journal 21(1), 85–95 (2003)
Gil-García, R., Badía Contelles, J.M., Pons-Porrata, A.: Parallel nearest neighbour algorithms for Text Categorization. In: Kermarrec, A.-M., Bougè, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 328–337. Springer, Heidelberg (2007)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Lewis, D., Yang, Y., Rose, T., Li, F.: Rcv1: A new benchmark collection for text categorization research. Machine Learning Research 5, 361–397 (2004)
Novovicová, J., Somol, P., Pudil, P.: Oscillating Feature Subset Search Algorithm for Text Categorization. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 578–587. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pons-Porrata, A., Gil-García, R., Berlanga-Llavori, R. (2007). Using Typical Testors for Feature Selection in Text Categorization. In: Rueda, L., Mery, D., Kittler, J. (eds) Progress in Pattern Recognition, Image Analysis and Applications. CIARP 2007. Lecture Notes in Computer Science, vol 4756. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76725-1_67
Download citation
DOI: https://doi.org/10.1007/978-3-540-76725-1_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76724-4
Online ISBN: 978-3-540-76725-1
eBook Packages: Computer ScienceComputer Science (R0)