Using Typical Testors for Feature Selection in Text Categorization

Pons-Porrata, Aurora; Gil-García, Reynaldo; Berlanga-Llavori, Rafael

doi:10.1007/978-3-540-76725-1_67

Aurora Pons-Porrata¹,
Reynaldo Gil-García¹ &
Rafael Berlanga-Llavori²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4756))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

2531 Accesses
8 Citations

Abstract

A major difficulty of text categorization problems is the high dimensionality of the feature space. Thus, feature selection is often performed in order to increase both the efficiency and effectiveness of the classification. In this paper, we propose a feature selection method based on Testor Theory. This criterion takes into account inter-feature relationships. We experimentally compared our method with the widely used information gain using two well-known classification algorithms: k-nearest neighbour and Support Vector Machine. Two benchmark text collections were chosen as the testbeds: Reuters-21578 and Reuters Corpus Version 1 (RCV1-v2). We found that our method consistently outperformed information gain for both classifiers and both data collections, especially when aggressive feature selection is carried out.

Download to read the full chapter text

Chapter PDF

Feature selection based on term frequency deviation rate for text classification

Article 11 November 2020

A Novel Feature Selection Technique for Text Classification

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

Keywords

References

Jain, A.K., Duin, R.P.W., Mao, J.: Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37 (2000)
Article Google Scholar
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problem. In: Proceedings of the 11th International Conference on Machine Learning, pp. 121–129 (1994)
Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval, Denmark, pp. 37–50. ACM Press, New York (1992)
Chapter Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Mladenic, D.: Feature subset selection using in text learning. In: Proceedings of the 10th European Conference on Machine Learning, pp. 95–100 (1998)
Google Scholar
Sebastiani, F.: Machine Learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proc. of the 16th International Conference on Machine Learning, pp. 258–267 (1999)
Google Scholar
Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 659–661. ACM Press, New York (2002)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Article MATH Google Scholar
Somol, P., Pudil, P.: Oscillating Search Algorithms for Feature Selection. In: Proc. of the 15th IAPR International Conference on Pattern Recognition, Barcelona, pp. 406–409 (2000)
Google Scholar
Lazo-Cortés, M., Ruiz-Shulcloper, J., Alba-Cabrera, E.: An overview of the evolution of the concept of testor. Pattern Recognition 34(4), 753–762 (2001)
Article MATH Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Article Google Scholar
Santiesteban, Y., Pons-Porrata, A.: LEX: a new algorithm for the calculus of typical testors. Mathematics Sciences Journal 21(1), 85–95 (2003)
Google Scholar
Gil-García, R., Badía Contelles, J.M., Pons-Porrata, A.: Parallel nearest neighbour algorithms for Text Categorization. In: Kermarrec, A.-M., Bougè, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 328–337. Springer, Heidelberg (2007)
Chapter Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Lewis, D., Yang, Y., Rose, T., Li, F.: Rcv1: A new benchmark collection for text categorization research. Machine Learning Research 5, 361–397 (2004)
Google Scholar
Novovicová, J., Somol, P., Pudil, P.: Oscillating Feature Subset Search Algorithm for Text Categorization. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 578–587. Springer, Heidelberg (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Center of Pattern Recognition and Data Mining, Universidad de Oriente, Santiago de Cuba, Cuba
Aurora Pons-Porrata & Reynaldo Gil-García
Computer Science, Universitat Jaume I, Castellón, Spain
Rafael Berlanga-Llavori

Authors

Aurora Pons-Porrata
View author publications
You can also search for this author in PubMed Google Scholar
Reynaldo Gil-García
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Berlanga-Llavori
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Luis Rueda Domingo Mery Josef Kittler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pons-Porrata, A., Gil-García, R., Berlanga-Llavori, R. (2007). Using Typical Testors for Feature Selection in Text Categorization. In: Rueda, L., Mery, D., Kittler, J. (eds) Progress in Pattern Recognition, Image Analysis and Applications. CIARP 2007. Lecture Notes in Computer Science, vol 4756. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76725-1_67

Download citation

DOI: https://doi.org/10.1007/978-3-540-76725-1_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76724-4
Online ISBN: 978-3-540-76725-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Using Typical Testors for Feature Selection in Text Categorization

Abstract

Chapter PDF

Similar content being viewed by others

Feature selection based on term frequency deviation rate for text classification

A Novel Feature Selection Technique for Text Classification

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Using Typical Testors for Feature Selection in Text Categorization

Abstract

Chapter PDF

Similar content being viewed by others

Feature selection based on term frequency deviation rate for text classification

A Novel Feature Selection Technique for Text Classification

Feature Selection Based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification Using Naïve Bayes

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation