Abstract
This paper deals with a supervised learning method devoted to producing categorization models of text documents. The goal of the method is to use a suitable numerical measurement of example similarity to find centroids describing different categories of examples. The centroids are not abstract or statistical models, but rather consist of bits of examples. The centroid-learning method is based on a Genetic Algorithm for Texts (GAT). The categorization system using this genetic algorithm infers a model by applying the genetic algorithm to each set of preclassified documents belonging to a category. The models thus obtained are the category centroids that are used to predict the category of a test document. The experimental results validate the utility of this approach for classifying incoming documents.
Article PDF
Similar content being viewed by others
References
del, Castillo M. D., & Serrano, J. I. (2004). A multistrategy approach for digital text categorization from imbalanced documents. ACM SIGKDD Explorations, 6, 70–79.
Cohen, W. W., & Singer, Y. (1999). Context-sensitive learning methods for text categorization. ACM Trans. Inform. Systems, 17(2), 141–173.
Cohen, W. W. (1995). Learning to classify English text with ILP methods. In L. De Raedt, (Ed.), Advances logic programming (pp. 124–143). Amsterdam: IOS Press.
Doan, A., Domingos, P., & Halevy, A. (2003). Learning to match the schemas of data sources: a multistrategy approach. Machine Learning, 50, 279–301.
Dumais, S. T., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representation for text categorization. In Proceedings of the CIKM-98, 7th International Conference on Information and Knowledge Management (pp. 148–155). Bethesda.
Goldberg, D. (1989). Genetic algorithms in search, optimization & machine learning, (ed.) Addison-Wesley Publishing Company, Inc.
Godoy, D., & Amandi, A. (2000). PersonalSearcher: an intelligent agent for searching web pages (pp. 43–52). LNAI, 1952. Springer-Verlag.
Grobelnik, M., & Mladenic, D. (1998). Efficient text categorization. In text mining workshop on the 10th european conference on machine learning (pp. 1–10). Chemnitz.
Han Eui-Hong, S., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. In PAKDD’2001 (pp. 53–65). Springer-Verlag, LNAI 2035.
Hull, D. A. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (pp. 282–289). Dublin.
Joachims, T. (1998). Text categorization with support vector machines. In Proceedings of ECML-98 10th European Conference on Machine Learning (pp. 137–142). Chemnitz.
Lenz, M., Hubner, A., & Kunze, M. (1998). Textual CBR. In M. Lenz, B. Bartsch, B. D. Burkhard, and S. Wess (Eds.), Case-based reasoning technology (pp. 115–138). Springer-Verlag, LNAI 1400.
Lewis, D. D. (1998). Naïve Bayes at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 4–15). Germany.
Lewis, D. D., & Gale, W. A. (1994). Heterogeneous uncertainty sampling for supervised learning. In Proceedings of SIGIR-94, 11th International Conference on Research and Development in Information Retrieval (pp. 3–12). Dublin.
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (pp. 81–93). Las Vegas.
Mitchell, T. M. (1997) Machine learning. The McGraw- Hill Companies.
Porter, M. F. (1980) An algorithm for suffix stripping. Program, 14(3), 130–137.
Ritcher, M. M. (1995). The knowledge contained in similarity measures. In Invited Talk at ICCBR-95.
Ruiz, M. E., & Srinivasan, P. (1997). Automatic text categorization using neural networks. In Proceedings of the 8th ASIS/SIGCR Workshop on Classification Research (pp. 59–72). Washington.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Sinka, M. P., & Corne, D. W. (2002). A large benchmark dataset for web document clustering. In A. Abraham, J. Ruiz-del-Solar, and M. Koeppen (eds.), Soft computing systems: design, management and applications (pp. 881–890). (Volume 87 of Frontiers in Artificial Intelligence and Applications, 2002).
Weiss, S. M., Apté, Damerau, F. J., Johnson, D. E., Oles, F. J., Goezt, T., & Hampp, T. (1999). Maximizing text-mining performance. IEEE Intelligent Systems, 14(4), 63–69.
Yang, Y., & Pedersen, J. P. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 412–420). Nashville.
Zechner, K. (1997). A literature survey on text summarization. Paper for Directed Reading (Fall 1996), Carnegie Mellon university. Computational Linguistics.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Serrano, J.I., Castillo, M.D.d. Evolutionary learning of document categories. Inf Retrieval 10, 69–83 (2007). https://doi.org/10.1007/s10791-006-9012-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-006-9012-6