Sampling and Feature Selection in a Genetic Algorithm for Document Clustering

Casillas, Arantza; de Lena, Mayte T. González; Martínez, Raquel

doi:10.1007/978-3-540-24630-5_74

Arantza Casillas⁵,
Mayte T. González de Lena⁶ &
Raquel Martínez⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2945))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

968 Accesses
3 Citations

Abstract

In this paper we describe a Genetic Algorithm for document clustering that includes a sampling technique to reduce computation time. This algorithm calculates an approximation of the optimum k value, and solves the best grouping of the documents into these k clusters. We evaluate this algorithm with sets of documents that are the output of a query in a search engine. Two types of experiment are carried out to determine: (1) how the genetic algorithm works with a sample of documents, (2) which document features lead to the best clustering according to an external evaluation. On the one hand, our GA with sampling performs the clustering in a time that makes interaction with a search engine viable. On the other hand, our GA approach with the representation of the documents by means of entities leads to better results than representation by lemmas only.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Calinski, T., Harabasz, J.: A Dendrite Method for Cluster Analysis. Communications in Statistics 3(1), 1–27 (1974)
Article MathSciNet Google Scholar
Casillas, A., González de Lena, M.T., Martínez, R.: Document Clustering into an unknown number of clusters using a Genetic Algorithm. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 43–49. Springer, Heidelberg (2003)
Chapter Google Scholar
Chu, S.C., Roddick, J.F., Pan, J.S.: An Incremental Multi-Centroid, Multi-Run Sampling Scheme for k-medoids-based Algorithms-Extended Report. In: Proceedings of the Third International Conference on Data Mining Methods and Databases, Data Mining III, pp. 553–562 (2002)
Google Scholar
Estivill-Castro, V., Murray, A.T.: Spatial Clustering for Data Mining with Genetic Algorithms. In: Proceedings of the International ICSC Symposium on Engineering of Intelligent Systems, EIS 1998 (1998)
Google Scholar
Fairthorne, R.A.: The mathematics of classification. Towards Information Retrieval, pp. 1–10. Butterworths, London (1961)
Google Scholar
Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley Longman, Inc., Amsterdam (2002)
Google Scholar
Good, I.J.: Speculations Concerning Information Retrieval, Research Report PC-78, IBM Research Center, Yorktown Heights, New York (1958)
Google Scholar
Gordon, A.D.: Classification. Chapman & Hall/CRC, Boca Raton (1999)
MATH Google Scholar
Holland, J.H.: Adaptation in natural and artificial system. The University of Michigan Press, Ann Arbor (1975)
Google Scholar
Imai, K., Kaimura, N., Hata, Y.: A New Clustering with Estimation of Cluster Number Based on Genetic Algorithms. In: Pattern Recognition in Soft Computing Paradigm, pp. 142–162. World Scientific Publishing Co., Inc., Singapore (2000)
Google Scholar
Karypis, G.: CLUTO: A Clustering Toolkit. Technical Report: 02-017. University of Minnesota, Department of Computer Science, Minneapolis, MN 55455
Google Scholar
Lucasius, C.B., Dane, A.D., Kateman, G.: On k-medoid clustering of large data sets with the aid of Genetic Algorithm: background, feasibility and comparison. Analytica Chimica Acta 283(3), 647–669 (1993)
Article Google Scholar
Makagonov, P., Alexandrov, M., Gelbukh, A.: Selection of typical documents in a document flow. In: Advances in Communications and Software Technologies, pp. 197–202. WSEAS Press (2002)
Google Scholar
Merz, P., Zell, A.: Clustering Gene Expression Profiles with Memetic Algorithms. In: Guervós, J.J.M., Adamidis, P.A., Beyer, H.-G., Fernández-Villacañas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 811–820. Springer, Heidelberg (2002)
Chapter Google Scholar
Michalewicz, Z.: Genetic algorithms + data structures = evolution programs. Springer Comp., Heidelberg (1996)
MATH Google Scholar
Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrik 58(2), 159–179 (1985)
Article Google Scholar
MUC-6. Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufman
Google Scholar
Murthy, C.A., Chowdhury, N.: In search of Optimal Clusters Using Genetic Algorithms. Pattern Recognition Letters 17(8), 825–832 (1996)
Article Google Scholar
Needham, R.M.: Research on information retrieval, classification and grouping 1957-1961, Ph.D. Thesis, University of Cambridge, Cambridge Language Research Unit, Report M.L. 149 (1961)
Google Scholar
van Rijsbergen, C.J.: Foundations of evaluation. Journal of Documentation 30, 365–373 (1974)
Article Google Scholar
Sarkar, M., Yegnanarayana, B., Khemani, D.: A clustering algorithm using an evolutionary programming-based approach. Pattern Recognition Letters 18, 975–986 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dpt. Electricidad y Electrónica, Universidad del País Vasco,
Arantza Casillas
Dpt. Informática, Estadística y Telemática Universidad Rey Juan Carlos,
Mayte T. González de Lena & Raquel Martínez

Authors

Arantza Casillas
View author publications
You can also search for this author in PubMed Google Scholar
Mayte T. González de Lena
View author publications
You can also search for this author in PubMed Google Scholar
Raquel Martínez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Casillas, A., de Lena, M.T.G., Martínez, R. (2004). Sampling and Feature Selection in a Genetic Algorithm for Document Clustering. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_74

Download citation

DOI: https://doi.org/10.1007/978-3-540-24630-5_74
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics