Sampling and Feature Selection in a Genetic Algorithm for Document Clustering

  • Arantza Casillas
  • Mayte T. González de Lena
  • Raquel Martínez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2945)


In this paper we describe a Genetic Algorithm for document clustering that includes a sampling technique to reduce computation time. This algorithm calculates an approximation of the optimum k value, and solves the best grouping of the documents into these k clusters. We evaluate this algorithm with sets of documents that are the output of a query in a search engine. Two types of experiment are carried out to determine: (1) how the genetic algorithm works with a sample of documents, (2) which document features lead to the best clustering according to an external evaluation. On the one hand, our GA with sampling performs the clustering in a time that makes interaction with a search engine viable. On the other hand, our GA approach with the representation of the documents by means of entities leads to better results than representation by lemmas only.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [Calinski & Harabasz 74]
    Calinski, T., Harabasz, J.: A Dendrite Method for Cluster Analysis. Communications in Statistics 3(1), 1–27 (1974)CrossRefMathSciNetGoogle Scholar
  2. [Casillas et al. 03]
    Casillas, A., González de Lena, M.T., Martínez, R.: Document Clustering into an unknown number of clusters using a Genetic Algorithm. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 43–49. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. [Chu et al. 02]
    Chu, S.C., Roddick, J.F., Pan, J.S.: An Incremental Multi-Centroid, Multi-Run Sampling Scheme for k-medoids-based Algorithms-Extended Report. In: Proceedings of the Third International Conference on Data Mining Methods and Databases, Data Mining III, pp. 553–562 (2002)Google Scholar
  4. [Estivill-Castro & Murray 98]
    Estivill-Castro, V., Murray, A.T.: Spatial Clustering for Data Mining with Genetic Algorithms. In: Proceedings of the International ICSC Symposium on Engineering of Intelligent Systems, EIS 1998 (1998)Google Scholar
  5. [Fairthorne 61]
    Fairthorne, R.A.: The mathematics of classification. Towards Information Retrieval, pp. 1–10. Butterworths, London (1961)Google Scholar
  6. [Goldberg 02]
    Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley Longman, Inc., Amsterdam (2002)Google Scholar
  7. [Good 58]
    Good, I.J.: Speculations Concerning Information Retrieval, Research Report PC-78, IBM Research Center, Yorktown Heights, New York (1958)Google Scholar
  8. [Gordon 99]
    Gordon, A.D.: Classification. Chapman & Hall/CRC, Boca Raton (1999)zbMATHGoogle Scholar
  9. [Holland 75]
    Holland, J.H.: Adaptation in natural and artificial system. The University of Michigan Press, Ann Arbor (1975)Google Scholar
  10. [Imai et al. 00]
    Imai, K., Kaimura, N., Hata, Y.: A New Clustering with Estimation of Cluster Number Based on Genetic Algorithms. In: Pattern Recognition in Soft Computing Paradigm, pp. 142–162. World Scientific Publishing Co., Inc., Singapore (2000)Google Scholar
  11. [Karypis]
    Karypis, G.: CLUTO: A Clustering Toolkit. Technical Report: 02-017. University of Minnesota, Department of Computer Science, Minneapolis, MN 55455Google Scholar
  12. [Lucasius et al. 93]
    Lucasius, C.B., Dane, A.D., Kateman, G.: On k-medoid clustering of large data sets with the aid of Genetic Algorithm: background, feasibility and comparison. Analytica Chimica Acta 283(3), 647–669 (1993)CrossRefGoogle Scholar
  13. [Makagonov et al. 02]
    Makagonov, P., Alexandrov, M., Gelbukh, A.: Selection of typical documents in a document flow. In: Advances in Communications and Software Technologies, pp. 197–202. WSEAS Press (2002)Google Scholar
  14. [Mertz & Zell 02]
    Merz, P., Zell, A.: Clustering Gene Expression Profiles with Memetic Algorithms. In: Guervós, J.J.M., Adamidis, P.A., Beyer, H.-G., Fernández-Villacañas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 811–820. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. [Michalewicz 96]
    Michalewicz, Z.: Genetic algorithms + data structures = evolution programs. Springer Comp., Heidelberg (1996)zbMATHGoogle Scholar
  16. [Milligan & Cooper 85]
    Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrik 58(2), 159–179 (1985)CrossRefGoogle Scholar
  17. [MUC-6 95]
    MUC-6. Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan KaufmanGoogle Scholar
  18. [Murthy & Chowdhury 96]
    Murthy, C.A., Chowdhury, N.: In search of Optimal Clusters Using Genetic Algorithms. Pattern Recognition Letters 17(8), 825–832 (1996)CrossRefGoogle Scholar
  19. [Needham 61]
    Needham, R.M.: Research on information retrieval, classification and grouping 1957-1961, Ph.D. Thesis, University of Cambridge, Cambridge Language Research Unit, Report M.L. 149 (1961)Google Scholar
  20. [van Rijsbergen 74]
    van Rijsbergen, C.J.: Foundations of evaluation. Journal of Documentation 30, 365–373 (1974)CrossRefGoogle Scholar
  21. [Sarkar et al. 97]
    Sarkar, M., Yegnanarayana, B., Khemani, D.: A clustering algorithm using an evolutionary programming-based approach. Pattern Recognition Letters 18, 975–986 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Arantza Casillas
    • 1
  • Mayte T. González de Lena
    • 2
  • Raquel Martínez
    • 2
  1. 1.Dpt. Electricidad y ElectrónicaUniversidad del País Vasco 
  2. 2.Dpt. InformáticaEstadística y Telemática Universidad Rey Juan Carlos 

Personalised recommendations