Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm

  • A. Casillas
  • M. T. González de Lena
  • R. Martínez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2807)


We present a genetic algorithm that deals with document clustering. This algorithm calculates an approximation of the optimum k value, and solves the best grouping of the documents into these k clusters. We have evaluated this algorithm with sets of documents that are the output of a query in a search engine. The experiments show that, most of the times, our genetic algorithm obtains better values of the fitness function than the well known Calinski and Harabasz stopping rule, and takes less time.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Calinski, T., Harabasz, J.: A Dendrite Method for Cluster Analysis. Communications in Statistics 3(1), 1–27 (1974)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Chu, S.C., Roddick, J.F., Pan, J.S.: An Incremental Multi-Centroid, Multi-Run Sampling Scheme for k-medoids-based Algortihms-Extended Report. In: Proceedings of the Third International Conference on Data Mining Methods and Databases, Data Mining III, pp. 553–562 (2002)Google Scholar
  3. 3.
    Estivill-Castro, V., Murray, A.T.: Spatial Clustering for Data Mining with Genetic Algorithms. In: Proceedings of the International ICSC Symposium on Engineering of Intelligent Systems, EIS 1998 (1998)Google Scholar
  4. 4.
    Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley Longman, Inc., Amsterdam (2002)Google Scholar
  5. 5.
    Gordon, A.D.: Classification. Chapman & Hall/CRC (1999)Google Scholar
  6. 6.
    Holland, J.H.: Adaptation in natural and artificial system. The University of Michigan Press, Ann Arbor (1975)Google Scholar
  7. 7.
    Lucasius, C.B., Dane, A.D., Kateman, G.: On k-medoid clustering of large data sets with the aid of Genetic Algorithm: background, feasibility and comparison. In: Analytica Chimica Acta, vol. 283(3), pp. 647–669. Elsevier Science Publishers B.V., Amsterdam (1993)Google Scholar
  8. 8.
    Makagonov, P., Alexandrov, M., Gelbukh, A.: Selection of typical documents in a document flow. In: Advances in Communications and Software Technologies, pp. 197–202. WSEAS Press (2002)Google Scholar
  9. 9.
    Merz, P., Zell, A.: Clustering Gene Expresion Profiles with Memetic Algorithms. In: Guervós, J.J.M., Adamidis, P.A., Beyer, H.-G., Fernández-Villacañas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 811–820. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  10. 10.
    Michalewicz, Z.: Genetic algorithms+data structures=evolution programs. Springer Comp., Heidelberg (1996)zbMATHGoogle Scholar
  11. 11.
    Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrik 58(2), 159–179 (1985)CrossRefGoogle Scholar
  12. 12.
    Murthy, C.A., Chowdhury, N.: In search of Optimal Clusters Using Genetic Algorithms. Pattern Recognition Letters 17(8), 825–832 (1996)CrossRefGoogle Scholar
  13. 13.
    Sarkar, M., Yegnanarayana, B., Khemani, D.: A clustering algorithm using an evolutionary programming-based approach. Pattern Recognition Letters 18, 975–986 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • A. Casillas
    • 1
  • M. T. González de Lena
    • 2
  • R. Martínez
    • 2
  1. 1.Dpt. Electricidad y ElectrónicaUniversidad del País Vasco 
  2. 2.Dpt. Informática, Estadística y TelemáticaUniversidad Rey Juan Carlos 

Personalised recommendations