Cluster Analysis

  • Roberto BaragonaEmail author
  • Francesco Battaglia
  • Irene Poli
Part of the Statistics and Computing book series (SCO)


Meta heuristic methods have been often applied to partitioning problems. On one hand this proceeded from the fact that heuristic methods have always been applied to such problems since their earliest formulations. On the other hand, meta heuristic found a promising field of application because cluster analysis has two characteristic features that make it specially suitable for designing algorithms in this framework. The solution space is large, and grows fast with the problem dimension. The solutions form a discrete set that cannot be explored by the gradient – based methods or whatever method that is grounded on the exploitation of the properties of analytic functions. A large lot of algorithms based on genetic evolutionary computation have been proposed and have been found excellent solvers of partitioning problems. In this chapter we shall recall the usual classification of cluster algorithms and explain which class may be successfully handled by genetic evolutionary computation techniques. While most chapter is devoted to crisp partition problem, the fuzzy partition problem will be discussed as well. Then, the theoretical framework offered by the mixture distributions will be examined related to evolutionary computing estimation techniques. Also we will account for the genetic algorithms-based approach to the CART technique for classification. Applications of genetic algorithms for clustering time series will be described. Finally, the multiobjective clustering and implementation in the genetic algorithms framework will be outlined. Some examples and comparisons will illustrate the evolutionary computing methods for cluster analysis.


Fitness Function Optimal Partition Cluster Validity Fuzzy Partition Elitist Strategy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Akaike H (1974) A new look at the statistical identification model. IEEE Trans Autom Control 19:716–723zbMATHCrossRefMathSciNetGoogle Scholar
  2. Bandyopadhyay S, Maulik U (2002) Genetic clustering for automatic evolution of clusters and application to image classification. Pattern Recognit 35:1197–1208zbMATHCrossRefGoogle Scholar
  3. Bandyopadhyay S, Maulik U, Mukhopadhyay A (2007) Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Trans Geosci Remote Sens 45:1506–1511CrossRefGoogle Scholar
  4. Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A simulated annealing based multi-objective optimization algorithm: AMOSA. IEEE Trans Evol Comput 12:269–283CrossRefGoogle Scholar
  5. Banfield JD, Raftery A (1993) Model-based gaussian and non-gaussian clustering. Biometrics 49:803–821zbMATHCrossRefMathSciNetGoogle Scholar
  6. Baragona R (2001) A simulation study on clustering time series with metaheuristic methods. Quad Stat 3:1–26MathSciNetGoogle Scholar
  7. Baragona R (2003a) Further results on Lund’s statistic for identifying cluster in a circular data set with application to time series. Commun Stat Simul Comput 32:943–952zbMATHCrossRefMathSciNetGoogle Scholar
  8. Baragona R, Battaglia F (2003) Multivariate mixture models estimation: a genetic algorithm approach. In: Schader M, Gaul W, Vichi M (eds) Between data science and applied data analysis. Springer, Berlin, pp 133–142Google Scholar
  9. Baragona R, Calzini C, Battaglia F (2001b) Genetic algorithms and clustering: an application to Fisher’s iris data. In: Borra S, Rocci R, Vichi M, Schader M (eds) Advances in classification and data analysis. Springer, Heidelberg, pp 109–118Google Scholar
  10. Baragona R, Carlucci F (1997) An optimality criterion for aggregating a set of time series in a composite index. J Time Ser Anal 18:1–9zbMATHCrossRefMathSciNetGoogle Scholar
  11. Bensmail H, Celeux G, Raftery AE, Robert C (1997) Inference in model-based cluster analysis. Stat Comput 7:1–10CrossRefGoogle Scholar
  12. Bezdek JC (1981) Pattern recognition with Fuzzy objective function algorithms. Plenum, New York, NYzbMATHGoogle Scholar
  13. Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst, Man Cybern B: Cybern 28:301–315CrossRefGoogle Scholar
  14. Bohte Z, Cepar D, Košmelj K (1980) Clustering of time series. In: COMPSTAT 1980. Physica-Verlag, Heidelberg, Germany, pp 587–593Google Scholar
  15. Boley D (1998) Principal direction divisive partitioning. Data Min Knowl Discov 2:325–344CrossRefGoogle Scholar
  16. Box GEP, Jenkins GM, Reinsel GC (1994) Time series analysis: forecasting and control, 3rd edn. Prentice Hall, Englewood Cliffs, NJzbMATHGoogle Scholar
  17. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont, CAzbMATHGoogle Scholar
  18. Brockwell PJ, Davis RA (1996) Introduction to time series and forecasting. Springer, New York, NYzbMATHGoogle Scholar
  19. Brooks SP, Morgan BJT (1995) Optimization using simulated annealing. Statistician 44:241–257CrossRefGoogle Scholar
  20. Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27CrossRefMathSciNetGoogle Scholar
  21. Chatterjee S, Laudato M (1997) Genetic algorithms in statistics: procedures and applications. Commun Stat Theory Methods 26(4):1617–1630zbMATHGoogle Scholar
  22. Chatterjee S, Laudato M, Lynch LA (1996) Genetic algorithms and their statistical applications: an introduction. Comput Stat Data Anal 22:633–651zbMATHCrossRefGoogle Scholar
  23. Chipman HA, George EI, McCulloch RE (1998) Bayesian cart model search. J Am Stat Assoc Theory Methods 93:935–948CrossRefGoogle Scholar
  24. Choi KS, Moon BR (2007) Feature selection in genetic fuzzy discretization for the pattern classification problem. IEICE Trans Inf Syst E90-D:1047–1054CrossRefGoogle Scholar
  25. Coello Coello CA, Van Veldhuizen DA, Lamont GB (2002) Evolutionary Algorithms for solving multi-objective problems. Kluwer, Norwell, MAzbMATHGoogle Scholar
  26. Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clsutering. J Am Stat Assoc 93:294–302zbMATHCrossRefGoogle Scholar
  27. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227CrossRefGoogle Scholar
  28. Davis L (1985) Applying adaptive algorithms to epistatic domains. In: Proceedings of the 9-th international joint conference on artificial intelligence, Morgan Kaufman San Francisco, pp 162–164Google Scholar
  29. Deb K (2001) Multi-objective optimization using evolutionary algorithms. Wiley, New York, NYzbMATHGoogle Scholar
  30. Dunn JC (1974) Well separated clusters and optimal fuzzy-partitions. J Cybern 4:95–104CrossRefMathSciNetGoogle Scholar
  31. Edwards W, Cavalli-Sforza L (1965) A method of cluster analysis. Biometrics 21:362–375CrossRefGoogle Scholar
  32. Falkenauer E (1998) Genetic algorithms and grouping problems. Wiley, ChichesterGoogle Scholar
  33. Ferligoj A, Batagelj V (1992) Direct multicriteria clustering algorithms. J Classification 9:43–61zbMATHCrossRefMathSciNetGoogle Scholar
  34. Friedman HP, Rubin J (1967) On some invariant criteria for grouping data. J Am Stat Assoc 62:1159–1178CrossRefMathSciNetGoogle Scholar
  35. Goldberg DE (1989b) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MAzbMATHGoogle Scholar
  36. Goldberg DE, Lingle R (1985) Alleles, loci and the travelling salesman problem. In: Grefenstette JJ (ed) Proceedings of an international conference on genetic algorithms and their applications. Lawrence Erlbaum Associates, Hillsdale, NJ, pp 154–159Google Scholar
  37. Gordon AD (1981) Classification. Chapman and Hall, LondonzbMATHGoogle Scholar
  38. Gower JC, Ross GJS (1969) Minimum spanning trees and single linkage cluster analysis. Appl Stat 18:54–64CrossRefMathSciNetGoogle Scholar
  39. Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11:56–76CrossRefGoogle Scholar
  40. Hartigan J (1975) Clustering algorithms. Wiley, New York, NYzbMATHGoogle Scholar
  41. Hartigan J, Wong M (1979) Algorithm as136: a k-means clustering algorithm. Appl Stat 28:100–108zbMATHCrossRefGoogle Scholar
  42. Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor, MIGoogle Scholar
  43. Hoppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition. Wiley, ChichesterGoogle Scholar
  44. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218CrossRefGoogle Scholar
  45. ISTAT (2008) Italian statistical yearbook. Rome, ItalyGoogle Scholar
  46. Jaccard P (1901) Distribution de la florine alpine dans la bassin de dranses et dans quelques regiones voisines. Bull Soc Vaud Sci Nat 37:241–272Google Scholar
  47. Jones DR, Beltramo MA (1991) Solving partitioning problems with genetic algorithms. In: Belew RK, Booker LB (eds) Proceedings of the 4th international conference on genetic algorithms. Morgan Kaufmann, San Mateo, CA, pp 442–449Google Scholar
  48. Kaufman L, Rousseeuw PJ (2005) Finding groups in data. Wiley, Hoboken, NJGoogle Scholar
  49. Keogh E, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min Knowl Discov 7:349–371CrossRefMathSciNetGoogle Scholar
  50. Koehler AB, Murphree ES (1988) A comparison of the Akaike and Schwarz criteria for selecting model order. Appl Stat 37:187–195CrossRefMathSciNetGoogle Scholar
  51. Lance GN, Williams WT (1967) A general theory of classificatory sorting strategies: I. hierarchical systems. Comput J 9:373–380Google Scholar
  52. Liao TW (2005) Clustering of time series data – a survey. Pattern Recognit 38:1857–1874zbMATHCrossRefGoogle Scholar
  53. Liu GL (1968) Introduction to combinatorial mathemathics. McGraw Hill, New YorkGoogle Scholar
  54. Lund U (1999) Cluster analysis for directional data. Commun Stat Simul Comput 28(4):1001–1009zbMATHCrossRefGoogle Scholar
  55. Marriott FHC (1982) Optimization methods of cluster analysis. Biometrics 69:417–422CrossRefMathSciNetGoogle Scholar
  56. Maulik U, Bandyopadhyay S (2003) Fuzzy partitioning using real coded variable length genetic algorithm for pixel classification. IEEE Trans Geosci Remote Sens 41:1075–1081CrossRefGoogle Scholar
  57. McLachlan G, Krishnan T (1997) The EM algorithm and extensions. Wiley, New YorkzbMATHGoogle Scholar
  58. McLachlan G, Peel D (2000) Finite mixture models. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  59. Milligan GW, Cooper MC (1986) A study of the comparability of external criteria for hierachical cluster analysis. Multivar Behav Res 21:441–458CrossRefGoogle Scholar
  60. Mola F, Miele R (2006) Evolutionary algorithms for classification and regression trees. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search. Springer, Heidelberg, pp 255–262CrossRefGoogle Scholar
  61. Murthy CA, Chowdhury N (1996) In search of optimal clusters using genetic algorithms. Pattern Recognit Lett 17:825–832CrossRefGoogle Scholar
  62. Piccolo D (1990) A distance measure for classifying ARIMA models. J Time Ser Anal 11:153–164zbMATHCrossRefGoogle Scholar
  63. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850CrossRefGoogle Scholar
  64. Sahni S, Gonzalez T (1976) P-complete approximation problems. J Assoc Comput Mach 23:555–565zbMATHMathSciNetGoogle Scholar
  65. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464zbMATHCrossRefGoogle Scholar
  66. Stock JH, Watson MW (2004) Combination forecasts of output growth in a seven-country data set. J Forecast 23:405–430CrossRefGoogle Scholar
  67. Symons MJ (1981) Clustering criteria and multivariate normal mixtures. Biometrics 37:35–43zbMATHCrossRefMathSciNetGoogle Scholar
  68. Tseng LY, Yang SB (2001) A genetic approach to the automatic clustering problem. Pattern Recognit 34:415–424zbMATHCrossRefGoogle Scholar
  69. Wang X, Smith KA, Hyndman RJ (2005) Characteristic-based clustering for time series data. Data Min Knowl Discov 13:335–364CrossRefMathSciNetGoogle Scholar
  70. Xie XS, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13:841–847CrossRefGoogle Scholar
  71. Zani S (1983) Osservazioni sulle serie storiche multiple e l’q analisi dei gruppi. In: Piccolo D (ed) Analisi moderna delle serie storiche. Franco Angeli, Milano, pp 263–274Google Scholar
  72. Abonyi J, Balasko B, Feil B (2005) Fuzzy clustering and data analysis toolbox. Accessed 21 Nov 2010
  73. Asuncion A, Newman D (2007) UCI machine learning repository. Accessed 21 Nov 2010
  74. Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA.∼honavar/clustering-survey.pdf. Accessed 21 Nov 2010
  75. Demsar J, Zupan B, Leban G (2004) Orange: from experimental machine learning to interactive data mining, white paper. > Accessed 21 Nov 2010
  76. Hettich S, Bay SD (1999) The UCI KDD archive. Accessed 25 Jul 2009
  77. Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th conference on VLDB, Santiago, ChileGoogle Scholar
  78. Peña D, Tiao GC (2003) The SAR procedure: a diagnostic analysis of heterogeneous data. Manuscript submitted for publicationGoogle Scholar
  79. SI (2008) Statistical innovations, /products /DemoData /diabetes.dat
  80. Statlib (2008) The data and story library (DASL).

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Roberto Baragona
    • 1
    Email author
  • Francesco Battaglia
    • 2
  • Irene Poli
    • 3
  1. 1.Department of Communication and Social ResearchSapienza University of RomeRomeItaly
  2. 2.Department of Statistical SciencesSapienza University of RomeRomaItaly
  3. 3.Department of StatisticsCa’ Foscari University of VeniceVeniceItaly

Personalised recommendations