Estimating the number of clusters from distributional results of partitioning a given data set

  • U. Möller


When estimating the optimal value of the number of clusters, C, of a given data set, one typically uses, for each candidate value of C, a single (final) result of the clustering algorithm. If distributional data of size T are used, these data come from Tdata sets obtained, e.g., by a bootstrapping technique. Here a new approach is introduced that utilizes distributional data generated by clustering the original data T times in the framework of cost function optimization and cluster validity indices. Results of this method are reported for model data (100 realizations) and gene expression data. The probability of correctly estimating the number of clusters was often higher compared to recently published results of several classical methods and a new statistical approach (Clest).


Cluster Algorithm Validity Index Cluster Validity Index Cluster Trial Cost Function Optimization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Theodoridis, S., Koutroumbas, K. (1999) Pattern Recognition. Academic Press, San DiegoGoogle Scholar
  2. [2]
    Peña, J.M., Lozano, J.A., Larrañga, P. (1999) An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters 20: 1027–1040CrossRefGoogle Scholar
  3. [3]
    Möller, U., Galicki, M., Barešová, E., Witte, H. (1998) An efficient vector quantizer providing globally optimal solutions. IEEE Trans. Signal Processing 46: 2515–2529CrossRefGoogle Scholar
  4. [4]
    Möller, U., Ligges, M., Georgiewa, P., Grünling, C, Kaiser, W.A., Witte, H., Blanz, B. (2002) How to avoid spurious cluster validation? A methodological investigation on simulated and fMRI data. Neurolmage 17: 431–446CrossRefGoogle Scholar
  5. [5]
    Bezdek, J.C., Pal, N.R. (1998) Some new indexes of cluster validity. IEEE Trans. Syst., Man and Cybern. B28: 301–315CrossRefGoogle Scholar
  6. [6]
    Dudoit, S., Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3: 0036.1–0036.21CrossRefGoogle Scholar
  7. [7]
    Cho, R.J., Campbell, M.J., Winzeler, et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2: 65–73CrossRefGoogle Scholar

Copyright information

© Springer-Verlag/Wien 2005

Authors and Affiliations

  • U. Möller
    • 1
  1. 1.Bioinformatics — Pattern Recognition GroupHans Knoll Institute for Natural Products Research JenaGermany

Personalised recommendations