Skip to main content
Log in

Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–821.

    Article  MATH  MathSciNet  Google Scholar 

  • BEL MUFTI, G, BERTRAND, P., and EL MOUBARKI, L. (2005), “Determining the Number of Groups from Measures of Cluster Stability”, in Proceedings of International Symposium on Applied Stochastic Models and Data Analysis, pp. 404–412.

  • BENZECRI, J.P. (1992), Correspondence Analysis Handbook, New York: Marcel Dekker.

    MATH  Google Scholar 

  • BOCK, H.-H. (2007), “Clustering Methods: A History of k-Means Algorithms”, in Selected Contributions in Data Analysis and Classification, eds. P. Brito, P. Bertrand, G. Cucumel, and F. De Carvalho, Heidelberg: Springer Verlag, pp. 161–172.

    Chapter  Google Scholar 

  • BRECKENRIDGE, J. (1989), “Replicating Cluster Analysis: Method, Consistency and Validity”, Multivariate Behavioral Research, 24, 147–61.

    Article  Google Scholar 

  • CALINSKI, T., and HARABASZ, J. (1974), “A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3(1), 1–27.

    Article  MathSciNet  Google Scholar 

  • CASILLAS, A., GONZALES DE LENA, M.T., and MARTINEZ, H. (2003), “Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm”, Text, Speech and Dialogue: 6th International Conference, Czech Republic, pp. 43–49.

  • CHAE, S.S., DUBIEN, J.L., and WARDE, W.D. (2006), “A Method of Predicting the Number of Clusters Using Rand’s Statistic”, Computational Statistics and Data Analysis, 50 (12), 3531–3546.

    Article  MATH  MathSciNet  Google Scholar 

  • DIMITRIADOU, E., DOLNICAR, S., and WEINGASSEL, A. (2002), “An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets”, Psychometrika, 67(1), 137–160.

    Article  MathSciNet  Google Scholar 

  • DUDA, R.O., and HART, P.E. (1973), Pattern Classification and Scene Analysis, New York: Wiley.

    MATH  Google Scholar 

  • DUDOIT, S., and FRIDLYAND, J. (2002), “A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset”, Genome Biology, 3(7), research 0036.1–0036.21.

  • EFRON B., and TIBSHIRANI R. J. (1993), An Introduction to the Bootstrap, New York: Chapman and Hall.

    MATH  Google Scholar 

  • FAYYAD, U.M., PIATETSKY-SHAPIRO, G., SMYTH, P., and UTHURUSAMY, R. (eds.) (1996), Advances in Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press/The MIT Press.

    Google Scholar 

  • FENG, Y., and HAMERLY, G. (2006), “PG-Means: Learning the Number of Clusters in Data”, Advances in Neural Information Processing Systems, 19 (NIPS Proceeding), Cambridge MA: MIT Press, pp. 393–400.

    Google Scholar 

  • FRALEY, C., and RAFTERY, A.E. (2002), “Model-based Clustering, Discriminant Analysis, and Density Estimation”, Journal of the American Statistical Association, 97 (458), 611–631.

    Article  MATH  MathSciNet  Google Scholar 

  • GENERATION OF GAUSSIAN MIXTURE DISTRIBUTED DATA (2006), NETLAB neural network software, http://www.ncrg.aston.ac.uk/netlab.

  • HAND, D.J., and KRZANOWSKI, W.J. (2005), “Optimising k-means Clustering Results with Standard Software Packages”, Computational Statistics and Data Analysis, 49, 969–973.

    Article  MATH  MathSciNet  Google Scholar 

  • HANSEN, P., and MLADENOVIC, N. (2001), “J-MEANS: A New Local Search Heuristic for Minimum Sum of Squares Clustering”, Pattern Recognition, 34, 405–413.

    Article  MATH  Google Scholar 

  • HARDY A. (1996), “On the Number of Clusters”, Computational Statistics & Data Analysis 23, 83–96

    Article  MATH  Google Scholar 

  • HARTIGAN, J. A. (1975), Clustering Algorithms, New York: J. Wiley & Sons.

    MATH  Google Scholar 

  • HUBERT, L.J., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.

    Article  Google Scholar 

  • HUBERT, L.J., and LEVIN, J.R. (1976), “A General Statistical Framework for Assessing Categorical Clustering in Free Recall”, Psychological Bulletin, 83, 1072–1080.

    Article  Google Scholar 

  • ISHIOKA, T. (2005), “An Expansion of X-Means for Automatically Determining the Optimal Number of Clusters”, Proceedings of International Conference on Computational Intelligence , Calgary AB, Canada, pp. 91–96.

  • JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, Englewood Cliffs NJ: Prentice Hall.

    MATH  Google Scholar 

  • KAUFMAN L., and ROUSSEEUW P. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: J. Wiley & Son.

    Google Scholar 

  • KRZANOWSKI W., and LAI Y. (1985), “A Criterion for Determining the Number of Groups in a Dataset Using Sum of Squares Clustering”, Biometrics, 44, 23–34.

    Article  MathSciNet  Google Scholar 

  • KUNCHEVA, L.I., and VETROV, D. P. (2005), “Evaluation of Stability of K-Means Cluster Ensembles with Respect to Random Initialization”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1798–1808.

    Article  Google Scholar 

  • LEISCH, F. (2006), “A Toolbox for K-Centroids Cluster Analysis”, Computational Statistics and Data Analysis, 51, 526–544.

    Article  MATH  MathSciNet  Google Scholar 

  • MAULIK,U., and BANDYOPADHYAY, S. (2000), “Genetic Algorithm-based Clustering Technique”, Pattern Recognition, 33, 1455–1465.

    Article  Google Scholar 

  • MCLACHLAN, G.J., and KHAN, N. (2004), “On a Resampling Approach for Tests on the Number of Clusters with Mixture Model-Based Clustering of Tissue Samples”, Journal of Multivariate Analysis, 90, 990–1005.

    Article  MathSciNet  Google Scholar 

  • MCLACHLAN, G.J., and PEEL, D. (2000), Finite Mixture Models, New York: Wiley.

    Book  MATH  Google Scholar 

  • MCQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, in Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. II, pp. 281–297.

  • MILLIGAN, G.W. (1981), “A Monte-Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis”, Psychometrika, 46, 187–199.

    Article  MATH  MathSciNet  Google Scholar 

  • MILLIGAN, G.W., and COOPER, M.C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, 50, 159–179.

    Article  Google Scholar 

  • MILLIGAN, G. W., and COOPER, M. C. (1988), “A Study of Standardization of Variables in Cluster Analysis”, Journal of Classification, 5, 181–204.

    Article  MathSciNet  Google Scholar 

  • MINAEI-BIDGOLI, B., TOPCHY, A., and PUNCH, W.F. (2004), “A Comparison of Resampling Methods for Clustering Ensembles”, International Conference on Machine Learning; Models, Technologies and Application (MLMTA04), Las Vegas, Nevada, pp. 939–945.

  • MIRKIN, B. (1990), “Sequential Fitting Procedures for Linear Data Aggregation Model”, Journal of Classification, 7, 167–195.

    Article  MATH  MathSciNet  Google Scholar 

  • MIRKIN, B. (1996), Mathematical Classification and Clustering, New York: Kluwer.

    MATH  Google Scholar 

  • MIRKIN, B. (2005), Clustering for Data Mining: A Data Recovery Approach, Boca Raton FL: Chapman and Hall/CRC.

    Book  MATH  Google Scholar 

  • MONTI, S., TAMAYO, P., MESIROV, J., and GOLUB, T. (2003), “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data”, Machine Learning, 52, 91–118.

    Article  MATH  Google Scholar 

  • MOJENA, R. (1977), “Hierarchical Grouping Methods and Stopping Rules: An Evaluation”, The Computer Journal, 20, 359–363.

    Article  MATH  Google Scholar 

  • MURTAGH, F., and RAFTERY, A.E. (1984), “Fitting Straight Lines to Point Patterns”, Pattern Recognition, 17, 479–483.

    Article  Google Scholar 

  • PELLEG, D., and MOORE, A. (2000), “X-means: Extending K-Means with Efficient Estimation of the Number of Clusters”, Proceedings of 17th International Conference on Machine Learning, San-Francisco: Morgan Kaufmann, pp. 727–734.

    Google Scholar 

  • PENA, J. M., LOZANO, J. A., and LARRANAGA P. (1999), “An Empirical Comparison of Four Initialization Methods for K-Means Algorithm”, Pattern Recognition Letters, 20(10), 1027–1040.

    Article  Google Scholar 

  • POLLARD, K.S., and VAN DER LAAN, M.J. (2002), “A Method to Identify Significant Clusters in Gene Expression Data”, U.C. Berkeley Division of Biostatistics Working Paper Series, p. 107.

  • SHEN, J., CHANG, S.I., LEE, E.S., DENG, Y., and BROWN, S.J. (2005), “Determination of Cluster Number in Clustering Microarray Data”, Applied Mathematics and Computation, 169, 1172–1185.

    Article  MATH  MathSciNet  Google Scholar 

  • SPAETH, H. (1985), Cluster Dissection and Analysis, Chichester: Ellis Horwood.

    MATH  Google Scholar 

  • STEINLEY, D. (2004), “Standardizing Variables in K-Means Clustering”, in Classification, Clustering, and Data Mining Applications, eds. D. Banks, L. House, F.R. McMorris, P. Arabie and W. Gaul, New York: Springer, pp. 53–60.

    Google Scholar 

  • STEINLEY, D. (2006), “K-Means Clustering: A Half-Century Synthesis”, British Journal of Mathematical and Statistical Psychology, 59, 1–34.

    Article  MathSciNet  Google Scholar 

  • STEINLEY, D., and BRUSCO M. (2007), “Initializing K-Means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24, 99–121.

    Article  MATH  MathSciNet  Google Scholar 

  • STEINLEY, D., and HENSON, R. (2005), “OCLUS: An Analytic Method for Generating Clusters with Known Overlap”, Journal of Classification, 22, 221–250.

    Article  MathSciNet  Google Scholar 

  • SUGAR, C.A., and JAMES, G.M. (2003), “Finding the Number of Clusters in a Data Set: An Information-Theoretic Approach”, Journal of American Statistical Association, 98(463), 750–778.

    Article  MATH  MathSciNet  Google Scholar 

  • TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), “Estimating the Number of Clusters in a Dataset via the Gap Statistics”, Journal of the Royal Statistical Society B, 63, 411–423.

    Article  MATH  MathSciNet  Google Scholar 

  • TIPPING, M.E., and BISHOP, C.M. (1999), “Probabilistic Principal Component Analysis”, Journal of the Royal Statistics Society, Series B 61, 611–622.

    Article  MATH  MathSciNet  Google Scholar 

  • VAPNIK, V. (2006), Estimation of Dependences Based on Empirical Data (2nd ed.), Berlin: Springer Science+Business Media Inc.

    MATH  Google Scholar 

  • WASITO, I., and MIRKIN, B. (2006), “Nearest Neighbours in Least-Squares Data Imputation Algorithms with Different Missing Patterns”, Computational Statistics & Data Analysis, 50, 926–949.

    Article  MathSciNet  Google Scholar 

  • YEUNG, K. Y., and RUZZO, W. L. (2001), “Details of the Adjusted Rand Index and Clustering Algorithms”, Bioinformatics, 17, 763–774.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Ming-Tso Chiang.

Additional information

The authors express their gratitude to the anonymous referees whose multiple comments have been taken into account in our revisions of the paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chiang, M.MT., Mirkin, B. Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads. J Classif 27, 3–40 (2010). https://doi.org/10.1007/s00357-010-9049-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-010-9049-5

Keywords

Navigation