Abstract
The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.
This is a preview of subscription content, access via your institution.
References
BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–821.
BEL MUFTI, G, BERTRAND, P., and EL MOUBARKI, L. (2005), “Determining the Number of Groups from Measures of Cluster Stability”, in Proceedings of International Symposium on Applied Stochastic Models and Data Analysis, pp. 404–412.
BENZECRI, J.P. (1992), Correspondence Analysis Handbook, New York: Marcel Dekker.
BOCK, H.-H. (2007), “Clustering Methods: A History of k-Means Algorithms”, in Selected Contributions in Data Analysis and Classification, eds. P. Brito, P. Bertrand, G. Cucumel, and F. De Carvalho, Heidelberg: Springer Verlag, pp. 161–172.
BRECKENRIDGE, J. (1989), “Replicating Cluster Analysis: Method, Consistency and Validity”, Multivariate Behavioral Research, 24, 147–61.
CALINSKI, T., and HARABASZ, J. (1974), “A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3(1), 1–27.
CASILLAS, A., GONZALES DE LENA, M.T., and MARTINEZ, H. (2003), “Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm”, Text, Speech and Dialogue: 6th International Conference, Czech Republic, pp. 43–49.
CHAE, S.S., DUBIEN, J.L., and WARDE, W.D. (2006), “A Method of Predicting the Number of Clusters Using Rand’s Statistic”, Computational Statistics and Data Analysis, 50 (12), 3531–3546.
DIMITRIADOU, E., DOLNICAR, S., and WEINGASSEL, A. (2002), “An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets”, Psychometrika, 67(1), 137–160.
DUDA, R.O., and HART, P.E. (1973), Pattern Classification and Scene Analysis, New York: Wiley.
DUDOIT, S., and FRIDLYAND, J. (2002), “A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset”, Genome Biology, 3(7), research 0036.1–0036.21.
EFRON B., and TIBSHIRANI R. J. (1993), An Introduction to the Bootstrap, New York: Chapman and Hall.
FAYYAD, U.M., PIATETSKY-SHAPIRO, G., SMYTH, P., and UTHURUSAMY, R. (eds.) (1996), Advances in Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press/The MIT Press.
FENG, Y., and HAMERLY, G. (2006), “PG-Means: Learning the Number of Clusters in Data”, Advances in Neural Information Processing Systems, 19 (NIPS Proceeding), Cambridge MA: MIT Press, pp. 393–400.
FRALEY, C., and RAFTERY, A.E. (2002), “Model-based Clustering, Discriminant Analysis, and Density Estimation”, Journal of the American Statistical Association, 97 (458), 611–631.
GENERATION OF GAUSSIAN MIXTURE DISTRIBUTED DATA (2006), NETLAB neural network software, http://www.ncrg.aston.ac.uk/netlab.
HAND, D.J., and KRZANOWSKI, W.J. (2005), “Optimising k-means Clustering Results with Standard Software Packages”, Computational Statistics and Data Analysis, 49, 969–973.
HANSEN, P., and MLADENOVIC, N. (2001), “J-MEANS: A New Local Search Heuristic for Minimum Sum of Squares Clustering”, Pattern Recognition, 34, 405–413.
HARDY A. (1996), “On the Number of Clusters”, Computational Statistics & Data Analysis 23, 83–96
HARTIGAN, J. A. (1975), Clustering Algorithms, New York: J. Wiley & Sons.
HUBERT, L.J., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.
HUBERT, L.J., and LEVIN, J.R. (1976), “A General Statistical Framework for Assessing Categorical Clustering in Free Recall”, Psychological Bulletin, 83, 1072–1080.
ISHIOKA, T. (2005), “An Expansion of X-Means for Automatically Determining the Optimal Number of Clusters”, Proceedings of International Conference on Computational Intelligence , Calgary AB, Canada, pp. 91–96.
JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, Englewood Cliffs NJ: Prentice Hall.
KAUFMAN L., and ROUSSEEUW P. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: J. Wiley & Son.
KRZANOWSKI W., and LAI Y. (1985), “A Criterion for Determining the Number of Groups in a Dataset Using Sum of Squares Clustering”, Biometrics, 44, 23–34.
KUNCHEVA, L.I., and VETROV, D. P. (2005), “Evaluation of Stability of K-Means Cluster Ensembles with Respect to Random Initialization”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1798–1808.
LEISCH, F. (2006), “A Toolbox for K-Centroids Cluster Analysis”, Computational Statistics and Data Analysis, 51, 526–544.
MAULIK,U., and BANDYOPADHYAY, S. (2000), “Genetic Algorithm-based Clustering Technique”, Pattern Recognition, 33, 1455–1465.
MCLACHLAN, G.J., and KHAN, N. (2004), “On a Resampling Approach for Tests on the Number of Clusters with Mixture Model-Based Clustering of Tissue Samples”, Journal of Multivariate Analysis, 90, 990–1005.
MCLACHLAN, G.J., and PEEL, D. (2000), Finite Mixture Models, New York: Wiley.
MCQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, in Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. II, pp. 281–297.
MILLIGAN, G.W. (1981), “A Monte-Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis”, Psychometrika, 46, 187–199.
MILLIGAN, G.W., and COOPER, M.C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, 50, 159–179.
MILLIGAN, G. W., and COOPER, M. C. (1988), “A Study of Standardization of Variables in Cluster Analysis”, Journal of Classification, 5, 181–204.
MINAEI-BIDGOLI, B., TOPCHY, A., and PUNCH, W.F. (2004), “A Comparison of Resampling Methods for Clustering Ensembles”, International Conference on Machine Learning; Models, Technologies and Application (MLMTA04), Las Vegas, Nevada, pp. 939–945.
MIRKIN, B. (1990), “Sequential Fitting Procedures for Linear Data Aggregation Model”, Journal of Classification, 7, 167–195.
MIRKIN, B. (1996), Mathematical Classification and Clustering, New York: Kluwer.
MIRKIN, B. (2005), Clustering for Data Mining: A Data Recovery Approach, Boca Raton FL: Chapman and Hall/CRC.
MONTI, S., TAMAYO, P., MESIROV, J., and GOLUB, T. (2003), “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data”, Machine Learning, 52, 91–118.
MOJENA, R. (1977), “Hierarchical Grouping Methods and Stopping Rules: An Evaluation”, The Computer Journal, 20, 359–363.
MURTAGH, F., and RAFTERY, A.E. (1984), “Fitting Straight Lines to Point Patterns”, Pattern Recognition, 17, 479–483.
PELLEG, D., and MOORE, A. (2000), “X-means: Extending K-Means with Efficient Estimation of the Number of Clusters”, Proceedings of 17th International Conference on Machine Learning, San-Francisco: Morgan Kaufmann, pp. 727–734.
PENA, J. M., LOZANO, J. A., and LARRANAGA P. (1999), “An Empirical Comparison of Four Initialization Methods for K-Means Algorithm”, Pattern Recognition Letters, 20(10), 1027–1040.
POLLARD, K.S., and VAN DER LAAN, M.J. (2002), “A Method to Identify Significant Clusters in Gene Expression Data”, U.C. Berkeley Division of Biostatistics Working Paper Series, p. 107.
SHEN, J., CHANG, S.I., LEE, E.S., DENG, Y., and BROWN, S.J. (2005), “Determination of Cluster Number in Clustering Microarray Data”, Applied Mathematics and Computation, 169, 1172–1185.
SPAETH, H. (1985), Cluster Dissection and Analysis, Chichester: Ellis Horwood.
STEINLEY, D. (2004), “Standardizing Variables in K-Means Clustering”, in Classification, Clustering, and Data Mining Applications, eds. D. Banks, L. House, F.R. McMorris, P. Arabie and W. Gaul, New York: Springer, pp. 53–60.
STEINLEY, D. (2006), “K-Means Clustering: A Half-Century Synthesis”, British Journal of Mathematical and Statistical Psychology, 59, 1–34.
STEINLEY, D., and BRUSCO M. (2007), “Initializing K-Means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24, 99–121.
STEINLEY, D., and HENSON, R. (2005), “OCLUS: An Analytic Method for Generating Clusters with Known Overlap”, Journal of Classification, 22, 221–250.
SUGAR, C.A., and JAMES, G.M. (2003), “Finding the Number of Clusters in a Data Set: An Information-Theoretic Approach”, Journal of American Statistical Association, 98(463), 750–778.
TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), “Estimating the Number of Clusters in a Dataset via the Gap Statistics”, Journal of the Royal Statistical Society B, 63, 411–423.
TIPPING, M.E., and BISHOP, C.M. (1999), “Probabilistic Principal Component Analysis”, Journal of the Royal Statistics Society, Series B 61, 611–622.
VAPNIK, V. (2006), Estimation of Dependences Based on Empirical Data (2nd ed.), Berlin: Springer Science+Business Media Inc.
WASITO, I., and MIRKIN, B. (2006), “Nearest Neighbours in Least-Squares Data Imputation Algorithms with Different Missing Patterns”, Computational Statistics & Data Analysis, 50, 926–949.
YEUNG, K. Y., and RUZZO, W. L. (2001), “Details of the Adjusted Rand Index and Clustering Algorithms”, Bioinformatics, 17, 763–774.
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors express their gratitude to the anonymous referees whose multiple comments have been taken into account in our revisions of the paper.
Rights and permissions
About this article
Cite this article
Chiang, M.MT., Mirkin, B. Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads. J Classif 27, 3–40 (2010). https://doi.org/10.1007/s00357-010-9049-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-010-9049-5
Keywords
- K-Means clustering
- Number of clusters
- Anomalous pattern
- Hartigan’s rule
- Gap statistic