Journal of Classification

, Volume 27, Issue 1, pp 3–40 | Cite as

Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads

Article

Abstract

The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.

Keywords

K-Means clustering Number of clusters Anomalous pattern Hartigan’s rule Gap statistic 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model-based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–821.MATHCrossRefMathSciNetGoogle Scholar
  2. BEL MUFTI, G, BERTRAND, P., and EL MOUBARKI, L. (2005), “Determining the Number of Groups from Measures of Cluster Stability”, in Proceedings of International Symposium on Applied Stochastic Models and Data Analysis, pp. 404–412.Google Scholar
  3. BENZECRI, J.P. (1992), Correspondence Analysis Handbook, New York: Marcel Dekker.MATHGoogle Scholar
  4. BOCK, H.-H. (2007), “Clustering Methods: A History of k-Means Algorithms”, in Selected Contributions in Data Analysis and Classification, eds. P. Brito, P. Bertrand, G. Cucumel, and F. De Carvalho, Heidelberg: Springer Verlag, pp. 161–172.CrossRefGoogle Scholar
  5. BRECKENRIDGE, J. (1989), “Replicating Cluster Analysis: Method, Consistency and Validity”, Multivariate Behavioral Research, 24, 147–61.CrossRefGoogle Scholar
  6. CALINSKI, T., and HARABASZ, J. (1974), “A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3(1), 1–27.CrossRefMathSciNetGoogle Scholar
  7. CASILLAS, A., GONZALES DE LENA, M.T., and MARTINEZ, H. (2003), “Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm”, Text, Speech and Dialogue: 6th International Conference, Czech Republic, pp. 43–49.Google Scholar
  8. CHAE, S.S., DUBIEN, J.L., and WARDE, W.D. (2006), “A Method of Predicting the Number of Clusters Using Rand’s Statistic”, Computational Statistics and Data Analysis, 50 (12), 3531–3546.MATHCrossRefMathSciNetGoogle Scholar
  9. DIMITRIADOU, E., DOLNICAR, S., and WEINGASSEL, A. (2002), “An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets”, Psychometrika, 67(1), 137–160.CrossRefMathSciNetGoogle Scholar
  10. DUDA, R.O., and HART, P.E. (1973), Pattern Classification and Scene Analysis, New York: Wiley.MATHGoogle Scholar
  11. DUDOIT, S., and FRIDLYAND, J. (2002), “A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset”, Genome Biology, 3(7), research 0036.1–0036.21.Google Scholar
  12. EFRON B., and TIBSHIRANI R. J. (1993), An Introduction to the Bootstrap, New York: Chapman and Hall.MATHGoogle Scholar
  13. FAYYAD, U.M., PIATETSKY-SHAPIRO, G., SMYTH, P., and UTHURUSAMY, R. (eds.) (1996), Advances in Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press/The MIT Press.Google Scholar
  14. FENG, Y., and HAMERLY, G. (2006), “PG-Means: Learning the Number of Clusters in Data”, Advances in Neural Information Processing Systems, 19 (NIPS Proceeding), Cambridge MA: MIT Press, pp. 393–400.Google Scholar
  15. FRALEY, C., and RAFTERY, A.E. (2002), “Model-based Clustering, Discriminant Analysis, and Density Estimation”, Journal of the American Statistical Association, 97 (458), 611–631.MATHCrossRefMathSciNetGoogle Scholar
  16. GENERATION OF GAUSSIAN MIXTURE DISTRIBUTED DATA (2006), NETLAB neural network software, http://www.ncrg.aston.ac.uk/netlab.
  17. HAND, D.J., and KRZANOWSKI, W.J. (2005), “Optimising k-means Clustering Results with Standard Software Packages”, Computational Statistics and Data Analysis, 49, 969–973.MATHCrossRefMathSciNetGoogle Scholar
  18. HANSEN, P., and MLADENOVIC, N. (2001), “J-MEANS: A New Local Search Heuristic for Minimum Sum of Squares Clustering”, Pattern Recognition, 34, 405–413.MATHCrossRefGoogle Scholar
  19. HARDY A. (1996), “On the Number of Clusters”, Computational Statistics & Data Analysis 23, 83–96MATHCrossRefGoogle Scholar
  20. HARTIGAN, J. A. (1975), Clustering Algorithms, New York: J. Wiley & Sons.MATHGoogle Scholar
  21. HUBERT, L.J., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.CrossRefGoogle Scholar
  22. HUBERT, L.J., and LEVIN, J.R. (1976), “A General Statistical Framework for Assessing Categorical Clustering in Free Recall”, Psychological Bulletin, 83, 1072–1080.CrossRefGoogle Scholar
  23. ISHIOKA, T. (2005), “An Expansion of X-Means for Automatically Determining the Optimal Number of Clusters”, Proceedings of International Conference on Computational Intelligence , Calgary AB, Canada, pp. 91–96.Google Scholar
  24. JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, Englewood Cliffs NJ: Prentice Hall.MATHGoogle Scholar
  25. KAUFMAN L., and ROUSSEEUW P. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: J. Wiley & Son.Google Scholar
  26. KRZANOWSKI W., and LAI Y. (1985), “A Criterion for Determining the Number of Groups in a Dataset Using Sum of Squares Clustering”, Biometrics, 44, 23–34.CrossRefMathSciNetGoogle Scholar
  27. KUNCHEVA, L.I., and VETROV, D. P. (2005), “Evaluation of Stability of K-Means Cluster Ensembles with Respect to Random Initialization”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1798–1808.CrossRefGoogle Scholar
  28. LEISCH, F. (2006), “A Toolbox for K-Centroids Cluster Analysis”, Computational Statistics and Data Analysis, 51, 526–544.MATHCrossRefMathSciNetGoogle Scholar
  29. MAULIK,U., and BANDYOPADHYAY, S. (2000), “Genetic Algorithm-based Clustering Technique”, Pattern Recognition, 33, 1455–1465.CrossRefGoogle Scholar
  30. MCLACHLAN, G.J., and KHAN, N. (2004), “On a Resampling Approach for Tests on the Number of Clusters with Mixture Model-Based Clustering of Tissue Samples”, Journal of Multivariate Analysis, 90, 990–1005.CrossRefMathSciNetGoogle Scholar
  31. MCLACHLAN, G.J., and PEEL, D. (2000), Finite Mixture Models, New York: Wiley.MATHCrossRefGoogle Scholar
  32. MCQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, in Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. II, pp. 281–297.Google Scholar
  33. MILLIGAN, G.W. (1981), “A Monte-Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis”, Psychometrika, 46, 187–199.MATHCrossRefMathSciNetGoogle Scholar
  34. MILLIGAN, G.W., and COOPER, M.C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, 50, 159–179.CrossRefGoogle Scholar
  35. MILLIGAN, G. W., and COOPER, M. C. (1988), “A Study of Standardization of Variables in Cluster Analysis”, Journal of Classification, 5, 181–204.CrossRefMathSciNetGoogle Scholar
  36. MINAEI-BIDGOLI, B., TOPCHY, A., and PUNCH, W.F. (2004), “A Comparison of Resampling Methods for Clustering Ensembles”, International Conference on Machine Learning; Models, Technologies and Application (MLMTA04), Las Vegas, Nevada, pp. 939–945.Google Scholar
  37. MIRKIN, B. (1990), “Sequential Fitting Procedures for Linear Data Aggregation Model”, Journal of Classification, 7, 167–195.MATHCrossRefMathSciNetGoogle Scholar
  38. MIRKIN, B. (1996), Mathematical Classification and Clustering, New York: Kluwer.MATHGoogle Scholar
  39. MIRKIN, B. (2005), Clustering for Data Mining: A Data Recovery Approach, Boca Raton FL: Chapman and Hall/CRC.MATHCrossRefGoogle Scholar
  40. MONTI, S., TAMAYO, P., MESIROV, J., and GOLUB, T. (2003), “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data”, Machine Learning, 52, 91–118.MATHCrossRefGoogle Scholar
  41. MOJENA, R. (1977), “Hierarchical Grouping Methods and Stopping Rules: An Evaluation”, The Computer Journal, 20, 359–363.MATHCrossRefGoogle Scholar
  42. MURTAGH, F., and RAFTERY, A.E. (1984), “Fitting Straight Lines to Point Patterns”, Pattern Recognition, 17, 479–483.CrossRefGoogle Scholar
  43. PELLEG, D., and MOORE, A. (2000), “X-means: Extending K-Means with Efficient Estimation of the Number of Clusters”, Proceedings of 17th International Conference on Machine Learning, San-Francisco: Morgan Kaufmann, pp. 727–734.Google Scholar
  44. PENA, J. M., LOZANO, J. A., and LARRANAGA P. (1999), “An Empirical Comparison of Four Initialization Methods for K-Means Algorithm”, Pattern Recognition Letters, 20(10), 1027–1040.CrossRefGoogle Scholar
  45. POLLARD, K.S., and VAN DER LAAN, M.J. (2002), “A Method to Identify Significant Clusters in Gene Expression Data”, U.C. Berkeley Division of Biostatistics Working Paper Series, p. 107.Google Scholar
  46. SHEN, J., CHANG, S.I., LEE, E.S., DENG, Y., and BROWN, S.J. (2005), “Determination of Cluster Number in Clustering Microarray Data”, Applied Mathematics and Computation, 169, 1172–1185.MATHCrossRefMathSciNetGoogle Scholar
  47. SPAETH, H. (1985), Cluster Dissection and Analysis, Chichester: Ellis Horwood.MATHGoogle Scholar
  48. STEINLEY, D. (2004), “Standardizing Variables in K-Means Clustering”, in Classification, Clustering, and Data Mining Applications, eds. D. Banks, L. House, F.R. McMorris, P. Arabie and W. Gaul, New York: Springer, pp. 53–60.Google Scholar
  49. STEINLEY, D. (2006), “K-Means Clustering: A Half-Century Synthesis”, British Journal of Mathematical and Statistical Psychology, 59, 1–34.CrossRefMathSciNetGoogle Scholar
  50. STEINLEY, D., and BRUSCO M. (2007), “Initializing K-Means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24, 99–121.MATHCrossRefMathSciNetGoogle Scholar
  51. STEINLEY, D., and HENSON, R. (2005), “OCLUS: An Analytic Method for Generating Clusters with Known Overlap”, Journal of Classification, 22, 221–250.CrossRefMathSciNetGoogle Scholar
  52. SUGAR, C.A., and JAMES, G.M. (2003), “Finding the Number of Clusters in a Data Set: An Information-Theoretic Approach”, Journal of American Statistical Association, 98(463), 750–778.MATHCrossRefMathSciNetGoogle Scholar
  53. TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), “Estimating the Number of Clusters in a Dataset via the Gap Statistics”, Journal of the Royal Statistical Society B, 63, 411–423.MATHCrossRefMathSciNetGoogle Scholar
  54. TIPPING, M.E., and BISHOP, C.M. (1999), “Probabilistic Principal Component Analysis”, Journal of the Royal Statistics Society, Series B 61, 611–622.MATHCrossRefMathSciNetGoogle Scholar
  55. VAPNIK, V. (2006), Estimation of Dependences Based on Empirical Data (2nd ed.), Berlin: Springer Science+Business Media Inc.MATHGoogle Scholar
  56. WASITO, I., and MIRKIN, B. (2006), “Nearest Neighbours in Least-Squares Data Imputation Algorithms with Different Missing Patterns”, Computational Statistics & Data Analysis, 50, 926–949.CrossRefMathSciNetGoogle Scholar
  57. YEUNG, K. Y., and RUZZO, W. L. (2001), “Details of the Adjusted Rand Index and Clustering Algorithms”, Bioinformatics, 17, 763–774.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Computer Science & Information SystemsBirkbeck University of LondonLondonUK
  2. 2.State University - Higher School of EconomicsMoscowRussia

Personalised recommendations