Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads
- 1.3k Downloads
The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.
KeywordsK-Means clustering Number of clusters Anomalous pattern Hartigan’s rule Gap statistic
Unable to display preview. Download preview PDF.
- BEL MUFTI, G, BERTRAND, P., and EL MOUBARKI, L. (2005), “Determining the Number of Groups from Measures of Cluster Stability”, in Proceedings of International Symposium on Applied Stochastic Models and Data Analysis, pp. 404–412.Google Scholar
- CASILLAS, A., GONZALES DE LENA, M.T., and MARTINEZ, H. (2003), “Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm”, Text, Speech and Dialogue: 6th International Conference, Czech Republic, pp. 43–49.Google Scholar
- DUDOIT, S., and FRIDLYAND, J. (2002), “A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset”, Genome Biology, 3(7), research 0036.1–0036.21.Google Scholar
- FAYYAD, U.M., PIATETSKY-SHAPIRO, G., SMYTH, P., and UTHURUSAMY, R. (eds.) (1996), Advances in Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press/The MIT Press.Google Scholar
- FENG, Y., and HAMERLY, G. (2006), “PG-Means: Learning the Number of Clusters in Data”, Advances in Neural Information Processing Systems, 19 (NIPS Proceeding), Cambridge MA: MIT Press, pp. 393–400.Google Scholar
- GENERATION OF GAUSSIAN MIXTURE DISTRIBUTED DATA (2006), NETLAB neural network software, http://www.ncrg.aston.ac.uk/netlab.
- ISHIOKA, T. (2005), “An Expansion of X-Means for Automatically Determining the Optimal Number of Clusters”, Proceedings of International Conference on Computational Intelligence , Calgary AB, Canada, pp. 91–96.Google Scholar
- KAUFMAN L., and ROUSSEEUW P. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: J. Wiley & Son.Google Scholar
- MCQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, in Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. II, pp. 281–297.Google Scholar
- MINAEI-BIDGOLI, B., TOPCHY, A., and PUNCH, W.F. (2004), “A Comparison of Resampling Methods for Clustering Ensembles”, International Conference on Machine Learning; Models, Technologies and Application (MLMTA04), Las Vegas, Nevada, pp. 939–945.Google Scholar
- PELLEG, D., and MOORE, A. (2000), “X-means: Extending K-Means with Efficient Estimation of the Number of Clusters”, Proceedings of 17th International Conference on Machine Learning, San-Francisco: Morgan Kaufmann, pp. 727–734.Google Scholar
- POLLARD, K.S., and VAN DER LAAN, M.J. (2002), “A Method to Identify Significant Clusters in Gene Expression Data”, U.C. Berkeley Division of Biostatistics Working Paper Series, p. 107.Google Scholar
- STEINLEY, D. (2004), “Standardizing Variables in K-Means Clustering”, in Classification, Clustering, and Data Mining Applications, eds. D. Banks, L. House, F.R. McMorris, P. Arabie and W. Gaul, New York: Springer, pp. 53–60.Google Scholar