Abstract
Though K-means is very popular for general clustering, its performance which generally converges to numerous local minima depends highly on initial cluster centers. In this paper a novel initialization scheme to select initial cluster centers for K-means clustering is proposed. This algorithm is based on reverse nearest neighbor (RNN) search which retrieves all points in a given data set whose nearest neighbor is a given query point. The initial cluster centers computed using this methodology are found to be very close to the desired cluster centers for iterative clustering algorithms. This procedure is applicable to clustering algorithms for continuous data. The application of proposed algorithm to K-means clustering algorithm is demonstrated. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method.
Similar content being viewed by others
References
MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967: 281–297.
Fukunaga K. Introduction to Statistical Pattern Recognition [M]. San Diego: Academic Press, 1990.
Jain A K, Dubes R C. Algorithms for Clustering Data[M]. Englewood Cliffs: Prentice Hall, 1988.
Duda R O, Hart P E. Pattern Classification and Scene Analysis [M]. New York: John Wiley and Sons, 1973.
Thiesson B, Meck C, Chickering D M, et al. Learning Mixtures of Bayesian Networks[R]. Redmond: Microsoft Research, 1997.
Bradley P S, Mangasarian O L, Street W N. Clustering via Concave Minimization[C]//Advances in Neural Information Processing System. Cambridge: MIT Press, 1997: 368–374.
Bradley P S, Fayyad U M. Refining Initial Points for K-Means Clustering[C]//Proceedings of the 15th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc, 1998:91–99.
Penã J M, Lozano J A, Larrañaga P. An Empirical Comparison of Four Initialization Methods for the K-Means Algorithm[J]. Pattern Recognition Letters, 1999, 20:1027–1040.
Su T. Another Look at Non-Random Methods for Initializing K-Means Clustering[D]. Boston: Northeastern University, 2003.
Forgy E. Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications[J]. Biometrics, 1965, 21(3): 768.
Hochbaum D, Shmoys D B. A Best Possible Heuristic for the K-Center Problem[J]. Mathematics of Operations Research, 1985, 10(2): 180–184.
Greene D, Cunningham P. Producing Accurate Interpretable Clusters from High-Dimensional Data[C]//Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Berlin: Springer-Verlag, 2005: 486–494.
Blake C L, Merz C J. UCI Repository of Machine Learning Database [EB/OL]. [2008-03-15]. htpp://www.ics.uci. edu/MLRepository.html.
Korn F, Muthukrishnan S. Influence Sets Based on Reverse Nearest Neighbor Queries[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2000: 201–212.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Supported by the National Natural Science Foundation of China (60503020, 60503033, 60703086), the Natural Science Foundation of Jiangsu Province (BK2006094), the Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow University (KJS0714) and the Research Foundation of Nanjing University of Posts and Telecommunications (NY207052, NY207082)
Rights and permissions
About this article
Cite this article
Xu, J., Xu, B., Zhang, W. et al. Stable initialization scheme for K-means clustering. Wuhan Univ. J. Nat. Sci. 14, 24–28 (2009). https://doi.org/10.1007/s11859-009-0106-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11859-009-0106-z