On Strategies to Fix Degenerate k-means Solutions
k-means is a benchmark algorithm used in cluster analysis. It belongs to the large category of heuristics based on location-allocation steps that alternately locate cluster centers and allocate data points to them until no further improvement is possible. Such heuristics are known to suffer from a phenomenon called degeneracy in which some of the clusters are empty. In this paper, we compare and propose a series of strategies to circumvent degenerate solutions during a k-means execution. Our computational experiments show that these strategies are effective, leading to better clustering solutions in the vast majority of the cases in which degeneracy appears in k-means. Moreover, we compare the use of our fixing strategies within k-means against the use of two initialization methods found in the literature. These results demonstrate how useful the proposed strategies can be, specially inside memorybased clustering algorithms.
Keywordsk-means Minimum sum-of-squares Degeneracy Clustering Heuristics
Unable to display preview. Download preview PDF.
- ARTHUR, D., and VASSILVITSKII, S. (2007). “K-means++: The Advantages of Careful Seeding”, In 2007 ACM-SIAM Symposium on Discrete Algorithms (SODA’07), pp. 1027–1035.Google Scholar
- BRADLEY, P.S., and FAYYAD, U.M. (1998), “Refining Initial Points for k-Means Clustering”, in International Conference on Machine Learning (ICML), Vol. 98, pp. 91–99.Google Scholar
- CHOROMANSKA, A., and MONTELEONI, C. (2012), “Online Clustering with Experts”, in International Conference on Artificial Intelligence and Statistics, pp. 227–235.Google Scholar
- DING, Y., ZHAO, Y., SHEN, X., MUSUVATHI, M., and MYTKOWICZ, T. (2015), “Yinyang k-means: A Drop-in Replacement of the Classic k-Means with Consistent Speedup”, in 32nd International Conference on Machine Learning (ICML-15), pp. 579–587.Google Scholar
- EILON, S., WATSON-GANDY, C., and CHRISTOFIDES, N. (1971), Distributed Management, New York: Hafner.Google Scholar
- FORGY, E. (1965), “Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications”, Biometrics, 21, 768.Google Scholar
- INABA, M., KATOH, N., and IMAI, H. (1994), “Applications ofWeighted Voronoi Diagrams and Randomization to Variance-Based k-Clustering”, in Proceedings of the 10th ACM Symposium on Computational Geometry, pp. 332–339.Google Scholar
- JAIN, R. (2008), The Art of Computer Systems Performance Analysis, New York: John Wiley and Sons.Google Scholar
- LICHMAN, M. (2013), UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, http://archive.ics.uci.edu/ml.
- MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, in Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 2, Berkely, CA, pp. 281–297.Google Scholar
- NUGENT, R., DEAN, N., and AYERS, E. (2010), “Skill Set Profile Clustering: The Empty k-Means Algorithm with Automatic Specification of Starting Cluster Centers”, in Educational Data Mining 2010, pp. 151–160.Google Scholar
- RUSPINI, E. (1970), “Numerical Method for Fuzzy Clustering”, Information Sciences, 2, 319–350.Google Scholar
- TEBOULLE, M. (2007), “A Unified Continous Optimization Framework for Center-Based Clustering Methods”, Journal of Machine Learning Research, (8), 65–102.Google Scholar
- WARD JR., J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58(301), 236–244.Google Scholar
- WU, X., and KUMAR, V. (2009), The Top Ten Algorithms in Data Mining, CRC Press.Google Scholar