Advertisement

Journal of Classification

, Volume 34, Issue 2, pp 165–190 | Cite as

On Strategies to Fix Degenerate k-means Solutions

  • Daniel Aloise
  • Nielsen Castelo Damasceno
  • Nenad Mladenović
  • Daniel Nobre Pinheiro
Article
  • 127 Downloads

Abstract

k-means is a benchmark algorithm used in cluster analysis. It belongs to the large category of heuristics based on location-allocation steps that alternately locate cluster centers and allocate data points to them until no further improvement is possible. Such heuristics are known to suffer from a phenomenon called degeneracy in which some of the clusters are empty. In this paper, we compare and propose a series of strategies to circumvent degenerate solutions during a k-means execution. Our computational experiments show that these strategies are effective, leading to better clustering solutions in the vast majority of the cases in which degeneracy appears in k-means. Moreover, we compare the use of our fixing strategies within k-means against the use of two initialization methods found in the literature. These results demonstrate how useful the proposed strategies can be, specially inside memorybased clustering algorithms.

Keywords

k-means Minimum sum-of-squares Degeneracy Clustering Heuristics 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ALOISE, D., DESHPANDE, A., HANSEN, P., and POPAT, P. (2009), “NP-Hardness of Euclidean Sum-of-Squares Clustering”, Machine Learning, 75, 245–249.CrossRefGoogle Scholar
  2. ALOISE, D., HANSEN, P., and LIBERTI, L. (2012), “An Improved Column Generation Algorithm for Minimum Sum-of-Squares Clustering”, Mathematical Programming, 131(1-2), 195–220.MathSciNetCrossRefMATHGoogle Scholar
  3. ARTHUR, D., and VASSILVITSKII, S. (2007). “K-means++: The Advantages of Careful Seeding”, In 2007 ACM-SIAM Symposium on Discrete Algorithms (SODA’07), pp. 1027–1035.Google Scholar
  4. BLANCHARD, S.J., ALOISE, D., and DESARBO, W.S. (2012), “The Heterogeneous PMedian Problem for Categorization Based Clustering”, Psychometrika, 77(4), 741–762.MathSciNetCrossRefMATHGoogle Scholar
  5. BRADLEY, P.S., and FAYYAD, U.M. (1998), “Refining Initial Points for k-Means Clustering”, in International Conference on Machine Learning (ICML), Vol. 98, pp. 91–99.Google Scholar
  6. BRIMBERG, J., and MLADENOVIĆ, N. (1999), “Degeneracy in the Multi-Source Weber Problem”, Mathematical Programming, 85(1), 213–220.MathSciNetCrossRefMATHGoogle Scholar
  7. BRUSCO, M.J., and STEINLEY, D. (2007), “A Comparison of Heuristic Procedures for MinimumWithin-Cluster Sums of Squares Partitioning”, Psychometrika, 72(4), 583–600.MathSciNetCrossRefMATHGoogle Scholar
  8. CARRIZOSA, E., ALGUWAIZANI, A., HANSEN, P., and MLADENOVIĆ, N. (2015), “New Heuristic for Harmonic Means Clustering”, Journal of Global Optimization, 63, 427–443.MathSciNetCrossRefMATHGoogle Scholar
  9. CHOROMANSKA, A., and MONTELEONI, C. (2012), “Online Clustering with Experts”, in International Conference on Artificial Intelligence and Statistics, pp. 227–235.Google Scholar
  10. COOPER, L. (1964), “Heuristic Methods for Location-Allocation Problems”, Siam Review, 6(1), 37–53.MathSciNetCrossRefMATHGoogle Scholar
  11. DING, Y., ZHAO, Y., SHEN, X., MUSUVATHI, M., and MYTKOWICZ, T. (2015), “Yinyang k-means: A Drop-in Replacement of the Classic k-Means with Consistent Speedup”, in 32nd International Conference on Machine Learning (ICML-15), pp. 579–587.Google Scholar
  12. EILON, S., WATSON-GANDY, C., and CHRISTOFIDES, N. (1971), Distributed Management, New York: Hafner.Google Scholar
  13. FORGY, E. (1965), “Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications”, Biometrics, 21, 768.Google Scholar
  14. HANSEN, P., and Mladenović, N. (2001), “J-Means: A New Local Search Heuristic for Minimum Sum of Squares Clustering”, Pattern Recognition, 34, 405–413.CrossRefMATHGoogle Scholar
  15. HANSEN, P., NGAI, E., CHEUNG, B.K., and MLADENOVIC, N. (2005), “Analysis of Global k-Means, An Incremental Heuristic for Minimum Sum-of-Squares Clustering”, Journal of Classification, 22(2), 287–310.MathSciNetCrossRefMATHGoogle Scholar
  16. HAVERLY, C.A. (1978), “Studies of the Behavior of Recursion for the Pooling Problem”, ACM SIGMAP Bulletin, (25), 19–28.CrossRefGoogle Scholar
  17. HELSEN, K., and GREEN, P.E. (1991), “A Computational Study of Replicated Clustering with an Application toMarket Segmentation”, Decision Sciences, 22(5), 1124–1141.CrossRefGoogle Scholar
  18. HOFMANS, J., CEULEMANS, E., STEINLEY, D., and VAN MECHELEN, I. (2015), “On the Added Value of Bootstrap Analysis for k-Means Clustering”, Journal of Classification, 32(2), 268–284.MathSciNetCrossRefMATHGoogle Scholar
  19. INABA, M., KATOH, N., and IMAI, H. (1994), “Applications ofWeighted Voronoi Diagrams and Randomization to Variance-Based k-Clustering”, in Proceedings of the 10th ACM Symposium on Computational Geometry, pp. 332–339.Google Scholar
  20. JAIN, R. (2008), The Art of Computer Systems Performance Analysis, New York: John Wiley and Sons.Google Scholar
  21. LICHMAN, M. (2013), UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, http://archive.ics.uci.edu/ml.
  22. MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, in Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 2, Berkely, CA, pp. 281–297.Google Scholar
  23. MAHAJAN, M., NIMBHORKAR, P., and VARADARAJAN, K. (2009), “The Planar k-Means Problem is NP-Hard”, Lecture Notes in Computer Science, 5431, 274–285.MathSciNetCrossRefMATHGoogle Scholar
  24. MAIRAL, J., BACH, F., and PONCE, J. (2012), “Sparse Modeling for Image and Vision Processing”, Foundations and Trends in Computer Graphics and Vision, 8(2-3), 85–283.CrossRefMATHGoogle Scholar
  25. MAK, J.N., and WOLPAW, J.R. (2009), “Clinical Applications of Brain-Computer Interfaces: Current State and Future Prospects”, IEEE Reviews in Biomedical Engineering, 2, 187–199.CrossRefGoogle Scholar
  26. NUGENT, R., DEAN, N., and AYERS, E. (2010), “Skill Set Profile Clustering: The Empty k-Means Algorithm with Automatic Specification of Starting Cluster Centers”, in Educational Data Mining 2010, pp. 151–160.Google Scholar
  27. ORDIN, B., and BAGIROV, A.M. (2015), “A Heuristic Algorithm for Solving the Minimum Sum-of-Squares Clustering Problems”, Journal of Global Optimization, 61(2), 341–361.MathSciNetCrossRefMATHGoogle Scholar
  28. PACHECO, J., and VALENCIA, O. (2003), “Design of Hybrids for the Minimum Sum-of-Squares Clustering Problem”, Computational Statistics and Data Analysis, 43(2), 235–248.MathSciNetCrossRefMATHGoogle Scholar
  29. RUSPINI, E. (1970), “Numerical Method for Fuzzy Clustering”, Information Sciences, 2, 319–350.Google Scholar
  30. STEINLEY, D. (2006), “K-Means Clustering: A Half-Century Synthesis”, British Journal of Mathematical and Statistical Psychology, 59(1), 1–34.MathSciNetCrossRefGoogle Scholar
  31. STEINLEY, D., and BRUSCO, M.J. (2007), “Initializing k-Means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24(1), 99–121.MathSciNetCrossRefMATHGoogle Scholar
  32. TAO, P.D. et al. (2014), “New and Efficient Dca Based Algorithms for Minimum Sum-of-Squares Clustering”, Pattern Recognition, 47(1), 388–401.CrossRefMATHGoogle Scholar
  33. TEBOULLE, M. (2007), “A Unified Continous Optimization Framework for Center-Based Clustering Methods”, Journal of Machine Learning Research, (8), 65–102.Google Scholar
  34. WARD JR., J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58(301), 236–244.Google Scholar
  35. WU, X., and KUMAR, V. (2009), The Top Ten Algorithms in Data Mining, CRC Press.Google Scholar

Copyright information

© Classification Society of North America 2017

Authors and Affiliations

  • Daniel Aloise
    • 1
    • 4
  • Nielsen Castelo Damasceno
    • 2
  • Nenad Mladenović
    • 3
  • Daniel Nobre Pinheiro
    • 2
  1. 1.Polytechnique MontréalMontréalCanada
  2. 2.Federal University of Rio Grande do NorteRio GrandeBrazil
  3. 3.LAMIH, Université de Valenciennes et du Hainaut CambrésisValenciennesFrance
  4. 4.Department of Computer EngineeringPolytechnique MontréalMontréalCanada

Personalised recommendations