Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Order-Constrained Solutions in K-Means Clustering: Even Better Than Being Globally Optimal

  • 275 Accesses

  • 19 Citations

Abstract

This paper proposes an order-constrained K-means cluster analysis strategy, and implements that strategy through an auxiliary quadratic assignment optimization heuristic that identifies an initial object order. A subsequent dynamic programming recursion is applied to optimally subdivide the object set subject to the order constraint. We show that although the usual K-means sum-of-squared-error criterion is not guaranteed to be minimal, a true underlying cluster structure may be more accurately recovered. Also, substantive interpretability seems generally improved when constrained solutions are considered. We illustrate the procedure with several data sets from the literature.

This is a preview of subscription content, log in to check access.

References

  1. Al-Sultan, K. (1995). A tabu search approach to the clustering problem. Pattern Recognition, 28, 1443–1451.

  2. Babu, G.P., & Murty, M.N. (1994). Simulated annealing for selecting optimal initial seeds in the K-means algorithm. Indian Journal of Pure and Applied Mathematics, 25, 85–94.

  3. Baker, F.B., & Hubert, L.J. (1977). Applications of combinatorial programming to data analysis: Seriation using asymmetric proximity measures. British Journal of Mathematical and Statistical Psychology, 30, 154–164.

  4. Brusco, M.J., & Cradit, J.D. (2004). Graph coloring, mimimum-diameter partitioning, and the analysis of confusion matrices. Journal of Mathematical Psychology, 48, 301–309.

  5. Brusco, M.J., & Cradit, J.D. (2005). Bicriterion methods for partitioning dissimilarity matrices. British Journal of Mathematical and Statistical Psychology, 58, 319–332.

  6. Brusco, M.J., & Stahl, S. (2001a). An interactive multiobjective programming approach to combinatorial data analysis. Psychometrika, 66, 5–24.

  7. Brusco, M.J., & Stahl, S. (2001b). Compact integer-programming models for extracting subsets of stimuli from confusion matrices. Psychometrika, 66, 405–420.

  8. Brusco, M.J., & Stahl, S. (2005). Branch-and-bound applications in combinatorial data analysis. New York: Springer.

  9. Chang, W.C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. Applied Statistics, 32, 267–275.

  10. Delattre, M., & Hansen, P. (1980). Bicriterion cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 277–291.

  11. Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.

  12. Fisher, W.D. (1958). On grouping for maximum heterogeneity. Journal of the American Statistical Association, 53, 789–798.

  13. Forgy, E.W. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21, 768–769.

  14. Hansen, P., & Mladenovic, N. (2001). J-Means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognition, 34, 405–413.

  15. Hartigan, J. (1975). Clustering algorithms. New York: Wiley.

  16. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.

  17. Hubert, L., Arabie, P., & Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming. Philadelphia: SIAM.

  18. Hubert, L., Arabie, P., & Meulman, J. (2006). The structural representation of proximity matrices with MATLAB. Philadelphia: SIAM.

  19. Kiplinger’s personal finance. In Kiplinger’s personal finance (Vol. 57, pp. 104–123).

  20. MacQueen, J. (1967). Some methods of classification and analysis of multivariate observations. In L.M. Le Cam & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). University of California Press: Berkeley.

  21. Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.

  22. Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.

  23. Milligan, G.W., & Cooper, M.C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.

  24. Mirkin, B. (2005). Clustering for data mining. New York: Chapman & Hall.

  25. Pacheco, J., & Valencia, O. (2003). Design of hybrids for the minimum sum-of-squares clustering problem. Computational Statistics and Data Analysis, 43, 235–248.

  26. Späth, H. (1980). Cluster analysis algorithms. Chichester: Ellis Horwood.

  27. Steinley, D. (2003). K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304.

  28. Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer.

  29. Steinley, D. (2004b). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9, 386–396.

  30. Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.

  31. Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192.

  32. Steinley, D., & Brusco, M.J. (2007). Initializing K-means batch clustering: A critical evaluation of several techniques. Journal of Classification, 24, 99–121.

  33. Steinley, D., & Brusco, M.J. (2008, in press). A new variable weighting and selection procedure for K-means cluster analysis. Multivariate Behavioral Research.

  34. Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.

  35. Theise, E.S. (1989). Finding a subset of stimulus-response pairs with minimum total confusion: A binary integer programming approach. Human Factors, 31, 291–305.

  36. Thorndike, R.L. (1953). Who belongs in the family? Psychometrika, 18, 267–276.

  37. Van Ness, J.W. (1973). Admissible clustering procedures II. Biometrika, 60, 422–424.

Download references

Author information

Correspondence to Douglas Steinley.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Steinley, D., Hubert, L. Order-Constrained Solutions in K-Means Clustering: Even Better Than Being Globally Optimal. Psychometrika 73, 647 (2008). https://doi.org/10.1007/s11336-008-9058-z

Download citation

Keywords

  • K-means cluster analysis
  • dynamic programming
  • quadratic assignment
  • constrained optimization
  • multicriterion optimization