Given that a minor condition holds (e.g., the number of variables is greater than the number of clusters), a nontrivial lower bound for the sum-of-squares error criterion in K-means clustering is derived. By calculating the lower bound for several different situations, a method is developed to determine the adequacy of cluster solution based on the observed sum-of-squares error as compared to the minimum sum-of-squares error.
Similar content being viewed by others
References
Anderberg, M.R. (1973). Cluster analysis for applications. New York: Academic Press.
Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270.
Cattell, R.B., & Coulter, M.A. (1966). Principles of behavioral taxonomy and the mathematical basis of the taxonome computer program. British Journal of Mathematical and Statistical Psychology, 19, 237–269.
Cormack, R.M. (1971). A review of classification (with discussion). Journal of the Royal Statistical Society, Series A, 134, 321–367.
Cox, D.R. (1957). Note on grouping. Journal of the American Statistical Association, 52, 543–547.
Duda, R.O., Hart, P.E., & Stork, D.G. (2001). Pattern recognition (2nd ed.). New York: Wiley.
Engelman, L., & Hartigan, J.A. (1969). Percentage points of a test of clusters. Journal of the American Statistical Association, 64, 1647–1648.
Fan, K. (1949). On a theorem of Weyl concerning eigenvalues of linear transformations. Proceedings of the National Academy of Sciences of the United States of America, 35, 652–655.
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188.
Fisher, W.D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53, 789–798.
Gentle, J.E. (2002). Elements of computational statistics. New York: Springer-Verlag.
Gersho, A., & Gray, R.M. (1992). Vector quantization and signal compression. Boston, MA: Kluwer Academic.
Golub, G.H., & Van Loan, C.F. (1996). Matrix computations (3rd ed.). Baltimore: The Johns Hopkins University Press.
Gordon, A.D. (1999). Classification (2nd ed.). New York: Chapman & Hall/CRC.
Gordon, A.D., & Henderson, J.J. (1977). An algorithm for Euclidean sum of squares classification. Biometrics, 33, 355–362.
Hartigan, J.A. (1975). Clustering algorithms. New York: Wiley.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Hubert, L.J., Arabie, P., & Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming. Philadelphia: SIAM.
Johnson, R.A., & Wichern, D.W. (2002). Applied multivariate statistical analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.
Kaufman, L., & Rousseeuw, P.J. (1987). Clustering by means of medoids. In Y. Dodge (Ed.), Statistical data analysis based on the L 1 -norm and related methods (pp. 405–416). Amsterdam: Elsevier Science.
Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
Lattin, J., Carroll, J.D., & Green, P.E. (2003). Analyzing multivariate data. Pacific Grove, CA: Brooks/Cole.
MacQueen, J. (1967). Some methods of classification and analysis of multivariate observations. In L.M. Le Cam, & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp.~281–297). Berkeley, CA: University of California Press.
Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.
Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.
Milligan, G.W., & Cooper, M.C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.
Sebestyen, G.S. (1962). Decision making process in pattern recognition. New York: Macmillan.
Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects. New York: Wiley.
Steinley, D. (2003). K-means clustering: What you don't know may hurt you. Psychological Methods, 8, 294–304.
Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer-Verlag.
Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.
Steinley, D. (in press). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology.
Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.
Thorndike, R.L. (1953). Who belongs in the family? Psychometrika, 18, 267–276.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423.
Timm, N.H. (2002). Applied multivariate analysis. New York: Springer-Verlag.
van Os, B.J. (2000). Dynamic programming for partitioning in multivariate data analysis. Leiden: University Press.
Waller, N.G., Underhill, J.M., & Kaiser, H.A. (1999). A method for generating simulated plasmodes and artificial test clusters with user-defined shape, size, and orientation. Multivariate Behavioral Research, 34, 123–142.
Weisstein, E.W. (2003). CRC concise encyclopedia of mathematics. Boca Raton, FL: Champman & Hall.
Zha, H., Ding, C., Gu, M., He, X., & Simon, H.D. (2001). Spectral relaxation for K-means clustering. Nueral Information Processsing Systems, 14, 1057–1064.
Author information
Authors and Affiliations
Corresponding author
Additional information
The author was partially supported by the Office of Naval Research Grant #N00014-06-0106.
Rights and permissions
About this article
Cite this article
Steinley, D. Validating Clusters with the Lower Bound for Sum-of-Squares Error. Psychometrika 72, 93–106 (2007). https://doi.org/10.1007/s11336-003-1272-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-003-1272-1