Skip to main content
Log in

Validating Clusters with the Lower Bound for Sum-of-Squares Error

  • Published:
Psychometrika Aims and scope Submit manuscript

Given that a minor condition holds (e.g., the number of variables is greater than the number of clusters), a nontrivial lower bound for the sum-of-squares error criterion in K-means clustering is derived. By calculating the lower bound for several different situations, a method is developed to determine the adequacy of cluster solution based on the observed sum-of-squares error as compared to the minimum sum-of-squares error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anderberg, M.R. (1973). Cluster analysis for applications. New York: Academic Press.

    Google Scholar 

  • Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270.

    Article  Google Scholar 

  • Cattell, R.B., & Coulter, M.A. (1966). Principles of behavioral taxonomy and the mathematical basis of the taxonome computer program. British Journal of Mathematical and Statistical Psychology, 19, 237–269.

    PubMed  Google Scholar 

  • Cormack, R.M. (1971). A review of classification (with discussion). Journal of the Royal Statistical Society, Series A, 134, 321–367.

    Article  Google Scholar 

  • Cox, D.R. (1957). Note on grouping. Journal of the American Statistical Association, 52, 543–547.

    Article  Google Scholar 

  • Duda, R.O., Hart, P.E., & Stork, D.G. (2001). Pattern recognition (2nd ed.). New York: Wiley.

    Google Scholar 

  • Engelman, L., & Hartigan, J.A. (1969). Percentage points of a test of clusters. Journal of the American Statistical Association, 64, 1647–1648.

    Article  Google Scholar 

  • Fan, K. (1949). On a theorem of Weyl concerning eigenvalues of linear transformations. Proceedings of the National Academy of Sciences of the United States of America, 35, 652–655.

  • Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188.

    Google Scholar 

  • Fisher, W.D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53, 789–798.

    Article  Google Scholar 

  • Gentle, J.E. (2002). Elements of computational statistics. New York: Springer-Verlag.

    Google Scholar 

  • Gersho, A., & Gray, R.M. (1992). Vector quantization and signal compression. Boston, MA: Kluwer Academic.

    Google Scholar 

  • Golub, G.H., & Van Loan, C.F. (1996). Matrix computations (3rd ed.). Baltimore: The Johns Hopkins University Press.

    Google Scholar 

  • Gordon, A.D. (1999). Classification (2nd ed.). New York: Chapman & Hall/CRC.

    Google Scholar 

  • Gordon, A.D., & Henderson, J.J. (1977). An algorithm for Euclidean sum of squares classification. Biometrics, 33, 355–362.

    Article  Google Scholar 

  • Hartigan, J.A. (1975). Clustering algorithms. New York: Wiley.

    Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.

    Article  Google Scholar 

  • Hubert, L.J., Arabie, P., & Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming. Philadelphia: SIAM.

    Google Scholar 

  • Johnson, R.A., & Wichern, D.W. (2002). Applied multivariate statistical analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.

    Google Scholar 

  • Kaufman, L., & Rousseeuw, P.J. (1987). Clustering by means of medoids. In Y. Dodge (Ed.), Statistical data analysis based on the L 1 -norm and related methods (pp. 405–416). Amsterdam: Elsevier Science.

    Google Scholar 

  • Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.

    Google Scholar 

  • Lattin, J., Carroll, J.D., & Green, P.E. (2003). Analyzing multivariate data. Pacific Grove, CA: Brooks/Cole.

    Google Scholar 

  • MacQueen, J. (1967). Some methods of classification and analysis of multivariate observations. In L.M. Le Cam, & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp.~281–297). Berkeley, CA: University of California Press.

  • Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.

    Article  Google Scholar 

  • Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.

    Article  Google Scholar 

  • Milligan, G.W., & Cooper, M.C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.

    Article  Google Scholar 

  • Sebestyen, G.S. (1962). Decision making process in pattern recognition. New York: Macmillan.

    Google Scholar 

  • Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects. New York: Wiley.

    Google Scholar 

  • Steinley, D. (2003). K-means clustering: What you don't know may hurt you. Psychological Methods, 8, 294–304.

    Article  PubMed  Google Scholar 

  • Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer-Verlag.

    Google Scholar 

  • Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.

    Article  Google Scholar 

  • Steinley, D. (in press). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology.

  • Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.

    Article  Google Scholar 

  • Thorndike, R.L. (1953). Who belongs in the family? Psychometrika, 18, 267–276.

    Article  Google Scholar 

  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423.

    Article  Google Scholar 

  • Timm, N.H. (2002). Applied multivariate analysis. New York: Springer-Verlag.

    Google Scholar 

  • van Os, B.J. (2000). Dynamic programming for partitioning in multivariate data analysis. Leiden: University Press.

    Google Scholar 

  • Waller, N.G., Underhill, J.M., & Kaiser, H.A. (1999). A method for generating simulated plasmodes and artificial test clusters with user-defined shape, size, and orientation. Multivariate Behavioral Research, 34, 123–142.

    Article  Google Scholar 

  • Weisstein, E.W. (2003). CRC concise encyclopedia of mathematics. Boca Raton, FL: Champman & Hall.

    Google Scholar 

  • Zha, H., Ding, C., Gu, M., He, X., & Simon, H.D. (2001). Spectral relaxation for K-means clustering. Nueral Information Processsing Systems, 14, 1057–1064.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Douglas Steinley.

Additional information

The author was partially supported by the Office of Naval Research Grant #N00014-06-0106.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinley, D. Validating Clusters with the Lower Bound for Sum-of-Squares Error. Psychometrika 72, 93–106 (2007). https://doi.org/10.1007/s11336-003-1272-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-003-1272-1

Key words

Navigation