Validating Clusters with the Lower Bound for Sum-of-Squares Error

Steinley, Douglas

doi:10.1007/s11336-003-1272-1

Validating Clusters with the Lower Bound for Sum-of-Squares Error

Published: 13 June 2007

Volume 72, pages 93–106, (2007)
Cite this article

Psychometrika Aims and scope Submit manuscript

Douglas Steinley¹

252 Accesses
9 Citations
Explore all metrics

Given that a minor condition holds (e.g., the number of variables is greater than the number of clusters), a nontrivial lower bound for the sum-of-squares error criterion in K-means clustering is derived. By calculating the lower bound for several different situations, a method is developed to determine the adequacy of cluster solution based on the observed sum-of-squares error as compared to the minimum sum-of-squares error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Quality Metric for K-Means Clustering Based on Centroid Locations

Review of Basic Local Searches for Solving the Minimum Sum-of-Squares Clustering Problem

Experiments with a Non-convex Variance-Based Clustering Criterion

References

Anderberg, M.R. (1973). Cluster analysis for applications. New York: Academic Press.
Google Scholar
Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270.
Article Google Scholar
Cattell, R.B., & Coulter, M.A. (1966). Principles of behavioral taxonomy and the mathematical basis of the taxonome computer program. British Journal of Mathematical and Statistical Psychology, 19, 237–269.
PubMed Google Scholar
Cormack, R.M. (1971). A review of classification (with discussion). Journal of the Royal Statistical Society, Series A, 134, 321–367.
Article Google Scholar
Cox, D.R. (1957). Note on grouping. Journal of the American Statistical Association, 52, 543–547.
Article Google Scholar
Duda, R.O., Hart, P.E., & Stork, D.G. (2001). Pattern recognition (2nd ed.). New York: Wiley.
Google Scholar
Engelman, L., & Hartigan, J.A. (1969). Percentage points of a test of clusters. Journal of the American Statistical Association, 64, 1647–1648.
Article Google Scholar
Fan, K. (1949). On a theorem of Weyl concerning eigenvalues of linear transformations. Proceedings of the National Academy of Sciences of the United States of America, 35, 652–655.
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188.
Google Scholar
Fisher, W.D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53, 789–798.
Article Google Scholar
Gentle, J.E. (2002). Elements of computational statistics. New York: Springer-Verlag.
Google Scholar
Gersho, A., & Gray, R.M. (1992). Vector quantization and signal compression. Boston, MA: Kluwer Academic.
Google Scholar
Golub, G.H., & Van Loan, C.F. (1996). Matrix computations (3rd ed.). Baltimore: The Johns Hopkins University Press.
Google Scholar
Gordon, A.D. (1999). Classification (2nd ed.). New York: Chapman & Hall/CRC.
Google Scholar
Gordon, A.D., & Henderson, J.J. (1977). An algorithm for Euclidean sum of squares classification. Biometrics, 33, 355–362.
Article Google Scholar
Hartigan, J.A. (1975). Clustering algorithms. New York: Wiley.
Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Article Google Scholar
Hubert, L.J., Arabie, P., & Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming. Philadelphia: SIAM.
Google Scholar
Johnson, R.A., & Wichern, D.W. (2002). Applied multivariate statistical analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.
Google Scholar
Kaufman, L., & Rousseeuw, P.J. (1987). Clustering by means of medoids. In Y. Dodge (Ed.), Statistical data analysis based on the L ₁-norm and related methods (pp. 405–416). Amsterdam: Elsevier Science.
Google Scholar
Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
Google Scholar
Lattin, J., Carroll, J.D., & Green, P.E. (2003). Analyzing multivariate data. Pacific Grove, CA: Brooks/Cole.
Google Scholar
MacQueen, J. (1967). Some methods of classification and analysis of multivariate observations. In L.M. Le Cam, & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp.~281–297). Berkeley, CA: University of California Press.
Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.
Article Google Scholar
Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.
Article Google Scholar
Milligan, G.W., & Cooper, M.C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.
Article Google Scholar
Sebestyen, G.S. (1962). Decision making process in pattern recognition. New York: Macmillan.
Google Scholar
Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects. New York: Wiley.
Google Scholar
Steinley, D. (2003). K-means clustering: What you don't know may hurt you. Psychological Methods, 8, 294–304.
Article PubMed Google Scholar
Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer-Verlag.
Google Scholar
Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.
Article Google Scholar
Steinley, D. (in press). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology.
Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.
Article Google Scholar
Thorndike, R.L. (1953). Who belongs in the family? Psychometrika, 18, 267–276.
Article Google Scholar
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423.
Article Google Scholar
Timm, N.H. (2002). Applied multivariate analysis. New York: Springer-Verlag.
Google Scholar
van Os, B.J. (2000). Dynamic programming for partitioning in multivariate data analysis. Leiden: University Press.
Google Scholar
Waller, N.G., Underhill, J.M., & Kaiser, H.A. (1999). A method for generating simulated plasmodes and artificial test clusters with user-defined shape, size, and orientation. Multivariate Behavioral Research, 34, 123–142.
Article Google Scholar
Weisstein, E.W. (2003). CRC concise encyclopedia of mathematics. Boca Raton, FL: Champman & Hall.
Google Scholar
Zha, H., Ding, C., Gu, M., He, X., & Simon, H.D. (2001). Spectral relaxation for K-means clustering. Nueral Information Processsing Systems, 14, 1057–1064.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Psychological Sciences, University of Missouri-Columbia, 210 McAlester Hall, Columbia, MO, 65211, USA
Douglas Steinley

Authors

Douglas Steinley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Douglas Steinley.

Additional information

The author was partially supported by the Office of Naval Research Grant #N00014-06-0106.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinley, D. Validating Clusters with the Lower Bound for Sum-of-Squares Error. Psychometrika 72, 93–106 (2007). https://doi.org/10.1007/s11336-003-1272-1

Download citation

Received: 23 November 2004
Accepted: 20 February 2006
Published: 13 June 2007
Issue Date: March 2007
DOI: https://doi.org/10.1007/s11336-003-1272-1

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Validating Clusters with the Lower Bound for Sum-of-Squares Error

Access this article

Similar content being viewed by others

A Quality Metric for K-Means Clustering Based on Centroid Locations

Review of Basic Local Searches for Solving the Minimum Sum-of-Squares Clustering Problem

Experiments with a Non-convex Variance-Based Clustering Criterion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Validating Clusters with the Lower Bound for Sum-of-Squares Error

Access this article

Similar content being viewed by others

A Quality Metric for K-Means Clustering Based on Centroid Locations

Review of Basic Local Searches for Solving the Minimum Sum-of-Squares Clustering Problem

Experiments with a Non-convex Variance-Based Clustering Criterion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation