Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

Vera, J. Fernando; Macías, Rodrigo

doi:10.1007/s11336-017-9561-1

Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

Published: 13 February 2017

Volume 82, pages 275–294, (2017)
Cite this article

Psychometrika Aims and scope Submit manuscript

J. Fernando Vera¹ &
Rodrigo Macías²

448 Accesses
8 Citations
Explore all metrics

Abstract

One of the main problems in cluster analysis is that of determining the number of groups in the data. In general, the approach taken depends on the cluster method used. For K-means, some of the most widely employed criteria are formulated in terms of the decomposition of the total point scatter, regarding a two-mode data set of N points in p dimensions, which are optimally arranged into K classes. This paper addresses the formulation of criteria to determine the number of clusters, in the general situation in which the available information for clustering is a one-mode \(N\times N\) dissimilarity matrix describing the objects. In this framework, p and the coordinates of points are usually unknown, and the application of criteria originally formulated for two-mode data sets is dependent on their possible reformulation in the one-mode situation. The decomposition of the variability of the clustered objects is proposed in terms of the corresponding block-shaped partition of the dissimilarity matrix. Within-block and between-block dispersion values for the partitioned dissimilarity matrix are derived, and variance-based criteria are subsequently formulated in order to determine the number of groups in the data. A Monte Carlo experiment was carried out to study the performance of the proposed criteria. For simulated clustered points in p dimensions, greater efficiency in recovering the number of clusters is obtained when the criteria are calculated from the related Euclidean distances instead of the known two-mode data set, in general, for unequal-sized clusters and for low dimensionality situations. For simulated dissimilarity data sets, the proposed criteria always outperform the results obtained when these criteria are calculated from their original formulation, using dissimilarities instead of distances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling

Article 19 May 2021

A Quality Metric for K-Means Clustering Based on Centroid Locations

Braverman’s Spectrum and Matrix Diagonalization Versus iK-Means: A Unified Framework for Clustering

References

Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.
Google Scholar
Calinski, R. B., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 1–27.
Google Scholar
Chiang, M. M., & Mirkin, B. (2010). Intelligent choice of the number of cluster in K-means clustering: An experimental study with different cluster spreads. Journal of Classification, 27, 3–40.
Article Google Scholar
Cilibrasi, R., & Vitanyi, P. (2004). Automatic meaning discovery using Google. http://xxx.lanl.gov/abs/cs.CL/0412098.
Cilibrasi, R. L., & Vitanyi, P. M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383.
Article Google Scholar
Condon, E., Golden, B., Lele, S., Raghavan, S., & Wasil, E. (2002). A visualization model based on adjacency data. Decision Support Systems, 33, 349–362.
Article Google Scholar
DeSarbo, W., Carroll, J. D., Clark, L., & Green, P. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78.
Article Google Scholar
Elvevag, B., & Storms, G. (2003). Scaling and clustering in the study of semantic disruptions in patients with Schizophrenia: A re-evaluation. Schizophrenia Research, 63, 237.
Article PubMed Google Scholar
Everit, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis. Wiley series in probability and statistics (5th ed.). New York: Wiley.
Google Scholar
Gower, J. C., & Krzanowski, W. J. (1999). Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48(4), 505–519.
Article Google Scholar
Hartigan, J. A. (1975). Clustering algorithms. New York: Wiley.
Google Scholar
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108.
Article Google Scholar
Heiser, W. J., & Groenen, P. J. F. (1997). Cluster differences scaling with a within-clusters loss component and a fuzzy succesive approximation strategy to avoid local minima. Psychometrika, 62, 63–83.
Article Google Scholar
Ito, K., Zeugmann, T., & Zhu, Y. (2010). Clustering the normalized compression distance for Influenza virus data. Lecture Notes in Computer Science, 6060, 130–146.
Article Google Scholar
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
Book Google Scholar
Krzanowski, W. J., & Lai, Y. T. (1985). A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics, 44, 23–34.
Article Google Scholar
McQueen, J. (1967). Some methods for classification and analysis of multivariate observations, In Fifth Berkeley Symposium on Mathematical Statistics and Probability (vol. II, pp. 281–297).
Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354–376.
Article Google Scholar
Melnykov, V., Chen, W.-C., & Maitra, R. (2012). MixSim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12), 1–25.
Article Google Scholar
Milligan, G. W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50(1), 123–127.
Article Google Scholar
Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.
Article Google Scholar
Mirkin, B. (2005). Clustering for data mining: A data recovery approach. Boca Raton, FL: Chapman and Hall.
Book Google Scholar
Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52, 91–118.
Article Google Scholar
Poland, J., & Zeugmann, T. (2006). Clustering the google distance with eigenvectors and semidefinite programming. In Knowledge Media Technologies, First International Core-to-Core Workshop, “Diskussionsbeiträge, Institut für Medien und Kommunikationswisschaft" (vol. 21, pp. 61–69). Technische Universität Ilmenau.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes: The art of scientific computing (3rd ed.). New York: Cambridge University Press.
Google Scholar
Priem, R. L., Love, L., & Shaffer, M. A. (2002). Executives perceptions of uncertainty sources: A numerical taxonomy and underlying dimensions. Journal of Management, 28, 725–746.
Article Google Scholar
Rocci, R., & Vichi, M. (2008). Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52, 1984–2003.
Article Google Scholar
Schwarz, A. J. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Article Google Scholar
Sokal, R., & Michener, C. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38, 1409–1438.
Google Scholar
Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.
Article PubMed Google Scholar
Steinley, D., & Brusco, M. J. (2007). Initializing K-means batch clustering: A critical evaluation of several techniques. Journal of Classification, 24, 99–121.
Article Google Scholar
Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Asssociation., 98, 750–762.
Article Google Scholar
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.
Article Google Scholar
Vera, J. F., Macías, R., & Angulo, J. M. (2008). Non-stationary spatial covariance structure estimation in oversampled domains by cluster differences scaling with spatial constraints. Stochastic Environmental Research and Risk Assessment, 22, 95–106.
Article Google Scholar
Vera, J. F., Macías, R., & Angulo, J. M. (2009). A latent class MDS model with spatial constraints for non-stationary spatial covariance estimation. Stochastic Environmental Research and Risk Assessment, 23(6), 769–779.
Article Google Scholar
Vera, J. F., Macías, R., & Heiser, W. J. (2009). A latent class multidimensional scaling model for two-way one-mode continuous rating dissimilarity data. Psychometrika, 74(2), 297–315.
Article Google Scholar
Vera, J. F., Macías, R., & Heiser, W. J. (2013). Cluster differences unfolding for two-way two-mode preference rating data. Journal of Classification, 30, 370–396.
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank the Editor, the AE, and two anonymous referees for valuable criticisms and suggestions that greatly improved the paper. This work has been partially supported by Grant ECO2013-48413-R (J. Fernando Vera) of the Ministerio de Economía y Competitividad of Spain (co-financed by FEDER), and by Grant CB-2015-02-252996 (Rodrigo Macías) of CONACYT.

Author information

Authors and Affiliations

Department of Statistics and O.R., Faculty of Sciences, University of Granada, 18071, Granada, Spain
J. Fernando Vera
Centro de Investigación en Matemáticas, Unidad Monterrey, Monterrey, Mexico
Rodrigo Macías

Authors

J. Fernando Vera
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Macías
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. Fernando Vera.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vera, J.F., Macías, R. Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data. Psychometrika 82, 275–294 (2017). https://doi.org/10.1007/s11336-017-9561-1

Download citation

Received: 24 October 2013
Revised: 28 April 2016
Published: 13 February 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11336-017-9561-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

Abstract

Access this article

Similar content being viewed by others

On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling

A Quality Metric for K-Means Clustering Based on Centroid Locations

Braverman’s Spectrum and Matrix Diagonalization Versus iK-Means: A Unified Framework for Clustering

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

Abstract

Access this article

Similar content being viewed by others

On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling

A Quality Metric for K-Means Clustering Based on Centroid Locations

Braverman’s Spectrum and Matrix Diagonalization Versus iK-Means: A Unified Framework for Clustering

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation