Skip to main content

Advertisement

Log in

Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

One of the main problems in cluster analysis is that of determining the number of groups in the data. In general, the approach taken depends on the cluster method used. For K-means, some of the most widely employed criteria are formulated in terms of the decomposition of the total point scatter, regarding a two-mode data set of N points in p dimensions, which are optimally arranged into K classes. This paper addresses the formulation of criteria to determine the number of clusters, in the general situation in which the available information for clustering is a one-mode \(N\times N\) dissimilarity matrix describing the objects. In this framework, p and the coordinates of points are usually unknown, and the application of criteria originally formulated for two-mode data sets is dependent on their possible reformulation in the one-mode situation. The decomposition of the variability of the clustered objects is proposed in terms of the corresponding block-shaped partition of the dissimilarity matrix. Within-block and between-block dispersion values for the partitioned dissimilarity matrix are derived, and variance-based criteria are subsequently formulated in order to determine the number of groups in the data. A Monte Carlo experiment was carried out to study the performance of the proposed criteria. For simulated clustered points in p dimensions, greater efficiency in recovering the number of clusters is obtained when the criteria are calculated from the related Euclidean distances instead of the known two-mode data set, in general, for unequal-sized clusters and for low dimensionality situations. For simulated dissimilarity data sets, the proposed criteria always outperform the results obtained when these criteria are calculated from their original formulation, using dissimilarities instead of distances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer.

    Google Scholar 

  • Calinski, R. B., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3, 1–27.

    Google Scholar 

  • Chiang, M. M., & Mirkin, B. (2010). Intelligent choice of the number of cluster in K-means clustering: An experimental study with different cluster spreads. Journal of Classification, 27, 3–40.

    Article  Google Scholar 

  • Cilibrasi, R., & Vitanyi, P. (2004). Automatic meaning discovery using Google. http://xxx.lanl.gov/abs/cs.CL/0412098.

  • Cilibrasi, R. L., & Vitanyi, P. M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383.

    Article  Google Scholar 

  • Condon, E., Golden, B., Lele, S., Raghavan, S., & Wasil, E. (2002). A visualization model based on adjacency data. Decision Support Systems, 33, 349–362.

    Article  Google Scholar 

  • DeSarbo, W., Carroll, J. D., Clark, L., & Green, P. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78.

    Article  Google Scholar 

  • Elvevag, B., & Storms, G. (2003). Scaling and clustering in the study of semantic disruptions in patients with Schizophrenia: A re-evaluation. Schizophrenia Research, 63, 237.

    Article  PubMed  Google Scholar 

  • Everit, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis. Wiley series in probability and statistics (5th ed.). New York: Wiley.

    Google Scholar 

  • Gower, J. C., & Krzanowski, W. J. (1999). Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48(4), 505–519.

    Article  Google Scholar 

  • Hartigan, J. A. (1975). Clustering algorithms. New York: Wiley.

    Google Scholar 

  • Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108.

    Article  Google Scholar 

  • Heiser, W. J., & Groenen, P. J. F. (1997). Cluster differences scaling with a within-clusters loss component and a fuzzy succesive approximation strategy to avoid local minima. Psychometrika, 62, 63–83.

    Article  Google Scholar 

  • Ito, K., Zeugmann, T., & Zhu, Y. (2010). Clustering the normalized compression distance for Influenza virus data. Lecture Notes in Computer Science, 6060, 130–146.

    Article  Google Scholar 

  • Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.

    Book  Google Scholar 

  • Krzanowski, W. J., & Lai, Y. T. (1985). A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics, 44, 23–34.

    Article  Google Scholar 

  • McQueen, J. (1967). Some methods for classification and analysis of multivariate observations, In Fifth Berkeley Symposium on Mathematical Statistics and Probability (vol. II, pp. 281–297).

  • Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354–376.

    Article  Google Scholar 

  • Melnykov, V., Chen, W.-C., & Maitra, R. (2012). MixSim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12), 1–25.

    Article  Google Scholar 

  • Milligan, G. W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50(1), 123–127.

    Article  Google Scholar 

  • Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.

    Article  Google Scholar 

  • Mirkin, B. (2005). Clustering for data mining: A data recovery approach. Boca Raton, FL: Chapman and Hall.

    Book  Google Scholar 

  • Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52, 91–118.

    Article  Google Scholar 

  • Poland, J., & Zeugmann, T. (2006). Clustering the google distance with eigenvectors and semidefinite programming. In Knowledge Media Technologies, First International Core-to-Core Workshop, “Diskussionsbeiträge, Institut für Medien und Kommunikationswisschaft" (vol. 21, pp. 61–69). Technische Universität Ilmenau.

  • Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes: The art of scientific computing (3rd ed.). New York: Cambridge University Press.

    Google Scholar 

  • Priem, R. L., Love, L., & Shaffer, M. A. (2002). Executives perceptions of uncertainty sources: A numerical taxonomy and underlying dimensions. Journal of Management, 28, 725–746.

    Article  Google Scholar 

  • Rocci, R., & Vichi, M. (2008). Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52, 1984–2003.

    Article  Google Scholar 

  • Schwarz, A. J. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.

    Article  Google Scholar 

  • Sokal, R., & Michener, C. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38, 1409–1438.

    Google Scholar 

  • Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.

    Article  PubMed  Google Scholar 

  • Steinley, D., & Brusco, M. J. (2007). Initializing K-means batch clustering: A critical evaluation of several techniques. Journal of Classification, 24, 99–121.

    Article  Google Scholar 

  • Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Asssociation., 98, 750–762.

    Article  Google Scholar 

  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.

    Article  Google Scholar 

  • Vera, J. F., Macías, R., & Angulo, J. M. (2008). Non-stationary spatial covariance structure estimation in oversampled domains by cluster differences scaling with spatial constraints. Stochastic Environmental Research and Risk Assessment, 22, 95–106.

    Article  Google Scholar 

  • Vera, J. F., Macías, R., & Angulo, J. M. (2009). A latent class MDS model with spatial constraints for non-stationary spatial covariance estimation. Stochastic Environmental Research and Risk Assessment, 23(6), 769–779.

    Article  Google Scholar 

  • Vera, J. F., Macías, R., & Heiser, W. J. (2009). A latent class multidimensional scaling model for two-way one-mode continuous rating dissimilarity data. Psychometrika, 74(2), 297–315.

    Article  Google Scholar 

  • Vera, J. F., Macías, R., & Heiser, W. J. (2013). Cluster differences unfolding for two-way two-mode preference rating data. Journal of Classification, 30, 370–396.

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the Editor, the AE, and two anonymous referees for valuable criticisms and suggestions that greatly improved the paper. This work has been partially supported by Grant ECO2013-48413-R (J. Fernando Vera) of the Ministerio de Economía y Competitividad of Spain (co-financed by FEDER), and by Grant CB-2015-02-252996 (Rodrigo Macías) of CONACYT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. Fernando Vera.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vera, J.F., Macías, R. Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data. Psychometrika 82, 275–294 (2017). https://doi.org/10.1007/s11336-017-9561-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-017-9561-1

Keywords

Navigation