Skip to main content
Log in

Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures

  • Theory and Methods
  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the performance of nearly all of the variable selection procedures. Overall, a variable selection technique based on a variance-to-range weighting procedure coupled with the largest decreases in within-cluster sums of squares error performed the best. On the other hand, variable selection methods used in conjunction with finite mixture models performed the worst.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.

    Article  Google Scholar 

  • Bartholomew, D.J., & Knott, M. (1999). Latent variable models and factor analysis. London: Arnold.

    Google Scholar 

  • Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270.

    Article  Google Scholar 

  • Carmone, F.J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segment definition by identifying noisy variables. Journal of Marketing Research, 36, 501–509.

    Article  Google Scholar 

  • Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society, Series A, 134, 321–367.

    Article  Google Scholar 

  • Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the E-M algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38.

    Google Scholar 

  • DeSarbo, W.S., Carroll, J.D., Clark, L.A., & Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78.

    Article  Google Scholar 

  • De Soete, G., DeSarbo, W.S., & Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternative least-squares algorithm. Journal of Classification, 2, 173–192.

    Article  Google Scholar 

  • Donoghue, J.R. (1990). Univariate screening measures for cluster analysis. Multivariate Behavioral Research, 30, 385–427.

    Article  Google Scholar 

  • Dy, J.G., & Brodley, C.E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845–889.

    Google Scholar 

  • Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569.

    Article  Google Scholar 

  • Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228.

    Article  Google Scholar 

  • Friedman, J.H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266.

    Article  Google Scholar 

  • Friedman, J.H., & Meulman, J.J. (2004). Clustering objects on subsets of variables. Journal of the Royal Statistical Society, Series B, 66, 1–25.

    Article  Google Scholar 

  • Friedman, J.H., & Tukey, J.W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computing, 23, 881–890.

    Article  Google Scholar 

  • Gnanadesikan, R., Kettenring, J.R., & Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113–136.

    Article  Google Scholar 

  • Goffe, W.L., Ferrier, G.D., & Rogers, J. (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics, 60, 65–99.

    Article  Google Scholar 

  • Green, P.E., Carmone, F.J., & Kim, J. (1990). A preliminary study of optimal variable weighting in k-means clustering. Journal of Classification, 7, 271–285.

    Article  Google Scholar 

  • Hubert, L.J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.

    Article  Google Scholar 

  • Kruskal, J.B. (1969). Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new index of condensation. In R.C. Milton, & J.A. Nelder (Eds.), Statistical Computation (pp. 427–440). New York: Academic Press.

    Google Scholar 

  • Law, M.H.C., Figueiredo, M.A.T., & Jain, A.K. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 1154–1166.

    Article  PubMed  Google Scholar 

  • Martinez, W.L., & Martinez, A.R. (2001). Computational statistics handbook with MATLAB. Boca Raton: Chapman & Hall.

    Google Scholar 

  • Martinez, W.L., & Martinez, A.R. (2005). Exploratory data analysis with MATLAB. Boca Raton: Chapman & Hall.

    Google Scholar 

  • McLachlan, G.J., & Basford, K.E. (1988). Mixture models: Inference and applications to clustering. New York: Dekker.

    Google Scholar 

  • McLachlan, G.J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley.

    Google Scholar 

  • McLachlan, G.J., & Peel, D. (2000). Finite mixture models. New York: Wiley.

    Book  Google Scholar 

  • Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.

    Article  Google Scholar 

  • Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 23–127.

    Google Scholar 

  • Milligan, G.W. (1989). A validation study of a variable weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71.

    Article  Google Scholar 

  • Montanari, A., & Lizzani, L. (2001). A projection pursuit approach to variable selection. Computational Statistics & Data Analysis, 35, 463–473.

    Article  Google Scholar 

  • Raftery, A.E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168–178.

    Article  Google Scholar 

  • Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304.

    Article  PubMed  Google Scholar 

  • Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer.

    Google Scholar 

  • Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.

    Article  PubMed  Google Scholar 

  • Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.

    Article  PubMed  Google Scholar 

  • Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192.

    Article  PubMed  Google Scholar 

  • Steinley, D., & Brusco, M.J. (2007, in press). A new variable weighting and selection procedure for K-means cluster analysis. Psychometrika.

  • Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.

    Article  Google Scholar 

  • Steinley, D., & McDonald, R.P. (2007). Examining factor score distributions to determine the nature of latent spaces. Multivariate Behavioral Research, 42, 133–156.

    Google Scholar 

  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423.

    Article  Google Scholar 

  • van Buuren, S.V., & Heiser, W.J. (1989). Clustering N objects into K groups under an optimal scaling of variables. Psychometrika, 54, 699–706.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Douglas Steinley.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinley, D., Brusco, M.J. Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures. Psychometrika 73, 125–144 (2008). https://doi.org/10.1007/s11336-007-9019-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-007-9019-y

Keywords

Navigation