Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures

Steinley, Douglas; Brusco, Michael J.

doi:10.1007/s11336-007-9019-y

Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures

Theory and Methods
Published: 04 August 2007

Volume 73, pages 125–144, (2008)
Cite this article

Psychometrika Aims and scope Submit manuscript

Douglas Steinley¹ &
Michael J. Brusco²

1411 Accesses
77 Citations
1 Altmetric
Explore all metrics

Abstract

Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the performance of nearly all of the variable selection procedures. Overall, a variable selection technique based on a variance-to-range weighting procedure coupled with the largest decreases in within-cluster sums of squares error performed the best. On the other hand, variable selection methods used in conjunction with finite mixture models performed the worst.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.
Article Google Scholar
Bartholomew, D.J., & Knott, M. (1999). Latent variable models and factor analysis. London: Arnold.
Google Scholar
Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270.
Article Google Scholar
Carmone, F.J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segment definition by identifying noisy variables. Journal of Marketing Research, 36, 501–509.
Article Google Scholar
Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society, Series A, 134, 321–367.
Article Google Scholar
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the E-M algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38.
Google Scholar
DeSarbo, W.S., Carroll, J.D., Clark, L.A., & Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78.
Article Google Scholar
De Soete, G., DeSarbo, W.S., & Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternative least-squares algorithm. Journal of Classification, 2, 173–192.
Article Google Scholar
Donoghue, J.R. (1990). Univariate screening measures for cluster analysis. Multivariate Behavioral Research, 30, 385–427.
Article Google Scholar
Dy, J.G., & Brodley, C.E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845–889.
Google Scholar
Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569.
Article Google Scholar
Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228.
Article Google Scholar
Friedman, J.H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266.
Article Google Scholar
Friedman, J.H., & Meulman, J.J. (2004). Clustering objects on subsets of variables. Journal of the Royal Statistical Society, Series B, 66, 1–25.
Article Google Scholar
Friedman, J.H., & Tukey, J.W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computing, 23, 881–890.
Article Google Scholar
Gnanadesikan, R., Kettenring, J.R., & Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113–136.
Article Google Scholar
Goffe, W.L., Ferrier, G.D., & Rogers, J. (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics, 60, 65–99.
Article Google Scholar
Green, P.E., Carmone, F.J., & Kim, J. (1990). A preliminary study of optimal variable weighting in k-means clustering. Journal of Classification, 7, 271–285.
Article Google Scholar
Hubert, L.J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Article Google Scholar
Kruskal, J.B. (1969). Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new index of condensation. In R.C. Milton, & J.A. Nelder (Eds.), Statistical Computation (pp. 427–440). New York: Academic Press.
Google Scholar
Law, M.H.C., Figueiredo, M.A.T., & Jain, A.K. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 1154–1166.
Article PubMed Google Scholar
Martinez, W.L., & Martinez, A.R. (2001). Computational statistics handbook with MATLAB. Boca Raton: Chapman & Hall.
Google Scholar
Martinez, W.L., & Martinez, A.R. (2005). Exploratory data analysis with MATLAB. Boca Raton: Chapman & Hall.
Google Scholar
McLachlan, G.J., & Basford, K.E. (1988). Mixture models: Inference and applications to clustering. New York: Dekker.
Google Scholar
McLachlan, G.J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley.
Google Scholar
McLachlan, G.J., & Peel, D. (2000). Finite mixture models. New York: Wiley.
Book Google Scholar
Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.
Article Google Scholar
Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 23–127.
Google Scholar
Milligan, G.W. (1989). A validation study of a variable weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71.
Article Google Scholar
Montanari, A., & Lizzani, L. (2001). A projection pursuit approach to variable selection. Computational Statistics & Data Analysis, 35, 463–473.
Article Google Scholar
Raftery, A.E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168–178.
Article Google Scholar
Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304.
Article PubMed Google Scholar
Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer.
Google Scholar
Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.
Article PubMed Google Scholar
Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.
Article PubMed Google Scholar
Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192.
Article PubMed Google Scholar
Steinley, D., & Brusco, M.J. (2007, in press). A new variable weighting and selection procedure for K-means cluster analysis. Psychometrika.
Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.
Article Google Scholar
Steinley, D., & McDonald, R.P. (2007). Examining factor score distributions to determine the nature of latent spaces. Multivariate Behavioral Research, 42, 133–156.
Google Scholar
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423.
Article Google Scholar
van Buuren, S.V., & Heiser, W.J. (1989). Clustering N objects into K groups under an optimal scaling of variables. Psychometrika, 54, 699–706.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Psychological Sciences, University of Missouri-Columbia, 210 McAlester Hall, Columbia, MO, 65211, USA
Douglas Steinley
Florida State University, Tallahassee, FL, USA
Michael J. Brusco

Authors

Douglas Steinley
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Brusco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Douglas Steinley.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinley, D., Brusco, M.J. Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures. Psychometrika 73, 125–144 (2008). https://doi.org/10.1007/s11336-007-9019-y

Download citation

Received: 06 July 2005
Revised: 03 July 2006
Published: 04 August 2007
Issue Date: March 2008
DOI: https://doi.org/10.1007/s11336-007-9019-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures

Abstract

Access this article

Similar content being viewed by others

Variable Selection in Cluster Analysis: An Approach Based on a New Index

Hierarchical Means Clustering

Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures

Abstract

Access this article

Similar content being viewed by others

Variable Selection in Cluster Analysis: An Approach Based on a New Index

Hierarchical Means Clustering

Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation