Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures
Theory and Methods
First Online:
Received:
Revised:
- 728 Downloads
- 51 Citations
Abstract
Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the performance of nearly all of the variable selection procedures. Overall, a variable selection technique based on a variance-to-range weighting procedure coupled with the largest decreases in within-cluster sums of squares error performed the best. On the other hand, variable selection methods used in conjunction with finite mixture models performed the worst.
Keywords
cluster analysis variable selectionReferences
- Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821. CrossRefGoogle Scholar
- Bartholomew, D.J., & Knott, M. (1999). Latent variable models and factor analysis. London: Arnold. Google Scholar
- Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270. CrossRefGoogle Scholar
- Carmone, F.J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segment definition by identifying noisy variables. Journal of Marketing Research, 36, 501–509. CrossRefGoogle Scholar
- Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society, Series A, 134, 321–367. CrossRefGoogle Scholar
- Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the E-M algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Google Scholar
- DeSarbo, W.S., Carroll, J.D., Clark, L.A., & Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78. CrossRefGoogle Scholar
- De Soete, G., DeSarbo, W.S., & Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternative least-squares algorithm. Journal of Classification, 2, 173–192. CrossRefGoogle Scholar
- Donoghue, J.R. (1990). Univariate screening measures for cluster analysis. Multivariate Behavioral Research, 30, 385–427. CrossRefGoogle Scholar
- Dy, J.G., & Brodley, C.E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845–889. Google Scholar
- Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569. CrossRefGoogle Scholar
- Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228. CrossRefGoogle Scholar
- Friedman, J.H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. CrossRefGoogle Scholar
- Friedman, J.H., & Meulman, J.J. (2004). Clustering objects on subsets of variables. Journal of the Royal Statistical Society, Series B, 66, 1–25. CrossRefGoogle Scholar
- Friedman, J.H., & Tukey, J.W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computing, 23, 881–890. CrossRefGoogle Scholar
- Gnanadesikan, R., Kettenring, J.R., & Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113–136. CrossRefGoogle Scholar
- Goffe, W.L., Ferrier, G.D., & Rogers, J. (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics, 60, 65–99. CrossRefGoogle Scholar
- Green, P.E., Carmone, F.J., & Kim, J. (1990). A preliminary study of optimal variable weighting in k-means clustering. Journal of Classification, 7, 271–285. CrossRefGoogle Scholar
- Hubert, L.J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. CrossRefGoogle Scholar
- Kruskal, J.B. (1969). Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new index of condensation. In R.C. Milton, & J.A. Nelder (Eds.), Statistical Computation (pp. 427–440). New York: Academic Press. Google Scholar
- Law, M.H.C., Figueiredo, M.A.T., & Jain, A.K. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 1154–1166. CrossRefPubMedGoogle Scholar
- Martinez, W.L., & Martinez, A.R. (2001). Computational statistics handbook with MATLAB. Boca Raton: Chapman & Hall. Google Scholar
- Martinez, W.L., & Martinez, A.R. (2005). Exploratory data analysis with MATLAB. Boca Raton: Chapman & Hall. Google Scholar
- McLachlan, G.J., & Basford, K.E. (1988). Mixture models: Inference and applications to clustering. New York: Dekker. Google Scholar
- McLachlan, G.J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. Google Scholar
- McLachlan, G.J., & Peel, D. (2000). Finite mixture models. New York: Wiley. CrossRefGoogle Scholar
- Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342. CrossRefGoogle Scholar
- Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 23–127. Google Scholar
- Milligan, G.W. (1989). A validation study of a variable weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71. CrossRefGoogle Scholar
- Montanari, A., & Lizzani, L. (2001). A projection pursuit approach to variable selection. Computational Statistics & Data Analysis, 35, 463–473. CrossRefGoogle Scholar
- Raftery, A.E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168–178. CrossRefGoogle Scholar
- Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304. CrossRefPubMedGoogle Scholar
- Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer. Google Scholar
- Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396. CrossRefPubMedGoogle Scholar
- Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34. CrossRefPubMedGoogle Scholar
- Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192. CrossRefPubMedGoogle Scholar
- Steinley, D., & Brusco, M.J. (2007, in press). A new variable weighting and selection procedure for K-means cluster analysis. Psychometrika. Google Scholar
- Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250. CrossRefGoogle Scholar
- Steinley, D., & McDonald, R.P. (2007). Examining factor score distributions to determine the nature of latent spaces. Multivariate Behavioral Research, 42, 133–156. Google Scholar
- Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423. CrossRefGoogle Scholar
- van Buuren, S.V., & Heiser, W.J. (1989). Clustering N objects into K groups under an optimal scaling of variables. Psychometrika, 54, 699–706. CrossRefGoogle Scholar
Copyright information
© The Psychometric Society 2007