Psychometrika

, 73:125 | Cite as

Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures

Theory and Methods

Abstract

Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the performance of nearly all of the variable selection procedures. Overall, a variable selection technique based on a variance-to-range weighting procedure coupled with the largest decreases in within-cluster sums of squares error performed the best. On the other hand, variable selection methods used in conjunction with finite mixture models performed the worst.

Keywords

cluster analysis variable selection 

References

  1. Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821. CrossRefGoogle Scholar
  2. Bartholomew, D.J., & Knott, M. (1999). Latent variable models and factor analysis. London: Arnold. Google Scholar
  3. Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270. CrossRefGoogle Scholar
  4. Carmone, F.J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segment definition by identifying noisy variables. Journal of Marketing Research, 36, 501–509. CrossRefGoogle Scholar
  5. Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society, Series A, 134, 321–367. CrossRefGoogle Scholar
  6. Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the E-M algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Google Scholar
  7. DeSarbo, W.S., Carroll, J.D., Clark, L.A., & Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78. CrossRefGoogle Scholar
  8. De Soete, G., DeSarbo, W.S., & Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternative least-squares algorithm. Journal of Classification, 2, 173–192. CrossRefGoogle Scholar
  9. Donoghue, J.R. (1990). Univariate screening measures for cluster analysis. Multivariate Behavioral Research, 30, 385–427. CrossRefGoogle Scholar
  10. Dy, J.G., & Brodley, C.E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845–889. Google Scholar
  11. Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569. CrossRefGoogle Scholar
  12. Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228. CrossRefGoogle Scholar
  13. Friedman, J.H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. CrossRefGoogle Scholar
  14. Friedman, J.H., & Meulman, J.J. (2004). Clustering objects on subsets of variables. Journal of the Royal Statistical Society, Series B, 66, 1–25. CrossRefGoogle Scholar
  15. Friedman, J.H., & Tukey, J.W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computing, 23, 881–890. CrossRefGoogle Scholar
  16. Gnanadesikan, R., Kettenring, J.R., & Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113–136. CrossRefGoogle Scholar
  17. Goffe, W.L., Ferrier, G.D., & Rogers, J. (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics, 60, 65–99. CrossRefGoogle Scholar
  18. Green, P.E., Carmone, F.J., & Kim, J. (1990). A preliminary study of optimal variable weighting in k-means clustering. Journal of Classification, 7, 271–285. CrossRefGoogle Scholar
  19. Hubert, L.J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. CrossRefGoogle Scholar
  20. Kruskal, J.B. (1969). Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new index of condensation. In R.C. Milton, & J.A. Nelder (Eds.), Statistical Computation (pp. 427–440). New York: Academic Press. Google Scholar
  21. Law, M.H.C., Figueiredo, M.A.T., & Jain, A.K. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 1154–1166. CrossRefPubMedGoogle Scholar
  22. Martinez, W.L., & Martinez, A.R. (2001). Computational statistics handbook with MATLAB. Boca Raton: Chapman & Hall. Google Scholar
  23. Martinez, W.L., & Martinez, A.R. (2005). Exploratory data analysis with MATLAB. Boca Raton: Chapman & Hall. Google Scholar
  24. McLachlan, G.J., & Basford, K.E. (1988). Mixture models: Inference and applications to clustering. New York: Dekker. Google Scholar
  25. McLachlan, G.J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. Google Scholar
  26. McLachlan, G.J., & Peel, D. (2000). Finite mixture models. New York: Wiley. CrossRefGoogle Scholar
  27. Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342. CrossRefGoogle Scholar
  28. Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 23–127. Google Scholar
  29. Milligan, G.W. (1989). A validation study of a variable weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71. CrossRefGoogle Scholar
  30. Montanari, A., & Lizzani, L. (2001). A projection pursuit approach to variable selection. Computational Statistics & Data Analysis, 35, 463–473. CrossRefGoogle Scholar
  31. Raftery, A.E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168–178. CrossRefGoogle Scholar
  32. Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304. CrossRefPubMedGoogle Scholar
  33. Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer. Google Scholar
  34. Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396. CrossRefPubMedGoogle Scholar
  35. Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34. CrossRefPubMedGoogle Scholar
  36. Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192. CrossRefPubMedGoogle Scholar
  37. Steinley, D., & Brusco, M.J. (2007, in press). A new variable weighting and selection procedure for K-means cluster analysis. Psychometrika. Google Scholar
  38. Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250. CrossRefGoogle Scholar
  39. Steinley, D., & McDonald, R.P. (2007). Examining factor score distributions to determine the nature of latent spaces. Multivariate Behavioral Research, 42, 133–156. Google Scholar
  40. Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423. CrossRefGoogle Scholar
  41. van Buuren, S.V., & Heiser, W.J. (1989). Clustering N objects into K groups under an optimal scaling of variables. Psychometrika, 54, 699–706. CrossRefGoogle Scholar

Copyright information

© The Psychometric Society 2007

Authors and Affiliations

  1. 1.Department of Psychological SciencesUniversity of Missouri-ColumbiaColumbiaUSA
  2. 2.Florida State UniversityTallahasseeUSA

Personalised recommendations