Latent class analysis variable selection

  • Nema Dean
  • Adrian E. RafteryEmail author


We propose a method for selecting variables in latent class analysis, which is the most common model-based clustering method for discrete data. The method assesses a variable’s usefulness for clustering by comparing two models, given the clustering variables already selected. In one model the variable contributes information about cluster allocation beyond that contained in the already selected variables, and in the other model it does not. A headlong search algorithm is used to explore the model space and select clustering variables. In simulated datasets we found that the method selected the correct clustering variables, and also led to improvements in classification performance and in accuracy of the choice of the number of classes. In two real datasets, our method discovered the same group structure with fewer variables. In a dataset from the International HapMap Project consisting of 639 single nucleotide polymorphisms (SNPs) from 210 members of different groups, our method discovered the same group structure with a much smaller number of SNPs.


Bayes factor BIC Categorical data Feature selection Model-based clustering Single nucleotide polymorphism (SNP) 


  1. Badsberg, J. H. (1992). Model search in contingency tables by CoCo. In Y. Dodge, J. Whittaker (Eds.), Computational statistics (Vol. 1, pp. 251–256). Heidelberg: Physica Verlag.Google Scholar
  2. Clogg C.C. (1981) New developments in latent structure analysis. In: Jackson D.J., Borgatta E.F. (eds) Factor analysis and measurement in sociological research. Sage, Beverly Hills, pp 215–246Google Scholar
  3. Clogg C.C. (1995) Latent class models. In: Arminger G., Clogg C.C., Sobel M.E. (eds) Handbook of statistical modeling for the social and behavioral sciences. Plenum, New York, pp 311–360Google Scholar
  4. Detrano R., Janosi A., Steinbrunn W., Pfisterer M., Schmid J.-J., Sandhu S., Guppy K. H., Lee S., Froelicher V. (1989) International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology 64: 304–310CrossRefGoogle Scholar
  5. Fraley C., Raftery A.E. (2002) Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97: 611–631zbMATHCrossRefMathSciNetGoogle Scholar
  6. Galimberti G., Soffritti G. (2006) Identifying multiple cluster structures through latent class models. In: Spiliopoulou M., Kruse R., Borgelt C., Nürnberger A., Gaul W. (eds) From data and information analysis to knowledge engineering. Springer, Berlin, pp 174–181CrossRefGoogle Scholar
  7. Gennari J.H., Langley P., Fisher D. (1989) Models of incremental concept formation. Artificial Intelligence 40: 11–61CrossRefGoogle Scholar
  8. Goodman L.A. (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61: 215–231zbMATHCrossRefMathSciNetGoogle Scholar
  9. Hagenaars J.A., McCutcheon A.L. (2002) Applied latent class analysis. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
  10. Hubert L., Arabie P. (1985) Comparing partitions. Journal of Classification 2: 193–218CrossRefGoogle Scholar
  11. Kass R.E., Raftery A.E. (1995) Bayes factors. Journal of the American Statistical Association 90: 773–795zbMATHCrossRefGoogle Scholar
  12. Keribin C. (1998) Consistent estimate of the order of mixture models. Comptes Rendues de l’Academie des Sciences, Série I-Mathématiques 326: 243–248zbMATHCrossRefMathSciNetGoogle Scholar
  13. Lazarsfeld, P. F. (1950a). The logical and mathematical foundations of latent structure analysis. In S. A. Stouffer (Ed.), Measurement and prediction, the American soldier: studies in social psychology in World War II (Vol. IV, Chap. 10, pp. 362–412). Princeton, NJ: Princeton University Press.Google Scholar
  14. Lazarsfeld, P. F. (1950b). The interpretation and computation of some latent structures. In S. A. Stouffer (Ed.), Measurement and prediction, the American soldier: studies in social psychology in World War II (Vol. IV, Chap. 11, pp. 413–472). Princeton, NJ: Princeton University Press.Google Scholar
  15. Lazarsfeld P.F., Henry N.W. (1968) Latent structure analysis. Houghton Mifflin, BostonzbMATHGoogle Scholar
  16. McCutcheon A.L. (1987) Latent class analysis. Sage, Newbury Park, CAGoogle Scholar
  17. McLachlan G.J., Peel D. (2000) Finite mixture models. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  18. Raftery A.E., Dean N. (2006) Variable selection for model-based clustering. Journal of the American Statistical Association 101: 168–178zbMATHCrossRefMathSciNetGoogle Scholar
  19. Rand W.M. (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66: 846–850CrossRefGoogle Scholar
  20. Rusakov D., Geiger D. (2005) Asymptotic model selection for naive Bayesian networks. Journal of Machine Learning Research 6: 1–35MathSciNetGoogle Scholar
  21. The International HapMap Consortium (2003) The international hapmap project. Nature 426: 789–796CrossRefGoogle Scholar
  22. Wolfe, J. H. (1963). Object cluster analysis of social areas. Master’s thesis, University of California, Berkeley.Google Scholar

Copyright information

© The Institute of Statistical Mathematics, Tokyo 2009

Authors and Affiliations

  1. 1.Department of StatisticsUniversity of GlasgowGlasgowScotland, UK
  2. 2.Department of StatisticsUniversity of WashingtonSeattleUSA

Personalised recommendations