Latent class analysis variable selection

Abstract

We propose a method for selecting variables in latent class analysis, which is the most common model-based clustering method for discrete data. The method assesses a variable’s usefulness for clustering by comparing two models, given the clustering variables already selected. In one model the variable contributes information about cluster allocation beyond that contained in the already selected variables, and in the other model it does not. A headlong search algorithm is used to explore the model space and select clustering variables. In simulated datasets we found that the method selected the correct clustering variables, and also led to improvements in classification performance and in accuracy of the choice of the number of classes. In two real datasets, our method discovered the same group structure with fewer variables. In a dataset from the International HapMap Project consisting of 639 single nucleotide polymorphisms (SNPs) from 210 members of different groups, our method discovered the same group structure with a much smaller number of SNPs.

This is a preview of subscription content, log in to check access.

References

  1. Badsberg, J. H. (1992). Model search in contingency tables by CoCo. In Y. Dodge, J. Whittaker (Eds.), Computational statistics (Vol. 1, pp. 251–256). Heidelberg: Physica Verlag.

  2. Clogg C.C. (1981) New developments in latent structure analysis. In: Jackson D.J., Borgatta E.F. (eds) Factor analysis and measurement in sociological research. Sage, Beverly Hills, pp 215–246

    Google Scholar 

  3. Clogg C.C. (1995) Latent class models. In: Arminger G., Clogg C.C., Sobel M.E. (eds) Handbook of statistical modeling for the social and behavioral sciences. Plenum, New York, pp 311–360

    Google Scholar 

  4. Detrano R., Janosi A., Steinbrunn W., Pfisterer M., Schmid J.-J., Sandhu S., Guppy K. H., Lee S., Froelicher V. (1989) International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology 64: 304–310

    Article  Google Scholar 

  5. Fraley C., Raftery A.E. (2002) Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97: 611–631

    MATH  Article  MathSciNet  Google Scholar 

  6. Galimberti G., Soffritti G. (2006) Identifying multiple cluster structures through latent class models. In: Spiliopoulou M., Kruse R., Borgelt C., Nürnberger A., Gaul W. (eds) From data and information analysis to knowledge engineering. Springer, Berlin, pp 174–181

    Google Scholar 

  7. Gennari J.H., Langley P., Fisher D. (1989) Models of incremental concept formation. Artificial Intelligence 40: 11–61

    Article  Google Scholar 

  8. Goodman L.A. (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61: 215–231

    MATH  Article  MathSciNet  Google Scholar 

  9. Hagenaars J.A., McCutcheon A.L. (2002) Applied latent class analysis. Cambridge University Press, Cambridge

    Google Scholar 

  10. Hubert L., Arabie P. (1985) Comparing partitions. Journal of Classification 2: 193–218

    Article  Google Scholar 

  11. Kass R.E., Raftery A.E. (1995) Bayes factors. Journal of the American Statistical Association 90: 773–795

    MATH  Article  Google Scholar 

  12. Keribin C. (1998) Consistent estimate of the order of mixture models. Comptes Rendues de l’Academie des Sciences, Série I-Mathématiques 326: 243–248

    MATH  Article  MathSciNet  Google Scholar 

  13. Lazarsfeld, P. F. (1950a). The logical and mathematical foundations of latent structure analysis. In S. A. Stouffer (Ed.), Measurement and prediction, the American soldier: studies in social psychology in World War II (Vol. IV, Chap. 10, pp. 362–412). Princeton, NJ: Princeton University Press.

  14. Lazarsfeld, P. F. (1950b). The interpretation and computation of some latent structures. In S. A. Stouffer (Ed.), Measurement and prediction, the American soldier: studies in social psychology in World War II (Vol. IV, Chap. 11, pp. 413–472). Princeton, NJ: Princeton University Press.

  15. Lazarsfeld P.F., Henry N.W. (1968) Latent structure analysis. Houghton Mifflin, Boston

    Google Scholar 

  16. McCutcheon A.L. (1987) Latent class analysis. Sage, Newbury Park, CA

    Google Scholar 

  17. McLachlan G.J., Peel D. (2000) Finite mixture models. Wiley, New York

    Google Scholar 

  18. Raftery A.E., Dean N. (2006) Variable selection for model-based clustering. Journal of the American Statistical Association 101: 168–178

    MATH  Article  MathSciNet  Google Scholar 

  19. Rand W.M. (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66: 846–850

    Article  Google Scholar 

  20. Rusakov D., Geiger D. (2005) Asymptotic model selection for naive Bayesian networks. Journal of Machine Learning Research 6: 1–35

    MathSciNet  Google Scholar 

  21. The International HapMap Consortium (2003) The international hapmap project. Nature 426: 789–796

    Article  Google Scholar 

  22. Wolfe, J. H. (1963). Object cluster analysis of social areas. Master’s thesis, University of California, Berkeley.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Adrian E. Raftery.

About this article

Cite this article

Dean, N., Raftery, A.E. Latent class analysis variable selection. Ann Inst Stat Math 62, 11 (2010). https://doi.org/10.1007/s10463-009-0258-9

Download citation

Keywords

  • Bayes factor
  • BIC
  • Categorical data
  • Feature selection
  • Model-based clustering
  • Single nucleotide polymorphism (SNP)