Variable selection in model-based clustering using multilocus genotype data

Abstract

We propose a variable selection procedure in model-based clustering using multilocus genotype data. Indeed, it may happen that some loci are not relevant for clustering into statistically different populations. Inferring the number K of clusters and the relevant clustering subset S of loci is seen as a model selection problem. The competing models are compared using penalized maximum likelihood criteria. Under weak assumptions on the penalty function, we prove the consistency of the resulting estimator \({(\widehat{K}_n, \widehat{S}_n)}\). An associated algorithm named Mixture Model for Genotype Data (MixMoGenD) has been implemented using c++ programming language and is available on http://www.math.u-psud.fr/~toussile. To avoid an exhaustive search of the optimum model, we propose a modified Backward-Stepwise algorithm, which enables a better search of the optimum model among all possible cardinalities of S. We present numerical experiments on simulated and real datasets that highlight the interest of our loci selection procedure.

This is a preview of subscription content, access via your institution.

References

  1. Allman ES, Matias C, Rhodes JA (2009) Identifiability of latent class models with many observed variables. Ann Stat (to appear)

  2. Azais J-M, Gassiat E, Mercadier C (2009) The likelihood ratio test for general mixture models with possibly structural parameter. ESAIM P&S (to appear)

  3. Biernacki C, Celeux G, Govaert G (2001) Strategies for getting highest likehood in mixture models. Technical Report 4255, INRIA

  4. Chambaz A, Garivier A, Gassiat E (2008) A MDL approach to HMM with Poisson and Gaussian emissions. Application to order identification (to appear JSPI)

  5. Corander J, Marttinen P, Sirén J, Tang J (2008) Enhanced Bayesian modelling in baps software for learning genetic structures of populations. BMC Bioinformatics 9: 539

    Article  Google Scholar 

  6. Dempster AP, Lairdsand NM, Rubin DB (1977) Maximum likelihood from in-complete data via the EM algorithm. J R Stat Soc B 39: 1–38

    MATH  Google Scholar 

  7. François O, Ancelet S, Guillot G (2006) Bayesian clustering using hidden Markov random fields in spatial population genetics. Genetics 174(2): 805–816

    Article  Google Scholar 

  8. Gassiat E (2002) Likelihood ratio inequalities with applications to various mixtures. In: Annales de l’Institut Henri Poincaré/Probabilités et statistiques, vol 38, pp 897–906. Elsevier SAS

  9. Guillot G, Mortier F, Estoup A (2005) Geneland: a computer package for landscape genetics. Mol Ecol Notes 5(3): 712–715

    Article  Google Scholar 

  10. Latch EK, Dharmarajan GC, Glaubitz J, Rhodes OE Jr (2006) Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation. Conserv Genet 7(2): 295

    Article  Google Scholar 

  11. Massart P (2007) Concentration inequalities and model selection, vol 1896 of Lecture Notes in Mathematics. Springer, Berlin. Lectures from the 33rd Summer School on probability theory held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard

  12. Maugis C, Celeux G, Martin-Magniette M-L (2009) Variable selection for clustering with gaussian mixture models. Biometrics (to appear)

  13. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2): 945–959

    Google Scholar 

  14. R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0

  15. Rosenberg NA, Woolf E, Pritchard JK, Schaap T, Gefel D, Shpirer I, Lavi U, Bonne-Tamir B, Hillel J, Feldman MW (2001) Distinctive genetic signatures in the libyan jews. Proc Natl Acad Sci USA 98(3): 858–863

    Article  Google Scholar 

  16. Wang Y, Liu Q (2006) Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of stock–recruitment relationships. Fish Res 77(2): 220–225

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Wilson Toussile.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Toussile, W., Gassiat, E. Variable selection in model-based clustering using multilocus genotype data. Adv Data Anal Classif 3, 109–134 (2009). https://doi.org/10.1007/s11634-009-0043-x

Download citation

Keywords

  • Model-based clustering
  • Penalized maximum likelihood criteria
  • Population genetics
  • Variable selection

JEL Classification

  • C89

Mathematics Subject Classification (2000)

  • 62H30