Advances in Data Analysis and Classification

, Volume 3, Issue 2, pp 109–134 | Cite as

Variable selection in model-based clustering using multilocus genotype data

  • Wilson ToussileEmail author
  • Elisabeth Gassiat
Regular Article


We propose a variable selection procedure in model-based clustering using multilocus genotype data. Indeed, it may happen that some loci are not relevant for clustering into statistically different populations. Inferring the number K of clusters and the relevant clustering subset S of loci is seen as a model selection problem. The competing models are compared using penalized maximum likelihood criteria. Under weak assumptions on the penalty function, we prove the consistency of the resulting estimator \({(\widehat{K}_n, \widehat{S}_n)}\). An associated algorithm named Mixture Model for Genotype Data (MixMoGenD) has been implemented using c++ programming language and is available on To avoid an exhaustive search of the optimum model, we propose a modified Backward-Stepwise algorithm, which enables a better search of the optimum model among all possible cardinalities of S. We present numerical experiments on simulated and real datasets that highlight the interest of our loci selection procedure.


Model-based clustering Penalized maximum likelihood criteria Population genetics Variable selection 

JEL Classification


Mathematics Subject Classification (2000)



Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Allman ES, Matias C, Rhodes JA (2009) Identifiability of latent class models with many observed variables. Ann Stat (to appear)Google Scholar
  2. Azais J-M, Gassiat E, Mercadier C (2009) The likelihood ratio test for general mixture models with possibly structural parameter. ESAIM P&S (to appear)Google Scholar
  3. Biernacki C, Celeux G, Govaert G (2001) Strategies for getting highest likehood in mixture models. Technical Report 4255, INRIAGoogle Scholar
  4. Chambaz A, Garivier A, Gassiat E (2008) A MDL approach to HMM with Poisson and Gaussian emissions. Application to order identification (to appear JSPI)Google Scholar
  5. Corander J, Marttinen P, Sirén J, Tang J (2008) Enhanced Bayesian modelling in baps software for learning genetic structures of populations. BMC Bioinformatics 9: 539CrossRefGoogle Scholar
  6. Dempster AP, Lairdsand NM, Rubin DB (1977) Maximum likelihood from in-complete data via the EM algorithm. J R Stat Soc B 39: 1–38zbMATHGoogle Scholar
  7. François O, Ancelet S, Guillot G (2006) Bayesian clustering using hidden Markov random fields in spatial population genetics. Genetics 174(2): 805–816CrossRefGoogle Scholar
  8. Gassiat E (2002) Likelihood ratio inequalities with applications to various mixtures. In: Annales de l’Institut Henri Poincaré/Probabilités et statistiques, vol 38, pp 897–906. Elsevier SASGoogle Scholar
  9. Guillot G, Mortier F, Estoup A (2005) Geneland: a computer package for landscape genetics. Mol Ecol Notes 5(3): 712–715CrossRefGoogle Scholar
  10. Latch EK, Dharmarajan GC, Glaubitz J, Rhodes OE Jr (2006) Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation. Conserv Genet 7(2): 295CrossRefGoogle Scholar
  11. Massart P (2007) Concentration inequalities and model selection, vol 1896 of Lecture Notes in Mathematics. Springer, Berlin. Lectures from the 33rd Summer School on probability theory held in Saint-Flour, July 6–23, 2003, With a foreword by Jean PicardGoogle Scholar
  12. Maugis C, Celeux G, Martin-Magniette M-L (2009) Variable selection for clustering with gaussian mixture models. Biometrics (to appear)Google Scholar
  13. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2): 945–959Google Scholar
  14. R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0Google Scholar
  15. Rosenberg NA, Woolf E, Pritchard JK, Schaap T, Gefel D, Shpirer I, Lavi U, Bonne-Tamir B, Hillel J, Feldman MW (2001) Distinctive genetic signatures in the libyan jews. Proc Natl Acad Sci USA 98(3): 858–863CrossRefGoogle Scholar
  16. Wang Y, Liu Q (2006) Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of stock–recruitment relationships. Fish Res 77(2): 220–225CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.UR016, Institut de Recherche pour le Développement (IRD), Laboratoire de Mathématique d’Orsay (LMO), Ecole Nationale Supérieure Polytechnique de YaoundéOrsay CedexFrance
  2. 2.Laboratoire de Mathématique d’OrsayOrsay CedexFrance

Personalised recommendations