Abstract
We propose a variable selection procedure in model-based clustering using multilocus genotype data. Indeed, it may happen that some loci are not relevant for clustering into statistically different populations. Inferring the number K of clusters and the relevant clustering subset S of loci is seen as a model selection problem. The competing models are compared using penalized maximum likelihood criteria. Under weak assumptions on the penalty function, we prove the consistency of the resulting estimator \({(\widehat{K}_n, \widehat{S}_n)}\). An associated algorithm named Mixture Model for Genotype Data (MixMoGenD) has been implemented using c++ programming language and is available on http://www.math.u-psud.fr/~toussile. To avoid an exhaustive search of the optimum model, we propose a modified Backward-Stepwise algorithm, which enables a better search of the optimum model among all possible cardinalities of S. We present numerical experiments on simulated and real datasets that highlight the interest of our loci selection procedure.
References
Allman ES, Matias C, Rhodes JA (2009) Identifiability of latent class models with many observed variables. Ann Stat (to appear)
Azais J-M, Gassiat E, Mercadier C (2009) The likelihood ratio test for general mixture models with possibly structural parameter. ESAIM P&S (to appear)
Biernacki C, Celeux G, Govaert G (2001) Strategies for getting highest likehood in mixture models. Technical Report 4255, INRIA
Chambaz A, Garivier A, Gassiat E (2008) A MDL approach to HMM with Poisson and Gaussian emissions. Application to order identification (to appear JSPI)
Corander J, Marttinen P, Sirén J, Tang J (2008) Enhanced Bayesian modelling in baps software for learning genetic structures of populations. BMC Bioinformatics 9: 539
Dempster AP, Lairdsand NM, Rubin DB (1977) Maximum likelihood from in-complete data via the EM algorithm. J R Stat Soc B 39: 1–38
François O, Ancelet S, Guillot G (2006) Bayesian clustering using hidden Markov random fields in spatial population genetics. Genetics 174(2): 805–816
Gassiat E (2002) Likelihood ratio inequalities with applications to various mixtures. In: Annales de l’Institut Henri Poincaré/Probabilités et statistiques, vol 38, pp 897–906. Elsevier SAS
Guillot G, Mortier F, Estoup A (2005) Geneland: a computer package for landscape genetics. Mol Ecol Notes 5(3): 712–715
Latch EK, Dharmarajan GC, Glaubitz J, Rhodes OE Jr (2006) Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation. Conserv Genet 7(2): 295
Massart P (2007) Concentration inequalities and model selection, vol 1896 of Lecture Notes in Mathematics. Springer, Berlin. Lectures from the 33rd Summer School on probability theory held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard
Maugis C, Celeux G, Martin-Magniette M-L (2009) Variable selection for clustering with gaussian mixture models. Biometrics (to appear)
Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2): 945–959
R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0
Rosenberg NA, Woolf E, Pritchard JK, Schaap T, Gefel D, Shpirer I, Lavi U, Bonne-Tamir B, Hillel J, Feldman MW (2001) Distinctive genetic signatures in the libyan jews. Proc Natl Acad Sci USA 98(3): 858–863
Wang Y, Liu Q (2006) Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of stock–recruitment relationships. Fish Res 77(2): 220–225
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Toussile, W., Gassiat, E. Variable selection in model-based clustering using multilocus genotype data. Adv Data Anal Classif 3, 109–134 (2009). https://doi.org/10.1007/s11634-009-0043-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-009-0043-x