Abstract
A Bayesian nonparametric form of regression based on Dirichlet process priors is adapted to the analysis of quantitative traits possibly affected by cryptic forms of gene action, and to the context of SNP-assisted genomic selection, where the main objective is to predict a genomic signal on phenotype. The procedure clusters unknown genotypes into groups with distinct genetic values, but in a setting in which the number of clusters is unknown a priori, so that standard methods for finite mixture analysis do not work. The central assumption is that genetic effects follow an unknown distribution with some “baseline” family, which is a normal process in the cases considered here. A Bayesian analysis based on the Gibbs sampler produces estimates of the number of clusters, posterior means of genetic effects, a measure of credibility in the baseline distribution, as well as estimates of parameters of the latter. The procedure is illustrated with a simulation representing two populations. In the first one, there are 3 unknown QTL, with additive, dominance and epistatic effects; in the second, there are 10 QTL with additive, dominance and additive × additive epistatic effects. In the two populations, baseline parameters are inferred correctly. The Dirichlet process model infers the number of unique genetic values correctly in the first population, but it produces an understatement in the second one; here, the true number of clusters is over 900, and the model gives a posterior mean estimate of about 140, probably because more replication of genotypes is needed for correct inference. The impact on inferences of the prior distribution of a key parameter (M), and of the extent of replication, was examined via an analysis of mean body weight in 192 paternal half-sib families of broiler chickens, where each sire was genotyped for nearly 7,000 SNPs. In this small sample, it was found that inference about the number of clusters was affected by the prior distribution of M. For a set of combinations of parameters of a given prior distribution, the effects of the prior dissipated when the number of replicate samples per genotype was increased. Thus, the Dirichlet process model seems to be useful for gauging the number of QTLs affecting the trait: if the number of clusters inferred is small, probably just a few QTLs code for the trait. If the number of clusters inferred is large, this may imply that standard parametric models based on the baseline distribution may suffice. However, priors may be influential, especially if sample size is not large and if only a few genotypic configurations have replicate phenotypes in the sample.
Similar content being viewed by others
References
Antoniak CE (1974) Mixtures of Dirichlet processes with applications to non-parametric problems. Ann Stat 2:1152–1174
Bush CA, MacEachern SN (1996) A semiparametric Bayesian model for randomized block designs. Biometrika 83:275–285
Cockerham CC (1954) An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics 39:859–882
Crow JF, Kimura M (1970) An introduction to population genetics theory. Harper and Row, New York
Dahl DB (2006) Model-Based clustering for expression data via a Dirichlet process mixture model. In: Do KA, Muller P, Vannucci M (eds) Bayesian inference for gene expression and proteomics. Cambridge University Press, Cambridge
De Los Campos G, Gianola D, and ROSA GJM (2009a) Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. J Anim Sci 87:1883–1887
De Los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, Weigel K and COTES JM (2009b) Predicting quantitative traits with regression models for dense molecular markers and pedigrees. Genetics 182:375–385
Dempster ER, Lerner IM (1950) Heritability of threshold characters. Genetics 35:212–236
Escobar MD (1994) Estimating normal means with a Dirichlet process prior. J Amer Statist Assoc 89:268–275
Escobar MD, West M (1998) Computing non-parametric hierarchical models. In: Dey D, Müller P, Sinha D (eds) Practical nonparametric and semiparametric bayesian statistics. Springer, New York, pp 1–22
Falconer DS (1965) The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann Hum Genet 29:51–76
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1:209–230
Foster SD, Verbyla AP, Pitchford WS (2007) Incorporating LASSO effects into a mixed model for QTL detection. J Agric Biol Environ Stat 12:300–314
Gianola D, De Los Campos G (2008) Inferring genetic values for quantitative traits non-parametrically. Genet Res 90:525–540
Gianola D, Foulley JL (1983) Sire evaluation for ordered categorical data with a threshold model. Genet Sel Evol 15:201–223
Gianola D, Simianer H (2006) A Thurstonian model for quantitative genetic analysis of ranks: a Bayesian approach. Genetics 174:1613–1624
Gianola D, van Kaam JBCHM (2008) Reproducing kernel Hilbert spaces methods for genomic assisted prediction of quantitative traits. Genetics 178:2289–2303
Gianola D, Perez-Enciso M, Toro MA (2003) On marker-assisted prediction of genetic value: beyond the ridge. Genetics 163:347–365
Gianola D, Fernando RL, Stella A (2006a) Genomic assisted prediction of genetic value with semi-parametric procedures. Genetics 173:1761–1776
Gianola D, Heringstad B, Ødegård J (2006b) On the quantitative genetics of mixture characters. Genetics 173:2247–2255
Gianola D, de Los Campos G, Hill WG, Manfredi E, Fernando RL (2009) Additive genetic variability and the Bayesian alphabet. Genetics (submitted)
González-recio O, Gianola D, Long N, Weigel KA, ROSA GJM, Avendaño S (2008) Nonparametric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics 178:2305–2313
González-recio O, Gianola D, Rosa GJM, Weigel KA, Avendaño S (2009) Genome-assisted prediction of a quantitative trait in parents and progeny: application to food conversion rate in chickens. Genet Selection Evol (in press)
Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME (2009) Genomic selection in dairy cattle: progress and challenges. J Dairy Sci 92:433–443
Hirschhorn JN, Daly MJ (2005) Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6:95–108
Ibrahim JC, Kleinman KP (1998) Semiparametric Bayesian methods for random effects models. In: Dey D, Müller P, Sinha D (eds) Practical nonparametric and semiparametric Bayesian statistics. Springer, New York
Jannink JL, Wu XL (2004) Estimating allelic number and identity in state of QTLs in interconnected families. Genet Res 81:133–144
Kleinman KP, Ibrahim JG (1998) A semiparametric Bayesian approach to the random effects model. Biometrics 54:921–938
Lee HKH (2004) Bayesian nonparametrics via neural networks. ASA- SIAM, Philadelphia
Long N, Gianola D, Rosa GJM, Weigel KA, Avendaño S (2007) Machine learning classification procedure for selecting SNP s in genomic selection: application to early mortality in broilers. J Anim Breed Genet 124:377–389
MacEachern SN (1994) Estimation of normal means with a conjugate style Dirichlet process prior. Comm Statist Sim 23:727–741
Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
Motsinger–Reif AA, Dudek SM, Hahn LW, Ritchie MD (2008) Comparison of approaches for machine learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology. Genet Epidemiol 32:325–340
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686
Searle SR (1971) Linear models. Wiley, New York
Sorensen D, Gianola D (2002) Likelihood, Bayesian, and MCMC methods in quantitative genetics. Springer, New York
Templeton AR (2000) Epistasis and complex traits. In: Wolf JB et al. (ed) Epistasis and the evolutionary process. Oxford University Press, New York, pp 41–57
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J Roy Stat Soc B 58:267–288
van der Merwe AJ, Pretorius AL (2003) Bayesian estimation in animal breeding using the Dirichlet process prior for correlated random effects. Genet Sel Evol 35:137–158
Van Raden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423
Wang CS, Rutledge JJ, Gianola D (1993) Marginal inferences about variance components in a mixed linear model using Gibbs sampling. Genet Sel Evol 25:41–62
Wang CS, Rutledge JJ, Gianola D (1994) Bayesian analysis of mixed linear models via Gibbs sampling with an application to litter size in Iberian pigs. Genet Sel Evol 26:91–115
West M (1992) Hyperparameter estimation in Dirichlet process mixture models. Technical Report 92-A03, 6 pp, ISDS, Duke University
Xu S (2003) Estimating polygenic effects using markers of the entire genome. Genetics 163:789–801
Yi N, Xu S (2008) Bayesian LASSO for quantitative trait loci mapping. Genetics 179:1045–1055
Acknowledgments
Part of this work was carried out while the senior author was a Visiting Professor at Georg-August-Universität, Göttingen (Alexander von Humboldt Foundation Senior Researcher Award), and Visiting Scientist at the Station d’Amélioration Génétique des Animaux, Centre de Recherche de Toulouse (Chaire D’Excellence Pierre de Fermat, Agence Innovation, Midi-Pyreneés). Support by the Wisconsin Agriculture Experiment Station, and by grant NSF DMS-NSF DMS-044371 to the first and second authors is acknowledged. Aviagen Ltd. (Newbridge, Scotland) is thanked for providing the chicken data. A FORTRAN program for implementing the specific model described in the paper is available upon request to nick.wu@ansci.wisc.edu.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Technical details on drawing genomic effects
Computing K i is critical for implementing the DP methodology. When the baseline distribution is \(N( g_{i}|0,\sigma_{g}^{2}) ,\) the integral in (10) represented as K i is expressible in closed form, because the integrand is the product of two normal densities. Here, \( \varvec{\beta }\) and all \(g^{\prime }s\) other than g i enter as fixed parameters, and the latter follows the \(N\left( g_{i}|0,\sigma _{g}^{2}\right) \) distribution. Standard integration yields
where \({\bf V}_{i}\) is the \(n{\times}n\) matrix
It is shown in “Appendix B” that the form of \({\bf V}_{i}\) (after rearrangement of observations, such that the first n i records are those from individuals with configuration i) is
where \({\bf J}_{n_{i}}\) is a matrix of ones of order \(n_{i}\times n_{i}\), and n − n i is the number of individuals with records that have genotypes other than i. Using results in Searle (1971)
so that
Likewise,
Now, with the records arranged such that those of the n i individuals with configuration i precede data points of the n − n i individuals having a different genotype, let such rearrangement (indexed by i) lead to
where \({\bf z}_{y\in i}\left( {\bf y},\varvec{\beta },{\bf g} \right) \) denotes elements of \(\left( {\bf y-X\beta }-\sum\limits_{j\neq i}^{C}{\bf z}_{g_{j}}g_{j}\right) \) involving g i , and \({\bf z} _{y\notin i}\left( {\bf y},\varvec{\beta },{\bf g}\right) \) indicates records in the complement. Using (27) and (28) in (26)
where \(z_{y\in i,j}\left( {\bf y},\varvec{\beta },{\bf g}\right) \) is element j of \({\bf z}_{y\in i}\left( {\bf y},\varvec{\beta },{\bf g}\right) .\) The form of K i in (29) does not involve matrix inverses or matrix computations.
Appendix B: Matrix manipulations
Form of matrix\({\bf V}_{i}.\) To see the pattern, suppose that n = 4, C = 5, and that \(\varvec{\beta }={\bf 1}_{4}\mu \) (\({\bf 1}_{4}\) is a \(4{\times}1\) vector of ones), so that model (2) is
Consider effect g1, so that
For g2
Likewise for \(g_{3}\left( g_{4}\right) \)
Finally,
The general form of \({\bf V}_{i}\) (after rearrangement of observations, such that the first n i records are those from individuals with configuration i) is
where \({\bf J}_{n_{i}}\) is an \(n_{i}{\times}n_{i}\) matrix of ones, and \( {\bf I}_{n_{i}}\) and \({\bf I}_{n-n_{i}}\) are identity matrices of order n i and n − n i , respectively.
Appendix C: Conditional posterior distribution of M
The normalized density (22) is
The integrals in the denominator yield
Letting \(a^{{\ast}}=(t+a)\) and \(b^{\ast }=\left( b-\log x\right) \) the conditional posterior distribution becomes
This is a mixture of Gamma distributions indicated, with mixing probabilities \({\pi}_{x}\) and \(1-{\pi}_{x}.\) Note that
Since \(\Upgamma \left( a^{\ast }\right) =\left( a^{\ast }-1\right) \Upgamma \left( a^{\ast }-1\right) ,\)
At the end of MCMC sampling there will be S, say, samples of the number of clusters t and of the auxiliary variables x. The density of the marginal posterior distribution of M can be estimated (West 1992) using the Rao-Blackwell estimator
where \(a^{\ast \left( s\right) }=t^{\left( s\right) }+a,\) and \(b^{\ast \left( s\right) }=b-\log x^{\left( s\right) }\). If a = 0 and b = 0, the Gamma prior degenerates to \(p(M) \varpropto M^{-1}\) or, equivalently, to \(p\left[ \log \left( M\right) \right] \varpropto \) constant. If such improper prior is adopted, the Rao-Blackwell estimator reduces to
with a* = t and b* = −logx used in the calculation of \(\pi _{x^{\left( s\right) }}.\) The conditional posterior distribution of M is well defined for a = b = 0, but this does not guarantee that its marginal posterior distribution will be always proper. Hence, the uniform prior on \( {\log}(M) \) should be used cautiously.
Rights and permissions
About this article
Cite this article
Gianola, D., Wu, XL., Manfredi, E. et al. A non-parametric mixture model for genome-enabled prediction of genetic value for a quantitative trait. Genetica 138, 959–977 (2010). https://doi.org/10.1007/s10709-010-9478-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10709-010-9478-4