Skip to main content

A robust Bayesian genome-based median regression model

Abstract

Key message

Current genome-enabled prediction models assumed errors normally distributed, which are sensitive to outliers. We propose a model with errors assumed to follow a Laplace distribution to deal better with outliers.

Abstract

Current genome-enabled prediction models use regressions that fit the expected value (mean) of a response variable with errors assumed normally distributed, which are often sensitive to outliers, either genetic or environmental. For this reason, we propose a robust Bayesian genome median regression (BGMR) model that fits regressions to the medians of a distribution, with errors assumed to follow a Laplace distribution to deal better with outliers. The BGMR model was evaluated under a Bayesian framework with Markov Chain Monte Carlo sampling using a location–scale mixture representation of the Laplace distribution. The BGMR was implemented with two simulated and two real genomic data sets, and we compared its prediction performance with that of a conventional genomic best linear unbiased prediction (GBLUP) model and the Laplace maximum a posteriori (LMAP) method. The prediction accuracies of BGMR were higher than those of the GBLUP and LMAP methods when there were outliers. The BGMR model could be useful to breeders who need to predict and select genotypes based on data with unknown outliers.

This is a preview of subscription content, access via your institution.

References

  1. Crossa J, de los Campos G, Pérez P, Gianola D, Burgueño J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun H-J (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186(2):713–724. https://doi.org/10.1534/genetics.110.118521

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Whole genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327–345

    Article  PubMed Central  Google Scholar 

  3. Edgeworth FY (1887) On observations relating to several quantities. Hermathena 6:279–285

    Google Scholar 

  4. Fen F, Wang H, Lu N, Chen T, He H, Lu Y, Tu XM (2014) Log-transformation and its implications for data analysis. Shanghai Arch Psychiatry 26(2):105–109. https://doi.org/10.3969/j.issn.1002-0829.2014.02.009

    Article  Google Scholar 

  5. Feng C, Wang H, Lu N, Tu XM (2012) Log-transformation: applications and interpretation in biomedical research. Stat Med 32:230–239. https://doi.org/10.1002/sim.5486

    Article  PubMed  Google Scholar 

  6. Gianola D, de los Campos G, Hill WG, Manfredi E, Fernando R (2009) Additive genetic variability and the bayesian alphabet. Genetics 183(1):347–363

    Article  PubMed  PubMed Central  Google Scholar 

  7. Gianola D, Cecchinato A, Naya H, Schön C-C (2018) Prediction of complex traits: robust alternatives to best linear unbiased prediction. Front Genet 9:195. https://doi.org/10.3389/fgene.2018.00195

    Article  PubMed  PubMed Central  Google Scholar 

  8. Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (2005) Robust statistics: the approach based on influence functions. Wiley, London

    Book  Google Scholar 

  9. Huber P (1973) Robust regression: asymptotics, conjectures, and monte carlo. Ann Stat 1(5):799–821

    Article  Google Scholar 

  10. Koenker R, Bassett G (1978) Regression quantiles. Econometrica 46(1):33–50

    Article  Google Scholar 

  11. Kozumi H, Kobayashi G (2011) Gibbs sampling methods for Bayesian quantile regression. J Stat Comput Simul 81(11):1565–1578

    Article  Google Scholar 

  12. Lange KL, Little RJA, Taylor JMG (1989) Robust statistical modeling using the T-distribution. J Am Stat Assoc 84:881–896

    Google Scholar 

  13. Lehermeier C, Wimmer V, Albrecht T, Auinger HJ, Gianola D, Schmid VJ, Schön CC (2013) Sensitivity to prior specification in Bayesian genome-based prediction models. Stat Appl Genet Mol Biol 12(3):375–391. https://doi.org/10.1515/sagmb-2012-0042

    Article  PubMed  Google Scholar 

  14. Li Z, Möttönen J, Sillanpää MJ (2015) A robust multiple-locus method for quantitative trait locus analysis of non-normally distributed multiple traits. Heredity 115(6):556–564

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Lourenço VM, Pires AM (2014) M-regression, false discovery rates and outlier detection with application to genetic association studies. J Comput Stat Data Anal 78:33–42

    Article  Google Scholar 

  16. Lourenço VM, Pires AM, Kirst M (2011) Robust linear regression methods in association studies. Bioinformatics 27(6):815–821

    Article  CAS  PubMed  Google Scholar 

  17. Lourenço VM, Rodrigues PC, Pires AM, Piepho H-P (2017) A robust DF-REML framework for variance components estimation in genetic studies. Bioinformatics 33(22):3584–3594

    Article  CAS  PubMed  Google Scholar 

  18. Montesinos-López OA, Montesinos-López A, Crossa J, Toledo F, Pérez-Hernández O, Eskridge KM, Rutkoski J (2016) A genomic bayesian multi-trait and multi-environment model. G3: Genes|Genomes|Genetics 6(9):2725–2744

    Article  PubMed  PubMed Central  Google Scholar 

  19. Nascimento M, de Resende MD, Cruz CD, Nascimento AC, Viana JM, Azevedo CF, Barroso LM (2017) Regularized quantile regression applied to genome-enabled prediction of quantitative traits. Genet Mol Res. https://doi.org/10.4238/gmr16019538

    Article  PubMed  Google Scholar 

  20. Ould-Estaghvirou SB, Ogutu JO, Piepho HP (2014) Influence of outliers on accuracy estimation in genomic prediction in plant breeding. G3: Genes, Genomes, Genetics 4(12):2317–2328

    Article  Google Scholar 

  21. Park T, Casella G (2008) The Bayesian lasso. J Am Stat Assoc 103(482):681–686

    Article  CAS  Google Scholar 

  22. Pérez P, de los Campos G, Crossa J, Gianola D (2010) Genomic-enabled prediction based on molecular markers and pedigree using the BLR package in R. Plant Genome 3:106–116

    Article  PubMed  PubMed Central  Google Scholar 

  23. Pérez-Rodríguez P, de los Campos G (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198:483–495

    Article  Google Scholar 

  24. Pourhoseingholi A, Pourhoseingholi MA, Vahedi M, Moghimi-Dehkordi B, Maserat AS, Zali MR (2009) Relation between demographic factors and hospitalization in patients with gastrointestinal disorders, using quantile regression analysis. East Afr J Public Health 6(1):45–47

    PubMed  Google Scholar 

  25. Rodrigues PC, Monteiro A, Lourenço VM (2016) A robust AMMI model for the analysis of genotype-by-environment data. Bioinformatics 32(1):58–66

    CAS  PubMed  Google Scholar 

  26. Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880

    Article  Google Scholar 

  27. Seber GAF, Lee AJ (2003) Linear regression analysis, 2nd edn. Wiley, Hoboken

    Book  Google Scholar 

  28. Strandén I, Gianola D (1998) Attenuating effects of preferential treatment with Student-t mixed linear models: a simulation study. Genet Sel Evol 30:565–583

    Article  PubMed Central  Google Scholar 

  29. Strandén I, Gianola D (1999) Mixed effects linear models with t-distributions for quantitative genetic analysis: a Bayesian approach. Genet Sel Evol 31:25–42. https://doi.org/10.1186/1297-9686-31-1-25

    Article  PubMed Central  Google Scholar 

  30. VanRaden PM (2007) Genomic measures of relationship and inbreeding. Interbull Bull 37:33–36

    Google Scholar 

  31. Yohai VJ (1987) High breakdown-point and high efficiency robust estimates for regression. Ann Stat 15(2):642–656

    Article  Google Scholar 

  32. Yu K, Moyeed A (2001) Bayesian quantile regression. Stat Probab Lett 54:437–447

    Article  Google Scholar 

Download references

Acknowledgments

We thank all scientists, field workers, and lab assistants from National Programs and CIMMYT who collected the data used in this study. We acknowledge the financial support provided by the Foundation for Research Levy on Agricultural Products (FFL) and the Agricultural Agreement Research Fund (JA) in Norway through NFR Grant 267806. We are also thankful for the financial support provided by CIMMYT CRP (maize and wheat), the Bill & Melinda Gates Foundation, as well the USAID projects (Cornell University and Kansas State University) that financed the collection of the CIMMYT maize and wheat data analyzed in this study.

Author information

Affiliations

Authors

Corresponding authors

Correspondence to Osval A. Montesinos-López or Daniel Gianola.

Ethics declarations

Conflict of interest

The authors declare they do not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Mikko J. Sillanpaa.

Appendices

Appendix 1: Deriving full conditional distributions for the Bayesian Laplace regression model

From Eq. (2), given the random effects and \(u = (u_{1} , \ldots ,u_{n} )^{\text{T}} ,\) the joint conditional density of the vector of responses is given by

$$\begin{aligned} f(\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} ) & \propto \prod\limits_{j = 1}^{n} {\frac{1}{{\sqrt {\sigma^{2} u_{j} } }}\exp \left[ { - \frac{{\left( {y_{j} - \mu - z_{j}^{T} b} \right)^{2} }}{{2\left( 8 \right)\sigma^{2} u_{j} }}} \right]} \\ & \propto \exp \left[ { - \frac{1}{{2\sigma^{2} }}\left( {\varvec{Y} - 1\varvec{\mu}- \varvec{b}} \right)^{T} \varvec{D}_{u}^{ - 1} \left( {\varvec{Y} - 1\varvec{\mu}- \varvec{b}} \right)} \right]. \\ \end{aligned}$$

Fully conditional for \(\mu\)

$$\begin{aligned} f\left( {\mu | {\text{ELSE}}} \right) & \propto f\left( {\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} } \right)f\left( {\mu |\sigma_{0}^{2} } \right) \\ & \propto \exp \left[ { - (\varvec{Y} - 1\mu - \varvec{b})^{\text{T}} \varvec{D}_{u}^{ - 1} (\varvec{Y} - 1\mu - \varvec{b}) - \frac{1}{{2\sigma_{0}^{2} }}(\mu - \mu_{0} )^{2} } \right] \\ & \propto \exp \left[ { - \frac{1}{{2\tilde{\sigma }_{0}^{2} }}(\mu - \tilde{\mu }_{0} )^{2} } \right] \\ \end{aligned}$$
(4)

where \(\tilde{\sigma }_{0}^{2} = \frac{1}{{\sigma_{0}^{ - 2} + 8^{ - 1} \sum\nolimits_{j = 1}^{n} {u_{j}^{ - 1} \sigma^{ - 2} } }}\). and \(\tilde{\mu }_{0} = \tilde{\sigma }_{0}^{2} [\mu_{0} \sigma_{0}^{ - 2} + \sigma^{ - 2} 1^{\text{T}} \varvec{D}_{u}^{ - 1} (\varvec{Y} - \varvec{b})]\). Then \(\mu |{\text{ELSE}}\,\sim\,N\left( {\tilde{\mu }_{0} ,\tilde{\sigma }_{0}^{2} } \right)\).

Fully conditional for \(\varvec{b}\)

Similarly, we have that

$$\begin{aligned} f(\varvec{b} | {\text{ELSE}}) & \propto f(\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} )f(\varvec{b} |\sigma_{1}^{2} ) \\ & \propto \exp \left[ { - \frac{1}{{2\sigma^{2} }}(\varvec{Y} - 1\mu - \varvec{b})^{\text{T}} \varvec{D}_{u}^{ - 1} (\varvec{Y} - 1\mu - \varvec{b}) - \frac{1}{{2\sigma_{0}^{2} }}(\mu - \mu_{0} )^{2} } \right] \\ & \propto \exp \left[ { - \frac{1}{2}(\varvec{b} - \tilde{\varvec{b}})^{\text{T}} \tilde{\varvec{\varSigma }}_{1}^{ - 1} \left( { \varvec{b} - \tilde{\varvec{b}}} \right)} \right] \\ \end{aligned}$$
(5)

where \(\tilde{\varvec{\varSigma }}_{1} = (\varvec{G}^{ - 1} \sigma_{1}^{ - 2} + \sigma^{ - 2} \varvec{D}_{u}^{ - 1} )^{ - 1}\) and \(\tilde{\varvec{b}} = \sigma^{ - 2} \tilde{\varvec{\varSigma }}_{{b_{1} }} \varvec{D}_{u}^{ - 1} (\varvec{Y} - 1\mu )\). So \(\varvec{b}|{\text{ELSE}}\,\sim\,N(\tilde{\varvec{b}},\tilde{\varvec{\varSigma }}_{1} )\).

Fully conditional for \(\sigma_{1}^{2}\)

$$\begin{aligned} f\left( {\sigma_{1}^{2} | {\text{ELSE}}} \right) & \propto f(\varvec{b} |\sigma_{1}^{2} )f(\sigma_{1}^{2} ) \\ & \propto (\sigma_{1}^{2} )^{{ - \frac{{\nu_{1} + J}}{2} - 1}} \exp \left\{ { - \left( {\frac{{\varvec{b}^{\text{T}} \varvec{G}^{ - 1} \varvec{b} + S_{1} }}{{2\sigma_{1}^{2} }}} \right)} \right\} \\ & \propto \chi^{ - 2} \left( {\nu_{1} + J,\varvec{b}^{\text{T}} \varvec{G}^{ - 1} \varvec{b} + S_{1} } \right) \\ \end{aligned}$$
(6)

Fully conditional for \(\sigma^{2}\)

$$\begin{aligned} f(\sigma^{2} | {\text{ELSE}}) & \propto f\left( {\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} } \right)f(\sigma^{2} ) \\ & \propto (\sigma^{2} )^{{ - \frac{df + n}{2} - 1}} \exp \left[ { - \frac{{\left( {\varvec{y} - 1\mu - \varvec{b}} \right)^{\text{T}} \varvec{D}_{u}^{ - 1} \left( {\varvec{y} - 1\mu - \varvec{b}} \right) + S}}{{2\sigma^{2} }}} \right] \\ & \propto \chi^{ - 2} \left( {df + n,\left( {\varvec{y} - 1\mu - \varvec{b}} \right)^{\text{T}} \varvec{D}_{u}^{ - 1} \left( {\varvec{y} - 1\mu - \varvec{b}} \right) + S} \right) \\ \end{aligned}$$
(7)

Fully conditional for \(\varvec{u}\)

$$\begin{aligned} P(\varvec{u}\left| {\text{ELSE}} \right.) & \propto f\left( {\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} } \right)\prod\limits_{j = 1}^{n} {f(u_{j} )} \\ & \propto \mathop \prod \limits_{j = 1}^{n} \frac{1}{{\sqrt {u_{j} } }}{ \exp }\left( { - \frac{{\left( {y_{j} - \mu - b_{j} } \right)^{2} }}{{2\left( 8 \right)\sigma^{2} u_{j} }}} \right){\text{exp(}} - u_{j} )\\ & \propto \mathop \prod \limits_{j = 1}^{n} u_{j}^{{\frac{1}{2} - 1}} { \exp }\left( { - \frac{1}{2}\left[ {\frac{{\left( {y_{j} - \mu - b_{j} } \right)^{2} }}{{8\sigma^{2} }}u_{j}^{ - 1} + 2u_{j} } \right]} \right) \\ & \propto \mathop \prod \limits_{j = 1}^{n} {\text{GIG}}\left( {\frac{1}{2},\frac{{\left( {y_{j} - \mu - b_{j} } \right)^{2} }}{{8\sigma^{2} }},2} \right) \\ \end{aligned}$$
(8)

where \({\text{GIG(}}v,a,b )\) denotes the generalized inverse Gaussian distribution with parameters \(v\), \(a\) and \(b\) (Kozumi and Kobayashi 2011).

Fully conditional for missing values

$$\begin{aligned} f\left( {\varvec{y}_{\text{miss}} | {\text{ELSE}}} \right) & \propto f\left( {\varvec{y}_{\text{miss}} |\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} } \right) \\ & \propto N\left( {\varvec{\eta}^{*} ,\sigma^{2} \varvec{D}_{u}^{*} } \right) \\ \end{aligned}$$
(9)

where \(\varvec{\eta}^{*} = 1^{\varvec{*}} \mu + \varvec{g}^{\varvec{*}}\) is the corresponding linear predictor of the missing values in the model in Eq. (2) and \(\varvec{D}_{u}^{*}\) is a diagonal matrix that retains the elements in \(\varvec{D}_{u}\) corresponding to the missing values.

Appendix 2: Setting hyperparameters for the prior distributions of the BGMR model

The prior mean (\(\mu_{0}\)) for the general mean (\(\mu\)) was settled as the mean response sample in the training data, while the rest of the hyperparameters for the BGMR model were set similarly to those used in the BGLR software (Pérez-Rodríguez and de los Campos 2014). These rules provide proper, but weakly informative prior distributions. We partitioned the total variance–covariance of the phenotypes into two components: (1) the error and (2) the linear predictor. First, the variance of the phenotypes \(y_{i}\) under the model is given by

$${\text{Var}}(y_{j} ) = {\text{Var}}(b_{j} ) + 8\sigma^{2}$$

Therefore, the average of the variance of the individuals, called total variance, is equal to

$$\frac{1}{n}\mathop \sum \limits_{j = 1}^{n} {\text{Var}}(y_{j} ) = \frac{1}{n}\mathop \sum \limits_{j = 1}^{n} {\text{Var(}}b_{j} )+ 8\sigma^{2} = \frac{1}{n}{\text{tr}}(\varvec{G})\sigma_{1}^{2} + 8\sigma^{2} = V_{1} + V_{\epsilon } .$$

Then, by setting \(R_{1}^{2}\) as the proportion of the total variance (\({\mathbf{V}}_{y}\)) that is explained by lines a priori, \(V_{g} = R_{1}^{2} {\mathbf{V}}_{y}\), and replacing \(\sigma_{1}^{2}\) in \(V_{1}\) by its prior mode, \(\frac{{S_{1} }}{{df_{1} + 2}}\). Once we have set a value for \(df_{1}\), the scale parameter is given by

$$S_{1} = \frac{{R_{1}^{2} {\mathbf{V}}_{y} }}{{\frac{1}{n}{\text{tr(}}\varvec{G} )}}\left( {df_{1} + 2} \right).$$

For the shape parameter by default, we set \(df_{1} = 5\) and \(R_{1}^{2} = 0.5\).

Similarly, once there is a value for the shape parameter of the prior distribution of \(\sigma^{2}\), \(df\), the value of the scale parameter is given by

$$S = \frac{{\left( {1 - R_{1}^{2} } \right){\mathbf{V}}_{y} }}{8}\left( {df + 2} \right)$$

where \(1 - R_{1}^{2}\) is the proportion of the total variance (\({\mathbf{V}}_{y}\)) that is explained by the error a priori. By default, we set \(df = 5.\)

The pdf of the scaled inverse Chi-square distribution with \(v\) degrees of freedom and scale parameter \(S\), \(\chi^{ - 2} \left( {v,S} \right)\), is given by

$$f(x;\,df,S) = \frac{{\left( {\frac{S}{2}} \right)^{df/2} }}{{\Gamma (df/2)}}x^{ - 1 - df/2} \exp \left( { - \frac{S}{2x}} \right), x > 0$$

and the mean, mode, and variance of this distribution are given by \(\frac{S}{df - 2}\), \(\frac{S}{df + 2}\), and \(\frac{{2S^{2} }}{{\left( {df - 2} \right)^{2} \left( {df - 4} \right)}}\), respectively. Specifically, the prior mean, mode, and variance for the variance components are:

$$\begin{aligned} & E\left( {\sigma_{1}^{2} } \right) = \frac{{S_{1} }}{{df_{1} - 2}}, \quad {\text{Mode}}\left( {\sigma_{1}^{2} } \right) = \frac{{S_{1} }}{{df_{1} + 2}}\;{\text{and}}\;{\text{Var}}\left( {\sigma_{1}^{2} } \right) = \frac{{2S_{1}^{2} }}{{\left( {df_{1} - 2} \right)^{2} \left( {df_{1} - 4} \right)}} \\ & E\left( {\sigma^{2} } \right) = \frac{S}{df - 2},\quad {\text{Mode}}\left( {\sigma^{2} } \right) = \frac{S}{df + 2} \;{\text{and}}\;{\text{Var}}\left( {\sigma^{2} } \right) = \frac{{2S^{2} }}{{\left( {df - 2} \right)^{2} \left( {df - 4} \right)}}. \\ \end{aligned}$$

Appendix 3

figurea
figureb
figurec
figured
figuree
figuref
figureg

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Montesinos-López, A., Montesinos-López, O.A., Villa-Diharce, E.R. et al. A robust Bayesian genome-based median regression model. Theor Appl Genet 132, 1587–1606 (2019). https://doi.org/10.1007/s00122-019-03303-6

Download citation