Abstract
Key message
Genomic prediction models for quantitative traits assume continuous and normally distributed phenotypes. In this research, we proposed a novel Bayesian discrete lognormal regression model.
Abstract
Genomic selection is a powerful tool in modern breeding programs that uses genomic information to predict the performance of individuals and select those with desirable traits. It has revolutionized animal and plant breeding, as it allows breeders to identify the best candidates without labor-intensive and time-consuming phenotypic evaluations. While several statistical models have been developed, most of them have been for quantitative continuous traits and only a few for count responses. In this paper, we propose a discrete lognormal regression model in the Bayesian context, that with a Gibbs sampler to explore the corresponding posterior distribution and make the predictions. Two datasets of resistance disease is used in the wheat crop and are then evaluated against the traditional Gaussian model and a lognormal model. The results indicate the proposed model is a competitive and natural model for predicting count genomic traits.
Similar content being viewed by others
Data availability
The genomic and phenotypic data used in this study can be downloaded from the following link http://hdl.handle.net/11529/10575..
References
Bai G, Shaner G (2004) Management and resistance in wheat and barley to Fusarium head blight. Annu Rev Phytopathol 42:135–161
Budhlakoti N, Kushwaha AK, Rai A, Chaturvedi KK, Kumar A, Pradhan AK, Kumar S (2022) Genomic selection: a tool for accelerating the efficiency of molecular breeding for development of climate-resilient crops. Front Genet 13:66
Buerstmayr M, Steiner B, Buerstmayr H (2020) Breeding for Fusarium head blight resistance in wheat—progress and challenges. Plant Breed 139(3):429–454
Cavanagh CR, Chao S, Wang S, Huang BE, Stephen S et al (2013) Genome-wide comparative diversity uncovers multiple targets of selection for improvement in hexaploid wheat landraces and cultivars. Proc Natl Acad Sci USA 110(20):8057–8062
Crossa J et al (2017) Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci 22(11):961–975
Falconi-Castillo CE (2014) Association mapping for detecting QTLs for Fusarium head blight and yellow rust resistance in bread wheat. Michigan State University
Falk DA, Swetnam TW (1998) Scaling rules and probability models for surface fire regimes in ponderosa pine forests. In: Fire, fuel treatments, and ecological restoration: conference proceedings, p 301
Gianola D, Van Kaam JBCHM (2008) Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics 178(4):2289–2303. https://doi.org/10.1534/genetics.107.084285
González-Camacho JM, Ornella L, Pérez-Rodríguez P, Gianola D, Dreisigacker S, Crossa J (2018) Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. The Plant Genome 11(2):170104
Habier D, Fernando RL, Kizilkaya K, Garrick DJ (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinform 12(1):186. https://doi.org/10.1186/1471-2105-12-186
Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME (2013) Invited review: genomic selection in dairy cattle: progress and challenges. J Dairy Sci 96(2):859–876. https://doi.org/10.3168/jds.2012-5639
Heffner EL, Lorenz AJ, Jannink JL, Sorrells ME (2010) Plant breeding with genomic selection: gain per unit time and cost. Crop Sci 50(5):1681–1690
Hickey JM et al (2017) Genomic prediction unifies animal and plant breeding programs to form platforms for biological discovery. Nat Genet 49(9):1297–1303
Leirness JB, Kinlan BP (2018) Additional statistical analyses to support guidelines for marine avian sampling. Sterling (VA): US Department of the Interior, Bureau of Ocean Energy Management. OCS Study BOEM, p 63
Lyu J, Nadarajah S (2021) Discrete lognormal distributions with application to insurance data. Int J Syst Assur Eng Manag 13:1–15
Merrick LF, Lozada DN, Chen X, Carter AH (2022) Classification and regression models for genomic selection of skewed phenotypes: a case for disease resistance in winter wheat (Triticum aestivum L.). Front Genet 13:835781
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4):1819–1829
Montesinos-López OA, Montesinos-López A, Crossa J, Burgueño J, Eskridge K (2015a) Genomic-enabled prediction of ordinal data with Bayesian logistic ordinal regression. G3 Genes Genomes Genet 5(10):2113–2126
Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez P, Eskridge K, He X, Juliana P, Singh P, Crossa J (2015b) Genomic prediction models for count data. J Agric Biol Environ Stat 20:533–554
Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez P, de Los Campos G, Eskridge K, Crossa J (2015c) Threshold models for genome-enabled prediction of ordinal categorical traits in plant breeding. G3 Genes, Genomes, Genet 5(2):291–300
Montesinos-López A, Montesinos-López OA, Crossa J, Burgueño J, Eskridge KM, Falconi-Castillo E, Cichy K (2016) Genomic Bayesian prediction model for count data with genotype× environment interaction. G3 Genes Genomes Genet 6(5):1165–1177
Montesinos-López OA, Montesinos-López A, Crossa J, Toledo FH, Montesinos-López JC, Singh P, Salinas-Ruiz J (2017) A Bayesian Poisson-lognormal model for count data for multiple-trait multiple-environment genomic-enabled prediction. G3 Genes Genomes Genet 7(5):1595–1606
Montesinos-López OA, Montesinos-López JC, Singh P, Lozano-Ramirez N, Barrón-López A, Montesinos-López A, Crossa J (2020) A multivariate Poisson deep learning model for genomic prediction of count data. G3 Genes Genomes Genet 10(11):4177–4190
Montesinos López OA, Montesinos López A, Crossa J (2022) Multivariate statistical machine learning methods for genomic prediction. Springer Nature, p 691
Moreira JA, Zeng XHT, Amaral LAN (2015) The distribution of the asymptotic number of citations to sets of publications by a researcher or from an academic department are consistent with a discrete lognormal model. PLoS One 10(11):e0143108
Oliveira SL, Turkman MA, Pereira JM (2012) An analysis of fire frequency in tropical savannas of northern Australia, using a satellite-based fire atlas. Int J Wildland Fire 22(4):479–492
Pérez P, de Los Campos G (2014a) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198(2):483–495
Pérez P, de Los Campos G (2014b) BGLR: a statistical package for whole genome regression and prediction. Genetics 198(2):483–495
Pryce JE, Arias J, Bowman PJ, Davis SR, Macdonald KA, Waghorn GC, Spelman RJ (2012) Accuracy of genomic predictions of residual feed intake and 250-day body weight in growing heifers using 625,000 single nucleotide polymorphism markers. J Dairy Sci 95(4):2108–2119
R Core Team (2023) R: a language and environment for statistical computing [Internet]. Vienna: R Foundation for Statistical Computing; Available from https://www.R-project.org/
Rutkoski J, Poland J, Jannink JL, Sorrells ME (2016) Imputation of unordered markers and the impact on genomic selection accuracy. G3 Genes Genomes Genet 6(5):1285–1296
Sorensen DA, Andersen S, Gianola D, Korsgaard I (1995) Bayesian inference in threshold models using Gibbs sampling. Genet Sel Evol 27(3):229–249
Spindel J, Begum H, Akdemir D, Virk P, Collard B, Redona E, McCouch SR (2015) Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet 11(2):e1004982
Stringer MJ, Sales-Pardo M, Nunes Amaral LA (2008) Effectiveness of journal ranking schemes as a tool for locating information. PLoS ONE 3(2):e1683
Stringer MJ, Sales-Pardo M, Amaral LAN (2010) Statistical validation of a global model for the distribution of the ultimate number of citations accrued by papers published in a scientific journal. J Am Soc Inform Sci Technol 61(7):1377–1385
Thelwall M (2016) The discretised lognormal and hooked power law distributions for complete citation data: best options for modelling and regression. J Informetr 10(2):336–346
Thelwall M, Wilson P (2014) Distributions for cited articles from individual subjects and years. J Informetr 8(4):824–839
VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423
Zhang Q et al (2015) Genomic selection for productive and disease resistance traits in cattle: a review. J Anim Sci Biotechnol 6(1):32
Zhao M, Leng Y, Chao S, Xu SS, Zhong S (2018) Molecular mapping of QTL for Fusarium head blight resistance introgressed into durum wheat. Theor Appl Genet 131:1939–1951
Zhu Z, Chen L, Zhang W, Yang L, Zhu W, Li J, Gao C (2020) Genome-wide association analysis of Fusarium head blight resistance in Chinese elite wheat lines. Front Plant Sci 11:206
Zipkin EF, Leirness JB, Kinlan BP, O’Connell AF, Silverman ED (2014) Fitting statistical distributions to sea duck count data: implications for survey design and abundance estimation. Stat Methodol 17:67–81
Acknowledgements
We are thankful for the financial support provided by the Bill & Melinda Gates Foundation [INV-003439, BMGF/FCDO, Accelerating Genetic Gains in Maize and Wheat for Improved Livelihoods (AG2MW)], the USAID projects [USAID Amend. No. 9 MTO 069033, USAID-CIMMYT Wheat/AGGMW, AGG-Maize Supplementary Project, AGG (Stress Tolerant Maize for Africa], and the CIMMYT CRP (maize and wheat). We acknowledge the financial support provided by the Foundation for Research Levy on Agricultural Products (FFL) and the Agricultural Agreement Research Fund (JA) through the Research Council of Norway for grants 301835 (Sustainable Management of Rust Diseases in Wheat) and 320090 (Phenotyping for Healthier and more Productive Wheat Crops).
Funding
We are thankful for the financial support provided by the Bill & Melinda Gates Foundation [INV-003439, BMGF/FCDO, Accelerating Genetic Gains in Maize and Wheat for Improved Livelihoods (AG2MW)].
Author information
Authors and Affiliations
Contributions
AML, OAML, and SRP developed the idea, implemented the model, and wrote the manuscript. HGP, JCML, and JC assisted in writing and critically reviewing the article.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not applicable (not human or animal data are used).
Consent to participate
Authors have declared that have consented to participate.
Consent for publication
Authors have declared that have consented to participate.
Additional information
Communicated by Mikko J. Sillanpää.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
The observed response variable in model (1) is the result of applying the floor function to a continuous Lognormal regression model, that is, \(Y_{i} = \left\lfloor {L_{i}^{*} } \right\rfloor\), where given \({{\varvec{x}}}_{i}\) the latent variable \({L}_{i}^{*}\) follows a Lognormal distribution with parameters \({\mu }_{i}={\beta }_{0}+\sum_{j=1}^{p}{x}_{ij}{\beta }_{j}\) and \({\sigma }_{i}^{2}={\sigma }^{2}\), \(i=1,\dots ,n.\) Then, by expressing \({L}_{i}^{*}={\text{exp}}({L}_{i})\) where \({L}_{i}|{{\varvec{x}}}_{i}\sim N\left({\mu }_{i},{\sigma }^{2}\right)\), \(i=1,\dots ,n\), and by augmenting the posterior distribution of the parameters of model (1), \({\beta }_{0},{\varvec{\beta}},{\sigma }_{\beta }^{2},{\sigma }^{2}\), with latent variables \({L}_{i}\) (\({\varvec{L}}={\left({L}_{1},..,{L}_{n}\right)}^{T}\)), the joint posterior of \({\beta }_{0},{\varvec{\beta}},{\sigma }_{\beta }^{2},{\sigma }^{2}\) and \({\varvec{L}}\) is given by
where \({\varvec{l}}={\left({l}_{1},..,{l}_{n}\right)}^{T}\). From here and doing simple algebraic manipulations, the full conditional posterior for \({\beta }_{0}\) is a normal distribution with mean \(\frac{1}{n}\sum_{i=1}^{n}{l}_{i}^{\left(0\right)}\) and variance \(\frac{{\sigma }^{2}}{n}\) where \({l}_{i}^{(0)}={l}_{i}-\sum_{\begin{array}{c}j=1\end{array}}^{p}{x}_{ij}{\beta }_{j}\).
Similarly, the full conditional posterior for each \({\beta }_{k},\) \(k=1,..,p,\) is a normal distribution with variance \(\frac{1}{{\sigma }_{\beta }^{-2}+{\sigma }^{-2}\sum_{i=1}^{n}{x}_{ik}^{2}}\) and mean \(\frac{{\sigma }^{-2}}{{\sigma }_{\beta }^{-2}+{\sigma }^{-2}\sum_{i=1}^{n}{x}_{ik}^{2}}\sum_{i=1}^{n}{l}_{i}^{(k)}{x}_{ik}\) where \({l}_{i}^{(k)}={l}_{i}-{\beta }_{0}-\sum_{\begin{array}{c}j=1\\ j\ne k\end{array}}^{p}{x}_{ij}{\beta }_{j}\).
The full conditional posterior for \({\sigma }_{\beta }^{2}\) is
which corresponds to the density of a scaled inverse chi-squared distribution (\({\chi }^{-2}\)) and so \({\sigma }_{\beta }^{2}|{\varvec{y}},-\sim {\chi }^{-2}\left({\widetilde{v}}_{\beta },{\widetilde{s}}_{\beta }\right)\), \({\widetilde{v}}_{\beta }={v}_{\beta }+p\) and\({\widetilde{s}}_{\beta }={s}_{\beta }+\sum_{j=1}^{p}{\beta }_{j}^{2}\). Likewise, the full conditional posterior for \({\sigma }^{2}\) is \({\sigma }^{2}|{\varvec{y}},-\sim {\chi }^{-2}\left(\widetilde{v},\widetilde{s}\right)\), \(\widetilde{v}=v+n\) and\(\widetilde{s}=s+\sum_{i=1}^{n}{\left({l}_{i}-{\beta }_{0}-\sum_{j=1}^{p}{x}_{ij}{\beta }_{j}\right)}^{2}\). Here, we denote the rest of the parameters other than the parameter for which the conditional distribution is specified.
Now, from equation (A1) the full conditional posterior for \({\varvec{L}}\) is given by
and from here conditioned to \({\varvec{Y}}\) and the parameters of model (1), \({L}_{1},..,{L}_{n}\) are independent random variables, each one with truncated normal distribution on (\(\mathit{log}\left({y}_{i}\right),log({y}_{i}+1\)) with parameters \({\beta }_{0}+\sum_{j=1}^{p}{x}_{ij}{\beta }_{j}\) and \({\sigma }^{2}\), \(i=1,\dots ,n\), respectively.
Appendix 2
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Montesinos-López, A., Gutiérrez-Pulido, H., Ramos-Pulido, S. et al. Bayesian discrete lognormal regression model for genomic prediction. Theor Appl Genet 137, 21 (2024). https://doi.org/10.1007/s00122-023-04526-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00122-023-04526-4