Theoretical and Applied Genetics

, Volume 132, Issue 10, pp 2781–2792 | Cite as

Training set determination for genomic selection

  • Jen-Hsiang Ou
  • Chen-Tuo LiaoEmail author
Original Article


Key message

A new optimality criterion is proposed to determine a training set for genomic selection, which is derived from Pearson’s correlation between GEBVs and phenotypic values of a test set. R functions are provided to generate the optimal training set.


For a specified test set, we develop a highly efficient algorithm to determine an optimal subset from a large candidate set in which the individuals have been genotyped but not phenotyped yet. The chosen subset serves as a training set to be phenotyped, and then a genomic selection (GS) model is built based on its phenotype and genotype data. In this study, we consider the additive effects whole-genome regression model and adopt ridge regression estimation for marker effects in the GS model. The resulting GS model is then employed to predict genomic estimated breeding values (GEBVs) for the individuals of the test set, which have been genotyped only. We propose a new optimality criterion to determine the required training set, which is derived directly from Pearson’s correlation between GEBVs and phenotypic values of the test set. Pearson’s correlation is the standard measure for prediction accuracy of a GS model. Our proposed methods can be applied to data with the varying degree of population structure. All the R functions for implementing our training set determination algorithms are available from the R package TSDFGS. The algorithms are illustrated with two datasets which have strong (rice genome dataset) and mild (wheat genome dataset) population structures. Our methods are shown to be advantageous over existing ones, mainly because they fully use the genomic relationship between the test set and the training set by taking into account both the variance and bias for predicting GEBVs.



This research was supported by the Ministry of Science and Technology Taiwan (Grant No.: MOST 106-2118-M-002-002-MY2).

Compliance with ethical standards

Conflict of interest

We have no conflicts of interest to disclose.

Supplementary material

122_2019_3387_MOESM1_ESM.docx (397 kb)
Supplementary material 1 (DOCX 396 kb)


  1. Akdemir D (2014) STPGA: selection of training populations by genetic algorithm. R package version 1.0.
  2. Akdemir D, Isidro-Sanchez J (2019) Design of training populations for selective genotyping in genomic prediction. Sci Rep 9:1446CrossRefGoogle Scholar
  3. Akdemir D, Sanchez JI, Jannink JL (2015) Optimization of genomic selection training populations with a genetic algorithm. Genet Sel Evol 47:38CrossRefGoogle Scholar
  4. Falconer DS, Mackay TF (1996) Introduction to quantitative genetics. Prentice Hill, LondonGoogle Scholar
  5. Gianola D (2013) Priors in whole-genome regression: the Bayesian alphabet returns. Genetics 194:573–596CrossRefGoogle Scholar
  6. Habier D, Fernando RL, Dekkers JCM (2007) The impact of genetic relationship information on genomic-assisted breeding values. Genetics 177:2389–2397CrossRefGoogle Scholar
  7. Heffner EL, Sorrells ME, Jannink JL (2009) Genomic selection for crop improvement. Crop Sci 49:1–12CrossRefGoogle Scholar
  8. Henderson CR (1977) Best linear unbiased prediction of breeding values not in the model for records. J Dairy Sci 60:783–787CrossRefGoogle Scholar
  9. Isidro J, Jannink JL, Akdemir D et al (2015) Training set optimization under population structure in genomic selection. Theor Appl Genet 128:145–158CrossRefGoogle Scholar
  10. Laloë D (1993) Precision and information in linear models of genetic evaluation. Genet Sel Evol 25:557CrossRefGoogle Scholar
  11. Lorenz AJ, Smith KP, Jannink JL (2012) Potential and optimization of genomic selection for Fusarium head blight resistance in six-row barley. Crop Sci 52:1609–1621CrossRefGoogle Scholar
  12. Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829PubMedPubMedCentralGoogle Scholar
  13. Ou JH, Liao CT (2019) TSDFGS: training set determination for genomic selection. R package version 1.0.0
  14. Piepho HP (2009) Ridge regression and extensions for genomewide selection in maize. Crop Sci 49:1165–1176CrossRefGoogle Scholar
  15. Rincent R, Laloë D, Nicolas S et al (2012) Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192:715–728CrossRefGoogle Scholar
  16. Rincent R, Charcosset A, Moreau L (2017) Predicting genomic selection efficiency to optimize calibration set and to assess prediction accuracy in highly structured populations. Theor Appl Genet 130:2231–2247CrossRefGoogle Scholar
  17. Searle SR (1982) Matrix algebra useful for statistics. Wiley, New YorkGoogle Scholar
  18. Tanaka R, Iwata H (2018) Bayesian optimization for genomic selection: a method for discovering the best genotype among a large number of candidates. Theor Appl Genet 131:93–105CrossRefGoogle Scholar
  19. VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423CrossRefGoogle Scholar
  20. Whitley D (1994) A genetic algorithm tutorial. Stat Comput 4:65–85CrossRefGoogle Scholar
  21. Wimmer V, Lehermeier C, Albrecht T et al (2013) Genome-wide prediction with different genetic architecture through efficient variable selection. Genetics 195:573–587CrossRefGoogle Scholar
  22. Xavier A, Muir WM, Craig B, Rainey KM (2016) Walking through the statistical black boxes of plant breeding. Theor Appl Genet 129:1933–1949CrossRefGoogle Scholar
  23. Zhao K, Tung CW, Elizenga GC et al (2011) Genome-wide association mapping reveals a rich genetic architechture of complex traits in Oryza sativa. Nat Commun 2:467CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of AgronomyNational Taiwan UniversityTaipeiTaiwan

Personalised recommendations