Skip to main content


Log in

Training set determination for genomic selection

  • Original Article
  • Published:
Theoretical and Applied Genetics Aims and scope Submit manuscript


Key message

A new optimality criterion is proposed to determine a training set for genomic selection, which is derived from Pearson’s correlation between GEBVs and phenotypic values of a test set. R functions are provided to generate the optimal training set.


For a specified test set, we develop a highly efficient algorithm to determine an optimal subset from a large candidate set in which the individuals have been genotyped but not phenotyped yet. The chosen subset serves as a training set to be phenotyped, and then a genomic selection (GS) model is built based on its phenotype and genotype data. In this study, we consider the additive effects whole-genome regression model and adopt ridge regression estimation for marker effects in the GS model. The resulting GS model is then employed to predict genomic estimated breeding values (GEBVs) for the individuals of the test set, which have been genotyped only. We propose a new optimality criterion to determine the required training set, which is derived directly from Pearson’s correlation between GEBVs and phenotypic values of the test set. Pearson’s correlation is the standard measure for prediction accuracy of a GS model. Our proposed methods can be applied to data with the varying degree of population structure. All the R functions for implementing our training set determination algorithms are available from the R package TSDFGS. The algorithms are illustrated with two datasets which have strong (rice genome dataset) and mild (wheat genome dataset) population structures. Our methods are shown to be advantageous over existing ones, mainly because they fully use the genomic relationship between the test set and the training set by taking into account both the variance and bias for predicting GEBVs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others


  • Akdemir D (2014) STPGA: selection of training populations by genetic algorithm. R package version 1.0.

  • Akdemir D, Isidro-Sanchez J (2019) Design of training populations for selective genotyping in genomic prediction. Sci Rep 9:1446

    Article  Google Scholar 

  • Akdemir D, Sanchez JI, Jannink JL (2015) Optimization of genomic selection training populations with a genetic algorithm. Genet Sel Evol 47:38

    Article  Google Scholar 

  • Falconer DS, Mackay TF (1996) Introduction to quantitative genetics. Prentice Hill, London

    Google Scholar 

  • Gianola D (2013) Priors in whole-genome regression: the Bayesian alphabet returns. Genetics 194:573–596

    Article  CAS  Google Scholar 

  • Habier D, Fernando RL, Dekkers JCM (2007) The impact of genetic relationship information on genomic-assisted breeding values. Genetics 177:2389–2397

    Article  CAS  Google Scholar 

  • Heffner EL, Sorrells ME, Jannink JL (2009) Genomic selection for crop improvement. Crop Sci 49:1–12

    Article  CAS  Google Scholar 

  • Henderson CR (1977) Best linear unbiased prediction of breeding values not in the model for records. J Dairy Sci 60:783–787

    Article  Google Scholar 

  • Isidro J, Jannink JL, Akdemir D et al (2015) Training set optimization under population structure in genomic selection. Theor Appl Genet 128:145–158

    Article  Google Scholar 

  • Laloë D (1993) Precision and information in linear models of genetic evaluation. Genet Sel Evol 25:557

    Article  Google Scholar 

  • Lorenz AJ, Smith KP, Jannink JL (2012) Potential and optimization of genomic selection for Fusarium head blight resistance in six-row barley. Crop Sci 52:1609–1621

    Article  Google Scholar 

  • Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829

    CAS  PubMed  PubMed Central  Google Scholar 

  • Ou JH, Liao CT (2019) TSDFGS: training set determination for genomic selection. R package version 1.0.0

  • Piepho HP (2009) Ridge regression and extensions for genomewide selection in maize. Crop Sci 49:1165–1176

    Article  Google Scholar 

  • Rincent R, Laloë D, Nicolas S et al (2012) Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192:715–728

    Article  CAS  Google Scholar 

  • Rincent R, Charcosset A, Moreau L (2017) Predicting genomic selection efficiency to optimize calibration set and to assess prediction accuracy in highly structured populations. Theor Appl Genet 130:2231–2247

    Article  CAS  Google Scholar 

  • Searle SR (1982) Matrix algebra useful for statistics. Wiley, New York

    Google Scholar 

  • Tanaka R, Iwata H (2018) Bayesian optimization for genomic selection: a method for discovering the best genotype among a large number of candidates. Theor Appl Genet 131:93–105

    Article  Google Scholar 

  • VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423

    Article  CAS  Google Scholar 

  • Whitley D (1994) A genetic algorithm tutorial. Stat Comput 4:65–85

    Article  Google Scholar 

  • Wimmer V, Lehermeier C, Albrecht T et al (2013) Genome-wide prediction with different genetic architecture through efficient variable selection. Genetics 195:573–587

    Article  Google Scholar 

  • Xavier A, Muir WM, Craig B, Rainey KM (2016) Walking through the statistical black boxes of plant breeding. Theor Appl Genet 129:1933–1949

    Article  Google Scholar 

  • Zhao K, Tung CW, Elizenga GC et al (2011) Genome-wide association mapping reveals a rich genetic architechture of complex traits in Oryza sativa. Nat Commun 2:467

    Article  Google Scholar 

Download references


This research was supported by the Ministry of Science and Technology Taiwan (Grant No.: MOST 106-2118-M-002-002-MY2).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Chen-Tuo Liao.

Ethics declarations

Conflict of interest

We have no conflicts of interest to disclose.

Additional information

Communicated by Hiroyoshi Iwata.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 396 kb)

Appendix: The derivation of r-score

Appendix: The derivation of r-score

Let \( \varvec{w}_{1} = \varvec{y}_{0} - \bar{y}_{0} {\mathbf{1}}_{{n_{0} }} \) and \( \varvec{w}_{2} = \varvec{g}_{0} - \bar{g}_{0} {\mathbf{1}}_{{n_{0} }} \), then Pearson’s correlation between \( \varvec{y}_{0} \) and \( \varvec{g}_{0} \) can be expressed in the matrix form

$$ r = \frac{{\varvec{w}_{1}^{{ \top }} \varvec{w}_{2} }}{{\sqrt {\left( {\varvec{w}_{1}^{{ \top }} \varvec{w}_{1} } \right)\left( {\varvec{w}_{2}^{{ \top }} \varvec{w}_{2} } \right)} }}. $$

We now consider a natural surrogate to the expectation of \( r \) by replacing \( \varvec{w}_{1}^{{ \top }} \varvec{w}_{2} \), \( \varvec{w}_{1}^{{ \top }} \varvec{w}_{1} \) and \( \varvec{w}_{2}^{{ \top }} \varvec{w}_{2} \) with their respective expectations, yielding

$$ E\left( {\tilde{r}} \right) = \frac{{E\left( {\varvec{w}_{1}^{{ \top }} \varvec{w}_{2} } \right)}}{{\sqrt {E\left( {\varvec{w}_{1}^{{ \top }} \varvec{w}_{1} } \right)E\left( {\varvec{w}_{2}^{{ \top }} \varvec{w}_{2} } \right)} }}. $$

According to the identity (Searle 1982, p. 355),

$$ E\left( {\varvec{v}_{1}^{{ \top }} \varvec{v}_{2} } \right) = {\text{Tr}}\left[ {{\text{Cov}}\left( {\varvec{v}_{1} ,\varvec{v}_{2} } \right)} \right] + E\left( {\varvec{v}_{1}^{{ \top }} } \right)E\left( {\varvec{v}_{2} } \right) $$

for two random variables vectors \( \varvec{v}_{1} \) and \( \varvec{v}_{2} \) of same order, where \( {\text{Cov}}\left( {\varvec{v}_{1} , \varvec{v}_{2} } \right) \) denotes the covariance matrix between \( \varvec{v}_{1} \) and \( \varvec{v}_{2} \), we have that

$$ E\left( {\varvec{w}_{1}^{{ \top }} \varvec{w}_{1} } \right) = {\text{Tr}}\left[ {\left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)} \right]\sigma_{e}^{2} +\varvec{\beta}^{{ \top }} \varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0}\varvec{\beta}; $$
$$ E\left( {\varvec{w}_{2}^{{ \top }} \varvec{w}_{2} } \right) = {\text{Tr}}\left[ {\varvec{A}^{{ \top }} \varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} \varvec{A}} \right]\sigma_{e}^{2} +\varvec{\beta}^{{ \top }} \varvec{X}^{{ \top }} \varvec{A}^{{ \top }} \varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} \varvec{AX\beta }; $$
$$ E\left( {\varvec{w}_{1}^{{ \top }} \varvec{w}_{2} } \right) = \varvec{\beta}^{{ \top }} \varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} \varvec{AX\beta }. $$

If matrix \( \varvec{H} \) can be factorized as \( \varvec{H} = \varvec{RR}^{{ \top }} \) for some matrix \( \varvec{R} \), then \( \varvec{H} \) is nonnegative definite (Searle 1982, p. 206). Furthermore, \( h_{ii} \ge 0 \) for all \( i \), where \( h_{ii} \) denotes the ith diagonal element of \( \varvec{H} \). Thus, the quadratic form \( \varvec{\beta}^{{ \top }} \varvec{H\beta } \) is proportion to \( \sum h_{ii} \beta_{i}^{2} \), hence proportional to \( {\text{Tr}}\left( \varvec{H} \right) = \sum h_{ii} \) for any value of \( \beta_{i} \). Based on the above result, we have that \( E\left( {\varvec{w}_{1}^{{ \top }} \varvec{w}_{1} } \right) \) of Eq. (8) is proportional to

$$ q_{1} = \left( {n_{0} - 1} \right) + {\text{Tr}}\left[ {\varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} } \right]. $$

Also, \( E\left( {\varvec{w}_{2}^{{ \top }} \varvec{w}_{2} } \right) \) of Eq. (9) is an increasing function of

$$ q_{2} = {\text{Tr}}\left[ {\varvec{A}^{{ \top }} \varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} \varvec{A}} \right] + {\text{Tr}}\left[ {\varvec{X}^{{ \top }} \varvec{A}^{{ \top }} \varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} \varvec{AX}} \right]. $$

From Eq. (10), we consider

$$ \begin{aligned} & {\text{Tr}}\left[ {\varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} \varvec{AX}} \right] \\ & = {\text{Tr}}\left[ {\varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} \varvec{X}^{{ \top }} \left( {\varvec{XX}^{{ \top }} + \lambda \varvec{I}_{n} } \right)^{ - 1} \varvec{X}} \right] \\ & = {\text{Tr}}\left[ {\varvec{RR}^{{ \top }} } \right] \\ \end{aligned} $$

where \( \varvec{R} = \left( {\varvec{XX}^{{ \top }} + \lambda \varvec{I}_{n} } \right)^{{ - \frac{1}{2}}} \varvec{XX}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right) \). Here \( \varvec{B}^{{ - \frac{1}{2}}} \varvec{B}^{{ - \frac{1}{2}}} = \varvec{B}^{ - 1} \). So \( E\left( {\varvec{w}_{1}^{{ \top }} \varvec{w}_{2} } \right) \) is proportional to

$$ q_{12} = {\text{Tr}}\left[ {\varvec{X}_{0}^{{ \top }} \left( {\varvec{I}_{{n_{0} }} - \bar{\varvec{J}}_{{n_{0} }} } \right)\varvec{X}_{0} \varvec{AX}} \right]. $$

Consequently, it is reasonable to define r-score as \( \frac{{q_{12} }}{{\sqrt {q_{1} q_{2} } }} \).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ou, JH., Liao, CT. Training set determination for genomic selection. Theor Appl Genet 132, 2781–2792 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: