## 5.1 The Linear Genomic Selection Index

### 5.1.1 Basic Conditions for Constructing the LGSI

Conditions described in Chap. 4 (Sect. 4.1.1) for constructing a valid linear molecular selection index (LMSI), are also necessary for the linear genomic selection index (LGSI); however, in addition to those conditions, the LGSI also requires:

1. 1.

All marker effects to be estimated simultaneously in the training population.

2. 2.

The estimated marker effects to be used in subsequent selection cycles to obtain GEBVs that are predictors of the individual breeding values in the testing population (candidates for selection) for which there is only marker information.

3. 3.

The GEBV values to be composed entirely of the additive genetic effects.

4. 4.

Phenotypes to be used to estimate all marker effects in the training population, not to make selections in the testing population (Heffner et al. 2009; Lorenz et al. 2011).

### 5.1.2 Genomic Breeding Values and Marker Effects

The breeding value (gi) is the average additive effects of the genes an individual receives from both parents; thus, it is a function of the genes transmitted from parents to progeny and is the only component that can be selected and, therefore, the main component of interest in breeding programs (Mrode 2005). The ith phenotypic value (yi) can be denoted as yi = gi + ei, where gi is the breeding value and ei the residual. Basic assumptions for gi and ei are: both gi and ei have normal distribution with expectation equal to zero and variance $${\sigma}_{g_i}^2$$ and $${\sigma}_{e_i}^2$$ respectively. This means that yi = μi + gi + ei is a linear mixed model (Mrode 2005; Searle et al. 2006), where μi is the mean of yi.

Let $${\mathbf{y}}_i^{\prime }=\left[{y}_{i1}\kern0.5em {y}_{i2}\kern0.5em \cdots \kern0.5em {y}_{in}\right]$$ be a vector 1 × n of observations in the ith trait and let $${\mathbf{g}}_i^{\prime }=\left[{g}_{i1}\kern0.5em {g}_{i2}\kern0.5em \cdots \kern0.5em {g}_{in}\right]$$ be a vector 1 × n of unobservable breeding values associated with yi; then yi can be written as

$${\mathbf{y}}_i=\mathbf{1}{\mu}_i+{\mathbf{Zg}}_i+{\mathbf{e}}_i,$$
(5.1)

where μi is the mean of the ith trait, 1 is a vector n × 1 of 1s, Z is a design matrix of 0s and 1s, gi ~ MVN (0, $$\mathbf{A}{\sigma}_{g_i}^2$$) is a vector of breeding values, and ei ~ MVN (0, $${\mathbf{I}}_n{\sigma}_{e_i}^2$$) is a vector of residuals; 0 is the mean and $$\mathbf{A}{\sigma}_{g_i}^2$$ and $${\mathbf{I}}_n{\sigma}_{e_i}^2$$ the covariance matrix of gi and ei respectively; A is the numerical relationship matrix (Mrode 2005) and In an identity matrix n × n; $${\sigma}_{g_i}^2$$ and $${\sigma}_{e_i}^2$$ are the additive and residual variances associated with gi and ei; and MVN stands for multivariate normal distribution.

Suppose that A, Z, μi, $${\sigma}_{g_i}^2$$, and $${\sigma}_{e_i}^2$$ are known; then, according to Mrode (2005), the best linear unbiased predictor (BLUP) of gi can be written as

$${\widehat{\mathbf{g}}}_i={\sigma}_{g_i}^2\mathbf{A}{\mathbf{Z}}^{\prime }{\mathbf{V}}^{-1}\left({\mathbf{y}}_i-\mathbf{1}{\mu}_i\right),$$
(5.2)

where V−1 is the inverse matrix of the variance of yi, i.e., $$Var\left({\mathbf{y}}_i\right)={\sigma}_{g_i}^2\mathbf{ZA}{\mathbf{Z}}^{\prime }+{\mathbf{I}}_n{\sigma}_{e_i}^2=\mathbf{V}$$. In the context of animal breeding, Eq. (5.2) is considered a univariate linear phenotypic selection index (LPSI) (Mrode 2005) and is used to rank and select individuals as parents of the next generation in the context of one trait. Equation (5.2) can be extended to the multi-trait phenotypic selection index case, but to predict the net genetic merit (H = wg, see Chap. 2 for details) it would be necessary to construct linear combinations of the predicted values of gi associated with the traits of interest as was described in the Foreword of this book.

The vector of the individual genomic breeding values (γi) associated with the ith characteristic (i = 1, 2,…,t; t = number of traits) of the candidates for selection can be written as

$${\boldsymbol{\upgamma}}_i={\mathbf{Xu}}_i,$$
(5.3)

where X is an n × m matrix (n = number of observations and m = number of markers in the population) of coded marker values (2 − 2p, 1 − 2p, and −2p for genotypes AA, Aa, and aa respectively) associated with the additive effects of the quantitative trait loci (QTL) and ui is an m × 1 vector of the additive effects of the QTL associated with markers that affect the ith trait. It is assumed that γi has MVN with mean 0 and variance $$\mathbf{G}{\sigma}_{\gamma}^2$$, i.e., γi ~ MVN (0, $$\mathbf{G}{\sigma}_{\upgamma_{\mathrm{i}}}^2$$), where $${\sigma}_{\upgamma_{\mathrm{i}}}^2$$ is the additive genomic variance of γi and G = XX′/c is the n × n additive genomic relationship matrix between genotypes; $$c=\sum \limits_{j=1}^m2{p}_j\left(1-{p}_j\right)$$ in an F2 population, and $$c=\sum \limits_{j=1}^m4{p}_j\left(1-{p}_j\right)$$ in a double haploid population; p is the frequency of allele A and 1 − p is the frequency of allele a in the jth marker (j  =  1,  2, …, m).

The additive genomic relationship matrix G = XX′/c has special properties. For example, in the asymptotic context, the expectation of matrix G is equal to the numerical relationship matrix A, i.e., E(G) = A (Habier et al. 2007; Van Raden 2008); this means that G is a particular realization of A and when the number of markers and genotypes increases in the training population, the value of G tends to concentrate around A. Thus, it can be assumed that at the limit, when the number of markers and genotypes is very high, G = A (Cerón-Rojas and Sahagún-Castellanos 2016).

The vector of genomic breeding values (Eq. 5.3) has a similar function in genomic selection as gi in the phenotypic selection context. In addition, gi can be written as gi = γi + ηi, where ηi = gi − γi (Gianola et al. 2003). Also, note that

$$Cov\left({\mathbf{g}}_i,{\boldsymbol{\upgamma}}_i\right)={\sigma}_{\upgamma_{\mathrm{i}}}^2,$$
(5.4)

i.e., the covariance between γi and gi is equal to the variance of γi (Dekkers 2007).

Let $${\mathbf{y}}_i^{\prime }=\left[{y}_{i1}\kern0.5em {y}_{i2}\kern0.5em \cdots \kern0.5em {y}_{in}\right]$$ be a vector 1 × n of observation of the ith trait in the training population and let $${\boldsymbol{\upgamma}}_i^{\prime }=\left[{\gamma}_{i1}\kern0.5em {\gamma}_{i2}\kern0.5em \cdots \kern0.5em {\gamma}_{in}\right]$$ be a vector 1 × n of unobservable genomic breeding values associated with yi; then, yi can also be written as

$${\mathbf{y}}_i=\mathbf{1}{\mu}_i+{\mathbf{Z}\boldsymbol{\upgamma}}_i+{\boldsymbol{\upvarepsilon}}_i,$$
(5.5)

where μi is the mean of the ith trait, 1 is a vector n × 1 of 1s, Z is a design matrix, γi ~ MVN (0, $$\mathbf{G}{\sigma}_{\gamma}^2$$) and εi ~ MVN (0, $${\mathbf{I}}_n{\sigma}_{\varepsilon_i}^2$$) are vectors of genomic breeding values and of residuals respectively, and $${\sigma}_{\varepsilon_i}^2$$ is the residual variance. In, G, and $${\sigma}_{\upgamma}^2$$ were defined in Eqs. (5.2) and (5.3).

According to Eqs. (5.2) and (5.3), when μi, $${\sigma}_{\upgamma}^2$$ and $${\sigma}_{\varepsilon_i}^2$$ are known, the vector of GEBVs for the individuals with the ith trait can be obtained as

$${\widehat{\boldsymbol{\upgamma}}}_i={\sigma}_{\gamma_i}^2\mathbf{G}{\mathbf{Z}}^{\prime }{\mathbf{V}}^{-1}\left({\mathbf{y}}_i-\mathbf{1}{\mu}_i\right),$$
(5.6)

where the variance of yi should now be written as $$\mathbf{V}={\sigma}_{\gamma_i}^2\mathbf{ZG}{\mathbf{Z}}^{\prime }+{\mathbf{I}}_n{\sigma}_{\varepsilon_i}^2$$. In the context of genomic selection, Eq. (5.6) is considered a univariate LGSI and is used to rank and select individuals as parents of the next generation (Van Raden 2008; Togashi et al. 2011). Equation (5.6) is the BLUP of γi and can be extended to a multi-trait genomic selection index, but to predict the net genetic merit (H = wg), it is necessary to construct an LGSI, which is a linear combination of γi.

Although Eq. (5.6) is theoretically very important in LGSI, in practice we need to estimate the marker effects associated with all the traits of interest and to use these estimates in the testing population to obtain the GEBV of the candidates for selection. Let $${\mathbf{u}}^{\prime }=\left[{\mathbf{u}}_1^{\prime}\kern0.5em {\mathbf{u}}_2^{\prime}\kern0.5em \cdots \kern0.5em {\mathbf{u}}_t^{\prime}\right]$$ be a vector 1 × nt associated with t traits. In the univariate context, Van Raden (2008) showed that the ith vector ui of marker effects in the training population can be estimated as

$${\widehat{\mathbf{u}}}_i={c}^{-1}{\mathbf{X}}^{\prime }{\left[\mathbf{G}+\upsilon {\mathbf{I}}_n\right]}^{-1}\left({\mathbf{y}}_i-\mathbf{1}{\mu}_i\right),$$
(5.7)

where $$\upsilon =\frac{\sigma_{e_i}^2}{\sigma_{g_i}^2}$$; $${\sigma}_{g_i}^2$$, $${\sigma}_{e_i}^2$$ and the other parameters were defined earlier. According to Ceron-Rojas et al. (2015), to estimate the vector $${\mathbf{u}}^{\prime }=\left[{\mathbf{u}}_1^{\prime}\kern0.5em {\mathbf{u}}_2^{\prime}\kern0.5em \cdots \kern0.5em {\mathbf{u}}_t^{\prime}\right]$$ in the multi-trait context, Eq. (5.7) can be written as

$$\widehat{\mathbf{u}}={c}^{-1}{\mathbf{W}}_t^{\prime }{\left[\left({\mathbf{I}}_t\otimes \mathbf{G}\right)+\left(\mathbf{N}\otimes {\mathbf{I}}_n\right)\right]}^{-1}\left(\mathbf{y}-\boldsymbol{\upmu} \otimes \mathbf{1}\right),$$
(5.8)

where Wt = It ⊗ X, “⊗” denotes the Kronecker product (Schott 2005), c and X were defined in Eq. (5.3); N = RC−1, whereR and C are the residual and breeding value covariance matrices for t traits respectively; $${\mathbf{y}}^{\prime }=\left[{\mathbf{y}}_1^{\prime}\kern0.5em {\mathbf{y}}_2^{\prime}\kern0.5em \cdots \kern0.5em {\mathbf{y}}_t^{\prime}\right]$$ ~ MVN(μ, V) is a vector of size 1 × tn, with covariance matrix V = C ⊗ G + R ⊗ In; It is an identity matrix of size t × t and In was defined earlier; $${\boldsymbol{\upmu}}^{\prime }=\left[{\mu}_1\kern0.5em {\mu}_2\kern0.5em \cdots \kern0.5em {\mu}_t\right]$$ is a vector 1 × t of means associated with vector y, and 1 is a vector n × 1 of 1s. In this case, the estimator of the vector of sub-vectors of genomic breeding values $${\boldsymbol{\upgamma}}^{\prime }=\left[{\boldsymbol{\upgamma}}_1\kern0.5em {\boldsymbol{\upgamma}}_2\kern0.5em \dots \kern0.5em {\boldsymbol{\upgamma}}_t\right]$$ in the testing population can be obtained as

$$\widehat{\boldsymbol{\upgamma}}={\mathbf{W}}_t\widehat{\mathbf{u}}.$$
(5.9)

Equation (5.9) is the vector of GEBVs for the multi-trait case. Thus, in the testing population, in Eq. (5.9), only the coded values in matrix X change, whereas $$\widehat{\mathbf{u}}$$ is the same in each selection cycle. Note that to obtain Eqs. (5.7) and (5.8), we assumed that μ, C, and R are known.

We indicated that the genomic breeding values have normal distribution (Eq. 5.5). Using the simulated data described in Chap. 2, Sect. 2.8.1, in Fig. 5.1 we present the distribution of the GEBVs (Eq. 5.9) associated with traits T1 in the first (Fig. 5.1a) and the fifth (Fig. 5.1b) selection cycles in the testing population. In effect, the frequency distribution of the GEBVs approaches normal distribution in both selection cycles.

### 5.1.3 The LGSI and Its Parameters

Similar to the LPSI (Chap. 2), the objective of the LGSI is to predict the net genetic merit H = wg, where $${\mathbf{g}}^{\prime }=\left[{g}_1\kern0.5em {g}_2\kern0.5em \dots \kern0.5em {g}_t\right]$$ (t = number of traits) is a vector of unobservable true breeding values and $${\mathbf{w}}^{\prime }=\left[{w}_1\kern0.5em {w}_2\kern0.5em \dots \kern0.5em {w}_t\right]$$ is a vector of economic weights. Suppose that the genomic breeding values γi = Xui are known; then, the LGSI can be written as

$${I}_{\mathrm{G}}={\boldsymbol{\upbeta}}^{\prime}\boldsymbol{\upgamma},$$
(5.10)

where β is an unknown vector of weights.

The main advantage of the LGSI over the LPSI lies in the possibility of reducing the intervals between selection cycles (LG) by more than two thirds (Lorenz et al. 2011); thus, this parameter should be incorporated into the LGSI selection response and the expected genetic gain per trait to reflect the main advantage of the LGSI over the LPSI and the other indices. Assuming that LG = 1, in the LPSI context we wrote the selection response as RI = kIσHρHI; however, if LG ≠ 1, the LGSI selection response can be written as

$${R}_{I_G}=\frac{k_I}{L_G}\frac{\sigma_{HI_G}}{\sigma_{I_G}^2}=\frac{k_I}{L_G}{\sigma}_H{\rho}_{HI_G},$$
(5.11)

where kI is the standardized selection differential (or selection intensity) associated with the LGSI, $${\sigma}_{HI_G}$$ is the covariance between H = wg and the LGSI, $${\sigma}_{I_G}^2$$ is the variance of the LGSI, σH is the standard deviation of H, $${\rho}_{HI_G}$$ is the correlation between H and the LGSI, and LG denotes the intervals between selection cycles.

Let C and Γ be matrices of covariance of the breeding values (g) and of the genomic breeding values (γ) respectively; then, the correlation between H = wg and IG = βγ can be written as

$${\rho}_{HI_G}=\frac{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \boldsymbol{\upbeta}}{\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}\sqrt{{\boldsymbol{\upbeta}}^{\prime}\boldsymbol{\Gamma} \boldsymbol{\upbeta}}},$$
(5.12)

where $${\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \boldsymbol{\upbeta} ={\sigma}_{HI_G}$$ is the covariance between H = wg and IG = βγ, $${\sigma}_H=\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}$$ is the standard deviation of the variance of H = wg, and $${\sigma}_{I_G}=\sqrt{{\boldsymbol{\upbeta}}^{\prime}\boldsymbol{\Gamma} \boldsymbol{\upbeta}}$$ is the standard deviation of the variance of IG = βγ.

### 5.1.4 Maximizing LGSI Parameters

To maximize the genomic selection response (Eq. 5.11), suppose that kI, σH and LG are fixed and take the derivative of the natural logarithm (ln) of the correlation between H and IG (Eq. 5.12) with respect to vector β, equate the result of the derivative to the null vector, and isolate β, i.e.,

$$\frac{\partial }{\partial \boldsymbol{\upbeta}}\ln {\rho}_{HI_g}=\frac{\partial }{\partial \boldsymbol{\upbeta}}\ln \left(\frac{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \boldsymbol{\upbeta}}{\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}\sqrt{{\boldsymbol{\upbeta}}^{\prime}\boldsymbol{\Gamma} \boldsymbol{\upbeta}}}\right)=\mathbf{0}.$$
(5.13)

The result is β = sw, where s = βΓβ/wΓβ is a proportional constant that does not affect the maximum value of $${\rho}_{HI_G}$$, because this is invariant to the scale change; then, assuming that β = w, the maximized LGSI selection response can be written as

$${R}_{I_G}=\frac{k_I}{L_G}\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}}.$$
(5.14)

Hereafter, we refer to the LGSI genomic selection response as that of Eq. (5.14). Also, because β = w, Eq. (5.12) can be written as

$${\rho}_{HI_G}=\frac{\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}}}{\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}}=\frac{\sigma_{I_G}}{\sigma_H},$$
(5.15)

which is the maximized correlation between H = wg and IG = βγ, or LGSI accuracy; $${\sigma}_H=\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}$$ is the standard deviation of the variance of H, and $${\sigma}_{I_G}=\sqrt{{\boldsymbol{\upbeta}}^{\prime}\boldsymbol{\Gamma} \boldsymbol{\upbeta}}$$ is the standard deviation of the variance of IG.

The LGSI expected genetic gain per trait ($${\mathbf{E}}_{I_G}$$) can be written as

$${\mathbf{E}}_{I_G}=\frac{k_I}{L_G}\frac{\boldsymbol{\Gamma} \mathbf{w}}{\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}}}.$$
(5.16)

All the terms in Eq. (5.16) were previously defined.

Let $${\lambda}_G=\frac{\rho_{HI_G}}{\rho_{HI}}$$ be LGSI efficiency versus LPSI efficiency to predict the net genetic merit, where $${\rho}_{HI_G}$$ is the LGSI accuracy and ρHI the LPSI accuracy; in percentage terms, LGSI efficiency versus LPSI efficiency for each selection cycle can be written as

$${p}_G=100\left({\lambda}_G-1\right).$$
(5.17)

According to Eq. (5.17), if pG > 0, LGSI efficiency is greater than LPSI efficiency; if pG = 0, the efficiency of both selection indices is equal, and if pG < 0, the LPSI is more efficient than the LGSI at predicting H = wg.

Equation (5.17) is useful for measuring LGSI efficiency in terms of accuracy when predicting the net genetic merit (H = wg), whereas the Technow et al. (2013) inequality measures LGSI efficiency in terms of the time needed to complete one selection cycle. In the context of the LGSI and the LPSI, the Technow inequality can be written as

$${L}_G<\frac{\rho_{HI_G}}{h_I}{L}_P,$$
(5.18)

where LG and LP denote the time required to complete one selection cycle for the LGSI and the LPSI respectively, $${\rho}_{HI_G}$$ is the LGSI accuracy, and hI is the square root of the heritability (Lin and Allaire 1977; Nordskog 1978) of the LPSI, which can be denoted as $${h}_I=\sqrt{\frac{{\mathbf{b}}^{\prime}\mathbf{Cb}}{{\mathbf{b}}^{\prime}\mathbf{Pb}}}$$ (see Chap. 2 for details). Then, assuming that the selection intensity is the same for both selection indices, if Eq. (5.18) is true, the LGSI is more efficient than the LPSI per unit of time.

### 5.1.5 Relationship Between the LGSI and LPSI Selection Responses

To obtain the relationship between $${R}_{I_G}$$ and RI in the asymptotic context, we omitted the intervals between selection cycles (LG and LI respectively) to simplify the algebra. Consider a population where the number of genotypes and markers tends to infinity; in this case, markers explain most of the true additive genetic variances and covariances. Thus, we can assume that matrices Γ and C are very similar, and at the limit, Γ = C. Now suppose that in this population the phenotypic variance–covariance matrix (P) is known and comprises matrix Γ and the variance–covariance residual matrix (R). In this case, the inverse of P can be written as P−1 = (Γ + R)−1 = Γ−1 − Γ−1(Γ−1 + R−1)−1Γ−1, where Γ−1 and R−1 are the inverses of matrices Γ and R respectively. Thus, the LPSI selection response is given by

$${R}_I={k}_I\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}={k}_I\sqrt{{\mathbf{w}}^{\prime }{\boldsymbol{\Gamma} \mathbf{P}}^{-1}\boldsymbol{\Gamma} \mathbf{w}}={k}_I\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}-{\mathbf{w}}^{\prime }{\left({\boldsymbol{\Gamma}}^{-1}+{\mathbf{R}}^{-1}\right)}^{-1}\mathbf{w}},$$
(5.19)

where b = P−1Γw is the vector of coefficients of the LPSI in the asymptotic context. Note that bPb ≥ 0 and wΓw ≥ 0, i.e., bPb and wΓw are positive semi-definite, meaning that wΓw ≥ w′(Γ−1 + R−1)−1w ≥ 0; then, in the asymptotic context, $${R}_{I_G}\ge {R}_I$$. This result is not common when the number of genotypes and markers is small; however, it gives an idea of the theoretical behavior of $${R}_{I_G}$$ with respect to RI when the number of markers and genotypes is very large.

Because gq can be written as gq = γq + ηq, where ηq = gq − γq (q  =  1,  2, ⋯, t), for low numbers of markers and genotypes, the covariance genotypic matrix C can be written as C = Γ + E, where E = C − Γ; then, the inverse of matrix P can be written as P−1 = [(Γ + E) + R]−1 = (Γ + E)−1 − (Γ + E)−1[(Γ + E)−1 + R−1]−1(Γ + E)−1. In the latter case, the LPSI selection response RI can be written as

$${R}_I={k}_I\sqrt{{\mathbf{w}}^{\prime}\left(\boldsymbol{\Gamma} +\mathbf{E}\right){\mathbf{P}}^{-1}\left(\boldsymbol{\Gamma} +\mathbf{E}\right)\mathbf{w}}={k}_I\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}+{\mathbf{w}}^{\prime}\mathbf{Ew}-{\mathbf{w}}^{\prime }{\left[{\left(\boldsymbol{\Gamma} +\mathbf{E}\right)}^{-1}+{\mathbf{R}}^{-1}\right]}^{-1}\mathbf{w}}.$$
(5.20)

Equation (5.20) indicates that in the non-asymptotic context (low numbers of markers and genotypes), $${R}_{I_G}$$ and RI are related in three possible ways:

1. 1.

$${R}_I>{R}_{I_G}$$ if wEw > w′[(Γ + E)−1 + R−1]−1w

2. 2.

$${R}_I={R}_{I_G}$$ if wEw = w′[(Γ + E)−1 + R−1]−1w

3. 3.

$${R}_{I_G}>{R}_I$$ if wEw < w′[(Γ + E)−1 + R−1]−1w

The second and third points indicate that $${R}_{I_G}$$ may be equal to or larger than RI, even under a small number of markers, depending on the size of wEw and w′[(Γ + E)−1 + R−1]−1w. These three points explain the theoretical relationship between RI and $${R}_{I_G}$$ for a low number of markers and genotypes. When Γ = C, E = 0, and $${R}_I={k}_I\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}-{\mathbf{w}}^{\prime }{\left({\boldsymbol{\Gamma}}^{-1}+{\mathbf{R}}^{-1}\right)}^{-1}\mathbf{w}}$$, then $${R}_{I_G}\ge {R}_I$$.

### 5.1.6 Statistical LGSI Properties

Assuming that H and IG have joint bivariate normal distribution and that Γ, C, and w are known, the LGSI has the following properties:

1. 1.

The variance of IG ($${\sigma}_{I_G}^2$$) and the covariance between H and IG ($${\sigma}_{HI_G}$$) are equal, i.e., $${\sigma}_{I_G}^2={\sigma}_{HI_G}$$.

2. 2.

The maximized correlation between H and IG (or LGSI accuracy) is equal to $${\rho}_{HI_G}=\frac{\sigma_{I_G}}{\sigma_H}$$, where $${\sigma}_{I_G}$$ is the standard deviation of $${\sigma}_{I_G}^2$$ and σH is the standard deviation of the variance of H ($${\sigma}_H^2$$).

3. 3.

The variance of the predicted error, $$Var\left(H-{I}_G\right)=\left(1-{\rho}_{HI_G}^2\right){\sigma}_H^2$$, is minimal. Note that $$Var\left(H-{I}_G\right)={\sigma}_{I_G}^2+{\sigma}_H^2-2{\sigma}_{HI_G}$$, and when β = w, $${\sigma}_{I_G}^2={\sigma}_{HI_G}$$, whence $$Var\left(H-{I}_G\right)={\sigma}_H^2-{\sigma}_{I_G}^2=\left(1-{\rho}_{HI_G}^2\right){\sigma}_H^2$$ is minimal.

4. 4.

The total variance of H explained by IG is $${\sigma}_{I_G}^2={\rho}_{HI_G}^2{\sigma}_H^2$$. It is evident that if $${\rho}_{HI_G}=1$$, $${\sigma}_{I_G}^2={\sigma}_H^2$$, and if $${\rho}_{HI_G}=0$$, $${\sigma}_{I_G}^2=0$$. That is, the variance of H explained by IG is proportional to $${\rho}_{HI_G}$$, and when $${\rho}_{HI_G}$$ is close to 1, $${\sigma}_{I_G}^2$$ is close to $${\sigma}_H^2$$; if $${\rho}_{HI_G}$$ is close to 0, $${\sigma}_{I_G}^2$$ is close to 0.

The LGSI properties described in points 1–4 of this subsection are the same as the LPSI properties described in Chap. 2. This corroborates the LGSI as an application of the LPSI theory to the genomic selection context.

### 5.1.7 Genomic Covariance Matrix in the Training and Testing Population

To derive the LGSI theory, we assumed that the true genomic additive variance–covariance matrix Γ was known. However, in practice, we need to estimate it. In the training population, matrix Γ can be estimated by restricted maximum likelihood (REML) using phenotypic and genomic information, as described by Vattikuti et al. (2012) and Su et al. (2012). In Eqs. (2.22) to (2.24) of Chap. 2, we presented the formulas for estimating the genotypic and residual variance and covariance based on the formulas described by Lynch and Walsh (1998). Here, we present a brief description of how we can estimate the qth component (σγqq) of Γ in the training population using the REML method.

We estimated $${\sigma}_{\upgamma qq}={\sigma}_{\upgamma q}^2$$ (q, q′  =  t = number of traits) in the absence of dominance and epistatic effects, using the model yq = 1μq + q + εq, where the vector yq ~ NMV(1μq, Vq) g × 1 (g = number of genotypes in the population) had a multivariate normal distribution; 1 was a g × 1 vector of 1s, μq was the mean of the qth trait, Z was an identity matrix g × g; γq ~ NMV(0, $$\mathbf{G}{\sigma}_{\upgamma q}^2$$) was a vector of genomic breeding values, and εq ~ NMV(0, $$\mathbf{I}{\sigma}_{\varepsilon_q}^2$$) was a g × 1 vector of residuals. Matrix G = XX′/c was the genomic relationship matrix, and in an F2 population, $$c=\sum \limits_{j=1}^N2{p}_j{q}_j$$; X was a g × m matrix (m = number of markers) of the coded marker values (2 − 2p for AA, 1 − 2p for Aa, and −2p for aa) for the additive effects of the markers; p and q denote the frequency of allele A and the frequency of allele a in the jth marker (j  =  1,  2, …, m), and $${\mathbf{V}}_q=\mathbf{G}{\sigma}_{\upgamma q}^2+\mathbf{I}{\sigma}_{\varepsilon_q}^2$$.

The expectation–maximization algorithm allowed the REML for the variance components $${\sigma}_{\upgamma q}^2$$ and $${\sigma}_{\varepsilon_q}^2$$ to be computed by iterating the following equations:

$${\sigma}_{\upgamma q}^{2\left(n+1\right)}={\sigma}_{\upgamma q}^{2(n)}+\frac{{\left({\sigma}_{\upgamma q}^{2(n)}\right)}^2}{g}\left[{\mathbf{y}}_q^{\prime}\left({\mathbf{T}}^{(n)}{\mathbf{GT}}^{(n)}\right){\mathbf{y}}_q- tr\left({\mathbf{T}}^{(n)}\mathbf{G}\right)\right]$$
(5.21)

and

$${\sigma}_{\varepsilon_q}^{2\left(n+1\right)}={\sigma}_{\varepsilon_q}^{2(n)}+\frac{{\left({\sigma}_{\varepsilon_q}^{2(n)}\right)}^2}{g}\left[{\mathbf{y}}_q^{\prime}\left({\mathbf{T}}^{(n)}{\mathbf{T}}^{(n)}\right){\mathbf{y}}_q- tr\left({\mathbf{T}}^{(n)}\right)\right],$$
(5.22)

where g is the number of genotypes. After n iterations, when $${\sigma}_{\upgamma q}^{2\left(n+1\right)}$$ was very similar to $${\sigma}_{\upgamma q}^{2(n)}$$ and $${\sigma}_{\varepsilon_q}^{2\left(n+1\right)}$$ was very similar to $${\sigma}_{\varepsilon_q}^{2(n)}$$, $${\sigma}_{\upgamma q}^{2\left(n+1\right)}$$ and $${\sigma}_{\varepsilon_q}^{2\left(n+1\right)}$$ were the estimated variance components of $${\sigma}_{\upgamma q}^2$$ and $${\sigma}_{\varepsilon_q}^2$$ respectively. In Eqs. (5.21) and (5.22) tr(.) denoted the trace of the matrices within brackets; $$\mathbf{T}={\mathbf{V}}_q^{-1}-{\mathbf{V}}_q^{-1}\mathbf{1}{\left({\mathbf{1}}^{\prime }{\mathbf{V}}_q^{-1}\mathbf{1}\right)}^{-1}{\mathbf{1}}^{\prime }{\mathbf{V}}_q^{-1}$$, and $${\mathbf{V}}_q^{-1}$$ was the inverse of $${\mathbf{V}}_q=\mathbf{G}{\sigma}_{\upgamma q}^2+\mathbf{I}{\sigma}_{\varepsilon_q}^2$$. In matrix T(n), $${\mathbf{V}}_q^{-1(n)}$$ was the inverse of matrix $${\mathbf{V}}_q^{(n)}=\mathbf{G}{\sigma}_{\upgamma q}^{2(n)}+\mathbf{I}{\sigma}_{\varepsilon_q}^{2(n)}$$.

The genomic additive genetic covariance between the observations of the qth and ith traits, yq and yi ($${\sigma}_{\gamma_{\mathrm{qi}}}$$, q, i = 1, 2,…,t), can be estimated by REML. Here, we adapted Eqs. (5.21) and (5.22) using the variance of the sum of yq and yi, i.e., Var(yi + yq) = Vi + Vq + 2Ciq, where $${\mathbf{V}}_i=\mathbf{G}{\sigma}_{{\boldsymbol{\upgamma}}_{\mathrm{i}}}^2+\mathbf{I}{\sigma}_{\varepsilon_i}^2= Var\left({\mathbf{y}}_i\right)$$ is the variance of yi and $${\mathbf{V}}_q=\mathbf{G}{\sigma}_{{\boldsymbol{\upgamma}}_{\mathrm{q}}}^2+\mathbf{I}{\sigma}_{\varepsilon_q}^2= Var\left({\mathbf{y}}_q\right)$$ is the variance of yq; $$2{\mathbf{C}}_{iq}=2\mathbf{G}{\sigma}_{{\boldsymbol{\upgamma}}_{\mathrm{iq}}}+2\mathbf{I}{\sigma}_{\varepsilon_{iq}}=2 Cov\left({\mathbf{y}}_i,{\mathbf{y}}_q\right)$$ is the covariance of yq and yi, and $${\sigma}_{{\boldsymbol{\upgamma}}_{\mathrm{iq}}}$$ and $${\sigma}_{\varepsilon_{iq}}$$ are the genomic and residual covariance respectively, associated with yi and yq. Thus, one way of estimating $${\sigma}_{{\boldsymbol{\upgamma}}_{\mathrm{iq}}}$$ and $${\sigma}_{e_{iq}}$$ is by using the following equation:

$$0.5 Var\left({\mathbf{y}}_i+{\mathbf{y}}_q\right)-0.5 Var\left({\mathbf{y}}_i\right)-0.5 Var\left({\mathbf{y}}_q\right),$$
(5.23)

for which Eqs. (5.21) and (5.22) can be adapted.

If there is only marker information on the testing population, then it is not possible to estimate Γ using Eqs. (5.21) to (5.23). Another way of estimating Γ is to use the method proposed by Ceron-Rojas et al. (2015), which requires the estimated values of γq ($${\widehat{\boldsymbol{\upgamma}}}_q$$) in the cycle of interest. Let $$\widehat{\mathbf{u}}$$ be the estimator of the vector of marker effects $${\mathbf{u}}^{\prime }=\left[{\mathbf{u}}_1^{\prime}\kern0.5em {\mathbf{u}}_2^{\prime}\kern0.5em \cdots \kern0.5em {\mathbf{u}}_t^{\prime}\right]$$ for t traits obtained in the training population. We obtained the qth GEBVs (q  =  1,  2, …, t) in the lth selection cycle (l  =  1, 2, …, number of cycles) as

$${\widehat{\boldsymbol{\upgamma}}}_{ql}={\mathbf{X}}_l{\widehat{\mathbf{u}}}_q$$
(5.24)

where $${\widehat{\mathbf{u}}}_q$$ is the vector of size m × 1 of the estimated marker effects of the qth trait in the training population and Xl is a matrix of size n × m of the coded values of marker genotypes in the lth selection cycle of the testing population.

Now suppose that γq and $${\boldsymbol{\upgamma}}_{q^{\prime }}$$ have multivariate normal distribution jointly, with mean $$\mathbf{1}{\mu}_{\gamma_q}$$ and $$\mathbf{1}{\mu}_{\gamma_{q^{\prime }}}$$ respectively, and covariance matrix $$\mathbf{G}{\sigma}_{\gamma_{q{q}^{\prime }}}$$, where 1 is an n × 1 vector of 1s and G = XX′/c is the additive genomic relationship matrix. Then, $$\boldsymbol{\Gamma} =\left\{{\sigma}_{\gamma_{q{q}^{\prime }}}\right\}$$ can be estimated as

$${\widehat{\boldsymbol{\Gamma}}}_l=\left\{{\widehat{\sigma}}_{\gamma_{q{q}^{\prime }}}\right\},$$
(5.25)

where $${\widehat{\sigma}}_{\gamma_{q{q}^{\prime }}}=\frac{1}{g}{\left({\widehat{\boldsymbol{\upgamma}}}_{ql}-\mathbf{1}{\widehat{\mu}}_{\gamma_{ql}}\right)}^{\prime }{\mathbf{G}}_l^{-1}\left({\widehat{\boldsymbol{\upgamma}}}_{q^{\prime }l}-\mathbf{1}{\widehat{\mu}}_{\gamma_{q^{\prime }l}}\right)$$ is the estimated covariance between γq and $${\boldsymbol{\upgamma}}_{q^{\prime }}$$ in the lth selection cycle of the testing population; g is the number of genotypes; $${\widehat{\boldsymbol{\upgamma}}}_{ql}$$ was defined in Eq. (5.24); $${\widehat{\mu}}_{\gamma_{ql}}$$ and $${\widehat{\mu}}_{\gamma_{q^{\prime }l}}$$ are the estimated arithmetic means of the values of $${\widehat{\boldsymbol{\upgamma}}}_{ql}$$ and $${\widehat{\boldsymbol{\upgamma}}}_{q^{\prime }l}$$; 1 is a g × 1 vector of 1s and $${\mathbf{G}}_l={c}^{-1}{\mathbf{X}}_l{\mathbf{X}}_l^{\prime }$$ is the additive genomic relationship matrix in the lth selection cycle (l = 1, 2, …, number of cycles) in the testing population.

From Eq. (5.25) we can estimate the LGSI response and expected genetic gain per trait in the testing population as

$${\widehat{R}}_{I_G}=\frac{k_I}{L_G}\sqrt{{\mathbf{w}}^{\prime}\widehat{\boldsymbol{\Gamma}}\mathbf{w}}\kern1em \mathrm{and}\kern1em {\widehat{\mathbf{E}}}_{I_G}=\frac{k_I}{L_G}\frac{\widehat{\boldsymbol{\Gamma}}\mathbf{w}}{\sqrt{{\mathbf{w}}^{\prime}\widehat{\boldsymbol{\Gamma}}\mathbf{w}}},$$
(5.26)

respectively. The estimated LGSI ($${\widehat{I}}_{\mathrm{G}}$$) values in the lth selection cycle can be obtained as

$${\widehat{I}}_{\mathrm{G}}=\sum \limits_{q=1}^{\mathrm{t}}{w}_q{\widehat{\boldsymbol{\upgamma}}}_{ql},$$
(5.27)

where wq is the qth economic weight and $${\widehat{\boldsymbol{\upgamma}}}_{ql}$$ was defined in Eq. (5.24). Equation (5.27) is a vector of size g × 1 (g= number of genotypes). In practice, $${\widehat{I}}_{\mathrm{G}}$$ values are ranked to select individual genotypes with optimal GEBVs.

### 5.1.8 Numerical Examples

To estimate matrices C and R and the marker effects in the training population, we used a real maize (Zea mays) F2 population with 248 genotypes (each with two repetitions), 233 molecular markers, and three traits—grain yield (GY, ton ha−1), ear height (EHT, cm), and plant height (PHT, cm)—evaluated in one environment. The estimated matrices were $$\widehat{\mathbf{C}}=\left[\begin{array}{ccc}0.07& 0.61& 1.06\\ {}0.61& 17.93& 22.75\\ {}1.06& 22.75& 44.53\end{array}\right]$$ and $$\widehat{\mathbf{R}}=\left[\begin{array}{ccc}0.38& 0.72& 1.27\\ {}0.72& 47.14& 60.96\\ {}1.27& 60.96& 121.46\end{array}\right]$$, which were estimated by Eqs. (5.21) to (5.23) using the numerical relationship A instead of the genomic relationship matrix (G = XX′/c).

Table 5.1 presents the first 20 BLUPs of the estimated marker effects (Eq. 5.8) in the training population and the first 20 marker coded values and GEBVs (Eq. 5.9) obtained in the testing population associated with trait GY. In the testing population, there were 380 genotypes and 233 molecular markers. In this population, the estimated genomic covariance matrix $$\boldsymbol{\Gamma} =\left\{{\sigma}_{\gamma_{q{q}^{\prime }}}\right\}$$ was $$\widehat{\boldsymbol{\Gamma}}=\left[\begin{array}{ccc}0.21& 2.95& 5.00\\ {}2.95& 42.41& 71.11\\ {}5.00& 71.11& 121.53\end{array}\right]$$. The first GEBV (0.195) related to GY in Table 5.1 was obtained as 0.195 =  − 0.0003(1) − 0.0038(1) − 0.0085(0) + ⋯ − 0.03(−1). The other GEBVs can be obtained in a similar manner.

Suppose a selection intensity of 10% (kI = 1.755) and a vector of economic weights of $${\mathbf{w}}^{\prime }=\left[5\kern0.5em -0.1\kern0.5em -0.1\right]$$; then, the estimated LGSI selection response and the expected genetic gain per trait without including the interval between selection cycle is $${\widehat{R}}_{I_G}=(1.755)\sqrt{{\mathbf{w}}^{\prime}\widehat{\boldsymbol{\Gamma}}\mathbf{w}}=0.92$$ and $${\widehat{\mathbf{E}}}_{I_G}^{\prime }=(1.755)\frac{{\mathbf{w}}^{\prime}\widehat{\boldsymbol{\Gamma}}}{\sqrt{{\mathbf{w}}^{\prime}\widehat{\boldsymbol{\Gamma}}\mathbf{w}}}=\left[0.80\kern0.5em 11.41\kern0.5em 19.28\right]$$ respectively, whereas the estimated LGSI accuracy was $${\widehat{\rho}}_{HI_G}=0.48$$.

Chapter 11 presents RIndSel, a graphical unit interface that uses selection index theory to select individual candidates as parents for the next selection cycle, which can be used to obtain the results of the real numerical example described in this subsection.

To compare LGSI efficiency versus LPSI efficiency we used the simulated data described in Chap. 2, Sect. 2.8.1. According to Beyene et al. (2015), at least 4 years are required to complete one phenotypic selection cycle in maize, whereas genomic selection requires only 1.5 years. Thus, to compare LGSI efficiency versus LPSI efficiency in terms of time, we can use the Technow et al. (2013) inequality described in Eq. (5.18).

Table 5.2 presents the estimated value of Eq. (5.18) for five simulated selection cycles. The LGSI efficiency was higher than LPSI efficiency in terms of time, because the Technow et al. (2013) inequality was true in the five selection cycles. An additional result obtained by Ceron-Rojas et al. (2015) is presented in Fig. 5.2, which shows the correlation among the LGSI, the LPSI, and the true net genetic merit values in seven selection cycles. According to Fig. 5.2, the correlation between the LGSI and the true net genetic merit values was higher than the correlation between the LPSI and the true net genetic merit values for the first three selection cycles; after this cycle, the correlation between LGSI and the true net genetic merit values tended to decrease.

## 5.2 The Combined Linear Genomic Selection Index

The combined LGSI (CLGSI) developed by Dekkers (2007) is a slightly modified version of the LMSI (see Chap. 4 for details), which, instead of using the marker scores, uses the GEBVs and the phenotypic information jointly to predict the net genetic merit. The main difference between the CLGSI and the LGSI is that the CLGSI can only be used in training populations, whereas the LGSI is used in testing populations. The basic conditions for constructing a valid CLGSI include conditions for constructing the LPSI, the LMSI, and the LGSI, because the CLGSI uses GEBVs and phenotypic information jointly to predict the net genetic merit.

### 5.2.1 The CLGSI Parameters

The net genetic merit can be written in a similar manner to that in the LMSI context, that is, as

$$H={\mathbf{w}}^{\prime}\mathbf{g}+{\mathbf{w}}_2^{\prime}\boldsymbol{\upgamma} =\left[{\mathbf{w}}^{\prime}\kern0.5em {\mathbf{w}}_2^{\prime}\right]\left[\begin{array}{c}\mathbf{g}\\ {}\boldsymbol{\upgamma} \end{array}\right]={\mathbf{a}}_G^{\prime }{\mathbf{z}}_G,$$
(5.28)

where $${\mathbf{g}}^{\prime }=\left[{g}_1\kern0.5em \dots \kern0.5em {g}_t\right]$$ is the vector of breeding values, $${\mathbf{w}}^{\prime }=\left[{w}_1\kern0.5em \cdots \kern0.5em {w}_t\right]$$ is the vector of economic weights associated with breeding values, $${\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_t\right]$$ is a null vector associated with the vector of genomic breeding values $${\boldsymbol{\upgamma}}^{\prime }=\left[{\gamma}_1\kern0.5em {\gamma}_2\kern0.5em \dots \kern0.5em {\gamma}_t\right]$$, $${\mathbf{a}}_G^{\prime }=\left[{\mathbf{w}}^{\prime}\kern0.5em {\mathbf{w}}_2^{\prime}\right]$$ and $${\mathbf{z}}_G=\left[{\mathbf{g}}^{\prime}\kern0.5em {\boldsymbol{\upgamma}}^{\prime}\right]$$.

The CLGSI can be written as

$${I}_C={\boldsymbol{\upbeta}}_{\mathbf{y}}^{\prime}\mathbf{y}+{\boldsymbol{\upbeta}}_G^{\prime}\boldsymbol{\upgamma} =\left[{\boldsymbol{\upbeta}}_{\mathbf{y}}^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_G^{\prime}\right]\left[\begin{array}{c}\mathbf{y}\\ {}\boldsymbol{\upgamma} \end{array}\right]={\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{t}}_C,$$
(5.29)

where $${\mathbf{y}}^{\prime }=\left[{y}_1\kern0.5em \cdots \kern0.5em {y}_t\right]$$ (t = number of traits) is the vector of phenotypic values; γ was defined earlier; $${\boldsymbol{\upbeta}}_{\mathbf{y}}^{\prime }$$ and βG are vectors of coefficients of phenotypic and genomic weight values respectively; $${\boldsymbol{\upbeta}}_C^{\prime }=\left[{\boldsymbol{\upbeta}}_{\mathbf{y}}^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_G^{\prime}\right]$$ and $${\mathbf{t}}_G^{\prime }=\left[{\mathbf{y}}^{\prime}\kern0.5em {\boldsymbol{\upgamma}}^{\prime}\right]$$.

The CLGSI selection response can be written as

$${R}_C={k}_I{\sigma}_H{\rho}_{HI_C}={k}_I{\sigma}_H\frac{{\mathbf{a}}_C^{\prime }{\boldsymbol{\Psi}}_C{\boldsymbol{\upbeta}}_C}{\sqrt{{\mathbf{a}}_C^{\prime }{\boldsymbol{\Psi}}_C{\mathbf{a}}_C}{\sqrt{{\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C\boldsymbol{\upbeta}}}_C},$$
(5.30)

where kI is the standardized selection differential of the CLGSI, $${\sigma}_H^2={\mathbf{a}}_C^{\prime }{\boldsymbol{\Psi}}_C{\mathbf{a}}_C$$ and $$Var\left({I}_C\right)={\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C{\boldsymbol{\upbeta}}_C$$ are the variances of H and IC, whereas $${\mathbf{a}}_C^{\prime }{\boldsymbol{\Psi}}_C{\boldsymbol{\upbeta}}_C$$ and $${\rho}_{HI_C}$$ are the covariance and the correlation between H and IC respectively; $${\mathbf{T}}_C= Var\left[\begin{array}{c}\mathbf{y}\\ {}\boldsymbol{\upgamma} \end{array}\right]=\left[\begin{array}{cc}\mathbf{P}& \boldsymbol{\Gamma} \\ {}\boldsymbol{\Gamma} & \boldsymbol{\Gamma} \end{array}\right]$$ and $${\boldsymbol{\Psi}}_C= Var\left[\begin{array}{c}\mathbf{g}\\ {}\boldsymbol{\upgamma} \end{array}\right]=\left[\begin{array}{cc}\mathbf{C}& \boldsymbol{\Gamma} \\ {}\boldsymbol{\Gamma} & \boldsymbol{\Gamma} \end{array}\right]$$ are block matrices of the phenotypic covariance matrix, P = Var(y), the genomic covariance matrix, Γ = Var(γ), and the genetic breeding values covariance matrix, C = Var(g).

Suppose that matrices ΨC and TC are known; then the CLGSI vector of coefficients that simultaneously maximizes $${\rho}_{HI_C}$$ and RC can be written as

$${\boldsymbol{\upbeta}}_C={\mathbf{T}}_C^{-1}{\boldsymbol{\Psi}}_C{\mathbf{a}}_C,$$
(5.31)

whence the optimized CLGSI is

$${I}_C={\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{t}}_C,$$
(5.32)

Equations (5.31) and (5.32) indicate that the CLGSI is an application of the LPSI to the genomic selection context.

From Eq. (5.31), the maximized CLGSI selection response, expected genetic gain per trait and accuracy can be written as

$${R}_C={k}_I\sqrt{{\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C{\boldsymbol{\upbeta}}_C},$$
(5.33)
$${\mathbf{E}}_C={k}_I\frac{{\boldsymbol{\Psi}}_C{\boldsymbol{\upbeta}}_C}{\sqrt{{\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C{\boldsymbol{\upbeta}}_C}}$$
(5.34)

and

$${\rho}_{HI_C}=\frac{\sqrt{{\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C{\boldsymbol{\upbeta}}_C}}{\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}},$$
(5.35)

respectively. Note that the maximized LPSI accuracy is $${\rho}_{HI}=\frac{\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}}{\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}}$$ (see Chap. 2). The denominator of the accuracy of the CLGSI and $${\rho}_{HI}=\frac{\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}}{\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}}$$ is the same; however, the numerator of the two indices accuracy is different. We would expect that $$\sqrt{{\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C{\boldsymbol{\upbeta}}_C}\ge \sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}$$, and then $${\rho}_{HI_C}\ge {\rho}_{HI}$$. Similar results can be observed when we compared the maximized LPSI selection response and expected genetic gain per trait with the maximized CLGSI selection response and expected genetic gain per trait.

### 5.2.2 Relationship Between the CLGSI and the LGSI

As we have indicated, the CLGSI is mathematically equivalent to the LMSI; thus, it has similar statistical properties to those of the LMSI, some of which are described in this section. The rest can be seen in Chap. 4. Let $${\mathbf{Q}}_C={\mathbf{T}}_C^{-1}{\boldsymbol{\Psi}}_C$$, then matrix QC can be written as

$${\mathbf{Q}}_C=\left[\begin{array}{cc}{\left(\mathbf{P}-\boldsymbol{\Gamma} \right)}^{-1}\left(\mathbf{C}-\boldsymbol{\Gamma} \right)& \mathbf{0}\\ {}\mathbf{I}-{\left(\mathbf{P}-\boldsymbol{\Gamma} \right)}^{-1}\left(\mathbf{C}-\boldsymbol{\Gamma} \right)& \mathbf{I}\end{array}\right],$$
(5.36)

whence as $${\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_t\right]$$, the two sub-vectors that conform vector βC = QCaC or $${\boldsymbol{\upbeta}}_C^{\prime }=\left[{\boldsymbol{\upbeta}}_{\mathbf{y}}^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_G^{\prime}\right]$$ can be written as

$${\boldsymbol{\upbeta}}_{\mathbf{y}}={\left(\mathbf{P}-\boldsymbol{\Gamma} \right)}^{-1}\left(\mathbf{C}-\boldsymbol{\Gamma} \right)\mathbf{w},$$
(5.37)

and

$${\boldsymbol{\upbeta}}_G=\left[\mathbf{I}-{\left(\mathbf{P}-\boldsymbol{\Gamma} \right)}^{-1}\left(\mathbf{C}-\boldsymbol{\Gamma} \right)\right]\mathbf{w}=\mathbf{w}-{\boldsymbol{\upbeta}}_{\mathbf{y}}.$$
(5.38)

When Γ is equal to the null matrix (no genomic information), Eq. (5.37) is equal to βy = P−1Cw = b and $${R}_C={k}_I\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}={R}_I$$, which are the LPSI vector of coefficients and the selection response.

By Eqs. (5.37) and (5.38), the maximized CLGSI selection response and the optimized CLGSI can be written as

$${R}_C={k}_I\sqrt{{\mathbf{w}}^{\prime}\mathbf{C}{\left(\mathbf{P}-\boldsymbol{\Gamma} \right)}^{-1}\left(\mathbf{C}-\boldsymbol{\Gamma} \right)\mathbf{w}+{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \left[\mathbf{I}-{\left(\mathbf{P}-\boldsymbol{\Gamma} \right)}^{-1}\left(\mathbf{C}-\boldsymbol{\Gamma} \right)\right]\mathbf{w}}$$
(5.39)

and

$${I}_C={\boldsymbol{\upbeta}}_{\mathbf{y}}\mathbf{y}+{\boldsymbol{\upbeta}}_G\boldsymbol{\upgamma} ={\mathbf{w}}^{\prime}\boldsymbol{\upgamma} +{\boldsymbol{\upbeta}}_{\mathbf{y}}\left(\mathbf{y}-\boldsymbol{\upgamma} \right),$$
(5.40)

respectively.

Assume that when the number of markers and genotypes increases, matrix Γ tends to matrix C and that, at the limit, Γ = C; then, Eq. (5.39) can be written as $${R}_C={k}_I\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}}={R}_G$$ (except by LG); in addition, βy = 0 and βG = w, the weights of the LGSI, and, in this latter case, the CLGSI is equal to the LGSI, as we would expect. Thus, in the asymptotic context, the LGSI and the CLGSI are the same.

An additional interesting result of the relationship between the CLGSI and the LGSI is as follows. The maximized correlation between H and IC (or CLGSI accuracy) can be written as

$${\rho}_{HI_C}=\frac{{\mathbf{a}}_C^{\prime }{\boldsymbol{\Psi}}_C{\boldsymbol{\upbeta}}_C}{\sqrt{{\mathbf{a}}_C^{\prime }{\boldsymbol{\Psi}}_C{\mathbf{a}}_C}\sqrt{{\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C{\boldsymbol{\upbeta}}_C}};$$
(5.41)

However, when Γ = C, $${\boldsymbol{\Psi}}_C=\left[\begin{array}{cc}\boldsymbol{\Gamma} & \boldsymbol{\Gamma} \\ {}\boldsymbol{\Gamma} & \boldsymbol{\Gamma} \end{array}\right]$$, βy = 0, βG = w, and $${\boldsymbol{\upbeta}}_C^{\prime }=\left[{\boldsymbol{\upbeta}}_{\mathbf{y}}^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_G^{\prime}\right]=\left[\begin{array}{cc}\mathbf{0}& {\mathbf{w}}^{\prime}\end{array}\right]$$, whence $${\mathbf{a}}_C^{\prime }{\boldsymbol{\Psi}}_C{\boldsymbol{\upbeta}}_C={\mathbf{a}}_C^{\prime }{\boldsymbol{\Psi}}_C{\mathbf{a}}_C={\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C{\boldsymbol{\upbeta}}_C={\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}$$, and Eq. (5.41) is equal to 1. That is, the maximum correlation between H and IC in the asymptotic context is equal to the maximum correlation between H and the LGSI, and that value will be equal to 1.

The asymptotic relationship between the CLGSI expected genetic gain per trait, EC (Eq. 5.34), and the LGSI expected genetic gain per trait, $${\mathbf{E}}_{I_G}$$ (Eq. 5.16), is as follows. When Γ = C, $${\boldsymbol{\Psi}}_C=\left[\begin{array}{cc}\boldsymbol{\Gamma} & \boldsymbol{\Gamma} \\ {}\boldsymbol{\Gamma} & \boldsymbol{\Gamma} \end{array}\right]$$ and $${\boldsymbol{\upbeta}}_C^{\prime }=\left[\mathbf{0}\kern0.5em {\mathbf{w}}^{\prime}\right]$$, whence

$${\mathbf{E}}_C={k}_I\frac{{\boldsymbol{\Psi}}_C{\boldsymbol{\upbeta}}_C}{\sqrt{{\boldsymbol{\upbeta}}_C^{\prime }{\mathbf{T}}_C{\boldsymbol{\upbeta}}_C}}={k}_I\frac{2\boldsymbol{\Gamma} \mathbf{w}}{\sqrt{{\mathbf{w}}^{\prime}\boldsymbol{\Gamma} \mathbf{w}}}=2{\mathbf{E}}_{I_G}.$$
(5.42)

This means that in the asymptotic context, the CLGSI expected genetic gain per trait is twice the LGSI expected genetic gain per trait. Of course, 2 is only a proportionality constant; thus, in reality, $${\mathbf{E}}_C={\mathbf{E}}_{I_G}$$.

### 5.2.3 Statistical Properties of the CLGSI

Assume that H and IC have bivariate joint normal distribution; P, C, Γ, and w are known, and $${\boldsymbol{\upbeta}}_C={\mathbf{T}}_C^{-1}{\boldsymbol{\Psi}}_C{\mathbf{a}}_C$$; then, the CLGSI properties are as follow:

1. 1.

$${\sigma}_{I_C}^2={\sigma}_{HI_C}$$, i.e., the variance of IC ($${\sigma}_{I_C}^2$$) and the covariance between H and IC ($${\sigma}_{HI_C}$$) are the same.

2. 2.

The maximized correlation between H and IC is $${\rho}_{HI_C}=\frac{\sigma_{I_C}}{\sigma_H}$$, where $${\sigma}_{I_C}$$ is the standard deviation of the variance of IC ($${\sigma}_{I_C}^2$$) and σH is the standard deviation of the variance of H($${\sigma}_H^2$$).

3. 3.

The variance of the predicted error, $$Var\left(H-{I}_C\right)=\left(1-{\rho}_{HI_C}^2\right){\sigma}_H^2$$, is minimal.

4. 4.

The total variance of H explained by IC is $${\sigma}_{I_C}^2={\rho}_{HI_C}^2{\sigma}_H^2$$.

Note that CLGSI properties 1 to 4 are the same as LMSI properties 1 to 4 and that both indices jointly incorporate phenotypic and marker information to predict the net genetic merit; however, the LMSI incorporates the marker information by the marker score values, whereas the CLGSI uses the GEBVs.

### 5.2.4 Estimating the CLGSI Parameters

Using the real maize (Zea mays) F2 population with 248 genotypes (each with two repetitions), 233 molecular markers and three traits—GY (ton ha−1), EHT (cm), and PHT (cm)—described in Sect. 5.1.8 of this chapter, we estimated matrices P and C using Eqs. (2.22) to (2.24) described in Chap. 2 of this book. The estimated matrices were $$\widehat{\mathbf{P}}=\left[\begin{array}{ccc}0.45& 1.33& 2.33\\ {}1.33& 65.07& 83.71\\ {}2.33& 83.71& 165.99\end{array}\right]$$ and $$\widehat{\mathbf{C}}=\left[\begin{array}{ccc}0.07& 0.61& 1.06\\ {}0.61& 17.93& 22.75\\ {}1.06& 22.75& 44.53\end{array}\right]$$.

In a similar manner, we estimated matrix Γ using Eqs. (5.21) to (5.23). The estimated matrix was $$\widehat{\boldsymbol{\Gamma}}=\left[\begin{array}{ccc}0.07& 0.65& 1.05\\ {}0.65& 10.62& 14.25\\ {}1.05& 14.25& 26.37\end{array}\right]$$. Note that matrices $$\widehat{\mathbf{C}}$$ and $$\widehat{\boldsymbol{\Gamma}}$$ have similar values. This means that, in the asymptotic context, we can assume that matrix Γ tends to matrix C.

To estimate the CLMSI and its associated parameters (selection response, expected genetic gain per trait, etc.), we need to estimate the vector of coefficients $${\boldsymbol{\upbeta}}_C={\mathbf{T}}_C^{-1}{\boldsymbol{\Psi}}_C{\mathbf{a}}_C$$ as $${\widehat{\boldsymbol{\upbeta}}}_C={\widehat{\mathbf{T}}}_C^{-1}{\widehat{\boldsymbol{\Psi}}}_C{\mathbf{a}}_C$$, where $${\widehat{\mathbf{T}}}_C=\left[\begin{array}{cc}\widehat{\mathbf{P}}& \widehat{\boldsymbol{\Gamma}}\\ {}\widehat{\boldsymbol{\Gamma}}& \widehat{\boldsymbol{\Gamma}}\end{array}\right]$$ and $${\widehat{\boldsymbol{\Psi}}}_C=\left[\begin{array}{cc}\widehat{\mathbf{C}}& \widehat{\boldsymbol{\Gamma}}\\ {}\widehat{\boldsymbol{\Gamma}}& \widehat{\boldsymbol{\Gamma}}\end{array}\right]$$ are estimates of matrices $${\mathbf{T}}_C=\left[\begin{array}{cc}\mathbf{P}& \boldsymbol{\Gamma} \\ {}\boldsymbol{\Gamma} & \boldsymbol{\Gamma} \end{array}\right]$$ and $${\boldsymbol{\Psi}}_C=\left[\begin{array}{cc}\mathbf{C}& \boldsymbol{\Gamma} \\ {}\boldsymbol{\Gamma} & \boldsymbol{\Gamma} \end{array}\right]$$ respectively. The estimated CLGSI vector of coefficients $${\widehat{\boldsymbol{\upbeta}}}_C={\widehat{\mathbf{T}}}_C^{-1}{\widehat{\boldsymbol{\Psi}}}_C{\mathbf{a}}_C$$ is conformed by the vector of phenotypic weights, $${\widehat{\boldsymbol{\upbeta}}}_{\mathbf{y}}={\left(\widehat{\mathbf{P}}-\widehat{\boldsymbol{\Gamma}}\right)}^{-1}\left(\widehat{\mathbf{C}}-\widehat{\boldsymbol{\Gamma}}\right)\mathbf{w}$$, and by the vector of genomic weights, $${\widehat{\boldsymbol{\upbeta}}}_G=\left[\mathbf{I}-{\left(\widehat{\mathbf{P}}-\widehat{\boldsymbol{\Gamma}}\right)}^{-1}\left(\widehat{\mathbf{C}}-\widehat{\boldsymbol{\Gamma}}\right)\right]\mathbf{w}$$.

Let $${\mathbf{w}}^{\prime }=\left[5\kern0.5em -0.1\kern0.5em -0.1\right]$$ be the vector of economic weights; then, according to the estimated matrices $$\widehat{\mathbf{P}}$$, $$\widehat{\mathbf{C}}$$, and $$\widehat{\boldsymbol{\Gamma}}$$, $${\widehat{\boldsymbol{\upbeta}}}_{\mathbf{y}}^{\prime }=\left[0.08\kern0.5em -0.02\kern0.5em -0.01\right]$$ and $${\widehat{\boldsymbol{\upbeta}}}_G^{\prime }=\left[4.92\kern0.5em -0.08\kern0.5em -0.09\right]$$, whence the estimated CLGSI in the training population can be written as

$${\widehat{I}}_C={\widehat{\boldsymbol{\upbeta}}}_{\mathbf{y}}\mathbf{y}+{\widehat{\boldsymbol{\upbeta}}}_G\widehat{\boldsymbol{\upgamma}}.$$
(5.43)

Suppose a selection intensity of 10% (kI = 1.755); then, the estimated CLGSI selection response and expected genetic gain per trait were $${\widehat{R}}_C={k}_I\sqrt{{\widehat{{\boldsymbol{\upbeta}}^{\prime}}}_C{\widehat{\mathbf{T}}}_C{\widehat{\boldsymbol{\upbeta}}}_C}=1.54$$ and $${\widehat{\mathbf{E}}}_C^{\prime }={k}_I\frac{{\widehat{\boldsymbol{\upbeta}}}_C^{\prime }{\widehat{\boldsymbol{\Psi}}}_C}{\sqrt{{\widehat{{\boldsymbol{\upbeta}}^{\prime}}}_C{\widehat{\mathbf{T}}}_C{\widehat{\boldsymbol{\upbeta}}}_C}}=\left[0.36\kern0.5em 1.04\kern0.5em 1.70\kern0.5em 0.36\kern0.5em 1.53\kern0.5em 2.38\right]$$ respectively, whereas the estimated CLGSI accuracy was $${\widehat{\rho}}_{HI_C}=\frac{{\widehat{\sigma}}_{I_C}}{{\widehat{\sigma}}_H}=0.814$$.

The estimated LPSI selection response, expected genetic gain per trait, and accuracy were 0.601, $$\left[0.09\kern0.5em -0.81\kern0.5em -0.89\right]$$, and 0.32 respectively; thus, the CLGSI was more efficient to predict the net genetic merit than the LPSI because the CLGSI accuracy and selection response were 0.814 and 1.54 respectively.

### 5.2.5 LGSI and CLGSI Efficiency Vs LMSI, GW-LMSI and LPSI Efficiency

In this subsection, we compare the accuracy, selection response, and efficiency of the LGSI and CLGSI with the LMSI, the GW-LMSI, and the LPSI using the simulated data for a maize (Zea mays) population described in Chap. 2, Sect. 2.8.1.

Figure 5.3 presents the estimated accuracy values of the LMSI, the LGSI, the CLGSI, the LPSI, and the GW-LMSI for five simulated selection cycles. According to these results, for the first three selection cycles, the estimated accuracies of the indices, in decreasing order, were LMSI > LGSI > CLGSI > LPSI > GW-LMSI. That is, the highest estimated accuracy was obtained with the LMSI, whereas the lowest was obtained with the GW-LMSI. For the fourth and fifth selection cycles, the estimated accuracies, in decreasing order, were LMSI > LPSI > CLGSI > LGSI > GW-LMSI. This means that in all five selection cycles, the LMSI had the highest accuracy and the GW-LMSI had the lowest accuracy, whereas the estimated LGSI accuracy was reduced to fourth place. Thus, the accuracy of the LGSI tended to decrease after the first three selection cycles whereas LPSI accuracy was a constant.