Abstract
There are two main linear marker selection indices employed in markerassisted selection (MAS) to predict the net genetic merit and to select individual candidates as parents for the next generation: the linear marker selection index (LMSI) and the genomewide LMSI (GWLMSI). Both indices maximize the selection response, the expected genetic gain per trait, and the correlation with the net genetic merit; however, applying the LMSI in plant or animal breeding requires genotyping the candidates for selection; performing a linear regression of phenotypic values on the coded values of the markers such that the selected markers are statistically linked to quantitative trait loci that explain most of the variability in the regression model; constructing the marker score, and combining the marker score with phenotypic information to predict and rank the net genetic merit of the candidates for selection. On the other hand, the GWLMSI is a singlestage procedure that treats information at each individual marker as a separate trait. Thus, all marker information can be entered together with phenotypic information into the GWLMSI, which is then used to predict the net genetic merit and select candidates. We describe the LMSI and GWLMSI theory and show that both indices are direct applications of the linear phenotypic selection index theory to MAS. Using real and simulated data we validated the theory of both indices.
Keywords
 Marker Scores
 Phenotypic Values
 Response Selection
 Code Value
 Marker Information
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download chapter PDF
4.1 The Linear Marker Selection Index
4.1.1 Basic Conditions for Constructing the LMSI
In Chap. 2, Sect. 2.1, we indicated ten basic conditions for constructing a valid linear phenotypic selection index (LPSI). These ten conditions are also necessary for the linear marker selection index (LMSI); however, in addition to those conditions, the LMSI also requires the following conditions:

1.
The markers and the quantitative trait loci (QTL) should be in linkage disequilibrium in the population under selection.

2.
The QTL effects should be combined additively both within and between loci.

3.
The QTL should be in coupling mode, that is, one of the initial lines should have all the alleles that have a positive effect on the chromosome, and the other lines should have all the negative effects.

4.
The traits of interest should be affected by a few QTL with large effects (and possibly a number of very small QTL effects) rather than many small QTL effects.

5.
The heritability of the traits should be low.

6.
Markers correlated with the traits of interest should be identified.
Under these conditions, the LMSI should be more efficient than the LPSI, at least in the first selection cycles (Whittaker 2003; Moreau et al. 2007).
4.1.2 The LMSI Parameters
Let y_{i} = g_{i} + e_{i} be the ith trait (i = 1, 2, …, t, t = number of traits), where e_{i}~N(0, \( {\sigma}_{e_i}^2 \)) is the residual with expectation equal to zero and variance value \( {\sigma}_{e_i}^2 \), and N stands for normal distribution. Assuming that the QTL effects combine additively both within and between loci, the ith unobservable genetic value g_{i} can be written as
where α_{k} is the effect of the kth QTL, q_{k} is the number of favorable alleles at the kth QTL (2, 1 or 0), and N_{Q} is the number of QTL affecting the ith trait of interest.
If the QTL effect values are not observable, the g_{i} values in Eq. (4.1) are also not observable; however, we can use a linear combination of the markers linked to the QTL (s_{i}) that affect the ith trait to predict the g_{i} value as
where s_{i} is a predictor of g_{i}, θ_{j} is the regression coefficient of the linear regression model, x_{j} is the coded value of the jth markers (e.g., 1, 0, and −1 for marker genotypes AA, Aa and aa respectively), and M is the number of selected markers linked to the QTL that affect the ith trait. Equation (4.2) is called the marker score (Lande and Thompson 1990; Whittaker 2003) and this is the main reason why the LMSI is not equal to the LPSI described in Chap. 2. The number of selected markers is only a subset of potential markers linked to QTL in the population under selection; thus, the s_{i} values should be lower than or equal to the g_{i} values. One way of estimating the s_{i} values is to perform a linear regression of phenotypic values on the coded values of the markers, select markers that are statistically linked to quantitative trait loci that explain most of the variability in the regression model, and then obtain the estimated value of s_{i} (\( {\widehat{s}}_i \)) as the sum of the products of the QTL effects linked to markers and multiplied by the marker coded values associated with the ith trait. Some authors (e.g., Moreau et al. 2007) call \( {\widehat{s}}_i \) the molecular score; in this book, we call s_{i} the marker score and \( {\widehat{s}}_i \) the estimated marker score.
The objective of the LMSI is to predict the net genetic merit of each individual and select the individuals with the highest net genetic merit for further breeding. In the LMSI context, the net genetic merit can be written as
where \( {\mathbf{g}}^{\prime }=\left[{g}_1\kern0.5em \dots \kern0.5em {g}_q\right] \) is the vector of breeding values; \( {\mathbf{w}}^{\prime }=\left[{w}_1\kern0.5em \cdots \kern0.5em {w}_t\right] \) is the vector of economic weights associated with g; \( {\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_t\right] \) is a null vector associated with the vector of marker scores \( {\mathbf{s}}^{\prime }=\left[{s}_1\kern0.5em \cdots \kern0.5em {s}_t\right] \); s_{i} is the ith marker score; \( {\mathbf{a}}^{\prime }=\left[{\mathbf{w}}^{\prime}\kern0.5em {\mathbf{w}}_2^{\prime}\right] \) and \( \mathbf{z}=\left[{\mathbf{g}}^{\prime}\kern0.5em {\mathbf{s}}^{\prime}\right] \).
The information provided by the marker score can be used in breeding programs to increase the accuracy of predicting the net genetic merit of the individuals under selection. The LMSI combines the phenotypic and marker scores to predict H in each selection cycle and can be written as
where \( {\boldsymbol{\upbeta}}_y^{\prime } \) and β_{s} are vectors of phenotypic and marker score weights respectively; \( {\mathbf{y}}^{\prime }=\left[{y}_1\kern0.5em \cdots \kern0.5em {y}_t\right] \) is the vector of trait phenotypic values and s was defined in Eq. (4.3); \( {\boldsymbol{\upbeta}}^{\prime }=\left[{\boldsymbol{\upbeta}}_y^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_s^{\prime}\right] \) and \( {\mathbf{t}}^{\prime }=\left[{\mathbf{y}}^{\prime}\kern0.5em {\mathbf{s}}^{\prime}\right] \).
The LMSI selection response can be written as
where k_{I} is the standardized selection differential of the LMSI, \( {\sigma}_H=\sqrt{{\mathbf{a}}^{\prime }{\mathbf{Z}}_M\mathbf{a}} \) and \( \sqrt{{\boldsymbol{\upbeta}}^{\prime }{\mathbf{T}}_M\boldsymbol{\upbeta}} \) are the standard deviations of the variances of H and I_{M}, whereas \( {\rho}_{I_MH} \) and a′Z_{M}β are the correlation and the covariance between H and I_{M} respectively; \( {\mathbf{T}}_M= Var\left[\begin{array}{c}\mathbf{y}\\ {}\mathbf{s}\end{array}\right]=\left[\begin{array}{cc}\mathbf{P}& \mathbf{S}\\ {}\mathbf{S}& \mathbf{S}\end{array}\right] \) and \( {\mathbf{Z}}_M= Var\left[\begin{array}{c}\mathbf{g}\\ {}\mathbf{s}\end{array}\right]=\left[\begin{array}{cc}\mathbf{C}& \mathbf{S}\\ {}\mathbf{S}& \mathbf{S}\end{array}\right] \) are block matrices of covariance where P = Var(y), S = Var(s), and C = Var(g) are the covariance matrices of phenotypic values (y), the marker score (s), and the genetic value (g) respectively in the population. Vectors a and β were defined in Eqs. (4.3) and (4.4) respectively.
The LMSI expected genetic gain per trait can be written as
All the parameters in Eq. (4.6) were previously defined.
4.1.3 The Maximized LMSI Parameters
Suppose that P, S and C are known matrices; then, matrices T_{M} and Z_{M} are known and, according to the LPSI theory (Chap. 2 for details), the LMSI vector of coefficients (β_{M}) that maximizes \( {\rho}_{I_MH} \), R_{M}, and E_{M} can be written as
whence the maximized selection response and the maximized correlation (or LMSI accuracy) between H and I_{M} can be written as
and
respectively, where \( {\sigma}_{I_M}=\sqrt{{\boldsymbol{\upbeta}}^{\prime }{\mathbf{T}}_M\boldsymbol{\upbeta}} \) is the standard deviation of the variance of I_{M} and \( {\sigma}_H=\sqrt{{\mathbf{a}}^{\prime }{\mathbf{Z}}_M\mathbf{a}} \) is the deviation of the variance of H. Equations (4.8a) and (4.8b) show that the LMSI is a direct application of the LPSI theory in the markerassisted selection (MAS) context.
Let \( \mathbf{Q}={\mathbf{T}}_M^{1}{\mathbf{Z}}_M \); then, matrix Q can be written as
whence β = Qa, and as \( {\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_t\right] \), we can write the two vectors of \( {\boldsymbol{\upbeta}}^{\prime }=\left[{\boldsymbol{\upbeta}}_y^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_s^{\prime}\right] \) as
Another way of writing the marker score vector weights is
where β_{y} = (P − S)^{−1}(C − S)w. By Eq. (4.10b), the optimal LMSI can be written as
Equation (4.11) indicates that, in practice, to estimate the optimal LMSI, we only need to estimate the vector of coefficients β_{y}. By Eq. (4.10a), Eq. (4.8a) can be written as
Thus, by Eqs. (4.10a) and (4.12), when S is a null matrix, vector β_{y} is equal to β_{y} = P^{−1}Cw = b and \( {R}_M={k}_I\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}={R}_I \), which are the LPSI vector of coefficients and its selection response respectively.
Assume that when the number of markers and genotypes tend to infinity, S tends to C; then, at the limit, we can suppose that S = C, and by this latter result, R_{M} is equal to
That is, Eq. (4.13) is the maximum value of the LMSI selection response when the numbers of markers and genotypes tend to infinity. Thus, the possible LMSI selection response values of Eq. (4.12) should be between \( {k}_I\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}} \) and \( {k}_I\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}} \), i.e.,
or between 1 and \( \frac{\sqrt{{\mathbf{w}}^{\prime}\mathbf{Cw}}}{\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}}=\frac{\sigma_H}{\sigma_I} \), that is,
Note that \( \frac{\sigma_H}{\sigma_I}=\frac{1}{\rho_{HI}} \), where ρ_{HI} is the maximized correlation between the net genetic merit (H) and the LPSI (I) described in Chap. 2. Equation (4.15) indicates that LMSI efficiency tends to infinity when the ρ_{HI} value tends to zero and is an additional way of denoting the paradox of LMSI efficiency described by Knapp (1998), which implies that LMSI efficiency tends to infinity when the ρ_{HI} value tends to zero.
4.1.4 The LMSI for One Trait
For the onetrait case, matrices T_{M}, Z_{M}, and Q can be written as
where \( {\sigma}_y^2 \), \( {\sigma}_g^2 \), and \( {\sigma}_s^2 \) are the phenotypic, genetic, and marker score variances respectively. By Eqs. (4.10a) and (4.10b), when \( {\mathbf{a}}^{\prime }=\left[1\kern0.5em 0\right] \), the elements of vector β = Qa are
whence the optimal LMSI can be written as
whereas by Eq. (4.12), the maximized LMSI selection response can be written as
When \( {\sigma}_s^2=0 \), \( {\beta}_y=\frac{\sigma_g^2}{\sigma_y^2}={h}^2 \), I_{M} = h^{2}y, and \( {R}_M=k\frac{\sigma_g^2}{\sigma_y}=k{\sigma}_y{h}^2=R \), the selection response for the onetrait case without markers.
4.1.5 Efficiency of LMSI Versus LPSI Efficiency for One Trait
Suppose that the intensity of selection is the same in both indices; then, to compare LMSI versus LPSI efficiency for predicting the net genetic merit, we can use the ratio \( {\lambda}_M=\frac{\rho_{I_MH}}{\rho_{HI}}=\frac{R_M}{R_I} \) (Bulmer 1980; Moreau et al. 1998), where R_{I} is the maximized LPSI selection response. In percentage terms, the LMSI versus LPSI efficiency can be written as
When p_{M} = 0, the efficiency of both indices is the same; when p_{M} > 0, the efficiency of the LMSI is higher than that of the LPSI, and when p_{M} < 0, LPSI efficiency is higher than LMSI efficiency for predicting the net genetic merit.
In the case of one trait, Lande and Thompson (1990) showed that LMSI efficiency (not in percentage terms) with respect to phenotypic efficiency can be written as
where R_{M} was defined in Eq. (4.18), R = kσ_{y}h^{2}, h^{2} is the trait heritability, and \( q=\frac{\sigma_s^2}{\sigma_g^2} \) is the proportion of additive genetic variance explained by the markers. According to Eq. (4.20), the advantage of the LMSI over phenotypic selection increases as the population size increases and heritability decreases, because in such cases, \( q=\frac{\sigma_s^2}{\sigma_g^2} \) tends to 1 and Eq. (4.20) approaches \( \frac{1}{h} \). Therefore, the LMSI is most efficient for traits with low heritability and when the marker score explains a large proportion of the genetic variance. Thus, note that when h^{2} tends to zero, \( \frac{1}{h} \) tends to infinity; this means that in the asymptotic context, LMSI efficiency with respect to phenotypic efficiency for one trait (Eq. 4.20) tends to infinity and this is the LMSI paradox pointed out by Knapp (1998). There are other problems associated with the LMSI: it increases the selection response only in the short term and can result in lower cumulative responses in the longer term than phenotypic selection, as the LMSI fixes the QTL at a faster rate than phenotypic selection. In addition, it requires the weights (Eq. 4.17a) to be updated, because in each generation the frequency of the QTL changes (Dekkers and Settar 2004).
4.1.6 Statistical LMSI Properties
Assume that H and I_{M} have bivariate joint normal distribution, \( \boldsymbol{\upbeta} ={\mathbf{T}}_M^{1}{\mathbf{Z}}_M\mathbf{a} \), and that P, C, S, and w are known; then, the statistical LMSI properties are the same as the LPSI properties described in Chap. 2. That is,

1.
\( {\sigma}_{I_M}^2={\sigma}_{HI_M} \): the variance of I_{M} (\( {\sigma}_{I_M}^2 \)) and the covariance between H and I_{M} (\( {\sigma}_{HI_M} \)) are the same.

2.
The maximized correlation between H and I_{M} (or I_{M} accuracy) is \( {\rho}_{HI_M}=\frac{\sigma_{I_M}}{\sigma_H} \).

3.
The variance of the predicted error, \( Var\left(H{I}_M\right)=\left(1{\rho}_{HI_M}^2\right){\sigma}_H^2 \), is minimal.

4.
The total variance of H explained by I_{M} is \( {\sigma}_{I_M}^2={\rho}_{HI_M}^2{\sigma}_H^2 \).

5.
The heritability of I_{M} is \( {\mathrm{h}}_{\mathrm{M}}^2=\frac{{\boldsymbol{\upbeta}}_M^{\prime }{\mathbf{Z}}_M{\boldsymbol{\upbeta}}_M}{{\boldsymbol{\upbeta}}_M^{\prime }{\mathbf{T}}_M{\boldsymbol{\upbeta}}_M} \).
Properties 1 to 4 are the same as LPSI properties 1 to 4, but, because the LMSI jointly incorporates the phenotypic and marker information to predict the net genetic merit, LMSI accuracy should be higher than LPSI accuracy. The same is true of the LMSI selection response and expected genetic gain per trait when compared with the LPSI selection response and expected genetic gain per trait.
4.2 The GenomeWide Linear Selection Index
The genomewide linear marker selection index (GWLMSI) is a singlestage procedure that treats information at each individual marker as a separate trait. Thus, all marker information can be entered together with phenotypic information into the GWLMSI, which is then used to predict the net genetic merit. In a similar manner to the LMSI, the GWLMSI exploits the linkage disequilibrium between markers and the QTL produced when inbred lines are crossed.
4.2.1 The GWLMSI Parameters
In a similar manner to the LPSI, the main objective of the GWLMSI is to predict the net genetic merit values of each individual and select the best individuals for further breeding. In the GWLMSI context, the net genetic merit can be written as
where \( {\mathbf{g}}^{\prime }=\left[{g}_1\kern0.5em \dots \kern0.5em {g}_t\right] \) (j = 1, 2, …, t = number of traits) is the vector of breeding values, \( {\mathbf{w}}^{\prime }=\left[{w}_1\kern0.5em \cdots \kern0.5em {w}_t\right] \) is the vector of economic weights associated with the breeding values, and \( {\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_m\right] \) is a null vector associated with the coded values of the markers \( {\mathbf{m}}^{\prime }=\left[{m}_1\kern0.5em \cdots \kern0.5em {m}_m\right] \), where m_{j} (j = 1, 2, …, m = number of markers) is the jth marker in the training population; \( {\mathbf{a}}_W^{\prime }=\left[{\mathbf{w}}^{\prime}\kern0.5em {\mathbf{w}}_2^{\prime}\right] \) and \( {\mathbf{z}}_W=\left[{\mathbf{g}}^{\prime}\kern0.5em {\mathbf{m}}^{\prime}\right] \).
The GWLMSI (I_{W}) combines the phenotypic value and the molecular information linked to the individual traits to predict H values in each selection cycle. It can be written as
where \( {\boldsymbol{\upbeta}}_y^{\prime } \) and β_{m} are vectors of phenotypic and marker weights respectively; \( {\mathbf{y}}^{\prime }=\left[{y}_1\kern0.5em \cdots \kern0.5em {y}_t\right] \) is the vector of phenotypic values and m was defined in Eq. (4.21); \( {\boldsymbol{\upbeta}}_W^{\prime }=\left[{\boldsymbol{\upbeta}}_y^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_m^{\prime}\right] \) and \( {\mathbf{t}}_W^{\prime }=\left[{\mathbf{y}}^{\prime}\kern0.5em {\mathbf{m}}^{\prime}\right] \).
The GWLSI selection response can be written as
where k_{I} is the standardized selection differential of the GWLMSI, \( {\sigma}_H^2={\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \mathbf{a}}_W \) and \( Var\left({I}_W\right)={\boldsymbol{\upbeta}}_W^{\prime }{\boldsymbol{\Phi} \boldsymbol{\upbeta}}_W \) are the variance of H and I_{W}, whereas \( {\rho}_{I_WH}=\frac{{\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \boldsymbol{\upbeta}}_W}{\sqrt{{\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \mathbf{a}}_W}\sqrt{{\boldsymbol{\upbeta}}_W^{\prime }{\boldsymbol{\Phi} \boldsymbol{\upbeta}}_W}} \) and \( {\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \boldsymbol{\upbeta}}_W \) are the correlation and the covariance between H and I_{W} respectively; \( \boldsymbol{\Phi} = Var\left[\begin{array}{c}\mathbf{y}\\ {}\mathbf{m}\end{array}\right]=\left[\begin{array}{cc}\mathbf{P}& {\mathbf{W}}^{\prime}\\ {}\mathbf{W}& \mathbf{M}\end{array}\right] \) and \( \boldsymbol{\Psi} = Var\left[\begin{array}{c}\mathbf{g}\\ {}\mathbf{m}\end{array}\right]=\left[\begin{array}{cc}\mathbf{C}& {\mathbf{W}}^{\prime}\\ {}\mathbf{W}& \mathbf{M}\end{array}\right] \) are block covariance matrices where P = Var(y), M = Var(m), C = Var(g), and W = Cov(y, m) = Cov(g, m) are the covariance matrices of phenotypic values (y), the molecular marker (m) coded values, and the genetic (g) values, whereas W is the covariance matrix between y and m, and between g and m. The size of matrices P and C is t × t, but the sizes of matrices M and W are m × m and m × t respectively.
From a theoretical point of view, Crossa and CerónRojas (2011) showed that matrix M can be written as
where (1 − 2δ_{ij}) is the covariance (or correlation) and δ_{ij} the recombination frequency between the ith and jth marker (i, j = 1, 2, …, m = number of markers). According to Crossa and CerónRojas (2011), matrix W can be written as
where (1 − 2r_{ik})α_{qk} (i = 1, 2, …, m, k = 1, 2, …, N_{Q} = number of QTL, q = 1, 2, …, t) is the covariance between the qth trait and the ith marker; r_{ik} is the recombination frequency between the ith marker and the kth QTL; and α_{qk} is the effect of the kth QTL over the qth trait.
The GWLMSI expected genetic gain per trait can be written as
All parameters in Eq. (4.24) were previously defined.
Matrix Φ could be singular, i.e., its inverse (Φ^{−1}) could not exist because matrix W is singular. Suppose that matrices Φ and Ψ are known; then, according to the LPSI theory, the GWLMSI vector of coefficients (β_{W}) that maximizes \( {\rho}_{I_WH} \) can be written as
where matrix Φ^{−} denotes a generalized inverse of Φ. By Eq. (4.25a), the maximized GWLMSI selection response is
Equations (4.25a) and (4.25b) show that the GWLMSI is a direct application of the LPSI to MAS. By Eq. (4.25a), the maximized correlation between H and I_{W} is
where \( {\sigma}_{I_W}=\sqrt{{\boldsymbol{\upbeta}}_W^{\prime }{\boldsymbol{\Phi} \boldsymbol{\upbeta}}_W} \) is the standard deviation of the variance of I_{W} and \( {\sigma}_H=\sqrt{{\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \mathbf{a}}_W} \) is the standard deviation of the variance of H.
4.2.2 Relationship Between the GWLMSI and the LPSI
Matrix Φ^{−} can be written as
where L^{−} is a generalized inverse of matrix L = P − W^{′}M^{−}W, and M^{−} is a generalized inverse of matrix M. In matrix Φ^{−}, the inverse of matrix W is not required and the standard inverse of matrix M (M^{−1}) may exist. In the latter case, the standard inverse of matrix L (L^{−1}) exists and can be written as L^{−1} = (P − W^{′}M^{−1}W)^{−1} = P^{−1} + P^{−1}W^{′}[M − WP^{−1}W^{′}]^{−1}WP^{−1} (Searle et al. 2006).
By Eq. (4.26) and because \( {\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_N\right] \), the vector components of \( {\boldsymbol{\upbeta}}_W^{\prime }=\left[{\boldsymbol{\upbeta}}_y^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_m^{\prime}\right] \), or β_{W} = Φ^{−}Ψa_{W}, can be written as
and
where w is the vector of economic weights. Suppose that there is no marker information; then, matrices M and W are null and Eq. (4.27) is equal to β_{y} = P^{−1}Cw = b (the LPSI vector of coefficients), whereas β_{m} = 0 and \( {R}_W={k}_I\sqrt{{\boldsymbol{\upbeta}}_W^{\prime }{\boldsymbol{\Phi} \boldsymbol{\upbeta}}_W}={k}_I\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}={R}_I \), the LPSI selection response. Now suppose that the markers explain all the genetic variability; in this case, β_{y} = 0 and β_{m} = (X^{′}X)^{−}X^{′}Y, the matrix of linear regression coefficients in the multivariate context, where (X^{′}X)^{−} is a generalized inverse matrix of X^{′}X and Y is a matrix of phenotypic observations.
4.2.3 Statistical Properties of GWLMSI
Assume that H and I_{W} have bivariate joint normal distribution, β_{W} = Φ^{−}Ψa_{W}, and P, C, M, W, and w are known; then, the statistical GWLMSI properties are the same as the LMSI properties. That is,

1.
\( {\sigma}_{I_W}^2={\sigma}_{HI_W} \), i.e., the variance of I_{W} (\( {\sigma}_{I_W}^2 \)) and the covariance between H and I_{W} (\( {\sigma}_{HI_W} \)) are the same.

2.
The maximized correlation between H and I_{W}, or I_{W} accuracy, is \( {\rho}_{HI_W}=\frac{\sigma_{I_W}}{\sigma_H} \).

3.
The variance of the predicted error, \( Var\left(H{I}_W\right)=\left(1{\rho}_{HI_W}^2\right){\sigma}_H^2 \), is minimal.

4.
The total variance of H explained by I_{W} is \( {\sigma}_{I_W}^2={\rho}_{HI_W}^2{\sigma}_H^2 \).
According to Lange and Whittaker (2001), GWLMSI efficiency should be greater than LMSI efficiency. However, this would be true only if matrices P, C, M, and W are known and trait heritability is very low.
4.3 Estimating the LMSI Parameters
When covariance matrices P, C, and S, and the vector of economic weights (w) are known, there is no error in the estimation of the LMSI parameters (selection response, expected genetic gain, etc.); the same is true for the GWLMSI when, in addition to P, C, and w, the covariance matrices M and W are known. In such cases, the relative efficiency of the LMSI (GWLMSI) depends only on the heritability of the traits and on the portion of phenotypic variation associated with markers. Using simulated data, Lange and Whittaker (2001) found that GWLMSI efficiency was higher than LMSI efficiency when trait heritability was 0.2 and matrices P, C, M, and W were known. When P, C, S, M, and W are unknown, it is necessary to estimate them; then, the LMSI and GWLMSI vector of coefficients and the effects associated with markers are estimated with some error. This error leads to lower LMSI and GWLMSI efficiency than expected under the assumption that the parameters are known; however, in the latter case, Lange and Whittaker (2001) also found that GWLMSI efficiency was greater than that of the LMSI when trait heritability was 0.05. Moreover, in the LMSI there is additional bias in the estimation of the parameters because only markers with significant effects are included in the index (Moreau et al. 1998).
In Chap. 2, we described the restricted maximum likelihood (REML) method for estimating matrices P and C. Some authors (Lande and Thompson 1990; Charcosset and Gallais 1996; Hospital et al. 1997; Moreau et al. 1998, 2007) have described methods for estimating marker scores, the variance of the marker scores, the LMSI vector of coefficients, etc., in the context of one trait; however, up to now there have been no reports on the estimation of matrix S in the multitrait case. Lange and Whittaker (2001) only indicated that matrix S can be estimated as \( \widehat{\mathbf{S}}= Var\left(\widehat{\mathbf{s}}\right) \), where \( \widehat{\mathbf{s}} \) is a vector of estimated marker scores associated with several individual traits.
The main problems associated with the estimated LMSI parameters are:

1.
The estimated values of the covariance matrix S (\( \widehat{\mathbf{S}} \)) tend to overestimate the genetic covariance matrix (C).

2.
The estimated variances of the marker scores can be negative.
When the first point is true, the estimated LMSI selection response and efficiency could be negative because the estimated matrix \( {\widehat{\mathbf{T}}}_M=\left[\begin{array}{cc}\widehat{\mathbf{P}}& \widehat{\mathbf{S}}\\ {}\widehat{\mathbf{S}}& \widehat{\mathbf{S}}\end{array}\right] \) is not positive definite (all eigenvalues positive) and the estimated matrix \( {\widehat{\mathbf{Z}}}_M=\left[\begin{array}{cc}\widehat{\mathbf{G}}& \widehat{\mathbf{S}}\\ {}\widehat{\mathbf{S}}& \widehat{\mathbf{S}}\end{array}\right] \) is not positive semidefinite (no negative eigenvalues). In addition, the results can lead to all weights being placed on the molecular score and the weights on the phenotype values can be negative (Moreau et al. 2007). When the second point is true, the variance of the marker scores is not useful. The two problems indicated above could be caused by using the same data set to select markers and to estimate marker effects, and there is no simple way of solving them. Lande and Thompson (1990) proposed that the markers used to obtain \( \widehat{\mathbf{S}} \) be selected a priori as those with the most highly significant partial regression coefficients from among all the markers in the linkage group analyzed in the previous generation. Zhang and Smith (1992, 1993) proposed using two independent sets of markers: one to estimate marker effects and the other to select markers. Additional solutions to these problems were described by Moreau et al. (2007).
In this subsection, we describe methods (in the univariate and multivariate context) for estimating molecular marker effects, marker scores, and their variance and covariance, and for estimating the LMSI and GWLMSI vector of coefficients, selection response, expected genetic gain, and accuracy. This subsection is only for illustration; we use the same data set to select markers, and to estimate marker effects and the variance of marker scores.
4.3.1 Estimating the Marker Score
According to Eqs. (4.11) and (4.17b), when the vector of economic weights is equal to \( {\mathbf{a}}^{\prime }=\left[1\kern0.5em 0\right] \), the LMSI for the ith trait y_{i} (i = 1, 2, ⋯, t; t = number of traits) value can be written as \( {I}_{M_{li}}\kern0.5em =\kern0.5em {s}_i+{\beta}_{y_i}\left({y}_i{s}_i\right) \) (l = 1, 2, ⋯, n; n = number of individuals or genotypes), where \( {\beta}_{yi}=\frac{\sigma_{g_i}^2{\sigma}_{s_i}^2}{\sigma_{y_i}^2{\sigma}_{s_i}^2}=\frac{h_i^2\left(1{q}_i\right)}{1{q}_i{h}_i^2} \) is the LMSI coefficient, \( {h}_i^2=\frac{\sigma_{g_i}^2}{\sigma_{y_i}^2} \) is the heritability of the ith trait, and \( {q}_i=\frac{\sigma_{s_i}^2}{\sigma_{g_i}^2} \) is the proportion of genetic variance explained by the QTL or markers associated with the ith trait; \( {s}_i=\sum \limits_{j=1}^M{\theta}_j{x}_j \) (j = 1, 2, ⋯, M; M = number of selected markers) is the ith individual trait marker score; and \( {\sigma}_{y_i}^2 \), \( {\sigma}_{g_i}^2 \), and \( {\sigma}_{s_i}^2 \) are the ith variances of the phenotypic, genetic, and marker score values respectively.
The simplest way of estimating the ith marker score s_{i} is to perform a multiple linear regression of phenotypic values (y_{i}) on the coded values of the markers (x_{j}) and then select the markers statistically linked to the ith QTL that explain most of the variability in the regression model and use them to construct \( {s}_i=\sum \limits_{j\in M}{\theta}_j{x}_j \).
We can fit the model \( {y}_i^{\ast }=\sum \limits_{j\in M}{\theta}_j{x}_j+e \), where \( {y}_i^{\ast }={y}_i{\overline{y}}_i \) and \( {\overline{y}}_i \) are the average values of the ith trait, by maximum likelihood or least squares. When estimating θ_{j}, the main problem is to choose the set of markers M based on criteria for declaring markers as significant and then use the estimated values of θ_{j} (\( {\widehat{\theta}}_j \)) to estimate the ith marker score s_{i} as \( {\widehat{s}}_i=\sum \limits_{j\in M}{\widehat{\theta}}_j{x}_j \). The values of \( {\widehat{s}}_i \) may increase or decrease according to the number of markers (x_{j}) included in the model, and \( {\widehat{s}}_i \) affects LMSI selection response and efficiency by means of the estimated variance of \( {\widehat{s}}_i \) (\( {\widehat{\sigma}}_{{\widehat{s}}_i}^2 \)) (Figs. 4.1 and 4.2).
According to the least squares method of estimation, \( \widehat{\boldsymbol{\uptheta}}={\left({\mathbf{X}}^{\prime}\mathbf{X}\right)}^{1}{\mathbf{X}}^{\prime }{\mathbf{y}}^{\ast } \) is an estimator of the vector of regression coefficients \( {\boldsymbol{\uptheta}}^{\prime }=\left[{\theta}_1\kern0.5em {\theta}_2\kern0.5em \cdots \kern0.5em {\theta}_m\right] \), where m (m < n) is the number of markers, X is a matrix n × m of coded marker values (e.g., 1, 0 and −1 for marker genotypes AA, Aa, and aa respectively) and y^{∗} is a vector n × 1 of phenotypic values centered based on its average values. Only a subset M(M < m) of the m markers is statistically linked to the QTL and then only a subset M of the estimated vector \( \widehat{\boldsymbol{\uptheta}} \) values is selected to estimate s_{i} as \( {\widehat{s}}_i=\sum \limits_{j=1}^M{\widehat{\theta}}_j{x}_j \).
To illustrate how to obtain \( {\widehat{s}}_i=\sum \limits_{j\in M}{\widehat{\theta}}_j{x}_j \), we use a real maize (Zea mays) F_{2} population with 247 genotypes (each one with two repetitions), 195 molecular markers, and four traits – grain yield (GY, ton ha^{−1}); plant height (PHT, cm), ear height (EHT, cm), and anthesis day (AD, days) – evaluated in one environment. In an F_{2} population, the marker homozygous loci for the allele from the first parental line can be coded by 1, whereas the marker homozygous loci for the allele from the second parental line can be coded by −1, and the marker heterozygous loci by 0.
For this example, we used trait PHT. Only seven markers were statistically linked to the PHT. The estimated vector of regression coefficients for these seven markers was \( \widehat{{\boldsymbol{\uptheta}}^{\prime }}=\left[5.46\kern0.5em 4.54\kern0.5em 0.98\kern0.5em 7.39\kern0.5em 7.75\kern0.5em 1.91\kern0.5em 3.53\right] \). Table 4.1 presents the first 20 genotypes, the coded values of the seven selected markers, and the first 20 estimated \( {\widehat{s}}_{PHT} \) values of the 247 genotypes in the maize (Zea mays) F_{2} population. According to \( \widehat{{\boldsymbol{\uptheta}}^{\prime }} \) and the coded values of the seven markers, the first estimated \( {\widehat{s}}_{PHT} \) value was obtained as \( {\widehat{s}}_{PHT1}=1.91(1)+3.53\left(1\right)=1.62 \); the second estimated \( {\widehat{s}}_{PHT} \) value was obtained as \( {\widehat{s}}_{PHT2}=5.46\left(1\right)+4.54\left(1\right)1.91\left(1\right)=0.99 \), etc. The 20th estimated \( {\widehat{s}}_{PHT} \) value was obtained as \( {\widehat{s}}_{PHT20}=3.53\left(1\right)=3.53 \). This estimation procedure is valid for any number of genotypes and markers.
Figure 4.3 shows the distribution of the 247 estimated marker scores associated with traits PHT and EHT of the maize F_{2} population. Note that the estimated marker score values approach normal distribution.
4.3.2 Estimating the Variance of the Marker Score
There are many methods of estimating the variance of the marker score associated with the ith trait (\( {\sigma}_{s_i}^2 \)); the first one was proposed by Lande and Thompson (1990). According to these authors, \( {\sigma}_{s_i}^2 \) can be estimated as
where \( {\widehat{\boldsymbol{\uptheta}}}_i \) is the estimated vector of regression coefficients of the selected markers, \( {\mathbf{M}}_i=\frac{2}{n}{\mathbf{X}}_i^{\prime }{\mathbf{X}}_i \)is the covariance matrix M × M of the selected markers that are statistically linked to the ith trait marker loci; \( {\widehat{\sigma}}_{e_i}^2=\frac{{\mathbf{y}}^{\prime}\left(\mathbf{I}\mathbf{H}\right)\mathbf{y}}{nM1} \) is the unbiased estimated variance of the residuals, \( \mathbf{H}=\mathbf{I}{\mathbf{X}}_i{\left({\mathbf{X}}_i^{\prime }{\mathbf{X}}_i\right)}^{1}{\mathbf{X}}_i^{\prime } \), I is an identity matrix n × n, M is the number of selected markers statistically linked to the QTL, and X_{i} is a matrix n × M with the coded values of the selected markers. According to Lande and Thompson (1990), Eq. (4.29) is an unbiased estimator of \( {\sigma}_{s_i}^2 \) and its variance can be written as
which tends to zero when n, the number of genotypes or individuals, is very high.
From Eq. (4.29), it is possible to obtain an estimator of the covariance between the ith and jth marker scores when the number of selected markers statistically linked to the QTL is the same in the ith and jth traits. Thus, by Eq. (4.29), the covariance between the ith and jth marker scores can be estimated as
where \( {\widehat{\boldsymbol{\uptheta}}}_i \) and \( {\widehat{\boldsymbol{\uptheta}}}_j \) are the estimated vectors of regression coefficients of the selected markers associated with the ith and jth trait loci respectively; \( {\mathbf{M}}_{ij}=\frac{2}{n}{\mathbf{X}}_i^{\prime }{\mathbf{X}}_j \) is the covariance matrix M × M of the markers statistically linked to the ith and jth trait marker loci; X_{i} and X_{j} are n × M matrices with the coded values of the selected markers associated with the ith and jth trait loci respectively; \( {\widehat{\sigma}}_{e_{ij}}=\frac{{\mathbf{y}}_i^{\prime}\left(\mathbf{I}{\mathbf{H}}_{ij}\right){\mathbf{y}}_j}{nM1} \) is the estimated covariance of the residuals between the ith (y_{i}) and jth (y_{j}) trait values, \( {\mathbf{H}}_{ij}=\mathbf{I}{\mathbf{X}}_i{\left({\mathbf{X}}_i^{\prime }{\mathbf{X}}_j\right)}^{1}{\mathbf{X}}_j^{\prime } \), I is an identity matrix n × n, and M is the number of selected markers statistically linked to the QTL.
According to the PHT values described in Sect. 4.3.1 of this chapter, M = 7, n = 247, \( {\widehat{\sigma}}_{e_i}^2=180.80 \) and \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) (Eq. 4.29). Note that \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2\le {\widehat{\sigma}}_{g_{PHT}}^2 \), where \( {\widehat{\sigma}}_{g_{PHT}}^2=83.0 \) is an estimate of the genetic variance of PHT. The estimated portion of the genetic variance attributable to \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) was \( {\widehat{q}}_{PHT}=\frac{48.23}{83}=0.5811 \); that is, the seven markers explain 58.11% of the genetic variance associated with PHT.
Charcosset and Gallais (1996) considered two possible methods of estimating \( {\sigma}_{s_i}^2 \) based on the coefficient of multiple determination or squared multiple correlation R^{2} (note that in this case R^{2} is not the square of the selection response). The coefficient R^{2} gives the portion of the total variation in the phenotypic values that is “explained” by, or attributable to, the markers and can be written as
where \( \widehat{\boldsymbol{\uptheta}}{\mathbf{X}}^{\prime}\mathbf{y}n{\overline{y}}^2 \) is the overall regression sum of squares adjusted for the intercept and \( {\mathbf{y}}^{\prime}\mathbf{y}n{\overline{y}}^2 \) is the total sum of squares adjusted for the mean. The coefficient R^{2} is equal to 1 if the fitted equation \( {y}_i={\theta}_0+\sum \limits_{j\in M}{\theta}_j{x}_j+{e}_i \) passes through all the data points, so that all residuals are null; then, the markers explain all the phenotypic variance. At the other extreme, R^{2} is zero if \( {\overline{y}}_i={\widehat{\theta}}_0 \) and the estimated regression coefficients are null, i.e., \( {\widehat{\theta}}_1={\widehat{\theta}}_2=\cdots ={\widehat{\theta}}_M=0 \). In the latter case, markers do not affect the phenotypic observations and the variance of the marker score values is zero. Thus, the R^{2} values are between 0 and 1, i.e., 0 ≤ R^{2} ≤ 1.0. Equation (4.32a) is useful for estimating \( {\sigma}_{s_i}^2 \) as \( {\widehat{\sigma}}_{y_i}^2\sum \limits_{j=1}^M{R}_j^2={\widehat{\sigma}}_s^2 \), where \( {R}_j^2 \) is the estimated value of the jth marker and \( {\widehat{\sigma}}_y^2 \) is the phenotypic variance of the ith trait; however, this is a biased estimator of \( {\sigma}_{s_i}^2 \) (Hospital et al. 1997).
Charcosset and Gallais (1996) and Hospital et al. (1997) proposed an unbiased estimator of \( {\sigma}_{s_i}^2 \) based on all the selected markers using the adjusted coefficient of multiple determination, i.e.,
whence we can obtain a unbiased estimator of \( {\sigma}_{s_i}^2 \) as \( {\widehat{\sigma}}_y^2{R}_{Adj}^2={\widehat{\sigma}}_{{\widehat{s}}_i}^2 \) by jointly using all the markers that affect the phenotypic values. The problem with Eq. (4.32b) is that the \( {R}_{Adj}^2 \) values could be negative; in that case, the estimated value of \( {\sigma}_{s_i}^2 \) would also be negative. One additional problem with Eq. (4.32b) is that the \( {R}_{Adj}^2 \) values can produce \( {\widehat{\sigma}}_s^2 \) values that are higher than those of the estimated variance of the breeding values \( {\widehat{\sigma}}_g^2 \).
Using Eqs. (4.32a) and (4.32b), we can estimate \( {\sigma}_{s_i}^2 \), but from them it is not clear how we can estimate the covariance between two different estimated marker score values.
Consider the case of the PHT values described in Sect. 4.3.1 of this chapter, where M = 7, n = 247, and the estimated variance of PHT was \( {\widehat{\sigma}}_{PHT}^2=191.81 \). The estimated values of R^{2} for each of the seven markers were 0.0038, 0.0005, 0.006, 0.0013, 0.0036, 0.0114, and 0.0298, whence, by multiplying each estimated R^{2} value by \( {\widehat{\sigma}}_{PHT}^2=191.81 \) and summing the results, we found that the estimated value of \( {\sigma}_{s_{PHT}}^2 \) was \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \). In this case, the estimated portion of the genetic variance attributable to \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \) was \( {\widehat{q}}_{PHT}=\frac{9.78}{83}=0.1178 \); thus, when we estimated \( {\sigma}_{s_{PHT}}^2 \) according to Eq. (4.32a), the seven markers explained only 11.78% of the genetic variance associated with PHT.
The estimated value of \( {R}_{Adj}^2 \) for the seven markers jointly was 0.06, whence \( {\widehat{\sigma}}_{s_{PHT}}^2=(191.81)(0.06)=11.50 \) is an estimate of \( {\sigma}_{s_{PHT}}^2 \). In the latter case, the estimated portion of the genetic variance attributable to \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \) was \( {\widehat{q}}_{PHT}=\frac{11.5}{83}=0.1385 \); that is, according to Eq. (4.32b), the seven markers explain 13.85% of the genetic variance associated with PHT.
One additional way of estimating the variance of the marker score \( {\sigma}_{s_i}^2 \) was proposed by Lange and Whittaker (2001) as
where \( {\widehat{s}}_i=\sum \limits_{j=1}^M{\widehat{\theta}}_j{x}_j \) and \( {\widehat{\mu}}_{s_i} \) is the mean of \( {\widehat{s}}_i \) values. The covariance between the ith and jth marker scores can be estimated as the cross products of the marker score values divided by n − 1. Note that in this case, the number of markers associated with the ith and jth traits may be different.
For the PHT values described in Sect. 4.3.1 of this chapter, where n = 247, the estimated value of \( {\sigma}_{s_i}^2 \) was \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \) and the estimated portion of the genetic variance attributable to \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \) was \( {\widehat{q}}_{PHT}=\frac{15.75}{83}=0.1897 \). That is, the seven markers jointly explain 18.97% of the genetic variance associated with PHT according to Eq. (4.33).
4.3.3 Estimating LMSI Selection Response and Efficiency
With the estimated phenotypic variances (\( {\widehat{\sigma}}_{PHT}^2=191.81 \)), the estimated genetic variance (\( {\widehat{\sigma}}_{g_{PHT}}^2=83.0 \)) and the estimated marker score variances: \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) (Eq. 4.29), \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \) (Eq. 4.32a), \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \) (Eq. 4.32b), and \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \) (Eq. 4.33), we can estimate the LMSI coefficient, selection response, and efficiency.
Using the estimated value \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) obtained with Eq. (4.29), it is possible to estimate the LMSI weight as \( {\widehat{\beta}}_{PHT}=\frac{{\widehat{\sigma}}_{g_{PHT}}^2{\widehat{\sigma}}_{s_{PHT}}^2}{{\widehat{\sigma}}_{PHT}^2{\widehat{\sigma}}_{s_{PHT}}^2}=\frac{83.048.23}{191.8148.23}=0.242 \), whereas for \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \), \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \), and \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \), the estimated values of β_{PHT} were 0.402, 0.40, and 0.382 respectively. The latter results indicate that the estimated values of β_{PHT} associated with the phenotypic values tend to decrease when the estimated values of the variance of the marker score increase. This means that at the limit, when all the genetic variance is explained by the markers, the estimated values of β_{PHT} are zero and the estimated LMSI is equal to \( {\widehat{I}}_M=\widehat{s} \). Thus, for trait PHT, when the estimated values of β_{PHT} are not zero, the estimated LMSI can be written as \( {\widehat{I}}_{M_{PHT}}={\widehat{s}}_{PHT}+{\widehat{\beta}}_{PHT}\left({PHT}_i{\widehat{s}}_{PHT}\right) \). The \( {\widehat{I}}_{M_{PHT}} \) values are used to predict, rank, and select the net genetic merit value of each individual candidate for selection.
Based on the result \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) obtained with Eq. (4.29) and using a selection intensity of 10% (k_{I}= 1.755), the estimated LMSI selection response can be obtained as
In a similar manner, using the result \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \), the estimated selection response was \( {\widehat{R}}_M=1.755\sqrt{\frac{83\left(8315.75\right)+15.75\left(191.8183\right)}{191.8115.75}}=1.755\sqrt{41.44}=11.30. \) With \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \) and \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \), the estimated values of the LMSI selection responses were 10.99 and 11.10 respectively. The latter results indicate that the estimated values of the LMSI selection responses tend to increase when the estimated values of the variance of the marker score increase.
We can estimate LMSI versus phenotypic efficiency for one trait as \( {\widehat{\lambda}}_M=\sqrt{\frac{\widehat{q}}{{\widehat{h}}^2}+\frac{{\left(1\widehat{q}\right)}^2}{1{\widehat{q}\widehat{h}}^2}} \), where \( {\widehat{h}}^2 \) is the estimated trait heritability and \( \widehat{q}=\frac{{\widehat{\sigma}}_s^2}{{\widehat{\sigma}}_g^2} \) is the estimated portion of additive genetic variance explained by the markers. When \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \), \( {\widehat{q}}_{PHT}=\frac{48.23}{83}=0.5811 \), and \( {\widehat{h}}^2=0.433 \), the estimated LMSI efficiency was \( {\widehat{\lambda}}_M=\sqrt{1.58}=1.25 \). For \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \), \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \), and \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \), the estimated portions of the additive genetic variance explained by the markers were \( {\widehat{q}}_{PHT}=\frac{15.75}{83}=0.1897 \), \( {\widehat{q}}_{PHT}=\frac{9.78}{83}=0.1178 \), and \( {\widehat{q}}_{PHT}=\frac{11.5}{83}=0.1385 \) respectively, whence the estimated LMSI efficiencies were 1.1, 1.04, and 1.05 respectively. The latter results indicate that the estimated values of LMSI efficiency tend to increase when the estimated values of the variance of the marker score increase (Fig. 4.1).
Figure 4.1 presents the change in LMSI efficiency with respect to phenotypic selection for different values of the variance of the marker score when the phenotypic (191.81) and genetic (83) variances are fixed. In a similar manner, Fig. 4.2 presents the change in the LMSI selection response for different values of the variance of the marker score when the phenotypic (191.81) and genetic (83) variances are fixed. In effect, LMSI efficiency and the selection response depend on the genetic variance explained by the markers.
4.3.4 Estimating the Variance of the Marker Score in the MultiTrait Case
Equation (4.33) can be used in the multitrait context when the numbers of markers associated with the ith and jth traits are different. Also, it is possible to adapt Eqs. (4.32a) and (4.32b) to the multitrait case. However, in the latter case, in addition to the markers linked to the QTL that affect one specific trait, we need to find markers that affect more than one trait, which may be very difficult. For this reason, in the multitrait context, Eqs. (4.32a) and (4.32b) could be used to estimate the variance of the marker score (S) without preselecting the markers that affect the phenotypic traits, only when the number of genotypes is higher than the number of markers.
Let y_{1}, y_{2}, …, y_{r} be r independent multivariate normal vectors of observations, each with n observations, such that \( \mathbf{Y}=\left[\begin{array}{cccc}{y}_{11}& {y}_{12}& \cdots & {y}_{1t}\\ {}{y}_{21}& {y}_{22}& \cdots & {y}_{2t}\\ {}\vdots & \vdots & \cdots & \vdots \\ {}{y}_{n1}& {y}_{n2}& \cdots & {y}_{nt}\end{array}\right] \) is a matrix n × t of observations for t traits; then, the multivariate linear regression model can be written as Y = XB + U, where X is a matrix n × m (m= number of markers and m < n) of known coded marker values, B is a matrix m × n of regression coefficients, and U is a matrix n × t of unobserved random disturbance whose rows for given X are uncorrelated, each with mean 0 and common covariance matrix E (Mardia et al. 1982; Rencher 2002). According to the least squares method of estimation, \( \widehat{\mathbf{B}}={\left({\mathbf{X}}^{\prime}\mathbf{X}\right)}^{1}{\mathbf{X}}^{\prime}\mathbf{Y} \) is an estimator of B and \( \widehat{\mathbf{E}}=\frac{{\left(\mathbf{Y}\widehat{\mathbf{B}}\mathbf{X}\right)}^{\prime}\left(\mathbf{Y}\widehat{\mathbf{B}}\mathbf{X}\right)}{nm1} \) is an estimator of the residual covariance matrix E assuming that n > m (Johnson and Wichern 2007).
Note that \( 1{R}^2=\frac{{\widehat{\mathbf{e}}}^{\prime}\widehat{\mathbf{e}}}{{\mathbf{y}}^{\prime}\mathbf{y}} \), where \( \widehat{\mathbf{e}} \) is a vector of estimated residual values of the model \( {y}_i={\theta}_0+\sum \limits_{j\in M}{\theta}_j{x}_j+{e}_i \) and R^{2} is the coefficient of multiple determination (Eq. 4.32a). In addition, as in the multitrait context the estimated matrix of residuals is \( \widehat{\mathbf{U}}=\mathbf{Y}\widehat{\mathbf{B}}\mathbf{X} \), 1 − R^{2} can be written as \( \mathbf{D}={\left({\mathbf{Y}}^{\prime}\mathbf{Y}\right)}^{1}{\widehat{\mathbf{U}}}^{\prime}\widehat{\mathbf{U}} \) (Mardia et al. 1982), whence R^{2} in the multivariate context can written as
whereas \( {R}_{Adj}^2 \) (Eq. 4.32b) can be written as
where I is an identity matrix t × t, \( {\widehat{\mathbf{P}}}^{1} \) is the inverse of the estimated covariance matrix of phenotypic values (\( \widehat{\mathbf{P}} \)), and \( \widehat{\mathbf{S}} \) is the estimated covariance matrix of marker score values. From Eq. (4.34b),
is an unbiased estimator of matrix \( \widehat{\mathbf{S}} \), whereas \( \widehat{\mathbf{P}}{\mathbf{R}}^2=\widehat{\mathbf{S}} \) (Eq. 4.34a) is a biased estimator of matrix \( \widehat{\mathbf{S}} \). The main problem of Eq. (4.34c) is that the diagonal elements of \( \widehat{\mathbf{S}} \) could be negative.
From the maize F_{2} population including 247 genotypes (each one with two repetitions) and 195 molecular markers described in Sect. 4.3.1, we used two traits—PHT (cm) and EHT (cm)—to illustrate the multivariate method of estimating the LMSI parameters. The estimated phenotypic and genetic covariance matrices were \( \widehat{\mathbf{P}}=\left[\begin{array}{cc}191.81& 106.89\\ {}106.89& 167.93\end{array}\right] \) and \( \widehat{\mathbf{C}}=\left[\begin{array}{cc}83.00& 57.44\\ {}57.44& 59.80\end{array}\right] \), whereas the estimated covariance matrix of marker scores, using Eq. (4.33), was \( \widehat{\mathbf{S}}=\left[\begin{array}{cc}15.750& 0.983\\ {}0.983& 28.083\end{array}\right] \). When we used Eq. (4.34a) and Eq. (4.34c), we obtained estimated values of the variance and covariance of the marker scores that were higher than the genetic values (data not presented). Equations (4.29) and (4.31) are used later to compare LMSI efficiency versus GWLMSI efficiency using the simulated data described in Chap. 2, Sect. 2.8.1.
With matrices \( \widehat{\mathbf{P}} \), \( \widehat{\mathbf{C}} \), and \( \widehat{\mathbf{S}} \), and the vector of economic weights \( {\mathbf{a}}^{\prime }=\left[{\mathbf{w}}^{\prime}\kern0.5em {\mathbf{0}}^{\prime}\right] \), where \( {\mathbf{w}}^{\prime }=\left[1\kern0.5em 1\right] \) and \( {\mathbf{0}}^{\prime }=\left[0\kern0.5em 0\right] \), we obtained the estimated matrices \( \widehat{\mathbf{T}}=\left[\begin{array}{cc}\widehat{\mathbf{P}}& \widehat{\mathbf{S}}\\ {}\widehat{\mathbf{S}}& \widehat{\mathbf{S}}\end{array}\right] \) and \( \mathbf{Z}=\left[\begin{array}{cc}\widehat{\mathbf{C}}& \widehat{\mathbf{S}}\\ {}\widehat{\mathbf{S}}& \widehat{\mathbf{S}}\end{array}\right] \), whence the estimated LMSI vector of coefficients was \( {\widehat{\boldsymbol{\upbeta}}}^{\prime }={\mathbf{a}}^{\prime }{\widehat{\mathbf{Z}}}_M{\widehat{\mathbf{T}}}_M^{1}=\left[0.59\kern0.5em 0.18\kern0.5em 0.41\kern0.5em 0.82\right] \). Using a selection intensity of 10% (k_{I} = 1.755), the estimated LMSI selection response and the expected genetic gains per trait were \( {\widehat{R}}_M={k}_I\sqrt{\widehat{{\boldsymbol{\upbeta}}^{\prime }}{\widehat{\mathbf{T}}}_M\widehat{\boldsymbol{\upbeta}}}=20.41 \) and \( {\widehat{\mathbf{E}}}_M^{\prime }={k}_I\frac{\widehat{{\boldsymbol{\upbeta}}^{\prime }}{\widehat{\mathbf{Z}}}_M}{\sqrt{\widehat{{\boldsymbol{\upbeta}}^{\prime }}{\widehat{\mathbf{T}}}_M\widehat{\boldsymbol{\upbeta}}}}=\left[10.09\kern0.5em 10.31\kern0.5em 2.53\kern0.5em 4.39\right] \) respectively, whereas the estimated LMSI accuracy was \( {\widehat{\rho}}_{H{\widehat{I}}_M}=\frac{{\widehat{\sigma}}_{I_M}}{{\widehat{\sigma}}_H}=0.72 \).
The estimated LPSI parameters (see Chap. 2 for details) using the phenotypic information from the maize F_{2} population for traits PHT and EHT are as follows. The estimated LPSI vector of coefficients was \( \widehat{{\mathbf{b}}^{\prime }}={\mathbf{w}}^{\prime}\widehat{\mathbf{C}}{\widehat{\mathbf{P}}}^{1}=\left[0.53\kern0.5em 0.36\right] \), and, with a selection intensity of 10% (k_{I} = 1.755), the estimated LPSI selection response and the expected genetic gains per trait were \( {\widehat{R}}_I={k}_I\sqrt{\widehat{{\mathbf{b}}^{\prime }}\widehat{\mathbf{P}}\widehat{\mathbf{b}}}=18.97 \) and \( \widehat{{\mathbf{E}}^{\prime }}={k}_I\frac{{\widehat{\mathbf{b}}}^{\prime}\widehat{\mathbf{C}}}{{\widehat{\sigma}}_I}=\left[10.52\kern0.5em 8.45\right] \) respectively, whereas the estimated LPSI accuracy was \( {\widehat{\rho}}_{H\widehat{I}}=\frac{{\widehat{\sigma}}_I}{{\widehat{\sigma}}_H}=0.67 \).
We can determine LMSI efficiency versus LPSI efficiency to predict the net genetic merit using the ratio of estimated accuracy values \( {\widehat{\rho}}_{H{\widehat{I}}_M}=0.72 \) and \( {\widehat{\rho}}_{H\widehat{I}}=0.67 \) of the LMSI and LPSI respectively, i.e., \( {\widehat{\lambda}}_M=\frac{0.72}{0.67}=1.075 \), whence, according to Eq. (4.19), the estimated LMSI efficiency versus the LPSI efficiency, in percentage terms, was \( {\widehat{p}}_M=100\left(1.0751\right)=7.5 \). That is, for these data, the estimated LMSI efficiency was only 7.5% greater than LPSI efficiency at predicting the net genetic merit.
4.4 Estimating the GWLMSI Parameters in the Asymptotic Context
Lange and Whittaker (2001) proposed the GWLMSI. However, these authors did not provide detailed procedures for estimating matrices P, C, W, and M. They indicated that matrix C can be estimated using the estimated matrix of covariance of marker scores (\( \widehat{\mathbf{S}} \)) and that matrices P, W, and M can be estimated directly by their empirical variances and covariances, but this assertion does not indicate a clear method for estimating those covariance matrices. In Chap. 2, we described the REML method of estimating C and P. Crossa and CerónRojas (2011) described matrices W and M in a doubled haploid population. In this study, we describe and estimate matrices W and M for an F_{2} population in the asymptotic context according to the Wright and Mowers (1994) approach, which is based on regressing phenotype values on marker coded values. We used this latter approach to estimate W and M, because it is a clearer estimation method than that of Lange and Whittaker (2001); however, the Wright and Mowers (1994) approach is an asymptotic method and should be regarded with precaution.
Matrix M is the covariance matrix of the molecular marker code values. All marker information used to construct matrix M is presented in Table 4.2. Based on this information, we found that the expectations (E(X_{1}) and E(X_{2})) and the variances (V(X_{1}) and V(X_{2})) of the marker coded values X_{1} and X_{2} are E(X_{1}) = E(X_{2}) = 0 and V(X_{1}) = V(X_{2}) = 1, whereas the covariance (Cov(X_{1}, X_{2})) and correlation (Corr(X_{1}, X_{2})), between X_{1} and X_{2} were
Thus, as the variances of X_{1} and X_{2} are equal to 1, the correlation between X_{1} and X_{2} is \( Corr\left({X}_1,{X}_2\right)=\frac{Cov\left({X}_1,{X}_2\right)}{\sqrt{V\left({X}_1\right)V\left({X}_2\right)}}=12\delta \), i.e., the covariance and correlation between X_{1} and X_{2} are the same. Equation (4.35) results indicate that if we perform the same operation with many markers, we will obtain similar results; they also indicate that this is the way to construct matrix M.
Let X be a matrix of coded markers of size n × m, where n ≥ m and m= number of markers; then according to Wright and Mowers (1994), because all marker information is contained in matrix X^{′}X, when the number of observations (n) tends to infinity, the product \( {\mathbf{x}}_i^{\prime }{\mathbf{x}}_j/n \) tends to the covariance between markers ith and jth, whence matrix n^{−1}X^{′}X should tend to the covariance matrix between the markers that conform matrix X with the ijth element equal to (0.5 − δ_{ij}). Thus, matrix 2n^{−1}X^{′}X should tend to a covariance matrix where the ijth entry is equal to (1 − 2δ_{ij}). Based on the latter result, an estimator of matrix M in the asymptotic context is
Equation (4.36) is an asymptotic result and should be taken with caution. To date, there has been no clear method for estimating M in the nonasymptotic context; for this reason, Eq. (4.36) is used to estimate the GWLMSI parameters.
Assume that a QTL is between the two markers in Table 4.2; then, δ can be written as δ = r_{1} + r_{2} − 2r_{1}r_{2}, where r_{1} and r_{2} denote the recombination frequency between marker 1 and marker 2 respectively, with the QTL between them. When the number of genotypes or individuals tends to infinity, the covariance between the phenotypic trait values (y) and the marker 1 coded values (X_{1}) in an F_{2} population can be written as
where α_{1}(1 − 2r_{1}) is the portion of the additive effect (α_{1}) of the QTL linked to marker 1 (Edwards et al. 1987), and r_{1} is the recombination frequency between the QTL and marker 1. We can assume that for many markers, the covariance of the phenotypic values is similar to Eq. (4.37), whence matrix W can be obtained.
Let y be a vector n × 1 of recorded phenotypic values, where n denotes the number of observation or records, and X is a matrix of coded markers of size n × m. When n tends to infinity, 2n^{−1}X^{′}y tends to be a vector with elements equal to α_{i}(1 − 2r_{i}), where α_{i} is the additive effect of the ith QTL linked to the ith marker, and r_{i} is the recombination frequency between the ith QTL and the ith marker. Now let \( \mathbf{Y}=\left[\begin{array}{cccc}{y}_{11}& {y}_{12}& \cdots & {y}_{1t}\\ {}{y}_{21}& {y}_{22}& \cdots & {y}_{2t}\\ {}\vdots & \vdots & \cdots & \vdots \\ {}{y}_{n1}& {y}_{n2}& \cdots & {y}_{nt}\end{array}\right] \) be a matrix of observations for t traits; then, an estimator of matrix W in the asymptotic context is
Once again, Eq. (4.38) is an asymptotic result and should be accepted with caution. But to date, there has been no clear method for estimating W in the nonasymptotic context; for this reason, Eq. (4.38) is used to estimate the GWLMSI parameters.
4.5 Comparing LMSI Versus LPSI and GWLMSI Efficiency
To compare LMSI efficiency versus GWLMSI efficiency for predicting the net genetic merit, we use the simulated data set described in Chap. 2, Sect. 2.8.1.
Figure 4.4 presents the estimated accuracy values of the LPSI (\( {\widehat{\rho}}_{H\widehat{I}}=\frac{{\widehat{\sigma}}_{\widehat{I}}}{{\widehat{\sigma}}_H} \)), the LMSI (\( {\widehat{\rho}}_{H{\widehat{I}}_M}=\frac{{\widehat{\sigma}}_{{\widehat{I}}_M}}{{\widehat{\sigma}}_H} \)), and the GWLMSI (\( {\widehat{\rho}}_{H{\widehat{I}}_W}=\frac{{\widehat{\sigma}}_{{\widehat{I}}_W}}{{\widehat{\sigma}}_H} \)) for five simulated selection cycles. In addition, Table 4.3 presents the estimated LPSI, LMSI, and GWLMSI selection responses, the estimated LPSI, LMSI, and GWLMSI variances of the predicted error (\( \left(1{\widehat{\rho}}_{H\widehat{I}}^2\right){\widehat{\sigma}}_H^2 \), \( \left(1{\widehat{\rho}}_{H{\widehat{I}}_M}^2\right){\widehat{\sigma}}_H^2 \) and \( \left(1{\widehat{\rho}}_{H{\widehat{I}}_W}^2\right){\widehat{\sigma}}_H^2 \) respectively), the ratios of the estimated LMSI accuracy to the estimated LPSI accuracy and the estimated LMSI accuracy to the estimated GWLMSI accuracy, expressed as percentages (Eq. 4.19), for five simulated selection cycles.
According to Fig. 4.4, for this data set the estimated LMSI accuracy (\( {\widehat{\rho}}_{H{\widehat{I}}_M} \)) was higher than the estimated LPSI and GWLMSI accuracy (\( {\widehat{\rho}}_{H\widehat{I}} \) and \( {\widehat{\rho}}_{H{\widehat{I}}_W} \) respectively), for the five simulated selection cycles, that is, \( {\widehat{\rho}}_{H{\widehat{I}}_M}>{\widehat{\rho}}_{H\widehat{I}}>{\widehat{\rho}}_{H{\widehat{I}}_W} \). In a similar manner, Table 4.3 results indicate that the estimated LMSI selection response (\( {\widehat{R}}_M \)) was higher than the estimated LPSI and GWLMSI selection responses (\( {\widehat{R}}_I \) and \( {\widehat{R}}_W \) respectively): \( {\widehat{R}}_M>{\widehat{R}}_I>{\widehat{R}}_W \).
Note that the estimated LPSI, LMSI, and GWLMSI variances of the predicted error, and the estimated LMSI efficiency versus LPSI efficiency and versus GWLMSI efficiency (expressed in percentages) are related to the estimated LMSI, LPSI, and GWLMSI accuracies, and that in all five selection cycles, \( {\widehat{\rho}}_{H{\widehat{I}}_M}>{\widehat{\rho}}_{H\widehat{I}}>{\widehat{\rho}}_{H{\widehat{I}}_W} \). This implies that the estimated LMSI variance of the predicted error was lower than the estimated LPSI and GWLMSI variance of the predicted error. In a similar manner, because \( {\widehat{\rho}}_{H{\widehat{I}}_M}>{\widehat{\rho}}_{H\widehat{I}}>{\widehat{\rho}}_{H{\widehat{I}}_W} \), the estimated LMSI efficiency was higher than the estimated LPSI efficiency and the estimated GWLMSI efficiency.
Based on Fig. 4.4 and Table 4.3 results, we conclude that the LMSI was a better predictor of the net genetic merit than the LPSI, and that the LPSI is a better predictor of the net genetic merit than the GWLMSI for this simulated data set.
References
Bulmer MG (1980) The mathematical theory of quantitative genetics. Lectures in biomathematics. University of Oxford, Clarendon Press, Oxford
Charcosset A, Gallais A (1996) Estimation of the contribution of quantitative trait loci (QTL) to the variance of a quantitative trait by means of genetic markers. Theor Appl Genet 93:1193–1201
Crossa J, CerónRojas JJ (2011) Multitrait multienvironment genomewide molecular marker selection indices. J Indian Soc Agric Stat 62(2):125–142
Dekkers JCM, Settar P (2004) Longterm selection with known quantitative trait loci. Plant Breed Rev 24:311–335
Edwards MD, Stuber CW, Wendel JF (1987) Molecularmarkerfacilitated investigations of quantitativetrait loci in maize. I. Numbers, genomic distribution and types of gene action. Genetics 116:113–125
Hospital F, Moreau L, Lacoudre F, Charcosset A, Gallais A (1997) More on the efficiency of markerassisted selection. Theor Appl Genet 95:1181–1189
Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River, NJ
Knapp SJ (1998) Markerassisted selection as a strategy for increasing the probability of selecting superior genotypes. Crop Sci 38:1164–1174
Lande R, Thompson R (1990) Efficiency of markerassisted selection in the improvement of quantitative traits. Genetics 124:743–756
Lange C, Whittaker JC (2001) On prediction of genetic values in markerassisted selection. Genetics 159:1375–1381
Mardia KV, Kent JT, Bibby JM (1982) Multivariate analysis. Academic Press, New York
Moreau L, Charcosset A, Hospital F, Gallais A (1998) Markerassisted selection efficiency in populations of finite size. Genetics 148:1353–1365
Moreau L, Hospital F, Whittaker J (2007) Markerassisted selection and introgression. In: Balding DJ, Bishop M, Cannings C (eds) Handbook of statistical genetics, vol 1, 3rd edn. Wiley, New York, pp 718–751
Rencher AC (2002) Methods of multivariate analysis. Wiley, New York
Searle S, Casella G, McCulloch CE (2006) Variance components. Wiley, Hoboken, NJ
Whittaker JC (2003) Markerassisted selection and introgression. In: Balding DJ, Bishop M, Cannings C (eds) Handbook of statistical genetics, vol 1, 2nd edn. Wiley, New York, pp 554–574
Wright AJ, Mowers RP (1994) Multiple regression for molecular marker, quantitative trait data from large F_{2} population. Theor Appl Genet 89:305–312
Zhang W, Smith C (1992) Computer simulation of markerassisted selection utilizing linkage disequilibrium. Theor Appl Genet 83:813–820
Zhang W, Smith C (1993) Simulation of markerassisted selection utilizing linkage disequilibrium: the effects of several additional factors. Theor Appl Genet 86:492–496
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2018 The Author(s)
About this chapter
Cite this chapter
CéronRojas, J.J., Crossa, J. (2018). Linear Marker and GenomeWide Selection Indices. In: Linear Selection Indices in Modern Plant Breeding. Springer, Cham. https://doi.org/10.1007/9783319912233_4
Download citation
DOI: https://doi.org/10.1007/9783319912233_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783319912226
Online ISBN: 9783319912233
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)