# Linear Marker and Genome-Wide Selection Indices

- 3.1k Downloads

## Abstract

There are two main linear marker selection indices employed in marker-assisted selection (MAS) to predict the net genetic merit and to select individual candidates as parents for the next generation: the linear marker selection index (LMSI) and the genome-wide LMSI (GW-LMSI). Both indices maximize the selection response, the expected genetic gain per trait, and the correlation with the net genetic merit; however, applying the LMSI in plant or animal breeding requires genotyping the candidates for selection; performing a linear regression of phenotypic values on the coded values of the markers such that the selected markers are statistically linked to quantitative trait loci that explain most of the variability in the regression model; constructing the marker score, and combining the marker score with phenotypic information to predict and rank the net genetic merit of the candidates for selection. On the other hand, the GW-LMSI is a single-stage procedure that treats information at each individual marker as a separate trait. Thus, all marker information can be entered together with phenotypic information into the GW-LMSI, which is then used to predict the net genetic merit and select candidates. We describe the LMSI and GW-LMSI theory and show that both indices are direct applications of the linear phenotypic selection index theory to MAS. Using real and simulated data we validated the theory of both indices.

## Keywords

Marker Scores Phenotypic Values Response Selection Code Value Marker Information## 4.1 The Linear Marker Selection Index

### 4.1.1 Basic Conditions for Constructing the LMSI

- 1.
The markers and the quantitative trait loci (QTL) should be in linkage disequilibrium in the population under selection.

- 2.
The QTL effects should be combined additively both within and between loci.

- 3.
The QTL should be in coupling mode, that is, one of the initial lines should have all the alleles that have a positive effect on the chromosome, and the other lines should have all the negative effects.

- 4.
The traits of interest should be affected by a few QTL with large effects (and possibly a number of very small QTL effects) rather than many small QTL effects.

- 5.
The heritability of the traits should be low.

- 6.
Markers correlated with the traits of interest should be identified.

Under these conditions, the LMSI should be more efficient than the LPSI, at least in the first selection cycles (Whittaker 2003; Moreau et al. 2007).

### 4.1.2 The LMSI Parameters

*y*

_{i}=

*g*

_{i}+

*e*

_{i}be the

*i*th trait (

*i*= 1, 2, …,

*t*,

*t*= number of traits), where

*e*

_{i}~

*N*(0, \( {\sigma}_{e_i}^2 \)) is the residual with expectation equal to zero and variance value \( {\sigma}_{e_i}^2 \), and

*N*stands for normal distribution. Assuming that the QTL effects combine additively both within and between loci, the

*i*th unobservable genetic value

*g*

_{i}can be written as

*α*

_{k}is the effect of the

*k*th QTL,

*q*

_{k}is the number of favorable alleles at the

*k*th QTL (2, 1 or 0), and

*N*

_{Q}is the number of QTL affecting the

*i*th trait of interest.

*g*

_{i}values in Eq. (4.1) are also not observable; however, we can use a linear combination of the markers linked to the QTL (

*s*

_{i}) that affect the

*i*th trait to predict the

*g*

_{i}value as

*s*

_{i}is a predictor of

*g*

_{i},

*θ*

_{j}is the regression coefficient of the linear regression model,

*x*

_{j}is the coded value of the

*j*th markers (e.g., 1, 0, and −1 for marker genotypes

*AA*,

*Aa*and

*aa*respectively), and

*M*is the number of selected markers linked to the QTL that affect the

*i*th trait. Equation (4.2) is called the

*marker score*(Lande and Thompson 1990; Whittaker 2003) and this is the main reason why the LMSI is not equal to the LPSI described in Chap. 2. The number of selected markers is only a subset of potential markers linked to QTL in the population under selection; thus, the

*s*

_{i}values should be lower than or equal to the

*g*

_{i}values. One way of estimating the

*s*

_{i}values is to perform a linear regression of phenotypic values on the coded values of the markers, select markers that are statistically linked to quantitative trait loci that explain most of the variability in the regression model, and then obtain the estimated value of

*s*

_{i}(\( {\widehat{s}}_i \)) as the sum of the products of the QTL effects linked to markers and multiplied by the marker coded values associated with the

*i*th trait. Some authors (e.g., Moreau et al. 2007) call \( {\widehat{s}}_i \) the molecular score; in this book, we call

*s*

_{i}the marker score and \( {\widehat{s}}_i \) the estimated marker score.

**g**; \( {\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_t\right] \) is a null vector associated with the vector of marker scores \( {\mathbf{s}}^{\prime }=\left[{s}_1\kern0.5em \cdots \kern0.5em {s}_t\right] \);

*s*

_{i}is the

*i*th marker score; \( {\mathbf{a}}^{\prime }=\left[{\mathbf{w}}^{\prime}\kern0.5em {\mathbf{w}}_2^{\prime}\right] \) and \( \mathbf{z}=\left[{\mathbf{g}}^{\prime}\kern0.5em {\mathbf{s}}^{\prime}\right] \).

*H*in each selection cycle and can be written as

**β**

_{s}are vectors of phenotypic and marker score weights respectively; \( {\mathbf{y}}^{\prime }=\left[{y}_1\kern0.5em \cdots \kern0.5em {y}_t\right] \) is the vector of trait phenotypic values and

**s**was defined in Eq. (4.3); \( {\boldsymbol{\upbeta}}^{\prime }=\left[{\boldsymbol{\upbeta}}_y^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_s^{\prime}\right] \) and \( {\mathbf{t}}^{\prime }=\left[{\mathbf{y}}^{\prime}\kern0.5em {\mathbf{s}}^{\prime}\right] \).

*k*

_{I}is the standardized selection differential of the LMSI, \( {\sigma}_H=\sqrt{{\mathbf{a}}^{\prime }{\mathbf{Z}}_M\mathbf{a}} \) and \( \sqrt{{\boldsymbol{\upbeta}}^{\prime }{\mathbf{T}}_M\boldsymbol{\upbeta}} \) are the standard deviations of the variances of

*H*and

*I*

_{M}, whereas \( {\rho}_{I_MH} \) and

**a**′

**Z**

_{M}

**β**are the correlation and the covariance between

*H*and

*I*

_{M}respectively; \( {\mathbf{T}}_M= Var\left[\begin{array}{c}\mathbf{y}\\ {}\mathbf{s}\end{array}\right]=\left[\begin{array}{cc}\mathbf{P}& \mathbf{S}\\ {}\mathbf{S}& \mathbf{S}\end{array}\right] \) and \( {\mathbf{Z}}_M= Var\left[\begin{array}{c}\mathbf{g}\\ {}\mathbf{s}\end{array}\right]=\left[\begin{array}{cc}\mathbf{C}& \mathbf{S}\\ {}\mathbf{S}& \mathbf{S}\end{array}\right] \) are block matrices of covariance where

**P**=

*Var*(

**y**),

**S**=

*Var*(

**s**), and

**C**=

*Var*(

**g**) are the covariance matrices of phenotypic values (

**y**), the marker score (

**s**), and the genetic value (

**g**) respectively in the population. Vectors

**a**and

**β**were defined in Eqs. (4.3) and (4.4) respectively.

All the parameters in Eq. (4.6) were previously defined.

### 4.1.3 The Maximized LMSI Parameters

**P**,

**S**and

**C**are known matrices; then, matrices

**T**

_{M}and

**Z**

_{M}are known and, according to the LPSI theory (Chap. 2 for details), the LMSI vector of coefficients (

**β**

_{M}) that maximizes \( {\rho}_{I_MH} \),

*R*

_{M}, and

**E**

_{M}can be written as

*H*and

*I*

_{M}can be written as

*I*

_{M}and \( {\sigma}_H=\sqrt{{\mathbf{a}}^{\prime }{\mathbf{Z}}_M\mathbf{a}} \) is the deviation of the variance of

*H*. Equations (4.8a) and (4.8b) show that the LMSI is a direct application of the LPSI theory in the marker-assisted selection (MAS) context.

**Q**can be written as

**β**=

**Qa**, and as \( {\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_t\right] \), we can write the two vectors of \( {\boldsymbol{\upbeta}}^{\prime }=\left[{\boldsymbol{\upbeta}}_y^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_s^{\prime}\right] \) as

**β**

_{y}= (

**P**−

**S**)

^{−1}(

**C**−

**S**)

**w**. By Eq. (4.10b), the optimal LMSI can be written as

**β**

_{y}. By Eq. (4.10a), Eq. (4.8a) can be written as

Thus, by Eqs. (4.10a) and (4.12), when **S** is a null matrix, vector **β**_{y} is equal to **β**_{y} = **P**^{−1}**Cw** = **b** and \( {R}_M={k}_I\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}={R}_I \), which are the LPSI vector of coefficients and its selection response respectively.

**S**tends to

**C**; then, at the limit, we can suppose that

**S**=

**C**, and by this latter result,

*R*

_{M}is equal to

Note that \( \frac{\sigma_H}{\sigma_I}=\frac{1}{\rho_{HI}} \), where *ρ*_{HI} is the maximized correlation between the net genetic merit (*H*) and the LPSI (*I*) described in Chap. 2. Equation (4.15) indicates that LMSI efficiency tends to infinity when the *ρ*_{HI} value tends to zero and is an additional way of denoting the paradox of LMSI efficiency described by Knapp (1998), which implies that LMSI efficiency tends to infinity when the *ρ*_{HI} value tends to zero.

### 4.1.4 The LMSI for One Trait

**T**

_{M},

**Z**

_{M}, and

**Q**can be written as

**β**=

**Qa**are

When \( {\sigma}_s^2=0 \), \( {\beta}_y=\frac{\sigma_g^2}{\sigma_y^2}={h}^2 \), *I*_{M} = *h*^{2}*y*, and \( {R}_M=k\frac{\sigma_g^2}{\sigma_y}=k{\sigma}_y{h}^2=R \), the selection response for the one-trait case without markers.

### 4.1.5 Efficiency of LMSI Versus LPSI Efficiency for One Trait

*R*

_{I}is the maximized LPSI selection response. In percentage terms, the LMSI versus LPSI efficiency can be written as

When *p*_{M} = 0, the efficiency of both indices is the same; when *p*_{M} > 0, the efficiency of the LMSI is higher than that of the LPSI, and when *p*_{M} < 0, LPSI efficiency is higher than LMSI efficiency for predicting the net genetic merit.

*R*

_{M}was defined in Eq. (4.18),

*R*=

*kσ*

_{y}

*h*

^{2},

*h*

^{2}is the trait heritability, and \( q=\frac{\sigma_s^2}{\sigma_g^2} \) is the proportion of additive genetic variance explained by the markers. According to Eq. (4.20), the advantage of the LMSI over phenotypic selection increases as the population size increases and heritability decreases, because in such cases, \( q=\frac{\sigma_s^2}{\sigma_g^2} \) tends to 1 and Eq. (4.20) approaches \( \frac{1}{h} \). Therefore, the LMSI is most efficient for traits with low heritability and when the marker score explains a large proportion of the genetic variance. Thus, note that when

*h*

^{2}tends to zero, \( \frac{1}{h} \) tends to infinity; this means that in the asymptotic context, LMSI efficiency with respect to phenotypic efficiency for one trait (Eq. 4.20) tends to infinity and this is the LMSI paradox pointed out by Knapp (1998). There are other problems associated with the LMSI: it increases the selection response only in the short term and can result in lower cumulative responses in the longer term than phenotypic selection, as the LMSI fixes the QTL at a faster rate than phenotypic selection. In addition, it requires the weights (Eq. 4.17a) to be updated, because in each generation the frequency of the QTL changes (Dekkers and Settar 2004).

### 4.1.6 Statistical LMSI Properties

*H*and

*I*

_{M}have bivariate joint normal distribution, \( \boldsymbol{\upbeta} ={\mathbf{T}}_M^{-1}{\mathbf{Z}}_M\mathbf{a} \), and that

**P**,

**C**,

**S**, and

**w**are known; then, the statistical LMSI properties are the same as the LPSI properties described in Chap. 2. That is,

- 1.
\( {\sigma}_{I_M}^2={\sigma}_{HI_M} \): the variance of

*I*_{M}(\( {\sigma}_{I_M}^2 \)) and the covariance between*H*and*I*_{M}(\( {\sigma}_{HI_M} \)) are the same. - 2.
The maximized correlation between

*H*and*I*_{M}(or*I*_{M}accuracy) is \( {\rho}_{HI_M}=\frac{\sigma_{I_M}}{\sigma_H} \). - 3.
The variance of the predicted error, \( Var\left(H-{I}_M\right)=\left(1-{\rho}_{HI_M}^2\right){\sigma}_H^2 \), is minimal.

- 4.
The total variance of

*H*explained by*I*_{M}is \( {\sigma}_{I_M}^2={\rho}_{HI_M}^2{\sigma}_H^2 \). - 5.
The heritability of

*I*_{M}is \( {\mathrm{h}}_{\mathrm{M}}^2=\frac{{\boldsymbol{\upbeta}}_M^{\prime }{\mathbf{Z}}_M{\boldsymbol{\upbeta}}_M}{{\boldsymbol{\upbeta}}_M^{\prime }{\mathbf{T}}_M{\boldsymbol{\upbeta}}_M} \).

Properties 1 to 4 are the same as LPSI properties 1 to 4, but, because the LMSI jointly incorporates the phenotypic and marker information to predict the net genetic merit, LMSI accuracy should be higher than LPSI accuracy. The same is true of the LMSI selection response and expected genetic gain per trait when compared with the LPSI selection response and expected genetic gain per trait.

## 4.2 The Genome-Wide Linear Selection Index

The genome-wide linear marker selection index (GW-LMSI) is a single-stage procedure that treats information at each individual marker as a separate trait. Thus, all marker information can be entered together with phenotypic information into the GW-LMSI, which is then used to predict the net genetic merit. In a similar manner to the LMSI, the GW-LMSI exploits the linkage disequilibrium between markers and the QTL produced when inbred lines are crossed.

### 4.2.1 The GW-LMSI Parameters

*j*= 1, 2, …,

*t*= number of traits) is the vector of breeding values, \( {\mathbf{w}}^{\prime }=\left[{w}_1\kern0.5em \cdots \kern0.5em {w}_t\right] \) is the vector of economic weights associated with the breeding values, and \( {\mathbf{w}}_2^{\prime }=\left[{0}_1\kern0.5em \cdots \kern0.5em {0}_m\right] \) is a null vector associated with the coded values of the markers \( {\mathbf{m}}^{\prime }=\left[{m}_1\kern0.5em \cdots \kern0.5em {m}_m\right] \), where

*m*

_{j}(

*j*= 1, 2, …,

*m*= number of markers) is the

*j*th marker in the training population; \( {\mathbf{a}}_W^{\prime }=\left[{\mathbf{w}}^{\prime}\kern0.5em {\mathbf{w}}_2^{\prime}\right] \) and \( {\mathbf{z}}_W=\left[{\mathbf{g}}^{\prime}\kern0.5em {\mathbf{m}}^{\prime}\right] \).

*I*

_{W}) combines the phenotypic value and the molecular information linked to the individual traits to predict

*H*values in each selection cycle. It can be written as

**β**

_{m}are vectors of phenotypic and marker weights respectively; \( {\mathbf{y}}^{\prime }=\left[{y}_1\kern0.5em \cdots \kern0.5em {y}_t\right] \) is the vector of phenotypic values and

**m**was defined in Eq. (4.21); \( {\boldsymbol{\upbeta}}_W^{\prime }=\left[{\boldsymbol{\upbeta}}_y^{\prime}\kern0.5em {\boldsymbol{\upbeta}}_m^{\prime}\right] \) and \( {\mathbf{t}}_W^{\prime }=\left[{\mathbf{y}}^{\prime}\kern0.5em {\mathbf{m}}^{\prime}\right] \).

*k*

_{I}is the standardized selection differential of the GW-LMSI, \( {\sigma}_H^2={\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \mathbf{a}}_W \) and \( Var\left({I}_W\right)={\boldsymbol{\upbeta}}_W^{\prime }{\boldsymbol{\Phi} \boldsymbol{\upbeta}}_W \) are the variance of

*H*and

*I*

_{W}, whereas \( {\rho}_{I_WH}=\frac{{\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \boldsymbol{\upbeta}}_W}{\sqrt{{\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \mathbf{a}}_W}\sqrt{{\boldsymbol{\upbeta}}_W^{\prime }{\boldsymbol{\Phi} \boldsymbol{\upbeta}}_W}} \) and \( {\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \boldsymbol{\upbeta}}_W \) are the correlation and the covariance between

*H*and

*I*

_{W}respectively; \( \boldsymbol{\Phi} = Var\left[\begin{array}{c}\mathbf{y}\\ {}\mathbf{m}\end{array}\right]=\left[\begin{array}{cc}\mathbf{P}& {\mathbf{W}}^{\prime}\\ {}\mathbf{W}& \mathbf{M}\end{array}\right] \) and \( \boldsymbol{\Psi} = Var\left[\begin{array}{c}\mathbf{g}\\ {}\mathbf{m}\end{array}\right]=\left[\begin{array}{cc}\mathbf{C}& {\mathbf{W}}^{\prime}\\ {}\mathbf{W}& \mathbf{M}\end{array}\right] \) are block covariance matrices where

**P**=

*Var*(

**y**),

**M**=

*Var*(

**m**),

**C**=

*Var*(

**g**), and

**W**=

*Cov*(

**y**,

**m**) =

*Cov*(

**g**,

**m**) are the covariance matrices of phenotypic values (

**y**), the molecular marker (

**m**) coded values, and the genetic (

**g**) values, whereas

**W**is the covariance matrix between

**y**and

**m**, and between

**g**and

**m**. The size of matrices

**P**and

**C**is

*t*×

*t*, but the sizes of matrices

**M**and

**W**are

*m*×

*m*and

*m*×

*t*respectively.

**M**can be written as

*δ*

_{ij}) is the covariance (or correlation) and

*δ*

_{ij}the recombination frequency between the

*i*th and

*j*th marker (

*i*,

*j*= 1, 2, …,

*m*= number of markers). According to Crossa and Cerón-Rojas (2011), matrix

**W**can be written as

*r*

_{ik})

*α*

_{qk}(

*i*= 1, 2, …,

*m*,

*k*= 1, 2, …,

*N*

_{Q}= number of QTL,

*q*= 1, 2, …,

*t*) is the covariance between the

*q*th trait and the

*i*th marker;

*r*

_{ik}is the recombination frequency between the

*i*th marker and the

*k*th QTL; and

*α*

_{qk}is the effect of the

*k*th QTL over the

*q*th trait.

All parameters in Eq. (4.24) were previously defined.

**Φ**could be singular, i.e., its inverse (

**Φ**

^{−1}) could not exist because matrix

**W**is singular. Suppose that matrices

**Φ**and

**Ψ**are known; then, according to the LPSI theory, the GW-LMSI vector of coefficients (

**β**

_{W}) that maximizes \( {\rho}_{I_WH} \) can be written as

**Φ**

^{−}denotes a generalized inverse of

**Φ**. By Eq. (4.25a), the maximized GW-LMSI selection response is

*H*and

*I*

_{W}is

*I*

_{W}and \( {\sigma}_H=\sqrt{{\mathbf{a}}_W^{\prime }{\boldsymbol{\Psi} \mathbf{a}}_W} \) is the standard deviation of the variance of

*H*.

### 4.2.2 Relationship Between the GW-LMSI and the LPSI

**Φ**

^{−}can be written as

**L**

^{−}is a generalized inverse of matrix

**L**=

**P**−

**W**

^{′}

**M**

^{−}

**W**, and

**M**

^{−}is a generalized inverse of matrix

**M**. In matrix

**Φ**

^{−}, the inverse of matrix

**W**is not required and the standard inverse of matrix

**M**(

**M**

^{−1}) may exist. In the latter case, the standard inverse of matrix

**L**(

**L**

^{−1}) exists and can be written as

**L**

^{−1}= (

**P**−

**W**

^{′}

**M**

^{−1}

**W**)

^{−1}=

**P**

^{−1}+

**P**

^{−1}

**W**

^{′}[

**M**−

**WP**

^{−1}

**W**

^{′}]

^{−1}

**WP**

^{−1}(Searle et al. 2006).

**β**

_{W}=

**Φ**

^{−}

**Ψa**

_{W}, can be written as

**w**is the vector of economic weights. Suppose that there is no marker information; then, matrices

**M**and

**W**are null and Eq. (4.27) is equal to

**β**

_{y}=

**P**

^{−1}

**Cw**=

**b**(the LPSI vector of coefficients), whereas

**β**

_{m}=

**0**and \( {R}_W={k}_I\sqrt{{\boldsymbol{\upbeta}}_W^{\prime }{\boldsymbol{\Phi} \boldsymbol{\upbeta}}_W}={k}_I\sqrt{{\mathbf{b}}^{\prime}\mathbf{Pb}}={R}_I \), the LPSI selection response. Now suppose that the markers explain all the genetic variability; in this case,

**β**

_{y}=

**0**and

**β**

_{m}= (

**X**

^{′}

**X**)

^{−}

**X**

^{′}

**Y**, the matrix of linear regression coefficients in the multivariate context, where (

**X**

^{′}

**X**)

^{−}is a generalized inverse matrix of

**X**

^{′}

**X**and

**Y**is a matrix of phenotypic observations.

### 4.2.3 Statistical Properties of GW-LMSI

*H*and

*I*

_{W}have bivariate joint normal distribution,

**β**

_{W}=

**Φ**

^{−}

**Ψa**

_{W}, and

**P**,

**C**,

**M**,

**W**, and

**w**are known; then, the statistical GW-LMSI properties are the same as the LMSI properties. That is,

- 1.
\( {\sigma}_{I_W}^2={\sigma}_{HI_W} \), i.e., the variance of

*I*_{W}(\( {\sigma}_{I_W}^2 \)) and the covariance between*H*and*I*_{W}(\( {\sigma}_{HI_W} \)) are the same. - 2.
The maximized correlation between

*H*and*I*_{W}, or*I*_{W}accuracy, is \( {\rho}_{HI_W}=\frac{\sigma_{I_W}}{\sigma_H} \). - 3.
The variance of the predicted error, \( Var\left(H-{I}_W\right)=\left(1-{\rho}_{HI_W}^2\right){\sigma}_H^2 \), is minimal.

- 4.
The total variance of

*H*explained by*I*_{W}is \( {\sigma}_{I_W}^2={\rho}_{HI_W}^2{\sigma}_H^2 \).

According to Lange and Whittaker (2001), GW-LMSI efficiency should be greater than LMSI efficiency. However, this would be true only if matrices **P**, **C**, **M**, and **W** are known and trait heritability is very low.

## 4.3 Estimating the LMSI Parameters

When covariance matrices **P**, **C**, and **S**, and the vector of economic weights (**w**) are known, there is no error in the estimation of the LMSI parameters (selection response, expected genetic gain, etc.); the same is true for the GW-LMSI when, in addition to **P**, **C**, and **w**, the covariance matrices **M** and **W** are known. In such cases, the relative efficiency of the LMSI (GW-LMSI) depends only on the heritability of the traits and on the portion of phenotypic variation associated with markers. Using simulated data, Lange and Whittaker (2001) found that GW-LMSI efficiency was higher than LMSI efficiency when trait heritability was 0.2 and matrices **P**, **C**, **M**, and **W** were known. When **P**, **C**, **S**, **M**, and **W** are unknown, it is necessary to estimate them; then, the LMSI and GW-LMSI vector of coefficients and the effects associated with markers are estimated with some error. This error leads to lower LMSI and GW-LMSI efficiency than expected under the assumption that the parameters are known; however, in the latter case, Lange and Whittaker (2001) also found that GW-LMSI efficiency was greater than that of the LMSI when trait heritability was 0.05. Moreover, in the LMSI there is additional bias in the estimation of the parameters because only markers with significant effects are included in the index (Moreau et al. 1998).

In Chap. 2, we described the restricted maximum likelihood (REML) method for estimating matrices **P** and **C**. Some authors (Lande and Thompson 1990; Charcosset and Gallais 1996; Hospital et al. 1997; Moreau et al. 1998, 2007) have described methods for estimating marker scores, the variance of the marker scores, the LMSI vector of coefficients, etc., in the context of one trait; however, up to now there have been no reports on the estimation of matrix **S** in the multi-trait case. Lange and Whittaker (2001) only indicated that matrix **S** can be estimated as \( \widehat{\mathbf{S}}= Var\left(\widehat{\mathbf{s}}\right) \), where \( \widehat{\mathbf{s}} \) is a vector of estimated marker scores associated with several individual traits.

- 1.
The estimated values of the covariance matrix

**S**(\( \widehat{\mathbf{S}} \)) tend to overestimate the genetic covariance matrix (**C**). - 2.
The estimated variances of the marker scores can be negative.

When the first point is true, the estimated LMSI selection response and efficiency could be negative because the estimated matrix \( {\widehat{\mathbf{T}}}_M=\left[\begin{array}{cc}\widehat{\mathbf{P}}& \widehat{\mathbf{S}}\\ {}\widehat{\mathbf{S}}& \widehat{\mathbf{S}}\end{array}\right] \) is not positive definite (all eigenvalues positive) and the estimated matrix \( {\widehat{\mathbf{Z}}}_M=\left[\begin{array}{cc}\widehat{\mathbf{G}}& \widehat{\mathbf{S}}\\ {}\widehat{\mathbf{S}}& \widehat{\mathbf{S}}\end{array}\right] \) is not positive semi-definite (no negative eigenvalues). In addition, the results can lead to all weights being placed on the molecular score and the weights on the phenotype values can be negative (Moreau et al. 2007). When the second point is true, the variance of the marker scores is not useful. The two problems indicated above could be caused by using the same data set to select markers and to estimate marker effects, and there is no simple way of solving them. Lande and Thompson (1990) proposed that the markers used to obtain \( \widehat{\mathbf{S}} \) be selected a priori as those with the most highly significant partial regression coefficients from among all the markers in the linkage group analyzed in the previous generation. Zhang and Smith (1992, 1993) proposed using two independent sets of markers: one to estimate marker effects and the other to select markers. Additional solutions to these problems were described by Moreau et al. (2007).

In this subsection, we describe methods (in the univariate and multivariate context) for estimating molecular marker effects, marker scores, and their variance and covariance, and for estimating the LMSI and GW-LMSI vector of coefficients, selection response, expected genetic gain, and accuracy. This subsection is only for illustration; we use the same data set to select markers, and to estimate marker effects and the variance of marker scores.

### 4.3.1 Estimating the Marker Score

According to Eqs. (4.11) and (4.17b), when the vector of economic weights is equal to \( {\mathbf{a}}^{\prime }=\left[1\kern0.5em 0\right] \), the LMSI for the *i*th trait *y*_{i} (*i* = 1, 2, ⋯, *t*; *t* = number of traits) value can be written as \( {I}_{M_{li}}\kern0.5em =\kern0.5em {s}_i+{\beta}_{y_i}\left({y}_i-{s}_i\right) \) (*l* = 1, 2, ⋯, *n*; *n* = number of individuals or genotypes), where \( {\beta}_{yi}=\frac{\sigma_{g_i}^2-{\sigma}_{s_i}^2}{\sigma_{y_i}^2-{\sigma}_{s_i}^2}=\frac{h_i^2\left(1-{q}_i\right)}{1-{q}_i{h}_i^2} \) is the LMSI coefficient, \( {h}_i^2=\frac{\sigma_{g_i}^2}{\sigma_{y_i}^2} \) is the heritability of the *i*th trait, and \( {q}_i=\frac{\sigma_{s_i}^2}{\sigma_{g_i}^2} \) is the proportion of genetic variance explained by the QTL or markers associated with the *i*th trait; \( {s}_i=\sum \limits_{j=1}^M{\theta}_j{x}_j \) (*j* = 1, 2, ⋯, *M*; *M* = number of selected markers) is the *i*th individual trait marker score; and \( {\sigma}_{y_i}^2 \), \( {\sigma}_{g_i}^2 \), and \( {\sigma}_{s_i}^2 \) are the *i*th variances of the phenotypic, genetic, and marker score values respectively.

The simplest way of estimating the *i*th marker score *s*_{i} is to perform a multiple linear regression of phenotypic values (*y*_{i}) on the coded values of the markers (*x*_{j}) and then select the markers statistically linked to the *i*th QTL that explain most of the variability in the regression model and use them to construct \( {s}_i=\sum \limits_{j\in M}{\theta}_j{x}_j \).

*i*th trait, by maximum likelihood or least squares. When estimating

*θ*

_{j}, the main problem is to choose the set of markers

*M*based on criteria for declaring markers as significant and then use the estimated values of

*θ*

_{j}(\( {\widehat{\theta}}_j \)) to estimate the

*i*th marker score

*s*

_{i}as \( {\widehat{s}}_i=\sum \limits_{j\in M}{\widehat{\theta}}_j{x}_j \). The values of \( {\widehat{s}}_i \) may increase or decrease according to the number of markers (

*x*

_{j}) included in the model, and \( {\widehat{s}}_i \) affects LMSI selection response and efficiency by means of the estimated variance of \( {\widehat{s}}_i \) (\( {\widehat{\sigma}}_{{\widehat{s}}_i}^2 \)) (Figs. 4.1 and 4.2).

According to the least squares method of estimation, \( \widehat{\boldsymbol{\uptheta}}={\left({\mathbf{X}}^{\prime}\mathbf{X}\right)}^{-1}{\mathbf{X}}^{\prime }{\mathbf{y}}^{\ast } \) is an estimator of the vector of regression coefficients \( {\boldsymbol{\uptheta}}^{\prime }=\left[{\theta}_1\kern0.5em {\theta}_2\kern0.5em \cdots \kern0.5em {\theta}_m\right] \), where *m* (*m* < *n*) is the number of markers, **X** is a matrix *n* × *m* of coded marker values (e.g., 1, 0 and −1 for marker genotypes *AA*, *Aa*, and *aa* respectively) and **y**^{∗} is a vector *n* × 1 of phenotypic values centered based on its average values. Only a subset *M*(*M* < *m*) of the *m* markers is statistically linked to the QTL and then only a subset *M* of the estimated vector \( \widehat{\boldsymbol{\uptheta}} \) values is selected to estimate *s*_{i} as \( {\widehat{s}}_i=\sum \limits_{j=1}^M{\widehat{\theta}}_j{x}_j \).

To illustrate how to obtain \( {\widehat{s}}_i=\sum \limits_{j\in M}{\widehat{\theta}}_j{x}_j \), we use a real maize (*Zea mays*) F_{2} population with 247 genotypes (each one with two repetitions), 195 molecular markers, and four traits – grain yield (GY, ton ha^{−1}); plant height (PHT, cm), ear height (EHT, cm), and anthesis day (AD, days) – evaluated in one environment. In an F_{2} population, the marker homozygous loci for the allele from the first parental line can be coded by 1, whereas the marker homozygous loci for the allele from the second parental line can be coded by −1, and the marker heterozygous loci by 0.

*Zea mays*) F

_{2}population. According to \( \widehat{{\boldsymbol{\uptheta}}^{\prime }} \) and the coded values of the seven markers, the first estimated \( {\widehat{s}}_{PHT} \) value was obtained as \( {\widehat{s}}_{PHT1}=-1.91(1)+-3.53\left(-1\right)=1.62 \); the second estimated \( {\widehat{s}}_{PHT} \) value was obtained as \( {\widehat{s}}_{PHT2}=5.46\left(-1\right)+-4.54\left(-1\right)-1.91\left(-1\right)=0.99 \), etc. The 20th estimated \( {\widehat{s}}_{PHT} \) value was obtained as \( {\widehat{s}}_{PHT20}=-3.53\left(-1\right)=3.53 \). This estimation procedure is valid for any number of genotypes and markers.

Number of selected genotypes, coded values of seven selected markers, and estimated marker score values obtained from a maize (*Zea mays*) F_{2} population with 247 genotypes and 195 molecular markers

Number of genotypes | Coded values of the selected markers | Marker score | ||||||
---|---|---|---|---|---|---|---|---|

M1 | M2 | M3 | M4 | M5 | M6 | M7 | ||

1 | 0 | 0 | 0 | 0 | 0 | 1 | −1 | 1.62 |

2 | −1 | −1 | 0 | 0 | 0 | −1 | 0 | 0.99 |

3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | −3.53 |

4 | 1 | 1 | 0 | 0 | 0 | −1 | −1 | 6.37 |

5 | 1 | 1 | 0 | −1 | −1 | −1 | −1 | 6.72 |

6 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0.98 |

7 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0.57 |

8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

9 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | −0.93 |

10 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 4.84 |

11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

12 | −1 | −1 | 0 | 0 | 0 | 0 | 0 | −0.92 |

13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

14 | 1 | 1 | 0 | −1 | −1 | 0 | −1 | 4.81 |

15 | 0 | 0 | 1 | −1 | −1 | 0 | 0 | 1.34 |

16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

17 | −1 | −1 | 0 | 0 | 0 | 0 | 1 | −4.46 |

18 | −1 | −1 | 0 | 0 | 0 | 0 | 1 | −4.46 |

19 | −1 | −1 | 1 | 0 | 0 | −1 | 1 | −1.56 |

20 | 0 | 0 | 0 | 0 | 0 | 0 | −1 | 3.53 |

_{2}population. Note that the estimated marker score values approach normal distribution.

### 4.3.2 Estimating the Variance of the Marker Score

*i*th trait (\( {\sigma}_{s_i}^2 \)); the first one was proposed by Lande and Thompson (1990). According to these authors, \( {\sigma}_{s_i}^2 \) can be estimated as

*M*×

*M*of the selected markers that are statistically linked to the

*i*th trait marker loci; \( {\widehat{\sigma}}_{e_i}^2=\frac{{\mathbf{y}}^{\prime}\left(\mathbf{I}-\mathbf{H}\right)\mathbf{y}}{n-M-1} \) is the unbiased estimated variance of the residuals, \( \mathbf{H}=\mathbf{I}-{\mathbf{X}}_i{\left({\mathbf{X}}_i^{\prime }{\mathbf{X}}_i\right)}^{-1}{\mathbf{X}}_i^{\prime } \),

**I**is an identity matrix

*n*×

*n*,

*M*is the number of selected markers statistically linked to the QTL, and

**X**

_{i}is a matrix

*n*×

*M*with the coded values of the selected markers. According to Lande and Thompson (1990), Eq. (4.29) is an unbiased estimator of \( {\sigma}_{s_i}^2 \) and its variance can be written as

*n*, the number of genotypes or individuals, is very high.

*i*th and

*j*th marker scores when the number of selected markers statistically linked to the QTL is the same in the

*i*th and

*j*th traits. Thus, by Eq. (4.29), the covariance between the

*i*th and

*j*th marker scores can be estimated as

*i*th and

*j*th trait loci respectively; \( {\mathbf{M}}_{ij}=\frac{2}{n}{\mathbf{X}}_i^{\prime }{\mathbf{X}}_j \) is the covariance matrix

*M*×

*M*of the markers statistically linked to the

*i*th and

*j*th trait marker loci;

**X**

_{i}and

**X**

_{j}are

*n*×

*M*matrices with the coded values of the selected markers associated with the

*i*th and

*j*th trait loci respectively; \( {\widehat{\sigma}}_{e_{ij}}=\frac{{\mathbf{y}}_i^{\prime}\left(\mathbf{I}-{\mathbf{H}}_{ij}\right){\mathbf{y}}_j}{n-M-1} \) is the estimated covariance of the residuals between the

*i*th (

**y**

_{i}) and

*j*th (

**y**

_{j}) trait values, \( {\mathbf{H}}_{ij}=\mathbf{I}-{\mathbf{X}}_i{\left({\mathbf{X}}_i^{\prime }{\mathbf{X}}_j\right)}^{-1}{\mathbf{X}}_j^{\prime } \),

**I**is an identity matrix

*n*×

*n*, and

*M*is the number of selected markers statistically linked to the QTL.

According to the PHT values described in Sect. 4.3.1 of this chapter, *M* = 7, *n* = 247, \( {\widehat{\sigma}}_{e_i}^2=180.80 \) and \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) (Eq. 4.29). Note that \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2\le {\widehat{\sigma}}_{g_{PHT}}^2 \), where \( {\widehat{\sigma}}_{g_{PHT}}^2=83.0 \) is an estimate of the genetic variance of PHT. The estimated portion of the genetic variance attributable to \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) was \( {\widehat{q}}_{PHT}=\frac{48.23}{83}=0.5811 \); that is, the seven markers explain 58.11% of the genetic variance associated with PHT.

*R*

^{2}(note that in this case

*R*

^{2}is not the square of the selection response). The coefficient

*R*

^{2}gives the portion of the total variation in the phenotypic values that is “explained” by, or attributable to, the markers and can be written as

*R*

^{2}is equal to 1 if the fitted equation \( {y}_i={\theta}_0+\sum \limits_{j\in M}{\theta}_j{x}_j+{e}_i \) passes through all the data points, so that all residuals are null; then, the markers explain all the phenotypic variance. At the other extreme,

*R*

^{2}is zero if \( {\overline{y}}_i={\widehat{\theta}}_0 \) and the estimated regression coefficients are null, i.e., \( {\widehat{\theta}}_1={\widehat{\theta}}_2=\cdots ={\widehat{\theta}}_M=0 \). In the latter case, markers do not affect the phenotypic observations and the variance of the marker score values is zero. Thus, the

*R*

^{2}values are between 0 and 1, i.e., 0 ≤

*R*

^{2}≤ 1.0. Equation (4.32a) is useful for estimating \( {\sigma}_{s_i}^2 \) as \( {\widehat{\sigma}}_{y_i}^2\sum \limits_{j=1}^M{R}_j^2={\widehat{\sigma}}_s^2 \), where \( {R}_j^2 \) is the estimated value of the

*j*th marker and \( {\widehat{\sigma}}_y^2 \) is the phenotypic variance of the

*i*th trait; however, this is a biased estimator of \( {\sigma}_{s_i}^2 \) (Hospital et al. 1997).

Using Eqs. (4.32a) and (4.32b), we can estimate \( {\sigma}_{s_i}^2 \), but from them it is not clear how we can estimate the covariance between two different estimated marker score values.

Consider the case of the PHT values described in Sect. 4.3.1 of this chapter, where *M* = 7, *n* = 247, and the estimated variance of PHT was \( {\widehat{\sigma}}_{PHT}^2=191.81 \). The estimated values of *R*^{2} for each of the seven markers were 0.0038, 0.0005, 0.006, 0.0013, 0.0036, 0.0114, and 0.0298, whence, by multiplying each estimated *R*^{2} value by \( {\widehat{\sigma}}_{PHT}^2=191.81 \) and summing the results, we found that the estimated value of \( {\sigma}_{s_{PHT}}^2 \) was \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \). In this case, the estimated portion of the genetic variance attributable to \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \) was \( {\widehat{q}}_{PHT}=\frac{9.78}{83}=0.1178 \); thus, when we estimated \( {\sigma}_{s_{PHT}}^2 \) according to Eq. (4.32a), the seven markers explained only 11.78% of the genetic variance associated with PHT.

The estimated value of \( {R}_{Adj}^2 \) for the seven markers jointly was 0.06, whence \( {\widehat{\sigma}}_{s_{PHT}}^2=(191.81)(0.06)=11.50 \) is an estimate of \( {\sigma}_{s_{PHT}}^2 \). In the latter case, the estimated portion of the genetic variance attributable to \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \) was \( {\widehat{q}}_{PHT}=\frac{11.5}{83}=0.1385 \); that is, according to Eq. (4.32b), the seven markers explain 13.85% of the genetic variance associated with PHT.

*i*th and

*j*th marker scores can be estimated as the cross products of the marker score values divided by

*n*− 1. Note that in this case, the number of markers associated with the

*i*th and

*j*th traits may be different.

For the PHT values described in Sect. 4.3.1 of this chapter, where *n* = 247, the estimated value of \( {\sigma}_{s_i}^2 \) was \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \) and the estimated portion of the genetic variance attributable to \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \) was \( {\widehat{q}}_{PHT}=\frac{15.75}{83}=0.1897 \). That is, the seven markers jointly explain 18.97% of the genetic variance associated with PHT according to Eq. (4.33).

### 4.3.3 Estimating LMSI Selection Response and Efficiency

With the estimated phenotypic variances (\( {\widehat{\sigma}}_{PHT}^2=191.81 \)), the estimated genetic variance (\( {\widehat{\sigma}}_{g_{PHT}}^2=83.0 \)) and the estimated marker score variances: \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) (Eq. 4.29), \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \) (Eq. 4.32a), \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \) (Eq. 4.32b), and \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \) (Eq. 4.33), we can estimate the LMSI coefficient, selection response, and efficiency.

Using the estimated value \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \) obtained with Eq. (4.29), it is possible to estimate the LMSI weight as \( {\widehat{\beta}}_{PHT}=\frac{{\widehat{\sigma}}_{g_{PHT}}^2-{\widehat{\sigma}}_{s_{PHT}}^2}{{\widehat{\sigma}}_{PHT}^2-{\widehat{\sigma}}_{s_{PHT}}^2}=\frac{83.0-48.23}{191.81-48.23}=0.242 \), whereas for \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \), \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \), and \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \), the estimated values of *β*_{PHT} were 0.402, 0.40, and 0.382 respectively. The latter results indicate that the estimated values of *β*_{PHT} associated with the phenotypic values tend to decrease when the estimated values of the variance of the marker score increase. This means that at the limit, when all the genetic variance is explained by the markers, the estimated values of *β*_{PHT} are zero and the estimated LMSI is equal to \( {\widehat{I}}_M=\widehat{s} \). Thus, for trait PHT, when the estimated values of *β*_{PHT} are not zero, the estimated LMSI can be written as \( {\widehat{I}}_{M_{PHT}}={\widehat{s}}_{PHT}+{\widehat{\beta}}_{PHT}\left({PHT}_i-{\widehat{s}}_{PHT}\right) \). The \( {\widehat{I}}_{M_{PHT}} \) values are used to predict, rank, and select the net genetic merit value of each individual candidate for selection.

*k*

_{I}= 1.755), the estimated LMSI selection response can be obtained as

In a similar manner, using the result \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \), the estimated selection response was \( {\widehat{R}}_M=1.755\sqrt{\frac{83\left(83-15.75\right)+15.75\left(191.81-83\right)}{191.81-15.75}}=1.755\sqrt{41.44}=11.30. \) With \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \) and \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \), the estimated values of the LMSI selection responses were 10.99 and 11.10 respectively. The latter results indicate that the estimated values of the LMSI selection responses tend to increase when the estimated values of the variance of the marker score increase.

We can estimate LMSI versus phenotypic efficiency for one trait as \( {\widehat{\lambda}}_M=\sqrt{\frac{\widehat{q}}{{\widehat{h}}^2}+\frac{{\left(1-\widehat{q}\right)}^2}{1-{\widehat{q}\widehat{h}}^2}} \), where \( {\widehat{h}}^2 \) is the estimated trait heritability and \( \widehat{q}=\frac{{\widehat{\sigma}}_s^2}{{\widehat{\sigma}}_g^2} \) is the estimated portion of additive genetic variance explained by the markers. When \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=48.23 \), \( {\widehat{q}}_{PHT}=\frac{48.23}{83}=0.5811 \), and \( {\widehat{h}}^2=0.433 \), the estimated LMSI efficiency was \( {\widehat{\lambda}}_M=\sqrt{1.58}=1.25 \). For \( {\widehat{\sigma}}_{s_{PHT}}^2=15.75 \), \( {\widehat{\sigma}}_{{\widehat{s}}_{PHT}}^2=9.78 \), and \( {\widehat{\sigma}}_{s_{PHT}}^2=11.50 \), the estimated portions of the additive genetic variance explained by the markers were \( {\widehat{q}}_{PHT}=\frac{15.75}{83}=0.1897 \), \( {\widehat{q}}_{PHT}=\frac{9.78}{83}=0.1178 \), and \( {\widehat{q}}_{PHT}=\frac{11.5}{83}=0.1385 \) respectively, whence the estimated LMSI efficiencies were 1.1, 1.04, and 1.05 respectively. The latter results indicate that the estimated values of LMSI efficiency tend to increase when the estimated values of the variance of the marker score increase (Fig. 4.1).

Figure 4.1 presents the change in LMSI efficiency with respect to phenotypic selection for different values of the variance of the marker score when the phenotypic (191.81) and genetic (83) variances are fixed. In a similar manner, Fig. 4.2 presents the change in the LMSI selection response for different values of the variance of the marker score when the phenotypic (191.81) and genetic (83) variances are fixed. In effect, LMSI efficiency and the selection response depend on the genetic variance explained by the markers.

### 4.3.4 Estimating the Variance of the Marker Score in the Multi-Trait Case

Equation (4.33) can be used in the multi-trait context when the numbers of markers associated with the *i*th and *j*th traits are different. Also, it is possible to adapt Eqs. (4.32a) and (4.32b) to the multi-trait case. However, in the latter case, in addition to the markers linked to the QTL that affect one specific trait, we need to find markers that affect more than one trait, which may be very difficult. For this reason, in the multi-trait context, Eqs. (4.32a) and (4.32b) could be used to estimate the variance of the marker score (**S**) without preselecting the markers that affect the phenotypic traits, only when the number of genotypes is higher than the number of markers.

Let **y**_{1}, **y**_{2}, …, **y**_{r} be *r* independent multivariate normal vectors of observations, each with *n* observations, such that \( \mathbf{Y}=\left[\begin{array}{cccc}{y}_{11}& {y}_{12}& \cdots & {y}_{1t}\\ {}{y}_{21}& {y}_{22}& \cdots & {y}_{2t}\\ {}\vdots & \vdots & \cdots & \vdots \\ {}{y}_{n1}& {y}_{n2}& \cdots & {y}_{nt}\end{array}\right] \) is a matrix *n* × *t* of observations for *t* traits; then, the multivariate linear regression model can be written as **Y** = **XB** + **U**, where **X** is a matrix *n* × *m* (*m*= number of markers and *m* < *n*) of known coded marker values, **B** is a matrix *m* × *n* of regression coefficients, and **U** is a matrix *n* × *t* of unobserved random disturbance whose rows for given **X** are uncorrelated, each with mean **0** and common covariance matrix **E** (Mardia et al. 1982; Rencher 2002). According to the least squares method of estimation, \( \widehat{\mathbf{B}}={\left({\mathbf{X}}^{\prime}\mathbf{X}\right)}^{-1}{\mathbf{X}}^{\prime}\mathbf{Y} \) is an estimator of **B** and \( \widehat{\mathbf{E}}=\frac{{\left(\mathbf{Y}-\widehat{\mathbf{B}}\mathbf{X}\right)}^{\prime}\left(\mathbf{Y}-\widehat{\mathbf{B}}\mathbf{X}\right)}{n-m-1} \) is an estimator of the residual covariance matrix **E** assuming that *n* > *m* (Johnson and Wichern 2007).

*R*

^{2}is the coefficient of multiple determination (Eq. 4.32a). In addition, as in the multi-trait context the estimated matrix of residuals is \( \widehat{\mathbf{U}}=\mathbf{Y}-\widehat{\mathbf{B}}\mathbf{X} \), 1 −

*R*

^{2}can be written as \( \mathbf{D}={\left({\mathbf{Y}}^{\prime}\mathbf{Y}\right)}^{-1}{\widehat{\mathbf{U}}}^{\prime}\widehat{\mathbf{U}} \) (Mardia et al. 1982), whence

*R*

^{2}in the multivariate context can written as

**I**is an identity matrix

*t*×

*t*, \( {\widehat{\mathbf{P}}}^{-1} \) is the inverse of the estimated covariance matrix of phenotypic values (\( \widehat{\mathbf{P}} \)), and \( \widehat{\mathbf{S}} \) is the estimated covariance matrix of marker score values. From Eq. (4.34b),

From the maize F_{2} population including 247 genotypes (each one with two repetitions) and 195 molecular markers described in Sect. 4.3.1, we used two traits—PHT (cm) and EHT (cm)—to illustrate the multivariate method of estimating the LMSI parameters. The estimated phenotypic and genetic covariance matrices were \( \widehat{\mathbf{P}}=\left[\begin{array}{cc}191.81& 106.89\\ {}106.89& 167.93\end{array}\right] \) and \( \widehat{\mathbf{C}}=\left[\begin{array}{cc}83.00& 57.44\\ {}57.44& 59.80\end{array}\right] \), whereas the estimated covariance matrix of marker scores, using Eq. (4.33), was \( \widehat{\mathbf{S}}=\left[\begin{array}{cc}15.750& 0.983\\ {}0.983& 28.083\end{array}\right] \). When we used Eq. (4.34a) and Eq. (4.34c), we obtained estimated values of the variance and covariance of the marker scores that were higher than the genetic values (data not presented). Equations (4.29) and (4.31) are used later to compare LMSI efficiency versus GW-LMSI efficiency using the simulated data described in Chap. 2, Sect. 2.8.1.

With matrices \( \widehat{\mathbf{P}} \), \( \widehat{\mathbf{C}} \), and \( \widehat{\mathbf{S}} \), and the vector of economic weights \( {\mathbf{a}}^{\prime }=\left[{\mathbf{w}}^{\prime}\kern0.5em {\mathbf{0}}^{\prime}\right] \), where \( {\mathbf{w}}^{\prime }=\left[-1\kern0.5em -1\right] \) and \( {\mathbf{0}}^{\prime }=\left[0\kern0.5em 0\right] \), we obtained the estimated matrices \( \widehat{\mathbf{T}}=\left[\begin{array}{cc}\widehat{\mathbf{P}}& \widehat{\mathbf{S}}\\ {}\widehat{\mathbf{S}}& \widehat{\mathbf{S}}\end{array}\right] \) and \( \mathbf{Z}=\left[\begin{array}{cc}\widehat{\mathbf{C}}& \widehat{\mathbf{S}}\\ {}\widehat{\mathbf{S}}& \widehat{\mathbf{S}}\end{array}\right] \), whence the estimated LMSI vector of coefficients was \( {\widehat{\boldsymbol{\upbeta}}}^{\prime }={\mathbf{a}}^{\prime }{\widehat{\mathbf{Z}}}_M{\widehat{\mathbf{T}}}_M^{-1}=\left[-0.59\kern0.5em -0.18\kern0.5em -0.41\kern0.5em -0.82\right] \). Using a selection intensity of 10% (*k*_{I} = 1.755), the estimated LMSI selection response and the expected genetic gains per trait were \( {\widehat{R}}_M={k}_I\sqrt{\widehat{{\boldsymbol{\upbeta}}^{\prime }}{\widehat{\mathbf{T}}}_M\widehat{\boldsymbol{\upbeta}}}=20.41 \) and \( {\widehat{\mathbf{E}}}_M^{\prime }={k}_I\frac{\widehat{{\boldsymbol{\upbeta}}^{\prime }}{\widehat{\mathbf{Z}}}_M}{\sqrt{\widehat{{\boldsymbol{\upbeta}}^{\prime }}{\widehat{\mathbf{T}}}_M\widehat{\boldsymbol{\upbeta}}}}=\left[-10.09\kern0.5em -10.31\kern0.5em -2.53\kern0.5em -4.39\right] \) respectively, whereas the estimated LMSI accuracy was \( {\widehat{\rho}}_{H{\widehat{I}}_M}=\frac{{\widehat{\sigma}}_{I_M}}{{\widehat{\sigma}}_H}=0.72 \).

The estimated LPSI parameters (see Chap. 2 for details) using the phenotypic information from the maize F_{2} population for traits PHT and EHT are as follows. The estimated LPSI vector of coefficients was \( \widehat{{\mathbf{b}}^{\prime }}={\mathbf{w}}^{\prime}\widehat{\mathbf{C}}{\widehat{\mathbf{P}}}^{-1}=\left[-0.53\kern0.5em -0.36\right] \), and, with a selection intensity of 10% (*k*_{I} = 1.755), the estimated LPSI selection response and the expected genetic gains per trait were \( {\widehat{R}}_I={k}_I\sqrt{\widehat{{\mathbf{b}}^{\prime }}\widehat{\mathbf{P}}\widehat{\mathbf{b}}}=18.97 \) and \( \widehat{{\mathbf{E}}^{\prime }}={k}_I\frac{{\widehat{\mathbf{b}}}^{\prime}\widehat{\mathbf{C}}}{{\widehat{\sigma}}_I}=\left[-10.52\kern0.5em -8.45\right] \) respectively, whereas the estimated LPSI accuracy was \( {\widehat{\rho}}_{H\widehat{I}}=\frac{{\widehat{\sigma}}_I}{{\widehat{\sigma}}_H}=0.67 \).

We can determine LMSI efficiency versus LPSI efficiency to predict the net genetic merit using the ratio of estimated accuracy values \( {\widehat{\rho}}_{H{\widehat{I}}_M}=0.72 \) and \( {\widehat{\rho}}_{H\widehat{I}}=0.67 \) of the LMSI and LPSI respectively, i.e., \( {\widehat{\lambda}}_M=\frac{0.72}{0.67}=1.075 \), whence, according to Eq. (4.19), the estimated LMSI efficiency versus the LPSI efficiency, in percentage terms, was \( {\widehat{p}}_M=100\left(1.075-1\right)=7.5 \). That is, for these data, the estimated LMSI efficiency was only 7.5% greater than LPSI efficiency at predicting the net genetic merit.

## 4.4 Estimating the GW-LMSI Parameters in the Asymptotic Context

Lange and Whittaker (2001) proposed the GW-LMSI. However, these authors did not provide detailed procedures for estimating matrices **P**, **C**, **W**, and **M**. They indicated that matrix **C** can be estimated using the estimated matrix of covariance of marker scores (\( \widehat{\mathbf{S}} \)) and that matrices **P**, **W**, and **M** can be estimated *directly by their empirical variances and covariances*, but this assertion does not indicate a clear method for estimating those covariance matrices. In Chap. 2, we described the REML method of estimating **C** and **P**. Crossa and Cerón-Rojas (2011) described matrices **W** and **M** in a doubled haploid population. In this study, we describe and estimate matrices **W** and **M** for an F_{2} population in the asymptotic context according to the Wright and Mowers (1994) approach, which is based on regressing phenotype values on marker coded values. We used this latter approach to estimate **W** and **M**, because it is a clearer estimation method than that of Lange and Whittaker (2001); however, the Wright and Mowers (1994) approach is an asymptotic method and should be regarded with precaution.

**M**is the covariance matrix of the molecular marker code values. All marker information used to construct matrix

**M**is presented in Table 4.2. Based on this information, we found that the expectations (

*E*(

*X*

_{1}) and

*E*(

*X*

_{2})) and the variances (

*V*(

*X*

_{1}) and

*V*(

*X*

_{2})) of the marker coded values

*X*

_{1}and

*X*

_{2}are

*E*(

*X*

_{1}) =

*E*(

*X*

_{2}) = 0 and

*V*(

*X*

_{1}) =

*V*(

*X*

_{2}) = 1, whereas the covariance (

*Cov*(

*X*

_{1},

*X*

_{2})) and correlation (

*Corr*(

*X*

_{1},

*X*

_{2})), between

*X*

_{1}and

*X*

_{2}were

Marker genotypes, expected frequency, and coded values (*X*_{1} and *X*_{2}) of the marker genotypes in an F_{2} population

Marker genotype | Expected frequency | | |
---|---|---|---|

A | (1−δ) | 1 | 1 |

A | 2(δ−δ | 1 | 0 |

A | δ | 1 | −1 |

A | 2(δ−δ | 0 | 1 |

A | 2(1−2δ + 2δ | 0 | 0 |

A | 2(δ−δ | 0 | −1 |

A | δ | −1 | 1 |

A | 2(δ−δ | −1 | 0 |

A | (1−δ) | −1 | −1 |

Thus, as the variances of *X*_{1} and *X*_{2} are equal to 1, the correlation between *X*_{1} and *X*_{2} is \( Corr\left({X}_1,{X}_2\right)=\frac{Cov\left({X}_1,{X}_2\right)}{\sqrt{V\left({X}_1\right)V\left({X}_2\right)}}=1-2\delta \), i.e., the covariance and correlation between *X*_{1} and *X*_{2} are the same. Equation (4.35) results indicate that if we perform the same operation with many markers, we will obtain similar results; they also indicate that this is the way to construct matrix **M**.

**X**be a matrix of coded markers of size

*n*×

*m*, where

*n*≥

*m*and

*m*= number of markers; then according to Wright and Mowers (1994), because all marker information is contained in matrix

**X**

^{′}

**X**, when the number of observations (

*n*) tends to infinity, the product \( {\mathbf{x}}_i^{\prime }{\mathbf{x}}_j/n \) tends to the covariance between markers

*i*th and

*j*th, whence matrix

*n*

^{−1}

**X**

^{′}

**X**should tend to the covariance matrix between the markers that conform matrix

**X**with the

*ij*th element equal to (0.5 −

*δ*

_{ij}). Thus, matrix 2

*n*

^{−1}

**X**

^{′}

**X**should tend to a covariance matrix where the

*ij*th entry is equal to (1 − 2

*δ*

_{ij}). Based on the latter result, an estimator of matrix

**M**in the asymptotic context is

Equation (4.36) is an asymptotic result and should be taken with caution. To date, there has been no clear method for estimating **M** in the non-asymptotic context; for this reason, Eq. (4.36) is used to estimate the GW-LMSI parameters.

*δ*can be written as

*δ*=

*r*

_{1}+

*r*

_{2}− 2

*r*

_{1}

*r*

_{2}, where

*r*

_{1}and

*r*

_{2}denote the recombination frequency between marker 1 and marker 2 respectively, with the QTL between them. When the number of genotypes or individuals tends to infinity, the covariance between the phenotypic trait values (

*y*) and the marker 1 coded values (

*X*

_{1}) in an F

_{2}population can be written as

*α*

_{1}(1 − 2

*r*

_{1}) is the portion of the additive effect (

*α*

_{1}) of the QTL linked to marker 1 (Edwards et al. 1987), and

*r*

_{1}is the recombination frequency between the QTL and marker 1. We can assume that for many markers, the covariance of the phenotypic values is similar to Eq. (4.37), whence matrix

**W**can be obtained.

**y**be a vector

*n*× 1 of recorded phenotypic values, where

*n*denotes the number of observation or records, and

**X**is a matrix of coded markers of size

*n*×

*m*. When

*n*tends to infinity, 2

*n*

^{−1}

**X**

^{′}

**y**tends to be a vector with elements equal to

*α*

_{i}(1 − 2

*r*

_{i}), where

*α*

_{i}is the additive effect of the

*i*th QTL linked to the

*i*th marker, and

*r*

_{i}is the recombination frequency between the

*i*th QTL and the

*i*th marker. Now let \( \mathbf{Y}=\left[\begin{array}{cccc}{y}_{11}& {y}_{12}& \cdots & {y}_{1t}\\ {}{y}_{21}& {y}_{22}& \cdots & {y}_{2t}\\ {}\vdots & \vdots & \cdots & \vdots \\ {}{y}_{n1}& {y}_{n2}& \cdots & {y}_{nt}\end{array}\right] \) be a matrix of observations for

*t*traits; then, an estimator of matrix

**W**in the asymptotic context is

Once again, Eq. (4.38) is an asymptotic result and should be accepted with caution. But to date, there has been no clear method for estimating **W** in the non-asymptotic context; for this reason, Eq. (4.38) is used to estimate the GW-LMSI parameters.

## 4.5 Comparing LMSI Versus LPSI and GW-LMSI Efficiency

To compare LMSI efficiency versus GW-LMSI efficiency for predicting the net genetic merit, we use the simulated data set described in Chap. 2, Sect. 2.8.1.

Estimated linear phenotypic, molecular, and genome-wide selection indices (LPSI, LMSI, and GW-LMSI respectively), selection responses and variance of the predicted error, and estimated ratio of LMSI accuracy to LPSI and GW-LMSI accuracy expressed in percentages for 4 traits, 2500 markers and 500 genotypes (each with four repetitions) in one environment for five simulated selection cycles

Selection response | Variance of the predicted error | Efficiency of LMSI versus | ||||||
---|---|---|---|---|---|---|---|---|

Cycle | LPSI | LMSI | GW-LMSI | LPSI | LMSI | GW-LMSI | LPSI | GW-LMSI |

1 | 17.84 | 19.60 | 16.24 | 22.53 | 0.07 | 39.84 | 10.07 | 20.67 |

2 | 15.66 | 24.36 | 13.88 | 22.66 | 0.07 | 40.06 | 12.14 | 26.81 |

3 | 14.44 | 14.70 | 12.13 | 21.95 | 1.86 | 39.86 | 3.43 | 21.27 |

4 | 14.29 | 15.29 | 12.48 | 22.84 | 1.46 | 39.09 | 6.57 | 22.50 |

5 | 13.86 | 15.15 | 11.49 | 22.13 | 0.88 | 39.65 | 11.11 | 31.88 |

Average | 15.22 | 17.82 | 13.24 | 22.42 | 0.87 | 39.70 | 8.66 | 24.63 |

According to Fig. 4.4, for this data set the estimated LMSI accuracy (\( {\widehat{\rho}}_{H{\widehat{I}}_M} \)) was higher than the estimated LPSI and GW-LMSI accuracy (\( {\widehat{\rho}}_{H\widehat{I}} \) and \( {\widehat{\rho}}_{H{\widehat{I}}_W} \) respectively), for the five simulated selection cycles, that is, \( {\widehat{\rho}}_{H{\widehat{I}}_M}>{\widehat{\rho}}_{H\widehat{I}}>{\widehat{\rho}}_{H{\widehat{I}}_W} \). In a similar manner, Table 4.3 results indicate that the estimated LMSI selection response (\( {\widehat{R}}_M \)) was higher than the estimated LPSI and GW-LMSI selection responses (\( {\widehat{R}}_I \) and \( {\widehat{R}}_W \) respectively): \( {\widehat{R}}_M>{\widehat{R}}_I>{\widehat{R}}_W \).

Note that the estimated LPSI, LMSI, and GW-LMSI variances of the predicted error, and the estimated LMSI efficiency versus LPSI efficiency and versus GW-LMSI efficiency (expressed in percentages) are related to the estimated LMSI, LPSI, and GW-LMSI accuracies, and that in all five selection cycles, \( {\widehat{\rho}}_{H{\widehat{I}}_M}>{\widehat{\rho}}_{H\widehat{I}}>{\widehat{\rho}}_{H{\widehat{I}}_W} \). This implies that the estimated LMSI variance of the predicted error was lower than the estimated LPSI and GW-LMSI variance of the predicted error. In a similar manner, because \( {\widehat{\rho}}_{H{\widehat{I}}_M}>{\widehat{\rho}}_{H\widehat{I}}>{\widehat{\rho}}_{H{\widehat{I}}_W} \), the estimated LMSI efficiency was higher than the estimated LPSI efficiency and the estimated GW-LMSI efficiency.

Based on Fig. 4.4 and Table 4.3 results, we conclude that the LMSI was a better predictor of the net genetic merit than the LPSI, and that the LPSI is a better predictor of the net genetic merit than the GW-LMSI for this simulated data set.

## References

- Bulmer MG (1980) The mathematical theory of quantitative genetics. Lectures in biomathematics. University of Oxford, Clarendon Press, OxfordGoogle Scholar
- Charcosset A, Gallais A (1996) Estimation of the contribution of quantitative trait loci (QTL) to the variance of a quantitative trait by means of genetic markers. Theor Appl Genet 93:1193–1201CrossRefGoogle Scholar
- Crossa J, Cerón-Rojas JJ (2011) Multi-trait multi-environment genome-wide molecular marker selection indices. J Indian Soc Agric Stat 62(2):125–142Google Scholar
- Dekkers JCM, Settar P (2004) Long-term selection with known quantitative trait loci. Plant Breed Rev 24:311–335Google Scholar
- Edwards MD, Stuber CW, Wendel JF (1987) Molecular-marker-facilitated investigations of quantitative-trait loci in maize. I. Numbers, genomic distribution and types of gene action. Genetics 116:113–125PubMedPubMedCentralGoogle Scholar
- Hospital F, Moreau L, Lacoudre F, Charcosset A, Gallais A (1997) More on the efficiency of marker-assisted selection. Theor Appl Genet 95:1181–1189CrossRefGoogle Scholar
- Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River, NJGoogle Scholar
- Knapp SJ (1998) Marker-assisted selection as a strategy for increasing the probability of selecting superior genotypes. Crop Sci 38:1164–1174CrossRefGoogle Scholar
- Lande R, Thompson R (1990) Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124:743–756PubMedPubMedCentralGoogle Scholar
- Lange C, Whittaker JC (2001) On prediction of genetic values in marker-assisted selection. Genetics 159:1375–1381PubMedPubMedCentralGoogle Scholar
- Mardia KV, Kent JT, Bibby JM (1982) Multivariate analysis. Academic Press, New YorkGoogle Scholar
- Moreau L, Charcosset A, Hospital F, Gallais A (1998) Marker-assisted selection efficiency in populations of finite size. Genetics 148:1353–1365PubMedPubMedCentralGoogle Scholar
- Moreau L, Hospital F, Whittaker J (2007) Marker-assisted selection and introgression. In: Balding DJ, Bishop M, Cannings C (eds) Handbook of statistical genetics, vol 1, 3rd edn. Wiley, New York, pp 718–751Google Scholar
- Rencher AC (2002) Methods of multivariate analysis. Wiley, New YorkCrossRefGoogle Scholar
- Searle S, Casella G, McCulloch CE (2006) Variance components. Wiley, Hoboken, NJGoogle Scholar
- Whittaker JC (2003) Marker-assisted selection and introgression. In: Balding DJ, Bishop M, Cannings C (eds) Handbook of statistical genetics, vol 1, 2nd edn. Wiley, New York, pp 554–574Google Scholar
- Wright AJ, Mowers RP (1994) Multiple regression for molecular marker, quantitative trait data from large F
_{2}population. Theor Appl Genet 89:305–312PubMedGoogle Scholar - Zhang W, Smith C (1992) Computer simulation of marker-assisted selection utilizing linkage disequilibrium. Theor Appl Genet 83:813–820CrossRefGoogle Scholar
- Zhang W, Smith C (1993) Simulation of marker-assisted selection utilizing linkage disequilibrium: the effects of several additional factors. Theor Appl Genet 86:492–496CrossRefGoogle Scholar

## Copyright information

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.