Introduction

During the first decade of the twenty-first century, the focus of genomic selection research was the development of theory and methods (e.g., Meuwissen et al. 2001; Habier et al. 2007; Daetwyler et al. 2008; Bernardo and Yu 2007; VanRaden 2008), and most researchers worked in animal rather than plant breeding. This changed in the following decade with the development of specialized software for genomic prediction, including rrBLUP (Endelman 2011), GAPIT (Lipka et al. 2012), synbreed (Wimmer et al. 2012), BGLR (Pérez and de los Campos 2014), and sommer (Covarrubias-Pazaran 2016). Over the last several years, new software development has emphasized multi-trait prediction models (Montesinos-López et al. 2019; Runcie et al. 2021; Pérez-Rodríguez and de los Campos 2022). Collectively, these software publications have been cited several thousand times, which reflects their enabling role for the adoption of genomic selection, particularly in plant breeding.

However, these packages have limitations to handle the full complexity of plant breeding data, with different experimental designs, heritabilities, and spatial models for non-genetic variation. The challenge of properly analyzing multi-environment datasets existed before genomic selection, which led to the concept of a two-stage analysis (Frensham et al. 1997). In Stage 1, genotype means are estimated as fixed effects for each environment, which become the response variable in Stage 2. The errors of the Stage 1 estimates are typically different, and failure to account for this in Stage 2 leads to sub-optimal results (Möhring and Piepho 2009). A “fully efficient” two-stage analysis uses the full variance–covariance matrix of the Stage 1 genotype means in Stage 2, rather than a diagonal approximation (Piepho et al. 2012; Damesa et al. 2017). Previous examples of a properly weighted, two-stage analysis have used one of three well-established, REML-based programs for mixed models: SAS PROC MIXED (SAS Institute Inc, Cary, NC), ASReml (Gilmour et al. 2015), or ASReml-R (Butler et al. 2018). All three software allow the variance–covariance matrix of the random effect for Stage 1 errors to be specified while estimating the other, unknown variance components of the Stage 2 model. Despite this precedent, many studies continue to ignore Stage 1 errors, and I believe a major reason is the additional programming skill required.

The goal of the current research was to develop a new R package (R Core Team 2022) for genomic selection that makes fully efficient, two-stage analysis more accessible to plant breeders. The software, called StageWise, returns empirical BLUPs using variance components estimated with ASReml-R. It also works for polyploids and incorporates advanced features such as directional dominance and multi-trait selection indices.

Methods

Single trait with homogeneous GxE

The response variable for Stage 2 is the Stage 1 BLUEs for the effect of genotype in environment. The mixed model with homogeneous GxE can be written as

$$BLUE\left[ {g_{{{\text{ij}}}} } \right] = y_{{{\text{ij}}}} = E_{{\text{j}}} + g_{{\text{i}}} + gE_{{{\text{ij}}}} + s_{{{\text{ij}}}}$$
(1)

where \(g_{{{\text{ij}}}}\) is the genotypic value for individual (or clone) i in environment j, \({E}_{\mathrm{j}}\) is the fixed effect for environment j, \({g}_{i}\) is the random effect for individual i across environments, and the GxE effect, \({gE}_{\mathrm{ij}},\) is actually the model residual (Damesa et al. 2017). The \({s}_{\mathrm{ij}}\) effect, which represents the Stage 1 estimation error, is multivariate normal with no free variance parameters: the variance–covariance matrix is the direct sum of the variance–covariance matrices of the Stage 1 BLUEs (Damesa et al. 2017). The \({gE}_{ij}\) are independent and identically distributed (i.i.d.), which implies a single genetic correlation between all environments. Without marker data, the software assumes the \({g}_{\mathrm{i}}\) effects are i.i.d.

When marker data are provided, the software decomposes \({g}_{i}\) into additive and non-additive values. The vector of additive values is multivariate normal with covariance proportional to a genomic additive matrix G (VanRaden 2008 Method 1, extended to arbitrary ploidy). If W represents the centered matrix of allele dosages (n individuals x m bi-allelic markers with frequencies p = 1–q), then for ploidy \(\phi\),

$${\mathbf{G}} = \frac{{{\mathbf{WW}}^{{\text{T}}} }}{{\phi \mathop \sum \nolimits_{k} p_{{\text{k}}} q_{{\text{k}}} }}$$
(2)

If a three-column pedigree is provided, G can be blended with the pedigree relationship matrix A (calculated using R package AGHmatrix (Amadeu et al. 2016)) to produce \(\mathbf{H}=\left(1-\upomega \right)\mathbf{G}+\upomega \mathbf{A}\), for \(0\le\upomega \le 1\) (Legarra et al. 2009; Christensen and Lund 2010). In addition to the additive polygenic effect, the user can indicate some markers should be included as additive (fixed effect) covariates in Eq. (1), to capture large effect QTL.

Directional dominance

Two models for the non-additive genetic values are available. In the genetic residual model, the non-additive values are i.i.d. The other option is a directional (digenic) dominance model, which follows the classical framework of Fisher (1941) and Kempthorne (1957) and is a refinement of recent research (Vitezica et al. 2013; Xiang et al. 2016; Endelman et al. 2018; Batista et al. 2022). For a locus with two alleles designated 0/1, there are three digenic dominance effects \({\beta }_{00},{\beta }_{01}\), \({\beta }_{11}\), which equal the dominance deviation in diploids, but more generally for any ploidy are the coefficients for regressing the dominance deviation on diplotype dosage. (Higher order dominance effects for polyploids are not considered.) These dominance effects can be expressed in terms of a parameter that has no established name but may be called a digenic substitution effect, \(\beta\), by analogy with the allele substitution effect α for additive effects. The \(\beta\) parameter represents the average change in dominance deviation per unit increase in dosage of the heterozygous diplotype:

$$\beta = \beta_{01} - \frac{1}{2}\left( {\beta_{00} + \beta_{11} } \right)$$
(3)

(This differs from the scaling in Endelman et al. (2018) by –2 so that \(\beta\) in Eq. (3) equals d in the classical diploid model of Vitezica et al. (2013).) Designating the frequency of allele 1 as \(p = 1 - q\), the dominance effects can be expressed in terms of the substitution effect:

$$\begin{gathered} \beta_{00} = - 2p^{2} \beta \hfill \\ \beta_{01} = 2pq\beta \hfill \\ \beta_{11} = - 2q^{2} \beta \hfill \\ \end{gathered}$$
(4)

The dominance value of an individual is the sum of its dominance effects and can be written as \(Q\beta\), where the dominance coefficient Q for ploidy \(\phi\) and allele dosage X (of allele 1) is

$$Q = - 2\left( {\begin{array}{*{20}c} \phi \\ 2 \\ \end{array} } \right)p^{2} + 2p\left( {\phi - 1} \right)X - X\left( {X - 1} \right)$$
(5)

In Eq. (5), \(\left( {\begin{array}{*{20}c} \phi \\ 2 \\ \end{array} } \right)\) is the binomial coefficient. The dominance genetic variance, VD, is \(\left(\begin{array}{c}\phi \\ 2\end{array}\right)\) times the variance of the dominance effects, \(4{p}^{2}{q}^{2}{\beta }^{2}.\) Extending this framework to m loci, the dominance value is \(\mathop \sum \nolimits_{k = 1}^{m} Q_{k} \beta_{k}\), and the dominance variance is

$$V_{D} = \left( {\begin{array}{*{20}c} \phi \\ 2 \\ \end{array} } \right)\sum\limits_{k = 1}^{m} {4p_{k}^{2} } q_{k}^{2} \beta_{k}^{2} + \sum\limits_{k} {\sum\limits_{{k^{\prime } \ne k}} {\beta_{k} \beta_{{k^{\prime } }} } } {\text{cov}} \left[ {Q_{k} ,Q_{{k^{\prime } }} } \right]$$
(6)

The first term in Eq. (6) is the dominance genic variance, which depends on allele frequencies but not LD between loci. The second term is the disequilibrium covariance, which can be positive or negative.

In classical quantitative genetics, the substitution effects are fixed parameters, but to compute dominance values by BLUP, we switch to viewing them as random normal effects (de los Campos et al. 2015), with mean \({\mu }_{\beta }\) and variance \({\sigma }_{\beta }^{2}\). For a trait with no average heterosis in the population, \({\mu }_{\beta }=0\) (Varona et al. 2018). Let Q denote the n × m matrix of dominance coefficients for n individuals at m loci. The vector of dominance values \(\mathbf{Q}{\varvec{\upbeta}}\) is multivariate normal, with mean \(\mathbf{Q}\mathbf{1}{\mu }_{\beta }\) and variance–covariance matrix \(\mathbf{Q}{\mathbf{Q}}^{\mathrm{T}}{\sigma }_{\beta }^{2}\). Equivalently, the dominance values can be written as

$${\mathbf{Q}\varvec{\beta}} = - b{\mathbf{F}} + {\mathbf{d}}_{0}$$
(7)

where \({\mathbf{F}}\) is a vector of genomic inbreeding coefficients, with regression coefficient \(b\) (positive value implies heterosis), and \({\mathbf{d}}_{0}\sim \mathrm{MVN}\left(0,\mathbf{D}{\sigma }_{D}^{2}\right)\) represents dominance with no average heterosis. The genomic dominance matrix D is defined by interpreting its variance component \({\sigma }_{D}^{2}\) as the expected value of the classical dominance variance with respect to the substitution effects, assuming no overall heterosis. From Eq. (6) the result is

$$\sigma_{D}^{2} = E\left[ {V_{D} } \right] = \sigma_{\beta }^{2} \left( {\begin{array}{*{20}c} \phi \\ 2 \\ \end{array} } \right)\mathop \sum \limits_{k} 4p_{k}^{2} q_{k}^{2}$$
(8)

which leads to

$${\mathbf{D}} = \frac{{{\mathbf{QQ}}^{{\text{T}}} }}{{\left( {\begin{array}{*{20}c} \phi \\ 2 \\ \end{array} } \right)\mathop \sum \nolimits_{k} 4p_{k}^{2} q_{k}^{2} }}$$
(9)

From Eq. (7), the vector of genomic inbreeding coefficients F is proportional to the row sum of Q. The correct scaling is derived by considering the expected value of Q (Eq. 5) in the classical sense (where genotypes are random and parameters are fixed), for a completely inbred population in which homozygotes of allele 1 occur with frequency p. Under these conditions, \(E\left[X\right]=\phi p\) and \(E\left[{X}^{2}\right]={\phi }^{2}p\), which leads to \(E\left[Q\right]=-2pq\left(\begin{array}{c}\phi \\ 2\end{array}\right)\). Extending this to multiple loci and equating the result to F = 1 sets the proportionality constant and leads to the following definition:

$${\mathbf{F}} = \frac{{ - {\mathbf{Q}}\mathbf{1}}}{{\left( {\begin{array}{*{20}c} \phi \\ 2 \\ \end{array} } \right)\mathop \sum \nolimits_{k} 2p_{k} q_{k} }}$$
(10)

The vector of genomic inbreeding coefficients is included as a fixed effect covariate in the Stage 2 model. Inbreeding coefficients can also be computed from the diagonal elements of the additive relationship matrix (either A or G) according to \((G-1)/(\phi -1)\) (Henderson 1976; Gallais 2003; Endelman and Jannink 2012).

Extension to multiple locations or traits

StageWise has the option of including a random effect \(g(L)\) in Stage 2 for genotype within location (or L can represent some other factor, such as management). Using the subscript k to designate location, the linear model (Eq. 1) becomes

$${\text{BLUE}}\left[ {g_{ijk} } \right] = y_{ijk} = E_{j} + g\left( L \right)_{ik} + gE_{ijk} + s_{ijk}$$
(11)

The \(g\left( L \right)_{{{\text{ik}}}}\) effect is modeled using a separable covariance structure, \({\mathbf{I}} \otimes {{\varvec{\Gamma}}}\) in the absence of marker data, where the genetic covariance between locations \({\varvec{\Gamma}}\) follows a second-order factor-analytic (FA2) model. The FA2 model provides a good balance between statistical parsimony and complexity for many plant breeding applications, and Stage2 returns the rotated and scaled factor loadings (Cullis et al. 2010). A heterogeneous variance model is used for \({gE}_{\mathrm{ijk}}\) (which is the model residual as before), with different variance parameters for each location.

When marker data are provided, genotypic value is partitioned into additive and non-additive values, and the FA2 model is still used for the additive covariance between locations. Attempts to use an FA2 model for non-additive values were unsuccessful in several datasets, and even with a compound symmetry model, the correlation parameter was always on the boundary (equal to 1). The non-additive correlation parameter was therefore fixed at 1 and accepted as a model limitation. When markers are included as fixed effect covariates, different regression coefficients are estimated for each location. Similarly, different regression coefficients for genomic inbreeding are estimated per location.

A similar framework is used for multi-trait analysis, with trait replacing location in Eq. (11), except that all trait covariance matrices are unstructured. In Stage 1, a separable covariance model is used for the residuals, and in Stage 2, the fixed effects for environment are trait-specific. When markers are used to partition additive and non-additive genetic value, separate unstructured covariance matrices are estimated for each. Multi-trait models are limited to the homogeneous GxE structure described for single trait analysis (i.e., the genetic correlation between all environments is the same, regardless of location).

Proportion of variance explained

The aim is to quantify the proportion of variance (PVE) explained by each effect in the Stage 2 model, excluding the main effect Ej (which mirrors how heritability is calculated). The core idea is to compute variances based on the method of Legarra (2016), and the PVE is the variance of each effect divided by the sum. This is not a true partitioning of variance because the Stage 2 effects are not necessarily orthogonal.

First consider effects such as \(g{E}_{\mathrm{ij}}\) and sij (Eq. 1), which are indexed by both genotype i and environment j. Representing these effects by vector y of length t, the variance is

$$V_{y} = \frac{1}{t}\mathop \sum \limits_{ij} y_{ij}^{2} - \left( {\frac{1}{t}\mathop \sum \limits_{ij} y_{ij} } \right)^{2} = \frac{1}{t}{\mathbf{y^{\prime}y}} - \frac{1}{{t^{2} }}\left( {\mathbf{1}_{t}^{^{\prime}} {\mathbf{y}}} \right)^{2}$$
(12)

The symbol \(\mathbf{1}_{t}\) in Eq. (12) is a t × 1 vector of 1’s. For multivariate normal (MVN) y with mean \({\varvec{\upmu}}\) and variance–covariance matrix \(\mathbf{K}\), the expectation of Vy can be computed using the following general formula for quadratic forms (Searle et al. 1992):

$$E\left[ {{\mathbf{y^{\prime}Ay}}} \right] = {\text{tr}}\left( {{\mathbf{AK}}} \right) + {\mathbf{\upmu^{\prime}A\upmu }}$$
(13)

The “tr” in Eq. (13) stands for trace, which equals the sum of the diagonal elements. It follows that

$$E\left[ {V_{{\text{y}}} } \right] = \left[ {\overline{{{\text{diag}}\left( {\mathbf{K}} \right)}} - \overline{{{\text{K}}_{ \cdot \cdot } }} } \right] + \left[ {\overline{{\mu_{ \cdot }^{2} }} - \left( {\overline{\mu .} } \right)^{2} } \right]$$
(14)

where \(\overline{\mathrm{diag }(\mathbf{K})}\) is the mean of the diagonal elements of K. Equation (14) follows the convention of using an overbar to indicate averaging with respect to dotted subscripts.

For effects indexed only by genotype, such as \({g}_{\mathrm{i}}\), Eq. (14) needs to be modified to accommodate unbalanced experiments. If \(\mathbf{x}\sim \mathrm{MVN}\left({\varvec{\upmu}},\mathbf{K}\right)\), and Z is the incidence matrix relating x to the gE basis of the Stage 2 model, then \(\mathbf{y}=\mathbf{Z}\mathbf{x}\) is the random vector for which we need to compute the expected variance. The result is identical to Eq. (14) provided the averages are interpreted as weighted averages:

$$\begin{gathered} \overline{{{\text{diag}}\left( {\mathbf{K}} \right)}} = \frac{1}{t}\mathop \sum \limits_{i} w_{{\text{i}}} {\text{K}}_{{{\text{ii}}}} \hfill \\ \overline{{{\text{K}}_{ \cdot \cdot } }} = \frac{1}{{t^{2} }}\mathop \sum \limits_{i,j} w_{i} {\text{K}}_{{{\text{ij}}}} w_{{\text{j}}} \hfill \\ \overline{{\mu_{.}^{b} }} = \frac{1}{t}\mathop \sum \limits_{i} w_{i} \mu_{i}^{b} \,\,{\text{for exponent }}b = 1,2, \ldots \hfill \\ \end{gathered}$$
(15)

The weights \(w_{i}\) in Eq. (15) come from \({\mathbf{w}} = \mathbf{1}_{t}^{\varvec{^{\prime}}} {\mathbf{Z}}\) and represent the number of environments for genotype \(i\).

For the multi-location model, the genotype within location variance is computed using \({\mathbf{K}} = {\mathbf{G}} \otimes {{\varvec{\Gamma}}}\) and weights equal to the number of times each gL combination is present. For a balanced experiment with n individuals and s locations, the result is

$$\begin{aligned} E\left[ {V_{g\left( L \right)} } \right] & = \frac{{{\text{tr}}\left( {{\mathbf{G}} \otimes {{\varvec{\Gamma}}}} \right)}}{ns} - \frac{{\left( {\mathbf{1}_{n}^{^{\prime}} \otimes \mathbf{1}_{s}^{^{\prime}} } \right)\left( {{\mathbf{G}} \otimes {{\varvec{\Gamma}}}} \right)\left( {\mathbf{1}_{n} \otimes \mathbf{1}_{s} } \right)}}{{n^{2} s^{2} }} \\ & = \left[ {\overline{{{\text{diag}}\left( {\mathbf{G}} \right)}} } \right]\left[ {\overline{{{\text{diag}}\left( {{\varvec{\Gamma}}} \right)}} } \right] - \left( {\overline{{{\text{G}}_{ \cdot \cdot } }} } \right)\left( {\overline{{{\Gamma }_{ \cdot \cdot } }} } \right) \end{aligned}$$
(16)

Following Rogers et al. (2021), Eq. (16) is partitioned into a main effect Vg plus genotype x loc interaction VgL. The main effect is based on the average of the \(\frac{{s\left( {s - 1} \right)}}{2}\) off-diagonal elements of \({{\varvec{\Gamma}}}\):

$$E\left[ {V_{g} } \right] = \left[ {\overline{{{\text{diag}}\left( {\mathbf{G}} \right)}} - \overline{{{\text{G}}_{ \cdot \cdot } }} } \right]\left[ {\frac{2}{{s\left( {s - 1} \right)}}\mathop \sum \limits_{i} \mathop \sum \limits_{j > i} {\Gamma }_{{{\text{ij}}}} } \right]$$
(17)

Equation (17) is extended to the unbalanced case by using weighted averages for G.

BLUP

Empirical BLUPs are calculated conditional on the variance components estimated in Stage 2. All Stage 2 models described above can be written in the following standard form:

$${\mathbf{y}} = {\mathbf{X\updelta }} + {\mathbf{Zu}} + {{\varvec{\upvarepsilon}}}$$
(18)

where \({{\varvec{\updelta}}}\) is a vector of fixed effects (for environments, markers, and inbreeding), \(\mathbf{u}\) is a vector of multivariate normal genetic effects, and \({\varvec{\upvarepsilon}}\) is the “residual” vector (for the g x env and Stage 1 error effects). Let \(\widehat{\mathbf{u}}\) denote BLUP[u], which is calculated one of two ways for numerical efficiency. If the length of y exceeds the length of u, then \(\widehat{\mathbf{u}}\) is calculating by inverting the coefficient matrix of the mixed model equations (MME; Henderson 1975). Otherwise, \(\widehat{\mathbf{u}}\) is calculated by inverting \({\mathbf{V}} = var\left( {\mathbf{y}} \right)\) and using the following result (Searle et al. 1992):

$$\begin{gathered} {\hat{\mathbf{u}}} = cov\left( {{\mathbf{u}},{\mathbf{y}}} \right){\mathbf{Py}} = var\left( {\mathbf{u}} \right){\mathbf{Z}}^{\prime}{\mathbf{Py}} \hfill \\ {\text{where }}{\mathbf{P}} = {\mathbf{V}}^{ - 1} - {\mathbf{V}}^{ - 1} {\mathbf{X}}\left( {{\mathbf{X^{\prime}V}}^{ - 1} {\mathbf{X}}} \right)^{ - 1} {\mathbf{X^{\prime}V}}^{ - 1} \hfill \\ \end{gathered}$$
(19)

Genetic merit is a linear combination of random and fixed effects. For random effects, the structure of u is trait nested within individual, nested within additive vs. non-additive values. For fixed effects (ignoring the environment effects), \({\varvec{\updelta}}\) contains trait nested within marker effects, followed by trait nested within the regression coefficient for heterosis. If W represents the centered matrix of allele dosages for the fixed effect markers (n individuals x m markers), F is the vector of genomic inbreeding coefficients, and c is the vector of economic weights for multiple traits or locations, then the genetic merit vector for the population is

$${{\varvec{\uptheta}}} = \left( {\left[ {\begin{array}{*{20}c} {{\mathbf{I}}_{n} } & {\gamma {\mathbf{I}}_{n} } \\ \end{array} } \right] \otimes {\mathbf{c}}\varvec{^{\prime}}} \right){\mathbf{u}} + \left( {\left[ {\begin{array}{*{20}c} {\mathbf{W}} & {\gamma {\mathbf{F}}} \\ \end{array} } \right] \otimes {\mathbf{c}}\varvec{^{\prime}}} \right){{\varvec{\updelta}}}$$
(20)

The value of \(\gamma\) depends on which genetic value is predicted: 0 for additive value, 1 for total value, and \(\left(\frac{\phi }{2}-1\right)/\left(\phi -1\right)\) for breeding value and ploidy \(\phi\) (Gallais 2003). Because BLUP is a linear operator, \(\widehat{{\varvec{\uptheta}}}=\) BLUP[\({\varvec{\uptheta}}]\) (i.e., the selection index) is given by Eq. (20) with u and \({\varvec{\updelta}}\) replaced by their predicted values.

Index coefficients entered by the user are interpreted as relative weights for standardized traits (or locations). To generate the vector c, the software divides the user-supplied weights by the standard deviations of the breeding values (estimated in Stage 2); it also applies an overall scaling such that \(\Vert \mathbf{c}\Vert =1\), which ensures predictions are commensurate with the original trait scale in multi-location models.

The reliability \({r}_{i}^{2}\) of the predicted merit \(\hat{\theta }_{i}\) for individual i is the squared correlation with its true value \(\theta_{i}\), which depends only on the random effects. If \({\mathbf{u}}_{i}\) represents the vector of random genetic effects for individual i, and \({{\varvec{\uplambda}}} \, {{\text{denotes} }}\left[ {\begin{array}{*{20}c} 1 & \gamma \\ \end{array} } \right]^{\prime} \otimes {\mathbf{c}}\), then the random effects component of \(\theta_{i}\) is \({{\varvec{\uplambda}}}^{\prime}{\mathbf{u}}_{i}\), and the reliability is

$$r_{i}^{2} = \frac{{cov^{2} \left( {\theta_{i} ,\hat{\theta }_{i} } \right)}}{{var\left( {\theta_{i} } \right)var\left( {\hat{\theta }_{i} } \right)}} = \frac{{\left[ {{{\varvec{\uplambda}}}^{\prime}cov\left( {{\mathbf{u}}_{i} ,{\hat{\mathbf{u}}}_{i} } \right){{\varvec{\uplambda}}}} \right]^{2} }}{{\left[ {{{\varvec{\uplambda}}}^{\prime}var\left( {{\mathbf{u}}_{i} } \right){{\varvec{\uplambda}}}} \right]\left[ {{{\varvec{\uplambda}}}^{\prime}var\left( {{\hat{\mathbf{u}}}_{i} } \right){{\varvec{\uplambda}}}} \right]}} = \frac{{{{\varvec{\uplambda}}}^{\prime}var\left( {{\hat{\mathbf{u}}}_{i} } \right){{\varvec{\uplambda}}}}}{{{{\varvec{\uplambda}}}^{\prime}var\left( {{\mathbf{u}}_{i} } \right){{\varvec{\uplambda}}}}}$$
(21)

The final equality in Eq. (21) relies on the following property of BLUP: \(cov\left(\mathbf{u},\widehat{\mathbf{u}}\right)=var\left(\widehat{\mathbf{u}}\right)\). For the MME solution method, the \(var\left(\widehat{\mathbf{u}}\right)\) matrix is computed as \(var\left(\mathbf{u}\right)-{\mathbf{C}}_{22}\), where \({\mathbf{C}}_{22}\) is from the partitioned inverse coefficient matrix (Henderson 1975). For the V inversion method, \(var\left(\widehat{\mathbf{u}}\right)=var\left(\mathbf{u}\right)\left({\mathbf{Z}}^{\mathbf{^{\prime}}}\mathbf{P}\mathbf{Z}\right)var(\mathbf{u})\) (Searle et al. 1992).

Selection response

The breeder’s equation provides the expected response to truncation selection on predicted merit \(\widehat{\theta }\). If \(\mathbf{b}\) denotes the multi-trait vector of breeding values for an individual, then its predicted merit is \(\hat{\theta } = {\mathbf{c}}^{\prime}{\hat{\mathbf{b}}}\) (see Eq. 20), and the multi-trait response x under selection intensity i is

$${\mathbf{x}} = \left[ {i\sigma_{{\hat{\theta }}} } \right]\left[ {\frac{{cov_{n} \left( {{\mathbf{b}}, \hat{\theta }} \right)}}{{\sigma_{{\hat{\theta }}}^{2} }}} \right] = i \sigma_{{\hat{\theta }}}^{ - 1} cov_{n} \left( {{\mathbf{b}},{\hat{\mathbf{b}}}} \right){\mathbf{c}}$$
(22)

(To connect Eq. (22) with a familiar form of the breeder’s equation, the first bracketed term is the selection differential, and the second bracketed term represents heritability.) The subscript n on \(co{v}_{n}\) indicates it is the covariance with respect to the n individuals in the population, which differs slightly from the covariance of a vector with respect to its MVN distribution (see “Appendix”). As mentioned earlier, under BLUP, the latter covariance satisfies \(cov\left(\mathbf{u},\widehat{\mathbf{u}}\right)=var\left(\widehat{\mathbf{u}}\right)\). Combining this result with “Appendix” Eq. (35), it follows that \(co{v}_{n}\left(\mathbf{b},\widehat{\mathbf{b}}\right)=va{r}_{n}(\widehat{\mathbf{b}})\), which is denoted B. The formula for traits j and k is

$$\begin{gathered} B_{jk} = \overline{{{{\text{diag}}}\left( {\mathbf{L}} \right)}} - \overline{{L_{ \cdot \cdot } }} + \overline{{\mu_{j \cdot } \mu_{k \cdot } }} - \left( {\overline{{\mu_{j \cdot } }} } \right)\left( {\overline{{\mu_{k \cdot } }} } \right) \hfill \\ {\mathbf{L}} = \left[ {\begin{array}{*{20}c} {{\mathbf{I}}_{n} } & {\gamma {\mathbf{I}}_{n} } \\ \end{array} } \right]cov\left( {{\hat{\mathbf{u}}}_{j} ,{\hat{\mathbf{u}}}_{k} } \right)\left[ {\begin{array}{*{20}c} {{\mathbf{I}}_{t} } & {\gamma {\mathbf{I}}_{t} } \\ \end{array} } \right]^{\prime} \hfill \\ {{\varvec{\upmu}}}_{j} = \left[ {\begin{array}{*{20}c} {\mathbf{W}} & {\gamma {\mathbf{F}}} \\ \end{array} } \right]{{\varvec{\updelta}}}_{j} \hfill \\ \end{gathered}$$
(23)

The vector \({\widehat{\mathbf{u}}}_{j}\) is a 2n × 1 stacked vector of the predicted additive and non-additive values for a population of size n. The calculation of \(cov({\widehat{\mathbf{u}}}_{j}, {\widehat{\mathbf{u}}}_{k})\) follows the same procedure described above (see Eq. 19), and the contribution from \({\varvec{\updelta}}\) is calculated using the fixed effect estimates. Since the overall scaling of the index coefficients is arbitrary, we can impose \({\sigma }_{\widehat{\theta }}^{2}=1\). Inverting Eq. (22) under this constraint leads to an expression for the index coefficients:

$${\mathbf{c}} = i^{ - 1} {\mathbf{B}}^{ - 1} {\mathbf{x}}$$
(24)

Substituting this result into \({1=\sigma }_{\widehat{\theta }}^{2}={\mathbf{c}}^{\mathbf{^{\prime}}}\mathbf{B}\mathbf{c}\) leads to an implicit equation for the response:

$${\mathbf{x^{\prime}B}}^{ - 1} {\mathbf{x}} - i^{2} = 0$$
(25)

Equation (25) is the matrix representation of an ellipsoid in t dimensions, which is used by StageWise to provide a geometric visualization of selection tradeoffs. (The software DESIRE (Kinghorn 2013) is an earlier example of plotting the elliptical multi-trait response.) If the response is expressed in units of genetic standard deviation, a diagonal matrix \({\varvec{\Delta}}\) with elements \({\sigma }_{b}=\sqrt{{\sigma }_{\mathrm{A}}^{2}+{\gamma }^{2}{\sigma }_{\mathrm{D}}^{2}}\) is used to rescale the matrix of the quadratic form as \({\varvec{\Delta}}{\mathbf{B}}^{-1}{\varvec{\Delta}}\). The principal axes of the ellipse are given by the eigenvectors of this matrix, and the lengths of the semi-axes equal the inverse square-root of the eigenvalues.

This geometric model provides a convenient method for implementing a restricted selection index, in which the response for some traits is constrained to be zero (Kempthorne and Nordskog 1959). From above, the change in genetic merit associated with response x is \(\mathbf{c}\mathbf{^{\prime}}\mathbf{x}\), which is the projection of x onto c times the magnitude of c. For the unrestricted index, the response that maximizes genetic gain is therefore the solution of the following convex optimization problem:

$$\begin{gathered} \mathop {\max }\limits_{{\mathbf{x}}} {\mathbf{c^{\prime}x}} \hfill \\ {\mathbf{x^{\prime}B}}^{ - 1} {\mathbf{x}} \le 1 \hfill \\ \end{gathered}$$
(26)

The linear inequality constraint in Eq. (26), which is convex, replaces the linear equality constraint of Eq. (25), which is not convex. This substitution is valid because the linear objective ensures the optimum is on the boundary (Boyd and Vandenberghe 2004). For the restricted index, the restricted traits are not included in the objective \({\mathbf{c}}^{\mathrm{^{\prime}}}\mathbf{x}\), and equality or inequality constraints on the genetic gain \({x}_{i}\) for restricted trait i are added to Eq. (26). Convex optimization is performed using CVXR (Fu et al. 2020), and the index coefficients are computed from the optimal x via Eq. (24) with intensity i = 1.

Marker effects and GWAS

Marker effects and GWAS scores are also calculated by BLUP. Let \({\varvec{\upalpha}}\) represent the mt × 1 vector of additive (substitution) effects for t traits/locations nested within m markers, with variance–covariance matrix \({\mathbf{I}}_{m} \otimes {{\varvec{\Gamma}}}\left( {\phi \mathop \sum \nolimits_{k} p_{k} q_{k} } \right)^{ - 1}\) for ploidy \(\phi\) (Endelman et al. 2018). From the linearity of BLUP, the predicted multi-trait index of marker effects is \(\left( {{\mathbf{I}}_{m} \otimes {\mathbf{c}}^{\prime}} \right){\hat{\mathbf{\alpha }}}\), and from Eq. (19), \({\hat{\mathbf{\alpha }}}\) can be written in terms of the predicted additive values \({\hat{\mathbf{a}}}\) as

$$\begin{gathered} {\hat{\mathbf{\alpha }}} = cov\left( {{{\varvec{\upalpha}}},{\mathbf{y}}} \right){\mathbf{Py}} = var\left( {{\varvec{\upalpha}}} \right)\left[ {{\mathbf{W^{\prime}}} \otimes {\mathbf{I}}_{t} } \right]\left[ {{\mathbf{G}}^{ - 1} \otimes {{\varvec{\Gamma}}}^{ - 1} } \right]{\hat{\mathbf{a}}} = \frac{{\left( {{\mathbf{W^{\prime}G}}^{ - 1} \otimes {\mathbf{I}}_{t} } \right){\hat{\mathbf{a}}}}}{{\phi \mathop \sum \nolimits_{k} p_{k} q_{k} }} \hfill \\ \Rightarrow \left( {{\mathbf{I}}_{m} \otimes {\mathbf{c}^{\prime}}} \right){\hat{\mathbf{\upalpha }}} = \frac{{\left( {{\mathbf{W^{\prime}G}}^{ - 1} \otimes {\mathbf{c}}^{\prime}} \right){\hat{\mathbf{a}}}}}{{\phi \mathop \sum \nolimits_{k} p_{k} q_{k} }} \hfill \\ \end{gathered}$$
(27)

The W matrix in Eq. (27) is the centered matrix of allele dosages (individuals x markers). A similar result holds for relating the multi-trait index of digenic substitution effects \({\varvec{\upbeta}}\) to the predicted dominance values \({\hat{\mathbf{d}}}\) (Eq. 7):

$$\left( {{\mathbf{I}}_{{\text{m}}} \otimes {\mathbf{c}^{\prime}}} \right){\hat{\mathbf{\upbeta }}} = \frac{{\left( {{\mathbf{Q^{\prime}D}}^{ - 1} \otimes {\mathbf{c}}^{\prime}} \right){\hat{\mathbf{d}}}}}{{\left( {\begin{array}{*{20}c} \phi \\ 2 \\ \end{array} } \right)\mathop \sum \nolimits_{k} 4p_{k}^{2} q_{k}^{2} }}$$
(28)

The fixed effect for inbreeding is included in \(\widehat{\mathbf{d}}\) and therefore represented in the predicted marker effects.

GWAS p-values are computed from the standardized BLUPs of the marker effects, which are asymptotically standard normal (Gualdrón Duarte et al. 2014). If \({\mathbf{w}}_{k}\) denotes the kth column of the W matrix, then the standard error of the predicted additive effect for marker k is

$$\frac{{\left[ {\left( {{\mathbf{w}}_{k}^{{\prime}} {\mathbf{G}}^{ - 1} \otimes {\mathbf{c}}^{\prime}} \right)var\left( {{\hat{\mathbf{a}}}} \right)\left( {{\mathbf{G}}^{ - 1} {\mathbf{w}}_{k} \otimes {\mathbf{c}}} \right)} \right]^{1/2} }}{{\phi \mathop \sum \nolimits_{k} p_{k} q_{k} }}$$
(29)

The formula for dominance effects is analogous, based on Eq. (28). StageWise provides the option to parallelize this computation across multiple cores. To control for multiple testing, the desired significance level specified by the user is divided by the effective number of markers (Moskvina and Schmidt 2008) to set the p value discovery threshold.

Potato data analysis

The potato dataset is an updated version of the data from Endelman et al. (2018), which spanned 2012–2017 at one location (Hancock, WI) and contained 571 clones from both preliminary and advanced yield trials. The current version spans 2015–2020 and contains 943 clones. Fixed effects for block or trial, as well as stand count, were used in Stage 1. Three traits were analyzed: total yield (Mg ha−1), vine maturity (1 [early] to 9 [late] visual scale at 100 days after planting), and potato chip fry color (Hunter L) after 6 months of storage. The G matrix was used for multi-trait analysis, instead of H, due to convergence problems with the latter.

Marker data files contain the estimated allele dosage (0–4) from genotyping with potato SNP array v2 or v3 (which contains most of v2) (Felcher et al. 2012; Vos et al. 2015). Genotype calls were made with R package fitPoly (Zych et al. 2019). Data from the two array versions were combined with the command merge_impute from R package polyBreedR (https://github.com/jendelman/polyBreedR). This command performs one iteration of the EM algorithm described in Poland et al. (2012) (only one iteration is needed for complete datasets at low and high density), followed by shift and scaling (if necessary) to ensure all data are in the interval [0, ploidy].

Results

The workflow to analyze data with StageWise is illustrated in Fig. 1. Any software can be used to compute genotype BLUEs and their variance–covariance matrix in Stage 1. For convenience, the package has a command named Stage1, which can accommodate any number of fixed or i.i.d. random covariates, as well as spatial analysis using SpATS (Rodríguez-Álvarez et al. 2018). To partition genetic value into additive and non-additive components, genome-wide marker data is processed with the command read_geno, and the output is then included in the call to Stage2. After estimating the variance components with Stage2, the blup_prep command inverts either the coefficient matrix of the mixed model equations or the variance–covariance matrix of the Stage 2 response variable, whichever is smaller. This allows for rapid, iterative use of the blup command to obtain different types of predictions and standard errors, which are used in the calculation of reliability (i.e., squared accuracy) for individuals and GWAS scores for markers. Three vignettes, or tutorials, come with the software to give detailed examples of using the commands. The following results represent a condensed version of this information.

Fig. 1
figure 1

Overview of the commands and workflow in R/StageWise

The primary dataset comes from six years of potato yield trials at a single location and includes 943 genotyped clones. The genotypic values of heterozygous clones have both additive and non-additive components. Non-additive values can be modeled in StageWise either as genetic residuals (no covariance) or as dominance values. In the context of genomic prediction, directional dominance models use inbreeding coefficients to estimate heterosis. Figure 2 compares three types of inbreeding coefficients for this population: (1) FD, from the directional dominance model, (2) FG, from the diagonal elements of the additive genomic relationship matrix, and (3) FA, from the diagonal elements of the pedigree relationship matrix. The FG and FD coefficients from the genomic models were highly correlated (r = 0.98) and have the same population mean, − 0.08, which indicates a slight excess of heterozygosity compared to panmictic (Hardy–Weinberg) equilibrium. Although there was some concordance between the genomic and pedigree coefficients for the most inbred individuals, there was little agreement at small values of FA (Fig. 2).

Fig. 2
figure 2

Comparison of inbreeding coefficients (F) for a population of 943 potato breeding lines. The vertical axis is computed from the dominance coefficients, and the horizontal axis is computed from the additive relationship matrix

Single trait analysis

Initially, the three traits in the potato dataset–total yield, chip fry color, and vine maturity–were analyzed independently. In Stage 1, broad-sense heritability on a plot basis was highest for yield (0.70–0.83), with similar results for fry color (0.25–0.74) and maturity (0.38–0.74) (Figure S1, ESM1). The benefit of including Stage 1 errors in the Stage 2 model was assessed based on the change in AIC, which ranged from − 29 for maturity to − 104 for fry color (Table 1). Applying the summary command to the output from Stage2 generates a table with the proportion of variation explained (PVE). The PVE for additive effects, which can be called genomic heritability, ranged from 0.34 (yield) to 0.43 (maturity) (Table 2). The PVE for dominance effects has two parts: one due to the variance of the dominance effects (“Dominance” in Table 2), and the other from variation in the genomic inbreeding coefficient (“Heterosis” in Table 2). Of the three traits, yield had the largest influence of dominance, with a combined PVE of 0.15.

Table 1 Akaike Information Criterion (AIC) for the Stage 2 model with vs. without inclusion of the Stage 1 errors
Table 2 Proportion of variation explained for the multi-year potato dataset

StageWise has the ability for genomic prediction with the H matrix, which is a weighted average of G and A that was originally developed to use ungenotyped individuals in the training population (Legarra et al. 2009; Christensen and Lund 2010). Even when all individuals are genotyped, H may still outperform G due to the sparsity of A (Fig. 3). For the potato dataset, the change in AIC with H ranged from − 6 (fry color) to − 13 (yield). The optimum weight for A was 0.3 for vine maturity and fry color and 0.5 for yield. As the weight for A increased, the estimate for genomic heritability (solid line in Fig. 3) also increased, at the expense of dominance (dashed line).

Fig. 3
figure 3

Minimizing the Akaike Information Criterion (AIC) to select the optimal weighting of pedigree (A) and marker (G) additive relationship matrices: H = wA + (1–w)G. The optimal weight varied by trait in a potato dataset of 943 clones. The proportion of variation explained (PVE) by the additive effects (solid line) increased with w, while the PVE for the dominance effects (dashed line) decreased

The blup_prep command has an option to mask Stage 1 BLUEs, which can be used to estimate the accuracy of predicting new individuals or new environments. Figure 4 compares the reliability of genome-wide marker-assisted selection (MAS) vs. marker-based selection (MBS) for the last breeding cohort in the potato dataset. The distinction between MAS and MBS is that the selection candidates are part of the training set with MAS but not with MBS (Bernardo 2010). The reliability of MAS \({(r}_{A}^{2})\) was 0.14–0.21 higher than MBS \({(r}_{B}^{2})\) across traits. From index theory (Lande and Thompson 1990; Riedelsheimer and Melchinger 2013), the two quantities are related by

Fig. 4
figure 4

Comparing the reliability (r2) of marker-assisted (MAS) vs. marker-based (MBS) genomic selection in the potato dataset. Each point represents a clone from one breeding cohort, and the blue line is a linear trendline. The increased accuracy from having phenotypes for the selection candidates (MAS) was closely predicted by selection index theory (dashed line)

$${r}_{A}^{2}={r}_{B}^{2}+\frac{{h}^{2}{\left(1-{r}_{B}^{2}\right)}^{2}}{\left(1-{h}^{2}{r}_{B}^{2}\right)}$$
(30)

When used with the genomic heritability estimates from Stage2, this formula closely matched the data for all three traits (Fig. 4).

Although GWAS is not the emphasis of StageWise, the software can perform a fully efficient, two-stage GWAS. For the potato dataset, there was a major QTL for vine maturity on chr05 (Figure S2, ESM1), in the vicinity of the well-known regulator of potato maturity StCDF1 (Kloosterman et al. 2013). Stage2 has an optional argument to include markers as fixed effects for major QTL. In this case, the PVE for the marker was 0.10, which represents 21% of the total additive variance.

Multi-trait analysis

Multi-trait analysis follows the same general workflow as a single trait. In addition to the PVE, the summary command returns the additive correlation matrix for the traits. For the potato dataset, late maturity was correlated with higher yield (r = 0.57) and slightly with lighter fry color (r = 0.23). There was no genetic correlation (r = 0.00) between yield and fry color.

The “index.coeff” argument for blup is used to specify the selection index coefficients, which determine the relative weights of the traits (after standardization to unit variance) for genetic merit. (Because StageWise uses a multi-trait BLUP, the optimal index coefficients equal the coefficients of genetic merit.) For the potato chip market, it is reasonable to give equal weight to yield and fry color. However, naïve selection on these traits alone will generate offspring with later maturity, which is undesirable. One way to avoid this is by using vine maturity as a covariate in the analysis.

Alternatively, the gain command in StageWise can be used to compute the coefficients of a restricted selection index, in which the response for some traits is constrained to be zero (Kempthorne and Nordskog 1959). For a given selection intensity and t traits, the set of all possible responses is a t-dimensional ellipsoid, and gain shows 2D slices of it. Figure 5 shows the breeding value response for yield and maturity, as well as two line segments. The dashed red line is the projection of the index vector, and the solid blue line is the projection of the optimal response. The restricted index requires negative weight for maturity to produce zero response, which reduces the yield response compared to the unrestricted index by \(0.23i\sigma\) (\(i\) is selection intensity and \(\sigma\) is the genetic standard deviation of the breeding values; Table 3).

Fig. 5
figure 5

Selection response tradeoffs in the potato dataset for three traits: yield, maturity, and fry color. The response surface is three-dimensional, but only the yield-maturity plane is shown to highlight the tradeoff between these two traits. The dashed red line segment is the projection of the index vector, and the solid blue line segment is the projection of the optimal response (color figure online)

Table 3 Multi-trait response for potato under truncation selection, assuming yield and fry color contribute equally to genetic merit

Discussion

StageWise was designed to enhance the use of genomic prediction in plant breeding, but there are some limitations. At present, each phenotype is associated with a single genotype identifier, which is inadequate for hybrid prediction. The options for modeling GxE are somewhat limited, particularly for multiple traits, which assume a uniform genetic correlation between environments. For single trait analysis, a more complex GxE model is possible to allow for heterogenous genetic correlation between locations. The genetic covariance between locations is based on a second- order factor-analytic (FA2) model (Smith et al. 2001), which offers enough statistical complexity for many applications. To assess model adequacy, the factor loadings returned by Stage2 can visualized with the command uniplot, which generates a circular plot in which the squared radius for each location equals the proportion of genetic variance explained by the latent factors (Cullis et al. 2010). This functionality is illustrated in Vignette 2 using national trial data for potato (Schmitz Carley et al. 2019). At present, StageWise does not have functionality for genomic prediction with environmental covariates.

This is the first study to formulate and apply a model for directional dominance in polyploids. Although heterosis explained less than 5% of the variance (PVE) for yield, we should expect small PVE when there is limited variation for inbreeding. The standard deviation of FD was only 0.03 for the population of 943 potato clones (Fig. 3).

From the theory of directional dominance, the average dominance coefficient is the covariate for estimating heterosis. Xiang et al. (2016) used average heterozygosity for the covariate because under a genotypic parameterization of dominance in diploids, this is equivalent to the average dominance coefficient. However, studies employing orthogonal parameterizations of dominance have also used this covariate (Aliloo et al. 2017; Yadav et al. 2021), even though heterozygosity is no longer equivalent to the dominance coefficient because the relative contribution of the genotypes to inbreeding depends on allele frequency (see Eq. 5). For example, the minor allele homozygote contributes more to inbreeding than the major allele homozygote, and the difference is \(\phi \left(\phi -1\right)\left(q-p\right)\) for ploidy \(\phi\) and minor allele frequency \(p=1-q\) at panmictic equilibrium. To give another example, simplex dosage of the minor allele in a tetraploid contributes more to inbreeding than duplex dosage only for p > 1/3; for p < 1/3, duplex dosage contributes more.

A more general approach to restricted selection indices was developed in StageWise by investigating the geometry of the problem (Eq. 26). Until now, only equality constraints have been included (i.e., specifying a certain value for genetic gain), which are amenable to solution by the method of Lagrange multipliers. StageWise uses convex optimization software to allow for both equality and inequality constraints. In many situations, inequality constraints are more appropriate than equality constraints. For example, when selecting for yield, we might accept earlier but not later maturity, which is represented by response \(\le 0\). With only one constrained trait, the optimal solution corresponds to zero response, so the inequality offers no advantage. But with two or more constraints, higher genetic gains are possible with inequalities (ESM2).

The “mask” argument for blup_prep makes it easy to investigate the potential benefit of using a correlated, secondary trait to improve genomic selection. Many plant breeding programs are exploring the use of spectral measurements from high-throughput phenotyping platforms to improve selection for yield. For example, Rutkoski et al. (2016) demonstrated that aerial measurements of canopy temperature during grain fill could be used to predict wheat grain yield. Vignette 3 shows how to recreate this result with StageWise.

Typically, the number of traits a breeder must consider for selection is too large to analyze jointly in StageWise based on the current implementation with ASReml-R. New algorithms may alleviate this limitation in the future (Runcie et al. 2021), but in the meantime, a practical approach is to split the traits into groups for multivariate analysis based on phenotypic correlations. In the final step, multiple outputs from blup_prep can be combined in one call to blup, using an index that covers all traits (example in Vignette 3).

We should acknowledge that truncation selection on breeding value is not optimal for long-term genetic gain. The design of selection methods that conserve and exploit genetic diversity more efficiently is an exciting area of research (e.g., Toro and Varona 2010; Akdemir and Sánchez 2016; Goiffon et al. 2017). Although such methods are not currently available in StageWise, the additive and dominance marker effects returned by the software can be used to implement them.