As discussed in the “Introduction”, the analysis of NVT MET data is accomplished via a two-stage approach. The approach for individual trial analysis (the first stage), including the calculation of weights, is documented in Smith et al. (2001a). We assume that the (second stage) data relates to \(t\) trials and a total of \(m\) varieties and let \(\mathbf {y}\) denote the \(n \times 1\) combined vector of variety means from the analyses of individual trials. Typically the data are unbalanced, since not all varieties are grown in all trials, so that \(n < < mt\). The second stage mixed model can be written as
$$\begin{aligned} \mathbf {y} = \mathbf {X}\varvec{\tau } + \mathbf {Z}\mathbf {u} + \mathbf {Z_p}\mathbf {u_p} + \varvec{\eta } \end{aligned}$$
(1)
where \(\varvec{\tau }\) is a vector of fixed effects with associated design matrix \(\mathbf {X}\); \(\mathbf {u}\) is the \(mt \times 1\) vector of random variety effects for each environment (ordered as varieties within environments) and has associated design matrix \(\mathbf {Z}\); \(\mathbf {u_p}\) is a vector of random non-genetic (peripheral) effects with associated design matrix \(\mathbf {Z_p}\) and \(\varvec{\eta }\) is a vector of effects that accounts for the fact that the data comprise estimates and are therefore subject to uncertainty. Note that typically \(\varvec{\tau }\) is simply the \(t \times 1\) vector of trial means and \(\mathbf {u_p}\) is omitted.
We assume that \(\mathbf {u}\), \(\mathbf {u_p}\) and \(\varvec{\eta }\) are mutually independent, and distributed as multivariate Gaussian, with zero means. The variance matrix for \(\mathbf {u_p}\) is given by \(\mathbf {G_p} = \oplus _{k=1}^b \sigma ^2_{p_k} \mathbf {I}_{q_k}\) where \(b\) is the number of components in \(\mathbf {u_p}\) and \(q_k\) is the number of effects in (length of) \(\mathbf {u_p}_k\). The variance matrix for \(\varvec{\eta }\) is assumed known from the first stage and is given by \(\varvec{\Sigma } = \oplus _{j=1}^t \varvec{\Pi }^{-1}_j\) where \(\varvec{\Pi }^{-1}_j\) is a diagonal matrix with elements given by the weights for trial \(j\). We assume that the variance matrix of the variety effects is given by
$$\begin{aligned} \mathrm{var}\left( \mathbf {u}\right) = \mathbf {G}_e \otimes \mathbf {I}_m \end{aligned}$$
(2)
where \(\mathbf {G_e}\) is a \(t \times t\) symmetric positive (semi)-definite matrix that is often referred to as the between environment genetic variance matrix.
It is of interest to consider several forms for \(\mathbf {G_e}\). The first is a diagonal form, namely \(\mathbf {G_e} = \oplus _{j=1}^t \sigma ^2_{g_j}\) where \(\sigma ^2_{g_j}\) is the genetic variance for environment \(j\). In this variance structure the variety effects are assumed independent between environments so there is an analogy with the separate analyses of individual trials. The simplest model that accommodates correlations between variety effects in different environments is the compound symmetric form which arises by assuming a model for \(\mathbf {u}\), namely
$$\begin{aligned} \mathbf {u} = \mathbf {Z_g} \mathbf {u_g} + \mathbf {u_{ge}} \end{aligned}$$
(3)
where \(\mathbf {u_g}\) is the \(m \times 1\) vector of variety main effects with variance matrix \(\sigma ^2_g \mathbf {I}_m\) and \(\mathbf {u_{ge}}\) is the \(mt \times 1\) vector of variety by environment interaction effects with variance matrix \(\sigma ^2_{ge} \mathbf {I}_{mt}\). The design matrix for the main effects is given by \(\mathbf {Z_g} = \left( \mathbf {1}_t \otimes \mathbf {I}_m\right)\) so that \(\mathbf {G_e} = \sigma ^2_g\mathbf {J}_t + \sigma ^2_{ge}\mathbf {I}_t\), where \(\mathbf {J}_t\) is a \(t \times t\) matrix in which all elements are unity.
The model in Eq. (3) is a variance component model, since all random terms have variance matrices that are scaled identity matrices. It is a very restrictive model since it leads to the assumption that the genetic variance is the same for all environments, and is given by \(\sigma ^2_g + \sigma ^2_{ge}\), and the genetic covariance for all pairs of environments is \(\sigma ^2_g\). Often, more general variance component models are used in which the variety by environment interaction effects are partitioned further, for example into variety by year, variety by region, variety by year by region and residual variety by environment effects. Such a model was used by Smith et al. (2001a). Even with this partitioning the resultant form for \(\mathbf {G_e}\) is over-simplified and rarely provides a good fit to the data.
The most general model for \(\mathbf {G_e}\) is the unstructured form that contains \(p = t(t+1)/2\) parameters to be estimated, namely a genetic variance for each environment and covariance between each pair of environments. Clearly as the number of trials increases, the number of parameters becomes prohibitively large and this influences both the ability to fit the model and to reliably estimate the variance parameters. The unstructured model is therefore rarely used for the analysis of MET data.
In the context of one-stage analyses of plant breeding MET data, we have found that the FA variance model (Smith et al. 2001b) provides a good approximation to the unstructured form (Kelly et al. 2007) and is both parsimonious and illuminating. The aim of the FA model as applied to the variety effects in different environments is to account for the genetic covariances between environments in terms of a small number of hypothetical factors. The number of factors is called the order of the model and we let FAk denote an FA model of order \(k\). The FAk model for the effect of variety \(i\) in environment \(j\) can be written as
$$\begin{aligned} u_{ij} = \lambda _{1j}f_{1i} + \lambda _{2j}f_{2i} + \cdots + \lambda _{kj}f_{ki} + \delta _{ij} \end{aligned}$$
(4)
where \(f_{ri}\) is the value (also called a score) of the \(r\)th hypothetical factor (\(r = 1, \ldots , k\)) for variety \(i\) and \(\lambda _{rj}\) is the coefficient (also called a loading) for environment \(j\). The factors are usually assumed to be independent with unit variance so that \(\mathrm{var}\left( f_{ri}\right) = 1\). The model can also be viewed as a multiple regression of the variety effects for an environment on a set of environmental covariates (loadings) with a separate slope (score) for each variety (also see Burgueno et al. 2008). The feature which distinguishes the FA model from an ordinary regression is that not only are the slopes estimated from the data, but also the covariates. The final term \(\delta _{ij}\) represents the lack of fit of the regression so will be termed a genetic regression residual. The model in Eq. (4) can be written in vector notation as
$$\begin{aligned} \mathbf {u} = \left( \varvec{\Lambda } \otimes \mathbf {I}_m \right) \mathbf {f} + \varvec{\delta } \end{aligned}$$
(5)
where \(\varvec{\Lambda }\) is the \(t \times k\) matrix of loadings, \(\mathbf {f}\) is the \(mk \times 1\) vector of scores and \(\varvec{\delta }\) is the \(mt \times 1\) vector of genetic regression residuals. The vectors of random effects \(\mathbf {f}\) and \(\varvec{\delta }\) are assumed to be mutually independent and distributed as multivariate Gaussian with zero means. The variance matrices are assumed to be \(\mathrm{var}\left( \mathbf {f}\right) = \mathbf {I}_{mk}\) and \(\mathrm{var}\left( \varvec{\delta }\right) = \varvec{\psi } \otimes \mathbf {I}_m\) where \(\varvec{\psi }\) is a \(t \times t\) diagonal matrix with a variance (called a specific variance) for each environment. Finally, these assumptions lead to a variance for \(\mathbf {u}\) given by
$$\begin{aligned} \mathrm{var}\left( \mathbf {u}\right) = \left( \varvec{\Lambda }\varvec{\Lambda }^{\!\scriptscriptstyle \top }+ \varvec{\psi }\right) \otimes \mathbf {I}_m \end{aligned}$$
(6)
so that the between environment genetic variance matrix is \(\mathbf {G_e} = \left( \varvec{\Lambda }\varvec{\Lambda }^{\!\scriptscriptstyle \top }+ \varvec{\psi }\right)\).
In this paper we propose the use of FA models for variety by environment effects in two-stage analyses of crop variety evaluation data.
FA model fitting and tools for interpretation
All models in this paper were fitted using the ASReml-R package (Butler et al. 2009) within R (R Core Team 2013). The variance parameters in the mixed model of Eq. (1) are estimated using residual maximum likelihood (REML). In terms of the FA model, the variance parameters are the loadings and specific variances and the REML estimates of these will be denoted by \(\hat{\lambda }_{rj}\) and \(\hat{\psi }_j\) (\(r = 1, \ldots , k; j = 1, \ldots , t\)). Note that when \(k>1\), the loading matrix \(\varvec{\Lambda }\) is not unique so that estimation necessitates the imposition of constraints. The algorithm in ASReml-R (Butler et al. 2009) fixes all \(k(k-1)/2\) elements in the upper triangle of \(\varvec{\Lambda }\) to zero. Once an estimate of \(\varvec{\Lambda }\) has been obtained, the matrix may be rotated as desired for interpretative purposes (see below).
Given estimates of all the variance parameters, we obtain empirical best linear unbiased estimates of the fixed effects and empirical best linear unbiased predictions (EBLUPs) of the random effects. In terms of the FA model we denote the EBLUPs of the factor scores and genetic regression residuals by \(\tilde{f}_{ri}\) and \(\tilde{\delta }_{ij}\) (\(r = 1, \ldots , k; i = 1, \ldots , m; j = 1, \ldots , t\)).
The model fitting process commences with the fitting of an FA1 model, then proceeds to higher order models as necessary. An appropriate order may be determined using likelihood based measures that compare sequences of FA models. Since such models are nested, residual maximum likelihood ratio tests (REMLRT) can be used, but so too can information criteria such as the Akaike and Bayesian information criteria (AIC and BIC, respectively). In our experience the application of REMLRT and AIC tend to lead to the selection of very high order models that are unnecessarily complicated. In contrast, the application of BIC which emphasises parsimony, leads to the choice of models that may underfit. A superior approach for the selection of an appropriate order may involve the comparison between an FAk model and the unstructured model, but since the latter typically cannot be fitted, an alternative type of test statistic would be required. This is the subject of current research. In the absence of such a test we choose to use a pragmatic approach based on a goodness-of-fit measure similar to that used for a standard multiple regression. We therefore compute the percentage of genetic variance accounted for by the \(k\) factors, both for individual environments (denoted \(v_j\)) and overall (denoted \(\bar{v}\)):
$$\begin{aligned} v_{j}&=\left. 100 \sum _{r=1}^k \hat{\lambda }^2_{rj} \right/\left( \sum _{r=1}^k \hat{\lambda }^2_{rj} + \hat{\psi }_j \right) \\ \bar{v}&=\left. 100\mathrm{tr}\left( \hat{\varvec{\Lambda }}\hat{\varvec{\Lambda }}^{\!\scriptscriptstyle \top }\right) \right/\mathrm{tr}\left( \hat{\varvec{\Lambda }}\hat{\varvec{\Lambda }}^{\!\scriptscriptstyle \top }+ \hat{\varvec{\psi }}\right) \end{aligned}$$
where the operator “\(\mathrm{tr}\left( \right)\)” computes the trace of the matrix argument. The order of FA model may then be chosen on the basis of both the overall percentage accounted for and the distribution of individual environment values, since it is desirable for the chosen model to have few environments with low values and many environments with high values.
The fitting of an FA model provides the REML estimate of the between environment genetic variance matrix as \(\hat{\mathbf {G}}_{\mathbf {e}} = \left( \hat{\varvec{\Lambda }}\hat{\varvec{\Lambda }}^{\!\scriptscriptstyle \top }+ \hat{\varvec{\psi }}\right)\). This can be converted to a correlation matrix, \(\hat{\mathbf {C}}_{\mathbf {e}} = \hat{\mathbf {D}}_{\mathbf {e}}\hat{\mathbf {G}}_{\mathbf {e}}\hat{\mathbf {D}}_{\mathbf {e}}\), where \(\hat{\mathbf {D}}_{\mathbf {e}}\) is a diagonal matrix with elements given by the inverse of the square roots of the diagonal elements of \(\hat{\mathbf {G}}_{\mathbf {e}}\). Investigation of this matrix will reveal variety by environment interaction in the sense of pairs of environments that have low, or possibly even negative estimated genetic correlations. In such cases the rankings of the varieties will differ substantially between the environments and this is likely to be important information for growers. The matrix \(\hat{\mathbf {C}}_{\mathbf {e}}\) has dimension \(t \times t\), so, for large values of \(t\) we choose to display \(\hat{\mathbf {C}}_{\mathbf {e}}\) graphically, using a heatmap in R (R Core Team 2013), re-ordering the rows and columns to aid with visualisation. In this paper we have chosen to order on the basis of the dendrogram obtained using the agnes package (an agglomerative hierarchical clustering algorithm) in R (R Core Team 2013) with \(\mathbf {I}_t - \hat{\mathbf {C}}_{\mathbf {e}}\) as the dissimilarity matrix. In this way, environments that are highly correlated (so exhibit little cross-over interaction) are located close together on the heatmap, whereas less well correlated environments will be further apart.
In terms of variety predictions from the FA model, we can compute the EBLUP of the effect of variety \(i\) in environment \(j\) as
$$\begin{aligned} \tilde{u}_{ij}&= \hat{\lambda }_{1j}\tilde{f}_{1i} + \hat{\lambda }_{2j}\tilde{f}_{2i} + \cdots + \hat{\lambda }_{kj}\tilde{f}_{ki} + \tilde{\delta }_{ij} \nonumber \\&= \tilde{\beta }_{ij} + \tilde{\delta }_{ij} \end{aligned}$$
(7)
where \(\tilde{\beta }_{ij}\) is the predicted regression component. The regression component is based purely on the underlying factors so represents the variety by environment variation that has repeatability in terms of the data under study and with reference to the FA model fitted. In contrast, the genetic regression residuals represent non-repeatable variety effects, that is, effects which are specific to individual environments, given the model and set of environments. In terms of variety information for growers we therefore choose to use the predicted regression component \(\tilde{\beta }_{ij}\) rather than the full predicted effect \(\tilde{u}_{ij}\) (see Cullis et al. 2010 for a full discussion). This has two important consequences. The first is that we obtain compatible predictions of variety effects for every environment, irrespective of whether the variety was grown in the environment. The second is that we must be wary of variety predictions for those environments where the percentage of variance accounted for by the regression is low.
The regression form of the variety predictions from an FA model allows investigation of variety stability in terms of responses to changes in environment, for those environments observed in the data. Each factor score, for \(r = 1, \ldots , k\), in Eq. (7) reflects the response of that individual to the corresponding environmental covariate (loading). If these are to be interpreted individually as stabilities, and if \(k>1\), it is usually most meaningful to rotate the estimated loadings (which have been constrained for estimation) to a principal component solution (Cullis et al. 2010). In this case the first rotated factor accounts for the maximum amount of genetic covariance in the data, the second accounts for the next largest amount and is orthogonal to the first, and so on. We denote the rotated estimated loadings and scores by \(\hat{\lambda }^*_{ij}\) and \(\tilde{f}^*_{ij}\) so that \(\tilde{\beta }_{ij}\) from Eq. (7) can now be written as \(\sum _{r=1}^k \hat{\lambda }^*_{rj}\tilde{f}^*_{ri}\). The multiple regression in terms of the rotated factors can then be displayed graphically, for an individual variety, using so-called latent regression plots which are similar to added variable plots with the advantage that there is a natural ordering of the variables. We may therefore construct \(k\) plots for variety \(i\), with the \(y\)- and \(x\)-axes for the first plot corresponding to \(\tilde{\beta }_{ij}\) and \(\hat{\lambda }^*_{1j}\) respectively. The points on this plot are located about a line that has slope given by \(\tilde{f}_{1i}\) so we add this line to the plot. Subsequent plots adjust the \(y\)- and \(x\)-axes for preceding factors. Thus the \(y\)-axis for plot \(s\) (\(s =2, \ldots , k\)) corresponds to \(\tilde{\beta }_{ij} - \sum _{r=1}^{s-1} \hat{\lambda }^*_{rj}\tilde{f}^*_{ri}\) and the \(x\)-axis to \(\hat{\lambda }^*_{sj}\). The line drawn on plot \(s\) (\(s =2, \ldots , k\)) for variety \(i\) has slope given by \(\tilde{f}^*_{si}\).
Finally we propose that the variety predictions be accompanied by a measure of accuracy. Any such measure will be based on the prediction error variance (PEV) matrix, which, for the complete vector of predictions, \(\tilde{\varvec{\beta }}\), is given by
$$\begin{aligned} \mathbf {V}_{\beta } = \mathrm{var}\left( \tilde{\varvec{\beta }} - \varvec{\beta }\right) = \left( \hat{\varvec{\Lambda }}^{*} \otimes \mathbf {I}_m \right) \mathbf {V}_{f^{*}} \left( \hat{\varvec{\Lambda }}^{*{\!\scriptscriptstyle \top }} \otimes \mathbf {I}_m \right) \end{aligned}$$
(8)
where \(\mathbf {V}_{f^*} = \mathrm{var}\left( \tilde{\mathbf {f}}^{*} - \mathbf {f}^{*}\right)\) is the PEV matrix for the rotated scores. Note that we could equally have used the PEV matrix for the unrotated scores, which could be obtained directly from the fit of the mixed model, but the accuracy of the rotated scores themselves is of interest given their interpretation as indicators of varietal stability. The computation of the PEV matrix for the rotated scores requires an additional iteration of model fitting in ASReml-R in which the rotated REML estimates of the loadings are incorporated. Thence the EBLUPs of the variety scores from this fit of the model are on the rotated scale. Details are available from the authors on request. Finally, we note that the formulation of the PEV matrix in Eq. (8) ignores any uncertainty in the estimation of the variance parameters, \(\hat{\varvec{\Lambda }}^*\).