Reparametrization-based estimation of genetic parameters in multi-trait animal model using Integrated Nested Laplace Approximation

Key message A novel reparametrization-based INLA approach as a fast alternative to MCMC for the Bayesian estimation of genetic parameters in multivariate animal model is presented. Abstract Multi-trait genetic parameter estimation is a relevant topic in animal and plant breeding programs because multi-trait analysis can take into account the genetic correlation between different traits and that significantly improves the accuracy of the genetic parameter estimates. Generally, multi-trait analysis is computationally demanding and requires initial estimates of genetic and residual correlations among the traits, while those are difficult to obtain. In this study, we illustrate how to reparametrize covariance matrices of a multivariate animal model/animal models using modified Cholesky decompositions. This reparametrization-based approach is used in the Integrated Nested Laplace Approximation (INLA) methodology to estimate genetic parameters of multivariate animal model. Immediate benefits are: (1) to avoid difficulties of finding good starting values for analysis which can be a problem, for example in Restricted Maximum Likelihood (REML); (2) Bayesian estimation of (co)variance components using INLA is faster to execute than using Markov Chain Monte Carlo (MCMC) especially when realized relationship matrices are dense. The slight drawback is that priors for covariance matrices are assigned for elements of the Cholesky factor but not directly to the covariance matrix elements as in MCMC. Additionally, we illustrate the concordance of the INLA results with the traditional methods like MCMC and REML approaches. We also present results obtained from simulated data sets with replicates and field data in rice.


Introduction
Estimation of variance components and associated breeding values is an important topic in classic (e.g., Piepho et al. 2008;Oakey et al. 2006;Bauer et al. 2006) and in Bayesian (e.g., Wang et al. 1993;Blasco 2001;Sorensen and Gianola 2002;Mathew et al. 2012) single-trait mixed model context. Similarly, multi-trait models have been proposed in both settings (e.g., Bauer and Léon 2008;Thompson and Meyer 1986;Korsgaard et al. 2003;Van Tassell and Van Vleck 1996;Hadfield 2010). Multi-trait analyses can take into account the correlation structure among all traits and that increases the accuracy of evaluation. However, this gain in accuracy is dependent on the absolute difference between the genetic and residual correlation between the traits (Mrode and Thompson 2005). This evaluation accuracy will increase as the differences between these correlations become high (Schaeffer 1984; Thompson and Meyer 1986). Persson and Andersson (2004) compared singletrait and multi-trait analyses of breeding values and they 1 3 showed that multi-trait predictors resulted in a lower average bias than the single-trait analysis. Estimation of genetic and residual covariance matrices are the main challenging problem in multi-trait analysis in mixed model framework. However, in Bayesian analysis of multi-trait animal models, inverse-Wishart distribution is the common choice as the prior distribution for those unknown covariance matrices. The use of inverse-Wishart prior distribution for covariance matrix guarantees that the resulting covariance matrix will be positive definite (that is, invertible). However, the use of inverse-Wishart prior distribution is quite restrictive, because then one gives same degrees of freedom for all components in the covariance matrix (Barnard et al. 2000). Moreover, it is often difficult to suggest prior distributions that can be used for common situations. Matrix decomposition approach presented in this paper assigns independent priors for elements in the Cholesky factor.
Markov Chain Monte Carlo (MCMC) methods are a popular choice for Bayesian inference of animal models (Sorensen and Gianola 2002). Often, inference using MCMC methods is challenging for a non-specialist. Although there are various packages available for Bayesian inference which are based on MCMC methods (e.g., MCMCglmm, Hadfield 2010;BUGS, Lunn et al. 2000;Stan, Stan Development Team 2014), most of these packages are not easy to use and computationally expensive. Among these packages, MCMCglmm seems to be easy to implement and computationally inexpensive. As an alternative to MCMC methods one can use the recently implemented non-sampling-based Bayesian inference method, Integrated Nested Laplace Approximation (INLA, Rue et al. 2009). INLA methodology is comparatively easy to implement, but less flexible than MCMC methods (Holand et al. 2013).
Canonical transformation is a common matrix decomposition technique in multi-trait animal models to simultaneously diagonalize the genetic covariance matrix and make residual covariance matrix to identity matrix (see e.g., Ducrocq and Chapuis 1997). After transformation, best linear unbiased prediction (BLUP) values can be calculated independently for each trait using univariate animal model and then back transformed to obtain benefits of multi-trait analysis. However, common requirement in canonical transformation is that covariance matrices need to be known before the transformation. Therefore, it cannot be applied for variance component estimation-with unknown genetic and residual covariance matrices. Here, as an improvement, we introduce another kind of decomposition approach, where elements of the transformation matrix are estimated simultaneously together with the other mixed model parameters allowing us to apply this transformation also for the case of variance component estimation. This kind of modified Cholesky decomposition approach is required to perform multi-trait analysis in INLA (see Bøhn 2014). The closely related decomposition approach has been presented in Pourahmadi (1999Pourahmadi ( , 2000Pourahmadi ( , 2011 and Gao et al. (2015). Also our approach is somewhat related to factor analytic (FA) models (e.g. Meyer 2009;Cullis et al. 2014) and which was first introduced in a breeding context by Piepho (1997Piepho ( , 1998. In this paper, we illustrate this approach to estimate genetic and residual covariance matrices with INLA and compare the obtained estimates with those from REML (Patterson and Thompson 1971) and MCMC approaches using simulated and real data sets. With the recent development of new low-cost high-throughput DNA sequencing technologies, it is now possible to obtain thousands of single nucleotide polymorphism (SNP) markers covering the whole genome, at the same time, in many animal and plant breeding programs often the detailed pedigree information is available. So we present results obtained using the marker data (real dataset) along with estimates obtained using pedigree information (simulated data) in this study. We also outline a more simple approach to simulate correlated traits based on the additive relationship matrix.

Model
We consider the multi-trait mixed model by Henderson and Quaas (1976). Let vector y 1 represent the n 1 observations for trait 1, y 2 represent the n 2 observations for trait 2 and y n represent the n n observations for trait n. Then the multi-trait mixed linear model for n traits can be written as: β i is a vector of fixed effects associated with trait i, u i is a vector of random additive genetic effects associated with trait i, ǫ i is a vector of error terms, which are independently normally distributed with mean zero and variance σ 2 e . Moreover, X i and Z i are known incidence matrices for the fixed effects and the random effects for the trait i, respectively. Then, the multi-trait mixed model for n traits can be represented as follows: n ] ′ and y contains traits y 1 . . . y n . In our study we considered three correlated traits so i = 1, 2, 3. Then mixed model equation (MME) for the model (1) is: Here, R and G are covariance matrices associated with the vector ǫ of residuals and vector u of random effects. If R 0 (of order 3 × 3) is the residual covariance for the three traits then R can be calculated as R = R 0 ⊗ I (here '⊗' is the Kronecker product of two matrices and I is the identity matrix). Similarly, the genetic covariance matrix G can be calculated as G = G 0 ⊗ A. Here, A is the additive genetic relationship matrix (p. 763 in Lynch and Walsh 1998) and G 0 is a 3 × 3 additive genetic (co)variance matrix. For the Bayesian inference with MCMCglmm package using model (1) one need to specify the conditional distribution for the data (y) and prior distribution for the unknown parameters. So the conditional distribution of data y, given the parameters assumed to follow a multivariate normal distribution: The additive genetic effects (u i 's) were assigned multivariate normal distributions with a mean vector of zeros, 0, as: and the residuals (ǫ i 's) were assumed to follow, where I is an identity matrix. In Bayesian analysis fixed effects also have a prior and here β was assigned a vague, large-variance Gaussian prior distribution. Steinsland and Jensen (2010) showed that animal models are latent Gaussian Markov random field (GMRF) models with a sparse precision matrix (inverse of the additive relationship matrix, A −1 ), and can be analyzed in INLA framework. Mathew et al. (2015), Larsen et al. (2014) and Holand et al. (2013) used INLA for Bayesian inference of univariate animal models, while in a recent study, Bøhn (2014) showed how to analyze a bivariate animal model using INLA. Unlike MCMCglmm and ASReml-R (Butler et al. 2007), analysis of multivariate animal model is not straightforward in R-INLA. For multivariate inference in INLA, we first assumed a trivariate distribution as a set of univariate Gaussian distributions, then we used the multiple likelihood feature in INLA and the recently implemented 'copy' feature (Martins et al. 2013) to fit our trivariate animal model with separate likelihoods (but which share few common parameters). The 'copy' feature in INLA allows us to estimate dependency parameters between traits. For the INLA analysis of the trivariate animal model we can reparametrize our model for the observation vector y as follows:

Reparametrization of trivariate animal model in INLA
Here, y 1 , y 2 , y 3 are the traits and κ i,j defines the dependency between additive effects (a i 's), moreover, α i,j defines the dependency between the error terms (e i 's). For the Bayesian inference one needs to assign prior distribution for the unknown parameters. The additive genetic effects (a i 's) for each trait were assigned multivariate normal distributions with a mean vector of zeros, 0, as: whereas the residuals (e i 's) were assumed to follow a multivariate normal distribution as follows: where I is an identity matrix. The hyperparameters (σ 2 a i , σ 2 e i ) were assigned inverse-Gamma prior (0.5, 0.5) distributions and the dependency parameters (κ i,j , α i,j ) were assumed to follow Gaussian distributions with mean 0 and variance 10. Thus we define the observation vector y for the trivariate animal model as: Here u = W a a, is the additive genetic term and ǫ = W e e, is the residual term. Moreover, Cholesky factor and Here, a 1 , a 2 and a 3 are the additive effects for each traits in the reparametrized scale. Moreover, A is additive relationship matrix calculated from the pedigree information and σ 2 a i , i = 1, 2, 3 are the additive genetic variances for each trait. Hence, Here, (7) y 1 = a 1 + e 1 y 2 = κ 1,2 a 1 + a 2 + α 1,2 e 1 + e 2 y 3 = κ 1,3 a 1 + κ 2,3 a 2 + a 3 + α 1,3 e 1 + α 2,3 e 2 + e 3 and is the additive genetic covariance matrix for the traits in the transformed scale. Thus, the additive genetic effects (u = W a a) follow a multivariate normal distribution (Eq. 5) with a mean vector of zeros, 0, as: Here, G 0 is the additive genetic (co)variance matrix. Similarly i = 1, 2, 3 and σ u ij , σ ǫ ij , where i, j = 1, 2, 3 be the (genetic and residual) variance and (genetic and residual) covariance components, respectively, in the original scale. First, calculate the approximated posterior marginal distribution for the hyperparameters (σ 2 a i 's , σ 2 e i 's ) and the dependency parameters (κ i,j 's, α i,j 's) by sampling from their joint distribution using the 'inla.hyperpar.sample' (Martins et al. 2013) function. Then, following Eq. (9) the genetic variance components can be calculated using the posterior distributions as, σ 2 . Similarly the genetic covariance components can be obtained as, . The same procedure can be used to calculate the residual (co) variance components using Eq. (11). R scripts for the back transformation can be found in the supplementary material.
Here we have independent error terms (e i 's) for each trait, so the covariance matrix I is an identity matrix. Hence, the residuals (ǫ) follow a multivariate normal distribution (Eq. 6) with a mean vector of zeros, 0, as follows: Here, R 0 is the residual genetic (co)variance matrix.
To extend this method for more than three traits (say, n traits) can be done by modifying the terms of Eq. (9), so that the additive genetic Cholesky factor W a is a Kronecker product of n × n lower triangular matrix with I and X a is the additive genetic block matrix containing n blocks. For example, number of dependency parameters required for a 4 × 4 Cholesky factor is already n(n − 1)/2 = 4 × 3/2 = 6 .
As an additional supplementary material we provide the R scripts we used for the INLA analysis.

Back transformation in INLA
INLA analysis returns the marginal posterior distributions of the hyperparameters (σ 2 a i 's , σ 2 e i 's ) and the dependency parameters (κ i,j 's, α i,j 's) for the reparametrized model (Eq. 7). So one need to perform the back transformation after the INLA analysis in order to obtain (co)variance components in the original scale. Let σ 2

Example analyses Simulated dataset with high heritability
To validate our new algorithm we developed a simulated pedigree data. In this, we considered a base population of 50 unrelated lines, wherein each of the 25 seed parents were mated with 25 pollen donors resulting in total 625 individuals (in total 675 individuals, including the base population). Additive genetic relationship matrix (A) was calculated from the pedigree information. In our current study, we simulated three quantitative traits by summing up the additive genetic effects a and the noise e. Thus, the vector of phenotypic observations (three traits) was calculated as: Here the vectors a, e were drawn from MVN (0, G 0 ⊗ A) and MVN (0, R 0 ⊗ I), respectively. In order to simulate correlated traits with relatively high heritability (h 2 0.5 ), we used y = a + e. as the genetic covariance matrix and the residual covariance matrix between the three traits. The three simulated traits had heritabilities ≈ 0.50, 0.60 and 0.70, respectively. Let G = G 0 ⊗ A and R = R 0 ⊗ I, then we used the Cholesky decomposition of the covariance matrices G and R to draw samples from the multivariate normal distribution. Hence, the random additive effect a was calculated as a = Pz a , where z a ∼ MVN (0, I) and P is the Cholesky factor PP ′ = G; whereas, the residuals e was calculated as e = Tz e , where z e ∼ MVN (0, I) and T is the Cholesky factor TT ′ = R.

Simulated dataset with low heritability
We also analyzed another simulated correlated dataset with low heritability (h 2 ≈ 0.2) and negative covariances between the traits, in order to show how these methods perform when the heritability is relatively low. To simulate the dataset we considered the same pedigree information from the high heritability dataset but, with different covariance matrices. For the simulation we considered as the genetic and residual covariance matrices, respectively. The correlated phenotypes were simulated as explained before and the three traits had heritabilities ≈ 0.20, 0.20 and 0.22, respectively.

Field data
In our study we analyzed the recently published rice (Oryza sativa) dataset (Spindel et al. 2015) and we selected three traits, grain yield (YLD), flowering time (FL) and plant height (PH) from 2012 dry season for the analysis. The population was genotyped with 73,147 markers using genotyping-by-sequencing method and we selected 323 lines where both the phenotypic and genotypic informations were available (see Spindel et al. 2015 for more details). So we used the available marker information for the estimation of genetic (co)variance components and the realized genomic relationship matrix (M) was obtained from the marker information using R-package 'rrBLUP' (Endelman 2011). For the real data analysis, we considered the marker data instead of the pedigree information, so in model (1) the vector of random effects (u) were assumed to follow a normal distribution according to Eq. (5) as Here, M is realized genomic relationship matrix calculated from the marker information and G 0 is the genetic (co)variance matrix.

Simulated data with replicates
In multi-trait analysis using iterative algorithms, it is often difficult to find suitable starting values for the parameters of interest. However, by performing test-runs using singletrait data one could find suitable starting values for the variance components. The (co)variance components were estimated using MCMCglmm, R-INLA and ASReml-R packages. For MCMC analysis using MCMCglmm package, we considered a total chain length of 50,000 iterations with a burning period of 10,000 iterations. The MCM-Cglmm package assign inverse-Wishart prior distribution for the random and residual covariance matrices. In our MCMC analysis, we used identity matrix as the scaling matrix of the prior distribution (ones for the variances and zeros for the covariances) assigned for the genetic covariance matrix (G 0 ) and for the residual covariance matrix (R 0 ) between the three traits. Moreover, we specified the degree of belief parameter (d) as 1 for the inverse-Wishart prior distribution. By default MCMCglmm uses the scaling matrix values as the starting values. For the REML analysis we used ones as the variances and zeros as the covariances for the genetic covariance matrix (G 0 ) as the starting values; whereas, for the residual covariance matrix (R 0 ) we used half of the phenotypic variance matrix of the data as initial values (ASReml-R default). The total computation time for the simulated dataset using MCMCglmm package was around 10 min and INLA took around 4 min, whereas the time for ASReml-R package was around 1 min. The INLA approach we used in the current study was not able to analyze bigger datasets (around 1000 lines), mainly due to the lack of memory on our system. We used a Linux system with 8GB RAM for our calculations. However, it is possible to analyze such large datasets using computers with more memory size or arguably one can use the option 'inla.remote()' to run R-INLA on a remote server with more memory size.
We used 50 simulation replicates for each simulated dataset to calculate the variance and covariance components using different estimation methods. In order to compare the accuracy of different estimation methods, we calculated the estimation error (difference between the true and estimated values) using 50 simulation replicates for the (co)variance components and then we plotted the box plots for the estimation errors to visualize the estimation accuracy of different methods. We show those box plots for the estimation errors for the variance (Fig. 1) and covariance (Fig. 2) components for the simulated dataset with high heritability. The Y-axis scale in those plots corresponds to the differences between the true simulated values and the estimated values, whereas the X-axis corresponds to different estimation methods. In order to calculate the estimation errors, for MCMC we used posterior mode, whereas for INLA we used the posterior mean estimates. From Figs. 1 and 2, it can be concluded that, different methods were able to provide similar estimates. We also plotted the box plots for the estimation errors for the variance (Fig. 3) and covariance (Fig. 4) for the dataset with low heritability. However, for the low heritability dataset the MCMC and INLA approaches provided variance estimates closer to true values than the REML method. The narrow-sense heritability estimates for the simulated datasets using 50 simulation replicates are shown in Table 4. Here we did not account the covariances between the traits in order to calculate the heritability. The narrow-sense heritability (h 2 ) was calculated for each trait separately as h 2 = V a /(V a + V e ), where V a and V e are the additive genetic variance and error variance of the particular trait, respectively.
Additionally, instead of covariances we report the estimated genetic and residual correlation coefficients as well as the 95 % empirical confidence intervals (in brackets) for each trait in Table 1. From Table 1 it is clear that the Bayesian methods were able to provide better estimates (closer to the true simulated values) for the additive genetic correlation coefficients than the REML approach with the low heritability dataset. One probable reason could be that the prior influence is higher with the low heritability dataset. We also performed univariate analyses using INLA with the simulated low heritability dataset (see Table 2). Both univariate and multivariate analyses gave very similar results, however in multivariate analysis one can account for and estimate the covariances between the traits.
Heritability and breeding values are of great interest to breeders in order to plan an efficient breeding program. In our study, we also calculated the correlation coefficients between the estimated and true breeding values using different estimation methods. We used average over 50 simulation replicates for both datasets to calculate the correlation coefficients. For the high heritability dataset, the correlation coefficients were 0.85, 0.84 and 0.83 for REML, MCMC and INLA methods, respectively. However, Variance σ 2 a1 σ 2 a2 σ 2 a3 σ 2 e1 σ 2 e2 σ 2 e3 for the low heritability dataset the correlation coefficients were relatively low being 0.64, 0.65 and 0.58 for REML, MCMC and INLA, respectively.

Field data
We chose the same starting values for the simulated data and real data in our REML analysis. For MCMC analysis we used empirical phenotypic variance of each trait as the variances and zeros as the covariance as the scale matrix of the prior distribution, whereas, starting values for other parameters were set randomly. For the REML analysis we chose the same starting values that we used for the simulated dataset. Both REML and Bayesian methods gave similar results in our analysis using real dataset (Table 3). Due to numerical problems caused by the large differences among the traits' phenotypic variances, before the INLA analysis we standardized each phenotypic vector to zero mean and unit variance, and after the analysis we rescaled the (co)variance components into the original scale. However, for MCMC and REML analysis we used the original scale. Our results showed that there is a negative genetic covariance between the traits plant height (PH) and yield (YLD). Additionally, as expected, the traits days to flowering (FL) and yield (YLD) showed a negative genetic covariance in our analysis. We also calculated the narrow-sense heritability for both datasets and Table 4 summarizes those results. Our narrow-sense heritability estimates for the real dataset are in concordance with the heritability estimates reported by Spindel et al. (2015) for the univariate animal model using REML. The total computation time for real dataset using INLA was around three minutes, whereas the MCMCglmm took around five hours. The main reasons for the expensive computation time with MCMCglmm are, firstly, the realized genomic relationship matrix were calculated outside the package, whereas, for the pedigree information MCMCglmm has built in functions in order to handle the covariance matrix more efficiently. Secondly, the realized genomic relationship matrix from the marker information is more dense than the pedigree-based additive relationship matrix.

Discussion
Multi-trait analysis of mixed models tend to be powerful and provide more accurate estimates than the single-trait analysis because the former method can take into account the underlying correlation structure found in a multi-trait dataset. However, Bayesian and non-Bayesian inference of multi-trait mixed model analysis are complex and computationally demanding. In this study, we explained how to do Bayesian inference of a multivariate animal model using recently developed INLA and the counter part MCMC, while comparing the results with the commonly used REML estimates. Our results show that reparametrizationbased INLA approach can be used as a fast alternative to MCMC methods for the Bayesian inference of multivariate animal model. The reparametrization approach, that was here applied for INLA analysis, can be used also more generally together with other tools to speed up the multi-trait animal model computations.
Drawback of the reparametrization-based approach is that priors are assigned for elements in the Cholesky factor instead of the original covariance matrix. Thus, here it is not possible to make a direct comparison between the MCMC and INLA results due to the differences in the prior distributions, however, it is possible to compare both approaches if we choose the same prior distributions. For the MCMC analysis we used inverse-Wishart distributions for the covariance matrices; whereas, for INLA we used Gaussian prior distribution for the elements in the Cholesky factor (i.e., dependence parameters) (κ ij 's, α ij 's) and inverse-Gamma distribution for the decomposition variance components (σ 2 a i 's , σ 2 e i 's ). Our results show that the REML estimates are in concordance with MCMCglmm and INLA. We want to emphasize that in our examples, the analyzed data sets were large and we did not encounter any problems. In general, identifiability is a problem in mixed model analyses with small data (Mathew et al. 2012). However, Bayesian methods are in better positions because they can at least find such problems (that posterior distribution has multiple modes) more easily than REML (which provides a single point-estimate). Variance σ 2 a1 σ 2 a2 σ 2 a3 σ 2 e1 σ 2 e2 σ 2 e3 Nowadays, molecular markers are widely used in animal and plant breeding programs as a valuable tool for genetic improvement. Therefore, we also showed how to estimate genetic parameters in a multivariate animal model using molecular marker information with the reparametrizationbased INLA approach and frequentist framework. Finally, our results imply that the reparametrization-based INLA approach can be used as a fast alternative to MCMC methods in order to estimate genetic parameters with a multivariate animal model using pedigree information as well as with molecular marker information.
Author contribution statement BM, AH, PK, JL and MS were involved in the conception and design of the study. BM performed the simulations and preprocessing of the data. BM and AH implemented the method, and performed the statistical analyses. BM drafted the manuscript. BM, AH, JL and MS participated in the interpretation of results. All the authors critically revised the manuscript. the editor and two anonymous reviewers as well as Karin Woitol for their suggestions and comments which helped us to improve our manuscript.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.  Table 3 The additive genetic variance (σ 2 a ) and the error variance (σ 2 e ) for the field data obtained from the REML analysis and the posterior mode estimates obtained from the MCMCglmm package along with R-INLA posterior mean estimates are presented The additive genetic covariance (σ T i ,T j (a) ) and the error covariance (σ T i ,T j (e) ) between each pair of three quantitative traits (PH, FL, YLD) are also shown (