Abstract
Largescale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multitrait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present MegaLMM, a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using three examples with real plant data, we show that MegaLMM can leverage thousands of traits at once to significantly improve genetic value prediction accuracy.
Similar content being viewed by others
Background
New highthroughput phenotyping technologies hold promise for a revolution in datadriven decisions in plant and animal breeding programs [1, 2]. For example, dronebased hyperspectral cameras can image fields at high resolution across hundreds of spectral bands [3], wearable sensors can continuously monitor animals health and physiology [4], and RNA sequencing and metabolite profiling can simultaneously assay the concentrations of tensofthousands of targets [5]. These highdimensional traits could allow breeders to rapidly assess many aspects of performance more accurately or earlier in development than was possible using traditional tools. This can increase the rate of gain in target traits by increasing selection accuracy, increasing selection intensity, and reducing breeding cycle durations.
However, efficiently incorporating highdimensional phenotype data into breeding decisions is challenging. Whenever two traits are genetically correlated, joint analyses can improve the precision of variety evaluation [6]. However, two key problems emerge. First, the number of traits in highdimensional datasets is often much larger than the number of breeding lines, which means that naive correlation estimates are not robust. Second, phenotypic correlation among traits are often poor approximations to genetic correlation, so not all correlated traits are useful for breeding decisions [7]. For example, plants grown in more productive areas of a field will tend to produce higher yields and be greener (measured by hyperspectral reflectance). Yet, selecting indirectly based on green plants instead of directly on higher yields may be counterproductive because “greeness” may indicate an overinvestment in vegetative tissues at the expense of seed. This contrasts with the problem of predicting genetic values from genotype data (e.g., genomic prediction; [8]), where all correlations between candidate features and performance are useful for selection.
The multivariate linear mixed model (MvLMM) is a widelyused statistical tool for decomposing phenotypic correlations into genetic and nongenetic components. The MvLMM is a multioutcome generalization of the univariate linear mixed model (LMM) that forms the backbone of the majority of methods in quantitative genetics. The MvLMM was introduced over 40 years ago [9], and has repeatedly been shown to increase selection efficiency [10–12]. Yet, MvLMMs are still rarely used in actual breeding programs because naive implementations of the framework are sensitive to noise, prone to overfitting, and exhibit convergence problems [13]. Furthermore, existing algorithms are extremely computationally demanding. The fragility of naive MvLMMs is due to the number of variancecovariance parameters that must be estimated which increases quadratically with the number of traits. The computational demands increase even more dramatically: from cubically to quintically with the number of traits [14] because most algorithms require repeated inversion of large covariance matrices. These matrix operations dominate the time required to fit a MvLMMs, leading to models that take days, weeks, or even years to converge.
Here, we describe MegaLMM (linear mixed models for millions of observations), a novel statistical method and computational algorithm for fitting massivescale MvLMMs to largescale phenotypic datasets. Although we focus on plant breeding applications for concreteness, our method can be broadly applied wherever multitrait linear mixed models are used (e.g., human genetics, industrial experiments, psychology, linguistics, etc.). MegaLMM dramatically improves upon existing methods that fit lowrank MvLMMs, allowing multiple random effects and unbalanced study designs with large amounts of missing data. We achieve both scalability and statistical robustness by combining strong, but biologically motivated, Bayesian priors for statistical regularization–analogous to the p>>n approach of genomic prediction methods–with algorithmic innovations recently developed for LMMs. In the three examples below, we demonstrate that our algorithm maintains high predictive accuracy for tensofthousands of traits, and dramatically improves the prediction of genetic values over existing methods when applied to data from real breeding programs.
Results
Methods overview
MegaLMM fits a full multitrait linear mixed model (MvLMM) to a matrix of phenotypic observations for n genotypes and t traits (level 1 of Fig. 1A). We decompose this matrix into fixed, random, and residual components, while modeling the sources of variation and covariation among all pairs of traits. The main statistical and computational challenge of fitting large MvLMMs centers around the need to robustly estimate t×t covariance matrices for the residuals and each random effect. Each covariance matrix has t(t−1)/2+t free parameters, and any direct estimation approach is computationally demanding because it requires repeatedly inverting these matrices (an \(\mathcal {O}(t^{3})\) operation).
We solve both of these problems by introducing K unobserved (latent) traits called factors (f_{k}) to represent the causes of covariance among the t observed traits. We treat each latent trait just as we would any directly measured trait and decompose its variation into the same fixed, random and residual components using a set of parallel univariate linear mixed models (level 2 of Fig. 1A). We then model the pairwise correlations between each latent trait and each observed trait through K loadings vectors λ_{k·}.
Together, the set of parallel univariate LMMs and the set of factor loading vectors result in a novel and very general reparameterization of the MvLMM framework as a mixedeffect factor model. This parameterization leads to dramatic computational performance gains by avoiding all large matrix inversions. It also serves as a scaffold for eliciting Bayesian priors that are intuitive and provide powerful regularization which is necessary for robust performance with limited data. Our default prior distributions encourage: i) shrinkage on the factortrait correlations (λ_{jk}) to avoid overfitting covariances, and ii) shrinkage on the factor sizes to avoid including too many latent traits. This twodimensional regularization helps the model focus only on the strongest, most relevant signals in the data.
While others have used latent factor approaches to reduce dimensionality of MvLMMs (e.g., [15–18]), these methods only use factors for a single random effect (usually the matrix of random genetic values)–with the exception of BSFG which uses factors for the combined effect of a single random effect and the residuals [17]. In MegaLMM, we expand this framework and use factors to model the joint effects of all predictors: fixed, random and residual factors on multiple traits.
We combine this efficient factor model structure with algorithmic innovations that greatly enhance computational efficiency, drawing upon recent work in LMMs [19–22]. While Gibbs samplers for MvLMMs are notoriously slow, we discovered extensive opportunities for collapsing sampling steps, marginalizing over missing data, and discritizing variance components so that intermediate results can be cached (Additional file 1: Supplemental Methods).
Genomic prediction using MegaLMM works by fitting the model to a partially observed trait matrix, with the traits to be predicted imputed as missing data. MegaLMM then estimates genetic values for all traits (both observed and missing) in a single step (Fig. 1B).
MegaLMM is efficient and effective for large datasets
We used a gene expression matrix with 20,843 genes measured in each of 665 Arabidopsis thaliana accessions (a total of nearly 14 million observations), to evaluate the accuracy and time requirements for traitassisted genomic prediction–a classic example of an applied use of MvLMMs–across a panel of existing software packages. We created datasets with 4 to 20,842 “secondary” traits with complete data, and used these data to predict the genetic values of a single randomly selected “focal” gene with 50% missing data.
Despite the limited number of independent lines in this data set, adding up to ≈200 secondary traits improved the genomic prediction accuracy of MegaLMM and two other Bayesian methods: MCMCglmm and phenix (Fig. 2A). The maximum likelihood method MTG2 [23], on the other hand, did only marginally better than singletrait prediction, and genomic prediction accuracy declined with 32 traits, likely due to overfitting. We note that the results here are averages over 20 randomly selected focal genes. The prediction accuracy and benefits of multitrait prediction varied considerably among genes (Additional file 1: Figure S1 and Figure S2), but comparisons among methods were largely correlated. Using simulated datasets where we knew the true genetic and residual covariances among traits, we also found that MegaLMM was at least as accurate in estimating covariance parameters as the competing methods (Additional file 1: Figure S3).
Beyond 32 secondary traits, computational times for MCMCglmm and MTG2 became prohibitive (Fig. 2B). Using extrapolation, we estimated that fitting these methods for 512 traits would take 20 days and 217 days, respectively, without considering issues of model convergence. In contrast, phenix and MegaLMM were both able to converge on good model fits for 512 traits in approximately one hour.
Beyond 512 traits, MegaLMM was the only viable method as phenix cannot be applied to datasets with t>n phenotypes. Although the genomic prediction accuracy of MegaLMM did not increase further after ≈256 traits, performance did not suffer even with the full dataset of >20,000 traits and the analysis was completed in less than a day. This shows that MegaLMM is feasible to apply to very highdimensional studies and, in most cases, does not require prefiltering of traits–something that requires great care in genomic prediction applications to avoid misleading results [24].
An important feature of MegaLMM is that the choice of the number of latent factors K is less critical than in most factor models. Since factors are ordered from mosttoleast important by the prior (See Methods), as long as enough factors are specified to capture the majority of the covariance among traits, adding additional latent factors does not lead to overfitting (Additional file 1: Figure S4A). Additional factors do increase the runtime of the algorithm, though (Additional file 1: Figure S4B), so some optimization of K during the burnin period can reduce computational demands during posterior sampling.
Applications to real breeding programs
To demonstrate the utility of MegaLMM, we developed two classes of genomic prediction models for highdimensional phenotype data in real plant breeding programs.
Genomic prediction using hyperspectral reflectance data
When the final performance of a variety is difficult or costly to obtain, breeding programs can supplement direct measures of performance with data from surrogate traits that can be measured cheaply, earlier in the breeding cycle, and on more varieties. For example, in the bread wheat breeding program at CIMMYT, hyperspectral reflectance data can be collected rapidly and repeatedly by aerial drones on thousands of plots [25]. We developed a multitrait genomic prediction model to incorporate 62band hyperspectral reflectance data from 10 different drone flights over the course of one growing season, and compared the accuracy of these genetic value predictions against more traditional approaches.
We first compared three standard univariate methods: GBLUP [26], Bayesian LASSO (BL) [27], and Reproducing kernel Hilbert space (RKHS) regression [28]. GBLUP achieved a prediction accuracy of ρ_{g}=0.43 for yield (Fig. 3A). Both the BL and RKHS methods showed modest improvements, with ρ_{g}=0.47 and ρ_{g}=0.49, respectively in these data. The RKHS model often outperforms GBLUP in plant breeding datasets, but improvements are generally slight and inconsistent depending on the genetic architecture of the targeted trait.
In the original analysis of this dataset, [25] achieved increased performance by replacing the genomic kernel (K in our notation) with a kernel based on the crossproduct of hyperspectral reflectances across all wavelengths and time points (termed the H matrix). We replicated these results, achieving a prediction accuracy of ρ_{g}=0.58 (HBLUP method). These authors also proposed a multikernel model combining the K and H kernel matrices, although they only applied this to crosstreatment genotypebyenvironment predictions. We found that applying this multikernel method to the withinenvironment data resulted in additional accuracy gains (ρ_{g}=0.64) (GBLUP+H method; Fig. 3A).
While more effective than univariate methods, predictions based on the H kernel matrix are biased by nongenetic correlations between surrogate traits and yield because they do not directly model the genetic component of these correlations. MegaLMM implements a full multitrait mixed model and thus can separate these sources of correlation. We fit three different multitrait prediction models with MegaLMM. The first was a standard multitrait mixed model with a single random effect based on the genomic relationship matrix K. This method achieved a dramatically higher prediction accuracy than any of the previous approaches (ρ_{g}=0.73). Second, because the RKHS model had the highest performance among univariate predictions, we implemented an approximate RKHS method in MegaLMM based on averaging over three kernel matrices [28]. We are not aware of any other highdimensional MvLMM implementations that allow models with multiple random effects. This model achieved the highest predictive accuracy (ρ_{g}=0.77). Finally, we repeated the MegaLMMGBLUP analysis but this time masking all phenotype data (both grain yield and hyperspectral data) from the testing set. We called this approach MegaLMMGBLUPCV1 following the nomenclature from [29]. Genetic prediction accuracy in the CV1 setting was similar to the univariate methods (ρ_{g}=0.49), showing that nearly all benefit of MegaLMM in this dataset came through the optimal use of secondary trait phenotypes on the lines of interest.
In summary, by directly modeling the genetic covariance between the surrogate traits (hyperspectral reflectance measures), we achieved performance increases of 56%79%, and up to 36% over the HBLUP method. To show that these conclusions were robust in other datasets, we repeated the same analyses in the other 19 trials reported by [25] and results were highly similar in all trials (Additional file 1: Figure S5).
To explore why directly modeling the genetic correlation is important, we compared the estimated genetic correlations between each hyperspectral band and grain yield to the corresponding phenotypic correlations (Fig. 3B). Most genetic correlation estimates closely paralleled the phenotypic correlations, with the largest values for lowtointermediate wavelengths occurring on dates towards the end of the growing season while plants were in the grain filling stage [25]. However, there were notable differences. For example, genomic correlations were moderate (ρ_{g}≈0.2) for most wavelengths during early February sampling dates while phenotypic correlations were near zero; yet, during early March time points, phenotypic correlations between yield and bands around 800 nanometers were moderate (ρ_{y}≈−0.2) but genomic correlations were nearzero. MegaLMM is able to model the discrepancy between genomic and phenotypic correlations, but methods based on the H matrix (e.g., HBLUP) are not.
Genomic prediction of agronomic traits across multienvironmental trials
Multitrait mixed models are also used to analyze data from multienvironment trials to account for genotypeenvironment interactions and select the best genotypes in each environment. The Genomes2Field initiative (https://www.genomes2fields.org/) is an ongoing multienvironment field experiment of maize hybrid genotypes across 20 American states and Canadian provinces. Data from the years 20142017 included 119 trials with a total of 2102 hybrids. As in many largescale multienvironment trials, only a small proportion of the available genotypes were grown in each trial. Therefore, the majority of trialgenotype combinations were unobserved.
We selected four representative agronomically important traits and compared the ability of four modeling approaches to impute the missing measurements. Including acrosstrial information was beneficial for each of the four traits, suggesting generally positive genetic correlations across trials. However, applying MegaLMM to each of the four trait datasets improved predictions dramatically, with average benefits across trials ranging from ρ_{y}=0.10 to ρ_{y}=0.17 (Fig. 4). The performance of phenix was inconsistent across traits and trials, likely because its model for the nonadditive genetic covariance (i.e., the residual) is less flexible than MegaLMM.
To explore why jointly modeling all genetic and nongenetic covariances for each pair of trials improved prediction accuracy for each trait, we assessed the pertrial differences in performance between MegaLMM and the corresponding withintrial genomic prediction model. Trials varied considerably in how much MegaLMM improved genomic prediction accuracy, with several trials seeing improvements of ρ>0.4. The magnitude of the MegaLMM effect on genomic prediction accuracy was largely explained by the maximum genetic covariance between that trial and any other trial in the dataset (Additional file 1: Figure S6). This is expected because the benefit of a MvLMM is largely dependent on the magnitude of genetic covariances between traits.
A common approach in multienvironment trials is to combine similar trials (based on geographic location or similar environments) into clusters and make genetic value predictions separately for each cluster [30]. However, this will not be successful if clusters cannot be selected a priori because using the trial data itself to identify clusters can lead to overfitting if not performed carefully [24]. In these data, the distribution of genomic correlations between trials differed among traits, so it is not straightforward to identify which pairs or subsets of trials could be combined. The most obvious predictor of trial similarity is geographic distance, but we did not see consistent spatial patterns in the amongtrial covariances across the four traits. The trials with the greatest benefit from our MvLMM showed geographic clustering in the central midwest for the anthesissilking interval (ASI) but not for the other three traits (Fig. 5A). Genetic correlations tended to decrease over long distances for ASI and over short distances for plant height, but not for the other two traits (Fig. 5B), resulting in obvious geographic clustering of genetic correlations for ASI but not the other traits (Fig. 5C). This suggests that including all trials together in one model is necessary to maximize the benefit of the MvLMM approach to multienvironment plant breeding.
Discussion
Novel statistical methods can help optimize plant and animal breeding programs to meet future food security needs. In the above examples, we highlighted two areas where largescale phenotype data can improve the accuracy of genomic prediction in realistic plant breeding scenarios: by incorporating highthroughput phenotyping data from remote sensors, and by synthesizing data on geneenvironment interactions across largescale multienvironment trials. In both examples, we apply highdimensional multivariate linear mixed models to efficiently integrate all available genotype and phenotype data into genetic value predictions. MegaLMM is a scalable tool that extends the feasible range of input data for multivariate linear mixed models by at least two orders of magnitude over existing methods, while providing the flexibility to plug directly into existing breeding programs.
Computational and statistical efficiency
Computational issues in singletrait LMMs have been studied extensively, allowing implementations for large datasets [14, 21, 22, 31]. Most of these algorithms diagonalize the genomic relationship matrices to improve computational efficiency. This technique dramatically improves the performance of simple, lowdimensional MvLMMs as well (e.g., [14, 23]). However, diagonalization does not address the computational challenge imposed by large traitcovariance matrices, and can only be applied to models with a single random effect and no missing data. Therefore, these tools cannot be applied to the datasets studied here or, more generally, to most largescale studies of geneenvironment interactions that frequently have large proportions of missing data [10] (Fig. 1) and to studies that have experimental designs with multiple sources of covariance (e.g., spatial environmental variation or nonadditive genetics).
Our work builds on the factoranalytic approach to regularizing MvLMMs [15–18] and is most similar to BSFG [17] and phenix [18], which improve upon traditional quantitative genetic factor models by specifying sparse or lowrank factor matrices to improve robustness in high dimensions. Importantly, however, these models are limited to a single random effect and are not tractable for datasets with large numbers of traits because of computational inefficiencies (BSFG), or a lack of strong regularization on the residual covariance matrix (phenix). MegaLMM generalizes both methods and dramatically improves their weaknesses, allowing analyses with >20,000 traits to be completed in less than one day. Since MegaLMM scales approximately linearly with the number of traits (Fig. 2), applying it to datasets with many more traits may be feasible. While we have designed many of our routines to take advantage of multicore CPUs, graphical processing units may offer additional performance gains.
Two key advantages of MegaLMM are its flexibility and generality. We have designed the MegaLMMR package to be as general as possible so that it can be applied to a wide array of problems in quantitative genetics. MegaLMM tolerates unbalanced designs with incomplete observations (something that makes MCMCglmm and MTG2 very slow), arbitrarily complex fixed effect specifications to model experimental blocks, covariates, or other sources of variation among samples (unlike phenix), and most importantly, multiple random effects (unlike phenix, GEMMA, or MTG2). Multiple random effect terms can be used to account for spatially correlated variation across fields, nonadditive genetic variation that is not useful for breeding, or to more flexibly model nonlinear genetic architectures as we demonstrated with the approximate RKHS regression approach in the wheat application (Fig. 3). To make multiplerandomeffect models computationally efficient, we take our earlier work with LMMs [22] and extend the same discrete estimation procedure to MvLMMs where the impact on computational efficiency is exponentially greater. Other commonly used tools for fitting MvLMMs such as ASREML [32] allow more flexibility in the specification of multiple variancecomponent models with correlated random effects that are not currently possible in MegaLMM. However, these tools do not scale well beyond ≈10 traits, so are not feasible to apply directly to largescale datasets in plant breeding.
Applicability to highthroughput phenotypic data
Largescale phenotype data collection is rapidly emerging as a standard tool in plant breeding and other fields that use quantitative genetics [1, 33, 34]. These deep phenotyping datasets can be used as highdimenisional features to predict genetic values in agronomically important traits and serve as substitutes for direct assays where these are more timeconsuming or expensive to collect.
Breeding objectives differ from the goals of polygenic risk score predictions for human diseases because the prediction target is not the phenotype of an individual, but its genetic value [24]. Genetic values quantify the expected phenotype of a plant’s offspring, and so exclude impacts of the plant’s own microenvironment on its phenotype [7]. Therefore, accurate genetic value prediction requires models that can distinguish between genetic and nongenetic sources of covariation among traits.
The MvLMM is considered the goldstandard method for isolating genetic correlations from nongenetic correlations in genetic value prediction [10]. However, it has rarely been applied in breeding programs because of the computational challenges associated with estimating multiple large covariance matrices. With highthroughput phenotype (HTP) data, MvLMMs have only been applied directly to sets of ≈2−5 traits. Instead, several authors have used a prior round of feature selection or calculated summary statistics of the HTP to generate model inputs rather than using the raw highdimensional data itself (e.g., [3, 12, 35–37],). Other authors have replaced the MvLMM with a direct regression on the HTP data, using techniques such as factorial regression [38], functional regression [39], kernel regression [25], and deep learning[40]. While straightforward to implement, this conditioning on the HTP traits creates a form of collider bias which can induce genotypephenotype associations that do not actually exist and impede genetic value predictions [24]. Alternative methods including IBCF [41]) and regularized selection indexes [42] avoid computational complexities of the full MvLMMs, but do not make full use of the trait correlations in the data.
MegaLMM, on the other hand, fits a full MvLMM to an arbitrary number of HTP traits and should be more efficient at leveraging highdimensional genetic correlations while accounting for nongenetic sources of covariance, particularly for datasets when HTP traits and focal performance traits are measured on the same plants. Nongenetic correlations will be less important on datasets where these sets of traits are measured on different plots. At least in the wheat breeding trial datasets we examined, the benefit of multitrait modeling was much greater when traits were partially observed on each individual than when secondary traits were only observed in the training partition. This is expected theoretically and has been demonstrated previously in simulations [24], but the magnitude of the benefit was particularly dramatic here. This suggests that breeding programs should focus on developing HTP technologies that can measure secondary traits on the target individuals; HTP measurements on training individuals may be less useful for prediction applications. Unlike other methods, including too many traits, or including redundant traits that are highly correlated is unlikely to significantly impact prediction accuracy, reducing the need to carefully choose which traits to include and which to exclude a priori; MegaLMM allows users to simply include all traits they have at once.
Applicability to multienvironment trial data
The analysis of multienvironment trials provides a separate set of computational and statistical challenges for plant breeders. Multienvironment trials (METs) are necessary because geneenvironment interactions (GEIs) often prevent the same variety from performing best in all locations where a crop is grown [10]. However, METs are expensive and logistically difficult. Genomic predictions in METs could reduce the need to test every variety in every environment, allowing smaller individual trials [43].
GEIs can be modeled in two ways: (i) as changes in variety effects on the same trait across environments (i.e., varietybyenvironment interactions), or (ii) as a set of genetically correlated traits, with each traitenvironment combination considered as a different phenotype [10]. When formulated with linear mixed models and random genetic effects, these two approaches are mathematically equivalent. Traditionally, the most common model for analyzing METs has been the AMMI model in which the genetic effects of each variety in each environment are modeled using a set of products between genetic and environmental vectors [44]. AMMI models are used to rank genotypes in different environments and to identify environmental clusters with similar rankings of varieties. However, AMMI models cannot easily incorporate marker data. When genetic values are treated as random effects, AMMI models becomes factor models (generally called factor analytic models in this literature) (e.g. [45, 46]), and can incorporate genetic marker data (e.g. [47]). MegaLMM extends this factoranalytic method for analyzing METs, making the methods robust for METs with hundreds or more individual trials.
A limitation of the AMMI factoranalytic approach to analyzing METs is that there is no mechanism for extending predictions to new environments outside of those already tested. Even largescale commercial trials cannot test every field a farmer might use. Several authors have proposed using environmental covariates (ECs) to model environmental similarity in METs and predict GEIs for novel environments (e.g., [47–49]). These approaches all involve regressions of genetic variation on the ECs, and so, if relevant ECs are missing or the relationship between variety plasticity and ECs is nonlinear, these models will underfit the GEIs. Nevertheless, these approaches are promising and have been successfully applied to large METs (e.g. [47],). MegaLMM cannot currently incorporate ECs to predict novel environments. However, a possible extension could involve replacing the iid prior on the elements of the factor loadings matrix with a regression on the ECs. This hybrid of ECs and a full MvLMM could leverage the strengths of both approaches.
Model limitations
While MegaLMM works well across a wide range of applications in breeding programs, our approach does have some limitations.
First, since MegaLMM is built on the GridLMM framework for efficient likelihood calculations [22], it does not scale well to large numbers of observations (in contrast to large numbers of traits), or large numbers of random effects. As the number of observational units increases, MegaLMM’s memory requirements increase quadratically because of the requirement to store sets of precalculated inversevariance matrices. Similarly, for each additional random effect term included in the model, memory requirements increase exponentially. Therefore, we generally limit models to fewer than 10,000 observations and only 1to4 random effect terms per trait. There may be opportunities to reduce this memory burden if some of the random effects are lowrank; then these random effects could be updated on the fly using efficient routines for lowrank Cholesky updates. We also do not currently suggest including regressions directly on markers and have used markerbased kinship matrices here instead for computational efficiency. Therefore as a standalone prediction method, MegaLMM requires calculations involving the Schur complement of the joint kinship matrix of the testing and training individuals which can be computationally costly.
Second, MegaLMM is inherently a linear model and cannot effectively model trait relationships that are nonlinear. Some nonlinear relationships between predictor variables (like genotypes) and traits can be modeled through nonlinear kernel matrices, as we demonstrated with the RKHS application to the Bread Wheat data. However, allowing nonlinear relationships among traits is currently beyond the capacity of our software and modeling approach. Extending our mixed effect model on the lowdimensional latent factor space to a nonlinear modeling structure like a neural network may be an exciting area for future research. Also, some sets of traits may not have lowrank correlation structures that are wellapproximated by a factor model. For example, certain autoregressive dependence structures are lowrank but cannot efficiently be decomposed into a discrete set of factors.
Nevertheless, we believe that in its current form, MegaLMM will be useful to a wide range of researchers in quantitative genetics and plant breeding.
Potential extensions
Beyond the examples we show in this work, the scalability and statistical power of MegaLMM can open up new avenues for innovation in genomic prediction applications across the fields of quantitative genetics–both in breeding programs as we have described here and, potentially, in human genetics. Genomic prediction is also used for the calculation of polygenic risk scores for complex human traits and diseases [50]. MegaLMM may help leverage past case histories, survey responses, molecular tests, and the genetic architecture of other correlated traits to provide a more comprehensive multitrait polygenic risk score (e.g. [51]).
We have focused here on simple scalar phenotypes: the expression of a single gene, the total grain yield, and individual measures of agronomic performance. However, many important traits in plants, animals, and humans cannot easily be reduced to a scalar value. Examples include timeseries traits such as growth curves [52], metabolic traits such as the relative concentrations of different families of metabolites [53], and morphological traits such as shape or color [54]. Each of these traits can be decomposed into vectors of interrelated components, but treating these components as independent prediction targets using existing univariate LMM or lowdimensional MvLMM genomic prediction tools is inefficient because of their statistical dependence. MegaLMM can be adapted to make joint predictions on vectors of hundreds or thousands of correlated trait components, which could be fed into highdimensional selection indices for efficient selection of these important plant characteristics. In human genetics, MegaLMM may provide a way to derive multiethnic polygenic risk scores [55] by treating outcomes within each ethnic, geographic, or other stratified population group as correlated traits, similar to the analysis of the multienvironment trials above.
MegaLMM should be straightforward to extend to more flexible genetic models including the Bayesian Alphabet family of mixture priors on marker effect sizes. These effects can be incorporated into the parameters B_{2R} and B_{2F} by adapting the prior structure. This will be further explored in future manuscripts.
Lastly, we have only focused on Gaussian MvLMMs, in which observations are assumed to marginally follow a Gaussian distribution. However, many other types of data require more flexible models. It should be possible to extend MegaLMM to the broader family of generalized LMMs. These approaches model the relationships among predictor variables in a latent space, which is then related to the observed data through a link function and an exponential family error distribution. More generally, linkfunctions could be any nonlinear function of multiple parameters such as a polynomial or spline basis, or a mechanistic model. In this case, we would model the correlations among model parameters on this linkscale and then use the linkfunction to relate the latent scale variables to the observed data. Extending MegaLMM to accommodate such generalized LMM structures would require new sampling steps in our MCMC algorithm (see Methods), but we do not see any conceptual challenges with this approach.
Conclusions
MegaLMM is a flexible and powerful framework for the analysis of very highdimensional datasets in genetics. Multivariate linear mixed models are widely used for analyzing correlated traits, but have been limited to a maximum of a dozen or so traits at a time by the curse of dimensionality. We developed a novel reparameterization of the MvLMM that allows powerful statistical regularization and efficient computation with thousands of traits. When applied to real plant breeding objectives, MegaLMM efficiently leverages information across traits to improve genetic value predictions. Our opensource software package will enable users to apply and extend this method in many directions, opening up new areas of research and development in breeding programs.
Methods
Multivariate linear mixed models
Multivariate linear mixed models (MvLMMs) are widely used to model multiple sources of covariance among related observations. Let the n×t matrix Y represent observations on t traits for n observational units (i.e., individual plants, plots, or replicates). A general MvLMM takes on the following form
where X is a n×b matrix of “fixed” effect covariates with effect sizes matrix B,U is an r×t matrix of random effects for each of the t traits, with corresponding random effect design matrix Z, and E is a n×t matrix of residuals for each of the t traits.
MegaLMM uses this formulation to accommodate a large number of designs through different specifications of X and Z, and different priors on B,U and E. The distinction between “fixed” and “random” effects in Bayesian mixed models is not welldefined because every parameter requires a prior. However, we use the following distinction here: “fixed” effects are covariates assigned flat (i.e., infinite variance) priors or priors with independent variances on each coefficient; “random” effects, in contrast, are grouped in sets that can be thought of as (possibly correlated) samples from a common population distribution. Generally, “fixed” effects are used to model experimental design terms such as blocks, time, sex, etc, genetic principal components, or specific genetic markers; while “random” effects are used to model genetic values, spatial variation, or related effects.
An important feature of MegaLMM is that multiple random effect terms can be included in the model. We specify this as
where each Z_{m} is an n×r_{m} design matrix for a set of related parameters with corresponding coefficient matrix U_{m}. For example, U_{1} may model additive genetic values for each individual, while U_{2} may model spatial environmental effects for each individual. The distribution of each random effect coefficient matrix is \(\mathbf {U}_{m} \sim \mathcal {N}(\boldsymbol {0},\mathbf {K}_{m},\mathbf {G}_{m})\), where \(\mathcal {N}(\mathbf {M},\boldsymbol {\Sigma },\boldsymbol {\Psi })\) is the matrix normal distribution with mean matrix M, amongrow covariance K_{m} and amongcolumn (i.e., amongtrait) covariance G_{m}. We assume that both Z_{m} and K_{m} are known, while G_{m} is unknown and must be learned from the data. Note that K_{m} must be positive semidefinite, while G_{m} is positivedefinite. The covariance among different coefficient matrices is assumed to be zero.
To complete the specification of the MvLMM, we assign the residual matrix the distribution \(\mathbf {E} \sim \mathcal {N}(\boldsymbol {0},\mathbf {I}_{n},\mathbf {R})\) where I_{n} is the n×n identity matrix and R is an unknown t×t positivedefinite covariance matrix.
Computational challenges with large multitrait mixed models
Fitting Eq. (1) is challenging because the columns of U and E are correlated. This means that data from individual traits (columns of Y) cannot be treated independently. Maximumlikelihood approaches for fitting MvLMMs (e.g., MTG2) compute the full (or restricted) likelihood of Y, which involves calculating the inverse of an nt×nt matrix many times during model optimization. This is computationally prohibitive when n and/or t are large (Fig. 2A). Gibbs samplers (e.g., MCMCglmm) avoid forming and computing the inverse of this extremely large matrix, but still require inverting each of the G_{m} and R matrices repeatedly, which is still prohibitive when t is large. Furthermore, the number of parameters in each G_{m} and R grow with the square of t and quickly get larger than the total number of observations (nt) when t is large. This means that G_{m} and R are not identifiable in many datasets and estimates require strong regularization.
Mixed effect factor model
If both G_{m} and R were diagonal matrices, the t traits would be uncorrelated. Fitting Eq. (1) then could be done in parallel across traits, greatly reducing the computational burden. While we cannot directly decorrelate traits, if we can identify the sources of variation that cause trait correlations, the residuals of each trait on these causal factors will be uncorrelated. We circumvent this issue by reparameterizing Eq. (1) as a factor model, where we introduce a set of unobserved (or latent) factors that account for all sources of correlation among the traits. Conditional on the values of these factors, the model reduces to a set of independent linear mixed models. Our reparameterized multitrait mixed effect factor model is
where F is an n×K matrix of latent factors, Λ is a K×t factor loadings matrix, X=[X_{1},X_{2}] is a partition of the n×b fixed effect covariate matrix between the b_{1} covariates with improper priors and the b_{2}=b−b_{1} covariates with proper priors, and U_{R} and U_{F} coefficients matrices are specified as:
The distributions of the random effects are specified as:
where Ψ_{Rm},Ψ_{Fm},Ψ_{RE}, and Ψ_{FE} are all diagonal matrices. Diagonal elements of Ψ_{Fm} and Ψ_{FE} are nonnegative, while diagonal elements of Ψ_{Rm} and Ψ_{RE} are strictly positive.
Conditional on F and Λ, the variation in each of the t columns of Y are uncorrelated and can be fitted to the remaining terms independently. Similarly, the K columns of F are also uncorrelated and can be modeled independently as well. Therefore, we can fit Eq. (2) without requiring calculating the inverses of any t×t matrices, and many calculations can be done in parallel across different CPU cores.
As long as K is sufficiently large, Eq. (2) is simply a reparameterization of Eq. (1). To see how Eq. (2) can represent the terms of Eq. (1), let:
Based on the properties of matrix normal random variables, we can integrate over U_{R},U_{F},E_{R} and E_{F} to calculate the distributions of each U_{m} and E as:
Therefore, each G_{m} is reparameterized as \(\boldsymbol {\Psi }_{Rm} + \boldsymbol {\Lambda }^{\intercal } \boldsymbol {\Psi }_{Fm} \boldsymbol {\Lambda }\) and R is reparameterized as \(\boldsymbol {\Psi }_{RE} + \boldsymbol {\Lambda }^{\intercal } \boldsymbol {\Psi }_{FE} \boldsymbol {\Lambda }\), such that all offdiagonal elements of each matrix are controlled by Λ.
Although these equations appear to imply that our mixed effect factor model constrains B,U and E (and thus each G_{m} and R) to be correlated due to the shared dependence on Λ, this is not necessarily the case. When any diagonal element of any Ψ_{Fx} matrix is set to zero, the corresponding row of Λ does not contribute to that term. If at least t linearly independent rows of Λ contribute to each matrix, then any set of positivedefinite matrices can be represented as above. Therefore, we can represent any set of positivedefinite matrices G_{m} and R with our model as long as K>=t(M+1).
Of course, the reason that we parameterize our model in this way is that we do expect some correlation among the genetic and residual covariance matrices. From a statistical perspective, when it is reasonable (given the data) to use the same row of Λ for multiple covariance matrices, we can save parameters in the model. From a biological perspective, shared factors provide a biologically realistic explanation for correlations among traits. If we consider the columns of F to be K traits that simply have not been observed, then it is reasonable to propose that each of these traits is regulated by the same sources of genetic and environmental variation as any of the observed traits.
In Eq. (2), the K latent traits (F) are the key drivers of all phenotypic covariation among the t observed traits (Y). These latent traits may not account for all variation in the observed traits. But, by definition, this residual variation (e.g., measurement errors in each trait) is unique to each trait and uncorrelated with the residual variation in other traits.
Prior parameterization
The intuitive structure of the mixed effect factor model (Eq. (2) and Fig. 1) makes prior specification and elicitation easier than for Eq. (1) because we do not need to define prior distributions for very large covariance matrices directly. Instead, priors on the random effect variance components and fixed effect regression coefficients are separable and can be described independently, while priors on trait correlations are specified indirectly as priors on the factor loading matrix Λ.
In MegaLMM, we have chosen functional forms for each prior parameter that balance between interpretability (for accurate prior elicitation), and compatibility with efficient computational approaches. For the variance components, we use the nonparametric discrete prior on variance proportions we previously introduced in GridLMM [22] that approximates nearly any joint distribution for multiple random effects. For the factor loadings matrix and matrices of regression coefficients, we use a twodimensional globallocal prior based on the horseshoe prior [56], parameterized in terms of the effective number of nonzero coefficients. For the factor loadings matrix specifically, our prior achieves both regularization and interpretability of the factor traits without having to carefully specify K itself. Full details of each prior distribution are provided in Additional file 1: Table S1 lists the default hyperparameters for each prior used in the analyses reported here and provided as defaults in the MegaLMMR package.
Computational details and posterior inference
We use a carefully constructed MCMC algorithm to draw samples from the posterior distribution of each model parameter. We gain efficiency in both periteration computational time and in effective samples per iteration through careful uses of diagonalization, sparse matrix algebra, parallelization, and integration (or partial collapsing). In particular, our algorithm synthesizes and extends several recent innovations in computational approaches to linear mixed models [17, 20, 22, 57]. Full details of the computational algorithm are provided in Additional file 1: Supplemental Methods.
Data Analyses
We demonstrate MegaLMM using three example datasets.
Scaling performance with gene expression data
To compare the scalability of MegaLMM to other multitrait mixed model programs, we used a large gene expression dataset of 24,175 genes across 728 Arabidopsis thaliana accessions. We downloaded the data from NCBI GEO [58] [59] and removed genes with average counts <10. We then normalized and variance stabilized the counts using the varianceStabilizingTransformation function from DESeq2 [60]. We downloaded a corresponding genomic relationship matrix K from the 1001 genomes project [61] and subsetted to the 665 individuals present in both datasets.
We generated datasets of varying sizes from t=2 to t=24,175 genes by randomly sampling. We selected one gene as the “focal” trait in each dataset, masked 50% of its values, fit the model in Eq. (1) using four different representative MvLMM programs to the remaining data, and used the results to predict the genetic values of each masked individual for this “focal” gene. Prediction accuracies were estimated as \(\rho _{g} = \text {cor}_{g}(\hat {\mathbf {u}},\mathbf {y})\sqrt {h^{2}(\hat {\mathbf {u}})}\), where cor_{g} is the estimated genetic correlation evaluated in the testing lines only, and \(h^{2}(\hat {\mathbf {u}})\) is the heritability of the predictor \(\hat {\mathbf {u}}\) estimated using a univariate LMM [6, 42]. The simpler Pearson’s correlation estimate of prediction accuracy is not valid in these data because all genes were measured together in the same sample, and therefore some correlation among genes is caused by nongenetic factors [24]. The four MvLMM prediciton methods were:

1.
MTG2 [23]: a restricted maximumlikelihood method written in fortran. We precalculated the eigenvalue decomposition for K, thus this additional time is not included in the results. MTG2 does not work well with a high percentage of missing data, so genetic value predictions were made with the twostep approach from [24] which involves estimating model parameters only from the individuals with complete observations, and then incorporating secondary trait values of the new individuals in the second step.

2.
MCMCglmm [62]: a Bayesian MCMC algorithm largely written in C++. We used “default” priors for R and G with diagonal means and ν=p, and ran a single MCMC chain for 7000 iterations, discarding the first 5000 samples as burnin. To speed up calculations (and make the timing results more comparable with the MegaLMM algorithm), we rotated the input data by premultiplying by the eigenvectors of K so that the input relationship matrix was diagonal. Since this matrix rotation is only possible with complete data, we again used the twostep multitrait prediction approach [24].

3.
phenix [18]: a variational Bayes algorithm written in R that uses a lowrank representation of G but a fullrank prior for R. We set the maximum number of factors to p/4 and used the eigendecomposition of K as the input. Again, we excluded this calculation from the time estimates.

4.
MegaLMM: we ran MegaLMM using “default priors” with K= min(n/4,p/2) and collected 6000 MCMC samples, discarding the first 5000 as burnin. We excluded the preparatory calculations, only including the MCMC iterations in the time calculations. For small datasets, these calculations were significant, but were a miniscule part of the analyses of larger datasets.
Each method was run 20 times on different randomly sampled datasets. For the two MCMC methods, we estimated the effective sample size of each element of U using the ess_bulk function of the rstan package [63], and used this to estimate the time necessary for the effective sample size to be at least 1000 for 90% of the u_{ij}. We ran MTG2 and MCMCglmm for datasets up to t=64 because computational times were prohibitively long for larger datasets. We linearly extrapolated the (log) computational times for these methods out to t=512 for comparisons. phenix fails when t≥n, so this method is limited to t<665 in this dataset.
To assess the accuracy of each method for estimating genetic and nongenetic covariances, we generated new datasets with 128 genes by calculating empirical correlation matrices for G and R from two separate samples of 128 genes from the full expression dataset, and then generating genetic and residual values for 128 traits from multivariate normal distributions based on these correlation matrices. For each trait, we converted the correlation matrices into covariance matrices by sampling an independent heritability value for each trait between 0.1 and 0.8. We then estimated the genetic and residual covariance matrices for subsets of these simulated datasets using each of the four above methods. In this example, we found that setting K larger (2p) gave better results, probably because the G and R matrices were largely uncorrelated and so independent factors were needed to model the two sets of covariances. Accuracy was reported as the Pearson correlation between the estimated covariance parameters and the true covariance parameters (excluding the variance parameters on the diagonal).
Wheat yield prediction using hyperspectral data
We used data from a bread wheat breeding trial to demonstrate how MegaLMM can leverage “secondary” traits from highthroughput phenotyping technologies to better predict genetic values of a single target trait. We downloaded grain yield and hyperspectral reflectance data from the bread wheat trials at the Campo Experimental Norman E. Borlaug in Ciudad Obregón, México reported in [25] [64]. We selected the 20142015 Optimal Flat siteyear for our main analysis because it had among the greatest number of hyperspectral reflectance data points, and [25] reported relatively low predictive accuracy for grain yield in this siteyear. Best linear unbiased estimates (BLUEs) and best linear unbiased predictors (BLUPs) of the line means for grain yield (GY) and 62 hyperspectral bands collected at each of 10 timepoints during the growing season, and genotype data from 8519 markers were provided for 1,092 lines in this trial. All other trials were analyzed in the analysis presented in Additional file 1: Figure S5.
We compared eight methods for predicting the GY trait based on the genetic marker and hyperspectral data. The first five were “standard” methods using stateoftheart models for genomic prediction. The final three were new models implemented within the MegaLMM framework.

1.
GBLUP: implemented using the R package rrBLUP [65], with the genomic relationship matrix K calculated with the A.mat function of rrBLUP as in [66].

2.
Bayesian Lasso (BL): implemented using the R package BGLR [67]. We first removed markers with >50% missing data, and imputed the remaining missing genotypes with the population mean allele frequency. We used the default prior parameters for the Bayesian Lasso in BGLR, and collected 9,000 posterior samples with a thinning rate of 5 after a 5,000 iteration burnin.

3.
RKHS: implemented using rrBLUP. We used the same thinned and imputed genotype dataset as for the BL method to calculate a genomic distance matrix (D). We also used the default theta.seq parameter to automatically choose the scale parameter of the Gaussian kernel.

4.
HBLUP: implemented using the R package lme4qtl. This replicates the analysis reported by [25], which uses the GBLUP method but replaces the genomic relationship matrix described above with H, a hyperspectral reflectance relationship matrix calculated as \(\mathbf {H} = \mathbf {S}\mathbf {S}^{\intercal }/620\), where S is a matrix of centered and standardized BLUEs of hyperspectral bands from each timepoint.

5.
GBLUP+H: implemented in the R package lme4qtl [68]. This is a twokernel method, where we use two relationship matrices: K and H. This method is analogous to the methods proposed by [25] for leveraging the hyperspectral data in prediction; however, those authors only used twokernel methods for G ×E prediction across siteyears. Since lme4qtl does not predict random effects for unmeasured observations, we formed predictions as: \(\mathbf {K}_{no}\mathbf {K}^{1}_{oo}\hat {\mathbf {u}}_{ko} + \mathbf {H}_{no}\mathbf {H}^{1}_{oo}\hat {\mathbf {u}}_{ho}\) where K_{no} is the n_{n}×n_{o} quadrant of K specifying the genomic relationships among the n_{n} “new” unobserved lines, K_{oo} is the n_{o}×n_{o} quadrant of K specifying the genomic relationships among the “old” observed lines, \(\hat {\mathbf {u}}_{ko}\) is the vector of BLUPs for the genomic random effect, and H_{no},H_{oo} and \(\hat {\mathbf {u}}_{ho}\) are similar quantities for the hyperspectral random effect.

6.
MegaLMMGBLUP: we modeled the combined trait data Y=[y,S] with the model specified in Eq. (2) using a single random effect with relationship matrix K as above, no fixed effects besides an intercept (X was a column of ones and X_{2} had zero columns). We ran MegaLMM with K=100 factors, “default” priors (see Additional file 1: Table S1), and two partitions of the trait data (the first containing grain yield with the masked training set as described below, and the second containing all 620 hyperspectral bands with complete data). We collected 500 posterior samples of the quantity: u_{1}=u_{R1}+(U_{F}λ_{1}) at a thinning rate of 2, discarding the first 1,000 iterations as burnin.

7.
MegaLMMRKHS: we implemented multitrait RKHS regression model using the “kernelaveraging” method proposed by [28]. We standardized D based on its mean (squared) value, and placed a uniform prior on the set of scaling factors h={1/5,1,5}, which we implemented by calculating three corresponding relationship matrices K_{1},…,K_{3} and by specifying three random effects in Eq. (2). We again used “default” priors, K=100 factors, and treated only the global intercept pertrait as fixed effects. We collected 500 posterior samples of the quantity: Zu_{1}=Zu_{R1}+Z(U_{F}λ_{1}) at a thinning rate of 2, discarding the first 1000 iterations as burnin.

8.
MegaLMMGBLUPCV1: we repeated the MegaLMMGBLUP method above, but this time without partitioning the trait data. Instead, we masked both the grain yield and the 620 hyperspectral band data from the testing set so all lines in the training data had complete data. Predictions of the genetic values were calculated identically to above.
We used crossvalidation to evaluate the prediction accuracy of each method. We randomly selected 50% of the lines for model training, 50% for testing, and masked the GY observations for the testing lines. We fit each model to the partiallymasked dataset and collected the predictions of GY for the testing lines. We estimated prediction accuracy as \(\rho _{g} = \text {cor}_{g}(\hat {\mathbf {u}},\mathbf {y})\sqrt {h^{2}(\hat {\mathbf {u}})}\) because the hyperspectral reflectance data were collected on the same plots as the GY data and therefore nongenetic (i.e., microenvironmental) factors that affect both reflectance and yield may induce nongenetic correlations among traits [24]. BLUPs were used as the predictand except in the 201617 year when the BLUPs were poorly corelated with the BLUEs suggesing data quality issues. We used a 5050 trainingtesting split of the data to ensure that cor_{g} could be estimated accurately in the testing partition. This crossvalidation algorithm was repeated 20 times with different random partitions.
Maize trait imputation in multienvironment trails
We used data on maize hybrids from the GenomesToFields Initiative experiments to demonstrate how MegaLMM can leverage genetic correlations across locations in multienvironment trials. We downloaded the agronomic data from the 20142017 field seasons from the CyVerse data repository [69] and corresponding genomic data. We used TASSEL5 [70] to build a kinship matrix for each hybrid genotype using the CenteredIBS routine.
A total of 2012 noncheck hybrids with phenotype and genotype data from 108 trials (i.e., siteyears) were available. We selected four representative agronomic traits: plant height (cm), grain yield (bushels/acre), daystosilking (days), and the anthesissilking interval (ASI, days). For each trait in each siteyear, we calculated BLUPs for all observed genotypes using the R package lme4 [71] with Rep and Block:Rep as fixed effects to account for the experimental design in each field, and formed them into 2012×108 BLUP matrices for each trait. We then dropped siteyears where the BLUP variance was zero, or which had fewer than 50 tested lines. On average ≈12% of hybridsiteyear combinations were observed across each of the four BLUP matrices. We then used four methods to predict the BLUPs of hybrids that were not grown in each trial:

1.
GBLUP (univariate): missing values were imputed separately for each site:year using the mixed.solve function of the rrBLUP package.

2.
GBLUP (env BLUPs): genetic values for each hybrid were assumed to be constant across all siteyears. We estimated these in two steps. In the first step, we estimated hybrid main effects treating lines as independent random effects using lme4, with site:year included as a fixed effect. In the second step, we estimated genetic values using the mixed.solve function of the rrBLUP package.

3.
phenix: we used phenix to impute missing observations in Y using K as a relationship matrix.

4.
MegaLMM: we fit the model specified in Eq. (2) to the full matrix Y, with K=50 factors and “default”. Here, we partitioned Y into 4 sets based on year to minimize the number of missing observations to condition on during the MCMC. We collected 1000 posterior samples of imputed values \(\widetilde {\mathbf {Y}} = \mathbf {X}_{1}\mathbf {B}_{1} + \mathbf {F}\boldsymbol {\Lambda } + \mathbf {Z}\mathbf {U}_{R}\) with a thinning rate of 2, after discarding the first 5000 iterations as burnin.
We estimated prediction accuracy of each method using crossvalidation. For each of 20 replicate crossvalidation runs per model, we randomly masked 20% of the nonmissing BLUPs, and then calculated the Pearson’s correlation between these “observed” values and the imputed values of each method. Pearson’s correlation is appropriate as an estimate of genomic prediction accuracy in this case because different plants were used in each trial, so there is no nongenetic source of correlation among siteyears that may bias accuracy estimates.
Availability of data and materials
All data used in these analyses were downloaded from the publicly accessible repositories described above. Arabidopsis gene expression data was downloaded from the NCBI GEO accession GSE80744 available at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80744 [59]. The Arabidopsis kinship matrix was downloaded from https://1001genomes.org/data/GMIMPI/releases/v3.1/SNP_matrix_imputed_hdf5/1001_SNP_MATRIX.tar.gz. The wheat dataset was downloaded from the CIMMYT Research Data & Software Repository Network available at http://hdl.handle.net/11529/10548109 [77]. The maize phenotype data were downloaded from the CyVerse data repository based on the links described in [69]. Genomic data were downloaded from (http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Carolyn_ Lawrence_Dill_G2F_Nov_2016_V.3/b._2014_gbs_data) [78]. Scripts for running analyses are available in the GitHub repository: https://github.com/deruncie/MegaLMM_analyses/tree/v0.9.2. The R package for MegaLMM is available here: https://github.com/deruncie/MegaLMM/tree/v0.9.3 and is licensed with the MIT license. The specific versions of the scripts and package codes are archived at zenodo [79, 80].
References
Araus JL, Kefauver SC, ZamanAllah M, Olsen MS, Cairns JE. Translating HighThroughput Phenotyping into Genetic Gain. Trends Plant Sci. 2018; 23(5):451–66.
Koltes JE, Cole JB, Clemmens R, Dilger RN, Kramer LM, Lunney JK, McCue ME, McKay SD, Mateescu RG, Murdoch BM, Reuter R, Rexroad CE, Rosa GJM, Serão NVL, White SN, WoodwardGreene MJ, Worku M, Zhang H, Reecy JM. A vision for development and utilization of highthroughput phenotyping and big data analytics in livestock. Front Genet. 2019; 10:1197. https://doi.org/10.3389/fgene.2019.01197.
Rutkoski J, Poland J, Mondal S, Autrique E, Pérez LG, Crossa J, Reynolds M, Singh R. Canopy Temperature and Vegetation Indices from HighThroughput Phenotyping Improve Accuracy of Pedigree and Genomic Selection for Grain Yield in Wheat. G3 Genes Genomes Genetics. 2016; 6(9):2799–808.
Neethirajan S. Recent advances in wearable sensors for animal health management. Sens and BioSens Res. 2017; 12:15–29.
Schrag TA, Westhues M, Schipprack W, Seifert F, Thiemann A, Scholten S, Melchinger AE. Beyond genomic prediction: combining different types of omics data can improve prediction of hybrid performance in maize. Genetics. 2018; 208(4):1373–85.
Thompson R, Meyer K. A review of theoretical aspects in the estimation of breeding values for multitrait selection. Livest Prod Sci. 1986; 15(4):299–313.
Bernardo R. Breeding for Quantitative Traits in Plants, vol 1. 2nd ed. Woodbury: Stemma press; 2010.
Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001; 157(4):1819–29.
Henderson CR, Quaas RL. Multiple Trait Evaluation Using Relatives’ Records. J Anim Sci. 1976; 43(6):1188–97.
Piepho HP, Möhring J, Melchinger AE, Büchse A. BLUP for phenotypic selection in plant breeding and variety testing. Euphytica. 2007; 161(12):209–28.
Calus MP, Veerkamp RF. Accuracy of multitrait genomic selection using different methods. Genet Sel Evol. 2011; 43(1):26.
Jia Y, Jannink JL. MultipleTrait Genomic Selection Methods Increase Genetic Value Prediction Accuracy. Genetics. 2012; 192(4):1513–22.
Johnstone IM, Titterington DM. Statistical challenges of highdimensional data. Phil Trans Ser A Math Phys Eng Sci. 2009; 367(1906):4237–53.
Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genomewide association studies. Nat Methods. 2014; 11(4):407–9.
de Los Campos G, Gianola D. Factor analysis models for structuring covariance matrices of additive genetic effects: a Bayesian implementation. Genet Sel Evol. 2007; 39(5):481–94.
Meyer K. Multivariate analyses of carcass traits for Angus cattle fitting reduced rank and factor analytic models. J Anim Breed Genet. 2007; 124(2):50–64.
Runcie D, Mukherjee S. Dissecting HighDimensional Phenotypes with Bayesian Sparse Factor Analysis of Genetic Covariance Matrices. Genetics. 2013; 194(3):753–67.
Dahl A, Iotchkova V, Baud A, Johansson Å, Gyllensten U, Soranzo N, Mott R, Kranis A, Marchini J. A multiplephenotype imputation method for genetic studies. Nat Genet. 2016; 48(4):466–72.
Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008; 178(3):1709–23.
Zhou X, Stephens M. Genomewide efficient mixedmodel analysis for association studies. Nat Genet. 2012; 44(7):821–4.
Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genomewide association studies. Nat Methods. 2011; 8(10):833–5.
Runcie D, Crawford L. Fast and flexible linear mixed models for genomewide genetics. PLOS Genet. 2019; 15(2):1007978.
Lee SH, van der Werf JHJ. MTG2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information. Bioinformatics. 2016; 32(9):1420–2.
Runcie D, Cheng H. Pitfalls and remedies for cross validation with multitrait genomic prediction methods. G3 Genes Genomes Genet. 2019; 9(11):3727–41. https://doi.org/10.1534/g3.119.400598.
Krause MR, GonzálezPérez L, Crossa J, PérezRodríguez P, MontesinosLópez O, Singh RP, Dreisigacker S, Poland J, Rutkoski J, Sorrells M, Gore MA, Mondal S. Hyperspectral ReflectanceDerived Relationship Matrices for Genomic Prediction of Grain Yield in Wheat. G3 Genes Genomes Gene. 2019; 9(4):1231–47.
Hayes BJ, Bowman PJ, Chamberlain AC, Verbyla K, Goddard ME. Accuracy of genomic breeding values in multibreed dairy cattle populations. Genet Sel Evol. 2009; 41(1):1–9.
Park T, Casella G. The Bayesian Lasso. J Am Stat Assoc. 2013; 103(482):681–6.
de Los Campos G, Gianola D, Rosa GJM, Weigel KA, Crossa J. Semiparametric genomicenabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet Res. 2010; 92(4):295–308.
Burgueño J, de los Campos G, Weigel K, Crossa J. Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci. 2012; 52(2):707–19. https://doi.org/10.2135/cropsci2011.06.0299.
Piepho HP, Möhring J. Best Linear Unbiased Prediction of Cultivar Effects for Subdivided Target Regions. Crop Sci. 2005; 45(3):1151–9.
Loh PR, Tucker G, BulikSullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, Chasman DI, Ridker PM, Neale BM, Berger B, Patterson N, Price AL. Efficient Bayesian mixedmodel analysis increases association power in large cohorts. Nat Genet. 2015; 47(3):284–90.
Gilmour AR. Mixed model regression mapping for QTL detection in experimental crosses. Comput Stat Data Anal. 2007; 51(8):3749–64.
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017; 550(7675):204–13.
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, Cortes A, Welsh S, Young A, Effingham M, McVean G, Leslie S, Allen N, Donnelly P, Marchini J. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018; 562(7726):203–9.
Guo G, Zhao F, Wang Y, Zhang Y, Du L, Su G. Comparison of singletrait and multipletrait genomic prediction models. BMC Genet. 2014; 15(1):30.
Sun J, Rutkoski JE, Poland JA, Crossa J, Jannink JL, Sorrells ME. Multitrait, Random Regression, or Simple Repeatability Model in HighThroughput Phenotyping Data Improve Genomic Prediction for Wheat Grain Yield. Plant Genome. 2017; 10(2):0.
Crain J, Mondal S, Rutkoski J, Singh RP, Poland J. Combining HighThroughput Phenotyping and Genomic Information to Increase Prediction and Selection Accuracy in Wheat Breeding.  PubMed  NCBI. Plant Genome. 2018; 11(1):1–14.
van Eeuwijk FA, BustosKorts D, Millet EJ, Boer MP, Kruijer W, Thompson A, Malosetti M, Iwata H, Quiroz R, Kuppe C, Muller O, Blazakis KN, Yu K, Tardieu F, Chapman SC. Modelling strategies for assessing and increasing the effectiveness of new phenotyping techniques in plant breeding. Plant Sci. 2019; 282:23–39.
MontesinosLópez A, MontesinosLópez OA, Cuevas J, MataLópez WA, Burgueño J, Mondal S, Huerta J, Singh R, Autrique E, GonzálezPérez L, Crossa J. Genomic Bayesian functional regression models with interactions for predicting wheat grain yield using hyperspectral image data. Plant Methods. 2017; 13(1):1.
Cuevas J, MontesinosLópez O, Juliana P, Guzman C, PérezRodríguez P, GonzálezBucio J, Burgueño J, MontesinosLópez A, Crossa J. Deep Kernel for Genomic and Near Infrared Predictions in Multienvironment Breeding Trials. G3 Genes Genomes Genet. 2019; 9(9):2913–24.
Juliana P, MontesinosLópez OA, Crossa J, Mondal S, GonzálezPérez L, Poland J, HuertaEspino J, CrespoHerrera L, Govindan V, Dreisigacker S, Shrestha S, PérezRodríguez P, Pinto Espinosa F, Singh RP. Integrating genomicenabled prediction and highthroughput phenotyping in breeding for climateresilient bread wheat. Theor Appl Genet. 2019; 132(1):177–94.
LopezCruz M, Olson E, Rovere G, Crossa J, Dreisigacker S, Mondal S, Singh R, de Los Campos G. Regularized selection indices for breeding value prediction using hyperspectral image data. bioRxiv. 2020; 125:625251.
Heffner EL, Sorrells ME, Jannink JL. Genomic Selection for Crop Improvement. Crop Sci. 2009; 49(1):1–12.
Gauch HG. Model Selection and Validation for Yield Trials with Interaction. Biometrics. 1988; 44(3):705–15.
Piepho HP. Empirical best linear unbiased prediction in cultivar trials using factoranalytic variancecovariance structures. Theor Appl Genet. 1998; 97(1):195–201.
Smith A, Cullis B, Thompson R. Analyzing Variety by Environment Data Using Multiplicative Mixed Models and Adjustments for Spatial Field Trend. Biometrics. 2001; 57(4):1138–47.
Jarquín D, Crossa J, Lacaze X, Du Cheyron P, Daucourt J, Lorgeou J, Piraux F, Guerreiro L, Pérez P, Calus M, Burgueño J, de Los Campos G. A reaction norm model for genomic selection using highdimensional genomic and environmental data. Theor Appl Genet. 2014; 127(3):595–607.
Malosetti M, BustosKorts D, Boer MP, van Eeuwijk FA. Predicting Responses in Multiple Environments: Issues in Relation to Genotype × Environment Interactions. Crop Sci. 2016; 56(5):2210–22.
Rincent R, Malosetti M, Ababaei B, Touzy G, Mini A, Bogard M, Martre P, Le Gouis J, van Eeuwijk FA. Using crop growth model stress covariates and AMMI decomposition to better predict genotypebyenvironment interactions. TAG Theor Appl Genet Theor Angew Genet. 2019; 132(12):3399–411.
The International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009; 460(7256):748–52.
Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, Fontana MA, NguyenViet TA, Wedow R, Zacher M, Furlotte NA, et al. Multitrait analysis of genomewide association summary statistics using mtag. Nat Genet. 2018; 50(2):229–37.
Campbell M, Walia H, Morota G. Utilizing random regression models for genomic prediction of a longitudinal trait derived from highthroughput phenotyping. Plant Direct. 2018; 2(9):00080.
Chan EKF, Rowe HC, Corwin JA, Joseph B, Kliebenstein DJ. Combining genomewide association mapping and transcriptional networks to identify novel genes controlling glucosinolates in Arabidopsis thaliana. PLoS Biol. 2011; 9(8):1001125.
Demmings EM, Williams BR, Lee CR, Barba P, Yang S, Hwang CF, Reisch BI, Chitwood DH, Londo JP. Quantitative Trait Locus Analysis of Leaf Morphology Indicates Conserved Shape Loci in Grapevine. Front Plant Sci. 2019; 10:36.
MárquezLuna C, Loh PR, Consortium SATDS, Consortium TSTD, Price AL. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet Epidemiol. 2017; 41(8):811–23.
Carvalho CM, Polson NG, Scott JG. The horseshoe estimator for sparse signals. Biometrika. 2010; 97(2):465–80.
Makalic E, Schmidt DF. A Simple Sampler for the Horseshoe Estimator. IEEE Signal Process Lett. 2016; 23(1):179–82.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. Ncbi geo: archive for functional genomics data sets—update. Nucleic Acids Res. 2012; 41(D1):991–5.
Huang S, Kawakatsu T, Jupe F, Schmitz R, Urich M, Castanon R, Nery J, Chen H, Ecker J. Epigenomic and genome structural diversity in a worldwide collection of Arabidopsis thaliana. NCBI Gene Expr Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80744. Accessed 05 Sept 2018.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rnaseq data with deseq2. Genome Biol. 2014; 15:550. https://doi.org/10.1186/s1305901405508.
AlonsoBlanco C, Andrade J, Becker C, Bemm F, Bergelson J, Borgwardt KM, Cao J, Chae E, Dezwaan TM, Ding W, et al. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell. 2016; 166(2):481–91.
Hadfield JD. MCMC methods for multiresponse generalized linear mixed models: the MCMCglmm R package. J Stat Softw. 2010; 33(1):1–22.
Stan Development Team. RStan: the R interface to Stan. 2019. R package version 2.19.2 http://mcstan.org/.
Mondal S, Krause M, Juliana P, Poland J, Dreisigacker S, Singh R. Use of hyperspectral reflectancederived relationship matrices for genomic prediction of grain yield in wheat  data for publication. CIMMYT Res Data Softw Repository Netw. 2018. https://hdl.handle.net/11529/10548109.
Endelman JB. Ridge regression and other kernels for genomic selection with r package rrblup. Plant Genome. 2011; 4:250–55.
Endelman JB, Jannink JL. Shrinkage Estimation of the Realized Relationship Matrix. G3 Genes Genomes Genet. 2012; 2(11):1405–13.
Perez P, de los Campos G. Genomewide regression and prediction with the bglr statistical package. Genetics. 2014; 198(2):483–95.
Ziyatdinov A, VazquezSantiago M, Brunel H, MartinezPerez A, Aschard H, Soria JM. lme4qtl: linear mixed models with flexible covariance structure for genetic studies of related individuals. BMC Bioinformatics. 2018:btw080. doi:10.1186/s128590182057x.
McFarland BA, AlKhalifah N, Bohn M, Bubert J, Buckler ES, Ciampitti I, Edwards J, Ertl D, Gage JL, Falcon CM, FlintGarcia S, Gore MA, Graham C, Hirsch CN, Holland JB, Hood E, Hooker D, Jarquín D, Kaeppler SM, Knoll J, Kruger G, Lauter N, Lee EC, Lima DC, Lorenz A, Lynch JP, McKay J, Miller ND, Moose SP, Murray SC, Nelson R, Poudyal C, Rocheford T, Rodriguez O, Romay MC, Schnable JC, Schnable PS, Scully B, Sekhon R, Silverstein K, Singh M, Smith M, Spalding EP, Springer N, Thelen K, Thomison P, Tuinstra M, Wallace J, Walls R, Wills D, Wisser RJ, Xu W, Yeh CT, de Leon N. Maize genomes to fields (G2F): 2014–2017 field seasons: genotype, phenotype, climatic, soil, and inbred ear image datasets. BMC Res Notes. 2020; 13(1):1–6.
Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. Tassel: software for association mapping of complex traits in diverse samples. Bioinformatics. 2007; 23(19):2633–5.
Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixedeffects models using lme4. J Stat Softw. 2015; 67(1):1–48. https://doi.org/10.18637/jss.v067.i01.
Bates D, Eddelbuettel D. Fast and elegant numerical linear algebra using the RcppEigen package. J Stat Softw. 2013; 52(5):1–24. http://www.jstatsoft.org/v52/i05/.
Anirban B, Antik C, Mallick BK. Fast sampling with Gaussian scale mixture priors in highdimensional regression. Biometrika. 2016; 103(4):985–91. https://doi.org/10.1093/biomet/asw042. https://academic.oup.com/biomet/articlepdf/103/4/985/8339159/asw042.pdf.
Bhattacharya A, Dunson DB. Sparse Bayesian infinite factor models. Biometrika. 2011; 98(2):291–306.
Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006; 1(3):515–33.
Piironen J, Vehtari A. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electron J Stat. 2017; 11(2):5018–51.
Mondal S, Krause M, Juliana P, Poland J, Dreisigacker S, Singh R. Use of hyperspectral reflectancederived relationship matrices for genomic prediction of grain yield in wheat  data for publication. 2018. https://hdl.handle.net/11529/10548109.
LawrenceDill C. Genomes To Fields 2014 v.3: CyVerse Data Commons; 2017. https://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Carolyn_Lawrence_Dill_G2F_Nov_2016_V.3.
Runcie D. deruncie/MegaLMM: Version for accepted manuscript. Github. 2021. https://doi.org/10.5281/zenodo.4961220.
Runcie D. deruncie/MegaLMMMegaLMM_analyses: Version for accepted manuscript. Github. 2021. https://doi.org/10.5281/zenodo.4961269.
Acknowledgments
Funding
This work is supported by Agriculture and Food Research Initiative grants no. 20206701330904 and 20186701527957 from the USDA National Institute of Food and Agriculture to DER and HC. DER is also supported by United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA), Hatch project 1010469. LC is supported by grants P20GM109035 (COBRE Center for Computational Biology of Human Disease; PI Rand) and P20GM103645 (COBRE Center for Central Nervous; PI Sanes) from the NIH NIGMS, 2U10CA18079406 from the NIH NCI and the Dana Farber Cancer Institute (PIs Gray and Gatsonis), as well as by an Alfred P. Sloan Research Fellowship and a David & Lucile Packard Fellowship for Science and Engineering.
Author information
Authors and Affiliations
Contributions
DER developed the method, wrote the R package, developed and ran the analyses, and wrote the paper. JQ edited the manuscript. HC helped develop the method, design the analysis, and edited the paper. LC helped develop the method, design the analysis, and wrote the paper. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
LC was employed by Microsoft Research while performing this work.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Supplemntary Methods. Expanded description of the prior distributions and derivation of the MCMC algorithm used in MegaLMM [72–76]. Figure S1. Prediction accuracies for each target gene expression trait relative to MegaLMM. Figure S2. Prediction accuracies for each target gene expression trait as a function of dataset size. Figure S3. Accuracy of covariance estimates in simulated data. Figure S4. Sensitivity of MegaLMM to changes in K. Figure S5. Performance of singletrait and multitrait genomic prediction for wheat yield across 20 experiments. Figure S6. Relationship between genetic correlation and benefit of MvLMM across Genomes2Fields siteyears. Table S1. Default hyperparameters for usercustomizable prior distributions.
Additional file 2
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Runcie, D.E., Qu, J., Cheng, H. et al. MegaLMM: Megascale linear mixed models for genomic predictions with thousands of traits. Genome Biol 22, 213 (2021). https://doi.org/10.1186/s1305902102416w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305902102416w