Genetic prediction of complex traits: integrating infinitesimal and marked genetic effects

Genetic prediction for complex traits is usually based on models including individual (infinitesimal) or marker effects. Here, we concentrate on models including both the individual and the marker effects. In particular, we develop a “Mendelian segregation” model combining infinitesimal effects for base individuals and realized Mendelian sampling in descendants described by the available DNA data. The model is illustrated with an example and the analyses of a public simulated data file. Further, the potential contribution of such models is assessed by simulation. Accuracy, measured as the correlation between true (simulated) and predicted genetic values, was similar for all models compared under different genetic backgrounds. As expected, the segregation model is worthwhile when markers capture a low fraction of total genetic variance.


Introduction
In recent years, new knowledge on molecular genetics and the rapid evolution of sequencing and genotyping technology has renewed the interest on genetic prediction of complex traits. It should be recalled, however, that genetic prediction of complex traits has been a traditional field in animal and plant breeding since the 40's in the framework of the Selection Index (SI) theory (e.g., Hazel 1943), extended later to the ''best linear unbiased prediction'' (BLUP; Henderson 1975). These genetic prediction methods, without DNA data, were based on the ''individual'' model where covariances amongst phenotypes of related individuals are translated into unobserved covariances amongst genetic values, via theoretical relatedness coefficients amongst individuals. Anticipating the availability of low-cost whole genome DNA data, Meuwissen et al. (2001) proposed ''marker'' models where many markers' genotypes represent genetic effects, while the individuals are not explicitly specified in the model. We concentrate here on a third group of models including both ''marker'' and ''individual'' effects. We first recall the families of models proposed for genetic prediction and then we develop a novel model, which is illustrated with an example. Then, we assess the relative performance of the novel model in relation to the marker model for different genetic scenarios, and we report results of the analyses of a public simulated sample. Finally, originality, limits and possible extensions of the model are discussed.
''phenotype = mean ? additive genetic value ? residual''. This model has been called ''polygenic'' or ''infinitesimal'' since the additive genetic value is the sum of the effects, assumed to be small and homogeneous, of numerous genes on the phenotype. In the statistical model, built from the genetic model, ''individual effects'' are used to represent additive genetic effects, and they are assumed random because genotype configurations of individuals arise through random processes: y is a vector of phenotypes l is a constant vector (assumed known in SI and estimated in BLUP) Z is an incidence matrix of order N y phenotypes ð ÞÂ N i individuals ð Þ ; relating each of the N y phenotypes to each of the measured individuals. For simplicity, we assume only one measure per individual. In standard BLUP technology Z ¼ 0 I 0 ½ , i.e., null columns for base individuals without phenotypes, the identity matrix for individuals with phenotypes (when there is a single measure for each individual), and null columns for descendants without phenotype, the usual target of prediction. In this context of genetic prediction, base individuals are defined for a given genealogy as the most distant known ancestors of individuals with recorded phenotypes, i.e., they do not have phenotypes and their parents are unknown.
u is a vector of additive genetic effects, with Var u ð Þ ¼ Ar 2 u , with A being the relationship matrix amongst individuals.
e is a vector of residuals, with Var e ð Þ ¼ Ir 2 e , with I being an identity matrix A further usual assumption is Covðe; uÞ ¼ 0. The only information available to distinguish genetic effects from residuals are the structures of the (co)variance matrices of u and e. In other words, the model describes a network of phenotypic covariances (observed) which are translated into genetic covariances (unobserved) via the theoretical genetic model, in particular the relatedness coefficients in the relationship matrix A.

Marker and individual models
With molecular data available, prediction models evolved to include this new information (e.g., Fernando and Grossman 1989;Meuwissen et al. 2001). Fernando and Grossman (1989) proposed a prediction model which included several genetic effects: an infinitesimal effect u plus haplotype effects of maternal and paternal origin at marked quantitative trait loci (QTL) positions. Their model was reasonably conservative, given the genomic tools available by that time (say, 500 microsatellites to cover the entire genome in farm animals). In this context, they assumed that a marker allele may mark different QTL alleles in different families. Later, with many more markers (10,000 multi-allelic markers), Meuwissen et al. (2001) switched from the previous conservative model to ''marker'' models exploiting linkage disequilibrium at the population level: where: m is a vector of marked genetic effects (usually termed ''marker effects'', although the usual hypothesis is that markers do not have a true effect per se on the phenotype) W is a matrix of marker genotypes of order N i individuals ð ÞÂN m markers ð Þ : With biallelic markers such as SNP, usual elements of W are 0, 1 or 2, the number of, say, the allele ''1'' of the marker genotype.
Usually assumed (co)variances are: If we further assume that u ¼ Wm and Var u ð Þ ¼ WW 0 r 2 m , it is possible to compute predictions for u with the individual model (1), amended such that the relationship matrix A is replaced by the realized ''genomic relationship'' matrix G ¼ WW 0 (VanRaden 2008;Goddard 2009). Application of BLUP to this model has been termed ''genomic BLUP'' and improvements have been proposed to make assumptions more realistic (departures from the homogeneous variances for marked effects in model (2)) and practical implementations when only part of the individuals are genotyped making necessary to mix the A and the G matrices for the combined analyses of individuals with or without genotypes (e.g. Aguilar et al. 2011).

Marker plus individual model
Alternative assumptions in an outbred population are u 6 ¼ Wm and Var u ð Þ 6 ¼ WW 0 r 2 m . There are theoretical reasons and experimental results to support this point of view. Theoretically, in a Bayesian context, Gianola et al. (2009) claimed that the functional relationship between r 2 u and r 2 m is elusive. They did propose simple approximations under Hardy-Weinberg and linkage equilibria (LE) to relate the marked genetic variance and the additive genetic variance as r 2 u ¼ 2 P N m i¼1 p i q i r 2 m , where p i and q i are the allelic frequencies for marker i. However, assuming LE is not compatible with the essential assumption of linkage disequilibrium in the context of genome-wide analysis. Furthermore, in most experimental studies, the sum of variances due to marker associations does not add up to the additive genetic variance due to individual infinitesimal effects raising the problem of the ''hidden heritability'' (e.g., Yang et al. 2011).
The unknown vector m represents the effects of unobserved genes that should be marked by observed markers. This model should fit all genome-wide additive effects simultaneously. However, it is not warranted that all the actual additive genetic effects in the studied genome will be effectively traced by the available markers (Yang et al. 2011). Potential problems are poor marker coverage (low density but also insufficient representation of independent DNA segments), rare alleles, small (infinitesimal) gene effects, multi-allelic genes having additive effects that are poorly traced by bi-allelic markers, or other molecular genetics mechanisms. The main assumption is that each marker allele or haplotype is associated with each unobserved QTL allele in identical way for each individual in the studied population. This may be true in some cases but it is not true in general. While an association between a marker and the QTL may be stable within parents and progeny, open populations over several generations are built up by subpopulations, each one with its own QTL allele-marker allele association. Reintroduction of infinitesimal effects in the prediction model is one of the recommended ways to control partially the lack of perfect association between marker alleles and causative alleles (Goddard and Hayes 2009). The model becomes: with additional assumptions: Var u ð Þ ¼ Rr 2 u ; and Covðe; uÞ ¼ Covðu; mÞ ¼ 0; where Rr 2 u is the symmetric (co)-variance matrix of individual effects of order N i . Usually, as in model (1), R = A, the additive relationship matrix computed theoretically from genealogy data. Note that the terms in model (3) The idea in model (3) is to include residual genetic values not taken into account by the marked effects m. In applications, this model gave better predictions than the marker model (2) (e.g., De los Campos et al. 2009;Duchemin et al. 2012).

Mendelian segregation model
Here, we develop a model where the genetic value of an individual is a function of infinitesimal effects of ancestors (individuals in the base, with unknown parents) and Mendelian sampling which can be traced by DNA data. In the following it is assumed that all individuals have complete genotype data and all descendants have known parents. We then discuss the departures from this complete data situation.
The model starts as in (3): It is convenient to separate individuals in two groups: the base ancestors with unknown parents (indexed by b) and the descendants (indexed by d). We can now expand and decompose the vector of infinitesimal values u as: Let P be a N i 9 N i matrix with two 1's in each row, indicating the parents of each individual (rows of P for base individuals are null).
We define the matrix M as: The matrix M is interpretable in biology (each row of M represents the individual minus half the sum of parents) and in mathematics since M has the form of a Laplacian matrix, representing the pedigree graph, with P being the adjacency matrix with elements equal to 1 at the intersection of adjacent nodes (parent and progeny nodes) or 0 otherwise.
Let / be a vector of infinitesimal mendelian sampling effects which are deviations of individual genetic values from their respective parental averages. Then, the matrix operator M 21 can be used to construct additive genetic values u as linear combinations of ancestor genetic values u b and mendelian sampling / of their descendants, as illustrated in part (a) of Fig. 1, so we can write: where u can be found by partitioning the M matrix in M bb , M dd , M db and M bd blocks, as: Using known results about the inverse of a lower triangular matrix, we obtain: Equation (4) uses standard results under infinitesimal models developed when it was impossible to observe DNA, and a theoretical distribution was assigned to the unknown / (see Quaas 1976). Availability of genotypes for progeny and parents gives a realized ''molecular'' mendelian sampling s, a predictor of / which can be approached as a function of marked gene effects m: where matrices W b and W d contain the marker genotypes of base and descendant individuals, respectively. Figure 1b illustrates how expression (5) represents individual deviations from parental means, in terms of marked genetic effects, for a hypothetical genealogy of 5 individuals and 3 markers. Then, replacing / by s in (4), and using (5) in (4), with D ¼ ÀM À1 dd M db , we get: And the model for phenotypes is then: In the term Z d Du b , Z d (of order Ny 9 Nd) relates records to individuals (descendants d) and D relates individual genetic values to ancestors' genetic values u b via simple coefficients of genome sharing (including consanguinity, i.e., multiple contributions of an ancestor to an individual). So this term in (7) concentrates all phenotype information of descendants to estimate the ancestors' infinitesimal values. The term Z d (2W d -DW b ) m in (7) groups two parts: Z d (W d -DW b ) m, the ''molecular'' mendelian sampling effects where individual marked effects deviate from ancestors' marked effects, and Z d W d m which represents the direct relations between markers and phenotypes.

Assumptions of the model
A set of possible assumptions is: The assumption of independent base individuals is usual in quantitative genetics. With DNA information and complete data it would be possible to make more general In general, for the ith individual, , with , , and representing the j-th marker genotype for the individual, the father, and the mother, respectively. ; where H represents a genomic matrix, thus recognizing that individuals in the base populations may share genes. Again, the model is redundant if it is assumed that u b = W b m and m . Alternatively, model (7) can also accommodate fixed genetic values for individuals in the base population.
Distribution of marked effects m is assumed normal but other distributions such as the Gamma may be chosen, to take into account experimental results indicating few loci with large effects and many more loci with small effects (Goddard and Hayes 2009).

Analyses of data
Firstly, repeated simulations were conducted to assess the predictive ability of the Mendelian segregation model MS (Eq. 7) relative to the marker model M (Eq. 2). Then, we analyzed a public sample simulated for the 12th European QTLMAS workshop by Lund et al. (2009), using several models including individual and marked genetic effects.
We preferred to use simulated data at this exploratory stage to understand the behavior of the compared models. Also, to simplify interpretation at this stage, estimation and prediction were limited to the unknowns in the models (l, the vector of marked effects m and the vector of individual genetic values u) by applying known variances used to simulate the data.
We used the same statistical method BLUP to all models compared, which have either one (Eqs. 1 and 2) or two (Eqs. 3 and 7) random effects in addition to random residuals. BLUP of random effects were computed as detailed in the ''Appendix''.

Relative predictive performance of the Mendelian segregation (MS) model
Data were simulated using the QMSim software (Sargolzaei and Schenkel 2009). The simulated population had 1 base generation (25 individuals), 3 training generations (120 individuals) and the last generation (40 individuals) taken as prediction target. Mating was at random and the family size was 1. The simulated genome had 2 chromosomes of 1 Morgan each and 10 biallelic QTL/chromosome were responsible for the QTL fraction of genetic variance. Number of SNP markers used was either 2,000 or 200 per chromosome. Phenotypes in the base and target generations were simulated but not used to predict genetic values of the target generation. The phenotypes had variance 1 and overall heritability (infinitesimal ? QTL effects) was 0.4. Three genetic scenarios were replicated 200 times: high (90 %), intermediate (50 %), or low (10 %) proportion of genetic variance explained by QTL.
Mean accuracies over 200 replicates when using 2,000 SNP markers are presented in Fig. 2 for 10, 50 and 90 % of total genetic variance explained by QTL. Accuracies were highest (0.76 for model M and 0.74 for model MS) in the training data when the genetic variance explained by QTL was high (90 %). The lowest correlations occurred for the test data under scenario 10 % (0.36 for M vs. 0.40 for MS). The MS model gave the best predictions when the infinitesimal effects were important (scenario 10 %) and model M gave the best predictions when QTL effects represented 90 % of genetic variance. Differences between mean accuracies of two models were small and non-significant (P \ 0.05).
When fewer markers were used (200 SNP per chromosome), all accuracies were lower but the methods ranked as when using more (2,000 SNP per chromosome) markers (Table 1). The accuracy of the MS model was 12 % higher than that of the M model for the scenario with the 10 % of genetic variance explained by QTL and 5 % lower when the QTL explained the 90 % of total variance.

Analyses of a public simulated sample
In the data simulated for the 12th European QTLMAS workshop (Lund et al. 2009), the simulated phenotypes were influenced by 50 loci, including 15 major effect loci and 35 minor effect loci with a total heritability of 0.3. Marker information was available for 6,000 SNP (only 5,925 were polymorphic and used in our analyses) on 6 chromosomes. The population was simulated under random mating and the absence of selection. Each male was mated to 10 females and each mating pair produced 10 offspring. A data set of 4,665 individuals was split into a training set (3,165 individuals) and a test set (1,500). In the Four models were compared using the known variances used for the simulation: the marker model (M) as in (2), the marker plus individual model (MI) as in (3), the marker plus mendelian effects model (MS) given in (7), and the individual model where the (co)-variance matrix of individual effects was the additive relationship A (individual infinitesimal model; II). The method to estimate the unknowns of all the models was BLUP. The known variances were given by Lund et al. (2009): r 2 e ¼ 3:15 and r 2 u ¼ 1:35. The variance of marker effects was computed as r 2 m ¼ r 2 u = 2 P j p j ð1 À p j Þ. Correlations between predicted values and simulated genetic values and phenotypes for the training and test populations are given in Table 2. The goodness of fit of model (7) for the training data was moderate r b u; y ð Þ ¼ 0:53 ð Þbut it yielded the best predictions for genetic values r b u; u ð Þ ¼ 0:94 ð Þand phenotypes r b u; y ð Þ ¼ 0:55 in the test sample. Model [7] was also the best to estimate the marked effects m: the correlations between estimates of m and the simulated allele substitution effects, in absolute values, were 0.69 for Model [7] and 0.56 for both the marker model and the ''marker ? individual'' model.

Discussion
As reviewed in the Introduction, there are plausible arguments to combine marked effects models with other individual effects when analyzing complex traits. To do so, the strategy used in the MS model [7] is to decompose the individual genetic value into two terms: a contribution from base individuals, weighted by the transmission matrix D, and a contribution from mendelian sampling occurring at several meiosis from base individuals to their descendants, instead of attempting to fit twice the additive genetic value of an individual as in model [3]. In traditional infinitesimal models, mendelian sampling is an unknown theoretical random term, so predictions of future phenotypes (of future progeny) are based on ancestor phenotypes and random terms. At present, with the availability of numerous markers, mendelian sampling is realized for each individual and it can be used to improve predictions.
Model [7] builds on very well-known results in quantitative genetics. Early work described how genetic transmission operates in the additive relationship matrix A (e.g., Quaas 1976 andHenderson 1976, who presented detailed factorizations of the A matrix). Subsequent models included genetic transmission at unobserved segregating QTL (e.g., Fernando and Grossman 1989;Meuwissen and Goddard 2000;Legarra and Fernando 2009) and combined within family and between family marker effects in the context of methodology for QTL search (e.g., Abecasis et al. 2000). In animal breeding, efforts have focused on combining genotype data with genealogy data in individual genomic models, as reviewed by Meuwissen et al. (2011). The model [7] developed here builds on previous work by the simultaneous inclusion of infinitesimal and marked genetic effects. In this way the model might capitalize on two advantages of molecular information: the improvement of the infinitesimal prediction by the estimation of realized mendelian sampling in descendant individuals, and by capturing marked gene effects without bias due to family structure, i.e., to predict marked effects and infinitesimal effects simultaneously and without redundancy. Here, marked effects are estimated at the level of the population (marked effects m in model MS  [7] are not defined within family) but the family structure is taken into account in the estimation model. Results of simulations indicate that the predictive ability of the MS model is comparable to that of the marker model. On one hand, the accuracies obtained in different genetic scenarios suggest that the MS model might be useful when markers are not adequate to fully explain the genetic background (low QTL variances with high infinitesimal variance, or low marker density).
On the other hand, the marker model M yielded slightly higher predictive ability than MS when QTL were important and marker density was high. This result might reflect sub-optimality of the MS model to exploit favorable situations where markers do effectively capture much of total genetic variance. This might be explained by the simple distributional assumptions that we assumed at this exploratory stage for the base individuals and the marked effects of model MS in [7] and accompanying assumptions. In particular, the marker model [2], and, more explicitly, its equivalent model ''Genomic BLUP'', capitalizes the complete data setting studied here by estimating covariances among base individuals, and covariances between base individuals and descendants. So, for the MS model to be fully competitive, its distributional assumptions should be extended to take into account those relationships.
Results for the QTLMAS example are encouraging but unique and different from those of replicated simulations. At least two reasons may be advanced to explain these different results: the more complicated genetic background and the large family size, a full-sib design, simulated in the QTLMAS data set. But the impact of such factors on predictive ability needs further investigation.
Further investigation is also needed on variance component estimation of models including marker and individual effects. Duchemin et al. (2012) were able to estimate both components of variance from real data using model [3], i.e., the variance of individual effects and the variance of marker effects. We are currently studying variance components estimation for model [7], with infinitesimal effects defined only for the base individuals and variance structure designed to avoid identifiability problems.
Also, at this stage of model development, we are assuming complete data, in particular genotypes of base individuals. In some situations, it may possible to impute missing data. Also, if genealogy is unknown and if all individuals are in the genotyped sample, parent-progeny pairs can be easily identified using DNA data (Rohlfs et al. 2012). However, to cover many variable situations in real life, it should be necessary to expand model [7] to include heterogeneous variances where mendelian sampling is observed for some individuals but it remains a random value for individuals without genotyped parents.
Another potential improvement of the MS model in [7] is the representation of genetic transmission (as in expression [5]) and marked genetic effects (as in [2] and [7]) which may be certainly improved. Haplotypes can be used instead of single non-phased SNP. The model is also compatible with approaches where some QTL are known, markers are preselected or markers are weighted by their effects during prediction (e.g. Zhang et al. 2011).

Conclusions
According to the literature on prediction of complex traits, it is justified to keep, both, individual (infinitesimal) and marked gene effects in the statistical predictive model. We gave a formal derivation of a mendelian sampling MS model where individual effects are a function of infinitesimal effects of base individuals and mendelian sampling in descendants, traced using available DNA data. At this stage of research, we are assuming complete data, simple distributional assumptions for individual and marked genetic effects, and known variances. First simulation results suggest that these simplifying assumptions should be extended to render the MS model fully competitive.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Appendix: Computation of individual and marked genetic effects using BLUP Let r 2 i , r 2 M , and r 2 e be the variance of infinitesimal effects, the genetic variance due to all QTL, and the residual variance, respectively. Also the variance of individual markers is r 2 m ¼ r 2 M =k, with k ¼ 2 P j p j 1 À p j . Then: where 1 is a vector of 1 and Z is the incidence matrix. b l is the BLUE (best linear unbiased estimator) of the general mean, and b u is the solution for individual effects.
where X = ZW, i.e., the incidence matrix times the matrix of genotypes, centered by column. b m is the solution for marked effects.
Predictions from model [2] can be also obtained with the individual model: