Statistical method for mapping QTLs for complex traits based on two backcross populations.

Most important agronomic and quality traits of crops are quantitative in nature. The genetic variations in such traits are usually controlled by sets of genes called quantitative trait loci (QTLs), and the interactions between QTLs and the environment. It is crucial to understand the genetic architecture of complex traits to design efficient strategies for plant breeding. In the present study, a new experimental design and the corresponding statistical method are presented for QTL mapping. The proposed mapping population is composed of double backcross populations derived from backcrossing both homozygous parents to DH (double haploid) or RI (recombinant inbreeding) lines separately. Such an immortal mapping population allows for across-environment replications, and can be used to estimate dominance effects, epistatic effects, and QTL-environment interactions, remedying the drawbacks of a single backcross population. In this method, the mixed linear model approach is used to estimate the positions of QTLs and their various effects including the QTL additive, dominance, and epistatic effects, and QTL-environment interaction effects (QE). Monte Carlo simulations were conducted to investigate the performance of the proposed method and to assess the accuracy and efficiency of its estimations. The results showed that the proposed method could estimate the positions and the genetic effects of QTLs with high efficiency.

loci (QTLs), appropriate data are required, including molecular markers and phenotypic traits from a well-designed experiment. The commonly used mapping populations in plants include F2, backcross (BC), double haploid (DH), and recombinant inbred (RI) populations. Of these, the F2 population provides the most abundant genetic information [1]. However, the F2 population is temporary, the individuals differ from each other in their genetic constitution, and they cannot be duplicated unless somatic cell cloning is applied. Thus, it is not possible in a QTL study to conduct multiple environmental experiments with an F2 population to analyze QTL × environment (QE) interactions. Furthermore, all individuals in different environments need to be genotyped, which is expensive in terms of time, labor, and money. The BC population shares the same features as the F2 population mentioned above; as a result of less segregation information for genotypes, additive effects cannot be distinguished from dominance effects, and some types of epistatic effects are also confounded [2]. To investigate QTL effects across different environments, immortal DH and RI populations have been developed and widely applied in QTL mapping for many species [3][4][5][6][7]. Since DH and RI populations are homozygous, their marker data can be used repeatedly with phenotypes of quantitative traits observed in different locations and years under various experimental designs. However, these populations cannot be used to analyze dominance effects and some types of dominance-related epistatic effects, which play important roles in hybrid heterosis. In the present study, we tested an experimental design including a permanent double backcross population derived from crossing DH or RI lines to both the homozygous parents. The advantage of such populations is that identical mapping populations can be replicated as necessary.
There remains a great challenge in methodology for analyzing epistasis and QE interactions, two principal genetic components of QTL effects. The importance of epistasis affecting complex quantitative traits has been well documented in numerous classical quantitative genetic studies [8,9]. Several QTL studies have also indicated that epistasis is an important genetic basis for complex traits such as grain yield and its components, as well as heterosis and inbreeding depression [10][11][12][13]. The QE interaction is another important component for quantitative traits [14][15][16][17][18][19][20], but none of the previous methods has integrated these two parts into one unified mapping framework. Wang et al. [2] proposed a method based on a mixed linear model approach to tackle this problem for a DH or RI population; this method was later extended to a permanent F2 population [21]. However, those methods have some disadvantages in cofactor selection for controlling background genetic effects, false positive rates, and computational tractability. Therefore, a full-QTL model was proposed for systematically mapping QTLs underlying complex traits. This model was sufficiently flexible to include the effects of multiple QTLs, epistasis, and QE interactions [22].
Here, we propose a method under the framework of a mixed linear model based composite interval mapping (MCIM) for immortal double backcross populations. The proposed method uses the approach of the full QTL mapping model described by Yang et al. [22]. Monte Carlo simulations were carried out to assess the performance of the method.

Genetic model
The proposed genetic model is based on the mixed linear model. The mapping population consists of double back-cross populations derived from crossing DH or RI lines with both of the homozygous parents. The mapping population is used to map s segregating QTLs simultaneously with additive, dominance, and additive × additive epistatic effects, as well as their interaction effects with the environment, in which t pairs of QTLs are involved in epistatic interactions. The mixed linear model for the phenotypic value of the kth individual in the hth environment (y hk ) can be expressed as follows (h = 1, 2, …, p; k = 1, 2, …, n h ) (1) where μ is the population mean; a i and d i are the fixed additive and dominance effects of Q i (ith QTL), respectively; aa ij is the additive × additive epistatic effect (fixed) between Q i and Q j . x A ki , x D ki , and x AA kij are the coefficients of the above QTL effects that depend on the observed genotypes at the marker loci (let M i− and M i+ be two markers flanking Q i and M j− , M j+ be markers flanking Q j , respectively) and the recombination frequencies (denoted by r M i− Q i , r Q i M i+ and r M j− Q j , r Q j M j+ ); e h is the random effect of the hth environment, ; ae hi is the random additive × environment interaction effect with coefficient ; de hi is the random dominance × environment interaction effect with coefficient x D ki , ; aae hij is the random additive-additive epistasis × environment interaction effect with coefficient ; ε hk is the random residual effect, .
Model (1) can be expressed in the following matrix form for all the n individuals (n=n 1 +n 2 + …+n h ): (2) where y is an n×1 vector of the observed phenotypic values, with n as the total number of individuals; 1 is an n×1 vector with all the elements=1 and b AA =[aa 1 aa 2 ···aa t ] T are the parameter vectors for the fixed QTL effects with incidence matrixes X A , X D and X AA , respectively; are random effect vectors with incidence matrixes U E , U A k E , U D k E and U AA h E , respectively; is the random vector of residual effects; and R u is an identity matrix (u=1, 2, …, r+1) when there is no kinship between the original parents.
For any putative QTL, all the coefficients in model (2) are unknown; however, they can be substituted with their expectation inferred from the conditional probability of the QTL genotype given the two flanking markers. Let p ijk be the conditional probability for the QTL with i denoting the parental line (P i , i=1, 2), j denoting the genotype of flanking markers (j=1, 2, 3, 4), and k denoting the genotype of QTL (k=1, 2, 3). Conditional probabilities are calculated for the mixed population from two backcrosses between DH lines (or RI lines) and homozygous parents P 1 and P 2 (Table 1). Additionally, the coefficients of additive and dominance effects are 1 and −0.5 for genotype Q i Q i , 0 and 0.5 for Q i q i , −1 and −0.5 for q i q i , respectively. Thus the coefficients of the putative QTL effects in model (1) can be estimated by x A k =(p ij1 −p ij3 ) and x D k =(p ij2 −p ij1 −p ij3 )/2 for each progeny from the backcross between DH or RI lines and parent P i, and having the jth flanking marker genotype.

Strategy for detecting QTLs in the whole genome
Prior to fitting model (1), the numbers and locations of QTLs with genetic main effects or QE interaction effects should first be identified based on a marker linkage map and phenotypic values. After obtaining these effects, a systematic mapping strategy [22] for complex traits was used in the present study to estimate various parameters of QTLs.
i. Mapping QTLs through 1D genome scan. According to the composite interval mapping (CIM) approach [23], mapping multiple QTLs can be simplified to 1D searching along a chromosome. It can be achieved by testing a QTL in a certain genomic region with some selected marker intervals as cofactors to control the background genetic effects from other QTLs outside of the region. ii. Mapping epistasis through 2D genome scan. After the 1D genome scan is completed with t main effect QTLs identified, a 2D genome scan is conducted to search for all possible epistasis including the effects of t QTLs and the interaction effects from f pre-selected paired marker intervals as the genetic background control in the model. The model to test the significance of epistatic interactions between loci i and j can be written as follows: (4) where is the interaction effect between the flanking markers of interval I A and I B , (I A− , I B− ) and (I A+ , I B+ ). All other variables and parameters have the same definitions as those in equation (3).
iii. Hypothesis testing for significance of QTL effects.
Both the genetic models of (3) and (4) can be reformatted into the following general multivariate linear mixed model in matrix form: (5) where, b Q is a tested vector of QTL parameters (consisting of additive, dominance, and epistatic effects) with coefficient matrix W Q , b B is the vector consisting of the population mean and the background effects due to the significant QTLs and marker intervals with the corresponding coefficient matrix W B . The F-statistic based on the Henderson III method [24] is used to test the significance of QTL effects, which can be expressed as follows where r w is the rank of the matrix W = (W Q ⋮W B ) , and r W B is the rank of W B ; SSR(b Q |b B ) is the extra regression sum of squares attributed to the b Q given the b B in the genetic model; SSE is the residual sum of squares of the model (5). Under the null hypothesis H 0 :b Q =0, the F-statistic follows the F distribution with (r W →r W B ) and (n−r w )as the numerator and denominator degrees of freedom, respectively. For detailed information, refer to Yang et al. [22,25] Prior to scanning the genome for candidate QTLs by models (3) and (4), the candidate marker interval or paired marker intervals with significant effects (additive, dominance, or epistatic effects) as cofactors in models (3) and (4) must first be selected via the method of marker pair selection (MPS) [26]. The same searching procedure proposed by Yang et al. [22] was used in this study.

Genome wide threshold value and the estimation of QTL genetic effects
We used the F-statistic based on the Henderson III method for a mixed linear model to determine the candidate marker intervals, as well as the QTLs with significant effects. The genome-wide threshold for the F-statistic was specified by the permutation procedure [27]. When the number of candidate QTLs and their positions in chromosome were obtained, the full model including effects of all candidate QTLs was constructed. The model selection procedure was adopted to further reduce the false positive rate of QTLs. Based on the result of model selection, the genetic effects of QTLs including QE interaction effects were estimated by the Bayesian method via Gibbs sampling [22].

Simulation setting
A total of 200 Monte Carlo simulations for double back-cross populations with DH × P 1 and DH × P 2 progenies were conducted under three different environments to investigate the statistical properties of the proposed method in detecting QTL positions, estimating QTL effects, and predicting QE interactions. The size of mapping population was kept constant at 300 with three different ratios (1:1, 1:2, 1:5) of the number of DH × P 1 population to the number of DH × P 2 population, or equal ratios were used for the three populations with sizes of 180, 240, and 300. In all simulations, a genome of five chromosomes was constructed with 55 evenly distributed markers at a space of 10 cM. Suppose that 7 QTLs (denoted as Q1, Q2, Q3, Q4, Q5 Q6 and Q7), were involved in the genetic variation of a complex trait, of which 5 QTLs (Q1-Q5) had main genetic effects and 2 QTLs (Q6 and Q7) were pure epistatic QTLs. Five QTLs, Q1, Q3, Q5, Q6, and Q7, were involved in three pairs of epistatic effects denoted as EQ1 (Q1-Q6), EQ2 (Q3-Q5), and EQ3 (Q6-Q7), Only epistatic effects were set for EQ3. Q1 -Q5 were located on chromosomes 1-5, and Q6, Q7 on chromosomes 4 and 5, respectively. The heritability of a simulated trait was assumed to be 60%. Detailed configurations of parameters for each QTL are shown in Tables 2 and 3.

The potential bias in QTL parameter estimation arising from ignoring epistasis
A simulation comparison between two QTL models (with and without inclusion of epistatic effects) was performed to investigate the impact of epistases on estimating QTL main/ epistatic effects, QE interactions, and QTL positions. Model I included both main and epistatic effects, while Model II did not include epistasis. Two populations with or without epistasis were simulated. Both populations had a sample size of 300 (150:150). The abovementioned factors led to a total of four different cases: epistases existed and were included in the model (Case I), epistases existed but were ignored in the model (Case II), epistases did not exist and were included in the model (Case III), and epistases did not exist and were not included in the model (Case IV).
The results of the simulation clearly show that the mixed linear model approach provided largely unbiased estimates of QTL positions, QTL main/epistatic effects, and QE interactions (Tables 2 and 3), with small standard deviations. The results also revealed that the power of detecting an individual QTL ranged from 93% to 100%, showing the high efficiency of the proposed method.
The results under four different cases revealed that the positions of QTLs with main effects could be precisely estimated whether the genetic model included the epistases or not. In the presence of epistasis, cases I and II both provided unbiased parameter estimates with minor differences for additive effects; similarly, when there were no QTL epistases, almost the same results were observed for the cases III and IV regardless of whether epistases were included in the model or not (Table 2). However, in the presence of epistasis, Case I tended to give much better estimates of dominance and interaction effects with the environment, compared with Case II (Table 2).
For estimating epistases, it should be noted that epistases can only be identified when they exist and when an epistatic model is used (Case I). In the other cases (Case II, Case III and Case IV), either using a model that omits epistasis (Case II and Case IV) or the absence of epistases (Case III and Case IV), will both result in undetectable epistatic effects. However, the results in Table 3 revealed that epistatic parameters were well estimated by the epistatic model (Case I), although the statistical power was relatively lower than that of individual QTLs. Considering that the estimation accuracy of QTL parameters is affected by epistases, and that epistases are generally involved in genetic variation of quantitative trait, a model that includes epistasis is preferable for use in QTL studies.

Impact of heritability and population constitution on estimation of QTL parameters
To investigate the impacts of heritability and population size on estimating QTL parameters, we conducted simulations with the several scenarios described in Tables 4 and 5. Model I with epistases (see Section 2.2) was used for three sample populations with different sizes and a constitution ratio of 1:1, that is; 180 (90:90), 240 (120:120), and 300 (150:150). The simulation results revealed that including epistases in the model was essential to reliably detect and quantify QTLs with the mixed linear model approach. The large heritability and population size could increase the statistical power and estimation precision of the QTL parameters. As shown in Table 4, the statistical power of every QTL with main genetic effects was nearly 100%, except for Q5 with relatively small heritability; this finding indicated that heritability is an important factor in QTL detection. In addition, as the population size increased to 240, most QTLs with main effects could be distinguished efficiently. For the epistatic interaction (Table 5), the statistical power and accuracy of epistatic parameters were improved with higher heritability or larger population sizes. Under all of the scenarios, had fewer type II errors than other two paired epistatic QTLs. Meanwhile, for each pair of epistatic QTLs, increased population size resulted in greater predictive power and more accurate estimates of parameters.
In addition to heritability and sample size, the population constitution may be another important factor affecting QTL mapping. Thus, using the same model as above (Model I), we performed another series of Monte Carlo simulations for mapping populations with the same size but three different constitutions, 300 (150:150), 300 (100:200), and 300 (50:250).
As shown in Tables 6 and 7, the population with even proportions improved the accuracy of estimating the genetic effects of QTLs. In most cases, the estimates of parameters were slightly more accurate in the evenly distributed sample, as indicated by standard deviation, although the differences were not so clear (Tables 6 and 7). The false discovery rate also tended to increase with a more unevenly distributed population (0.0416, 0.0537 to the 0.0543 for three respective population constitutions; Table 7). In all cases, the power for the main QTL was higher than that for epistasis. Thus, a double backcross population with a large sample size and evenly distributed population is preferred and will enhance the accuracy of QTL analysis.

Discussion
Backcrossing is a very commonly used breeding method. It is widely used in plant breeding to precisely improve the target trait by transferring the associated gene(s) from the donor parent to the recurrent parent, to control hybrid populations, and to overcome the barriers of distant crosses, and so on. Backcrossing is also a very useful design for dissecting the genetic mechanism of quantitative traits, and has been used in mapping QTLs for complex traits in crops and animals [28,29]. Advanced backcrossing has been used for fine mapping of QTLs [30,31]. However, when a traditional temporary BC population derived from crossing F 1 with one of its parents is used to map QTLs for complex traits, it is impossible to obtain different phenotypes of each genotype in different environments. In the present study, we proposed mapping of QTLs based on an immortal double backcross population (that is, a population constructed by mating DH or RI lines with homozygous parents P1 and P2 separately). Because of the benefits of the identical genetic constitution of the mapping population and repeated multiple environment experiments, QTL and QE interactions can be analyzed accurately. Meanwhile, time and labor costs can be saved because the molecular genotype of each individual can be inferred from the molecular information of the parents and the DH or RI lines. Another advantage is that the double backcross population may improve the statistical power and precision of estimation, such as estimates of QTL positions, additive effects, and dominance effects, because of the increased population size and more abundant molecular genotypes at one locus or multiple loci. In QTL mapping, an F 2 population is ideal for analyzing additive, dominance, and additive-additive epistasis of QTLs because it has richer genetic diversity than DH, BC, and RI populations. On the other hand, to overcome the shortcomings resulting from heterozygosity of individuals and the need to conduct repeated experiments, the permanent or immortal F 2 (PF2 or IF2) populations derived from mating of DH or RI lines is proposed to mimic the F 2 population. The double backcross population is also an excellent mimic of the F 2 population, although it has fewer genotype types of multiple loci than the F 2 population, and of the four types of epistases, only additive-additive epistasis can be analyzed. However, there are still many merits of the double backcross mapping population for QTL studies in crops and especially in animals.
Recent studies of QTL mapping in rice and maize have shown that dominance and epistasis, especially the additive × additive interaction, play a key role in contributing to heterosis [10,32,33]. In the conventional BC population, the additive and dominant effects can only be identified in a mixture, and the additive × additive interaction cannot be detected separately. The double backcross population proposed in the present study can be used to tackle this problem, and accurately and efficiently estimates additive, dominance, and additive × additive interaction effects. However, population structure should be taken into account when mapping QTLs in the double backcross population. A population with equal ratios of the two backcross populations has a more similar genetic structure to the F 2 population, and should show greater statistical power and estimation accuracy compared with those of other populations. However, our simulation did not show large differences among three populations consisting of 150 and 150 lines, 100 and 200 lines, and 50 and 250 lines derived from crosses of DH × P 1 and DH × P 2 . When there are extreme changes in proportions of the population, the double-backcross populations will become more similar to a traditional one-backcross population, and the coefficient matrix of the model will become singular; thus, some genetic effects of QTLs, such as dominance, cannot be analyzed. Thus, an evenly distributed population is preferable and may improve the precision of estimates and statistical power.
It is believed that epistatic interactions are important genetic buffers that provide functional redundancy for species to survive under adverse changes in their environment. As well, epistatic interactions could generate more phenotypic polymorphisms in response to natural and artificial selection [34]. So far, there is much research evidence on the role of epistatic interactions. However, to detect a pair of true positive epistatic QTLs usually requires a relatively large population. Actually, the size of the simulation population, 300, was large enough to detect epistasis under the simulation scenarios considered here. As shown in Table 7, irrespective of various population proportions, the false positive rates of epistasis detection are quite low, 0.04, 0.05, and 0.05 in the three surveyed cases. On the other hand, the simulation study suggests that if epistasis of the QTLs does contribute to a trait, it will affect the accuracy of estimated QTL parameters. When there is significant epistasis in agronomic traits, mapping QTLs with case II will lead to inferior estimates of the effects and positions. Therefore, the best strategy is to include epistatic effects in a QTL model, especially when some epistases are actually involved in genetic variation of complex traits. Table 1 Conditional probabilities of QTL for two backcrosses populations from mating between DH (or RI) lines and parental lines (P  Comparisons of the estimation of QTL parameter and power by models with and without epistases a)  Estimate  Estimation of QTL effects and their interaction effects with environments under three different sample sizes     Estimation of epistatic effects and their interaction effects with environments under three different constitutions of mapping population b) estimate of parameter and standard deviation in parenthesis; AA and AAE denote estimates of additive by additive effects and their interaction effects with environments, respectively; I,II and III have same definition as that in Table 6, with false discovery rates of 0.0416, 0.0537, 0.0543, respectively.