QTL linkage analysis of connected populations using ancestral marker and pedigree information
Authors
Abstract
The common assumption in quantitative trait locus (QTL) linkage mapping studies that parents of multiple connected populations are unrelated is unrealistic for many plant breeding programs. We remove this assumption and propose a Bayesian approach that clusters the alleles of the parents of the current mapping populations from locus-specific identity by descent (IBD) matrices that capture ancestral marker and pedigree information. Moreover, we demonstrate how the parental IBD data can be incorporated into a QTL linkage analysis framework by using two approaches: a Threshold IBD model (TIBD) and a Latent Ancestral Allele Model (LAAM). The TIBD and LAAM models are empirically tested via numerical simulation based on the structure of a commercial maize breeding program. The simulations included a pilot dataset with closely linked QTL on a single linkage group and 100 replicated datasets with five linkage groups harboring four unlinked QTL. The simulation results show that including parental IBD data (similarly for TIBD and LAAM) significantly improves the power and particularly accuracy of QTL mapping, e.g., position, effect size and individuals’ genotype probability without significantly increasing computational demand.
Introduction
The quantitative dissection of complex traits into underlying genetic components has been the stated goal of many generations of quantitative geneticists (Falconer 1989). Recently, increased availability of molecular markers combined with enhanced statistical analysis techniques has given quantitative geneticists new tools. One simple approach to achieve the goal is to use quantitative trait locus (QTL) detection methods that exploit phenotypic and molecular marker data collected in designed bi-parental mapping populations of large size (Boer et al. 2007). However, such an approach has a serious limitation in that it explores only a small fraction of the genetic variance available in the reference population from which the two parents of the bi-parental mapping population are sampled. Additionally, analyses of mapping populations from different parents for the same trait can give inconsistent estimates of QTL positions and effect sizes (Beavis et al. 1991). QTL analysis of connected populations has been advocated as an alternative to increase the amount of genetic variability accounted for in the statistical model (Bink et al. 2002; Blanc et al. 2006). This approach is also expected to yield more consistent QTL mapping results. However, a common assumption in this approach is that the parents of the connected populations are unrelated and thus can be treated as independent (Blanc et al. 2006; Crepieux et al. 2005; Fang et al. 2011; Hayashi and Iwata 2009). While this assumption is convenient from the standpoint of the statistical analysis, it does not reflect the reality of most breeding programs and leads to loss of power in QTL estimation when the parents are in fact related.
The mapping resolution in QTL linkage studies depends, among other factors, on the number of meioses events accounted for in the statistical model. Therefore, accounting for meioses events that occurred in the ancestors of the parents of the current mapping population should be beneficial in the detection of QTL and the precise placement of these QTL on the genetic map. Meuwissen and Goddard (2000) proposed such methodology for the precise mapping of loci affecting quantitative traits. This methodology combines linkage and linkage-disequilibrium information where the latter is a function of the historical/ancestral recombinations. While Meuwissen and Goddard (2000) made use of population genetics theory to model linkage disequilibrium, an alternative scenario is possible when highly accurate pedigree and marker information exists for the recent ancestral generations of the parents of connected populations. This is especially true when the ancestral pedigree has been genotyped for many more genetic markers than the connected populations to be used for QTL mapping. However, explicit inclusion of the marker and pedigree information collected on ancestors into the dataset to be analyzed can create a significant missing marker data problem which requires significant imputation efforts when used for mapping experiments of a size typical for breeding programs working with commercial elite germplasm. Instead of including the high-density genotyped ancestors themselves into the statistical analysis, we propose an approach that collapses the marker and pedigree information from the ancestors into parental identity by descent (PIBD) information.
This article presents a novel Bayesian approach to combine PIBD information into a QTL linkage analysis framework. The PIBD information pertains to the parents of the connected populations that form the analysis dataset and this information may be obtained in various ways. However, it is assumed that this information is in the form of an IBD matrix specific to a particular genomic position. Here we extend the Bayesian hierarchical framework of Bink et al. (2008) to allow for latent ancestral alleles that are derived from the locus-specific PIBD matrices. The approach is empirically tested using simulated phenotypic and marker data conditional on a pedigree specific to a maize mapping population. Extensions and implications to other QTL mapping experiments are discussed.
Materials and methods
In conventional QTL linkage mapping it is common to assume independence among the parental alleles of the mapping population(s) (Blanc et al. 2006; Crepieux et al. 2005; Fang et al. 2011; Hayashi and Iwata 2009). Here, this assumption is replaced by allowing for putative dependencies between the parents based on ancestral pedigree and marker information.
In the description of the methodology and implementation, we will concentrate on mapping populations containing inbred lines, i.e., individuals are homozygous at all loci. Consequently, we may use the terms allele, haplotype, and individual synonymously. The theoretical concepts presented in this article can be readily adapted for outbred populations.
Three types of data are available, i.e., phenotypic trait data (Y _{ T }), low-density marker data on mapping populations (Y _{ M }), and parental IBD data (Y _{ D }). The parental IBD data are IBD probabilities among the parents of the mapping populations, available as symmetric matrices Q for a set of n _{ Q } positions along the genome. At each genomic position an element Q_{ ij } of Q is the probability that parents I _{ i } and I _{ j } are IBD.
Modeling QTL genotypes
Let n _{ i } denote the number of parents, n _{ j } the number of mapping populations (crosses), n _{ o }[j] the number of offspring in the jth mapping population, and n _{a} the number of ancestral alleles. Then, we denote G as the (n _{ i } × N _{QTL}) matrix of parental alleles with \( i = 1,\; \ldots ,\;n_{i} )\) and \( qtl = 1,\; \ldots ,\;N_{{{\text{QTL}}}}\). Similarly, let S denote the (n _{ O } × N _{QTL}) matrix of segregation (or meiosis) indicators (Donnelly 1983; Lander and Green 1987), where n _{ O } is the total number of offspring across all mapping populations, i.e., \( n_{O} = \sum\nolimits_{j = 1}^{{n_{j} }} {n_{o} [j]} \); let C denote the (n _{ i } × N _{QTL}) matrix of ancestral class indicators; and let A denote the (n _{a} × N _{QTL}) matrix of ancestral alleles. Finally, for ease of readability we will suppress the subscripts pertaining to qtl and describe the concept as if only one QTL is assumed.
Original framework used to model QTL genotypes
Framework used to model dependence of parental QTL genotypes—ancestral alleles
Models for parental identity by descent (PIBD) data
Numerical example of IBD probability matrix Q among six inbred individuals (A) and the corresponding probability matrix P for the latent ancestor model (LAAM) (B). The IBD status matrix Q _{TIBD} (C) and the corresponding ancestor assignments (P _{TIBD}) for the threshold model (D), where the IBD status of pair I_{4}–I_{6} has been adjusted for reason of transitivity. The assignments correspond to the example given in Fig. 1
Threshold IBD model (TIBD)
Values below the threshold are replaced by 0 unless a transitivity problem arises (ter Braak et al. 2010). In case of a transitivity problem, the threshold is locally lowered to create consistency in IBD patterns (Table 1C). This threshold-based approach results in a crisp 0/1 matrix; for example the TIBD model for the example in Table 1A yields three ancestral allele classes (Table 1D). The inbreds I _{ 1 } and I _{2} are copies of ancestral allele A _{1}, inbreds I _{3}, I _{4}, and I _{6} are copies of ancestral allele A _{3}, and inbred I _{5} is a copy of ancestral allele A _{2}. Note that the IBD probability between inbreds I _{4} and I _{6} is below the threshold but set to 1 because of the transitivity rule.
This sampling approach seems more robust as it does not suffer the problem that occurs in the weighted-average implementation (Eq. 6). In addition, a computational advantage of the sampling-based implementation is that the PIBD matrices can be processed once prior to the MCMC simulation to cluster individuals, given the threshold value, and thus have crisp 0/1 matrix results available for all PIBD positions instead of the original IBD probability matrices.
Latent Ancestor Allele Model (LAAM)
We have recently proposed several algorithms to find a matrix P such that \( q_{ij}^{*} \) is close to the observed Q_{ ij } for all i ≠ j (ter Braak et al. (2009, 2010). For the Q matrix in Table 1A, the P matrix with five latent ancestor allele classes that corresponds with PIBD matrix Q (zero RMSE between Q and Q ^{ * }) is also given (Table 1B).
In the Bayesian algorithm we need 0/1 matrices such as the one in Table 1D, expressing a unique assignment of each parent to a single ancestral allele. We can sample such matrices from this prior model by sampling for each parent i independently its ancestral allele (class membership) according to its (row-wise) probabilities \( \left\{ {p_{ik} ,k = 1,..,K} \right\} \) in Table 1B.
Sampling from P_{l} and P_{r} with these probabilities yields precisely the average IBD probabilities Q_{λ} of Eq. 6 if P_{l} and P_{r} perfectly fit Q_{l} and Q_{r}, respectively (see “Appendix A”). In the case of a perfect fit, the sampling approach of Eq. 10 is therefore equivalent to that of calculating the P matrix corresponding to Q _{λ} (6). In the case of a non-perfect fit, the two approaches are almost equivalent. Note that the latter sampling approach is not the same as sampling from an average of P _{l} and P _{r}.
Effective number of latent ancestors
The prior model introduces correlation among the alleles of the parents because parents with similar rows in P are likely to be assigned as offspring from the same latent ancestor and will thus receive more often the same allele than under the independence model. Thus, ter Braak et al. (2010) also propose to use the effective number of latent classes as a measure for genetic diversity (see “Appendix B”).
Bayesian Markov chain Monte Carlo QTL analysis
The utilization of PIBD data adds a new layer in the Bayesian hierarchical framework described by Bink et al. (2008) as presented in Fig. 2. We now present the linear model for the phenotypes and the joint posterior distribution of all random variables.
Data likelihood
Joint posterior distribution
Note that the incidence matrix W is fully determined by variables A, C, and S (“Appendix C”). For MCMC simulation, the sampling distributions of the random variables are derived from this joint posterior distribution by conditioning on all other variables. These conditional distributions are as described by Bink et al. (2008), except for the QTL ancestor class indicator variables (C) that will be presented in the following.
Posterior conditional distributions of ancestral class indicators
In the TIBD model with the weighted Q (6) this sampling distribution is actually deterministic because of the conditioning on λ. Consequently, the ancestral class indicators are updated jointly with the position of the QTL. In the TIBD model with the sampled Q (7) we follow the same approach as presented below for the LAAM model.
The method of updating of the alleles of the ancestral classes is identical to that for updating alleles of parents in the model were parents are unrelated (UNR), which is Gibbs sampling. Note that the alleles of the parents are correlated in the TIBD and LAAM models because of the P matrix. Models TIBD and LAAM thus shift the independence assumption upward in the pedigree structure, namely from the parents to the latent ancestors.
Markov chain Monte Carlo simulation and posterior inference
The calculation of the above joint posterior distribution is analytically intractable, and we apply computer-intensive MCMC simulation (Gilks et al. 1996) to obtain draws from the joint posterior distribution. Different MCMC sampling algorithms are used, i.e., the Gibbs sampler (Gelman et al. 1995) when the conditional sampling distribution has a recognizable kernel and can directly be sampled from, and the Metropolis–Hastings algorithm (Gelman et al. 1995) when the conditional distribution cannot be sampled from directly. The sampling of ancestral class indicators under our new models TIBD and LAAM has been detailed above. To allow changes in model dimension, i.e., to increase or decrease the number of QTL in the model, we use the reversible jump MCMC method (Green 1995), similar to previous QTL model selection studies (Bink et al. 2002; Heath 1997; Sillanpaa and Arjas 1998). For each model we performed a Markov chain simulation of 500,000 (200,000) cycles for the pilot dataset (replicated datasets) and stored every 200th sample for posterior inference.
For all three models (UNR, TIBD, and LAAM), three values (1, 3, and 5) are evaluated for the mean of the Poisson distribution being the prior on the N _{QTL} in the analyses of our simulated data. The stored draws from the joint posterior distribution were used for posterior inference on the variables of interest, most importantly the characteristics of QTL (number, position, size, genotypes). A linkage group was divided into 1-centiMorgan (cM) bins and the number of QTL per bin per cycle was used to calculate the posterior QTL intensity (Sillanpaa and Arjas 1998).
For model selection in the pilot dataset we used Bayes factors (Kass 1993; Kass and Raftery 1995) as a measure of evidence coming from the data for different QTL models. More precisely, we used the statistic \( 2 \times \ln \left( {BF} \right) \) that scales similar to a LOD score test statistic (Kass and Raftery 1995).
In the replicated datasets we adopt the approach of Hayashi and Iwata (2009) to assess the power and accuracy of the three models. The posterior QTL intensity for 1-cM bins along the linkage groups was calculated. Subsequently, the Summed QTL Intensity (SQI) was calculated by summing the QTL intensity over a single linkage group (Hayashi and Awata 2008). Thresholds of SQI were determined from empirical null distributions of the maximum SQI over all linkage groups obtained from 100 null data sets (no QTL were modeled on any linkage group). When SQI exceeds these thresholds for any linkage group, detection of a QTL was declared. For declared QTL the position and effect were calculated as the weighted average over the linkage group where the weights were equal to the QTL intensity. We also examined an alternative method: to declare a QTL, a SQI threshold value of 0.50 must be exceeded (regardless of model), and the posterior mode estimate of QTL location (and the estimated QTL effect pertaining to that location mode) is used for every declared QTL. Taking SQI threshold values other than 0.50—we explored a range between 0.2 and 0.8—yielded similar patterns in relative performance among the three models (results not presented). Furthermore, the SQI approach works well for linkage groups with only 1 (or no) QTL but cannot be applied to a linkage group with multiple QTL as in the pilot dataset.
Simulated data
To empirically test our models we use one pilot dataset and 100 replicated datasets with the same pedigree data and marker densities but with different trait architectures.
Pedigree of connected mapping populations
Simulated QTL genotypes and estimated posterior genotype probabilities in the pilot dataset for map intervals with positive QTL evidence of the 16 mapping parents
Parent |
No. of crosses |
No. of progeny |
Simulation |
Model^{a} | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNR |
TIBD |
LAAM | ||||||||||||
Position (cM) |
Intervals (cM) |
Intervals (cM) |
Intervals (cM) | |||||||||||
30 |
60 |
140 |
28–37 |
58–64 |
134–143 |
25–32 |
57–61 |
136–142 |
26–33 |
58–62 |
138–143 | |||
761 |
4 |
83 |
1 |
0 |
1 |
0.3 |
0.6 |
0.5 |
0.9 |
0.1 |
1.0 |
0.7 |
0.1 |
1.0 |
766 |
2 |
59 |
0 |
0 |
0 |
0.3 |
0.3 |
0.1 |
0.1 |
0.0 |
0.2 |
0.1 |
0.0 |
0.2 |
773 |
8 |
248 |
1 |
0 |
1 |
0.3 |
0.5 |
0.8 |
0.9 |
0.1 |
1.0 |
0.7 |
0.1 |
1.0 |
775 |
3 |
145 |
0 |
1 |
1 |
0.3 |
0.8 |
0.7 |
0.1 |
1.0 |
1.0 |
0.1 |
0.9 |
1.0 |
822 |
2 |
97 |
1 |
0 |
1 |
0.3 |
0.7 |
0.7 |
0.9 |
0.0 |
1.0 |
0.9 |
0.0 |
1.0 |
847 |
4 |
197 |
0 |
0 |
0 |
0.1 |
0.4 |
0.1 |
0.1 |
0.0 |
0.0 |
0.1 |
0.0 |
0.0 |
851 |
6 |
230 |
1 |
0 |
0 |
0.8 |
0.3 |
0.1 |
1.0 |
0.0 |
0.0 |
0.9 |
0.0 |
0.0 |
853 |
5 |
190 |
1 |
0 |
0 |
0.4 |
0.1 |
0.1 |
0.9 |
0.0 |
0.0 |
0.9 |
0.0 |
0.0 |
855 |
4 |
119 |
0 |
0 |
1 |
0.4 |
0.4 |
0.3 |
0.1 |
0.0 |
1.0 |
0.1 |
0.0 |
1.0 |
857 |
7 |
283 |
0 |
1 |
1 |
0.2 |
0.8 |
0.9 |
0.1 |
1.0 |
1.0 |
0.1 |
0.9 |
1.0 |
859 |
3 |
100 |
1 |
0 |
0 |
0.3 |
0.5 |
0.1 |
0.9 |
0.1 |
0.0 |
0.9 |
0.1 |
0.0 |
861 |
3 |
69 |
1 |
0 |
1 |
0.5 |
0.8 |
0.8 |
0.9 |
0.8 |
0.9 |
0.9 |
0.2 |
0.9 |
863 |
1 |
40 |
1 |
1 |
0 |
0.7 |
0.8 |
0.1 |
1.0 |
1.0 |
0.0 |
0.9 |
0.9 |
0.0 |
865 |
2 |
23 |
0 |
1 |
0 |
0.7 |
0.7 |
0.1 |
0.1 |
1.0 |
0.0 |
0.1 |
0.9 |
0.0 |
867 |
2 |
82 |
1 |
0 |
0 |
0.4 |
0.6 |
0.1 |
0.9 |
0.0 |
0.0 |
0.8 |
0.1 |
0.0 |
869 |
4 |
179 |
1 |
1 |
0 |
0.8 |
0.8 |
0.1 |
0.9 |
1.0 |
0.0 |
0.9 |
1.0 |
0.0 |
Average \( \left| {{\text{P}}_{\text{true}} - {\text{P}}_{\text{est}} } \right| \) |
0.44 |
0.39 |
0.18 |
0.08 |
0.08 |
0.04 |
0.11 |
0.06 |
0.05 |
Pedigree of ancestral generations
The known ancestral pedigree of the 16 parents of the connected mapping populations contains 162 inbred lines. Of these, 32 are founders, i.e., their parentages (pedigree) are assumed unknown.
Marker data
We simulated genetic data for 32 independent founders and their descendants according to the known (ancestral and mapping) pedigree structure. The pedigree contains multiple loops and the longest lineage is nine generations for any of the resulting 16 inbred parents. A gene-dropping method (Maccluer et al. 1986) was used to simulate Mendelian inheritance of marker and QTL alleles from parents to offspring while Haldane’s mapping function (Haldane 1919) was used to transform linkage distances into recombination fractions.
Pilot dataset
One linkage group of 150 cM was simulated that was covered by 16 equidistantly spaced bi-allelic SNP markers for the mapping populations (=sparse map). In contrast, the 16 parents and their ancestors were genotyped for 151 markers covering the same genome length (1 cM distance, = dense map).
Replicated datasets
Five linkage groups of 100 cM were simulated, each group covered by 11 and 101 equidistantly spaced bi-allelic SNP markers for the mapping populations and parents and ancestral pedigree members, respectively. So, the total numbers of SNP markers were 55 and 505 SNPs for the two subsets of the pedigree.
Phenotypic trait data
Pilot dataset
The phenotypes of all mapping individuals were simulated by assuming three QTL residing at positions 30, 60, and 140 cM on the linkage group. The distance between the first and second QTL was relatively small to assess differences in detection power and accuracy of estimates of closely linked QTL. The size of the additive effect for all of the three simulated QTL was set to 1.0.
Replicated datasets
Linkage groups 1–4 contained a single QTL at positions 22, 44, 66, and 88, respectively. Linkage group 5 did not harbor a QTL and was included to evaluate the false positives rate. The size of the additive effects for the QTL at the subsequent linkage groups were 1.3, 1.1, 0.9, and 0.7, respectively.
Parent IBD data
For every marker position on the 1 cM map, IBD probabilities among the 16 inbred parents were calculated by using the FlexQTL software (Bink et al. 2008). Note that these 16 × 16 PIBD matrices were calculated (and stored) only once for a given dataset regardless of the number of traits to be analyzed or model specification in the analysis. The resulting 151 (505) PIBD matrices of the pilot dataset (replicated datasets) were used as the new additional data source in the Bayesian analysis (data Y _{ D } in Fig. 2). In the LAAM approach we used the least-squares approximation to obtain Latent Class probabilities allowing a maximum of 16 classes (ter Braak et al. 2010).
Results
Pilot dataset
The effective number of ancestral classes (see “Appendix B”) was computed for the TIBD and LAAM models and significant variation in this number was observed. That is, the mean (standard deviation) for the TIBD and LAAM models were 3.4 (0.96) and 3.1 (0.76), respectively. The lowest number was 1.6, implying a substantial probability (0.625) that two randomly chosen individuals belong to the same ancestral allele class. The highest number was more than 6. The effective number in the LAAM model was always smaller or equal to the effective number in the TIBD model, indicating that parents in the LAAM model may have a higher probability of sharing the same ancestral allele.
Posterior mean estimates for overall mean (μ), residual variance (σ _{e} ^{2} ), the number of QTL (N _{QTL}), the QTL variance (σ _{QTL} ^{2} ), and heritability (h ^{2}) for the pilot dataset
Variable |
μ |
σ _{e} ^{2} |
N _{QTL} |
σ _{QTL} ^{2} |
h ^{2} |
2ln (Bayes factor)^{a} | |||
---|---|---|---|---|---|---|---|---|---|
1/0 |
2/1 |
3/2 |
4/3 | ||||||
Simulation |
0 |
16 |
3 |
3 |
0.16 | ||||
Model^{b} | |||||||||
Prior(N _{QTL}) = 1 | |||||||||
UNR |
1.1 |
16.6 |
3.3 |
3.9 |
0.19 |
na |
26 |
3.7 |
1.7 |
TIBD |
−0.1 |
16.2 |
3.6 |
3.5 |
0.18 |
na |
na |
29 |
1.7 |
LAAM |
0.0 |
16.2 |
3.7 |
3.6 |
0.18 |
na |
na |
28 |
2.3 |
Prior(N _{QTL}) = 3 | |||||||||
UNR |
0.9 |
16.5 |
4.8 |
4.2 |
0.20 |
na |
9.5 |
3.1 |
1.5 |
TIBD |
0.0 |
16.1 |
4.6 |
4.0 |
0.20 |
na |
na |
24 |
1.6 |
LAAM |
0.1 |
16.1 |
4.9 |
4.4 |
0.21 |
na |
na |
24 |
2.2 |
Prior(N _{QTL}) = 5 | |||||||||
UNR |
0.8 |
16.4 |
6.1 |
4.3 |
0.21 |
na |
na |
4.1 |
1.6 |
TIBD |
0.1 |
16.1 |
5.7 |
4.5 |
0.22 |
na |
na |
21 |
2.0 |
LAAM |
0.0 |
16.1 |
5.7 |
4.7 |
0.23 |
na |
na |
21 |
1.6 |
For these most probable QTL regions we computed the posterior probabilities for QTL genotypes for the 16 parents of the mapping population (Table 2). The posterior probability estimates were often inconclusive (0.3 ≤ P ≤ 0.7) for the UNR model, especially for the QTL at 30 and 60 cM, while for the QTL at 140 cM ten out of 16 parents had conclusive probability estimates (0.1 ≤ P or P ≥ 0.9). On the other hand, the genotype probability estimates for the TIBD and LAAM models were almost always conclusive for all three QTL positions (Table 2). These differences between models were also summarized by calculating the average absolute difference between the true and estimated QTL genotype probabilities. These summary values show that including PIBD information leads to four times more accurate genotype probability estimates.
Replicated datasets
Posterior inferences results on replicated datasets (100 replicates), using the SQI thresholds from 100 null datasets
LG 1 |
LG 2 |
LG 3 |
LG 4 |
LG 5 | |
---|---|---|---|---|---|
Simulation | |||||
Position |
22.0 |
44.0 |
66.0 |
88.0 | |
Effect |
1.30 |
1.10 |
0.90 |
0.70 | |
Segregation^{a} |
82 |
92 |
88 |
87 |
0 |
UNR | |||||
Power^{b} |
81 |
88 |
69 |
67 |
37^{e} |
Location^{c} |
29.5 (9.7) |
45.8 (9.4) |
55.2 (11.5) |
60.0 (17.8) | |
Effect^{d} |
0.86 (0.49) |
0.55 (0.43) |
0.35 (0.46) |
0.24 (0.39) | |
TIBD | |||||
Power |
77 |
83 |
69 |
54 |
7^{e} |
Location |
25.5 (5.5) |
43.5 (6.6) |
61.8 (9.9) |
75.4 (13.0) | |
Effect |
1.10 (0.25) |
0.90 (0.24) |
0.73 (0.34) |
0.58 (0.32) | |
LAAM | |||||
Power |
77 |
83 |
71 |
55 |
10^{e} |
Location |
25.4 (6.2) |
43.2 (6.9) |
61.8 (9.3) |
75.0 (14.3) | |
Effect |
1.10 (0.25) |
0.91 (0.25) |
0.72 (0.34) |
0.59 (0.32) |
Summed QTL Intensity threshold based on empirical null distribution
For the prior E(N _{QTL} = 1), the 5% significance level of these empirical distributions were 0.127, 0.206, and 0.196 for the models UNR, TIBD, and LAAM, respectively. A plausible cause of the higher threshold for the TIBD and LAAM models is the following. Some (segments of) linkage groups are not segregating in the mapping parents (similar to fixed QTL in Fig. 5). These monomorphic regions are excluded in the TIBD and LAAM models which increases the prior probability for QTL on other linkage groups. This creates a higher variability in SQI among linkage groups and thus higher values for the maximum SQI of the linkage groups. The power of QTL detection was higher for the UNR model than for the other two models when using the SQI threshold based on 100 null datasets, except for the QTL on linkage group 3 (Table 4). The UNR model also yielded a much higher false discovery rate, i.e., 37 QTL were declared for linkage group 5. The posterior estimates for location were most biased for the UNR model and for the QTL at the extremes of the linkage groups. The bias is pointing to the middle of the linkage group and seems to be caused by the estimation protocol of Hayashi and Iwata (2009). Since all intervals along the linkage group are included in the estimation procedure, QTL at the extremes will have their estimated location biased toward the center of the linkage group. Especially for the QTL with smaller effect size the location cannot be precisely determined and positions further away are plausible as well. This bias in estimates of QTL location may be strongly reduced by considering the mode of the posterior mode estimates. For example, for the QTL position at the 4th linkage group the estimated mode was equal to 89 cM in all scenarios and both thresholds (results not shown). The accuracy, as represented by the standard deviation of QTL location, was always lower for the UNR model than for the TIBD and LAAM models. The accuracy of location estimates decreased (standard deviation estimates increased) for smaller QTLs (Table 4). The TIBD and LAAM models yielded very similar results for power and accuracy. The effect sizes were always underestimated for all three models, more severely for the UNR model, and this was also likely due to the estimation procedure of averaging along the whole linkage group.
SQI threshold equal to 0.5
Posterior inferences results on replicated datasets (100 replicates), using the same SQI threshold of 0.50 for all scenarios of analysis
LG 1 |
LG 2 |
LG 3 |
LG 4 |
LG 5 | |
---|---|---|---|---|---|
Simulation | |||||
Position |
22.0 |
44.0 |
66.0 |
88.0 | |
Effect |
1.30 |
1.10 |
0.90 |
0.70 | |
Segregation^{a} |
82 |
92 |
88 |
87 |
n.r. |
UNR | |||||
Power^{b} |
66 |
63 |
29 |
21 |
2^{e} |
Location^{c} |
22.4 (7.7) |
44.9 (17.0) |
64.3 (14.3) |
78.5 (30.2) | |
Effect^{d} |
1.40 (0.23) |
1.16 (0.29) |
1.27 (0.31) |
1.22 (0.46) | |
TIBD | |||||
Power |
73 |
77 |
56 |
43 |
1^{e} |
Location |
22.0 (4.8) |
42.3 (7.4) |
64.4 (11.2) |
80.6 (19.1) | |
Effect |
1.31 (0.21) |
1.11 (0.21) |
1.01 (0.29) |
0.89 (0.31) | |
LAAM | |||||
Power |
73 |
77 |
56 |
43 |
1^{e} |
Location |
21.8 (4.1) |
42.1 (7.8) |
63.6 (11.2) |
80.4 (22.2) | |
Effect |
1.30 (0.20) |
1.12 (0.18) |
1.02 (0.29) |
0.89 (0.28) |
Discussion
We present a novel approach to efficiently include genome-wide ancestral IBD information on parent alleles into the QTL analyses of multiple connected populations. Analysis of simulated data indicates improvement of mapping accuracy and power when genetic relationships between parents are modeled as opposed to treating the parents as independent. Two algorithms were implemented and tested. The threshold-algorithm benefits from ease of implementation and interpretation but may yield a crude classification of founder alleles, especially when PIBD probabilities are more intermediate between 0 and 1. Furthermore, consistency of classification needs to be checked via transitivity rules. The latent class algorithm is conceptually more appealing as it provides a more precise representation of the original IBD information along the genome. In our simulated datasets these two algorithms yielded the same posterior conclusions. Our current results indicate that the threshold values of 0.90 and 0.80 in the TIBD model yield very similar posterior results and mixing behavior; however, results and performance may become different with further lowering of this threshold. These implementation issues of the TIBD model are subject to further research.
The comparison of our proposed PIBD approach to a full pedigree analyses was impractical as the high marker density in the ancestral pedigree (Fig. 1) creates a major missing data problem in the mapping pedigree. The progeny in the mapping populations in the pilot dataset would have missing marker scores for 135 out of 151 loci. A comparison of our novel approach with a full pedigree analysis with all individuals genotyped for the sparse density map showed that the full pedigree analysis was almost as powerful, but that computation time was dramatically increased because of the added number of individuals and the additional number of generations in the pedigree which makes the sampling algorithms more time consuming. Even in a much smaller simulated example, i.e., considering a single biparental mapping population, the full pedigree analysis required over 50 times more computation time (results not shown).
The novel approach to include genome-wide ancestral IBD information in QTL mapping can be further extended to include a polygenic component which may account for QTL that cannot be picked up in the linkage detection, cf. Bink et al. (2008). When modeling the polygenic component, the use of a marker-based genome-wide average coancestry among founder individuals could be obtained from the PIBD matrices calculated for each chromosomal segment, cf. (NejatiJavaremi et al. 1997), and recently applied to genomic selection, e.g., (Habier et al. 2007). An alternative to this approach would be to use known pedigree relationships to construct the coancestry relationship matrix needed to account for the polygenic term.
In this study, we assume two alleles at a QTL which allows a straightforward extension to include non-additive effects, e.g., dominance and epistatic interactions. When primary interest is in additive gene actions, QTL models with many alleles may be advantageous to allow greater flexibility for panmictic populations (Hoeschele et al. 1997). However, two important implementation issues must be addressed. First, a multiple allele model may contain effects for alleles with little supporting phenotypic data and is thereby prone to less accurate results for QTL allelic effects (Hayashi and Iwata 2009). To draw accurate inference on the number of allelic effects, the allelic effects must differ substantially from each other (Jannink and Wu 2003). Second, the extension to dominance and higher-order interactions is not straightforward as many interaction effects will not be realized in the phenotypic data. The extension of our new approach to outbred populations will be straightforward in case dense marker data are available to unambiguously assign haplotypes to all parents of the mapping population. Then the dimensionality of the PIBD matrices simply doubles and the number of rows in matrix P also doubles. In our study we had access to accurate pedigree and marker data on ancestors of the mapping populations. When ancestral pedigree is unknown or DNA is not available on the members, the LD-based estimation method of Meuwissen and Goddard (2000) utilizing very dense marker data can be applied to obtain the location-specific parental IBD matrices in outbred and inbred populations (Bink and Meuwissen 2004).
The UNR model was the point of departure in our Bayesian approach and this model already accounts for the sharing of one or more common parents by multiple populations using a pedigree linkage approach. Other recent Bayesian approaches have been proposed that take unique QTL allelic effects for each of the mapping parents (Hayashi and Iwata 2009) with their simulated datasets containing a common reference parent in a star design. Fang et al. (2011) assumed the QTL alleles of all mapping individuals as samples from a normal distribution with a covariance matrix proportional to the IBD matrix that was calculated from the marker information on the mapping offspring and their parents. The modeling of the QTL as a random effect in a mixed model can be solved more efficiently using restricted maximum likelihood approaches. Mixed models for QTL mapping in real connected plant populations have been successful in wheat (Arbelbide and Bernardo 2006; Crepieux et al. 2005; Rosyara et al. 2009) and maize (van Eeuwijk et al. 2010), but the treatment of multiple QTL models is less straightforward for (closely) linked QTL as was the case in our pilot dataset.
The simulated datasets used in this study reflect a typical connected population structure as they contained 30 mapping populations derived from 16 connected parents with known ancestry up to 32 original founder individuals. These characteristics can easily be varied without changing the applicability of the method. For example, the idea of ancestral allele classes can also be applied to a single mapping population derived from two inbred parents. In that case, the ancestral PIBD data will indicate which genomic regions are shared by the two inbred parents and these regions can be excluded a priori to harbor QTL. This type of information may substantially increase mapping precision but is fully ignored in other linkage methods. Furthermore, plant breeders may consider a large number of small mapping populations derived from a large number of parents where these parents inherited a limited number of (unknown) ancestral alleles. The additional layer of ancestral allele classes will facilitate substantial power to associate phenotypic trait variation with genomic polymorphisms. The increase in numbers may require more efficient algorithms to include ancestral IBD information. The practicality of our new approach was well illustrated by the successful mapping of QTLs in hybrid selection programs (van Eeuwijk et al. 2010).
The increasing availability of cheap and abundant markers opens new ways to advance genetic progress in plant and animal breeding programs, such as whole genome selection approaches (Bernardo and Yu 2007; Meuwissen et al. 2001). However, the application of high-density (SNP) genotyping to all mapping populations grown within commercial breeding programs might still not be feasible due to economic reasons. Therefore, a substantial discrepancy in marker density between elite (selected) breeding lines and regular breeding populations can occur. Our approach tackles this potential discrepancy and exploits the available sources of information efficiently to map important genomic regions affecting complex traits.
Acknowledgments
We acknowledge useful suggestions and comments of our colleagues at Biometris and Pioneer on this study and manuscript. The manuscript has greatly benefitted from the reviewers’ comments.
Conflict of interest
The authors declare no conflict of interest.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Appendix A
Theorem
Proof
Note that this theorem cannot be stated in terms of an average of P matrices.
Appendix B
Effective number of ancestral allele classes
Appendix C
Construction of incidence matrix of QTL effects to phenotypes
Note that alleles A _{4}, A _{5}, and A _{6} in matrix T_{C} are included for completeness; they were not transmitted in the example of Fig. 2.