The distribution of genetic parameter estimates and confidence intervals from small disconnected diallels
Authors
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s00122-005-1957-0
- Cite this article as:
- Isik, F., Boos, D.D. & Li, B. Theor Appl Genet (2005) 110: 1236. doi:10.1007/s00122-005-1957-0
- 102 Views
Abstract
The distributions of genetic variance components and their ratios (heritability and type-B genetic correlation) from 105 pairs of six-parent disconnected half-diallels of a breeding population of loblolly pine (Pinus taeda L.) were examined. A series of simulations based on these estimates were carried out to study the coverage accuracy of confidence intervals based on the usual t-method and several other alternative methods. Genetic variance estimates fluctuated greatly from one experiment to another. Both general combining ability variance (σ^{2}_{g}) and specific combining ability variance (σ^{2}_{s}) had a large positive skewness. For σ^{2}_{g} and σ^{2}_{s}, a skewness-adjusted t-method proposed by Boos and Hughes-Oliver (Am Stat 54:121–128, 2000) provided better upper endpoint confidence intervals than t-intervals, whereas they were similar for the lower endpoint. Bootstrap BCa-intervals (Efron and Tibshirani, An introduction to the bootstrap. Chapman & Hall, London 436 p, 1993) and Hall’s transformation methods (Zhou and Gao, Am Stat 54:100–104, 2000) had poor coverages. Coverage accuracy of Fieller’s interval endpoint(J R Stat Soc Ser B 16:175–185, 1954) and t-interval endpoint were similar for both h^{2} and r_{B} for sample sizes n≤10, but for n=30 the Fieller’s method is much better.
Introduction
Confidence intervals for genetic variance components and their ratios have been determined using a variety of techniques. Öfversten (1993) developed methods for obtaining exact F-tests of variance components for three unbalanced mixed linear models. The methods were extended and generalized by Christensen (1996) and Fayyad et al. (1996) to test the null hypothesis that a variance component is equal to zero. The standard errors of variance components and their ratios were also approximated with chi-square distributions using the expected mean squares (Lu and Graybill 1987). Assuming standard normal distribution properties, Knapp et al. (1987) defined parametric interval estimators for heritability, a ratio of additive genetic variance to phenotypic variance. They used F-distributions to establish exact confidence intervals for the point estimates. Knapp et al. (1989) used Monte Carlo simulation to estimate Jackknife intervals of family-mean heritability and expected selection response. Their methods were based on simulated data for a balanced one-factor random linear model and expected mean squares.
The literature cited here-above mainly focused on the precision of variance components and their functions for specific experiments. In practice, parents of a breeding population are grouped into smaller units, such as diallels, for mating and progeny testing. Diallel mating designs have been widely used in plant breeding programs to estimate genetic variances (Baker 1978). Usually, a small number of parents (4–12) are allocated to a diallel because of logistic difficulties and the cost-associated limitations of breeding and testing (Wilson et al. 1978; Cisar et al. 1982; Boyle 1987; Xiang et al. 2003). Then, each diallel group is field-tested and analyzed separately to obtain variances. Thus, variance components may vary dramatically from one experiment to another because of the “bottleneck” from random sampling a small number of parents (Wright 1985). The question we address: when there are numerous empirical variance components estimated by maximum likelihood methods (ML or REML) and heritability estimates from different experiments (diallels), how can statistical inferences be made about the population?
We utilized a unique empirical data set from the North Carolina State University-Industry Cooperative Tree Improvement Program to make a statistical inference about a breeding population. For the second-cycle breeding of loblolly pine (Pinus taeda L.), 3,800 parents were grouped into six-parent disconnected half diallels. The parents were mated, and 15 crosses were produced for each diallel group (Li et al. 1999). Progeny of two diallels were tested at four locations. Two pairs of diallels were analyzed together to estimate variance components (Li et al. 1996). In this study, we utilized 105 variance components and their ratios from 105 pairs of diallels of the Coastal breeding population of loblolly pine. The data provided an excellent opportunity to examine the distribution properties of variance components from different experiments and to make statistical inference about them.
The objectives of the study reported here were to answer the following questions. (1) How are variance components and their ratios, estimated from individual diallels, distributed? (2) Depending on the distribution properties, what is the true coverage of t-confidence intervals? (3) How can we improve the true coverage of confidence intervals if there are significant departures from the normal distribution?
Materials and methods
Data
The model was analyzed with the sas mixed procedure by creating dummy variables for GCA effects (Xiang and Li 2001). The variance components of random effects were estimated using the default reml model fitting option. In the data analysis, it was assumed that six parents were randomly assigned into each diallel and that the parent trees in a diallel represent a random sample of the whole breeding population. Diallel effects were negligible and were dropped from the mixed model. Individual-tree narrow-sense heritability (h^{2}) for each diallel was estimated as the ratio of additive genetic to phenotypic variance: h^{2}=4σ^{2}_{g}/(2σ^{2}_{g} + σ^{2}_{s} + 2σ^{2}_{gt} + σ^{2}_{st} + σ^{2}_{p} + σ^{2}_{e}). Type-B genetic correlations (r_{B}) were calculated as r_{B}=σ^{2}_{g}/(σ^{2}_{g} + σ^{2}_{gt}), which is a ratio of family variance components over the sum of the family and genotype-by-environment interaction variance (Yamada 1962). This ratio has been used commonly to quantify the level of genotype-by-environment interaction in progeny trials (Burdon 1977). In this study, we used 105 variance components, heritabilities and type-B genetic correlations estimated from 105 pairs of diallels for 6-year height growth to make inferences about the Coastal breeding population of loblolly pine in the southern USA.
Statistical analysis
Descriptive statistics of variance components, heritability and type-B genetic correlation for 6-year height for the Coastal breeding population of loblolly pine (N=105)
Variable^{a} | Mean | Standard deviation | Range | Coefficient of variation (%) | Skewness | Kurtosis |
---|---|---|---|---|---|---|
σ_{g}^{2} | 0.23 | 0.17 | 0.0–1.11 | 74 | 2.09 | 6.75 |
σ_{s}^{2} | 0.09 | 0.17 | 0.0–1.41 | 188 | 5.88 | 40.21 |
σ_{gt}^{2} | 0.05 | 0.03 | 0.0–1.15 | 65 | 1.13 | 1.70 |
σ_{st}^{2} | 0.02 | 0.03 | 0.0–1.14 | 155 | 1.89 | 3.03 |
σ_{p}^{2} | 0.31 | 0.17 | 0.0–0.69 | 53 | −0.40 | −0.33 |
σ_{e}^{2} | 3.69 | 1.04 | 1.96–6.49 | 28 | 1.10 | 0.72 |
h_{b}^{2} | 0.19 | 0.12 | 0.0–0.62 | 60 | 1.34 | 2.36 |
r_{B_b} | 0.79 | 0.17 | 0.0–1.00 | 21 | −1.93 | 5.56 |
Since the estimates h^{2}_{−unb} and r_{B−unb} are the ratio of the sample means, we used Fieller’s (1954) method to estimate their confidence intervals and compared these to nominal t-intervals. Fieller’s method is a clever way to get a confidence interval for the ratio of two means for random pairs. The idea is to define a new variable Z=Y-(μ_{y}/μ_{x})X, using the mean μ_{y} of the numerator variable Y and the mean μ_{x} of the denominator variable X. The Z variable has mean 0 and a variance that is estimated from the sample Z values. Then, the Fieller’s confidence interval finds all values of μ_{y}/μ_{x} such that t^{2} (Z)<t^{2}_{(0.975,n-1)}, where t(Z) is the t-statistic on the Z-variable and t^{2}_{(0.975,n-1)} is the usual t-interval cut-off.
For planning purposes, how large should n be for estimating σ^{2}_{g}, σ^{2}_{s}, h^{2} and r_{B}? The sample size for a desired length of interval can be obtained from the standard formula, n=[t_{α2,n-1}* s/d]^{2}, where α/2 is the error rate (typically 0.025), s is the sample standard deviation (which must be guessed), and d is the margin of error. Recall that the t-interval has length of 2t_{α2,n-1}s/\( {\sqrt n } \) or margin of error t_{α2,n-1}s/\( {\sqrt n } \). Later we give a numerical example for specific standard errors of heritability and type-B genetic correlation.
Results
Descriptive statistics
Variance components and their ratios fluctuated considerably among experiments within the breeding population (Table 1). General combining ability variance (σ^{2}_{g}) ranged from 0.0 to 1.11, whereas narrow-sense heritability (h^{2}) ranged from 0.0 to 0.62. Type-B genetic correlation (r_{B}) fluctuated between the lower (0.0) and upper (1.0) theoretical limits. High variation in variance components among diallels is reflected in high coefficients of variation (CV): r_{B} had the smallest CV (21%), whereas specific combining ability variance (σ^{2}_{s}) had the highest (188%) CV. σ^{2}_{g} and σ^{2}_{s} had a positive skewness of 2.09 and 5.88, respectively, indicating a significant departure from normality. The distribution of σ^{2}_{g} and σ^{2}_{s} estimates also had large kurtosis values. Plot-to-plot (σ^{2}_{p}) and within-plot (σ^{2}_{e}) error variances had the lowest skewness and kurtosis among the variance components.
Interval coverage of genetic variances (simulation results)
Simulation results for σ_{g}^{2} and σ_{s}^{2} genetic variances. MR, ML proportions, total coverage and the length of coverage for nominal 95% t-intervals and BH-intervals for sample sizes n=5, 10, 30, and 100. Entries are based on 1,000 Monte Carlo replications. The standard errors of the miss rates, coverage estimates, and interval lengths are all approximately 0.01
Variable | n | Miss right (MR) | Miss left (ML) | Total coverage | Interval length | ||||
---|---|---|---|---|---|---|---|---|---|
BH | t | BH | t | BH | t | BH | t | ||
σ_{g}^{2} | 5 | 0.01 | 0.01 | 0.11 | 0.11 | 0.89 | 0.88 | 0.43 | 0.37 |
10 | 0.01 | 0.01 | 0.07 | 0.09 | 0.92 | 0.91 | 0.24 | 0.22 | |
30 | 0.01 | 0.01 | 0.06 | 0.08 | 0.93 | 0.91 | 0.13 | 0.12 | |
100 | 0.02 | 0.01 | 0.03 | 0.05 | 0.96 | 0.95 | 0.07 | 0.07 | |
σ_{s}^{2} | 5 | 0.00 | 0.00 | 0.19 | 0.22 | 0.81 | 0.78 | 0.31 | 0.25 |
10 | 0.00 | 0.00 | 0.17 | 0.23 | 0.83 | 0.78 | 0.19 | 0.17 | |
30 | 0.00 | 0.00 | 0.19 | 0.25 | 0.80 | 0.75 | 0.11 | 0.11 | |
100 | 0.01 | 0.00 | 0.11 | 0.15 | 0.88 | 0.85 | 0.07 | 0.06 |
Because of high ML error rates, the true coverage of nominal 1-α intervals was much lower than desired. For example, at n=30, the true coverage of 95% t-intervals for σ^{2}_{s} would be only 75%. In another words, σ^{2}_{s} will be not be in the nominal 95% t-interval for about 25% of the time. The total coverage of t-intervals for σ^{2}_{g} was better than the total coverage for σ^{2}_{s} intervals. For the same sample size, the true coverage of nominal 95% interval for σ^{2}_{g} would be about 91%. This total coverage is only 4% smaller than the nominal 95% interval coverage. As the sample size increases, the accuracy of t-intervals for both genetic variances increased. For example at n=100, the true coverage of nominal 1-α intervals was approximately the same as the desired level of 95% interval. The interval lengths of both genetic variances were also considerably smaller (0.07) than the interval lengths at n=30.
The BH-intervals improved the coverage probability for σ^{2}_{s}, whereas the improvement over the t-intervals for σ^{2}_{g} was modest (Table 2). Bootstrap BCa and Hall’s transformed intervals had poorer coverages than BH-intervals and were thus not reported. The true coverage of BH nominal 95% t-intervals for sample size n=30 was around 80% for σ^{2}_{s} and 93% for σ^{2}_{g}. The improvement over the t-intervals was 5% for σ^{2}_{s} and 2% for σ^{2}_{g} for the same sample size. The total coverage of BH-intervals for σ^{2}_{g} is only 2% smaller than 95% interval coverage. For a smaller sample size, the BH-intervals had a greater interval length than the t-intervals. However, as the sample size increases, the interval length difference between t and BH diminished although their improvement of total coverage remained about the same.
Interval coverage of h^{2} and r_{B}
Simulation results for nominal 95% Fieller and t-confidence intervals for heritability (h^{2}) and type-B genetic correlation (r_{B}) based on sample sizes n=5, 10, 30, and 100. Entries are based on 1,000 Monte Carlo replications. The standard errors of the miss rates, coverage estimates, and interval lengths are all about 0.01
Variable | n | Miss right (MR) | Miss left (ML) | Total coverage | Interval length | ||||
---|---|---|---|---|---|---|---|---|---|
Fieller | t | Fieller | t | Fieller | t | Fieller | t | ||
h^{2} | 5 | 0.00 | 0.00 | 0.09 | 0.10 | 0.91 | 0.90 | 0.27 | 0.25 |
10 | 0.00 | 0.00 | 0.09 | 0.10 | 0.91 | 0.90 | 0.17 | 0.16 | |
30 | 0.01 | 0.00 | 0.07 | 0.11 | 0.92 | 0.89 | 0.09 | 0.09 | |
100 | 0.01 | 0.00 | 0.04 | 0.12 | 0.95 | 0.88 | 0.05 | 0.05 | |
r_{B} | 5 | 0.02 | 0.03 | 0.02 | 0.02 | 0.96 | 0.96 | 0.49 | 0.35 |
10 | 0.02 | 0.02 | 0.03 | 0.04 | 0.96 | 0.94 | 0.20 | 0.22 | |
30 | 0.02 | 0.00 | 0.04 | 0.18 | 0.95 | 0.82 | 0.10 | 0.12 | |
100 | 0.02 | 0.00 | 0.03 | 0.69 | 0.94 | 0.31 | 0.05 | 0.07 |
Discussion
Variation
Partitioning of parents into small groups of diallels for breeding is commonly employed for monoecious woody perennials such as conifers (Yanchuk 1996; Yeh and Heaman 1987; Johnson and King 1998). Despite the several advantages in using diallels, genetic sampling of parents from a large population for a given diallel is subject to random genetic drift. Genetic differences among diallel groups may cause significant variation in estimates of genetic variances (Hill 1985). In our study, over 100 variance components estimates of six-parent disconnected half-diallels were used to make inferences about a breeding population. Thus, the observed large variation of variance components among individual diallels was not surprising. Similar large differences in genetic variance estimates between diallels were also observed in Douglas-fir (Pseudotsuga menziessi var. menziessii) (Yeh and Heaman 1987) and in another population of loblolly pine (Balocchi et al. 1993).
In this study, among the variance components studied, σ^{2}_{s} had a wider range and higher coefficient of variation than any other estimates. This is partly due to the nature of σ^{2}_{s} estimation from the diallels. The estimation of σ^{2}_{s} is based on specific crosses of one parent and is subject to larger standard errors (Yanchuk 1996). On the other hand, σ^{2}_{g} is estimated from many more progenies in a diallel with a smaller variance, as shown by the smaller coefficient of variation. SCA by environment interaction variance (σ^{2}_{st}) followed σ^{2}_{s} in having a large variance. The error variance components (σ^{2}_{p}, σ^{2}_{e}) can be approximated with a normal distribution as they had a smaller skewness and kurtosis than the genetic variances.
In general, the pattern of distribution of h^{2} followed the distribution pattern of σ^{2}_{g}. This is expected because h^{2} is mainly the product of σ^{2}_{g}. However, h^{2} had a smaller kurtosis value than σ^{2}_{g}. Although r_{B} is also the product of σ^{2}_{g}, we observed a negative skewness for r_{B}. This is mainly due to the high frequency of large σ^{2}_{gt} values and sporadic zero σ^{2}_{gt} values. The kurtosis of r_{B} was similar to the σ^{2}_{g} value.
Error rates and the coverage of the intervals
The dependence of coverage probabilities on the skewness have been covered extensively in the literature (Chaffin and Rhiel 1993; Chen 1995; Boos and Hughes-Oliver 2000). In the present study, very high ML error rates of intervals observed for σ^{2}_{g} and σ^{2}_{s} were mainly the effect of a large skewness in the distributions of these two genetic variances. Among the variance components studied, σ^{2}_{s} had a larger skewness and greater standard deviation than any other variance components. Thus, the high error rates of two-sided intervals for σ^{2}_{s} are not surprising. If breeders basically average σ^{2}_{s} variances from different experiments to obtain a mean with a nominal 95% confidence interval, then the ‘true’ σ^{2}_{s} mean would likely be out of that interval at least 25% of the time. We suggest using the BH skewness-adjusted t-intervals for variance components such as σ^{2}_{s} and σ^{2}_{g.} Although the BH intervals were longer than the t-intervals, the improvement can be substantial for smaller samples. The length of coverage of BH-intervals approaches the length of nominal t-intervals as the sample size increases. The drawback of the BH intervals is that the right-side coverages are not perfect. In fact, the t and BH right side error rates are similar, and they converge slowly to 0.025, but even n=100 is not large enough to see the convergence.
Knapp et al. (1987, 1989) found similarities between nominal interval estimates of point heritability and variance components. In their study, confidence interval widths of heritability decreased slightly as sample size increased. Their heritability estimates and precision statistics were based an analysis of variance and expected mean squares but not maximum likelihood estimation procedures. When the error rates were taken into consideration, Fieller’s intervals for h^{2} and r_{B} were not superior to t-intervals, particularly for small sample size n. However, one needs to consider the length of the interval along with the coverage. Fieller’s intervals were superior to t-intervals as shown by narrower interval lengths for both h^{2} and r_{B}. The decrease in length of the Fieller’s intervals and t-intervals in our study was clearly pronounced as the sample size increased. We observed a sharp drop in the coverage of the t-interval for r_{B} at n=30 and n=100. This is due to the bias of averaging r_{B} values.
Accuracy of the population statistics and the sample size n
Conclusions
The genetic variance component estimates σ^{2}_{g} and σ^{2}_{s} from disconnected diallels for 6-year height growth had a considerable large positive skewness. The results are based on empirical estimates of variance components from a breeding population of loblolly pine. Because of this skewness, the true coverage of nominal 95% t-intervals for genetic variances would likely be less than the nominal coverage. The BH-intervals adjusted for skewness provided better coverage than regular t-intervals. However, BH-intervals are not optimal. Their miss right error rates were as poor as t-intervals. In contrast to genetic variances, h^{2} and r_{B} had smaller skewness, but because they are ratios of variance components, it is wise to use Fieller’s intervals over the t-intervals for h^{2} and r_{B} although regular t-intervals are not much worse in small samples. When statistics (mean, standard deviation, and intervals) are estimated from a sample of genetic variances and their functions, the distribution properties should be taken into account for more reliable inferences.