Background

In genetics, heterogeneity is a major feature of human traits. Genetic heterogeneity occurs when the same or clinically indistinguishable phenotypes are caused by different genetic factors. This can be due to multiple variants located in the same locus (allelic heterogeneity) or to mutations located in different loci (locus heterogeneity).

The focus of this work is locus heterogeneity, specifically heterogeneity caused by having an unknown subset of pedigrees in a sample being unlinked to a disease locus while the rest are linked [1, 2].

There are many reported examples of locus heterogeneity, including breast cancer [36], maturity-onset diabetes of the young (MODY) [7], epilepsy [8], early-onset Alzheimer's Disease [9], rheumatoid arthritis [10], non-polyposis colorectal cancer [11], non-syndromic hearing loss [1214] and retinitis pigmentosa [1517].

Locus heterogeneity can substantially affect the power of linkage and association analyses [1827]. In linkage analysis, there are many examples of methods that address this issue. For example, we have: the M test [28] (also known as K-test [29, 30]), a likelihood ratio test (LRT) that estimates the value of the (assumed fixed) recombination fraction (θ) for each pedigree in a sample; the B-test [29], which is a more powerful version of the M-test that assumes an underlying beta null distribution for each estimated θ; the admixture test (A-test), which is based on the difference between the log-likelihood of the admixture model (data are composed of linked and unlinked families) and the homogeneity model (families are all linked with a common θ) [2, 3136]; the D-test [30], a combination of the A and B tests and finally, the C-test [19], which is based on the M-test and for which the underlying null probability distribution is determined by simulation. The M and B tests were originally developed to identify different values of θ for different pedigrees. For the A-test, families are grouped into two types: a proportion a that are linked to the disease locus (θ < 1/2) and a proportion 1- α that are unlinked (θ = 1/2) [1, 2]. As contrasted with M and the B tests, which place pedigrees into classes a priori, the A test accounts for heterogeneity by maximizing the standard log-odds (LOD) score [37] over α and θ. That is, each pedigree has some probability of being in the linked or unlinked group. This statistic is known as the heterogeneity LOD score (HLOD) [38].

The A-test has been implemented in a suite of programs to test for heterogeneity vs. homogeneity (HOMOG) [38]. More complex heterogeneity scenarios are also available in this package: HOMOG1 allows for gender specific differences in θ. HOMOG2, HOMOG3, HOMOG4, distinguish two, three and four types of families respectively, each linked to different disease loci on the same chromosome. HOMOG3R is a special case of HOMOG3 where there are three family classes: the first class is linked to a given marker; the second is linked to another marker on a different chromosome and the third is linked to neither marker. Lastly, HOMOGM [39], an extension of HOMOG3R, allows for any number of disease loci.

It is important to mention linkage analysis methods for quantitative trait loci (QTL) that account for locus heterogeneity in the analysis. Yang et al. [40] proposed a QTL mapping model for sib pair data. Knight et al. [41] and Ekstrøm et al. [42] independently developed LRT-based models in which the underlying null probability distributions are determined by simulation while Wang and Peng [43] proposed three test statistics with known null asymptotic distributions. It appears that relatively fewer publications considering locus heterogeneity for association have been published as compared with heterogeneity for linkage. When using the search terms "(locus heterogeneity) AND (linkage)" in ISI Web of Knowledge, we retrieve a total of 2,418 titles. By contrast, using the using the search terms "(locus heterogeneity) AND (association)", we retrieve a total of 884 titles, an almost 67% reduction. Having documented that, we do note that methods to address locus heterogeneity for association-based methods have been developed.

Latent class models [44] have been used to estimate membership-class probabilities for individuals with similar genetic backgrounds [4548].Ordered Subset Analysis (OSA)-based models have been extended to association, including the sequential addition (SA) procedure [49] and the OSA case-control (OSACC) method [50]. For family-based data, the OSA-TDT [51] applies OSA to the transmission disequilibrium test (TDT) [52], and the APL-OSA [53] similarly applies OSA to the "association in the presence of linkage" test (APL) [54].

Yang et al. [55] extended the Posterior Probability of Linkage (PPL) method to one that incorporates linkage disequilibrium information between marker and disease alleles. Huang et al. [56] extended the PPL method to case-control data. These methods maintain all the features of the original PPL method for linkage, namely, they do not require correction for multiple testing and they can sequentially update information across multiple data sets.

Wang and Huang [22] developed two LRT extensions of the HLOD: the LD-Het for general pedigrees and the LD-multinomial for affected sib pair data. Here, LD stands for linkage disequilibrium. Schmidt et al. [57] proposed using a two-stage linkage/association approach for affected sib pair data. Finally, Zhou and Pan [58] used a mixture model to allow for locus heterogeneity in a case-control design.

The purpose of this work is the development of a new test statistic that we call TDT-HET, that allows for locus heterogeneity when applying the TDT statistic. This work is largely motivated by the recent work of Zhou and Pan [58]. As in their paper, our statistic is based on an underlying mixture model. We apply an expectation-maximization (EM) algorithm to compute log-likelihoods of the data under null and alternative hypotheses. The EM algorithm also produces maximum likelihood estimates of parameters such as the probability that a heterozygous parent transmits the disease allele to an affected child, the probability that a trio (mother, father, affected child) is linked to the locus in question, and the probability that certain trio types (determined by the constellation of genotypes) are linked to the locus being studied. In addition, we extend our TDT-HET method to statistic that can evaluate multiple loci jointly. This extension is motivated by and similar to the work of Hoh, Ott, and colleagues. They called their method SumStat [5962].

For both single-locus and multi-locus simulations, we evaluate the type I error rate and the power of the TDT-HET method to detect association. In addition, we apply the TDT-HET method to candidate loci from a study of idiopathic scoliosis trios to determine if there is any suggestion of locus heterogeneity at the loci considered, and whether the results suggest evidence for association in the presence of heterogeneity.

Methods

Notation

Much of the notation we use comes from the work of Zhou and Pan [58], who developed a test statistic for case-control data that allows for locus heterogeneity. Also, much of the TDT notation comes from the work of Schaid and Sommer [63]. Here we present notation used in the main body of this work. A fuller notation list may be found in the additional file 1, Appendix (Notation section).

M = The disease allele at the putative disease SNP locus.

N = The non-disease allele at the putative disease SNP locus.

x abc = The trio where parent 1, parent 2, and affected child have a, b, and c copies of the M allele at the putative disease locus (range for all copies: 0 - 2). For example, x222 is the trio with mating type MM × MM and affected child genotype MM. Throughout this work, we will use the notation abc interchangeably with x abc .

n abc = The number of trios x abc in the sample.

n = The total number of trios in the study.

D = Event that the child in a trio is affected.

A = Event that individual in a population is affected.

ϕ = Pr(A) = Disease prevalence.

f i = Pr(A|i copies of M allele in individual's genotype) = Disease penetrances, i = 0,1,2.

R i = f i f 0

, i = 1,2 = genotype relative risks (GRR) [63]. R1 corresponds to the heterozygote GRR and R2 to the GRR for disease allele homozygote. We consider three kinds of disease modes of inheritance (MOI) in this work: R1 = R2 (dominant); R1 = 1 (recessive); R 1 2 = R 2 (multiplicative).

p = Pr(M) = Disease allele frequency (DAF).

q = Pr(N) = 1 - p = Non-disease allele frequency.

t = Pr(heterozygous parent transmits M allele to affected offspring). In this work, the null hypothesis, H0, is t = 0.5. The alternative hypothesis, H1, is t ≠ 0.5.

μk,i= Pr(Mating type = i|D, pop = k) = probability that the mating type is i given that the child is affected and the trio comes from the kth population, 1 ≤ k ≤ 2. Throughout this work, we shall use the notation k = 1 to indicate that the trio is in the linked population (t ≠ 0.5) and k = 2 to indicate that the trio is in the unlinked population (t = 0.5). Similar to Schaid and Sommer [63], we consider 6 mating types in this work. We recognize that other models, such as those considered by Weinberg and colleagues [64, 65], require more than six mating type frequencies. We conjecture that our work extends to such situations.

π1 = Pr(trio is linked to trait locus) = Pr(t ≠ 0.5). In this work we specify that t is the same for all linked trios. This specification is also made for the recombination fraction in some tests of linkage allowing for heterogeneity (see, e.g., work by C. A. B. Smith [1, 2] and Ott [38], specifically the method implemented in programs such as HOMOG [38], GENEHUNTER [66], SIMWALK2 [67], VITESSE [68], MERLIN [69], and other programs).

π 2 = 1 - π 1 .
x ^

= Maximum likelihood estimate (MLE) of the parameter x. This MLE is determined by means of the EM algorithm.

z k,j = The indicator variable for population k and trio x j , where the subscript j indicates the jth trio in the sample.

τ k , j ( r ) = r th

iteration step estimate that the jth trio is in the kth population, k = 1,2. Without loss of clarity, we will use sometimes write τ k , a b c ( r ) , where abc refers to the trio x abc (see above).

TDT-HET Test Statistic

The TDT-HET statistic is a likelihood ratio statistic. Log-likelihoods under the null hypothesis, H0: t = 0.5 or π1 = 0, and under the alternative hypothesis,H1 : t ≠ 0.5 and π1 ≠ 0, are computed by maximizing these parameters for the observed data. We compute the maximum likelihood estimates under H0 and H1 using the Expectation-Maximization method [70]. P-values are computed using permutation methods. Full details are provided in the additional file 1, Appendix (TDT-HET Statistic section).

All trios drawn from a population with one set of parental mating types

Simulations

Single locus

To evaluate the type I error rate and power of the test statistic under different scenarios, we perform simulations. In this section, we describe simulations where we consider type I error rate and power for a single disease locus that has been genotyped. The parameter settings that we consider are presented in Table 1.

Table 1 Simulation parameter settings for the single-locus simulations

We comment that, in item 3 in Table 1, we specify that the disease locus is in Hardy Weinberg Equilibrium. In our simulations, we use the value p to determine the mating type frequencies. Specifically, we specify random mating in the single-locus simulations, so that the mating-type frequencies μ i are the products of the parental genotype frequencies, which themselves are determined by the allele frequency p according to Hardy-Weinberg Equilibrium. For example, the frequency of the mating-type MN × NN is 2 × (2pq) × q2 = 4pq3, where q = 1 - p is the frequency of the N allele. Schaid and Sommer provide similar results in their Table 1 [63]. While we do not simulate non-Hardy-Weinberg situations in our single-locus simulations, we do so in our multi-locus simulations (see below).

Multi-locus

To evaluate the TDT-HET statistic for multiple loci, we apply a slight variant of the "SumStat " procedure developed by Hoh, Ott, and colleagues [5962]. While these researchers consider sums of ever-increasing number of SNPs, in this work, we consider just the full sum. Specifically, for each of the k loci, 1 ≤ kL, where L is the number of loci in the simulation, we compute TDT-HET(k), the value of the statistic at the kth locus. We then compute:

S u m S t a t = k = 1 L T D T - H E T ( k ) .
(1)

Empirical significance levels are determined through permutation. Since each locus k has 500 permuted TDT-HET(k) statistics associated with it, we can compute the permuted SumStat statistics for each permutation number (1 to 500) (This value changes to 100,000 for the Idiopathic scoliosis Candidate Loci - see below). The empirical significance level is defined as the proportion of SumStat values that exceed the SumStat value for the observed data. Table 2 lists the parameter settings for our multi-locus simulations (those parameters not listed there are the same as listed in Table 1).

Table 2 Simulation parameter settings for the multi-locus simulations

Here, we consider 4 correlated SNPs in each simulation. Mating types for the first locus (labeled MT[1] [i]) are determined using the disease allele frequency p (setting 2b in Table 2). As above, each mating type i = 1, ..., 6 will have its frequency determined using HWE proportions. See above (Simulations - Single Locus) for formulas determining mating type frequencies for the first locus. For each consecutive locus l, 2 ≤ l ≤ 4, the mating type frequencies for the ith mating type is determined in the following fashion.

  1. 1.

    Define [l][i] = ρ × MT[l - 1][i] + (1 - ρ) × X, where X~U(0,1).

  2. 2.

    Compute SumY [ l ] = i = 1 6 Y [ l ] [ i ] .

  3. 3.
    MT [ l ] [ i ] = Y [ l ] [ i ] S u m Y [ l ]

    .

Note that, if ρ = 1 (perfect correlation), then the mating type frequencies for each locus are identical. If ρ = 0 (no correlation), then each locus has mating type frequencies that are essentially random numbers that sum to 1. In the Results section, power is computed at the 1% significance level (see below).

Idiopathic Scoliosis Candidate Loci

We applied our method to a dataset that included selected loci from our published genome-wide association study (GWAS) of adolescent idiopathic scoliosis (AIS) [71]. Briefly, AIS is a common spinal deformity with a prevalence of ~3% in school age children worldwide. The underlying genetics of AIS are generally complex and heterogeneity is apparent [71, 72]. In the work presented here we selected genotypes for five loci derived in a total of 447 trios (1849 samples) from 447 families that were included in our previous publication [71]. Of the five loci, four (rs1400180, rs10510181, rs1040315, and rs2222973) were selected due to their significance by TDT analysis, their evidence of clustering, and their proximity to genes of potential biological relevance. We also selected an additional locus, rs11770843, because of its proximity to haplotypes previously linked and associated with AIS [73].

While we keep a number of the settings fixed (Table 1, settings 8-9), we alter the number of permutations per statistic to 100,000. Note that this number is much larger than the number performed in our simulation studies. The reason for this is that we are analyzing far fewer markers here than in our simulations, so time/CPU constraints are not really an issue. Also, the SumStat P-value is based on 100,000 permutations, since we have 100,000 permutation TDT-HET statistics for each locus.

As a comparison, we compute the TDT statistic [52] as implemented in the PLINK software [74]. We also compute point-wise and family-wise permutation p-values (labeled Emp1 and Max(T), respectively by Purcell et al. [74]). The Max(T) permutation statistic is based on the maximum observed test per permutation and so accurately reflects the family-wise error rate in the presence of LD.

While this description is for a genome-wide study, we consider only the situation max(T) applied to 5 candidate SNPs. We compare the max(T) statistic to our Bonferroni-corrected maximum TDT-HET SumStat statistic (corrected over 2 chromosomes, since one chromosome has one locus).

Results

Simulations

Null hypothesis (Type I error rate)

R 2 = 1. 0

For the situation where R2 = 1.0, we present empirical type I error rates at the 5% and 1% levels in Figure 1. The minimum observed type I error rate at the 5% level for TDT-HET is 0.04 (-log(0.04) = 1.46), which occurs for the settings: ϕ = 0.05, π1 = 0.50, p = 0.10, and the maximum observed type I error rate is 0.07 (-log(0.07) = 1.17), which occurs for the settings: ϕ = 0.15, π1 = 0.50, p = 0.75. The median type I error rate is 0.05.

Figure 1
figure 1

Empirical type I error rates for TDT-HET and TDT statistics. Here we present the empirical type I error rates for the TDT-HET and TDT statistics for different settings of the prevalence (0.05 or 0.15), DAF (0.10, 0.25, 0.50, 0.75, 0.90), π1 (0.25, 0.50, 0.75, 1.00) at two different significance levels (5%, 1%). The various plotted shapes in the figure represent empirical type I error rates (-log-transformed) for a fixed setting of the parameters. Solid square = TDT-HET, 5% Empirical Type I Error Rate. Hollow diamond = TDT, 5% Empirical Type I Error Rate. Solid circle = TDT-HET, 1% Empirical Type I Error Rate. Hollow triangle = TDT, 1% Empirical Type I Error Rate.

At the 1% level, the minimum observed type I error rate for TDT-HET is 0.006 (-log(0.006) = 2.22), which occurs for the settings: ϕ = 0.05, π1 = 0.25, p = 0.25, and the maximum observed type I error rate is 0.02 (-log(0.02) = 1.74), which occurs for the settings: ϕ = 0.15, π1 = 0.50, p = 0.75. The median type I error rate is 0.01.

Given that the type I error rate is computed over 250 replicates for each simulation vector setting in Table 1, we can use the method implemented in the BINOM program [38] to compute exact 95% confidence intervals for each empirical type I error rate. For the minimum and maximum empirical rates presented above from Figure 1, BINOM indicates that 0.05 and 0.01 are contained in in each respective 95% confidence interval. In addition, in Figure 1 we include linear trend lines using the method implemented in the MS Office 2007 Excel Spreadsheet software. Note that the 5% and 1% trend lines are very close to the constant lines y = 1.30 and y = 2.00, which are the -log-transformed values of 0.05 and 0.01, respectively. This result suggests that the TDT-HET maintains the correct type I error rate under the null hypotheses.

As a confirmation of our simulation code, we comment that the minimum observed type I error rate at the 5% level for TDT is 0.03 (-log(0.03) = 1.55), the maximum observed type I error rate is 0.06 (-log(0.06) = 1.19), and the median type I error rate is 0.05. At the 1% level, the minimum observed type I error rate for TDT is 0.004 (-log(0.004) = 2.40), the maximum observed type I error rate is 0.02 (-log(0.02) = 1.72), and the median type I error rate is 0.01. These results suggest that our simulation code is correctly simulating null data.

Alternative hypotheses (Power)

Single Locus

In Figures 2, 3 and 4 we present contour plots of empirical powers of the TDT-HET method for a single locus at the 5% significance level for dominant, multiplicative and recessive MOIs, respectively. The contour plots provide empirical power ranges as a function of the proportion of linked trios π1 and DAF (p), where settings of the input values are stated in Table 1, items 4 and 5). Other parameter settings that are used in these simulations are also provided in Table 1. In each of these figures, the prevalence is 0.05.

Figure 2
figure 2

Contour plot of TDT-HET empirical power for Dominant MOI at 5% significance level. Empirical powers at the 5% significance level for prevalence (ϕ) equal to 0.05; DAF (p) equal to 0.10, 0.25, 0.50, 0.75, 0.90 and π1 equal to 0.25, 0.50, 0.75, 1.00. Each contour in the figure represents a range of empirical power values. There are five contours, corresponding to power ranges (x, x + 0.20), where x = 0.00, 0.20, 0.40, 0.60, 0.80. For example, the black contour represents the power range (0.00, 0.20). The light gray contour contiguous to the black contour represents the power range (0.20, 0.40) and so forth. The lightest contour represents the power range (0.80, 1.00).

Figure 3
figure 3

Contour plot of TDT-HET empirical power for Multiplicative MOI at 5% significance level. Empirical powers at the 5% significance level for prevalence (ϕ) equal to 0.05; DAF (p) equal to 0.10, 0.25, 0.50, 0.75, 0.90 and π1 equal to 0.25, 0.50, 0.75, 1.00. Each contour in the figure represents a range of empirical power values. There are five contours, corresponding to power ranges (x, x + 0.20), where x = 0.00, 0.20, 0.40, 0.60, 0.80. For example, the black contour represents the power range (0.00, 0.20). The light gray contour contiguous to the black contour represents the power range (0.20, 0.40) and so forth. The lightest contour represents the power range (0.80, 1.00).

Figure 4
figure 4

Contour plot of TDT-HET empirical power for Recessive MOI at 5% significance level. Empirical powers at the 5% significance level for prevalence (ϕ) equal to 0.05; DAF (p) equal to 0.10, 0.25, 0.50, 0.75, 0.90 and π1 equal to 0.25, 0.50, 0.75, 1.00. Each contour in the figure represents a range of empirical power values. There are five contours, corresponding to power ranges (x, x + 0.20), where x = 0.00, 0.20, 0.40, 0.60, 0.80. For example, the black contour represents the power range (0.00, 0.20). The light gray contour contiguous to the black contour represents the power range (0.20, 0.40) and so forth. The lightest contour represents the power range (0.80, 1.00).

Each contour in each figure represents a range of empirical power values. In each figure, there are five contours, corresponding to power ranges (x, x + 0.20), where x = 0.00, 0.20, 0.40, 0.60, 0.80. For example, the black contour represents the power range (0.00, 0.20). The light gray contour contiguous to the black contour represents the power range (0.20, 0.40) and so forth. The lightest contour represents the power range (0.80, 1.00).

Studying these figures, we can draw a number of conclusions. First, we see that, independent of the disease MOI, as the proportion of linked trios π1 increases, the empirical power increases as well. This result is not surprising. It is interesting to note that power for a fixed DAF is very much dependent upon disease MOI. For example, we see in Figure 2 that empirical power for a dominant MOI tends to be larger when p ≤ 0.50. For a multiplicative MOI (Figure 3), empirical power tends to be larger for 0.25 ≤ p ≤ 0.75. Finally, for a recessive MOI (Figure 4), MOI tends to be larger when p ≥ 0.50.

How does the TDT-HET statistic's power compare with that of the TDT in the presence of heterogeneity? Our previous work determining the non-centrality parameter of the TDT in the presence of heterogeneity [27] allows us to answer this question directly. However, we use TDT empirical power instead to compare "apples to apples" for power. In Figure 5, we present a Box and Whiskers plot [75] that reports a summary of the distribution of differences (TDT-HET empirical power - TDT empirical power) for the various MOIs (Dominant, Multiplicative, Recessive) and significance levels (5%, 1%). TDT empirical power is computed using the method we previously published [27].

Figure 5
figure 5

Box and whiskers plots of differences ( TDT-HET empirical power - TDT empirical power) for different disease MOIs and significance levels. Here we present Box and Whisker plots for the differences (TDT-HET Empirical Power - TDT Empirical Power) for the three MOIs and the two significance levels. The code referring to each plot is: (MOI, Significance Level). For example, Dom05 refers to the differences for the dominant MOI at the 5% significance level. Similarly, Dom01 refers to the differences for the dominant MOI at the 1% significance level. For each gray box, the upper side indicates the 3rd quartile value, the lower side indicates the 1st quartile value, and the black horizontal line in the middle of the box indicates the median value. Mean values are indicated by white diamonds, and outliers are indicated by stars.

While the median power difference is approximately 0 for all six categories, we see that there is a pattern associated with disease MOI. That is, for the dominant MOIs, TDT tends to have larger power than TDT-HET (gray quartile boxes below 0 in Figure 5), while for multiplicative and recessive MOIs, TDT-HET tends to have higher power than TDT (gray quartile boxes above 0 in Figure 5). The minimum value for power difference of -0.05 occurs for the parameter settings: ϕ = 0.05; MOI = Dominant; R2 = 2.25; DAF = 0.50; π1 = 0.50; Significance Level = 1%. For these settings, TDT-HET empirical power is 0.45, while TDT empirical power is 0.50. The maximum value for power difference of 0.17 occurs for the parameter settings: ϕ = 0.05; MOI = Recessive; R2 = 2.25; DAF = 0.25; π1 = 0.75; Significance Level = 1%. For these settings, TDT-HET empirical power is 0.74, while TDT empirical power is 0.56.

Multi-locus

In Figures 6, 7 and 8, we present TDT-HET and TDT SumStat empirical power values (type I error rate values in Figure 6) for parameter settings listed in the Methods. The results presented in Figure 6 (R2 = 1.0) indicate that our SumStat statistic appears to maintain the correct type I error rate for all simulation parameters considered, with the exception of the settings: TDT SumStat Statistic, π1 = 0.25, DAF = 0.90, which gives an empirical type I error rate of 0.03 at the 1% level. According to BINOM, the exact 95% confidence interval for this value does not contain 0.01 (the lower bound of the interval is 0.02). However, all other simulations do contain 0.01 in their 95% confidence intervals.

Figure 6
figure 6

Empirical type I error rates for TDT-HET and TDT SumStat statistics. In this figure, we provide TDT-HET and TDT SumStat empirical type I error values (GRR R1 = R2 = 1.0) for parameter settings listed in Tables 1 and 2. In this figure, solid symbols represent TDT-HET SumStat empirical powers; hollow shapes represent TDT SumStat empirical powers. More specifically: Solid diamonds = TDT-HET SumStat empirical type I error rate at 1% significance level when π1 = 0.25. Solid squares = TDT-HET SumStat empirical type I error rate at 1% significance level when π1 = 0.75. Hollow diamonds = TDT SumStat empirical type I error rate at 1% significance level when π1 = 0.25. Multiplication signs = TDT SumStat empirical type I error rate at 1% significance level when π1 = 0.75.

Figure 7
figure 7

Empirical powers for TDT-HET and TDT SumStat statistics when GRR R 1 = 1.5. Here, we present TDT-HET and TDT SumStat empirical powers for parameter settings listed in Tables 1 and 2. Solid diamonds = TDT-HET SumStat empirical power at 1% significance level when π1 = 0.25. Solid squares = TDT-HET SumStat empirical power at 1% significance level when π1 = 0.75. Hollow diamonds = TDT SumStat empirical power at 1% significance level when π1 = 0.25. Multiplication signs = TDT SumStat empirical power at 1% significance level when π1 = 0.75.

Figure 8
figure 8

Empirical powers for TDT-HET and TDT SumStat statistics when GRR R 1 = 3.0. Here, we present TDT-HET and TDT SumStat empirical powers for parameter settings listed in Tables 1 and 2. Solid diamonds = TDT-HET SumStat empirical power at 1% significance level when π1 = 0.25. Solid squares = TDT-HET SumStat empirical power at 1% significance level when π1 = 0.75. Hollow diamonds = TDT SumStat empirical power at 1% significance level when π1 = 0.25. Multiplication signs = TDT SumStat empirical power at 1% significance level when π1 = 0.75.

Regarding empirical power, when R1 = 1.5 (Figure 7), TDT-HET and TDT produce nearly identical powers. This can be seen from the fact that the hollow symbols of the TDT empirical powers do not seem to appear in Figure 7. The reason is that they are covered by the TDT-HET empirical power symbols. When R1 = 3.0 (Figure 8), TDT-HET and TDT also produce nearly identical powers. Note that the values on the vertical axis for Figure 8 are much higher than those for Figures 6 and 7. We comment that, when R1 = 3.0, we have very high power at the 1% significance level even with the proportion of linked trios is low (π1 = 0.25; diamonds and triangles; Figure 8). This result suggests that genotype relative risk can "trump" locus heterogeneity. We have observed this phenomenon in previous studies, where genotype relative risk is the most significant factor in determining power [76], even in the presence of "missing data" (e.g., misclassification errors).

Idiopathic Scoliosis Candidate Loci

In Table 3, we present the results of our TDT-HET analysis for the five candidate loci mentioned in the Methods section. They are: RS1400180, RS10510181 (both on Chromosome 3), RS11770843 (on Chromosome 7); and RS1040315, RS2222973 (both on Chromosome 21).

Table 3 Results of TDT-HET analysis on idiopathic scoliosis candidate loci

The first thing to notice about these results is that the statistic values are similar. For example, on Chromosome 3, locus RS1400180 has a TDT-HET statistic value of 14.78 versus a TDT value of 14.35. Similarly, on Chromosome 21, locus RS2222973 has a TDT-HET statistic value of 22.53 versus a TDT value of 22.25. However, as noted above, the TDT-HET statistic does not follow a central chi-squared distribution with 1 degree of freedom under the null hypothesis. For that, we must compare permutation p-values. If we compare the point-wise permutation p-values (P-value (Perm) column for TDT-HET and Perm01 column for PLINK TDT), we see that, for most loci the permutation p-values are quite similar (same order of magnitude). In fact, according to BINOM, for most of the loci, the exact 95% confidence intervals overlap (full results not shown). The one exception is for locus RS11770843 on Chromosome 7. For this locus, the upper bound of the exact 95% confidence interval of the TDT-HET permutation p-value as computed by BINOM is 5.6 × 10-5, while the lower bound of the exact 95% confidence interval of the PLINK TDT permutation p-value (Perm01) is 9.1 × 10-5. This result suggests that, for this marker locus, the TDT-HET has slightly more power.

As for the multi-locus results, the situation is quite similar. The Bonferroni corrected minimum p-value of the TDT-HET SumStat statistic is 0.00, on Chromosome 21. The upper bound of the exact 95% confidence interval is 3.0 × 10-5. The lower bound of the exact 95% confidence interval for the minimum max(T) p-value is 2.8 × 10-5, indicating that the p-values overlap. Thus power for each method is equivalent for this data set. While additional studies need to be performed, this result suggests that the SumStat method for TDT-HET may not be as advantageous when loci are in HWE and/or are in linkage disequilibrium.

If there is no gain in power for the TDT-HET method over the standard TDT method, what is its utility? We suggest that the value comes from the estimates of the transmission probability, the proportion of linked trios, and most especially, the estimates of the probabilities that each of the trios is linked to a particular locus. Similar information is available for the HLOD statistic in that we may obtain probability estimates that each family is linked to a particular locus [38].

We demonstrate this utility for locus RS2222973, on Chromosome 21. In Table 4, we present the posterior probability estimates τ ^ 1 , a b c ( r ) that each of the 10 coded trios abc (see Table 5) is linked to the locus. We note that in Table 3, the estimate of the overall proportion of linked trios, π 1 ^ , is 0.87. As Ott points out [32] and [38] (page 224), this value can be used as a cutoff for determining coded trios that are linked to the locus as compared with coded trios that are not. Specifically, if τ ^ 1 , a b c ( r ) > π 1 ^ , then we conclude that the coded trio x abc is linked to the locus. Similarly, if τ ^ 1 , a b c ( r ) < π 1 ^ , then we conclude that the coded trio x abc is unlinked. The results in Table 4 suggest that the coded trios x000, x100, x110, x201, x211, x222 are linked to locus RS2222973, while the remaining 4 coded trios x101, x111, x112, x212 are not. Of the coded trios for which at least one parent is heterozygous (i.e., either a or b equals 1), the linked coded trios are defined by the fact that the affected child always receives the "1" allele from a heterozygous parent. This result suggests that, for this locus, the "1" allele is the risk allele.

Table 4 Posterior probability estimates that each coded trio is in linked group for Chromosome 21 Locus RS2222973 in the idiopathic scoliosis data set
Table 5 Conditional probabilities of mating type and child genotype

Discussion

In this work, we present a mixture model of linked and unlinked trios and develop a statistical method to estimate the probability t that a heterozygous parent transmits the disease allele at a di-allelic locus, as well as the probability π1 that any trio is in the linked group. The null hypothesis is that t = 0.5. The purpose here is the development of a test, the TDT-HET, which extends the classic transmission disequilibrium test (TDT) to one that accounts for locus heterogeneity. Our results suggest that use of permutation p-values enable us to correctly maintain correct type I error rates at the 5% and 1% significance levels. Power simulations using disease MOIs suggest that power can be disease model dependent, with the TDT being slightly more powerful for dominant MOIs, and the TDT-HET being more power for recessive MOIs. Also, we find that our statistic can have high power, even in the presence of locus heterogeneity, when the GRR is larger.

It is interesting to note that the value of the TDT-HET statistic and the corresponding permutation p-value appears to be about the same as that of ordinary TDT for the Idiopathic scoliosis Candidate Loci data set even though results of the TDT-HET analysis suggest that there is locus heterogeneity for several loci. Based on our simulations, we might conjecture that the single-locus MOI for each SNP is multiplicative.

We computed parameters for the situation where linked and unlinked trio types come from populations with different sets of parental mating type frequencies, but apart from determining the rth iteration step estimates, we did not investigate this form of the TDT-HET statistic further. Given the extensive amount of work already present, we consider this work to be beyond scope of the present manuscript. We plan to follow up this research and report our findings in another manuscript.

As noted in Results, Idiopathic Scoliosis Candidate Loci section, Ott documents that a decision rule for determining whether a particular trio type x abc is linked to a locus is seeing whether the inequality:

τ ^ 1 , a b c ( r ) π 1 ^ ,
(2)

is satisfied. Having said that, Terwilliger and Ott [77] report that, for linkage, the conditional probabilities "...should be taken with a grain of salt, and they cannot ever be validly used to separate families for the remainder of a linkage study. It should be required that any further marker typings be done on all families combined..." Their rationale for this statement is that selectively typing only linked families would introduce bias and increase the type I error rate of the linkage statistic. However, this book was published in 1994, even before the advent of SNPs. We are now producing next generation sequence data, so that the causative variant may well be typed in the first set. It remains an open question whether one can use the parameter estimates τ ^ 1 , a b c ( r ) to find trios that contain the causative variant(s). We recognize that there are situations where parameter estimation may be quite difficult. Vieland and Logue [78] documented that when the genetic models at linked and unlinked loci differ, maximizing the HLOD yields incorrect parameter estimates. These authors found that the admixture parameter α does not even measure the proportion of linked families within the sample, as is commonly supposed.

We conjecture that having additional information on the posterior probabilities τ ^ 1 , a b c ( r ) may increase the probability of correctly identifying linked trios. One of the advantages of the TDT-HET statistic is that it provides estimates that each of the 10 types of trios (Table 5) is linked/unlinked. We can use this information to create a decision rule about whether a particular trio type is linked (i.e., harbors the disease allele). One possible decision rule is the inequality documented by Ott [38] and listed above (2). Ott reports that, for linkage analysis allowing for locus heterogeneity, a decision rule for determining whether a particular family is linked to a locus is checking whether the posterior probability that the family is linked is larger than or equal to the overall estimate of the proportion of linked families. We can extend this rule to our work by making the decision rule be that a trio type x abc is linked to a locus if and only if the inequality is satisfied. Here r is the iteration step such that the log-likelihoods are less than the stopping criterion.

This decision rule potentially reduces the number of trios that we need consider when looking for linked trios. We can further reduce the number of trios considered by adding the condition that we only consider trios in which at least one parent is heterozygous. Thus, the two decision rules we consider here for selecting linked trios using the TDT-HET statistic are: (i) all trios that satisfy inequality (2); and (ii) all trios for which at least one parent is heterozygous and that also satisfy inequality (2).

For the TDT statistic, our analogous decision rules are: (i) all trios; and (ii) all trios for which at least parent is heterozygous.

We plan to perform an extensive analysis to evaluate the empirical probabilities that each statistic can correctly identify linked trios. We can simulate linked and unlinked trios using the method implemented in the FASTSLINK software [79, 80]. We can use different genetic model parameter settings, specifically, settings in which the genetic effect is small/large. Since FASTLINK produces pedigree files that indicate which pedigrees are linked or unlinked, we can directly test our decision rules. This is work in progress.

Given that next generation sequencing data applied to families is bound to identify large amounts of locus heterogeneity, any methods that increase the probability of identifying true disease variants should be welcome. We realize that, though, the probabilities of correctly identifying linked trios may be dependent upon the true proportion of linked trios. One way we can reduce heterogeneity is to look at larger family sizes. We plan to apply our statistic to such families and investigate its performance.

Conclusions

Motivated by the recent work of Zhou and Pan [58], we have developed a TDT statistic, TDT-HET, that allows for locus heterogeneity among coded trios. This method is an extension of TDT, in that our simulation results suggest it has approximately the same power as the original TDT. Results of our simulations suggest that our method maintains correct type I error for the null hypothesis (R1 = 1.0). Benefits of our method include: estimates of parameters in the presence of heterogeneity (e.g., the proportion of linked coded trios, the posterior probabilities that a particular trio type is linked to a locus), and reasonable power even when the proportion of linked trios is lower. Also, we have extended Hoh, Ott, and colleagues' SumStat method to TDT-HET. The parameter estimation above, particular, estimation of the probability that a trio is linked will be useful as we enter the age of next-generation sequencing, where one can expect extensive levels of locus heterogeneity given the rare disease frequencies.