Genomewide association filtering using a highly locusspecific transmission/disequilibrium test
 First Online:
 Received:
 Accepted:
DOI: 10.1007/s004390100854z
 Cite this article as:
 AbadGrau, M.M., MedinaMedina, N., MontesSoldado, R. et al. Hum Genet (2010) 128: 325. doi:10.1007/s004390100854z
 343 Downloads
Abstract
Multimarker transmission/disequilibrium tests (TDTs) are powerful association and linkage tests used to perform genomewide filtering in the search for disease susceptibility loci. In contrast to case/control studies, they have a low rate of false positives for population stratification and admixture. However, the length of a region found in association with a disease is usually very large because of linkage disequilibrium (LD). Here, we define a multimarker proportional TDT (mTDT_{P}) designed to improve locus specificity in complex diseases that has good power compared to the most powerful multimarker TDTs. The test is a simple generalization of a multimarker TDT in which haplotype frequencies are used to weight the effect that each haplotype has on the whole measure. Two concepts underlie the features of the metric: the ‘common disease, common variant’ hypothesis and the decrease in LD with chromosomal distance. Because of this decrease, the frequency of haplotypes in strong LD with common disease variants decreases with increasing distance from the disease susceptibility locus. Thus, our haplotype proportional test has higher locus specificity than common multimarker TDTs that assume a uniform distribution of haplotype probabilities. Because of the common variant hypothesis, risk haplotypes at a given locus are relatively frequent and a metric that weights partial results for each haplotype by its frequency will be as powerful as the most powerful multimarker TDTs. Simulations and real data sets demonstrate that the test has good power compared with the best tests but has remarkably higher locus specificity, so that the association rate decreases at a higher rate with distance from a disease susceptibility or disease protective locus.
Introduction
Genomewide genotyping of singlenucleotide polymorphisms (SNPs) can yield a few hundred thousand binary markers in a single chip array, providing a relatively unbiased examination of the entire genome for common risk variants. Many loci have been determined to be associated with multifactorial diseases using this new technology. However, in most cases, the information provided is not enough to localize the causal variant of the association. Nonetheless, genomewide association studies yield useful information for better identification of an associated region that facilitates fine mapping of the region with a reduced number of markers.
There are two main types of genomewide data association analyses: case–control studies and familybased studies. Although case–control association studies are the most common, they have high type I errors because of population stratification (Spielman et al. 1993; Zhang et al. 2003). In familybased studies, transmission/disequilibrium tests (TDTs) are powerful tests requiring only family trios with both parents and one affected offspring. In contrast to case–control studies, TDTs are known to be robust for population structures. Therefore, they are an interesting alternative to case–control studies when family trios can be genotyped. The classic singlemarker biallelic TDT can detect association due to linkage. Multimarker generalizations of the classic TDT enhance it by detecting marker interactions, such as when a trait does not depend on a single marker but there is association when considering more than one marker together, which may point to linkage disequilibrium (LD) or gene–gene interaction (epistasis). This may be the case for genomewide genotyping in which a disease susceptibility locus cannot be genotyped but some markers in LD with the locus can be. Thus, the power of a multimarker TDT can significantly enhance that reached by a single TDT.
Different approaches have been used to define multimarker TDTs, each of them computing statistical significance in a different way. The most widely used are: (1) TDTs that are straightforward extensions of the classic singlemarker biallelic TDT; (2) TDTs that group haplotypes to reduce the degrees of freedom (df); and (3) TDTs based on haplotype similarities to reduce df and improve the test power.
The idea behind the first of the approaches is simple. In nuclear families with one affected child, there must be a difference between the counts for nontransmitted and transmitted haplotypes if they are directly associated with the disease or in linkage with a susceptibility locus. The most commonly used test in this approach is the classic multimarker TDT (mTDT) (Spielman and Ewens 1996; Lazzeroni and Lange 1998), a straightforward extension of the biallelic monomarker TDT that can be used by considering each haplotype as a particular allele (Sham 1997; Bourgain et al. 2001). Using this approach, we can also consider introducing some nonlinear transformation to the transmitted/nontransmitted haplotype counts, such as TDT_{E} (Zhao et al. 2007), which is based on the concept of entropy. More specific tests have also been defined to improve power for uncertain transmission cases (Clayton 1999; Zhao et al. 2000) or genotyping errors (Gordon et al. 2001). The main problem with tests using this approach is that the df of the approximate χ^{2} distribution increase with the number of haplotypes and thus permutation tests to determine the null distribution may be required for sparse data.
The second approach tries to reduce the df by grouping haplotypes using different criteria such as haplotype distance (Li et al. 2001) or a haplotype evolutionary relationship (Seltman et al. 2001). These tests are very timeconsuming when used in genomewide searches, as they have to first infer a model to group the haplotypes. As an example, a cladogram for which it is assumed that there are no recurrent disease mutations and no recombination or gene conversion must be estimated. Violation of these strong assumptions may decrease the general accuracy of the test.
The third approach also tries to reduce the df using haplotype similarities. However, instead of counts for the haplotype groups, similarity metrics are used, such as the length measure used in the length contrast test (TDT_{LC}) (Yu et al. 2005) and the signed rank test (TDT_{SR}) (Yu et al. 2005) and other metrics such as those used in the maximum identity length contrast (MILC) test (Bourgain et al. 2001) and the haplotypesharing TDT (HSTDT) (Zhang et al. 2003). For the TDT_{LC} and TDT_{SR} tests it is assumed that there must be less variation among haplotypes transmitted to affected offspring than among nontransmitted haplotypes, as they distinguish the sign of the difference in the measure between transmitted and nontransmitted data sets. However, TDTs based on this assumption are more specific than multimarker TDTs because they do not detect statistically significant differences in haplotype similarities when these are greater among nontransmitted haplotypes. This may occur when a haplotype is not in linkage with a disease susceptibility gene but with a protective gene, so that it will be more frequent in healthy individuals. There is a more important issue in similaritybased TDTs: similarity measures are computed by pairwise comparisons between individuals. Thus, their computational complexity is a quadratic function of the number of founders, in contrast to most of TDT measures, which use sample counts and increase linearly with the number of individuals. For current genotype samples with up to a few thousand individuals, similaritybased TDTs are thus a real burden.
Our goal was to define a computationally feasible multimarker measure, named a proportional mTDT (mTDT_{P}), with high power and high robustness for population admixture and stratification with high locus specificity as an association test. Therefore, association rates are expected to quickly decrease with distance from a disease susceptibility or protective locus. The measure belongs to the first of the approaches and is a generalization of mTDT that weights partial results for each haplotype by its probability frequency. The success of the measure in improving locus specificity is based on two assumptions: (1) according to the decrease in LD with chromosomal distance, the frequency of haplotypes in linkage with a disease haplotype is higher at shorter distances from the disease locus; and (2) according to the ‘common disease, common variant’ (CDCV) hypothesis, disease susceptibility variants are quite common in complex diseases and a combination of several genes, rather than a single gene, together with environmental factors, causes the disease. A consequence of these assumptions is that haplotypes in very strong LD with a disease or protective variant are common and their frequency will notably decrease with chromosomal distance.
Therefore, under both extremes of the expectrum of chromosomal distances (the null hypothesis of no linkage and no distance to the disease locus), there must be little difference between mTDT_{P} and mTDT; as we depart from these, differences between the two tests arise: association detected by mTDT_{P} will decrease more rapidly as we depart from the disease locus.
In “Methods”, after analysis of mTDT and the reasons why it cannot be considered a highly locusspecific test, we propose mTDT_{P}, a modification of mTDT that considers differences in haplotype frequencies to improve both specificity and sensitivity. “Simulation studies” compares different multimarker TDTs for different genetic models, relative risks, haplotype lengths and total disease susceptibility loci. As mentioned above, our goal was not only to study test power and robustness under different configurations, but also to observe the rate at which statistical significance decreases with chromosomal distance. Simulations to study association rates at different chromosomal distances from a disease susceptibility locus have been performed for singlemarker TDTs (Zhao et al. 2007). The “Simulation studies” compare sensitivity, specificity and robustness for some stateoftheart multimarker TDTs defined under different approaches. In “Real data sets”, we compare the power and locus specificity of our test (mTDT_{P}) with other TDTs using real trio samples for Crohn and multiple sclerosis (MS) diseases and robustness using control trio samples of individuals from the International Hapmap Project (IHP) (HapMapConsortium 2003), and finally “Discussion”.
Methods
Assume that the data represent M nuclear families in which one child is affected and that L SNPs are genotyped for all the family members. As an example, for L = 2 and assuming biallelic SNPs, there will be only k = 4 different haplotypes: AB, Ab, aB and ab. Consider a sample composed of all transmitted and nontransmitted haplotypes when the parents are heterozygotic. Let n be the sample size. Thus, subsamples S_{T} and S_{U} of transmitted and nontransmitted haplotypes, respectively, both contain n/2 haplotypes.
Analysis of mTDT
Both mTDT and mTDT_{s} give all haplotypes the same weight, regardless of their frequencies, as each summand is the square of a standard normal distribution under the null hypothesis. Even under the null hypothesis, the variability in haplotype frequency is usually very high, with some haplotypes very frequent and others very rare. Therefore, the assumption that differences in transmission of multimarker haplotypes follow a χ^{2} distribution under the null hypothesis of no linkage leads to a test that is too simplistic and unrealistic. The larger the haplotypes, the greater is the departure of the true null distribution from a \(\chi_{k1}^2\) distribution, as there are more differences among haplotype frequencies.
We explore the consequences of this simplification once we introduce a generalization of mTDT that considers differences in haplotype frequencies.
Definition of mTDT_{P}
Factors n_{i}/n, ∀i ∈ 1,…, k weight haplotypes according to their frequencies, which means that differences in transmission for the most frequent haplotypes have a greater effect on the measure.
Taking into account that haplotype counts are correlated, the asymptotic variance of mTDT_{P} under the null hypothesis is derived in Appendix 1.
Under different genotype frequencies, the variance (Appendix 1) is larger than \({\frac{2}{k1}}\), so that, as it occurs with mTDT (Sham 1997), TDT_{P} will tend to be anticonservative. A feature of this measure is that it reduces the impact of random effects due to rare haplotypes without the need of imposing a lower bound in haplotype counts for haplotypes to be used, as is usually done by mTDT (Sham and Curtis 1995).
But the main feature of mTDT_{P} is that, in contrast to most multimarker TDTs which lack either in power or in locus specificity, mTDT_{P} has both: a high power and a high locus specificity to detect disease susceptibility or disease protective loci in complex diseases. The reason for the measure to be comparable in power to the powerful mTDT is that, assuming the CDCV hypothesis, the impact that nonrecombinant haplotypes have on the measure is high when chromosomal distance to a disease locus is very short, as their frequencies are high and so are their weights. As we depart from the disease locus, the recombination factor increases, nonrecombinant haplotypes will be less frequent in haplotypes transmitted to affected children and their impact in the whole measure will decrease faster than when weighting is not used, as in mTDT.
In order to characterize the distribution of mTDT_{P} under the null hypothesis of no linkage to avoid using permutation tests to assess statistical significance we will first consider the simpler but unrealistic situation of haplotype counts being obtained from independent samples (“Independent random variables: characterization and approximation of a weighted χ^{2} distribution”) as a starting point to consider dependencies among them (“Dependent random variables: approximation of mTDT_{P} under the null hypothesis”).
Independent random variables: characterization and approximation of a weighted χ^{2} distribution
The computation of the distribution function of W_{w} = (w_{1},…, w_{k}) is very complicated because of numerical integration (Solomon and Stephens 1977; Gabler and Wolff 1987). As we are interested in a TDT for genomewide association filtering, permutation tests should be avoided and an easily computable approximation of the asymptotic test distribution under the null hypothesis is required.
It is straightforward to show that in the case of equal weights (\(w_i={\frac{1}{k}},\forall i \{1,\ldots, k\}\)), \(\delta={ \frac{1} {k}}\) and the approximation turns out to be a true weighted χ^{2} distribution, as the three distribution functions are exactly the same.
Dependent random variables: approximation of mTDT_{P} under the null hypothesis
As each individual carries a pair of haplotypes, haplotype counts are not obtained from independent samples. Therefore, \(Y_{i}^2, \/ i\in \{1,\ldots,k\}\) are not independent \(\chi_1^2\) variables and thus mTDT_{P} under the null is not \(W_{k,{\bf w}=({\frac{n_1} {n}},\ldots,{\frac{n_k}{n}})}\). Therefore, the exact distribution needs to be assessed.
As it was said above, under the null hypothesis of no linkage and when the frequencies of all parental heterozygous genotypes are equal, mTDT asymptotically follows a \(\chi_{k1}^2\) distribution and, therefore, mTDT_{P} = (k − 1)mTDT a scaled \(\chi_{k1}^2\). For k = 2, the asymptotic variance is 2. Moreover, for k = 2, mTDT_{P} also reduces to the simple (i.e., monomarker, monoallelic) TDT.
To use the approximation of the weighted sum of χ^{2} distributions W considered above (Gabler and Wolff 1987) in order to obtain the distribution of mTDT_{P} under the null hypothesis and considering that the χ^{2} distributions are not independent, we have modified the limiting distributions G and U so that it can be easily shown they will be exactly a scaled \(\chi_{k1}^2\) with scale factor k − 1 under equal genotype heterozygous frequencies.
Inputs: 

k: the number of different haplotypes in the sample 
weights: a list of k weights 
HP: the value of statistics mTDT_{P} for the current sample 
Output: 
result: p value 
Description: 
result = 0 
DS = 1 
R1 = 0 
df = k − 1 
Foreach haplotype i = 1,…,k 
dZero = 0.5/weights(i) 
R1 = R1 + weights(i)*gammai(dZero, HP*dZero) 
DS = DS*weights(i) 
R2 = pValTestChiSquare(HP/DS^{1/k}, k) 
result = max(R1, 1 − R2) 
In order to check whether mTDT_{P} follows a weighted χ^{2} distribution in the more general case of different parental heterozygous genotype frequencies, we performed permutations in “Simulation studies” (Zhang et al. 2003; Yu et al. 2005) and we did not find significant differences (data not shown).
Simulation studies
We compared the performance of our solution mTDT_{P} with several stateoftheart multimarker TDTs, such as the classic mTDT and other TDTs based on different approaches: the similaritybased tests mTDT_{LC} and mTDT_{SR}, the entropybased mTDT_{E} and the groupbased mTDT_{T1}. mTDT_{1T} (Ott 1999) is a \(\chi_1^2\) test under the null hypothesis of no linkage that checks differences between the haplotype with more significant differences n_{iT} − n_{iU} and the rest of the haplotypes in a sample.
We also modified mTDT using some wellknown corrections of χ^{2} tests to improve the specificity by reducing random errors due to low frequencies and some modifications of these (Appendix 2), such as the Yates (1934) correction mTDT_{Y}, its modification mTDT_{YP} and the Laplace corrections mTDT_{L1} and mTDT_{L2}.
Besides robustness to population stratification and power, we are interested in measuring locus specificity. Thus, the decrease in the rate of associations detected with incremental linkage distance or recombination rates (θ) was assessed considering the extreme points from θ = 0 for which all associations detected are true positive associations (power) and from θ = 0.0002 for which most associations detected are type I errors.
Statistical significance levels were obtained using a permutation procedure for mTDT_{LC}, mTDT_{SR} and mTDT_{E} (Zhang et al. 2003; Yu et al. 2005). For mTDT_{P}, the approximation of a weighted χ^{2} with weights being the haplotype frequencies was used (Independent random variables: characterization and approximation of a weighted χ^{2} distribution). For the remaining tests, the exact χ^{2} distribution was used.
Simulation setup
We tried to reproduce the same simulations used in several studies to check TDT accuracy (Zhang et al. 2003; Yu et al. 2005) and explained in the following subsections.
As our main goal is to have a useful test to perform genomewide association filtering, computational complexity is a main issue and a linear relationship between computational complexity and the number of SNPs is highly desirable. Therefore, we applied the tests in a very feasible way in which only consecutive or overlapping clusters of SNPs (known as sliding windows) were tested together. For simulations of a cluster as suggested by Crawford et al. (2004), we assumed that recombination rates among all the markers tested is very low, which is equivalent to assuming that they belong to the same lowrecombination block (Daly et al. 2001). The recombination fraction within blocks (θ_{B}) for a common population with exponential growth, such as an African population, has been estimated as 0.000088 (Hinds et al. 2005) and we used this value in the simulations.
We also modified the method for introducing a disease mutation compared to other studies (Sham 1997; Zhang et al. 2003; Yu et al. 2005). Instead of considering only one ancestral chromosome with the diseasecausing mutation, or the improvement of using two ancestral chromosomes (Zhang et al. 2003), a more realistic simulation of inheritance of complex diseases was used, in which the number of ancestral disease chromosomes can change according to the coalescent model, as any other gene does.
Populations were drawn using msHOT (Hellenthal and Stephens 2007), a program for generating samples based on the coalescent model that incorporates recombination. The samples for all the populations were obtained using trioSampling, a computer program available on the supplementary website. In the following subsections, we describe the simulations in detail and highlight any departures from the setup commonly used (Sham 1997; Zhang et al. 2003; Yu et al. 2005). A more detailed explanation of the simulations performed can be accessed on the supplementary website.
Robustness
To check the robustness to population stratification, simulations were performed as described by Zhang et al. (2003) and Yu et al. (2005). Therefore, we considered stratified populations. However, instead of using samples of 200 nuclear families (Zhang et al. 2003; Yu et al. 2005), we produced samples with 500 nuclear families. Moreover, we used recombination fraction from the markers to the disease locus θ = 0.5 to represent a true null. Association rates were estimated based on 1, 000 replications. Families were randomly sampled by choosing haplotypes with the disease mutation and randomly choosing the haplotypes transmitted to children considering recombinations. For the first subpopulation, the minor allele frequency (MAF) for the markers was 0.5 and the probability of the disease mutation in parents p_{D} was 0.2. For the second subpopulation, different MAFs q for the markers were used: q ∈ {0.1, 0.3, 0.5} and p_{D} was 0.3. Different proportions of individuals from the first sample were used, \(pp\in\{1/2, 1/4, 1/6\}\). Therefore, by varying pp and q, nine different scenarios where considered to test the robustness.
Locus specificity and sensitivity
Simulations for power (sensitivity), i.e., assuming no recombination between the disease susceptibility locus and the markers tested, were similar to those used in several studies assuming one founder disease haplotype (Lam et al. 2000; Zhang et al. 2003; Yu et al. 2005), except that SNPs used were assumed to be in high LD, i.e., they belong to the same lowrecombination block (Daly et al. 2001). Therefore, we performed simulation analyses using haplotype data sets for 200 nuclear families (family trios with both parents and an affected child). Association rates were estimated based on 100 replications of the simulations described below (Sham 1997; Zhang et al. 2003; Yu et al. 2005).
Values used to configure sample parameters used in specificity/sensitivity simulations
Relative risk  2, 4, 6, 8, 10 
Genetic model  Additive, recessive, dominant 
θ to disease loci  0, 5e−05, 1e−04, 1.5e−04, 2e−04 
Haplotype length  1, 2, 4, 6, 8, 10 
The fourth parameter checks the decrease in association rate due to chromosomal distance. We considered five different recombination fractions (θ) from the markers to the disease susceptibility locus, ranging from perfect LD (no recombination) to θ = 0.0002. Use of the recombination fraction to choose markers for the samples forced us to modify the pattern of population growth to simulate the LD decrease with distance in a more realistic way in a human population (Kruglyak 1999; Crawford et al. 2004). For greater consistency with real populations and complex diseases in which different numbers of founders can carry the disease loci, we used the coalescent model (Nordborg 2001) to draw populations with a variable number of founder haplotypes and population growth as explained above. Any position can be a disease susceptibility locus. Disease founder haplotypes were chosen by selecting one SNP with a mutant allele with frequency in the interval [0.2, 0.4] to mimic a common disease (Yu et al. 2005).
We later produced a second set of simulations with more realistic relative risks (1.2, 1.6, 2.0, 2.4 and 2.6) and samples of 500 nuclear families and focused only in the most powerful statistics which were also highly efficient (computational complexity linear to the number of families).
In order to know how frequencies of the disease mutation affect mTDT_{P} and the other measures, we generated a third set of simulations with same parameters as the second one but considering the frequency of the disease mutation in the interval [0.1, 0.2].
Simulation results
The sensitivity and specificity of the tests were analyzed by counting rates of association for different chromosomal distances from markers to disease loci.
Type I error rates in presence of population stratification and admixture and recombination factor the the disease locus 0.5 based on 1,000 simulations
α  MAF  pp 


0.01  0.1  0.5  0.009 
0.01  0.3  0.5  0.012 
0.01  0.5  0.5  0.013 
0.01  0.1  0.75  0.012 
0.01  0.3  0.75  0.016 
0.01  0.5  0.75  0.015 
0.01  0.1  0.833  0.011 
0.01  0.3  0.833  0.013 
0.01  0.5  0.833  0.013 
0.05  0.1  0.5  0.054 
0.05  0.3  0.5  0.063 
0.05  0.5  0.5  0.071 
0.05  0.1  0.75  0.060 
0.05  0.3  0.75  0.061 
0.05  0.5  0.75  0.052 
0.05  0.1  0.833  0.055 
0.05  0.3  0.833  0.056 
0.05  0.5  0.833  0.058 
Results for sensitivity (θ = 0) show that mTDT, mTDT_{1T} and mTDT_{P} achieve the best results under all scenarios tested, with little differences among the three of them, whereas locus specificity results (θ ∈ {0.00005, 0.0001, 0.00015, 0.0002}) show that mTDT_{P} has better performance than all the other methods. Therefore, association rates decrease faster with mTDT_{P} than with the other methods whenever recombination fraction θ to the disease locus increases. These differences are more appreciable when we increase RR and haplotype length.
Results for α = 0.05 and haplotype lengths of 1, 2, 4, 6, 8 and 10 for one locus are available on the supplementary web site (Figures S1–S6). Results for two loci and disease models Additive, DomOrDom and RecOrRec (Figures S7–S12) and for two loci and disease models DomAndDom, Threshold and Modified) Figures S13–S18) are available on the supplementary web site. We also used the corrections to the small data problem mentioned in Appendix 2 (Figs. S19–S36). As expected, the same pattern was always observed: all the corrections improved the specificity at a cost of a reduction in sensitivity. The higher the correction, the stronger was this pattern. It should be noted that for haplotypes of length 1, i.e., only one marker, mTDT, mTDT_{1T} and mTDT_{P} are equivalent and therefore yield the same results. Differences among them increase with haplotype length.
As mTDT and mTDT_{P} showed a constant pattern of higher power than the other statistics for all the scenarios provided, we focused in them together with mTDT_{Y}, the measure that performs the lightest correction to the small data problem. Disregarding mTDT_{LC} and mTDT_{SR} made feasible to perform a second and third set of simulations using a larger number of nuclear families: 500. We did not use mTDT_{1T} because it chooses the haplotype with the highest power and therefore it requires multitesting correction. When we used Bonferroni correction (data not shown) the measure was not competitive any more, in agreement with the already referred overcorrect association results (Tang et al. 2009).
Results for α = 0.05 and haplotype lengths of 1, 2, 4, 6, 8 and 10 for one locus are available on the supplementary web site (Figs. S37–S42). Results for two loci and disease models Additive, DomOrDom and RecOrRec (Figs. S43–S48) and for two loci and disease models DomAndDom, Threshold and Modified (Figs. S49–S54) are available on the supplementary web site.
Real data sets
As in the simulation study, besides mTDT and tests designed to cope with the problem of small data (mTDT_{Y}, mTDT_{YP}, mTDT_{L1} and mTDT_{L2}), we used the same tests for stateoftheart data sets for comparison with mTDT_{P}: mTDT_{1T}, mTDT_{E}, mTDT_{LC} and mTDT_{SR}. We added a further test for the real data sets. mTDT_{1U} is the same as mTDT_{1T} but uses the most frequent nontransmitted instead of the most frequent transmitted haplotype. Our purpose was to consider whenever a disease is more common in the absence of a protective disease locus in affected individuals, a situation for which mTDT_{1T} would be powerless.
A multimarker TDT for genomewide association searches requires a very efficient exploration approach for the method to be feasible. A possible approach would consist of dividing the SNP sequence into blocks of low recombination using an algorithm based on confidence intervals (Gabriel et al. 2002). However, we chose to split regions in a blockfree way because a lowrecombination block has sensible differences depending on the definition used by the algorithm to split a region in blocks (Halldórsson et al. 2004). Thus, we used sliding windows (Daly et al. 2001) to apply the test to very small subsets of consecutive markers, such as 6, 8 or 10 markers. Each subset is a window and windows can share markers.
We used sliding windows of 1, 2, 4, 6, 8 and 10 SNPs per window and an offset of 1 to compute p values. Significance levels were computed for each sliding window using standard permutation tests (1,000 permutations) for when the null distribution is unknown. For all tests for which the null distribution or its approximation is known, we used that distribution to compute p values.
Phase reconstruction
We inferred haplotype frequencies using all the information from the family (Yu et al. 2005; Rinaldo et al. 2005). Those haplotypes that were unsolved using family information, were inferred using the EM algorithm under the restriction of family information (Abecasis et al. 2001; Yu et al. 2005).
To avoid inaccurate haplotype reconstruction, EM algorithm is usually applied within a low recombination block (Niu et al. 2002). However, despite we first performed a preliminary division of the chromosome in blocks of low recombination by using some of the several algorithms proposed to do that (Gabriel et al. 2002), we finally decided to use sliding windows because of the following two reasons.
On one hand, results from different block building algorithms are very distinct (Halldórsson et al. 2004) and they may bias results from TDT measures. Moreover, the chances of an haplotype of few SNPs to cover more than one block are being reduced with the increase in the number of sequenced SNPs. As an example, with a current genomewide SNP array of about 500,000 SNP markers, and considering the estimation of 20,700 bp as the average block size in Caucasian populations (Hinds et al. 2005) it means about 20 SNPs per block. For windows of length 10, there are few chances for the haplotype to span through more than one block.
On the other hand, in trio samples the EM algorithm is used under the restriction of family information (Zhang et al. 2003; Yu et al. 2005) and, therefore, it is more accurate than the simple EM to infer the phase, even beyond block boundaries, as the only positions whose transmission/nontransmission alleles cannot be solved using family information are those for which the three family members are heterozygotic (Sebastiani et al. 2004). We compared (data not shown) results of two main ways to proceed within each family: (1) to choose the most likely phase according with the EM algorithm under the restriction of family information or (2) to use weighted phases using as weights the frequencies reported by the algorithm (Zhang et al. 2003; Yu et al. 2005) and, in agreement with these works, found no significant differences among the two methods. Therefore, we opted for using the first one of the two choices, for being the one with lower computational complexity.
Data sets used
We used nine data sets of trio genotypes, one with individuals with Crohn’s disease (affectedCrohn) and the others with individuals with MS. The Crohn data set is a publicly available set originally used by Rioux et al. (2001).
Real data sets
Data set  ch.  First SNP  Last SNP  SNPs 

EVI5  1  92388330  93651891  93 
IL2R  10  6103680  7715013  353 
IL7R  5  35847586  35991293  35 
HLA  6  30736061  33163225  468 
KIAA0350  16  11050221  11226546  26 
CD226  18  65550188  65997985  38 
CD58  1  116677600  116983610  19 
IRF5  7  128055671  128309250  15 
For all the sets used, we prepared data sets for unaffected individuals from data publicly available at the website of the IHMP (HapMapConsortium 2003) consisting of genotype data for 30 family trios (HapMap Phase II) typed in the CEPH population, who are Utah residents with ancestry from Northern and Western Europe. The tests for unaffected trios are used as a control, since an association found in unaffected individuals may point out to a disease protective locus, genotypic errors or changes in Hardy–Weinberg equilibrium.
Crohn affected and unaffected data sets from the IHMP are all available on the supplementary website.
Results for real data sets
To show these results we used comparative TDT (CTDT) maps, which are drawn by averaging the p values for each sliding window covering the same marker. A computer program to construct these maps was built using BioCASE (Montes and AbadGrau 2009). Each row in a CTDT map represents sample results obtained from a different TDT. The height of the colored bar for each marker represents the range of the p value. If the p value is greater than 0.05, there is no color for that marker position, meaning that association is not significant. If the p value is less than 0.01, the colored bar has maximum height.
The association of the KIAA0350/ CLEC16A locus with MS was reported by the IMSGC genomewide association study (International Multiple Sclerosis Genetics Consortium et al. 2007), however it did not reached genomewide significance. Later on, it was replicated in several populations and now is considered a risk factor for MS (Martínez et al. 2010; M et al. 2009). Our results for the KIAA0350 locus (Fig. 8a) reveal that mTDT_{P} detected a strong association (maximum height bar) from locus rs28087 to locus rs248836. Compared with mTDT and the alternative corrections for coping with the small data problem, mTDT_{P} is more specific, as the range of markers with maximum association is smaller. The other tests were not able to detect association, with p values less than 0.01.
Interferon regulatory factor 5 (IRF5) has been found to be associated with MS in a cadidate gene study in several population (Kristjansdottir et al. 2008). Results for IRF5 (Fig. 8b) show an interesting pattern in mTDT_{P} and mTDT_{1T}: there is a locus with maximum association (rs3807306), which may mean that the actual disease susceptibility locus is somewhere between this marker and its left and right neighbors, and a continuous decrease with distance from the marker at maximum association either to the left or to the right along the chromosome. This pattern only applies to the right side of the locus, with maximum association for other mTDT measures. Thus, mTDT_{P} again yields the maximum information: the power is maximum for a shorter region and significantly decreases with distance from this region.
However, results obtained by mTDT_{P} do not always show a narrower region of association. Sometimes the region is as wide or even wider than that detected by mTDT. This is the case for the human leukocyte antigen (HLA) locus (see the fifth CTDT map in Figs. S38–S43 on the supplementary website). This would mean that there is no single gene associated with the disease at that locus and other associations were detected as a result of linkage, but many of them along the HLA locus can influence disease onset. This is consistent with other studies suggesting that the HLA class II genes (HLADRB1) are the major determinants of MS risk in the major histocompatibility complex (MHC) region. Despite the recognized effect of HLA class II genes on risk, it is not clear what contributions other genes in this region may make. The MHC region has extensive LD spanning several megabases (Mb) and high levels of variability, with the HLA genes having hundreds of alleles. The MS data set analyzed here has not been genotyped at a sufficiently marker density across the entire MHC region to model the class II effects appropriately to be confident that the associations are not attributable to either the class II loci themselves or other (untyped) loci within the region.
Discussion
With current SNP genotype samples for family trios of a few hundred or thousand trios, the locus specificity of a test has become as important as its power, as it is very common to find associations due to linkage in loci at a considerable distance from the disease susceptibility locus. These associations usually cannot be replicated in other samples from close populations, as they are at some distance from the disease susceptibility locus and their haplotypes may have departed from the common ancestors in the first sample used due to recombination. A lack in locus specificity means they may detect association at considerably large chromosomal distance to the disease susceptibility locus. These associations can be considered spurious associations, as they do not point out to a susceptibility locus or positions very close to it and they will be hardly replicated in a lightly different sample. Thus, more than two markers may be used so that power will increase with a lower risk of low specificity. Therefore, it is very important to consider the locus specificity of TDTs to increase the chances of finding truly risky haplotypes, i.e., those actually at the disease susceptibility locus or at a very short distance from it, and thus the chances of replicating the results in other samples. With this goal, we proposed mTDT_{P}, which is based on mTDT, one of the first multimarker TDTs. mTDT, together with mTDT_{1T} and mTDT_{P} has the highest power under a wide range of scenarios in light of our simulations. Because mTDT_{P} is based on mTDT, the new assumption used to define mTDT_{P} is crucial to improve locus specificity without risking the high power of mTDT. Therefore, the new assumption and thus the modification introduced by the test had to be as simple as possible for the test to be as generic as mTDT and to focus on reducing association rates with chromosomal distance to the disease susceptibility locus at a faster rate. To achieve this, the new assumption was very specific: association decreases with chromosomal distance from a specific locus because of recombinations. As a consequence, haplotypes in phase with a disease variant at the time at which a variant appeared would recombine more often with other haplotypes with increasing distance from the disease locus. Thus, in a sample of trios with affected offspring, the frequency of these nonrecombinant haplotypes will be lower than if the haplotype were closer to the disease locus. Therefore, by weighting each summand in mTDT by the haplotype frequency, we reduce the effect that haplotypes at some chromosomal distance to a disease locus can have on the measure because of linkage. Moreover, in positions close to the disease locus, and assuming the CDCV hypothesis, there would be very few, but common, haplotypes with strong association with the risk variant, so that the weighting procedure will not reduce the power.
We performed simulations under a wide range of population and disease variables, such as the number of disease loci, the disease model, the relative risk of a genotype, haplotype length, etc. Simulations confirmed the correctness of the assumptions and the improvement in locus specificity achieved by mTDT_{P} without reducing the power. We also used several real trio data sets with affected offspring.
As these TDTs are to be applied to genomewide data sets. a multiple testing correction should be performed. Multiple testing correction for GWAS is currently a very active research topic (Betensky and Rabinowitz 2000; Wei et al. 2009; Gorlov et al. 2009), as most of the current approaches do not consider LD between different markers and they usually overcorrect association results and therefore trueeffect associations may be missed. As the objective in the simulations performed was to compare power and locus specificity from different tests, we did not perform multiple testing corrections in any of the tests and p values were directly compared. Moreover, mTDT_{1T} and mTDT_{1U}, which choose the haplotype with the lowest p value, were not competitive when the Bonferroni correction was applied. Current real genomewide data usually have hundred thousand markers. We considered using sliding windows and comparative TDT maps as visual tools for genomewide screening, including also the use of IHMP samples as controls. In these two visual tools, instead of a unique pvalue for each window with multiple testing correction, average pvalues for all the windows a marker belongs to are drawn in order to reduce the chances of spurious associations. Therefore, we chose a simple approach to detect association decay with distance in order to select a region to perform a further finemapping study including a more dense screening over the selected region and sample replication for which multiple testing correction may be required.
The results obtained using mTDT_{P} analysis for the MS data set showed more precise definition of MS implicated variants among the loci analyzed. KIAA0350/CLEC16A has been associated with several autoimmune diseases in genomewide association and replication studies (International Multiple Sclerosis Genetics Consortium et al. 2007; Todd et al. 2007; Márquez et al. 2009). Fine mapping of the region for type 1 diabetes (T1D) by resequencing of exons and flanking regions and SNP genotyping for the surrounding genes revealed that the most probable causal variant would be localized at the 3′ end of the KIAA0350/CLEC16A gene. Results for the mTDT_{P} CTDT map of the KIAA0350/CLEC16A locus using MS data reveal that the region with greatest association is the last 3′ 60 Kbp of the gene, whereas the other TDTs extend the association to the intergenic 3′ region. These mTDT_{P} results pointed to the 3′ end of the KIAA350 gene as the causative association region in MS as described for T1D. We also observed for some other loci that the mTDTP map extends to a larger region than the other TDT maps. This is the case for the IRF5 locus. The most probable causal variant for association of the IRF5 locus with MS is a functional 5bp biallelic insertion–deletion polymorphism that differentially binds the SP1 transcription factor to the IRF5 promoter (Kristjansdottir et al. 2008). The mTDT_{P} map revealed maximal association at IRF5 and extended it to the 5′ region, including the IRF5 promoter, whereas the other maps did not reveal any association with the IRF5 promoter. In designing a fine mapping of the IRF5 locus based on mTDT, mTDT_{Y}, mTDT_{YP}, mTDT_{L1} or mTDT_{L2} results, we would be erroneously focusing on the middle of the gene instead of on the promoter, where the most probable causative variants are located.
An interesting question arises about whether mTDT_{P} would be still useful when diseasesusceptibility variants have very low frequencies, i.e., under the ‘common disease, many rare variants’ (CDMRV) hypothesis. In general, GWAS are not suitable to capture rare variants and other techniques, such as DNA resequencing of candidate genes are often used (Bodner and Bonilla 2008). However, it is being recently claimed that many of the associations found by GWAS are due to ‘synthetic associations’ between very rare variants and less rare alleles, such as SNP markers (Dickson et al. 2010) on the basis that what is usually tested are not the causative genes but SNP markers around them. Under this hypothesis, we believe mTDT_{P} may have less power than mTDT if we consider results from our simulations (Fig. 7 and supplementary Figures S55–S59): using usual mutation frequencies in common diseases (interval [0.2, 0.4]) mTDT_{P} outperforms mTDT in power; if we reduce mutation frequencies to be in the interval [0.1, 0.2], still high to be considered a rare variant, differences in power between the two test converge and even mTDT outperforms mTDT_{P} under several scenarios.
Our ultimate goal is to have a multimarker test that: (1) requires little computational time, as mTDT or mTDT_{HE}; (2) provides high power under very different circumstances, as mTDT or mTDT_{1T}; (3) performs stronger filtering than stateoftheart TDTs so that it can detect association in narrower regions when used as a first genomewide step in searching for disease susceptibility or protective genes. mTDT_{P} achieves these three goals better than all the other tests we used. Moreover, by producing highly informative Comparative TDT (CTDT) maps using different lowcomplexity TDT measures with very different specificity and sensitivity behaviors and using IHMP samples as both control and test validators, we provide a robust tool for visual exploration that may assist molecular biologists in decisions about the regions to choose for fine mapping.
In conclusion, we believe mTDT_{P} can benefit genomewide association studies as its higher locus specificity may be crucial to improve chances of detecting only associations close to a disease susceptibility or protective locus and therefore its chances of being replicated in different samples.
Web source
A supplementary website has been created for this study at http://bios.ugr.es/TDTP, where Figures S1–S43, Table S1, a detailed explanation of the simulations performed and the source code in c++ of the software developed for this work are available.
Acknowledgments
The authors were supported by the Spanish Research Program under project TIN200767418C0303, the Andalusian Research Program under project P08TIC03717 and the European Regional Development Fund (ERDF). We acknowledge the International Multiple Sclerosis Genetics Consortium (IMSGC) for giving us access to their data repository.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.