An extension of the weighted dissimilarity test to association study in families
Authors
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s00439-007-0376-5
- Cite this article as:
- Yuan, A., Yue, Q., Apprey, V. et al. Hum Genet (2007) 122: 83. doi:10.1007/s00439-007-0376-5
- 1 Citations
- 43 Views
Abstact
Association studies for complex diseases based on pedigree haplotype or genotype data have received increasing attention in the last few years. The similarity tests are appealing for these studies because they take into account of the DNA structure, but they have blind areas on which significant association can not be detected. Recently, we developed a dissimilarity method for this problem based on independent haplotype data, which eliminates the blind areas of the existing methods. As DNA collected on families are common in practice, and the data are either of the form of genotype or haplotype. Here we extend our method for association study to data on families. It can be used to evaluate different designs in terms of power. Simulation studies confirmed that the extended method improves the type I error rate and power. Applying this method to the Genetic Analysis Workshop 14 alcoholism data, we find that markers rs716581, rs1017418, rs1332184 and rs1943418 on chromosomes 1, 2, 9 and 18 yield strong signal (with P value 0.001 or lower) for association with alcoholism. Our work can serve as a guide in the design of association studies in families.
Introduction
Gene sequence analysis for complex diseases has been getting increasing attention largely due to the new technologies for genome scans. With the information from the International HapMap Project, haplotype-based association studies will continue to be one of the key methods for finding genes underlying complex diseases (Daly et al. 2001; Gabriel et al. 2002). Haplotype data are usually not directly available, and have to be determined from genotype data. So analysis based on genotype data is more convenient and provides complimentary insight. As some diseases are caused by gene mutations, the causal genes are often in tight linkage with flanking markers surrounding them (Jorde 2000), thus the composition of the haplotype blocks in the affected (cases) population should be different from that in the unaffected (controls). A systematic case-control analysis of common haplotype or genotype variants in the human genome can reveal the major genetic origins of diseases.
To identify disease susceptible genes, there are many statistical procedures for association studies. For independent data, the commonly used nonparametric methods are variations of Pearson’s χ^{2} statistic. Methods which take into account of the DNA structure in the haplotype data are especially appealing, such as the similarity test (Tzeng et al. 2003; Schaid et al. 2005), which has the form \(\hat{p}'S\hat{p}-\hat{q}'S\hat{q},\) where S is a matrix of similarity measures among all the different haplotypes under study, \(\hat{p}\,\hbox{and}\,\hat{q}\) are the observed case and control frequencies of these haplotypes. This type of test statistics manifests association of the haplotypes to the disease if the above difference of similarities deviates from zero significantly. Apparently, this type of tests have blind areas in which significant differences cannot be detected. Such areas are quantified by the set of case and control frequencies {(p,q): p′Sp−q′Sq = 0, p ≠ q}. Recently we proposed a nonparametric weighted dissimilarity test for this problem (Yuan et al. 2006), which eliminates such blind areas, and enhances the power to detect differences. The original method is for independent haplotypes. As the independent case control design may suffer from the population stratification (Hutchison et al. 2004) and population admixture confounding (Lander and Schork 1994), in practice, many DNA sequence data are collected on families, and the data usually are in either genotype or haplotype form. A basic question is: whether the dependence among family members will affect the results in association study using methods for individual data and in what extent? To answer this question, we extend our method to families. There are some corresponding methods which account for dependence in families (Van der Meulen and te Meerman 1997; Bourgain et al. 2000; Allen and Satten 2007; for example), but they do not have the correction for blind areas. The other family based association methods, such as the transmission disequilibrium test and its variations (Field et al. 1986; Terwilliger and Ott 1992; Spielman et al. 1993; Whittemore et al. 2005) usually require more complicated data information or different data type, and do not take into consideration of genetic distances among the haplotype structures. Here we extend our method to encompass both the genotype and haplotype data collected on families. Also, by analyzing the powers for different designs with the same sample size, this method can also be used as a guide in the selection of the design to achieve greater power, and avoid designs with low efficiency.
Methods
The method is for both haplotype and genotype data. Let x_{1}, ..., x_{m} and y_{1}, ..., y_{n} be the observed haplotypes (genotypes) for the affected and unaffected individuals at a haplotype block (gene locus) of k different haplotypes (genotypes) H = (h_{1}, ..., h_{k}). Let p = (p_{1}, ..., p_{k})′ and q = (q_{1}, ..., q_{k})′ be the case and control population frequencies. The goal can be simply formulated as the test of the hypothesis H_{0}:p = q vs. H_{1}:p ≠ q. For this problem, the traditional similarity methods may have some blind areas, in which p ≠ q, but it is difficult for them to detect. Recently we (Yuan et al. 2006) proposed a dissimilarity test that eliminates any possible blind areas and is powerful.
By the large sample theory of U-statistics, under H_{0}, \(U_{m,n} \buildrel{a.s.}\over {\rightarrow} \mu_{p,p},\) and under \(H_1, U_{m,n} \buildrel{a.s.} \over {\rightarrow} \mu_{p,q} > \mu_{p,p},\) where a.s. stands for “almost surely”. So under H_{0}, Z tends to take moderate values, and under H_{1}, Z tends to be larger, thus for a given 0 < α < 1, the rejection rule for the level α test of H_{0} has the form: Z > Z(α), the upper αth quantile of the asymptotic null distribution of (5); or H_{0} is rejected if Z has a P value less than α. In the following, we will give the asymptotic distribution of our test statistic and provide procedures for using the test. Different from the independent data case, here suppose we have r different types of relative pairs in the data, for example in a three generation pedigree we have three types of relative pairs: (grand-parent, grand-child), (parent, child) and (sib, sib). Let m_{j} and n_{j} be the number of concordant pairs of type j in the case and control, and o_{j} be the number of type j discordant pairs (j = 1, ..., r). Note if there are b sibs in a given family, the number of different sib pairs is C_{b}^{2}, so some of the m_{j}’s (or n_{j}’s) may be larger than m (than n).
To describe the dependence among relative pairs, we need the kinship coefficients. For this we need to distinguish the case for haplotype data and that for genotype data.
Haplotype Kinship coefficient for some relative pairs
Relationship | Δ′_{8} | Δ′_{9} |
---|---|---|
Grand parent–offspring | 1/8 | 7/8 |
Parent–offspring | 1/4 | 3/4 |
Half siblings | 1/8 | 7/8 |
Full siblings | 1/4 | 3/4 |
First cousins | 1/32 | 31/32 |
Double first cousins^{a} | 1/16 | 15/16 |
Second cousins | 1/256 | 255/256 |
Uncle–nephew | 1/16 | 15/16 |
Let μ_{p,q} = E_{p,q}(D(x,y)), and \(\buildrel{D} \over {\rightarrow}\) stands for convergence in distribution. Modifying the proof of our original Proposition for independent observations, we have (Appendix).
Proposition 1
- (i)underH_{0}whereD = (D(h_{i},h_{j}))_{k × k}, V = (v_{1}, ...,v_{k})′ ∼ N(0,R), |V| = (|v_{1}|, ...,|v_{k}|)′, R = (r_{ij})_{k × k}, with$$ Z = \sqrt{N}(\hat{U}_{m,n}-\mu_{\check{p},\check{p}}) \buildrel{D} \over {\rightarrow} p'D|V|, $$$$ r_{ij} = \left\{\begin{array}{ll}p_i(1-p_i)[\gamma_1+\gamma_2+2 \sum_{l=1}^r(\gamma_1\lambda_{11,l}-\sqrt{\gamma_1\gamma_2}\lambda_{12, l} +\gamma_2\lambda_{22,l})\Delta'_{8,l}], & if \, i=j,\\ -p_ip_j [\gamma_1+\gamma_2+2\sum_{l=1}^r(\gamma_1\lambda_{11,l}-\sqrt{\gamma_1 \gamma_2}\lambda_{12,l}+\gamma_2\lambda_{22,l})\Delta'_{8,l}],&else. \end{array}\right.. $$
- (ii)UnderH_{1},where\(\delta = p'Dq-\check{p}'D\check{p} > 0, \sigma^2 = \gamma_1 (a'+q'D_{p,q})S(a+D_{p,q}q)+\gamma_2(b'+p'D_{p,q})Q(b+D_{p,q}p)-2 \sqrt{\gamma_1\gamma_2}(a'+q'D_{p,q})T(b+D_{p,q}p), S = (s_{ij})_{k\times k},\)with$$ \sqrt{N}(\hat{U}_{m,n}-\mu_{\check{p},\check{p}}-\delta)/\sigma \buildrel{D} \over {\rightarrow} N(0,1), $$QisSwithpreplaced byqand λ_{11,l}’s by λ_{22,l}’s, T = (t_{ij})_{k × k}, with$$ s_{ij} = \left\{\begin{array}{ll} p_i(1-p_i)(1+2\sum_{l=1}^r \lambda_{11,l}\Delta'_{8,l}), & if\,i=j,\\ -p_ip_j(1+2 \sum_{l=1}^r\lambda_{11,l}\Delta'_{8,l}), & else. \end{array}\right. $$a = (a_{1}, ..., a_{k})′, b = (b_{1}, ..., b_{k})′,$$ t_{ij} = = \left\{\begin{array}{ll}\sum_{l=1}^r\lambda_{12,l} (p_iq_i(1-\bar{p}_i)+(1-p_i)(1-q_i)\bar{p}_i)\Delta'_{8,l}, &if\, i=j,\\ -\sum_{l=1}^r\lambda_{12,l}(p_i(1-q_j)\bar{p}_j+p_j(1-q_i) \bar{p}_i)\Delta'_{8,l}, &else,\end{array}\right. $$$$ \begin{aligned} a_i &= p_i \frac{{\rm e}^{p_i-q_i}-{\rm e}^{q_i-p_i}}{2}\sum_{j=1}^k \frac{{\rm e}^{p_j-q_j}+{\rm e}^{q_j-p_j}}{2}q_jD({\bf h}_i,{\bf h}_j),\quad(i=1, \ldots, k)\\ b_i &= q_i \frac{{\rm e}^{q_i-p_i}-{\rm e}^{p_i-q_i}}{2}\sum_{j=1}^k \frac{{\rm e}^{q_j-p_j}+{\rm e}^{p_j-q_j}}{2}p_jD({\bf h}_i,{\bf h}_j), \quad(i=1, \ldots, k). \end{aligned} $$
Genotype Kinship coefficient for some relative pairs
Relationship | Δ_{7} | Δ_{8} | Δ_{9} | Φ |
---|---|---|---|---|
Grand parent–offspring | 0 | 1/2 | 1/2 | 1/8 |
Parent–offspring | 0 | 1 | 0 | 1/4 |
Half siblings | 0 | 1/2 | 1/2 | 1/8 |
Full siblings | 1/4 | 1/2 | 1/4 | 1/4 |
First cousins | 0 | 1/4 | 3/4 | 1/16 |
Double first cousins | 1/16 | 6/16 | 9/16 | 1/8 |
Second cousins | 0 | 1/16 | 15/16 | 1/64 |
Uncle–nephew | 0 | 1/2 | 1/2 | 1/8 |
Let Δ_{7,j} and Δ_{8,j} be the corresponding kinship coefficients for type j relative pairs. Assume there are h different alleles a_{j} (j = 1, ..., h), so k = h(h−1)/2. We prove the following Proposition in the Appendix.
Proposition 2
Note: If we adopt the dictionary representation of genotype (11,12, 13, ..., 1h,22,23, ...,2h, ..., hh) for h allele locus, then a genotype i can be uniquely written as i = (st) if (s−1)h−(s−2)(s−1)/2 < i ≤ sh− (s−1)s/2 with t = i−[(s−1)h−(s−2)(s−1)/2] (i = 1, ..., h(h + 1)/2).
From the propositions we see that the asymptotic distribution of our statistic Z given in (5) is a linear combination of the absolute values of normal random variables under the null, and it shifts strictly to the right under the alternative. Based on the first part of the proposition, we can determine our test P value for a given genetic distance and two sets of haplotype frequencies. From the second part of the proposition, we can calculate the power and sample sizes. Now we specify some details in calculating the P values, power and sample sizes.
The following Corollaries are true for both Propositions, and we use the same general notations for the two cases to state them. With the same notations as in the propositions, we can write V as V = R^{1/2}U where U ∼ N(0,I_{k}). Since R is semi-positive definite, there is an orthonormal matrix Q such that R = Q′Λ Q, and R^{1/2} = QΛ^{1/2}, where Λ = diag(λ_{1}, ..., λ_{k}) and the λ_{i}′s ( ≥ 0) are the eigenvalues of R. Given a pre-specified integer M (typically M ≥ 5,000), for each 0 ≤ i ≤ M, sample W_{i} ∼ N(0,I_{k}). Now V_{i} = {QΛ^{1/2}W_{i}} (i = 1, ...,M) is a sample of size M from N(0,R). Let Z_{i} = p′D|V_{i}|,(i = 1, ..., M) and χ(·) be the indicator function. Now we are ready to state the following two corollaries:
Corollary 1
With the above otations, underH_{0}, the P value for the statistic Z from (5) can be approximated by\(\sum_{i=1}^M \chi(Z_i\ge Z)/M.\)
For a given level α, the critical valueZ(α) for (5) can be approximately determined by the (1−α)th sample quantile ofZ_{1}, ..., Z_{M}, which is their [(1−α)M]th ordered statistic.
Corollary 2
Simulation study
We simulated both the cases of genotype and haplotype data. We describe the simulation for genotype in detail, that for haplotype data is similar.
We generated the data on a disease marker and a normal marker to study the power and type I error rate. Now we describe briefly the sampling for the data. We first generate the genotypes of the parents independently based on a given set of frequencies, and those of the sib’s are generated under the Mendelian inheritance. The disease status is determined by the penetrance and genotypes on the disease locus. The data is selected from the simulated population based on a given design. 10,000 samples are drawn to simulate the asymptotic null distributions at each marker, and 500 replications are made to compute the P values, type I errors and powers of the test statistics at several significance level α’s.
To evaluate the performances of the extended method and the behavior of different designs, we simulated six data sets with different combinations of designs and penetrances of the same total sample size. To compare with related methods, for each given significance level α = 0.05 and 0.01, we compare the results of (I) Tzeng et al. (2003), (II) the traditional chisquared test, (III) our method (Yuan et al. 2006) for independent observations, (IV) our extended method to familial data, and (V) a family based method, with the software FBAT (Horvath et al. 2007), which is a multi-markers extension of the family based method of Laird et al. (2000). FBAT needs certain numbers of informative families. For our simulated data only designs (a)–(c) can be computed. Although the chisqaured method is neither family based nor with DNA structure, here we choose it because of its popularity.
Designs used in the simulation
Code | Design name | Penetrance | m_{11} | n_{11} | o_{11} | m_{1} | n_{1} |
---|---|---|---|---|---|---|---|
(a) | Control and discordant pairs | (0.04, 0.004) | 0 | 100 | 100 | 0 | 0 |
(b) | Control and discordant pairs | (0.10, 0.004) | 0 | 100 | 100 | 0 | 0 |
(c) | Discordant pairs | (0.04, 0.004) | 0 | 0 | 200 | 0 | 0 |
(d) | Singleton | (0.04, 0.004) | 0 | 0 | 0 | 200 | 200 |
(e) | Case pairs and control singleton | (0.04, 0.004) | 100 | 0 | 0 | 0 | 200 |
The genotypes frequencies at each marker are selected from the four sets (0.04,0.32,0.64), (0.09,0.42,0.49), (0.16,0.48,0.36) and (0.25, 0.5,0.25). They are not necessarily in Hardy-Weinberg equilibrium. Here we adopted the additive disease model, and we choose relatively low penetrances, as high penetrances will give powers very close to 1 and comparisons will not be well distinguishable.
Type I error rates and powers for five methods on simulated genotype data
Design | α = 0.05 | α = 0.01 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
I | II | III | IV | V | I | II | III | IV | V | |
(a) | 0.028 | 0.021 | 0.034 | 0.046 | 0.034 | 0.009 | 0.007 | 0.012 | 0.016 | 0.004 |
0.100 | 0.218 | 0.274 | 0.308 | 0.138 | 0.048 | 0.112 | 0.134 | 0.190 | 0.032 | |
(b) | 0.033 | 0.018 | 0.029 | 0.038 | 0.067 | 0.007 | 0.005 | 0.007 | 0.014 | 0.015 |
0.142 | 0.750 | 0.814 | 0.840 | 0.456 | 0.066 | 0.614 | 0.696 | 0.762 | 0.242 | |
(c) | 0.002 | 0.001 | 0.003 | 0.007 | 0.059 | 0.001 | 0.000 | 0.001 | 0.004 | 0.013 |
0.004 | 0.156 | 0.242 | 0.446 | 0.686 | 0.000 | 0.058 | 0.088 | 0.288 | 0.464 | |
(d) | 0.055 | 0.044 | 0.060 | 0.060 | – | 0.017 | 0.020 | 0.025 | 0.025 | – |
0.562 | 0.966 | 0.980 | 0.980 | – | 0.428 | 0.926 | 0.956 | 0.956 | – | |
(e) | 0.129 | 0.145 | 0.150 | 0.077 | – | 0.069 | 0.076 | 0.082 | 0.032 | – |
0.752 | 0.966 | 0.970 | 0.952 | – | 0.660 | 0.948 | 0.958 | 0.908 | – | |
(f) | 0.093 | 0.101 | 0.102 | 0.062 | – | 0.049 | 0.044 | 0.049 | 0.026 | – |
0.420 | 0.476 | 0.512 | 0.420 | – | 0.304 | 0.318 | 0.370 | 0.274 | – |
Type I error rates and powers for four methods on simulated haplotype data
Design | α = 0.05 | α = 0.02 | α = 0.01 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
I | II | III | IV | I | II | III | IV | I | II | III | IV | |
(a) | 0.034 | 0.034 | 0.024 | 0.010 | 0.018 | 0.004 | 0.004 | 0.000 | 0.016 | 0.000 | 0.000 | 0.000 |
0.330 | 0.570 | 0.610 | 0.522 | 0.178 | 0.422 | 0.448 | 0.346 | 0.100 | 0.322 | 0.330 | 0.274 | |
(b) | 0.034 | 0.044 | 0.032 | 0.020 | 0.016 | 0.016 | 0.010 | 0.002 | 0.002 | 0.010 | 0.002 | 0.000 |
0.404 | 0.908 | 0.912 | 0.884 | 0.202 | 0.822 | 0.846 | 0.804 | 0.110 | 0.770 | 0.788 | 0.726 | |
(c) | 0.026 | 0.010 | 0.014 | 0.020 | 0.006 | 0.004 | 0.002 | 0.006 | 0.002 | 0.002 | 0.000 | 0.002 |
0.194 | 0.174 | 0.262 | 0.346 | 0.102 | 0.086 | 0.118 | 0.200 | 0.058 | 0.048 | 0.078 | 0.120 | |
(d) | 0.048 | 0.036 | 0.038 | 0.038 | 0.024 | 0.020 | 0.012 | 0.012 | 0.012 | 0.006 | 0.006 | 0.006 |
0.814 | 0.908 | 0.926 | 0.926 | 0.708 | 0.830 | 0.876 | 0.876 | 0.612 | 0.756 | 0.800 | 0.800 | |
(e) | 0.098 | 0.116 | 0.092 | 0.052 | 0.052 | 0.068 | 0.048 | 0.022 | 0.032 | 0.044 | 0.024 | 0.008 |
0.920 | 0.996 | 0.998 | 0.992 | 0.858 | 0.990 | 0.992 | 0.990 | 0.792 | 0.988 | 0.990 | 0.986 | |
(f) | 0.052 | 0.076 | 0.046 | 0.026 | 0.012 | 0.016 | 0.020 | 0.010 | 0.006 | 0.012 | 0.010 | 0.008 |
0.934 | 1.000 | 1.000 | 1.000 | 0.882 | 0.998 | 1.000 | 0.998 | 0.822 | 0.992 | 0.998 | 0.994 |
In this case, the curves for Methods II, III, IV are very close, indicating no clear advantage among the three methods, at least for these simulated data.
Application to the genetics of alcoholism
From this figure we see that our two weighted methods yield similar results. Our previous method without family adjustment yields slightly smaller P values, while the similarity test of Tzeng et al. gives considerably larger P values at most of the markers. Since -log_{10} (P value) bigger than 2 corresponds to P value less than 0.01, we see that the weighted methods find quite a few significant markers at the 1% level. Markers rs716581 on chromosome 1 (location 121.1), rs1017418 on chromosome 2 (80.5), rs1332184 on chromosome 9 (41.3) and rs1943418 on chromosome 18 (80.9), corresponding to SNP loci 168, 487, 2498 and 4162, respectively, yield the strongest signal (with significance level 0.001 or lower), which suggests that these loci are highly likely to be associated with alcoholism. We also see from the figure that our weighted methods detected lots of loci with -log_{10}(P value) ≥ 1.3 (significant at the 5% level) which are missed by the test of Tzeng et al., due to the blindness property of the similarity test.
Tian et al. (2005) analyzed a single SNP on chromosome 4, rs1037475, in the same data, using trend tests with family correlation adjusted, and compared the result from that of the same test without such adjustment. They found that the two tests give the same significant result, that the standard error from the adjusted method is slightly larger, leading to slightly larger P value. They concluded that the method without family adjustment would have larger false positive rate, and hence anti-conservative. We found a similar pattern with our results.
Using a family-based GEE approach, Yang et al. (2005) analyzed 4 SNPs (rs1036475, rs1491233, rs749407, rs980972) on chromosome 4. They found that without familial adjustment, SNPs rs1036475 and rs980972 are significant with P values 0.0032 and 0.028, respectively. When familial adjustment was applied, they found a more conservative test: the P values for these two SNPs are 0.025 and 0.182, respectively. This is also in keeping with our general finding.
Concluding remarks
We extended our nonparametric dissimilarity test for case-control association study to family data, for both haplotype and genotype data. This test has no blind areas, low type I error rate and high power. Simulation studies show this method improves the inference over the existing methods.
In the five designs investigated in our simulation, the case-pair and control-pair design and the case-and-control singleton have highe power. The case-control discordant pair design has lower power. There seems to be a trade-off between accuracy and power. To eliminate the population stratification and/or admixture confounding factors, generally one should use the discordant pair design, although it has lower power. In our simulated example, from the ROC curve we see that our family based method has advantage over the other methods. In case only a homogenous population is involved in the study, the case-pair and control-pair design and the case-and-control singleton design is preferred for their higher power.
Based on our simulation studies, family based method has some limited advantage over individual based one.
We note for further research that under linkage equilibrium, the kinship coefficients are given in Tables 1 and 2 for haplotypes and genotypes. When this condition is not assumed, Δ′_{8} for haplotype sharing and Δ_{7} for genotype sharing will tend to be bigger than those given in these two tabels. Their actual values can be estimated from the data, although the values from the two Tables still can be used as approximations.
For the case control association analysis using genotype data alone, our simulation studies indicate that designs using either the singleton case vs singleton control or concordant-pairs are preferred in terms of power. Two designs are shown to be inferior: control pairs with discordant pairs, and discordant pairs only design. Although the singleton cases vs singleton controls design appears to have greater power, it is known that it is sensitive to population stratification.
In conclusion, our results here confirm that adjustment for the dependence improves the inference in association studies on families, as expected.
Acknowledgments
The research has been supported in part by the National Center for Research Resources at NIH grant 2G12RR003048.