Human Genetics

, Volume 122, Issue 1, pp 83–94

An extension of the weighted dissimilarity test to association study in families

Authors

    • National Human Genome Center, Department of Community Health and Family MedicineHoward University
  • Qingqi Yue
    • National Human Genome Center, Department of Community Health and Family MedicineHoward University
  • Victor Apprey
    • National Human Genome Center, Department of Community Health and Family MedicineHoward University
  • George Bonney
    • National Human Genome Center, Department of Community Health and Family MedicineHoward University
Original Investigation

DOI: 10.1007/s00439-007-0376-5

Cite this article as:
Yuan, A., Yue, Q., Apprey, V. et al. Hum Genet (2007) 122: 83. doi:10.1007/s00439-007-0376-5

Abstact

Association studies for complex diseases based on pedigree haplotype or genotype data have received increasing attention in the last few years. The similarity tests are appealing for these studies because they take into account of the DNA structure, but they have blind areas on which significant association can not be detected. Recently, we developed a dissimilarity method for this problem based on independent haplotype data, which eliminates the blind areas of the existing methods. As DNA collected on families are common in practice, and the data are either of the form of genotype or haplotype. Here we extend our method for association study to data on families. It can be used to evaluate different designs in terms of power. Simulation studies confirmed that the extended method improves the type I error rate and power. Applying this method to the Genetic Analysis Workshop 14 alcoholism data, we find that markers rs716581, rs1017418, rs1332184 and rs1943418 on chromosomes 1, 2, 9 and 18 yield strong signal (with P value 0.001 or lower) for association with alcoholism. Our work can serve as a guide in the design of association studies in families.

Introduction

Gene sequence analysis for complex diseases has been getting increasing attention largely due to the new technologies for genome scans. With the information from the International HapMap Project, haplotype-based association studies will continue to be one of the key methods for finding genes underlying complex diseases (Daly et al. 2001; Gabriel et al. 2002). Haplotype data are usually not directly available, and have to be determined from genotype data. So analysis based on genotype data is more convenient and provides complimentary insight. As some diseases are caused by gene mutations, the causal genes are often in tight linkage with flanking markers surrounding them (Jorde 2000), thus the composition of the haplotype blocks in the affected (cases) population should be different from that in the unaffected (controls). A systematic case-control analysis of common haplotype or genotype variants in the human genome can reveal the major genetic origins of diseases.

To identify disease susceptible genes, there are many statistical procedures for association studies. For independent data, the commonly used nonparametric methods are variations of Pearson’s χ2 statistic. Methods which take into account of the DNA structure in the haplotype data are especially appealing, such as the similarity test (Tzeng et al. 2003; Schaid et al. 2005), which has the form \(\hat{p}'S\hat{p}-\hat{q}'S\hat{q},\) where S is a matrix of similarity measures among all the different haplotypes under study, \(\hat{p}\,\hbox{and}\,\hat{q}\) are the observed case and control frequencies of these haplotypes. This type of test statistics manifests association of the haplotypes to the disease if the above difference of similarities deviates from zero significantly. Apparently, this type of tests have blind areas in which significant differences cannot be detected. Such areas are quantified by the set of case and control frequencies {(p,q): pSpqSq = 0, pq}. Recently we proposed a nonparametric weighted dissimilarity test for this problem (Yuan et al. 2006), which eliminates such blind areas, and enhances the power to detect differences. The original method is for independent haplotypes. As the independent case control design may suffer from the population stratification (Hutchison et al. 2004) and population admixture confounding (Lander and Schork 1994), in practice, many DNA sequence data are collected on families, and the data usually are in either genotype or haplotype form. A basic question is: whether the dependence among family members will affect the results in association study using methods for individual data and in what extent? To answer this question, we extend our method to families. There are some corresponding methods which account for dependence in families (Van der Meulen and te Meerman 1997; Bourgain et al. 2000; Allen and Satten 2007; for example), but they do not have the correction for blind areas. The other family based association methods, such as the transmission disequilibrium test and its variations (Field et al. 1986; Terwilliger and Ott 1992; Spielman et al. 1993; Whittemore et al. 2005) usually require more complicated data information or different data type, and do not take into consideration of genetic distances among the haplotype structures. Here we extend our method to encompass both the genotype and haplotype data collected on families. Also, by analyzing the powers for different designs with the same sample size, this method can also be used as a guide in the selection of the design to achieve greater power, and avoid designs with low efficiency.

Methods

The method is for both haplotype and genotype data. Let x1, ..., xm and y1, ..., yn be the observed haplotypes (genotypes) for the affected and unaffected individuals at a haplotype block (gene locus) of k different haplotypes (genotypes) H =  (h1, ..., hk). Let p = (p1, ..., pk)′ and q = (q1, ..., qk)′ be the case and control population frequencies. The goal can be simply formulated as the test of the hypothesis H0:pq vs. H1:pq. For this problem, the traditional similarity methods may have some blind areas, in which p ≠ q, but it is difficult for them to detect. Recently we (Yuan et al. 2006) proposed a dissimilarity test that eliminates any possible blind areas and is powerful.

For H0:p = q versus H1:pq, the unweighted test statistic has the form
$$ \frac{1}{mn}\sum_{i=1}^m\sum_{j=1}^n\{\hbox{Dissimilarity\, between observations\, in\, case} (i) \hbox{and\, control} (j) \}. $$
But as is for the similarity test, this type of statistics has blind areas, in which significant differences between case and control group can not be detected. To overcome this problem, we need the weighted version of above statistic as in Yuan et al. (2006), which has the form
$$ \frac{1}{mn}\sum_{i=1}^m\sum_{j=1}^n \{\hbox{Weight}(i,j)\} \{\hbox{Dissimilarity\, between\, observations\, in\, case} (i) \hbox{and\, control} (j) \}. $$
(1)
We define the dissimilarity as a function D(·,·) for any two haplotypes (genotypes), which satisfies
$$ D({\bf x},{\bf x}) = 0, \quad D({\bf x},{\bf y}) = D({\bf y},{\bf x}) > 0 \,({\bf x} \neq {\bf y}). $$
(2)
It measures the difference between two observed haplotypes (genotypes). For haplotype data, the dissimilarity measures such as those in Yuan et al. (2006) can be used. For genotype data we can define the dissimilarity as D(hi,hj) = 2 − IBS(hi,hj), where IBS(hi,hj) is the number of alleles identical by state between hi and hj. Since 0 ≤ IBS(hi, hj) ≤  2 measures the similarity between hi and hj, 2 − D(hi,hj) is a dissimilarity between them. Some alternative definitions of dissimilarity are D(hi,hj) =  2 − IBD(hi,hj), where IBD denotes identity by descent, and D(hi,hj) = 1 − corr2(hihj), where −1 ≤  corr(hi,hj) ≤ 1 is the correlation between hi and hj.
The weights should be chosen such that the weighted dissimilarity varnishes only if the two observations are identical, and that it is easy to use. We find the hyperbolic weights wij given in Yuan et al. (2006) satisfy these conditions, and simulation studies indicate they outperform other choices. They are
$$ \hbox{Weight}(i,j)= \frac{{\rm e}^{p_i-q_i}+{\rm e}^{q_i-p_i}}{2} \frac{{\rm e}^{p_j-q_j}+{\rm e}^{q_j-p_j}}{2}, \quad (1\leq i,j \leq k). $$
(3)
Note that the weight in (3) is well defined for any (pi,pj; qi,qj) and bounded above. Also Weight(i,j) ≥  1 with “ = ” if and only if (pi,pj) = (qi,qj).
Now using the weights (3), we define the weighted dissimilarity index as
$$ D_{p,q}({\bf x},{\bf y}) = \frac{{\rm e}^{p_i-q_i}+{\rm e}^{q_i-p_i}}{2} \frac{{\rm e}^{p_j-q_j}+{\rm e}^{q_j-p_j}}{2}D({\bf h}_i,{\bf h_j}), \quad\hbox{if}\, ({\bf x},{\bf y}) = ({\bf h}_i,{\bf h}_j) $$
(4)
One can see that Dp,q(·,·) satisfies the conditions in (2). With this weighted dissimilarity, we have
$$ \begin{aligned} E_{p,q}(D_{p,q}({\bf x},{\bf y})) & = \sum_{i=1}^k \sum_{j=1}^k\frac{{\rm e}^{p_i-q_i}+{\rm e}^{q_i-p_i}} {2}\frac{{\rm e}^{p_j-q_j}+{\rm e}^{q_j-p_j}}{2} D({\bf h}_i,{\bf h_j})p_iq_j\\ &\geq \sum_{i=1}^k\sum_{j=1}^k D({\bf h}_i,{\bf h_j})p_iq_j\\&= E_{p,q}(D({\bf x},{\bf y})) =: \mu_{p,q}, \quad \hbox{with}\,``=\hbox{"} \, \hbox{if\, and\, only\, if}\,\, p=q \end{aligned} $$
and hence it leaves no blind areas. This dissimilarity formula is characterized by the genetic distance between pairs of haplotype (genotype) sequences, and the two sets of haplotype (genotype) frequencies in the case and control groups.
Now using the construction (1) and the weighted dissimilarity index (4), we have the weighted average dissimilarity index
$$ U_{m,n} = \frac{1}{mn}\sum_{i=1}^m\sum_{j=1}^n D_{p,q} ({\bf x}_i,{\bf y}_j). $$
Note that there are only k different haplotypes (genotypes) h1, ..., hk, so we have only k2 different weights in (3). We have m observed case haplotypes (genotypes) xi’s, each of them is a realization of one of the hj’s; similarly, we have n observed control haplotypes (genotypes) yi’s, each of them is a realization of one of the hj’s. So we have a total of mn observed dissimilarity indices above.
Since Dp,q(·,·) involves the unknown p and q, we plug in their empirical estimates \(\hat{p}\, \hbox{and}\, \hat{q},\) and define
$$ \hat{U}_{m,n} = \frac{1}{mn}\sum_{i=1}^m\sum_{j=1}^n D_{\hat{p}, \hat{q}}({\bf x}_i,{\bf y}_j) = \hat{p}'D_{\hat{p},\hat{q}}\hat{q}, $$
where \(D_{\hat{p},\hat{q}} = (D_{\hat{p},\hat{q}}({\bf h}_i, {\bf h}_j))_{1\leq i,j \leq k}\) is the matrix for the weighted dissimilarity measure. Let \(\check{p}_i = \min\{\hat{p}_i,\hat{q}_i\}\quad(i=1, \ldots,k), \check{p}=(\check{p}_1, \ldots, \check{p}_k)'\) and N = m + n. We define our weighted dissimilarity test statistic to be
$$ Z = \sqrt{N}(\hat{U}_{m,n}-\mu_{\check{p},\check{p}}). $$
(5)
Here \(\mu_{\check{p},\check{p}}\) is μp,q with (p,q) replaced by \((\check{p},\check{p}), \mu_{p,q}\) is the expected dissimilarity given in the right hand side of (3), and we should distinguish the notations D(·,·) and Dp,q(·,·). Note that the construction of the test statistic Z is basically the same as for the independent data case, the dependence relationships among the relative pairs are incorporated in the asymptotic variance of Z as given in Propositions 1 and 2 below, for haplotype data and genotype data respectively.

By the large sample theory of U-statistics, under H0, \(U_{m,n} \buildrel{a.s.}\over {\rightarrow} \mu_{p,p},\) and under \(H_1, U_{m,n} \buildrel{a.s.} \over {\rightarrow} \mu_{p,q} > \mu_{p,p},\) where a.s. stands for “almost surely”. So under H0, Z tends to take moderate values, and under H1, Z tends to be larger, thus for a given 0 < α < 1, the rejection rule for the level α test of H0 has the form: ZZ(α), the upper αth quantile of the asymptotic null distribution of (5); or H0 is rejected if Z has a P value less than α. In the following, we will give the asymptotic distribution of our test statistic and provide procedures for using the test. Different from the independent data case, here suppose we have r different types of relative pairs in the data, for example in a three generation pedigree we have three types of relative pairs: (grand-parent, grand-child), (parent, child) and (sib, sib). Let mj and nj be the number of concordant pairs of type j in the case and control, and oj be the number of type j discordant pairs (j = 1, ..., r). Note if there are b sibs in a given family, the number of different sib pairs is Cb2, so some of the mj’s (or nj’s) may be larger than m (than n).

To describe the dependence among relative pairs, we need the kinship coefficients. For this we need to distinguish the case for haplotype data and that for genotype data.

Haplotype data. To deal with the dependencies among relative pairs, we use the notion of kinship coefficients, which are formally defined for genotype data in the next subsection. Here we define them for the haplotypes among relative pairs. Suppose there are r different types of relative pairs in the data, let Δ′8,l be the probability of sharing one allele identical-by-descent (IBD) at a given locus for two type l relative haplotypes, and Δ′9,l for sharing zero l = 1, ..., r). We assume phase unknown as is usual in practice. Here we give some of their values for some common types relative pairs, under Mendelian inheritance, phase unknown and linkage equilibrium among blocks, in Table 1.
Table 1

Haplotype Kinship coefficient for some relative pairs

Relationship

Δ′8

Δ′9

Grand parent–offspring

1/8

7/8

Parent–offspring

1/4

3/4

Half siblings

1/8

7/8

Full siblings

1/4

3/4

First cousins

1/32

31/32

Double first cousinsa

1/16

15/16

Second cousins

1/256

255/256

Uncle–nephew

1/16

15/16

aIf a pair of siblings from one family, each form a couple with a pair of siblings from another family, then the children of these two couples will be double first cousins to one another

Let μp,qEp,q(D(x,y)), and \(\buildrel{D} \over {\rightarrow}\) stands for convergence in distribution. Modifying the proof of our original Proposition for independent observations, we have (Appendix).

Proposition 1

Assume\(\lim_N N/m = \gamma_1 < \infty, \lim_N N/n = \gamma_2 < \infty, \lim_m m_j/m = \lambda_{11,j}\)and\(\lim_n n_j/n = \lambda_{22,j}, \lim_N o_j/\sqrt{mn}=\lambda_{12,j} (j=1, \ldots, r),\)then asN → ∞,
  1. (i)
    underH0
    $$ Z = \sqrt{N}(\hat{U}_{m,n}-\mu_{\check{p},\check{p}}) \buildrel{D} \over {\rightarrow} p'D|V|, $$
    whereD = (D(hi,hj))k × k, V = (v1, ...,vk)′ ∼ N(0,R), |V| = (|v1|, ...,|vk|)′, R = (rij)k × k, with
    $$ r_{ij} = \left\{\begin{array}{ll}p_i(1-p_i)[\gamma_1+\gamma_2+2 \sum_{l=1}^r(\gamma_1\lambda_{11,l}-\sqrt{\gamma_1\gamma_2}\lambda_{12, l} +\gamma_2\lambda_{22,l})\Delta'_{8,l}], & if \, i=j,\\ -p_ip_j [\gamma_1+\gamma_2+2\sum_{l=1}^r(\gamma_1\lambda_{11,l}-\sqrt{\gamma_1 \gamma_2}\lambda_{12,l}+\gamma_2\lambda_{22,l})\Delta'_{8,l}],&else. \end{array}\right.. $$
     
  2. (ii)
    UnderH1,
    $$ \sqrt{N}(\hat{U}_{m,n}-\mu_{\check{p},\check{p}}-\delta)/\sigma \buildrel{D} \over {\rightarrow} N(0,1), $$
    where\(\delta = p'Dq-\check{p}'D\check{p} > 0, \sigma^2 = \gamma_1 (a'+q'D_{p,q})S(a+D_{p,q}q)+\gamma_2(b'+p'D_{p,q})Q(b+D_{p,q}p)-2 \sqrt{\gamma_1\gamma_2}(a'+q'D_{p,q})T(b+D_{p,q}p), S = (s_{ij})_{k\times k},\)with
    $$ s_{ij} = \left\{\begin{array}{ll} p_i(1-p_i)(1+2\sum_{l=1}^r \lambda_{11,l}\Delta'_{8,l}), & if\,i=j,\\ -p_ip_j(1+2 \sum_{l=1}^r\lambda_{11,l}\Delta'_{8,l}), & else. \end{array}\right. $$
    QisSwithpreplaced byqand λ11,ls by λ22,ls, T =  (tij)k × k, with
    $$ t_{ij} = = \left\{\begin{array}{ll}\sum_{l=1}^r\lambda_{12,l} (p_iq_i(1-\bar{p}_i)+(1-p_i)(1-q_i)\bar{p}_i)\Delta'_{8,l}, &if\, i=j,\\ -\sum_{l=1}^r\lambda_{12,l}(p_i(1-q_j)\bar{p}_j+p_j(1-q_i) \bar{p}_i)\Delta'_{8,l}, &else,\end{array}\right. $$
    a = (a1, ..., ak)′, b = (b1, ..., bk)′,
    $$ \begin{aligned} a_i &= p_i \frac{{\rm e}^{p_i-q_i}-{\rm e}^{q_i-p_i}}{2}\sum_{j=1}^k \frac{{\rm e}^{p_j-q_j}+{\rm e}^{q_j-p_j}}{2}q_jD({\bf h}_i,{\bf h}_j),\quad(i=1, \ldots, k)\\ b_i &= q_i \frac{{\rm e}^{q_i-p_i}-{\rm e}^{p_i-q_i}}{2}\sum_{j=1}^k \frac{{\rm e}^{q_j-p_j}+{\rm e}^{p_j-q_j}}{2}p_jD({\bf h}_i,{\bf h}_j), \quad(i=1, \ldots, k). \end{aligned} $$
     
Genotype data. Now we consider genotype data case. The hi’s denote the genotypes, and the pi’s and qi’s stands for genotype frequencies. For this data, let IBS(hi,hj) be the number of identical by state between genotypes hi and hj. We define the dissimilarity D(hi,hj) as
$$ D({\bf h}_i,{\bf h}_j) = 2-IBS({\bf h}_i,{\bf h}_j). $$
Although we are using the sample IBS’s, the calculation of the asymptotic variance of (5) require the theoretical IBD’s as the case of haplotype data. Let Δ7ij, Δ8ij, Δ9ij be the condensed kinship coefficient, between individuals i and j. The Δkijs (k = 1, ..., 9) are the probabilities for the nine possible condensed IBD status as divided by Jacquard (1974), in which Δ7ij, Δ8ij and Δ9ij are commonly used in practice. They are the population probabilities of sharing 2, 1 and 0 alleles IBD for individuals (i,j), without regard to their particular genotypes, but only (i,j)’s kinship relationships, and under the Mendelian inheritance. Also, 2Φij is the expected proportion of gene IBD for individuals (i,j) at this locus.
Here we list some of their values under linkage equilibrium, as in Lange (1997) for example (Table 2).
Table 2

Genotype Kinship coefficient for some relative pairs

Relationship

Δ7

Δ8

Δ9

Φ

Grand parent–offspring

0

1/2

1/2

1/8

Parent–offspring

0

1

0

1/4

Half siblings

0

1/2

1/2

1/8

Full siblings

1/4

1/2

1/4

1/4

First cousins

0

1/4

3/4

1/16

Double first cousins

1/16

6/16

9/16

1/8

Second cousins

0

1/16

15/16

1/64

Uncle–nephew

0

1/2

1/2

1/8

Let Δ7,j and Δ8,j be the corresponding kinship coefficients for type j relative pairs. Assume there are h different alleles aj (j = 1, ..., h), so k = h(h−1)/2. We prove the following Proposition in the Appendix.

Proposition 2

Assume\(\lim_N N/m = \gamma_1 < \infty, \lim_N N/n = \gamma_2 < \infty, \lim_m m_j/m = \lambda_{11,j}\)and\(\lim_n n_j/n = \lambda_{22,j}, \lim_N o_j/\sqrt{nm} = \lambda_{12,j} (j=1,\ldots,r),\)then asN → ∞, (i) underH0
$$ Z = \sqrt{N}(\hat{U}_{m,n}-\mu_{\check{p},\check{p}}) \buildrel{D} \over {\rightarrow} p'D|V|, $$
whereD = (D(hi,hj))k × k, V = (v1, ..., vk)′ ∼ N(0,R), |V| = (|v1|, ..., |vk|)′, R =  (rij)k × k, with
$$ r_{ij} = \left\{\begin{array}{ll}p_i(1-p_i)(\gamma_1+\gamma_2)+ 2\sum_{l=1}^r(\gamma_1\lambda_{11,l}-\sqrt{\gamma_1\gamma_2}\lambda_{12, l}+\gamma_2\lambda_{22,l})(p_i(1-p_i)\Delta_{7,l}+\theta_{ii}\Delta_{8,l}), &if\, i=j;\\ -p_ip_j(\gamma_1+\gamma_2)+2\sum_{l=1}^r(\gamma_1 \lambda_{11,l}-\sqrt{\gamma_1\gamma_2}\lambda_{12,l}+\gamma_2 \lambda_{22,l})(-p_ip_j\Delta_{7,l} +\theta_{ij}\Delta_{8,l}), &else, \end{array}\right. $$
where, assume genotypei = asat (s ≤  t), genotypej = auav (u ≤  v), then θij = −pipj + α (i,j = 1, ..., k),
$$ \alpha=\left\{\begin{array}{ll}r_tr_vr_s, &if\,s=u, t\neq v; \\ r_tr_ur_s,& if\,s=v,t\neq u;\\ r_sr_ur_t, &if\,t=v, s\neq u; \\ r_sr_vr_t, &if\,t=u, s\neq v;\\ (r_sr_t^2+r_s^2r_t)/2, &if\,s=u, t=v;\\ 0,& else.\end{array}\right. $$
and\(r_l=P(a_la_l)+\sum_{j\neq l}^hP(a_la_j)/2\)is the frequency of allelealin the total population. (ii) UnderH1,
$$ \sqrt{N}(\hat{U}_{m,n}-\mu_{\check{p},\check{p}}-\delta)/\sigma \buildrel{D} \over {\rightarrow} N(0,1), $$
where\(\delta = p'Dq-\check{p}'D\check{p} > 0, \sigma^2 = \gamma_1 (a'+q'D_{p,q})S(a+D_{p,q}q)+\gamma_2(b'+p'D_{p,q})Q(b+D_{p,q}p)- \sqrt{\gamma_1\gamma_2}(a'+q'D_{p,q})T(b+D_{p,q}p), S = (s_{ij})_{k\times k},\)with
$$ s_{ij} = \left\{\begin{array}{ll}p_i(1-p_i)+2\sum_{l=1}^r\lambda_{11,l} (p_i(1-p_i)\Delta_{7,l}+\theta_{ii}\Delta_{8,l}), & if\quad i=j,\\ -p_ip_j+2\sum_{l=1}^r\lambda_{11,l}(-p_ip_j\Delta_{7,l}+\theta_{ij} \Delta_{8,l}), & else,\end{array}\right. $$
QisSwithpreplaced byqand λ11,ls by λ22,l, T = (tij)k × k, with
$$ t_{ij} = \left\{\begin{array}{ll}\sum_{l=1}^r\lambda_{12,l} ([p_iq_i(1-\bar{p}_i)+(1-p_i)(1-q_i)\bar{p}_i]\Delta_{7,l}+\theta'_{ii} \Delta_{8,l}), & if\quad i=j,\\ \sum_{l=1}^r\lambda_{12,l}(-[p_i(1-q_j) \bar{p}_j+p_j(1-q_i)\bar{p}_i]\Delta_{7,l}+\theta'_{ij} \Delta_{8,l}), & else,\end{array}\right. $$
where\(\bar{p}_i\)s are the genotype frequencies in the total population, \(\theta'_{ij} = p_iq_j+\sum_{l=1}^h(p_{i|a_l}q_{j|a_l}-q_jp_{i|a_l}- p_iq_{j|a_l})r_l, p_{i|a_l}\)andqj|a_ls are the conditional frequencies of genotypes given allelealin the case and control populations, aandbare the same as in Proposition 1.

Note: If we adopt the dictionary representation of genotype (11,12, 13, ..., 1h,22,23, ...,2h, ..., hh) for h allele locus, then a genotype i can be uniquely written as i = (st) if (s−1)h−(s−2)(s−1)/2 <  i ≤  sh− (s−1)s/2 with t = i−[(s−1)h−(s−2)(s−1)/2] (i = 1, ..., h(h + 1)/2).

From the propositions we see that the asymptotic distribution of our statistic Z given in (5) is a linear combination of the absolute values of normal random variables under the null, and it shifts strictly to the right under the alternative. Based on the first part of the proposition, we can determine our test P value for a given genetic distance and two sets of haplotype frequencies. From the second part of the proposition, we can calculate the power and sample sizes. Now we specify some details in calculating the P values, power and sample sizes.

The following Corollaries are true for both Propositions, and we use the same general notations for the two cases to state them. With the same notations as in the propositions, we can write V as V = R1/2U where UN(0,Ik). Since R is semi-positive definite, there is an orthonormal matrix Q such that R = Q′Λ Q, and R1/2 = QΛ1/2, where Λ = diag1, ..., λk) and the λis ( ≥ 0) are the eigenvalues of R. Given a pre-specified integer M (typically M ≥  5,000), for each 0 ≤ i ≤  M, sample WiN(0,Ik). Now Vi = {QΛ1/2Wi}  (i = 1, ...,M) is a sample of size M from N(0,R). Let Zi = pD|Vi|,(i = 1, ..., M) and χ(·) be the indicator function. Now we are ready to state the following two corollaries:

Corollary 1

With the above otations, underH0, the P value for the statistic Z from (5) can be approximated by\(\sum_{i=1}^M \chi(Z_i\ge Z)/M.\)

For a given level α, the critical valueZ(α) for (5) can be approximately determined by the (1−α)th sample quantile ofZ1, ..., ZM, which is their [(1−α)M]th ordered statistic.

Corollary 2

With the above notations, underH1, for any given level α and\(\delta \in (0,\bar{\delta}),\)where\(\bar{\delta}=\max\{p'Dq-\check{p}'D\check{p}, over\,all\,frequency\, vectors\,p,q\},\)the power β and sample sizesm,nare related by
$$ \beta = \beta(m,n,\alpha,\delta) = 1-\Phi\bigg(\frac{Z(\alpha)} {\hat{\sigma}}-\sqrt{N}\frac{\delta}{\hat{\sigma}}\bigg), $$
where Φ(·) is the standard normal distribution function.

Simulation study

We simulated both the cases of genotype and haplotype data. We describe the simulation for genotype in detail, that for haplotype data is similar.

We generated the data on a disease marker and a normal marker to study the power and type I error rate. Now we describe briefly the sampling for the data. We first generate the genotypes of the parents independently based on a given set of frequencies, and those of the sib’s are generated under the Mendelian inheritance. The disease status is determined by the penetrance and genotypes on the disease locus. The data is selected from the simulated population based on a given design. 10,000 samples are drawn to simulate the asymptotic null distributions at each marker, and 500 replications are made to compute the P values, type I errors and powers of the test statistics at several significance level α’s.

To evaluate the performances of the extended method and the behavior of different designs, we simulated six data sets with different combinations of designs and penetrances of the same total sample size. To compare with related methods, for each given significance level α = 0.05 and 0.01, we compare the results of (I) Tzeng et al. (2003), (II) the traditional chisquared test, (III) our method (Yuan et al. 2006) for independent observations, (IV) our extended method to familial data, and (V) a family based method, with the software FBAT (Horvath et al. 2007), which is a multi-markers extension of the family based method of Laird et al. (2000). FBAT needs certain numbers of informative families. For our simulated data only designs (a)–(c) can be computed. Although the chisqaured method is neither family based nor with DNA structure, here we choose it because of its popularity.

To be clear, we list the six designs in Table 3.
Table 3

Designs used in the simulation

Code

Design name

Penetrance

m11

n11

o11

m1

n1

(a)

Control and discordant pairs

(0.04, 0.004)

0

100

100

0

0

(b)

Control and discordant pairs

(0.10, 0.004)

0

100

100

0

0

(c)

Discordant pairs

(0.04, 0.004)

0

0

200

0

0

(d)

Singleton

(0.04, 0.004)

0

0

0

200

200

(e)

Case pairs and control singleton

(0.04, 0.004)

100

0

0

0

200

m11 No. case pairs, n11 No. control pairs, o11 No. discordant pairs, m1 No. case singleton, n1 No. control singleton

The genotypes frequencies at each marker are selected from the four sets (0.04,0.32,0.64), (0.09,0.42,0.49), (0.16,0.48,0.36) and (0.25, 0.5,0.25). They are not necessarily in Hardy-Weinberg equilibrium. Here we adopted the additive disease model, and we choose relatively low penetrances, as high penetrances will give powers very close to 1 and comparisons will not be well distinguishable.

The results for simulated genotype data for the five methods are displayed in Table 4, in which the type I error rates followed by power are shown for each design.
Table 4

Type I error rates and powers for five methods on simulated genotype data

Design

α = 0.05

α = 0.01

I

II

III

IV

V

I

II

III

IV

V

(a)

0.028

0.021

0.034

0.046

0.034

0.009

0.007

0.012

0.016

0.004

0.100

0.218

0.274

0.308

0.138

0.048

0.112

0.134

0.190

0.032

(b)

0.033

0.018

0.029

0.038

0.067

0.007

0.005

0.007

0.014

0.015

0.142

0.750

0.814

0.840

0.456

0.066

0.614

0.696

0.762

0.242

(c)

0.002

0.001

0.003

0.007

0.059

0.001

0.000

0.001

0.004

0.013

0.004

0.156

0.242

0.446

0.686

0.000

0.058

0.088

0.288

0.464

(d)

0.055

0.044

0.060

0.060

0.017

0.020

0.025

0.025

0.562

0.966

0.980

0.980

0.428

0.926

0.956

0.956

(e)

0.129

0.145

0.150

0.077

0.069

0.076

0.082

0.032

0.752

0.966

0.970

0.952

0.660

0.948

0.958

0.908

(f)

0.093

0.101

0.102

0.062

0.049

0.044

0.049

0.026

0.420

0.476

0.512

0.420

0.304

0.318

0.370

0.274

We see that overall the method of Tzeng et al. has lowest power at each of the levels, our current method with familial dependence structure implemented has highest power and comparable type I error. The type I error and power relationship among the four methods is not uniform across all the designs, and there is usually a trade-off between power and type I error. To better understand the performances of these methods in joint consideration of power and type I error, we plotted the ROC curves of methods I–IV, with covered areas shown for each method, as in Fig. 1. FBAT is not used in the plot as it is not applicable for all the designs. From this figure we see that in designs (a)–(d), the curves of our new methods have the largest covered areas, while in designs (e) and (f), methods II, III and IV perform similar with III slightly better. In terms of population stratification, (c) is an important design, and our family based method performs significantly better. These results suggest using our new method based family data has some overall advantage. Method I has the smallest covered areas, and method III has better performance than method II. Aside from narrower application scopes, FBAT seems performs no better than our and the chisqured method in designs (a) and (b), but better in design (c). Also the results have some significant deviation from those from the chisqured method, while the results of our methods match the latter and can be viewed as reasonable adjustments. Thus, our method with blind area correction has clear advantage over the existing nonparametric similarity test, and implementation of the familial dependence structure improves the performance of our test further. Based on our simulations, we find that the design with heavy case-discordant pairs has relatively low power, since in this case the marker is more likely not in linkage with the disease locus, the association due to causes other than linkage is not easy to detect. This design is most robust in case of population stratification.
https://static-content.springer.com/image/art%3A10.1007%2Fs00439-007-0376-5/MediaObjects/439_2007_376_Fig1_HTML.gif
Fig. 1

ROC curves for four methods with simulated haplotype data, where numerical values are areas under the curves

Next we examine our family based method in the haplotype data case and compare it with the same methods and under the same designs as above, except that here the penetrances, in the sense of genotypes, for disease and normal haplotypes are (0.1, 0.02) for all the designs except for (b), which has penetrances (0.1,0.01). Six haplotypes are used in the simulation, as:
$$ \begin{array}{ll} {\bf h}_1=(ATTAGCTAGCTT), &\quad {\bf h}_2=(ACTATCTAGCTT)\\ {\bf h}_3=(GATATCTAGCGT), &\quad {\bf h}_4=(TATATCTAGCGT)\\ {\bf h}_5=(TATAACTAGCGT), &\quad {\bf h}_6=(TATAGCTAGCAT) \end{array} $$
with frequencies (0.15, 0.04, 0.1, 0.2, 0.25, 0.26) for all the designs. Methods I, III and IV considered the DNA structure in the test statistic, which is lack for most other family based association study methods, so we didn’t directly compare our methods with other family based methods. The method FBAT dose not take haplotype data. The results are shown in Table 5.
Table 5

Type I error rates and powers for four methods on simulated haplotype data

Design

α = 0.05

α = 0.02

α = 0.01

I

II

III

IV

I

II

III

IV

I

II

III

IV

(a)

0.034

0.034

0.024

0.010

0.018

0.004

0.004

0.000

0.016

0.000

0.000

0.000

0.330

0.570

0.610

0.522

0.178

0.422

0.448

0.346

0.100

0.322

0.330

0.274

(b)

0.034

0.044

0.032

0.020

0.016

0.016

0.010

0.002

0.002

0.010

0.002

0.000

0.404

0.908

0.912

0.884

0.202

0.822

0.846

0.804

0.110

0.770

0.788

0.726

(c)

0.026

0.010

0.014

0.020

0.006

0.004

0.002

0.006

0.002

0.002

0.000

0.002

0.194

0.174

0.262

0.346

0.102

0.086

0.118

0.200

0.058

0.048

0.078

0.120

(d)

0.048

0.036

0.038

0.038

0.024

0.020

0.012

0.012

0.012

0.006

0.006

0.006

0.814

0.908

0.926

0.926

0.708

0.830

0.876

0.876

0.612

0.756

0.800

0.800

(e)

0.098

0.116

0.092

0.052

0.052

0.068

0.048

0.022

0.032

0.044

0.024

0.008

0.920

0.996

0.998

0.992

0.858

0.990

0.992

0.990

0.792

0.988

0.990

0.986

(f)

0.052

0.076

0.046

0.026

0.012

0.016

0.020

0.010

0.006

0.012

0.010

0.008

0.934

1.000

1.000

1.000

0.882

0.998

1.000

0.998

0.822

0.992

0.998

0.994

We see that in the haplotype case, the powers and type I error rates for the four methods under the six designs have some similar patterns as in the genotype data case, but in this case our existing method, method III, performs best. The ROC curves for the four methods are shown in Fig. 2. The interpretations fo results are the same as for the genotype data.
https://static-content.springer.com/image/art%3A10.1007%2Fs00439-007-0376-5/MediaObjects/439_2007_376_Fig2_HTML.gif
Fig. 2

ROC curves for four methods with simulated haplotype data, where numerical values are areas under the curves

In this case, the curves for Methods II, III, IV are very close, indicating no clear advantage among the three methods, at least for these simulated data.

Application to the genetics of alcoholism

As an application to localize genes that contribute to complex diseases and their risk factors, we apply our method to the analysis of the alcoholism data supplied by Collaborative Study on the Genetics of Alcoholism (COGA) for the Genetic Analysis Workshop 14 (GAW14). The data were collected on 1,614 individuals, 788 females and 826 males, from 143 pedigrees. We focused on the Illumina markers consisting of over 4,600 SNPs distributed evenly across the human genome. For our association analysis we used ALDX1 as the phenotype: all affected were regarded as cases and all the unaffected as controls. The numbers of observed genes vary from marker to marker because of missing information, but the variation is relatively small. For example, the numbers of observed case pairs, control pairs and discordant pairs on each locus are around 545, 382 and 815, respectively. More details of the data can be found in Bailey-Wilson et al. (2005) and references therein. We used the methods compared above in the analysis: (a) the method of Tzeng et al. (2003), (b) the method of Yuan et al. (2006) without familial structure adjustment, and (c) the new method proposed here which takes familial structure into account. To effectively show the results, we only display in Fig. 3 the 86 markers with -(log-P values) of 2 or bigger by at least one of the methods.
https://static-content.springer.com/image/art%3A10.1007%2Fs00439-007-0376-5/MediaObjects/439_2007_376_Fig3_HTML.gif
Fig. 3

-(log-P values) for three methods for selected markers

From this figure we see that our two weighted methods yield similar results. Our previous method without family adjustment yields slightly smaller P values, while the similarity test of Tzeng et al. gives considerably larger P values at most of the markers. Since -log10 (P value) bigger than 2 corresponds to P value less than 0.01, we see that the weighted methods find quite a few significant markers at the 1% level. Markers rs716581 on chromosome 1 (location 121.1), rs1017418 on chromosome 2 (80.5), rs1332184 on chromosome 9 (41.3) and rs1943418 on chromosome 18 (80.9), corresponding to SNP loci 168, 487, 2498 and 4162, respectively, yield the strongest signal (with significance level 0.001 or lower), which suggests that these loci are highly likely to be associated with alcoholism. We also see from the figure that our weighted methods detected lots of loci with -log10(P value) ≥  1.3 (significant at the 5% level) which are missed by the test of Tzeng et al., due to the blindness property of the similarity test.

Tian et al. (2005) analyzed a single SNP on chromosome 4, rs1037475, in the same data, using trend tests with family correlation adjusted, and compared the result from that of the same test without such adjustment. They found that the two tests give the same significant result, that the standard error from the adjusted method is slightly larger, leading to slightly larger P value. They concluded that the method without family adjustment would have larger false positive rate, and hence anti-conservative. We found a similar pattern with our results.

Using a family-based GEE approach, Yang et al. (2005) analyzed 4 SNPs (rs1036475, rs1491233, rs749407, rs980972) on chromosome 4. They found that without familial adjustment, SNPs rs1036475 and rs980972 are significant with P values 0.0032 and 0.028, respectively. When familial adjustment was applied, they found a more conservative test: the P values for these two SNPs are 0.025 and 0.182, respectively. This is also in keeping with our general finding.

Concluding remarks

We extended our nonparametric dissimilarity test for case-control association study to family data, for both haplotype and genotype data. This test has no blind areas, low type I error rate and high power. Simulation studies show this method improves the inference over the existing methods.

In the five designs investigated in our simulation, the case-pair and control-pair design and the case-and-control singleton have highe power. The case-control discordant pair design has lower power. There seems to be a trade-off between accuracy and power. To eliminate the population stratification and/or admixture confounding factors, generally one should use the discordant pair design, although it has lower power. In our simulated example, from the ROC curve we see that our family based method has advantage over the other methods. In case only a homogenous population is involved in the study, the case-pair and control-pair design and the case-and-control singleton design is preferred for their higher power.

Based on our simulation studies, family based method has some limited advantage over individual based one.

We note for further research that under linkage equilibrium, the kinship coefficients are given in Tables 1 and 2 for haplotypes and genotypes. When this condition is not assumed, Δ′8 for haplotype sharing and Δ7 for genotype sharing will tend to be bigger than those given in these two tabels. Their actual values can be estimated from the data, although the values from the two Tables still can be used as approximations.

For the case control association analysis using genotype data alone, our simulation studies indicate that designs using either the singleton case vs singleton control or concordant-pairs are preferred in terms of power. Two designs are shown to be inferior: control pairs with discordant pairs, and discordant pairs only design. Although the singleton cases vs singleton controls design appears to have greater power, it is known that it is sensitive to population stratification.

In conclusion, our results here confirm that adjustment for the dependence improves the inference in association studies on families, as expected.

Acknowledgments

The research has been supported in part by the National Center for Research Resources at NIH grant 2G12RR003048.

Copyright information

© Springer-Verlag 2007