Skip to main content
Log in

Detecting disease gene in DNA haplotype sequences by nonparametric dissimilarity test

  • Original Investigation
  • Published:
Human Genetics Aims and scope Submit manuscript

Abstract

Association studies for complex diseases based on haplotype data have received increasing attention in the last few years. A commonly used nonparametric method, which takes haplotype structure into consideration, is to use the U-statistic to compare the similarities between genetic compositions in the case and control populations. Although the method and its variants are convenient to use in practice, there are some areas where the tests cannot detect even large differences between cases and controls. To overcome this problem and enhance the power, we propose a new form of the weighted U-statistic, which directly compares the dissimilarity between the haplotype structures in the case and control populations. We show that this test statistic is asymptotically a linear combination of the absolute values of normal random variables under the null hypothesis, and shifts strictly toward the right under the alternative, and therefore has no blind areas of detection. Simulation studies indicate that our test statistic overcomes the weakness of the existing ones and is robust and powerful as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Bourgain C, Génin E, Holopainen P, Mustalahti K, Mä M, Partanen J (2000) Use of closely related affected individuals for the genetic study of complex diseases in founder populations. Am J Hum Genet 68:154–159

    Article  PubMed  Google Scholar 

  • Cheung VG, Nelson SF (1998) Genomic mismatch scanning identifies human genomic DNA shared identical by descent. Genomics 47:1–7

    Article  PubMed  CAS  Google Scholar 

  • Devlin B, Roeder K, Wasserman L (2000) Genomic control for association studies: a semiparametric test to detect excess-haplotype sharing. Biostatistics 1:369–387

    Article  PubMed  CAS  Google Scholar 

  • Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New York

    Google Scholar 

  • Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376

    Article  PubMed  CAS  Google Scholar 

  • Grant GR, Manduchi E, Cheung VG, Ewens WJ (1999) Significant test for direct identity-by-descent mapping. Ann Hum Genet 63:441–454

    Article  PubMed  CAS  Google Scholar 

  • Jorde LB (2000) Linkage disequilibrium and the search for complex disease genes. Genome Res 10:1435–1444

    Article  PubMed  CAS  Google Scholar 

  • Kimura M (1980) A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120

    Article  PubMed  CAS  Google Scholar 

  • Lee M-LT, Dehling HG (2005) Generalized two-sample U-statistics for clustered data. Stat Neerl 59:313–323

    Article  Google Scholar 

  • McGuire G, Prentice M, Wright F (1999) Improved error bounds for genetic distances from DNA sequences. Biometrics 55:1064–1070

    Article  PubMed  CAS  Google Scholar 

  • Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM (2005) Nonparametric tests of association of mutation genes with human disease. Am J Hum Genet 76:780–793

    Article  PubMed  CAS  Google Scholar 

  • Tajima F, Nei M (1982) Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. J Mol Evol 18:115–120

    Article  PubMed  CAS  Google Scholar 

  • Tzeng JY, Devlin B, Wasserman L, Roeder K (2003a) On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am J Hum Genet 72:891–902

    Article  PubMed  CAS  Google Scholar 

  • Tzeng JY, Byerley W, Devlin B, Roeder K, Wasserman L (2003b) Outlier detection and false discovery rates for whole-genome DNA matching. J Am Stat Assoc 98:236–246

    Article  Google Scholar 

  • Van der Meulen MA, Te Meerman GJ (1997) Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring. Genet Epidemiol 14:915–920

    Article  PubMed  CAS  Google Scholar 

  • Vardi Y, Ying Z, Zhang CH (2001) Two-sample tests for growth curves under dependent right censoring. Biometrika 88:949–960

    Article  Google Scholar 

  • Weeks DE, Lange K (1988) The affected-pedigree-member method of linkage analysis. Am J Med Genet 42:315–326

    CAS  Google Scholar 

  • Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80–83

    Article  Google Scholar 

Download references

Acknowledgments

The research has been supported in part by the National Center for Research Resources at NIH grant 2G12RR003048. The authors are grateful to the two reviewers and the editor for their helpful suggestions. We thank Mrs. Ashelyn Mosby for her careful reading of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ao Yuan.

Appendix

Appendix

Proof of the proposition

(i) Note under H 0, \(U_{m,n} = \mu_{\hat{p}, \hat{q}} = \hat{p}^{\prime} D\hat{q}\) and D = D p,p , so we have

$$ \hat{U}_{m,n} - \mu_{\check{p},\check{p}} = \left( \hat{U}_{m,n} - U_{m,n} \right) + (U_{m,n} - \mu_{p,p}) + \left( \mu_{p,p} - \mu_{\check{p},\check{p}} \right). $$
(6)

Let D p,q (1,0)(·,·) = ∂D p,q (·,·)/∂p and D p,q (0,1)(·,·) = ∂D p,q (·, ·)/∂q be the column vectors of first partial derivatives, and D p,q (0,1) and D p,q (0,1) be the corresponding matrices of column arrays. The first term in (6) is

$$\begin{aligned} \hat{U}_{m,n} - U_{m,n} &= \hat{p}^{\prime} \left(D_{\hat{p},\hat{q}} -D_{p,q} \right) \hat{q} = \hat{p}^{\prime} \left[D^{(1,0)}_{p,q} \left(\hat{p} - p \right) + D^{(0,1)}_{p,q} \left( \hat{q} - p \right) + O \left( \left\|\hat{p} - p \right\|^2 + \left\| \hat{q} - q \right\|^2 \right) \right] \hat{q}\\ &= \hat{p}^{\prime} \left[D^{(1,0)}_{p,q} \left( \hat{p} - p \right) + D^{(0,1)}_{p,q} \left(\hat{q} - p \right) \right] \hat{q} + O_P(1/N). \end{aligned}$$

Note the (i, j)th component of D p,q (1,0) is the vector with lth entry

$$ {\frac{\partial}{\partial p_l}} D_{p,q} \left( {\bf h}_i, {\bf h}_j \right) = \left\{\begin{array}{*{20}l} {\frac{{\rm e}^{p_i-q_i} - {\rm e}^{q_i-p_i}}{2}}\;{\frac{{\rm e}^{p_j-q_j} + {\rm e}^{q_j-p_j}}{2}} D \left({\bf h}_i, {\bf h}_j \right) &\hbox{if}\;l=i\\ 0 & \hbox{else}\end{array}\right. \quad(l=1, \ldots, k). $$

Thus D p,q (1,0) = 0 under H 0, similarly D p,q (0,1) = 0 under H 0, and so under H 0,

$$ \hat{U}_{m,n} - U_{m,n} = O_P(1/N). $$

Note ∂(pDq)/∂qD, ∂(pDq)/∂q = pD, so under H 0 the second term in (6) is

$$ \hat{p}^{\prime} D\hat{q} - p^{\prime} Dp = p^{\prime} D \left( \hat{p} - p \right) + p^{\prime} D\left( \hat{q} - p \right) + O \left( \left\| \hat{p} - p \right\|^2 + \left\| \hat{q} - p \right\|^2 \right). $$

Also, ∂μ p,p /∂= ∂(pDp)/∂p = 2pD, so the third term in (6) under H 0 is

$$ -2p^{\prime} D\left( \check{p} - p \right) + O \left( \left\| \check{p} - p \right\|^2 \right) = -2p^{\prime} D\left( \check{p} - p \right) + O_P(1/N). $$

Now collect the above relationships, we have

$$ \sqrt{N} \left( \hat{U}_{m,n} - \mu_{\check{p},\check{p}} \right) = \sqrt{N}p^{\prime} D\left[ \left(\hat{p} - p \right) + \left( \hat{q} - p \right) - 2 \left(\check{p} - p \right) \right] + O_P \left(1/\sqrt{N} \right). $$

Note that under H 0,

$$ \sqrt{N} \left[ \left( \hat{p} - p \right) - \left( \hat{q} - p \right) \right] \mathop{\rightarrow}\limits^{D} N \left(0, \left(\gamma_1+\gamma_2 \right)R \right), $$

and for each i, the i-th entry of \(\left( \hat{p} - p \right) + \left( \hat{q} - p \right) - 2\left(\check{p}-p \right)\) is

$$\left\{\begin{array}{*{20}l} \left( \hat{p}_i - p_i \right) - \left(\hat{q}_i - p_i \right) & \hbox{if}\;\hat{p}_i - p_i >= \hat{q}_i - p_i \\ \left(\hat{q}_i - p_i \right) - \left(\hat{p}_i - p_i \right) & \hbox{else}. \end{array} \right.$$

Thus, for a vector a = (a 1, ..., a k )′, denote |a| = (|a 1|, ..., |a k |)′, we have

$$ \sqrt{N} \left(\hat{U}_{m,n} - \mu_{\check{p},\check{p}} \right) = p^{\prime} D \left| \sqrt{N} \left[ \left( \hat{p} - p \right) - \left( \hat{q} - p \right) \right] \right| + O_P \left(1/\sqrt{N} \right) \mathop{\rightarrow}\limits^{D} \left(\gamma_1 + \gamma_2 \right)^{1/2} p^{\prime} D|W|, $$

where WN(0,R).

(ii) In this case

$$\begin{aligned} \hat{U}_{m,n} - \check{p}^{\prime} D\check{p} &= \left( \hat{p}^{\prime} D_{\hat{p},\hat{q}} \hat{q} - \hat{p}^{\prime} D_{p,q} \hat{q} \right) + \left( \hat{p}^{\prime} D_{p,q} \hat{q} - p^{\prime} D_{p,q}q \right) + \left( p^{\prime} D_{p,q}q - \check{p}^{\prime} D\check{p} \right)\\ &= \hat{p}^{\prime} \left[ D_{p,q}^{(1,0)} \left( \hat{p} - p \right) + D_{p,q}^{(0,1)} \left( \hat{q} - q \right) \right] \hat{q} + q^{\prime} D_{p,q} \left(\hat{p} - p \right)\\ &\quad + p^{\prime} D_{p,q} \left(\hat{q} - q \right) + \left( p^{\prime} D_{p,q}q - \tilde{p}^{\prime} D\tilde{p} \right) + O_P(1/N). \end{aligned}$$

Since \(\hat{p} \rightarrow p\) (a.s.) and \(\hat{q} \rightarrow q\) (a.s), by Slutsky’s theorem, \(\sqrt{N} \hat{p}^{\prime} \left[ D_{p,q}^{(1,0)} \left( \hat{p} - p \right)+ D_{p,q}^{(0,1)} \left( \hat{q} - q \right) \right] \hat{q}\;\hbox{and}\;\sqrt{N} p^{\prime} \left[ D_{p,q}^{(1,0)} \left( \hat{p} - p \right) + D_{p,q}^{(0,1)} \left( \hat{q} - q \right) \right] q\) has the same asymptotic distribution. With the components of D (1,0) p,q given in (i), it is easy to check that

$$ p^{\prime} D_{p,q}^{(1,0)} \left( \hat{p} - p \right) q = a \left( \hat{p} - p \right), \quad p^{\prime} D_{p,q}^{(0,1)} \left( \hat{q} - q \right)q = b \left( \hat{q} - q \right) $$

where a = (a 1, ..., a k )′ and = (b 1, ..., b k )′ with

$$\begin{aligned} a_i &= p_i {\frac{{\rm e}^{p_i-q_i} - {\rm e}^{q_i-p_i}}{2}} \sum\limits_{j=1}^k {\frac{{\rm e}^{p_j-q_j} + {\rm e}^{q_j-p_j}}{2}} q_jD \left( {\bf h}_i, {\bf h}_j \right)\quad (i=1, \ldots, k)\\ b_i &= q_i {\frac{{\rm e}^{q_i-p_i} - {\rm e}^{p_i-q_i}}{2}} \sum\limits_{j=1}^k {\frac{{\rm e}^{q_j-p_j} + {\rm e}^{p_j-q_j}}{2}} p_jD \left({\bf h}_i, {\bf h}_j \right) \quad (i=1,\ldots,k). \end{aligned}$$

This gives

$$\begin{aligned} \, &\sqrt{N} \left( \hat{U}_{m,n} - \check{p}^{\prime} D\check{p} \right) - \sqrt{N} \left( p^{\prime} D_{p,q}q - \tilde{p}^{\prime} D\tilde{p} \right)\\ \, &\quad = \sqrt{N} \left[ \left( a^{\prime} + q^{\prime} D_{p,q} \right) \left( \hat{p} - p \right) + \left( b^{\prime} + p^{\prime} D_{p,q} \left( \hat{q} - q \right) \right) \right] + O_P \left(1/\sqrt{N} \right) \mathop{\rightarrow}\limits^{D} N(0, \sigma^2) \end{aligned}$$

with

$$ \sigma^2 = \gamma_1 \left( a^{\prime} + q^{\prime} D_{p,q} \right) R \left(a + D_{p,q}q \right) + \gamma_2 \left( b^{\prime} + p^{\prime} D_{p,q} \right) Q \left( b + D_{p,q}p \right). $$

Similarly, we obtained asymptotic versions of Tzeng et al.’s results. let σ T 2 and \(\tilde{\sigma}^2_T\) be their asymptotic variance under H 0 and H 1, respectively, assume \(0 \lim N/m = \gamma_1 < \infty,\;0 \lim N/n = \gamma_2 < \infty\) and non-degeneracy of the kernel, then

$$ \sqrt{N} \left( \hat{p}^{\prime} A \hat{p} - \hat{q}^{\prime} A\hat{q} \right)/\sigma_T \mathop{\rightarrow}\limits^{D} N(0,1), $$

where σ T 2 = 4(γ+ γ2) pARAp. σ T 2 is consistently estimated by \(\hat{\sigma}^2_T\) which is σ T 2 with p replaced by \(\hat{p},\) the estimate of p from the pooled data. Under H 1,

$$ \sqrt{N} \tilde{\sigma}^{-1}_T \left( \hat{p}^{\prime} A\hat{p} - \hat{q}^{\prime} A\hat{q} - p^{\prime} Ap + q^{\prime} Aq \right) \mathop{\rightarrow}\limits^{D} N(0,1), $$

where \(\tilde{\sigma}^2_T =4 \left( \gamma_1 p^{\prime}ARAp + \gamma_2 q^{\prime}AQAq \right),\;\hbox{and}\;\tilde{\sigma}^{2}_T\) is consistently estimated by its empirical version.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, A., Yue, Q., Apprey, V. et al. Detecting disease gene in DNA haplotype sequences by nonparametric dissimilarity test. Hum Genet 120, 253–261 (2006). https://doi.org/10.1007/s00439-006-0216-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00439-006-0216-z

Keywords

Navigation