Abstract
Association studies for complex diseases based on haplotype data have received increasing attention in the last few years. A commonly used nonparametric method, which takes haplotype structure into consideration, is to use the U-statistic to compare the similarities between genetic compositions in the case and control populations. Although the method and its variants are convenient to use in practice, there are some areas where the tests cannot detect even large differences between cases and controls. To overcome this problem and enhance the power, we propose a new form of the weighted U-statistic, which directly compares the dissimilarity between the haplotype structures in the case and control populations. We show that this test statistic is asymptotically a linear combination of the absolute values of normal random variables under the null hypothesis, and shifts strictly toward the right under the alternative, and therefore has no blind areas of detection. Simulation studies indicate that our test statistic overcomes the weakness of the existing ones and is robust and powerful as well.
Similar content being viewed by others
References
Bourgain C, Génin E, Holopainen P, Mustalahti K, Mä M, Partanen J (2000) Use of closely related affected individuals for the genetic study of complex diseases in founder populations. Am J Hum Genet 68:154–159
Cheung VG, Nelson SF (1998) Genomic mismatch scanning identifies human genomic DNA shared identical by descent. Genomics 47:1–7
Devlin B, Roeder K, Wasserman L (2000) Genomic control for association studies: a semiparametric test to detect excess-haplotype sharing. Biostatistics 1:369–387
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New York
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Grant GR, Manduchi E, Cheung VG, Ewens WJ (1999) Significant test for direct identity-by-descent mapping. Ann Hum Genet 63:441–454
Jorde LB (2000) Linkage disequilibrium and the search for complex disease genes. Genome Res 10:1435–1444
Kimura M (1980) A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120
Lee M-LT, Dehling HG (2005) Generalized two-sample U-statistics for clustered data. Stat Neerl 59:313–323
McGuire G, Prentice M, Wright F (1999) Improved error bounds for genetic distances from DNA sequences. Biometrics 55:1064–1070
Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM (2005) Nonparametric tests of association of mutation genes with human disease. Am J Hum Genet 76:780–793
Tajima F, Nei M (1982) Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. J Mol Evol 18:115–120
Tzeng JY, Devlin B, Wasserman L, Roeder K (2003a) On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am J Hum Genet 72:891–902
Tzeng JY, Byerley W, Devlin B, Roeder K, Wasserman L (2003b) Outlier detection and false discovery rates for whole-genome DNA matching. J Am Stat Assoc 98:236–246
Van der Meulen MA, Te Meerman GJ (1997) Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring. Genet Epidemiol 14:915–920
Vardi Y, Ying Z, Zhang CH (2001) Two-sample tests for growth curves under dependent right censoring. Biometrika 88:949–960
Weeks DE, Lange K (1988) The affected-pedigree-member method of linkage analysis. Am J Med Genet 42:315–326
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80–83
Acknowledgments
The research has been supported in part by the National Center for Research Resources at NIH grant 2G12RR003048. The authors are grateful to the two reviewers and the editor for their helpful suggestions. We thank Mrs. Ashelyn Mosby for her careful reading of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of the proposition
(i) Note under H 0, \(U_{m,n} = \mu_{\hat{p}, \hat{q}} = \hat{p}^{\prime} D\hat{q}\) and D = D p,p , so we have
Let D p,q (1,0)(·,·) = ∂D p,q (·,·)/∂p and D p,q (0,1)(·,·) = ∂D p,q (·, ·)/∂q be the column vectors of first partial derivatives, and D p,q (0,1) and D p,q (0,1) be the corresponding matrices of column arrays. The first term in (6) is
Note the (i, j)th component of D p,q (1,0) is the vector with lth entry
Thus D p,q (1,0) = 0 under H 0, similarly D p,q (0,1) = 0 under H 0, and so under H 0,
Note ∂(p′Dq)/∂p = q′ D, ∂(p′Dq)/∂q = p′D, so under H 0 the second term in (6) is
Also, ∂μ p,p /∂p = ∂(p′Dp)/∂p = 2p′D, so the third term in (6) under H 0 is
Now collect the above relationships, we have
Note that under H 0,
and for each i, the i-th entry of \(\left( \hat{p} - p \right) + \left( \hat{q} - p \right) - 2\left(\check{p}-p \right)\) is
Thus, for a vector a = (a 1, ..., a k )′, denote |a| = (|a 1|, ..., |a k |)′, we have
where W ∼ N(0,R).
(ii) In this case
Since \(\hat{p} \rightarrow p\) (a.s.) and \(\hat{q} \rightarrow q\) (a.s), by Slutsky’s theorem, \(\sqrt{N} \hat{p}^{\prime} \left[ D_{p,q}^{(1,0)} \left( \hat{p} - p \right)+ D_{p,q}^{(0,1)} \left( \hat{q} - q \right) \right] \hat{q}\;\hbox{and}\;\sqrt{N} p^{\prime} \left[ D_{p,q}^{(1,0)} \left( \hat{p} - p \right) + D_{p,q}^{(0,1)} \left( \hat{q} - q \right) \right] q\) has the same asymptotic distribution. With the components of D (1,0) p,q given in (i), it is easy to check that
where a = (a 1, ..., a k )′ and b = (b 1, ..., b k )′ with
This gives
with
Similarly, we obtained asymptotic versions of Tzeng et al.’s results. let σ T 2 and \(\tilde{\sigma}^2_T\) be their asymptotic variance under H 0 and H 1, respectively, assume \(0 \lim N/m = \gamma_1 < \infty,\;0 \lim N/n = \gamma_2 < \infty\) and non-degeneracy of the kernel, then
where σ T 2 = 4(γ1 + γ2) p′ARAp. σ T 2 is consistently estimated by \(\hat{\sigma}^2_T\) which is σ T 2 with p replaced by \(\hat{p},\) the estimate of p from the pooled data. Under H 1,
where \(\tilde{\sigma}^2_T =4 \left( \gamma_1 p^{\prime}ARAp + \gamma_2 q^{\prime}AQAq \right),\;\hbox{and}\;\tilde{\sigma}^{2}_T\) is consistently estimated by its empirical version.
Rights and permissions
About this article
Cite this article
Yuan, A., Yue, Q., Apprey, V. et al. Detecting disease gene in DNA haplotype sequences by nonparametric dissimilarity test. Hum Genet 120, 253–261 (2006). https://doi.org/10.1007/s00439-006-0216-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00439-006-0216-z