Skip to main content
Log in

An Adaptive Genetic Association Test Using Double Kernel Machines

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

Recently, gene set-based approaches have become very popular in gene expression profiling studies for assessing how genetic variants are related to disease outcomes. Since most genes are not differentially expressed, existing pathway tests considering all genes within a pathway suffer from considerable noise and power loss. Moreover, for a differentially expressed pathway, it is of interest to select important genes that drive the effect of the pathway. In this article, we propose an adaptive association test using double kernel machines (DKM), which can both select important genes within the pathway as well as test for the overall genetic pathway effect. This DKM procedure first uses the garrote kernel machines test for the purposes of subset selection and then the least squares kernel machine test for testing the effect of the subset of genes. An appealing feature of the kernel machine framework is that it can provide a flexible and unified method for multi-dimensional modeling of the genetic pathway effect allowing for both parametric and nonparametric components. This DKM approach is illustrated with application to simulated data as well as to data from a neuroimaging genetics study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404

    Article  MathSciNet  MATH  Google Scholar 

  2. Bühmann MD (2003) Radial basis functions. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  3. Cristianini N, Shawe-Tayor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge

    Google Scholar 

  4. Cai T, Lin X, Carroll RJ (2012) Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test. Biostatistics 13:776–790

    Article  Google Scholar 

  5. Cai T, Tonini G, Lin X (2011) Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics 67:975–986

    Article  MathSciNet  MATH  Google Scholar 

  6. Fan J (1996) Test of significance based on wavelet thresholding and Neyman’s truncation. J Am Stat Assoc 91:674–688

    Article  MATH  Google Scholar 

  7. Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol 70:849–911

    Article  MathSciNet  Google Scholar 

  8. Harville DA (1977) Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc 72:320–338

    Article  MathSciNet  MATH  Google Scholar 

  9. Hofmann T, Schölkopf B, Smola AJ (2008) Kernel method in machine learning. Ann Stat 36:1171–1220

    Article  MATH  Google Scholar 

  10. Kim MH, Akritas MG (2010) Order thresholding. Ann Stat 38:2314–2350

    Article  MathSciNet  MATH  Google Scholar 

  11. Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP (2008) A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet 82:386–397

    Article  Google Scholar 

  12. Lin D (2005) An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21:781–787

    Article  Google Scholar 

  13. Liu D, Lin X, Ghosh D (2007) Semiparametric regression of multi-dimensional genetic pathway data: least squares kernel machine and linear mixed models. Biometrics 63:1079–1088

    Article  MathSciNet  MATH  Google Scholar 

  14. Liu D, Ghosh D, Lin X (2008) Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinform 9:292

    Article  Google Scholar 

  15. Maity A, Lin X (2011) Powerful tests for detecing a gene effect in the presence of possible gene-gene interactions using garrote kernel machines. Biometrics 67:1271–1284

    Article  MathSciNet  MATH  Google Scholar 

  16. Neyman J (1937) Smooth test for goodness of fit. Scand Actuar J 3–4:149–199

    Article  MATH  Google Scholar 

  17. Nyholt D (2004) A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet 74:765–769

    Article  Google Scholar 

  18. Pan W, Shen X (2011) Adaptive tests for association analysis of rare variants. Genet Epidemiol 35:381–388

    Article  Google Scholar 

  19. Stein JL, Hua X, Morra JH et al (2010) Genome-wide analysis reveals novel genes influencing temporal lobe structure with relevance to neurodegeneration in Alzheimer’s disease. Neurolmage 51:542–554

    Article  Google Scholar 

  20. Wessel J, Schork NJ (2006) Generalized gonomic distance-based regression methodology for multilocus association analysis. Am J Hum Genet 79:792–806

    Article  Google Scholar 

  21. Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X (2010) Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet 86:929–942

    Article  Google Scholar 

  22. Wu MC, Zhang L, Wang Z, Christiani DC, Lin X (2009) Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics 25:1145–1151

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported by NIH grants CA129102. The authors thank the reviewers for helpful comments.

Conflict of interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Zhan.

Appendix

Appendix

1.1 Least Squares Kernel Machine Score Test Based on Linear Kernels

In this section, we prove that different linear kernels \(k(x,y; \rho )=x^T y+ \rho \) lead to the same LSKM score test. First let us recall the LSKM score test proposed in [13]. The test statistic of a LSKM score test is \(Q(\hat{\beta }, \hat{\sigma }^2, \rho )\) where

$$\begin{aligned} Q(\beta , \sigma ^2, \rho ) = \frac{1}{2\sigma ^2} (y-x\beta )^TK(\rho )(y-x\beta ), \end{aligned}$$

\(\hat{\beta }\) and \(\hat{\sigma }^2\) are the MLEs of \(\beta \) and \(\sigma ^2\) under the null model \(y=x\beta +\epsilon \). [13] used a scaled chi-squared distribution \(a\chi _b^2\) to approximate the distribution of \(Q(\hat{\beta }, \hat{\sigma }^2, \rho )\), where \(a\) and \(b\) are determined by matching the moments of \(Q\) and the scaled chi-squared distribution. It is easy to see that \(a=\mathrm{{Var}}(Q)/2E(Q)\) and \(b=2E^2(Q)/\mathrm{{Var}}(Q)\). Let \(X\) be the design matrix for the clinical covariates and \(K_{\rho }\) be the kernel matrix, which depends on kernel parameter \(\rho \). Denote \(P_0=I- X(X^TX)^{-1}X^T\). Then according to [13]:

$$\begin{aligned} E(Q)=\frac{tr(P_0K_{\rho })}{2}\,,\, \mathrm{{Var}}(Q)=\frac{tr(PK_{\rho }PK_{\rho })}{2}-\frac{[tr(P_0K_{\rho }P_0)]^2}{2tr(P_0^2)^2} \end{aligned}$$
(9)

Now consider two arbitrary linear kernels \(k(x,y,\rho _1)\) and \(k(x,y,\rho _2)\). Let \(Q_i, K_i, a_i\) and \(b_i\) be some quantities corresponding to kernel \(i, i=1,2\). Let \(A \equiv (1,\ldots , 1)^T\). Then, it is easy to see \(K_2=K_1+(\rho _2-\rho _1)AA^T\). Moreover, \(P_0=\mathcal {P}_{X^{\perp }}\), where \(\mathcal {P}_{X^{\perp }}\) denotes the projection matrix to the orthogonal complement of the space spanned by the columns of \(X\). Note that \(A\) is the first column of \(X\) (we assume that the intercept is contained in the clinical part). Hence, \(P_0A=\mathcal {P}_{X^{\perp }} A=0\). Therefore,

$$\begin{aligned}&\displaystyle tr(P_0K_2)= tr(P_0K_1)+ (\rho _2-\rho _1)tr(P_0AA^T)=tr(P_0K_1),\\&\displaystyle tr(P_0K_2P_0K_2)= tr(P_0K_2P_0K_1) = tr(P_0K_1P_0K_1),\\&\displaystyle tr(P_0K_2P_0)= tr(P_0K_1P_0). \end{aligned}$$

Plugging these results back to Eq. (9), one can show that \(E(Q_2)=E(Q_1)\) and \(\mathrm{{Var}}(Q_2)=\mathrm{{Var}}(Q_1)\). Hence, \(a_2=a_1\) and \(b_2=b_1\). Because the residuals of the null model \(y=x\beta +\epsilon \) sum to 0, one can easily show that \(Q_2 = Q_1\). That is, both the LSKM score test statistic and the null distribution of the test statistic are identical for two arbitrary linear kernels \(k(x,y,\rho _1)\) and \(k(x,y,\rho _2)\). Therefore, all linear kernels lead to the same LSKM score test.

1.2 Description of ADNI Data

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies, and non-profit organizations, as a $ 60 million, 5-year public-private partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.

The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California San Francisco. ADNI is the result of efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects, but ADNI has been followed by ADNI-GO and ADNI-2. To date these three protocols have recruited over 1,500 adults, ages 55–90, to participate in the research, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow-up duration of each group is specified in the protocols for ADNI-1, ADNI-2, and ADNI-GO. Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in ADNI-2. For up-to-date information, see http://www.adni-info.org.

1.3 Acknowledgments to ADNI

Data collection and sharing for this project were funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhan, X., Epstein, M.P. & Ghosh, D. An Adaptive Genetic Association Test Using Double Kernel Machines. Stat Biosci 7, 262–281 (2015). https://doi.org/10.1007/s12561-014-9116-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-014-9116-2

Keywords

Navigation