An Adaptive Genetic Association Test Using Double Kernel Machines

Zhan, Xiang; Epstein, Michael P.; Ghosh, Debashis

doi:10.1007/s12561-014-9116-2

An Adaptive Genetic Association Test Using Double Kernel Machines

Published: 24 June 2014

Volume 7, pages 262–281, (2015)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

Xiang Zhan¹,
Michael P. Epstein³ &
Debashis Ghosh^1,2

225 Accesses
5 Citations
Explore all metrics

Abstract

Recently, gene set-based approaches have become very popular in gene expression profiling studies for assessing how genetic variants are related to disease outcomes. Since most genes are not differentially expressed, existing pathway tests considering all genes within a pathway suffer from considerable noise and power loss. Moreover, for a differentially expressed pathway, it is of interest to select important genes that drive the effect of the pathway. In this article, we propose an adaptive association test using double kernel machines (DKM), which can both select important genes within the pathway as well as test for the overall genetic pathway effect. This DKM procedure first uses the garrote kernel machines test for the purposes of subset selection and then the least squares kernel machine test for testing the effect of the subset of genes. An appealing feature of the kernel machine framework is that it can provide a flexible and unified method for multi-dimensional modeling of the genetic pathway effect allowing for both parametric and nonparametric components. This DKM approach is illustrated with application to simulated data as well as to data from a neuroimaging genetics study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

Article Open access 13 June 2019

PIMKL: Pathway-Induced Multiple Kernel Learning

Article Open access 05 March 2019

Fast kernel-based association testing of non-linear genetic effects for biobank-scale data

Article Open access 15 August 2023

References

Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404
Article MathSciNet MATH Google Scholar
Bühmann MD (2003) Radial basis functions. Cambridge University Press, Cambridge
Book MATH Google Scholar
Cristianini N, Shawe-Tayor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge
Google Scholar
Cai T, Lin X, Carroll RJ (2012) Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test. Biostatistics 13:776–790
Article Google Scholar
Cai T, Tonini G, Lin X (2011) Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics 67:975–986
Article MathSciNet MATH Google Scholar
Fan J (1996) Test of significance based on wavelet thresholding and Neyman’s truncation. J Am Stat Assoc 91:674–688
Article MATH Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol 70:849–911
Article MathSciNet Google Scholar
Harville DA (1977) Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc 72:320–338
Article MathSciNet MATH Google Scholar
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel method in machine learning. Ann Stat 36:1171–1220
Article MATH Google Scholar
Kim MH, Akritas MG (2010) Order thresholding. Ann Stat 38:2314–2350
Article MathSciNet MATH Google Scholar
Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP (2008) A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet 82:386–397
Article Google Scholar
Lin D (2005) An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21:781–787
Article Google Scholar
Liu D, Lin X, Ghosh D (2007) Semiparametric regression of multi-dimensional genetic pathway data: least squares kernel machine and linear mixed models. Biometrics 63:1079–1088
Article MathSciNet MATH Google Scholar
Liu D, Ghosh D, Lin X (2008) Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinform 9:292
Article Google Scholar
Maity A, Lin X (2011) Powerful tests for detecing a gene effect in the presence of possible gene-gene interactions using garrote kernel machines. Biometrics 67:1271–1284
Article MathSciNet MATH Google Scholar
Neyman J (1937) Smooth test for goodness of fit. Scand Actuar J 3–4:149–199
Article MATH Google Scholar
Nyholt D (2004) A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet 74:765–769
Article Google Scholar
Pan W, Shen X (2011) Adaptive tests for association analysis of rare variants. Genet Epidemiol 35:381–388
Article Google Scholar
Stein JL, Hua X, Morra JH et al (2010) Genome-wide analysis reveals novel genes influencing temporal lobe structure with relevance to neurodegeneration in Alzheimer’s disease. Neurolmage 51:542–554
Article Google Scholar
Wessel J, Schork NJ (2006) Generalized gonomic distance-based regression methodology for multilocus association analysis. Am J Hum Genet 79:792–806
Article Google Scholar
Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X (2010) Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet 86:929–942
Article Google Scholar
Wu MC, Zhang L, Wang Z, Christiani DC, Lin X (2009) Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics 25:1145–1151
Article Google Scholar

Download references

Acknowledgments

This research was supported by NIH grants CA129102. The authors thank the reviewers for helpful comments.

Conflict of interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Department of Statistics, Pennsylvania State University, University Park, PA , 16802, USA
Xiang Zhan & Debashis Ghosh
Department of Public Health Sciences, Pennsylvania State University, University Park, PA , 16802, USA
Debashis Ghosh
Department of Human Genetics, Emory University, Atlanta, GA , 30322, USA
Michael P. Epstein

Authors

Xiang Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Michael P. Epstein
View author publications
You can also search for this author in PubMed Google Scholar
Debashis Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Zhan.

Appendix

1.1 Least Squares Kernel Machine Score Test Based on Linear Kernels

In this section, we prove that different linear kernels $k(x,y; \rho )=x^T y+ \rho $ lead to the same LSKM score test. First let us recall the LSKM score test proposed in [13]. The test statistic of a LSKM score test is $Q(\hat{\beta }, \hat{\sigma }^2, \rho )$ where

$$\begin{aligned} Q(\beta , \sigma ^2, \rho ) = \frac{1}{2\sigma ^2} (y-x\beta )^TK(\rho )(y-x\beta ), \end{aligned}$$

$\hat{\beta }$ and $\hat{\sigma }^2$ are the MLEs of $\beta $ and $\sigma ^2$ under the null model $y=x\beta +\epsilon $. [13] used a scaled chi-squared distribution $a\chi _b^2$ to approximate the distribution of $Q(\hat{\beta }, \hat{\sigma }^2, \rho )$, where $a$ and $b$ are determined by matching the moments of $Q$ and the scaled chi-squared distribution. It is easy to see that $a=\mathrm{{Var}}(Q)/2E(Q)$ and $b=2E^2(Q)/\mathrm{{Var}}(Q)$. Let $X$ be the design matrix for the clinical covariates and $K_{\rho }$ be the kernel matrix, which depends on kernel parameter $\rho $. Denote $P_0=I- X(X^TX)^{-1}X^T$. Then according to [13]:

$$\begin{aligned} E(Q)=\frac{tr(P_0K_{\rho })}{2}\,,\, \mathrm{{Var}}(Q)=\frac{tr(PK_{\rho }PK_{\rho })}{2}-\frac{[tr(P_0K_{\rho }P_0)]^2}{2tr(P_0^2)^2} \end{aligned}$$

(9)

Now consider two arbitrary linear kernels $k(x,y,\rho _1)$ and $k(x,y,\rho _2)$. Let $Q_i, K_i, a_i$ and $b_i$ be some quantities corresponding to kernel $i, i=1,2$. Let $A \equiv (1,\ldots , 1)^T$. Then, it is easy to see $K_2=K_1+(\rho _2-\rho _1)AA^T$. Moreover, $P_0=\mathcal {P}_{X^{\perp }}$, where $\mathcal {P}_{X^{\perp }}$ denotes the projection matrix to the orthogonal complement of the space spanned by the columns of $X$. Note that $A$ is the first column of $X$ (we assume that the intercept is contained in the clinical part). Hence, $P_0A=\mathcal {P}_{X^{\perp }} A=0$. Therefore,

$$\begin{aligned}&\displaystyle tr(P_0K_2)= tr(P_0K_1)+ (\rho _2-\rho _1)tr(P_0AA^T)=tr(P_0K_1),\\&\displaystyle tr(P_0K_2P_0K_2)= tr(P_0K_2P_0K_1) = tr(P_0K_1P_0K_1),\\&\displaystyle tr(P_0K_2P_0)= tr(P_0K_1P_0). \end{aligned}$$

Plugging these results back to Eq. (9), one can show that $E(Q_2)=E(Q_1)$ and $\mathrm{{Var}}(Q_2)=\mathrm{{Var}}(Q_1)$. Hence, $a_2=a_1$ and $b_2=b_1$. Because the residuals of the null model $y=x\beta +\epsilon $ sum to 0, one can easily show that $Q_2 = Q_1$. That is, both the LSKM score test statistic and the null distribution of the test statistic are identical for two arbitrary linear kernels $k(x,y,\rho _1)$ and $k(x,y,\rho _2)$. Therefore, all linear kernels lead to the same LSKM score test.

1.2 Description of ADNI Data

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies, and non-profit organizations, as a $ 60 million, 5-year public-private partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.

The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California San Francisco. ADNI is the result of efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects, but ADNI has been followed by ADNI-GO and ADNI-2. To date these three protocols have recruited over 1,500 adults, ages 55–90, to participate in the research, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow-up duration of each group is specified in the protocols for ADNI-1, ADNI-2, and ADNI-GO. Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in ADNI-2. For up-to-date information, see http://www.adni-info.org.

1.3 Acknowledgments to ADNI

Data collection and sharing for this project were funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhan, X., Epstein, M.P. & Ghosh, D. An Adaptive Genetic Association Test Using Double Kernel Machines. Stat Biosci 7, 262–281 (2015). https://doi.org/10.1007/s12561-014-9116-2

Download citation

Received: 09 October 2013
Accepted: 01 June 2014
Published: 24 June 2014
Issue Date: October 2015
DOI: https://doi.org/10.1007/s12561-014-9116-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Adaptive Genetic Association Test Using Double Kernel Machines

Abstract

Access this article

Similar content being viewed by others

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

PIMKL: Pathway-Induced Multiple Kernel Learning

Fast kernel-based association testing of non-linear genetic effects for biobank-scale data

References

Acknowledgments

Conflict of interest

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Least Squares Kernel Machine Score Test Based on Linear Kernels

1.2 Description of ADNI Data

1.3 Acknowledgments to ADNI

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Adaptive Genetic Association Test Using Double Kernel Machines

Abstract

Access this article

Similar content being viewed by others

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

PIMKL: Pathway-Induced Multiple Kernel Learning

Fast kernel-based association testing of non-linear genetic effects for biobank-scale data

References

Acknowledgments

Conflict of interest

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Least Squares Kernel Machine Score Test Based on Linear Kernels

1.2 Description of ADNI Data

1.3 Acknowledgments to ADNI

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation