Abstract
Variable selection in genomewide association studies can be a daunting task and statistically challenging because there are more variables than subjects. We propose an approach that uses principalcomponent analysis (PCA) and least absolute shrinkage and selection operator (LASSO) to identify genegene interaction in genomewide association studies. A PCA was used to first reduce the dimension of the singlenucleotide polymorphisms (SNPs) within each gene. The interaction of the gene PCA scores were placed into LASSO to determine whether any genegene signals exist. We have extended the PCALASSO approach using the bootstrap to estimate the standard errors and confidence intervals of the LASSO coefficient estimates. This method was compared to placing the raw SNP values into the LASSO and the logistic model with individual genegene interaction. We demonstrated these methods with the Genetic Analysis Workshop 16 rheumatoid arthritis genomewide association study data and our results identified a few genegene signals. Based on our results, the PCALASSO method shows promise in identifying genegene interactions, and, at this time we suggest using it with other conventional approaches, such as generalized linear models, to narrow down genetic signals.
Background
The goal of this paper is to develop and evaluate prediction methods and tools for genomewide association studies, particularly for variable selection and dimension reduction. There is a demand for statistical techniques capable of handling large volumes of data in genetic studies. Technical advances have enabled the collection of massive highdimensional datasets in such studies. This has called for broadening of the area of research in dimensionreduction techniques to provide methods for prediction and variable selection. For example, during the last decade, Li [1], Tibshirani [2], and Efron et al. [3] have paved new directions for dimensionreduction techniques and broadened the area to other applications of prediction, including genetics.
For this paper, we explore extensions of currently existing dimensionreduction methods and variableselection methods related to genomewide association studies (GWAS) singlenucleotide polymorphism (SNP) selection and genegene interactions for application to the disease classification problem based on genetic data. Recently, the focus has shifted to GWAS, where the emphasis can be placed on assessing whether multiple markers function together rather than depending on univariate tests and generalized linear models (GLM). Dimensionreduction techniques are a powerful tool because they provide a summary measure of massive amounts of data. We can apply such techniques to determine whether multiple marker pathways and genegene interactions are associated with the disease of interest. The highly dense genetic marker data from the rheumatoid arthritis study and the published reports about the study provide an ideal empirical dataset for developing and testing extensions of dimensionreduction methods.
There is a demand for statistical techniques to handle large volumes of data, particularly in the area of genetics. Genetic data is used to find genetic variants that are associated with rheumatoid arthritis risk (or other diseases) through the use of statistical modeling. The tendency for analyzing genotype data is to use GLM and univariate tests; however, these models perform poorly when analyzing highdimensional data [4, 5]. The research objective of this study is to develop prediction tools primarily methods for variable selection and dimension reduction in a GWAS.
In an effort to improve variable selection, Tibshirani [2] developed the least absolute shrinkage and selection operator (LASSO), a penalized likelihood approach, for linear regression. Two important components of variable selection are prediction accuracy and interpretation. Ordinary least squares (OLS) is known to estimate coefficients with small bias but inflated variance. In the case of a large number of predictors, OLS has difficulty selecting the subset of predictors that appears to be the most important or to have the strongest effects. LASSO is a combination of ridge regression and subset selection developed to improve OLS by shrinking the coefficient values and setting some equal to zero. LASSO [2, 6] is similar to OLS with constraints and produces a stable and interpretable model. Nonlinear extensions of the LASSO exist such as modeling a binary outcome [6]. Principalcomponents analysis (PCA) is a nonparametric dimensionreduction approach. PCA is a linear transformation of the original data that incorporates secondorder statistics to determine the optimal components that describe the functional relationship between the outcome and covariate [7]. The premise of PCA is to identify the orthogonal linear combinations with the largest covariance. The benefit to using PCA and LASSO is that both methods can accommodate correlation, such as linkage disequilibrium (LD), between SNPs. This advantage prompted us to select PCA and LASSO to model SNPs and genes; models such as GLM fail in the presence of LD [4].
We investigate PCA [7] and LASSO [2, 6, 8] methods to reduce the dimension of the genetic marker data and detect genegene interaction signals on chromosome 6. We explore the two methods, PCA and LASSO, combining variable selection and dimensionreduction techniques. The combined approach will further reduce the dimension of the data to detect signals from variants and genegene interactions in addition to the gene(s) discovered in the previously published work on rheumatoid arthritis [9, 10]. The bootstrap will be used to estimate the standard errors and confidence intervals of the LASSO coefficient estimates. We will compare the LASSOPCA approach to the LASSO method including the entire set of SNP values, the logistic regression with individual PCAPCA interaction, and the logistic regression with individual SNPSNP interaction.
Methods
We denote Y_{ i }∈ {0, 1} to be the outcome and Z_{i, k}= {0, 1, 2}, k = 1,...,K, to be the SNP variables of a Kdimensional covariate vector Z_{ i }= (Z_{i,1},...,Z_{i, k})^{T}with n subjects, where i = 1,..,n is the subscript for the i^{th} subject. Logistic regression is the model of choice for a binary outcome and it is a member of GLM. We specify Y to have a binomial distribution, Y_{ i }~bin(n, μ_{ i }), where the mean is , the linear predictor is given by , and the link function here is the logit function of the form g(μ_{ i }) = ln [μ_{ i }/(1μ_{ i })] = η_{ i }. The link function describes the relationship between the mean of the distribution function and the linear predictor. The loglikelihood is of the form
LASSO was originally intended for linear regression and it has been extended to the GLM by Lockhorst [6]. The LASSO and GLM algorithms are combined to provide a generalized LASSO algorithm [6] to estimate the LASSO coefficients. The idea is to use an iteratively reweighted leastsquares approach to compute estimates of the regression coefficients in a LASSO model while placing a constraint on the regression coefficients. The generalized LASSO algorithm begins with initial estimates of and , where h_{ i }is a specified weight. Initial values of β are not needed. Another option it to start with coefficient values of 0; however, this can take too long to converge. The covariates that are not constrained can be swept out. We denote these covariates as V_{ i }, and their regression coefficient parameters as denoted β. The covariates that are constrained are denoted as X_{ i }, and the regression coefficient parameters are denoted γ. The next step is to estimate the adjusted response variable, , that is of the form , where j denotes the iteration number, a denotes adjusted, and . The next step involves projecting the weighted independent variables and weighted adjusted dependent variable onto the column space of , where W is a weight and of the form and . The updated covariates and response variable are given by and The regression coefficients for V are estimated as β = (V^{T}W^{(j)}V)^{1}V^{T}W^{(j)}Y^{a,(j)}. The last step is to solve min(Y^{*,(j) } X^{*,(j)γ})^{T}(Y^{*,(j) } X^{*,(j)γ}) subject to γ_{1} ≤ t. The tuning parameter, t > 0, specifies the amount of shrinkage that will be applied to the coefficient estimates. The tuning parameter is estimated by selecting a normalized parameter, s, that is the ratio of the tuning parameter to the total effect size of the regression unbounded estimate, which is expressed as s = t/(X^{*T}X*)^{1} X^{*T}X*_{1}, 0 ≤ s ≤ 1. It should be noted when s = 1 there is no shrinkage. The estimates are updated and the iterative process is continued until convergence.
For each gene, the score derived from the PCA is a linear combination of the SNPs. This PCA score represents a summary measure of the SNPs from the g^{th} gene in a condensed fashion, where the score is is the l^{th} PCA component for the g^{th} gene, and Z^{(g) }is the raw SNP data from the g^{th} gene. The components that account for at least 10% of the variance are chosen, where D_{l, g}is the summary measure to determine the percentage of variance for the g^{th} gene, , and d_{l, g}denotes the eigenvalues obtained from the PCA for the g^{th} gene.
The R package we used for analysis is LASSO2. LASSO2 has limited capabilities when analyzing categorical data, such as the inability to estimate the standard errors. As recommended by Meier et al. [11], we used the bootstrap [12] to estimate the standard errors and confidence intervals. A nonzero LASSO coefficient value indicates the variable should be considered for variable selection and further investigation is necessary. The bootstrap confidence interval can indicate the statistical importance of a covariate from the LASSO. We selected C = 1000 bootstrap samples from the data (Y, Z, S) with replacement. For each of these bootstrap samples, we estimated the LASSO coefficient for the c^{th} bootstrap sample where c = 1,...,C and the star (*) indicates the estimate is from the bootstrap. The average of the bootstrapped estimate is [12]. The variance of the bootstrapped estimate is [12]. An estimate of the bias of θ is [12]. The normaltheory interval is used to estimate the 95% bootstrap confidence interval. We assume θ has a normal distribution, , and the confidence interval is of the form [12].
Results
The HLADRB1 gene on chromosome 6 has been linked to rheumatoid arthritis [9]. Based on this finding, we decided to evaluate markers from chromosome 6. We focused on markers from a subset of the genes that were explored in studies conducted from 1992 to 2003 [9]. A total of 135 SNPs were considered for analysis from 28 genes: AP (n = 1), HLA class (n = 16), MICAMICF (n = 6), TAP (n = 2), and TNF (n = 3). PLINK has been used for quality control. From chromosome 6, there are 35,574 markers, with 33,585 SNPs left after removing those that failed the HardyWeinberg equilibrium test (p ≤ 0.001), the missingness test (GENO > 0.1), and the frequency test (minor allele frequency < 0.01). We have removed the SNPs that did not meet the quality control criteria.
The intercept was the only variable swept out in the LASSO model. The number of components selected with PCA ranged from one to two. All PCA scores and the corresponding PCA_{gene_a}PCA_{gene_b}interactions were entered into the LASSO model to determine whether there was genegene interaction. Here, the PCA_{gene_a}is a PCA score from the a^{th} gene and the PCA_{gene_b}is a PCA score from the b^{th} gene, where a ≠ b. Table 1 has the results indicating 16 potential interactions with their bootstrap standard error and bootstrap confidence interval. Based on the bootstrap estimates, only two genegene interactions of HLADRA × HLADRB9 and HLADRA × MICA were significant. Of these 16 potential genegene interactions, we entered the raw SNP values and the corresponding SNP_{gene_a}SNP_{gene_b}interactions into the LASSO model to determine whether the same genetic relationships exist. Here, the SNP_{gene_a}is a SNP from the a^{th} gene and the SNP_{gene_b}is a SNP from the b^{th} gene, where a ≠ b. Eleven genegene interactions were suggested from the LASSOSNP method, while three of these genegene interactions were suggested from the LASSOPCA analysis. However, the final results are the same from both the LASSOPCA and LASSOSNP method, where there were two significant genegene interactions of HLADRA × HLADRB9 and HLADRA × MICA. We did explore selecting the components using a scree plot; it often selected too many components with noise. In addition, we set the value of the normalized parameter to 0.5 and explored various normalized parameter values to determine the optimal value for variable selection. Our analysis was inconclusive on the best measure to select an optimal value and we will explore this further in the future.
Additionally, we ran logistic regression models with the individual SNP_{gene_a}SNP__{gene_b}interactions and the individual PCA_{gene_a}PCA_{gene_b}interactions to compare methods. A multiplecomparison procedure was applied using the Benjamini and Hochberg [13] method, which controls the falsediscovery rate. With the individual SNPSNP interaction from the logit model, we found 337 significant interactions that reduced to 78 unique genegene interactions; out of these, 11 overlapped with the LASSO findings. For the individual PCA_{gene_a}PCA_{gene_b}interactions in the logit model, we found 37 genegene interactions and only 5 overlapped from the LASSO findings. The two genegene interactions consistently found to be significant across all four approaches were HLADRA × HLADRB9 and HLADRA × MICA. This suggests that the individual SNPSNP interactions may function jointly instead of independently. Further investigation of the LASSO and PCA approach will be pursued. A third approach was explored that involved placing all 135 SNPs in a LASSO model to determine whether there were any variantvariant signals. There were limitations to this approach due to the large amount of categorical data and large number of interactions. We did not pursue this method much further after recognizing the analysis had to be split into three LASSO models.
Conclusion
In GWAS there is an overwhelming amount of data and it can be difficult to distinguish between true signals and spurious results based only on singlemarker analysis. Our approach is focused on assessing whether multiple markers act together in producing the phenotype. We demonstrate a combined approach of a dimensionreduction method, PCA, and a variableselection method, LASSO, to detect genegene interaction signals. We have extended the LASSO method to estimate standard errors and confidence intervals with the bootstrap.
Interestingly enough, whether the principalcomponent score or the raw SNP values were placed into the LASSO, the final results were the same. The results from the individual interaction PCA logit models and individual interaction SNP logit models had overlapping results and revealed the same interactions found in the LASSO method. This suggests the PCALASSO method shows promise. At this time we suggest using it with other conventional approaches to narrow down genetic signals. The advantage to our method is that highly collinear data and a large number of variables can be reduced to a manageable dimension, where LD is accommodated by LASSO and PCA. Also, a large number of SNPs can be represented as a function of a gene.
A limitation of this current work is that we cannot conclude whether our PCALASSO method is an improvement over other genegene variableselection methods. We will further investigate the threshold of the number of covariates in the LASSO model. We propose to do simulation studies in the future that will compare the PCALASSO approach to other variable selection methods [4, 11]. Simulation studies are necessary to determine the properties of the PCALASSO approach. We will also pursue study of the normalized parameter in our future work.
Abbreviations
 GLM:

Generalized linear models
 GWAS:

Genomewide association studies
 LASSO:

Least absolute shrinkage and selection operator
 LD:

Linkage disequilibrium
 OLS:

Ordinary least squares
 PCA:

Principalcomponents analysis
 SNP:

Singlenucleotide polymorphism.
References
Li KC: Sliced inverse regression for dimension reduction. J Am Stat Assoc. 1991, 86: 316327. 10.2307/2290563.
Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996, 58: 267288.
Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat. 2004, 32: 407499. 10.1214/009053604000000067.
Malo N, Libiger O, Schork NJ: Accommodating linkage disequilibrium in geneticassociation analyses via ridge regression. Am J Hum Genet. 2008, 82: 37585. 10.1016/j.ajhg.2007.10.012.
Steyerberg EW, Eijkemans MJC, Habbema JDF: Application of shrinkage techniques in logistic regression analysis: a case study. Stat Neerl. 2001, 55: 7688. 10.1111/14679574.00157.
Lokhorst J: The LASSO and Generalised Linear Models. Honors Project. 1999, The University of Adelaide, Adelaide, Statistics Department
Jolliffe IT: Principal Component Analysis. 1986, New York, SpringerVerlag
Shi W, Lee KE, Wahba G: Detecting diseasecausing genes by LASSOpatternsearch algorithm. BMC Proc. 2007, 1 (suppl 1): S6010.1186/175365611s1s60.
Newton JL, Harney SMJ, Wordsworth BP, Brown MA: A review of the MHC genetics of rheumatoid arthritis. Genes Immun. 2004, 5: 151157. 10.1038/sj.gene.6364045.
Carlton VEH, Hu X, Chokkalingam AP, Schrodi SJ, Brandon R, Alexander HC, Chang M, Catanese JJ, Leong DU, Ardlie KG, Kastner DL, Seldin MF, Criswell LA, Gregersen PK, Beasley E, Thomson G, Amos CI, Begovich AB: PTPN22 genetic variation: evidence for multiple variants associated with rheumatoid arthritis. Am J Hum Genet. 2005, 77: 567581. 10.1086/468189.
Meier L, Geer van de S, Bühlmann P: The group lasso for logistic regression. J R Stat Soc Series B Stat Methodol. 2008, 70: 5371.
Davison AC, Hinkley DV: Bootstrap Methods and Their Application. 1997, Cambridge, Cambridge University Press
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995, 57: 289300.
Acknowledgements
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. This publication was also made possible by NIH grant UL1 RR024992. We thank Yu Tao for data management assistance.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
GMD conceived the study, extended the methodology, performed the statistical analysis, interpreted the results, and drafted the manuscript. DCR participated in revision of the manuscript and interpretation of the results. CCG participated in the study design, interpretation of the results, and revision of the manuscript. All authors read and approved the final manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
D'Angelo, G.M., Rao, D. & Gu, C.C. Combining least absolute shrinkage and selection operator (LASSO) and principalcomponents analysis for detection of genegene interactions in genomewide association studies. BMC Proc 3 (Suppl 7), S62 (2009). https://doi.org/10.1186/175365613S7S62
Published:
DOI: https://doi.org/10.1186/175365613S7S62
Keywords
 Ordinary Little Square
 Variable Selection
 Bootstrap Confidence Interval
 Genetic Marker Data
 Regression Coefficient Parameter