The goal of this paper is to develop and evaluate prediction methods and tools for genome-wide association studies, particularly for variable selection and dimension reduction. There is a demand for statistical techniques capable of handling large volumes of data in genetic studies. Technical advances have enabled the collection of massive high-dimensional datasets in such studies. This has called for broadening of the area of research in dimension-reduction techniques to provide methods for prediction and variable selection. For example, during the last decade, Li [1], Tibshirani [2], and Efron et al. [3] have paved new directions for dimension-reduction techniques and broadened the area to other applications of prediction, including genetics.

For this paper, we explore extensions of currently existing dimension-reduction methods and variable-selection methods related to genome-wide association studies (GWAS) single-nucleotide polymorphism (SNP) selection and gene-gene interactions for application to the disease classification problem based on genetic data. Recently, the focus has shifted to GWAS, where the emphasis can be placed on assessing whether multiple markers function together rather than depending on univariate tests and generalized linear models (GLM). Dimension-reduction techniques are a powerful tool because they provide a summary measure of massive amounts of data. We can apply such techniques to determine whether multiple marker pathways and gene-gene interactions are associated with the disease of interest. The highly dense genetic marker data from the rheumatoid arthritis study and the published reports about the study provide an ideal empirical dataset for developing and testing extensions of dimension-reduction methods.

There is a demand for statistical techniques to handle large volumes of data, particularly in the area of genetics. Genetic data is used to find genetic variants that are associated with rheumatoid arthritis risk (or other diseases) through the use of statistical modeling. The tendency for analyzing genotype data is to use GLM and univariate tests; however, these models perform poorly when analyzing high-dimensional data [4, 5]. The research objective of this study is to develop prediction tools primarily methods for variable selection and dimension reduction in a GWAS.

In an effort to improve variable selection, Tibshirani [2] developed the least absolute shrinkage and selection operator (LASSO), a penalized likelihood approach, for linear regression. Two important components of variable selection are prediction accuracy and interpretation. Ordinary least squares (OLS) is known to estimate coefficients with small bias but inflated variance. In the case of a large number of predictors, OLS has difficulty selecting the subset of predictors that appears to be the most important or to have the strongest effects. LASSO is a combination of ridge regression and subset selection developed to improve OLS by shrinking the coefficient values and setting some equal to zero. LASSO [2, 6] is similar to OLS with constraints and produces a stable and interpretable model. Nonlinear extensions of the LASSO exist such as modeling a binary outcome [6]. Principal-components analysis (PCA) is a nonparametric dimension-reduction approach. PCA is a linear transformation of the original data that incorporates second-order statistics to determine the optimal components that describe the functional relationship between the outcome and covariate [7]. The premise of PCA is to identify the orthogonal linear combinations with the largest covariance. The benefit to using PCA and LASSO is that both methods can accommodate correlation, such as linkage disequilibrium (LD), between SNPs. This advantage prompted us to select PCA and LASSO to model SNPs and genes; models such as GLM fail in the presence of LD [4].

We investigate PCA [7] and LASSO [2, 6, 8] methods to reduce the dimension of the genetic marker data and detect gene-gene interaction signals on chromosome 6. We explore the two methods, PCA and LASSO, combining variable selection and dimension-reduction techniques. The combined approach will further reduce the dimension of the data to detect signals from variants and gene-gene interactions in addition to the gene(s) discovered in the previously published work on rheumatoid arthritis [9, 10]. The bootstrap will be used to estimate the standard errors and confidence intervals of the LASSO coefficient estimates. We will compare the LASSO-PCA approach to the LASSO method including the entire set of SNP values, the logistic regression with individual PCA-PCA interaction, and the logistic regression with individual SNP-SNP interaction.


We denote Y i ∈ {0, 1} to be the outcome and Zi, k= {0, 1, 2}, k = 1,...,K, to be the SNP variables of a K-dimensional covariate vector Z i = (Zi,1,...,Zi, k)Twith n subjects, where i = 1,..,n is the subscript for the ith subject. Logistic regression is the model of choice for a binary outcome and it is a member of GLM. We specify Y to have a binomial distribution, Y i ~bin(n, μ i ), where the mean is , the linear predictor is given by , and the link function here is the logit function of the form g(μ i ) = ln [μ i /(1-μ i )] = η i . The link function describes the relationship between the mean of the distribution function and the linear predictor. The log-likelihood is of the form

LASSO was originally intended for linear regression and it has been extended to the GLM by Lockhorst [6]. The LASSO and GLM algorithms are combined to provide a generalized LASSO algorithm [6] to estimate the LASSO coefficients. The idea is to use an iteratively reweighted least-squares approach to compute estimates of the regression coefficients in a LASSO model while placing a constraint on the regression coefficients. The generalized LASSO algorithm begins with initial estimates of and , where h i is a specified weight. Initial values of β are not needed. Another option it to start with coefficient values of 0; however, this can take too long to converge. The covariates that are not constrained can be swept out. We denote these covariates as V i , and their regression coefficient parameters as denoted β. The covariates that are constrained are denoted as X i , and the regression coefficient parameters are denoted γ. The next step is to estimate the adjusted response variable, , that is of the form , where j denotes the iteration number, a denotes adjusted, and . The next step involves projecting the weighted independent variables and weighted adjusted dependent variable onto the column space of , where W is a weight and of the form and . The updated covariates and response variable are given by and The regression coefficients for V are estimated as β = (VTW(j)V)-1VTW(j)Ya,(j). The last step is to solve min(Y*,(j) - X*,(j)γ)T(Y*,(j) - X*,(j)γ) subject to ||γ||1t. The tuning parameter, t > 0, specifies the amount of shrinkage that will be applied to the coefficient estimates. The tuning parameter is estimated by selecting a normalized parameter, s, that is the ratio of the tuning parameter to the total effect size of the regression unbounded estimate, which is expressed as s = t/||(X*TX*)-1 X*TX*||1, 0 ≤ s ≤ 1. It should be noted when s = 1 there is no shrinkage. The estimates are updated and the iterative process is continued until convergence.

For each gene, the score derived from the PCA is a linear combination of the SNPs. This PCA score represents a summary measure of the SNPs from the gth gene in a condensed fashion, where the score is is the lth PCA component for the gth gene, and Z(g) is the raw SNP data from the gth gene. The components that account for at least 10% of the variance are chosen, where Dl, gis the summary measure to determine the percentage of variance for the gth gene, , and dl, gdenotes the eigenvalues obtained from the PCA for the gth gene.

The R package we used for analysis is LASSO2. LASSO2 has limited capabilities when analyzing categorical data, such as the inability to estimate the standard errors. As recommended by Meier et al. [11], we used the bootstrap [12] to estimate the standard errors and confidence intervals. A non-zero LASSO coefficient value indicates the variable should be considered for variable selection and further investigation is necessary. The bootstrap confidence interval can indicate the statistical importance of a covariate from the LASSO. We selected C = 1000 bootstrap samples from the data (Y, Z, S) with replacement. For each of these bootstrap samples, we estimated the LASSO coefficient for the cth bootstrap sample where c = 1,...,C and the star (*) indicates the estimate is from the bootstrap. The average of the bootstrapped estimate is [12]. The variance of the bootstrapped estimate is [12]. An estimate of the bias of θ is [12]. The normal-theory interval is used to estimate the 95% bootstrap confidence interval. We assume θ has a normal distribution, , and the confidence interval is of the form [12].


The HLA-DRB1 gene on chromosome 6 has been linked to rheumatoid arthritis [9]. Based on this finding, we decided to evaluate markers from chromosome 6. We focused on markers from a subset of the genes that were explored in studies conducted from 1992 to 2003 [9]. A total of 135 SNPs were considered for analysis from 28 genes: AP (n = 1), HLA class (n = 16), MICA-MICF (n = 6), TAP (n = 2), and TNF (n = 3). PLINK has been used for quality control. From chromosome 6, there are 35,574 markers, with 33,585 SNPs left after removing those that failed the Hardy-Weinberg equilibrium test (p ≤ 0.001), the missingness test (GENO > 0.1), and the frequency test (minor allele frequency < 0.01). We have removed the SNPs that did not meet the quality control criteria.

The intercept was the only variable swept out in the LASSO model. The number of components selected with PCA ranged from one to two. All PCA scores and the corresponding PCAgene_a-PCAgene_binteractions were entered into the LASSO model to determine whether there was gene-gene interaction. Here, the PCAgene_ais a PCA score from the ath gene and the PCAgene_bis a PCA score from the bth gene, where ab. Table 1 has the results indicating 16 potential interactions with their bootstrap standard error and bootstrap confidence interval. Based on the bootstrap estimates, only two gene-gene interactions of HLA-DRA × HLA-DRB9 and HLA-DRA × MICA were significant. Of these 16 potential gene-gene interactions, we entered the raw SNP values and the corresponding SNPgene_a-SNPgene_binteractions into the LASSO model to determine whether the same genetic relationships exist. Here, the SNPgene_ais a SNP from the ath gene and the SNPgene_bis a SNP from the bth gene, where ab. Eleven gene-gene interactions were suggested from the LASSO-SNP method, while three of these gene-gene interactions were suggested from the LASSO-PCA analysis. However, the final results are the same from both the LASSO-PCA and LASSO-SNP method, where there were two significant gene-gene interactions of HLA-DRA × HLA-DRB9 and HLA-DRA × MICA. We did explore selecting the components using a scree plot; it often selected too many components with noise. In addition, we set the value of the normalized parameter to 0.5 and explored various normalized parameter values to determine the optimal value for variable selection. Our analysis was inconclusive on the best measure to select an optimal value and we will explore this further in the future.

Table 1 LASSO results

Additionally, we ran logistic regression models with the individual SNPgene_a-SNP_gene_binteractions and the individual PCAgene_a-PCAgene_binteractions to compare methods. A multiple-comparison procedure was applied using the Benjamini and Hochberg [13] method, which controls the false-discovery rate. With the individual SNP-SNP interaction from the logit model, we found 337 significant interactions that reduced to 78 unique gene-gene interactions; out of these, 11 overlapped with the LASSO findings. For the individual PCAgene_a-PCAgene_binteractions in the logit model, we found 37 gene-gene interactions and only 5 overlapped from the LASSO findings. The two gene-gene interactions consistently found to be significant across all four approaches were HLA-DRA × HLA-DRB9 and HLA-DRA × MICA. This suggests that the individual SNP-SNP interactions may function jointly instead of independently. Further investigation of the LASSO and PCA approach will be pursued. A third approach was explored that involved placing all 135 SNPs in a LASSO model to determine whether there were any variant-variant signals. There were limitations to this approach due to the large amount of categorical data and large number of interactions. We did not pursue this method much further after recognizing the analysis had to be split into three LASSO models.


In GWAS there is an overwhelming amount of data and it can be difficult to distinguish between true signals and spurious results based only on single-marker analysis. Our approach is focused on assessing whether multiple markers act together in producing the phenotype. We demonstrate a combined approach of a dimension-reduction method, PCA, and a variable-selection method, LASSO, to detect gene-gene interaction signals. We have extended the LASSO method to estimate standard errors and confidence intervals with the bootstrap.

Interestingly enough, whether the principal-component score or the raw SNP values were placed into the LASSO, the final results were the same. The results from the individual interaction PCA logit models and individual interaction SNP logit models had overlapping results and revealed the same interactions found in the LASSO method. This suggests the PCA-LASSO method shows promise. At this time we suggest using it with other conventional approaches to narrow down genetic signals. The advantage to our method is that highly collinear data and a large number of variables can be reduced to a manageable dimension, where LD is accommodated by LASSO and PCA. Also, a large number of SNPs can be represented as a function of a gene.

A limitation of this current work is that we cannot conclude whether our PCA-LASSO method is an improvement over other gene-gene variable-selection methods. We will further investigate the threshold of the number of covariates in the LASSO model. We propose to do simulation studies in the future that will compare the PCA-LASSO approach to other variable selection methods [4, 11]. Simulation studies are necessary to determine the properties of the PCA-LASSO approach. We will also pursue study of the normalized parameter in our future work.