Background

In a typical genome-wide association study (GWAS), several thousands to several millions of single nucleotide polymorphism (SNP) markers are genotyped in a sample size of several hundred to several thousand individuals, thus leading to many more predictor variables than response variables. In this case, multiple linear regression cannot be used because the unique ordinary least squares estimators of the regression coefficients are not defined. Methods that allow for more predictors than observations [1] may cause model overfitting. Overfitted models are likely to demonstrate poor predictive ability when applied to new data. To overcome these problems, penalized regression methods have been proposed [2,3,4,5,6], aiming at shrinking the regression coefficients toward zero. Depending on the form of the penalty function, some methods (eg, ridge regression [3]) only shrink the coefficients without setting them to zero, whereas other methods (eg, the least absolute shrinkage and selection operator [LASSO] regression [2]) allow shrinkage of the coefficients down to exactly zero, thus performing variable selection.

The strength of the penalty is controlled by a regularization parameter that determines the amount of shrinkage imposed. One challenge of penalized approaches is choosing an optimal value of the regularization parameter. This is often done by k-fold cross-validation to find the parameter value in the training folds that minimizes the average mean squared error in the test folds. Assessing the predictive performance can also be done using k-fold cross-validation. In this case, the two cross-validation experiments are combined into one so-called nested cross-validation. In nested cross-validation, an outer cross-validation loop is used to assess the predictive performance, while, within each outer fold, an inner cross-validation loop is used to find the regularization parameter [7]. The most commonly used number of inner and outer folds is 10 because it provides a reasonable classification accuracy [8].

Here, we focus on LASSO linear regression, which has the property of variable selection. We apply 10-fold cross-validation to assess the out-of-sample predictive performance, using 10-fold nested cross-validation to specify the penalty parameter. We explore the effect of the sample size on the predictive ability of LASSO regression.

Methods

The GAW20 data are based on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study data set [9] that investigated the epigenetic determinants of triglyceride (TG) response. TGs are major blood lipids [10] that constitute an important biomarker of cardiovascular disease risk [11]. Previous GWAS studies found a number of loci associated with TG levels [12]. We focus on predicting the TG response by analyzing the GWAS (SNP) data and four measures of TG. The first two measures (at visits 1 and 2) were taken before the lipid-lowering drug treatment; the second two measures (at visits 3 and 4) were taken after the treatment.

We performed quality control (QC) on the GAW20 GWAS (SNP) data using standard procedures outlined in Turner et al. [13]. SNP-level QC removes 63,907 SNPs with low minor allele frequency (< 0.01), and 2694 SNPs for failing a test of Hardy-Weinberg equilibrium (p ≤ 0.00001). Of 822 individuals for whom we have SNP data, the simulated phenotype is available for 680 individuals and the real phenotype is available for 778 individuals. Working on a log scale, we take the mean of the TG measures for visits 1 and 2 as a baseline measure, and the mean of visits 3 and 4 as a follow-up measure. If the measure for either visit 1 or 2 is not available, we take the only available measure as a baseline measure. Similarly, if the measure for either visit 3 or 4 is not available, we take the only available measure as a follow-up measure. We adjust the follow-up measure for the baseline measure, age, center, smoking status, and first 20 principal components (PCs) of SNP effects using a linear regression. The number of PCs was defined by examining the quantile–quantile (Q-Q) plot of the p values from the ordinary linear regression. When 20 PCs were incorporated into the linear regression, the Q-Q plot showed no inflated p values, which suggests that the relatedness and population stratification are accounted for. We take the standardized residuals as our final phenotype.

Lasso

Consider a standard multiple linear regression, y = β0 +  + ϵ, where y is a vector of response variables; X is a n × p matrix of predictor variables; β0 is an intercept; β = (β1, …, βp) is a vector of regression coefficients; and ϵ is a vector of the error terms, ϵ~N(0, σ2). For n > p the estimated values of the coefficients are found by minimizing the residual sum of squares:

$$ {\widehat{\beta}}_o,\widehat{\boldsymbol{\beta}}=\mathrm{argmin}\ \left[{\sum}_{i=1}^n{\left({y}_i-{\beta}_o-{\sum}_{j=1}^p{\beta}_j{X}_{ij}\right)}^2\right] $$

However, in a typical GWAS, n < p. In this case, penalized regression is often used, where the estimators of β are found by minimizing the sum of the residual sum of squares and a penalty function:

$$ {\widehat{\beta}}_o,\widehat{\boldsymbol{\beta}}=\mathrm{argmin}\ \left[{\sum}_{i=1}^n{\left({y}_i-{\beta}_o-{\sum}_{j=1}^p{\beta}_j{X}_{ij}\right)}^2+\lambda P\left(\lambda, \boldsymbol{\beta} \right)\right] $$

Here P(λ, β) is the penalty function with a regularization parameter λ which controls the amount of shrinkage. The LASSO penalty [2] utilizes an 1-norm penalty, that is, \( P\left(\lambda, \boldsymbol{\beta} \right)=\left\Vert \boldsymbol{\beta} \right\Vert {}_{\ell_1} \); consequently, the estimators of the coefficients take the form:

$$ {\widehat{\beta}}_o,\widehat{\boldsymbol{\beta}}=\mathrm{argmin}\ \left[{\sum}_{i=1}^n{\left({y}_i-{\beta}_o-{\sum}_{j=1}^p{\beta}_j{X}_{ij}\right)}^2+\lambda {\sum}_{j=1}^p|{\beta}_j|\right] $$

One important property of the LASSO penalty is that it allows the coefficients to be set to exactly zero, thus performing variable selection.

Results

Simulated data

We analyzed replicate 84 of the simulated data, consulting the “answers” before performing the analysis. The Manhattan plot from standard linear regression shows no genome-wide significant associations, however, one of the known causal SNPs is nearly genome-wide significant (Fig. 1a). We applied LASSO regression on these data and assessed the predictive performance through 10-fold nested cross-validation, after reducing the number of SNPs to approximately 56,000 using a linkage disequilibrium (LD)–based clumping procedure implemented in the PLINK software [14]. This procedure groups correlated SNPs together and chooses one representative SNP per group, thus reducing dimensionality and eliminating the redundancy of information in the data. We assessed the prediction accuracy through the Pearson correlation coefficient (the square root of heritability), the mean squared error (lower values indicate better fit), and the slope of the best-fit line (a slope of 1 suggests perfect prediction). Figure 1b shows that the predicted phenotypes are heavily shrunk toward zero, resulting in a very low correlation between observed and predicted values and a slope that is far from 1. Poor prediction can be explained by the fact that the effect (regression coefficient) of the most significant true causal SNP is heavily shrunk (a mean estimate of − 0.051 and a SD of 0.062 over the 10 folds, compared to a mean estimate of − 0.443 and a SD of 0.02 over the 10 folds in standard linear regression).

Fig. 1
figure 1

Analysis of the GAW20 simulated data, replicate 84. a, Manhattan plot of p values from tests of association between SNP and phenotype. The black dots represent causal SNPs. The dashed line represents genome-wide significance. b, Prediction results. The black dashed line is the equality line; the red dashed line is the best-fit line

To investigate whether increasing the sample size can improve the prediction, we simulated phenotypes for 7753 individuals for whom we had SNP genotype data available from previous studies. The phenotypes were simulated using only 1 causal SNP with an effect size of − 0.44 (similar to the effect of the nearly genome-wide significant causal SNP from the GAW20 simulated data) and heritability of 0.05, which induces a correlation of approximately 0.23 between the observed and the predicted phenotype. First, we analyzed a subset of these data comprising 700 individuals (similar to the sample size of the GAW20 simulated data). The results resemble those seen in the GAW20 simulated data (Fig. 2). Even though the causal SNP is consistently picked up by LASSO, its effect is poorly estimated (the mean estimate is − 0.096 and the SD is 0.019 over the 10 folds) and has the same order of magnitude as the SNPs (false positive) chosen by LASSO, resulting in poor prediction. We then repeat the analysis using the full sample (7753 individuals). The Manhattan plot shows that the significance of the causal SNP has greatly increased (Fig. 3a). The prediction plots show three distinct clusters representing the estimated effects within the three genotype categories of the causal SNP (Fig. 3b). The effect of the causal SNP is much better estimated (the mean estimate is − 0.247 and the SD is 0.009 over the 10 folds) and the effect size is 10 to 1000 times greater than that of the non-causal SNPs falsely chosen by LASSO. The range of the regularization parameter across folds for the full data set (0.046 to 0.056) is smaller than that for the subset (0.081 to 0.183). This suggests that the regression coefficients for the full data set are much less penalized, greatly improving the prediction.

Fig. 2
figure 2

Analysis of the subset of the illustrative simulated data set (700 individuals). a, Manhattan plot of p values from tests of association between SNP and phenotype. The black dot represents the causal SNP. The dashed line represents genome-wide significance. b, Prediction results. The black dashed line is the equality line; the red dashed line is the best-fit line

Fig. 3
figure 3

Analysis of the illustrative simulated data set (7753 individuals). a, Manhattan plot of p values from tests of association between SNP and phenotype. The black dot represents the causal SNP. The dashed line represents genome-wide significance. b, Prediction results. The black dashed line is the equality line; the red dashed line is the best-fit line

Real data

We also applied LASSO regression to the real GAW20 data and assessed the predictive performance through 10-fold nested cross-validation, after again reducing the number of SNPs to approximately 56,000 using an LD-based clumping procedure. The Manhattan plot shows no genome-wide significant associations (Fig. 4a) and it resembles the Manhattan plot of the GAW20 simulated data. The prediction results are very similar to the results seen in the GAW20 simulated data (Fig. 4b). In a sense, these results are not surprising given the similar sample sizes of the simulated and the real data sets.

Fig. 4
figure 4

Analysis of the GAW20 real data set. a, Manhattan plot of p values from tests of association between SNP and phenotype. The dashed line represents genome-wide significance. b, Prediction results. The black dashed line is the equality line; the red dashed line is the best-fit line

Discussion

The prediction ability of LASSO was assessed on the simulated and the real GAW20 data sets through 10-fold nested cross-validation. Both data sets demonstrated poor predictive performance and exhibited a noticeable shrinkage of the predicted phenotype toward zero. By examining the effect of the most significant causal SNP in the simulated data, we found that it was heavily penalized. To investigate this issue, we analyzed a much larger data set (approximately 7000 individuals). We found that the effect of the causal SNP was much less penalized, thus enabling the best prediction possible with that SNP.

Conclusions

The prediction ability of LASSO was assessed on the GAW20 data sets and on the much larger data set available from previous studies. Poor predictive performance is achieved for data sets of a few hundred individuals with a weak signal. This can be explained by the fact that the LASSO regression coefficients are substantially shrunk. Other regularized methods that do not result in such a heavy shrinkage of the regression coefficients might be of use. For example, with hyper-LASSO [6] the extent of the shrinkage depends on the size of the coefficients, and adaptive LASSO [5] uses different adaptive weights for penalizing different coefficients. Both of these can potentially lead to a moderate shrinkage. However, with LASSO, increasing the sample size from a few hundred to a few thousand individuals increased the strength of the signal and reduced the amount of shrinkage of the regression coefficients, thus improving the prediction. We conclude that LASSO regression requires large sample sizes (several thousands of individuals) to achieve good prediction.