On the impact of model selection on predictor identification and parameter inference

We assessed the ability of several penalized regression methods for linear and logistic models to identify outcome-associated predictors and the impact of predictor selection on parameter inference for practical sample sizes. We studied effect estimates obtained directly from penalized methods (Algorithm 1), or by refitting selected predictors with standard regression (Algorithm 2). For linear models, penalized linear regression, elastic net, smoothly clipped absolute deviation (SCAD), least angle regression and LASSO had a low false negative (FN) predictor selection rates but false positive (FP) rates above 20 % for all sample and effect sizes. Partial least squares regression had few FPs but many FNs. Only relaxo had low FP and FN rates. For logistic models, LASSO and penalized logistic regression had many FPs and few FNs for all sample and effect sizes. SCAD and adaptive logistic regression had low or moderate FP rates but many FNs. 95 % confidence interval coverage of predictors with null effects was approximately 100 % for Algorithm 1 for all methods, and 95 % for Algorithm 2 for large sample and effect sizes. Coverage was low only for penalized partial least squares (linear regression). For outcome-associated predictors, coverage was close to 95 % for Algorithm 2 for large sample and effect sizes for all methods except penalized partial least squares and penalized logistic regression. Coverage was sub-nominal for Algorithm 1. In conclusion, many methods performed comparably, and while Algorithm 2 is preferred to Algorithm 1 for estimation, it yields valid inference only for large effect and sample sizes.


Introduction
Many regression procedures have been proposed in the recent literature that use penalties on regression coefficients in order to achieve sparseness or shrink them toward zero. These methods are popular for the analysis of datasets with large numbers of predictors, as they allow efficient selection of regression variables. While in many applications the primary interest is in identifying outcome associated covariates, it is nonetheless sometimes also desirable to gain scientific insights into the data generating process, and draw statistical inference on the parameters associated with the selected variables. Biases in estimates as well as in standard errors and confidence intervals become important if investigators focus on the magnitude of the observed effects.
Selection of predictor variables is a special case of model selection, which can be stated as follows. Let M denote the space of all candidate models that could be used to describe the data D. For our purposes M is characterized in terms of distribution functions that depend on parameters β and may or may not contain the true model that gave rise to the data. The model selection problem is to choose a modelM (D) in M such thatM is a "good" model in terms of parameter estimation or prediction. If the focus is on inference regarding the parameters, then the quantity of interest isβ(M). Model selection is a source of variability that is often ignored in standard statistical approaches. However, several authors, e.g. Sen (1979), Pötscher (1991) and Leeb (2005), have shown that the asymptotic distribution of the post-model selection estimates n 1/2 (β − β), where n denotes the sample size, is typically non-normal, and depends on the unknown β in complex fashions.
Some analytical results are available for penalized maximum likelihood estimators obtained from LASSO (Tibshirani 1996), SCAD (smoothly clipped absolute deviation; Fan and Li 2001) and hard thresholding for linear regression models, see e.g. Knight and Fu (2000), Leeb and Pötscher (2009) and Pötscher and Schneider (2009). These estimators have highly non-normal finite sample distributions and under conservative model selection their large sample distribution can be far from normal. Even under consistent model selection (pointwise) asymptotic analysis gives highly misleading results. In addition, the large sample properties depend on the choice of tuning parameters. Therefore the naively estimated standard error for those estimates will be biased and confidence intervals based on standard asymptotic theory for these methods may not have proper coverage, not even asymptotically.
No comprehensive comparisons of penalized approaches with respect to their finite sample properties have been performed to date, and little work has been done for non-linear models. We thus studied the properties of estimates obtained from popular penalized likelihood approaches applied to linear and logistic regression models using simulated data, focusing on realistic effect and sample sizes to make conclusions applicable to practical settings (Sect. 2.2). We first assess the methods' ability to identify truly outcome-associated predictors. We then study properties of effect estimates obtained directly from penalized methods (Algorithm 1), or by refitting selected predictors with standard regression (Algorithm 2) described in Sect. 2.3. The results presented in Sect. 3 can help to avoid overly optimistic interpretation of parameters in future research.

Methods
The goal of this analysis is to assess the impact of model selection on parameter estimates in linear and logistic models. We evaluated the influence of sample size, n, and magnitude of the regression coefficients for associations β on each method's ability to identify outcome associated predictors. We also studied properties of effect estimates obtained directly from penalized methods (Algorithm 1), or by refitting selected predictors with standard regression (Algorithm 2). A range of sample sizes, algorithms and correlation structures among predictors are utilized.

Estimation methods and algorithms
We calculated both the LASSO (Tibshirani 1996), and least angle regression (LARS, Efron et al. 2004) estimates with the function lars in the lars library of the statistical package R (Ihaka and Gentlemen 1996). The elastic net was fit using the function enet in the elasticnet library (Zou and Hastie 2005). We used the function relaxo in the library relaxo in R to fit relaxed LASSO, a generalization of the LASSO shrinkage technique (Meinshausen 2007). Generalized linear model (GLM) estimates with L1 (LASSO) and/or L2 (ridge) penalties, or a combination are obtained using the library and function penalized (Goeman 2010).
To fit L2 penalized logistic regression models with a stepwise variable selection, we used the function plr in the package stepPlr (Park and Hastie 2008). We also used an R implementation of SCAD, available at http://www.stat.umn.edu/~hzou/ ftpdir/code/one-step-SCAD-funs.R (accessed 05/09).
For linear and binary outcome data regression coefficients for penalized partial least squares were obtained using the function penalized.pls in the library ppls (Krämer et al. 2008).
We used fivefold cross validation to select tuning parameters for all the methods that allowed that option. Table 1 summarizes the algorithms and software packages.

Continuous outcome data
Each observation in a data set of size n contains the predictors X = (X 1 , . . . , X p ) , and the continuous outcome variable, Y . We assumed that only a small number of predictors p * < p are associated with Y and denote those by X * = (X * 1 , . . . , X * p * ) , a 1 × p * subvector of X. For ease of exposition we let the predictors in X be ordered so that the first p * values of X correspond to X * . Given X * and β * = (β * 1 , . . . , β * p * ) , a 1 × p * vector with β * i = 0, the response Y was generated from the linear model For each simulation, we then fit a linear model using all available predictors, X, i.e. assuming using the methods given in Sect. 2.1 and obtained the 1 × p vector of parameter estimatesβ of β. In all settings we studied, p = 50 and p * = 10 and the variance of the error term in Eq. (1) was σ 2 * = 1. We also assessed the robustness of the methods by generating from a t-distribution with two degrees of freedom.
We generated X from a multivariate normal distribution, with mean 0 and correlation matrix X . To assess the impact of various correlation structures among the predictors on the performance of the methods, we studied several choices of X = (σ i j ), i, j = 1, . . . , p. They include the independence correlation structure, X = I, where I denotes the p × p identity matrix, a block diagonal structure for X , where each block submatrix has dimension 5 × 5 and constant entries σ i j = 0.5, i = j for |i − j| ≤ 5, and σ i j = 0 otherwise, and an autoregressive (AR) correlation structure for X with σ i j = 0.5 |i− j| for |i − j| ≤ 10 and σ i j = 0 otherwise.

Binary outcome data
Binary data, labeled "controls" (Y = 0) and "cases" (Y = 1), were simulated similarly to the continuous outcomes. The probability P(Y = 1) was a function of the predictors X * , the p * dimensional subvector of X: where logit(x) = exp(x)/{1 + exp(x)}. For each simulation we created a population of subjects by drawing Y from a Bernoulli distribution with success probability given by model (3) with β * 0 = −1, given X, and then sampled a fixed number of cases and controls to obtain a case-control study, a design popular in biological applications. Again, p = 50 and p * = 10.
For each simulation, we obtained estimatesβ by fitting a logistic model using all available predictors, X, i.e. assuming using the methods given in Sect. 2.1. We study multivariate normally distributed predictors X that have mean zero and X = I, and also binary predictors X, that is X i = 0 or X i = 1, with P(X i = 1) = 0.5.

Parameter choices and sample sizes
For simplicity, we assume that all outcome associated β * = (β * 1 , . . . , β * p * ) coefficients in models (1) and (3) have the same magnitude, but half of them are positively and half of them are negatively associated with Y , i.e. the β * i s differ by their sign. We chose β * i = 0.25, 0.5 and 1.0 for both linear and logistic models. For continuous outcomes the sample sizes were n = 100, n = 200, and n = 500. For the binary outcome setting we used a case-control design with equal numbers of controls (n 0 ) and cases (n 1 ) with n 0 = n 1 = 100, n 0 = n 1 = 200, and n 0 = n 1 = 500. All simulations and analyses were implemented in R.

Analysis
We assessed the performance of two strategies to obtain parameter estimates and their standard errors for both linear and logistic regression models.

Algorithm 1 (Adaptive approach)
This is a one-stage approach that uses the estimatesβ obtained from the respective procedure. We denote the vector of coefficients ofβ that are either the intercept or are non-zero byβ adapt , and byX Si the vector of predictors for the i th subject corresponding to the intercept and the non-zero parameter estimates. The corresponding p adapt × n design matrix isX S . We letσ 2 adapt be the mean squared error of the fit forβ, but with the degrees of freedom n − p adapt , where p adapt is the dimension ofβ adapt . The covariance matrix ofβ adapt is estimated as Algorithm 2 (Oracle approach) This is a two stage approach. First we obtainX S as in Algorithm 1. In stage two we regress Y onX S to get a p adapt × 1 vectorβ oracle of new parameter estimates, which include an intercept. We letσ 2 oracle be the mean squared error of the fit whenβ oracle is used, with n − p adapt degrees of freedom. The estimated covariance matrix ofβ oracle is then cov β oracle =σ 2

Logistic regression
Like for linear regression,X Si is the vector of predictors for the i th subject corresponding to the intercept and the non-zero components ofβ andX S is the corresponding design matrix.

Algorithm 1 (Adaptive approach)
Again,β adapt denotes the vector of coefficients ofβ obtained from the respective procedure that are either the intercept or are non-zero. Letting logit (p a i ) =X Siβadapt , p adapt = (p a 1 (1 −p a 1 ), . . . ,p a n (1 −p a n )) andV adapt =p adapt I, where I denotes the identity matrix, we compute the covariance matrix of the estimates as cov β adapt = X SVadaptX S −1 . (7) Algorithm 2 (Oracle approach) First we obtainX S , and then computeβ oracle by re-fitting the standard logistic regression model withX S instead of X to the outcome data. Letting logit ( ) andV oracle =p oracle I, where I denotes the identity matrix, we compute the covariance matrix of the estimates as cov β oracle = X SVoracleX S −1 . (8)

Performance criteria
We evaluated the influence of sample size, n, and magnitude of the associations β * on each method's ability to identify the true outcome associated predictors X * and on the two algorithms described above to estimate the corresponding regression parameters β * .

Performance criteria for variable selection
False positives (FPs) Let β = (β 0 , β 1 , . . . , β p ) = (β * , 0) , where 0 is a 1 × ( p − p * ) vector of zeros. A FP occurs for β j when β j = 0 but its regularized estimateβ j = 0. The FP rate for β j is the percentage of times an FP occurs for β j , and the overall FP rate is the average of the FP rates across all zero coefficients of β.
False Negatives (FNs) A FN occurs for β j when β j = 0 but its regularized estimatê β j = 0. The FN rate for β j is the percentage of times a FN occurs for β j , and the overall FN rate is the average of the FN rates across all non-zero coefficients of β.

Impact of model selection on parameter estimates, coverage computations
The following coverages of the 95 % confidence intervals (CIs) for linear and logistic models were computed. The coverage of zeros is the number of times that either the regularized estimate of a β j = 0 coefficient is zero, i.e.β j = 0, or the 95 % CI of β j = 0 includes zero divided by the number of p − p * of zero coefficients. The 95 % CIs are computed using the asymptotic approximation, the normal distribution, and the standard errors from either Algorithm 1 or Algorithm 2. We compute the coverage of zeros separately for the zero and non-zero coefficients of β and report the average over all β j = 0 and β j = β * j = 0 respectively. The coverage of the true β * coefficients is the number of times that the 95 % CI aroundβ adapt orβ oracle includes the true value of β * j (= β j = 0) divided by the number p * of non-zero coefficients β * . Again, we report the average over all coefficients β * j , j = 1, . . . , p * .

LARS
The FP rate ranged from 11.1 to 34.8 %, and was slightly lower for the block correlation structure than the AR or independent correlations ( Fig. 1 and Supplemental Table 1). The FN rate was below 0.05 for sample sizes n = 200 and n = 500 and around 20 % for n = 100 for all effect sizes and correlations. The coverage of zero for β j = 0 was close to 100 % for Algorithm 1 and around 95 % for Algorithm 2. The coverage of zero for β * = 0 was 0 % for both algorithms for n = 500 for all effect sizes (Fig. 2). The 95 % CI coverage of β * for theβ = 0 coefficients was around 95 % for Algorithm 2 with n = 500. It also was around 95 % for both algorithms for n = 200 and n = 500 for the block correlation structure, but for all other correlations Algorithm 1 had lower coverage than Algorithm 2, generally below 90 % (Fig. 3).

LASSO
Similar to LARS, the FP rate was slightly lower for the block correlation structure than the AR or independent correlations ( Fig. 1 and Supplemental Table 2), and it ranged from 10.3 to 44.6 %. The FN rate was below 5 % for n = 200 and n = 500, and for n = 100 with β * = 0.5 and β * = 1. The coverage of zero for β j = 0 was close to 100 % for Algorithm 1 and around 95 % for Algorithm 2 for all sample and effect sizes. The coverage of zero for β * was 0 % for both algorithms for n = 500 for all effect sizes (Fig. 2). The coverage of β * forβ = 0 estimates was slightly lower for Algorithm 1 than 2. Algorithm 2 had 95 % coverage for n = 500 and slightly below 95 % for n = 200 (Fig. 3). Algorithm 1 had somewhat higher,   albeit still less than 95 % coverage for the block correlation structure than the other correlations.

Elastic net
The FP rate ranged from 30.6 to 47.4 % for independent predictors, while it was below 4 % for the block correlation structure for all values of n and β * (Table 4). For the AR correlation structure the FP rate was less than 5 % for n = 500 for all values of β * . The FN rate was low for all correlation structures, and less than 5 % for n = 200 and 500, regardless of the effect sizes ( Fig. 1 and Supplemental Table 3). The coverage of the zero coefficients was close to 100 % for both algorithms. The coverage of zero for coefficients corresponding to β * was 0 % for both algorithms for n = 500 for all effect sizes (Fig. 2). Overall, the coverage of β * forβ = 0 coefficients was noticeably higher for Algorithm 2 than for Algorithm 1. Algorithm 2 had close to 95 % coverage with the exception of small sample sizes. For n = 100 with β * = 0.25 the coverage fell below 90 %, likely due to variables not being selected (Fig. 3).

Relaxo
The FP and FN rates were slightly lower for the block correlation structure than the independent or AR correlations, but were less than 5 % for n = 200 and n = 500 for all effect sizes and correlations ( Fig. 1 and Supplemental Table 4). Both FP and FN rates also dropped quickly as n increased. For example, for the independent correlation structure and β * = 0.25 the FP and FN rates were 15.6 and 33.0 % for n = 100 and 3.1 and 0.4 % for n = 500, respectively. For all correlation structures the coverage of zero for β = 0 was close to 100 % for both algorithms. The coverage of zero for β * was close to 0 % for both algorithms for n = 500 for all effect sizes (Fig. 2). The coverage of β * forβ = 0 coefficients was similar for Algorithms 1 and 2, and close to 95 % for n = 500 for all effect sizes, and for n = 200 with β * = 0.5 and β * = 1.0 for all correlation structures (Fig. 3).

SCAD
The FP rates were very similar for all correlation structures and less than 5 % for n = 500 with β * = 1. The FN rate was generally low and dropped quickly as n increased ( Fig. 1 and Supplemental Table 5). For n = 200 or n = 500 it was less than 5 % for all values of β * . It was greater than 10 % only for n = 100 with β * = 0.25. The coverage of zero for β = 0 was close to 100 % for Algorithm 1 and 95 % for Algorithm 2. The coverage of zero for β * was 0 % for both algorithms for n = 500 for all effect sizes (Fig. 2). For all correlation structures the coverage of the 95 % CIs of β * forβ = 0 coefficients was close to 95 % for n = 500 for Algorithm 2 for all effect sizes, and for Algorithm 1 for n = 500 with β * = 1. Again, the coverage was noticeably higher for Algorithm 2 than for Algorithm 1 and it increased for both algorithms with sample and effect size (Fig. 3).

Penalized penalized linear regression
The FP rate ranged from 17.9 % for n = 100 with β * = 0.5 and the block correlation structure to 48.5 % for n = 100 with β * = 0.25 and independent predictors ( Fig. 1 and Supplemental Table 6). The FN rate was 16.2 % for n = 100, β * = 0.25 for the independent correlation structure, but for all other n and effect sizes it was less than 1 %. For all correlation structures the coverage of zero for the β = 0 coefficients for Algorithm 1 was higher than 99 % for all sample sizes and effect sizes, while for Algorithm 2 the coverage was around 95 %. The coverage of zero for β * was 0 % for both algorithms for n = 500 for all values of β * . For independent X, the coverage of β * of the 95 % CIs forβ = 0 coefficients ranged from 78.7 to 86.6 % for Algorithm 1. It was around 95 % for Algorithm 2 only for n = 500, but lower for n = 100 and n = 200. Similar patterns were seen for the AR correlation structure. For the block correlation structure both algorithms had close to 96 % coverage for n = 200 and n = 500 for all effect sizes (Fig. 3).

Partial least squares
The FP rate was less than 5 % for most correlation structures and effect sizes, the only outlier was the FP value of 15.3 % for n = 500 and β * = 1 with the independent correlation structure ( Fig. 1 and Supplemental Table 7). However, the FN rate ranged from 15.3 % for n = 500, β * = 1 and the AR correlation structure to 87.6 % for n = 100 with β * = 0.25 for independent predictors, and was above 60 % for all correlation structures and most sample sizes. The FN rate was lower for larger effect and sample sizes. The coverage of zero for β j = 0 was around 100 % for both algorithms for settings. The coverage of zero for β * for n = 500 ranged from 15.3 to 81.2 %, and was not different for the two algorithms (Fig. 2). For both algorithms the coverage of β * of the 95 % CIs forβ = 0 was very low for all correlation structures, ranging from 0 to 65.7 % (for independent X, with n = 500 and β * = 1) (Fig. 3).

Non-normal error distribution
When we generated outcome data from a linear model (1) where the error term followed a t-distribution with 2 degrees of freedom for independent X (Table 2), the FP rate was lower for LARS, LASSO, elastic net and relaxo while for these methods the FN rate was higher compared to normally distributed errors. In contrast, for SCAD, the FP rate was much higher and the FN rate much lower than for normal errors. Simulation runs based on penalized partial least squares regression failed to give reasonable results in so many instances that we do not present any results for this method.
For all methods the coverage of zero for the β = 0 coefficients for Algorithms 1 and 2 was very similar to the normal case. For all methods the coverage of zero for β * was much higher than for normally distributed errors. The coverage of β * of the 95 % CIs forβ = 0 coefficients however was much lower than in the normal case for Algorithm 1 and Algorithm 2 and much below the nominal 95 %. When the errors were generated from a t-distribution with 15 degrees of freedom however (Supplemental Table 8), the coverage was much improved and similar to that seen for normally distributed error terms.

Table 2
Performance of the various methods when the error distribution in the linear model (1)

Results for p > n
We also attempted to assess the performance of the methods when p > n by generating data with p = 500 and p * = 10 for n = 100, 200 and n = 500 for independent predictors X. SCAD, elastic net and penalized linear regression resulted in so many error messages that we do not present any findings for these algorithms. Results for LARS, LASSO and relaxo are given in Table 3. For both LARS and LASSO, the FP rate was lower than for the p < n setting for n = 100 and 200, but was above 74 % for n = 500 for all values of β * The FN rate was low except for β * = 0.25 with n = 100. The coverage of zero for the β = 0 coefficients for Algorithm 1 was higher than 99 % for all sample sizes and effect sizes, while for Algorithm 2 the coverage was below 70 %. The coverage of zero for β * ranged from 0 to 93 % for Algorithm 1 and from 0 to 26 % for Algorithm 2. The coverage of β * of the 95 % CIs forβ = 0 coefficients was less than 9 % for both algorithms.
For relaxo, the FP and FN rates were similar to those seen for LARS and LASSO for n = 100 and n = 200, with the exception of the FP rates for n = 500, which were less than 2 %. The coverage of zero for the β = 0 coefficients for Algorithm 1 ranged from 2 to 83 %, for all sample sizes and effect sizes, while for Algorithm 2 the coverage was below 62 %. The coverage of zero for β * was below 5 % for β * = 0.5 and β * = 1 for both algorithms. The coverage of β * of the 95 % CIs forβ = 0 coefficients was less than 10 % for both Algorithm 1 and 2.

Summary of results for linear regression
The estimation methods that had a high false positive (FP) rate were LARS, LASSO, elastic net, SCAD and penalized linear regression. Not surprisingly, the FN rate of these methods was low. Partial least squares regression had a low FP rate at the cost of having many false negatives. Only relaxo had both a low FP and FN rate. The coverage of zero for the β = 0 coefficients for Algorithm 1 was close to 100 % for all methods, while for Algorithm 2 it was closer to 95 %. The coverage of zero of the β * coefficients was close to zero for all methods with the exception of penalized least squares (Fig. 3). The coverage of the true β * coefficients of the 95 % CIs aroundβ = 0 was typically higher for Algorithm 2 than for Algorithm 1. For Algorithm 2 it was close to 95 % for large sample sizes and effect sizes for all methods with the exception of penalized partial least squares, for which coverage even for n = 500 with β * = 1 was around 65 %. When p > n, the coverage of both algorithms was much lower than 95 %, however.

LASSO
The FP rate was somewhat higher for binary predictors than for independent normally distributed X, but for both it was appreciable, with values up to 44 % even for large effect and sample sizes ( Table 4). The FN rate was above 50 % for β * = 0.25, but was less than 6 % for binary predictors with β * = 1.0 for all sample sizes, for β * = 0.5 Table 3 Results for p = 500 and p * = 10 with the independent covariance matrix for linear regression models Corresponds to Algorithm 1, and b to Algorithm 2 in the text for n = 200 and 500, and for normally distributed X with β * = 1.0 for n = 200 and 500. The coverage of zero for β = 0 was nearly 100 % for Algorithm 1 and closer to 95 % for Algorithm 2. The coverage of zero for β * was higher for Algorithm 1 than Algorithm 2. For Algorithm 1 the coverage of zero for β * ranged from 0 % for n = 500 with β * = 1 and binary predictors to 99.7 % for n = 100 with β * = 0.25. The coverage of β * of the 95 % CIs aroundβ = 0 was very low for both algorithms, with the exception of n = 500 and β * = 1.0 for normally distributed and β * = 0.5 and β * = 1.0 for binary X, where coverage was close to 95 %. SCAD For both, independent normally distributed and binary X the FP rate was very low; the largest value was 5 % for n = 500 with β * = 0.5, while the FN rate was extremely high, with values above 80 % for many other settings (Table 5). Only for n = 500 with β * = 1.0 and for binary predictors also for β * = 0.5 was the FN rate below 15 %. The coverage of zero forβ j = 0 was nearly 100 % for both algorithms. The coverage of zero for β * = 0 was similar for both algorithms and ranged from 0.04 to 99.6 %. It dropped as sample size and effect size increased. The coverage of β * of the 95 % CIs aroundβ = 0 was very low for both algorithms, with the exception of n = 500 and β * = 1.0 for both normally distributed and binary predictors, for which the coverage was approximately 93 %.
Penalized logistic regression For the independent normally distributed and binary predictors X the FP rate was similar, and ranged from 44.2 to 72.0 %. We observed an FP of 45.3 % for independent normal X even for β * = 1 and n = 500 (Table 6). The FN rate depended more strongly on the effect size, was somewhat higher for normally distributed X but in all cases decreased noticeably as n increased. For example, for normally distributed predictors with β * = 0.5, the FN rate was 28.3 % for n = 100, 13.4 % for n = 200 and 3 % for n = 500. The coverage of zero for β = 0 was nearly 100 % for Algorithm 1 and between 91.5 and 94.8 % for Algorithm 2. The coverage of zero for β * ranged from from 0.0 to 99.9 % for Algorithm 1 and was slightly lower for Algorithm 2. It dropped as sample size and effect size increased for both algorithms. The coverage of β * of the 95 % CIs aroundβ = 0 was slightly lower for Algorithm 1 than 2. For β * = 0.5 and β * = 1.0 with n = 500 Algorithm 2 had a coverage of nearly 95 %.
Adaptive logistic regression For all X the FP rate was less than 10 % for n = 200 and n = 100 for all effect sizes, while the FN rates for those n ranged from 67 to 85 % (Table 7). For n = 500, the FP rate was 5 % for β * = 0.25 and 20 and 22 % for β * = 0.5 and β * = 1 respectively, with corresponding FN rates of 77, 22 and 0. %. For binary predictors the FP rate was higher, and ranged from 6 to 23 %, and FN rates ranged from 0 to 81 %. The coverage of zero for β = 0 was nearly 100 % for Algorithm 1 for sample sizes n = 100 and n = 200. For n = 500 with effect sizes β * = 0.5 and β * = 1.0 the coverage for Algorithm 1 was 95 %. The coverage of zero for β * ranged from from 0.0 to 99 % for Algorithm 1 and was slightly lower for Algorithm 2. It dropped as sample size and effect size increased for both algorithms. The coverage of β * of the 95 % CIs around β = 0 was slightly lower for Algorithm 1 than 2. However, Algorithm 2 had 94 % coverage only for n = 500 and β * = 1.0 and β * = 0.5. For all other sample and

Summary of results for logistic regression
LASSO and penalized logistic regression had a high FP rate and a low FN rate. SCAD had a low FP rate at the cost of having many FNs. Adaptive logistic regression had a moderate FP rate and a high FN rate. The coverage of zero for the β = 0 coefficients was close to 100 % for Algorithm 1, while for Algorithm 2 it was closer to 95 % for all methods. The coverage of zero of the β * coefficients was close to zero for all methods with the exception of penalized logistic regression. The coverage of the true β * coefficients of the 95 % CIs aroundβ = 0 was close to 95 % for Algorithm 2 for large sample sizes and effect sizes for all methods with the exception of penalized logistic regression for which coverage even for n = 500 with β * = 1 was around 80 %. It was lower for Algorithm 1.

Discussion
Penalized estimation methods deliberately introduce a bias to reduce variability of the estimates to identify outcome-associated variables, and have been typically applied to prediction. Nonetheless, penalized regression techniques are also used sometimes when the aim is inference. For example, they have been applied to molecular genetic data for both prediction, and identification of disease susceptibility genes. We therefore assessed the performance of several readily available penalized estimation methods for linear and logistic regression. We performed only a small simulation study for the setting of p > n for which asymptotic results on consistent variable selection are very limited. Our main focus was on situations often encountered in practical settings, where the sample size n ranges from twofold larger to tenfold larger than the number of parameters, p. First we quantified the methods' ability to identify truly outcome associated predictors, i.e. to estimate the sparsity patterns of a vector β of regression coefficients. For linear models, penalized linear regression, elastic net, smoothly clipped absolute deviation (SCAD), least angle regression (LARS) and LASSO had a low false negative (FN) predictor selection rates but false positive (FP) rates above 20 % for all sample and effect sizes. Partial least squares regression had few FPs but many FNs. Only relaxo had low FP and FN rates. For logistic models, LASSO and penalized logistic regression had many FPs and few FNs for all sample and effect sizes. SCAD and adaptive logistic regression had low or moderate FP rates but many FNs.
We also evaluated inference properties for the various procedures. We studied effect estimates obtained directly from penalized methods (Algorithm 1), or by refitting selected predictors with standard regression (Algorithm 2). 95 % confidence interval coverage of predictors with null effects was approximately 100 % for Algorithm 1 for all methods, and 95 % for Algorithm 2 for large sample and effect sizes. Coverage was low only for penalized partial least squares (linear regression). For outcome-associated predictors, coverage was close to 95 % for Algorithm 2 for large sample and effect sizes for all methods except penalized partial least squares and penalized logistic regression. Coverage was sub-nominal for Algorithm 1. In conclusion, while Algorithm 2 is preferred to Algorithm 1, estimates from Algorithm 2 are still prone to some bias arising from the selection of predictors, which affects those associated with moderate effect sizes more strongly than predictors with large effect sizes.
All procedures were somewhat sensitive to violations of the assumption of normality for the error distribution for the linear model. When we generated outcome data from a linear model where the error term followed a t-distribution with 2 degrees of freedom the FN rate was higher compared to normally distributed errors for LARS, LASSO, elastic net and relaxo, while for SCAD the FP rate was much higher, and penalized partial least squares regression generally failed to give results. For outcome-associated predictors, the coverage of the 95 % CIs was much below the nominal 95 % for all procedures.
We addressed the problem of coverage much more extensively than previous publications (e.g. Wang and Leng 2007;Kabaila and Leeb 2006), including many popular penalized methods in our simulations, and also focused on false positive and false negative findings. We simulated practically relevant settings that reflect the number of predictors seen in many datasets, and showed that even for large sample sizes estimates are subject to undue bias and variance from the model selection procedure. Refitting attenuates the bias, but does not eliminate it in all but the cases where there is large sample size combined with estimating large effects. In these settings the residual bias not compensated for in refitting was small enough to be negligible. In all other settings where data is limited or effect sizes are small, the bias and variance are large enough to invalidate inference after model selection on those parameters, even for Algorithm 2.
When simulations were based on p > n, SCAD, elastic net, and penalized linear regression (the implementations we used) resulted in so many error messages that it was not meaningful to present any findings for them. For LARS and LASSO the FN rate was low and the FP rate was lower than for the p < n setting for moderate sample sizes but was above 74 % for n = 500 for all values of β * . For relaxo, the FP and FN rates were similar to those seen for LARS and LASSO but low also for large n. The coverage of β * of the 95 % CIs forβ = 0 coefficients was much below the nominal level for both Algorithm 1 and 2.
There is a growing literature on valid inference after model selection. E.g., Efron (2014), Wasserman and Roeder (2009) and Meinshausen et al. (2009) proposed approaches based on resampling or data splitting. Lockhart et al. (2014) derived the exact asymptotic null distribution of a test statistic for significance of variables that enter the LASSO model for general design matrices X and extends results to elastic net estimates. Berk et al. (2013) proposed an approach for post-selection inference ("PoSI") that is valid over all possible selected models and does not assume the linear model is correct. A better understanding of the small sample properties of some of these techniques is still needed. Nonetheless translation of the above mentioned approaches and others into statistical practice is also important to avoid misleading inference and irreproducible scientific findings.