Skip to main content
Log in

High-dimensional variable screening and bias in subsequent inference, with an empirical comparison

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

We review variable selection and variable screening in high-dimensional linear models. Thereby, a major focus is an empirical comparison of various estimation methods with respect to true and false positive selection rates based on 128 different sparse scenarios from semi-real data (real data covariables but synthetic regression coefficients and noise). Furthermore, we present some theoretical bounds for the bias in subsequent least squares estimation, using the selected variables from the first stage, which have direct implications for construction of p-values for regression coefficients.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Adragni K, Cook R (2009) Sufficient dimension reduction and prediction in regression. Philos Trans R Soc A 367:4385–4400

    Article  MATH  MathSciNet  Google Scholar 

  • Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:1705–1732

    Article  MATH  MathSciNet  Google Scholar 

  • Bühlmann P (2012) Statistical significance in high-dimensional linear models. Bernoulli (to appear)

  • Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, New York

    Book  Google Scholar 

  • Bühlmann P, Meier L, Kalisch M (2013) High-dimensional statistics with a view towards applications in biology. Annu Rev Stat Appl (to appear)

  • Bunea F, Tsybakov A, Wegkamp M (2007) Sparsity oracle inequalities for the Lasso. Electron J Stat 1:169–194

    Article  MATH  MathSciNet  Google Scholar 

  • Candès E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351

    Article  MATH  Google Scholar 

  • Dettling M (2004) Bagboosting for tumor classification with gene expression data. Bioinformatics 20(18):3583–3593

    Article  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    Article  MATH  MathSciNet  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultra-high dimensional feature space (with discussion). J R Stat Soc Ser B 70:849–911

    Article  MathSciNet  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22

    Google Scholar 

  • Greenshtein E, Ritov Y (2004) Persistence in high-dimensional predictor selection and the virtue of over-parametrization. Bernoulli 10:971–988

    Article  MATH  MathSciNet  Google Scholar 

  • Hebiri M, van de Geer S (2011) The smooth Lasso and other \(\ell _1+ \ell _2\)-penalized methods. Electron J Stat 5:1184–1226

    Article  MATH  MathSciNet  Google Scholar 

  • Koltchinskii V (2009a) The Dantzig selector and sparsity oracle inequalities. Bernoulli 15:799–828

    Article  MATH  MathSciNet  Google Scholar 

  • Koltchinskii V (2009b) Sparsity in penalized empirical risk minimization. Ann de l’Institut Henri Poincaré, Probab et Stat 45:7–57

    Article  MATH  MathSciNet  Google Scholar 

  • Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann Stat 34:1436–1462

    Article  MATH  Google Scholar 

  • Meinshausen N, Meier L, Bühlmann P (2009) P-values for high-dimensional regression. J Am Stat Assoc 104:1671–1681

    Article  MATH  Google Scholar 

  • Meinshausen N, Yu B (2009) Lasso-type recovery of sparse representations for high-dimensional data. Ann Stat 37:246–270

    Article  MATH  MathSciNet  Google Scholar 

  • Raskutti G, Wainwright M, Yu B (2010) Restricted eigenvalue properties for correlated Gaussian designs. J Mach Learn Res 11:2241–2259

    MATH  MathSciNet  Google Scholar 

  • Shao J, Deng X (2012) Estimation in high-dimensional linear models with deterministic design matrices. Ann Stat 40:812–831

    Article  MATH  MathSciNet  Google Scholar 

  • Sun T, Zhang C-H (2012) Scaled sparse linear regression. Biometrika 99:879–898

    Article  MATH  MathSciNet  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58:267–288

    MATH  MathSciNet  Google Scholar 

  • van de Geer S (2007) The deterministic Lasso. In: JSM proceedings, p. 140. American Statistical Association, Alexandria, VA

  • van de Geer S (2008) High-dimensional generalized linear models and the Lasso. Ann Stat 36:614–645

    Article  MATH  Google Scholar 

  • van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392

    Article  MATH  MathSciNet  Google Scholar 

  • van de Geer S, Bühlmann P, Zhou S (2011) The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electron J Stat 5:688–749

    Article  MATH  MathSciNet  Google Scholar 

  • Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _{1}\)-constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55:2183–2202

    Article  MathSciNet  Google Scholar 

  • Wasserman L, Roeder K (2009) High dimensional variable selection. Ann Stat 37:2178–2201

    Article  MATH  MathSciNet  Google Scholar 

  • West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467

    Article  Google Scholar 

  • Ye F, Zhang C-H (2010) Rate minimaxity of the Lasso and Dantzig selector for the \(\ell _q\) loss in \(\ell _r\) balls. J Mach Learn Res 11:3519–3540

    MATH  MathSciNet  Google Scholar 

  • Zhang C-H, Huang J (2008) The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Stat 36:1567–1594

    Article  MATH  Google Scholar 

  • Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563

    MATH  MathSciNet  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320

    Article  MATH  MathSciNet  Google Scholar 

  • Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101:1418–1429

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Bühlmann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bühlmann, P., Mandozzi, J. High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29, 407–430 (2014). https://doi.org/10.1007/s00180-013-0436-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-013-0436-3

Keywords

Navigation