Skip to main content
Log in

Variable selection techniques after multiple imputation in high-dimensional data

  • Original Paper
  • Published:
Statistical Methods & Applications Aims and scope Submit manuscript

Abstract

High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in high-dimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the low-dimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Brand JPL (1999) Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Ph.D. thesis, Erasmus University, Rotterdam

  • Carpenter JR, Kenward MG, White IR (2007) Sensitivity analysis after multiple imputations under missing at random: a weighting approach. Stat Methods Med Res 16(3):259–275

    MathSciNet  MATH  Google Scholar 

  • Chen Q, Wang S (2013) Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med 32:3646–3659

    MathSciNet  Google Scholar 

  • Clark TG, Altman DG (2003) Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 56:28–37

    Google Scholar 

  • Efron B (1994) Missing data, imputation, and the bootstrap. J Am Stat Assoc 89:463–475

    MathSciNet  MATH  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    MathSciNet  MATH  Google Scholar 

  • Fan J, Li R (2006) Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Proceedings of the international congress of mathematicians, Madrid, Spain, pp 595–622

  • Gelman A (2004) Parameterization and Bayesian modeling. J Am Stat Assoc 99(466):537–545

    MathSciNet  MATH  Google Scholar 

  • George E, Foster D (2000) Calibration and empirical bayes variable selection. Biometrika 87:731–747

    MathSciNet  MATH  Google Scholar 

  • George E, McCulloch R (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889

    Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    MATH  Google Scholar 

  • Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie C (2001) Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn 1:49–75

    MATH  Google Scholar 

  • Heymans MW, Van Buuren S, Knol DL, Van Mechelen W, de Vet HCW (2007) Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 7:33–42

    Google Scholar 

  • Hughes RA, White IR, Seaman SR, Carpenter JR, Tilling K, Sterne JAC (2014) Joint modelling rationale for chained equations. BMC Med Res Methodol 14(28):1–10

    Google Scholar 

  • Ian RW, Patrick R, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399

    MathSciNet  Google Scholar 

  • Kennickell AB (1991) Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. In: ASA 1991 proceedings of the section on survey research methods, pp 1–10

  • Kim S, Belin TR, Sugar CA (2016) Multiple imputation with non-additively related variables: joint-modeling and approximations. Stat Methods Med Res. https://doi.org/10.1177/0962280216667763

    Article  Google Scholar 

  • Kropko J, Goodrich B, Gelman A, Hill J (2014) Multiple imputation for continuous and categorical data: comparing joint multivariate normal and conditional approaches. Polit Anal 22:497–519

    Google Scholar 

  • Lachenbruch PA (2011) Variable selection when missing values are present: a case study. Stat Methods Med Res 20:429–444

    MathSciNet  MATH  Google Scholar 

  • Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 28 Jan 2018

  • Little R, Rubin D (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken

    MATH  Google Scholar 

  • Long Q, Johnson BA (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics 16(3):596–610

    MathSciNet  Google Scholar 

  • Lv J, Fan Y (2009) A unified approach to model selection and sparse recovery using regularized least squares. Ann Stat 37:3498–3528

    MathSciNet  MATH  Google Scholar 

  • Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct 405(2):442–451

    Google Scholar 

  • Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc B 72(4):417–473

    MathSciNet  MATH  Google Scholar 

  • Newman DA (2009) Missing data techniques and low response rates: the role of systematic nonresponse parameters, chap 1. In: Lance CE, Vandenberg RJ (eds) Statistical and methodological myths and urban legends. Routledge, Tylor & Francis Group, pp 7–36

    Google Scholar 

  • Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–687

    MathSciNet  MATH  Google Scholar 

  • Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol 27(1):85–95

    Google Scholar 

  • Raman S, Fuchs TJ, Wild PJ, Dahl E, Roth V (2009) The Bayesian group-Lasso for analyzing contingency tables. In: Proceedings of the 26th international conference on machine learning, Montreal, Canada, pp 881–888

  • Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    MathSciNet  MATH  Google Scholar 

  • Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    MATH  Google Scholar 

  • Rubin DB (2003) Nested multiple imputation of Nmes via partially incompatible MCMC. Stat Neerl 57(1):3–18

    MathSciNet  Google Scholar 

  • Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81:366–374

    MathSciNet  MATH  Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London

    MATH  Google Scholar 

  • Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177

    Google Scholar 

  • Shen X, Huang HC (2010) Grouping pursuit through a regularization solution surface. J Am Stat Assoc 105:727–739

    MathSciNet  MATH  Google Scholar 

  • Shen X, Pan W, Zhu Y (2012) Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc 107(497):223–232

    MathSciNet  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288

    MathSciNet  MATH  Google Scholar 

  • van Buuren S (2007) Multiple imputation of discrete and continuous data by full conditional specification. Stat Methods Med Res 16(3):219–242

    MathSciNet  MATH  Google Scholar 

  • van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67

    Google Scholar 

  • van Buuren S, Oudshoorn K (2000) Multiple imputation by chained equations: Mice v1.0 user’s manual. Technical report, TNO prevention and health, Leiden. http://www.stefvanbuuren.nl/publications/mice. Accessed 28 Jan 2018

  • van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694

    Google Scholar 

  • Waal T, Pannekoek J, Scholtus S (2011) Handbook of statistical data editing and imputation. Wiley, Hoboken

    Google Scholar 

  • White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30:377–399

    MathSciNet  Google Scholar 

  • Wood AM, White IR, Royston P (2008) How should variable selection be performed with multiply imputed data? Stat Med 27:3227–3246

    MathSciNet  Google Scholar 

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68:49–67

    MathSciNet  MATH  Google Scholar 

  • Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942

    MathSciNet  MATH  Google Scholar 

  • Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035

    MathSciNet  Google Scholar 

  • Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497

    MathSciNet  MATH  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faisal Maqbool Zahid.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables 2, 3, 4 and 5.

Table 2 Simulation study with normal predictors: values of Mathews correlation coefficient (MCC)
Table 3 Simulation study with binary predictors: values of Mathews correlation coefficient (MCC)
Table 4 Simulation study with normal predictors: average hit rates and false alarm rates (in percentages) of \(S=500\) simulated datasets for different variable selection methods
Table 5 Simulation study with binary predictors: average hit rates and false alarm rates (in percentages) of \(S=500\) simulated datasets for different variable selection methods

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zahid, F.M., Faisal, S. & Heumann, C. Variable selection techniques after multiple imputation in high-dimensional data. Stat Methods Appl 29, 553–580 (2020). https://doi.org/10.1007/s10260-019-00493-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-019-00493-7

Keywords

Navigation