Abstract
High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in high-dimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the low-dimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.
Similar content being viewed by others
References
Brand JPL (1999) Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Ph.D. thesis, Erasmus University, Rotterdam
Carpenter JR, Kenward MG, White IR (2007) Sensitivity analysis after multiple imputations under missing at random: a weighting approach. Stat Methods Med Res 16(3):259–275
Chen Q, Wang S (2013) Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med 32:3646–3659
Clark TG, Altman DG (2003) Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 56:28–37
Efron B (1994) Missing data, imputation, and the bootstrap. J Am Stat Assoc 89:463–475
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Li R (2006) Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Proceedings of the international congress of mathematicians, Madrid, Spain, pp 595–622
Gelman A (2004) Parameterization and Bayesian modeling. J Am Stat Assoc 99(466):537–545
George E, Foster D (2000) Calibration and empirical bayes variable selection. Biometrika 87:731–747
George E, McCulloch R (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie C (2001) Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn 1:49–75
Heymans MW, Van Buuren S, Knol DL, Van Mechelen W, de Vet HCW (2007) Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 7:33–42
Hughes RA, White IR, Seaman SR, Carpenter JR, Tilling K, Sterne JAC (2014) Joint modelling rationale for chained equations. BMC Med Res Methodol 14(28):1–10
Ian RW, Patrick R, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399
Kennickell AB (1991) Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. In: ASA 1991 proceedings of the section on survey research methods, pp 1–10
Kim S, Belin TR, Sugar CA (2016) Multiple imputation with non-additively related variables: joint-modeling and approximations. Stat Methods Med Res. https://doi.org/10.1177/0962280216667763
Kropko J, Goodrich B, Gelman A, Hill J (2014) Multiple imputation for continuous and categorical data: comparing joint multivariate normal and conditional approaches. Polit Anal 22:497–519
Lachenbruch PA (2011) Variable selection when missing values are present: a case study. Stat Methods Med Res 20:429–444
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 28 Jan 2018
Little R, Rubin D (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
Long Q, Johnson BA (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics 16(3):596–610
Lv J, Fan Y (2009) A unified approach to model selection and sparse recovery using regularized least squares. Ann Stat 37:3498–3528
Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct 405(2):442–451
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc B 72(4):417–473
Newman DA (2009) Missing data techniques and low response rates: the role of systematic nonresponse parameters, chap 1. In: Lance CE, Vandenberg RJ (eds) Statistical and methodological myths and urban legends. Routledge, Tylor & Francis Group, pp 7–36
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–687
Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol 27(1):85–95
Raman S, Fuchs TJ, Wild PJ, Dahl E, Roth V (2009) The Bayesian group-Lasso for analyzing contingency tables. In: Proceedings of the 26th international conference on machine learning, Montreal, Canada, pp 881–888
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Rubin DB (2003) Nested multiple imputation of Nmes via partially incompatible MCMC. Stat Neerl 57(1):3–18
Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81:366–374
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177
Shen X, Huang HC (2010) Grouping pursuit through a regularization solution surface. J Am Stat Assoc 105:727–739
Shen X, Pan W, Zhu Y (2012) Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc 107(497):223–232
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
van Buuren S (2007) Multiple imputation of discrete and continuous data by full conditional specification. Stat Methods Med Res 16(3):219–242
van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
van Buuren S, Oudshoorn K (2000) Multiple imputation by chained equations: Mice v1.0 user’s manual. Technical report, TNO prevention and health, Leiden. http://www.stefvanbuuren.nl/publications/mice. Accessed 28 Jan 2018
van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694
Waal T, Pannekoek J, Scholtus S (2011) Handbook of statistical data editing and imputation. Wiley, Hoboken
White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30:377–399
Wood AM, White IR, Royston P (2008) How should variable selection be performed with multiply imputed data? Stat Med 27:3227–3246
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68:49–67
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zahid, F.M., Faisal, S. & Heumann, C. Variable selection techniques after multiple imputation in high-dimensional data. Stat Methods Appl 29, 553–580 (2020). https://doi.org/10.1007/s10260-019-00493-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-019-00493-7