Variable selection techniques after multiple imputation in high-dimensional data

Zahid, Faisal Maqbool; Faisal, Shahla; Heumann, Christian

doi:10.1007/s10260-019-00493-7

Variable selection techniques after multiple imputation in high-dimensional data

Original Paper
Published: 03 November 2019

Volume 29, pages 553–580, (2020)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

989 Accesses
4 Citations
Explore all metrics

Abstract

High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in high-dimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the low-dimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data

Article Open access 12 February 2016

Yi Deng, Changgee Chang, … Qi Long

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Article Open access 18 July 2022

Peter C. Austin & Stef van Buuren

Feature Based Multivariate Data Imputation

References

Brand JPL (1999) Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Ph.D. thesis, Erasmus University, Rotterdam
Carpenter JR, Kenward MG, White IR (2007) Sensitivity analysis after multiple imputations under missing at random: a weighting approach. Stat Methods Med Res 16(3):259–275
MathSciNet MATH Google Scholar
Chen Q, Wang S (2013) Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med 32:3646–3659
MathSciNet Google Scholar
Clark TG, Altman DG (2003) Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 56:28–37
Google Scholar
Efron B (1994) Missing data, imputation, and the bootstrap. J Am Stat Assoc 89:463–475
MathSciNet MATH Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
MathSciNet MATH Google Scholar
Fan J, Li R (2006) Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Proceedings of the international congress of mathematicians, Madrid, Spain, pp 595–622
Gelman A (2004) Parameterization and Bayesian modeling. J Am Stat Assoc 99(466):537–545
MathSciNet MATH Google Scholar
George E, Foster D (2000) Calibration and empirical bayes variable selection. Biometrika 87:731–747
MathSciNet MATH Google Scholar
George E, McCulloch R (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889
Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
MATH Google Scholar
Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie C (2001) Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn 1:49–75
MATH Google Scholar
Heymans MW, Van Buuren S, Knol DL, Van Mechelen W, de Vet HCW (2007) Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 7:33–42
Google Scholar
Hughes RA, White IR, Seaman SR, Carpenter JR, Tilling K, Sterne JAC (2014) Joint modelling rationale for chained equations. BMC Med Res Methodol 14(28):1–10
Google Scholar
Ian RW, Patrick R, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399
MathSciNet Google Scholar
Kennickell AB (1991) Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. In: ASA 1991 proceedings of the section on survey research methods, pp 1–10
Kim S, Belin TR, Sugar CA (2016) Multiple imputation with non-additively related variables: joint-modeling and approximations. Stat Methods Med Res. https://doi.org/10.1177/0962280216667763
Article Google Scholar
Kropko J, Goodrich B, Gelman A, Hill J (2014) Multiple imputation for continuous and categorical data: comparing joint multivariate normal and conditional approaches. Polit Anal 22:497–519
Google Scholar
Lachenbruch PA (2011) Variable selection when missing values are present: a case study. Stat Methods Med Res 20:429–444
MathSciNet MATH Google Scholar
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 28 Jan 2018
Little R, Rubin D (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
MATH Google Scholar
Long Q, Johnson BA (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics 16(3):596–610
MathSciNet Google Scholar
Lv J, Fan Y (2009) A unified approach to model selection and sparse recovery using regularized least squares. Ann Stat 37:3498–3528
MathSciNet MATH Google Scholar
Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct 405(2):442–451
Google Scholar
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc B 72(4):417–473
MathSciNet MATH Google Scholar
Newman DA (2009) Missing data techniques and low response rates: the role of systematic nonresponse parameters, chap 1. In: Lance CE, Vandenberg RJ (eds) Statistical and methodological myths and urban legends. Routledge, Tylor & Francis Group, pp 7–36
Google Scholar
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–687
MathSciNet MATH Google Scholar
Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol 27(1):85–95
Google Scholar
Raman S, Fuchs TJ, Wild PJ, Dahl E, Roth V (2009) The Bayesian group-Lasso for analyzing contingency tables. In: Proceedings of the 26th international conference on machine learning, Montreal, Canada, pp 881–888
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
MathSciNet MATH Google Scholar
Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
MATH Google Scholar
Rubin DB (2003) Nested multiple imputation of Nmes via partially incompatible MCMC. Stat Neerl 57(1):3–18
MathSciNet Google Scholar
Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81:366–374
MathSciNet MATH Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London
MATH Google Scholar
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177
Google Scholar
Shen X, Huang HC (2010) Grouping pursuit through a regularization solution surface. J Am Stat Assoc 105:727–739
MathSciNet MATH Google Scholar
Shen X, Pan W, Zhu Y (2012) Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc 107(497):223–232
MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
MathSciNet MATH Google Scholar
van Buuren S (2007) Multiple imputation of discrete and continuous data by full conditional specification. Stat Methods Med Res 16(3):219–242
MathSciNet MATH Google Scholar
van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Google Scholar
van Buuren S, Oudshoorn K (2000) Multiple imputation by chained equations: Mice v1.0 user’s manual. Technical report, TNO prevention and health, Leiden. http://www.stefvanbuuren.nl/publications/mice. Accessed 28 Jan 2018
van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694
Google Scholar
Waal T, Pannekoek J, Scholtus S (2011) Handbook of statistical data editing and imputation. Wiley, Hoboken
Google Scholar
White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30:377–399
MathSciNet Google Scholar
Wood AM, White IR, Royston P (2008) How should variable selection be performed with multiply imputed data? Stat Med 27:3227–3246
MathSciNet Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68:49–67
MathSciNet MATH Google Scholar
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
MathSciNet MATH Google Scholar
Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035
MathSciNet Google Scholar
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497
MathSciNet MATH Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Government College University Faisalabad, Faisalabad, Pakistan
Faisal Maqbool Zahid & Shahla Faisal
Department of Statistics, Ludwig-Maximilians-University Munich, Munich, Germany
Christian Heumann

Authors

Faisal Maqbool Zahid
View author publications
You can also search for this author in PubMed Google Scholar
Shahla Faisal
View author publications
You can also search for this author in PubMed Google Scholar
Christian Heumann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Faisal Maqbool Zahid.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Tables 2, 3, 4 and 5.

Table 2 Simulation study with normal predictors: values of Mathews correlation coefficient (MCC)

Full size table

Table 3 Simulation study with binary predictors: values of Mathews correlation coefficient (MCC)

Full size table

Table 4 Simulation study with normal predictors: average hit rates and false alarm rates (in percentages) of \(S=500\) simulated datasets for different variable selection methods

Full size table

Table 5 Simulation study with binary predictors: average hit rates and false alarm rates (in percentages) of \(S=500\) simulated datasets for different variable selection methods

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zahid, F.M., Faisal, S. & Heumann, C. Variable selection techniques after multiple imputation in high-dimensional data. Stat Methods Appl 29, 553–580 (2020). https://doi.org/10.1007/s10260-019-00493-7

Download citation

Accepted: 12 October 2019
Published: 03 November 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10260-019-00493-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable selection techniques after multiple imputation in high-dimensional data

Abstract

Access this article

Similar content being viewed by others

Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Feature Based Multivariate Data Imputation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Variable selection techniques after multiple imputation in high-dimensional data

Abstract

Access this article

Similar content being viewed by others

Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Feature Based Multivariate Data Imputation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation