missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data

Jin, Heejin; Jung, Surin; Won, Sungho

doi:10.1007/s13258-022-01247-8

missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data

Research Article
Published: 06 April 2022

Volume 44, pages 651–658, (2022)
Cite this article

Genes & Genomics Aims and scope Submit manuscript

695 Accesses
5 Citations
Explore all metrics

Abstract

Background

Missing data are a common problem in large-scale datasets and its appropriate handling is crucial for data analyses. Missingness can be categorized as (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). Different missingness mechanisms require different imputation strategies. Multiple imputation, an approach for averaging outcomes across multiple imputed data, is more suitable than single imputation for dealing with various missing mechanisms. missForest, a nonparametric missing value imputation strategy using random forest, is one of the most prevalent multiple imputation methods for missing-data because it can be applied to mixed-type data and does not require distributional assumptions. However, a recent study found that missForest can produce biased results for non-normal data. In addition, missForest is computationally expensive.

Objective

Therefore, we aimed to further develop the missForest algorithm by combining a binary particle swarm optimization (BPSO)-based feature-selection strategy.

Methods

The BPSO is an evolutionary algorithm that is well known for global optimization and computational efficiency. By using the BPSO-based feature selection step prior to imputing missing values with missForest, the imputation accuracy for continuous variables could be increased by pruning redundant variables.

Results

In this study, missForest with BPSO (BPSOmf) showed better imputation accuracy than missForest alone with respect to continuous variables by feature selection prior to the imputation step.

Conclusions

BPSOmf is an appropriate and robust method when the imputation target data consist mainly of continuous variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble Learning for Heterogeneous Missing Data Imputation

Filter-based feature selection methods in the presence of missing data for medical prediction models

Article 10 August 2023

Feature Based Multivariate Data Imputation

References

Andridge RR, Little RJ (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78:40–64. https://doi.org/10.1111/j.1751-5823.2010.00103.x
Article PubMed PubMed Central Google Scholar
Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20:40–49. https://doi.org/10.1002/mpr.329
Article PubMed PubMed Central Google Scholar
Carpenter J, Kenward M (2012) Multiple imputation and its application. John Wiley & Sons, New York
Google Scholar
Chuang LY, Chang HW, Tu CJ, Yang CH (2008) Improved binary PSO for feature selection using gene expression data. Comp Biol Chem 32:29–37. https://doi.org/10.1016/j.compbiolchem.2007.09.005
Article CAS Google Scholar
Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) Review: A gentle introduction to imputation of missing values. J Clin Epidemiol 59:1087–1091. https://doi.org/10.1016/j.jclinepi.2006.01.014
Article PubMed Google Scholar
Hong S, Lynn HS (2020) Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol 20:199. https://doi.org/10.1186/s12874-020-01080-1
Article PubMed PubMed Central Google Scholar
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95-International Conference on Neural Networks. IEEE Publications, pp 1942–1948
Kim Y, Han BG, KoGES group (2017) Cohort profile: the Korean genome and epidemiology study (KoGES) consortium. Int J Epidemiol 46:e20–e20. https://doi.org/10.1093/ije/dyv316
Article PubMed Google Scholar
Kweon S, Kim Y, Jang MJ, Kim Y, Kim K, Choi S, Chun C, Khang YH, Oh K (2014) Data resource profile: the Korea national health and nutrition examination survey (KNHANES). Int J Epidemiol 43:69–77. https://doi.org/10.1093/ije/dyt228
Article PubMed PubMed Central Google Scholar
Little RJA (1988) A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc 83:1198–1202. https://doi.org/10.1080/01621459.1988.10478722
Article Google Scholar
Little RJ, Rubin DB (2019) Statistical analysis with missing data, 793rd edn. John Wiley & Sons, New York
Google Scholar
Malarvizhi R, Thanamani AS (2012) K-nearest neighbor in missing data imputation. Int J Eng Res Dev 5:5–7
Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592. https://doi.org/10.1093/biomet/63.3.581
Article Google Scholar
Rubin DB (1996) Multiple imputation after 18+ years. J Am Stat Assoc 91:473–489. https://doi.org/10.1080/01621459.1996.10476908
Article Google Scholar
Rubin DB, Schenker N (1991) Multiple imputation in health-are databases: an overview and some applications. Stat Med 10:585–598. https://doi.org/10.1002/sim.4780100410
Article CAS PubMed Google Scholar
Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H (2014) Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol 179:764–774. https://doi.org/10.1093/aje/kwt312
Article PubMed PubMed Central Google Scholar
Stekhoven DJ (2015) missForest: nonparametric missing value imputation using random forest. Astrophys Source Code Libr 1505:1011
Google Scholar
Stekhoven DJ, Bühlmann P (2012) MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118. https://doi.org/10.1093/bioinformatics/btr597
Article CAS PubMed Google Scholar
Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R (2015) UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med 12:e1001779. https://doi.org/10.1371/journal.pmed.1001779
Article PubMed PubMed Central Google Scholar
Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min 10:363–377. https://doi.org/10.1002/sam.11348
Article PubMed PubMed Central Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525. https://doi.org/10.1093/bioinformatics/17.6.520
Article CAS PubMed Google Scholar
Van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res 16:219–242. https://doi.org/10.1177/0962280206074463
Article PubMed Google Scholar
Van Buuren S (2018) Flexible imputation of missing data. CRC Press, London
Book Google Scholar
Van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Article Google Scholar
Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgins PD (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3:e002847. https://doi.org/10.1136/bmjopen-2013-002847
Article PubMed PubMed Central Google Scholar
Xiong L, Chen R-S, Zhou X, Jing C (2019) Multi-feature fusion and selection method for an improved particle swarm optimization. J Ambient Intell Hum Comput. https://doi.org/10.1007/s12652-019-01624-4
Article Google Scholar

Download references

Funding

This study was supported by the National Research Foundation of Korea (2017M3A9F3046543) and the Industrial Core Technology Development Program (20000134) funded by the Ministry of Trade, Industry, and Energy (MOTIE, Korea).

Author information

Heejin Jin and Surin Jung contributed equally to this work.

Authors and Affiliations

Institute of Health and Environment, Seoul National University, Seoul, South Korea
Heejin Jin & Sungho Won
Department of Public Health Sciences, Seoul National University, 1 Kwanak-ro Kwanak-gu, Seoul, 151-742, South Korea
Surin Jung & Sungho Won
RexSoft Corp, Seoul, South Korea
Sungho Won

Authors

Heejin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Surin Jung
View author publications
You can also search for this author in PubMed Google Scholar
Sungho Won
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HJ and SJ analyzed and interpreted the results and wrote the manuscript. These authors contributed equally to this work. SW designed the study and contributed to discussion. All the authors critically revised the manuscript for important intellectual content. SW agrees to take responsibility for the integrity and veracity of this paper and for the work and research it represents. SW accepts full responsibility for the work and/or conduct of the study, has access to the data, and controls the decision to publish.

Corresponding author

Correspondence to Sungho Won.

Ethics declarations

Conflict of interest

HJ, SJ, and SW declare that they have no conflict of interest.

Ethical approval and consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, H., Jung, S. & Won, S. missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data. Genes Genom 44, 651–658 (2022). https://doi.org/10.1007/s13258-022-01247-8

Download citation

Received: 05 January 2022
Accepted: 12 March 2022
Published: 06 April 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s13258-022-01247-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data