Abstract
Background
Missing data are a common problem in large-scale datasets and its appropriate handling is crucial for data analyses. Missingness can be categorized as (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). Different missingness mechanisms require different imputation strategies. Multiple imputation, an approach for averaging outcomes across multiple imputed data, is more suitable than single imputation for dealing with various missing mechanisms. missForest, a nonparametric missing value imputation strategy using random forest, is one of the most prevalent multiple imputation methods for missing-data because it can be applied to mixed-type data and does not require distributional assumptions. However, a recent study found that missForest can produce biased results for non-normal data. In addition, missForest is computationally expensive.
Objective
Therefore, we aimed to further develop the missForest algorithm by combining a binary particle swarm optimization (BPSO)-based feature-selection strategy.
Methods
The BPSO is an evolutionary algorithm that is well known for global optimization and computational efficiency. By using the BPSO-based feature selection step prior to imputing missing values with missForest, the imputation accuracy for continuous variables could be increased by pruning redundant variables.
Results
In this study, missForest with BPSO (BPSOmf) showed better imputation accuracy than missForest alone with respect to continuous variables by feature selection prior to the imputation step.
Conclusions
BPSOmf is an appropriate and robust method when the imputation target data consist mainly of continuous variables.
Similar content being viewed by others
References
Andridge RR, Little RJ (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78:40–64. https://doi.org/10.1111/j.1751-5823.2010.00103.x
Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20:40–49. https://doi.org/10.1002/mpr.329
Carpenter J, Kenward M (2012) Multiple imputation and its application. John Wiley & Sons, New York
Chuang LY, Chang HW, Tu CJ, Yang CH (2008) Improved binary PSO for feature selection using gene expression data. Comp Biol Chem 32:29–37. https://doi.org/10.1016/j.compbiolchem.2007.09.005
Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) Review: A gentle introduction to imputation of missing values. J Clin Epidemiol 59:1087–1091. https://doi.org/10.1016/j.jclinepi.2006.01.014
Hong S, Lynn HS (2020) Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol 20:199. https://doi.org/10.1186/s12874-020-01080-1
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95-International Conference on Neural Networks. IEEE Publications, pp 1942–1948
Kim Y, Han BG, KoGES group (2017) Cohort profile: the Korean genome and epidemiology study (KoGES) consortium. Int J Epidemiol 46:e20–e20. https://doi.org/10.1093/ije/dyv316
Kweon S, Kim Y, Jang MJ, Kim Y, Kim K, Choi S, Chun C, Khang YH, Oh K (2014) Data resource profile: the Korea national health and nutrition examination survey (KNHANES). Int J Epidemiol 43:69–77. https://doi.org/10.1093/ije/dyt228
Little RJA (1988) A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc 83:1198–1202. https://doi.org/10.1080/01621459.1988.10478722
Little RJ, Rubin DB (2019) Statistical analysis with missing data, 793rd edn. John Wiley & Sons, New York
Malarvizhi R, Thanamani AS (2012) K-nearest neighbor in missing data imputation. Int J Eng Res Dev 5:5–7
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592. https://doi.org/10.1093/biomet/63.3.581
Rubin DB (1996) Multiple imputation after 18+ years. J Am Stat Assoc 91:473–489. https://doi.org/10.1080/01621459.1996.10476908
Rubin DB, Schenker N (1991) Multiple imputation in health-are databases: an overview and some applications. Stat Med 10:585–598. https://doi.org/10.1002/sim.4780100410
Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H (2014) Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol 179:764–774. https://doi.org/10.1093/aje/kwt312
Stekhoven DJ (2015) missForest: nonparametric missing value imputation using random forest. Astrophys Source Code Libr 1505:1011
Stekhoven DJ, Bühlmann P (2012) MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118. https://doi.org/10.1093/bioinformatics/btr597
Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R (2015) UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med 12:e1001779. https://doi.org/10.1371/journal.pmed.1001779
Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min 10:363–377. https://doi.org/10.1002/sam.11348
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525. https://doi.org/10.1093/bioinformatics/17.6.520
Van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res 16:219–242. https://doi.org/10.1177/0962280206074463
Van Buuren S (2018) Flexible imputation of missing data. CRC Press, London
Van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgins PD (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3:e002847. https://doi.org/10.1136/bmjopen-2013-002847
Xiong L, Chen R-S, Zhou X, Jing C (2019) Multi-feature fusion and selection method for an improved particle swarm optimization. J Ambient Intell Hum Comput. https://doi.org/10.1007/s12652-019-01624-4
Funding
This study was supported by the National Research Foundation of Korea (2017M3A9F3046543) and the Industrial Core Technology Development Program (20000134) funded by the Ministry of Trade, Industry, and Energy (MOTIE, Korea).
Author information
Authors and Affiliations
Contributions
HJ and SJ analyzed and interpreted the results and wrote the manuscript. These authors contributed equally to this work. SW designed the study and contributed to discussion. All the authors critically revised the manuscript for important intellectual content. SW agrees to take responsibility for the integrity and veracity of this paper and for the work and research it represents. SW accepts full responsibility for the work and/or conduct of the study, has access to the data, and controls the decision to publish.
Corresponding author
Ethics declarations
Conflict of interest
HJ, SJ, and SW declare that they have no conflict of interest.
Ethical approval and consent to participate
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jin, H., Jung, S. & Won, S. missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data. Genes Genom 44, 651–658 (2022). https://doi.org/10.1007/s13258-022-01247-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13258-022-01247-8