Skip to main content
Log in

missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data

  • Research Article
  • Published:
Genes & Genomics Aims and scope Submit manuscript

Abstract

Background

Missing data are a common problem in large-scale datasets and its appropriate handling is crucial for data analyses. Missingness can be categorized as (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). Different missingness mechanisms require different imputation strategies. Multiple imputation, an approach for averaging outcomes across multiple imputed data, is more suitable than single imputation for dealing with various missing mechanisms. missForest, a nonparametric missing value imputation strategy using random forest, is one of the most prevalent multiple imputation methods for missing-data because it can be applied to mixed-type data and does not require distributional assumptions. However, a recent study found that missForest can produce biased results for non-normal data. In addition, missForest is computationally expensive.

Objective

Therefore, we aimed to further develop the missForest algorithm by combining a binary particle swarm optimization (BPSO)-based feature-selection strategy.

Methods

The BPSO is an evolutionary algorithm that is well known for global optimization and computational efficiency. By using the BPSO-based feature selection step prior to imputing missing values with missForest, the imputation accuracy for continuous variables could be increased by pruning redundant variables.

Results

In this study, missForest with BPSO (BPSOmf) showed better imputation accuracy than missForest alone with respect to continuous variables by feature selection prior to the imputation step.

Conclusions

BPSOmf is an appropriate and robust method when the imputation target data consist mainly of continuous variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

Download references

Funding

This study was supported by the National Research Foundation of Korea (2017M3A9F3046543) and the Industrial Core Technology Development Program (20000134) funded by the Ministry of Trade, Industry, and Energy (MOTIE, Korea).

Author information

Authors and Affiliations

Authors

Contributions

HJ and SJ analyzed and interpreted the results and wrote the manuscript. These authors contributed equally to this work. SW designed the study and contributed to discussion. All the authors critically revised the manuscript for important intellectual content. SW agrees to take responsibility for the integrity and veracity of this paper and for the work and research it represents. SW accepts full responsibility for the work and/or conduct of the study, has access to the data, and controls the decision to publish.

Corresponding author

Correspondence to Sungho Won.

Ethics declarations

Conflict of interest

HJ, SJ, and SW declare that they have no conflict of interest.

Ethical approval and consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, H., Jung, S. & Won, S. missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data. Genes Genom 44, 651–658 (2022). https://doi.org/10.1007/s13258-022-01247-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13258-022-01247-8

Keywords

Navigation