Skip to main content
Log in

A novel clustering-based purity and distance imputation for handling medical data with missing values

  • Application of soft computing
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Nowadays, people pay increasing attention to health, and the integrity of medical records has been put into focus. Recently, medical data imputation has become a very active field because medical data usually have missing values. Many imputation methods have been proposed, but many model-based imputation methods such as expectation–maximization and regression-based imputation based on the variables data have a multivariate normal distribution, which assumption can lead to biased results. Sometimes, this becomes a bottleneck, such as computationally more complex than model-free methods. Furthermore, directly removing instances with missing values has several problems, and it is possible to lose the important data, produce ineffective research samples, and cause research deviations. Therefore, this study proposes a novel clustering-based purity and distance imputation method to improve the handling of missing values. In the experiment, we collected eight different medical datasets to compare the proposed imputation methods with the listed imputation methods with regard to the results of different situations. In imputation measures, the area under the curve (AUC) is used to evaluate the performance of the imbalanced class datasets in MAR and MCAR experiments, and accuracy is applied to measure its performance of the balanced class in MNAR experiment. Finally, the root-mean-square error (RMSE) is also used to compare the proposed and the listing imputation methods. In addition, this study utilized the elbow method and the average silhouette method to find the optimal number of clusters for all datasets. Results showed that the proposed imputation method could improve imputation performance in the accuracy, AUC, and RMSE of different missing degrees and missing types.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Al SA, Lotfi A, Coleman S (2013) Intelligent synthetic composite indicators with application. Soft Comput 17:2349–2364. https://doi.org/10.1007/s00500-013-1098-3

    Article  Google Scholar 

  • Amiri M, Jensen R (2016) Missing data imputation using fuzzy-rough methods. Neurocomputing 205:152–164

    Google Scholar 

  • Andridge RR, Little RJA (2010) A review of hot deck imputation for survey non-response. Int Stat Review 78:40–64

    Google Scholar 

  • Awan SE, Bennamoun M, Sohel F, Sanfilippo FM, Dwivedi G (2021) Imputation of missing data with class imbalance using conditional generative adversarial networks. Neurocomputing. https://doi.org/10.1016/j.neucom.2021.04.010

    Article  Google Scholar 

  • Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533

    Google Scholar 

  • Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27

    Google Scholar 

  • Cheng CH, Chan CP, Sheu YJ (2019) A novel purity-based k nearest neighbors imputation method and its application in financial distress prediction. Eng Appl Artif Intell 81:283–299

    Google Scholar 

  • Cheng CH, Chang JR, Huang HH (2020) A novel weighted distance threshold method for handling medical missing values. Comput Biol Med 122:103824

    Google Scholar 

  • Dinh D-T, Huynh V-N, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci. https://doi.org/10.1016/j.ins.2021.04.076

    Article  MathSciNet  Google Scholar 

  • Donders AR, van der Heijden GJ, Stijnen T, Moons KG (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59:1087–1091

    Google Scholar 

  • Dubey A, Rasool A (2020) Clustering-based hybrid approach for multivariate missing data imputation. Int J Adv Comput Sci Appl. https://doi.org/10.14569/IJACSA.2020.0111186

    Article  Google Scholar 

  • Enders CK (2017) Multiple imputation as a flexible tool for missing data handling in clinical research. Behav Res Ther 98:4–18

    Google Scholar 

  • Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21(3):768–769

    Google Scholar 

  • Galan CO, Lasheras FS, de Juez FJ, Sanchez AB (2017) Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions. J Comput Appl Math 311:704–717

    MathSciNet  MATH  Google Scholar 

  • García-Laencina PJ, Abreu PH, Abreu MH, Afonoso N (2015) Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med 59:125–133

    Google Scholar 

  • Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B 31(5):735–744. https://doi.org/10.1109/3477.956035

    Article  Google Scholar 

  • Jerez JM, Molina I, Subirats JL, Franco L (2006) missing data imputation in breast cancer prognosis. In: Proceedings of the 24th IASTED international conference on Biomedical engineering. p.323–328, February 15–17, 2006, Innsbruck, Austria

  • Jerez JM, Molina I, Garcia-Laencina PJ, Alba E, Ribelles N, Martin M, Franco L (2010a) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115

    Google Scholar 

  • Jerez JM, Molina I, Garcia-Laencina PJ, Alba E, Ribelles N, Martin M, Franco L (2010b) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50:105–115

    Google Scholar 

  • John GH, langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: proceedings of the eleventh conference on uncertainty in artificial intelligence, pp. 338–345, San Mateo, CA: Morgan Kaufmann

  • Keerin P, Kurutach W, Boongoen T (2016) A cluster-directed framework for neighbour based imputation of missing value in microarray data. Int J Data Min Bioinform 15(2):165–193

    Google Scholar 

  • Ketchen DJ, Shook CL (1996) The application of cluster analysis in strategic management research: an analysis and critique. Strateg Manag J 17(6):441–458

    Google Scholar 

  • Kharrazi H, Wang C, Scharfstein D (2014) Prospective EHR-based clinical trials: the challenge of missing data. J Gen Intern Med 29(7):976–978

    Google Scholar 

  • Lee JY, Styczynski MP (2018) NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 14:1–12

    Google Scholar 

  • Li D, Gu H, Zhang L (2010) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Exp Syst Appl 37(10):6942–6947

    Google Scholar 

  • Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53:1487–1509

    Google Scholar 

  • M€uhlenbruch K, Kuxhaus O, Giuseppe R, Boeing H, Weikert C, Schulze MB (2017) Multiple imputation was a valid approach to estimate absolute risk from a prediction model based on case–cohort data. J Clin Epidemiol 84:130–141

    Google Scholar 

  • Mitra S, Pal SK (1995) Fuzzy multi-layer perceptron, inferencing and rule generation. IEEE Trans Neural Networks 6:51–63

    Google Scholar 

  • Moayedikia A, Ong KL, Boo YL, Yeoh WG, Jensen R (2017) Feature selection for high dimensional imbalanced class data using harmony search. Eng Appl Artif Intell 57:38–49

    Google Scholar 

  • Ondeck NT, Fu MC, Skrip LA, McLynn RP, Su EP, Grauer JN (2018a) Treatments of missing values in large national data affect conclusions: the impact of multiple imputation on arthroplasty research. J Arthroplasty 33(3):661–667

    Google Scholar 

  • Ondeck NT, Fu MC, Skrip LA, McLynn RP, Su EP, Grauer JN (2018b) Treatments of missing values in large national data affect conclusions: the impact of multiple imputation on arthroplasty research. J Arthroplasty 33:661–667

    Google Scholar 

  • Pearl J, Russell S (2000) Bayesian networks TR R-277. University of California

    Google Scholar 

  • Polit DF, Beck CT (2012) Nursing research: generating and assessing evidence for nursing practice, 9th edn. Wolters Kluwer Health, Lippincott Williams & Wilkins, Philadelphia

    Google Scholar 

  • Pombo N, Rebelo P, Araújo P, Viana J (2015) Combining data imputation and statistics to design a clinical decision support system for post-operative pain monitoring. Procedia Comput Sci 64:1018–1025

    Google Scholar 

  • Pombo N, Rebelo P, Araújo P, Viana J (2016) Design and evaluation of a decision support system for pain management based on data imputation and statistical models. Measurement 93:480–489

    Google Scholar 

  • Quinlan JR (1992) C45 programs for machine learning. Morgan Kaufmann, San Mateo

    Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    MATH  Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63:581–590

    MathSciNet  MATH  Google Scholar 

  • Sammut C, Webb GI (2010) Encyclopedia of machine learning. Springer, Boston

    MATH  Google Scholar 

  • Sandercock PA, Niewada M, Członkowska A (2011) The international stroke trial database. Trials 12:101

    Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data, New York. Chapman & Hall

    MATH  Google Scholar 

  • Shao J (2000) Cold deck and ratio imputation. Surv Pract 26:79–85

    Google Scholar 

  • Sim J, Lee JS, Kwon O (2015) Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Math Prob Eng 12:1–14

    Google Scholar 

  • Sterne J, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338:157–160

    Google Scholar 

  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525

    Google Scholar 

  • Wagstaff K (2004) Clustering with missing values: no imputation required. In: Banks D, McMorris FR, Arabie P, Gaul W (eds) Classification, clustering, and data mining applications: studies in classification, data analysis, and knowledge organisation. Springer, Berlin

    Google Scholar 

  • Zhang Z (2016) Multiple imputation with multivariate imputation by chained Equation (MICE) package. Ann Transl Med 4(2):30

    Google Scholar 

  • Zhang Z, Yang X, Li H, Li W, Yan H, Shi F (2017) Application of a novel hybrid method for spatiotemporal data imputation: a case study of the Minqin County groundwater level. J Hydrol 553:384–397

    Google Scholar 

Download references

Funding

The authors received no financial support for the research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ching-Hsue Cheng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables 13 and 14.

Table 13 Internal comparison for the other UCI datasets in MAR
Table 14 Internal comparison for the other UCI datasets in MCAR

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, CH., Huang, SF. A novel clustering-based purity and distance imputation for handling medical data with missing values. Soft Comput 25, 11781–11801 (2021). https://doi.org/10.1007/s00500-021-05947-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-05947-3

Keywords

Navigation