Abstract
Nowadays, people pay increasing attention to health, and the integrity of medical records has been put into focus. Recently, medical data imputation has become a very active field because medical data usually have missing values. Many imputation methods have been proposed, but many model-based imputation methods such as expectation–maximization and regression-based imputation based on the variables data have a multivariate normal distribution, which assumption can lead to biased results. Sometimes, this becomes a bottleneck, such as computationally more complex than model-free methods. Furthermore, directly removing instances with missing values has several problems, and it is possible to lose the important data, produce ineffective research samples, and cause research deviations. Therefore, this study proposes a novel clustering-based purity and distance imputation method to improve the handling of missing values. In the experiment, we collected eight different medical datasets to compare the proposed imputation methods with the listed imputation methods with regard to the results of different situations. In imputation measures, the area under the curve (AUC) is used to evaluate the performance of the imbalanced class datasets in MAR and MCAR experiments, and accuracy is applied to measure its performance of the balanced class in MNAR experiment. Finally, the root-mean-square error (RMSE) is also used to compare the proposed and the listing imputation methods. In addition, this study utilized the elbow method and the average silhouette method to find the optimal number of clusters for all datasets. Results showed that the proposed imputation method could improve imputation performance in the accuracy, AUC, and RMSE of different missing degrees and missing types.
Similar content being viewed by others
References
Al SA, Lotfi A, Coleman S (2013) Intelligent synthetic composite indicators with application. Soft Comput 17:2349–2364. https://doi.org/10.1007/s00500-013-1098-3
Amiri M, Jensen R (2016) Missing data imputation using fuzzy-rough methods. Neurocomputing 205:152–164
Andridge RR, Little RJA (2010) A review of hot deck imputation for survey non-response. Int Stat Review 78:40–64
Awan SE, Bennamoun M, Sohel F, Sanfilippo FM, Dwivedi G (2021) Imputation of missing data with class imbalance using conditional generative adversarial networks. Neurocomputing. https://doi.org/10.1016/j.neucom.2021.04.010
Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Cheng CH, Chan CP, Sheu YJ (2019) A novel purity-based k nearest neighbors imputation method and its application in financial distress prediction. Eng Appl Artif Intell 81:283–299
Cheng CH, Chang JR, Huang HH (2020) A novel weighted distance threshold method for handling medical missing values. Comput Biol Med 122:103824
Dinh D-T, Huynh V-N, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci. https://doi.org/10.1016/j.ins.2021.04.076
Donders AR, van der Heijden GJ, Stijnen T, Moons KG (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59:1087–1091
Dubey A, Rasool A (2020) Clustering-based hybrid approach for multivariate missing data imputation. Int J Adv Comput Sci Appl. https://doi.org/10.14569/IJACSA.2020.0111186
Enders CK (2017) Multiple imputation as a flexible tool for missing data handling in clinical research. Behav Res Ther 98:4–18
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21(3):768–769
Galan CO, Lasheras FS, de Juez FJ, Sanchez AB (2017) Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions. J Comput Appl Math 311:704–717
García-Laencina PJ, Abreu PH, Abreu MH, Afonoso N (2015) Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med 59:125–133
Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B 31(5):735–744. https://doi.org/10.1109/3477.956035
Jerez JM, Molina I, Subirats JL, Franco L (2006) missing data imputation in breast cancer prognosis. In: Proceedings of the 24th IASTED international conference on Biomedical engineering. p.323–328, February 15–17, 2006, Innsbruck, Austria
Jerez JM, Molina I, Garcia-Laencina PJ, Alba E, Ribelles N, Martin M, Franco L (2010a) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115
Jerez JM, Molina I, Garcia-Laencina PJ, Alba E, Ribelles N, Martin M, Franco L (2010b) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50:105–115
John GH, langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: proceedings of the eleventh conference on uncertainty in artificial intelligence, pp. 338–345, San Mateo, CA: Morgan Kaufmann
Keerin P, Kurutach W, Boongoen T (2016) A cluster-directed framework for neighbour based imputation of missing value in microarray data. Int J Data Min Bioinform 15(2):165–193
Ketchen DJ, Shook CL (1996) The application of cluster analysis in strategic management research: an analysis and critique. Strateg Manag J 17(6):441–458
Kharrazi H, Wang C, Scharfstein D (2014) Prospective EHR-based clinical trials: the challenge of missing data. J Gen Intern Med 29(7):976–978
Lee JY, Styczynski MP (2018) NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 14:1–12
Li D, Gu H, Zhang L (2010) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Exp Syst Appl 37(10):6942–6947
Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53:1487–1509
M€uhlenbruch K, Kuxhaus O, Giuseppe R, Boeing H, Weikert C, Schulze MB (2017) Multiple imputation was a valid approach to estimate absolute risk from a prediction model based on case–cohort data. J Clin Epidemiol 84:130–141
Mitra S, Pal SK (1995) Fuzzy multi-layer perceptron, inferencing and rule generation. IEEE Trans Neural Networks 6:51–63
Moayedikia A, Ong KL, Boo YL, Yeoh WG, Jensen R (2017) Feature selection for high dimensional imbalanced class data using harmony search. Eng Appl Artif Intell 57:38–49
Ondeck NT, Fu MC, Skrip LA, McLynn RP, Su EP, Grauer JN (2018a) Treatments of missing values in large national data affect conclusions: the impact of multiple imputation on arthroplasty research. J Arthroplasty 33(3):661–667
Ondeck NT, Fu MC, Skrip LA, McLynn RP, Su EP, Grauer JN (2018b) Treatments of missing values in large national data affect conclusions: the impact of multiple imputation on arthroplasty research. J Arthroplasty 33:661–667
Pearl J, Russell S (2000) Bayesian networks TR R-277. University of California
Polit DF, Beck CT (2012) Nursing research: generating and assessing evidence for nursing practice, 9th edn. Wolters Kluwer Health, Lippincott Williams & Wilkins, Philadelphia
Pombo N, Rebelo P, Araújo P, Viana J (2015) Combining data imputation and statistics to design a clinical decision support system for post-operative pain monitoring. Procedia Comput Sci 64:1018–1025
Pombo N, Rebelo P, Araújo P, Viana J (2016) Design and evaluation of a decision support system for pain management based on data imputation and statistical models. Measurement 93:480–489
Quinlan JR (1992) C45 programs for machine learning. Morgan Kaufmann, San Mateo
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Rubin DB (1976) Inference and missing data. Biometrika 63:581–590
Sammut C, Webb GI (2010) Encyclopedia of machine learning. Springer, Boston
Sandercock PA, Niewada M, Członkowska A (2011) The international stroke trial database. Trials 12:101
Schafer JL (1997) Analysis of incomplete multivariate data, New York. Chapman & Hall
Shao J (2000) Cold deck and ratio imputation. Surv Pract 26:79–85
Sim J, Lee JS, Kwon O (2015) Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Math Prob Eng 12:1–14
Sterne J, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338:157–160
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Wagstaff K (2004) Clustering with missing values: no imputation required. In: Banks D, McMorris FR, Arabie P, Gaul W (eds) Classification, clustering, and data mining applications: studies in classification, data analysis, and knowledge organisation. Springer, Berlin
Zhang Z (2016) Multiple imputation with multivariate imputation by chained Equation (MICE) package. Ann Transl Med 4(2):30
Zhang Z, Yang X, Li H, Li W, Yan H, Shi F (2017) Application of a novel hybrid method for spatiotemporal data imputation: a case study of the Minqin County groundwater level. J Hydrol 553:384–397
Funding
The authors received no financial support for the research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cheng, CH., Huang, SF. A novel clustering-based purity and distance imputation for handling medical data with missing values. Soft Comput 25, 11781–11801 (2021). https://doi.org/10.1007/s00500-021-05947-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-05947-3