A novel clustering-based purity and distance imputation for handling medical data with missing values

Cheng, Ching-Hsue; Huang, Shu-Fen

doi:10.1007/s00500-021-05947-3

A novel clustering-based purity and distance imputation for handling medical data with missing values

Application of soft computing
Published: 16 June 2021

Volume 25, pages 11781–11801, (2021)
Cite this article

Soft Computing Aims and scope Submit manuscript

Ching-Hsue Cheng¹ &
Shu-Fen Huang²

Abstract

Nowadays, people pay increasing attention to health, and the integrity of medical records has been put into focus. Recently, medical data imputation has become a very active field because medical data usually have missing values. Many imputation methods have been proposed, but many model-based imputation methods such as expectation–maximization and regression-based imputation based on the variables data have a multivariate normal distribution, which assumption can lead to biased results. Sometimes, this becomes a bottleneck, such as computationally more complex than model-free methods. Furthermore, directly removing instances with missing values has several problems, and it is possible to lose the important data, produce ineffective research samples, and cause research deviations. Therefore, this study proposes a novel clustering-based purity and distance imputation method to improve the handling of missing values. In the experiment, we collected eight different medical datasets to compare the proposed imputation methods with the listed imputation methods with regard to the results of different situations. In imputation measures, the area under the curve (AUC) is used to evaluate the performance of the imbalanced class datasets in MAR and MCAR experiments, and accuracy is applied to measure its performance of the balanced class in MNAR experiment. Finally, the root-mean-square error (RMSE) is also used to compare the proposed and the listing imputation methods. In addition, this study utilized the elbow method and the average silhouette method to find the optimal number of clusters for all datasets. Results showed that the proposed imputation method could improve imputation performance in the accuracy, AUC, and RMSE of different missing degrees and missing types.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Article Open access 19 December 2014

A survey on missing data in machine learning

Article Open access 27 October 2021

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

References

Al SA, Lotfi A, Coleman S (2013) Intelligent synthetic composite indicators with application. Soft Comput 17:2349–2364. https://doi.org/10.1007/s00500-013-1098-3
Article Google Scholar
Amiri M, Jensen R (2016) Missing data imputation using fuzzy-rough methods. Neurocomputing 205:152–164
Google Scholar
Andridge RR, Little RJA (2010) A review of hot deck imputation for survey non-response. Int Stat Review 78:40–64
Google Scholar
Awan SE, Bennamoun M, Sohel F, Sanfilippo FM, Dwivedi G (2021) Imputation of missing data with class imbalance using conditional generative adversarial networks. Neurocomputing. https://doi.org/10.1016/j.neucom.2021.04.010
Article Google Scholar
Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533
Google Scholar
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Google Scholar
Cheng CH, Chan CP, Sheu YJ (2019) A novel purity-based k nearest neighbors imputation method and its application in financial distress prediction. Eng Appl Artif Intell 81:283–299
Google Scholar
Cheng CH, Chang JR, Huang HH (2020) A novel weighted distance threshold method for handling medical missing values. Comput Biol Med 122:103824
Google Scholar
Dinh D-T, Huynh V-N, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values. Inf Sci. https://doi.org/10.1016/j.ins.2021.04.076
Article MathSciNet Google Scholar
Donders AR, van der Heijden GJ, Stijnen T, Moons KG (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59:1087–1091
Google Scholar
Dubey A, Rasool A (2020) Clustering-based hybrid approach for multivariate missing data imputation. Int J Adv Comput Sci Appl. https://doi.org/10.14569/IJACSA.2020.0111186
Article Google Scholar
Enders CK (2017) Multiple imputation as a flexible tool for missing data handling in clinical research. Behav Res Ther 98:4–18
Google Scholar
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21(3):768–769
Google Scholar
Galan CO, Lasheras FS, de Juez FJ, Sanchez AB (2017) Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions. J Comput Appl Math 311:704–717
MathSciNet MATH Google Scholar
García-Laencina PJ, Abreu PH, Abreu MH, Afonoso N (2015) Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med 59:125–133
Google Scholar
Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B 31(5):735–744. https://doi.org/10.1109/3477.956035
Article Google Scholar
Jerez JM, Molina I, Subirats JL, Franco L (2006) missing data imputation in breast cancer prognosis. In: Proceedings of the 24th IASTED international conference on Biomedical engineering. p.323–328, February 15–17, 2006, Innsbruck, Austria
Jerez JM, Molina I, Garcia-Laencina PJ, Alba E, Ribelles N, Martin M, Franco L (2010a) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115
Google Scholar
Jerez JM, Molina I, Garcia-Laencina PJ, Alba E, Ribelles N, Martin M, Franco L (2010b) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50:105–115
Google Scholar
John GH, langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: proceedings of the eleventh conference on uncertainty in artificial intelligence, pp. 338–345, San Mateo, CA: Morgan Kaufmann
Keerin P, Kurutach W, Boongoen T (2016) A cluster-directed framework for neighbour based imputation of missing value in microarray data. Int J Data Min Bioinform 15(2):165–193
Google Scholar
Ketchen DJ, Shook CL (1996) The application of cluster analysis in strategic management research: an analysis and critique. Strateg Manag J 17(6):441–458
Google Scholar
Kharrazi H, Wang C, Scharfstein D (2014) Prospective EHR-based clinical trials: the challenge of missing data. J Gen Intern Med 29(7):976–978
Google Scholar
Lee JY, Styczynski MP (2018) NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 14:1–12
Google Scholar
Li D, Gu H, Zhang L (2010) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Exp Syst Appl 37(10):6942–6947
Google Scholar
Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53:1487–1509
Google Scholar
M€uhlenbruch K, Kuxhaus O, Giuseppe R, Boeing H, Weikert C, Schulze MB (2017) Multiple imputation was a valid approach to estimate absolute risk from a prediction model based on case–cohort data. J Clin Epidemiol 84:130–141
Google Scholar
Mitra S, Pal SK (1995) Fuzzy multi-layer perceptron, inferencing and rule generation. IEEE Trans Neural Networks 6:51–63
Google Scholar
Moayedikia A, Ong KL, Boo YL, Yeoh WG, Jensen R (2017) Feature selection for high dimensional imbalanced class data using harmony search. Eng Appl Artif Intell 57:38–49
Google Scholar
Ondeck NT, Fu MC, Skrip LA, McLynn RP, Su EP, Grauer JN (2018a) Treatments of missing values in large national data affect conclusions: the impact of multiple imputation on arthroplasty research. J Arthroplasty 33(3):661–667
Google Scholar
Ondeck NT, Fu MC, Skrip LA, McLynn RP, Su EP, Grauer JN (2018b) Treatments of missing values in large national data affect conclusions: the impact of multiple imputation on arthroplasty research. J Arthroplasty 33:661–667
Google Scholar
Pearl J, Russell S (2000) Bayesian networks TR R-277. University of California
Google Scholar
Polit DF, Beck CT (2012) Nursing research: generating and assessing evidence for nursing practice, 9th edn. Wolters Kluwer Health, Lippincott Williams & Wilkins, Philadelphia
Google Scholar
Pombo N, Rebelo P, Araújo P, Viana J (2015) Combining data imputation and statistics to design a clinical decision support system for post-operative pain monitoring. Procedia Comput Sci 64:1018–1025
Google Scholar
Pombo N, Rebelo P, Araújo P, Viana J (2016) Design and evaluation of a decision support system for pain management based on data imputation and statistical models. Measurement 93:480–489
Google Scholar
Quinlan JR (1992) C45 programs for machine learning. Morgan Kaufmann, San Mateo
Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
MATH Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63:581–590
MathSciNet MATH Google Scholar
Sammut C, Webb GI (2010) Encyclopedia of machine learning. Springer, Boston
MATH Google Scholar
Sandercock PA, Niewada M, Członkowska A (2011) The international stroke trial database. Trials 12:101
Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data, New York. Chapman & Hall
MATH Google Scholar
Shao J (2000) Cold deck and ratio imputation. Surv Pract 26:79–85
Google Scholar
Sim J, Lee JS, Kwon O (2015) Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Math Prob Eng 12:1–14
Google Scholar
Sterne J, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338:157–160
Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Google Scholar
Wagstaff K (2004) Clustering with missing values: no imputation required. In: Banks D, McMorris FR, Arabie P, Gaul W (eds) Classification, clustering, and data mining applications: studies in classification, data analysis, and knowledge organisation. Springer, Berlin
Google Scholar
Zhang Z (2016) Multiple imputation with multivariate imputation by chained Equation (MICE) package. Ann Transl Med 4(2):30
Google Scholar
Zhang Z, Yang X, Li H, Li W, Yan H, Shi F (2017) Application of a novel hybrid method for spatiotemporal data imputation: a case study of the Minqin County groundwater level. J Hydrol 553:384–397
Google Scholar

Download references

Funding

The authors received no financial support for the research.

Author information

Authors and Affiliations

Department of Information Management, National Yunlin University of Science and Technology, 123 University Road, section 3, Douliu, 64002, Yunlin, Taiwan
Ching-Hsue Cheng
Department of Multimedia Design, Chihlee University of Technology, Banqiao Dist., New Taipei City, 22050, Taiwan
Shu-Fen Huang

Authors

Ching-Hsue Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Shu-Fen Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ching-Hsue Cheng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Tables 13 and 14.

Table 13 Internal comparison for the other UCI datasets in MAR

Full size table

Table 14 Internal comparison for the other UCI datasets in MCAR

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, CH., Huang, SF. A novel clustering-based purity and distance imputation for handling medical data with missing values. Soft Comput 25, 11781–11801 (2021). https://doi.org/10.1007/s00500-021-05947-3

Download citation

Accepted: 04 June 2021
Published: 16 June 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s00500-021-05947-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel clustering-based purity and distance imputation for handling medical data with missing values

Abstract

Access this article

Similar content being viewed by others

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

A survey on missing data in machine learning

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel clustering-based purity and distance imputation for handling medical data with missing values

Abstract

Access this article

Similar content being viewed by others

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

A survey on missing data in machine learning

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation