Abstract
When machine learning is used for the design of a prediction model in medical science, then higher accuracy is essential. It becomes difficult to achieve higher accuracy due to unavailability of values in certain fields of data set. Therefore, it is necessary to deal with the issue of missing values effectively. This research work focuses on an efficient way to handle missing values. Authors have proposed a systematic methodology for the identification of missing value. Authors have used Cleveland Heart disease dataset from the UCI (University of California, Irvine) repository to test their experiments. Missing values are imparted using three different approaches, namely random, MISSHASH & MISSFIB. Four imputation methods k-nearest neighbor (KNN), multivariate imputation by chained equations (MICE), mean, and mode imputation were analyzed with the help of four classifiers Naive Bayes (NB), support vector machine (SVM), logistic regression (LR), and random forest (RF). Root mean square error (RMSE) of classifiers was compared to find the combination of the best imputation method. It has found that MICE imputation method has performed better related to other imputation methods. Moreover, its accuracy is independent of classifier and missing value distribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jakobsen JC, Gluud C, Wetterslev J, Winkel P (2017) When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC Med Res Methodol 17(1):1–10
Bertsimas D, Pawlowski C, Zhuo YD (2017) From predictive methods to missing data imputation: an optimization approach. J Machine Learn Res 18(1):7133–7171
Noor MN, Yahaya AS, Ramli NA, Al Bakri AMM (2014) Filling missing data using interpolation methods: Study on the effect of fitting distribution. Key Eng Mater 594:889–895
Kumar RN, Kumar MA (2016) Enhanced fuzzy K-NN approach for handling missing values in medical data mining. Ind J Sci Technol 9(S1):1–7
Sim J, Lee JS, Kwon O (2015) Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Math Prob Eng 2015:1–14
Nahato KB, Harichandran KN, Arputharaj K (2015) Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Comput Math Methods Med 2015:1–13
Qin J, Chen L, Liu Y, Liu C, Feng C, Chen B (2019) A machine learning methodology for diagnosing chronic kidney disease. IEEE Access 8:20991–21002
Venkatraman S, Yatsko A, Stranieri A, Jelinek HF (2016) Missing data imputation for individualised CVD diagnostic and treatment. In: Computing in cardiology conference, vol 43, pp 349–352. IEEE
Al Muhaideb S, Menai MEB (2016) An individualized preprocessing for medical data classification. Procedia Comput Sci 82:35–42
Kuppusamy V, Paramasivam I (2016) A study of impact on missing categorical data—a qualitative review. Ind J Sci Technol 9(32):1–6
Hu Z, Melton GB, Arsoniadis EG, Wang Y, Kwaan MR, Simon GJ (2017) Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J Biomed Inform 68:112–120
Sujatha M, Anusha S, Bhavani G (2018) A study on performance of cleveland heart disease dataset for imputing missing values. Int J Pure Appl Math 120(6):7271–7280
Karim SAA, Ismail MT, Othman M, Abdullah MF, Hasan MK, Sulaiman J (2018) Rational cubic spline interpolation for missing solar data imputation. J Eng Appl Sci 13(9):2587–2592
Mohan S, Thirumalai C, Srivastava G (2019) Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7:81542–81554
Almansour NA, Syed HF, Khayat NR, Altheeb RK, Juri RE, Alhiyafi J, Alrashed S, Olatunji SO (2019) Neural network and support vector machine for the prediction of chronic kidney disease: a comparative study. Comput Biol Med 109:101–111
Kim T, Ko W, Kim J (2019) Analysis and impact evaluation of missing data imputation in day-ahead PV generation forecasting. Appl Sci 9(1):1–18
Stavseth MR, Clausen T, Roislien J (2019) How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med 7:1–12
Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2019) Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 1–19
Raja PS, Thangavel K (2019) Missing value imputation using unsupervised machine learning techniques. Soft Comput 1–32
https://www.kaggle.com/ronitf/heart-disease-uci. Accessed on 01 Oct 2019
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1623–1657
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41(12):3692–3705
Thomas RM, Bruin W, Zhutovsky P, van Wingen G (2020) Dealing with missing data, small sample sizes, and heterogeneity in machine learning studies of brain disorders. In Machine learning, pp 249–266. Academic Press (2020)
Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20(1):40–49
Dulhare UN (2018) Prediction system for heart disease using Naive Bayes and particle swarm optimization. Biomed Res 29(12):2646–2649
Musa AB (2013) Comparative study on classification performance between support vector machine and logistic regression. Int J Mach Learn Cybernet 4(1):13–24
Jain A, Kumar R, Mittal S, Rani P, Sharma R, Lamba R (2020) An optimized system for heart disease prediction by feature selection. Patent Application No. 202011004239A. Office of controller general of patents, Designs & TradeMarks, India. 8532 (2020)
Jabbar MA, Deekshatulu BL, Chandra P (2016) Prediction of heart disease using random forest and feature subset selection. In: Innovations in bio-inspired computing and applications. Springer, Cham, pp 187–196
Guo H, Yin J, Zhao J, Yao L, Xia X, Luo H (2015) An ensemble learning for predicting breakdown field strength of polyimide nanocomposite films. J Nanomater 2015:1–11
Ayilara OF, Zhang L, Sajobi TT, Sawatzky R, Bohm E, Lix LM (2019) Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes 17(1):1–9
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rani, P., Kumar, R., Jain, A. (2021). Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset. In: Raj, J.S., Iliyasu, A.M., Bestak, R., Baig, Z.A. (eds) Innovative Data Communication Technologies and Application. Lecture Notes on Data Engineering and Communications Technologies, vol 59. Springer, Singapore. https://doi.org/10.1007/978-981-15-9651-3_53
Download citation
DOI: https://doi.org/10.1007/978-981-15-9651-3_53
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9650-6
Online ISBN: 978-981-15-9651-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)