Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset

Rani, Pooja; Kumar, Rajneesh; Jain, Anurag

doi:10.1007/978-981-15-9651-3_53

Pooja Rani⁶,
Rajneesh Kumar⁶ &
Anurag Jain⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 59))

1647 Accesses
17 Citations

Abstract

When machine learning is used for the design of a prediction model in medical science, then higher accuracy is essential. It becomes difficult to achieve higher accuracy due to unavailability of values in certain fields of data set. Therefore, it is necessary to deal with the issue of missing values effectively. This research work focuses on an efficient way to handle missing values. Authors have proposed a systematic methodology for the identification of missing value. Authors have used Cleveland Heart disease dataset from the UCI (University of California, Irvine) repository to test their experiments. Missing values are imparted using three different approaches, namely random, MISSHASH & MISSFIB. Four imputation methods k-nearest neighbor (KNN), multivariate imputation by chained equations (MICE), mean, and mode imputation were analyzed with the help of four classifiers Naive Bayes (NB), support vector machine (SVM), logistic regression (LR), and random forest (RF). Root mean square error (RMSE) of classifiers was compared to find the combination of the best imputation method. It has found that MICE imputation method has performed better related to other imputation methods. Moreover, its accuracy is independent of classifier and missing value distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jakobsen JC, Gluud C, Wetterslev J, Winkel P (2017) When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC Med Res Methodol 17(1):1–10
Article Google Scholar
Bertsimas D, Pawlowski C, Zhuo YD (2017) From predictive methods to missing data imputation: an optimization approach. J Machine Learn Res 18(1):7133–7171
MathSciNet MATH Google Scholar
Noor MN, Yahaya AS, Ramli NA, Al Bakri AMM (2014) Filling missing data using interpolation methods: Study on the effect of fitting distribution. Key Eng Mater 594:889–895
Google Scholar
Kumar RN, Kumar MA (2016) Enhanced fuzzy K-NN approach for handling missing values in medical data mining. Ind J Sci Technol 9(S1):1–7
Google Scholar
Sim J, Lee JS, Kwon O (2015) Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Math Prob Eng 2015:1–14
Article Google Scholar
Nahato KB, Harichandran KN, Arputharaj K (2015) Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Comput Math Methods Med 2015:1–13
Article Google Scholar
Qin J, Chen L, Liu Y, Liu C, Feng C, Chen B (2019) A machine learning methodology for diagnosing chronic kidney disease. IEEE Access 8:20991–21002
Article Google Scholar
Venkatraman S, Yatsko A, Stranieri A, Jelinek HF (2016) Missing data imputation for individualised CVD diagnostic and treatment. In: Computing in cardiology conference, vol 43, pp 349–352. IEEE
Google Scholar
Al Muhaideb S, Menai MEB (2016) An individualized preprocessing for medical data classification. Procedia Comput Sci 82:35–42
Google Scholar
Kuppusamy V, Paramasivam I (2016) A study of impact on missing categorical data—a qualitative review. Ind J Sci Technol 9(32):1–6
Google Scholar
Hu Z, Melton GB, Arsoniadis EG, Wang Y, Kwaan MR, Simon GJ (2017) Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J Biomed Inform 68:112–120
Article Google Scholar
Sujatha M, Anusha S, Bhavani G (2018) A study on performance of cleveland heart disease dataset for imputing missing values. Int J Pure Appl Math 120(6):7271–7280
Google Scholar
Karim SAA, Ismail MT, Othman M, Abdullah MF, Hasan MK, Sulaiman J (2018) Rational cubic spline interpolation for missing solar data imputation. J Eng Appl Sci 13(9):2587–2592
Google Scholar
Mohan S, Thirumalai C, Srivastava G (2019) Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7:81542–81554
Article Google Scholar
Almansour NA, Syed HF, Khayat NR, Altheeb RK, Juri RE, Alhiyafi J, Alrashed S, Olatunji SO (2019) Neural network and support vector machine for the prediction of chronic kidney disease: a comparative study. Comput Biol Med 109:101–111
Article Google Scholar
Kim T, Ko W, Kim J (2019) Analysis and impact evaluation of missing data imputation in day-ahead PV generation forecasting. Appl Sci 9(1):1–18
Google Scholar
Stavseth MR, Clausen T, Roislien J (2019) How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med 7:1–12
Article Google Scholar
Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2019) Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 1–19
Google Scholar
Raja PS, Thangavel K (2019) Missing value imputation using unsupervised machine learning techniques. Soft Comput 1–32
Google Scholar
https://www.kaggle.com/ronitf/heart-disease-uci. Accessed on 01 Oct 2019
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1623–1657
MATH Google Scholar
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41(12):3692–3705
Article Google Scholar
Thomas RM, Bruin W, Zhutovsky P, van Wingen G (2020) Dealing with missing data, small sample sizes, and heterogeneity in machine learning studies of brain disorders. In Machine learning, pp 249–266. Academic Press (2020)
Google Scholar
Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20(1):40–49
Article Google Scholar
Dulhare UN (2018) Prediction system for heart disease using Naive Bayes and particle swarm optimization. Biomed Res 29(12):2646–2649
Article Google Scholar
Musa AB (2013) Comparative study on classification performance between support vector machine and logistic regression. Int J Mach Learn Cybernet 4(1):13–24
Article Google Scholar
Jain A, Kumar R, Mittal S, Rani P, Sharma R, Lamba R (2020) An optimized system for heart disease prediction by feature selection. Patent Application No. 202011004239A. Office of controller general of patents, Designs & TradeMarks, India. 8532 (2020)
Google Scholar
Jabbar MA, Deekshatulu BL, Chandra P (2016) Prediction of heart disease using random forest and feature subset selection. In: Innovations in bio-inspired computing and applications. Springer, Cham, pp 187–196
Google Scholar
Guo H, Yin J, Zhao J, Yao L, Xia X, Luo H (2015) An ensemble learning for predicting breakdown field strength of polyimide nanocomposite films. J Nanomater 2015:1–11
Google Scholar
Ayilara OF, Zhang L, Sajobi TT, Sawatzky R, Bohm E, Lix LM (2019) Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes 17(1):1–9
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, MMEC, Maharishi Markandeshwar (Deemed to be University), Mullana, Ambala, Haryana, India
Pooja Rani & Rajneesh Kumar
Virtualization Department, School of Computer Science, University of Petroleum and Energy Studies, Dehradun, India
Anurag Jain

Authors

Pooja Rani
View author publications
You can also search for this author in PubMed Google Scholar
Rajneesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Anurag Jain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pooja Rani .

Editor information

Editors and Affiliations

Department of Electronics and Communication Engineering, Gnanamani College of Technology, Namakkal, Tamil Nadu, India
Jennifer S. Raj
College of Engineering, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia
Abdullah M. Iliyasu
Telecommunications, Czech Technical University in Prague, Prague, Czech Republic
Robert Bestak
School of Information Technology, Deakin University, Geelong, Australia
Zubair A. Baig

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rani, P., Kumar, R., Jain, A. (2021). Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset. In: Raj, J.S., Iliyasu, A.M., Bestak, R., Baig, Z.A. (eds) Innovative Data Communication Technologies and Application. Lecture Notes on Data Engineering and Communications Technologies, vol 59. Springer, Singapore. https://doi.org/10.1007/978-981-15-9651-3_53

Download citation

DOI: https://doi.org/10.1007/978-981-15-9651-3_53
Published: 03 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9650-6
Online ISBN: 978-981-15-9651-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics