Skip to main content

Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset

  • Conference paper
  • First Online:
Innovative Data Communication Technologies and Application

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 59))

Abstract

When machine learning is used for the design of a prediction model in medical science, then higher accuracy is essential. It becomes difficult to achieve higher accuracy due to unavailability of values in certain fields of data set. Therefore, it is necessary to deal with the issue of missing values effectively. This research work focuses on an efficient way to handle missing values. Authors have proposed a systematic methodology for the identification of missing value. Authors have used Cleveland Heart disease dataset from the UCI (University of California, Irvine) repository to test their experiments. Missing values are imparted using three different approaches, namely random, MISSHASH & MISSFIB. Four imputation methods k-nearest neighbor (KNN), multivariate imputation by chained equations (MICE), mean, and mode imputation were analyzed with the help of four classifiers Naive Bayes (NB), support vector machine (SVM), logistic regression (LR), and random forest (RF). Root mean square error (RMSE) of classifiers was compared to find the combination of the best imputation method. It has found that MICE imputation method has performed better related to other imputation methods. Moreover, its accuracy is independent of classifier and missing value distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jakobsen JC, Gluud C, Wetterslev J, Winkel P (2017) When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC Med Res Methodol 17(1):1–10

    Article  Google Scholar 

  2. Bertsimas D, Pawlowski C, Zhuo YD (2017) From predictive methods to missing data imputation: an optimization approach. J Machine Learn Res 18(1):7133–7171

    MathSciNet  MATH  Google Scholar 

  3. Noor MN, Yahaya AS, Ramli NA, Al Bakri AMM (2014) Filling missing data using interpolation methods: Study on the effect of fitting distribution. Key Eng Mater 594:889–895

    Google Scholar 

  4. Kumar RN, Kumar MA (2016) Enhanced fuzzy K-NN approach for handling missing values in medical data mining. Ind J Sci Technol 9(S1):1–7

    Google Scholar 

  5. Sim J, Lee JS, Kwon O (2015) Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Math Prob Eng 2015:1–14

    Article  Google Scholar 

  6. Nahato KB, Harichandran KN, Arputharaj K (2015) Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Comput Math Methods Med 2015:1–13

    Article  Google Scholar 

  7. Qin J, Chen L, Liu Y, Liu C, Feng C, Chen B (2019) A machine learning methodology for diagnosing chronic kidney disease. IEEE Access 8:20991–21002

    Article  Google Scholar 

  8. Venkatraman S, Yatsko A, Stranieri A, Jelinek HF (2016) Missing data imputation for individualised CVD diagnostic and treatment. In: Computing in cardiology conference, vol 43, pp 349–352. IEEE

    Google Scholar 

  9. Al Muhaideb S, Menai MEB (2016) An individualized preprocessing for medical data classification. Procedia Comput Sci 82:35–42

    Google Scholar 

  10. Kuppusamy V, Paramasivam I (2016) A study of impact on missing categorical data—a qualitative review. Ind J Sci Technol 9(32):1–6

    Google Scholar 

  11. Hu Z, Melton GB, Arsoniadis EG, Wang Y, Kwaan MR, Simon GJ (2017) Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J Biomed Inform 68:112–120

    Article  Google Scholar 

  12. Sujatha M, Anusha S, Bhavani G (2018) A study on performance of cleveland heart disease dataset for imputing missing values. Int J Pure Appl Math 120(6):7271–7280

    Google Scholar 

  13. Karim SAA, Ismail MT, Othman M, Abdullah MF, Hasan MK, Sulaiman J (2018) Rational cubic spline interpolation for missing solar data imputation. J Eng Appl Sci 13(9):2587–2592

    Google Scholar 

  14. Mohan S, Thirumalai C, Srivastava G (2019) Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7:81542–81554

    Article  Google Scholar 

  15. Almansour NA, Syed HF, Khayat NR, Altheeb RK, Juri RE, Alhiyafi J, Alrashed S, Olatunji SO (2019) Neural network and support vector machine for the prediction of chronic kidney disease: a comparative study. Comput Biol Med 109:101–111

    Article  Google Scholar 

  16. Kim T, Ko W, Kim J (2019) Analysis and impact evaluation of missing data imputation in day-ahead PV generation forecasting. Appl Sci 9(1):1–18

    Google Scholar 

  17. Stavseth MR, Clausen T, Roislien J (2019) How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med 7:1–12

    Article  Google Scholar 

  18. Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2019) Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 1–19

    Google Scholar 

  19. Raja PS, Thangavel K (2019) Missing value imputation using unsupervised machine learning techniques. Soft Comput 1–32

    Google Scholar 

  20. https://www.kaggle.com/ronitf/heart-disease-uci. Accessed on 01 Oct 2019

  21. Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1623–1657

    MATH  Google Scholar 

  22. Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41(12):3692–3705

    Article  Google Scholar 

  23. Thomas RM, Bruin W, Zhutovsky P, van Wingen G (2020) Dealing with missing data, small sample sizes, and heterogeneity in machine learning studies of brain disorders. In Machine learning, pp 249–266. Academic Press (2020)

    Google Scholar 

  24. Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20(1):40–49

    Article  Google Scholar 

  25. Dulhare UN (2018) Prediction system for heart disease using Naive Bayes and particle swarm optimization. Biomed Res 29(12):2646–2649

    Article  Google Scholar 

  26. Musa AB (2013) Comparative study on classification performance between support vector machine and logistic regression. Int J Mach Learn Cybernet 4(1):13–24

    Article  Google Scholar 

  27. Jain A, Kumar R, Mittal S, Rani P, Sharma R, Lamba R (2020) An optimized system for heart disease prediction by feature selection. Patent Application No. 202011004239A. Office of controller general of patents, Designs & TradeMarks, India. 8532 (2020)

    Google Scholar 

  28. Jabbar MA, Deekshatulu BL, Chandra P (2016) Prediction of heart disease using random forest and feature subset selection. In: Innovations in bio-inspired computing and applications. Springer, Cham, pp 187–196

    Google Scholar 

  29. Guo H, Yin J, Zhao J, Yao L, Xia X, Luo H (2015) An ensemble learning for predicting breakdown field strength of polyimide nanocomposite films. J Nanomater 2015:1–11

    Google Scholar 

  30. Ayilara OF, Zhang L, Sajobi TT, Sawatzky R, Bohm E, Lix LM (2019) Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes 17(1):1–9

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pooja Rani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rani, P., Kumar, R., Jain, A. (2021). Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset. In: Raj, J.S., Iliyasu, A.M., Bestak, R., Baig, Z.A. (eds) Innovative Data Communication Technologies and Application. Lecture Notes on Data Engineering and Communications Technologies, vol 59. Springer, Singapore. https://doi.org/10.1007/978-981-15-9651-3_53

Download citation

Publish with us

Policies and ethics