Skip to main content
Log in

A Data-Driven Heart Disease Prediction Model Through K-Means Clustering-Based Anomaly Detection

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Heart disease, alternatively known as cardiovascular disease, is the primary basis of death worldwide over the past few decades. To make an early diagnosis, a data-driven prediction model considering the associate risk factors in heart disease can play a significant role in healthcare domain. However, to build such an effective model based on machine learning techniques, the quality of the data, e.g., data without “anomalies” or outliers, is important. This research investigates anomaly detection in the healthcare domain to effectively predict heart disease using unsupervised K-means clustering algorithm. Our proposed model first determines an optimal value of K using the Silhouette method to form the clusters for finding the anomalies. After that, we eliminate the identified anomalies from the data and employ the five most popular machine learning classification techniques, such as K-nearest neighbor, random forest, support vector machine, naive Bayes, and logistic regression to build the resultant prediction model. The efficacy of the proposed methodology is justified using a standard heart disease dataset. We also take into account the data plotting to test the exactness of the detection of anomalies in our experimental analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Altman N. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46(3):175–85.

    MathSciNet  Google Scholar 

  2. Ayon SI, Islam MM, Hossain MR. Coronary artery heart disease prediction: a comparative study of computational intelligence techniques. IETE J Res. 2020;1–20.

  3. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  4. Campello RJ, Moulavi D, Sander J. Density-based clustering based on hierarchical density estimates. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2013. p. 160–72.

  5. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.

    MATH  Google Scholar 

  6. Cramer JS. The origins of logistic regression; 2002.

  7. Dessai ISF. Intelligent heart disease prediction system using probabilistic neural network. Int J Adv Comput Theory Eng. 2013;2(3):2319–526.

    Google Scholar 

  8. Ding Z, Fei M. An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc Vol. 2013;46(20):12–7.

    Article  Google Scholar 

  9. Fan J, Zhang Q, Zhu J, Zhang M, Yang Z, Cao H. Robust deep auto-encoding gaussian process regression for unsupervised anomaly detection. Neurocomputing. 2020;376:180–90.

    Article  Google Scholar 

  10. Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965;21:768–9.

    Google Scholar 

  11. Fujimaki R, Yairi T, Machida K. An approach to spacecraft anomaly detection problem using kernel feature space. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining; 2005. p. 401–10.

  12. Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

    MATH  Google Scholar 

  13. Janakiram D, Reddy V, Kumar AP. Outlier detection in wireless sensor networks using Bayesian belief networks. In: 2006 1st International conference on communication systems software & middleware. IEEE; 2006. p. 1–6.

  14. Kumar V. Parallel and distributed computing for cybersecurity. IEEE Distrib Syst Online. 2005;6(10).

  15. Liu FT, Ting KM, Zhou ZH. Isolation forest. In: 2008 Eighth IEEE international conference on data mining. IEEE; 2008. pp. 413–22.

  16. Mascaro S, Nicholso AE, Korb KB. Anomaly detection in vessel tracks using Bayesian networks. Int J Approx Reason. 2014;55(1):84–98.

    Article  Google Scholar 

  17. Mohamed MS, Kavitha T. Outlier detection using support vector machine in wireless sensor network real time data. Int J Soft Comput Eng. 2011;1(2).

  18. Mohan S, Thirumalai C, Srivastava G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access. 2019;7:81542–54.

    Article  Google Scholar 

  19. Münz G, Li S, Carle G. Traffic anomaly detection using k-means clustering. In: GI/ITG workshop MMBnet; 2007. p. 13–4.

  20. Nachman B, Shih D. Anomaly detection with density estimation. Phys Rev D. 2020;101(7):075042.

    Article  Google Scholar 

  21. Ranjith R, Athanesious JJ, Vaidehi V. Anomaly detection using dbscan clustering technique for traffic video surveillance. In: 2015 Seventh international conference on advanced computing (ICoAC). IEEE; 2015. p. 1–6.

  22. Ripan RC, Sarker IH, Furhad MH, Anwar MM, Hoque MM. An effective heart disease prediction model based on machine learning techniques; 2020.

  23. Ronit: Heart disease uci; 2018. https://www.kaggle.com/ronitf/heart-disease-uci.

  24. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.

    Article  Google Scholar 

  25. Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):95.

    Article  Google Scholar 

  26. Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

    Article  Google Scholar 

  27. Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

    Article  Google Scholar 

  28. Sarker IH, Hoque MM, Uddin MK, Alsanoosy T. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl. 2020;1–19.

  29. Sarker IH, Kayes A. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020;102762.

  30. Sarker IH, Kayes A, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

    Article  Google Scholar 

  31. Sarker IH, Kayes A, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):57.

    Article  Google Scholar 

  32. Spence C, Parra L, Sajda P. Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model. In: Proceedings IEEE workshop on mathematical methods in biomedical image analysis (MMBIA 2001). IEEE; 2001. p. 3–10.

  33. Sun L, Versteeg S, Boztas S, Rao A. Detecting anomalous user behavior using an extended isolation forest algorithm: an enterprise case study. 2016. arXiv preprint. arXiv:1609.06676.

  34. Tax DM, Duin RP. Support vector data description. Mach Learn. 2004;54(1):45–66.

    Article  Google Scholar 

  35. Tu B, Yang X, Li N, Zhou C, He D. Hyperspectral anomaly detection via density peak clustering. Pattern Recognit Lett. 2020;129:144–9.

    Article  Google Scholar 

  36. Wickham H, Stryjewski L. 40 years of boxplots. Am. Stat. 2011.

  37. Xu J, Shelton CR. Intrusion detection using continuous time Bayesian networks. J Artif Intell Res. 2010;39:745–74.

    Article  MathSciNet  Google Scholar 

  38. Xue Z, Shang Y, Feng A. Semi-supervised outlier detection based on fuzzy rough c-means clustering. Math Comput Simul. 2010;80(9):1911–21.

    Article  MathSciNet  Google Scholar 

  39. Yoon KA, Kwon OS, Bae DH. An approach to outlier detection of software measurement data using the k-means clustering method. In: First international symposium on empirical software engineering and measurement (ESEM 2007. IEEE; 2007. p. 443–5.

  40. Zhang C, Song D, Chen Y, Feng X, Lumezanu C, Cheng W, Ni J, Zong B, Chen H, Chawla NV. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33; 2019. p. 1409–16.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iqbal H. Sarker.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ripan, R.C., Sarker, I.H., Hossain, S.M.M. et al. A Data-Driven Heart Disease Prediction Model Through K-Means Clustering-Based Anomaly Detection. SN COMPUT. SCI. 2, 112 (2021). https://doi.org/10.1007/s42979-021-00518-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00518-7

Keywords

Navigation