Skip to main content

Machine Learning-Based Early Diabetes Prediction

  • Conference paper
  • First Online:
Intelligent Sustainable Systems

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 213))

Abstract

There are several diseases that the world faces presently and a critical one is Diabetes mellitus. The current diagnostic practice involves various tests at a lab or a hospital and a treatment based on the outcome of the diagnosis. This study proposes a machine learning model to classify a patient as diabetic or not, utilizing the popular PIMA Indian Dataset. The dataset contains features like Pregnancy, Blood Pressure, Skin Thickness, Age and Diabetes Pedigree Function along with regular factors like Glucose, BMI and Insulin. The objective of this study is to make use of several pre-processing techniques resulting in improved accuracy over simple models. The study compares different classification models namely GaussianNB, Logistic Regression, KNN, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier in several ways. Initially, missing values in the significant features are replaced by computing median of the input variables based on the outcome of whether the patient is diabetic or not. After this, feature engineering is performed by adding new features which are obtained by categorizing the existing features based on its range. Finally, Hyperparameter tuning is carried out to optimize the model. Performance metrics such as Accuracy and area under the ROC Curve (AUC) is used to validate the effectiveness of the proposed framework. Results indicate that XGBoosting Classifier is concluded as the optimum model with 88% accuracy and AUC value of 0.948. The performance of the model is evaluated using Confusion Matrix and ROC Curve.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cho, N.H., Shaw, J.E., Karuranga, S., Huang, Y., da Rocha Fernandes, J.D., Ohlrogge, A.W., Malanda, B.: IDF diabetes atlas: global estimates of diabetes prevalence for 2017 and projections for 2045. Diabet. Res. Clin. Pract. 138, 271–281 (2018). https://doi.org/10.1016/j.diabres.2018.02.023

  2. Saeedi, P., Petersohn, I., Salpea, P., Malanda, B., Karuranga, S., Unwin, N., Colagiuri, S., Guariguata, L., Motala, A. A., Ogurtsova, K., Shaw, J. E., Bright, D., Williams, R., IDF Diabetes Atlas Committee: Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas, 9th edn. Diabet. Res. Clin. Pract. 157, 107843 (2019). https://doi.org/10.1016/j.diabres.2019.107843

  3. Maniruzzaman, M., Kumar, N., Menhazul Abedin, M., et al.: Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Comput. Methods Progr. Biomed. 152, 23–34 (2017). https://doi.org/10.1016/j.cmpb.2017.09.004

  4. Komi, M., Li, J., Zhai, Y., Zhang, X.: Application of data mining methods in diabetes prediction. In: 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, pp. 1006–1010 (2017). https://doi.org/10.1109/ICIVC.2017.7984706

  5. Mercaldo, F., Nardone, V., Santone, A.: Diabetes mellitus affected patients classification and diagnosis through machine learning techniques. Procedia Comput. Sci. 112, 2519–2528 (2017). https://doi.org/10.1016/j.procs.2017.08.193

  6. Sisodia, D., Sisodia, D. S.: Prediction of diabetes using classification algorithms. Procedia Comput. Sci. 132, 1578–1585. Elsevier B.V(2018). https://doi.org/10.1016/j.procs.2018.05.122

  7. Hasan Md, A., Md. Ashraful, Das, D., Hossain, E., Hasan, M.: Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access, 1–1 (2020) https://doi.org/10.1109/ACCESS.2020.2989857

  8. Alehegn, M., Raghvendra Joshi, R., Mulay, R.: Diabetes analysis and prediction using random forest, KNN, Naïve Bayes, And J48: an ensemble approach. Int. J. Sci. Technol. Res. 8, 09 (2019)

    Google Scholar 

  9. Sneha, N., Gangil, T.: Analysis of diabetes mellitus for early prediction using optimal features selection. J. Big Data 6, 13 (2019). https://doi.org/10.1186/s40537-019-0175-6

    Article  Google Scholar 

  10. Hina, S., Shaikh, A., Sattar, S.A.: Analyzing diabetes datasets using data mining. J. Basic Appl. Sci. 13, 466–471 (2017)

    Google Scholar 

  11. Asuero, A.G., Sayago, A., Gonzalez, A.: The correlation coefficient: an overview, Crit. Rev. Anal. Chem. 36, 41–59 (2006)

    Google Scholar 

  12. Markovitch, S., Rosenstein, D.: Feature generation using general constructor functions. Mach. Learn. 49, 59–98 (2002). https://doi.org/10.1023/A:1014046307775

    Article  MATH  Google Scholar 

  13. Ünsal, Ö., Bulbul, H.: Comparison of classification techniques used in machine learning as applied on vocational guidance data. In: International Conference on Machine Learning and Applications, vol. 10 (2011)

    Google Scholar 

  14. Zeng, X., Martinez, T.R.: Distribution-balanced stratified cross validation for accuracy estimation. J. Exp. Theor. Artif. Intell. 12, 1–12 (2000)

    Google Scholar 

  15. Mitchell, T.M., et al.: Machine Learning, vol. 45.37. McGraw Hill, Burr Ridge, IL, pp. 870–877 (1997)

    Google Scholar 

  16. Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45, 3084–3104, ISSN 0031-3203 (2012). https://doi.org/10.1016/j.patcog.2012.03.004

  17. Peng, C.-Y.J., Lee, K.L., Ingersoll, G.M.: An introduction to logistic regression analysis and reporting. J. Educ. Res. 96, 3–14 (2002). https://doi.org/10.1080/00220670209598786

    Google Scholar 

  18. Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn. 76, 211–225 (2009). https://doi.org/10.1007/s10994-009-5127-5

    Google Scholar 

  19. Özkan, Y.: Data Mining Methods. Papatya Publications, Istanbul, Turkey (2008)

    Google Scholar 

  20. Raj, J.S.: A novel information processing in IoT based real time health care monitoring system. J. Electron. 2(3), 188–196 (2020)

    Google Scholar 

  21. Raj, J.S., Ananthi, J.V: Recurrent neural networks and nonlinear prediction in support vector machines. J. Soft Comput. Paradigm (JSCP) 1(1), 33–40 (2019)

    Google Scholar 

  22. Ross Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993)

    Google Scholar 

  23. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  24. Liaw, A., Wiener, M.: Classification and regression by random forest. R news 2, 18–22 (2002)

    Google Scholar 

  25. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.A.: Review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. 42(4), 463–484 (2012). https://doi.org/10.1109/TSMCC.2011.2161285

  26. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16). Association for Computing Machinery, New York, NY, USA, 785–794 (2016). https://doi.org/10.1145/2939672.2939785

  27. Melo, F.: Area under the ROC Curve. In:Dubitzky, W., Wolkenhauer, O., Cho, K.H., Yokota, H. (eds.) Encyclopedia of Systems Biology. Springer, New York (2013). https://doi.org/10.1007/978-1-4419-9863-7_209

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepa Elizabeth James .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

James, D.E., Vimina, E.R. (2022). Machine Learning-Based Early Diabetes Prediction. In: Raj, J.S., Palanisamy, R., Perikos, I., Shi, Y. (eds) Intelligent Sustainable Systems. Lecture Notes in Networks and Systems, vol 213. Springer, Singapore. https://doi.org/10.1007/978-981-16-2422-3_52

Download citation

Publish with us

Policies and ethics