Abstract
Diabetes mellitus is a metabolic disorder characterized by hyperglycemia which results from the inadequacy of the body to secret and responds to insulin. If not properly managed or diagnosed on time, diabetes can pose a risk to vital body organs such as the eyes, kidneys, nerves, heart, and blood vessels and can be life-threatening. From the many years of research in computational diagnosis of diabetes, machine learning has been proven to be a viable solution for the prediction of diabetes. However, the accuracy rate to date suggests that there is still much room for improvement. In this paper, we are proposing a machine learning framework to improve the performance of diabetes prediction with the PIMA Indian dataset. Through analysis, we observe that the main challenges of the dataset, which flaws learning, are feature selection and missing values. For each of these challenges, we propose a working solution that incorporates, Spearman Correlation and polynomial regression from a new perspective. Further, we optimize the random forest classifier by tuning its hyperparameters using grid search and repeated stratified k-fold cross-validation to build a robust random forest model that scales to the prediction problem. Finally, through exhaustive experiments, we demonstrate that our proposed data preparation approaches lead to a robust machine learning framework for the diagnosis of diabetes mellitus with train accuracy, and test-accuracy values that range from 98.96% to 100% and 97.92% to 100%, respectively, which outperforms all the state-of-the-art results. The source code for the proposed machine learning framework is made publicly available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The global epidemics of diabetes in the 21st century: Current situation and perspectives. https://pubmed.ncbi.nlm.nih.gov/31766915/.
References
Khan, R., Chua, Z., Tan, J., Yang, Y., Liao, Z., Zhao, Y.: From pre-diabetes to diabetes: diagnosis, treatments and translational research. Medicina 55(9), 546 (2019)
American Diabetes Association: Classification and diagnosis of diabetes. Diabetes Care 40(Supplement 1), S11–S24 (2017)
Metzger, B.E., Coustan, D.R. (eds.): Proceedings of the Fourth International Workshop-Conference on Gestational Diabetes Mellitus (1998). Diabetes Care 21(Suppl. 2), B1–B167
Cheng, Y., Caughey, A.: Gestational diabetes: diagnosis and management. J. Perinatol. 28(10), 657–664 (2008)
Hasan, M., Alam, M., Das, D., Hossain, E., Hasan, M.: Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access 8, 76516–76531 (2020)
Alam, M.T., et al.: A model for early prediction of diabetes. Inform. Med. Unlocked 16, 100204 (2019)
Wang, Q., Cao, W., Guo, J., Ren, J., Cheng, Y., Davis, D.: DMP_MI: an effective diabetes mellitus classification algorithm on imbalanced data with missing values. IEEE Access 7, 102232–102238 (2019)
Maniruzzaman, M., et al.: Accurate diabetes risk stratification using machine learning: role of missing value and outliers. J. Med. Syst. 42(5), 1–17 (2018). https://doi.org/10.1007/s10916-018-0940-7
Barhate, R., Kulkarni, D.: Analysis of classifiers for prediction of type II diabetes mellitus. In: International Conference on Computing Communication Control and Automation (ICCUBEA), vol. 4, pp. 1–6 (2018)
Zou, Q., Qu, K., Luo, Y., Yin, D., Ju, Y., Tang, H.: Predicting diabetes mellitus with machine learning techniques. Front. Genet. 9, 515 (2018)
Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15(1), 72–101 (1904). https://doi.org/10.2307/1412159.JSTOR1412159
Corder, G.W., Foreman, D.I.: Nonparametric Statistics: A Step-by-Step Approach. Wiley, Hoboken (2014). ISBN: 978-1-118-84031-3
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Podgorelec, V., Kokol, P., Stiglic, B., Rozman, I.: Decision trees: an overview and their use in medicine. J. Med. Syst. 26(5), 445–463 (2002). https://doi.org/10.1023/A:1016409317640
Marshall, R.J.: The use of classification and regression trees in clinical epidemiology. J. Clin. Epidemiol. 54(6), 603–609 (2001)
Biau, G.: Analysis of a random forests model. J. Mach. Learn. Res. 13(1), 1063–1095 (2012)
Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 11, 169–198 (1999)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Annual Symposium on Computer Application in Medical Care, pp. 261–265, November 1988
Kutner, M.H., Nachtsheim, C.J., Neter, J., Li, W.: Applied Linear Statistical Models, 4th edn, vol. 5, p. 283. McGraw-Hill Irwin, Boston (2005)
Zhang, Z.: Missing data imputation: focusing on single imputation. Ann. Transl. Med. 4(1), 9 (2016)
Royston, P.: Multiple imputation of missing values. Stand. Genomic Sci. 4(3), 227–241 (2004)
Probst, P., Wright, M.N., Boulesteix, A.L.: Hyperparameters and tuning strategies for random forest. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 9(3), e1301 (2019)
Krstajic, D., Buturovic, L.J., Leahy, D.E., Thomas, S.: Cross-validation pitfalls when selecting and assessing regression and classification models. J. Cheminform. 6(1), 1–15 (2014)
Mohan, V., et al.: Associations of β-cell function and insulin resistance with youth-onset type 2 diabetes and prediabetes among Asian Indians. Diabetes Technol. Ther. 15(4), 315–322 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Olisah, C., Adeleye, O., Smith, L., Smith, M. (2022). A Robust Machine Learning Framework for Diabetes Prediction. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2021, Volume 2. FTC 2021. Lecture Notes in Networks and Systems, vol 359. Springer, Cham. https://doi.org/10.1007/978-3-030-89880-9_58
Download citation
DOI: https://doi.org/10.1007/978-3-030-89880-9_58
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89879-3
Online ISBN: 978-3-030-89880-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)