Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms

Chang, Victor; Bailey, Jozeene; Xu, Qianwen Ariel; Sun, Zhili

doi:10.1007/s00521-022-07049-z

Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms

S.I.: AI-based e-diagnosis
Published: 24 March 2022

Volume 35, pages 16157–16173, (2023)
Cite this article

Download PDF

Neural Computing and Applications Aims and scope Submit manuscript

Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms

Download PDF

Victor Chang ORCID: orcid.org/0000-0002-8012-5852¹,
Jozeene Bailey²,
Qianwen Ariel Xu² &
…
Zhili Sun³

16k Accesses
63 Citations
1 Altmetric
Explore all metrics

Abstract

This paper proposes an e-diagnosis system based on machine learning (ML) algorithms to be implemented on the Internet of Medical Things (IoMT) environment, particularly for diagnosing diabetes mellitus (type 2 diabetes). However, the ML applications tend to be mistrusted because of their inability to show the internal decision-making process, resulting in slow uptake by end-users within certain healthcare sectors. This research delineates the use of three interpretable supervised ML models: Naïve Bayes classifier, random forest classifier, and J48 decision tree models to be trained and tested using the Pima Indians diabetes dataset in R programming language. The performance of each algorithm is analyzed to determine the one with the best accuracy, precision, sensitivity, and specificity. An assessment of the decision process is also made to improve the model. It can be concluded that a Naïve Bayes model works well with a more fine-tuned selection of features for binary classification, while random forest works better with more features.

Analysis of the Performance of Data Mining Classification Algorithm for Diabetes Prediction

A Survey on Medical Diagnosis of Diabetes Using Machine Learning Techniques

Diagnosis of Heart Disease Using Internet of Things and Machine Learning Algorithms

1 Introduction

Diabetes mellitus, or simply diabetes, is a leading non-communicable disease (NCD) globally, almost doubling in cases since 1980 [1]. It is a chronic illness that develops either when the pancreas are not able to generate sufficient insulin or when the body does not utilize the insulin produced effectively [1]. There is no cure for this disease. Diabetes is thought to result from a combination of genetic and environmental factors. Several risk factors that are attributed to diabetes include ethnicity, family history of diabetes, age, excess weight, unhealthy diet, physical inactivity, and smoking. In addition to this, the absence of early detection of diabetes has been known to contribute to the development of other chronic diseases such as kidney disease. Furthermore, additional pre-existing non-communicable diseases present a high risk for the patient, as they easily contract and are susceptible to infectious diseases such as COVID-19 [2].

Predicting the probability of an individual's risk and susceptibility to a chronic illness like diabetes is an important task. Diagnosing chronic illness at an early stage saves on medical costs and reduces the risk of more complicated health problems. Even in emergencies where a patient may be unconscious or unintelligible, it is pertinent that deductions can be made accurately from immediately measurable medical indicators to help clinicians make better decisions for patient treatment in high-risk situations.

The majority of existing NCD cases remain undiagnosed, with patients suffering few symptoms during the initial phases of the disease, which causes a huge challenge in ensuring early detection and diagnosis. One advantage of providing treatments to patients in the early stage of their experience with non-communicable diseases is that they can avoid expensive treatments later in life as the disease gets worse. This is made more problematic with a lack of medical practitioners in underserved regions such as rural and remote villages. In such cases, the combination of the Internet of Medical Things (IoMT) and machine learning models can be made available to assist healthcare professionals in the early detection and diagnosis of NCDs by providing predictive tools for more efficient and timely decision-making.

However, it should be noted that machine learning solutions tend to be mistrusted by some people because of what may be referred to as a 'black-box' effect: an inability to show its internal decision-making process. This lack of explainability in machine learning models causes skepticism by consumers and results in slow uptake by end-users within the healthcare sector. The ability to explain both the reasonings behind and the process it takes to get a machine learning prediction is crucial to building trust, particularly in the healthcare field, where mistakes could be fatal.

This paper seeks to develop an e-diagnosis system for detecting and classifying diabetes as an IoMT application. Through the use of machine learning algorithms [Naïve Bayes, random forest, and decision tree (J48)], the system will be able to predict whether a person is at risk for diabetes based on several risk factors, provide doctors with a preliminary diagnosis, and feedback the doctor's guidance on diet, exercise, and blood glucose testing to patients.

These classification models were evaluated by the use of various methods, including accuracy, precision, sensitivity, F-measure, area under receiver operating characteristics (AUROC) curve to identify the best performing classifier. Several significant features that can be used to predict the severity of diabetes were extracted from the top classification model.

The Pima Indian Diabetes dataset is employed for this experiment. Pima Indians are a Native American group that lives in Mexico and Arizona, USA [3]. This group was deemed to have a high incidence rate of diabetes mellitus. Thus, research around them was thought to be significant to and representative of global health [4]. The Pima Indian Diabetes dataset consisting of Pima Indian females 21 years and older is a popular benchmark dataset [5]. This group is also significant to members of underrepresented minority or indigenous groups.

The features of the dataset comprise measures that do not require extensive testing. In emergency situations and patient self-care, which have become more popular, this function is essential.

The methodology is as follows: prepare the dataset, followed by data pre-processing such as dealing with missing values and categorical values, imputation, and standardization. Feature selection will be performed by using a variety of tools. Lastly, the classifiers' performance before and after feature selection will be evaluated further.

The organization of this paper is outlined as follows: Sect. 2 presents literature review, Sect. 3 provides details on data cleaning, exploration and feature selections, and Sect. 4 presents the methodology for analysis and evaluations of the dataset. Finally, Sect. 5 concludes the paper with discussions of future research.

2 Literature review

2.1 Internet of Medical Things (IoMT) and artificial intelligence algorithms

Internet of Medical Things (IoMT) is the application of the Internet of Things (IoT) in the medical field. Utilizing networking technologies, the IoMT aims to connect medical equipment and its applications with healthcare IT systems [6]. This innovative development has changed the medical field with its novel-designed remote healthcare system in terms of social benefits, perception, and reliable detection of illness. Benefiting from the constant computing of the IoT, it becomes easier to accomplish clinical goals such as patient data, medical orders, medical instruments, and remedies [7]. The development of the IoMT has brought about tremendous changes in promoting disease management, enhancing disease diagnostic and treatment techniques, as well as lowering healthcare costs and mistakes. This transformation has had a significant influence on the healthcare quality for both frontline healthcare professionals and patients. The IoMT is a thriving force for the researcher, the medical professional, the patient, and the insurer, enabling numerous use cases, for example, telemedical support, data insights, drug management, operation enhancement, patient tracking, etc. [8]. In particular, the IoMT offers various services to medical professionals, including delivering feedback to medical staff, equipment data and settings based on the needs of the patient and the specialist. IoMT gives rapid and easy access to various reports that help surgeons in operating rooms during surgeries [9].

The value of the IoMT is growing as a result of the symbiotic rise of artificial intelligence (AI). However, data production is one of the most significant challenges resulting from the development that a number of academics have confronted [10]. Because the amount of data acquired is quite massive, it is necessary to use machine learning technology, which is good at processing and analyzing the data and extracting valuable information from the massive data and then visualizing them [11].

For chronic diseases like diabetes, AI, including machine learning and deep learning, plays an extremely important and effective role in supporting doctors’ decision-making and monitoring and managing patients [12, 13]. Specifically, the combination of IoMT and AI can bring two benefits to the diagnosis and treatment of chronic diseases. On the one hand, an e-diagnosis system based on AI can efficiently analyze and classify the data obtained in IoMT to make a preliminary diagnosis of patients and provide support for the doctor to make the final diagnosis and specify the treatment plan. On the other hand, this e-diagnosis system makes it possible to realize remote supervision and management of patients with chronic illness. For example, the root of the diabetes management problem lies in the self-management of patients. The key to solving this problem is to tell patients how to monitor blood sugar, arrange diet, exercise, and rationally use drugs. The diabetes management system based on IoT technology provides the possibility to solve this problem. In remote areas lacking medical experts and professional medical equipment, mobile devices can provide data to the e-diagnosis system to use the services provided by IoMT to detect and classify diseases [9].

In this paper, an e-diagnosis system for detecting and classifying diabetes as an IoMT application is proposed, as shown in Fig. 1. Employing ML algorithms, this system aims to predict the diagnosis of diabetes based on patient data, provide doctors with a preliminary diagnosis, and return feedback on the doctor's guidance on diet, exercise, and blood glucose testing to patients. In addition, as shown on the left side of the figure, the IoMT enables the medical systems, applications and devices to connect with each other. Therefore, a patient’s profile can be assessed by a doctor remotely through the Internet and shared by doctors from different medical institutions, no matter the community hospitals or large hospitals. In this way, the amount of paper medical records can be reduced to a large extent, and the patient does not need to go to the same hospital or even go to the hospital for follow-up visits in person.

2.2 Intelligent methods of diabetes prediction

By clarifying common problems, the emerging techniques in data science can bring benefits to other fields of science, including medicine. Numerous research has employed various machine learning or AI methods for diabetes prediction, such as artificial neural network (ANN), support vector machine, gradient boosting decision tree, and Naive Bayes.

In the study of Komi et al. [14], they use five various data mining techniques [ANN, elaboration likelihood model (ELM), Gaussian mixture model (GMM), support vector machine (SVM), and logistic regression] to explore the early prediction of diabetes. Their research results show that ANN performs best among the five techniques. Similar to Komi et al., Ramanujam et al. [15] and Kumar et al. [16] also contribute to the early prediction of diabetes, but with different approaches. The early diagnosis of diabetes and proper treatment will affect costs and mortality in the later stage. Early diagnosis and testing expenditures are significantly crucial. Therefore, people in rural areas are unlikely to afford early diagnosis and miss timely treatments, resulting in higher mortality [17]. In order to help the rural Indian people, Ramanujam et al. [15] develop a multilingual decision support system that integrates the predictive models and clinical decision support system. The design feature of the system is that users can not only evaluate diabetes with the help of nursing assistants but also evaluate diabetes by themselves. Kumar et al. [16] compare the performance of technique CatBoost with other ML techniques, including K-nearest neighbor, logistic regression, stochastic gradient descent, Gaussian Naive Bayes, and multilayer perceptron, in the early prediction of diabetes. In their research, CatBoost has the highest accuracy.

In addition, AI algorithms are also employed to analyze and classify iris images to diagnose diabetes. Samant and Agarwal [18, 19] study the diagnosis of diabetes through the changes in pigmentation in certain areas of the iris by using several ML algorithms. They use pre-image processing methods to obtain iris and crop out certain areas. Then, they use texture textural, statistical and wavelet features to observe the variances in the tissue pigmentation. Finally, five classifiers are employed to classify whether the patient has diabetes. Their results show that random forest outperforms other classifiers.

Although AI and machine learning pervade the fields of healthcare and non-communicable chronic diseases, due to the lack of explanation of these complex algorithms or models, their actual medical application rate is very low. Based on the existing literature, this paper chooses three classifier models, Naïve Bayes, random forest classifier, and J48 decision tree, to classify the Pima Indians Diabetes dataset in the R programming language. However, unlike the predecessors, the purpose of this study is to employ interpretable ML models to make our model clear and understandable to end-users regarding how we judge which features are important and how the choice of features affects the model's prediction results.

2.3 The selected machine learning algorithms

2.3.1 J48 decision tree

A decision tree (DT) is a supervised ML algorithm widely utilized in dealing with classification and regression issues. A leaf node in a decision tree represents the classification outcomes, and an internal node represents the judgment of attributes. Quinlan [20] calls the algorithm employed to establish the decision tree ID3, which uses a top-down learning method. The following steps describe the process of the DT: the first step is selecting the most appropriate attribute for the root node; secondly, the instances are divided into a number of subsets. For each subset, its instances are supposed to have identical attribute values; finally, every subset is repeated recursively until all instances have identical classes [21]. Figure 2 shows a part of a diagnosis decision tree, which can be interpreted easily. For instance, according to the tree, if a patient does not have inter-systolic noise, but has pre-cordial pain, then he or she has a prolapse.

The decision tree algorithm has been employed in many scientific regions, including the medical area. For example, Rochmawati et al. [22] use the DT algorithm to classify the COVID-19 symptom. They conclude that compared with Hoeffding tree, DT has a better performance but is more complicated. Other diseases can also be intelligently diagnosed by DT, for instance, Lupus disease [23] and coronary artery disease [24].

2.3.2 Random forest

Random forest (RF) is an extension of a decision tree and is composed of numerous single decision trees, each of which produces a category of prediction results. The category with the most votes in the forest contributes to the random forest classifier's final prediction result. For example, as shown in Fig. 3, among nine single decision trees in the forest, the prediction results of six trees are 1, and those of the remaining three trees are 0. Therefore, the prediction result of the RF is 1. The key to the good performance of this classifier is that the trees in the forest are relatively unrelated to each other, ensuring that the decision they make as a whole is better than the decisions made by each of them individually. [25].

Random forest uses a simple and powerful basic concept, called the wisdom of the crowd. The low correlation between trees is crucial to the success of the model. Under this premise, even if the prediction results of several trees are not correct, as long as the prediction results of most other trees are correct, then as a group, these trees can finally get the correct prediction results. In other words, the random forest model performs well because abundant relatively unrelated models that operate as a whole perform better than any single constituent model.

Surface-enhanced Raman scattering (SERS) technology is very useful for analyzing biological samples. Nevertheless, it is difficult to obtain the required information from the collected data in the absence of labeled molecules. Therefore, Seifert [26] combines the random forest method with SERS data to solve this problem. The outcomes indicate that this approach is able to enhance the performance of SERS technology. Apart from biology, RF can also be used in the areas of agriculture [27] and medical science [18, 19].

2.3.3 Naïve Bayes

The Bayesian classifier is a statistical classifier, and it is operated according to the Bayes theorem, classifying data into predetermined categories using conditional probability. Conditional probability can be understood as the probability that an event will take place if other events have already taken place. A Bayesian rule is an approach used to estimate the possibility of an attribute given a data set as input. The term "naive" of the algorithm's name refers to that it assumes that each attribute value is independent.

Naive Bayes (NB) is regarded as a descriptive as well as a predictive algorithm. The probabilities are descriptive and then employed to predict the categories of the untrained data. This method has several merits, as follows. First of all, it is easy to use. Secondly, the amount of training data NB needs for classification is not necessarily large. In addition, although the NB classifier is naively designed and its assumption seems to be too simple, it performs well in a number of complicated real-world situations [28].

Pandiangan et al. [29] consider that in his applied AI research, a student's study time and duration is an essential index to evaluate the quality of the university. They then employ the NB classification algorithm and DT algorithm to predict the student's study period, evaluate academic performance and identify correlations for improving the quality of the university. In the field of education, Daniati [30] develops a decision support system for students to select suitable programs using DBSCAN and Naive Bayes. Different from them, Akbar et al. [31] integrate the Internet of Things with the NB algorithm to develop an intelligent laundry mobile application.

3 Data and methodology

3.1 Dataset exploration and pre-processing

Although there are now larger, more complex diabetes datasets, the Pima Indian Diabetes dataset has remained a benchmark for diabetes classification research. Given the presence of a binary outcome variable, the dataset naturally lends itself to supervised learning and, in particular, logistic regression. However, various ML algorithms have been employed to produce classification models based on this dataset for not being limited to a singular type of model.

In this research, our focus is to analyze the Pima Indian Dataset with advanced algorithms to work with IoMT effectively. The dataset was downloaded from Kaggle (https://www.kaggle.com/uciml/pima-indians-diabetes-database) and is available via a CC0: Public Domain License and is properly anonymized and does not contain any identifiable features of the patient subjects. As seen in Table 1, it records eight causal characteristics and the corresponding classification. The dataset has 9 columns and 768 rows (500 non-diabetics and 268 diabetics). The binary classification outcome variable takes (0 or 1) values, where 0 indicates a negative test for diabetes, and 1 implies a positive test. Table 1 shows the dataset features (columns) and descriptions.

Table 1 Overview of Pima Indian diabetes dataset

Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms

Abstract

Similar content being viewed by others

Analysis of the Performance of Data Mining Classification Algorithm for Diabetes Prediction

A Survey on Medical Diagnosis of Diabetes Using Machine Learning Techniques

Diagnosis of Heart Disease Using Internet of Things and Machine Learning Algorithms

1 Introduction

2 Literature review

2.1 Internet of Medical Things (IoMT) and artificial intelligence algorithms

2.2 Intelligent methods of diabetes prediction

2.3 The selected machine learning algorithms

2.3.1 J48 decision tree

2.3.2 Random forest

2.3.3 Naïve Bayes

3 Data and methodology

3.1 Dataset exploration and pre-processing

3.2 Methods

4 Experiment and results

4.1 Feature selection

4.2 Results of machine learning algorithms

4.2.1 J48 decision tree

4.2.2 Random forest

4.2.3 Naïve Bayes

4.2.4 AUC-ROC curves

4.3 Final results

5 Discussion and conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation