1 Introduction

Diabetes Mellitus, commonly referred to as diabetes, is a chronic disease that affects how the body turns food into energy [1]. It is one of the top 10 causes of death worldwide with 4.2 million deaths in 2019 [2]. There are three main types of diabetes: type 1, type 2, and gestational diabetes [1]. Type 1 diabetes is thought to be caused by an autoimmune reaction where the body’s immune system affects the insulin-producing beta-cells. Type 2 diabetes is caused by inadequate production of insulin and the inability of the body cells to respond to insulin properly. Gestational diabetes affecting pregnant women mostly during 6 and 9 months of pregnancy is caused by the hormone produced by the placenta leading to insulin resistance. The proportion of people with type 2 diabetes are increasing compared to ones with type 1 and gestational diabetes [2]. In 2019, more than 1.1 million children and adolescents were suffering from type 1 diabetes, while 374 million people were at increased risk of type 2 diabetes. Consequently, in this paper, we focus on type 2 diabetes.

The prevalence of type 2 diabetes in an individual is found to be associated with the interactions of several risk factors that are related to lifestyle, psychosocial, medical conditions, demographic, and genetic (Hereditary) [3,4,5]. Diagnosis of type 2 diabetes if not performed at an early stage can lead to several serious life-threatening complications [6]. Machine learning-based decision support systems for the prediction of chronic diseases [7,8,9,10,11,12] have thus gained a lot of attention for better prognosis/diagnosis support to health professionals and public health [13]. Several research efforts have been proposed in the literature for using machine learning classification algorithms to predict the prevalence of type 2 diabetes based on different risk factors [14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45]. These algorithms are either tree-based [46, 47] that construct a classification tree with the dataset features as the nodes and the class labels as the leaves, probability-based [48] that make use of a probability distribution function over the class labels for a given observation, lazy approach-based [49, 50] that store the dataset in the memory without developing a model and use the dataset to classify a new observation using a distance function, function-based [51,52,53] that use a regularization function that aims to minimize the model prediction error, rule-based [54,55,56,57] that use IF–THEN statements extracted from decision trees, classifier ensemble-based [58,59,60] that use a set of classifiers whose individual decisions are combined by using a voting mechanism, clustering and classifier ensemble-based [61] that first perform clustering on the dataset to remove outliers and then apply a classification algorithm, or meta heuristic-based [62,63,64] that combine the output of different classification algorithms with each modelled using a random sample of the dataset. The work in the literature evaluates and compares different classification algorithms using heterogeneous datasets and evaluation metrics. However, we are unaware of any objective comparison of these algorithms using unified datasets and evaluation metrics. This paper aims to address this void.

The key research contributions are as follows.

  • We classify type 2 diabetes risk factors based on their common characteristics to analyze which categories are more significant than others in predicting type 2 diabetes.

  • We evaluate the performance of 35 different algorithms in terms of accuracy, F-measure, and execution time in a unified setup using 3 real-life diabetes datasets with and without feature selection.

2 Taxonomy of Type 2 Diabetes Risk Factors

In this section, we present a taxonomy of type 2 diabetes risk factors (Fig. 1). We classify them into five categories: (1) lifestyle, (2) medical condition, (3) hereditary, (4) psychosocial, and (5) demographic. The purpose of this classification is to analyze which category of the risk factors significantly contributes to the prediction of type 2 diabetes. Lifestyle factors refer to the ones that are highly influenced by the lifestyle and environment of an individual. The medical condition-based factors are related to the presence of certain diseases in an individual such as serum uric acid, obesity, hypertension, cardiovascular disease. A serum uric acid level of more than 370 μmol/l is considered high in an individual and associated with the prevalence of type 2 diabetes [65]. Obesity is defined as the excessive amount of fat accumulation in an individual and characterized by a Body Mass Index (BMI) higher than 30 [66]. High blood pressure is the medical condition in which the blood pressure in the arteries remains persistently elevated and characterized by systolic pressure 140–159 mmHg or diastolic pressure 90–99 mmHg [66]. Cardiovascular disease refers to the conditions affecting heart or blood vessels such as abnormal heart rhythms, heart attack, heart failure, and stroke [67]. The risk factors that are passed on from one generation to another fall under the hereditary category. The psychosocial factors include illness related to the mental health of an individual, and the demographic risk factor refers to the characteristics of an individual.

Fig. 1
figure 1

Taxonomy of risk factors for type 2 diabetes

3 Machine Learning Classification Algorithms

In this section, we describe the implementation of the machine learning algorithms under study for the prediction of type 2 diabetes.

3.1 Decision Tree (DT)

It constructs a tree structure to define the sequences of decisions and outcomes [46] and to use it for prediction. At each node of the tree, the algorithm selects the branch having the maximum information gain (Eq. 1).

$$InfoGain_{R} = H_{diabetes} - H_{diabetes|R}$$
(1)

where R is the risk factor, Hdiabetes represents the base entropy (Eq. 2), and Hdiabetes|R represents the conditional entropy (Eq. 3).

$$H_{diabetes} = \mathop \sum \limits_{{\forall diabetes\epsilon \{ diabetes, non{ - }diabetes)}} P\left( {diabetes} \right)\log_{2} P\left( {diabetes} \right)$$
(2)

where P(diabetes) is the probability of the number of observations in the diabetes class compared to the total number of observations.

$$H_{diabetes|R} = \mathop \sum \limits_{r} P\left( r \right)H\left( {diabetes{|}R = r} \right) = \mathop \sum \limits_{\forall r\epsilon R} P\left( r \right)\mathop \sum \limits_{{\forall diabetes\epsilon \left\{ {diabetes, non{ - }diabetes} \right\}}} P\left( {diabetes{|}r} \right)\log_{2} P(diabetes|r)$$
(3)

where R is a risk factor and r is the set of its values from all the observations in the dataset.

A one-level Decision tree, i.e., a tree where the root is immediately connected to the leaf nodes, is known as Decision Stump (DS) [47] that makes the prediction based only on the risk factor with the highest information gain.

3.2 Bayesian Network (BN)

It is a graphical representation of probabilistic relationships among a set of risk factors and diabetes prevalence [48]. BN is a set of edges (E) and risk factors (R = {R1, R2, …, Rn}), which forms a directed acyclic graph (DAG) G=(R, E) that encodes a joint probability distribution over R. Each node is represented using a conditional probability distribution (CPD) as stated in Eq. (4).

$$CPD = P(R_{i} |Pa\left( {R_{i} } \right))$$
(4)

where Pa(Ri) indicates the parent of Ri in G.

In a Bayesian network, a joint distribution of risk factors is obtained by multiplying the CPD for each risk factor as stated in Eq. (5).

$$P\left( {R_{1} ,R_{2} , \ldots , R_{n} } \right) = \mathop \prod \limits_{i = 1}^{n} CPD\left( {R_{i} } \right)$$
(5)

3.3 Naïve Bayes (NB)

It is primarily based on Bayes’ theorem that gives the relationship between the probabilities of two risk factors and their conditional probabilities [46]. Given that a risk factor A already exists in an individual, the conditional probability of a risk a factor C occurring is given by Eq. (6).

$$P\left( {C{|}A} \right) = \frac{{P\left( {A \cap C} \right)}}{P\left( A \right)} = \frac{{P\left( {A{|}C} \right) \cdot P\left( C \right)}}{P\left( A \right)}$$
(6)

For a record with a set of risk factors (R1, R2, …, Rn), the goal is to predict diabetes/non-diabetes class Ci from the set of classes, C = {Cdiabetes, Cnon-diabetes} that maximizes the conditional probability P(Ci|R1, R2, …, Rn). The general form of Bayes’ theorem for assigning diabetes or non-diabetes class to observation with multiple risk factors is given by Eq. (7).

$$P\left( {C_{i} {|}R_{1} , R_{2} , \ldots , R_{n} } \right) = \frac{{P\left( {R_{1} , R_{2} , \ldots , R_{n} {|}C_{i} } \right) \cdot P\left( {C_{i} } \right)}}{{P\left( {R_{1} , R_{2} , \ldots , R_{n} } \right)}}$$
(7)

The NB algorithm extends the Bayes’ theorem mentioned in Eq. (7) using two assumptions: 1) each risk factor is conditionally independent of every other factor given a class Ci as shown in Eq. (8), and 2) ignoring the term P(R1, R2, …, Rn) in Eq. (7). Consequently, the probability of Ci given the probabilities of all the risk factors, P(Ci|R1, R2, …, Rn) can be calculated using Eq. (9).

$$P\left( {R_{1} , R_{2} , \ldots , R_{n} {|}C_{i} } \right) = P\left( {R_{1} {|}C_{i} } \right) \cdot P\left( {R_{2} {|}C_{i} } \right). \ldots P\left( {R_{n} {|}C_{i} } \right) = \mathop \prod \limits_{j = 1}^{n} P\left( {R_{j} {|}C_{i} } \right)$$
(8)
$$P\left( {C_{i} {|}R_{1} , R_{2} , \ldots , R_{n} } \right) = P\left( {C_{i} } \right) \cdot \mathop \prod \limits_{j = 1}^{n} P\left( {R_{j} {|}C_{i} } \right)$$
(9)

3.4 K Nearest Neighbors (K-NN)

It stores the dataset and classifies a new observation based on how likely it is to be a member of diabetes or non-diabetes class [49]. The algorithm calculates the distance of the new observation from all the existing ones in the dataset. It then assigns the new observation to a class that appears the maximum number of times in a group of k (positive integer parameter) neighbors.

3.5 K Star

It uses an entropic measure based on the probability of transforming one observation into another by randomly choosing between all possible transformations [50]. A new observation is assigned to the same diabetes or non-diabetes class as that of one in the dataset having the shortest distance from the new observation. The distance between observations in K star is indicated by the complexity of transforming one observation into another. Let I be a set of observations and T a finite set of transformations on I. Let P be the set of all prefix codes from T* which are terminated by \(\sigma\). \(\sigma\) is a member of T which maps observations to themselves, i.e., (\(\sigma\)(a) = a). Members of T* define a transformation on I as shown in Eq. (10).

$$\overline{t}\left( a \right) = t_{n} \left( {t_{n - 1} \left( { \ldots t_{1} \left( a \right) \ldots } \right)} \right), where\,\overline{t} = t_{1} , \ldots , t_{n}$$
(10)

A probability function p is defined on T* that satisfies the conditions stated in Eq. (11).

$$0 \le \frac{{p\left( {\overline{t}u} \right)}}{{p\left( {\overline{t}} \right)}} \le 1, \mathop \sum \limits_{u} p\left( {\overline{t}u} \right) = p\left( {\overline{t}} \right), \,and\,p\left( \Lambda \right) = 1$$
(11)

The function P* is defined as the probability of all paths from observation a to observation b (Eq. 12).

$$P^{*} \left( {b{|}a} \right) = \mathop \sum \limits_{{\overline{t}\epsilon P:\overline{t}\left( a \right) = b}} p\left( {\overline{t}} \right)$$
(12)

The K star function is defined using Eq. (13).

$$K^{*} \left( {b{|}a} \right) = - \log_{2} P^{*} (b|a)$$
(13)

3.6 Logistic Regression (LR)

It predicts the probability that a given observation belongs to diabetes or non-diabetes class using a sigmoid function [51] as stated in Eq. (14).

$$p\left( {diabetes} \right) = \frac{{e^{{\beta_{0} + \mathop \sum \nolimits_{i = 1}^{n} \beta_{i} R_{i} }} }}{{1 + e^{{\beta_{0} + \mathop \sum \nolimits_{i = 1}^{n} \beta_{i} R_{i} }} }}$$
(14)

where p(diabetes) represents the probability of having diabetes, R is the set of risk factors, and β0 and βi are the regression coefficients representing the intercept and the slope respectively. The values of regression coefficients are calculated using maximum likelihood estimation such that the value of Eq. (15) is the maximum.

$$l\left( {\beta_{0} , \ldots ,\beta_{i} } \right) = \mathop \prod \limits_{{i:y_{i} = 1}} P\left( {diabetes_{i} } \right) \mathop \prod \limits_{{i^{\prime}:y_{i}^{^{\prime}} = 0}} (1 - p\left( {non{ - }diabetes_{i} } \right)$$
(15)

3.7 Support Vector Machine (SVM)

It aims to create a decision boundary known as a hyperplane that can separate n-dimensional instance space into diabetes and non-diabetes classes [52]. The hyperplane is created using the extreme points (support vectors) of the dataset. The generation of hyperplane is an iterative process to find the maximum possible margin between the support vectors of the opposite classes. Let ri and yi represent the risk factors and classes in the dataset and there exists a hyperplane that separates diabetes and non-diabetes classes as stated in Eq. (16).

$$\begin{aligned} & w^{T} r + b = 0 \\ & {\text{s}}.{\text{t}}.,\,w^{T} r^{\left( i \right)} + b > 0, if\,y^{\left( i \right)} = + 1\,and\,w^{T} r^{\left( i \right)} + b < 0,\,if\,y^{\left( i \right)} = - 1 \\ \end{aligned}$$
(16)

where w is the normal of the hyperplane and b is the bias.

The minimization problem to obtain the optimal hyperplane that maximizes the margin can be formulated using Eq. (17).

$$Minimize \,\, \emptyset \left( W \right) = \frac{1}{2}\left| {\left| W \right|} \right|^{2} ,\,such\,that\,y_{i} \left( {W \cdot r_{i} + b} \right) \ge 1$$
(17)

In the case of a non-linear dataset, SVM uses a kernel trick to transform the input space to a higher dimensional space to generate a hyperplane. SVM uses different kernel functions such as linear, polynomial, Radial Basis Function (RBF), and sigmoid.

3.8 Artificial Neural Networks (ANN)

It is an iterative process that consists of networks of neurons based on the neural structure of the human brain [53]. In this paper, we use Multilayer Perceptron (MLP) which is a feedforward neural network that utilizes backpropagation for training. An MLP consists of three layers: risk factors (input), hidden, and diabetes-and-non-diabetes classes (output). Except for the risk factors nodes, each node in the network is a neuron that uses a non-linear activation function. In this paper, we use the sigmoid activation function to develop a non-linear relationship between the risk factors and the diabetes-and-non-diabetes classes. MLP is an iterative process and after each iteration, the algorithm compares the result of the output layer with actual diabetes or non-diabetes class labels and calculates an error term for each node. These error terms are then used to adjust the weights in the hidden layers such that the prediction accuracy increases in the next iteration. The output of each hidden layer is calculated using Eq. (18).

$$a = \emptyset \left( {Wi + b} \right)$$
(18)

where W is the weight matrix for each risk factor, i is the input vector consisting of the risk factors, b is the bias vector, Ø(.) is the sigmoid activation function and a is the vector output consisting of diabetes and non-diabetes class labels.

3.9 Zero Rule (ZeroR)

It is a frequency-based algorithm [54]. It labels all observations based on the majority (diabetes or non-diabetes). It does not require a model development for diabetes prediction.

3.10 One Rule (OneR)

It is a frequency-based algorithm that generates one rule for each risk factor in the dataset to predict diabetes and then selects the rule with the smallest error [55]. The algorithm first constructs a frequency table for each risk factor against diabetes and non-diabetes classes to create the rules for the risk factors. The error for each rule is then calculated by dividing the number of the observations in the minority class (diabetes or non-diabetes) by the total number of observations as stated in Eq. (19).

$$Error = \frac{Number\,of\,observations\,in\,the\,minority\,class}{{\left( {Total\,number\,of\,observations} \right)}}$$
(19)

3.11 JRip

It is based on association rules with reduced error pruning by implementing Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [56]. An initial set of rules is generated for the risk factors that cover all the observations in the diabetes class. This overlarge set of rules is then iteratively simplified by performing pruning operations. At each stage of simplification, a rule is removed such that there is the highest error reduction. The iteration stops when further pruning operations would increase the error.

3.12 Decision Table (DTable)

It is a set of If-Then rules that are formulated in terms of a candidate decision table [57]. A set of rules for the association of risk factors and prevalence of diabetes are stored in a candidate decision table. The table is then pruned by removing redundant rules and the rules having a confidence value less than 1. The confidence for a risk factor rule is calculated as stated in Eq. (20).

$$Confidence = \frac{Total\,number\,of\,observations\,having\,same\,risk\,factors\,for\,the\,diabetes\,class}{{Total\,number\,of\,observations\, having\,same\,risk\,factors}}$$
(20)

3.13 Random Tree (RT)

It is similar to a decision tree where the algorithm generates a decision tree by selecting the risk factors with high information gain. However, instead of considering all the risk factors for selection, RT uses a set of randomly selected factors [58].

3.14 Random Forest (RF)

It is a set of decision trees constructed using randomly selected samples of the dataset [59]. It performs voting on the output of each decision tree and classifies an observation into diabetes or non-diabetes depending on the majority of the decision trees’ output.

3.15 Reduced Error Pruning Tree (REPTree)

It builds several decision trees iteratively using different risk factors and selects the tree with the least prediction error. It builds a tree for the risk factors with the risk factor having the highest information gain as the root similar to DT. The algorithm then prunes the tree using reduced error pruning. This is by removing the subtree rooted at each risk factor and making that risk factor as a leaf node by assigning the majority diabetes or non-diabetes class. If the error rate of the new tree is less than or equal to the original tree, then pruning is done [60]. The algorithm iterates until further pruning reduces the classification performance.

3.16 K-means

It is a clustering technique rather than a classification [61]. It is an iterative algorithm that divides the given dataset into diabetes and non-diabetes clusters. The observations having similar risk factors values are placed in the same cluster. K-means is used along with a classification algorithm to increase the classification performance. The algorithm performs K-means and removes the incorrectly clustered observations, i.e., individuals with diabetes classified as non-diabetes and vice-versa. The dataset having correctly clustered observations is then used by a classifier to build a classification model.

3.17 Bagging

It is an ensemble meta-estimator that uses another classification algorithm as a parameter [62]. It first randomly generates multiple samples of the observations from the dataset with replacement. The base classifier is then applied to each random subset of the dataset for the diabetes prediction. The algorithm then aggregates the classifier outputs for each subset and selects the final output based on voting.

3.18 Boosting

It is an ensemble meta-estimator that aims to boost the classification performance for diabetes prediction by creating a strong classifier from several weak classifiers [63]. It builds a model from the dataset to predict diabetes and then creates a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted as expected or the maximum number of models are added. The models when added to the other models are weighted based on their weakness. The weakness is identified by the classifier’s error rate as stated in Eq. (21). After each model is added, the data weights are readjusted, known as re-weighting.

$$Error = \frac{{\mathop \sum \nolimits_{j = 1}^{n} W_{j} I\left( {C\left( {R_{j} } \right) \ne Y_{j} } \right)}}{{\mathop \sum \nolimits_{j = 1}^{n} W_{j} }}, I\left( x \right) = \left\{ {\begin{array}{*{20}l} {1,\quad \quad if\,r\,is\,true} \hfill \\ {0,\quad \quad if\,r\,is\,false} \hfill \\ \end{array} } \right.$$
(21)

where C is the classifier, R is the matrix containing the risk factors, Y is a vector containing diabetes and non-diabetes class labels and W is the assigned weight.

3.19 Stacking

It is an ensemble meta-estimator in which a classification model is developed using the observations to predict diabetes or non-diabetes classes. The output of this classifier is then used as an input to develop another classification model for diabetes prediction [64].

Table 1 shows the advantages and disadvantages of the classification algorithms used in the literature for the prediction of type 2 diabetes. It shows which algorithm is suitable for which dataset type.

Table 1 Advantages and disadvantages of classification algorithms

4 Performance Analysis

In this section, we analyze and compare the performance of the studied algorithms in terms of accuracy, F-measure, and execution time.

4.1 Datasets

We use Weka 3.8 [54] for the implementation and evaluation of the studied algorithms. We evaluate the performance of the algorithms with and without feature selection using three datasets, i.e., PIMA Indian [68], UCI [69], and MIMIC III [70]. PIMA Indian dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases and is used to predict whether or not a patient has diabetes, based on diagnostic measurements. UCI dataset includes patient and hospital outcomes from 130 US hospitals between 1999–2008. MIMIC III contains information of over 40,000 patients who stayed in critical care units of the Berth Israel Deaconess Medical Center between 2001 and 2012.

4.2 Feature Selection Algorithms

Table 2 shows the feature selection algorithms used.

Table 2 List of feature selection algorithms used in the experiments

4.3 Experiments

The UCI dataset includes race, gender, age, diagnosis 1, diagnosis 2, diagnosis 3, and diabetes medication columns. Diagnosis 1, 2, and 3 represent the results of primary, secondary, and additional secondary diagnoses respectively. We create a class label for diabetes based on the diabetes medication column. For diagnosis 1, 2, and 3 columns, we extract the values of diseases that are risk factors for type 2 diabetes such as obesity, hypertension, and cardiovascular disease, and use them as binary features in the dataset. For the MIMIC III dataset, we consider the available diabetes risk factor features that are ethnicity, gender, age, and diabetes. Table 3 shows the total number of features, observations, positive and negative diabetes classes for each dataset. It shows that the number of observations in the positive class (negative class) for the UCI (MIMIC III) dataset is very high compared to that in the negative (positive) class, leading to imbalanced datasets [74].

Table 3 Specification of pre-processed datasets used in the experiments

We evaluate the algorithms under study with and without features selection using the tenfold cross-validation method [75] where the dataset is divided into k partitions. One partition is for testing data and k-1 partitions for training with replacement. This is repeated until each partition is used for training and testing. The resultant model is then obtained by averaging the result of each iteration. For K-NN, we run the algorithm for different values of ‘k’ from 1 to square root of the number of observations and select the value of ‘k’ that gives the highest accuracy. For SVM, we use the polynomial and RBF kernels. Each algorithm is executed 3 times on each dataset (we call it a run) and the average for accuracy, F-measure, and execution time is calculated over each run. For the feature selection, we execute the selection algorithms (Table 2) on each dataset. We calculate the frequency of the features selected by each algorithm and select the features which appear in more than 50% of the algorithms. The accuracy and the F-measure are calculated using Eq. (22) and Eq. (23) respectively. The execution time is calculated by adding the model building and validation times.

$$Accuracy = \frac{TP + TN}{{TP + FP + TN + FN}}$$
(22)

where TP is True Positive, TN is True Negative, FP is False Positive, and FN is False Negative. TP (TN) represents the number of observations in the positive (negative) class that are classified as positive (negative), and FP (FN) represents the number of observations in the negative (positive) class that are classified as positive (negative).

$$F{ - }measure = \frac{{2\left( {Recall*Precision} \right)}}{{\left( {Recall + Precision} \right)}}$$
(23)

where recall and precision for the positive (negative) class are calculated using Eqs. (24) and (25) respectively.

$$Recall = \frac{{TP \left( {TN} \right)}}{{TP \left( {TN} \right) + FN\left( {FP} \right)}}$$
(24)
$$Precision = \frac{{TP \left( {TN} \right)}}{{TP \left( {TN} \right) + FP\left( {FN} \right)}}$$
(25)

4.4 Experimental Results Analysis

In this section, we analyze our experimental results and give insights and conclusions. In particular, we reveal the reasons behind the performance of these algorithms.

4.4.1 Classification Algorithms without Feature Selection

Figure 2 shows the accuracy and F-measure of the algorithms for the PIMA Indian dataset without feature selection. The accuracy of K-means + Logistic Regression (LR) algorithm is the highest among all the studied ones. This is because the algorithm removes the incorrectly clustered observations (outliers) from the dataset using K-means clustering (Fig. 3) before developing a classification model. The observations with the positive (negative) class label placed in the negative (positive) cluster are known as the outliers. The LR is then applied to the dataset without outliers resulting in higher accuracy. However, the removal of outliers before implementing LR is not justified for the diabetes dataset. For instance, if a diabetic patient having all the risk factors is clustered in the non-diabetes cluster, then the patient will be removed by the k-means as an outlier. Consequently, the algorithm will be trained using the dataset with those patients removed. This will classify the potential individual at risk of developing diabetes as non-diabetic. On the other hand, the LR model, when used with the Bagging heuristic (Bagging-LR), performs better without removing the outliers. This is thanks to the logistic function used by the meta-heuristic algorithm which can linearly separate diabetes and non-diabetes binary classes for the interdependent features such as cholesterol and obesity. Figure 2 shows that there is no F-measure value for SVM with RBF kernel, ZeroR, and stacking with DT, NB, RF, and K-NN. This is because these algorithms predict the negative (non-diabetes) class which is the majority in the dataset and the positive (diabetes) class is never predicted.

Fig. 2
figure 2

Accuracy and F-measure of the algorithms for the PIMA Indian Dataset without feature selection

Fig. 3
figure 3

Removal of incorrectly clustered diabetes observations by K-means clustering on PIMA Indian Dataset

Figures 4 and 5 show the accuracy and F-measure of the algorithms without feature selection for the UCI and MIMIC III datasets respectively. RF has the highest accuracy, whereas most of the studied algorithms are unable to detect the minority class as the datasets are imbalanced. RF performs better with an imbalanced dataset because while constructing an individual decision tree, the algorithm bootstraps a sample from the minority class and the same size of the sample with replacement from the majority class. Consequently, each decision tree algorithm in RF is applied to a balanced subset of the dataset leading to a highly accurate classifier with both classes of diabetes being detected. Most of the algorithms in Figs. 4 and 5 have no F-measure value. This is because the UCI and MIMIC III datasets are highly imbalanced, and the algorithms are not able to predict the minority class.

Fig. 4
figure 4

Accuracy and F-measure of the algorithms for the UCI Dataset without feature selection

Fig. 5
figure 5

Accuracy and F-measure of the algorithms for the MIMIC III Dataset without feature selection

Table 4 shows the execution times for model building and validation of the algorithms under study for each dataset. The execution times of the algorithms for the PIMA Indian dataset are negligible while, for the UCI are the highest. This is because time is a function of the number of features and observations.

Table 4 Execution time for model building and validation without feature selection

4.4.2 Classification Algorithms with Feature Selection

Figure 6 shows the results of the feature selection algorithms. The results reveal that the following risk factors have a significant impact on the prediction of type 2 diabetes: age (demographic category), ethnicity and family history of diabetes (hereditary category), hypertension, obesity, and cardiovascular disease (medical conditions category), and cholesterol (lifestyle category). Figures 7, 8 and 9 show the performance of the studied algorithms for each dataset with feature selection. The relative performance of the algorithms remains the same as that without feature selection. The most accurate algorithm for PIMA Indian is Bagging-LR while that for UCI and MIMIC III datasets is RF. Table 5 shows the execution time for model building and validation of each algorithm under study. The relative performance is similar to that without feature selection (Table 4). However, the execution times for algorithms with feature selection are less than the ones without feature selection in all the datasets under experiment. This is because of the reduced number of features. The execution time for PIMA Indian is reduced up to 2.5 times and for UCI is 1.23 times. There is no significant reduction in execution times for the MIMIC III dataset as only one feature is removed.

Fig. 6
figure 6

Selected features for each dataset using feature selection algorithms

Fig. 7
figure 7

Accuracy and F-measure of the classification algorithms for the PIMA Indian Dataset with feature selection

Fig. 8
figure 8

Accuracy and F-measure of the classification algorithms for the UCI Dataset with feature selection

Fig. 9
figure 9

Accuracy and F-measure of the classification algorithms for the MIMIC III Dataset with feature selection

Table 5 Execution time for model building and validation with feature selection

5 Related Work

In the last decade, there have been many research efforts by academic and industrial researchers to predict type 2 diabetes using machine learning algorithms [14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45]. However, the algorithms in these works are compared using different datasets and evaluation metrics (Table 6), making an objective comparison difficult. In this paper, we evaluate and compare 35 algorithms in a unified setup.

Table 6 Evaluated works on classification algorithms

6 Conclusions

Urbanization, uneven diet, and changing lifestyles have led to the increase in type 2 diabetes globally. Many works in the literature have compared several algorithms to accurately predict type 2 diabetes. Those algorithms were evaluated using different datasets and evaluation metrics, making it difficult to compare their relative performance.

In this paper, we evaluate these algorithms using three diabetes datasets in a unified setup and compare their performance in terms of accuracy, F-measure, and execution time. Our experimental results show that the Bagging-LR algorithm is the most accurate for a balanced dataset with and without feature selection while, for an imbalanced dataset, RF is the most accurate. In addition, we classify type 2 diabetes risk factors to analyze the most significant categories for diabetes prediction.

When using a classification algorithm for the prediction of type 2 diabetes, the following requirements should be considered.

  1. 1.

    Accuracy vs F-measure: Most of the algorithms give a high classification accuracy. However, evaluating the classifier performance using only the accuracy can be misleading. This is because, in the case of an imbalanced dataset, which is very frequent in the health domain, the algorithm might have high accuracy but will not be able to classify the minority class labels as revealed by the F-measure. In such a situation, the prediction results can lead to a life-threatening situation, as a diabetic patient can be classified as non-diabetic. Consequently, we recommend the data scientist include F-measure as one of the evaluation metrics.

  2. 2.

    Data-driven: Our experimental results reveal that the performance of the classification algorithms is data-driven. The random forest algorithm is best suitable for prediction in the case of an imbalanced dataset. This is because the algorithm takes a sample of minority class and a similar size sample from the positive class with replacement to form the training dataset for each decision tree constructed.

  3. 3.

    Feature selection: We recommend the data scientist and clinicians execute a feature selection algorithm on the dataset before training the classification model. This reduces the execution time, avoids data overfitting as well as will give insights on the most significant features for future consideration without accuracy degradation.

  4. 4.

    Significant features: We recommend the clinicians use age (demographic category), ethnicity and family history of diabetes (hereditary category), hypertension, obesity, and cardiovascular disease (medical conditions category), and cholesterol (lifestyle category) for the prediction of type 2 diabetes based on our experimental results. This is also confirmed by the statistical and clinical studies in the literature [79, 80].