1 Introduction

Diabetes is a metabolic disorder characterized by high blood sugar levels that causes severe damage to the nerves, heart, eyes, kidneys, and blood vessels. Approximately 422 million people worldwide, mostly in low- and middle-income countries, have diabetes, and diabetes is the direct cause of more than 1.6 million deaths each year [53].

Diabetes is expected to increase significantly over the next decades [35]. Many studies have shown a higher susceptibility to certain infectious diseases in people with diabetes, possibly due to the dysregulated immune system [23]. It is reported that diabetes and plasma glucose levels in SARS patients are independent predictors of morbidity and mortality [58]. As with flu-related mortality [14], diabetes is a significant risk factor for the negative consequences of COVID-19. COVID-19 has recently become one of the serious and acute diseases that has spread all over the world, having a serious impact on the health system and the overall global economy [25, 45]. Mortality rates from pneumonia in people with diabetes aged 75 years and over in Hong Kong currently exceed death rates from cardiovascular disease and cancer in this age group [3]. A study conducted in Wuhan, China, revealed that 32% of 41 people infected with COVID-19 had underlying diseases and 20% had diabetes [52]. Guo et al. (2020) claimed that diabetes should be seen as a risk factor for the rapid progression and poor prognosis of COVID-19 rapid progression and poor prognosis [17]. Diabetic patients may be at high risk for COVID-19. Providing care to people with diabetes is an essential part of the effort [3].

The current diagnostic criteria used for the diagnosis of diabetes have been in place globally for almost a decade [54]. Plasma glucose measurement is one of the clinical diagnostic tests and continues to be the basis of diagnostic criteria. For this test, blood samples taken from the subjects are transported to the laboratories. Since blood cells can continue to metabolize glucose, the duration of blood glucose determination after blood is drawn is important [8]. Factors such as blood specimen collection, evaluation time, and instrumental errors can affect laboratory analysis results.

In the healthcare system, with the contribution of artificial intelligence-based systems, cancer cells [38], liver disorder [20], tumor detection [46], heart disease [2], breast cancer [16, 24, 26], COVID-19 [27, 28, 40, 44], and diabetes [1, 18, 21], great progress has been made for automatic detection of diseases. It is important to start early treatment with automatic early detection of diseases to slow down the disease process. Serious damages, morbidity, and mortality caused by diabetes can be reduced by early diagnosis and treatment. Therefore, my main focus is to develop an intelligent machine learning system that can detect and classify this disease. In recent years, researchers have provided intelligent diagnostic systems for diabetes detection to assist physicians. Kandhasamy and Balamurali (2015) compared four machine learning algorithms, including k-NN, support vector machines, random forest, and J48 decision tree, to predict diabetes mellitus [33]. Both k-NN (k = 1) and random forest, with 100% accuracy, were the best performing algorithms. A comparative study of Naive Bayes, ID3, random forest, and AdaBoost reported that random forest outperformed the other algorithms at classifying diabetic patients [56], with 85% accuracy. Paul and Choubey (2017) proposed a hybrid algorithm using a radial basis function network (RBFN) and a genetic algorithm (GA) [43]. The findings were that the hybrid method outperformed RBFN. Zou et al. (2018) used principal component analysis (PCA) and minimum redundancy maximum relevance (mRME) to reduce dimensionality [59]. They showed that the prediction could reach the highest accuracy with the random forest (ACC = 0.8084). Wu et al. (2018) have proposed a novel model based on data mining techniques (improved K-means algorithm and logistic regression) for predicting type 2 diabetes mellitus [55]. The model’s accuracy was up to 95.42%. Ayon ve Islam (2019), presented a deep learning network to automatically detect diabetes [1]. The dataset included a total of 768 samples, of which 268 were diabetic. The proposed system obtained 98.35% accuracy, an F1 score of 98, and an MCC of 97 for five-fold cross-validation. Islam et al. (2020) have analyzed the dataset (of this study) with the Naive Bayes Algorithm, Logistic Regression Algorithm, and Random [29]. They reported that the random forest using 16 features at its input reached the best result with an accuracy rate of 99%. Tigga and Garg (2020) conducted tests on the data set collected with 18 questions about diabetes to predict the risk of diabetes mellitus type 2 [49]. They applied different classification methods to the data and reported that the most successful accuracy rate was random forest, with 94.10%. Naz and Ahuja (2020) tested PIMA Indian diabetic dataset for diagnosis of diabetes disease using Naive Bayes (NB), Decision Tree (DT), Deep Learning (DL), and Artificial Neural Network (ANN) classifiers which give comparable performance [41]. The accuracy achieved by these classifiers is in the range of 90–98%. The DL provided the best results with an accuracy rate of 98.07%. Gupta et al. (2021) proposed an approach as logistic regression which is tested over a diabetic clinical dataset (DCA) which has shown better generalization performance compared to the other 14 classification techniques [18]. This dataset consists of 174 instances and 10 attributes, where a total of 61 nondiabetic and 113 diabetic patients. An accuracy of 94.59% was obtained for the diagnosis of diabetes. Gourisaria et al. (2022) used various deep learning, machine learning, and data dimensionality reduction techniques on two datasets to detect diabetes [15]. In their test on the second (University of California, Irvine) dataset, they achieved 99.2% accuracy with the Random Forest classifier using 16 features. When the dimensionality of the same dataset was reduced using linear discriminant analysis and principal component analysis, the accuracy dropped to 96.3 and 95.6, respectively.

In recent years, various methods have been proposed for diabetes prediction, including k-NN [33], LR [18], DL [1, 41], and RF [15, 29, 49, 56, 59]. However, these methods suffer from one or more of the following shortcomings: First, most of them do not test for statistical significance between the independent and dependent variables. This may create a trust problem regarding the use of variables used in the inputs of the classifiers. Second, some of them use more input data to increase the accuracy of the classifications. In this way, the response time of the classifier model increases. Third, most studies require various blood values to predict diabetes. The procedure for obtaining blood data is not easy and time-consuming. In this study, I propose a model that tests the reliability of independent variables created without blood data with a statistical significance test and requires a shorter time calculation with fewer features with a feature selection method. As far as I know, there is no study in the literature showing the use of MLR-RF and XGBoost as feature selection and classifier in diabetes prediction.

This paper proposes a novel method for predicting early-stage diabetes based on a data-driven. With the proposed method, the need for blood tests can be reduced by using direct questionnaires for diagnosing early-stage diabetes. The major contribution of this study is to find out a hybrid prediction model based on a machine learning approach with high prediction accuracy, low memory usage, and low computation time.

The proposed predictive model strategy based on MLR-RF-XG is shown in Fig. 1. The remainder of this paper is organized as follows: The materials and estimation methodology are described in Chapter 2. Experimental results, comparisons, and discussions are presented in Chapter 3. Finally, the conclusions are summarized in Chapter 4.

Fig. 1
figure 1

The overall system architecture of the proposed diabetes detection system

2 Materials and methodology

2.1 Datasets

The dataset used in this study was collected from patients of Sylhet Diabetes Hospital in Bangladesh using questionnaires. It was obtained from the publicly available web link https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset [51]. The dataset format is CSV and constitutes 520 samples in which there are 200 “non-diabetes” and 320 “diabetes”. Each sample consists of gender, age, polydipsia, polyuria, weakness, visual blurring, sudden weight loss, genital thrush, itching, polyphagia, obesity, irritability, partial paresis, delayed healing, alopecia, and muscle stiffness.

2.2 Methodology

The proposed approach for detecting the risk of diabetes consists of five main stages: (1) descriptive statistics, (2) testing the validity of variables, (3) feature selection and reduction, (4) classification using all variables or using variables selected by RF, and (5) performance assessment and comparison. These stages are separated by a dotted line in Fig. 1.

In the first part of the study, the general descriptive statistics about independent variables were detailed. Then, MLR was used to test the validity of the variables. Third, the most significant risk factors that could cause diabetes with RF were identified. This step removes features that contain redundant information. A 16-dimensional feature vector is reduced to 9 with this strategy. Fourth, eight different tests are conducted using two different input sets and four different classifier algorithms for the classification process. Finally, the classifiers’ performances are evaluated and compared.

This study adopts a comparative approach between two different input set analysis frameworks, i.e., models I and II. Model I uses all independent variables as classifier inputs, while Model II adopts variables selected by MLR-RF. The rest of the framework is the same for both methods, and four machine learning techniques are named random forest (RF), k-nearest neighbor (k-NN), gradient boosting (GB), and XGBoost (XG) are compared.

2.2.1 Test of validity

Multiple linear regression (MLR)

MLR is a widely used regression analysis type [12]. The goal of multiple linear regression (MLR) is to describe data and model the linear relationship between the independent variables and one dependent variable. The relationship between dependent and independent variables focuses on understanding the effect of inputs on the output parameter.

The MLR method can be used to eliminate insignificant variables from the models using p value statistics. The p value is used to measure statistical significance in the context of null hypothesis testing. The smaller the p value, the more likely the findings will be valid. The p value for the 95% confidence level is 0.05. If the p value is less than 0.05, it indicates that the observed effect is not due to random changes [11]. A p value less than 0.05 for the 95% confidence level was considered statistically significant in this study.

2.2.2 Classifiers

This research aims to develop an optimal classification model that can be used for accurately detecting diabetes. It is also to determine the best model. The performances of four different models (k-Nearest Neighbors, gradient boosting, random forest, and XGBoost) were compared to achieve the best model.

K-nearest neighbors (k-NN)

The k-NN algorithm is an essential and straightforward data mining technique. It is especially useful in pattern recognition [10]. The main idea is that if a point to be estimated is in the same category as its nearest neighbors in the training set, it can be concluded that this point also has the same characteristics and attributes. The nearest neighbors in the training set affect the performance of the prediction results. Estimating a new point is about the values of the k-NN [7]. The k-NN algorithm has been widely used in many fields such as outlier detection, regression, and pattern recognition [42].

Random forest (RF)

Random forest (RF) can be defined as a collection of tree-type classifiers. Most datasets can contain multidimensional features with many irrelevant features. For classifier models, these irrelevant variables degrade classifier performance. Feature selection increases the success rate of the classifier. The random forest algorithm uses simple probability to select the strongest features for its inputs. Breiman (2001) formulated the RF algorithm using subsets of sample data and mapping a random sample of feature subspaces by constructing multiple decision trees [4, 30].

RF is an algorithm that scales well to large datasets and is resistant to irrelevant features. It also improves performance by reducing variance [37].

Gradient boosting (GB)

Gradient Boosting is a powerful technique that gives successful results in many practical applications [6, 22, 32, 48, 50]. GB discards all the weak predictors and chooses the stronger ones. It is an improved version of the decision tree where each successor is analyzed comparatively to create the best satisfactory structure of the tree using gain calculation, structure score, and increasingly refined approaches [47]. GB is resistant to high dimensionality and linearity [9].

eXtreme gradient boosting (XGBoost, XG)

XGBoost is a high-performance boosting technique that minimizes the loss function and is optimized by various arrangements. It is a gradient boosting method that loops through loops to add models to a community iteratively. The basic principle behind boosting is to focus on instances that the model cannot predict correctly or that are difficult. These instances are given more emphasis by skewing the distribution of observations to make such measures appear in a sample probable. Therefore, the next weak student will focus more on guessing hard instances correctly. By combining all the simple prediction rules into one overarching model, a powerful predictor, XGBoost, is obtained. [37].

2.2.3 Feature subset selection

This method is used to select the most relevant features of all features during model building. It reduces the complexity of the prediction model [19]. In the study, I applied feature selection with random forest in the Python programming language to construct a model that only contains the most significant features. Classification input parameters were obtained using this algorithm.

Two selection strategies were used, MLR followed by random forest feature subset selection, to: (i) reduce the number of features fed to classifiers, (ii) find the most relevant and uncorrelated components, and (iii) reduce computational complexity and prediction time.

2.2.4 Performance evaluation

To evaluate the performance of the classifications used in this study, the dataset was divided into training and test datasets of 75% and %25, respectively. These datasets were randomly separated.

The confusion matrix was used to compare the performance of the classifications. Table 1 showed the four parts of the matrix called True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).

Table 1 Confusion matrix for binary classification

Non-diabetes and diabetes labeled samples have been considered a positive class and negative class, respectively. Here, TP is the samples of non-diabetes correctly classified as “non-diabetes”, FP is the samples of “diabetes” classified as “non-diabetes “, TN is the samples of diabetes classified as diabetes, and FN is the samples of “non-diabetes” predicted as the “diabetes”. The metrics of precision, recall, F1 score, and accuracy obtained using the confusion matrix were used for performance evaluation, as given in eqs. (1), (2), (3), and (4), respectively.

$$ \boldsymbol{Precision}\ \left(\boldsymbol{PPV}\right)=\frac{\boldsymbol{TP}}{\boldsymbol{TP}+\boldsymbol{FP}}\kern0.5em $$
(1)
$$ \boldsymbol{Recall}\ \left(\boldsymbol{TPR}\right)=\frac{\boldsymbol{TP}}{\boldsymbol{TP}+\boldsymbol{FN}} $$
(2)
$$ \boldsymbol{F}\mathbf{1}\ \boldsymbol{Score}=\mathbf{2}\ast \frac{\boldsymbol{PPV}\ast \boldsymbol{TPR}}{\boldsymbol{PPV}+\boldsymbol{TPR}} $$
(3)
$$ \boldsymbol{Accuracy}\ \left(\boldsymbol{ACC}\right)=\frac{\boldsymbol{TP}+\boldsymbol{TN}}{\boldsymbol{TP}+\boldsymbol{FP}+\boldsymbol{TN}+\boldsymbol{FN}} $$
(4)

3 Experimental results and discussions

In the study, classification experiments were performed with hybrid models consisting of testing the validity of variables, dimension reduction, and machine learning methods for the disease diagnosis on early-stage diabetes risk prediction datasets. The multiple linear regression produced output as a numeric vector including the R2 value, the F-statistic and its p value, and an estimate of the error variance, to test the validity of variables. The random forest method was used for dimension reduction; k-NN, gradient boosting, random forest, and XGBoost models were used for classification. Models were tested in the Phyton and Matlab R2021a platforms on a computer with an i5-8250U CPU and 12 GB of RAM.

The research data were formed from 520 instances consisting of 320 instances the “diabetes” and 200 the “non-diabetes”. Figure 2 shows the descriptive statistics and the boxplot of independent variables.

Fig. 2
figure 2

The descriptive statistics and the boxplot of independent variables

The age range of the participants in the study was 16–90 (mean = 48.03, standard deviation = 12.15). Male participants (328) were more than female participants (192). The least common symptoms in participants were obesity (88 subjects), genital thrush (116), and irritability (126), respectively.

Validity describes the outcome reliability for research that correctly yields the same findings that a researcher wants to assume [39]. Therefore, test reliability is a measure or tool that defines how well it is measured and the accuracy that wants to be seen [5, 31]. The MLR analysis was applied to evaluate the validity of variables. The correlation analysis results of all independent variables with the dependent variable are given in Table 2.

Table 2 Description of the R2 statistic, the F-statistic, and its p value of each variable using MLR

Table 2 shows that the F-static values of the polydipsia and polyuria were 376.4226 and 412.7384, respectively. These values mean that there are significant associations between Diabetes and the two risk factors. After polydipsia and polyuria, gender is the strongest risk factor for diabetes with an F-static value of 130.968. Also, polyuria and polydipsia had the highest R2 (coefficient of determination) among all independent variables.

A p value in linear regression indicates whether the relationship between variables is statistically significant. MLR results showed that the p value for 13 variables was less than 0.05, and these variables were more effective in predicting diabetes. The three predictor variables were not statistically significant because their p values (0.1, 0.28, 0.76) were higher than the usual significance level of 0.05. It could not be used as an input to itching, delayed healing, and obesity prediction models due to the p value.

Dimension reduction methods help to remove irrelevant independent variables from the dataset. With these methods, data storage space gets smaller, and the computational complexity time is also reduced. Thus, the time taken to reach the same goal is shortened. The downside is that they can rarely cause significant data loss [34, 36]. Classification models need to use the most relevant variables instead of unnecessary arguments in their inputs to increase training efficiency. Here, feature selection is performed using the random forest algorithm. This algorithm finds the weights of attributes. Figure 3 shows all the features of this study’s data set.

Fig. 3
figure 3

Feature importance and feature selection using RF

The random forest algorithm identified the most significant attributes required for the early-stage diabetes risk prediction model. By ranking features, the size of the feature set has been reduced. The seven variables, including delayed healing, visual blurring, weakness, itching, obesity, genital thrush, and muscle stiffness were not used as inputs to the classification models. Because their p-values were equal to or greater than 0.05, or their feature ranking score was low. According to MLR and RF analysis results, the most significant risk factors for diabetes were found polydipsia, polyuria, and gender. Nine features obtained from 16 independent variables were finally used as inputs to classification models for Model II.

Among the 520 samples, 390 samples were randomly selected for training, and the remaining 130 samples were used for testing. All the four classifiers (k-NN, random forest, gradient boosting, XGBoost) were trained with all independent variables for the first experiments (Model I). After the classifier training, the confusion matrices were obtained for the test data, showing the classifiers’ performance. Figure 4 shows the confusion matrices obtained from all the models for the testing phase.

Fig. 4
figure 4

Values of TP, TN, FP, and FN for (a) k-NN, (b) GB, (c) RF, and (d) XG algorithms with 16 independent variable inputs for the test data

It is found that the RF and XGBoost models outperform the competitive models as it has better and more consistent true positive (51) and true negative (78) values as compared with the other two models. Also, it shows that the RF and XGBoost have lesser false-positive (1) and false-negative (0) values. The k-NN model showed the lowest performance, with false positives of 11 and a false-negative of 7 values.

Results were visualized in confusion matrices. The same matrices were obtained for the RF and XG algorithm. Evaluation metrics were calculated using confusion matrices and presented in Table 3 below.

Table 3 Evaluation of performance comparison for classifiers with 16 variable inputs

The macro and weighted averages of the confusion matrix for each classifier were calculated for precision, recall, and F1 score. RF and XG algorithms showed the best performance for all variables. The results show that the RF and XG techniques outperform others in all three categories by achieving 0.99, 0.99, and 0.99% of macro-average precision, recall, and F-measures respectively. The same success was obtained for the weighted average values.

As shown in Table 3, RF and XG algorithms show better results with an accuracy of 99.23%. It gives the highest accuracy as compared to GB and k-NN. However, k-NN showed the lowest accuracy in all four metrics, including accuracy, precision, recall, and F1-score compared to the other three algorithms.

All classifiers were trained with nine independent variables for Model II. After training the models, the confusion matrices needed to measure the model performance of test data were obtained. Figure 5 shows the confusion matrices from all models with nine variables for test data.

Fig. 5
figure 5

Values of TP, TN, FP, and FN for (a) k-NN, (b) GB, (c) RF, and (d) XG algorithms with 9 independent variable inputs for the test data

It is found that the RF and XGBoost models outperform the competitive models as it has better and more consistent true positive (51) and true negative (78) values as compared with the other two models. Also, it shows that the RF and XGBoost have lesser false-positive (1) and false-negative (0) values. The k-NN model showed the lowest performance, with false positives of 11 and a false-negative of 6 values.

Although the number of variables was reduced, the RF and XG algorithm matrices did not change compared to the first experiment. But the confusion matrix of the k-NN and GB algorithm has gotten better. Evaluation metrics were shown in Table 4 below.

Table 4 Evaluation of performance comparison for classifiers with 9 variable inputs

In Model II, unlike the first one, only nine variables were applied to the input of the classifiers. Classifier performances, including recall, precision, F1 score, and accuracy, increased in k-NN and GB, but no change was observed in RF and XG. But still, the XG and RF had better results than the others, with an accuracy of 99.23%. When the proposed feature selection was applied the XG algorithm achieved the best results for all three measures. The best results achieved for macro-average precision, recall, and F-measurements are 0.99, 0.99, and 0.99 respectively when the feature selection algorithm is applied.

The memory usage and computational time decreased for four classification models. Computation times were 0.0245, 0.07563, 0.14635, 0.04825 s for the k-NN, GB, RF and XG models, respectively. Feature selection reduced the computation time from 0.00798 seconds to 0.00245 seconds for k-NN (by 69%). The reduction in computation time was more limited in the other three classifiers. ​Even though the success rates were equal, the calculation time of the RF method was three times that of the XG method.

The receiver operating characteristic (ROC) curve is a two-dimensional plot that illustrates how well a classifier system works. The x- and y-axes are the false positive rate and true positive rate, respectively, for the predictive test. The diagonal line denotes the ROC curve of a random classifier. Each point in the ROC space is a TP/FP data pair for a discrimination cut-off value of the predictive test [57]. As an alternative to ROC curves, precision-recall curves have been often used in machine learning for assessing classification models [13]. The precision-recall curve shows the relationship between precision and recall [26]. Precision measures the number of positive class predictions that belong to the positive class, while recall measures the number of positive class predictions from all positive samples. The Precision-Recall and ROC curves (and AUCROC) for the diabetes prediction using three machine learning algorithms are illustrated in Fig. 6.

Fig. 6
figure 6

Precision-Recall and ROC Curves for the prediction of diabetes for three classifiers: (a) K-NN classifier (green), (b) Gradient Boosting classifier (purple curve), and (c) XGBoost classifier (red)

The Precision-Recall curve with an AP score of 0.99 obtained with the proposed XG Boost algorithm also indicates an ideal system. The other graph is constructed with “sensitivity” and “1-specificity” on the vertical and horizontal axis (Fig. 6b). In comparing the “non-diabetes” group and the “diabetes” group, the area under the curve (AUC) was 0.87, 0.98, and 0.99 for the k-NN, GB, and XG models, respectively.

Figure 7 reports the classification results for classifiers with/without feature selection to classify diabetes data. The results showed that the feature selection technique could improve the prediction success of the prediction model and shorten the computation time. When the computational time (0.04825 s) and accuracy (99.23%) of the four classifiers were examined together, the XGBoost model was superior to the others.

Fig. 7
figure 7

Comparison between the result without feature selection and with the features selected by RF

By analyzing the results, it has been shown that a combination of MLR, RF, and XG has significant effects on diabetes detection based on the direct questionnaire method. The proposed methodology can distinguish diabetic patients from normal individuals with high accuracy. To investigate the effectiveness of the proposed MLR-RF-XG technique, comparisons have been made with some of the approaches reported in the current literature. Table 5 provides the comparative analysis of the mentioned existing methods and the proposed method.

Table 5 Comparison of the proposed system with existing systems in terms of accuracy and computational time

The researchers measured the performance of J48, KNN, SVM, and RF for the prediction of diabetes mellitus [33]. The accuracy obtained by the techniques is 86.46%, 100%, 77.73%, and 100% in J48, KNN, SVM, and RF, respectively. The accuracy achieved by the RF is 85.0% [56]. A genetic algorithm(GA) with a radial basis function neural network (RBFNN) for prediction accuracy achieved 77.4% [43]. The system developed in [59] acquired 80.84%, 78.53%, and 78.41% accuracy for RF, J48, and neural network, respectively. With mRMR implemented, they have found five features that are best for this classification task. For RF, J48, and neural network with mRMR, they obtained 75.08%, 76.13%, and 75.70% accuracy values, respectively. The system proposed in a previous study [55] achieved 95.42% accuracy by using two-stage K-means and logistic regression algorithms. The accuracy obtained by deep learning (DL) is 98.35%, in another study [1]. Tigga and Garg (2020) proposed a system using RF and found an accuracy of 94.10%. Naz and Ahuja (2020) used DL in their system and found an accuracy of 98.07%. Logistic regression obtained an accuracy of 94.59% [18]. The obtained results in a recent study [15] illustrated that the PCA-RF method performed great with an accuracy of 99.2%. Islam et al. (2020) and Gourisaria et al. (2022) evaluated random forest classifiers for the dataset used in this study [15, 29]. They reported that the random forest using 16 features at its entrance reached the best result with an accuracy rate of about 99%. The prediction success of my methodology is better than that achieved by the RF method on the same dataset. Because 99.23% classification accuracy rate was reached with only nine features. In addition, when XGBoost is used as a classifier instead of RF, the calculation time is reduced. By analyzing the information in Table 5, it is clear that the MLR-RF-XG approach can be regarded as an optimal method of data classification when compared to the most recent studies and especially the study using the same dataset.

The main advantages of the proposed method are: reducing the need for blood testing by using direct questionnaires to diagnose early-stage diabetes, low memory and low computation time for computation, and high estimation accuracy, determining which features are more significant for diagnosis, using fewer features than other some studies in the literature.

The disadvantage of this study is that the dataset used contains very little data, especially for the young population. If diabetes cannot be controlled in young people, unwanted diseases may occur in the early period. Therefore, the proposed method should be tested in this age group. The disadvantage of the feature selection method used is that an additional computational cost is involved in the preprocessing stage of the predictive model building process.

4 Hypothesis and limitations

This study is promising and demonstrates that the proposed MLR-RF-XG technique is a powerful method that can assist in the diagnosis of diabetic patients. However, the study’s limitations are as follows: the research sample is limited to only one country and one hospital. The increasing number of instances obtained from different countries and hospitals will further strengthen the confidence in this technique. To further increase the model’s reliability, a plan aiming to increase the number of instances in the existing dataset in the next two years is considered. The second drawback of the system is that the performance of the proposed approach has not been discussed with internists. Therefore, a comparison of this system with internists will be part of a future study.

5 Conclusion

This research proposes a new method for early-stage predict diabetes using XGBoost Classifier with a random forest feature selection technique. The dataset consists of information from questionnaires made to patients of Sylhet Diabetes Hospital in Bangladesh. Analyzes were carried out using different methods to compare the classification performances. The same test/training ratios (25%/75%) were determined for each classifier method used in experimental studies to evaluate the results objectively. The results were compared with the studies in the literature and found superior in terms of classification accuracy, sensitivity, precision, AUC, and F1 score. The proposed method using XGBoost compared to three other classifiers with the same procedures, including k-Nearest Neighbor, random forest, and gradient boosting. Using XGBoost classifier with random forest feature selection technique provided 99.23%, 99%, 99%, 99%, and 0.993 classification accuracy, precision, sensitivity, F1 score, and AUC to identify disease risk factors, respectively. The excellent performance of the XGBoost model shows that it can predict whether early-stage diabetes can be accurate. In this way, the need for blood tests for diagnosis can be reduced. This approach has a practical pre-clinical application in assessing patients at risk for diabetes. The proposed methodology can significantly assist in improving the accuracy of the diagnosis of diabetes and, therefore, for further treatment. Future works will be to test the proposed method with different data sets.