1 Introduction

Diabetes is a chronic condition characterized by insufficient production of insulin by the pancreas or the inability of the body to effectively use the insulin it produces. However, when the insulin hormone is unable to perform this function, hyperglycaemia develops as the glucose level in the blood rises above the normal range [1]. Unless treated, diabetes, which is characterized by symptoms such as intense thirst, intense appetite, and frequent urination, causes many complications in the patient [2]. It is anticipated that the number of diabetic patients will increase significantly as the global population grows, and based on information shared by the International Diabetes Federation (IDF) and the World Health Organisation (WHO), the progression of the disease has accelerated over the past decade. According to data released by the IDF in 2021, there are presently 537 million diagnosed individuals with diabetes. This number is projected to increase to 643 million by 2030 and 784 million by 2045 [3].

With the advancement of technology, it is possible to diagnose numerous diseases using artificial intelligence and learning techniques. In this manner, disease diagnosis and the reporting of related examinations are completed in a shortened period of time; consequently, patients spend less time in the medical facility [2]. In many nations, large investments are made in smart hospital initiatives today. By automating the system, this application eliminates hospital overcrowding and reduces the amount of labour required. The analysis of biomedical data and images using artificial intelligence techniques in the literature is increasing rapidly [4,5,6,7,8]. Many researchers primarily rely on machine learning algorithms to conduct experiments and develop methods for diagnosing various diseases. In the diagnosis of various diseases, machine learning algorithms are preferred because they provide more accurate, quicker, and less expensive results. Because data mining and machine learning algorithms can combine data from multiple sources and manage large quantities of data, they increase the predictive power. The progress made in machine learning and artificial intelligence has resulted in more effective early-stage disease detection and diagnosis compared to manual methods, particularly in the case of diabetes recognition [9,10,11,12]. The early diagnosis of this illness relies heavily on computer-assisted expert systems based on machine learning. Using machine learning algorithms, the objective of this investigation is to detect diabetes at an early stage. Vijayan and Anjali utilized a global dataset obtained from the UCI machine learning repository as their training dataset [13]. A proposed decision support system is introduced, employing the AdaBoost algorithm with a Decision Stump as the foundational classifier for the classification task. Evaluating the performance of these individual algorithms, they found that the SVM algorithm achieved the highest accuracy rate of 79.68%. However, when their performance is evaluated in conjunction with the AdaBoost algorithm with decision stump model exhibited the highest accuracy at 80.72%.

Perveen et al. [14] recommended estimating diabetes mellitus with the CPCSSN diabetes dataset and the AdaBoost (J48 decision tree), Bagging (J48 decision tree), and J48 decision tree algorithms. The dataset consists of the training dataset (60%) and the test dataset (40%). The Chi-square method was utilized as a technique for selecting features. The Weka programming language was utilized in the experiments. Looking at the ROC curve performance results, the Bagging (J48 Decision Tree) model achieved the highest accuracy with a value of 98%. In the study by Ram and Vishwakarma, diabetes disease prediction was performed using the Pima Indian Diabetes Dataset (PIDD) dataset and KNN, SVM, LR, Gaunt Naif Bayes (GNB), and RF machine learning algorithms [15]. In the investigation, a 10-fold cross-validation was performed. At 85%, the logistic regression machine learning algorithm achieved the highest level of accuracy in this study. In their investigation, Sisodia and Sisodia classified the PIDD dataset for diabetes prediction using GNB, SVM, and DT machine learning algorithms [16]. Performance, accuracy, precision, and recall metrics were evaluated, as well as the outcomes of these three algorithms. In this investigation, the Naive Bayes machine learning algorithm achieved the best prediction performance with a value of 76.30%. Using a neural network and SVM ensemble, Zolfagri et al. [17] proposed a method for diagnosing diabetes in Pima Indian female populations. In comparison to other classification systems in the literature, the prediction accuracy of 88.04 is the highest and is extremely promising for this problem. Alam et al. [18] presented a research paper on diabetes prediction, employing artificial neural networks (ANN), random forest (RF), and K-means clustering. They applied the ANN technique to the PIDD dataset and achieved a maximum accuracy of 75.7% in their study. On the other hand, Ma utilized six traditional machine learning models, such as logistic regression, support vector machine, decision tree, random forest, augmentation, and neural network, to develop a prediction model for diabetes diagnosis [19]. The Sylhet Diabetes Dataset from the UCI Machine Learning Repository supplied the dataset. Each model is parameterized to strike a balance between accuracy and complexity. The accuracy of the neural network in the test dataset is 96 percent, making it the most accurate diabetes prediction model. Emon et al. demonstrated the relationship between various diabetes-causing symptoms and diseases [20]. Eleven classification algorithms based on machine learning were utilized for the Sylhet Diabetes Dataset. The random forest classifier demonstrated the highest accuracy (98%) among these machine learning classifiers.

Khaleel et al. propose a model capable of predicting whether a patient has diabetes or not [21]. Our model is based on the prediction accuracy of specific powerful machine learning (ML) algorithms evaluated using various metrics such as precision, recall, and F1 score. A specific dataset, the Pima Indian Diabetes (PIDD) dataset, based on diagnostic methods for predicting diabetes onset, was utilized. The results indicate that Logistic Regression (LR), Naive Bayes (NB), and K-nearest Neighbour (KNN) algorithms achieved prediction accuracies of 94, 79, and 69%, respectively. These findings demonstrate that LR is more efficient in predicting diabetes compared to other algorithms. Usama et al. introduced a novel model utilizing a fused machine learning approach for predicting diabetes [22]. The proposed framework incorporates two types of models: Support Vector Machine (SVM) and Artificial Neural Network (ANN) models. These models analyse a dataset to determine the likelihood of a diabetes diagnosis being positive or negative. The outputs of these models serve as input membership functions for a fuzzy logic model, which ultimately determines the diabetes diagnosis. The fused models are stored in a cloud storage system for future use. The proposed fused machine learning model achieves a prediction accuracy of 94.87%, surpassing the performance of previously published methods. Kumari et al. enhanced the accuracy of diabetes mellitus prediction through the utilization of an ensemble of machine learning algorithms [23]. The research focuses on the Pima Indians Diabetes dataset, which contains information about patients both with and without diabetes. The proposed ensemble soft voting classifier conducts binary classification by employing a combination of three machine learning algorithms: random forest, logistic regression, and Naive Bayes. Empirical evaluation of the proposed methodology involves comparing it with state-of-the-art techniques and base classifiers. Evaluation criteria include accuracy, precision, recall, and F1-score. The ensemble approach achieves the highest accuracy, precision, recall, and F1-score values, reaching 79.04, 73.48, 71.45, and 80.6%, respectively, on the PIMA diabetes dataset. The study proposed by Nipa et al. aims to develop a machine learning-based predictive model for early diagnosis of diabetes [24]. A dataset consisting of a total of 1078 records, including patient records obtained through a survey conducted in Bangladesh and the dataset from Sylhet Diabetes Hospital, was utilized. Thirty-five different classification methods were evaluated using performance metrics such as accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (ROC-AUC). The results were interpreted using the Shapley additive explanations method to identify effective features for diabetes occurrence. The findings reveal that the Extreme Trees (ET) classifier demonstrated the best performance with a 97.11% accuracy rate for the Sylhet Diabetes Hospital dataset, while the Multi-Layer Perceptron (MLP) classifier yielded the best result with a 96.42% accuracy rate for the overall dataset.

In optimizing the hyperparameters of CNN algorithms, various studies have explored the effectiveness of different optimization techniques. Bochinski et al. [25] utilized GWO to optimize key hyperparameters such as layer count, filter count, neuron count, and initial layer values. Similarly, Baldominos et al. [26] applied GWO to fine-tune a comprehensive set of hyperparameters, including learning rate, optimization method, activation method, batch size, number of filters, filter size, number of neurons, and optimizer values within the CNN algorithm. In a related vein, Silva et al. [27] chose PSO to optimize hyperparameters such as filter size and number, batch size, and dropout rate, reporting performance gains using the PSO algorithm for determining CNN hyperparameters. Wang et al. [28] contributed to the exploration of optimization techniques by employing PSO to refine hyperparameters, including filter number and size, neuron count, and stride values in the CNN algorithm. Meanwhile, Mohakud and Dash [29] focused on GWO as their chosen method for optimizing CNN algorithm hyperparameters. Expanding beyond CNN algorithms, Kılıçarslan employed a diverse set of optimization algorithms, including PSO, Cat Swarm Optimization (CSO), and a hybrid approach combining PSO and GWO [30]. This hybrid approach was specifically applied for optimizing hyperparameters in the 1D-VGG-16 model.

Many researchers have stated that hybrid or ensemble learning models will give better results compared to a single model [31], and within the scope of the study, classification is carried out using various ensemble learning methods. Another alternative approach in the study is to use ensemble learning algorithms to reduce both the variance in the model and the dependence on the model. We can increase the accuracy of our predictions by combining ensemble learning models with models that are trained multiple and independently of each other. Ensemble approaches using multiple learning algorithms seem to be an effective way to improve classification accuracy [32]. An overview of study is given in Fig. 1. In this study,

  • Combining (PS) and GWO algorithms, the study introduces a hybrid optimization method for hyperparameter optimization. This innovative strategy aims to improve the effectiveness and efficiency of hyperparameter optimization in machine learning models.

  • Using the collected dataset, multiple machine learning algorithms, including Random Forest, Gradient Boosting Machine, Extra Tree Classifier, Gaussian Naive Bayes, Logical Regression, and Decision Tree, are employed to classify diabetes. These models are used as benchmarks to assess the performance of the proposed hybrid optimization method.

  • The research investigates various ensemble learning techniques, such as Boosting, Bagging, Super Learner, Stacking, Max Voting, and Average Voting, to enhance the classification accuracy and robustness of the models. These methods combine the predictions of multiple base classifiers to produce more accurate and trustworthy forecasts.

  • A comprehensive comparative analysis is conducted to identify the models and ensemble learning methods with the highest accuracy, precision, and F1 score performance. Included in the comparison are pertinent studies from the scholarly literature, allowing for a thorough evaluation of the proposed method versus existing cutting-edge techniques.

Fig. 1
figure 1

An overview of study methodology

The following is an overview of the article's structure. The data collection, machine learning models, ensemble models, and optimization methods are described in Sect. 2. The created ensemble learning application models' parameters and evaluation metrics are provided. Section 3 presents machine learning classifier experimental work and results as well as evaluation metrics such as accuracy, recall, precision, and F1-scores. Section 4 discussions and suggests areas for future investigation. Finally, conclusion is given in Sect. 5.

2 Material and methods

2.1 Dataset

The dataset used in the study was collected through a direct survey conducted with patients at Sylhet Diabetes Hospital, located in the region of Sylhet, Bangladesh. This dataset is publicly available on the Kaggle platform [33]. Details of dataset are given in Table 1. The diabetes symptom dataset under consideration comprises 16 attributes and encompasses a total of 520 instances. Each attribute in the dataset likely represents a distinct feature or variable related to diabetes symptoms, contributing to a comprehensive set of 16 different aspects for each of the 520 instances. The numbers for the attributes in the table below are expressed as percentages.

Table 1 Details of dataset

2.2 Data preprocessing

Data preprocessing is crucial for prepping data for machine learning tasks. It involves transforming and cleansing the original data to guarantee its quality and compatibility with the selected algorithms. Certain data processing steps are performed in order to prepare the data for use with machine learning algorithms. Data preprocessing steps performed in this study; categorical attributes in the dataset contain values such as "Yes/No" or "Positive/Negative." These values are converted into numerical values, where "Yes" or "Positive" is assigned 1, and "No" or "Negative" is assigned 0. The gender attribute is presented as a percentage split. To convert this attribute into a binary numeric format, 1 is assigned to represent male, and 0 is assigned to represent female. The age attribute is provided as a range. To standardize this attribute into a numerical format, the average of the range can be taken. The class attribute indicates a patient's diabetes test result. To encode this attribute into binary numeric values, 1 represents a positive result, and 0 represents a negative result. By applying these steps, the data are prepared for use with machine learning algorithms, ensuring compatibility and effective model training and evaluation.

2.3 Machine learning

The discipline of machine learning, a subfield of artificial intelligence, is inspired by human learning processes. It employs mathematical and statistical techniques to draw conclusions and make predictions from data [4, 34]. By learning from existing datasets, machine learning algorithms can create models that address complex problems that may be difficult to solve using conventional programming techniques [35]. These models are capable of automatically enhancing their performance based on their acquired knowledge and data-driven insights.

2.3.1 Ensemble models

Ensemble learning is a machine learning approach where multiple classifiers are trained to solve the same problem. Unlike traditional machine learning methods that aim to learn a single hypothesis from training data, ensemble learning methods generate a collection of hypotheses and combine them for utilization [36]. The objective of any machine learning problem is to find the best model that can accurately predict the desired outcome. Instead of relying on a single model and hoping it is the most accurate estimator possible, ensemble methods consider multiple models and average their predictions to produce a final model. The goal of ensemble learning methods is to enhance generalizability and robustness by combining the predictions of several baseline estimators constructed using a specific learning algorithm. Common ensemble learning methods include bagging, boosting, voting, and stacking, as they offer specific advantages and have well-defined construction processes. The structure of ensemble models is illustrated in Fig. 2.

Fig. 2
figure 2

Ensemble models

2.3.1.1 Bagging

The bagging algorithm created by Breiman is among the oldest, simplest, and most effective ensemble-based algorithms [37]. In the Bagging algorithm, each model is trained independently and then averaged together. The primary objective of the Bagging method is to achieve less variance than any individual model. This method, which is an abbreviation for Bootstrap Aggregation, draws different subsets of training data at random from the entire training dataset each time. The bagging ensemble learning method typically takes into account homogeneous weak learners, teaches them independently in parallel, and combines them using a process similar to averaging. It can also be effective when the dataset is noisy or when there is a high risk of overfitting [38, 39].

2.3.1.2 Boosting

Freund and Schapire created the boosting ensemble learning method as an iterative approach to combine weak learning to create a more accurate classifier [40]. Boosting typically takes into account homogeneous weak learnings, sequentially adapts to them, and combines them according to a strategy. Boosting methods function similarly to bagging methods, based on the logic of constructing a family of models that, when combined, produce a stronger learner with improved performance. Boosting is a technique that involves placing a large number of weak learners in a highly adaptive manner, as opposed to the bagging method, whose primary objective is to reduce variance.

2.3.1.3 Voting

In classification or regression problems, the voting algorithm is a well-known technique for effective ensemble learning. Creating multiple submodels, each of which makes predictions, is required. The predictions from these submodels are then averaged or taken as the mode, allowing each submodel to cast a vote [41].

Utilising majority voting, the voting ensemble learning method combines the predictions. There are three distinct varieties of majority voting. In the initial version, the class with which all classifiers concur is selected. In the second version, the majority is the class supported by at least one more than half of the classifiers. In the third version, the class with the most votes is chosen, regardless of whether the votes exceed 50% (majority or plurality voting) [42].

In the merging procedure, the voting ensemble learning method employs multiple techniques [43]. The majority voting technique is one of these methods. Each classifier makes an estimate in the majority vote method. If more than half of the classifiers agree on an estimation decision, it is valid as a ensemble decision. In the majority vote technique, the ensemble estimate may not be produced if more than half of the classifiers' estimation decisions cannot be obtained. In such circumstances, plurality voting, which is another unification technique, is more appropriate. In the majority vote classifier approach, the ensemble prediction is determined by accepting the decision that is most frequently predicted, without requiring a decision to have more than 50% of the votes. However, if the individual classifiers in the voting ensemble learning method have varying performances, it is more advantageous to use the weighted voting technique. This technique combines the classifiers based on their performance weights, allowing for a more nuanced and accurate ensemble prediction.

2.3.1.4 Stacking

In 1992 Wolpert's stacking ensemble learning method is a general approach that entails training a learner to combine individual learning models [44]. Stacking is a method for combining classification or regression models, which is typically composed of two layers. The initial layer consists of the models used to make predictions on test datasets. The second layer consists of a meta-classifier or meta-regressor that generates new predictions using the predictions from the underlying models as input. Individual learning models are referred to as first-level learners in this context, while the integrator responsible for combining these models is referred to as a second-level learner or meta-learner.

2.3.1.5 Super learner

The Super Learner, a ensemble learning model, was first introduced in the article "Super Learner" by Mark Van der Laan, Alan Hubbard, and Eric Polley from the University of California, Berkeley, in 2007 [45]. The Super Learner approach follows a two-part logic in its basic framework.

In the first phase, different learners are trained on the dataset. These learners may use different algorithms or different learning approaches. For example, different learning models such as linear regression, decision tree, support vector machines, neural networks can be used. In the second phase, the estimates of each learner are combined and a weighting is made to provide the best estimate. Weighting can be determined according to the performance of each learner, or it can be determined by more sophisticated methods. For example, majority voting in classification problems or averaging estimates in regression problems can be used.

The study employed ensemble learning methods along with machine learning techniques to estimate the total energy expenditure of households. Various classifiers, including K-Nearest Neighbour, Naive Bayes, Decision Tree, Deep Neural Networks, Random Forest, and Gradient Boosted Trees, are utilized in the model estimation process. The evaluation of the model considered not only the individual performances of these classifiers but also the performances of ensemble learning methods such as bagging, boosting, voting, and stacking.

2.4 Optimization algorithms

Searching for the optimal values of hyperparameters, optimization algorithms play a crucial role in machine learning model fine-tuning. Mirjalili et al. proposed a method based on the hybridization of two optimization algorithms: Particle Swarm Optimization (PSO) and Grey Wolf Optimization (GWO) [46]. By combining the strengths of these algorithms, the researchers aimed to improve the hyperparameter optimization procedure's efficacy and efficiency.

PSO is a population-based optimization algorithm inspired by the flocking or schooling behaviour of birds and fish. Utilising a swarm of particles that iteratively explore the search space, the optimal solution is found. GWO is a nature-inspired algorithm that is based on the social hierarchy and hunting behaviour of grey wolves. It imitates the leadership dynamics between the alpha, beta, and delta wolves in order to direct the search for the global optimum. By combining PSO and GWO, the proposed algorithm capitalizes on the exploration and exploitation capabilities of both algorithms to conduct a more exhaustive search of the hyperparameter space. This hybrid strategy seeks to strike a balance between exploration to identify promising regions and exploitation to fine-tune hyperparameters for enhanced model performance.

The integration of PSO and GWO in the proposed algorithm offers a unique synergy that has the potential to improve the optimization process, resulting in better convergence, faster computation, and enhanced precision of the machine learning models. This hybridization technique has the potential to achieve better hyperparameter optimization results than either PSO or GWO alone.

2.4.1 PSO

PSO is a population-based optimization algorithm (Kennedy and Eberhart [47]) inspired by the collective behaviour of bird flocking or fish schooling. The algorithm is comprised of a swarm of particles that move through the search space, iteratively updating their positions based on their best-known position and the best-known position of the swarm as a whole. This social interaction and exchange of information between particles enables PSO to effectively explore and exploit the search space. Due to its simplicity and effectiveness, PSO has been widely utilized in numerous fields, including machine learning, optimization, and data mining. It has shown success in resolving complex optimization problems, such as feature selection, parameter optimization, and neural network training [48]. It is a popular choice for hyperparameter optimization in machine learning tasks due to the algorithm's ability to strike a balance between exploration and exploitation.

The formula for updating the position and velocity of each particle is as follows [49] :

$${v}_{i+1}= w*{v}_{i}+{c}_{1}*\text{rand}*\left(p{\text{best}}_{i}-{x}_{i}\right)++{c}_{2}*\text{rand}*(g{\text{best}}_{i}-{x}_{i})$$
(1)

\(p{\text{best}}_{i}\): the individual optimal value of the ith particle

\(g\text{bes}{t}_{i}\): the global optimal value

\({c}_{\text{1,2}}\): optimization coefficients

rand: random number between (0,1)

w: inertia weight parameter

$${x}_{i+1}={x}_{i}+{v}_{i+1}$$
(2)
$$w={w}_{max}=\frac{{w}_{max}-{w}_{min}}{T}x t$$
(3)

T: the maximum number of iterations

t : the current number of iterations.

2.4.2 GWO

GWO is a nature-inspired optimization algorithm that simulates the social hierarchy and hunting behaviour of grey wolves [50]. To guide the search for the global optimum, the algorithm imitates the leadership dynamics among the alpha, beta, and delta wolves. Each wolf in the search space represents a potential solution, and their positions are updated iteratively based on the alpha, beta, and delta positions. The algorithm's equilibrium between exploration and exploitation enables it to navigate the search space efficiently. GWO has demonstrated promising results in the resolution of multiple optimization problems, including feature selection, parameter tuning, and function optimization. It is a valuable tool for optimization tasks in machine learning and other domains due to its efficacy, simplicity, and rapid convergence [50].

For the purpose of simulating a leadership structure, the study used a hierarchy of grey wolves composed of the four types alpha, beta, delta, and omega. The beta and delta wolves in this structure stand in for the second and third best solutions, respectively, while the alpha wolf symbolizes the best option. An ideal candidate is thought to be the omega wolf. The mathematical model incorporates the hunting behaviour of grey wolves, which entails looking for prey, surrounding it, and attacking it. The hunting mechanism and the leadership hierarchy are combined to create a mathematical model [49].

The behaviour of wolves when they surround their prey can be expressed by Eqs. (4) and (5) as follows [49].

$$D=\left|C x {X}_{p}\left(t\right)-X(t)\right|$$
(4)
$$X\left(t+1\right)={X}_{p}\left(t\right)-AxD$$
(5)

t: the number of iterations,

X: is the location of the prey,

The vector coefficients A and C can be computed using Eqs. (6) and Eq. (7).

$$A=ax(2x{r}_{1}-1)$$
(6)
$$C=2x{r}_{2}$$
(7)

\(a\): linearly reduced from 2 to 0,

\({r}_{\text{1,2}}\): random variables in the interval 0 to 1

When a grey wolf locks its prey for hunting, the grey wolf's location will alter and the prey's location will be updated. The formula for the update is as follows:

$${D}_{a}=\left|{C}_{1} x {X}_{a}-X(t)\right|, {D}_{\beta }= \left|{C}_{2} x {X}_{\beta }-X(t)\right|, {D}_{\delta }= \left|{C}_{2} x {X}_{\delta }-X(t)\right|$$
(8)
$${X}_{1}=\left| {X}_{a}-{a}_{1}{D}_{a}\right|, {X}_{2}=\left| {X}_{\beta }-{a}_{2}{D}_{\beta }\right|, {X}_{3}=\left| {X}_{\delta }-{a}_{3}{D}_{\delta }\right|$$
(9)
$${X}_{p}\left(t+1\right)= \frac{{X}_{1}+{X}_{2}{+X}_{3}}{3}$$
(10)

2.5 Evaluation metrics

In this study, a variety of machine learning methods and ensemble learning models are employed for the analysis and classification of the data. The performance of these models is evaluated using various metrics, and the evaluation results, which include the confusion matrix, are presented in Fig. 3. To determine the efficacy of classifiers and ensemble methods, performance metrics such as accuracy, sensitivity weighted average, and precision weighted average were utilized. These metrics provide insight into the overall accuracy of the models, as well as their ability to correctly identify positive instances (sensitivity) and predict positive instances across different classes with high precision. The evaluation of classifiers and ensemble methods using these metrics allows for a better understanding of their respective strengths and ability to handle the given classification task. The metrics equations are given in Eqs. (1114):

  • True Positive (TP) True Positive indicates instances where the model correctly predicts a positive condition (e.g. the presence of diabetes).

  • True Negative (TN) True Negative denotes instances where the model correctly predicts a negative condition (e.g. the absence of diabetes).

  • False Positive (FP) False Positive occurs when the model incorrectly predicts a negative condition as positive. In this case, it wrongly classifies a person without the disease as having the disease.

  • False Negative (FN) False Negative happens when the model incorrectly predicts a positive condition as negative. In this case, it wrongly classifies a person with the disease as not having the disease.

    $$\text{Accuracy}=\frac{tn+tp}{tn+tp+fn+fp}$$
    (11)
    $$\text{Recall}=\frac{tp}{tp+fn}$$
    (12)
    $$\text{Precision}=\frac{tp}{tp+fp}$$
    (13)
    $$F1\_\text{Score }=2x\frac{\text{Precision} x \text{Recall}}{\text{Precision}+\text{Recall}}$$
    (14)
Fig. 3
figure 3

Evaluation metrics

3 Experimental results

In the information discovery process to predict diabetes, first of all, the individual classification performances of the classifiers are evaluated regardless of the ensemble learning method. Many classifiers are applied in the model (Decision Tree, Gaussian Naive Bayes, Support Vector Classifier, Logistic Regression, Gradient Boosting Machine, Light Grad. Boosting Machine, Extra Trees Classifier, Random Forest, eXtra Gradient Boosting, Multi-Layer Perceptron) among them. Classifiers that provide suitable and high performance are included in the model and the results of all ensemble models used are given in this section, respectively. The hardware and software environment used for the study carried out within the scope of this study, Intel Xeon processor, 128 GB RAM, and a computer with Windows 10 operating system are preferred. In the classification process, Anaconda application and Python 3.8.13 version are preferred and Jupyter notebook editor is used.

The dataset used in our research comprises 16 attributes, each representing different aspects related to diabetes symptoms. As defined in Sect. 2.1, these attributes were obtained from a direct survey conducted with patients at Sylhet Diabetes Hospital in Bangladesh. The features extracted from the dataset primarily include binary indicators related to diabetes symptoms (e.g. Polyuria, Polydipsia) and demographic information (e.g. Age, Gender). To increase the reliability of our research, we divided the dataset into training-validation (80%) and test (20%) sets. Additionally, the 5-fold cross-validation method is used in the study.

In our research, we initially obtained individual results for the Particle Swarm Optimization (PSO) and Grey Wolf Optimization (GWO) algorithms, both designed for hyperparameter optimization. These results are presented in Tables 2 and 3, showcasing the performance of each algorithm in isolation. The purpose of this presentation is to establish a baseline understanding of the efficacy of each algorithm on its own. By examining the isolated outcomes of PSO and GWO, we aim to gain insights into their individual strengths and weaknesses when applied to classification algorithms. This comparative analysis sets the stage for the subsequent step, wherein we plan to introduce a hybrid approach, combining both PSO and GWO into a unified PSO+GWO algorithm. The intention is to assess how this hybridization influences the overall performance and optimization capabilities, providing a comprehensive understanding of its impact on classification algorithms. The presented table serves as a reference point for evaluating the relative improvements or synergies achieved by the hybrid PSO+GWO algorithm in comparison to the standalone PSO and GWO approaches. The 5-fold cross-validation was used to obtain the results.

Table 2 Accuracy results of classifiers with PSO algorithm
Table 3 Accuracy results of classifiers with GWO algorithm

Machine learning relies significantly on hyperparameter optimization, which is determining the ideal set of hyperparameters for a given model in order to attain optimal performance. Compared to conventional optimization methods, the PSO-GWO hybrid optimization method has the following benefits: The PSO-GWO hybrid optimization approach is crucial.

In general, the PSO algorithm is effective at solving real-world problems, but it can occasionally become stuck in local optima, thereby limiting its performance [51]. To address this issue, our proposed method combines the PSO algorithm with the GWO algorithm, which serves as an additional mechanism to reduce the likelihood of falling into local optima. In this method, the PSO algorithm occasionally directs particles to random positions with a low probability in order to explore various regions of the search space and avoid local optima [51]. However, there is a chance that this exploration strategy will deviate from the global optimal. We utilize the reconnaissance capability of the GWO algorithm to mitigate this risk. Instead of sending particles to random locations, the GWO algorithm directs them to positions that have been partially optimized by its own optimization process. A flowchart of hybrid PSO-GWO is shown in Fig. 4. By combining the GWO algorithm with PSO, we improve the overall exploration and exploitation balance, thereby increasing the likelihood of locating the global optimum. It is essential to note, however, that the addition of the GWO algorithm increases the required computational time for optimization. This additional time should be considered in light of the optimization problem at hand. Depending on the complexity of the problem and the performance gains achieved, the additional time may be deemed acceptable. In conclusion, our proposed hybrid method addresses the challenge of local optima in optimization by combining the PSO and GWO algorithms. By combining the strengths of both algorithms, we hope to improve the overall performance of optimization. Despite the fact that this method increases computational time, the potential benefits in terms of locating global optimums may, depending on the specific problem at hand, justify the additional investment. Hybrid PSO-GWO algorithm is given in Table 4. Hyperparameters of classifiers are submitted in Table 5.

Fig. 4
figure 4

A flowchart of hybrid PSO-GWO

Table 4 Hybrid PSO-GWO algorithm
Table 5 Hyperparameters of classifiers

Ensemble learning methods, initially proposed to reduce the high variance in machine learning methods and increase accuracy, quickly proved successful in addressing common problems encountered in machine learning methods such as feature selection, identifying missing features, error correction, estimating confidence intervals, dealing with imbalanced data and classes, etc. [42]. Generally, ensemble learning methods refer to learning methods that combine predictions from multiple machine learning methods to achieve higher accuracy and better performance than any individual method alone. By training data on multiple machine learning methods, predictions with higher performance than a single machine learning method can be achieved [52] these models, rather than combining classifiers, the focus is on combining the predictions obtained by classifiers to generate a shared prediction [52].

To ensure an accurate assessment of our model's performance and mitigate the risk of overfitting, we adopted the 5-fold cross-validation method for data division. In this approach, our dataset is divided into five equal parts, referred to as folds. During the cross-validation process, each fold is sequentially utilized as the test dataset, while the remaining fourfold is combined to create the training dataset. This process is repeated five times, with each fold serving as the test set exactly once. As a result, all instances in the dataset are used for testing, ensuring comprehensive evaluation. Scheme of cross-validation method is given in Figure 5.

Fig. 5
figure 5

Scheme of cross-validation method

3.1 Machine learning algorithms

In this study, firstly, the dataset models are classified using machine learning methods, and the performance evaluation results are shown in Table 6. Each of the values is the result of 5-fold cross-validation process and is obtained by taking the average of the value obtained with each fold in the table. The models with the highest average performance on the test data in Table 6 are Random Forest, Gradient Boosting Machine, and Extra Tree Classifier machine learning algorithms. The confusion matrix and classification report of these algorithms are given in Fig. 6. ROC graph is given in Fig. 7.

Table 6 Evaluation metrics of machine learning algorithms
Fig. 6
figure 6

Confusion matrix and classification report of (a) Random Forest, (b) Gradient Boosting Machine, and (c) Extra Tree Classifier

Fig. 7
figure 7

ROC curve of model

3.2 Boosting ensemble models

As the second method in the study, the creation of models with the boosting ensemble method is carried out. Boosting Ensemble Learning methods are applied to the dataset in the classification of diabetes. In the Boosting ensemble learning method, the classifiers given in Table 7 are included and their classification performances are evaluated. Each of the values was obtained by averaging the values obtained at each fold as a result of 5-fold cross-validation. According to the findings in Table 7, the Random Forest Classifier gave high performance with the boosting ensemble learning method. The accuracy value of the classifier is 0.979, the precision value was 0.987, and the F1 score is 0.979. In the Boosting ensemble learning method, the Random Forest classifier provided the highest performance in the specificity metric. When the data in Table 7 are examined, the Extra Tree Classifier model and the shortest test time Extra Tree Classifier model and Random Forest Classifier model give the shortest training time with a value of 0.0784. Confusion matrix and classification report of the top three models with the highest performance are shown in Fig. 8. The ROC graph of all models used for Boosting Ensemble Learning is given in Fig. 9.

Table 7 Evaluation metrics of boosting ensemble models
Fig. 8
figure 8

Confusion matrix and classification report of (a) Random Forest, (b) Decision Tree, and (c) Extra Tree Classifier

Fig. 9
figure 9

ROC curve of models

3.3 Bagging ensemble models

The bagging ensemble learning method is another method applied on the dataset in diabetes classification. The performances of the classifiers used in the Bagging Ensemble learning method are shown in Table 8. Each performance value is obtained by 5-fold cross-validation. The value is added on for each floor and obtained by taking the average after all floors were completed. When Table 8 is examined, it can be stated that the highest performance belongs to the Gradient Boosting Machine Classifier model. It is stated that the model that completes the training in the shortest time belongs to the Gaussian Naive Bayes Classifier and the test process belongs to the Logistic Regression classifier model as soon as possible. Gradient Boosting Machine classifier, which provides the highest performance, has an accuracy value of 0.979, a precision of 0.990, and an F1-Score of 0.940.

Table 8 Evaluation metrics of bagging ensemble models

The training time of the Gaussian Naive Bayes classifier, which completed the training in the shortest time, is 0.1843 and the test time of the Logistic Regression classifier, which completed the test phase in the shortest time, is 0.0090. The Confusion matrix and classification report of the first three classifiers that give the highest performance of the models used in the Bagging ensemble learning method are given in Fig. 10. The figure containing the ROC Graph of all models is given in Fig. 11.

Fig. 10
figure 10

Confusion matrix and classification report of (a) Gradient Boosting Machine, (b) Decision Tree, and (c) Random Forest

Fig. 11
figure 11

ROC curve of models

3.4 Super learner ensemble models

Another method applied on the dataset for the classification of diabetes is the Super Learner ensemble learning method. On the method, each classifier model is trained on the dataset with this model and model performances are obtained. Performance values are calculated by averaging the values obtained at each fold with 5-fold cross-validation. The performance values of all classifiers on this method are given in Table 9. When the findings in Table 9 are examined, it is obtained that the highest performance, fastest training and fastest test stages belong to the Random Forest Classifier. When the random forest classifier is examined, it is stated that it has 0.981 accuracy value, 0.990 precision value, 0.976 sensitivity value, 0.981 F1-score value, 0.711 training time, and 0.5279 test time. The confusion matrix and classification report of the first three models that provide the highest accuracy in the Super Learner ensemble learning method are given in Fig. 12. ROC graph of all classifiers used on the method is given in Fig. 13.

Table 9 Evaluation metrics of super learner ensemble models
Fig. 12
figure 12

Confusion matrix and classification report of (a) Gradient Boosting Machine. (b) Extra Tree Classifier and (c) Random Forest

Fig. 13
figure 13

ROC curve of models

3.5 Stacking ensemble models

The stacking ensemble learning method is also included for the dataset used to classify diabetes. The classifiers used on the method and the performance evaluation results are given in Table 10. Each of the performance values on the table is obtained by averaging the values obtained from each layer using 5-fold cross-validation. When the results on Table 8 are examined, it can be stated that the highest performance in the Stacking ensemble learning method belongs to the Extra Tree Classifier, the shortest training time belongs to the Gaussian Naive Bayes classifier, and the shortest test phase time belongs to the Logistic Regression Classifier. It is stated that the accuracy value of the Extra Tree classifier, which provides the highest performance, is 0.979, the precision value is 0.981, the sensitivity value is 0.983, and the f1-score value is 0.979.

Table 10 Evaluation metrics of stacking ensemble models

The training time of the Gaussian Naive Bayes Classifier, which has the shortest training time, is obtained as 0.1021. The time spent by the Logistic Regression Classifier, which has the shortest test phase time, is obtained as 0.0020. Confusion matrix and classification report of the first three models that give the highest performance in the stacking ensemble learning method are given in Fig. 14. The ROC graph of all models is given in Fig. 15.

Fig. 14
figure 14

Confusion matrix and classification report of (a) Gradient Boosting Machine, (b) Extra Tree Classifier and (c) Random Forest

Fig. 15
figure 15

ROC curve of models

3.6 Max voting ensemble models

Max voting ensemble learning method is another method applied on the dataset to classify diabetes. With the method, each classifier is trained and its performances are examined. The classifiers that can be used on the method are given in Table 11 with their performance values. Each performance value is calculated by using the 5-fold cross-validation method, and the performances at each fold were obtained and averaged. When the performances in Table 11 are examined, it is obtained that the Random Forest Classifier gave the highest performance. Table 11 shows that the Random forest classifier has an accuracy value of 0.981, a precision of 0.990, and a f1-score of 0.981. It is obtained that the training time of the Decision Tree Classifier with the shortest training time was 0.0082, and the time spent for the test of the Light Gradient Boosting Machine classifier with the shortest test time was 0.0032. The confusion matrix and classification report of the first three models that provide the highest performance using the Max Voting ensemble learning method are given in Fig. 16. The ROC graph of all the classifiers used with the method is given in Fig. 17.

Table 11 Evaluation metrics of max voting ensemble models
Fig. 16
figure 16

Confusion matrix and classification report of (a) Gradient Boosting Machine, (b) Extra Tree Classifier and (c) Random Forest

Fig. 17
figure 17

ROC curve of models

3.7 Average voting ensemble models

The Average Voting ensemble learning method is used in the classification of diabetes on the dataset used in the study. With the method, the classifiers are trained and their performances were obtained. The performances obtained with the classifiers used in this method are given in Table 12. Each of the performances in Table 12 is obtained as a result of 5-fold cross-validation of the Average Voting ensemble learning model and the classifier. The performances obtained at each fold are calculated by averaging at the end of the fold. As can be seen in Table 12 as a result of the calculations, the highest performance belongs to the Random Forest Classifier. With this method, the Random Forest Classifier has an accuracy value of 0.981, a precision of 0.990, a sensitivity of 0.976, and a f1-score of 0.981. With 0.0116 training time, Decision Tree has the shortest training time. The Light Gradient Boosting Machine classifier has the shortest test time with a value of 0.0026. Confusion matrix and classification report values of the first three models with the highest performance with Average Voting ensemble learning method are shown in Fig. 18. ROC graph of all classifiers used on the method is given in Fig. 19.

Table 12 Evaluation metrics of average voting ensemble models
Fig. 18
figure 18

Confusion matrix and classification report of (a) Gradient Boosting Machine, (b) Extra Tree Classifier and (c) Random Forest

Fig. 19
figure 19

ROC curve of models

The results given in Table 13 demonstrate the classification performance of each model utilising various ensemble learning techniques and high values in the models are expressed in bold. The Soft Model depicts the performance of each individual model without the use of ensemble techniques. Ensemble learning methods include Boosting, Bagging, Super L. (Super Learner), Stacking, MaxVoting, and Average Voting. Random Forest had the highest accuracy rate among the models with 0.980, followed by Multi-Layer Perceptron with 0.978 accuracy rates. These models performed consistently well across multiple ensemble learning methodologies. Compared to other models, the accuracy of the Gaussian Naive Bayes model is lower, indicating its limitations in reliably classifying diabetes. The results demonstrate that ensemble learning methods improve classification performance. The super learner ensemble learning method, which combines multiple classifiers, provided consistently superior performance across multiple models. These results contribute to the field of diabetes diagnosis and demonstrate the predictive power of ensemble learning techniques. Exploring additional ensemble learning techniques and optimising hyperparameters can help improve classification performance even further.

Table 13 Comparison of model accuracy with hybrid PSO+GWO

In comparing the performance of the three optimization methods—PSO (Table 2), GWO (Table 3), and the hybrid PSO+GWO (Table 13), it is evident that each method exhibits distinct strengths across various classification models. In the PSO results, certain models, such as Random Forest and Gradient Boosting, showcase high accuracy, while others, like Logistic Regression and Light Gradient Boosting, demonstrate relatively lower performance. Similarly, GWO yields competitive results, excelling in some models like Decision Tree and Gaussian Naive Bayes, but underperforming in others. The hybrid PSO+GWO, on the other hand, consistently delivers robust accuracy across a diverse set of models, outperforming both PSO and GWO individually. This hybrid approach capitalizes on the complementary strengths of PSO and GWO, showcasing a balanced and effective optimization strategy. The superior performance of the hybrid method suggests its potential as a reliable optimization technique for diverse classification tasks, highlighting the significance of integrating multiple optimization algorithms for enhanced model accuracy and generalization.

The Friedman test, outlined in [53], is a statistical technique designed for comparing multiple group measures without making assumptions about their distribution. Its aim is to evaluate whether these group measures demonstrate significant differences in variability, essentially examining the null hypothesis that they share the same variance up to a certain level of significance. In [54], the assessment of machine learning algorithms across diverse datasets is conducted, utilizing statistical tests including the Wilcoxon signed ranks test and Friedman test [55]. In this particular study, the models employed undergo the Friedman test, and the findings are presented in Table 14.

Table 14 The Friedman test results for average voting model

The Friedman test results for the average voting model, where the models generally achieve the highest accuracy rate, are presented in the study. In this article, a p value of 1.831e-06 is found for the Friedman test, indicating that the test yielded statistically significant results. The Friedman test is employed to assess differences between various measures used to compare the performance of models. The results include the mean and standard deviation of model performances, as well as their minimum and maximum values. These findings indicate that the Random Forest model achieved the highest average accuracy (0.980637). Additionally, the performance of other models was compared and presented in tabular form. The observed results provide valuable insights into which model performs best on a specific dataset. This article could serve as an important resource for researchers working on model selection and performance evaluation.

In comparison to these earlier studies given in Table 15, our research attained the highest level of precision at 98.10%. This demonstrates that the ensemble learning model proposed in this study, which utilizes multiple classifiers and combining techniques, outperforms the models examined in the literature. The results demonstrate the efficiency of ensemble learning in improving the accuracy of diabetes diagnosis. In addition, our research offers vital insights into the potential of ensemble learning methods to enhance the performance of diabetes classification models. It is important to note that the dataset size, feature set, and population characteristics used in each study may vary, which can affect accuracy rate comparisons. Nevertheless, our study demonstrates a substantial advancement in the field by obtaining an exceptionally high rate of accuracy in diagnosing diabetes using the provided dataset.

Table 15 Comparison of studies in literature

4 Discussion

In this study, we investigated the application of ensemble learning methods for diabetes classification using a dataset collected from patients at Sylhet Diabetes Hospital in Sylhet, Bangladesh. We utilized popular machine learning algorithms such as Random Forest, Gradient Boosting Machine, Extra Tree Classifier, Gaussian Naive Bayes, and Logistic Regression within various ensemble learning frameworks. One significant contribution of this research is the utilization of a hybrid method, Particle Swarm Optimization-Grey Wolf Optimization (PSO-GWO), for hyperparameter optimization. When comparing the three optimization methods—Particle Swarm Optimization (PSO), Grey Wolf Optimization (GWO), and the hybrid PSO+GWO—we observed distinct strengths and performance patterns across various classification models. PSO demonstrated high accuracy in specific models like Random Forest (PSO: 0.980) and Gradient Boosting (PSO: 0.975), while GWO excelled in Decision Tree (GWO: 0.969) and Gaussian Naive Bayes (GWO: 0.887). However, the hybrid PSO+GWO consistently delivered robust accuracy across a diverse set of models, outperforming both PSO and GWO individually. For instance, in the hybrid approach, Random Forest achieved an accuracy rate of 0.981, surpassing the individual rates of PSO (0.980) and GWO (0.9769). This hybrid approach capitalizes on the complementary strengths of PSO and GWO, showcasing a balanced and effective optimization strategy. The superior performance of the hybrid method suggests its potential as a reliable optimization technique for diverse classification tasks, emphasizing the importance of integrating multiple optimization algorithms for enhanced model accuracy and generalization. Hyperparameters play a crucial role in the performance of machine learning models, and optimizing them effectively is essential for achieving accurate and robust results. The PSO-GWO hybrid method combines the strengths of both PSO and GWO algorithms, enabling efficient and effective search in the hyperparameter space. By applying this hybrid optimization technique, we were able to enhance the performance of the classifiers and obtain improved results for diabetes classification. The results obtained from the performance evaluation indicate that the Random Forest classifier consistently exhibited high accuracy, precision, and F1-scores across multiple ensemble learning methods. This highlights the effectiveness of ensemble learning in combining multiple classifiers to improve the overall classification performance. Additionally, the Gradient Boosting Machine and Extra Tree Classifier also demonstrated strong performance, displaying their potential as valuable tools in diabetes classification tasks. Among the Boosting ensemble learning method, the Random Forest classifier exhibited excellent performance, achieving an accuracy value of 0.979, precision value of 0.987, and an F1 score of 0.979. Moreover, the Extra Tree Classifier and Random Forest Classifier models displayed the shortest training time, both recording a value of 0.0784. The Bagging ensemble learning method demonstrated that the Gradient Boosting Machine Classifier achieved the highest performance, with an accuracy value of 0.979, precision value of 0.990, and an F1 score of 0.940. Additionally, the Gaussian Naive Bayes Classifier and Logistic Regression classifier showcased the shortest training and test times, respectively. In the Super Learner ensemble learning method, the Random Forest Classifier outperformed other classifiers in terms of accuracy (0.981), precision (0.990), F1-score (0.981), training time (0.711), and test time (0.5279). Similarly, the Extra Tree Classifier excelled in the Stacking ensemble learning method, exhibiting high accuracy (0.979), precision (0.981), and F1-score (0.979). The Gaussian Naive Bayes Classifier had the shortest training time (0.1021), while the Logistic Regression Classifier recorded the shortest test time (0.0020). The Max Voting ensemble learning method indicated that the Random Forest Classifier achieved the highest performance with an accuracy value of 0.981, precision value of 0.990, and F1-score of 0.981. Moreover, the Decision Tree Classifier exhibited the shortest training time (0.0082), while the Light Gradient Boosting Machine classifier showcased the shortest test time (0.0032). Similarly, in the Average Voting ensemble learning method, the Random Forest Classifier delivered the highest performance in terms of accuracy (0.981), precision (0.990), and F1-score (0.981). The Decision Tree classifier recorded the shortest training time (0.0116), while the Light Gradient Boosting Machine classifier demonstrated the shortest test time (0.0026).

This study demonstrates the effectiveness of ensemble learning methods in diabetes classification. Future works could contribute to the generalization of these findings by testing them on larger datasets and expanding them to different disease classifications. Additionally, investigating more advanced hyperparameter optimization techniques and exploring new ensemble learning methods could be important for obtaining more precise results in diabetes diagnosis and treatment.

5 Conclusions

In conclusion, this study demonstrated the effectiveness of ensemble learning methods for diabetes classification using a dataset from Sylhet Diabetes Hospital, Bangladesh. Our comprehensive evaluation of Particle Swarm Optimization (PSO), Grey Wolf Optimization (GWO), and the hybrid PSO+GWO algorithm in the context of classification tasks has provided valuable insights into their respective performances. The individual analyses revealed distinctive strengths and weaknesses for PSO and GWO across various models. However, the hybrid PSO+GWO consistently demonstrated superior accuracy, outperforming both PSO and GWO in most cases. The application of the PSO-GWO hybrid method for hyperparameter optimization contributed to improved performance of the classifiers, with the Random Forest, Gradient Boosting Machine, and Extra Tree Classifier algorithms consistently achieving high accuracy, precision, and F1-scores. The findings of this research have implications for both academic and practical applications in the field of diabetes classification. Ensemble learning methods, especially when coupled with advanced hyperparameter optimization techniques, offer robust and accurate models for assisting healthcare professionals in diabetes diagnosis and treatment decision-making. Future research should focus on further refining and exploring hybrid optimization methods to enhance the performance of ensemble learning models for diabetes classification. Additionally, the incorporation of additional features and data sources, such as genetic information and patient lifestyle factors, holds potential for developing more personalized and accurate diabetes classification models. As we move forward, future research directions should focus on refining and expanding the hybrid optimization methods to enhance the robustness and efficiency of ensemble learning models. Exploring additional features, such as genetic information and lifestyle factors, could contribute to the development of more personalized and accurate diabetes classification models. Moreover, the study opens avenues for investigating the interpretability of ensemble models, as understanding the decision-making processes of these models is crucial for their acceptance and application in real-world medical scenarios. Furthermore, the generalizability of the proposed approach to diverse datasets and populations should be explored to ensure its applicability across different healthcare settings. Additionally, attention should be directed towards addressing challenges related to data privacy and security, especially when implementing ensemble learning models in clinical practice. Continued collaboration between machine learning researchers and healthcare professionals is essential to bridge the gap between technological advancements and practical healthcare needs, ultimately advancing the field of diabetes diagnosis and treatment decision-making.