Detecting diabetes in an ensemble model using a unique PSO-GWO hybrid approach to hyperparameter optimization

Ulutas, Hasan; Günay, Recep Batuhan; Sahin, Muhammet Emin

doi:10.1007/s00521-024-10160-y

Detecting diabetes in an ensemble model using a unique PSO-GWO hybrid approach to hyperparameter optimization

Original Article
Open access
Published: 24 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Detecting diabetes in an ensemble model using a unique PSO-GWO hybrid approach to hyperparameter optimization

Download PDF

Hasan Ulutas¹,
Recep Batuhan Günay¹ &
Muhammet Emin Sahin¹

222 Accesses
Explore all metrics

Abstract

Diabetes is a chronic medical condition that disrupts the body's normal blood sugar levels. It is essential to detect this disease at an early stage in order to prevent organ and tissue injury. This study focuses on diagnosing diabetes by leveraging ensemble learning methods, which involve combining various machine learning techniques. The goal is to create an ensemble learning model that achieves the best classification performance by employing different classifiers and combining techniques. The study explores boosting, bagging, voting, and stacking ensemble learning methods, while also introducing an approach called PSO-GWO (Particle Swarm Optimization and Grey Wolf Optimization) hybrid method for optimizing the model's hyperparameters. The model consisting of combining various classifiers in the stacking ensemble learning method provided the highest classification performance in diagnosing diabetes. The 5-fold cross-validation method is used in the study. Within the scope of the study, the highest accuracy with (98.10%) is obtained with the random forest classifier. The results of the study are presented in comparison with other studies in the literature. These findings contribute to the field of diabetes diagnosis and highlight the potential for developing more accurate and reliable diagnostic systems in the future.

Improving Machine Learning Performance for Diabetes Prediction

Ensemble-Based Weighted Voting Approach for the Early Diagnosis of Diabetes Mellitus

An Ensemble Approach for Classification and Prediction of Diabetes Mellitus Disease

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Diabetes is a chronic condition characterized by insufficient production of insulin by the pancreas or the inability of the body to effectively use the insulin it produces. However, when the insulin hormone is unable to perform this function, hyperglycaemia develops as the glucose level in the blood rises above the normal range [1]. Unless treated, diabetes, which is characterized by symptoms such as intense thirst, intense appetite, and frequent urination, causes many complications in the patient [2]. It is anticipated that the number of diabetic patients will increase significantly as the global population grows, and based on information shared by the International Diabetes Federation (IDF) and the World Health Organisation (WHO), the progression of the disease has accelerated over the past decade. According to data released by the IDF in 2021, there are presently 537 million diagnosed individuals with diabetes. This number is projected to increase to 643 million by 2030 and 784 million by 2045 [3].

With the advancement of technology, it is possible to diagnose numerous diseases using artificial intelligence and learning techniques. In this manner, disease diagnosis and the reporting of related examinations are completed in a shortened period of time; consequently, patients spend less time in the medical facility [2]. In many nations, large investments are made in smart hospital initiatives today. By automating the system, this application eliminates hospital overcrowding and reduces the amount of labour required. The analysis of biomedical data and images using artificial intelligence techniques in the literature is increasing rapidly [4,5,6,7,8]. Many researchers primarily rely on machine learning algorithms to conduct experiments and develop methods for diagnosing various diseases. In the diagnosis of various diseases, machine learning algorithms are preferred because they provide more accurate, quicker, and less expensive results. Because data mining and machine learning algorithms can combine data from multiple sources and manage large quantities of data, they increase the predictive power. The progress made in machine learning and artificial intelligence has resulted in more effective early-stage disease detection and diagnosis compared to manual methods, particularly in the case of diabetes recognition [9,10,11,12]. The early diagnosis of this illness relies heavily on computer-assisted expert systems based on machine learning. Using machine learning algorithms, the objective of this investigation is to detect diabetes at an early stage. Vijayan and Anjali utilized a global dataset obtained from the UCI machine learning repository as their training dataset [13]. A proposed decision support system is introduced, employing the AdaBoost algorithm with a Decision Stump as the foundational classifier for the classification task. Evaluating the performance of these individual algorithms, they found that the SVM algorithm achieved the highest accuracy rate of 79.68%. However, when their performance is evaluated in conjunction with the AdaBoost algorithm with decision stump model exhibited the highest accuracy at 80.72%.

Perveen et al. [14] recommended estimating diabetes mellitus with the CPCSSN diabetes dataset and the AdaBoost (J48 decision tree), Bagging (J48 decision tree), and J48 decision tree algorithms. The dataset consists of the training dataset (60%) and the test dataset (40%). The Chi-square method was utilized as a technique for selecting features. The Weka programming language was utilized in the experiments. Looking at the ROC curve performance results, the Bagging (J48 Decision Tree) model achieved the highest accuracy with a value of 98%. In the study by Ram and Vishwakarma, diabetes disease prediction was performed using the Pima Indian Diabetes Dataset (PIDD) dataset and KNN, SVM, LR, Gaunt Naif Bayes (GNB), and RF machine learning algorithms [15]. In the investigation, a 10-fold cross-validation was performed. At 85%, the logistic regression machine learning algorithm achieved the highest level of accuracy in this study. In their investigation, Sisodia and Sisodia classified the PIDD dataset for diabetes prediction using GNB, SVM, and DT machine learning algorithms [16]. Performance, accuracy, precision, and recall metrics were evaluated, as well as the outcomes of these three algorithms. In this investigation, the Naive Bayes machine learning algorithm achieved the best prediction performance with a value of 76.30%. Using a neural network and SVM ensemble, Zolfagri et al. [17] proposed a method for diagnosing diabetes in Pima Indian female populations. In comparison to other classification systems in the literature, the prediction accuracy of 88.04 is the highest and is extremely promising for this problem. Alam et al. [18] presented a research paper on diabetes prediction, employing artificial neural networks (ANN), random forest (RF), and K-means clustering. They applied the ANN technique to the PIDD dataset and achieved a maximum accuracy of 75.7% in their study. On the other hand, Ma utilized six traditional machine learning models, such as logistic regression, support vector machine, decision tree, random forest, augmentation, and neural network, to develop a prediction model for diabetes diagnosis [19]. The Sylhet Diabetes Dataset from the UCI Machine Learning Repository supplied the dataset. Each model is parameterized to strike a balance between accuracy and complexity. The accuracy of the neural network in the test dataset is 96 percent, making it the most accurate diabetes prediction model. Emon et al. demonstrated the relationship between various diabetes-causing symptoms and diseases [20]. Eleven classification algorithms based on machine learning were utilized for the Sylhet Diabetes Dataset. The random forest classifier demonstrated the highest accuracy (98%) among these machine learning classifiers.

Khaleel et al. propose a model capable of predicting whether a patient has diabetes or not [21]. Our model is based on the prediction accuracy of specific powerful machine learning (ML) algorithms evaluated using various metrics such as precision, recall, and F1 score. A specific dataset, the Pima Indian Diabetes (PIDD) dataset, based on diagnostic methods for predicting diabetes onset, was utilized. The results indicate that Logistic Regression (LR), Naive Bayes (NB), and K-nearest Neighbour (KNN) algorithms achieved prediction accuracies of 94, 79, and 69%, respectively. These findings demonstrate that LR is more efficient in predicting diabetes compared to other algorithms. Usama et al. introduced a novel model utilizing a fused machine learning approach for predicting diabetes [22]. The proposed framework incorporates two types of models: Support Vector Machine (SVM) and Artificial Neural Network (ANN) models. These models analyse a dataset to determine the likelihood of a diabetes diagnosis being positive or negative. The outputs of these models serve as input membership functions for a fuzzy logic model, which ultimately determines the diabetes diagnosis. The fused models are stored in a cloud storage system for future use. The proposed fused machine learning model achieves a prediction accuracy of 94.87%, surpassing the performance of previously published methods. Kumari et al. enhanced the accuracy of diabetes mellitus prediction through the utilization of an ensemble of machine learning algorithms [23]. The research focuses on the Pima Indians Diabetes dataset, which contains information about patients both with and without diabetes. The proposed ensemble soft voting classifier conducts binary classification by employing a combination of three machine learning algorithms: random forest, logistic regression, and Naive Bayes. Empirical evaluation of the proposed methodology involves comparing it with state-of-the-art techniques and base classifiers. Evaluation criteria include accuracy, precision, recall, and F1-score. The ensemble approach achieves the highest accuracy, precision, recall, and F1-score values, reaching 79.04, 73.48, 71.45, and 80.6%, respectively, on the PIMA diabetes dataset. The study proposed by Nipa et al. aims to develop a machine learning-based predictive model for early diagnosis of diabetes [24]. A dataset consisting of a total of 1078 records, including patient records obtained through a survey conducted in Bangladesh and the dataset from Sylhet Diabetes Hospital, was utilized. Thirty-five different classification methods were evaluated using performance metrics such as accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (ROC-AUC). The results were interpreted using the Shapley additive explanations method to identify effective features for diabetes occurrence. The findings reveal that the Extreme Trees (ET) classifier demonstrated the best performance with a 97.11% accuracy rate for the Sylhet Diabetes Hospital dataset, while the Multi-Layer Perceptron (MLP) classifier yielded the best result with a 96.42% accuracy rate for the overall dataset.

In optimizing the hyperparameters of CNN algorithms, various studies have explored the effectiveness of different optimization techniques. Bochinski et al. [25] utilized GWO to optimize key hyperparameters such as layer count, filter count, neuron count, and initial layer values. Similarly, Baldominos et al. [26] applied GWO to fine-tune a comprehensive set of hyperparameters, including learning rate, optimization method, activation method, batch size, number of filters, filter size, number of neurons, and optimizer values within the CNN algorithm. In a related vein, Silva et al. [27] chose PSO to optimize hyperparameters such as filter size and number, batch size, and dropout rate, reporting performance gains using the PSO algorithm for determining CNN hyperparameters. Wang et al. [28] contributed to the exploration of optimization techniques by employing PSO to refine hyperparameters, including filter number and size, neuron count, and stride values in the CNN algorithm. Meanwhile, Mohakud and Dash [29] focused on GWO as their chosen method for optimizing CNN algorithm hyperparameters. Expanding beyond CNN algorithms, Kılıçarslan employed a diverse set of optimization algorithms, including PSO, Cat Swarm Optimization (CSO), and a hybrid approach combining PSO and GWO [30]. This hybrid approach was specifically applied for optimizing hyperparameters in the 1D-VGG-16 model.

Many researchers have stated that hybrid or ensemble learning models will give better results compared to a single model [31], and within the scope of the study, classification is carried out using various ensemble learning methods. Another alternative approach in the study is to use ensemble learning algorithms to reduce both the variance in the model and the dependence on the model. We can increase the accuracy of our predictions by combining ensemble learning models with models that are trained multiple and independently of each other. Ensemble approaches using multiple learning algorithms seem to be an effective way to improve classification accuracy [32]. An overview of study is given in Fig. 1. In this study,

Combining (PS) and GWO algorithms, the study introduces a hybrid optimization method for hyperparameter optimization. This innovative strategy aims to improve the effectiveness and efficiency of hyperparameter optimization in machine learning models.
Using the collected dataset, multiple machine learning algorithms, including Random Forest, Gradient Boosting Machine, Extra Tree Classifier, Gaussian Naive Bayes, Logical Regression, and Decision Tree, are employed to classify diabetes. These models are used as benchmarks to assess the performance of the proposed hybrid optimization method.
The research investigates various ensemble learning techniques, such as Boosting, Bagging, Super Learner, Stacking, Max Voting, and Average Voting, to enhance the classification accuracy and robustness of the models. These methods combine the predictions of multiple base classifiers to produce more accurate and trustworthy forecasts.
A comprehensive comparative analysis is conducted to identify the models and ensemble learning methods with the highest accuracy, precision, and F1 score performance. Included in the comparison are pertinent studies from the scholarly literature, allowing for a thorough evaluation of the proposed method versus existing cutting-edge techniques.

The following is an overview of the article's structure. The data collection, machine learning models, ensemble models, and optimization methods are described in Sect. 2. The created ensemble learning application models' parameters and evaluation metrics are provided. Section 3 presents machine learning classifier experimental work and results as well as evaluation metrics such as accuracy, recall, precision, and F1-scores. Section 4 discussions and suggests areas for future investigation. Finally, conclusion is given in Sect. 5.

2 Material and methods

2.1 Dataset

The dataset used in the study was collected through a direct survey conducted with patients at Sylhet Diabetes Hospital, located in the region of Sylhet, Bangladesh. This dataset is publicly available on the Kaggle platform [33]. Details of dataset are given in Table 1. The diabetes symptom dataset under consideration comprises 16 attributes and encompasses a total of 520 instances. Each attribute in the dataset likely represents a distinct feature or variable related to diabetes symptoms, contributing to a comprehensive set of 16 different aspects for each of the 520 instances. The numbers for the attributes in the table below are expressed as percentages.

Table 1 Details of dataset

Detecting diabetes in an ensemble model using a unique PSO-GWO hybrid approach to hyperparameter optimization

Abstract

Similar content being viewed by others

Improving Machine Learning Performance for Diabetes Prediction

Ensemble-Based Weighted Voting Approach for the Early Diagnosis of Diabetes Mellitus

An Ensemble Approach for Classification and Prediction of Diabetes Mellitus Disease

1 Introduction

2 Material and methods

2.1 Dataset

2.2 Data preprocessing

2.3 Machine learning

2.3.1 Ensemble models

2.3.1.1 Bagging

2.3.1.2 Boosting

2.3.1.3 Voting

2.3.1.4 Stacking

2.3.1.5 Super learner

2.4 Optimization algorithms

2.4.1 PSO

2.4.2 GWO

2.5 Evaluation metrics

3 Experimental results

3.1 Machine learning algorithms

3.2 Boosting ensemble models

3.3 Bagging ensemble models

3.4 Super learner ensemble models

3.5 Stacking ensemble models

3.6 Max voting ensemble models

3.7 Average voting ensemble models

4 Discussion

5 Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation