Introduction

Chronic illness is a disease or condition that is ongoing or the effects of which are permanent [1, 2]. However, the consequences of this type of disease can have various negative effects on the quality of life, and a large part of the national budget is spent on chronic diseases [3, 4]. Diabetes is one of those chronic diseases that pose a major health risk and the number of medical causes of death is increasing every year, making it one of the biggest problems in emerging and developed countries [5, 6] High blood sugar levels have been linked to diabetes. The hormone insulin, which causes glucose from the food that enters the body to enter the bloodstream, is produced by one of the types of beta cells in the pancreas. Diabetes is caused by a lack of this hormone [7]. This disease can increase thirst, hunger, heart disease, kidney disease, etc., and even lead to the death of the patient [8, 9]. Diabetes can be divided into two types: type 1 and type 2. Patients with type 1 diabetes are often young and most of them are under 30 years old. Increasing thirst and blood sugar levels as well as frequent urination are common clinical signs [10]. In type 2, middle-aged and elderly people are more prone to the disease and are usually associated with obesity, hypertension, dyslipidemia, atherosclerosis, and other health problems. Drugs alone are not enough to treat this type of diabetes, and insulin injections are also essential [11, 12]. However, no long-term cure has been discovered for this disease, but it can be controlled with early diagnosis and prognosis in the early stages of the disease, and in the later stages of the disease, treatment can be much easier. Therefore, the prediction of diabetes has become a controversial topic for study and research [10,11,12,13].

In recent years, significant advancements have been made in developing and publishing various methods for predicting diseases, including but not limited to diabetes, Covid-19, and other illnesses [14, 15]. Simultaneously, the rapid progress of machine learning models has led to their widespread utilization in numerous applications, particularly in the medical field, for the accurate diagnosis of diverse diseases [13,14,15,16]. In general, machine learning models aim to describe and predict data and can help people make early judgments about disease based on their physical condition and diagnose the disease in its early stages until treatment is complete [14]. Amit Kishor and Chinmay Chakraborty have proposed a cutting-edge healthcare model based on machine learning techniques, aiming to enhance the accuracy and timeliness of diabetes diagnosis. In this model, they employed a set of five machine learning classifiers, including logistic regression, K-nearest neighbor, Naive Bayes, random forest, and support vector machine. Furthermore, to improve the model's performance, they utilized the fast correlation-based filter feature selection method to eliminate irrelevant features and applied the artificial minority oversampling technique to address imbalanced datasets [17]. Zou et al. [18] employed J48 decision tree, RF, and ANN models to predict diabetes from a hospital examination data set in Luzhou, China. To ensure the wide applicability of their methods, the authors selected the top-performing techniques for independent empirical tests. Chen and Pan [19] conducted a comprehensive study to predict diabetes using various machine learning methods and identify the most efficient and accurate model. The study utilized a dataset with 520 instances and 17 features, including polyuria, gender, age, sudden weight loss, polydipsia, polyphagia, weakness, irritability, genital thrush, itching, vision blurring, muscle stiffness, alopecia, and obesity. The authors compared the performance of eight classification algorithms, including Support Vector Classifier (SVC), Gaussian Naive Bayes (GNB), Random Forest (RF), Decision Tree Classifier (DTC), Logistic Regression (LR), Extra Tree Classifier (ETC), K-Nearest Neighbors (KNN), and XGBoost (XGB), and found that Extra Tree Classifier (ETC) achieved the highest accuracy of 98.55%. These results demonstrate that ETC is the most efficient and accurate machine-learning classification technique for diagnosing diabetes based on the mentioned parameters. Zhu et al. [20] proposed an innovative approach for diabetes prediction by combining PCA and K-Means techniques, resulting in a highly effective and well-clustered dataset. The model consists of three components: principal component analysis, K-Means clustering, and logistic regression, along with data standardization. The experimental outcomes demonstrated that PCA significantly enhanced the accuracy of the K-Means clustering method and the logistic regression classifier, with K-Means accurately classifying 25 data points and boosting the logistic regression accuracy by 1.98%, thereby surpassing earlier findings. Lukmanto et al. [21] utilized fuzzy support vector machines and F-exponential feature selection for the identification and classification of diabetes. Feature selection was employed to extract useful characteristics from the dataset, and the output was classified using the fuzzy inference method. The dataset was trained using SVM, resulting in an impressive accuracy of 89.02% when applied to the PIMA Indian Diabetes dataset. Furthermore, the approach employed an optimal number of fuzzy rules, maintaining an excellent level of accuracy. Raja et al. [22] proposed a novel data mining technique for type 2 diabetes prediction, combining particle swarm optimization (PSO) and fuzzy clustering (FCM) to create a highly efficient predictive model. The approach was evaluated through experiments on the PIMA Indian diabetes dataset, utilizing metrics for accuracy, sensitivity, and specificity. The results demonstrated that the proposed model outperformed other methodologies, exhibiting an 8.26% increase in accuracy compared to the other methods. Khanam et al. [23] employed seven machine learning and neural network methods to predict diabetes on the PIMA diabetes dataset. The authors created various neural network models with different numbers of hidden layers for different periods. The experimental results revealed that the models of Logistic Regression (LR) and Support Vector Machine (SVM) were highly effective in predicting diabetes and that neural networks with two hidden layers achieved an impressive accuracy of 88.6%. Rajendra et al. [24] conducted experiments on the PIMA diabetes dataset, comparing logistic regression algorithms and ensemble learning techniques for diabetes prediction. The study demonstrated that logistic regression is one of the most efficient techniques for creating prediction models and that employing feature selection, data pre-processing, and integration strategies can significantly enhance the accuracy of the model. Rawat et al. [25] conducted comparative studies on the PIMA diabetes dataset, utilizing machine learning methods such as Naive Bayesian (NB), Support Vector Machine (SVM), and Neural Network. The experimental results revealed that the neural network achieved the highest accuracy amongst all classifiers, with an impressive accuracy of 98%. The neural network method was deemed the most effective in the early detection of diabetes. Zhou et al. [26] proposed a diabetes prediction model based on Boruta feature selection and ensemble learning, which utilized unsupervised clustering of data using the K-Means +  + algorithm and stacking an ensemble learning method for classification. The model was validated on the PIMA Indian diabetes dataset, achieving an incredibly high accuracy rate of 98%, surpassing other diabetes prediction models, and highlighting its superior performance in diabetes prediction. The incorporation of Boruta feature selection and ensemble learning in this model provides a promising approach for accurate diabetes diagnosis and treatment. Shilpi et al. [27] utilized two common boosting algorithms, Adaboost.M1 and LogitBoost, to establish machine learning models for diabetes diagnosis based on clinical test data from a total of 35,669 individuals. The experimental results demonstrated that the LogitBoost classification model outperformed the Adaboost.M1 classification model, achieving an impressive overall accuracy of 95.30% with tenfold cross-validation. The authors concluded that these boosting algorithms exhibit excellent performance for diabetes classification models based on clinical medical data. The significant discriminating factors between diabetic and general populations obtained from the process of selecting preferred test items can be used as reference risk factors for diabetes mellitus. The model is also robust and has a degree of pre-diagnosis function, given that the coefficient matrix of the original data is a sparse matrix due to missing test results, some of which are directly related to disease diagnosis.

The fast-paced advancements in machine learning have shown promising results in the early detection of diseases. However, developing an accurate prediction model for diagnosing diabetes remains challenging. This is due to several factors such as the limited availability of labeled data, frequent occurrence of incomplete or missing values in the dataset, and the imbalanced nature of the dataset. These issues make it difficult to achieve optimal performance and necessitate the development of novel techniques to address them. In this article, we present an innovative pipeline-based framework for predicting diabetes in three classes of peoples (diabetic, non-diabetic, and pre-diabetic) using the Iraqi Patient Dataset for Diabetes patients (IPDD), that's an imbalanced data set. Preprocessing is an important part of the proposed framework to achieve a high-quality result. Which includes several steps, such as removing duplicate samples, filling in missing values, normalizing and standardizing data, selecting relevant features, and performing k-fold cross-validation to ensure high-quality data. Consulted with a specialist to complete missing attribute values using the k-NN Imputation method. We implemented various machine learning models, including MLMs, k-NN, SVM, DT, RF, AdaBoost, and GNB, and used Bayesian optimization and grid search techniques to find the optimal hyper-parameters for MLMs. As our dataset is imbalanced, accuracy evaluation alone is not being a good measure of model evaluation, and we therefore used the Area Under the ROC Curve (AUC) as an additional measure to evaluate the models. To evaluate the models. Under the same experimental conditions and dataset, multiple experiments are performed with different preprocessing combinations and machine learning models to maximize the area under the curve of predicting diabetes. Then optimal machine learning model is then used as the baseline model for the proposed framework to optimally predict diabetes. We then proposed an ensemble machine learning model (EMLM) with a combination of MLMs to improve the prediction accuracy and AUC of diabetes diseases. Combining different models can help address the weaknesses of using single models when the dataset is imbalanced. To combine machine learning models, we used the AUC with the One-Vs-One (OVO) multiclass classification approach as the weight for this EMLM. The AUC in the proposed EMLM is unbiased to the class distribution, and therefore, we chose it as the model's weight in the voting ensemble rather than accuracy. We performed many experiments with different combinations of MLMs to obtain the optimal set of EMLMs by applying the optimal preprocessing from previous experiments.

Feature selection and dimensionality reduction are of utmost importance in disease diagnosis research, as they enable the construction of a model with a reduced number of features. Such a model is simpler, less time-consuming for training and testing, and particularly powerful in disease prediction. In this study, we evaluated the efficacy of the MRMR feature selection method, as well as the PCA and ICA dimensionality reduction methods, in identifying the most important features affecting the target variable, i.e., the factors that have the most significant impact on the determination of class.

Our proposed framework demonstrates high accuracy and AUC in predicting diabetes, and we believe that it can be applied to other patient populations as well. By utilizing these methods, we have developed a model with fewer features, which has resulted in increased accuracy in diabetes prediction. Our study provides valuable insights into the development of a more effective diabetes prediction model, which can significantly contribute to the recovery of diabetic patients.

In conclusion, our research highlights the importance of feature selection and dimensionality reduction in disease diagnosis research, and underscores the potential of these techniques in developing accurate and efficient prediction models. Our findings can serve as a valuable resource for researchers and practitioners in the field of diabetes diagnosis and treatment, and can pave the way for future studies aimed at improving the accuracy and efficacy of diabetes prediction models.

This paper is organized as follows. "Methods" section deals with the methods used in this research, particularly the flowchart of the proposed model in section. "Machine learning models" section deals with the machine learning models, particularly the proposed EMLM in section. "Evaluation metrics" section describes the evaluation measures (metrics) of the proposed models in this article, and "Results" section describes the results of the various experiments conducted. Finally, discussions are offered in "Discussion and future work" section.

Methods

Dataset

The Iraqi Patient Dataset for Diabetes (IPDD) [28] was obtained from 1000 samples, including 565 males and 435 females aged 20–79 years old, during in-hospital physical examinations at the Specialized Center for Endocrinology and Diabetes-Al-Kindy Teaching Hospital in Iraq. This dataset is divided into three regions: Diabetic (Y) with 837 samples, Non-Diabetic (N) with 103 samples, and Predicted Diabetic (P) with 53 samples. These include 11 physical examination indicators. Table 1 lists the attribute descriptions, and the distribution of each attribute in the dataset is shown in Fig. 1, where green, blue, and yellow color distributions denote the diabetic, non-diabetic, and predicted diabetic classes, respectively.

Table 1 Overview of the Iraqi Patient Dataset on Diabetes (IPDD)
Fig. 1
figure 1

The Iraqi Patient dataset for Diabetes (IPDD) dataset population distribution of all attributes, with green, blue, and yellow color distributions denoting diabetic (Y) individuals, non-diabetic (N), and predicted diabetic (P) classes, respectively

Proposed framework

The framework proposed in this study is shown in Fig. 2, where the pre-processing of raw data is a crucial stage in the pipeline, as data quality can directly influence the training of the classifiers.

Fig. 2
figure 2

shows a potential model for reliable and automatic diabetes prediction

Data preprocessing

According to the proposed framework for predicting diabetes disease in this study (Fig. 2), data preprocessing is the first and most important step because it can improve data quality and the resulting data quality can have a direct impact on learning classification models. Preprocessing steps in the proposed framework include Removing Duplicate Samples, Converting Attributes, filling in values are missing or null (I), Normalization (N), Standardization, (Z), and attribute feature selection, and which briefly described as follows:

  • Remove Duplicate Samples In this study, after examining all 1000 data samples, we concluded that seven of these samples were completely identical and thus were removed, leaving 993 samples.

  • Attribute Conversion Because, in this study's data set, the values of the Gender attributes and the Class label are qualitative (non-numerical) values, for use in models, we use Eqs. (1) and (2), convert the values of these attributes to the numerical values.

    $$Gender \left( a \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {if\;a = female} \hfill \\ {1,} \hfill & {if\;a = male} \hfill \\ \end{array} } \right.$$
    (1)
    $$Class\;Lable \left( C \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {if\;C = Y\;\left( {Diabetic} \right)} \hfill \\ {1,} \hfill & {if\;C = N\;\left( {Non{ - }diabetic} \right)} \hfill \\ {2,} \hfill & {if\;C = P\;\left( {Diabetic\;predicted} \right)} \hfill \\ \end{array} } \right.$$
    (2)
  • Handling Missing or Null Values Missing or null values are values that may lead to incorrect predictions or inference for each class in the classification [29, 30], so here, a little number of attribute values were completing (missing values) with the consultation of a doctor specializing in endocrinology and metabolism using the k-NN Imputation method. The result of this process is shown in Fig. 3.

  • Normalization Data normalization is a critical factor in improving the performance of machine learning algorithms [31]. Normalization helps reduce bias from features with high numerical contributions, ensuring fair consideration of each variable during the learning process. It also enhances numerical stability, reduces training time, and facilitates meaningful feature comparisons. Because some continuous attributes in the data have a wide range of values, this can have a considerable impact on the performance of the classifier. To convert the range of continuous features to a [0,1] interval, we utilize the min–max normalization [32] as shown in Eq. (3).

    $$N\left( {x_{ij} } \right) = \frac{{x_{ij} - x_{jmin} }}{{x_{jmax} - x_{jmin} }}$$
    (3)

    where the original attribute and the converted attribute are equal to \({x}_{ij}\) and \(N\left({x}_{ij}\right)\).

  • Standardization The Z-score [33] is a method of standardization used to convert continuous attribute numerical values into standard scores, with a mean of zero and a standard deviation of one. The Z-score standardization formula is shown in Eq. (4).

    $$Z\left( {x_{ij} } \right) = \frac{{x_{ij} - \overline{x}_{j} }}{{\sigma_{j} }}$$
    (4)

The mean and standard deviation of the \(j\)th attribute is equal to \({\overline{x} }_{j}\) and \({\sigma }_{j}\) respectively.

Fig. 3
figure 3

Data set after Filling in missing or null values

k-fold cross-validation

K-fold Cross-Validation (kCV) is a statistical approach used to evaluate and compare the effectiveness of classifiers in machine learning algorithms. It splits data into two parts: one for training a model and the other for validation or testing [34]. In kCV, the data is separated into k equal (or nearly equal) segments or folds. Thereafter, k iterations of training and validation are performed, each iteration using a different fold of the data for validation and the remaining k-onefold for training [35, 36]. In the inner loop, where the hyper-parameter optimization algorithms (Bayesian optimization and grid search) [37,38,39] were applied, the hyper-parameters were trained and fine-tuned using the four folds. The test data is used to evaluate the model in the outer loop using the optimal hyper-parameters found in the training step, which are repeated five times (Fig. 4).

Fig. 4
figure 4

The partitioning of the IPDD dataset for kCV for both the hyper-parameters tuning and for the evaluation

Feature selection

Feature selection strategies can help reduce the number of attributes and avoid using redundant features. Various methods for feature selection and dimensionality reduction are available. To reduce dimensionality and feature selection in this study, we used PCA and ICA and Maximum Relevance Minimum Redundancy (MRMR). In this section, we describe the PCA and ICA as well as MRMR methods. In addition, the codes of these methods are included in Additional file 1: Appendix 1A.

PCA (principal component analysis)

Principal component analysis (PCA) [40] is a mathematical approach for reducing the dimensionality of data by finding and preserving the most variation in the data set by defining directions called Principal Components. Instead of a large number of variables, each sample may be represented by a few components [18].

ICA (independent component analysis)

The observed multivariate data, often presented as a large database of samples, is transformed into a generative model by ICA [41, 42]. The data variables in the model are viewed as linear mixtures of unknown latent variables, as is the mixture system. The independent components of the observed data are expected to be non-Gaussian and mutually independent latent variables. ICA can locate these separate components, often referred to as sources or factors. PCA and factor analysis have a superficial relationship with ICA. When these traditional methods fail, ICA is a far more effective method that can uncover the underlying causes or sources. The FastICA algorithm was used in this study [43, 44].

MRMR (minimum redundancy maximum relevance)

Features should have the maximum Euclidean distances or have pairwise correlations as low as possible. The Minimum Redundancy Maximum Relevance (MRMR) standards, such as maximal mutual information and target phenotypes, are frequently used to supplement minimum redundancy norms. The advantages of this can be obtained in two ways. Firstly, the MRMR feature set might have a more representative target phenotype for greater generalization with the same number of features. Secondly, we can effectively cover the same space with a smaller MRMR feature set as we can with a larger regular feature set by using a smaller MRMR feature set [45,46,47,48].

Machine learning models

Single models

In this research, we used different classification Machine Learning Models (MLMs) such as k-NN [49, 50], multi-support vector machine Multi-Class SVM [51], DT [52], RF [53], Multi-Class AdaBoost [54,55,56], and GNB [57] to train and test the proposed framework. Because class boundaries may overlap, multi-classification may perform worse than binary classification. There are two methods for extending binary classification algorithms to the multi-class mode studied: The One-Vs-One (OVO) approach and the One-Vs-All (OVA) technique. However, empirical studies show that the OVO approach performs better results than the OVA approach [58]. Because our problem in this study is a multiclass classification, we used the multiclass classification OVO approach [59], all of which are pseudo-codes in Additional file 1: Appendix 1B.

After selecting the MLM, we optimized the model hyper-parameters (see Table 2) using Bayesian optimization and grid search hyper-parameter optimization methods for our desired problem.

Table 2 shows various MLMs with hyper-parameters that can be tuned in the internal loop using optimization approaches

The proposed ensemble of machine learning models

Within the communities of computational intelligence and machine learning, multiple classifier systems, often known as ensemble systems, have received much press. In a wide range of problem areas and real-world applications, ensemble systems have proven to be incredibly successful and versatile; thus this focus is well-deserved [60,61,62]. The Ensemble of the Machine Learning Model (EMLM) is a well-known approach for improving performance by grouping a set of classifiers (especially here where the data is imbalanced) [63, 64]. The aggregation of the output from various models can improve the precision of the prediction [39]. For an ensemble of MLMs, the output of each MLM is a function \({Y=f}_{j}: X\to {R}^{C}\) that assigns \(C\) confidence values \({P}_{i}\in {\mathbb{R}}\) to an unseen test sample \(X\), where \({P}_{i}^{e}\in \left[\mathrm{0,1}\right]\) for \(i=\mathrm{1,2},\dots , C\), and \({\sum }_{i=1}^{C}{P}_{i}=1\), and \(j=\mathrm{1,2},3,\dots ,m\) is the number of MLMs. (In our case, the values \({P}_{1}, {P}_{2}, \mathrm{and} {P}_{3}\) indicate confidence values for each MLM). In this study, we utilized a weighted aggregation model in which the sum of individual confidences is calculated as:

$$P_{i}^{EMLM} = \frac{{\mathop \sum \nolimits_{j = 1}^{5} \left( {A_{j} \times P_{ij} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{3} \mathop \sum \nolimits_{j = 1}^{5} \left( {A_{j} \times P_{ij} } \right)}}, \;\;i = 1,2,3$$
(5)

where \({P}_{ij}\) denotes the \({MLM}_{j}\) confidence that \(X\) (an unseen test sample) belongs to class \({\mathrm{c}}_{\mathrm{i}}\) and \({\mathrm{A}}_{\mathrm{j}}\) is the weight corresponding to the OVO approach AUC of that jth \(\mathrm{MLM}\). The normalization term is used to convert values \({P}_{i}^{EMLM}\) to a \(\left[\mathrm{0,1}\right]\) interval for \(i=\mathrm{1,2},3\) such that \({\sum }_{i=1}^{3}{P}_{i}^{EMLM}=1\). The ensemble model's output, \(Y\in {\mathbb{R}}^{C}\), has \({P}_{i}^{EMLM}\) confidence values. Finally, the unseen test sample \(X\) belongs to the class with the maximal probability; That is, \({C}_{i}\), if \({P}_{i}^{EMLM}=max(Y=f(X))\).

Evaluation metrics

In this study, we used various metrics for evaluating the Multiclass Classification Model Evaluation to measure the performance of the MLMs [65, 66], as shown in Table 3.

Table 3 The summary of all performance evaluation measure metrics of Multiclass classification models

\(TN\) stands for True Negative, and it refers to the number of cases that have been correctly diagnosed as negative. \(TP\) stands for True Positive, and it refers to the number of positive examples that have been correctly detected. The letters \(FP\) stand for False Positive, which refers to the number of real negative cases classified as positive; \(FN\) stands for False Negative, which refers to the number of genuine positive examples classified as negative; and k stands for the total number of classes. Rather than giving absolute results, we used Receiver Operating Characteristics (ROC) and the Area Under the ROC Curve (AUC) to determine how well predictions are assessed.

Results

Table 4 illustrates the quantitative results for choosing the most effective preprocessing and machine learning model, with the average accuracy (AAC) and standard deviation presented for comparison. Each model's ability to achieve the optimal AAC using the proposed pipeline is summarized in Table 4, along with the effective preprocessing and feature selection technique, as well as the number of attributes chosen. The optimally tuned hyper-parameters using hyper-parameter optimization methods are also shown in Table 4.

Table 4 illustrates the optimal-performing MLM and preprocessing, as well as tuned hyper-parameters with the highest AAC

Tables 5, 6, 7 and 8 show that using appropriate preprocessing can improve the outcomes of various models.

Table 5 Results of the first experiment in terms of AAC for MLMs, using all features in different modes of pre-processing combination
Table 6 Results of the second experiment in terms of AAC for MLMs, using the MRMR feature selection algorithm in different modes of preprocessing combination
Table 7 The results of the second experiment in terms of AAC for MLMs, using the PCA dimensionality reduction algorithm in different cases of pre-processing combination
Table 8 The results of the second experiment in terms of AAC for MLMs, using the ICA algorithm in different cases of pre-processing combination

As one can see, we used four different and integrated experiments to compare classification MLMs (Tables 5, 6, 7 and 8). When the eleven features are used, all classifiers show optimal performance for missing values filling (I) and standardization (Z) (Table 5). According to Table 5, the first experiment shows that the RF classifier outperforms other classifiers when using all features with different combinations of data pre-processing. With I + N pre-processing, the k-NN classifier exhibits the highest AAC compared with other pre-processing methods for this classifier. Similarly, the SVM, DT, GNB, and AB classifiers achieve the highest AAC using I + Z pre-processing, surpassing other pre-processing techniques.

In the second experiment (Table 6) and using MRMR (with 10 features) to select the features of the k-NN classifier in the I + N pre-processing case, it had the highest AAC compared to other pre-processing's for this classifier. In this experiment, the SVM classifier also had the highest AAC compared to other pre-processing's and feature selection by MRMR for this algorithm by selecting 10 features by MRMR and using I + N. The DT, GNB, and AB classifiers also respectively had the highest AAC value with I + Z, I + Z, and I pre-processing and the number of features equal to 10, 10, and 8 in this experiment compared to other pre-processing's and the number of features selected by MRMR. However, for the RF classifier, in this experiment, the use of I + N pre-processing 10 features obtained the highest AAC value compared to all the different states of this experiment and was chosen as the optimal classifier in this experiment.

According to Table 7, as you can see, using PCA for dimensionality reduction in three cases with explained variance equal to 90%, 95%, and 98%, and with all different cases of pre-processing did not improve the AAC value compared to other experiments. Thus, in this data, it is not appropriate to use PCA to reduce the dimension (in the third experiment).

Finally, in the fourth experiment, using ICA with five components and I + Z pre-processing, different classifiers were optimized in terms of performance, and AAC values were more optimized than in other experiments. Among all the classifiers, GNB could perform optimally in terms of AAC performance (Table 8). Because GNB is sensitive to the distribution shape and assumes that data is derived from a normal distribution, we observe in all experiments that, after outlier standardization, the performance of AAC of this classification also increases (Tables 5, 6, 7 and 8).

Figure 5 visually compares the optimal results of each of the four experiments. Figure 5a shows that, in the first experiment with 11 features, the RF classifier with I + Z pre-processing performed more successfully than in other cases. Part b of Fig. 5 also shows that, in the second experiment, the RF classifier and the selection of 6, 8, and 10 features using MRMR and I + N pre-processing performed better in terms of AAC than in other cases. Part c also shows the results of the third experiment, which used PCA to reduce the dimension, that the GNB classifier with PCA, a variance value of 98%, and I + Z pre-processing provided better performance than other cases. Finally, part d also shows the results of using ICA, and in this case, the GNB classifier with ICA, a number of components of 5, and pre-processing I + Z had the highest AAC value compared to other cases.

Fig. 5
figure 5

Comparison of different MLMs in terms of AAC in 4 experiments. a Comparison of different MLMs in terms of AAC using all features. b Comparison of different MLMs in terms of AAC using MRMR feature selection. c Comparison of different MLMs in terms of AAC and PCA. d Comparison of different MLMs in terms of AAC and ICA

Optimal pre-processing from Table 4 is used in this experiment. The combination of the six MLMs provides several EMLMs. Table 9 shows the optimal performing EMLM with two, three, four, five, and six baseline models, as well as their outcomes. As shown in Table 9, the combination of K-NN, AB, DT, RF, and I pre-processing yields the highest results for diabetes prediction across all performance evaluation measures. The AUC bar chart of the optimal EMLMs is shown in Fig. 6a. The curve of Fig. 6b shows the optimal EMLM model in terms of AUC.

Table 9 Comparative performance of ensemble machine learning models on various micro-averaging measures
Fig. 6
figure 6

The Comparative performance of different proposed a MLMs, and b the ROC curve of our optimal proposed EMLM. a The Comparative performance of different proposed EMLMs. b The ROC curve of our optimal proposed EMLM

Finally, as shown in Table 10, the combination of K-NN, AB, DT, and RF gives the highest results for predicting diabetes in all performance evaluation criteria compared to the only models presented on the PIDD dataset (the Hybrid model presented by Soukaena Hassan et al. [67]). Additionally, the bar charts shown in Fig. 7 graphically compare these two models with each other in terms of accuracy evaluation criteria.

Table 10 Comparison of diabetes prediction models in terms of ACC performance criteria
Fig. 7
figure 7

Comparison of diabetes prediction models in terms of ACC performance criteria

Discussion and future work

Diabetes is a chronic condition that significantly impacts individuals' quality of life, underscoring the critical need for accurate prediction methods in its management and prevention. In our study, we delve into the analysis and interpretation of results obtained from our ensemble machine learning models, which were designed to predict diabetes using the IPDD dataset. We also explore the implications of our findings, discuss the limitations of our study, and provide recommendations for future research.

The primary contribution of this research lies in introducing a introduces, pipeline-based framework of multi-class machine learning models for diabetes prediction. The framework utilizes the IPDD dataset, which encompasses three distinct groups: diabetic subjects (Y), non-diabetic subjects (N), and predicted diabetic subjects (P). The innovative nature of this framework lies in its ability to effectively classify individuals into these categories, thereby enhancing our understanding of diabetes prediction. This approach addresses the multi-class classification problem and ensures a comprehensive evaluation of performance by employing various evaluation metrics to assess the effectiveness of our proposed models. Data pre-processing plays a vital role in enhancing the accuracy and efficiency of predictive models. In our proposed model, we utilized several pre-processing techniques, such as filling missing values, standardization, normalization, feature selection, and dimensionality reduction. These techniques were implemented to meticulously prepare the data, improve model performance, and mitigate the impact of incomplete or inconsistent data. The results of our study emphasize the significance of data pre-processing in achieving accurate predictions for diabetes. By leveraging the collective intelligence of multiple individual classifiers, our ensemble approach demonstrates its effectiveness through improved overall performance and accuracy in diabetes prediction. This approach addresses biases and errors inherent in individual classifiers, which is particularly important given the challenges posed by imbalanced data and missing attribute values in diabetes prediction. Our experiments consistently showed that the random forest model, in combination with the MRMR and I + N stages of data pre-processing, outperformed other models. This highlights the importance of feature selection and dimensionality reduction techniques in enhancing diabetes prediction accuracy. The utilization of MRMR feature selection and PCA/ICA dimensionality reduction methods enables the identification of key features that significantly impact class determination. Furthermore, combining K-NN, AB, DT, and RF models with 11 features and I pre-processing exhibited superior performance in predicting diabetes in the IPDD dataset. This underscores the significance of employing a diverse set of machine learning models in an ensemble approach to enhance prediction accuracy. By harnessing the strengths of these models, we achieve more robust and reliable predictions. It is important to note that the evaluation of our models was not solely based on accuracy due to the imbalanced nature of the dataset. Instead, we employed multiple evaluation measures, including the Area Under the ROC Curve (AUC), to provide a comprehensive assessment of model performance. AUC is particularly suitable for imbalanced datasets as it considers the trade-off between true positive rate and false positive rate, offering a more accurate representation of the model's predictive power. Despite yielding promising results, our study has certain limitations. Firstly, the IPDD dataset used in our research may possess inherent biases and limitations that could affect the generalizability of our findings to other populations. Future studies should consider incorporating datasets from diverse patient populations to validate the effectiveness of our proposed models. Secondly, while we employed various data pre-processing techniques, there may be alternative approaches that could further optimize the performance of our models. Exploring alternative pre-processing techniques and comparing their efficacy could be a valuable avenue for future research. Ensemble models have their limitations, including increased model complexity, longer training and testing times, and the requirement of comprehensive data for model construction and configuration. Additionally, the interpretation of results from these models can be challenging due to their complexity across different datasets, potentially leading to inconclusive outcomes. Therefore, prior to utilizing these models, a meticulous examination and in-depth analysis of their features, data size, and other aspects are imperative.

In conclusion, our research demonstrates the potential of ensemble machine learning models, along with comprehensive data pre-processing techniques, in accurately predicting diabetes using the IPDD dataset. The results highlight the importance of feature selection and dimensionality reduction in improving prediction accuracy. Our proposed models offer a promising approach to diabetes prediction by addressing challenges posed by imbalanced data and missing attribute values. The findings of this study contribute to the field of diabetes diagnosis and treatment, providing valuable insights for researchers and practitioners. Our future research will focus on validating our models with larger and more diverse datasets, investigating additional preprocessing techniques to enhance the performance of diabetes prediction models, as well as exploring novel methods for early detection of COVID-19 disease and applications in mobile computing and manufacturing for comprehensive early disease diagnosis.