Background

Advances in research and technology have revolutionalised medicine, resulting in improved health outcomes of complex diseases and enhancing longevity. However, there is still much to be achieved in regard to preventing and controlling diabetes mellitus (DM) and its effects or burden has been far reaching (1). From developing to developed countries, the disease affects 1 in 11 people, and over 400 million people die from DM every year [1]. It is estimated that the DM prevalence will rise from 415 million in 2015 to 640 million in 2040 while 232 million people are not even aware of their status [1, 2].

A significant proportion of diabetes carriers (> 90%) identify as having type II diabetes mellitus (T2DM)—a condition of the inability to control plasma glucose due to insulin insufficiency or resistance [1]. When affected, T2DM makes individuals unproductive [3, 4], disables them [5] and renders patients and their families financially impoverished due to life-long spending on medical and hospital bills [6]. Sadly, its effects are not just experienced by the affected alone but also hugely impact the global economy [2]. The American Diabetes Association has even stated that if the current trends of diabetes persist, the economic cost of diabetes will reach $2.1 trillion by 2030 [2].

Despite its widespread implications, there is still no cure for the disease [1, 7]. Current treatments only provide relieve by modifying disease-associated symptoms. Meanwhile, the long latency period of the disease allows for targeting and tackling disease before conditions become irreversible [8,9,10]. In this latency period,  the timing of detection and the accuracy of diagnosis are crucial for ensuring predictive, preventive and personalized medicine (PPPM) [11].

Defined as the medical practice that systematically predicts the onset of chronic disease long before its clinical manifestation, PPPM has the potential to influence treatment in time and influence optimal therapies [8, 12]. Further, PPPM is beneficial in multiple ways including 1) delaying the onset of a chronic disease, 2) designing of targeted drugs, establishing the efficacy, potency and adverse effects of drugs on patients, patient stratification and prevention of disease-associated complication [8, 12,13,14,15]. However, before the concept of PPPM can be operationalised, there is the need to recognise the existence of risk factors that are associated with human lifestyles and how these factors influence cardiometabolic health.

Majority of large-scale studies have shown that factors that are antecedence of T2DM are age [16], obesity [17], physical inactivity [17, 18], unhealthy diet [19,20,21], high blood pressure [22,23,24], high plasma glucose and high cholesterol levels [25]. With this knowledge, risk estimation scores have been developed including the Framingham Cardiovascular Disease (CVD) risk score [26] and the Systematic Coronary Risk Evaluation (SCORE) established by the European Society of Cardiology [27]. While these scores could signal if an individual will develop a disease, they are built from simple or few models and fail to account for complex variables [28]. Other studies have explored the use of the Suboptimal Health Status Score as a predictor of cardiometabolic diseases [10, 29, 30]. These studies have largely relied on traditional logistic regression and multivariate regression models to make predictions. Although beneficial, the reliance on conventional regression is short-sighted, given they provide a modest information about the interaction between predictors. In addition, logistic regression might be less computationally demanding but does not provide optimal predictions when there is a nonlinear interactions between factors [31] or when there is an imbalance in the number of cases and controls [32]. To overcome these, there is a need for a more advanced predictive tool such as machine learning (ML). ML does not make any statistical assumptions, such as normality, collinearity, linearity or nonlinearity, when building a predictive model. It has proven to be robust in building a predictive model and diversely used in domains such as education, health and business.

ML relies on algorithms that learn from observations or features and create models [33, 34]. Based on these observations, ML scans for patterns, highlights the complex interactions between the predictors and ultimately, optimizes the performance of predictors. Moreover, ML display a better discriminatory power [35], operate with less focus on data distribution [36], handle multidimensional data and create models from big data or utilized for real-time association analysis [37, 38]. Given its ability to transform data into a meaningful information, its application is now seen in medicine. When employed in clinical data, ML learns patterns of health trajectories of patients, can review or expose patient charts and detects subclinical abnormalities in several chronic diseases including coronary artery disease, cardiovascular, rheumatoid arthritis as well as T2DM [39].

Due to the rapid generation of big clinical data and the quest for accurate predictions, the interest in ML has increased dramatically [32, 40,41,42,43,44,45]. For example, Lai et al. (2019) [33] used Gradient Boosting Machine (GBM) and logistic regression to predict the onset of diabetes in a Canadian population. This study revealed an area under receiver operating curve (AROC) of 84.7% with 71.6% sensitive and 84.0% with 73.4% for GBM and logistic regression, respectively. While this study has shown novel insights, the outcome cannot be generalized. Other ML techniques have also been used elsewhere. Zou et al. [44] used random forest (RF), neural network, and decision tree (DT) to predict diabetes in a population in Luzhou. However, the study could only identify which algorithm was superior to the other and was not able to adequately predict diabetes due to limited indices and imbalanced data [44].

Using feature selection method on a cohort of diabetes individuals, Sneha and Gangil (2019) revealed that RF and DT algorithms had the highest specificities of 98.20% and 98.00% respectively [46]. Utilising four ML methods including k-nearest neighbors (KNN), multifactor logistic regression, multifactor dimensionality reduction and support vector machines (SVM), Farran et al. (2013) reported classification accuracies of 85% of diabetes and 90% for hypertension. Sneha and Gangil (2019) explored the performance of six ML algorithms (RF, Naïve-Bayes (NB), KNN, DT) and SVM. The researchers developed a predictive model for diabetes dataset for each of the ML algorithms. Out of the fifteen (15) attributes in the dataset, ten (10) were found, through a feature selection technique, to produce an optimal predictive model. The researchers generalized the selection of optimal features from the dataset to improve the classification accuracy. The results of their study found DT algorithm and RF to be the highest at 98.20% and 98.00%, respectively.

T2DM arises from the interplay between genetic and environmentally acquired factors including diet, race or ethnicity. Hence, the present study uses four ML algorithms, 1) NB, 2) SVM, 3) KNN and 4) DT to identify predictors of T2DM in ethnically distinct population, Ghana. Moreover, this study ranks the order of importance of the various attributes in the diabetic dataset.

Methodology

Methods and study design

Recruitment of patients was based on a purposive sampling approach where T2DM patients who visited Komfo Anokye Teaching Hospital (KATH) for their medications were asked to participate. After this, we used a convenient sampling approach to recruit healthy individuals from three popular suburbs within the Kumasi metropolis.

Ethics approval

The study was approved by the Kwame Nkrumah University of Science and Technology (KNUST) in Ghana, the Committee on Human Research, Publication and Ethics (CHRPE), and the Human Research Ethics Committee (HREC), Edith Cowan University (ECU). Each of the participants signed an informed consent prior to participating in the study.

Anthropometric examination

Aided by a standard sphygmomanometer (Omron HEM711DLX, UK), blood pressure measurements (Systolic blood pressure (SBP) and diastolic blood pressure (DBP) were noted and recorded. Estimation of body fat was by Body Mass Index (BMI) which is calculated as BMI = weight (kg)/height (m)2. Waist to height (WHtR ratio was measured as waist (cm)/height (cm).

Clinical data

Fasting blood samples were taken from antecubital vein of each participant into a gel separator, EDTA and fluoride oxalate coated tubes. Serum lipids, comprising total cholesterol (TC), high-density lipoprotein -cholesterol (HDL-c), low-density lipoprotein -cholesterol (LDL-c) and triglycerides (TG), were measured using an automated chemistry analyser (Roche Diagnostics, COBAS INTEGRA 400 Plus, USA). On the same instrument, glycated haemoglobin (HbA1c) in EDTA tubes and fasting blood sugar (FBS) in fluoride tubes were also measured.

Inclusion and exclusion criteria

Cases

T2DM patients who have been clinically assessed by a medical doctor were invited to participate. Those who were identified as having type I diabetes mellitus or in any form of insulin treatment were excluded. The study excluded 34 T2DM from the 253 T2DM patients because of missing information. Thus, 219 participants were included in the final analysis.

Controls

Participants diagnosed with diabetes and/or hypertension were excluded. Moreover, those with digestive, respiratory, genitourinary disorders were excluded. At the end, 219 healthy individuals were included.

The mean age for the cases was 56.54 ± 9.89 and controls was 55.10 ± 9.27. The number of females outnumbered the males (i.e. 61.4% females in controls and 57.3% females in cases) but the difference did not reach statistical significance (p = 0.80). Most of the participants were educated and employed. T2DM patients were primarily sedentary when compared with controls but there was no statistical difference in BMI between the groups. Generally, T2DM patients had higher FBS, HbA1c, and HDL-c when compared with controls. The controls had higher SBP and DBP but WHtR, TC, TG and LDL-c were not statistically different between the groups (Table 1).

Table 1 Demographic information of T2DM patients and healthy controls

Experiment

Data Pre-processing and Feature selection

Figure 1 shows the process model of the classification in this work. As indicated in the figure, the data for each of the attribute is numeric with different form or scaling. The steps in the process model include cleaning, scaling, feature selection, test/train validation, classifier model building and evaluation. With this dataset, there was no issues with data imbalance as the number of T2DM patients in the dataset (N = 219) was the same as the number of persons without T2DM (controls) (N = 219). The dataset contained 438 instances (participants) with eleven (11) different features (attributes). While the attributes Age, BMI, SBP, DBP, HbA1c, FBS, TC, TG, HDL-c, LDL-c are the predictor variables (attributes), the T2DM class is the target variable. This division is essential especially that the approach is to build a predictive model with ML.

Fig. 1
figure 1

Model for Classification where 0 is non-T2DM and 1 is T2DM

Among other factors, the performance of a classification algorithm is largely dependent on the quality of the data. Data that is fraught with errors, such as outliers, influence the performance of a machine algorithm [47]. Hence, the diabetes dataset used in this study was explored to eliminate outliers and errors. By visualizing the data through the lens of boxplot, no outlier was detected. In all the datapoints in the dataset, only seventeen (17) of them were found missing. Hence, Expectation–Maximization (EM) algorithm was employed to compute for the missing data. EM algorithm incorporates statistical considerations to compute the “most likely, or maximum-likelihood, source distribution that would have created the observed projection data, including the effects of counting statistics “ [48].

To improve the performance of the algorithms and eliminate any possible bias, the predictive variables were scaled to a range of (0,1). Data scaling is a method used in ML to normalize the range of predictive variables (features) of data. Furthermore, the importance of each of the attributes (predictor variables) used in this study was explored and ranked according to their respective coefficients. The ranking demonstrates which among the attributes is/are the most important and least important for detecting diabetes. We leveraged on the Recursive Feature Elimination (RFE) from Scikit-learn using Python to compute and rank the importance of each of the attributes. RFE works by recursively removing attributes and building a model to rank the attributes [49]. It uses the model accuracy (coefficient) to identify which attribute is/are the most important in terms of their predictive influence.

Classification

The predictive models in this study were built on four different ML algorithms: KNN), SVM, NB andDT. While this study sought to predict T2DM and rank the order of the predictive importance of the feature attributes, the goal was to also compare the performance of each of the model classifiers in predicting the unseen data. These classifiers were selected based on their efficacy and the fact that they have been used widely for text classification (Kolog et. al., 2019). Altogether, the total instances of the data used in this study was 438 (219 each for the cases and controls. As shown in Fig. 1, we used train-test technique (42) to build the predictive models in this study. With this technique, the data was split into two, where 80% was used to train the algorithms. The remaining 20% of the data was used to test the algorithms. Of the 438 instances of the data (both controls and cases), only 350 (80%) was used for the training while the remaining 68 (20%) was used for the testing. The division of the data into testing and training was random. To avoid imbalance classification, the 350 instances of the training data comprised of 175 each for the case data (diabetic patients) and control data (non-diabetic patients).

NB are probabilistic classifiers that use Bayesian theorem with naïve independent assumptions between the features or attributes [50] (Domingos & Pazzani, 1997). There are three main types of NB algorithms: Multinomial Naive Bayes, Gaussian Naive Bayes and Bernoulli Naive Bayes. These types are identified according to their classification techniques. Gaussian Naive Bayes was employed in this study because of its versatility to handle both continuous and discrete data. For instance, when the predictors take up a continuous value, Gaussian assumes that these values are sampled from a gaussian distribution. In our study, we sought to predict patients who are with T2DM (cases) or not (controls), Ck (where C1 = diabetes and C0 = non-diabetes) given that its predictor variables are x1, x2,…,xp which can be expressed as P(Ck|x1,…,xp). The Bayesian formula for calculating this probability is Eq. 1. From the equation, P(Ck) is the prior probability of the outcome, P(x) is the probability of the predictor variables, P(x|Ck) is the conditional probability or likelihood and P(Ck|x) is called our posterior probability. This is further expressed in Eq. 2.

$$P\left({C}_{k}| x\right)=\frac{P\left({C}_{k}\right).P(x|{C}_{k}) }{P(x)}$$
(1)
$$Posterior=\frac{Prior x Likelihood}{Evidence}$$
(2)

DT is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility [41]. It consists of nodes and leaves expressed in hierarchical layers as indicated in Fig. 2. Each of the nodes is a divergent point where a particular characteristic of the data is tested, and the data split accordingly [51]. Just like the other ML algorithms, DT is not built based on any statistical assumption of the data, such as normality, collinearity or correlation between explanatory variables. The capability of DT classifiers has prompted its application in diverse domains. It can be used for decision analysis in management sciences and operations research. The algorithms are nevertheless popular ML applications for classification problems.

Fig. 2
figure 2

Basic structure of decision tree showing three different hierarchical layers

SVM was originally developed for binary classification, but it was later extended for multiple classifications. It is one of the most popularly used algorithms for both classification and regression due to its efficacy. What the algorithm does is to construct a line (hyperplane (s)) in datapoints expressed in high dimensional vector space [52]. As indicated in Fig. 3, the larger the margin the lower the generalization error of the classifier. Therefore, a hyperplane that is farther from the nearest training data point of any class (functional margin) is well separated. SVM algorithms use a set of mathematical functions that are defined as the kernel. Kernel function contains a mathematical function that takes data as input and transforms it into the required form. Examples of SVM kernel functions are linear, nonlinear, polynomial, radial basis function (RBF) and sigmoid. In this work, we tried all the kernel functions on our data and later arrived at using RBF due to its optimality on our data.

Fig. 3
figure 3

Pictorial representation of a) K-Nearest neighbor classification on datapoints b) SVM classification on datapoints in high dimensional vector space

KNN algorithm is a classification algorithm that works by using distance matrix to find k most similar instances in the training data for test data [53].The mean outcome of the neighbors is taken as the prediction. Just like k-means in clustering, KNN algorithm commonly uses Euclidean distance. Mathematically, lets represent xi as input sample with p features (xi1, xi2,…,xip), n be the total number of input samples (i = 1,2,…,n) and p the total number of features (j = 1,2,…,p) (69). The Euclidean distance between datapoints is given by Eq. 3. In this study, we implement KNN from sklearn machine learning library.

$$d\left({x}_{i},{x}_{j}\right)=\sqrt{{({x}_{i1}-{x}_{j1})}^{2}+{{({x}_{i2}-{x}_{j2})}^{2}+{\dots \dots .+({x}_{ip}-{x}_{jp})}^{2} }^{.}}$$
(3)

Classifiers evaluation

The ML algorithms used in this study were evaluated according to their predictive strengths. Thus, we computed for the Precision, Recall, F1-score and Accuracy of the algorithms. Recall is the proportion of the instances of the test data that were correctly identified by the classifier model based on the trained data, while Precision is the proportion of the identified instances of the data that were accurately predicted by the algorithms. The harmonic means of Precision and Recall constitute the F1-score or F-measure. Given the number of real positive (p) cases and the number of real negative (n) cases in the data, the precision, recall and F1-score are indicated in Eqs. 35, where tp is true positive, fp is false positive and fn is false negative.

$$Precision=\frac{tp}{tp+fp}$$
(4)
$$Precision=\frac{tp}{tp+fn}$$
(5)
$$Precision=2 x \frac{Precision x Recall}{Precision+Recall}$$
(6)

Additionally, we computed the Area under Receiver Operating Characteristics curve (AROC) and the confusion matrix of the algorithms. The figures shown in the second column of Table 3 are ROC curves which depict the abilities of the various classifiers. The discriminatory thresholds of the classifiers are varied. ROC curves are typically used in binary classification to study the output of a predictive model. As indicated in Table 3, ROC curve typically features true positive (sensitivity) rate on the y-axis and false positive rate (1-specificity) on the x-axis.

Confusion matrix, also called the error matrix, is a tabular representation of the performance of an algorithm. The computation of the confusion matrix prompted the number of instances that were correctly predicted and falsely predicted. As indicated in Table 3, the first columns contain contingency tables (confusion matrix) for each of the algorithm. From the confusion matrix tables, the predicted class is on the row while the actual class is at the column.

Results and analysis

Descriptive

The mean of most of the attributes for both the patients with T2DM and without T2DM vary but insignificantly. A notable significant difference is the means of HbA1c and FBS for T2DM and those without T2DM. The mean score of the HbA1c of T2DM patients (mean = 8.1) is higher than that of the non-T2DM patients (mean = 5.3). Generally, the mean score of the parameters in patients with T2DM was higher than the patients without T2DM except for SBP and DBP (Fig. 4).

Fig. 4
figure 4

Distribution of study features across T2DM and non-T2DM subjects. Note that HbA1c is glycated haemoglobin, TC is Total Cholesterol; BMI is body mass index; FBS is Fasting blood sugar; DBP is Diastolic Blood Pressure; TG is Triglycerides; SBP is Systolic Blood Pressure; HDL-c is high-density lipoprotein cholesterol; LDL-c is density lipoprotein cholesterol

 Relationship structure among features 

There exists reasonable overlap in the classification of T2DM and non-T2DM cases based on the two orthogonal linear combinations of the features that explain most of the variability in the data (Fig. 5). In terms of relationships among predictors, there exist a strong positive relationship between SBP and DBP, LDL-c and TC, Age and TG and FBS and HbA1c, based on the angles between the vectors for the features (< 30°). SBP, DBP, BMI, Age and TG seem to be uncorrelated with HbA1c, FBS, HDL-c, as the angle between these vectors is approximately 90°. SBP and DBP contribute highly to classifying the control group, whilst HbA1c, FBS and HDL-c are most influential in classifying subjects with T2DM.

Fig. 5
figure 5

Principal component analysis (PCA) biplot illustrating the relationships among study features and the clustering patterns in subjects with T2DM and no T2DM based on orthogonal linear combinations of the features. Glycated Haemoglobin (HbA1c), Total cholesterol (TC); Body Mass Index (BMI); Fasting Blood Sugar (FBS); Diastolic blood pressure (DBP); Triglycerides (TG); Systolic Blood Pressure (SBP); High Density Lipoprotein cholesterol (HDL-c); Density Lipoprotein cholesterol (LDL-c)

Classification

Table 2 shows the performance of the various ML algorithms in terms of their scores in Precision, Recall, F1-score, weighted average and Accuracy. As indicated in Table 2, all the classifiers performed beyond the acceptable threshold of 70% for Precision, Recall, F1-score and Accuracy. However, the performance of the individual classifiers varied slightly in all the parameters. From the table, after building the predictive model with NB, 82% of diabetic test data was detected by the algorithm of which only 93% of them were accurately predicted. The F1-score for diabetic patients was 87%. With regards to the non-T2DM data, 93% of the instances of the test data were detected by the NB algorithm of which only 82% of the detected instances of the non-T2DM test data were accurately predicted. The overall accuracy of the NB algorithm is 87%, which is the highest of the performance of all the algorithms.

Table 2 Performance of the classifiers for cases and control

The SVM algorithm was the second-best performing algorithm. The algorithm yielded an overall accuracy of 84% for predicting both the cases (T2DM data) and controls (non-T2DM data) as contained in the test data. With SVM performance, only 73% of the T2DM test data was detected but 97% of the detected instances of the data were accurately predicted (precision). In a similar vein, the non-T2DM test data yielded an accurate prediction score of 75% from 98% of the detected proportion of the non-T2DM test data.

KNN and DT yielded overall accuracies of 83% and 81% respectively. Although NB and SVM were better, the performance of DT and KNN signify a good predictive strength. However, the KNN performed better than that of the DT though both classifiers exceed the accepted threshold of 70%. With regards to both algorithms, more than 70% of the instances of the test data were detected by the respective algorithms for both the diabetic and non-diabetic data. Of the detected test data, more than 70% were accurately predicted by the KNN and DT classifiers. NB outperformed the other algorithms in terms of ROC, sensitivity, specificity, accuracy and Kappa (Fig. 6).

Fig. 6
figure 6

Comparative analysis of the four machine learning algorithms across sensitivity analysis statistics for the training set. ROC = Receiver Operating Curve, Sens = Sensitivity score and Spec = Specificity score

Table 3 contains both the Confusion matrix and ROC curves of the various ML algorithms for the test set. As earlier described, ROC curve provides the overall assessment of the predictive models. The figures at the right of Table 3 show ROC curves of the four classifiers (NB, KNN, SVM, DT). The top left corner of each of the plot is the “ideal” point—a false positive rate of zero (0), and a true positive rate of one (1). However, it is highly unrealistic to obtain the extreme Area Under Curve (AUC) score of exactly 0 or 1. Nevertheless, in an ideal situation, AUC of 0.90–1.0 = excellent, 0.80–0.90 = good, 0.70–0.80 = fair, 0.60–0.70 = poor and 0.50–0.60 = fail (Kleinbaum & Klein, 2010). The AUC measures discrimination and the models classify the cases and controls. Therefore, the larger the area bounded to the reference line, the better in terms of the predictive model. From Table 3, the AUC of all the classifiers was beyond 0.80 (80%) but less than 0.90 (90%) indicating “good” predictive models. While recognising the good performance of the ROC curve, some of the classifiers performed better than others. NB has the best performance (AUC = 0.87) followed by SVM (AUC = 0.84), KNN (AUC = 0.85) and DT (AUC = 0.81). This performance shows how well the algorithms discriminate on the dataset.

Table 3 Confusion Matrix and ROC curve for each of the classifiers

Feature importance

The predictor attributes of the T2DM dataset used in this study were ranked according to their predictive influence on the target variable (T2DM status). Our feature extraction indicates the relevance of all the features for predicting T2DM. Hence, all the attributes were used for building and testing the predictive models with the various ML algorithms. Although the various attributes were relevant for building the predictive model, the order of importance of each of the feature attributes was computed and ranked according to their strength of influence on predicting T2DM. As indicated in Fig. 7, the best three feature attributes, in order of importance, are HbA1c, TC and BMI (rankings were uniform across four ML algorithms). The highly ranked feature attributes are very important when detecting T2DM and these should be prioritized accordingly. While we recognise that all the feature attributes are essential for detecting T2DM, this current study has found Age, HDL-c and LDL-c as the least of the feature attributes for predicting T2DM.

Fig. 7
figure 7

Feature importance across four machine learning algorithms. Glycated haemoglobin (HbA1c), Total cholesterol (Chol); Body Mass Index (BMI); Fasting Blood Sugar (FBS); Diastolic Blood Pressure (DBP); Triglycerides (TG); Systolic Blood Pressure (SBP); High Density Lipoprotein cholesterol (HDL-c); Density Lipoprotein cholesterol (LDL-c)

The ML model was developed to consider Age as one of the attributes. Age is known in literature to affect the other attributes of T2DM. In this study, the ML algorithms inherently aggregated and extracted the weighted the age attribute for the classifier to learn to build a model towards prediction. Therefore, variations in the Age of the patients, in respect of the other attributes, influence the predictive strength of the model towards predicting the test data. For instance, if the Age of patients are highly tilted above 50 years, the indicators of the other attributes are affected, and in effect, influence the predictive model. Likewise, when the Ages are tilted below 40 years the other attributes will vary and the predictive model affected.

Discussion

The emergence of ML techniques has fuelled interest in the predictive modelling of cardiometabolic diseases [43, 44]. In this Ghanaian cohort study, we have demonstrated that ML algorithms can accurately predict T2DM based on laboratory results and anthropometric data. In doing so, four ML classification algorithms NB, KNN, SVM and DT were compared. The predictive performance was generally good for all algorithms. In the analysis, it was found that NB was the best performing classifier with AROC of 87.20%, and also in terms of sensitivity, specificity, accuracy and kappa (Fig. 6). This finding agrees with that of Sisodia and Sisodia (2018) who reported NB (AROC of 76.30%) as the best predictor of diabetes in pregnant women [43]. Sneha and Gangil, (2019) also showed that NB had the best accuracy when compared with DT [46].

The study identified SVM as the second-best performer and having a good discriminatory power. This agrees with previous research. For example, reporting an AUC of 83.47%, Yu et al., [35] highlighted that SVM is efficient, can predict diabetes and outperforms logistic regression in population health surveys. The reason for the good discriminatory power of SVM is suggested to be due to the large margin between hyperplanes that allows for the separation of classes in three dimensional vector space [43]. However, one of the limitations of SVM in terms of its performance on data is the size of the data. SVM has been found in literature to perform extremely well when the dataset is large. Although the performance in this study was above the accepted threshold of the 70%, the size of the dataset could have affected the performance.

With regards to KNN classifier, the accuracy was 81.0% and AROC of 83.9%. While this result is significant (> 70%), it is possible the result could have been affected by bias variance trade off [54]. Despite the fact that KNN is sensitive to the quality of the data, it is also sensitive to the scale of the data and irrelevant features. Hence, features that exhibited weak predictive strength could have influenced the performance of KNN. The DT, which performed significantly but poorly among the classifiers used in the study, is a probabilistic algorithm which works well when the attributes are extremely unique. The poor performance among the other classifiers may have occurred as a result of its sensitivity to small perturbations in the data.

With ML methods exhibiting high Precision, Recall, F1-score, Weighted average and Accuracy (Table 3), the study has demonstrated the ability of the ML techniques to correctly predict T2DM. For example, NB could identify the presence of T2DM in 82 patients out of 100 T2DM patients, SVM could identify 73 out of 100; KNN could identify 78 out of 100 and DT could identify 82 out of 100 T2DM patients. This is especially important as a lower recall rates can lead to misdiagnosis of T2DM.

The phenotypic expression of T2DM is due to a continuum of risk factors. The present study identified 9 variables including blood pressure, FBS, TC, TRG, BMI, HDL-c and LDL-c as predictors of T2DM. Particularly, with regards to the order of importance, we identified HbA1c, TC and BMI as the top three primary predictors of T2DM. These findings are not unexpected but validate those of previous studies [34, 42, 45]. Hitherto, the measurement of FBS was considered the surest way to determining prediabetes and diabetes. However, due to daily fluctuations of glucose levels, there was a need for alternative biomarkers (55). In the course of research, it was known that sugars are pinned to residues of globin chains and forms 1-deoxy-1-N-valyl-fructose after an Amadori rearrangement [55]. Later, this product became known as glycated haemoglobin (HbA1c). While the level of FBS is still the basis for the diagnosis of prediabetes and diabetes in most laboratories, this research has indicated that HbA1c is sensitive and more reliable for diagnosing diabetes than FBS [7, 56]. Further, HbA1c has leverage over FBS by being stable and can detect plasma glucose levels in the previous 3 months. From.

Figure 5, our results confirm that of previous studies that HbA1c is superior to FBS in T2DM diagnosis. Although some researchers prefer other obesity measures to BMI [57,58,59,60], BMI is widely used as an indicator of excess body fat and a risk factor for cardiometabolic disease [61, 62]. The use of BMI in the present study instead of the other fat indicators such as waist circumference, abdominal obesity and visceral body fat is justifiable in the light of a previous study (63). Based on 1288 subjects, Bouchard, (2007) revealed the bidirectional relationship between BMI and other fat measures. Specifically, the study showed that BMI strongly correlated with excess fat mass (r = 0.94), waist or abdominal obesity (r = 0.93), and abdominal visceral fat (r = 0.72) [63]. These results are comparable to the findings of several previous studies [31, 35, 64]. For example, using a neural network model, Akella and Kaushik (2020) identified resting blood pressure, serum cholesterol and blood glucose as part of the top 10 variables of importance in cardiovascular disease prediction [65].

LDL-c and HDL-c are important molecules that are dysregulated or modified in T2DM. In T2DM, there is a decline of HDL-c due to the formation of TRG rich HDL-c. TRG-HDL-c is a substrate for hepatic lipases that catalyses the breakdown of HDL-c [66]. Conversely, there is a reduction in the catabolism of LDL-c in T2DM leading to increased levels of LDL-c. This decrease has been attributed to a decline in the expression of apolipoprotein B and apolipoprotein E receptors as well as a decreased affinity of LDL-c [67].

Dinh et al. [39] have revealed that age is a key risk factor for cardiovascular events and diabetes [68] because ageing is linked to physical inactivity and ultimately, T2DM. However, in the present study, our feature selection technique revealed age to be one of the risk factors in T2DM albeit among the least predictors of T2DM. This is to imply that ageing is a determining attribute for T2DM detection. However, other attributes such as HbA1c, TC and BMI ought to be considered before Age when diagnosing for T2DM. It is worth noting that the aged people may physically exhibite symptoms of T2DM but may not necessarily be diabetic. This is may be the reason whyHbA1c, TC and BMI are the most important attributes for diagnosing T2DM. This result also agrees with those reported in the literature. Comparing multiple variables including invasive laboratory data and non-laboratory data (non-invasive), Dinh et al., [39] documented that age was the fifth predictor of diabetes behind LDL-c, TRG, blood urea nitrogen, sodium and blood osmolality. However, in the absence of laboratory variables, their results showed that age was the second most important feature for predicting diabetes.

It should be clear by now that ML can adequately predict T2DM in a Ghanaian population. However, some limitations need to be mentioned. Firstly, the sample size of the participants was small and the prediction may be over/underestimated. However, this does not invalidate our results since Kuhn and Max [36] has stated that large sample sizes, though beneficial, increase computational burden and can impact the results. Secondly, it is important to note that other potential risk factors including family history, physical activity exist, but they were not considered in this study. It is expected that the inclusion of these will further enhance the predictive model. Thirdly, the impact of antidiabetic medications should not be overlooked. Some of the medications being used to control T2DM in this population include glucose-lowering (e.g. biguanides, thiazolidinediones, sulfonylureas); lipid-lowering (statins) and antihypertensives (e.g. angiotensin II receptor blockers, calcium channel blockers) [29]. Thus, the interpretation of the results should be viewed in light of medication use. Going forward, we will explore the potentiality of ML methods for discovering other biomarkers of T2DM.

The study seeks to influence policy and practice in the various health facilities in Ghana. The present study recommends clinicians to first test HbA1c, TC and BMI for T2DM before any other parameter could be considered. This study underscores the fact that some T2DM risk factors are more important, in terms of their predictive strength, than the other risk factors. Hence, our study attempts to reduce the cost of diagnosing T2DM. Indeed, HbA1c and FBS were the strong biomarkers for predicting diabetes in the Ghanaian population. The data for the current study comprised both undiagnosed (including at-risk) and diagnosed diabetes individuals. Identifying individuals with undiagnosed diabetes has been a challenge but our results reinforce the relevance of HbA1c or FBS for early detection of diabetes or prediabetes. Once detected, such individuals can be targeted for tailored treatments that will delay them from developing the disease. Based on the analysed data, these attributes are enough to show diabetes patients.

Conclusion

Using multiple variables as substrates, the study has shown that ML can generate accurate predictions of T2DM and provide potentially meaningful information. We identified NB as the best algorithm in predicating T2DM when compared with KNN, DT and SVM. When employed, these algorithms can allow the early detection of T2DM, anticipate future events and in turn, stimulate a timely intervention. It is hoped that the findings of this study will guide the selection of appropriate ML algorithm for the prediction of T2DM and help health professionals in Ghana to make well-informed and better decisions.