Introduction

Diabetes mellitus (DM) is a metabolic disease characterized by hyperglycemia. Hyperglycemia can cause chronic damage to tissues over time [1]. Diabetes has become a major health problem worldwide with a significant increase in DM patients. According to the International Diabetes Federation (IDF), approximately 537 million adults worldwide had diabetes in 2021 (with a prevalence of 10.5%), and it is estimated that by 2045, approximately 783 million people worldwide are likely to have diabetes (with a prevalence of approximately 12.2%) [2, 3]. In China, the number of adults with diabetes ranked first in the world in 2021 (approximately 140.9 million patients, with a prevalence rate of approximately 13.0%) [3, 4]. According to a survey, because individuals with type-2 diabetes mellitus (T2DM) usually lack the relevant knowledge, or they are asymptomatic, some individuals with T2DM patients can not be detected in time (approximately 50% of individuals with T2DM are undiagnosed) [3, 5]. It is necessary to identify individuals with diabetes in the population in an efficient and accurate manner. Therefore that early preventive measures and treatment can be taken to avoid further escalation of T2DM.

Currently, the scientific community has shifted its focus to the use of powerful computational methods for early and accurate prediction of diabetes [6,7,8,9,10,11]. Machine learning (ML) can iteratively learn nonlinear interactions from large amounts of data [12,13,14]. At present, based on electronic medical records and hospitalization data, ML methods have been used in the diagnosis and prediction of diabetes, prediabetes, complications and disease progression [7, 8, 15,16,17], as well as real-time blood glucose monitoring [18, 19], with some success. However, most of these models are created for the care of T2DM patients, and the sample size of training data is too small to reliably capture asymptomatic cases of early abnormal blood glucose, which are not suitable for mass screening of the population or public health planning [20, 21]. One study [22] reported that most models for diabetes prediction and risk assessment were rarely used because they relied on specific data. As physical examination data grows and ML rapidly develops, the use of physical examination data for disease risk assessment can provide better clinical guidance and facilitate large-scale screenings at an earlier stage [23]. However, at present, fewer scholars conduct diabetes screening based on health examination data [8, 24]. ML methods have not been applied to T2DM screening models and risk assessment in western China based on large-scale physical examination data.

We aimed to develop an ML model suitable for large-scale screening of T2DM among adults in western China. In this study, we established the model based on logistic regression (LR) and ML algorithms, including classification and regression tree (CART), light gradient boosting machine (LightGBM), random forest (RF), extreme gradient boosting (XGBoost), multilayer perceptron (MLP), and TabNet model, and combined them with western China large-scale health examination data, which are characterized by wide coverage, large volume and strong representation. In addition, in order to improve the convenience and flexibility of the model, a diabetes risk score card was established based on logistic regression to assess the risk of the population. This study is the first T2DM screening model that systematically compares various algorithms on a multiethnic and large sample basis.

Materials and methods

The dataset

We used the health examination data obtained from the national physical examination (NPE) project in 2020, which was previously described in detail [25]. The NPE health examination consisted of three parts: questionnaire, routine physical examination and laboratory tests.

A total of 9,333,091 people were enrolled in this study by signing an informed consent form. Participants were excluded from the study if they were (i) younger than 18 years old; or (ii) more than 20% of their baseline and laboratory test data were missing. Second, we removed variables unrelated to the study, such as participants’ names, contact phone numbers, and home addresses. After that, missing value processing (random forest interpolation) and extreme value processing (deletion) were performed for the remaining variables. The detailed analysis process is shown in Fig. 1. Finally, 4,075,431 samples were left, including 3,774,084 healthy individuals and 3,013,47 T2DM patients.

Fig. 1
figure 1

Flow Chart. CART classification and regression tree, LightGBM light gradient boosting machine, RF random forest, XGBoost extreme gradient boosting, LR logistic regression

Feature fusion

In our computational model, we combined three types of physical examination data: questionnaire data (9 features), routine tests (2 features), and laboratory values (9 features). A total of 20 features were sufficient to identify diabetes risk. Through the questionnaire, we collected demographic characteristics, diet, smoking, hypertension, coronary heart disease, and parental history of T2DM in the population. Body mass index (BMI) and waist circumference (WC) were collected through routine tests. Through laboratory testing, nine laboratory values were collected.

T2DM was defined if any of the following criteria were met: 2 h postprandial blood glucose (2hPG) ≥ 11.1 mmol/L, fasting blood glucose (FBG) ≥ 7.0 mmol/L, or a complaint of diabetes and the use of antidiabetic drugs.

Feature selection

To adjust the parameters and measure the model’s performance, the data were segmented using the 70–30 holdout method. The training set contained 2852801 samples (healthy population: 2641683, T2DM patients 211118). The possible risk factors for DM were preliminarily screened by reviewing the relevant literature. Univariate and multivariate logistic regression analyses were performed to analyze these characteristics, and correlation analysis was used to determine the correlation between each characteristic.

Classification algorithms

In this study, integrated learning (CART, LightGBM, RF, XGBoost), deep learning (TabNet and MLP) and LR models were used to construct a diabetes risk assessment model.

The CART algorithm is a tree arrangement algorithm. CART has the advantages of fast operation speed, high accuracy, high-dimensional data and no parameter assumptions. There are some problems with it, including high variance and overfitting, which limit its applicability as an independent prediction model.

The RF algorithm is a combination of bagging ensemble learning theory and the random subspace method [26, 27]. The core idea of the RF algorithm is to construct multiple independent classifiers, and then apply the average or majority voting principle to their predictions to determine the results of ensemble classifiers.

The XGBoost technique is a nonlinear machine learning technique based on trees [8]. XGBoost is based on combining weak estimators to predict hard-to-evaluate samples repeatedly [28], so as to constitute a strong estimator. The XGBoost can evaluate the importance of each input feature more easily than other black box techniques such as support vector machine (SVM) and artificial neural network (ANN) techniques.

The LightGBM algorithm is a decision tree-based ensemble algorithm that provides an effective implementation of gradient lifting [29]. Compared to traditional training algorithms, LightGBM has a faster training speed, a lower memory requirement, and a higher accuracy, which can lead to more efficient models.

MLP is a feed-forward, supervised artificial neural network structure that can contain multiple hidden layers through multilayer perceptrons to achieve classification modeling of nonlinear data.

TabNet is a neural network for tabular data that uses sequential attention mechanism to select the features to be reasoned about at each decision step, thus learning to obtain the most salient features for interpretability and more efficient learning.

In order to facilitate clinical and real-life applications, we designed a diabetes risk scorecard based on LR. In the process of establishing the score card, we used the chi-square method for continuous variables, and the discrete variables were directly divided into categories. We determined the final number of boxes according to the information value (IV) value curve. Then, the IV value of each feature was calculated and variables whose IV value was greater than 0.1 were selected into the scorecard model. Finally, the weight of evidence (WOE) value of each box was calculated, and the WOE was mapped back to the original dataset, and then LR was used to establish the model. The detailed process can be found in a previous study [8].

Model evaluation

To obtain the optimal parameters, we used grid search to perform hyperparameter debugging on four models to obtain the optimal parameters. Based on the confusion matrix, we calculated the accuracy, recall, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and receiver operating characteristic (ROC) curve of each model. Furthermore, we utilized the Kolmogorov–Smirnov (KS) value to appraise the efficiency of the scorecard model. A higher value of KS is indicative of an improved model. The greater the KS value, the more successful the model is. The KS value, which varies from 0 to 1, and when KS surpasses 0.3, the prediction performance of the model is deemed satisfactory.

Statistical analysis

The baseline characteristics of the study population are represented as the mean ± standard deviation when they are continuous variables, and as frequency (percentage) when they are categorical variables.

The differences in variables between diabetic patients with diabetes and healthy people were analyzed. The t test or Mann–Whitney test was used for continuous variables. The chi-square test or Fisher’s exact test were used for categorical variables. Statistical significance was inferred at a two-sided P-value < 0.05.

This study utilized Python Software Version 3.8.3. The libraries “Pandas” “NumPy” and “Matplotlib” were used for determining nulls and outliers as well as for interpolation. Meanwhile, the “Sklearn” library was used for construction and validation of the ML model. We use “PyTouch” to build a deep learning framework.

Results

Basic characteristics

A total of 9,333,091 participants were included in this study. After data preprocessing, 4,075,431 participants were left, including 1,919,248 (47.09%) males and 2,156,183 (52.91%) females. A total of 3,774,084 healthy people and 301,347 T2DM patients were included. The prevalence of T2DM was calculated at 7.39% among the study population. The general characteristics of the study population are presented in Table 1. Patients with diabetes had older age, higher BMI, WC, HGB, WBC, FB, TC, TG, LDLC, lower PLT and HDLC than healthy people. Compared with the healthy population, the proportion of patients with hypertension and CAD was higher in diabetic patients with diabetes. The prevalence of T2DM was significantly different among people with different dietary habits and smoking statuses. For further details, see Table 1.

Table 1 Characteristics of participants in this study

We compared the prevalence of diabetes in different age groups (Fig. 2). It was found that the age of diabetes patients was concentrated in the range of 50–80 years old, accounting for approximately 60% of diabetes patients. Diabetes patients younger than the age of 40 accounted for 2.5% of the total number of diabetes patients.

Fig. 2
figure 2

Distribution of diabetes patients and healthy people by age. Healthy people (yellow) and T2DM patients (blue)

Feature selection

The possible risk factors for T2DM were preliminarily screened by reviewing relevant literature (Table 2). The Pearson’s Correlation Coefficient was utilized to reveal the interrelationship between the various features. The correlation between the factors was then depicted using heat maps (Additional file 1: Figure A1). In Additional file 1: Figure A1, BMI had a positive correlation with WC, while HTN and CDA showed a positive correlation.

Table 2 Multivariate logistic regression analysis in the development group

Univariate logistic regression analysis (Additional file 1: Table A1) and a multivariate logistic regression analysis (Table 2) were performed for these features. We found that age, unbalanced diet, smoking, hypertension, CAD, PDM, WC, BMI, WBC, FGB, TC, and TG were positively associated with the risk of T2DM. HDL was negatively associated with T2DM. Multivariate logistic regression showed that HGB, PLT and LDLC were negatively correlated with the risk of T2DM, which may be related to the data itself and affected by missing values. Considering that some previous studies found a relationship between TC and T2DM, combined with correlation analysis and logistic regression analysis, finally, sex, age, ethnicity, EH, SS, HTN, CAD, PDM, WC, BMI, WBC, PLT, FBG, ECG, TC, TG, LDLC, and HDLC were chosen to construct the diabetes risk prediction model.

Tuning of the parameters

To obtain the optimal parameters, we used grid search and cross-validation to conduct hyperparameter debugging for seven models, as shown in Additional file 1: Table A2.

Comparison of model performance

In this study, we constructed various tree-based machine learning models, such as CART, LightGBM, RF, XGBoost, MLP and TabNet, as well as the LR model. Table 3 and Additional file 1: Figure A2 show the performance of each prediction model on the validation group. The results showed that XGBoost had a good model performance, with an AUC of 0.9122. XGBoost also showed superiority in accuracy (0.8314), precision (0.2800), PPV (0.9829) and NPV (0.9122). Table 3 demonstrates the efficacy of each prediction model on the validation group.

Table 3 Performance metrics of the machine learning models

Figure 3 shows the ROC curves and AUC of different prediction models in the development group and validation group. It is found that XGBoost performed better than the other prediction models. The AUC of the development group was 0.9209, and the AUC of the validation group was 0.9122. The results showed that the XGBoost algorithm showed excellent advantages in predicting the risk of diabetes in this study.

Fig. 3
figure 3

ROC curves of different learning machine learning algorithms on the training and validation sets. A ROC curve in the development group. B ROC curve in the validation group

We used SHapley Additive Explanations (SHAP) to explain the characteristic contributions of the XGBoost model. Figure 4 showes the feature importance of the XGBoost algorithms. We found that HTN, FGB, age, PDM, CAD, ethnicity, TG, WC, BMI and TC were identified as the top ten of the most important factors.

Fig. 4
figure 4

Feature importance of the XGBoost model. HTN hypertension, FBG fasting blood glucose, PDM parental diabetes mellitus, CAD coronary heart disease, WC waist circumference, BMI body mass index, WBC white blood cell, HGB hemoglobin, PLT platelet, TC total cholesterol, TG triglyceride, LDLC low density lipoprotein cholesterol, HDLC high density lipoprotein cholesterol

Diabete risk score card

A diabetes risk score card with a scale of 100 was designed for this study. The diabetes risk score card was used to evaluates an individual's risk of diabetes by aiding in the calculation of their risk score. Based on the IV value, the risk score card model was established using age, FBG, HTN, WC, BMI, TG, CAD and ethnicity as variables. The ROC and KS curves of the validation group are displayed in Fig. 5.

Fig. 5
figure 5

ROC and KS curves of the diabetes risk score card. A ROC curve; B KS curve

We used the score card scaling algorithm to convert the model into score cards (Table 4). The score card comprises the baseline score as well as the associated score for each box within each feature. When using a scorecard, the total score is the sum of the base score and the feature score, which represents the diabetes risk value. In this study, the base score was 46.3.

Table 4 Diabetes risk score card

The Kolmogorov–Smirnov curve (Fig. 5) was utilized to illustrate the totality of the score and to determine the risk interval. The higher the KS value is, the greater the segmentation ability of the model’s corresponding threshold value will be. As illustrated in Fig. 5, the apex of the inflection point is achieved when the score is equal to 45. Therefore, to easily calculate the risk interval, we set 50 as the intermediate threshold. The higher the score generated from testing, the lower the risk of diabetes; conversely, the lower the score, the greater the likelihood of developing diabetes. To supply users with a more direct evaluation, four risk categories have been established in accordance with the KS chart (Table 5).

Table 5 Risk interval division and threshold of diabetes risk score card

Comparison with existing models

To further validate the efficacy of our model, a comparison of the proposed model against other leading methods was conducted, the results of which are presented in Table 6.

Table 6 Comparison with existing models

Discussion

The increasing burden of diabetes has become a global challenge [3, 32]. Through mass screening, early identification and intervention of patients with diabetes can be achieved to delay or prevent the development of the disease [33, 34]. The most efficacious method for widespread screening of diabetes has yet to be identified. In this study, the T2DM risk prediction models were developed and validated on data from more than 4 million people. The data were obtained from the cross-sectional data of NPE, including more than 9 million people in 14 prefectures of Xinjiang, China, which can be considered representative of the overall population of Xinjiang. Following the evaluation of the model's performance, it was determined that the XGBoost model was the optimal model for predicting the risk of T2DM, with an AUC was is 0.9122.

In this study, we used questionnaires to obtain indicators of hypertension and cardiovascular diseases, genetic history and smoking and diet in the population, which not only captured the medical history of each patient, but also included demographic factors and laboratory test indicators. Univariate and multivariate logistic regression analyses showed that sex, age, ethnicity, EH, SS, HTN, CAD, PDM, WC, BMI, WBC, PLT, FBG, ECG, TC, TG, LDLC, and HDLC were important factors for diabetes. HTN, FGB, age, PDM, CAD, ethnicity, TG, WC, BMI, and TC were the most important predictors of diabetes. Except that the FGB was viewed as a recognized risk factor and predictor of T2DM, hypertension and CAD were the most important features of T2DM risk models, which presented with high predictive ability. Some studies have confirmed that hypertension, cardiovascular disease and diabetes are mutually promote and influence each other [35, 36]. Many pathophysiological mechanisms underlie the association between diabetes and cardiovascular disease. Among these mechanisms, several have been identified as potential contributors [36]. Including insulin resistance in the nitric-oxide pathway, the stimulatory effect of hyperinsulinemia on sympathetic drive, smooth muscle growth, and sodium-fluid retention, as well as the excitatory effect of hyperglycemia on the renin–angiotensin–aldosterone system, provide plausible explanations for the association between diabetes and cardiovascular disease. On the other hand, the functional changes occurring in the context of T2DM and hypertension significantly alter the hemodynamic stress on the heart and other organs. Some studies have also demonstrated the important role of ECG in the prediction of diabetes [37], and our study confirmed the association between abnormal ECG results and T2DM. Understanding these underlying mechanisms is crucial for developing targeted interventions to prevent and manage cardiovascular complications in individuals with diabetes.

Our study showed that age was also an important feature of diabetes prediction models. The FDRSMA is a classic and widely used diabetes risk scoring model [38]. The objective of FDRSM is to utilize six risk factors (including BMI, FBG, PDM, HDLC, blood pressure and TG) to evaluate the risk of T2DM among middle-aged individuals. T2DM is generally observed in adults and appears to be more prevalent among the elderly individuals. As people age, the glucose sensitivity of pancreatic cells decline and insulin secretion is impaired, leading to hyperglycemia and T2DM [39]. Several studies reported differences in the incidence of diabetes between ethnic groups [40,41,42] and confirmed that ethnicity could be a predictors of diabetes [40, 43,44,45,46]. In our study, we used Uyghur as a reference, with Han and Hui ethnic groups exhibiting a heightened susceptibility to diabetes. The kazakh, Mongolian and Tajik ethnic groups had a lower risk. Genetic and environmental differences (i.e., economic level, diet, lifestyle, climate) were taken into account. Family history of diabetes was also identified as an important risk factor for T2DM in our model, which is consistent with previous studies [47]. There is a significant genetic predisposition to T2DM, with a 2 to 30 fold increased risk for T2DM in those with a family history compared with those without a family history [48].

Many studies have demonstrated a connection between obesity and diabetes. Furthermore, our study discovered that augmented BMI and WC were correlated with a higher probability of having diabetes. The development of obesity gain can result in insulin resistance and diminished β-cell functionality in humans. According to the World Health Organization, the global increase in the prevalence of diabetes is believed to be related to chronic stress, being overweight [49], lacking of physical activity [50, 51], excessive consumption of alcohol [52, 53] and an unhealthy diet [54]. Our model also demonstrated that EH and SS were predictors of T2DM. In addition, we also found that people who smoked and those who had quit smoking had a higher risk of T2DM than those who did not smoke, and those who ate a vegetarian or meat-based diet had a higher probability of T2DM than those who ate a balanced meat-vegetarian diet.

We incorporated laboratory variables, including TC, TG and HDLC into the diabetes prediction model. Our findings indicated that TG was an independent risk factor for T2DM, while TC was not an independent risk factor for T2DM in our study. Consistent with other studies [55]. The feature importance ranking showed that TC, TG, LDLC and HDLC were all important features of the T2DM risk prediction model. Multiple studies have revealed that dyslipidemia and T2DM often coexist in individuals and share common pathological mechanisms, such as insulin resistance, metabolic disturbances, inflammation, and alterations in the gut microbiota [55, 56].

Currently, ML algorithms are increasingly used to predict diabetes and related diseases [11, 12, 18, 19, 30, 57,58,59]. In this study, a diabetes screening model based on CART, LightGBM, RF, XGBoost TabNet and MLP models was constructed. The AUC (0.9122), PPV (0.2800), NPV (0.9829) and accuracy (0.8314) of the XGBoost prediction model showed good performance in the validation group. It appears that our model outperforms the majority of existing models, which may be because the model is built on the basis of multiple features and big data. Other studies also found that XGBoost was effective in predicting the risk of diabetes [8, 16].

The development of the diabetes risk assessment score card assists clinicians and individuals alike in conducting self-examinations, with the aim of increasing the rate of diabetes cascade screening and enhancing individual lifestyle management. Hence, utilizing large-scale physical examination information to achieve prompt risk notification and identification of diabetes is the most practicable course of action.

This study has several advantages. First, based on the NPE project, it not only has a wide coverage and a large amount of data, but also includes a number of major ethnic groups in China, which can enable better assessment of the characteristics of the population in Xinjiang, China; in addition the risk prediction model has a good generalization ability in Xinjiang, China. Second, the risk factors affecting diabetes were fully considered in this study. Laboratory examination, questionnaire survey and routine examination data were fully taken into account to obtain indicators such as hypertension and cardiovascular diseases, genetic history and exercise and diet in the population, and the influencing factors of diabetes were comprehensively analyzed. Third, the results of our model all showed satisfactory predictive effects (XGBoost: AUC = 0.9122). This study also has several limitations. First, it is not possible to establish causality using cross-sectional data derived from national health examinations, therefore, these results should be subject to further investigated in subsequent research. Second, the health examination data used in our study were highly heterogeneous and had a high rate of missing data, which affected the power of the model.

Conclusion

T2DM imposes an inexorable and significant burden on society, including intangible costs of lost productivity, premature death, and poor quality of life. Our model is based on large-scale health examination data in Xinjiang, China, which was used to construct a large-scale early diabetes risk screening model. Our model can be applied directly to the physical examination database, providing a highly efficient means for the identification of high-risk diabetes records over at a large range. This allows for the understanding of potential diabetes risk ratios at the public health level and the implementation of more effective diabetes prevention and control strategies. It is of great significance for the early control of diabetes to identify early risk warning sings and perform screening based on large-scale physical examination data.