Introduction

The incidence of diabetes has been increasing worldwide in recent years. About 6·7 million people died from diabetes or its complications in 2021 [1]. It is estimated that by 2045, China will have 174 million people with diabetes, ranking first in the world. The burden of diabetes-related complications is expected to increase with the increase in diabetes prevalence. Diabetic peripheral neuropathy (DPN) is the most common microvascular complication of diabetes [2], potentially leading to foot deformity, ulceration, and even amputation. It can also damage the central nervous system and increase the risk of all-cause and cardiovascular mortality [3,4,5,6]. Recent studies have found that DPN starts progressing in prediabetes [7]. However, a nerve conduction study could not assess its subtle damage to nerve fibers, resulting in delayed DPN diagnosis, intervention, and treatment administration [8, 9]. Identifying risk factors for DPN is crucial for clinical management.

Studies have identified age, diabetes duration, and glycosylated hemoglobin as risk factors for DPN progression [10, 11], and various other DPN-related factors are still being explored. However, due to differences in evaluation criteria among studies, many contrasting conclusions have been reported [12, 13], resulting in poor clinical applicability of DPN predictors. With the development of science and technology, the establishment of accurate predictive models through machine learning and risk factor acquisition are now applied clinically [14].

Machine learning does not require model structure pre-specification; rather, machine learning searches for the optimal fit within certain constraints. This approach can result in an accurate final prediction model that analyzes the complex interactions between many features [15]. Only a few DPN prediction models have been developed based on machine learning. Considering that DPN prevalence varies among countries [14], the existing DPN prediction models for China are limited by their small sample and single-center nature [16]. Our study aimed to discover the link between laboratory indicators and DPN through machine learning, and help clinicians quickly and accurately predict the risk of developing DPN.

Methods

Study population

We retrieved the data of 1278 patients with T2DM treated at Jiangsu Provincial Hospital of Traditional Chinese Medicine (n = 1093) and Jiangsu Provincial Governmental Hospital (n = 185) between February 2020 and July 2022. The data included 192 indicators, including the patient’s basic characteristics, complications, routine blood values, blood biochemistry, immune testing, thyroid function, coagulation function, routine urine values, urine biochemistry, routine stool values, insulin measurement, tumor screening, and sex hormone testing (see Supplementary Table S1).The data were divided into those with and those without DPN according to the nerve conduction test results. See Fig. 1 for the entire research process, including feature selection, data balancing, model construction, model comparison, and optimal model selection and interpretation. This study followed the principles of the Declaration of Helsinki and was approved by the Ethics Committee of Jiangsu Provincial Government Hospitals (2022 Hospital Ethics Review No. 030). All the above data have passed ethical verification.

Fig. 1
figure 1

Study design for building a machine learning model to predict diabetic peripheral neuropathy. Abbreviations and definitions: XGBoost, Extreme Gradient Boosting; NB, Naive Bayes; LR, Logistic Regression; KNN, K-Nearest-Neighbor; RF, Random Forest; DT, Decision Tree; K-Fold, K-Fold cross validation; SHAP, SHapley Additive exPlanations

Sample size

In order to achieve better performances of ML models in predicting the risk factors of diabetic peripheral neuropathy in this study, we included all records who fulfilled our inclusion criteria to fully train the models.

Data extraction

Screening of patient records according to the following inclusion and exclusion criteria.

Inclusion criteria

T2DM Patients diagnosed with DPN and had the diagnostic basis of EMG.

Exclusion criteria

Patients with neuropathy caused by other factors or patients with intervertebral disc disease, spinal nerve root disease and other neuropathy diseases.

Data exploration

We invited medical experts to manually select the features with a significant impact on DPN diagnosis, reducing the 192 features in the dataset to 53. The features were selected based on their clinical relevance and previous research and included the patient’s age, weight, glycosylated hemoglobin, blood lipids, and other essential characteristics. Some of these commonly used medical indicators are calculated by other indicators, as shown in the following formulae. Data mining treated the variables as complete (no without values) or incomplete (with missing values). Important incomplete variables were dealt with by deletion, imputation, or no processing. We eliminated ten features whose missing data rate exceeded 30%. After eliminating features, the remaining categorical variables with missing data were completed with the variables’ modes, and the remaining continuous variables were filled with the variables’ mean values. This study included 748 samples with DPN and 530 samples without DPN.

$$\mathrm C2/\mathrm C0=\frac{2\;\mathrm{hours}\;\mathrm{postprandial}\;\mathrm C-{\mathrm{peptide}\left(\mathrm{ng}/\mathrm{mL}\right)}}{\mathrm{fasting}\;\mathrm C-\mathrm{peptide}\left(\mathrm{ng}/\mathrm{mL}\right)}$$
(1)
$$\mathrm{HOMA}-\mathrm{IR}=\frac{\mathrm{fasting}\;\mathrm{plasmaglucose}(\mathrm{mmol}/\mathrm L)\times\mathrm{fasting}\;\mathrm{seruminsulin}(\mathrm{mU}/\mathrm L)}{22\cdot5}$$
(2)
$$\mathrm{NLR}=\frac{\mathrm{number}\;\mathrm{of}\;\mathrm{neutrophils}(10^9/\mathrm L)}{\mathrm{number}\;\mathrm{of}\;\mathrm{lymphocytes}(10^9/\mathrm L)}$$
(3)

Feature selection and data balancing

Feature selection, critical in feature engineering, can filter out highly correlated features to improve model performance and reduce training time. Feature selection can be divided into the filtering, wrapping, and embedding methods. The embedding feature selection method employed in this study uses a machine learning model to automatically select features and integrate them into the training process.

We balanced the data as most machine learning algorithms do not work well with unbalanced datasets. For unbalanced data, over-sampling, under-sampling or both can be used to achieve positive and negative sample balance. The SMOTETomek method [17, 18] was used for data balancing. It is a composite method that performs an over-sampling operation on a large proportion of samples and an under-sampling operation on a small proportion of samples.

Research technology

Statistical analysis was performed using IBM SPSS Statistics for Windows, Version 27.0 (IBM Corp., Armonk, NY, USA). We used the Kolmogorov-Smirnov test to assess the continuous variables and the Chi-Squared test to assess the categorical variables (see Supplementary Table S2 and S3). As the continuous variables were not normally distributed, they were compared by the Mann-Whitney U test. Feature selection and data preprocessing, balancing, modeling, and evaluation were performed using Python Software Foundation. Python Language Reference, version 3.9. Available at http://www.python.org.

We divided the data into a 7:3 ratio of training set and test set, using the training set to train the prediction model and the test set for the model evaluation. Six machine learning algorithms were used to build the prediction models, including logistic regression, k-nearest neighbor, decision tree, naive Bayes, random forest, and extreme gradient boosting (XGBoost). We used the open-source package sklearn 0.24.2 for model realization and evaluation [19]. Model performances were assessed by the indicators’ accuracy, precision, recall, F1-score, confusion matrix, and the area under the receiver operating characteristic curve (AUC) under 10-fold cross-validation.

Model interpretation

The interpretation of the model is a very important step that helps one to understand the process of model classification. The SHapley Additive exPlanations(SHAP) originated from cooperative game theory and have a solid theoretical foundation [20]. The method is a model-independent solution to model interpretability. We synthetically selected the best-performing model, and used SHAP to calculate the marginal contribution of features to explain the model output, identify significant features in the various classifications, and indicate whether they were positively or negatively correlated.

Results

Participants

We used the collected data of 1278 diabetic patients for modeling of machine learning, all patients have completed EMG results. Nerve conduction abnormalities involving one or more nerves was defined as nerve injury, grouped according to the patient’s nerve conduction outcome. 748 (58.53%) of which did and 530 (41.47%) did not develop DPN. The description of the general situation of the study subjects is presented in Supplementary Table S3.

Feature selection

The 43 variables included in this study comprised 34 continuous variables and nine categorical variables. We used the embedded method and random forest as primary learners to further filter the features and select 16. Supplementary Table S4 presents the weights of these 16 features.

As shown in Table 1, age, alanine aminotransferase, albumin, total bilirubin, urea, total cholesterol, glycosylated hemoglobin, activated partial thromboplastin time (aPTT), 24-h urine protein quantification, urine protein concentration, diabetes duration, neutrophil-to-lymphocyte ratio, and the homeostatic model assessment of insulin resistance (HOMA-IR) index were statistically significant (P<0·05).

Table 1 Single factor analysis of DPN selection variables

Data balancing

The data distribution before and after balancing is shown in Table 2.

Table 2 Comparison of positive and negative samples before and after data balance

Modeling and evaluation

The evaluation results of the six machine learning algorithms are shown in Table 3. The results showed that XGBoost had the best accuracy (0·753 ± 0·032), recall (0·721 ± 0·050), and F1-value (0·744 ± 0·036). K-nearest neighbor showed the highest precision (0·858 ± 0·070) but performed poorly in the other indicators.

Table 3 Comparison of classification results of different models (mean ± std)

The confusion matrix serves as a formalized method for evaluating machine learning models, reflecting the results presented in Table 3. As can be easily deduced from Fig. 2, the overall performance of the Random Forest and XGBoost models is significantly superior to that of the other models, with the XGBoost model having a slight edge over the Random Forest model. Specifically, the XGBoost model slightly outperforms the Random Forest model in terms of accuracy, precision, and recall metrics.

Fig. 2
figure 2

Model classification confusion matrix. XGBoost, Extreme Gradient Boosting; NB, Naive Bayes; LR, Logistic Regression; KNN, K-Nearest-Neighbor; RF, Random Forest; DT, Decision Tree

The performance of the six machine learning algorithms in predicting DPN is shown in Fig. 3. The AUC of XGBoost (0·818) was the largest, followed by random forest (0·804). The decision tree model had the smallest AUC (0·636).

Fig. 3
figure 3

ROC curves for different classification models. XGBoost, Extreme Gradient Boosting; NB, Naive Bayes; LR, Logistic Regression; KNN, K-Nearest-Neighbor; RF, Random Forest; DT, Decision Tree; ROC curve,receiver operating characteristic curve

Descriptive statistics of AUC values for different models were shown in Table 4. Due to the small sample size and non-normal distribution of AUC, we conducted a Wilcoxon signed-rank test on the XGBoost model’s and other models’ AUC values. The AUC values of XGBoost were statistically significantly different from those of LR, KNN, NB, and DT, but not significantly different from RF, as shown in Table 5. Cross-validation results were often used for model selection, and based on the average values of various metrics in Table 3, we have chosen XGBoost.

Table 4 Descriptive Statistics of AUC for Different Models
Table 5 Significance Testing between XGBoost and Other Models

A comprehensive performance analysis of the six models indicated that XGBoost was the optimal model for predicting DPN. We used SHAP to elucidate the relationship between the features and the output of the XGBoost model. From Fig. 4, we can get the top 10 indicators that have the greatest impact on classification, all of which P < 0̵·05. Figure 4 shows that age, disease duration, C2/C0, and total cholesterol were essential features for DPN prediction by XGBoost and significantly impacted the classification results. Age, disease duration, glycated hemoglobin, insulin resistance (IR) index, 24-h urine protein quantification, and urine protein concentration were risk factors, while C2/C0, total cholesterol, aPTT, and creatinine were protective factors.

Fig. 4
figure 4

Comparison of XGBoost Model Interpretations using SHAP across Different Dataset Splits

Discussion

DPN is the most common complication of diabetes in China, in addition to cardiovascular and cerebrovascular diseases. DPN is often neglected in its early stages because nerve conduction studies cannot detect small fiber lesions. The nerve damage caused by DPN is irreversible by the time the disease is fully established. Therefore, it is very important to identify the clinical indicators, predictors, and risk factors of DPN. This study built a DPN prediction model based on XGBoost through machine learning. The high recall rate of the model shows that it has good reliability.

The developed model has several advantages. First, data from two medical centers in Nanjing, China, were used for analysis. DPN diagnosis in all patients followed the nerve conduction study classification, and their pathological data were tested by standardized laboratory techniques, to ensuring the accuracy of the variables in the prediction model. Second, unbalanced datasets make models highly dependent on some specific indicators, leading to over- or underfitting. We used the SMOTETomek method to solve this problem. Third, we constructed multiple classical machine learning DPN prediction models. We determined that the XGBoost model was the best after comparing the models by 10-fold cross-validations. Fourth, the SHAP method was used to explain the relationship between the input features and the output variables of the XGBoost model, ranking the features according to their contribution to the model output. We found several important indicators highly correlated with the progression to DPN. These are highly relevant since the key indicators of disease diagnosis were objectively extracted from real clinical data.

Previous studies have attempted to build predictive models of DPN. Wu et al. [16] established four predictive nomographs and selected the optimal model, but the Toronto Clinical Neuropathy Scoring System score suggested the diagnosis groups included false positives. Kazemi et al. [21] built a DPN prediction model based on multicategory support vector machine; however, by directly selecting the research model, they could not compare it to others, possibly reducing the model’s accuracy. Baskozos et al. [22] used the DOLORisk project dataset, applied machine learning to classify painful and painless DPN, built models, and identified predictive factors. Although their study used one of the largest and most comprehensively phenotyped cohort of people with DPN, its disadvantage was in its uncertain data quality. Metsker et al. [23] used EMRS data to build various machine-learning models for DPN. They found the most effective method for each research task to ensure the high accuracy of the research results. Therefore, it would be useful to build a better DPN prediction model using well-structured, large, and balanced datasets. The previously determined risk factors of DPN include age, diabetes duration, glycosylated hemoglobin, 24-h urine protein quantification, and urine protein concentration [24], consistent with our results. Our research has additional new findings.

The C2/C0 index represents a protective factor of DPN. C2/C0 reflecting the secretory function of pancreatic β cells. The higher the C2/C0 value represents the higher increase in C-peptide after meals. C-peptide is a polypeptide released by pancreatic β cells into the blood at a concentration equal to that of insulin. Laboratory studies have shown that C-peptide can increase endothelial nitric oxide synthase to improve endothelial function, blood flow, and neural function [25]. It can also improve neural function and structural abnormalities by increasing the activity of nerve Na+/K + ATPase and reducing Na + retention [26]. Furthermore, it can improve the gene expression of nerve growth factor, insulin-like growth factor-1, and neurotrophin-3 receptors correcting the mRNA and protein expression of neurofilament and tubulin and normalizing the abnormal phosphorylation of neurofilament [27, 28]. Clinical studies of T1DM showed that C-peptide has a good neuroprotective effect [29]. As T2DM is related to IR, changes in insulin and C-peptide concentrations will occur with the development of the disease. Therefore, there is no clear clinical evidence of the role of C-peptide in T2DM. Most patients in our cohort had T2DM. C2 and C0 were correlated with DPN when we started analyzing the data, but they were not significant enough. Their importance became apparent once they were combined into an index. Our finding confirms the neuroprotective effect C-peptide has in patients with diabetes and further explains the relationship between the secretion function and health of pancreatic β cells and DPN.

Total cholesterol protects against DPN in patients with diabetes. Total cholesterol is the sum of HDL-C, LDL-C, and the small amount of free cholesterol. Cholesterol is closely related to nerves and is an indispensable resource for myelin sheath development [30]. The myelin sheath can ensure the rapid transmission of nerve impulses and maintain normal nerve function in the peripheral nervous system (PNS) [31]. The myelin sheath in the PNS is mainly formed by a repeated wrapping of the Schwann cell membrane around the axon. Most of the cholesterol required to form the myelin sheath is synthesized in the endoplasmic reticulum of the neuronal cell body. However, when a very long axon is damaged away from the cell body, the Schwann cells need to take cholesterol from the circulation to form the myelin sheath [32, 33]. It was suggested that lower serum cholesterol levels might hamper peripheral nerve regeneration [34]. It was proposed that nerve swelling due to changes in Schwann cell lipid components for lack of cholesterol affects axon regeneration after nerve injury [33, 35]. Furthermore, the effect of daily medication on the lipid profile in patients with diabetes should not be ignored. Most patients with diabetes are treated with insulin, which can increase the high-density lipoprotein cholesterol (HDL-C) level, but not the low-density lipoprotein cholesterol (LDL-C) level [36]. Metformin, a commonly used clinical hypoglycemic drug, can reduce LDL-C [37]. It was shown that LDL-C might cause nerve damage [32, 38, 39] while HDL-C protects against it [40]. HDL-C or LDL-C alone did not show a strong correlation with DPN in this study due to the diverse medications used by the patients. However, it can be assumed that changes in the lipid profile after diabetes medication use might affect the nerves. In addition to hypoglycemic drugs, the intake of statins and the related reduction in serum cholesterol level are also associated with accelerated deterioration of neurological symptoms, microvascular injury, and peripheral nerve fiber injury [34, 41, 42]. But in some studies, hyperlipidemia is a risk factor for DPN in patients with T2DM [43]. In clinical practice, most patients need medications to control blood lipids status. Consequently, a lingering question is whether we can predict DPN development based on the proportion of cholesterol in blood lipids, considering the use of drugs to control blood lipid within a certain range.

HOMA-IR is a predictor of DPN in patients with diabetes. It is an index calculated using fasting blood glucose and insulin [44]. HOMA-IR is a good indicator of the degree of IR in the body and has been used in large-scale clinical and epidemiological studies. IR is the main internal environment state in patients with T2DM. Neuronal IR can lead to low insulin signaling and induce DPN progression. IR reduces the Akt signaling transduction by destroying insulin signaling in Schwann cells of the PNS [45]. Alterations in the Akt signaling pathway affect the neuronal mitochondrial function and lead to the subsequent increase in oxidative stress [46]. Glucose can mediate oxidative stress and promote the progression of DPN by inducing mitochondrial biogenesis and fission [47]. Therefore, the disruption to insulin signaling induced by IR makes the PNS neurons more susceptible to metabolic damage. Moreover, Akt regulates the myelination of the PNS nerve fibers by activating Rac1 to enhance membrane encapsulation and synthesizing myelin protein through mammalian target of rapamycin complex 1 [48, 49]. The reduction in Akt signaling transduction caused by IR impairs myelination and enhances DPN progression. IR was associated with DPN in laboratory studies [50], but related clinical data analysis is rare. This study was the first to use the HOMA-IR index as an indicator and explored the correlation between IR and DPN in clinical data.

Our model indicated that aPTT and creatinine were protective factors of DPN. aPTT represents the coagulation ability in the body; the lower the value, the more likely there is a hypercoagulable state. Therefore, it is possible that the nerves have a better blood supply when the body is not in a hypercoagulable state, and the better blood supply delays the progress to DPN. However, considering the obstacles to coagulation in patients with diabetes and the few related studies, this conclusion remains to be explored. Creatinine was mostly associated with diabetic nephropathy in studies on diabetic complications. Its association with DPN should be further explored as there are too few studies on the topic.

The results obtained by the model helped better understand the importance of each feature to the model’s prediction. Among the indicators detected by the model, the ten most closely related to DPN were age, diabetes duration, C2/C0, total cholesterol, glycosylated hemoglobin, HOMA-IR index, aPTT, 24-h urine protein quantification, creatinine, and urine protein concentration. The high correlation between age, diabetes duration, and DPN further highlighted the importance of early intervention to prevent complications in patients with diabetes.

Our study had several limitations that must be considered. First, all participants had diabetes, so we only explored diabetes-related indicators. The prediction results could have been different if we had included a control group of healthy individuals. Second, we lack the collection and analysis of patients’ subjective description. These will need to be addressed in future research. Third, all the indicators in this study were continuous variables, and the data analysis may be segmental. For example, the protective effects of aPTT and creatinine on DPN should be restricted to a certain threshold range. Future studies should focus on the impact of indicators within different thresholds. Fourth, the prediction model data came mostly from Nanjing, Jiangsu Province. The applicability of our results to other regions remains to be verified.

In conclusion, we established a DPN risk prediction model, which showed good performance. Through the model, we identified the factors most closely related to DPN. Our team will address the existing problems and strive to establish a better DPN prediction model through future research to help doctors quickly and accurately judge the corresponding prognosis for improved, reliable, and convenient personalized treatment and management of patients with diabetes.