Introduction

Type 2 diabetes mellitus (T2DM) is a major challenge to public health worldwide, and the assessment and management of this chronic disease impose a heavy economic burden [1, 2]. T2DM is associated with many complications and problematic symptoms, including micro- and macrovascular complications [3, 4]. Among these complications, diabetic kidney disease (DKD) is a leading cause of chronic kidney disease (CKD) and is associated with a future risk of progression to end-stage renal disease (ESRD) [5, 6]. However, a diagnosis of DKD is often delayed, particularly in the early stages of the disease, because most patients remain asymptomatic with respect to kidney dysfunction [7]. Therefore, identifying DKD patients with a rapid decline in the estimated glomerular filtration rate (eGFR) might be helpful for allowing early nephroprotective treatment to be administered to delay or prevent the progression of kidney failure.

Previous large-scale population-based cohort studies have identified multiple factors potentially contributing to rapid eGFR decline, such as hypertension [8, 9], proteinuria [10], demographic factors, and underlying comorbidities [7]. A meta-analysis of demographic and clinical laboratory data from twenty cohorts representing 41,271 T2DM patients was conducted to develop a categorization point system for DKD prediction [11]. The prediction model achieved an average area under the receiver operating characteristic curve (AUC) of 0.765. Because electronic health record usage provides hundreds of clinical features and a large volume of data, prediction models using a categorization point system may be insufficient to effectively make use of unaligned and correlated data structures. Recently, artificial intelligence (AI) has changed modern procedures, and the progress of machine learning with big data analysis has improved the capacity of predictive model development [12].

In a cohort study consisting of diabetic patients, an AI model using logistic regression was developed to predict the progression of DKD according to 3073 features [13], and it achieved an AUC of 0.743 and an average accuracy of 71%. However, only logistic regression was applied in this study, and the predictive ability of other machine learning models with respect to renal function progression in diabetic patients remains unknown. In addition, in the abovementioned study, the AI model predicted DKD progression for 6 months after the enrollment period; therefore, its predictive ability for a longer follow-up period is unknown.

In our study, we used a large-scale newly diagnosed DM cohort to perform machine learning models by using clinical features, including demographic characteristics, comorbidities, laboratory data and concomitant medications from outpatient department and emergency room visits as well as hospital admissions, to predict the risks of developing ESRD with a long follow-up period. We also used SHapley Additive exPlanation (SHAP) values to evaluate the accurate attribution values for each important feature within machine learning models.

Methods

Data sources and study population

During the period of January 2008 to December 2018, we constructed a T2DM 10-year retrospective longitudinal cohort based on the information of patients with newly diagnosed T2DM from the Big Data Center, which includes the detailed patient demographic, underlying comorbidities, medication prescriptions, and laboratory data from all inpatient, outpatient and emergency services [14]. Patients without at least two eGFR values were excluded from our analyses. In addition, we excluded T2DM patients who had undergone renal replacement therapy, such as hemodialysis, peritoneal dialysis, and kidney transplant, before the enrollment points. This study was approved by the Institutional Review Board (Taipei Veterans General Hospitals, Approval no. 2022–03-006 AC), and the need for informed consent was waived because the data were deidentified.

Feature selection

We extracted 78 features used for machine learning, including demographic characteristics, underlying comorbidities, laboratory data and concomitant drugs. The demographic characteristics included age, gender, smoking and alcohol consumption. Underlying comorbidities included histories of hypertension, transient ischemic attack, ischemic stroke, hemorrhagic stroke, myocardial infarction, coronary artery disease, congestive heart failure, chronic liver disease, cirrhosis, peptic ulcer disease, autoimmune disease, chronic obstructive pulmonary disease, asthma, peripheral arterial occlusive disease, cancer, gout, atrial fibrillation, valvular heart disease and diabetic retinopathy. The laboratory data included baseline serum creatinine, mean serum creatinine assessed within 1 year before the diagnosis of T2DM, cholesterol, triglycerides, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, uric acid, calcium, phosphate, white blood cells, hemoglobin, albumin, alanine aminotransferase, aspartate aminotransferase, total bilirubin, direct bilirubin, alkaline phosphatase, gamma-glutamyl transferase, glycated hemoglobin, glucose, the international normalized ratio, activated partial thromboplastin time, high-sensitivity C-reactive protein, iron, thyroid-stimulating hormone, free thyroxine, and spot urine protein-to-creatinine ratio (UPCR). Concomitant medications included renin-angiotensin-aldosterone system (RAAS) inhibitors, alpha blockers, beta blockers, calcium channel blockers, warfarins, direct oral anticoagulants, aspirins, clopidogrels, nitrates, statins, diuretics, spironolactones, metformins, sulfonylureas, meglitinides, sodium–glucose cotransporter 2 inhibitors, glucagon-like peptide-1 receptor agonists, dipeptidyl peptidase-4 inhibitors, thiazolidinediones, alpha-glucosidase inhibitors, insulins, nonsteroidal anti-inflammatory drugs, cyclooxygenase-2 inhibitors, proton pump inhibitors, steroids, allopurinols, febuxostats and benzbromarones.

Class definition

In our study, the class was annotated as 1 if there was ESRD occurrence during the follow-up periods (defined as eGFR < 15 ml/min/1.73 m2 or the receipt of maintenance dialysis or kidney transplant), and the class was annotated as 0 if there was no ESRD occurrence. We calculated eGFR using the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equations [15].

Data cleaning and machine learning model development

Categorical variables are presented as numbers (proportions) and continuous parametric variables are shown as the median (interquartile ranges [IQRs]). To impute the missing values of the clinical features, the K-nearest neighbor (KNN) algorithm was used before the machine learning methods [16, 17]. For model development, the study cohort was randomly divided to create a 70%:30% training set to test set ratio. Because the number of ESRD cases was much smaller than the number of non-ESRD cases, we performed the synthetic minority over-sampling technique (SMOTE)-Tomek algorithms to balance the number of samples taken for imbalanced data [18, 19]. Six machine learning models, including logistic regression, extra trees [20], random forest [21], gradient boosting decision tree (GBDT) [22], extreme gradient boosting models (XGBoost) [23], and light gradient boosting machine (LGBM) [24], are performed. We used forward-feature selection for the reduction in dimensions, which selects the most useful subset of features from all available features [25, 26]. Five-fold cross-validation is performed on the training set to estimate the performance and validate the stability of the applied machine learning models [27, 28].

Hyperparameter optimization

A grid search in combination with the five-fold cross-validation was conducted to optimize the hyperparameters of logistic regression, extra trees, random forest, GBDT, XGBoost, and LGBM to achieve the best F1 score [29,30,31]. The details of hyperparameter optimization for each ensemble model are listed in Table 1. Grid searches determine the best hyperparameter value based on a set of given values.

Table 1 Hyperparameters of machine learning models

Model evaluation

The discriminative abilities of the different machine learning models were compared based on their AUCs. In addition, the F1 score, accuracy, precision, recall, average precision and log loss values of each model by using testing dataset were also presented. SHapley Additive exPlanations (SHAP) was used to evaluate the risk of developing ESRD in T2DM and to provide explanations for the attribution values of clinical features in a unified framework to interpret model predictions.

Software and package applicating for modeling

We used Python (Python Software Foundation version 3.7.6, available at http://www.python.org) and open-source Scikit-learn library for the establishment of machine learning models and SAS version 9.4 (SAS Institute, Cary, NC) for statistical analysis [32]. We used Python and Scikit-learn library packages, including sklearn.impute.KNNImputer for missing value imputation, sklearn.model_selection.train_test_split for randomly dividing data into train and test sets, sklearn.model_selection.GridSearchCV for hyperparameter optimization, sklearn.linear_model.LogisticRegression for development of the logistic regression model, sklearn.ensemble.ExtraTreesClassifier for development of the extra tree model, sklearn.ensemble.RandomForestClassifier for development of the random forest model, sklearn.ensemble.GradientBoostingClassifier for development of the GBDT model, XGBoost Python package for development of the XGBoost model, lightgbm.LGBMClassifier Python package for development of the LGBM model, and sklearn.model_selection.StratifiedKFold for cross-validation. A P value of 0.05 was considered statistically significant.

Results

Characteristics and distribution of patients

A total of 105,234 T2DM patients aged > 20 years old were identified during the 10-year study period, of whom 34,059 had no eGFR measurements, 16,351 did not have at least two eGFR values, and 1347 patients receiving renal replacement therapy were excluded, which resulted in a final cohort of 53,477 T2DM patients. The detailed patient demographic data are provided in Table 2. The median patient age was 67.05 years (IQR 57.37 to 77.74 years), and 41.4% of the patients were female. In addition, 58.2% of patients had hypertension, 19.8% had coronary artery disease, and 23.4% had cancer. Regarding renal function, T2DM patients had baseline serum creatinine levels of 0.94 mg/dL (IQR 0.75 to 1.27 mg/dL), mean serum creatinine of 0.95 mg/dL (IQR 0.76 to 1.26 mg/dL) within 1 year before the diagnosis of T2DM. The dataset was randomly divided into a training set (70%) and a testing set (30%). Of all the T2DM patients, 4769 (8.9%) patients developed ESRD. A total of 3334 (8.9%) patients developed ESRD on the training set, and 1435 (8.9%) patients developed ESRD on the testing set.

Table 2 Demographics and clinical features between T2DM patients

Model prediction ability

Six machine learning models, i.e., logistic regression, extra tree classifier, random forest, GBDT, XGBoost, and LGBM, were performed, and the AUCs and other performance indices, such as accuracy, F1 score, precision, recall and average precision achieved by the machine learning models after data augmentation are presented in Supplementary Table 1. The AUCs resulting from 5-fold cross-validation of XGBoost models with a mean of 0.984 (Supplementary Fig. 1). On the testing dataset, AUCs showed that the XGBoost model had the highest predictive ability, with an AUC of 0.953, followed by the extra tree model with an AUC of 0.952 (Fig. 1).

Fig. 1
figure 1

A Receiver operating characteristic curves and B precision–recall curves of machine learning models on the testing dataset. C XGBoost yielded the highest area under the ROC curve for prediction of end-stage renal disease followed by extra trees classifier and GBDT on the testing dataset. Abbreviations: ROC, receiver operating characteristic; PR, precision–recall; AUC, area under curve of receiver operating characteristic curve; A.precision, average precision; AUC PRC, area under curve of precision-recall curve; GBDT, gradient boosting decision tree; XGBoost, extreme gradient boosting; LGBM, light gradient boosting machine

Ranks of feature importance and SHAP values in the XGBoost model

We performed feature importance plots of the XGBoost model based on the SHAP values and listed the top important features sorted by the impacts in descending order (Fig. 2A). The top five important features were baseline serum creatinine, mean serum creatinine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, UPCR and female gender. The impacts of feature importance on model output were also illustrated in the SHAP summary plot (Fig. 2B). Higher SHAP values of important features indicate a higher probability of impacts of the prediction in the XGBoost model. SHAP values in red dots indicate an increase in prediction, while those in blue dots indicate a decrease in prediction. Baseline serum creatinine, mean serum creatinine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, and UPCR showed positive impacts on the prediction of developing ESRD risk, while the female gender showed a negative impact.

Fig. 2
figure 2

A The feature importance plot and B SHAP summary plot showed the top clinical important features for predicting risks of developing end-stage renal disease in the XGBoost model. Abbreviations: XGBoost, extreme gradient boosting; HSCRP, high-sensitivity C-reactive protein; UPCR, spot urine protein-to-creatinine ratio; ALT, alanine transaminase; DPP4i, dipeptidyl peptidase 4 inhibitors; HGB, hemoglobin; HbA1c, glycated hemoglobin; ALB, albumin; NSAID, nonsteroidal anti-inflammatory drug; HTN, hypertension; INR, international normalized ratio; PI, phosphate

The dependent plots of interactions between serum creatinine, high-sensitivity C-reactive protein, UPCR and female gender

As shown in Fig. 3, the dependent plots illustrated the SHAP values and the interactions between serum creatinine, high-sensitivity C-reactive protein, UPCR and female gender in the XGBoost model. The risks of developing ESRD increased as baseline or mean serum creatinine increased and then reached a plateau when creatinine > 5 mg/dL (Fig. 3A–B). Figure 3C–F illustrates the interaction between the SHAP values of baseline serum creatinine, mean serum creatinine, high-sensitivity C-reactive protein, UPCR and female gender. The values on the y-axis indicate the interaction SHAP values between baseline serum creatinine and other important features, and values on the x-axis are the levels of baseline serum creatinine. Mean serum creatinine, high-sensitivity C-reactive protein, UPCR and female gender were positively correlated with the predictive value of baseline serum creatinine.

Fig. 3
figure 3

The plots of SHAP value of (A) baseline serum creatinine and (B) mean serum creatinine within 1 year before diagnosis showed increased creatinine levels were associated with increased SHAP values. SHAP interaction plots showed the interaction impacts between baseline serum creatinine and (C) mean serum creatinine (D) HSCRP, (E) UPCR, and (F) female gender on the prediction model’s output. Abbreviations: SHAP, SHapley Additive exPlanations; HSCRP, high-sensitivity C-reactive protein; UPCR, spot urine protein-creatinine ratio

Discussion

In the current study, we developed machine learning models to predict the development of ESRD among T2DM patients based on electronic medical records. We used the machine learning system to conduct feature selection and compare the AUCs among the different machine learning models. We found that the XGBoost model had the highest predictive performance with the highest AUC of 0.953 on the testing dataset compared to other machine learning algorithms. The top five important features were baseline serum creatinine, mean serum creatinine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, UPCR and female gender.

Previous studies in nondiabetic populations have attempted to find useful markers to predict ESRD. A Norwegian large-scale general health study including 65,589 adults aged > 20 years from 1995 through 1997 established a clinical predictive model (incorporating age, gender, physical activity, diabetes, systolic blood pressure, antihypertensive medication, and high-density lipoprotein) for the future risk of ESRD, and the AUC reached 0.864 [33]. After adding albuminuria and eGFR, the AUC of the model was increased to 0.936. Ishani et al [34]. studied 12,866 men who were at high risk for heart disease and found that dipstick proteinuria, eGFR < 60 ml/min/1.73 m2, and hematocrit were related to the development of ESRD. Because the study populations were limited to nondiabetic populations, the findings of these studies may not be generalizable to T2DM groups. For diabetic patients, proteinuria [35, 36], diabetic retinopathy [37, 38], increased glycated hemoglobin levels [39], hypertension [40], and cardiovascular diseases [41, 42] may precede kidney function decline and have been demonstrated to be associated with renal function progression.

A customized software program for CKD risk identification in Australia (the Electronic Diagnosis and Management Assistance to Primary Care in Chronic Kidney Disease (eMAP:CKD) program) was developed to integrate primary care electronic health records from more than 150,000 patients [43]. After the initiation of the program, there was a significant improvement in CKD documentation from 0.48 to 1.55%. In addition, the proportions of at-risk patients diagnosed with CKD at 15 months were found to be significantly increased from 7.8 to 24.40%. Furthermore, recent studies have applied AI to predict the risks of CKD. Kanda et al. [44] conducted a study including 7465 subjects and found that AI models with support vector machine (SVM) models can help predict CKD progression in both high-risk and low-risk subjects. After the 3-year follow-up, the accuracy of the SVM models was increased. Chen et al. [45] used three different models, i.e., K-nearest neighbor (KNN), SVM, and soft independent modeling of class analogy (SIMCA), to analyze data from 386 patients with or without CKD for clinical risk assessment and achieved accuracies over 93%. In their study, KNN and SVM achieved better performance than SIMCA. Almansour et al. [46] studied data from 400 patients with the goal of diagnosing CKD at an early stage and found that artificial neural networks (accuracy: 99.75%) performed better than SVMs (accuracy: 97.75%).

Although several studies have developed machine-learning models to detect diabetes and diabetic complications, to date, only one machine learning model has been developed to detect renal function progression in diabetic patients. Makino et al. [13] conducted a longitudinal data analysis with big data representing diabetes patients with stage 1 to 2 diabetic nephropathy and found that logistic regression models can predict DKD aggravation with 71% accuracy. A higher risk of hemodialysis was associated with DKD aggravation than with nonaggravation. However, the study was limited to the early stage of DKD and a single machine learning model with logistic regression. In our study, we found that the machine learning XGBoost model predicted the risk of developing ESRD, achieving an AUC value of 0.953 on the testing dataset.

With a positive SHAP value, the machine learning models revealed that baseline serum creatinine showed the greatest impact on predicting the risk of developing ESRD. A previous study found that better baseline renal function was protective against renal function decline [47]. Our models also found that mean serum creatinine within 1 year before diagnosis of T2DM was an important predictor of developing ESRD. The possible explanation may be that mean serum creatinine is reflective of the usual renal status. According to the SHAP dependence plots, the interaction with high-sensitivity C-reactive protein increases the prediction of risks of developing ESRD. Elevated high-sensitivity C-reactive protein was found to be independently associated with an increased risk of renal function decline in patients with diabetes and the general non-diabetic population [48, 49]. Higher UPCR levels at the time of diagnosis of T2DM were also associated with higher risks of developing ESRD, which was similar to previous research that found a positive correlation between UPCR and ESRD [50]. In contrast, female gender was associated with lower SHAP values and decreased risks of developing ESRD. A previous study also found that renal function decline in women was slower compared to men among middle-aged and elderly individuals [51].

Our study has several strengths. We established a predictive model by inputting big EMR data into the machine learning algorithm. The novelty of this study is the use of a 10-year longitudinal cohort to predict the risk of developing ESRD in newly diagnosed T2DM patients with baseline median creatinine of 0.94 mg/dL. The machine learning algorithm compared discriminative ability among different machine learning models and selected the best models. This approach offers not only improvement in AUCs but also selection of the best predicting model in cases where it is unclear what machine learning models are most suitable. In addition, the SHAP algorithm was used to interpret the model predictions, and the impacts of important features on developing ESRD were explored. Using SHAP summary plots, we demonstrated the strength and direction of each feature (positive or negative effects).

Our study also has real and perceived limitations. First, as patient information, including demographic data, underlying comorbidities and concomitant medications, was obtained from electronic health record systems and coding procedures, we could not identify mild diseases without coding in T2DM patients. Second, the inclusion of data on the duration and frequency of laboratory visits was not uniform but varied among patients. Finally, the training data and testing data were from the same dataset. Further validation in other cohorts is necessary.

Conclusion

Our machine learning models employing longitudinal data from electronic health records were effective in predicting the risks of developing ESRD in T2DM patients in real-world clinical scenarios over a 10-year study period of observation. In addition, we used the SHAP method to provide explanations for the selected features to interpret model predictions. The developed model has the potential to predict the T2DM patients at increased risks for developing ESRD and thus, consequently initiating prevention or treatment plans for patients. In the future, external validation studies are necessary to convenient machine learning models to be developed for widespread use in clinical practice.