Introduction

Cardiovascular disease (CVD) and type 2 diabetes mellitus (T2DM) are intertwined health challenges that have profound implications for global morbidity and mortality1. T2DM is recognized as a risk factor for cardiovascular complications, contributing to an increased disease burden and adverse outcomes1. According to the 2022 Diabetes Fact Sheet in Korea, patients with prediabetes and diabetes have 1.05 and 1.59 times higher risks of myocardial infarction and 1.05 and 1.51 times higher risks of heart failure, respectively, compared to those in normoglycemic adults2. Preventive interventions for T2DM can prevent long-term complications and are reportedly cost-effective3,4. Cardiovascular risk factors known to date include duration of diabetes, obesity/overweight status, hypertension, dyslipidemia, smoking, a family history of early coronary artery disease, chronic kidney disease, and albuminuria3,4. All patients with diabetes should undergo annual assessments and management to identify modifiable abnormal risk factors5.

With the recent development of artificial intelligence, new technologies, such as machine learning (ML), have been applied in various fields. ML has been applied to texture filtering in computer graphics and images6 and rolling element-bearing fault identification in mechanical engineering. It is attracting attention as a new approach that overcomes the limitations of traditional methods and shows superior performance6,7. ML is gaining traction in the medical field, and efforts are being made to apply ML technologies to the existing disease models8,9. A study that used an ML algorithm to predict recurrent spontaneous abortion constructed a practical framework using clinical information from real clinical data, such as vitamin D and thyroid function tests, in the analysis and achieved a sensitivity of 93.3% and specificity of 93.1%10. The study involving a novel transfer learning-based ensemble classifier for detecting COVID-19 infection in chest computed tomography scans provides an important diagnostic tool for the field of medical imaging and is expected to have a significant impact on the diagnosis and management of various other lung diseases in the future11. ToxMVA, a deep learning-based method used for predicting protein toxicity in the early stages of protein-based drug discovery, has shown the potential to shorten the process and reduce the cost of drug screening12. Thus, ML is a powerful tool for addressing existing limitations by leveraging clinical data to uncover hidden patterns and identifying critical variables associated with disease development. By integrating ML algorithms with clinical data, clinicians can efficiently identify early warning signs and risk factors for complications13.

Various methods have been developed to assess and predict CVD risk, including atherosclerotic CVD risk calculators and biomarkers14,15. However, this is problematic because traditional risk assessment tools are insufficient and the complex interactions between clinical variables are unclear8,9. Various predictive models using ML have recently been developed for CVD prediction in T2DM; however, their predictive power is limited because multiple risk factors have not been included in these models16,17,18. Therefore, this study aimed to identify the relationship between clinical factors and cardiovascular complications and to develop a predictive model for CVD occurrence through rigorous model training and validation, utilizing the strengths of ML technology in patients with T2DM in South Korea. By identifying significant predictors within the model, we sought to understand the nuanced relationship between these factors and cardiovascular risk in patients with T2DM. We have presented our methodology, including a thorough validation process for two independent Korean cohorts, strengthening the reliability and applicability of our findings. We have also discussed the potential of the predictive model developed in this study, emphasizing its role as an efficient and accurate tool for CVD risk stratification and its contribution to the field of personalized medicine to improve CVD outcomes.

Materials and methods

Study population and data collection

The data used in this retrospective study were obtained from two independent longitudinal cohorts previously enrolled in an observational study. Hospital-based data consisted of electronic medical records from outpatient, inpatient, and emergency department visits collected between January 1, 2008, and December 31, 2022. Eligible participants were selected from among patients with T2DM, excluding those with type 1 diabetes and previous onset of CVD. Finally, 12,809 patients were selected from a tertiary hospital at the Kyung Hee University Medical Center for the discovery cohort. Data for extra-validation were collected from a retrospective dataset from the secondary hospitals Kyung Hee University Medical Center at Gangdong and Gachon University Gil Hospital (validation cohort), and 2019 eligible patients were selected (Fig. 1).

Figure 1
figure 1

Study workflow.

Input variables

A comprehensive set of 68 variables was included in the model. Patients’ baseline demographic characteristics included age and sex. Medical histories included the presence of hypertension, dyslipidemia, macrovascular complications (cerebrovascular disease, dementia, Parkinson’s disease, and lower limb amputation), microvascular complications (diabetic retinopathy, proliferative diabetic retinopathy, diabetic neuropathy, and chronic kidney disease), and cancer. Medication history included types of antidiabetic drugs (metformin, sulfonylurea, dipeptidyl peptidase-4 inhibitor, meglitinide, thiazolidinedione, α-glucosidase inhibitor, insulin, glucagon-like peptide-1 receptor agonist, and sodium–glucose co-transporter 2 inhibitor), antihypertensive drugs (angiotensin II receptor blocker, angiotensin-converting enzyme inhibitor, calcium channel blocker [CCB], diuretics, and beta blocker), dyslipidemia drugs (statin, fibrate, ezetimibe, omega-3, and other dyslipidemia drugs), and antiplatelet agents (aspirin, clopidogrel, cilostazol, glycoprotein IIb/IIIa antagonist, and other antiplatelet agents). The mean and range of clinical parameters included body mass index (BMI)19, systolic blood pressure, diastolic blood pressure, and pulse rate. The mean and range of blood tests included glycated hemoglobin (HbA1c), serum glucose, total cholesterol, triglycerides, high-density lipoprotein cholesterol, low-density lipoprotein (LDL) cholesterol, serum creatinine, aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyl transferase, and alkaline phosphatase (ALP) levels.

Identification of new CVD cases

New-onset CVD among patients with T2DM was identified using the International Classification of Diseases, 10th Revision (ICD-10) codes for ischemic heart disease and myocardial infarction (I20.X–I25.X), heart failure (I50.X), and atrial fibrillation (I48.X). The primary endpoint was a new CVD diagnosis within 3 years.

Data preprocessing

Missing data were excluded from analysis. Covariates were divided into three sections: (1) demographic data, (2) physical examinations and blood tests, and (3) medication and comorbidity information. Using the examination date, we utilized physical examination and blood test data before the CVD onset. The dataset for the entire period was calculated and converted into a mean. During this period, the range was calculated by subtracting the maximum and minimum recorded values. First-visit information on medications and comorbidities was selected as a covariate.

Model training and validation

A common ML approach for prediction involves splitting data into training and test sets. In this study, the target value of the given data on the incidence of CVD over 3 years was insufficient. Consequently, the model was trained on the entire dataset, rather than being split for internal validation. Instead, a separate external dataset was used to assess the extent to which the model could be generalized. This approach is essential to verify whether the model performs well on previously unseen data.

Model development

We selected decision-tree-based ensemble models, such as the XGBoost (XGB), random forest (RF), LightGBM (LGM), and AdaBoost (ADB), and linear classification models, such as logistic regression (LR) and support vector machine (SVM). Among these, the XGB, RF, and LGM models are the most common and practical models for handling a mixture of categorical and continuous variables. For the SVM model, we chose a linear kernel because of its simplicity and efficiency, particularly for high-dimensional data, where the number of features is much larger than the number of samples. Linear kernels are beneficial when data are linearly or nearly linearly separable.

To optimize the performance of each model, we performed hyperparameter tuning using GridSearchCV and maximized the area under the receiver operating characteristic curve (AUROC) to determine the best combination of hyperparameters.

ML analysis

To determine the AUROC score, various tree-based and linear classification models were used to predict the potential occurrence of CVD. Using the AUROC score as a scoring metric, we used GridSearchCV to optimize the hyperparameters of this model. Once the optimal hyperparameters were determined, the model was trained for subsequent predictions. Considering the class imbalance in our data, we used the synthetic minority over-sampling technique to generate synthetic samples.

We used various metrics, such as AUROC, accuracy, sensitivity, specificity, and balanced accuracy, to assess the model's performance. These metrics were calculated based on the probability predictions generated by the model. Subsequently, we calculated the mean and 95% confidence intervals (CIs) for each performance metric, measuring both the average and variability of the model's performance. Due to characteristics such as the small size of the external dataset, we used a bootstrapping method instead of traditional cross-validation. By resampling the dataset multiple times and performing bootstrapping with up to 10,000 iterations, we could evaluate the sample distribution and calculate 95% CIs for the model's performance metrics. We plotted a receiver operating characteristic (ROC) curve to visually represent the model’s performance. This was complemented by the mean ROC curve and the standard deviation within that range, which elucidated the distribution of the model's performance.

To identify the most important features for predicting CVD, we utilized the intrinsic feature-importance mechanisms provided by tree-based models such as the RF, XGB, and LGM, which primarily evaluate the feature-importance based on metrics such as mean decrease impurity method, which calculates Gini impurity reduction and gain. We selected the top 15 features that had the greatest impact on the model and plotted them on a bar graph to illustrate their influence on the model predictions.

Performance metrics

To understand the performance of our model comprehensively, we selected the following five performance metrics: AUROC, accuracy, sensitivity, specificity, and balanced accuracy. AUROC is a robust performance measure to a model's ability to discriminate between classes across all possible thresholds. Its robustness stems from the fact that it considers both sensitivity and specificity, making it the preferred metric, particularly in situations where classes are unbalanced. The accuracy is a simple and intuitive performance metric that provides the proportion of true results (both true positives and true negatives) from the total number of cases examined. However, accuracy alone can be misleading, particularly for unbalanced datasets; therefore, additional performance metrics are required. Sensitivity and specificity were selected to assess how well the model identified the positive and negative cases, respectively. Sensitivity measures the proportion of true positives correctly identified by the model and provides insight into the model's ability to detect positive cases, whereas specificity measures the proportion of actual negatives that are correctly identified, providing a sense of the model's ability to avoid false alarms. Finally, we included balanced accuracy to provide a more balanced view of our model's performance, particularly in the face of class imbalance. As an average of the sensitivity and specificity, balanced accuracy assigns equal weights to both metrics, making it a good alternative to accuracy when addressing unbalanced datasets. The combination of these metrics enabled us to evaluate the performance of our model from different perspectives, thereby ensuring a more robust evaluation20,21.

Software and libraries

All data preprocessing, model development, and analyses were performed using Python 3.9.16. Key libraries used in our study included Scikit-learn 1.2.2, NumPy 1.23.5, and Pandas 1.5.3 for ML algorithms and data manipulation. Matplotlib 3.7.1 and Seaborn 0.12.2 were used for data visualization.

Ethical approval

This study was approved by the Institutional Review Board of the Kyung Hee University Hospital (No. KHSIRB-22–473(EA)). The requirement for informed consent was waived by the institutional review board because deidentified data were used for the analyses. All research was performed in accordance with the relevant guidelines, regulations, and the Declaration of Helsinki. This study followed the guidelines outlined in the Transparent Reporting of a Multivariate Prediction Model for Individual Prognosis or diagnosis (TRIPOD) statement22.

Results

Cohort characteristics

In total, 12,809 patients were selected from the discovery cohort, of whom 1238 (10.2%) had CVD (Fig. 1). Among the participants, 6530 (51.0%) were male patients, and the mean age was 62.5 ± 12.1 years. For extra-validation, 2019 patients were included, comprising 32 (1.6%) patients with CVD from the validation cohort. The validation cohort had 1094 (54.2%) male patients with a mean age of 56.3 ± 11.9 years (Table 1). The median range of each clinical indicator and blood test was the difference between the maximum and minimum values before the occurrence of CVD in the hospital records of each patient, listed separately in Supplementary Table S1.

Table 1 Baseline characteristics of the study and extra-validation datasets.

Comparisons of prediction model performance

The RF model displayed impressive performance on the validation set, exhibiting an AUROC of 0.830 (95% CI, 0.816–0.842). Details of the hyperparameter of the models and the tuning process of the RF model are listed in Supplementary Tables S2 and S3. The RF model also demonstrated consistent metrics across the board (accuracy: 74.7% [73.2–76.2]; sensitivity: 74.6% [73.0–76.2]; specificity: 74.7% [73.2–76.2]; and balanced accuracy: 74.6% [73.1–76.2]). In comparison, the XGB and LGM models demonstrated slightly higher values of AUROC and maintained an overall robust performance. The ADB and LR models presented consistent performance despite a lower AUROC than that of the RF model. The SVM model delivered a substantially lower AUROC value and demonstrated noticeable variability in its metrics (Table 2; Fig. 2).

Table 2 Performance metrics of six different machine learning algorithms on the original and external validation datasets.
Figure 2
figure 2

ROC curves of the random forest model. Mean ROC curve from tenfold cross-validation on the original dataset. ROC receiver operating characteristic, AUC area under the ROC curve.

When these models were applied to the external validation set, the RF model achieved the highest AUROC of 0.722 and exhibited the best performance on the other metrics. The XGB, LGM, ADB, and LR models exhibited lower performance than did the RF model and demonstrated good results for other performance metrics, whereas the SVM model exhibited somewhat less efficacy (Table 2; Fig. 2).

Consequently, considering its superior and consistent results, the RF model emerged as the most effective predictor of CVD onset within a 3-year timeframe in patients with diabetes.

Contributing factors for the prediction model performance

The significance of the contributing factors analyzed using the feature-importance method is shown in Fig. 3. Among the 68 variables considered in this study, the most critical factor contributing to the performance of the CVD prediction model was creatinine level, followed by HbA1c, AST, ALP, and ALT levels. BMI and medication history of CCB and diuretics were also included among the top 15 features. Early cerebrovascular complications were among the top 15 features. To provide an insight into the performance of the RF model, we have visually presented the distribution of each of the top 15 most important features in Supplementary Fig. S1.

Figure 3
figure 3

Top 15 feature-importance of the random forest model. Cr creatinine, HbA1c glycated hemoglobin, AST aspartate transaminase, ALP alkaline phosphatase, ALT alanine transaminase, HDL high-density lipoprotein, TG triglyceride, TC total cholesterol, LDL low-density lipoprotein, CCB calcium channel blocker, DU diuretics, BMI body mass index, CeVD cerebrovascular disease.

Discussion

This study highlights the importance of developing a highly accurate ML-based CVD prediction model that can be universally applied to adults with T2DM in South Korea, facilitating easy and accurate assessment of future annual CVD risk in the diabetic population. To summarize the results of this study, the XGB, RF, LGM, and ADB models, which are ensemble models, demonstrated excellent performance with AUROC values of 0.81–0.84 on the internal validation dataset and 0.71–0.72 on the external validation dataset. Creatinine and HbA1c levels ranked highest among the top 15 feature-importance factors (Fig. 3). The findings of this study can potentially improve patient outcomes by facilitating timely interventions, enhancing the understanding of contributing variables, and reducing the burden of cardiovascular complications in patients with diabetes.

The pooled cohort equation and Framingham risk score, which are standard prognostic models based on classical risk factors, were developed and performed well in predicting CVD in the general population23, but they failed to provide reliable prediction results in patients with diabetes24,25. Specifically, the Framingham prediction model was validated three times in a population with diabetes; however, the area under the curve (AUC) varied widely between 0.56 and 0.80 and was poorly calibrated (P < 0.001)26. Many ML models have been developed to predict incident CVD in the general population; however, their usefulness remains unclear owing to methodological flaws, a lack of external validation, and model impact studies27. Various ML models for predicting cardiovascular risk in patients with T2DM have not been extensively studied. According to a systematic review, the neural network-based model performed the best with an AUC of 0.91; however, the precision of the model was only 76.6% and no external validation was not performed28. In contrast, our CVD prediction model demonstrated sufficiently good performance with a mean AUROC of 0.83, using only questionnaires, body measurements, and blood tests commonly conducted in clinical practice for patients with diabetes. In addition, our prediction model maintained an excellent performance even after external validation.

This study differs from previous studies in that it was based on a large cohort of Koreans, utilizing data from three university hospitals and showed a superior predictive performance compared to the existing risk prediction models. The discovery cohort from one university hospital used for model training and the cohorts from the two university hospitals used for validation comprised different populations with different baseline characteristics. The performance of the model developed in this study trained on one cohort was similar to that of independent cohorts with different characteristics. We utilized the RF model to analyze CVD risk factors, including medical history, clinical parameters, and blood tests, due to its ability to handle multicollinearity among variables such as blood pressure, glycemic status, and dyslipidemia. This model choice enhances the stability and reliability of our results, allowing us to effectively use interrelated predictors without the risks associated with multicollinearity in linear models29,30,31. Despite the RF model's robustness in handling multicollinearity, interpreting feature-importances requires careful consideration of their potential redundancy and the nuanced relationship between mathematical and clinical significance. In addition, this study used the range of each clinical variable and blood test results as input variables. Among the top 15 features, 11 involved range values rather than the median values of each variable. This suggests that the fluctuation of these variables is more critical in risk prediction than the baseline values of the variables, which are considered classic risk factors in the existing risk prediction models.

Chronic kidney disease is also a known risk factor for CVD32, and predicting the occurrence of CVD using albuminuria, estimated glomerular filtration rate (eGFR), or cystatin C is feasible33. Among ML-based CVD prediction models, one includes creatinine levels34. A previous study also indicated that eGFR variability predicts cardiovascular event-induced hospitalization and death better than the baseline eGFR35. Consistent with these findings, our study demonstrated that the change in creatinine levels was more critical in predicting CVD in patients with diabetes than the median value of baseline creatinine.

Persistently elevated high blood glucose levels cause a hyperglycemic burden and adversely affect the occurrence of complications. Recent studies have reported that HbA1c variability plays an important role in microvascular disease outcomes in patients with relatively optimal basal glycemic control and that high HbA1c variability can predict almost all cardiovascular complications of T2DM36,37. Thus, it is reasonable to explain why the range of change in HbA1c holds importance in the CVD prediction model used in this study. Even if the median HbA1c level is low, high glycemic variability increases the risk of CVD complications.

Moreover, the finding that variability in liver levels and lipid profiles can be a predictor of CVD in this study is supported by previous studies37,38,39. However, conflicting results have been reported regarding the effects of BMI and LDL cholesterol variability on CVD40,41. Furthermore, among the complications other than cardiovascular complications in diabetes, the presence or absence of cerebrovascular disease is an important variable for predicting CVD risk. Previous studies reporting that the total area of carotid plaques predicts the risk of myocardial infarction support this result42. However, it is difficult to explain why cerebrovascular complications are more important predictors than microvascular or macrovascular complications. CCB and diuretics are the most commonly used medications for treating hypertension. Diuretics are among the most widely used drugs for triple therapy in patients with uncontrolled hypertension. This disproves the idea that uncontrolled hypertension may increase the risk of CVD, although the median or range of systolic or diastolic blood pressure was not included among the top 15 features.

This study has several limitations. First, owing to the retrospective nature of the study, it was challenging to expect detailed precision in the information obtained from the dataset based on hospital medical records. Information bias, arising from inaccuracies in data collection methods and the recording of medications and clinical parameters, can lead to misclassification of both exposure and outcome, affecting the accuracy of an ML model's predictions for CVD in patients with T2DM. Identification of new-onset CVD cases by ICD-10 codes is limited by potential misclassification and failure to capture patient nuances, and comprehensive record checks are required to validate “negative” CVD cases as defined by ICD-10 codes. Nevertheless, the ICD-10 codes provide a standardized methodology that is important for large-population studies. Second, the model was trained on data obtained from tertiary care centers, which may introduce a selection bias because these patients may have socioeconomic backgrounds different from those of primary care patients and receive more comprehensive healthcare, affecting CVD risk and diabetes management. Third, the superiority of this prediction model has not been compared with other existing prediction models and warrants further research. Finally, this study alone could not prove a causal relationship between the predictors used in the model and the occurrence of CVD. Confounding factors in predicting CVD in patients with T2DM include possible biases due to medications prescribed for conditions such as CVD risk factors (hypertension or dyslipidemia), the influence of unexplained variables such as the severity and duration of diabetes, and missing data on important lifestyle factors (smoking status, physical activity, diet, and alcohol consumption), all of which can distort the true relationship between risk factors and CVD outcomes.

Nevertheless, this study used an ML-based CVD prediction model for patients with T2DM using comprehensive covariates and two independent cohorts from a multicenter registry in South Korea. In the future, it may be possible to leverage common data models such as the Observational Medical Outcome Partners-Common Data Model, which integrates data from multiple institutions worldwide to create more accurate and precise global ML-based CVD prediction models.

Conclusion

We successfully constructed an ML-based predictive model using a representative national cohort, enabling easy and accurate prediction of CVD risk in all members of the Korean population with T2DM. Our model outperformed traditional risk assessment tools, such as the Framingham risk score, which has shown limitations in their applicability to patients with diabetes. Highlighting the importance of creatinine and HbA1c level variabilities, the study illustrates how well-developed ML models can predict CVD risk across diverse populations, using routine clinical data to enhance risk assessment accessibility. Although much work remains to be done to generalize our model to a more diverse diabetic population, as it was trained on a specific cohort, we aimed to show that the ML model we developed and validated has broad potential for predicting CVD events in a diverse population of patients with diabetes. As part of a national project with the Korea Disease Control and Prevention Agency and the Korean Diabetes Association, this pioneering implementation prompts further expansion and validation of these models to diverse settings, reinforcing ML's role in advancing healthcare outcomes. Despite the conservative stance of several national clinical guidelines for diabetes and CVD on the use of predictive models, our findings support a reassessment of the clinical integration of these models to improve personalized medical approaches and reduce the CVD burden. Future research should focus on improving these models by including a broader range of patient data and conducting comparative studies using existing prediction methods. Collaboration between healthcare professionals, data scientists, and policymakers is essential to update clinical guidelines and ensure ethical use, paving the way for improved patient outcomes and the advancement of personalized medicine in public health.