Abstract
This study aimed to develop and validate a machine learning (ML) model tailored to the Korean population with type 2 diabetes mellitus (T2DM) to provide a superior method for predicting the development of cardiovascular disease (CVD), a major chronic complication in these patients. We used data from two cohorts, namely the discovery (one hospital; n = 12,809) and validation (two hospitals; n = 2019) cohorts, recruited between 2008 and 2022. The outcome of interest was the presence or absence of CVD at 3 years. We selected various ML-based models with hyperparameter tuning in the discovery cohort and performed area under the receiver operating characteristic curve (AUROC) analysis in the validation cohort. CVD was observed in 1238 (10.2%) patients in the discovery cohort. The random forest (RF) model exhibited the best overall performance among the models, with an AUROC of 0.830 (95% confidence interval [CI] 0.818–0.842) in the discovery dataset and 0.722 (95% CI 0.660–0.783) in the validation dataset. Creatinine and glycated hemoglobin levels were the most influential factors in the RF model. This study introduces a pioneering ML-based model for predicting CVD in Korean patients with T2DM, outperforming existing prediction tools and providing a groundbreaking approach for early personalized preventive medicine.
Similar content being viewed by others
Introduction
Cardiovascular disease (CVD) and type 2 diabetes mellitus (T2DM) are intertwined health challenges that have profound implications for global morbidity and mortality1. T2DM is recognized as a risk factor for cardiovascular complications, contributing to an increased disease burden and adverse outcomes1. According to the 2022 Diabetes Fact Sheet in Korea, patients with prediabetes and diabetes have 1.05 and 1.59 times higher risks of myocardial infarction and 1.05 and 1.51 times higher risks of heart failure, respectively, compared to those in normoglycemic adults2. Preventive interventions for T2DM can prevent long-term complications and are reportedly cost-effective3,4. Cardiovascular risk factors known to date include duration of diabetes, obesity/overweight status, hypertension, dyslipidemia, smoking, a family history of early coronary artery disease, chronic kidney disease, and albuminuria3,4. All patients with diabetes should undergo annual assessments and management to identify modifiable abnormal risk factors5.
With the recent development of artificial intelligence, new technologies, such as machine learning (ML), have been applied in various fields. ML has been applied to texture filtering in computer graphics and images6 and rolling element-bearing fault identification in mechanical engineering. It is attracting attention as a new approach that overcomes the limitations of traditional methods and shows superior performance6,7. ML is gaining traction in the medical field, and efforts are being made to apply ML technologies to the existing disease models8,9. A study that used an ML algorithm to predict recurrent spontaneous abortion constructed a practical framework using clinical information from real clinical data, such as vitamin D and thyroid function tests, in the analysis and achieved a sensitivity of 93.3% and specificity of 93.1%10. The study involving a novel transfer learning-based ensemble classifier for detecting COVID-19 infection in chest computed tomography scans provides an important diagnostic tool for the field of medical imaging and is expected to have a significant impact on the diagnosis and management of various other lung diseases in the future11. ToxMVA, a deep learning-based method used for predicting protein toxicity in the early stages of protein-based drug discovery, has shown the potential to shorten the process and reduce the cost of drug screening12. Thus, ML is a powerful tool for addressing existing limitations by leveraging clinical data to uncover hidden patterns and identifying critical variables associated with disease development. By integrating ML algorithms with clinical data, clinicians can efficiently identify early warning signs and risk factors for complications13.
Various methods have been developed to assess and predict CVD risk, including atherosclerotic CVD risk calculators and biomarkers14,15. However, this is problematic because traditional risk assessment tools are insufficient and the complex interactions between clinical variables are unclear8,9. Various predictive models using ML have recently been developed for CVD prediction in T2DM; however, their predictive power is limited because multiple risk factors have not been included in these models16,17,18. Therefore, this study aimed to identify the relationship between clinical factors and cardiovascular complications and to develop a predictive model for CVD occurrence through rigorous model training and validation, utilizing the strengths of ML technology in patients with T2DM in South Korea. By identifying significant predictors within the model, we sought to understand the nuanced relationship between these factors and cardiovascular risk in patients with T2DM. We have presented our methodology, including a thorough validation process for two independent Korean cohorts, strengthening the reliability and applicability of our findings. We have also discussed the potential of the predictive model developed in this study, emphasizing its role as an efficient and accurate tool for CVD risk stratification and its contribution to the field of personalized medicine to improve CVD outcomes.
Materials and methods
Study population and data collection
The data used in this retrospective study were obtained from two independent longitudinal cohorts previously enrolled in an observational study. Hospital-based data consisted of electronic medical records from outpatient, inpatient, and emergency department visits collected between January 1, 2008, and December 31, 2022. Eligible participants were selected from among patients with T2DM, excluding those with type 1 diabetes and previous onset of CVD. Finally, 12,809 patients were selected from a tertiary hospital at the Kyung Hee University Medical Center for the discovery cohort. Data for extra-validation were collected from a retrospective dataset from the secondary hospitals Kyung Hee University Medical Center at Gangdong and Gachon University Gil Hospital (validation cohort), and 2019 eligible patients were selected (Fig. 1).
Input variables
A comprehensive set of 68 variables was included in the model. Patients’ baseline demographic characteristics included age and sex. Medical histories included the presence of hypertension, dyslipidemia, macrovascular complications (cerebrovascular disease, dementia, Parkinson’s disease, and lower limb amputation), microvascular complications (diabetic retinopathy, proliferative diabetic retinopathy, diabetic neuropathy, and chronic kidney disease), and cancer. Medication history included types of antidiabetic drugs (metformin, sulfonylurea, dipeptidyl peptidase-4 inhibitor, meglitinide, thiazolidinedione, α-glucosidase inhibitor, insulin, glucagon-like peptide-1 receptor agonist, and sodium–glucose co-transporter 2 inhibitor), antihypertensive drugs (angiotensin II receptor blocker, angiotensin-converting enzyme inhibitor, calcium channel blocker [CCB], diuretics, and beta blocker), dyslipidemia drugs (statin, fibrate, ezetimibe, omega-3, and other dyslipidemia drugs), and antiplatelet agents (aspirin, clopidogrel, cilostazol, glycoprotein IIb/IIIa antagonist, and other antiplatelet agents). The mean and range of clinical parameters included body mass index (BMI)19, systolic blood pressure, diastolic blood pressure, and pulse rate. The mean and range of blood tests included glycated hemoglobin (HbA1c), serum glucose, total cholesterol, triglycerides, high-density lipoprotein cholesterol, low-density lipoprotein (LDL) cholesterol, serum creatinine, aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyl transferase, and alkaline phosphatase (ALP) levels.
Identification of new CVD cases
New-onset CVD among patients with T2DM was identified using the International Classification of Diseases, 10th Revision (ICD-10) codes for ischemic heart disease and myocardial infarction (I20.X–I25.X), heart failure (I50.X), and atrial fibrillation (I48.X). The primary endpoint was a new CVD diagnosis within 3 years.
Data preprocessing
Missing data were excluded from analysis. Covariates were divided into three sections: (1) demographic data, (2) physical examinations and blood tests, and (3) medication and comorbidity information. Using the examination date, we utilized physical examination and blood test data before the CVD onset. The dataset for the entire period was calculated and converted into a mean. During this period, the range was calculated by subtracting the maximum and minimum recorded values. First-visit information on medications and comorbidities was selected as a covariate.
Model training and validation
A common ML approach for prediction involves splitting data into training and test sets. In this study, the target value of the given data on the incidence of CVD over 3 years was insufficient. Consequently, the model was trained on the entire dataset, rather than being split for internal validation. Instead, a separate external dataset was used to assess the extent to which the model could be generalized. This approach is essential to verify whether the model performs well on previously unseen data.
Model development
We selected decision-tree-based ensemble models, such as the XGBoost (XGB), random forest (RF), LightGBM (LGM), and AdaBoost (ADB), and linear classification models, such as logistic regression (LR) and support vector machine (SVM). Among these, the XGB, RF, and LGM models are the most common and practical models for handling a mixture of categorical and continuous variables. For the SVM model, we chose a linear kernel because of its simplicity and efficiency, particularly for high-dimensional data, where the number of features is much larger than the number of samples. Linear kernels are beneficial when data are linearly or nearly linearly separable.
To optimize the performance of each model, we performed hyperparameter tuning using GridSearchCV and maximized the area under the receiver operating characteristic curve (AUROC) to determine the best combination of hyperparameters.
ML analysis
To determine the AUROC score, various tree-based and linear classification models were used to predict the potential occurrence of CVD. Using the AUROC score as a scoring metric, we used GridSearchCV to optimize the hyperparameters of this model. Once the optimal hyperparameters were determined, the model was trained for subsequent predictions. Considering the class imbalance in our data, we used the synthetic minority over-sampling technique to generate synthetic samples.
We used various metrics, such as AUROC, accuracy, sensitivity, specificity, and balanced accuracy, to assess the model's performance. These metrics were calculated based on the probability predictions generated by the model. Subsequently, we calculated the mean and 95% confidence intervals (CIs) for each performance metric, measuring both the average and variability of the model's performance. Due to characteristics such as the small size of the external dataset, we used a bootstrapping method instead of traditional cross-validation. By resampling the dataset multiple times and performing bootstrapping with up to 10,000 iterations, we could evaluate the sample distribution and calculate 95% CIs for the model's performance metrics. We plotted a receiver operating characteristic (ROC) curve to visually represent the model’s performance. This was complemented by the mean ROC curve and the standard deviation within that range, which elucidated the distribution of the model's performance.
To identify the most important features for predicting CVD, we utilized the intrinsic feature-importance mechanisms provided by tree-based models such as the RF, XGB, and LGM, which primarily evaluate the feature-importance based on metrics such as mean decrease impurity method, which calculates Gini impurity reduction and gain. We selected the top 15 features that had the greatest impact on the model and plotted them on a bar graph to illustrate their influence on the model predictions.
Performance metrics
To understand the performance of our model comprehensively, we selected the following five performance metrics: AUROC, accuracy, sensitivity, specificity, and balanced accuracy. AUROC is a robust performance measure to a model's ability to discriminate between classes across all possible thresholds. Its robustness stems from the fact that it considers both sensitivity and specificity, making it the preferred metric, particularly in situations where classes are unbalanced. The accuracy is a simple and intuitive performance metric that provides the proportion of true results (both true positives and true negatives) from the total number of cases examined. However, accuracy alone can be misleading, particularly for unbalanced datasets; therefore, additional performance metrics are required. Sensitivity and specificity were selected to assess how well the model identified the positive and negative cases, respectively. Sensitivity measures the proportion of true positives correctly identified by the model and provides insight into the model's ability to detect positive cases, whereas specificity measures the proportion of actual negatives that are correctly identified, providing a sense of the model's ability to avoid false alarms. Finally, we included balanced accuracy to provide a more balanced view of our model's performance, particularly in the face of class imbalance. As an average of the sensitivity and specificity, balanced accuracy assigns equal weights to both metrics, making it a good alternative to accuracy when addressing unbalanced datasets. The combination of these metrics enabled us to evaluate the performance of our model from different perspectives, thereby ensuring a more robust evaluation20,21.
Software and libraries
All data preprocessing, model development, and analyses were performed using Python 3.9.16. Key libraries used in our study included Scikit-learn 1.2.2, NumPy 1.23.5, and Pandas 1.5.3 for ML algorithms and data manipulation. Matplotlib 3.7.1 and Seaborn 0.12.2 were used for data visualization.
Ethical approval
This study was approved by the Institutional Review Board of the Kyung Hee University Hospital (No. KHSIRB-22–473(EA)). The requirement for informed consent was waived by the institutional review board because deidentified data were used for the analyses. All research was performed in accordance with the relevant guidelines, regulations, and the Declaration of Helsinki. This study followed the guidelines outlined in the Transparent Reporting of a Multivariate Prediction Model for Individual Prognosis or diagnosis (TRIPOD) statement22.
Results
Cohort characteristics
In total, 12,809 patients were selected from the discovery cohort, of whom 1238 (10.2%) had CVD (Fig. 1). Among the participants, 6530 (51.0%) were male patients, and the mean age was 62.5 ± 12.1 years. For extra-validation, 2019 patients were included, comprising 32 (1.6%) patients with CVD from the validation cohort. The validation cohort had 1094 (54.2%) male patients with a mean age of 56.3 ± 11.9 years (Table 1). The median range of each clinical indicator and blood test was the difference between the maximum and minimum values before the occurrence of CVD in the hospital records of each patient, listed separately in Supplementary Table S1.
Comparisons of prediction model performance
The RF model displayed impressive performance on the validation set, exhibiting an AUROC of 0.830 (95% CI, 0.816–0.842). Details of the hyperparameter of the models and the tuning process of the RF model are listed in Supplementary Tables S2 and S3. The RF model also demonstrated consistent metrics across the board (accuracy: 74.7% [73.2–76.2]; sensitivity: 74.6% [73.0–76.2]; specificity: 74.7% [73.2–76.2]; and balanced accuracy: 74.6% [73.1–76.2]). In comparison, the XGB and LGM models demonstrated slightly higher values of AUROC and maintained an overall robust performance. The ADB and LR models presented consistent performance despite a lower AUROC than that of the RF model. The SVM model delivered a substantially lower AUROC value and demonstrated noticeable variability in its metrics (Table 2; Fig. 2).
When these models were applied to the external validation set, the RF model achieved the highest AUROC of 0.722 and exhibited the best performance on the other metrics. The XGB, LGM, ADB, and LR models exhibited lower performance than did the RF model and demonstrated good results for other performance metrics, whereas the SVM model exhibited somewhat less efficacy (Table 2; Fig. 2).
Consequently, considering its superior and consistent results, the RF model emerged as the most effective predictor of CVD onset within a 3-year timeframe in patients with diabetes.
Contributing factors for the prediction model performance
The significance of the contributing factors analyzed using the feature-importance method is shown in Fig. 3. Among the 68 variables considered in this study, the most critical factor contributing to the performance of the CVD prediction model was creatinine level, followed by HbA1c, AST, ALP, and ALT levels. BMI and medication history of CCB and diuretics were also included among the top 15 features. Early cerebrovascular complications were among the top 15 features. To provide an insight into the performance of the RF model, we have visually presented the distribution of each of the top 15 most important features in Supplementary Fig. S1.
Top 15 feature-importance of the random forest model. Cr creatinine, HbA1c glycated hemoglobin, AST aspartate transaminase, ALP alkaline phosphatase, ALT alanine transaminase, HDL high-density lipoprotein, TG triglyceride, TC total cholesterol, LDL low-density lipoprotein, CCB calcium channel blocker, DU diuretics, BMI body mass index, CeVD cerebrovascular disease.
Discussion
This study highlights the importance of developing a highly accurate ML-based CVD prediction model that can be universally applied to adults with T2DM in South Korea, facilitating easy and accurate assessment of future annual CVD risk in the diabetic population. To summarize the results of this study, the XGB, RF, LGM, and ADB models, which are ensemble models, demonstrated excellent performance with AUROC values of 0.81–0.84 on the internal validation dataset and 0.71–0.72 on the external validation dataset. Creatinine and HbA1c levels ranked highest among the top 15 feature-importance factors (Fig. 3). The findings of this study can potentially improve patient outcomes by facilitating timely interventions, enhancing the understanding of contributing variables, and reducing the burden of cardiovascular complications in patients with diabetes.
The pooled cohort equation and Framingham risk score, which are standard prognostic models based on classical risk factors, were developed and performed well in predicting CVD in the general population23, but they failed to provide reliable prediction results in patients with diabetes24,25. Specifically, the Framingham prediction model was validated three times in a population with diabetes; however, the area under the curve (AUC) varied widely between 0.56 and 0.80 and was poorly calibrated (P < 0.001)26. Many ML models have been developed to predict incident CVD in the general population; however, their usefulness remains unclear owing to methodological flaws, a lack of external validation, and model impact studies27. Various ML models for predicting cardiovascular risk in patients with T2DM have not been extensively studied. According to a systematic review, the neural network-based model performed the best with an AUC of 0.91; however, the precision of the model was only 76.6% and no external validation was not performed28. In contrast, our CVD prediction model demonstrated sufficiently good performance with a mean AUROC of 0.83, using only questionnaires, body measurements, and blood tests commonly conducted in clinical practice for patients with diabetes. In addition, our prediction model maintained an excellent performance even after external validation.
This study differs from previous studies in that it was based on a large cohort of Koreans, utilizing data from three university hospitals and showed a superior predictive performance compared to the existing risk prediction models. The discovery cohort from one university hospital used for model training and the cohorts from the two university hospitals used for validation comprised different populations with different baseline characteristics. The performance of the model developed in this study trained on one cohort was similar to that of independent cohorts with different characteristics. We utilized the RF model to analyze CVD risk factors, including medical history, clinical parameters, and blood tests, due to its ability to handle multicollinearity among variables such as blood pressure, glycemic status, and dyslipidemia. This model choice enhances the stability and reliability of our results, allowing us to effectively use interrelated predictors without the risks associated with multicollinearity in linear models29,30,31. Despite the RF model's robustness in handling multicollinearity, interpreting feature-importances requires careful consideration of their potential redundancy and the nuanced relationship between mathematical and clinical significance. In addition, this study used the range of each clinical variable and blood test results as input variables. Among the top 15 features, 11 involved range values rather than the median values of each variable. This suggests that the fluctuation of these variables is more critical in risk prediction than the baseline values of the variables, which are considered classic risk factors in the existing risk prediction models.
Chronic kidney disease is also a known risk factor for CVD32, and predicting the occurrence of CVD using albuminuria, estimated glomerular filtration rate (eGFR), or cystatin C is feasible33. Among ML-based CVD prediction models, one includes creatinine levels34. A previous study also indicated that eGFR variability predicts cardiovascular event-induced hospitalization and death better than the baseline eGFR35. Consistent with these findings, our study demonstrated that the change in creatinine levels was more critical in predicting CVD in patients with diabetes than the median value of baseline creatinine.
Persistently elevated high blood glucose levels cause a hyperglycemic burden and adversely affect the occurrence of complications. Recent studies have reported that HbA1c variability plays an important role in microvascular disease outcomes in patients with relatively optimal basal glycemic control and that high HbA1c variability can predict almost all cardiovascular complications of T2DM36,37. Thus, it is reasonable to explain why the range of change in HbA1c holds importance in the CVD prediction model used in this study. Even if the median HbA1c level is low, high glycemic variability increases the risk of CVD complications.
Moreover, the finding that variability in liver levels and lipid profiles can be a predictor of CVD in this study is supported by previous studies37,38,39. However, conflicting results have been reported regarding the effects of BMI and LDL cholesterol variability on CVD40,41. Furthermore, among the complications other than cardiovascular complications in diabetes, the presence or absence of cerebrovascular disease is an important variable for predicting CVD risk. Previous studies reporting that the total area of carotid plaques predicts the risk of myocardial infarction support this result42. However, it is difficult to explain why cerebrovascular complications are more important predictors than microvascular or macrovascular complications. CCB and diuretics are the most commonly used medications for treating hypertension. Diuretics are among the most widely used drugs for triple therapy in patients with uncontrolled hypertension. This disproves the idea that uncontrolled hypertension may increase the risk of CVD, although the median or range of systolic or diastolic blood pressure was not included among the top 15 features.
This study has several limitations. First, owing to the retrospective nature of the study, it was challenging to expect detailed precision in the information obtained from the dataset based on hospital medical records. Information bias, arising from inaccuracies in data collection methods and the recording of medications and clinical parameters, can lead to misclassification of both exposure and outcome, affecting the accuracy of an ML model's predictions for CVD in patients with T2DM. Identification of new-onset CVD cases by ICD-10 codes is limited by potential misclassification and failure to capture patient nuances, and comprehensive record checks are required to validate “negative” CVD cases as defined by ICD-10 codes. Nevertheless, the ICD-10 codes provide a standardized methodology that is important for large-population studies. Second, the model was trained on data obtained from tertiary care centers, which may introduce a selection bias because these patients may have socioeconomic backgrounds different from those of primary care patients and receive more comprehensive healthcare, affecting CVD risk and diabetes management. Third, the superiority of this prediction model has not been compared with other existing prediction models and warrants further research. Finally, this study alone could not prove a causal relationship between the predictors used in the model and the occurrence of CVD. Confounding factors in predicting CVD in patients with T2DM include possible biases due to medications prescribed for conditions such as CVD risk factors (hypertension or dyslipidemia), the influence of unexplained variables such as the severity and duration of diabetes, and missing data on important lifestyle factors (smoking status, physical activity, diet, and alcohol consumption), all of which can distort the true relationship between risk factors and CVD outcomes.
Nevertheless, this study used an ML-based CVD prediction model for patients with T2DM using comprehensive covariates and two independent cohorts from a multicenter registry in South Korea. In the future, it may be possible to leverage common data models such as the Observational Medical Outcome Partners-Common Data Model, which integrates data from multiple institutions worldwide to create more accurate and precise global ML-based CVD prediction models.
Conclusion
We successfully constructed an ML-based predictive model using a representative national cohort, enabling easy and accurate prediction of CVD risk in all members of the Korean population with T2DM. Our model outperformed traditional risk assessment tools, such as the Framingham risk score, which has shown limitations in their applicability to patients with diabetes. Highlighting the importance of creatinine and HbA1c level variabilities, the study illustrates how well-developed ML models can predict CVD risk across diverse populations, using routine clinical data to enhance risk assessment accessibility. Although much work remains to be done to generalize our model to a more diverse diabetic population, as it was trained on a specific cohort, we aimed to show that the ML model we developed and validated has broad potential for predicting CVD events in a diverse population of patients with diabetes. As part of a national project with the Korea Disease Control and Prevention Agency and the Korean Diabetes Association, this pioneering implementation prompts further expansion and validation of these models to diverse settings, reinforcing ML's role in advancing healthcare outcomes. Despite the conservative stance of several national clinical guidelines for diabetes and CVD on the use of predictive models, our findings support a reassessment of the clinical integration of these models to improve personalized medical approaches and reduce the CVD burden. Future research should focus on improving these models by including a broader range of patient data and conducting comparative studies using existing prediction methods. Collaboration between healthcare professionals, data scientists, and policymakers is essential to update clinical guidelines and ensure ethical use, paving the way for improved patient outcomes and the advancement of personalized medicine in public health.
Data availability
The datasets generated and/or analyzed during the current study are not publicly available owing to patient confidentiality or because they are used under license but are available from the corresponding author upon reasonable request.
References
Dinh, A., Miertschin, S., Young, A. & Mohanty, S. D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med. Inform. Decis. Mak. 19, 211. https://doi.org/10.1186/s12911-019-0918-5 (2019).
Korean Diabetes Association. Diabetes Fact Sheet in Korea 2022 52–53 (Kyu Chang Won, 2022).
the Diabetes Prevention Program Outcomes Study. Diabetes Prevention Program Research, G. Long-term effects of lifestyle intervention or metformin on diabetes development and microvascular complications over 15-year follow-up. Lancet Diabetes Endocrinol 3, 866–875. https://doi.org/10.1016/S2213-8587(15)00291-0 (2015).
Lindstrom, J. et al. Improved lifestyle and decreased diabetes risk over 13 years: Long-term follow-up of the randomised Finnish Diabetes Prevention Study (DPS). Diabetologia 56, 284–293. https://doi.org/10.1007/s00125-012-2752-5 (2013).
American Diabetes Association Professional Practice Committee. 10. Cardiovascular Disease and Risk Management: Standards of Care in Diabetes—2024. Diabetes Care 47, S179–S218. https://doi.org/10.2337/dc24-S010 (2024).
Zhao, H., Jiang, L., Jin, X., Du, H. & Li, X. Constant time texture filtering. Vis. Comput. 34, 83–92. https://doi.org/10.1007/s00371-016-1315-z (2016).
Wang, S., Xiang, J., Zhong, Y. & Zhou, Y. Convolutional neural network-based hidden Markov models for rolling element bearing fault identification. Knowl. Based Syst. 144, 65–76. https://doi.org/10.1016/j.knosys.2017.12.027 (2018).
Rhee, S. Y. et al. Development and validation of a deep learning based diabetes prediction system using a nationwide population-based cohort. Diabet. Metab. J. 45, 515–525. https://doi.org/10.4093/dmj.2020.0081 (2021).
Jing, L. et al. A machine learning approach to management of heart failure populations. JACC Heart Fail 8, 578–587. https://doi.org/10.1016/j.jchf.2020.01.012 (2020).
Shi, B. et al. Prediction of recurrent spontaneous abortion using evolutionary machine learning with joint self-adaptive sime mould algorithm. Comput. Biol. Med. 148, 105885. https://doi.org/10.1016/j.compbiomed.2022.105885 (2022).
Shaik, N. S. & Cherukuri, T. K. Transfer learning based novel ensemble classifier for COVID-19 detection from chest CT-scans. Comput. Biol. Med. 141, 105127. https://doi.org/10.1016/j.compbiomed.2021.105127 (2022).
Shi, H. et al. ToxMVA: An end-to-end multi-view deep autoencoder method for protein toxicity prediction. Comput. Biol. Med. 151, 106322. https://doi.org/10.1016/j.compbiomed.2022.106322 (2022).
Kavakiotis, I. et al. Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116. https://doi.org/10.1016/j.csbj.2016.12.005 (2017).
DeFilippis, A. P. et al. Risk score overestimation: The impact of individual cardiovascular risk factors and preventive therapies on the performance of the American Heart Association-American College of Cardiology-Atherosclerotic Cardiovascular Disease risk score in a modern multi-ethnic cohort. Eur. Heart. J. 38, 598–608. https://doi.org/10.1093/eurheartj/ehw301 (2017).
Bohula, E. A. et al. Atherothrombotic risk stratification and ezetimibe for secondary prevention. J. Am. Coll. Cardiol. 69, 911–921. https://doi.org/10.1016/j.jacc.2016.11.070 (2017).
Weng, S. F., Reps, J., Kai, J., Garibaldi, J. M. & Qureshi, N. Can machine-learning improve cardiovascular risk prediction using routine clinical data?. PLoS ONE 12, e0174944. https://doi.org/10.1371/journal.pone.0174944 (2017).
Lagani, V. et al. Development and validation of risk assessment models for diabetes-related complications based on the DCCT/EDIC data. J. Diabet. Complic. 29, 479–487. https://doi.org/10.1016/j.jdiacomp.2015.03.001 (2015).
Jonnagaddala, J. et al. Identification and progression of heart disease risk factors in diabetic patients from longitudinal electronic health records. Biomed. Res. Int. 2015, 636371. https://doi.org/10.1155/2015/636371 (2015).
Eum, S. & Rhee, S. Y. Age, ethnic, and sex disparity in body mass index and waist circumference: a bi-national large-scale study in South Korea and the United States. Life Cycle 3, e4. https://doi.org/10.54724/lc.2023.e4 (2023).
Lee, S. W. Regression analysis for continuous independent variables in medical research: Statistical standard and guideline of Life Cycle Committee. Life Cycle 2, e3. https://doi.org/10.54724/lc.2022.e3 (2022).
Kim, J., Kim, S. C., Kang, D., Yon, D. K. & Kim, J. G. Classification of Alzheimer’s disease stage using machine learning for left and right oxygenation difference signals in the prefrontal cortex: A patient-level, single-group, diagnostic interventional trial. Eur. Rev. Med. Pharmacol. Sci. 26, 7734–7741 (2022).
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): The TRIPOD statement. Ann. Intern. Med. 162, 55–63. https://doi.org/10.7326/m14-0697 (2015).
Goff, D. C. Jr. et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: A report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation 129, S49-73. https://doi.org/10.1161/01.cir.0000437741.48606.98 (2014).
Basu, S., Sussman, J. B., Berkowitz, S. A., Hayward, R. A. & Yudkin, J. S. Development and validation of Risk Equations for Complications Of type 2 Diabetes (RECODe) using individual participant data from randomised trials. Lancet Diabet. Endocrinol. 5, 788–798. https://doi.org/10.1016/S2213-8587(17)30221-8 (2017).
Chowdhury, M. Z. I., Yeasmin, F., Rabi, D. M., Ronksley, P. E. & Turin, T. C. Prognostic tools for cardiovascular disease in patients with type 2 diabetes: A systematic review and meta-analysis of C-statistics. J. Diabet. Complic. 33, 98–111. https://doi.org/10.1016/j.jdiacomp.2018.10.010 (2019).
van Dieren, S. et al. Prediction models for the risk of cardiovascular disease in patients with type 2 diabetes: A systematic review. Heart 98, 360–369. https://doi.org/10.1136/heartjnl-2011-300734 (2012).
Damen, J. A. et al. Prediction models for cardiovascular disease risk in the general population: Systematic review. BMJ 353, i2416. https://doi.org/10.1136/bmj.i2416 (2016).
Kee, O. T. et al. Cardiovascular complications in a diabetes prediction model using machine learning: A systematic review. Cardiovasc. Diabetol. 22, 13. https://doi.org/10.1186/s12933-023-01741-7 (2023).
Lindner, T., Puck, J. & Verbeke, A. Beyond addressing multicollinearity: Robust quantitative analysis and machine learning in international business research. J. Int. Bus. Stud. 53, 1307–1314. https://doi.org/10.1057/s41267-022-00549-z (2022).
Drobnič, F., Kos, A. & Pustišek, M. On the interpretability of machine learning models and experimental feature selection in case of multicollinear data. Electronics 9, 761 (2020).
Chowdhury, S., Lin, Y., Liaw, B. & Kerby, L. in 2022 International Conference on Intelligent Data Science Technologies and Applications (IDSTA). 17–25.
Go, A. S., Chertow, G. M., Fan, D., McCulloch, C. E. & Hsu, C. Y. Chronic kidney disease and the risks of death, cardiovascular events, and hospitalization. N. Engl. J. Med. 351, 1296–1305. https://doi.org/10.1056/NEJMoa041031 (2004).
Joshi, S. & Viljoen, A. Renal biomarkers for the prediction of cardiovascular disease. Curr. Opin. Cardiol. 30, 454–460. https://doi.org/10.1097/HCO.0000000000000177 (2015).
Chicco, D. & Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med. Inform. Decis. Mak. 20, 16. https://doi.org/10.1186/s12911-020-1023-5 (2020).
Suzuki, A. et al. Visit-to-visit variability in estimated glomerular filtration rate predicts hospitalization and death due to cardiovascular events. Clin. Exp. Nephrol. 23, 661–668. https://doi.org/10.1007/s10157-019-01695-9 (2019).
Ceriello, A. et al. HbA1c variability predicts cardiovascular complications in type 2 diabetes regardless of being at glycemic target. Cardiovasc. Diabetol. 21, 13. https://doi.org/10.1186/s12933-022-01445-4 (2022).
Shen, Y. et al. Association between visit-to-visit HbA1c variability and the risk of cardiovascular disease in patients with type 2 diabetes. Diabet. Obes. Metab. 23, 125–135. https://doi.org/10.1111/dom.14201 (2021).
Cho, E. J., Han, K., Lee, S. P., Shin, D. W. & Yu, S. J. Liver enzyme variability and risk of heart disease and mortality: A nationwide population-based study. Liver Int. 40, 1292–1302. https://doi.org/10.1111/liv.14432 (2020).
Wan, E. Y. F. et al. Greater variability in lipid measurements associated with cardiovascular disease and mortality: A 10-year diabetes cohort study. Diabet. Obes. Metab. 22, 1777–1788. https://doi.org/10.1111/dom.14093 (2020).
Lee, J. S. et al. Effects of ten year body weight variability on cardiovascular risk factors in Japanese middle-aged men and women. Int. J. Obes. Relat. Metab. Disord. 25, 1063–1067. https://doi.org/10.1038/sj.ijo.0801633 (2001).
Youk, T. M., Kang, M. J., Song, S. O. & Park, E. C. Effects of BMI and LDL-cholesterol change pattern on cardiovascular disease in normal adults and diabetics. BMJ Open Diabet. Res. Care 8, e001340. https://doi.org/10.1136/bmjdrc-2020-001340 (2020).
Johnsen, S. H. & Mathiesen, E. B. Carotid plaque compared with intima-media thickness as a predictor of coronary and cerebrovascular disease. Curr. Cardiol. Rep. 11, 21–27. https://doi.org/10.1007/s11886-009-0004-1 (2009).
Acknowledgements
This is the result of a multicenter study conducted in collaboration with the Korean Diabetes Association and the Korea Disease Control and Prevention Agency, as well as data collected by the National Information Agency. This study was supported by the National Institutes of Health Research Project (No. 2022-ER1102-01) of South Korea. The funders had no role in the study design, data collection, data analysis, data interpretation, or manuscript writing.
Author information
Authors and Affiliations
Contributions
H.S., D.K.Y., and S.Y.R. conceptualized the study. H.L., M.L., and J.P. acquired, analyzed, and interpreted the data. H. S. and H. L. wrote the first draft of the manuscript. S.K., H.G.W., M.R., A.K., L.S., S.L., Y.-C.H., T.S.P., H.L., and S.Y.R. revised the manuscript. D. K. Y. and S. Y. R. approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sang, H., Lee, H., Lee, M. et al. Prediction model for cardiovascular disease in patients with diabetes using machine learning derived and validated in two independent Korean cohorts. Sci Rep 14, 14966 (2024). https://doi.org/10.1038/s41598-024-63798-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-63798-y
- Springer Nature Limited