Abstract
Type 2 diabetes mellitus (T2DM) is one common chronic disease caused by insulin secretion disorder that often leads to severe outcomes and even death due to complications, among which coronary heart disease (CHD) represents the most common and severe one. Given a huge number of T2DM patients, it is thus increasingly important to identify the ones with high risks of CHD complication but the quantitative method is still not available. Here, we first curated a dataset of 1,273 T2DM patients including 304 and 969 ones with or without CHD, respectively. We then trained an artificial intelligence (AI) model using randomly selected 4/5 of the dataset and use the rest data to validate the performance of the model. The result showed that the model achieved an AUC of 0.77 (fivefold cross-validation) on the training dataset and 0.80 on the testing dataset. To further confirm the performance of the presented model, we recruited 1,253 new T2DM patients as totally independent testing dataset including 200 and 1,053 ones with or without CHD. And the model achieved an AUC of 0.71. In addition, we implemented a model to quantitatively evaluate the risk contribution of each feature, which is thus able to present personalized guidance for specific individuals. Finally, an online web server for the model was built. This study presented an AI model to determine the risk of T2DM patients to develop to CHD, which has potential value in providing early warning personalized guidance of CHD risk for both T2DM patients and clinicians.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Introduction
Diabetes mellitus (DM) is a serious and chronic disease resulted from the pancreatic beta-cells’ insulin secretion disorder1,2,3. In 1980, 108 million persons were diagnosed as diabetes while the number is increased to 463 million (4.2 million death) in 2019 all over the world, which was growing rapidly in the past decade according to the World Health Organization (WHO) and International Diabetes Federation (IDF)1,4. Currently, it has become one of the top 10 causes of death and IDF predicted that the number of DM patients will climb to over 700 million adults by 20454. Moreover, DM can be briefly classified to type 1 (T1DM) and type 2 (T2DM), and the two types are totally different in clinical therapy5. Asia especially China could be considered one dominant area of T2DM due to a large amount of population base6,7. T2DM can result in a number of complications, such as macrovascular diseases, for example, cardiovascular disease (CVD), and microvascular diseases, for example, kidney, the retina and the nervous system diseases7. Even worse, T2DM may cause dementia and cognitive impairment, thereby reducing sensitivity to diabetes complications for T2DM patients8. It is known that the incidence of heart disease such as heart failure (HF), cardiac dysfunction in individuals with T2DM is much higher than those without T2DM9. Specifically, coronary heart disease (CHD) represents one of the most common and severe diabetes complications10.
CHD is a disease of the less blood supplying to heart muscle vessels11 manifested as hyperlipidaemia, myocardial infarction, and angina pectoris12,13,14 and ~ 17.7 million people perished from CHD in 201511,15. Only in the United States, 18.2 million adults over 20 have CHD which take parts 6.7% of total population16 and CHD caused 363,452 death in 201717. It is known that individuals’ basic information like gender and age, and blood test indexes such as blood pressure, total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C) as well as smoking behaviour, diabetes status can be considered as risk factors of causing CHD18,19,20. Therefore, the early diagnosis of CHD is important while it is not easy21. Given the high prevalence and mortality rate of CHD, it is thus important to predict CHD risk for individuals. For doing so, a number of models for predicting CHD have been proposed using mathematical models like cox regression19,22,23 and machine learning models like neural network15,21,24. These models were designed for the general population, however, a model specifically built for predicting CHD risk in T2DM patients is still not available. Moreover, 68% of the 65-year-old-or-older diabetes patients dead from some form of heart diseases like CHD25 and diabetes patients have 2 to 4 times higher risk of developing CHD than others26. Given the huge number of T2DM patients, it is thus quite important to evaluate the risk of developing CHD for T2DM patients.
In this study, we proposed an AI (random forest) based model to predict the risk of developing CHD for individuals with T2DM. As a result, the predictive model achieved an AUC of 0.77 (fivefold cross-validation) in the training dataset and an AUC of 0.80 in the testing dataset, respectively. Moreover, the model achieved an AUC of 0.71 on a totally independent dataset including 1,253 newly recruited T2DM patients. In addition, a risk contribution model was built to quantify the importance of each feature for a given T2DM individual. Finally, we implemented a web server for the predictive model.
Methods
Study subjects
In this study, all procedures complied with the Helsinki Declaration for investigation of human subjects. The study received ethical approval from the competent Institutional Review Boards of Lu He hospital. All subjects supplied written informed consent.
Datasets
From January 2017 till June 2019, 1,357 subjects with T2DM were recruited in the study. Patients with T2DM were recruited from the Inpatient Department of Endocrinology in Lu He hospital. Exclusion criteria included any history or active treatment of cancer, pregnancy, cognitive inability as judged by the interviewer, any serious medical condition which would prevent long-term participation, the language barrier. Patients with other specific types of diabetes and patients with gestational diabetes mellitus were excluded. Finally, a total of 1,273 patients were enrolled in our study and all of them successfully underwent the medical history-taking included the history of smoking, alcohol, medical treatment, and history of CHD, hypertension and diabetes. All the features included are listed in Table 1. Among the 1,273 samples, 969 are diabetes patients without CHD (negative samples) and 304 are diabetes patients with CHD (positive samples). Next, we randomly selected 4/5 of positive samples and 4/5 of the negative samples as the training dataset. The rest samples are taken as the independent testing dataset. Finally, to confirm the accuracy of the presented predictive model, we recruited 1,253 totally new T2DM patients (200 positive samples and 1,053 negative samples) from the Outpatient Department of Endocrinology in Lu He hospital and determine the related information required by the model. Data is loaded and preprocessed by Pandas 0.2527 which is a package in Python 3.7.
Diagnostic criteria
Diagnostic criteria of T2DM are in line with guidelines for the prevention and treatment of type 2 diabetes in China (2017 edition). Diagnostic criteria of CHD meet guidelines for the diagnosis and treatment of stable coronary heart disease (SCAD) (2018 edition). Hypertension was a blood pressure of at least 140 mmHg systolic or 90 mmHg diastolic or use of antihypertensive drugs.
Biochemical measurement
All participants suffered overnight fasting before venous blood samples were drawn. We aim to measure the total and differential white blood cell count, red blood cells, platelets, hemoglobin A1c (HbA1c), serum creatinine (SCr), uric acid (UA), serum triglyceride (TG), TC, LDL-C, HDL-C, fasting blood glucose (FBG), D-dimer, C-reactive protein (CRP), gamma-glutamyl transpeptidase (GGT). We also collected insulin and C-peptide levels of 0, 1, 2, 3 h when patients went through Oral Glucose Tolerance Test (OGTT). All indexes were measured in the central laboratory in Lu He hospital.
Information entropy function-based feature selection method
Information entropy is a conventional concept in information theory which is proposed by C. E. Shannon in 1948 and it can quantitatively describe the information contained in a series of data. Here, we use the information entropy function and Gini impurity function to check the information hidden in each feature. One feature will get a higher score if this series of data contain more information for classification and vice versa. The information entropy function-based feature selection method is implemented by using random forest model with 500 decision trees in sci-kit-learn 0.2228 in Python 3.7. Because the decision tree classifier makes decisions based on the entropy function, the average importance of each feature in 500 decision trees is calculated.
Random forest based predictive model
Random forest (RF)29 is a conventional ensemble model for machine learning. It uses the information entropy function or Gini impurity function for discrimination. Here, we proposed a RF-based predictive model (DCHD, Diabetic Coronary Heart Disease) with Gini impurity as an entropy function, which is also known as Classification And Regression Tree (CART)30. The Gini impurity function of a decision tree node with dataset \(D\) is defined as
where \({p}_{i}\) is the probability of belonging to class \(i\) in dataset \(D\) and \(i=\mathrm{1,2},...,C\). The dataset \(D\) will be divided into 2 subsets on this tree node based on the criterion \(A=a\) which is the minimal Gini gain point defined as
where \({D}_{i}\) is the subsets after applying division \(A=a\) (\({D}_{1}=\left\{d\in D|d\le a\right\},{D}_{2}=\{d\in D|d>a\}\)). The number of trees is set to 500 and no tree depth limitation is applied to get a more precise and robust model. This model is implemented by using sci-kit-learn 0.2228 in Python 3.7.
Risk contribution model
We also performed an analysis of how much contribution a feature has by using the proportion method to calculate the contribution of each feature for individuals which is described in the following equations.
where \({f}_{i}^{k}\) is the value of the \({i}\)th feature for the \({k}\)th sample and \(m\) is the total number of selected features. So, \({F}_{i}^{k}\) represents the feature vector where the \({i}\)th feature value is zero and \({F}^{k}\) is the original feature vector; \(RF\) represents the probability of developing CHD in the predictive model. The zero value in the latent represents the mean value of that feature because data is standardized.
Note that one contribution can not only be positive but also negative because some of the features can be normal and make the risk probability drop a little bit.
Web server for the predictive model
In order to make the risk prediction model available for all T2DM patients, we built a web server (https://www.cuilab.cn/dchd). The server-side was performed by Django 2.2.5 package in Python 3.7 and the user interface was built using Bootstrap 4 and HTML 5.
Results
Risk features selection and validation
Fewer features normally make it more convenient for T2DM patients and clinicians to use the model and make the model more robust while it could decrease the performance of the model. To find a balance between the prediction robustness, convenience, and accuracy, here information entropy function-based feature selection method is applied to the whole dataset with the total 52 features. All these features are treated as input features of a random forest model and train the model with 500 decision trees. From this model, the entropy function represents the importance of each feature. And we found that the information entropy function and Gini impurity function make little difference in feature selection. Here we choose Gini impurity as the criteria function. The higher scores the feature got, the more information the feature contains. Next, the top 8 features (Age, LDL-C, Course of diabetes, TC, Heart rate, Diastolic pressure, Blood platelet, Course of hypertension) are selected and they contribute 30% among all the features and each of the rest features can do just little contribution (less than 2.3%) on distinguishing CHD from non-CHD in T2DM patients. The information contributions of the selected features are sorted and shown in Fig. 1. A more general and robust model is built using these selected features with almost the same performance compared with the original model. (Fig. 2a,b).
Performance of the predictive model
A number of metrics are often used to evaluate the prediction performance of machine learning models, such as true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). Here, TP and TN are the correctly classified CHD and non-CHD, respectively; FN represents CHD that are misclassified as non-CHD; non-CHD incorrectly classified as CHD is defined as FP. And then several standard performance metrics are applied to describe the model performance based the metrics including accuracy (ACC), true positive rate (TPR) also known as recall rate, false positive rate (FPR), precision rate and F1 score.
As a result, the presented model achieved an AUC of 0.77 for fivefold cross-validation and an AUC of 0.80 in the independent testing dataset (Fig. 2b). And the performance scores introduced before are listed in Table 2.
Performance of the predictive model on a newly recruited independent testing dataset
To further confirm the robustness and performance of our model, we newly recruited 1767 patients with T2DM from the Outpatient Department of Endocrinology. Among these patients, 1,253 subjects were finally enrolled including 200 with CHD and 1,053 without CHD. As a result, our model achieved an AUC of 0.71 in the newly recruited independent testing dataset (Fig. 3). In addition, the performance scores for this dataset are listed in Table 3.
Risk contributions of features
Risk contribution of a feature represents an indication of how much this feature impacts the risk of developing CHD. As a case study, here is a T2DM patient whose data for each feature is as follows, Age: 68, Low-density lipoprotein: 1.92, Course of diabetes: 20, Total cholesterol: 3.32, Heart rate: 65, Diastolic pressure: 67, Platelet count: 340, Course of hypertension: 3. This individual was predicted to be at a high risk to develop to CHD (0.925) using our model (Fig. 4a). And by risk contribution model, the calculated scores of risk factors are as follows, Age: 0.105, Low-density lipoprotein: 0.26, Course of diabetes: 0.125, Total cholesterol: 0.425, Heart rate: 0.3, Diastolic pressure: 0.16, Platelet count: 0.035, Course of hypertension: 0.025 (Fig. 4b). All the result was shown on the web server. In the first bar plot, the length of the red bar represents the probability of developing to CHD and the length of the green bar represents the probability of non-CHD. Moreover, the risk factors’ contributions are sorted and plotted in the bottom figure. The contributions thus can provide advice on daily diet and clinical treatment for individuals.
Web server
The home page of the webserver is shown as Figure S1. For single instance prediction (Figure S1a), users can input the value of the features needed by the model and click the “Run” button. Then, the model will analyze the input data and output the result in a new page (the result figures like Fig. 4 will be shown). Clicking the “Example” button, the data for a case study will be entered automatically. For multiple instances prediction (Figure S1b), users can paste a CSV format text and click “Run” to get a batch of prediction results.
Discussion
T2DM is a common disease and often resulted in death due to severe complications, among which, CHD is one most common and severe one. Given the huge number of T2DM patients, it thus becomes increasingly critical to quantitatively evaluate the risk for a T2DM patient to develop to CHD in the near future.
In this paper, we presented an AI-based model to predict the risk of developing CHD for T2DM patients. We first proposed a feature selection model to confirm risk factors. Then, a predictive model was built to predict CHD among T2DM patients. As a result, the model achieved an AUC of 0.77 on fivefold cross-validation and 0.80 on an independent testing dataset. Moreover, the presented model achieved an AUC of 0.71 on a newly recruited dataset. Finally, a risk proportion model was built for individuals to analyze the contribution of each feature to CHD risk.
The predictive model is available online for users to do a self-checking and it can be treated as the early-warning of developing CHD if the probability is high enough. Moreover, clinicians can use the online web server as an auxiliary tool to determine potential CHD risk for a T2DM patient. The risk contribution results can also be used by doctors to design personalized treatment strategies for different patients.
Furthermore, the blood test can be considered as a very basic and cheap test in physical checkup and the features used in the predictive model are mostly from the blood test. That is, the risk prediction model is convenient to use for self-checkup and more detailed physical checkup needs to be done if a high risk is reported by the model.
The current model can be improved in the following aspects. Firstly, the used features in this study are limited. Therefore, the model could be improved if more valuable features from other aspects (such as medication history and heart imaging data) are included in the future. Secondly, there may be non-linear compositions of known features (such as age * blood pressure). Although the random forest model is not a linear model, it cannot detect and explain all kinds of non-linear compositions. So, the error accumulates when the number of non-linear compositions increases in real situations. Therefore, the non-linear analysis would be of help in improving this model in the future. Besides, one more important limitation is that both the training dataset and independent validation dataset are from the same hospital and there is no tracking data of the patients with the non-CHD clinical diagnosis but high risks from the prediction model, which may be a source of bias. Hence, the model would be more robust and convincible if the training data are from multi-source and cohort tracking data is included.
In summary, we presented a reliable AI model to predict CHD risk for T2DM patients, which could be of help for precision DM care. Finally, we will continuously update the predictive model to achieve better performance and to provide greater help for the precision medicine of T2DM patients.
References
WHO Global Report. Global Report on Diabetes. Isbn 978, 6–86 (2016).
Sebastiani, G. et al. Circulating microRNAs and diabetes mellitus: A novel tool for disease prediction, diagnosis, and staging?. J. Endocrinol. Investig. 40, 591–610 (2017).
WHO. Diabetes. https://www.who.int/news-room/fact-sheets/detail/diabetes. Accessed 21 Dec 2019.
Internation Diabetes Federation. IDF Diabetes Atlas Ninth (IDF, Dunia, 2019).
American Diabetes Association. 2. Classification and Diagnosis of Diabetes. Diabetes Care 39, S13–S22 (2016).
Ma, R. C. W. Epidemiology of diabetes and diabetic complications in China. Diabetologia 61, 1249–1260 (2018).
Zheng, Y., Ley, S. H. & Hu, F. B. Global aetiology and epidemiology of type 2 diabetes mellitus and its complications. Nat. Rev. Endocrinol. 14, 88–98 (2018).
Zilliox, L. A., Chadrasekaran, K., Kwan, J. Y. & Russell, J. W. Diabetes and cognitive impairment. Curr. Diab. Rep. 16, 87 (2016).
Tan, Y. et al. Mechanisms of diabetic cardiomyopathy and potential therapeutic strategies: Preclinical and clinical evidence. Nat. Rev. Cardiol. https://doi.org/10.1038/s41569-020-0339-2 (2020).
Association American Diabetes. 10. Cardiovascular disease and risk management: Standards of medical care in diabetes—2020. Diabetes Care 43, S111–S134 (2020).
WHO. Cardiovascular diseases (CVDs). WHO https://www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds). Accessed 21 Dec 2019.
Maneerat, Y., Prasongsukarn, K., Benjathummarak, S., Dechkhajorn, W. & Chaisri, U. Intersected genes in hyperlipidemia and coronary bypass patients: Feasible biomarkers for coronary heart disease. Atherosclerosis 252, e183–e184 (2016).
Nakashima, T. et al. Prognostic impact of spontaneous coronary artery dissection in young female patients with acute myocardial infarction: A report from the Angina Pectoris-Myocardial Infarction Multicenter Investigators in Japan. Int. J. Cardiol. 207, 341–348 (2016).
Zebrack, J. S. et al. Usefulness of high-sensitivity C-Reactive protein in predicting long-term risk of death or acute myocardial infarction in patients with unstable or stable angina pectoris or acute myocardial infarction. Am. J. Cardiol. 89, 145–149 (2002).
Kim, J. K. & Kang, S. Neural network-based coronary heart disease risk prediction using feature correlation analysis. J. Healthc. Eng. 2017, 1–13 (2017).
Fryar, C. D., Chen, T.-C. & Li, X. Prevalence of Uncontrolled Risk Factors for Cardiovascular Disease: United States, 1999–2010. (2012).
Benjamin, E. J. et al. Heart disease and stroke statistics—2019 update: A report from the American Heart Association. Circulation 139, e56–e528 (2019).
Gordon, T. & Kannel, W. B. Multiple risk functions for predicting coronary heart disease: The concept, accuracy, and application. Am. Heart J. 103, 1031–1039 (1982).
Wilson, P. W. F. et al. Prediction of coronary heart disease using risk factor categories. Circulation 97, 1837–1847 (1998).
Gordon, T. Diabetes, blood lipids, and the role of obesity in coronary heart disease risk for women. Ann. Intern. Med. 87, 393 (1977).
Narain, R., Saxena, S. & Goyal, A. Cardiovascular risk prediction: A comparative study of Framingham and quantum neural network based approach. Patient Prefer. Adherence 10, 1259–1270 (2016).
Nishimura, K. et al. Predicting coronary heart disease using risk factor categories for a Japanese urban population, and comparison with the Framingham risk score: The suita study. J. Atheroscler. Thromb. 21, 784–798 (2014).
Onat, A. Algorithm for predicting CHD death risk in Turkish adults: Conventional factors contribute only moderately in women. Anatol. J. Cardiol. 17, 436–444 (2017).
Arabasadi, Z., Alizadehsani, R., Roshanzamir, M., Moosaei, H. & Yarifard, A. A. Computer aided decision making for heart disease detection using hybrid neural network-Genetic algorithm. Comput. Methods Programs Biomed. 141, 19–26 (2017).
American Heart Association. Cardiovascular Disease and Diabetes. https://www.heart.org/en/health-topics/diabetes/why-diabetes-matters/cardiovascular-disease--diabetes. Accessed 25 Mar 2020.
Kannel, W. B. Diabetes and cardiovascular disease. The Framingham study. JAMA J. Am. Med. Assoc. 241, 2035–2038 (1979).
McKinney, W. Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference 51–56 (2010).
Varoquaux, G. et al. Scikit-learn. GetMobile Mob Comput. Commun. 19, 29–33 (2015).
Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844 (1998).
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and Regression Trees. (Routledge, 2017). https://doi.org/10.1201/9781315139470.
Funding
This work was supported by the grants from the National Key R&D Program (2019YFC2004704), the Natural Science Foundation of China (81670462, 81970440, 81921001, 81800768, 81800723), the Peking University Basic Research Program (BMU2020JC001), the Peking University Clinical Scientist Program (BMU2019LCKXJ001), and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Contributions
Q.C. and D.Z. conceived the project. R.F. performed the analysis and built the predictive model. N.Z., L.Y., and J.K. curated the dataset. RF wrote the draft manuscript. R.F., Q.C., N.Z. and D.Z. edited the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fan, R., Zhang, N., Yang, L. et al. AI-based prediction for the risk of coronary heart disease among patients with type 2 diabetes mellitus. Sci Rep 10, 14457 (2020). https://doi.org/10.1038/s41598-020-71321-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-020-71321-2
- Springer Nature Limited
This article is cited by
-
Machine learning in precision diabetes care and cardiovascular risk prediction
Cardiovascular Diabetology (2023)
-
Utility of an Artificial Intelligence Enabled Electrocardiogram for Risk Assessment in Liver Transplant Candidates
Digestive Diseases and Sciences (2023)
-
MIDO GDM: an innovative artificial intelligence-based prediction model for the development of gestational diabetes in Mexican women
Scientific Reports (2023)