Introduction

Diabetes mellitus (DM) is a serious and chronic disease resulted from the pancreatic beta-cells’ insulin secretion disorder1,2,3. In 1980, 108 million persons were diagnosed as diabetes while the number is increased to 463 million (4.2 million death) in 2019 all over the world, which was growing rapidly in the past decade according to the World Health Organization (WHO) and International Diabetes Federation (IDF)1,4. Currently, it has become one of the top 10 causes of death and IDF predicted that the number of DM patients will climb to over 700 million adults by 20454. Moreover, DM can be briefly classified to type 1 (T1DM) and type 2 (T2DM), and the two types are totally different in clinical therapy5. Asia especially China could be considered one dominant area of T2DM due to a large amount of population base6,7. T2DM can result in a number of complications, such as macrovascular diseases, for example, cardiovascular disease (CVD), and microvascular diseases, for example, kidney, the retina and the nervous system diseases7. Even worse, T2DM may cause dementia and cognitive impairment, thereby reducing sensitivity to diabetes complications for T2DM patients8. It is known that the incidence of heart disease such as heart failure (HF), cardiac dysfunction in individuals with T2DM is much higher than those without T2DM9. Specifically, coronary heart disease (CHD) represents one of the most common and severe diabetes complications10.

CHD is a disease of the less blood supplying to heart muscle vessels11 manifested as hyperlipidaemia, myocardial infarction, and angina pectoris12,13,14 and ~ 17.7 million people perished from CHD in 201511,15. Only in the United States, 18.2 million adults over 20 have CHD which take parts 6.7% of total population16 and CHD caused 363,452 death in 201717. It is known that individuals’ basic information like gender and age, and blood test indexes such as blood pressure, total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C) as well as smoking behaviour, diabetes status can be considered as risk factors of causing CHD18,19,20. Therefore, the early diagnosis of CHD is important while it is not easy21. Given the high prevalence and mortality rate of CHD, it is thus important to predict CHD risk for individuals. For doing so, a number of models for predicting CHD have been proposed using mathematical models like cox regression19,22,23 and machine learning models like neural network15,21,24. These models were designed for the general population, however, a model specifically built for predicting CHD risk in T2DM patients is still not available. Moreover, 68% of the 65-year-old-or-older diabetes patients dead from some form of heart diseases like CHD25 and diabetes patients have 2 to 4 times higher risk of developing CHD than others26. Given the huge number of T2DM patients, it is thus quite important to evaluate the risk of developing CHD for T2DM patients.

In this study, we proposed an AI (random forest) based model to predict the risk of developing CHD for individuals with T2DM. As a result, the predictive model achieved an AUC of 0.77 (fivefold cross-validation) in the training dataset and an AUC of 0.80 in the testing dataset, respectively. Moreover, the model achieved an AUC of 0.71 on a totally independent dataset including 1,253 newly recruited T2DM patients. In addition, a risk contribution model was built to quantify the importance of each feature for a given T2DM individual. Finally, we implemented a web server for the predictive model.

Methods

Study subjects

In this study, all procedures complied with the Helsinki Declaration for investigation of human subjects. The study received ethical approval from the competent Institutional Review Boards of Lu He hospital. All subjects supplied written informed consent.

Datasets

From January 2017 till June 2019, 1,357 subjects with T2DM were recruited in the study. Patients with T2DM were recruited from the Inpatient Department of Endocrinology in Lu He hospital. Exclusion criteria included any history or active treatment of cancer, pregnancy, cognitive inability as judged by the interviewer, any serious medical condition which would prevent long-term participation, the language barrier. Patients with other specific types of diabetes and patients with gestational diabetes mellitus were excluded. Finally, a total of 1,273 patients were enrolled in our study and all of them successfully underwent the medical history-taking included the history of smoking, alcohol, medical treatment, and history of CHD, hypertension and diabetes. All the features included are listed in Table 1. Among the 1,273 samples, 969 are diabetes patients without CHD (negative samples) and 304 are diabetes patients with CHD (positive samples). Next, we randomly selected 4/5 of positive samples and 4/5 of the negative samples as the training dataset. The rest samples are taken as the independent testing dataset. Finally, to confirm the accuracy of the presented predictive model, we recruited 1,253 totally new T2DM patients (200 positive samples and 1,053 negative samples) from the Outpatient Department of Endocrinology in Lu He hospital and determine the related information required by the model. Data is loaded and preprocessed by Pandas 0.2527 which is a package in Python 3.7.

Table 1 Brief description of each feature in the dataset with 1,273 subjects.

Diagnostic criteria

Diagnostic criteria of T2DM are in line with guidelines for the prevention and treatment of type 2 diabetes in China (2017 edition). Diagnostic criteria of CHD meet guidelines for the diagnosis and treatment of stable coronary heart disease (SCAD) (2018 edition). Hypertension was a blood pressure of at least 140 mmHg systolic or 90 mmHg diastolic or use of antihypertensive drugs.

Biochemical measurement

All participants suffered overnight fasting before venous blood samples were drawn. We aim to measure the total and differential white blood cell count, red blood cells, platelets, hemoglobin A1c (HbA1c), serum creatinine (SCr), uric acid (UA), serum triglyceride (TG), TC, LDL-C, HDL-C, fasting blood glucose (FBG), D-dimer, C-reactive protein (CRP), gamma-glutamyl transpeptidase (GGT). We also collected insulin and C-peptide levels of 0, 1, 2, 3 h when patients went through Oral Glucose Tolerance Test (OGTT). All indexes were measured in the central laboratory in Lu He hospital.

Information entropy function-based feature selection method

Information entropy is a conventional concept in information theory which is proposed by C. E. Shannon in 1948 and it can quantitatively describe the information contained in a series of data. Here, we use the information entropy function and Gini impurity function to check the information hidden in each feature. One feature will get a higher score if this series of data contain more information for classification and vice versa. The information entropy function-based feature selection method is implemented by using random forest model with 500 decision trees in sci-kit-learn 0.2228 in Python 3.7. Because the decision tree classifier makes decisions based on the entropy function, the average importance of each feature in 500 decision trees is calculated.

Random forest based predictive model

Random forest (RF)29 is a conventional ensemble model for machine learning. It uses the information entropy function or Gini impurity function for discrimination. Here, we proposed a RF-based predictive model (DCHD, Diabetic Coronary Heart Disease) with Gini impurity as an entropy function, which is also known as Classification And Regression Tree (CART)30. The Gini impurity function of a decision tree node with dataset \(D\) is defined as

$$Gini\left( D \right) = 1 - \sum\limits_{i = 1}^C {p_i^2}$$

where \({p}_{i}\) is the probability of belonging to class \(i\) in dataset \(D\) and \(i=\mathrm{1,2},...,C\). The dataset \(D\) will be divided into 2 subsets on this tree node based on the criterion \(A=a\) which is the minimal Gini gain point defined as

$$Gain(D,a)=\frac{|{D}_{1}|}{|D|}Gini({D}_{1})+\frac{|{D}_{2}|}{|D|}Gini({D}_{2})$$

where \({D}_{i}\) is the subsets after applying division \(A=a\) (\({D}_{1}=\left\{d\in D|d\le a\right\},{D}_{2}=\{d\in D|d>a\}\)). The number of trees is set to 500 and no tree depth limitation is applied to get a more precise and robust model. This model is implemented by using sci-kit-learn 0.2228 in Python 3.7.

Risk contribution model

We also performed an analysis of how much contribution a feature has by using the proportion method to calculate the contribution of each feature for individuals which is described in the following equations.

$${F}_{i}^{k}=\left[{f}_{0}^{k}, {f}_{1}^{k}, \dots ,{f}_{i-1}^{k}, 0, {f}_{i+1}^{k}, \dots , {f}_{m-1}^{k}\right]$$
$${S}_{i}=RF\left({F}^{k}\right)-RF\left({F}_{i}^{k}\right)$$

where \({f}_{i}^{k}\) is the value of the \({i}\)th feature for the \({k}\)th sample and \(m\) is the total number of selected features. So, \({F}_{i}^{k}\) represents the feature vector where the \({i}\)th feature value is zero and \({F}^{k}\) is the original feature vector; \(RF\) represents the probability of developing CHD in the predictive model. The zero value in the latent represents the mean value of that feature because data is standardized.

Note that one contribution can not only be positive but also negative because some of the features can be normal and make the risk probability drop a little bit.

Web server for the predictive model

In order to make the risk prediction model available for all T2DM patients, we built a web server (https://www.cuilab.cn/dchd). The server-side was performed by Django 2.2.5 package in Python 3.7 and the user interface was built using Bootstrap 4 and HTML 5.

Results

Risk features selection and validation

Fewer features normally make it more convenient for T2DM patients and clinicians to use the model and make the model more robust while it could decrease the performance of the model. To find a balance between the prediction robustness, convenience, and accuracy, here information entropy function-based feature selection method is applied to the whole dataset with the total 52 features. All these features are treated as input features of a random forest model and train the model with 500 decision trees. From this model, the entropy function represents the importance of each feature. And we found that the information entropy function and Gini impurity function make little difference in feature selection. Here we choose Gini impurity as the criteria function. The higher scores the feature got, the more information the feature contains. Next, the top 8 features (Age, LDL-C, Course of diabetes, TC, Heart rate, Diastolic pressure, Blood platelet, Course of hypertension) are selected and they contribute 30% among all the features and each of the rest features can do just little contribution (less than 2.3%) on distinguishing CHD from non-CHD in T2DM patients. The information contributions of the selected features are sorted and shown in Fig. 1. A more general and robust model is built using these selected features with almost the same performance compared with the original model. (Fig. 2a,b).

Figure 1
figure 1

The importance scores of the top 8 features calculated by information entropy function-based feature selection model.

Figure 2
figure 2

The ROC and the AUC of the predictive model on both the fivefold cross-validation (blue) and the independent testing dataset (green) using the dataset with all features (a) and using the dataset with top 8 selected features (b).

Performance of the predictive model

A number of metrics are often used to evaluate the prediction performance of machine learning models, such as true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). Here, TP and TN are the correctly classified CHD and non-CHD, respectively; FN represents CHD that are misclassified as non-CHD; non-CHD incorrectly classified as CHD is defined as FP. And then several standard performance metrics are applied to describe the model performance based the metrics including accuracy (ACC), true positive rate (TPR) also known as recall rate, false positive rate (FPR), precision rate and F1 score.

$$ACC=\frac{\left(TP+TN\right)}{\left(TP+TN+FP+FN\right)}$$
$$TPR=\frac{TP}{\left(TP+FN\right)}=Recall$$
$$FPR=\frac{FP}{\left(TN+FP\right)}$$
$$Precision=\frac{TP}{TP+FP}$$
$$F1=\frac{2*Precision*Recall}{Precision+Recall}$$

As a result, the presented model achieved an AUC of 0.77 for fivefold cross-validation and an AUC of 0.80 in the independent testing dataset (Fig. 2b). And the performance scores introduced before are listed in Table 2.

Table 2 Performance scores of the predictive model.

Performance of the predictive model on a newly recruited independent testing dataset

To further confirm the robustness and performance of our model, we newly recruited 1767 patients with T2DM from the Outpatient Department of Endocrinology. Among these patients, 1,253 subjects were finally enrolled including 200 with CHD and 1,053 without CHD. As a result, our model achieved an AUC of 0.71 in the newly recruited independent testing dataset (Fig. 3). In addition, the performance scores for this dataset are listed in Table 3.

Figure 3
figure 3

The ROC and AUC of the predictive model on the newly recruited dataset.

Table 3 Performance scores of the predictive model on the newly recruited dataset.

Risk contributions of features

Risk contribution of a feature represents an indication of how much this feature impacts the risk of developing CHD. As a case study, here is a T2DM patient whose data for each feature is as follows, Age: 68, Low-density lipoprotein: 1.92, Course of diabetes: 20, Total cholesterol: 3.32, Heart rate: 65, Diastolic pressure: 67, Platelet count: 340, Course of hypertension: 3. This individual was predicted to be at a high risk to develop to CHD (0.925) using our model (Fig. 4a). And by risk contribution model, the calculated scores of risk factors are as follows, Age: 0.105, Low-density lipoprotein: 0.26, Course of diabetes: 0.125, Total cholesterol: 0.425, Heart rate: 0.3, Diastolic pressure: 0.16, Platelet count: 0.035, Course of hypertension: 0.025 (Fig. 4b). All the result was shown on the web server. In the first bar plot, the length of the red bar represents the probability of developing to CHD and the length of the green bar represents the probability of non-CHD. Moreover, the risk factors’ contributions are sorted and plotted in the bottom figure. The contributions thus can provide advice on daily diet and clinical treatment for individuals.

Figure 4
figure 4

The predicted result of a case study. (a) The predicted CHD risk (red bar) or non-CHD risk (green bar). (b) The predicted feature contributions for the input individual.

Web server

The home page of the webserver is shown as Figure S1. For single instance prediction (Figure S1a), users can input the value of the features needed by the model and click the “Run” button. Then, the model will analyze the input data and output the result in a new page (the result figures like Fig. 4 will be shown). Clicking the “Example” button, the data for a case study will be entered automatically. For multiple instances prediction (Figure S1b), users can paste a CSV format text and click “Run” to get a batch of prediction results.

Discussion

T2DM is a common disease and often resulted in death due to severe complications, among which, CHD is one most common and severe one. Given the huge number of T2DM patients, it thus becomes increasingly critical to quantitatively evaluate the risk for a T2DM patient to develop to CHD in the near future.

In this paper, we presented an AI-based model to predict the risk of developing CHD for T2DM patients. We first proposed a feature selection model to confirm risk factors. Then, a predictive model was built to predict CHD among T2DM patients. As a result, the model achieved an AUC of 0.77 on fivefold cross-validation and 0.80 on an independent testing dataset. Moreover, the presented model achieved an AUC of 0.71 on a newly recruited dataset. Finally, a risk proportion model was built for individuals to analyze the contribution of each feature to CHD risk.

The predictive model is available online for users to do a self-checking and it can be treated as the early-warning of developing CHD if the probability is high enough. Moreover, clinicians can use the online web server as an auxiliary tool to determine potential CHD risk for a T2DM patient. The risk contribution results can also be used by doctors to design personalized treatment strategies for different patients.

Furthermore, the blood test can be considered as a very basic and cheap test in physical checkup and the features used in the predictive model are mostly from the blood test. That is, the risk prediction model is convenient to use for self-checkup and more detailed physical checkup needs to be done if a high risk is reported by the model.

The current model can be improved in the following aspects. Firstly, the used features in this study are limited. Therefore, the model could be improved if more valuable features from other aspects (such as medication history and heart imaging data) are included in the future. Secondly, there may be non-linear compositions of known features (such as age * blood pressure). Although the random forest model is not a linear model, it cannot detect and explain all kinds of non-linear compositions. So, the error accumulates when the number of non-linear compositions increases in real situations. Therefore, the non-linear analysis would be of help in improving this model in the future. Besides, one more important limitation is that both the training dataset and independent validation dataset are from the same hospital and there is no tracking data of the patients with the non-CHD clinical diagnosis but high risks from the prediction model, which may be a source of bias. Hence, the model would be more robust and convincible if the training data are from multi-source and cohort tracking data is included.

In summary, we presented a reliable AI model to predict CHD risk for T2DM patients, which could be of help for precision DM care. Finally, we will continuously update the predictive model to achieve better performance and to provide greater help for the precision medicine of T2DM patients.