Introduction

Coronary artery disease (CAD) is the main cause of mortality and morbidity worldwide, causing over 8 million deaths worldwide each year and a huge burden on social care systems [1]. Asians exhibited greater susceptibility to CAD and diabetes mellitus (DM) compared to Westerners, due to higher levels of central obesity and total adiposity or higher degree of insulin resistance and endothelial dysfunction [2,3,4]. Notably, China has the world’s largest number of people with DM, accounting for a quarter of the world’s diabetic patients [5]. As a poor prognostic factor, DM is associated with a two- to four-fold increased risk of developing CAD [6, 7]. Compared to non-diabetic subjects, more severe and diffuse coronary atherosclerosis have been found in diabetic subjects [8]. Similarly, patients with DM more likely experience acute myocardial infarction than patients without DM [9]. That is to say, DM induces subsequent cardiovascular diseases and is detrimental to outcomes among patients with CAD. Survival prediction of patients with CAD helps make appropriate therapeutic strategies, which is able to significantly improve the prognosis of those patients. Proper interventions can be performed in advance and prognosis can be improved more effectively if we promptly achieve risk identification of CAD progression [10]. However, there are few studies on the survival risk factors of older patients with CAD combined with impaired glucose tolerance (IGT) or DM.

Machine learning provides researchers with powerful statistical methods to intercept associations between patient’s features and outcomes, allowing for objective integration of data to predict clinical prognosis. It has been demonstrated to be highly effective methods for prognostic prediction and decision making in plenty of analysis of diverse and massive electronic health record data [11, 12]. Prognostic prediction models have been developed in terms of different machine learning algorithms, and prediction performance of these models was superior to traditional regression-based heart disease prediction [13,14,15]. Recent systematic review identified that the best predictive model for cardiovascular complications in diabetic patients was neural network, followed by gradient boosting machine model [16]. However, there is still a lack of studies on machine learning for prognostic prediction specifically among older patients with CAD and IGT or DM. Therefore, the purpose of the study was to develop a model to predict one-year mortality among older patients with CAD combined with IGT or DM using machine learning techniques. The logistic regression model, and three machine learning algorithms, including gradient boosting machine model, random forest, and decision tree, were introduced in the study. In addition, a web-based calculator was developed to present optimal model and promote clinical application.

Methods

Design and patients

This prospective study analyzed 974 older patients with CAD admitted in the Department of Geriatric Cardiology, Chinese people’s Liberation Army (PLA) General Hospital. Patients were included from October, 2007 to July, 2011 if they (1) aged above 60 years, (2) diagnosed with CAD, and (3) had IGT or DM. Figure 1 shows the flowchart outlining patient’s enrollment and study design. Chinese PLA General Hospital was their designated hospital and had comprehensive medical treatment and final death records, which makes it easier for us to track them for long-term effectively and accurately judge the end point. A total of 451 patients with CAD combined with IGT and DM were finally enrolled, and those patients randomly split 70:30 into training cohort (n = 308) and validation cohort (n = 143). The randomization of patients was achieved using computer. The training cohort was used to train and optimize models using the logistic regression model and three machine learning algorithms, and the validation cohort was used to test prediction performance of these models. This study protocol was approved by the Ethics Committee of Chinese People’s Liberation Army General Hospital (Beijing, China) and in accordance with the Helsinki Declaration of 1975 (as revised in 1983).

Figure 1
figure 1

Flowchart outlining patient’s enrollment and study design.

Definition and outcome

According to the guidelines of the American College of Cardiology (ACC)/American Heart Association (AHA)/European College of Cardiology (ESC), the diagnosis of CAD was made by chief physicians based on clinical histories, angina symptoms, cardiac markers, and specific examinations, including electrocardiogram (rest and exercise), echocardiography, radionuclide imaging, computed tomography, and coronary angiography [17, 18]. IGT was defined as 2-h glucose ≥ 7.8mmol/l after a 75-g oral glucose tolerance test (OGTT) [19]. DM was defined as patients with fasting glucose ≥ 7.0mmol/l, OGTT 2-h glucose ≥ 11.1mmol/l, or oral hypoglycemic drugs or insulin [20]. The median interval of follow-up was 5.06 years. Due to the failure of multiple organs in the older population and the priority of all-cause mortality in the outcome study, the primary outcome of the present study was all-cause mortality. One-year mortality was defined as patients died within one year after discharge. Mortality was determined by telephone interviews and medical records, including legal documents of dead time, place and others.

Data and variables

The present study collected patient’s (1) demographics, including age, gender, current smoker, body mass index (BMI, kg/m2), and heart rate, (2) laboratory examination, including glucose (mmol/L), hemoglobin (g/dL), high density lipoprotein cholesterol (HDL-C, mmol/L), low density lipoprotein cholesterol (LDL-C, mmol/L), albumin (g/dL), serum creatinine (µmol/L), uric acid (µmol/L), N-terminal pro-B-type natriuretic peptide (NT-proBNP, pg/ml), (3) comorbidities, including hypertension, chronic heart failure (CHF), and chronic kidney disease (CKD), (4) drugs used during hospitalization, including aspirin, clopidogrel, beta receptor blockers, calcium channel blockers (CCB), nitrates, angiotensin converting enzyme inhibitor (ACEI) or angiotensin receptor blocker (ARB), and statins. All information was obtained and preserved by trained researchers. To verify the accuracy of the results, other independent researchers performed logistical check and data re-evaluation. Blood tests were conducted at the central laboratory in the Department of Biochemistry, Chinese PLA General Hospital.

Modelling and validation

To develop prediction model for one-year mortality among those patients, the logistic regression model and three machine learning models including gradient boosting machine model, random forest, and decision tree were used to develop prediction models in the training group. The least absolute shrinkage and selection operator (LASSO) method combined with ten-fold cross-validation was used to screen characteristics associated with one-year mortality, and significant variables were used as inputs to train and optimize models. Random hyper-parameter search was performed to determine the model parameters after applying 5-fold cross-validation and 100 iterations of bootstrapping procedures to achieve the best area under the curve (AUC). Patients in the validation group were used to assess predictive effectiveness of models using the Brier score, AUC, calibration curve, and decision curve. The following formula was used to calculate the Brier score.

$$Brier Score= \frac{1}{N}\sum _{i=1}^{n}{({p}_{i}-{o}_{i})}^{2}$$

In the formula, \(N\) is the sample size, \({p}_{i}\) is the predicted probability of death within one year, and \({o}_{i}\) is the actual probability of death within one year (“No” vs. “Yes”).

Shaley Additive exPlanation (SHAP) was performed to determine feature contributions using the following formula.

$$g\left({z}^{{\prime }}\right)={\varphi }_{0}+\sum _{j=1}^{M}{\varphi }_{j}{{Z}^{{\prime }}}_{j}$$

In the formula, \(g\) is the interpretation model, \(M\) is the number of inputted parameters, \({\varphi }_{0}\) is a constant, and \({\varphi }_{j}\) is the attribution value (Shapley value) of each model parameter.

Developed web calculator

A web-based calculator for predicting one-year mortality among those patients was developed using the “Streamlit” (https://share.streamlit.io/) application in terms of the optimal model. For friendly use of the web calculator, the present study deployed three panels, including the panel for choosing and filling model parameters, the panel for getting predicted one-year mortality, and the panel for model introduction.

Statistics and implementation

Categorical variables were presented as proportion, and continuous variables that were not normally distributed were presented as median and interquartile range (IQR). A comparison of characteristics was performed between patients with and without one-year mortality using Wilcoxon rank tests for not normally distributed continuous variables and using Chi-square or continuous adjusted Chi-square tests for categorical variables. Reclassification of patients was performed according to the best cut-off value of the optimal model, and two risk (high-risk vs. low-risk) groups were categorized in the validation group. Traditional statistical analyses were performed using R programming language (version 4.1, http://www.R-project.org). A P value of less than 0.05 was considered as significant (two-tailed). Machine learning modelling and interpretation were performed in an open-source web application of Jupyter Notebook in which authors are able to use Python language (version 3.9).

Results

Patient’s basic characteristics

The median age of the entire cohort was 86.00 [82.00, 89.00] years (Table 1). The majority of patients were male (89.6%). The comorbidity burden was relatively heavy, because 87.1% patients had hypertension, 85.1% patients had DM, 71.0% had chronic CAD, 31.0% had CHF, and 36.6% had CKD. Drugs used during hospitalization was common among those patients. To elaborate, aspirin, clopidogrel, beta receptor blocker, CCB, nitrates, ACEI/ARB and statins were administrated among 46.3%, 62.7%, 73.2%, 71.4%, 85.4%, 57.6% and 67.2% patients, respectively. The one-year mortality was 26.83% among all patients. Subgroup analysis was performed to compare clinical characteristics between patients with and without one-year mortality. Patients who died within one year tended to have an older age, a lower BMI, and a higher heart rate (P < 0.05 for all). Meanwhile, patients who were dead within one year had higher glucose, serum creatinine, uric acid and NT-proBNP levels, and lower hemoglobin, HDL-C, LDL-C and albumin levels (P < 0.05 for all). In addition, patients with one-year mortality tended to have an elevated proportion of comorbidities such as more DM, CHF and CKD (P < 0.05 for all). Regarding administration of drugs during hospitalization, patients with one-year mortality had significant less aspirin, clopidogrel, ACEI/ARB, and statins (P < 0.05 for all).

Table 1 Patient’s demographics, tests, comorbidities, and drugs

Development of models

According to the least absolute shrinkage and selection operator (LASSO) method and ten-fold cross-validation (Fig. 2A B), seven clinical characteristics were identified to be significantly associated with one-year mortality, and these characteristics included hemoglobin, HDL-C, albumin, serum creatinine, NT-proBNP, CHF, and statins. In addition, serum creatinine, NT-proBNP, and CHF were risk factors based on the LASSO method, with hemoglobin, HDL-C, albumin, and statins being protective factors. Correlation coefficients between those clinical characteristics were calculated and presented as a correlation matrix (Fig. 2C). It demonstrated that no serious collinearity existed because all the correlation coefficients were below 0.80. Thus, the seven clinical characteristics were served as model predictors to establish models using the logistic regression and three machine learning models. The optimal full super-parameters for each model were obtained after training and optimizing models (Table 2).

Figure 2
figure 2

A: The least absolute shrinkage and selection operator (LASSO) coefficient profiles of all variables; B: Selection of appropriate parameters; C: Correlation coefficients between clinical characteristics; D: Overall performance of each model; E: Calibration curve of each model; F: Clinical decision curve analysis of each model; G: The heatmap of SHAP value; H: Analysis of feature importance based on SHAP summary plot; I: The screenshot of online calculator.

Table 2 Full super-parameters of techniques

Validation of models

As for the overall performance of each model, the Brier score was 0.116 for logistic regression model, 0.114 for gradient boosting machine model, 0.143 for decision tree model, and 0.126 for random forest model. Corresponding AUC was 0.827 [95% confidence interval (CI): 0.731–0.924], 0.836 (95% CI: 0.743–0.929), 0.760 (95% CI: 0.652–0.869), and 0.829 (95% CI: 0.728–0.930), respectively (Fig. 2D). The above findings elucidated that the gradient boosting machine model was the optimal model in terms of the Brier score and AUC. Calibration curve showed that all models, in particular the gradient boosting machine model, had favorable calibration (Fig. 2E). Clinical decision curve analysis showed that the gradient boosting machine model and the logistic regression model both showed favorable clinical usefulness (Fig. 2F). Based on the gradient boosting machine model, the study further analyzed the feature importance for significant clinical characteristics in the whole population. The individual SHAP value and the mean SHAP value both found that the top three features associated with one-year mortality were NT-proBNP, albumin, and statins (Fig. 2G H).

Optimal web-based calculator

As the gradient boosting machine model was the optimal model in the study, we further developed a web-based calculator for predicting one-year mortality using this model in the “Streamlit” application (https://share.streamlit.io/). In the calculator, the present study deployed a panel for choosing model parameters, a panel for obtaining predicted one-year mortality, and a panel for introduction of the model (Fig. 2I). This calculator could be available at https://starxueshu-online-application1-year-mortality-main-49cye8.streamlitapp.com/.

Sometimes, if the web-based calculator has gone to sleep (shut down) and can not be opened, users are able to access to it via clicking ‘Yes, get this app back up!’. After about 30 s, the web-based calculator would be accessible.

Reclassification of patients

Patients were recategorized according to the best cut-off value (26.4%) in the gradient boosting machine model. Patients with a predicted risk of less than 26.4% were categorized into the low-risk group, whereas patients who had a predicted risk of 26.4% or above were categorized into the high-risk group. To be specific, the predicted risk was 7.72% in the low-risk group, and 62.31% in the high-risk group; the actual risk was 7.22% (7/97) in the low-risk group, and 50.00% (23/46) in the high-risk group. It indicated that patients in the high-risk group were near seven times more likely to suffer from one-year mortality after discharge from hospitalization.

Discussion

The present study developed an accurate model to predict one-year mortality among older patients with CAD and IGT or DM. Three machine learning techniques were introduced for analysis in the study, and the gradient boosting machine model showed the optimal prediction performance as compared to other models. The AUC of the model was up to 0.836, indicating excellent prediction effectiveness. In addition, to encourage clinical use of the model, the study proposed a web-based application, and the application was user-friendly. The LASSO method and ten-fold cross-validation identified that seven clinical characteristics were significantly associated with one-year mortality with creatine, NT-proBNP, and CHF being risk factors and hemoglobin, HDL-C, albumin, and statins used during hospitalization being protective factors. The SHAP found that the top three features associated with one-year mortality were NT-proBNP, albumin, and statins.

In the past two decades, the incidence of DM in China has been on a significant rise. Half of the adults (50.1%) in China are thought to have prediabetes. In other words, they have impaired fasting glucose and/or IGT [21, 22]. In addition, the prognosis of CAD is worse in patients with DM than in nondiabetic patients. IGT is also a risk factor for cardiovascular diseases [23, 24]. Most of the current studies on the prognosis of CAD focus on the analysis of patients’ overall survival and risk factors. Studies on the risk of short-term mortality in older patients with CAD combined with IGT or DM are still rare. The present study developed a new model with the aim of helping clinicians to identify patients at risk of short-term mortality in time for early warning and treatment to improve the survival of older patients with CAD.

In the present study, the LASSO method and ten-fold cross-validation identified that seven clinical characteristics were significantly associated with one-year mortality with creatine, NT-proBNP, and CHF being risk factors and hemoglobin, HDL-C, albumin, and statins used during hospitalization being protective factors. The SHAP found that the top three features associated with one-year mortality were NT-proBNP, albumin, and statins. To elaborate, NT-proBNP is a well-known biomarker for CAD, and it may facilitate early diagnosis of heart failure and stratification of cardiac risk. NT-proBNP has been reported to be a significant predictor of mortality for cardiovascular diseases [25]. Elevated NT-proBNP levels are associated with left ventricular insufficiency and poorer clinical prognosis in patients with heart failure and other related diseases. These observations have since been extended to patients with CAD, where elevated NT-proBNP levels may also be the result of myocardial ischemia [26]. As demonstrated previously, increasing quartiles of NT-proBNP were strongly related to an increase in the odds of one-year mortality [27]. In a study of patients with CAD, NT-proBNP levels were strongly associated with one-year mortality for the fourth versus the first quartile [28].

Serum albumin is an acute phase protein whose serum levels are affected by the nutritional status, inflammatory responses, and fluid status, exerting a protective effect in atherosclerosis via its anti-inflammatory, anti-oxidant, and anti-thrombotic roles [29]. Several studies have revealed that low serum albumin levels predicted adverse outcomes in the general population, as well as in patients with CAD [30]. A previous meta-analysis demonstrated that low serum albumin levels are associated with an increased risk of death, not only in subjects free from cardiovascular diseases, but also in patients who already experienced cardiovascular diseases [31]. These findings suggested that serum albumin could be a potential prognostic biomarker for cardiovascular diseases. Older patients are often at risk for malnutrition, infection, overloaded fluid, and liver and kidney insufficiency, all of which lead to the development of hypoproteinemia and an increased risk of short-term mortality in older patients. The inclusion of serum albumin into the model in our report is benefit for clinicians to identify potential risks in older patients with CAD in a timely manner, thus assisting clinical decision-making and therapy.

It is well-known that stains can improve clinical prognosis in the primary and secondary prevention of CAD. Both abnormal inflammatory responses or abnormal lipid metabolism are risk factors for CAD and DM. Dyslipidemia often accompanies abnormal glucose metabolism. As an essential risk factor for CAD, DM can exacerbate the progression of atherosclerosis, resulting in poor clinical outcomes. Abnormal blood glucose metabolism, including IGT and DM, has become increasingly common. As a key risk factor for CAD, DM can aggravate the progression of atherosclerosis and lead to adverse clinical outcomes [32, 33]. CAD and DM have common abnormal inflammatory response or abnormal lipid metabolism, and glucose metabolism is closely related to lipid metabolism [34]. Therefore, lipid metabolism and its management play an important role in CAD patients with CAD combined with IGT or DM.

As one of the important prognostic markers for patients with CAD, epidemiologic studies have demonstrated inverse relationship between HDL-C levels and CAD occurrence. HDL-C can mediate reverse transport of cholesterol to regulate atherosclerotic process and exhibits anti-inflammatory, anti-oxidant, and anti-thrombotic effects on vascular system. Studies reported that elevated HDL-C levels after treatment with stains can reduced the incidence of death [35]. The seven parameters included in our model are all common clinical indicators, with advantages of accessibility and convenience, which contribute to clinical application and popularization of the model. For high-risk patients, we should intervene in advance to control adverse prognostic factors and reduce death risk of patients with CAD combined with IGT or DM.

Actually speaking, previous studies have developed models to predict survival outcome among patients with CAD using machine learning techniques. For instance, Sherazi et al. [36] developed a machine learning-based one-year mortality prediction model after applying 22 continuous variables, 43 categorical variables, and 4 discrete variables. Although the AUC of the best machine learning model could be up to 0.898, too many variables in the model may restrict clinical application of the model. In addition, web-based calculator was not available in the above study. Ke et al. [10] showed that the gradient boosting decision tree had the optimal performance with the AUC of 0.918, but the model was used to predict the in-hospital mortality among patients with CAD. In the present study, the AUC of our model was up to 0.836, indicating excellent prediction effectiveness. To encourage clinical use of the model, the study also proposed a web-based application, and the application was user-friendly. In clinical settings, users are able to fill parameters of each feature according to patient’s conditions in the panel of selecting parameters, and then the probability of one-year mortality could be presented in the panel of results via submitting all these parameters. Based on the reclassification of patients, patients in the high-risk group were near seven times more likely to suffer from one-year mortality. Therefore, for patients in the high-risk group, effective measures need to be taken early based on the findings of the study, including regulating the levels of hemoglobin, albumin, creatine and NT-proBNP, treating CHF, and administering statins for appropriate patients.

The present study had some drawbacks. For one thing, although this study analyzed up to 26 variables, respiratory infections may have a significant impact on survival outcome and were not included for analysis of this study. For another thing, because the model was not externally validated in the present study, the generalization of the model in other cohorts needs future investigation.

Conclusions

The present study proposes an accurate model that is able to stratify patients with a high risk of one-year mortality. The gradient boosting machine model demonstrates promising prediction performance. Some interventions to affect NT-proBNP and albumin levels, and statins used during hospitalization, are beneficial to improve survival outcome among patients with CAD combined with IGT or DM.