Introduction

Whereas complications after cardiac operations are associated with increased risk of in-hospital mortality, only few predict long-term mortality. The best documented is post-operative acute kidney injury (AKI), a highly prevalent complication occurring in 15–30% of patients1,2 which is associated with both increased short- and long-term mortality1,2,3,4. The relation between postoperative AKI and mortality varies greatly per type of cardiac operation. Mortality risks related to AKI are well characterized for coronary artery bypass grafting (CABG), but less well studied in valve operations, despite these accounting for 24% of all cardiac operations and having higher mortality rates5,6. Recently, Bouma et al.5, showed post-operative AKI to be strongly associated with an increase in long-term mortality in patients with solitary valve and combined valve and CABG operations. Remarkably, even a mild impairment in renal function well below the threshold for AKI-1 (i.e., a mere 10% post-operative increase in serum creatinine) significantly increased long-term mortality risk in solitary valve operations5. Therefore, to date postoperative AKI represents the best studied organ injury related early marker of long-term mortality risk after cardiac operations.

Previously, we have demonstrated that machine learning (ML) predictive models proved superior to classical multivariable analysis in identifying patients at increased risk of long-term mortality after CABG operations7. Moreover, a unique property of ML is its ability to identify parameters predicting mortality and rank their importance by variable importance analysis. However, while ML analyses gain popularity in peri-operative care8, studies using ML techniques for long-term mortality analysis after cardiac valve operations are lacking. Several studies in different fields of healthcare have shown ensemble ML algorithms to be more accurate than individual algorithms in modelling complex outcomes such as mortality in critically ill patients9 and mortality following cardiac arrest10. In anesthesiology, recent studies showed that different machine learning algorithms could accurately predict acute hypotensive episodes 10 min in advance using patient characteristics and physiological variables11,12,13.

In this study, we combined multiple ML algorithms into an ensemble using the Super Learner (SL) algorithm14. This ensemble ML algorithm was trained to predict 5-year mortality in a large prospective cohort of patients undergoing cardiac valve, CABG, or combined operations using routinely collected peri-operative data in a single tertiary care hospital. We compared the accuracy of two SL training methodologies, using a targeted approach with patients split per operation type compared to the entire, unselected population. Furthermore, variable importance analysis was conducted to identify the strongest predictors of mortality.

Results

Patient characteristics and mortality per operation type

Patient characteristics, descriptives of all variables used in this study and mortality data per operation type are summarized in Table 1 (and Table 1 of the “Supplementary material”). Five years mortality rate of the full patient cohort was 16.5%. Operations involving valve procedures showed higher mortality amounting 16.9% for aortic valve alone, 19.7% for mitral valve alone, 21.0% for combined aortic valve/CABG and 28.9% for combined mitral valve/CABG (Table 1). Accordingly, mortality rate for CABG-only (13.8%) was lower than for the entire cohort.

Table 1 Descriptives table per operation type.

Machine learning analysis

As a first step in the ML based prediction of long-term mortality, the ensemble was trained on the full cohort (SL1; Fig. 5, left part). ROC curves and their respective AUROCs were established for the full cohort and the different cardiac operation types (Fig. 1). SL1 achieved an AUROC of 0.810 [0.798–0.823]. When analyzed per operation type, the accuracy of SL1 was highest for solitary mitral valve (0.846) and solitary aortic valve operations (0.838), and lowest for solitary CABG (0.784) and mitral valve/CABG (0.796). In addition, the comparison between SL1 and the trained GLM showed that the SL1 significantly outperformed GLM (AUROC 0.756 [0.725–0.787]) for the full cohort (P = 0.0016; Fig. 1) as well as for solitary aortic valve and combined aortic valve and CABG (P < 0.01; Table 2 in the “Supplementary material”). Thus, SL1 produced sound long-term mortality prediction based on peri-operative routinely collected patient and operation data.

Figure 1
figure 1

Plot of the receiver operating characteristic (ROC) curves and the respective areas under curve (AUCs) for the weighted Super Learner 1 for each of the 5 types of operation and for the whole cohort. Plot of the receiver operating characteristic (ROC) curves and the respective areas under curve (AUCs) for the weighted Super Learner and the generalized linear model (GLM) for the whole cohort. SL super learner, CABG coronary artery bypass grafting.

Next, we performed a similar analysis based on SL training per operation type, by making five training sets using 80% of the relevant patients to train five weighted ensembles (SL2–SL6). Comparison of AUROCs between SL1 versus SL2–6, showed identical ranking for specific operation subgroups. Predictive performance between the models generated by SL1 compared those from SL2 to SL6 did not differ (Fig. 1; Table 2 in the “Supplementary material”). SL3 and SL4 also outperformed GLM (P < 0.01; Table 4 in the “Supplementary material”). Lastly, because of its potential ability to identify patients at high risk prior to surgery, we examined the predictive performance when only pre-operative data are included. As expected, the model trained only on pre-operative data showed inferior performance to the full peri-operative model (AUROC 0.718 [0.687–0.749], P < 0.01, Fig. 12 in the “Supplementary material”).

Calibration, sensitivity analysis and adjusted risk thresholds based on predicted probability of mortality

Calibration of SL1 and SL2–6 was good for most models (Table 5 and Figs. 111 of the “Supplementary material”). Using the adjusted thresholds based on the Youden index and on a 50% increased risk of mortality lead to improved model sensitivity and specificity (Fig. 2). For all operations, the thresholds based on the Youden index approximated the baseline absolute mortality risk. Compared to the default threshold of 50% mortality risk, both the thresholds based on the Youden index and the thresholds defined by a 50% increased risk of mortality increased sensitivity substantially for all types of operation (Tables 615 of the “Supplementary material”). For the Youden index thresholds, this was paired with a steeper decrease in specificity than for the thresholds at 50% increased risk of mortality. As Table 2 shows, the threshold representing 50% increase in risk improved the number of patients correctly classified as “non-survivor” for all types of operation. The largest increase in correctly classified “non-survivors” was observed for aortic valve, CABG, combined aortic valve and CABG, and for all operations combined (3-, 4.7-, 2.2-, and 3-fold increase).

Figure 2
figure 2

Specificity (blue) and sensitivity (red) values across all possible thresholds for all operations combined. The default 0.50 threshold is marked in grey, the threshold based on the maximized Youden index in black, and the threshold representing a 50% increase in mortality risk in green.

Table 2 Percentage of correctly classified cases in survivors and non-survivors per operation type for SL1 predictions using the default and 50% increase in risk thresholds.

Variable importance analysis

Unexpectedly, variable importance analysis of all operations combined (n = 8142) revealed serum urea at day 4 after operation as the top predictor variable for 5-year mortality (Fig. 3). Serum urea was also found the top predictor in all operation types, except for the smallest group (n = 367), combined mitral valve and CABG operations. Other important predictive variables included patient age, serum urea at other time points, indicators of kidney function, and serum markers for organ damage and inflammation. To better illustrate the impact of the changes in these variable and possible interactions, we constructed probability plots of the two highest ranking variables in all patients (Fig. 4). Mortality risk steeply increased from day 4 urea levels of 10 mmol/L, reaching a plateau at 30 mmol/L denoting a 50% increase in absolute risk compared to baseline. Likewise, mortality risk gradually increased between 60 and 80 years of age. Figure 4 illustrates the combined effect of serum urea day 4 and age on mortality risk.

Figure 3
figure 3

Top ten predictor variables for all types of operations combined. Variable coefficients indicate how much each parameter contributes to the outcome. eCCR estimated creatinine clearance, LDH lactate dehydrogenase, ESR erythrocyte sedimentation rate, ICU intensive care unit, ASAT aspartate transaminase, BMI body mass index.

Figure 4
figure 4

Partial dependence plots of urea at postoperative day 4 and age. Partial dependence plots of urea at postoperative day 4 against age. The vertical bar represents predicted risk (blue to red, low to high).

Discussion

This study shows that ensemble ML analysis achieves a high accuracy in predicting 5-year mortality in a cohort of 8241 patients with CABG and/or valve operations. Moreover, variable importance analysis revealed early postoperative urea as a novel and strong predictor of mortality in all types of cardiac operations. Furthermore, methodologically, a more targeted approach of training the algorithms on sub-groups instead of the full cohort did not significantly improve mortality prediction.

We demonstrated that using an ensemble algorithm with a combination of pre-operative, intra-operative, and first week post-operative data, achieves high accuracy in predicting 5-year mortality after different types of cardiac operations. These findings extend a previous study where we demonstrated the superiority of individual ML models compared to classical multivariable analysis in identifying patients at increased risk of long-term mortality after CABG7. Here, we reaffirm these findings using ensemble ML and data from different types of cardiac operations. Using peri-operative data, we achieved similar accuracy to a recently developed ML-based risk algorithm for prediction of 1- to 24-month mortality following major surgery15. Compared to other models that predict mortality specifically after cardiac surgery, the ensemble achieved superior performance8.

The application of algorithms such as the one we developed to pre-operative data would possibly predict patients at the highest risk of long-term complications prior to surgery. Expectedly, analysis of pre-operative data in the XGBoost model decreased performance significantly, which could be partly due to the limited set of pre-operative data available in our cohort, or to the lower frequency of the outcome (long-term mortality as opposed to short-term post-operative complications). Yet, it should be noted that the model’s performance using our restricted set of pre-operative data has comparable predictive power as currently used clinical scores8.

Methodologically, our study contributed to the discussion on the need of conducting predictive studies on operation-specific cohorts. Results from previous studies suggest that algorithms trained on pooled data from patients undergoing different types of surgeries were accurate in predicting outcomes for all these types of operations. In keeping, our findings show that both the model trained with the full cohort, and the models trained with the individual cardiac operation subgroups showed a good performance in predicting long-term mortality after aortic and mitral valve operations. This finding further questions the need to conduct ML analyses on operation-specific cohorts. Specifically, including full cohorts may lead to better model performance analyses due to the greater amount of data.

Additionally, by providing risk predictions at individual level, ML algorithms allow for the adjustment of the sensitivity and specificity of each model for different clinical settings15. Balancing sensitivity and specificity in the context of mortality risk predictions can be challenging. Lowering the prediction threshold may lead to excessive over-diagnosing and increase in healthcare costs. However, especially in populations with relatively low mortality rates such as cardiac surgery patients, a too high threshold would miss too many “non-survivors”. Here, we demonstrated that using a 50% increase in absolute risk of mortality as cut-off provides a favorable trade-off between false positives and true negatives, as previously shown in similar large studies predicting postoperative mortality and mortality in intensive care patients15,16. Validation of this approach merits further investigation, and may facilitate the translation of an algorithm’s good predictive performance into a clinically useful patient risk stratification tool17.

Variable importance analysis identified postoperative urea as the strongest predictor of 5-year mortality. This is consistent with our previous findings in a CABG-only population7. Yet, literature on the possible role of urea as a mortality predictor in cardiac operations is scarce7. Preoperative urea values above 10 mmol/L have been found to be associated with increased 30-day mortality risk after CABG and with increased risk of stroke in the 10 days after cardiac operations18,19. It should also be noted that, in heart failure patients, increased urea levels have been associated with derangements in cardiac output and renal perfusion20,21. These are, in turn, strongly related to patients’ overall performance status and prognosis, with both urea and the urea/creatinine ratio being known prognostic predictors22. In the context of this study, increased urea may originate from excess production and/or impaired excretion, yet mechanistic insight remains elusive. Possibly, urea production may be increased by mitochondrial dysfunction, caused by ischemia/reperfusion and increased systemic inflammatory response after cardiopulmonary bypass and surgical trauma23. Mitochondrial dysfunction may be amplified through excess reactive oxygen species (ROS) following accumulation of succinate during ischemia24,25. Additionally, recent evidence indicates that high urea levels generate ROS26. Furthermore, renal excretion of urea may decrease in response to kidney injury. Thus, urea likely reflects the compound pathological state of different organ systems, rather than just kidney function.

Lastly, this study also has some limitations to consider. Being a single center study, our findings need confirmation by external validation. Further, our analysis is limited to the variables in the CAROLA database. Detailed co-morbidity information, for instance, could help further improve model performance, especially for the CABG sub-group. Additionally, variable importance analysis as such does not provide directionality and assumptions about effect size between the variables and the outcome cannot be made directly. Finally, the current ensemble ML is not suited to use high-frequency, high-volume data, such as continuous intraoperative measurements of blood pressure, heart rate, oxygen saturation or temperature. Therefore, a study including algorithms suitable for such analysis, such as recurrent neural networks, is a logical follow-up.

In conclusion, ML analysis of 88 routinely collected peri-operative data achieved a high accuracy in predicting 5-year mortality after different cardiac operations in this large study of 8241 patients. A targeted approach of training the algorithms on sub-groups instead of the full cohort did not improve model performance. Moreover, variable importance analysis showed early postoperative urea as a novel and strong predictor of mortality in all types of cardiac operations. Similar studies enabling the identification of modifiable risk factors and providing individual patient predictions may form a first step towards facilitating personalized clinical interventions to improve patient care.

Methods

The electronic Cardiothoracic Anesthesiology Registry (CAROLA) comprises extensive prospective data of all adult patients who underwent first-time valve operation, CABG, or a combination of both between 1997 and 2017 in the University Medical Centre Groningen (UMCG), the Netherlands. The total number of patients is 11,286. This database study was approved by the Medical Ethical Committee of the UMCG, and the requirement to obtain informed consent was waived (waiver: METC#2010/118). All analyses were performed in accordance with relevant guidelines and regulations.

Patient population and outcome

Only patients who underwent valve operation, either solitary or combined with coronary artery bypass grafting (CABG), or solitary CABG, with cardiopulmonary bypass (CPB) were included (n = 8241). There were 1663 patients in the combined aortic and coronary group, 367 in the combined mitral and coronary group, 884 in the solitary mitral group, 813 in the solitary aortic group, and 4514 in the CABG-only group. Mortality data were obtained in November 2017 from the Dutch Municipal Personal Records Database comprising actual and reliable data of all citizens within the Netherlands.

Data selection and pre-processing

The dataset includes patient characteristics, peri-operative hemodynamic, CPB, respiratory and organ function data and blood values collected at different time points indicated in Fig. 5. Because for some patients referred from other hospitals the stay in our center was limited to the immediate peri-operative phase, a variable pattern of missing data was observed. Multivariate imputation by chained equations was performed on the set of variables with at least 50% non-missing data27. The final dataset without missing data consisted of 88 predictor variables and 5-year mortality as the outcome variable (Table 1). Baseline serum creatinine measurements was defined as the closest to the start of operation. Patients were classified for post-operative AKI 0–3 within the 7 days after operation according to the AKIN classification3.

Figure 5
figure 5

Timeline of clinical measurements before, during, and after cardiac operation, in the intensive care unit (day 1 after operation), day 1 in the ward (day 2 after operation), and day 3 in the ward (day 4 after operation). Patient characteristics are not included here, but described in detail in Table 1. Dur CA duration of cardiac arrest, Dur clamp duration of aortic cross-clamp, Hb hemoglobin, ASAT aspartate aminotransferase, ALAT alanine aminotransferase, Thromb thrombocytes, ESR erythrocyte sedimentation rate, LDH lactate dehydrogenase, CVP central venous pressure, PaCO2 arterial carbon dioxide partial pressure, SaO2 oxygen saturation, PaO2 arterial oxygen partial pressure, SBP systolic blood pressure, DBP diastolic blood pressure.

Statistical analysis

The Super Learner, selected candidate algorithms, and hyper-parameter tuning

The Super Learner algorithm is a generalization of the stacking algorithms developed by Breiman28, which combines a set of candidate algorithms to make k-fold-cross-validated predictions9,29. In this process, the dataset is divided into k mutually exclusive and exhaustive subsets, with one set serving as a validation set, while the others are used for training each candidate algorithm14. This means that each patient is used only once in the validation set, and included in the training set for all other rounds. For each candidate learner, k risks are calculated and averaged into a “cross-validated risk”. Subsequently, the learners with the minimal risk are selected, applied to the entire dataset and included in the new weighted estimator (the SL), that attributes a relative coefficient to each of the learners. Those which reduce the calculated risk the most, will contribute to the final weighted prediction. Moreover, the SL presents individual patient predicted probabilities for 5-year mortality per ensemble. Five candidate algorithms were included in the SL: support Bayesian additive regression trees (BART), extremely randomized trees, elastic net, support vector machine, and extreme gradient boosted machine (XGBoost). Details of these five algorithms can be found in the “Supplementary material”. Since the performance of an algorithm varies greatly depending on its hyper-parameters and can be substantially improved by tuning, multiple hyper-parameter combinations were generated for each candidate algorithm. Details of each of these algorithms including the hyper-parameters, the tuning process, and final values are described in the “Supplementary material”. A 10-fold cross-validated generalized linear regression model (GLM) was trained on data from the full cohort for use as baseline comparison of the SL’s performance. Lastly, to test the performance of a model using only pre-operative data in predicting post-operative outcomes, a 10-fold cross-validated XGBoost model was trained on data from the full cohort.

Model training

Two distinct training procedures for the SL were carried out (Fig. 6). First, one of the ensembles (SL1) was trained using the full cohort of 8241 patients. Secondly, the cohort was split into five different groups according to operation type, with one ensemble trained on data from each group (SL2–SL6). All six ensembles included the same candidate algorithms, and the same hyper-parameter configurations. Performance of two different approaches were assessed by comparison of the 10-fold cross-validated area under the receiver operated characteristic curve (AUROC), with a 95% confidence interval, for each of the weighted SL’s. Differences in the performance between SL’s and between SL1 and the GLM were assessed with DeLong’s nonparametric test for the difference in areas under the curve30.

Figure 6
figure 6

Diagram of the steps involved in data analysis: data split, algorithm training, and outcome prediction using different Super Learner ensembles. On the left, the process of training the single Super Learner on data of the whole cohort (n = 8241), obtaining the pooled predicted probabilities, and retrieving the group-specific probabilities to calculate the performance measures for each type of operation. On the right, the process of splitting the data into five groups, one per operation type, and training a different super learner on data from one type of operation only. SL super learner, AV aortic valve, MV mitral valve, CABG coronary artery bypass grafting.

Calibration, sensitivity analysis and adjusted risk thresholds based on predicted probability of mortality

Calibration plots and calibration indices (ECI)31 for all models are provided in the “Supplementary material”. Model performance metrics described above were obtained in a 2-step procedure: first using a default threshold to maximize the AUROC, and then using adjusted thresholds to optimize sensitivity and specificity. This process of tuning the operating points of the ROC using different risk thresholds depending on the requirements of a specific clinical setting has been previously shown to optimize model sensitivity and specificity for mortality prediction15. In the first step, a default threshold of 0.50 was used, where patients are classified as “non-survivors” if the predicted probability of mortality is greater than 50%. This is the standard threshold used to maximize algorithm performance during training. After this, a second and third risk thresholds were defined. The second one was calculated based on the maximized Youden index, which provides a balance between sensitivity and specificity15. The third one was based on the actual long-term mortality rate of each of the surgical sub-groups, and corresponds to a 50% increase in the absolute risk of mortality. We opted for this value as it represents a clinically relevant increase that could justify intervention. The confusion matrix, sensitivity, and specificity for each of the thresholds are reported in the “Supplementary material”.

Variable importance analysis

Variable importance measures aim at estimating the contribution of predictor variables to changes in the outcome32. The greater the association between each feature and the outcome, the greater the decrease in accuracy upon its removal, and the higher its reported importance32. We determined the variable importance of all routinely measured peri-operative clinical parameters in our cohort by training the best performing individual algorithm included in the ensemble—the XGBoost model—using the same hyper-parameter configurations as in the SL. The coefficients for the top ten features for each operation type, as well as for all operations combined, are presented.

All analyses were performed using R version 3.6.2 (The R Foundation for Statistical Computing; Vienna, Austria) for Ubuntu 16.04 LTS. Data are expressed as mean (95% confidence interval), and categorical as percentages. A P value < 0.05 was accepted as a statistically significant difference.