Introduction

Postoperative acute kidney injury (PO-AKI) is a common complication after cardiac and vascular surgery that is strongly associated with increased risk of in-hospital mortality, prolonged intensive care unit (ICU) admission and hospital stay, and greater healthcare costs [1]. Its prevalence has been reported to range from 20 to 70% and vary by the type of surgery and the criteria of the AKI definition [2, 3]. Even minor increases in serum creatinine after cardiac surgery matter and add to the risk of death [4]. Despite the fact that many episodes of postoperative AKI are reversible within days or weeks of onset, results from a number of epidemiological studies conducted over the last decade reveal a substantial association of AKI with subsequent chronic kidney disease and long-term adverse consequences [5, 6], which leads to a considerable disease burden. The pathophysiological mechanisms of PO-AKI are multifactorial and still largely unknown. Currently, the health care of PO-AKI patients is mainly based on supportive therapy. The best approach to AKI management is prevention. The success of many preventative treatment strategies rests on the ability to correctly identify patients at high risk of developing AKI prior to major changes in serum creatinine or urine output. Therefore, perioperative accurate assessment of AKI risk and specific preventive strategies are urgent to be established. To early identify patients at higher risk for PO-AKI, previous studies have developed a number of potentially useful clinical risk prediction models. But the generalization of the models is uncertain. Most models suffer from differing AKI definitions, insufficient perioperative variables, small cohorts, and a lack of external validation [7]. Moreover, these models were designed mainly for western populations and did not represent good performance among Chinese patients.

In recent years, machine learning (ML) techniques have also been increasingly employed to make predictions and have shown promise as a potentially powerful tool to detect data patterns and trends that could remain undiscovered using traditional statistical method [8,9,10,11,12]. Decision trees (DT) are versatile and intuitive models that use a tree-like structure to make decisions or predictions. Random Forest (RF) is an ensemble learning method that combines multiple decision trees to make predictions by aggregating the votes or averaging the outputs of individual trees. Gradient Boosting Classifier (GBC) is another ensemble learning method that builds a predictive model by combining multiple weak classifiers in a stage-wise manner. Gaussian Naive Bayes (GNB) is a probabilistic classification algorithm based on Bayes’ theorem and calculates the posterior probability of each class for a given instance and assigns the class with the highest probability as the predicted class. Multilayer Perceptron (MLP) is a type of artificial neural network that consists of multiple layers of interconnected nodes. These algorithms have their own strengths and weaknesses. It’s important to experiment and evaluate different algorithms to find the best fit for a specific task. Compared with the CSM approach, ML algorithms could theoretically perform better in measuring nonlinear associations, dealing with high-dimensional data, which may enhance risk prediction. However, the existing studies did not systematically compare and statistically test the difference in performance among ML algorithms and the CSM approach in predicting PO-AKI [13]. Some of the existing ML models did not report more interpretable indicators for model assessment, such as sensitivity and specificity, which were highly emphasized in clinical conditions [8, 14].

Interpretation of ML models is another crucial issue that refers to the challenge of understanding how and why a machine learning model arrived at a particular prediction or decision. Machine learning models are often regarded as black boxes due to their complex and nonlinear nature. This lack of transparency hinders the understanding of the factors driving predictions and can limit their adoption in critical applications. However, several approaches can be used to enhance model interpretability and overcome this limitation. Techniques such as SHAP (SHapley Additive exPlanations) method or permutation feature importance assign weights or rankings to each feature based on their impact on the model’s output, thereby enabling insights into the underlying factors and increasing the trustworthiness of predictive models. Although previous studies suggested that ML approaches provided good prediction discrimination, the interpretability of the models was lacking, which limited their application in an actual clinical setting.

Therefore, we hypothesized that ML techniques may outperform traditional logistic regression analysis in the prediction of postoperative AKI risk following cardiac surgery. The study aimed to systematically compare the performance of multiple different ML algorithms and LR in predicting two outcomes separately, including postoperative AKI and severe AKI as well. We also sought to further explain the prediction of the best-performing model at the population and individual levels.

Methods

Study design

In the present retrospective cohort study, we enrolled 2327 patients who were over eighteen years old and underwent cardiac surgeries in a tertiary teaching hospital during January 2018 through December 2020. A total of 17 patients were excluded according to the criteria below: (a) preoperative dialysis or renal transplantation (4 cases); (b) used nephrotoxic drugs within seven days before surgery: gentamicin and vancomycin (2 cases); (c) preoperative or postoperative serum creatinine was not available (11 cases). The rationale behind the exclusion was to avoid the influence of the baseline factors above on postoperative renal function or PO-AKI diagnosis. Finally, 2310 participants were eligible for the present analysis. The study was approved by the Medical Ethics Committee of Central China Fuwai Hospital of Zhengzhou University (LUNSHEN 2020-07). This work has followed the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) guidance for the reporting of studies developing and validating a prediction model [15].

Data collection

We applied to the China Cardiac Surgery Registry (CCSR) database for the data submitted by our institute. CCSR was a web-based and paperless data collection system. Data quality control measures for the CCSR database were described in another study [16] introducing the design and data audit of the CCSR. Further, to ensure the accuracy and reliability of the data, two investigators screened the data for the missing or outlier values and replaced them with the true values extracted from electronic health records (EHR). The available variables regarding demographics, comorbidities, previous cardiac interventions, preoperative medications, perioperative examinations, intraoperative data, postoperative complications and outcomes were included in the present analyses. Based on evidence from previous studies, clinical guidelines, or expert opinions, predictor selection during model development was conducted among preoperative and intraoperative variables.

Demographics and presentation of comorbidities included gender, age, BMI, smoker regardless of whether the patient quit smoking, diabetes mellitus, hypertension, hyperlipidemia, chronic renal failure, chronic obstructive pulmonary disease (COPD), peripheral vascular disease, cerebrovascular accident (CA), congestive cardiac failure, Canadian Cardiovascular Society (CCS) angina class, New York Heart Association (NYHA) class, arrhythmia, and previous myocardial infarction (MI). Previous cardiac intervention included previous percutaneous coronary intervention (PCI) and previous cardiac surgery. Preoperative medications included intravenous nitroglycerin injection, catecholamine injection, β-blockers, angiotensin-converting-enzyme inhibitor (ACEi)/ angiotensin II receptor blocker (ARB), lipid-lowering agents. Perioperative examinations included the last preoperative serum creatinine (SCr), peak postoperative SCr, last preoperative left ventricular ejection fraction (LVEF), last preoperative left ventricular end-diastolic diameter (LVEDD), number of diseased coronary vessels. Intraoperative data included surgery type, cardiopulmonary bypass (CPB), aortic cross-clamping (ACC), intra-aortic balloon pump (IABP) usage, intraoperative blood product transfusion (BPT), postoperative BPT, and assisted ventilation time. BPT refers to packed red blood cells (pRBC), fresh frozen plasma, cryoprecipitate and platelet.

Outcomes

We considered AKI and severe AKI after cardiac surgery as the primary and secondary outcomes, respectively. AKI diagnosis was based on a modified KDIGO (Kidney Disease: Improving Global Outcomes) definition [17] by using serum creatinine (SCr) criteria with the most recent preoperative result as baseline. AKI was defined as any of the following: SCr increase ≥ 0.3 mg/dL within 48 h or SCr increase ≥ 1.5 times baseline within 7 days of surgery. Severe AKI (stage 2 and 3 AKI classification) refers to the condition of an increase in SCr ≥ 4.0 mg/dL or ≥ 2.0 times baseline or the initiation of renal replacement therapy (RRT).

Statistical analysis

During data preprocessing, the nonphysiologic vital values or outliers were checked and replaced by the true values from electronic medical records when it was mis-recorded or changed to missing. For several variables with missing data (BMI, CCS angina class, COPD), we imputed the missing values by random forest - multiple imputation method in R using the ‘missForest’ package.

For all participants grouped by postoperative AKI or severe AKI, the continuous variables were presented as mean ± standard deviation (SD) and analyzed by unpaired t tests or expressed as median (interquartile range) and compared using the Mann-Whitney U test when they violated the normality assumption. The discrete variables were expressed as frequencies (%) and tested using Chi-square tests.

Model development, validation and comparisons

For model development, the complete dataset was randomly divided into a derivation set and a validation set based on a ratio of 4:1. The derivation set was used to train the model and the validation set was used to access the generalizability of the model.

To reduce redundant data dimensions, univariate analyses were performed to select feature predictors. Only the candidate preoperative and intraoperative variables that were significantly associated with PO-AKI would be incorporated in the model training.

In the derivation dataset, we used traditional logistic regression analysis and five popular supervised ML algorithms to train models, including decision tree (DT), random forest (RF), gradient boosting classifier (GBC), Gaussian Naive Bayes (GNB) and multilayer perceptron (MLP).

Furthermore, the predictive value of each algorithm in discriminating PO-AKI was systematically evaluated in the validation set by using sensitivity, specificity, area under the receiver operating characteristic curve (AUC), accuracy, Youden index (sensitivity plus specificity-1) and the 95% confidence interval (CI). Sensitivity measures the ability of a model to correctly identify positive cases out of all the actual positive cases. Specificity measures the ability of a model to correctly identify negative cases out of all the actual negative cases. AUC is a commonly used metric that summarizes the overall performance of a model to discriminate between positive and negative instances across various classification thresholds. Accuracy calculates the proportion of correctly classified instances over the total number of instances. Youden index is derived from sensitivity and specificity and provides a balance between these two metrics, allowing for the determination of an optimal classification threshold. The statistical significance of the difference among the AUCs was further examined using the method proposed by DeLong et al. (1988) using MedCalc® Statistical Software version 20. This method is a widely used approach for comparing AUC values between two or more models. It is based on the nonparametric approach and uses the theory of generalized U-statistics.

Model interpretation using SHAP

Lastly, we focused on interpreting the output of the final ML models. SHAP (SHapley Additive exPlanations) is a unified approach [18] to explain the variable impact on the output of any model by assigning each feature consistent and accurate attribution values for a particular prediction. The benefits of utilizing SHAP are as follows: (1) the collective SHAP values help to interpret and understand the model at a global level, (2) an observation gets its own set of SHAP values that explain the variable impact on individual predictions at a local level. We further visualized the SHAP explanation by summary plot and force plot to understand how the models interpret the data and make predictions.

Statistical analyses were conducted with SAS version 9.1 (SAS Institute Inc., Cary, North Carolina, USA). Development, validation and interpretation of the models were performed under Python Version 3.6 environment. All P values were two-sided with a significance level of 0.05.

Results

Comparison of characteristics and outcomes

A total of 2310 participants were included in the present retrospective cohort study. Table 1 shows the comparison of characteristics and outcomes in patients with and without postoperative AKI or severe AKI. In the whole cohort of cardiac surgery, 1020 (44.2%) patients developed PO-AKI and 286 (12.4%) developed severe AKI (stage 2 and 3 AKI). Univariate analyses indicated that presentation of hyperlipidemia, cerebrovascular accident, congestive cardiac failure, CCS angina class II-IV, NYHA class III-IV, CABG + valve/other surgery, CPB, aortic cross-clamping, IABP usage, intraoperative BPT, and longer assisted ventilation time were significantly associated with increased risk of PO-AKI (all P < 0.05). Notably, we found that the levels of preoperative SCr in the patients without PO-AKI were significantly higher than those of patients with PO-AKI (P < 0.001). Compared to the patients with PO-AKI, those without PO-AKI had higher percentages of preoperative medications use, such as intravenous nitroglycerin injection (54.1% vs. 43.3%; P < 0.001), β-blockers (50.1% vs. 37.8%; P < 0.001), ACEi/ARB (17.8% vs. 10.4%; P < 0.001), lipid-lowering agents (41.1% vs. 21.0%; P < 0.001)). There was no statistical significance between AKI and no-AKI groups regarding gender, age, BMI, smoker, diabetes mellitus, hypertension, chronic renal failure, COPD, peripheral vascular disease, arrhythmia, previous MI and cardiac interventions.

Table 1 Characteristics of the patients with and without postoperative AKI/severe AKI

Development, validation and comparison of models

Using the sixteen preoperative and intraoperative predictors that were significantly associated with postoperative AKI, we developed five ML models and a conventional logistic regression model in the derivation set (n = 1848, 80%) and validated the models in the validation set (n = 462, 20%) for prediction of postoperative AKI and severe AKI, respectively. There were no significant differences among patient characteristics, postoperative outcome and complications between the derivation set and the validated set (Supplementary Table 1). Model performance metrics in the validated set were demonstrated in Table 2. The ROC curves of prediction models were also illustrated in Fig. 1(a) for AKI prediction and Fig. 1(b) for severe AKI prediction. Comparisons of AUCs among different models are presented in Supplementary Table 2.

Table 2 Performance metrics of the models for prediction of postoperative AKI/ Severe AKI in the deviation set
Fig. 1
figure 1

Model performance illustrated by receiver operating characteristic curves for predicting (a) postoperative AKI and (b) postoperative severe AKI. AKI, acute kidney injury; AUC, area under receiver operating characteristic curves; CI, confidence interval

Among the five ML models, MLP and GNB, with similar AUCs (0.793, 95%CI: 0.735, 0.844 vs. 0.762, 95%CI: 0.701, 0.815, Pdifference =0.179), performed best in predicting postoperative AKI. Compared to MLP, logistic regression had a higher AUC (0.812, 95%CI: 0.756, 0.860 vs. 0.793, 95%CI: 0.735, 0.844, Pdifference =0.036), sensitivity (0.774, 95%CI: 0.719, 0.813 vs. 0.804, 95%CI: 0.749, 0.843), accuracy (0.753, 95%CI: 0.719, 0.781 vs. 0.745, 95%CI: 0.705, 0.778) and Youden index (0.513, 95%CI: 0.451, 0.573 vs. 0.460, 95%CI: 0.391, 0.536).

For postoperative severe AKI prediction, in terms of AUC, GBC (0.86, 95%CI: 0.808, 0.902) performed significantly better than the other four ML models, including DT (0.749, 95%CI: 0.688, 0.803, Pdifference =0.011), RF (0.805, 95%CI: 0.748, 0.854, Pdifference =0.027), GNB (0.734, 95%CI: 0.672, 0.790, Pdifference =0.004) and MLP (0.718, 95%CI: 0.655, 0.775, Pdifference <0.001). The AUC of GBC was larger than the AUC of conventional logistic regression (0.86, 95%CI: 0.808, 0.902 vs. 0.803, 95%CI: 0.746, 0.852) but no significant differences were found (Pdifference =0.173). As illustrated in Table 2, the sensitivity (0.333, 95%CI: 0.224, 0.431 vs. 0.148, 95%CI: 0.078, 0.250), Youden index (0.304, 95%CI: 0.202, 0.424 vs. 0.138, 95%CI: 0.065, 0.260) and accuracy (0.896, 95%CI: 0.868, 0.917 vs. 0.892, 95%CI: 0.877, 0.905) of GBC seemed to be consistently greater than those of conventional logistic regression.

Model interpretation

For AKI prediction, the feature importance matrix plot for the best-performing logistic regression is shown in Fig. 2 (a). The top five most important predictors based on the SHAP values were assisted ventilation time, CPB, lipid-lowering agents, last preoperative SCr, and hyperlipidemia. To summarize the effects of all the features on logistic regression output at a global level, we plotted the SHAP values of every feature for every participant (Fig. 3 (a)). It revealed that CPB and longer assisted ventilation time increased the risk of AKI and vice versa for preoperative lipid-lowering agents and higher preoperative SCr. At a local level, we could get a set of SHAP values for specific participant’s feature values and visualize the impact of each feature on the output by force plot (Supplementary Fig. 1(a)). The color of the arrows shows how the feature impacts the model: a red arrow increases the model output predicted value in the positive direction while a blue arrow decreases it in the negative direction. The size of the color arrows for each feature value represents the impact magnitude of the SHAP value. The base value refers to the mean probability that would be predicted if we do not know any features for the current output. As shown in Supplementary Fig. 1(a), the base value here was 0.49 and the output predicted value was 0.61, lipid-lowering agents = 0 with red arrow and ACC = 1 with the blue arrow have similar impact magnitudes but in opposite directions. Red feature values contribute to postoperative AKI prediction, while blue ones push towards no AKI development.

Fig. 2
figure 2

Feature importance matrix plot based on SHAP for (a) postoperative AKI prediction using logistic regression and (b) postoperative severe AKI prediction using gradient boosting model. AKI, acute kidney injury; CCS, Canadian Cardiovascular Society; NYHA, New York heart association; CA, cerebrovascular accident; ACEi, angiotensin-converting-enzyme inhibitor; ARB, angiotensin II receptor blocker; ACC, Aortic cross-clamping; SCr, serum creatinine; CABG, coronary artery bypass grafting; CPB, cardiopulmonary bypass; IABP, intra-aortic balloon pump; BPT, blood product transfusion

Fig. 3
figure 3

SHAP summary plot for (a) postoperative AKI prediction using logistic regression and (b) postoperative severe AKI prediction using gradient boosting model. AKI, acute kidney injury; CCS, Canadian Cardiovascular Society; NYHA, New York heart association; CA, cerebrovascular accident; ACEi, angiotensin-converting-enzyme inhibitor; ARB, angiotensin II receptor blocker; ACC, Aortic cross-clamping; SCr, serum creatinine; CABG, coronary artery bypass grafting; CPB, cardiopulmonary bypass; IABP, intra-aortic balloon pump; BPT, blood product transfusion

Similarly, we combined the optimal GBC model for severe AKI prediction with SHAP analysis and presented the importance matrix plot in Fig. 2 (b). To get an overview of which features contribute most to the GBC model, we used the SHAP summary plot (Fig. 3 (b)). Of the top five most important variables, there were two risk factors (longer assisted ventilation time and hyperlipidemia) and three protective factors (higher last preoperative SCr, preoperative lipid-lowering agents and intravenous nitroglycerin intake). SHAP force plot (Supplementary Fig. 1(b)) showed the impact direction and magnitude of each feature value on the development of postoperative severe AKI by taking a specific patient case as an example. In this case, the blue-colored explanatory variable values (last preoperative SCr (mg/dl) = 0.49, no preoperative intravenous nitroglycerin injection, no preoperative intake of lipid-lowering agents, NYHA III-IV, no preoperative intake of β-blockers, hyperlipidemia) had a positive contribution towards severe AKI development and outweighed the negative impact of other blue-colored variable values, such as no CPB, assisted ventilation time (hr) = 45 and so on.

Discussions

We conducted a retrospective study to investigate the incidence and risk factors of patients developing PO-AKI and compare the performance of machine learning algorithms and logistic regression in predicting PO-AKI following cardiac surgery. Contrary to our expectations, traditional logistic regression outperformed all five other machine learning models in discriminating postoperative AKI. However, Gradient Boosting Classifier showed higher AUC values than traditional logistic regression when predicting postoperative severe AKI, but no significant difference was found. Taking into account sensitivity, accuracy, and Youden index, GBC appeared to be superior to traditional logistic regression in predicting postoperative severe AKI. The study yielded moderate to good predictive values for both postoperative PO-AKI and severe AKI. Furthermore, the models identified various risk factors, including lower preoperative SCr level, longer assisted ventilation time, cardiopulmonary bypass and hyperlipidemia, and discovered main protective factors, such as preoperative lipid-lowering agents and intravenous nitroglycerin intake, which could potentially inform clinical interventions.

The performance of risk prediction model was typically determined by various factors, such as the quality of the data, sample size, predictor parameters, algorithms and so on. Lee et al. [9] reported that XGboost achieved the best performance (AUC: 0.78 (95%CI 0.75–0.80)) using 39 variables to predict postoperative AKI of all stages compared to traditional logistic regression (AUC: 0.69 (95%CI 0.66–0.72)) and other ML models. On the contrary, based on the same AKI definition and similar incidence of postoperative AKI, we used far fewer predictors (16 variables) and found that the performance of traditional logistic regression (AUC: 0.812(95%CI 0.756–0.860)) was significantly superior to that of other five ML models in predicting PO-AKI of all stages. Notably, we have also attempted to include as many variables as possible in the models before formal analyses. However, we found that the expansion of input features did not imply great improvement of predictive ability, probably due to high correlation between the features or data noise. Thottakkara et al. [19] found that reducing the dimensionality of the predictors with LASSO had minimal effect on the performance of models, while feature extraction using principal component analysis improved predictive performance. We were convinced that models with redundant data and elaborate risk calculation would add extra burden and make it difficult to deploy in real clinical conditions. So we only included the variables significantly associated with postoperative AKI, which may have the risk of omitting some potentially important variables but aid in decreasing overfitting and making the model easier interpretable. Additionally, more advanced and complex algorithms are not certainly synonymous with substantial performance improvement. For example, Tomašev et al. [20] once utilized a broad range of alternative RNN architectures (specifically, the long short-term memory, update gate RNN and intersection RNN, the neural Turing machine, memory-augmented neural network, simple recurrent units, gated recurrent units, the Differentiable Neural Computer and the relational memory core) to create a continuous prediction model of AKI. These alternatives did not show significantly incremental performance than the simple recurrent unit architecture.

Notably, the performance of risk prediction models should be assessed comprehensively and compared based on scientific statistical tests. To identify more positive cases and reduce false negative cases, sensitivity is particularly emphasized over specificity in clinical screening and risk assessment. Excellent discrimination measured by AUC does not necessarily mean good performance in sensitivity. However, some previous studies [3, 8, 9, 21] did not report such more comprehensive metrics for model assessment, which may overestimate their models’ reliability and practical values. Additionally, we could not rule out the possibility of publication bias. Studies that demonstrate better performance with ML algorithms may be more likely to be published compared to those showing insignificant results. This can impact the overall body of evidence by potentially skewing the performance of ML models.

Even if logistic regression outperformed the ML models in discrimination of postoperative AKI, the limitations of logistic regression should be addressed as well. Logistic regression assumes a linear relationship between the independent variables and the logit function. It may not capture non-linear relationships and complex interactions among variables. Logistic regression has limitations in handling high-dimensional data, missing data and unbalanced datasets. Logistic regression relies on manual or statistical methods for feature selection, which requires prior knowledge or assumptions about the relevant variables. Anyway, we could not underestimate the strengths of ML algorithms. Bihorac et al. [22] used an automated generalized additive model (GAM) with logistic link function to assess the risk of death and eight postoperative complications including AKI and reported an AUC of 0.88 (0.87–0.89) and a sensitivity of 0.80 (0.78, 0.82) for AKI after index surgery. The advantages of this automated ML system included automatic covariate selection, universal applicability to any type of surgery, exportability to other EHR systems, and the ability to handle high-dimensional datasets with multiple different data types (such as interrelated, correlated, semistructured or unstructured data, missing or sparse data). Rank et al. [23] built a recurrent neural network (RNN) based on 96 routinely collected clinical parameters for real-time prediction of AKI within 8 future time horizons (6-h, 12-h, 18-h, 24-h, 36-h, 48-h, 60-h and 72-h ahead) after cardiothoracic surgery. The deep-learning model yielded a significantly higher AUC and sensitivity than the experienced clinicians. The RNN model could potentially be integrated into an EHR for real-time patient monitoring and early detection of postoperative AKI. A shift to personalized predictive medicine, where large amounts of multiple different types of features are integrated, suggests that computational methods like ML would become increasingly popular.

While model performance is crucial, the interpretability and clinical implications of the model also play significant roles. As revealed by SHAP analysis, LR and GBC models have identified lower preoperative SCr level as an important contributor to the development of PO-AKI, particularly in cases of postoperative severe AKI. Intriguingly, patients who experienced severe AKI exhibited significantly lower preoperative SCr level (0.65 (95% CI 0.51, 0.82) mg/dl vs. 0.77 (95% CI 0.66, 0.90) mg/dl) and higher postoperative SCr level (1.57 (95% CI 1.24, 2.31) mg/dl vs. 0.98 (95% CI 0.82, 1.16) mg/dl) than patients without severe AKI. Conversely, Teresa et al. [24] and Guan et al. [25] found that patients with AKI following cardiac surgery had higher baseline SCr level. Renal functional reserve (RFR) may help explain the opposite phenomenon we observed. RFR is defined by the difference between maximal functional capacity and baseline function and represents the capacity of the kidneys to increase GFR in response to stimuli (e.g., protein load or during pregnancy) [26, 27]. Those who appear to have sustained no permanent loss of kidney function (normal serum creatinine), may still lose RFR. In the present study, patients with lower baseline SCr level may have higher baseline function and diminished RFR, which makes the renal response to cardiac surgery less powerful and induced kidney injury. Ronco et al. [28] have demonstrated that preoperative RFR, measured with a protocolized protein load of 1.2 g/kg, was highly predictive of cardiac surgery associated-AKI with an AUC of 0.83 (95% CI 0.70–0.96) and patients with an increased renal reserve were less likely to develop postoperative AKI. Husain-Syed et al. [29] noted that patients with postoperative AKI have a persistent decrease in RFR at 3 months after cardiac surgery despite normalization of serum creatinine and patients with severe AKI exhibited higher decreases in RFR. These remind us that preoperative detection of baseline SCr is not enough and highlight the importance of assessing renal functional reserve as a highly potential predictor of postoperative AKI before cardiac surgery. The patients with normal baseline SCr but occult impaired renal reserve should be timely identified and deserve close clinical attention.

The models also established hyperlipidemia and preoperative lipid-lowering agents use as among the top five important predictors for postoperative AKI and severe AKI as well. In this study, statins accounted for over 97% of the lipid-lowering agents used preoperatively. It is speculated that preoperative lipid-lowering agents (e.g., statins) use may be protective against AKI by inhibiting post-operative inflammatory processes. Several randomized controlled trials investigating inflammatory markers and perioperative statin usage in patients undergoing cardiac surgery have demonstrated a reduction in inflammatory cytokines [30]. Meta-analyses of observational studies have revealed that preoperative statin therapy significantly reduced the incidence of postoperative renal dysfunction and the requirement for renal replacement therapy in patients undergoing cardiac surgery [31, 32]. Similarly, other modifiable factors such as assisted ventilation time, cardiopulmonary bypass and intravenous nitroglycerin intake were also found to be key predictors of AKI risk, highlighting their clinical implications in predicting AKI and guiding clinical intervention.

Some advantages of the study could be acknowledged. We comprehensively developed, validated and compared the traditional logistic regression and the five most popular ML models in the prediction of postoperative AKI and severe AKI as well. More variables on preoperative medicine intakes and intraoperative procedures were included as modifiable predictors, which provide potential targets for prevention and intervention of high-risk population. Furthermore, we made intuitive explanations of the models using SHAP at the global and local levels, which shed light on the black box of ML and aid in understanding how and how much each predictor contributes. To determine whether sufficient power was gained to ensure the reliability of our results, we implemented a post hoc sample size calculation using the method proposed by Riley et al. [33] via the pmsampsize package in R. The number of predictor parameters was 16, the R2 was 0.11, and the overall AKI incidence was 44.2%. Finally, the minimum sample size with 33.92 events per predictor required for model development based on the inputs is 1228. The actual sample of 1848 patients for model development likely has sufficient power to draw conclusions.

Limitation

However, the study also presented several limitations. First, the diagnosis of AKI was based on the modified KDIGO definition without incorporating urine output criteria. Because the collection and measurement of urine are sometimes incomplete, inaccurate and unreliable. In addition to the impact of renal function, urine output varies depending on the use of diuretic or fluid treatment regimens. Second, due to the unavailability of the complete time-series data, the study could not make real-time risk prediction of AKI. Future enhancement would include integrating additional perioperative time-series data into the prediction models. Third, the models needed to be further externally validated. Factors such as differences in demographic characteristics, comorbidities, surgical profiles, management protocols, and perioperative care practices may influence the performance of the models. Moreover, the impact of potential confounding factors, such as variations in baseline kidney function and medication regimens, should also be carefully considered when applying these models in different populations or settings. Therefore, rigorous external validation studies are necessary to verify the performance and establish the clinical utility of these prediction models in diverse patient populations and healthcare settings.

Conclusions

The study did not provide consistent evidence supporting the hypothesis that machine learning algorithms could significantly enhance the predictive performance of postoperative AKI and severe AKI after cardiac surgery compared to conventional logistic regression. Logistic regression and the GBC algorithm showed superior performance and exhibited moderate to good predictive ability for the risk of PO-AKI and severe AKI, respectively. We interpreted the models at the population and individual level, which gained insight into why those predictions were made and offered evidence for individual prevention and treatment.