Background

Cardiac resynchronization therapy (CRT) is the preferred treatment method for patients with ventricular dyssynchrony accompanied by reduced ejection fraction and bundle branch block [1]. CRT reduces the risk of sudden heart failure due to the weakening of the heart muscle and can help alleviate disease symptoms for an improved quality of life [2]. The 2008 American Heart Association / American College of Cardiology and 2007 European Society of Cardiology guidelines recommend the following criteria for selecting patients for CRT: patients with sinus rhythm, left ventricular ejection fraction ≤ 35%, QRS > 120ms, NYHA class III/IV [3]. Unfortunately, roughly one-third of CRT recipients do not respond favorably to the treatment [4]. Given its expense and surgical risks, the ability to accurately predict individual patient benefits from this treatment could hold great clinical value [5].

The field of biomedical science has seen a surge in predictive models for disease prognosis and treatment outcomes using various machine learning approaches for detecting subtle patterns in underlying datasets [6, 7]. Recent studies have tested the utility of advanced machine learning algorithms for predicting the response to CRT using various patient data including electronic health records, clinical imaging, electrocardiograms, etc., and some of these studies have reported moderate predictive accuracy [8,9,10,11,12,13,14]. However, notably absent in most CRT predictive algorithms is the inclusion of biochemical features. Given the important roles that biochemical markers such as extracellular matrix proteins and inflammatory signals can play in cardiac tissue remodeling, Spinale and colleagues recently showed that the circulating levels of several serum protein biomarkers can hold exciting predictive capability for CRT response [15]. Notably, elevated levels of the soluble suppressor of tumorigenicity-2 (sST-2), soluble tumor necrosis factor receptor-II (sTNFr-II), matrix metalloproteinases-2 (MMP-2), and C-reactive protein (CRP) indicated a reduced likelihood of benefit across ~ 800 patients from the SMART-AV CRT trial.

Past efforts to predict individual patient responses to CRT using machine learning algorithms have largely been limited in two ways. First, most studies have used only a single type of data to make predictions (e.g., electrocardiograms or biochemical markers - not both). Second, most studies have used sophisticated ‘black-box’ algorithms with a limited ability for interpretation or explanation, which can potentially hinder their adoption and utility in clinical practice. One approach for improving interpretability is using simpler models like regression approaches, but an increase in recent explainability approaches is helping to make any model interpretable without penalizing prediction accuracy [16,17,18].

In this study, our objective was to computationally predict individual patient responses to CRT using a combination of demographics, physical characteristics, comorbidities, medication history, circulating biomarker levels, and echo-based LV assessment. Building upon the previous work of Spinale et al., we combined their biomarker-based metric with various features from the SMART- AV clinical patient data [15, 19]. We assessed the performance of our resulting ensemble machine learning classification model using receiver-operating curve analysis for a hold-out patient dataset and comparisons of 6-month cardiac measures between model-predicted responder and non-responder groups. We also performed SHapley Additive exPlanations (SHAP) analysis to help interpret the global importance of all features included in the model.

Methods

Study population and data preparation

The data source for our model training and testing was the SMART-AV trial published previously [19]. In that study, 794 patients with NYHA class II and IV, LVEF ≤ 35%, and QRS duration ≥ 120 milliseconds were randomly assigned to different defibrillation protocols and evaluated at 0, 3, and 6 months with echocardiography and serum biomarker panels. The complete list of recorded features is organized in Table 1 with summary statistics in Table 2. A positive CRT response was defined as a decrease in ESV of at least 15 mL between 0 and 6 months post-surgery, and the patient cohort held a nearly equal split of responders (n = 398) and non-responders (n = 396).

Table 1 Variables acquired from the SMART-AV clinical trial
Table 2 Baseline characteristics of CRT Responders and Non-Responders

Missing data were imputed using two different methods for our study. Surgical intervention features, PCI and CABG, were imputed to match the most frequent value for each of those features. Categorical data were transformed using one-hot encoding. Non- categorical data and continuous data were imputed using the mean value for each respective variable, followed by the scaling using the RobustScaler method [20]. The patients were split into training and testing datasets, with 80% in the training dataset (n = 635) and 20% in the testing dataset (n = 159). The testing set was completely excluded from model training and feature selection.

Machine learning model development

The complete workflow of our model development, testing, and interpretation framework is presented in Fig. 1. Using Python 3.6.4 and scikit-learn 0.23.2, we tested a wide variety of supervised classification machine learning algorithms, including K-Nearest Neighbors, Support Vector Classifier, Decision Tree Classifier, Random Forest, Adaptive Boosting, Gradient Boosted Classifier, Gaussian Naive Bayes classifier, Linear Discriminant Analysis, XGBoost, Catboost, logistic regression, and Multi-Layer Perceptron Neural Network [20]. We also tested Stacked and Voting ensembles that combined these other approaches [21,22,23]. This list of algorithms includes well-established methods for binary classification tasks where parameters are fit to the underlying classifier structures in order to optimize predictive performance. In general, “ensemble” algorithms seek to improve classification performance by combining several individual algorithms, thereby leveraging the different strengths of each underlying algorithm. “Boosting” approaches, generally, are techniques to improve relatively weak classifiers by iteratively re-weighting data, thereby enabling the algorithm to adapt over successive iterations of model training.

Fig. 1
figure 1

General workflow of algorithm development and testing

Patients were previously enrolled in the SMART-AV clinical trial based upon their New York Heart Association (NYHA) heart failure designation, left ventricular ejection fraction (LVEF), and duration of the Q-R-S wave from electrocardiography, and patients were then classified as responders or non-responders based upon their change in left ventricular end-systolic volume (LVESV) after six months of therapy [19]. We first processed the dataset by imputing missing values, numerically encoding categorical variables, and data scaling, and then we separated patients into the training set (for model parameter fitting) and testing set (for model performance testing). Lastly, we used SHapley Additive exPlanations (SHAP) analysis and Local Interpretable Model-agnostic Explanations (LIME) to improve model interpretation through feature explanation

Each model was tuned using a cross-validated grid search across hyperparameters with parameters selected to maximize the area under the receiver-operating characteristic curve (AUC) for binary classification of patients in the training set. Notably, the algorithm only used 0-month (pre-surgery) feature data to predict the 6-month post-surgery response vs. non-response outcome. The resulting model parameters and hyperparameters are provided in Supplemental Table 1.

Feature selection was performed using a backward stepwise methodology, eliminating features that did not improve the model training score. A guiding hypothesis for this work was that combining the previously identified serum biomarkers with demographic and echo-based features would improve predictive capability. To evaluate this hypothesis, we trained and tested our ensembled model using three different sets of features, including all features listed in Table 1 plus (1) no biomarker values, (2) all 12 biomarker values, or (3) a biomarker score based on previous analysis by Spinale et al. [15]. The biomarker score for each patient is calculated by counting how many of the four critical biomarker analytes exceed a risk threshold (MMP-2 ≥ 982,000 pg/mL, sST-2 ≥ 23,721 pg/mL, CRP ≥ 7381 ng/mL, sTNFR-II ≥ 7,090 pg/mL).

Model interpretation

Model performance was evaluated using 5-fold cross-validation within the training dataset, and the final model was selected based on the highest mean AUC. After model selection using the training set, the final model performance was validated using the holdout validation set. Interpretation of model output results is difficult with ensemble models due to the inherent complexity of layering multiple algorithms to select a prediction. To help interpret global feature importance, we performed a SHapley Additive exPlanations (SHAP) analysis using the Python SHAP library 0.37.0 KernelExplainer and KernelSHAP using all samples as input for SHAP value calculation [24]. With ensemble models, feature importance and predictions become very personalized to the individual sample making it difficult to understand a local prediction using only global feature importance. To provide a more personalized explanation of an individual prediction, local interpretation is more accurate. We also picked two examples of CRT recipients to demonstrate how the model behaves locally for responders and non-responders using Local Interpretable Model-agnostic Explanations (LIME) [25].

Results

Model predictive performance

Across all the algorithms tested, a majority-voting ensemble classification model demonstrated the best performance. The ensemble consisted of nine equally weighted models, each voting with their respective probability of surgical success: a Linear Discriminant Analysis classifier, a Catboost Classifier, a Gradient Boosted classifier, a Random Forest classifier, an XGBoost classifier, a Support Vector Classifier, a 3-layer Multi-level Perceptron Neural Network, a Logistic Regression Classifier, and an Adaboost classifier. Without using biomarker data, our algorithm approach demonstrated modest predictive performance with an AUC of 0.63 in the training patient set (Table 3). The addition of biomarker data substantially improved model performance with an AUC reaching 0.75 in the training patient set using all 12 biomarkers or the simplified biomarker composite score (Table 3). Using the biomarker score with a voting classifier reached the highest AUC in both the training and test patient set, so we proceeded with this model for the remaining analyses (Table 4; Fig. 2 A).

Table 3 Area-Under-the-Curves (AUC) for the ML models with or without the biomarker data
Table 4 Comparison of the performance of the top 6 models in our study using Biomarker Scoring
Fig. 2
figure 2

Overall performance of the machine learning model

The Receiver-Operating Characteristic curve for the supervised, binary classification ensemble model demonstrates high predictive capability with an area-under-the-curve of 0.784 for the majority voting classifier. (B) Model-predicted responders exhibited a 69% response rate (61/88), while model-predicted non-responders exhibited only a 27% response rate (19/71). Further stratification based on the model-predicted responses probability score demonstrated a greater predictive accuracy

Our binary classification model correctly predicted 71% of patient responses in the test set, with 61/88 classified responders and 52/71 classified non-responders matching the trial result (Fig. 2B). In other words, the prediction yielded 61 true positives, 52 true negatives, 27 false positives, and 19 false negatives. To analyze more detailed patient stratifications, we separated patients into five groups according to the model-predicted probability of response (i.e., probability bins = 1-0.8, 0.8 − 0.6, 0.6 − 0.4, 0.4 − 0.2, or 0.2-0). Across the stratified patients, the model correctly identified 96% of patients in the highest and lowest response groups, with 14/15 patient responders in the high probability score group and 8/8 non-responders in the low probability score group (Fig. 2B).

In addition to response rate (which is judged by a strict over/under -15mL threshold for ESV change over six months), we also explored quantitative changes in left ventricle remodeling metrics across the model classification groups (Fig. 3). Over six months after the procedure, patients predicted by the model as responders showed significant reductions in both ESV and EDV, while patients classified as non-responders showed no change in ESV and a slight increase in EDV over six months. Both responders and non-responders showed increased stroke volumes and ejection fractions, but the model-predicted responders showed a statistically more significant improvement in ejection fraction (~ 40% compared to ~ 20%). These discrepancies between groups were amplified further across the 5-group patient stratification using the model probability score (Fig. 3B). In the most extreme case, the high response probability group exhibited almost a 75% improvement in ejection fraction, while the low response probability group exhibited no change in ejection fraction over the 6 months after surgery.

Fig. 3
figure 3

Cardiac remodeling across patient stratifications

Model-predicted responders showed statistically significant differences in left ventricle remodeling metrics compared to the model-predicted non-responders. (A) Binary classification identified a responder group with substantially greater improvements in ESV, EDV, and EF from 0–6 months after CRT intervention. (B) More detailed patient stratification further amplified the remodeling differences across groups

Model interpretability

To improve the interpretability of our ensemble classification algorithm, we performed a SHAP analysis and corresponding visualization of feature importance (Fig. 4 A). Briefly, this technique calculates a collective, global average of how much each feature value contributed to each patient’s classification to indicate both the magnitude and direction that each feature contributes to the overall probability of falling on either side of the binary classifier (i.e., responders vs. non-responders). SHAP analysis indicated that lower 1D stretch, lower biomarker score, absence of ischemic cardiomyopathy, lower QOL score, and higher age were strong global contributors within the algorithm for identifying responders.

Fig. 4
figure 4

Global and local interpretations of model predictions

(A) SHAP plot shows the feature importance in our model. 1D stretch, biomarker score, ischemic cardiomyopathy, QOL score, and age were indicated as the top 5 most important features for determining patient response probability. The scatter width and separation indicate the feature importance, and the color indicates which direction of that feature value is predictive of high vs. low patient response. (B) LIME plot shows the most significant contributing features for an example responder wherein a 1D Stretch of ≤ 1.08 along with a lack of RBBB, atrial flutter, ischemic cardiomyopathy, AT_PSVT, PAF, and SA surgery increased the probability of responding favorably to CRT treatment. (C) LIME plot shows the most significant contributing features for an example non-responder wherein a 1D Stretch of > 1.14 along with a lack of VT-SVT, Afib, and nonsustained VT increased the probability of not responding to treatment. On the other hand, a history of ischemic cardiomyopathy also affected the predicted non-response to CRT.

To demonstrate feature importance in a local, patient-specific visualization, we use LIME for an example responder and non-responder (Fig. 3B). For the responder patient example with a higher probability of responding to CRT (0.81), 1D stretch of less than or equal to 1.08 and no history of RBBB, Atrial Flutter, Ischemic Cardiomyopathy, AT-PSVT, PAF, and SA are helping to move the patient to the response regimen. But no history of VT-SVT, non-sustained VT, and Biomarker Score > 2 contribute to non-responsiveness for this patient. For the non-responder patient example with a higher probability of not responding to CRT (0.75), 1D stretch of greater than 1.14, a history of Ischemic Cardiomyopathy, and no history of VT-SVT, Afib, and Nonsustained VT are helping to move the patient to the non-response group. But no history of RBBB, Atrial Flutter, SA surgery, PAF, and Biomarker Score of zero is responsible for this patient’s small probability of response to CRT.

Discussion

While CRT offers significant clinical benefits for many heart failure patients, a large proportion of the population does not respond positively to treatment [4]. This high patient-to-patient variability presents a need for predictive methods to help identify which patients will or will not benefit from CRT based on information obtained before the procedure. We hypothesized that integrating multiple data sources and including biochemical levels from serum panels would significantly improve the predictive ability of machine learning algorithms.

Using previously obtained patient data in the SMART-AV trial, we built a novel algorithm that integrates demographic data, physical characteristics, medical history, circulating biomarker levels, and echocardiography data to improve the prediction of CRT response before surgical intervention. In a previous study, Spinale and colleagues showed significant predictive power for identifying CRT response using pre-surgical levels of specific serum protein biomarkers (sST-2, sTNFr-II, MMP-2, and CRP) [15]. Given the important roles of inflammation and extracellular matrix turnover for regulating cardiac remodeling related to CRT, it should be no surprise that circulating proteins are associated with CRT response either as upstream regulators or downstream correlates. We combined the Spinale et al. patient biomarker score with 40 other input features spanning echo-based LV metrics, medical history, demographic information, and basic clinical assessments. Using these features enabled our ensemble machine learning classifier to correctly identify 71% of patient response outcomes, achieving an AUC of 0.784 – a substantial improvement over the previous study using the biomarker score alone [15].

A major limitation of many machine learning approaches is their ‘black-box’ nature of predictions, or in other words, their un-explainability. Future adoption of artificial intelligence into the clinical decision-making process will undoubtedly be affected by an ability to explain (to some degree at least) why models predict what they predict and to identify the driving variables within the algorithms, especially for high-risk and costly decisions like CRT treatment. To improve interpretability in high risk or costly decisions, a growing emphasis is being put on ‘glass-box’ or ‘white-box’ techniques. We employed SHAP analysis to elucidate the relative contribution of each feature globally to the patient response probability output of our model (Fig. 4). This analysis revealed that important features came from diverse data sources, with the top five features including echo-based data (1D stretch), serum protein data (biomarker score), co-morbidity data (ischemic cardiomyopathy), clinical evaluation data (QOL score), and demographic data (patient age). In addition, LIME revealed features responsible for personalized prediction and showed diverse feature sets responsible for individual response to treatment. Of course, we must emphasize that the power of these features to predict CRT response is indicative of their correlation to cardiac remodeling and not necessarily indicative of their mechanistic causation of cardiac remodeling. Additional notable limitations include a relatively short follow-up time of 6 months and a relatively small patient sample size (compared to thousands of patients’ data used in electronic health record-based algorithms).

Numerous recent studies have applied a wide range of machine learning approaches to predict CRT from diverse datasets [8,9,10,11,12,13,14]. All these reports have generally produced AUC values > 0.7 with the best performing algorithms ~ 0.8% (comparable to our 0.784 AUC). The datatypes used for these past reports have varied (electronic health records, clinical imaging, demographic data, electrocardiograms, etc.), and the computational algorithms have spanned a range of simple regression models to more complicated approaches including gradient boosting [8], Naïve-Bayes [9], multiple kernel learning [10], random forest [11], adaptive lasso [12], and support vector machines [13]. In agreement with our results, the most important predictors from past studies have spanned different data types across comorbidity (e.g., ischemic cardiomyopathy and LBBB), electro-mechanical (e.g., systolic blood pressure, QRS width, and wall strains), and demographic data (e.g., age and sex) [8, 10, 12]. This diversity of predictor type further supports our underlying hypothesis that various data sources are not necessarily redundant and can therefore provide additive benefit for identifying CRT response.

Current clinical guidelines define specific eligibility criteria for physicians to base their CRT recommendations [26]. The increasing accuracy of computational predictions suggests that incorporating personalized model-based probabilities could benefit such recommendation criteria. Encouragingly, our patient stratification demonstrated 96% accuracy in the highest and lowest response subgroups with significant differences in volume changes and functional changes over six months post-CRT. Using higher resolution (quintile) binning was motivated by the potential practical utility for a clinician to label patients as very high, high, neutral, low, and very low response categories. Clinician decisions are often more complicated than simply “operate vs. do not operate”, so the intermediate group binning could inform when to take other clinical options (e.g. additional measurements, prolonged observation, etc.). Our algorithm was built and tested using only baseline, pre-CRT measurements, demonstrating that it is feasible for machine learning algorithms to harness a composite set of data from the demographic, functional, and biomarker domains obtained at the time of patient evaluation for CRT and provide predictive value on the ultimate CRT response. As future model developments are likely to further improve prediction accuracy across a broader number of patients, future clinical and ethical discussions will prove vital to appropriately leverage this predictive information into CRT decisions.

Conclusion

In this study, we have shown that integrating multiple types of data including demographics, circulating biomarkers, and echo-based structure features can improve the predictive capability of machine learning algorithms to identify CRT responders and non-responders before intervention. Further, interpretability approaches like SHAP and LIME can help elucidate specific contributions of each feature’s role in determining the predicted responses across a cohort and patient-specific level.