Cine-cardiac magnetic resonance to distinguish between ischemic and non-ischemic cardiomyopathies: a machine learning approach

Objective This work aimed to derive a machine learning (ML) model for the differentiation between ischemic cardiomyopathy (ICM) and non-ischemic cardiomyopathy (NICM) on non-contrast cardiovascular magnetic resonance (CMR). Methods This retrospective study evaluated CMR scans of 107 consecutive patients (49 ICM, 58 NICM), including atrial and ventricular strain parameters. We used these data to compare an explainable tree-based gradient boosting additive model with four traditional ML models for the differentiation of ICM and NICM. The models were trained and internally validated with repeated cross-validation according to discrimination and calibration. Furthermore, we examined important variables for distinguishing between ICM and NICM. Results A total of 107 patients and 38 variables were available for the analysis. Of those, 49 were ICM (34 males, mean age 60 ± 9 years) and 58 patients were NICM (38 males, mean age 56 ± 19 years). After 10 repetitions of the tenfold cross-validation, the proposed model achieved the highest area under curve (0.82, 95% CI [0.47–1.00]) and lowest Brier score (0.19, 95% CI [0.13–0.27]), showing competitive diagnostic accuracy and calibration. At the Youden’s index, sensitivity was 0.72 (95% CI [0.68–0.76]), the highest of all. Analysis of predictions revealed that both atrial and ventricular strain CMR parameters were important for the identification of ICM patients. Conclusion The current study demonstrated that using a ML model, multi chamber myocardial strain, and function on non-contrast CMR parameters enables the discrimination between ICM and NICM with competitive diagnostic accuracy. Clinical relevance statement A machine learning model based on non-contrast cardiovascular magnetic resonance parameters may discriminate between ischemic and non-ischemic cardiomyopathy enabling wider access to cardiovascular magnetic resonance examinations with lower costs and faster imaging acquisition. Key Points • The exponential growth in cardiovascular magnetic resonance examinations may require faster and more cost-effective protocols. • Artificial intelligence models can be utilized to distinguish between ischemic and non-ischemic etiologies. • Machine learning using non-contrast CMR parameters can effectively distinguish between ischemic and non-ischemic cardiomyopathies. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1007/s00330-024-10640-8.


Variable Selection
We used a random forest algorithm to rank the variables, based on their mean Gini impurity decrease within each tree of the forest.We first fitted a random forest using all available variables in each fold of the 5-fold cross-validation (CV), and then selected only those attributes with an importance score at least 1.25 times greater than the average importance score.We then used the union of the five resulting feature sets for our ML analysis.Gini impurity index ranges from 0 to 0.5, where a lower value indicates better performance.This index reflects the probability of misclassification if a randomly selected element from a set were labeled according to the target distribution of that set's elements.The choice of using the Gini impurity for variable selection is well-justified for several reasons.Firstly, in our dataset containing approximately half the number of features as compared to the subjects, multicollinearity is a potential concern.By leveraging Gini impurity, the features with most impact on detecting ICM can be identified and prioritized, thereby mitigating the impact of multicollinearity, and promoting feature diversity.Secondly, Gini impurity facilitates the identification of non-linear relationships between the variables and the target class.Lastly, in limited sample size scenarios, outliers may significantly influence the analysis.The use of Gini impurity provides greater robustness to outliers compared to other traditional metrics, ensuring more reliable and stable variable selection outcomes.

Model Definition
We employed the tree-based gradient-boosting generalized additive model (GB-GAM) to build a model for distinguishing between ICM and NICM.In gradient boosting, a type of ensemble learning, multiple estimatorstypically decision treesare trained sequentially, by adjusting weights of data points to penalize misclassifications at each iteration, to produce an estimator with greater generalization performance 1 .The GB-GAM model is explainable-by-design as it learns a relationship between each feature and the log-odds of the outcome separately.When testing on unseen data, the model considers every patient-specific feature value and indexes the learned feature functions to obtain the relative contribution of that feature to the predicted log-odds.Then, it sums all these contributions to obtain the final predicted log-odds of ICM, which is then transformed to a probability score between 0 and 1. GB-GAM is also capable of discovering relevant pairwise feature interactions and integrating it into the final model with complete transparency and interpretability.In greater detail, the GB-GAM algorithm starts by discretizing all continuous variables into a specific number of bins.The discretization simplifies the model and enhances its interpretability.Then, a standard additive model or initial model is fitted to model the non-linear relationships between each individual binned feature and the response variable.In the GB-GAM algorithm, shallow regression trees are used to represent these smooth functions.The initial model, which is a linear combination of smooth functions of the individual features or the learned separate models, is used as a starting point for the gradient boosting process, which adds shallower tree-like models to capture the pairwise interactions between features.After fitting the initial model, the algorithm focuses on pairwise interactions.For this matter, the residuals of each individual feature are calculated by subtracting the predicted values for that feature of the initial model from the actual binned values; these residuals represent the part of the response variable that is not explained by the initial model.Then, for each pair of binned features, a shallow tree-like model is fitted to the residual of one feature with respect to the other, and viceversa; in other words, it uses the residuals of the former to predict the residuals of the latter.This effectively captures the pairwise interactions between the two features in both directions.Finally, the predicted values of the two separate models are multiplied to obtain a single value representing the pairwise interaction.These residuals will be then used to fit the next set of shallow models in the gradient boosting process, which will attempt to capture the remaining information in the data that was not captures by the previous models.The predictions of these new models are added to the predictions of the previous models to obtain updated predictions for the response variable.The process will continue until a stopping criterion is met.The final model is a sum of the initial models and the tree-like models that were added during the gradient boosting process, and it captures both the main effects of each individual feature and the pairwise interactions between features. 2e compared the performance of the GB-GAM model with four traditional machine learning (ML) algorithms, for which a brief description will follow.Random forests is a supervised, tree-based ensemble that uses bootstrap aggregating (also known as bagging) to train multiple decision trees and combine their predictions to produce a stronger classification 3 .Support vector machines are linear classifiers based on the maximum margin concept; in binary classification settings, it searches the hyperplane that maximally separates data points of the two classes 4 .K-nearest neighbors is an algorithm that does not require explicit training, and which makes predictions on-the-fly based on a similarity measure calculated considering the k-nearest neighbors of each new data point to be classified 5 .Finally, logistic regression is a statistical method that models the expected value of a target binary variable conditional on the value of one or more features 5 .
The following hyperparameters were used when training the models: •

Probability Calibration
Briefly, a non-decreasing function is fitted to predicted probabilities so that each new prediction for the training data is closest to the targets in terms of mean squared error.

Model Training and Testing
Tenfold CV has several advantages compared to the classic split-sample approach: 1) data for training and testing is maximized, whilst avoiding overlap between data used for training models and evaluating performance and reducing risk of overfitting; 2) it reduces the bias when estimating predictive ability of models and 3) reduces the variance when estimating the generalization error 6 .Briefly, the available data is divided in ten equally sized partitions, each containing similar proportions of ICM and NICM.Then, 9/10th of the data was used for feature selection and model training, and the remaining 1/10th was used for testing.This protocol ensures each data point is used for evaluation once and only once, and that the model is always evaluated on previously unseen patients.
For the leave-one-out testing, N-1 subjects were utilized for feature selection, calibration, and model building, while the remaining subject served as the test case.This procedure was then iterated for every subject in the cohort, ensuring each was independently evaluated as the test case.
To increase robustness, we repeated the entire procedure 10 times, employing dataset shuffling in each iteration to create different partitions for feature selection and calibration.
The proposed model produced a continuous score from 0 to 1 representing the likelihood of ICM for all subjects in each testing fold.These scores were then concatenated and used to estimate average performance in unseen data.

Variable Importance and Explanations of Case Examples
We performed the ranking based on average absolute impact on ML predictions separately for each cross-validation testing subset.The ten most impactful variables were further analyzed by relating feature values with the ML-predicted log-odds of ICM.

Diagnostic Performance Evaluation
For leave-one-out (LOO) testing, out-of-fold predictions are obtained pooling together predictions of the independent test case kept aside for LOO cross-validation (CV) fold.After one iteration of LOOCV, 119 predictions are obtained and used to measure diagnostic performance of models.The entire procedure is repeated 10 times, each time pooling out-of-fold predictions.After 10 repetitions, the sets of 10 performance measures are used to compute median and 95% confidence intervals using the percentile method for probability-based measures; for metrics computed using binary labels after Youden's J index-based dichotomization, the bootstrap method with 5000 iterations was used on accumulated ground-truths and predicted probabilities over 119 predictions and 10 repetitions, for a total of 1190 predictions.Confidence bands around curves were computed using the bootstrap method with 5000 iterations.Confidence intervals (CI) at 95% confidence level for probability-based measures were calculated using the percentile method.These CI were computed based on the scores obtained from groundtruths and predictions across all 10-fold CV repetitions.CI for sensitivity, specificity, F1 score, positive predictive value and negative predictive value were calculated with the bootstrap method on ground-truths and predicted probabilities accumulated across all CV folds.

Statistical Analysis
Eur We used the Python implementation of the GB-GAM algorithm provided by the interpret framework.All statistical analyses were performed with the R software (R Foundation for Statistical Computing, Vienna, Austria, version 4.1.0)and the Python language (Python Software Foundation.Python Language Reference, version 3.9).Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): a guide for authors and reviewers.Radiol Artif Intell 2020; 2(2):e200029.https://doi.org/10.1148/ryai.2020200029 Radiol (2024) Cau R, Pisu F, Pintus A et al