FormalPara Take home message

This extensive systematic review addresses the characteristics, quality and performance of all mortality prediction models which have ever been developed (n = 58) or externally validated (n = 225) in patients requiring extra-corporeal membrane oxygenation (ECMO) for refractory circulatory and/or respiratory failure. Despite the large number of ECMO mortality prediction models that have been published, current models seem unsuitable for individual decision making due to high concerns for bias and methodological shortcomings. Furthermore, the conditionality of models on the fact that ECMO had already been initiated in all cohorts further limits their applicability for ECMO allocation.

Introduction

Extracorporeal membrane oxygenation (ECMO) is increasingly used as a salvage support modality in patients with refractory cardiogenic shock and/or severe respiratory failure. Despite its widespread application [1, 2], ECMO remains associated with high complication rates [3], and a considerable number of patients eventually die [4], or experience negative long-term sequelae [5, 6]. The increased use of ECMO for a broadened range of indications poses a significant socio-economic burden [7, 8] as its management requires extensive critical care by highly qualified personnel. Therefore, it is important to reserve ECMO support mainly for those who benefit most.

Prediction models could potentially aid in important decisions such as the allocation of ECMO to patients who would experience a clear survival advantage when receiving such support. In recent years, many prediction models have been developed for the dedicated use in ECMO patients [9,10,11,12,13,14,15,16], but a comprehensive overview of developed models with their predictive performance, and practical applicability to inform clinicians when and in which context these models could be applied currently lacks.

The primary objective of this systematic review was to summarize and appraise all published prognostic models for patients requiring veno-arterial (V-A) or veno-venous (V-V) ECMO support for severe cardiocirculatory shock and/or respiratory failure. As a secondary objective, model performance was assessed for each model.

Methods

This review was registered at the international prospective register of systematic reviews PROSPERO (University of York) with registration number CRD42021251873. The formalized scope of this review is outlined in the Population, Intervention, Comparator, Outcomes, Timing and Setting (PICOTS) table available in Table 1 of Electronic Supplementary Material (ESM) Appendix 1.

Eligibility criteria

We considered all original research reports reporting on the development, redevelopment and/or external validation of a multivariable (≥ 2 predictors) model aimed at the prediction of all-cause mortality in adult (> 18 years) patients receiving V-A or V-V ECMO support for severe cardiocirculatory and/or respiratory failure. Only peer-reviewed research articles published in English were considered for inclusion. There were no limitations on prediction horizon or year of publication.

Search and data extraction

A systematic literature search (details available in ESM Appendix 2) was undertaken in PubMed and EMBASE on January 3, 2022. Two reviewers (LP and JB) independently screened all articles for eligibility based on title and abstract. Subsequently, both reviewers assessed the remaining articles in full text. Discrepancies between reviewers were resolved through discussion with a third and independent reviewer (CM). Thereafter, articles citing studies reporting on the derivation of prognostic models were extracted through reference search in ISI Web of Science. In addition, a cross-reference check was performed in all retrieved articles to identify other eligible articles.

Two reviewers (LP and JB) independently extracted data using a standardized data extraction form (ESM Appendix 3). Again, discrepancies between reviewers were resolved through discussion with a third independent reviewer (CM). Items of the data extraction form were based on the CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modeling Studies (CHARMS) which addresses key information on the source of data, study design, participants, outcomes, candidate predictors, prediction horizon, sample size, missing data, model development, model performance and model evaluation [17]. The Prediction model Risk Of BiAS Tool (PROBAST) was used to assess the risk of bias and concern for applicability per model, using signaling questions in four different domains (participants, predictors, outcomes and analysis) [18]. The overall risk of bias was judged ‘low’ when all four domains were scored as having low risk of bias. When at least one domain indicated a high risk of bias, overall risk of bias was scored high. When the risk of bias was ‘unclear’ in at least one domain and all other domains were scored low risk of bias, the overall risk of bias was scored ‘unclear’. This scoring system was also applied to the assessment of concerns of study applicability, indicating the extent to which the participants, predictors and outcomes in the study matched the review question.

We extracted concordance (c) statistics as a measure of discrimination. A c-statistic represents the ability of a model to distinguish a patient with—versus an individual without—the outcome of interest. A higher c-statistic illustrates a higher discriminatory capability (range 0–1). For the assessment of calibration, observed versus expected (O:E) ratios were extracted. O:E ratios reflect the agreement between the observed and predicted outcomes and the direction of misclassification, i.e., O:E > 1 reflects underprediction and O:E < 1 overprediction. When the predicted mortality was not reported, the reported mean model score in the dataset was used to estimate the corresponding predicted mortality or survival, if possible. When a study reported separate performance metrics for derivation and validation datasets, both values were included as independent observations.

Statistical analysis

We pooled the logit c-statistics and log O:E ratios [19] of prediction models with random-effects meta-analysis when these metrics were available in at least 10 independent datasets with comparable outcome of interest, i.e., we did not mix short-term and long-term mortality. If measures of uncertainties, i.e., confidence intervals (CI) and/or standard errors (SE) of the c-statistic or O:E ratio, were not reported they were approximated as described earlier [19]. Pooled c-statistics and O:E ratios were presented using forest plots. In addition, we calculated 95% prediction intervals (PI) [19]. A 95% PI reflects the range in which each performance metric is expected to fall in future validation studies and reflects the degree of between-study heterogeneity.

To assess whether model performance was influenced by differences in study characteristics (prediction moment, geographical location, year of start recruitment, mean model score and risk of bias), patient characteristics (median age, percentage male patients, percentage of patients with the predicted outcome) or features relating to the disease or ECMO configuration (percentage V-A ECMO, percentage bacterial and viral pneumonia, pre-ECMO cardiac arrest, postcardiotomy, immunocompromised status prior to ECMO), univariable random effects meta-regression was performed. In addition, subgroup analysis was performed to assess performance of prediction models in specific cohorts (e.g., after excluding ECPR patients). This systematic review was reported according to the PRISMA statement for reporting systematic reviews and meta-analysis [20]. R version 1.3.1093 with packages ‘metamisc’ and ‘metafor’ was used for data analysis.

Results

Our search yielded 4905 unique publications (Fig. 1). 4485 articles were excluded based on title/abstract, and 328 based on full text review, respectively, leaving 92 eligible articles for inclusion. After identification of 4 other articles through cross-reference checking, 96 articles were included in our study. Of these 96 articles, 46 articles showed the developments of 58 models. 77 articles reported on 225 external model validations and 27 articles reported on both model development and external validation(s).

Fig. 1
figure 1

Study flow diagram

Patients and settings

Of 58 developed models, 21 (15 articles) [9, 11, 13, 21,22,23,24,25,26,27,28,29,30,31,32] and 31 (29 articles) [10, 12, 14,15,16, 30, 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57] were derived from ECMO recipients suffering from respiratory and cardiocirculatory failure, respectively. Six other models (in two studies) were derived in a mixed cohort [58, 59]. All models were developed in patients who received ECMO and a majority were single-center cohorts (n = 40, 69%). The models’ derivation cohort sizes ranged from 17 to 4175 patients. A comprehensive overview of the populations, geographical location and recruitment years of derived models is available in Fig. 2a. Recruitment of cohorts for model derivation and validation occurred between 2002 and 2020. Patients’ median or mean age (depending on normality of the distribution) ranged from 37 to 76 years across all cohorts. All cohorts comprised predominantly males with the percentage of male patients varying between 50 and 85%. The time point of prediction was shortly (within 48 h) after ECMO initiation in the majority of models (n = 42, 72%) [9,10,11,12,13,14,15,16, 21,22,23,24,25,26,27, 29,30,31,32,33,34,35, 37,38,39, 41,42,43,44,45, 49, 50, 52, 54, 59], during ECMO support (n = 8, 14%) [28, 40, 51, 53], after weaning (n = 1, 2%) [36], before durable mechanical circulatory support implantation (n = 1, 2%) [46], before dialysis while on ECMO (n = 1, 2%) [58] and pre-surgery in postcardiotomy patients (n = 1, 2%) [55]. In three models (5%) [47, 48, 57], the time point of prediction was unclear. Pre-ECMO variables were used for mortality prediction in the majority of models (67%) (the time point of prediction was after initiation of ECMO, also if only pre-ECMO variables were used). The most frequently included predictors were age, lactate, Sequential Organ Failure Assessment (SOFA) score, duration of mechanical ventilation prior to ECMO, weight, gender and immunocompromised status pre-ECMO (in 23, 18, 14, 7, 6, 5 and 5 models, respectively). An overview of the included predictors per model for ECMO patients suffering respiratory failure, cardiocirculatory failure and mixed cohorts is available in ESM Appendix 4 (Tables 2, 3 and 4).

Fig. 2
figure 2

a Cohorts of all 58 derived models. b Cohorts of all 225 external validations

The outcome of interest, i.e., the predicted outcome of the model, ought to occur in short term (in-hospital mortality or survival, n = 41 (71%), ICU mortality/survival, n = 3 (5%), 30-day mortality/survival, n = 2 (3%), mortality during ECMO, n = 2 (3%) and mortality in the first 7 days after initiation, n = 1 (2%)) in 49 (84%) of models and mid-to-long term (mortality/survival at 3–36 months after ECMO support) in 9 (16%) studies. No dynamic models were identified, i.e., none of the models was updated over time by taking into account serial changes in patient status or events for updated predictions. C-statistics in derivation cohorts varied between 0.602 and 0.970. Relevant study characteristics of derived models are presented in ESM Appendix 5, Table 5.

External validation

Of all 58 models which were developed for ECMO patients, only 14 (24%) models were externally validated (see Table 5 of ESM Appendix 5). Of these, the Survival after Veno-Arterial ECMO score (SAVE), Respiratory ECMO Survival Prediction Score (RESP) and Predicting death for severe ARDS on VV-ECMO (PRESERVE) score were externally validated more than 10 times (20, 18, and 12 times, respectively). Six prognostic models aiming to predict all-cause mortality in general ICU patients were externally validated in ECMO patients. Of these, the SOFA score, the Simplified Acute Physiology (SAPS II) Score, and the Acute Physiology and Chronic Health Evaluation (APACHE II) score were externally validated in more than 10 ECMO cohorts (37, 21 and 22 validations, respectively). External validation studies’ study and population characteristics are presented in Fig. 2b. The c-statistics of externally validated models ranged from 0.440 to 0.920. An overview of external validations grouped by model is provided in ESM Appendix 6, Table 6.

Predictive performance

Discrimination

Figure 3a presents the pooled c-statistics for the various models. Pooled c-statistics of the SAVE, RESP, SAPS II, APACHE II and SOFA score were comparable (range 0.66–0.70). 95% PIs were wide for all aforementioned models indicating extensive between-study heterogeneity (Fig. 3b).

Fig. 3
figure 3

a Pooled c-statistics from external validations studies for different models. b Prediction intervals of the c-statistics for different models. c Pooled O:E ratios from external validation studies for different models. d Prediction intervals of the O:E ratios for different models

Calibration

Pooled O:E ratios were available for the SAVE, SAPS II, APACHE II and SOFA score (Fig. 3c. All models tended to underestimate mortality (O:E ratio > 1), except for the SAPS II score which, on average, overestimated mortality risk (O:E ratio < 1). Similar to the prediction intervals of the c-statistics, the prediction intervals of the O:E ratios were wide suggesting large between-study heterogeneity (Fig. 3d). Forest plots displaying reported c-statistics and O:E ratios of studies validating the SAVE, RESP, SAPS II, APACHE II and SOFA score are available in ESM Appendix 7, Fig. 1a–i.

Sources of heterogeneity

Meta-regression analyses for the SAVE, RESP, SAPS II, APACHE II and SOFA score did not reveal an association with previously specified factors (ESM Appendix 8, Figs. 2–6). After excluding studies with a high risk of bias in more than two domains (see below), the pooled O:E ratio of the APACHE II model changed from a tendency to underpredict to overprediction of mortality (from 1.09, 95% CI 0.92–1.30 to 0.64, 95% CI 0.54–0.74). No significant change in direction of pooled estimates was observed for the other models (ESM Appendix 9.1). Discriminatory performance of the SAVE Score was better in cohorts without ECPR patients (n = 7, c-statistic: 0.740 (0.590–0.840) than in cohorts also comprising patients after ECPR (n = 16, 0.700 (0.640–0.750) (ESM Appendix 9.2). The SAPS II, APACHE II and SOFA score performed best in cardiocirculatory and mixed cohorts (ESM Appendix 9.3). Performance in cohorts of adequate sample size (> 100 patients with the outcome) was comparable to performance in smaller cohorts (ESM Appendix 9.4). Lastly, performance of models in external validations published after the Transparent Reporting of a Multivariate Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines [60] did not differ significantly after excluding studies published before the guidelines became available (ESM Appendix 9.5). However, all aforementioned outcomes have to be interpreted with caution due to the small number of studies available for subanalysis.

Risk of bias assessment

Risk of bias was scored high for all but one derived model [14] (n = 57, 98%) and in 213 (95%) out of 225 external validations (Fig. 4a, c). The main reason for the high risk of bias for both model derivations and validations were small sample sizes. Only 4 (7%) derived models met the recommended event-per-variable (EPV) of ≥ 10 events (death or survival) per candidate variable, and 36 (16%) of the 225 external validations met the PROBAST recommendations of having at least 100 outcome events. Another reason for high risk of bias was a lack of reported calibration measures (a calibration plot was provided in 29 (13%) of external validations. The number of studies with a low risk of bias increased in recent years (from 2018 onwards, see Fig. 8 in ESM Appendix 9.5). Risk of bias assessment for each included derived model and external validation is provided in ESM Appendix 10. Concerns for applicability arose in 41% of derived models and 12% of external validations (Fig. 4b, d) and seemed primarily caused by a mismatch between the type of included patients and the aim of the study, being to provide decision support to the question whether or not to initiate ECMO support [10,11,12, 16, 21, 25, 27, 33, 34, 42, 50]. Namely, all models were developed in patients who had already received ECMO support.

Fig. 4
figure 4

a Risk of bias assessments on the PROBAST-tool all derives models. b Applicability assessment based on the PROBAST-tool all derivations models. c Risk of bias assessments on the PROBAST-tool all external validations. d Applicability assessment based on the PROBAST-tool all external validations

Discussion

In this first, comprehensive systematic review and meta-analysis of studies focusing on mortality prediction in the setting of ECMO, we identified 58 unique prediction models designed specifically for such purpose. Among them, the SAVE and RESP score were most frequently externally validated but only had a moderate discriminative performance overall. A similar observation was made for the performance of general ICU prediction derived models SAPS II, APACHE II and SOFA score in ECMO cohorts. In addition, the majority of models had a high risk of bias and were conditional on the fact that ECMO support had already been initiated. As such, current models seem unfit to aid in decisions regarding individual cases.

The necessity for accurate and reliable prognostications as an aid for clinical decision making in patients with ECMO support is underscored by the large number of publications and exists because of several reasons. First, prediction models may assist in the decision to cease further treatment during ECMO support when models would indicate (a near) certainty for a fatal or unwanted outcome. Second, reliable prediction models may improve cost-effectiveness of ECMO, as the expanding growth of this resource-intensive support modality creates a considerable economic burden [1]. Finally, although current models would not suffice for such purpose, future prediction models, based on source populations where also patients would be included who did not receive ECMO after all, could provide a more objective assessment tool for selecting patients in whom chances of success (survival and weaning) are best.

Despite the abundance of prediction models that have been published, their clinical applicability seems limited by several factors. At first, nearly all (both derivation and external validation) studies had high risk of bias which was mainly attributable to the criterium of small sample sizes. Small sample sizes may cause bias through overfitting of predictors in derivation studies [18, 61] and provide distorted estimates of model performance in external validation studies. Second, overall performance for the different models was merely moderate rendering a significant possibility of misclassification. Of special interest was the post hoc observation that performance measures were not substantially different between general ICU prediction models and models that were specifically developed in—and for—ECMO recipients. It was to be expected that ECMO-specific models would outperform general ICU models as the ECMO population likely features additional and typical clinical characteristics and hence different predictors compared to a general ICU population; as similarly holds for models derived specifically for sepsis [62] and cancer patients [63] in the ICU. The overall observed equal performance of ICU models is yet likely explained by the clinical heterogeneity of ECMO patients in terms of age, indications and management, and the fact that important prognostic information occurring during the course of ECMO is not uniformly incorporated in all, i.e., ECMO-specific and general ICU models. In general, it appears that predictors of mortality in patients suffering from cardiocirculatory and/or respiratory failure are similar to a certain extent in patients with and without ECMO in current models.

On top of the previous discussion, a large heterogeneity in model performance (as illustrated by wide prediction intervals) was found throughout different studies and populations. We were not able to reveal significant associations between model performance and study/patient/disease characteristics in extensive meta-regression analyses. The absence of explanatory findings in meta-regression analyses could possibly be caused by small numbers of external validation studies. The heterogeneity may find its explanation in differences regarding ECMO indications and management between physicians, centers and countries. The observation of a lower model performance in external validations often prompted the development of a new model, which in turn ultimately led to a wide and undesirable variation of available prediction models. This dilemma is well recognized, and therefore it is recommended to update models instead of ‘starting from scratch’, especially when it is difficult to generate sufficiently large sample sizes [64].

The clinical applicability of all the models analyzed here was additionally limited by important conditionalities, i.e., the patients selected in the development cohorts and the timing of prediction. A physician may use the predicted mortality of a specific patient for two potential scenarios: one after—, and one before—the institution of ECMO (as described above). It should be noted that all 58 prediction models were derived in patients who had already received ECMO. Thus, physicians are currently merely informed about the first scenario, i.e., with ECMO, as the predicted mortality is conditional on having already received ECMO and being admitted in an ECMO practicing center. This is still the case when only pre-ECMO variables are incorporated in the prediction model. This conditionality renders current models methodologically unfit for the purpose of decision making prior to the initiation of ECMO, despite the aim of a substantial number of publications to do so [10,11,12, 16, 21, 25, 27, 33, 34, 42, 50]. For prediction in patients on ECMO, cohorts only comprising patients on ECMO support would obviously be preferred. However, a specific prediction moment limits their clinical applicability as most models are based on data acquired only around a single time point of prediction, usually upon initiation of ECMO.

Future directions

Findings from this systematic review of current prediction models in ECMO highlights the urgent need to (re)design and create reliable models that can assist in the potential irreversible clinical decision-making processes that are inevitably encountered during ECMO support. Currently available prediction models should be externally validated, regularly updated and recalibrated to adjust for changes in case mix and survival rates [65]). Models should also be evaluated and, possibly, be adjusted for specific subgroups of patients (e.g., ARDS or postcardiotomy). For this purpose, individual patient data (IPD) meta-analysis could help to generate sufficiently large patient cohorts. Another interesting development to improve performance of prediction models would be to update probabilities for survival over time by incorporating variables (e.g., complications and parameters reflecting organ (dys-) function collected during the course ECMO support. For this purpose, dynamic prediction approaches and artificial intelligence techniques could be of significant interest.

To aid in the decision to start or withhold ECMO for a given patient, it is essential to develop a model in one or multiple source populations also comprising patients who did not yet receive ECMO, or would not receive ECMO support at all. The actual initiation of ECMO can then be modeled as an additional variable in the equation. Bayesian modeling techniques might also prove helpful for this, by incorporating available knowledge from the literature and expert opinion about patients who are rarely represented in current models (i.e., elderly, cancer patients) in the statistical model.

Limitations of this review

In our approach, we decided a-priori to pool c-statistics and O:E ratios of external validation studies irrespective of heterogeneity in patient cohorts (Prospero, CRD42021251873), and potentially also in pooled outcomes. We decided to do so because all studies clearly addressed the same research question, being to externally validate a mortality risk prediction in critically ill ECMO patients. As such, we consider the observed heterogeneity rather as an outcome measure than a shortcoming of the study design. Second, pooling was only possible when c-statistics and O:E ratios were reported or could be estimated in more than 10 studies. As this was only true for the SAVE, RESP, SAPS II, APACHE II and SOFA score, it could have resulted in selective reporting of pooled c-statistics and O:E ratios and the possibility of bias. To mitigate this risk as much as possible, we calculated these performance metrics in as many validation studies as possible through calculating predicted mortality on basis of average model scores. We did so as these metrics are of great importance for the interpretation of performance and a subsequent safe use of prediction models in clinical practice [66].

Conclusions

Robust mortality prognostications could significantly help for important treatment decisions regarding treatment benefit and futility in patients supported with ECMO. Although a large number of ECMO mortality prediction models have been published for such purpose, discriminative performance of these models in external validation studies was moderate at best. In addition, these models suffered from methodological shortcomings and were largely conditional on the fact that ECMO had just been initiated. As such, currently available models seem unsuitable for individual decision-making pre-ECMO and while on ECMO support. Future research should focus on model development in cohorts of eligible ECMO patients as well as the incorporation of time-dependent parameters and events.