Background

In Europe, more than 300,000 out-of-hospital cardiac arrests (OHCAs) occur each year, resulting in 250,000 deaths [1]. Among these, less than 10% will leave the hospital alive without serious neurological sequelae [2]. Nearly all deaths occur early, during the first days and weeks, as a consequence of lesions caused by hypoxic-ischemic brain injury, which are aggravated by reperfusion provoked by the return of spontaneous circulation (ROSC). Clinicians, who face many challenges, find it very useful to be able to estimate the subsequent prognosis with maximum reliability so that they can give relatives reliable information and adapt the therapeutic strategy. In addition, an estimation of the accuracy of the prognosis may help to better identify subgroups of patients eligible for certain interventions and clinical research programs, such as early coronary reperfusion [3] or neuroprotective treatments [4].

To allow an early assessment of prognosis, it is possible to use scores based on variables available immediately upon admission to the intensive care unit (ICU). Severity scores used in the general population of ICU patients have been evaluated in the specific population of OHCA patients, but revealed poor calibration and discrimination performances [5,6,7,8,9]. On the other hand, numerous specific scores, all available at hospital admission, have currently been developed within the restricted framework of OHCA patients [9,10,11,12,13,14,15,16,17,18,19]. For example, the Cardiac Arrest Hospital Prognosis (CAHP) score was evaluated in this situation and showed acceptable discrimination and calibration performance [10, 11]. However, to our knowledge, the respective performances of all these scores have never been compared simultaneously for the same population of patients in a prospective multicenter study. Historically, Utstein criteria (which are also used in the calculation of all these scores) were used for prognostication after OHCA at the prehospital phase, and recommendations were made to collect these elements under the term “Utstein style” [20,21,22]. Thus, as these Utstein criteria are the minimum framework to be collected in a study on OHCA, it seemed important to analyze the respective discrimination of each score as compared to the historical reference Utstein variable-based model score.

Thus, we designed the AfterROSC1 study with the main objective of comparing the performance of the CAHP score compared to the score derived from the Utstein style criteria for the prediction of functional prognosis after cardiac arrest. The secondary objectives were to compare the performance of other scores specific to cardiac arrest to those of the score derived from the Utstein style criteria.

Methods

The main objective of the present research was to evaluate, in a prospective, multicenter, observational study, the discrimination, calibration, and clinical utility of CAHP score after OHCA as compared to Utstein criteria. The secondary objective was to compare the performance of these different scores to the prediction offered by the Utstein criteria. The study was declared on ClinicalTrial.gov before it began (NCT04167891; August 1, 2020). Statistical analysis plan was approved before enrolment of first patient by Ethics Committee in charge of the study in France.

Study settings

This study was conducted in 24 ICUs in France and Belgium between August 2020 and June 2022. All were members of the AfterROSC Network, which is dedicated to the promotion and development of clinical research and education regarding post-cardiac arrest care.

Ethics

Information about the study was delivered to each patient’s relatives. In cases of missing relatives, emergency inclusion was allowed according to French law. Patients without an available relative included in the study were informed as soon as they regained competence. If they subsequently declined to participate, they were removed from the database. The research protocol (available with the full text of this article) was approved by the appropriate ethical committees (2019-A01378-49; CPP-SMIV 190901) and French data-protection authorities, according to the principles of the Declaration of Helsinki and its amendments. The analysis and reporting for this study were conducted in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement [14].

Study population

All patients admitted to the participating centers during the study period were screened for participation. Patients were eligible if they were 18 years or older, if they were admitted to the ICU after OHCA, and if they remained comatose at admission (defined by Glasgow Coma Score equal to or lower than 8) despite ROSC. In patients who had been sedated before ICU admission, the Glasgow Coma Score (GCS) determined by the emergency physician just before sedation was used. Non-inclusion criteria were in-hospital cardiac arrest, traumatic cardiac arrest, patient under guardianship, and previous inclusion in the AfterROSC1 study. Furthermore, we only included in the analysis patients for whom all the Utstein style criteria and the main endpoint (modified Rankin scale on day 90) were available.

Score determination

For each patient, all components of the Utstein style criteria were captured and plotted into a dedicated score [15]. According to reference [15] and to a previous study with the same methodology [13], age, gender, cardiac/non cardiac cause of arrest, bystander, bystander cardiopulmonary resuscitation (CPR), location (home/other), occurrence of CA before Emergency Medical System (EMS) arrival, shockable/non shockable rhythm, and time between CA and EMS arrival were incorporated into the “Utstein style” score.

Similarly, for each patient, individual scores were calculated according to published data. To select scores retained in the present analysis, we performed a narrative review of observational studies published from database inception (1947) until September 2019 that included non-traumatic OHCA patients. We included studies that reported both early prognostic scores (including prehospital and in-hospital variables) and patient outcomes, which included early mortality (within 24 h after emergency department admission), survival to hospital admission, survival to hospital discharge, and functional outcome at hospital discharge.

The following databases were searched: PubMed, Embase, Google Scholar, and Web of Science. The search strategies, adapted for each database, included medical subject headings and keywords for “heart arrest, ventricular fibrillation, resuscitation, pulseless electrical activity, asystole” combined using the Boolean operator AND with a comprehensive range of search terms for prognostic score, including “score, early determination of prognosis.”

All risk scores calculable upon admission to hospital and predicting patients’ outcome after OHCA were retained in the analysis. Eight scores were selected: CAHP [9] and its simplified [10] and modified versions [11], OHCA [11], CREST [12], C-Graph [14], TTM [15], CAST [16], and NULL-PLEASE [17]. Post hoc, we added two scores, rCAST [16] and MIRACLE2 [17], published during the study period. To facilitate comparison, the variables involved in the calculation of these scores are listed in Additional file 1: Table S1.

Data collection

All data were collected by a dedicated study nurse or investigator in each participating center. The following variables were collected: baseline clinical data and comorbidities; characteristics of cardiac arrest and resuscitation; clinical and biological characteristics at ICU admission; treatments delivered in the ICU; length of stay (LOS) in ICU; invasive mechanical ventilation duration; functional and vital status at ICU discharge; and functional and vital status at hospital discharge. Post-resuscitation shock (PRS) was recorded at ICU admission and was defined as a systolic blood pressure below 90 mm Hg for at least 30 min with impaired end-organ perfusion (cool extremities, mottling, or urine output < 30 mL/h), requiring norepinephrine and/or epinephrine intravenous infusion [18]. The last neurological evaluation was performed on day 90 using the modified Rankin scale [19].

Outcome measures

The neurological outcome was scored using the level reached on the modified Rankin scale [20] at day 90, assessed by a research nurse during a telephonic interview. The main endpoint was a favorable outcome at Day 90, as defined by an mRS level of 0 (no symptoms), 1 (no significant disability), 2 (slight disability) or 3 (moderate disability), as recommended in the guidelines [21].

Sample size

Using existing data from a large and comprehensive registry of cardiac arrests admitted to the intensive care unit in the Greater Paris area [2], according to the same inclusion criteria of the study described here, multivariable logistic regression integrating the Utstein criteria described allows the realization of a receiver operating characteristic (ROC) curve whose area is estimated at 0.85 [13]. The prediction of the CAHP score has been described as having an area under the curve of 0.93 [10]. Considering a first-species risk of 0.05, a power of 0.90, and a difference of 0.85 to 0.93, it was necessary to include 574 patients. According to previous data, we planned a favorable functional outcome rate of 20% for patients included at ICU admission [2]. Since the endpoint was based on a telephonic interview at 90 days and considering the risk of about 20% of missing responses on the modified Rankin endpoint at 90 days, a total of 597 patients were required [22].

Statistical analysis

We used descriptive statistics to summarize categorical variables as proportions, and continuous variables as mean with standard deviation and median with interquartile range for normal and non-normal distribution, respectively. Comparisons between proportions used Pearson’s chi-squared (or Fisher’s exact test, if appropriate) and a t-test (or Mann–Wilcoxon rank-sum test) for continuous variables.

The discrimination abilities of the prognostication scores were assessed using ROC analysis and quantified using the area under the ROC curve (AUC). The AUC values were compared in a pairwise manner using the method of DeLong et al. [23]. The calibration performances of the prognostication scores were assessed using the Hosmer–Lemeshow test. For complete assessment of calibration and regarding low power of Hosmer–Lemeshow test, we performed calibration belt—which plot expected and observed outcome according to each score—with related P-value using calibration belt function on STATA. In the absence of dedicated metric for balance between discrimination ability and simplicity for determination, we added the ratio between AUC and number of items for each score. We plotted a decision curve analysis for Utstein style criteria score and others scores [24].

A first sensitivity analysis was performed to determine AUROC of each score including missing data, with performed multiple imputations using a chained equations [25] on the dataset restricted to patients with available day-90 functional outcome available (primary outcome), and based on M = 10 imputed completed. A second sensitivity analysis was performed, restricted to non-cardiac causes of cardiac arrest at ICU admission.

All tests were two-sided, with a P-value of < 0.05 considered significant. Analyses were performed using STATA/SE 14.2 (Lakeway Drive, TX, USA).

Results

Baseline characteristics

During the study period, 907 patients were screened for participation, and 658 were retained in the analysis (Fig. 1). Baseline characteristics and outcomes are described in Table 1. Patients were mostly male (72%) and collapsed at home (64%) in the presence of a witness (86%) who performed bystander CPR in 68% of cases.

Fig. 1
figure 1

Study flowchart

Table 1 Characteristics of the study population

Functional outcome and mortality

Survival at ICU discharge and at day 90 was 38%, with a favorable functional outcome (mRS < 4) at day 90 observed in 37% of cases (Fig. 2).

Fig. 2
figure 2

Distribution of mRS scores in each category at ICU discharge and day 90 follow-up

Discrimination, calibration and comparison of CAHP score to “Utstein style criteria score” (Table 2)

Table 2 Comparison (total sample size N = 658)

CAHP score could be determined for 98.6% of patients. The AUROC for CAHP was 0.87 [0.84–0.90] which was significantly higher as compare to reference (0.79 [0.76–0.83]; P < 0.001). The calibration was acceptable according to Hosmer–Lemeshow test and calibration belt test (both P-value > 0.05).

Discrimination, calibration and comparison of other scores to “Utstein style criteria score” (Table 2)

The proportion of patients for whom it was possible to calculate each of the scores studied varied between 82.6% (CREST) and 98.6% (sCAHP, and mCAHP). According to Hosmer–Lemeshow test, calibration was acceptable for all score except for MIRACLE2 (P = 0.03).

According to calibration belt test, calibration was acceptable for all score except for CREST and CGRAPH (P-values, respectively, 0.02 and 0.01). Calibration belts are depicted in Fig. 3.

Fig. 3
figure 3figure 3

Calibrations belts

Comparing AUROCs, the three best-performing scores were achieved by TTM (0.88 [0.86–0.92]), CAHP (0.87 [0.84–0.90], and mCAHP (0.86 [0.83–0.89]), while the three worst-performing scores were achieved by C-GRAPH (0.76 [0.71–0.80]), CREST (0.79 [0.75–0.83]), and NULL-PLEASE (0.81 [0.77–0.84]). A comparison of the respective AUROCs is depicted in Fig. 4. All scores showed significantly increased AUROC values (P < 0.05) in comparison with the Utstein style “score” except for CREST (P = 0.28), NULL-PLEASE (P = 0.20) and rCAST (P = 0.16). For each score, the added value of each component (total AUROC/number of items) appears in Table 2. AUROC values for each score after multiple imputation are available on Additional file 2: Table S2.

Fig. 4
figure 4

ROC curves of scores included in the analysis

Clinical utility

Decision curve analysis is available as Additional file 4: Fig. S1.

Performances in patients with a non-cardiac cause of arrest (Additional file 3: Table S3)

In the subgroup of patients with a non-cardiac cause of arrest (n = 233), the AUROC of the Utstein style score was 0.75 [0.67–0.83]. The predictive values of CREST and NULL-PLEASE could not be determined because these scores are not usable in this population. AUROCs from other scores ranged from 0.59 [0.48–0.70] to 0.87 [0.81–0.93]. The scores for CAHP, mCAHP, and TTM performed significantly better than Utstein, whereas C-GRAPH performed significantly worse.

Discussion

In this prospective multicenter study, we found that most of the tested predictive scores performed at least as well as, and most often better than, the predictive score derived from the Utstein style. In the population studied, we observed that these predictive scores could be calculated on admission in nearly all patients, confirming that they could be used routinely.

These results should be considered in relation to the data available in this field. Isenschmid et al. [26] found that prediction scores dedicated to cardiac arrest cohorts performed better than general ICU scores and that the presumed asphyxia cause of cardiac arrest was associated with a drop in AUROC (0.71 vs 0.83). Potpara et al. [27], monitoring a cohort of 547 patients who suffered from OHCA, observed that the NULL-PLEASE needed to be modified regarding pH and lactate values, as those two items were inconsistently measured in their cohort. Tsuchida et al. found in a cohort of 236 OHCA patients that AUROC of NULL-PLEASE, CAST, and rCAST were 0.874, 0.860, and 0.770, respectively [28]. Recently, Blatter et al. [29], observing 415 patients, found that the AUROCs of OHCA, CAHP, APACHE II, and SAPS II scores had similar performances in predicting poor neurological outcomes at 2 years after cardiac arrest. Note that “general ICU scores,” such as APACHE II and SAPS II, could only be determined after 24 h, a limitation of their utility in the early phase of evolution. Heo et al. [30] compared 12 scores in a dataset of 1163 patients suffering from OHCA. The PROLOGUE score showed better discrimination performance without miscalibration. However, their study included a relatively homogenous population of patients who received targeted temperature management, and the analysis was retrospective, with only 69% of the population eligible for score determination. To summarize, the literature is extensive but has recurrent limitations (mostly retrospective design and small sample size). A common limitation is also the endpoint timing, with some of these studies using a short-term evaluation (ICU discharge) and others using a long-term evaluation (up to 2 years after the index event), which is questionable. Following the guidelines of the last version of the Core Outcome Set for Cardiac Arrest [21], we used the mRS, which allows for a combined assessment of neurological outcome and vital status at day 90. Hosmer–Lemeshow test was significant for MIRACLE2 indicating mis-calibration for this score. Calibration was not adequate for CREST and C-Graph. TTM score could not be determined for more than 10% of patients. In the subgroup of non-cardiac OHCA, CREST and NULL-PLEASE could not be determined. All those together, leave CAHP (and its subsequent scores), OHCA, NULL-PLEASE and rCAST as candidates for universal adoption.

Early evaluation of patients’ prognoses at hospital or ICU admission is very useful to tailor interventions, especially neuroprotective interventions. However, it is likely too few clinicians make the assessment. Scores are good candidates for this assessment provided they have been scientifically validated. On other side, besides AUCs values, prognostic tests with similar discrimination power could mask different clinical utilities: scores with very high specificity can be useful for accurate identification (rule in) specific outcome, whereas scores with very high sensitivity can be useful for exclusion (rule out) of specific outcome. However, our sample size did not allow us determination of respective characteristics of each score for predetermined Specificity. Some retrospective studies have already found than scores could help in identification of subgroups of interest (such as for coronary angiogram [3] or temperature management [11]) and furthers interventional trials could use them as selection criterions. Their use could be encouraged by guidelines to promote their use by clinicians on a daily basis. As already highlighted, recent literature has evaluated composite scores with many limitations. Apart from prediction scores, other tools could be used. Deye et al. [31] found that early determination of PS100-B have acceptable AUROCs (AUC 0.83 [95% CI 0.78–0.88]), but this biomarker is rarely available on an emergency basis, which makes its use impossible when a decision must be made without delay. Other brain damage biomarkers exist that are potentially interesting in predicting outcomes in these patients, such as serum neurofilament light (NFL) [32] and Tau protein. However, their half-lives are too long for their use in this early phase, and preliminary data have indicated low performance (AUROC 0.58 [0.48–0.69] for tau protein [33]). A quantitative evaluation of pupillary light reflex may also be a valuable tool, but it is not widely available at this time [34].

To be effective, a risk score needs to be accurate, well calibrated, and easy to employ in routine practice. Whereas several metrics exist to measure usability in commercial areas, there is no equivalent for medical scores. Indeed, effectiveness is defined by discrimination and calibration, whereas efficiency and satisfaction cannot be determined with already validated tools. To help readers, we chose to determine the ratio of added AUROC per item included in the score to reflect the added value of each item in the score. However, this metric did not consider completeness or ease of determination for each value.

Our study has several strengths. First, it is the largest study to date to evaluate prognosis scores. Second, we compare a very wide range of these scores in a unique, prospective, and dedicated study. Third, we highlight than prognosis can be performed relatively easily at the bedside.

Our study must be interpreted within its own limits. We did not evaluate all the available scores because we did not capture the data required for their determination [35,36,37] or because they were developed in a specific context of care [38,39,40]. Finally, some scores included dynamic values, such as for a vasopressor dose [41]. Our sample size, although larger than that used in previous studies, could lack the power to detect small differences, leading to recruitment in AfterROSC2 (NCT05606809).

Conclusion

In patients admitted to intensive care after a cardiac arrest, most of the scores available for evaluation of the subsequent prognosis are more efficient than the usual Utstein criteria. Some of these scores performed better than others, but calibration is unacceptable for some of them. Our results show that some scores (CAHP, sCAHP, mCAHP, OHCA, rCAST) have superior performance, and that their ease and speed of determination should encourage their use.