FormalPara Take-home message

In a prospective international trial, blinded and standardised qualitative, manual and automated quantitative assessments of CT reliably predicted poor functional outcome in unconscious patients after out-of-hospital cardiac arrest

Introduction

According to the European Resuscitation Council (ERC) and the European Society for Intensive Care Medicine (ESICM), “diffuse and extensive anoxic injury” on neuroimaging is predictive of poor functional outcome after cardiac arrest [1]. Head computed tomography (CT) is widely available and is frequently used for neuroprognostication [2,3,4]. Recent meta-analyses conclude that the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) level of evidence for CT to predict outcome after cardiac arrest is very low [5,6,7]. Signs of “diffuse and extensive anoxic injury” seen as a reduced differentiation between grey and white matter and/or effacement of the cerebral sulci on CT correlate well with elevated levels of neuronal injury markers [8] and histopathological severity of hypoxic–ischaemic encephalopathy (HIE) [9].

In clinical practice, CTs are usually assessed qualitatively using a non-standardised approach [10]. Some specialised centres use non-standardised quantitative methods such as grey–white-matter ratio (GWR) placing regions of interest at the basal ganglia and/or in (sub)cortical regions to quantify oedema [7]. Manual assessments carry the risk of interrater variability and standardisation and/or automatisation may be necessary to ensure a safe translation from research to clinical routine [11,12,13,14]. In stroke imaging, automated quantification of non-contrast CTs is routinely used [15]. This has not yet been achieved for neuroprognostication and a few single-centre retrospective studies with automated quantification of GWR after cardiac arrest have been published [16, 17].

Based on our retrospective studies with adult out-of-hospital cardiac arrest patients, we established Standard Operating Procedures (SOPs) for qualitative (visual interpretation) and quantitative (GWR) CT assessments for clinical use [13, 18,19,20]. In line with the previous studies [18, 19, 21,22,23,24], our recent pilot study from the Target Temperature Management after Out-of-hospital Cardiac Arrest (TTM)-trial confirmed that GWR was most accurate at the basal ganglia level where a cutoff < 1.10 predicted poor functional outcome with 100% specificity [18].

We here present the results of a prospective observational substudy of an international multicentre trial in which we applied our previously published criteria for standardised qualitative and quantitative CT assessment as well as an atlas-based automated GWR method for neuroprognostication after cardiac arrest [14]. Our main hypotheses were that “Definite signs of severe HIE”, by standardised assessment and manually or automatically obtained GWR < 1.10, would predict poor functional outcome without false positives in CTs performed 48 h–7 days after cardiac arrest [13].

Method

Study design

This prospective international multicentre observational study (Clinicaltrials.gov NCT03913065) was a substudy of the Targeted Hypothermia versus Targeted Normothermia after out-of-hospital cardiac arrest (TTM2) trial [25] (Clinicaltrials.gov NCT02908308). The design and statistical analysis plan of this substudy has previously been published [13].

Patient selection and ethics

Between November 2017 and January 2020, the TTM2-trial consecutively screened unconscious patients ≥ 18 years admitted to hospital after out-of-hospital cardiac arrest of a presumed cardiac or unknown cause [25]. Approval was waived/obtained from the appropriate ethics committees. The trial was performed in accordance with the ethical standards laid down in the Declaration of Helsinki and its later amendments [26]. Consent was obtained from legal representatives and/or patients according to local legislation.

Thirteen sites from Sweden, Germany, France, and United Kingdom that routinely use CT for neuroprognostication in patients unconscious > 48 h post-arrest participated (electronic supplementary material [ESM] Table E1). Unconsciousness was defined as not obeying verbal commands. Included patients were managed according to the TTM2-trial protocol regarding randomisation, clinical management, neurological prognostication, decisions on withdrawal of life-sustaining therapy, and follow-up [25, 27,28,29].

Data collection and technical requirements

All types of scanners and software were permitted. Technical prerequisites were availability of axial slices of 4–5 mm thickness obtained with a tube voltage of 120 kV.

CT assessments

CTs with artefacts or structural lesions interfering with reliable evaluation were excluded. Five radiologists and two neurologists with 3–15 years of experience of CTs after cardiac arrest from four countries evaluated images individually using a virtual private network (VPN) secured platform (Human Observer Net) [30] (ESM Table E2). Raters were blinded to all information except the patients age, since brain volume may decrease with age, and thus, this information was considered necessary for assessing the extent of cerebral oedema. The raters received approximately 30 min of training for the software used for evaluations, but unrelated to the actual rating of images. Raters were encouraged to have the SOP accessible during ratings.

Standardised operating procedures for qualitative assessments

Axial images were evaluated at four levels; brain stem and cerebellum, basal ganglia, frontoparietal cortex at corona radiata level, and at high convexity cortex (ESM Fig. E1A) [13]. The raters confirmed or declined; “Are there definite signs of severe HIE defined as complete or near complete loss of grey–white-matter differentiation at the basal ganglia level and in the frontoparietal cortex with additional evidence of brain swelling/sulcal effacement?”.

Standardised operating procedures for quantitative assessments

Circular 0.1 cm2 regions of interest were manually placed at the basal ganglia level in the putamen, the caudate nucleus (caput), the posterior limb of the internal capsule, and the genu corpus callosum bilaterally (ESM Fig. E1B) [13].

Automated measurements

The software pipeline for automated GWR determinations has been published [17]. Images were co-registered to a freely available standard brain atlas and mean Hounsfield Units were quantified in each individual CT space using inversely transferred probabilistic tissue maps [31, 32] (ESM Figs. E2–E3).

GWR calculations

GWR was calculated as the sum of the radiodensity of the grey matter regions of interest divided by the sum of the radiodensity of the white matter regions of interest (ESM Fig. E1B). The GWR-8 model included all eight regions of interest. The GWR-4 model and the automated GWR only included the measurements in the putamen and in the posterior limb of the internal capsule.

Outcome assessment

Functional outcome by the modified Rankin Scale (mRS) was assessed by a trained outcome assessor at a structured face-to-face or telephone follow-up, at six months after randomisation. Functional outcome was dichotomised into good (mRS 0–3) and poor (mRS 4–6) [25, 27, 33].

Statistical analysis

The results are reported according to the Standards of Reporting Diagnostic Studies [34] and the Standards for Studies of Neurological Prognostication in Comatose Survivors of Cardiac Arrest [35]. Continuous variables are reported as median (interquartile range, IQR) or means (± standard deviation) and categorical variables in numbers (percentages). Sensitivities and specificities for prediction of poor functional outcome, and negative and positive predictive values are presented with 95% confidence intervals (CI) calculated with Wilsons´s method. Results from the manual standardised assessments are presented separately for each rater and as median (min–max) of all raters. For GWR, we decided to apply the cutoff < 1.10, since this yielded a 100% specificity for poor outcome prediction in our pilot study [18]. In addition, we analysed the pre-specified GWR cutoff < 1.15. The overall prognostic performance of GWR for good versus poor functional outcome was assessed by the area under the receiver-operating characteristic curve (AUC) with 95% CI. AUC was classified as; < 0.60 = failure, 0.60–0.70 = poor, 0.70–0.80 = fair, 0.80–0.90 = good, and 0.90–1.00 = excellent [36]. The mean AUC for manual GWR was compared to the automated GWR using DeLong.

The interrater agreement between the blinded raters was calculated with Fleiss’ kappa. Intra-rater agreements for 20% of the images re-evaluated by each rater (identical for all raters) were analysed with Cohen’s kappa. The strength of the agreement was classified as kappa (κ); < 0.20 = poor, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61–0.80 = good, and 0.81–1.00 = very good for Fleiss’ and Cohen’s kappa [37,38,39].

CT's performed 48 h–7 days post-arrest were included in our prospective cohort. To assess the accuracy of automated GWR, we included all patients with CTs performed ≤ 7d in a post hoc cohort. We assessed the impact of timing on prognostic accuracies for automated GWR < 1.10 in time-windows < 2 h, 2–6 h, > 6–48 h, > 48–96 h, and > 96–168 h after cardiac arrest.

Sensitivities and specificities with 95% CI are also presented separately for patients randomised to hypothermia and normothermia within the prospective cohort for each rater for the standardised qualitative assessment, GWR-8 and GWR-4 at cutoff < 1.10 and for automated GWR < 1.10.

We further examined whether "severe HIE" by qualitative rating and GWR-8 cutoff < 1.10 evaluated by four or more raters corresponded with pathological findings of routine prognostic methods.

Statistical analyses were performed with IBM SPSS Statistics (SPSS Statistics for Windows, Version 29.0.0.0 Armonk, NY: IBM Corp) and R, version 4.0.4 (The R Foundation for Statistical Computing).

Results

Patient demographics

387/635 (60.9%) patients underwent at least one head CT ≤ 7 days after cardiac arrest. Forty-six patients were excluded due to unmet technical requirements or the presence of other intracranial pathologies (Fig. 1). N = 140 patients were unconscious and examined with CT 48 h–7 days after cardiac arrest, thereby meeting the inclusion criteria for our prospective cohort. Together with the prospective cohort, further N = 201 patients examined \(\le\) 48 h post-arrest were included in a post hoc analysis using automated GWR [13]. The median age of our prospective cohort was 68 years, 76% were male, and the median time from cardiac arrest to CT was 84 h (IQR 66–109) (Table 1). Poor functional outcome was more frequent among included than excluded patients (75% versus 50%). Prognostic accuracies for predicting poor functional outcome by other routine prognostic tests used within the prospective cohort are displayed in ESM Table E3. Life-sustaining therapy was withdrawn due to poor neurological prognosis in 66/140 (47%), at median 121 h (IQR 98–156) post-arrest.

Fig. 1
figure 1

Flowchart of patient selection and exclusion. The prospective cohort included N = 140 unconscious patients not obeying verbal commands at 48 h post-arrest examined with CT 48 h–7 days post-arrest. The post hoc cohort included all patients examined with CT within 7 days post-arrest (N = 341 patients). For the N = 42 patients examined with \(\ge\) 2 CT examinations, only the latest examination within the 7 days time-range was included. CT head computed tomography, h hours, d days

Table 1 Characteristics of included and excluded patients

Prediction of functional outcome

Blinded qualitative assessment

The standardised qualitative assessment of “definite signs of severe HIE” predicted poor outcome with 100% specificity and 100% positive predictive value in all seven raters (980 ratings overall) (Table 2, ESM Table E4). The median sensitivity of all raters was 37% (min–max 11–61%). Inter-rater agreement between raters was moderate (κ = 0.6) (ESM Table E5). Intra-rater agreement ranged from fair (κ = 0.33) to very good (κ = 0.93) (ESM Table E6).

Table 2 Prediction of poor functional outcome by individual raters

Blinded quantitative assessment

Median GWR was significantly lower in poor outcome patients compared to good outcome patients, p < 0.0001 (ESM Table E7). In both models, GWR < 1.10 predicted poor functional outcome with 100% specificity and 100% positive predictive value in all 980 ratings (Table 2, ESM Table E4, ESM Fig. E4A–B). A cutoff < 1.15 yielded two false-positive predictions of poor outcome for GWR-8 (median specificity 100%, min–max 94–100%) and five false-positive predictions for GWR-4 (median specificity 100%, min–max 91–100%) (ESM Table E4, ESM Table E8, ESM Fig. E4A–B). Median sensitivities were higher for GWR-8 than GWR-4 at cutoff < 1.10 (39% versus 30%) and at cutoff < 1.15 (48% versus 38%). Median AUC of all seven raters was 0.86 for GWR-8 and 0.81 for GWR-4 (Fig. 2A–C).

Fig. 2
figure 2

AC Manual and automated GWR assessments for overall prediction of functional outcome. Comparison of manual and automated GWR for prediction of good versus poor functional outcome (modified Rankin Scale 0–3 versus 4–6) at 6 months post-arrest in N = 140 patients included in the prospective cohort. The area under the receiver-operating characteristics curve (AUC) is presented with 95% confidence intervals. Results are presented separately for all raters for the manual GWR-4 (A) and the manual GWR-8 (B). C Displays the prognostic accuracy for the automated measurements versus mean of the seven raters’ manual assessments of GWR-4 and GWR-8, respectively. In C, the AUC for automated GWR did not differ significantly between mean manual GWR-8 (p = 0.10) or between mean manual GWR-4 (p = 0.84)

The interrater agreement for GWR-8 ranged from good (κ = 0.72) for < 1.10 to very good (κ = 0.83) for < 1.15 (ESM Table E5). The interrater agreement for GWR-4 was moderate (κ = 0.57) at both cutoffs. Examples of CTs with low, moderate, and high interrater agreement are found in the ESM Fig. E5A–C. Intra-rater agreements for GWR-8 ranged from moderate to very good (κ = 0.58–1) and from fair to good for GWR-4 (κ = 0.26–0.75) (ESM Table E9).

Effect of targeted temperature management

ESM Table E10 depicts prognostic accuracy measures for hypothermia and normothermia groups. We found no clinically relevant effect of temperature allocation on prognostic accuracies for prediction of poor functional outcome.

Congruence of HIE with routine prognostic methods

In patients with "severe HIE" in qualitative assessment or a GWR-8 < 1.10 diagnosed by the majority of raters, 96% of patients had high levels of neuron-specific enolase (NSE) (> 60 ng/ml), 44–53% had bilaterally absent somatosensory evoked potentials (SSEP) N20 potentials and 66–68% had a highly malignant electroencephalogram (EEG) (ESM Tables E11A–B). The lowest congruence was seen for bilaterally absent corneal and pupillary reflexes (29–37%).

Automated GWR

Automated GWR < 1.10 predicted poor outcome with 100% specificity (95% CI 90–100) and 41% sensitivity (95% CI 32–51) (Table 2, Fig. 3, ESM Table E12). AUC was 0.84 (95% CI 0.77–0.91) with no significant difference compared to the average of manually determined GWR-8 or GWR-4 (Fig. 2C). Automated GWR was significantly lower in poor outcome patients, regardless of temperature management, p < 0.001, ESM Fig. E6.

Fig. 3
figure 3

Automated GWR for prediction of functional outcome. Automated GWR and functional outcome (modified Rankin Scale 0–3 versus 4–6) at 6 months post-arrest grouped by timing of CT acquisition in hours after cardiac arrest. Examinations are evaluated in the following time-windows after cardiac arrest; < 2 h: N = 67, > 2–6 h: N = 102, > 6–48 h: N = 32, > 48–96 h: N = 90, > 96 h–168 h: N = 50. Automated GWR was significantly lower in poor outcome patients than in good outcome patients at all timepoints \(\ge\) 2 h post-arrest. ns not significant, **p > 0.01, ***p < = 0.001,****p < = 0.0001. Overall, automated GWR < 1.10 predicted poor outcome with 1 false pathological prediction at 2 h post-arrest, with an overall specificity of 99% (95% CI 96–100) and 30% sensitivity (95% CI 24–36). Exact prognostic accuracies and a contingency table for each timepoint in this figure can be seen in ESM Table E12. Median GWR values for good and poor outcome patients are displayed in ESM Table E13. The CT image and case description of the 1 false-positive patient is displayed in ESM Fig. E7

Post hoc analysis of automated GWR

GWR was significantly lower in poor outcome patients compared to good outcome patients at all timepoints except for CT examinations performed < 2 h post-arrest (ESM Table E13). In CTs obtained ≤ 7 days after cardiac arrest (N = 341), overall specificity of automated GWR < 1.10 was 99% (95% CI 96–100) (ESM Table E12). Poor outcome was incorrectly predicted in one patient, probably due to a lacunar infarction and enlarged perivascular spaces (ESM Fig. E7). Sensitivity for automated GWR < 1.10 increased gradually from 9% (95% CI 2–27%) for CTs performed < 2 h post-arrest to a peak of 48% (95% CI 37–59%) for examinations performed > 48–96 h (Fig. 3, ESM Table E12).

Discussion

In this prospective multicentre study, evaluating three different methods of diagnosing severe hypoxic ischaemic injury on CT for prediction of poor functional outcome after cardiac arrest, we validate pre-published standardised criteria and evaluate GWR cutoff < 1.10 for manual and automated assessments [13]. We conclude that CT is a highly specific prognostic tool for neuroprognostication, regardless of assessment method, with highest sensitivities for poor outcome prediction when performed 48–96 h post-arrest. GWR determination at the basal ganglia level < 1.10 performed either manually or automated offer a more objective measure of HIE with reduced interrater variability.

CT is a guideline-recommended predictor of outcome after cardiac arrest with very low quality of evidence [1, 5, 7, 40]. The main concerns raised by ERC/ESICM include the lack of multicentre validation and standardised assessments of both qualitative and quantitative methods [1]. Our study provides a framework that is easy to use in clinical practice and addresses several concerns raised in recent publications [5,6,7, 10, 41].

Our standardised qualitative criteria define signs of severe HIE as a “complete or nearly complete loss of grey-white matter differentiation in the basal ganglia and in the frontoparietal cortex with additional evidence of brain swelling/sulcal effacement” [13]. A visual interpretation according to a checklist with mandatory evaluation at several levels of the brain had to be completed before reaching a conclusion. This qualitative assessment predicted poor outcome with 0% false-positive rate in 980 blinded ratings overall. In line with the previous qualitative studies, sensitivities for individual raters ranged between 11 and 61% for imaging performed 48 h–7 days post-arrest [18, 20, 41].

Both the ERC/ESICM and the American Neurocritical Care Society recommendations use similar, undefined terminology to describe signs of severe HIE on CT; "diffuse", "extensive" or "bilaterally across vascular territories", with a "loss of grey-white-matter differentiation" [1, 5]. While our standardised qualitative criteria may offer a more precise definition of severe HIE than those given in the current guidelines, it achieved only moderate interrater reliability. The CT evaluation in our study was mostly performed by experienced raters (3–15 years with CTs of cardiac arrest patients). Rater experience may impact both sensitivity and specificity of CT evaluation using our SOP. This should be kept in mind when implementing our CT analysis in clinical routine. Future improvements to improve interrater reliability are necessary and may include a better standardisation of windowing during visual analysis, standards regarding decision in case of residual grey–white differentiation and awareness of the effects from residual contrast agents. In contrast to clinical practice, our raters only had one CT available for analysis and did not have access to pre-cardiac arrest CTs. We plan a subsequent study using serial CTs to evaluate whether an analysis of changes in grey–white-matter differentiation and brain volume over time improves prognostic accuracy.

GWR is the only guideline-recommended method to quantify the extent of HIE on CT and can be applied with routine radiological software, but there is no consensus on the number, size, and exact location of regions of interest [1, 10, 22, 42,43,44]. Based on previous investigations and our retrospective pilot study, we chose to validate manually placed 0.1 cm2 regions of interest at the basal ganglia level [18]. Importantly, we included the instruction to place the regions of interest in a subregion with a radiodensity representative of the entire anatomical target region and to avoid potential confounders (artefacts, calcifications, lacunar infarcts, etc.). We confirmed that both manual GWR models had a maximal specificity at GWR < 1.10, which is in accordance with the previous studies [6, 7, 18]. As expected, sensitivities increased at cutoff < 1.15 at cost of a slightly decreased specificity. None of the false positives through quantitative measurements fulfilled criteria for "severe HIE" with the qualitative assessment, underlining the potential value of combining both approaches. As in our pilot study, GWR-8 was consistently superior to GWR-4 concerning prognostic accuracies, intra- and interrater agreements [18]. A possible explanation for the higher accuracy of GWR-8 is the reduction of noise due to the larger number of regions of interest [18, 45]. GWR-8 was superior to the qualitative assessment for some raters and the interrater reliability for GWR-8 was superior to that of qualitative assessments—highlighting a potential advantage of quantification. We presume that the interrater reliability of manual GWR could be further improved by applying stricter instructions for measurements within anatomical regions or using non-circular and/or larger regions of interest.

Automated atlas-based GWR measurements offer an alternative to manual measurements unaffected by interrater variability and could increase the availability of GWR for hospitals without on-site neuroradiologic expertise [17]. A few previous studies evaluate automated GWR quantification and they are limited by single-centre, retrospective designs, and early assessment of functional outcome [16, 17]. The prognostic accuracy of automated GWR < 1.10 in our prospective cohort was as good as the manually assessed GWR with 40% sensitivity at 100% specificity. This performance is also in the range of manual and automated GWR from CTs performed > 24 h post-arrest in the previous studies [18, 20, 24, 46], routine EEG, and SSEP [7]. Except for one study on early CTs, 1.10 was the lowest reported GWR cut-off with 100% specificity thus far [47]. Overall, automated GWR < 1.10 performed within 7 days post-arrest had one false-positive prediction of poor outcome in a patient with a subcortical low attenuating area close to putamen, most likely an old lacunar infarction or perivascular space, but with intact overall grey–white-matter differentiation. The use of automated GWR relies on anatomic landmarks and its use must include a quality check for co-registration and exclusion of artefacts or acute brain pathologies potentially interfering with measurements [10]. Future studies on larger cohorts should investigate whether machine learning can predict outcome from CTs after cardiac arrest with superior accuracy compared to our human rater-based approach [48].

Data from our current and previous studies do not suggest that CT can predict good outcome/absence of severe HIE. Future studies, for example using analysis of serial CTs, should re-investigate this issue.

Our results on optimal timing support guideline recommendations [1, 5] that CTs performed 48–96 h have higher sensitivity for predicting poor outcome than examinations performed within the first hours post-arrest [5, 17, 18, 20, 23, 24, 49]. Examinations performed on hospital admission are often routinely used to exclude cerebral causes of unconsciousness and may be too early to detect HIE for most patients. The increase in sensitivity within the first days corresponds with developing HIE. The higher sensitivity of later examinations is in line with previous observations and supports the notion that an optimal time window of a few days exists for neuroprognostic CT’s [17]. We found no clinically relevant effect of temperature allocation on prognostic accuracies for prediction of poor functional outcome. When performed at an optimal timepoint and analysed using standardised interpretation, combining CT with other prognostic methods with higher sensitivities such as EEG or NSE could increase the number of correctly identified poor outcome patients.

Strengths and limitations

Strengths of our study include the prospective, multicentre design with standardised criteria for neuroprognostication and withdrawal of life-sustaining therapy and a structured assessment of functional outcome at 6 months. CT’s were prospectively performed in unconscious patients at a timepoint clinically most relevant for neuroprognostication. Radiological assessments were blinded and performed by multiple raters from different countries according to a pre-published protocol using standardised radiological criteria and pre-defined cutoffs. A comparison with automated GWR within the same cohort further strengthens our results.

Our main limitation is imprecision due to sample size [5]. A substantial proportion of patients were examined before the pre-specified time point, reasonably as part of clinical practice, and thus reported as part of a post hoc cohort examined ≤ 7 days. Additional patients did not receive CT > 48 h, because they underwent magnetic resonance imaging rather than CT, used other prognostic methods or because CT could not be performed for logistical reasons.

In contrast to clinical practice, to standardise the protocol within a clinical trial, our raters only had axial CT images available, separately performed qualitative and quantitative assessments, their rating were final, and they did not have the possibility to discuss their results with colleagues.

Patients included in the TTM2-trial had a presumed cardiac or unknown cause of cardiac arrest and results may differ from other causes of arrest [25]. The conservative approach to prognostication within the TTM2-trial was designed to limit the risk of self-fulfilling prophecies, reflected in the longer times to withdrawal of life-sustaining therapy in our prospective cohort [25, 28, 29]. Nonetheless, despite the blinded CT evaluations in this study the risk of self-fulfilling prophecies cannot be entirely excluded, since local radiologists CT reports were available to treating physicians. Our results should be validated in a cohort where withdrawal of treatment was not performed.

Conclusion

The combination of a structured qualitative assessment of severe HIE with a quantitative assessment at the basal ganglia level (GWR-8 or automated GWR < 1.10) allows the prediction of poor functional outcome after cardiac arrest with high specificity and moderate sensitivity. CT should be considered in patients unconscious later than 48 h after cardiac arrest using a multimodal approach to neuroprognostication. Automated GWR could help avoid errors compared to manual ratings and make head CT quantification more accessible.