Introduction

During the last decade, increased computational power and improved algorithms have led to a renaissance for machine learning as an alternative to traditional regression models to analyse large data sets. Machine learning has been found valuable in various clinical settings such as interpretation of ECG (electrocardiography) patterns and detection of cardiac arrest in emergency calls or in the emergency department, to predict outcome in traumatic brain injury and to predict the need for critical care as an alternative to conventional triage and early warning scores [1,2,3,4,5]. It has also been suggested for mortality prediction in patients admitted to intensive care units (ICUs) [6].

Recently, machine learning models have been used to predict the outcome in out-of-hospital cardiac arrest (OHCA) cohorts with high accuracy early in the chain of resuscitation, where overall mortality is above 80% [7, 8], but these models are not applicable to patients admitted to ICUs after OHCA. Several factors are known to influence the overall outcome in the OHCA population, including patients’ age and comorbidities, cardiac arrest characteristics and status on admission [9,10,11,12,13,14,15,16]. Albeit carrying important individual information, none of these variables is taken into account in the current recommended multimodal neurological prognostication algorithm [17, 18] as the independent prediction ability in each of these variables is limited. A number of prediction models have been developed using clinical variables available on hospital admission. Risk scores using logistic regression have been proposed and typically show moderate to good accuracy including the “CAHP (Cardiac Arrest Hospital Prognosis) risk score” [19], the “OHCA risk score” [20] and a scoring system published by Aschauer et al. in 2014 [21]. So far, none of these models has been precise enough to be used for individual prediction after OHCA. The “TTM risk score” based on data from the Target Temperature Management (TTM) trial, using ten independent predictors associated with a poor outcome including death at 6 months after OHCA, managed to achieve excellent discrimination of outcome with an area under the receiver operating characteristic curve (AUC) of 0.818–0.842 [22]. With robust and accurate algorithms for early classification of illness severity and mortality risk, multimodal prognostication could hopefully be further improved to tailor patients to individual therapy and intervention effects. This may possibly only be applicable to subgroups of patients which could be differentiated in future heterogeneous clinical trials.

Using the database from the TTM trial, we aimed to investigate whether an artificial neural network (ANN)—a supervised machine learning algorithm—could detect more complex dependencies between clinical variables available at hospital admission in OHCA survivors and perform early and reliable predictions of long-term functional outcome with even better accuracy than traditional regression models. We also wanted to investigate which part of the “chain of survival” contained the most predictive information based on background, pre-hospital and admission-centred data. Finally, an attempt was made to demonstrate any difference in treatment effect across risk classes of illness severity in the TTM trial.

Materials and methods

Study setting

We included all 939 patients enrolled in the TTM trial from 2010 to 2013 in 36 ICUs in Europe and Australia. The trial included comatose (Glasgow Coma Scale (GCS) ≤ 8) adults (≥ 18 years of age) with a sustained return of spontaneous circulation (ROSC) after successful resuscitation from OHCA of presumed cardiac cause. Patients were admitted to ICUs and randomised to TTM at 33 °C or 36 °C [23]. The trial protocol was approved by ethical committees in each participating country, and informed consent was waived or obtained from all participants or relatives according to national legislation, in line with the Helsinki Declaration [24]. Patient data were entered in an online electronic case record form and externally monitored. The results of the main trial were subjected to sensitivity analyses for time, study centre and other possible biases and have been elaborated in post hoc analyses and substudies. All have shown similar outcomes in both temperature groups [25,26,27,28]. Therefore, the pooled TTM data set was used for the present analysis.

Variables

Baseline comorbidities, demographics, pre-hospital data, arrest characteristics and physiological variables, as well as admission data, were systematically collected according to the Utstein criteria [29, 30] and categorised as background-, pre-hospital and admission variables (Table 1). Time from cardiac arrest (CA) to initiation of basic life support (BLS; administered by bystanders or first responders) and advanced life support (ALS) was recorded. No-flow and low-flow times were defined as the time from CA to the start of CPR (BLS or ALS) and the time from the start of CPR to ROSC, respectively. Time to ROSC was defined as the time from CA to the first recorded time point of sustained (> 20 min) spontaneous circulation. “No flow” (indicating the time from arrest until the start of cardiopulmonary resuscitation (CPR)) and “low flow” (indicating the time from the start of CPR until the return of spontaneous circulation (ROSC)) are often used to describe the circumstances of the CPR treatment. However, from a clinical point of view, these terms are less intuitive compared to “bystander CPR”, “time to advanced CPR” and “time to ROSC”; therefore, two data sets were created. Data set A—all variables plus “bystander CPR”, “time to advanced CPR” and “time to ROSC”, but not “no flow” and “low flow”. Data set B—all variables plus “no flow” and “low flow”, but not “bystander CPR”, “time to advanced CPR” and “time to ROSC” (Table 1).

Table 1 Baseline characteristics stratified into good outcome (CPC 1–2) and poor outcome (CPC3–5) after 6 months

Outcome

The main outcome of this study was 180 days functional outcome including survival using a dichotomised Cerebral Performance Category (CPC) scale where CPC 1–2 was categorised as a good functional outcome and CPC 3–5 as a poor functional outcome [31]. A good functional outcome (CPC 1–2) includes patients independent for daily activities but may have a minor disability. A poor functional outcome (CPC 3–5) includes patients dependent on others, in a coma or vegetative state and dead [32]. The CPC was graded at follow-up by a blinded assessor during a structured interview face-to-face or by telephone [33].

Prediction models

We aimed to create two different predictions models: the best possible prediction model, which included 54 available input variables on patient admission to intensive care (Table 2), and a simplified prediction model by ranking all the variables after their individual performance adding one variable at the time according to their relative importance. The ranking of these variables was calculated by their individual effect on the AUC when subtracted from the overall model. We wanted to investigate how well our model performed compared to an earlier risk-scoring system based on logistic regression analysis of the same cohort [22].

Table 2 Predictor ranking and prediction performance in data set A

We also wanted to analyse which variables and clinical information that carry the most predictive information among background, pre-hospital or admission variables and compare this with the overall model. Finally, we performed an analysis of the intervention effect of 33 °C vs 36 °C stratified to risk classes. The five risk classes were defined as 0–20%, 20–40%, 40–60%, 60–80% and 80–100% risk of a poor outcome at 180 days based on variables available at randomisation.

Designing and evaluating the ANN

A test set, corresponding to 10% of the data, was randomly chosen and set aside to test the performance of the final ANN model. The remaining data (90%) was used for training. The training set was randomly divided into five equal-sized groups, to allow for cross-validation during model development. Missing values were imputed using a simple mean or mode substitution based on the training set.

Our ANN consisted of one input layer, a number of hidden layers and one output layer (Fig. 1). A Bayesian optimisation approach, based on the Tree-structured Parzen Estimator (TPE), was used to find the best possible network architecture [34]. The search for optimal hyperparameters was performed with the following limits: 1–4 hidden layers, 5–400 nodes in each layer, batch size between 1 and 128, and learning rate 10−7–1, and the activation function was chosen to be either to be the rectified linear unit (ReLU) or the hyperbolic tangent function. To improve generalisation, Bayesian optimisation was used to determine the most suitable regularisation parameters. The algorithm chose between the weight decay techniques L1−, L2−norm penalties distributed between 10−5 and 1 or max-norm regularisation distributed between one and five. To further improve generalisation, dropout [35] and batch normalisation was applied [36]. The probability of a node being dropped was uniformly distributed between 0 and 0.5 in the hidden layers, and 0 and 0.3 in the input layer. The sigmoid activation function was used for the single node in the output layer [37]. All networks were trained using early stopping, with patience of 50 epochs. The maximum number of epochs was set to 1000. Two different methods for optimising the loss function were tested: the Adam implementation of stochastic gradient descent (SGD) and a slightly different version called Adam AMSGrad [38]. The hyperparameters resulting in the best performing networks were as follows: a one-layer network with 149 nodes using the ReLU activation function and L2-norm weight decay with λ = 0.1374. The input dropout rate was 0.240, and the hidden dropout rate was 0.405. Furthermore, the optimisation algorithm was Adam AMSGrad, with a learning rate of 0.00197 and a batch size of 29. Batch normalisation was used. All networks were created using TensorFlow, an open-source machine learning framework developed by Google [39].

Fig. 1
figure 1

ANN, artificial neural network. A schematic ANN with one input layer, two hidden layers and one single output layer. All nodes in the network are connected in resemblance to the human central nervous system. The input layers in our ANN consisted of variables (background, pre-hospital and/or admission data) whereas the output layer was the outcome variable Cerebral Performance Category (CPC) scale dichotomised into good (CPC 1–2) or poor (CPC 3–5) functional outcome

Statistical analysis

All continuous variables were presented as median with upper and lower quartiles, the interquartile range (IQR). Categorical variables were presented as numbers and percentages. The fraction of missing data is reported in Table 1. For comparison between the groups, the Mann-Whitney U test was used for continuous data and Fisher’s exact test for categorical data. To evaluate the performance of the ANN model, we examined the receiver operating characteristics curves (ROC), which plots sensitivity against 1-specificity, for all threshold settings. We used the area under the curve (AUC) as a performance measure [40], and the method of DeLong et al. [41] was used for the calculation of AUC differences.

A forest plot was created to assess the association between five predefined ANN-stratified risk classes of a poor outcome and treatment with targeted temperature management at 33 °C and 36 °C. All p values were two-tailed, and a p < 0.05 was considered significant. We used the STROBE Statement style for the study manuscript [42].

Results

Of the 939 patients enrolled in the TTM trial, 932 were included in our study for the final analysis. Six patients were excluded due to missing outcomes, and one patient was excluded due to a high number of missing values (> 40). The population characteristics were categorised and presented as background, pre-hospital and admission variables in Table 1. Good functional outcome (CPC 1–2) was found in 440 (47%) patients, and 492 (53%) patients had a poor functional outcome (CPC 3–5) at 180 days follow-up. Patients with poor functional outcome were significantly older (68 vs 61 years, p < 0.001), more often female (22.6% vs 15.0%, p < 0.01) and had a higher degree of cardiovascular comorbidity compared to patients with good functional outcome. Patients with a poor functional outcome also presented with worse clinical neurological findings, more metabolic and respiratory acidosis and the presence of circulatory shock on admission (Table 1).

The data set was then randomly divided into a training set for developing the ANN model (n = 839) and a test set (n = 93) for independent performance measurement of the model’s generalisability. The overall ANN model, based on the 54 variables of data set A, showed a good prognostic capability in predicting outcome after 6 months. The cross-validated AUC (from the training set) was 0.852 ± 0.017 (Table 2b), and the AUC on the independent internal validation data set (the test set, n = 93) was 0.891, as shown in Fig. 2. Similar results were found when using the 53 variables in data set B (cross-validated AUC 0.852 ± 0.018 and test set AUC 0.889). As the variables in data set A are more intuitive to use in a clinical setting, and the AUC was similar in both data sets, we chose to focus on data set A.

Fig. 2
figure 2

Prediction performance. The prediction performance of long-term functional outcome is expressed as AUC in a ROC curve, by an ANN model using all 54 variables available on admission to intensive care. Of the 932 patients included in the study, 93 patients (10%) was randomly chosen and removed from the training set on which the ANN algorithm trained its prediction model. The trained ANN was then used to make a prediction of the outcome on the 93 patients earlier removed to represent the test set. The mean AUC for our ANN was 0.891, indicating an excellent performance to predict long-term outcome. AUC, area under the curve; ROC, receiver operating characteristics; ANN, artificial neural network

When using information from the background, pre-hospital or admission variables in separate analyses, the model including only pre-hospital data performed best with an AUC of 0.861 on the validation set (test set) compared to admission data only (AUC test 0.784) or background data only (AUC test 0.670). When comparing the performance difference between the “TTM risk score” [22] and our ANN model on the test set, the ANN model had a significantly better AUC (0.904 vs 0.839, p = 0.029), as shown in Fig. 3.

Fig. 3
figure 3

Prediction performance in comparison. Comparison of the prediction performance of long-term outcome expressed as AUC in ROC curves, between our ANN model (blue) and the TTM risk score (green) from Martinell et al. The ANN model (AUC = 0.904) outperformed the TTM risk score (AUC = 0.839) significantly (p = 0.029) in a comparative analysis based on 80 patients (test set) from the TTM data set. Since the “TTM risk score” does not have a strategy for handling missing values, 13 patients were removed from the original test set with 93 patients when comparing the two models. The ANN AUCs in Figs. 2 and 3 differ for the same reason. AUC, area under the curve; ROC, receiver operating characteristics; ANN, artificial neural network; TTM, targeted temperature management

To create a simplified prediction model, all 54 variables were ranked based on their individual importance and their effect on the AUC when removed from the model. The ranking for the 15 most important variables and the corresponding AUC, when adding them one at a time to the model, is shown in Table 2b. The predictive performance initially increased rapidly, but then levelled out, gradually approaching the value of the reference AUC of the model using all 54 variables (Fig. 4). After adding five variables, there was no further significant increase in performance between the models. Of all variables available at admission to hospital, “age”, “time to ROSC” and “first monitored rhythm” were the three variables carrying the most predictive information. When only these three variables were combined in a neural network model, they showed good discrimination with a cross-validated AUC of 0.820 ± 0.011 (training set) and an AUC of 0.852 on the validation test set (Table 2b). Finally, we divided the trial cohort into five classes of risk of a poor outcome. The ANN-stratified analyses showed similar treatment effect of TTM to 33 °C or 36 °C in these five predefined risk classes as measured by the logarithm of the diagnostic odds ratio (log (DOR)) (Fig. 5). Risk prediction in risk class 0–20% had a log (DOR) of 2.2 (n = 94, CI95% − 4.2–8.7, p = 0,25); in risk class 20–40%, a log (DOR) of 0.24 (n = 206, CI95% − 0.43–0.92, p = 0.24); in risk class 40–60%, a log (DOR) of 0.01 (n = 170, CI95% − 0.59–0.61, p = 0.49); in risk class 60–80%, a log (DOR) of 0 (n = 202, CI95% − 0.72–0.70, p = 0.51); and in risk class 80–100%, a log (DOR) of − 3.3 (n = 142, CI95% − 9.7–3, p = 0.85).

Fig. 4
figure 4

Increased prediction performance when adding variables. The change in AUC during training (AUCCV) when adding one predictor at the time and running the optimization process each time. The predictive performance of the model (represented by the blue line and its corresponding CI in green area) initially increased rapidly, but then levelled out, gradually approaching the reference AUC (represented by the dotted line and its corresponding CI in the pink area) of the model using all 54 variables. After adding five variables, there was no significant difference between the two models regarding prediction performance, marked by a red X in the figure. AUC, area under the curve; CI, confidence interval

Fig. 5
figure 5

Diagnostic odds ratio for the artificial neural network (ANN)-stratified risk groups The forest plot shows the logarithmic diagnostic odds ratio for five ANN-stratified risk groups of CPC score > 2 and its association to treatment with targeted temperature management at 33 °C and 36 °C. A diagnostic odds ratio > 1 implies a better functional outcome when treated with 36 °C compared to 33 °C. CPC, cerebral performance category

Discussion

In this study, we performed a post hoc analysis of OHCA patients included in the TTM trial and used artificial neural network (ANN), a supervised machine learning model, to predict the functional outcome including survival at 180 days, with information readily available at the time of hospitalisation. Our model performed predicted outcome better compared to a corresponding logistic regression model in a prior study of the same cohort [22]. The overall ANN model, based on all 54 variables available on admission, showed an excellent capability of outcome prediction during the internal validation training and performed even better on the test set with an AUC of 0.891. Using only the three most important independent factors (age, time to ROSC and first monitored rhythm, which are variables readily known on arrival in the emergency room) in an ANN led to a model with an excellent predictive ability on the test set with an AUC of 0.852 which is better compared to most proposed models in the field [19,20,21,22]. To identify which type of information that carries the most valuable prediction of outcome, we also designed a model that used the three available data categories (background, pre-hospital and admission data) separately. This approach decreased the prediction capability compared to the overall model, but variables from the pre-hospital setting carried the most information.

Large pragmatic clinical trials have been criticised for being heterogeneous and possibly dilute any intervention effect that theoretically may be relevant for subgroups of patients [43]. In this study, we performed a stratified analysis using ANN to define risk classes in relation to the outcome where any intervention effect could be studied. Our models did not show any significant difference in the intervention effect of 33 °C or 36 °C regarding the outcome when dividing the TTM trial population into five different risk classes for a poor outcome. The intervention effect was thus uniform across the risk classes, which strengthens the main conclusion of the trial, but also suggest a possible model for detection of subgroup effect in other clinical trials.

A number of attempts have been made to create robust and straight-forward outcome prediction scores in the OHCA population at admission to intensive care, in order to early identify patients with a significant risk of a poor outcome and stratify the severity of illness better than traditional classifications as the Acute Physiology, Age and Chronic Health Evaluation (APACHE) and Simplified Acute Physiology Score (SAPS) known to underperform in OHCA populations [19,20,21, 44]. An interesting future use of ANN algorithms would be the possibility to reliably assess individual risk of a poor outcome in OHCA patients which could have clinical implications for early allocation to specific interventions (tailored therapy) and later in the clinical course to inform prognosis and continued life support. In recent years, machine learning has been used increasingly in various studies and proved to be a promising method for data analyses. Machine learning has advantages compared to traditional regression models, i.e. the ability to detect correlations between independent variables in large complex data sets and to find trends or patterns in subsets of data. Recently published studies have shown the potential of machine learning regarding OHCA prediction with very good performance [7, 8]. In a study from Kwon et al., over 36,000 OHCA patients were included, and a deep learning-based OHCA prognostic system showed an impressive performance to predict neurologic recovery and survival to discharge of OHCA patients, with an AUC of 0.953 ± 0.001. However, no information regarding the long-term outcome in these patients was presented, and the overall mortality was very high, inherently increasing the possibility to reach high AUCs. The cohort used in the study was heterogeneous including more than 8000 patients (22%) with cardiac arrest of a traumatic cause, known to have a poor outcome and therefore probably contributing significantly to the predictive performance of the models [8]. In a population with about 50% survival, as for OHCA patients admitted to intensive care, our model reaching an AUC close to around 0.9 using early data alone should encourage validation in separate and prospective cohorts. There have been some studies indicating the lack of machine learning performance benefits over logistic regression-based models. In a systematic review of Christodoulou et al. from 2019, no evidence of performance superiority of machine learning over logistic regression was found [45]. The study did, however, conclude that improvements in both methodology and reporting are needed for trials that compare modelling algorithms, and our study indeed indicated a significantly better performance with ANN compared to the state-of-the-art logistic regression.

There are a number of limitations to this study. The majority of the variables had missing values, which leads to a number of challenges when developing a prediction model. To ensure that not too many patients or too many important variables were removed from the data set, we chose a simple strategy to replace them by mean values for continuous variables and mode values for categorical variables. Data collected in the pre-hospital setting might be imprecise due to the challenge of registering exact and valid information in that situation. Moreover, the TTM trial cohort is a selected population, including only patients with a presumed cardiac cause of cardiac arrest, making it difficult to generalise our results to unselected cardiac arrest patients. There are discrepancies between the cross-validating (training) AUC and the resulting AUC from the test set. This is normal [46], but the fact that the models performed better on the test sets is, however, noteworthy. Due to the nature of ANNs, there are two likely factors that play a major role: the number of patients in the test set and the fact that we were using the ensemble of networks created during cross-validation to make predictions on the test set. The ensemble technique is a widely used regularisation method. Employing it should result, as in this case based on 5-fold cross-validation, in an increased generalisability of the model. Finally, data analysis using ANN models is still somewhat of a “black box” when it comes to applying the results to a real-life clinical setting due to the complexity of biology and the variable medical contexts.

Study strengths include the use of a well-defined cohort of OHCA patients. The TTM trial was an international multicentre randomised controlled trial with predefined protocol-based criteria for inclusion and treatment. There were strict rules for multimodal neurological prognostication and withdrawal of life-sustaining therapy. The long-term follow-up on outcome was performed with minimal data loss and assessed by a blinded assessor at a meeting with the patient and the patient’s next-of-kin according to a structured protocol, including neurological examinations and face-to-face interviews.

Robust and straight-forward prediction scores used as a practical decision tool to support clinical assessments would probably improve the overall cardiac arrest care by directing very advanced and potentially high-risk invasive treatment to those patients who may benefit from it. Such scores would hopefully also increase the ability to provide reliable prognostic information to next-of-kin, earlier than the observation time of at least 72 h, which is the current recommendation for neurological prognostication after cardiac arrest [17, 18, 47, 48].

We believe that this study is an important step towards improved outcome prediction in comatose patients surviving cardiac arrest with a good functional outcome. In the near future, we will have the results from the TTM2 trial with 1900 patients [49], offering the use of an even larger unique registry with OHCA patients for ANN analyses and hopefully improving the outcome prediction in these patients. There are some obvious medical and ethical implications as well as resource aspects that may benefit from the progression of future reliable cardiac arrest-specific severity scores for early outcome prediction. Future studies should investigate if outcome prediction performance increases significantly by adding additional data and clinical variables such as early electroencephalography, neuroimaging and biomarkers. Finally, to be able to detect subgroups in an OHCA population with an increased risk of a poor outcome or subgroups that may benefit from a specific intervention or need extensive rehabilitation, further studies on larger data sets are necessary to demonstrate significant associations.

Conclusion

Our supervised machine learning model of ANN predicted neurological recovery, including survival excellently and outperformed a conventional model based on logistic regression. By data available at time of hospitalisation, factors related to the pre-hospital setting carried the most predictive information. ANN may stratify a heterogenous trial population in risk classes and help determine intervention effect across subgroups.