Introduction

Since January 2020, the world has been massively affected by the coronavirus-19 disease (COVID-19) outbreak. In that context, intensive care units (ICUs) are frequently forced to expand bed capacity in many countries. Unusually long mechanical ventilation (MV) duration and ICU stays observed during the first wave are some of the most distinctive characteristics of treating severe acute respiratory syndrome-coronavirus 2 (SARS-CoV-2)-infection-related acute respiratory distress syndrome (ARDS), with 90-day mortality ranging from 31 to 53% [1,2,3,4]. Although accurately predicting patients’ clinical outcomes throughout this prolonged ICU stay can be difficult, effective recognition—at ICU admission and within the first 14 days—of those at high risk of death in-ICU is crucial to inform clinical decision-making and families of likely prognoses. It could also facilitate adequate resource allocation, including hospital beds and critical care resources, and risk-adjusted comparison of center-specific outcomes. Predicting outcomes of critically ill patients with COVID-19 being treated in the ICU is a major challenge, aimed at avoiding futile prolonged ICU stays and resource use, and provide additional reliable information for decision-making concerning withholding or withdrawing life-sustaining treatment, especially within disease epicenters needing to triage the high-volume influx of patients.

COVID-19-survival models published to date tried to predict the risk of clinical deterioration of acute cases [5, 6] using data from hospitalization day (D)1 [7]. To the best of our knowledge, none focused on predicting the survival of patients after 1-to-2 weeks in ICU. Taking advantage of the COVID–ICU-cohort database containing prospectively collected characteristics, management, and outcomes of patients admitted to ICUs for severe COVID-19 in France, Belgium, and Switzerland, between February and May 2020 [3], we used machine learning to develop three dynamic, clinically useful models able to predict 90-day mortality using in-ICU data collected on ICU D1, D7 or D14, respectively.

Patients and methods

Study population and data collection

COVID–ICU is a multicenter, prospective cohort study, conducted in 149 ICUs from 138 centers, across three countries (France, Switzerland, and Belgium), launched by the Reseau Europeen de recherche en Ventilation Artificielle (REVA) network. Most of the centers were in France (135/138) whereas two were in Belgium and one in Switzerland. All consecutive patients, over 16 years old, admitted to the participating ICUs between February 25, 2020, and May 4, 2020, with laboratory-confirmed SARS-CoV-2 infection, were included. Among the 4643 patients admitted to the ICU, 4244 had available survival status up to D90 post-ICU admission.

Every day, study investigators completed a standardized electronic case report form. Details of the information collected are described elsewhere [3]. Briefly, baseline information collected within the first 24 h post-ICU admission (D1) were: age, sex, body mass index (BMI), Simplified Acute Physiology Score (SAPS)-II [8], Sequential Organ-Failure Assessment (SOFA) score [9], comorbidities, clinical frailty-scale category [10], date of the first symptom(s) and ICU admission date. A daily-expanded dataset included respiratory support devices (oxygen mask, high-flow nasal cannula, noninvasive ventilation, or invasive MV), arterial blood gases, standard laboratory parameters, and adjuvant therapies for ARDS until D90. In-ICU organ dysfunctions included acute kidney failure requiring renal replacement therapy, proven thromboembolic complications, confirmed ventilator-associated pneumonia (VAP) or bacterial coinfection, and cardiac arrest. Each patient’s vital status was obtained 90 days post-ICU admission.

COVID–ICU received approval from the French Intensive Care Society Ethics Committee (CE-SRLF 20–23) in accordance with local regulations. All patients, or close relatives, were informed that their data were included in the COVID–ICU cohort. This study was conducted in accordance with the amended Declaration of Helsinki.

Candidate predictors

We included candidate predictors considered in our previous multivariate Cox regression analyses, which assessed baseline risk factors of death by D90 [3]. D7 and D14 candidate predictors were defined a priori among data available in the COVID–ICU cohort [3] (i.e., before the building of the SOSIC models), based on recent publications describing risk factors and specific complications associated with COVID-19 prognosis [11, 12]. VAP was diagnosed by quantitative distal bronchoalveolar lavage cultures growing ≥ 104 CFU/mL, blind protected specimen-brush distal samples growing ≥ 103 CFU/mL, or endotracheal aspirates growing ≥ 106 CFU/mL. Pulmonary embolism was proven by pulmonary computed-tomography angiography or echocardiography.

Statistical analyses

Model development

We implemented a systematic machine learning-based framework to construct three mortality-prediction models (SOSIC-1, SOSIC-7, and SOSIC-14) from randomly selected development datasets, comprising 90% of the study sample; the remaining 10% were randomly assigned to the test datasets. Each prediction model was built using a gradient-boosting machine with decision trees, as implemented in the eXtreme Gradient-Boosting (XGBoost) classification algorithm [13]. XGBoost algorithm contains several tuning parameters (e.g., the number of decision trees, the maximal length of the component decision trees). The best set of parameters was chosen among a large grid of tuning parameters using tenfold cross-validation to maximize the prediction model’s discrimination ability, as assessed by the area under the receiver operating characteristics curve (AUC). We aimed to build models that could accurately estimate D90 survival for patients alive on D1, D7, or D14 following ICU admission. The SOSIC-1 model included only baseline candidate predictors, while SOSIC-7 and SOSIC-14 models combined baseline and D7 or D14 patient characteristics. The variable importance, which quantifies how much each variable contributed to the classification was extracted from the models. SHAP (SHapley Additive exPlanations) values were also computed to visualize the influence of each input variable on the final score [14].

Model validation

The performances of the three SOSIC models predicting 90-day mortality were evaluated using AUC-assessed discrimination (i.e., the probability that patients who experience the outcome will be ranked above those who do not), and calibration (i.e., the agreement between predicted and observed risks) assessed by the calibration curve (i.e., the ideal calibration intercept is 0 and ideal calibration slope is 1). The Brier score was also computed; it combines calibration and discrimination by quantifying how close predictions are to the observed outcomes (i.e., better performance is observed with a lower Brier score) [15].

A double internal validation was applied for the three SOSIC prediction models. First, internal validity was assessed by estimating the model performance corrected from optimism using bootstrap resampling with 100 repetitions. All the steps leading to the final prediction model (including the selection of the set of XGBoost tuning parameters) were applied to every bootstrap sample [16]. Second, model performance was assessed on the independent test datasets, distinct from development datasets used for model construction. One of the advantages of the XGBoost algorithm is its sparsity awareness that can handle the possibility of missing values [17]. Therefore, no missing value was imputed before model development or validation. Because the COVID-19 pandemic did not hit similarly all regions, we tested the performances of the three SOSIC models in two distinct populations namely in centers from Paris-greater areas and Grand Est compared to centers from other regions. Lastly, the performances of the SOSIC-1 were also compared to the SOFA and the SAPS II scores in the development and test datasets.

Descriptive analysis

Characteristics of the data included in the SOSIC scores are expressed as number (percentage) for categorical variables and means ± standard deviations or medians (interquartile ranges) for continuous variables. In a univariate analysis, categorical variables were compared with χ2 or Fisher’s exact test and continuous variables were compared with Student's t-test or Wilcoxon's rank-sum test. A P value < 0.05 was considered statistically significant. Statistical analyses and predictive model construction were computed with R v4.0.3, caret package v6.0-86, and XGBoost package v1.3.2.1.

Results

Study population

Among 4643 patients enrolled by May 4, 2020, 399 were lost to follow-up by D90. Thus, the predictive survival models were built based on the remaining 4244 patients with available D90 vital status. Then, 4244, 2877, and 1349 patients, respectively, were included in the development datasets to construct the SOSIC-1, SOSIC-7, and SOSIC-14 scores, with 424, 292, and 185 from each group, respectively, randomly assigned to the corresponding test datasets (Fig. 1). The three models selected 15 ICU (baseline) variables: i.e., age; sex; BMI; treated hypertension; known diabetes; immunocompromised status; clinical frailty-scale category; bacterial coinfection; ventilation profile; SOFA-score respiratory, cardiovascular, and renal components; lactate concentration; and lymphocyte count). d-Dimers were also selected a priori but were not retained for model development because of their inconsistent collection at ICU admission (Additional file 1). Selected in-ICU parameters obtained on D7 (SOSIC-7) or D14 (SOSIC-14) were: SOFA-score respiratory, cardiovascular, and renal components; lactate level, and ventilation profile. In addition, on D7 or D14, the duration of invasive MV, extubation procedure, prone-positioning, continuous neuromuscular blockade, VAP, cardiac arrest, and/or proven pulmonary embolism since ICU admission were integrated into the SOSIC-7 and SOSIC-14 scores. Table 1 reports the distributions of these variables according to D90 vital status in the D1, D7, or D14 development and test datasets.

Fig. 1
figure 1

Flowchart of COVID-19 patient screening, inclusion, and assignment to the development and test datasets. ICU intensive care unit, SOSIC Survival of Severely Ill COVID score

Table 1 Demographic, clinical and ventilatory support characteristics of the development and test datasets of the SOSIC scores on ICU day-1 according to their 90-day survival status

Univariate analyses of patient characteristics in the development datasets showed that those who died were significantly older and had a higher clinical frailty-scale category, lower BMI, and shorter intervals between first symptom(s) and ICU admission (except for the SOSIC-14 dataset) compared to D90 survivors (P < 0.01). Similarly, patients who had died by D90 were more likely on invasive MV in ICU D1 and had significantly higher SOFA-score respiratory, cardiovascular, and renal components. Their lactate levels during the first 24 h in-ICU were significantly higher and lymphocyte counts were lower.

Among patients still in-ICU on D7 or D14, the same differences were observed regarding their SOFA-score components, lactate levels, and ventilation profiles on those days (Table 2). As expected, patients who died were more likely to have undergone prone-positioning or received neuromuscular blockade and experienced significantly more complications in-ICU (i.e., VAP, cardiac arrest, pulmonary embolism) within the first 7 or 14 days.

Table 2 Characteristics on ICU day-7 or day-14 of the development and test datasets of the SOSIC-7 and SOSIC-14 scores according to their 90-day survival status

Importance of the 90-day mortality predictors

Figure 2 and Additional file 2 illustrate variable-weighting in the machine-learning models used to build the D1, D7, and D14 SOSIC scores. Briefly, age, clinical frailty-scale category, D1 lymphocyte count, and the interval between first symptom(s) and ICU admission were given significant weight to predict D90 mortality. However, the weights of these baseline characteristics tended to decrease when the prediction was estimated after 7 or 14 days in-ICU. Conversely, other baseline comorbidities, such as known diabetes, immunocompromised status, or treated hypertension, were accorded similar weights in all three scores.

Fig. 2
figure 2

Variable weighting in the machine-learning models: Survival of Severely Ill COVID (SOSIC) scores on days 1, 7 and 14. A color gradient is used to show variable strength in the machine-learning models at different times, ranging from yellow for the highest preponderance input variables to progressively darker shade of purple for lower input variables. D day, BMI body mass index, ICU intensive care unit, MV mechanical ventilation, SOSIC Survival of Severely Ill COVID score, SOFA Sequential Organ-Failure Assessment, VAP ventilator-associated pneumonia

Interestingly, when the prediction was estimated on D7 (SOSIC-7) or D14 (SOSIC-14), SOFA-score respiratory and cardiovascular components, and respiratory support at ICU admission were accorded greater importance compared to the D1 prediction. Moreover, cardiovascular, renal, and pulmonary functions on the prediction D7 or D14 were among the inputs with the highest preponderance in both models, while in-ICU complications since admission had only modest weight.

Performance of the SOSIC scores

We developed three models using XGBoost algorithms that accurately predicted 90-day mortality using data from ICU D1 and the prediction day. Apparent-, bootstrap-corrected-, and test-dataset-validation metrics are reported in the Additional file 3. Based on the test dataset the AUC was slightly higher for SOSIC-7 (0.80 [0.74–0.86]) than for SOSIC-1 (0.76 [0.71–0.81]) and SOSIC-14 (0.76 [0.68–0.83]). Similarly, SOSIC-1 and SOSIC-7 calibration curves were excellent (Fig. 3). Those findings indicate fair agreement between predicted and observed risks in those two scores. Although calibration of SOSIC-14 was lower (slope 0.83 [0.49–1.22]) compared to the SOSIC-1 and SOSIC-7, the Brier scores, which assess both calibration and discrimination of the predictive models, were similar for all three models. Besides, the correlations between the three scores were good (Additional file 4). We did not identify any center effect, as evidenced by the AUC of SOSIC-1 which was similar in Paris-greater areas and Grand Est vs other regions (0.75 95%CI [0.69;0.81] vs 0.76 95%CI [0.67;0.85]). Similarly, calibration slope close to one whatever the region (Paris-greater area and Grand Est AUC: 0.92 95%CI [0.65;1.21] versus other regions AUC: 0.90 95%CI [0.53;1.33]). Similar results were observed for SOSIC-7 and SOSIC-14 models and are reported in Additional file 5. The internal validation of SOSIC-1 on the test dataset exhibited fair performance (AUC: 0.76 [95%CI 0.71;0.81]) in contrast to much poorer discrimination of the SAPS II (AUC:0.64 [95%CI 0.58;0.70]) and SOFA scores (AUC:0.62 [95%CI 0.56; 0.69]). Graphic representation of the SOSIC-1, SAPS II, and SOFA discrimination performances is shown in Fig. 4. Similarly, performances of the SOSIC-7 and SOSIC-14 were slightly better than SOFA at day-7 and day-14, respectively (Additional file 6).

Fig. 3
figure 3

Calibration and discrimination of the Survival of Severely Ill COVID (SOSIC)-1, SOSIC-7 and SOSIC-14 scores using the test dataset

Fig. 4
figure 4

Graphic representation of the SOSIC-1, the SAPS II, and the SOFA performances in the A development and B the test datasets. SOSIC Survival of Severely Ill COVID score; SAPS II Simplified Acute Physiology Score II, SOFA Sequential Organ-Failure Assessment

Discussion

We developed and validated three prognostic models (SOSIC-1, SOSIC-7, and SOSIC-14) to predict 90-day mortality of 4244 critically ill patients with COVID-19 treated in France, Belgium, and Switzerland, evaluated during the 2 weeks following ICU admission. The SOSIC scores showed that entering 15 to 27 baseline and dynamic clinical parameters (depending on the score day) into an automatable XGBoost algorithm had the potential to accurately predict likely mortality 90 days post-ICU admission. Although external validations of the SOSIC scores in other critically ill populations with COVID-19 are still needed, these dynamic tools could enable clinicians to objectively assess the in-ICU mortality risk of patients with COVID-19 for up to 14 days. It offers an additional tool to strengthen decisions about life-sustaining treatments, hospital and ICU resources, and informing family members of likely prognosis.

Predicting outcomes of critically ill COVID patients is challenging. Patients hospitalized with COVID-19 can be classified into three phenotypes that have prognostic implications [18]. Indeed, patients with more chronic heart, lung, or renal disease(s), obesity, diabetes, an intense inflammatory syndrome, higher creatinine level, and poorer oxygenation parameters were classified as having the highest risk of deterioration that was associated with poorer outcomes [18]. Age is frequently associated with higher rates of hospitalization, ICU admission, and mortality of patients with COVID-19 [18,19,20]. Frailty is a useful tool to stratify the risk of death 90 days post-ICU admission and offers important additional prognostic information to combine with age over 70 years for patients with COVID [21]. Interestingly, the weights accorded age and frailty in our predictive models declined over the ICU stay. In other words, those two variables more weakly affected mortality prediction after 7 or 14 days in-ICU, compared to the prediction at ICU admission (Fig. 2). Similarly, a shorter interval from the onset of COVID-19 symptom(s) to ICU admission, which was associated with a higher risk of death [22], weighed less in SOSIC-7 and SOSIC-14.

Despite being collected on D1, SOFA-score cardiovascular, respiratory, and renal components strongly impacted later predictions but only modestly affected the D1 prediction. Similarly, in an observational multicenter cohort of patients with moderate to severe COVID-19 ARDS, the decrease of the static compliance of the respiratory system observed between ICU day-1 and day-14 was not associated with day-28 outcome [23]. Besides, cardiac injuries appear frequent with nearly 70% of COVID-19 patients experienced cardiac injury within the first 14 days of ICU stay [24].

The poor discriminant accuracy of the SOFA-score to predict mortality of patients before intubation for COVID-19 pneumonia was recently highlighted [25]; indeed, these patients generally have severe single-organ dysfunction and globally less SOFA-score variations. However, the impacts D7 and D14 respiratory, cardiovascular and renal statuses are of the utmost importance in the mortality prediction at those times. The SOSIC scores put the spotlight on the possibility of some variables exerting variable influence to predict mortality of patients with COVID-19, e.g., demographic variables had less weight after 1 or 2 weeks in-ICU. However, the discrimination of the SOSIC scores did not improve over time. Indeed, the AUC was not better at day-7 or 14 compared to day-1 and its better performance compared to the SOFA-score was reduced over time. A greater number of variables inducing a higher heterogeneity associated with a reduction of the sample size of the development and the test datasets at day-7 and 14 could explain this finding.

Because no models predicting COVID-19 outcomes focused on patients already in the ICU [5, 6, 26], the SOSIC scores have the potential for clinical usefulness and generalizability. Internal validations of the SOSIC scores showed consistent discrimination and calibration, which obviously deserves further external validation. With an AUC around 0.80, external validation is desirable to assess the mortality prediction beyond population levels and to fully assess the mortality risk of the individual being admitted and cared for up to 2 weeks in-ICU. Although discrimination was largely consistent for the different validation methods, SOSIC-14’s calibrations were lower; that finding suggests its performance using an external independent sample might be lower than those of SOCIC-1 or SOSIC-7. By construction, SOSIC-14 was developed on a smaller sample size than the other two models, which might explain its lower quality in terms of predicting 90-day mortality.

Despite being developed and validated on a substantial cohort with a large number of participating ICUs, these scores were constructed during the first COVID wave in Europe, a period with high pressure on the health systems and before the publication of core randomized trials [4, 27]. Moreover, ventilator strategies have also changed, as the pandemic has evolved and the medical community acquired a greater understanding of the pathophysiology of the disease and how to treat it. Caregiver reluctance to provide noninvasive oxygen strategies has been overcome [3], leading to higher percentages of patients on high-flow oxygen and noninvasive ventilation, and lower rates of intubation on ICU D1 [3, 28].

Debates are still ongoing as to the best timing of intubation in that population, as recent data have suggested poorer outcomes associated with an early intubation strategy [29,30,31]. Thus, the very high percentages of our patients intubated on ICU D1 will probably differ during subsequent COVID-19 outbreaks, in countries with different public healthcare organizations or ICU admission policies.

Indeed, SOSIC predictions should be interpreted as reflecting a profile of critically ill patients with COVID-19 not routinely treated with corticosteroids and outside vaccination campaigns, which may have changed since May 2020. Besides, this cohort was conducted at a time where the national health system was extremely pressured which lead to an important reorganization of intensive care supplies in some regions although we did not find a region effect on the performances of the SOSIC scores. However, we cannot rule out that outside a surge situation, the model could slightly overestimate the mortality. As commonly done for other scoring systems [32], prospective external validations of the SOSIC scores are warranted to determine the need for temporal recalibration and to evaluate model performance in diverse international settings. External validations in more recent cohorts of patients who received recent treatments and ventilation management are warranted. The publicly available calculator (sosic.shinyapps.io/shiny) should help achieve these goals. Another limitation is that we only included predictors that were routinely collected in the COVID–ICU database during the study period. Thus, we cannot rule out that some additional laboratory or ventilatory parameters reflecting respiratory mechanics (especially measured on ICU D7 or D14) would have improved SOSIC-score performances. We were also unable to integrate d-dimer concentrations as initially planned [33], because of their inconsistent collection at ICU admission. Although the XGBoost algorithm incorporates missing data in its split finding algorithm, we cannot guarantee that this method can handle any pattern of missing data effectively [34]. As this algorithm potentially exploits the data missingness patterns for prediction, a major shift in the missingness mechanism in an external independent sample may affect SOSIC scores performance. Lastly, important detailed information on therapy withholding or withdrawing is lacking.

Conclusion

The SOSIC-1, -7, and -14 scores were able to fairly predict 90-day mortality of critically ill patients with COVID-19 admitted and managed in-ICU (sosic.shinyapps.io/shiny). These machine-learning models, built with XGBoost algorithms, showed good discriminations and excellent calibrations. The patient’s demographic characteristics contributed most to SOSIC-1, while ventilatory status and extrapulmonary dysfunctions were the preponderant predictors in SOSIC-7 and SOSIC-14. Further studies are now warranted to externally validate these scores in recent cohorts of critically ill COVID-19 patients and assess their performances at individual levels as the pandemic evolves.