Background

The coronavirus disease (COVID-19) pandemic has been characterized by a high uncertainty in outcomes for those contracting the virus, particularly regarding the severity of symptoms, disease trajectories, and mortality. Additionally, there are differences in governmental public health responses between countries and between surges in COVID-19 cases (“waves”) [1]. As such, outcomes have varied by geographic region and temporally by “wave” [2]. This has further exacerbated uncertainty, making it difficult to predict outcomes among people with COVID-19 who are admitted to the hospital.

Approximately 20% of patients hospitalized with COVID-19 require intensive care and possibly invasive mechanical ventilation [3, 4]. Patient preferences with COVID-19 for mechanical ventilation may be different than for other types of pneumonia, because intubation for these patients is often prolonged, may be administered in settings characterized by severe social isolation and is associated with very high average mortality rates [3, 5]. Supporting patients and surrogate decision-makers in conversations facing decisions regarding admission to the intensive care unit (ICU) and mechanical ventilation requires providing an accurate forecast of their likely outcomes based on their individual characteristics [3, 6]. Further, given the continuous pressure on health care systems, there is a need to support decision-making in triaging people with COVID-19 in the Emergency Department (ED) for hospital or ICU admission. Clinical prediction models have the potential to support health care providers and people with COVID-19 and their families in decision-making by providing accurate prognoses.

Since the start of the pandemic, over 200 prediction models for the diagnosis and prognosis of COVID-19 have appeared in the literature, but few were developed with high methodological rigor [7]. Almost all published models were identified as having a high risk of bias, indicating that their reported performance is most likely overly optimistic [7]. Although some of these models were externally validated — showing highly variable model performance — the validity and generalizability in settings beyond those in which the model was developed remains largely unknown [8]. Poorly calibrated prognostic models may lead to harm, since they yield misinformation that can lead to clinical decision-making that is worse than using best “on average” information [9,10,11].

In addition to examining geographic transportability, since new SARS-CoV-2 variants with different COVID-19 severity are emerging (such as Omicron), natural and vaccine immunity are developing and treatment best practices are rapidly evolving over time (e.g., proning, minimizing paralytics, lung-protective volumes, remdesivir, dexamethasone), validating and updating these prognostic models may be crucial, even within the same geographic setting [12, 13]. Changes over time in the selection of patients who are admitted to the hospital can also have important effects on outcomes and on the consistency of predictor effects. Models developed on abundant first-wave data may have little generalizability to later waves.

Both in the New York City (NYC) area and the Netherlands (NL), prognostic models were developed for predicting outcomes in patients hospitalized with COVID-19: The Northwell COVID-19 Survival (NOCOS) model was developed on a large set of NYC data and the COVID Outcome Prediction in the Emergency Department (COPE) model was developed on a large set of NL data [14, 15]. We aimed to evaluate the geographic and temporal transportability of these two models and to examine updating approaches. Thus, we sought to gain further insight on model transportability to different settings and to different time windows, particularly in a dynamic pandemic.

Methods

Population

The database included anonymized data of COVID-19 patients who were admitted to 12 Northwell Clinics in the NYC area and to 4 Dutch hospitals. NOCOS and COPE were developed on data of patients who presented at the ED and were admitted to the hospital with suspected COVID-19 in the first wave of the pandemic between March and August 2020, in NYC and NL respectively. To evaluate the temporal and geographic transportability of NOCOS and COPE, we used data of patients who presented at the ED and were admitted to the hospital with suspected COVID-19 in the second wave of the pandemic, between September and December 2020. Patients being transferred to other hospitals were excluded since information on outcomes was missing.

Outcomes

The outcomes of interest were (a) death or transfer to a hospice within 28 days after hospital admission and (b) requiring mechanical ventilation (NYC) or ICU admission (NL) within 28 days after hospital admission.

Predictors

Based on prior literature both NOCOS and COPE included patient characteristics (sex, age, BMI), vital parameters (oxygen saturation, systolic blood pressure, heart rate, respiratory rate [RR], body temperature), and blood test values (C-reactive protein [CRP], lactic dehydrogenase [LDH], D-Dimer, leucocytes, lymphocytes, monocytes, neutrophils, eosinophils, Mean Corpuscular Volume [MCV], albumin, bicarbonate, sodium, creatinine, urea), all measured at ED admission [7, 14, 15]. Logarithmic transformations of predictor values were included to capture non-linear associations with the outcomes. The date of admission was included to capture potential secular changes in outcomes over time; these variables were fixed to calibrate risk predictions to outcome rates at the end of the first wave. In the case of multiple measurements for the same patient, we used the first measurement after presentation at the ED. We used Multivariate Imputation by Chained Equations (R-packages mice) for multiple imputation of missing predictor values [16, 17]. Multiple imputation in the validation data was undertaken separately from multiple imputation in the development data to ensure fully independent model validation.

Model development

Details on the development of COPE and NOCOS are described in other publications [14, 15]. A summary of important details is provided in the supplement (Additional file 1: Box S1), together with the model formulas (Additional file 1: Table S1).

Model validation

Model performance was assessed temporally on subsequent second-wave data at the same site and also geographically, i.e., COPE was evaluated on second-wave NYC data and NOCOS on second-wave NL data. We assessed discriminative ability with the area under the operator receiver characteristic curve (AUC). The model-based concordance (mb.c), which provides the expected AUC in a validation dataset based on the distribution of the predicted probabilities (i.e., assuming no model invalidity), was used to understand the impact on the discriminative ability of potential differences in case-mix heterogeneity between the development and validation data [18]. We assessed calibration with calibration plots of ten equally sized groups of predicted risk, with the E-statistic — the average absolute difference between predicted probabilities and observed frequencies according to a smooth calibration curve – and with calibration intercepts and calibration slopes [19]. We used decision curves to assess the net benefit of using the models at a range of decision thresholds [20]. We also evaluated the net benefit after updating the intercept and the slope in the validation data. All analyses were performed in MATLAB 2019b and in R software, at the NYC and Dutch site, respectively [16].

Results

Patient characteristics and outcomes

Mortality

Twenty-eight-day mortality was considerably higher in the NYC first-wave data (2551/12,163 = 21.0%), compared to the second-wave (216/2137 = 10.1%) and the NL data (first wave 629/5831 = 10.8%; second wave 326/3252 = 10.0%). Many predictors were similarly distributed in the NL and the NYC area, with the exception of CRP and LDH, which were higher, and D-Dimer, which was lower in the NYC area (Table 1). These biomarkers may have been measured in sicker patients, reflected in higher biomarker levels when larger proportions were missing.

Table 1 Baseline characteristics of 1st wave and 2nd wave patient cohorts in the Netherlands and NYC. Median, quartile range (“Q1” = first quartile; “Q3” = third quartile) and percentage missing (“% NA”) are presented for all continuous variables. The percentage of patients with male sex is reported in the last row

Need for mechanical ventilation or ICU admission

In the NYC area, the proportion of patients receiving mechanical ventilation decreased from 16.9% (2056/12,163) in the first wave to 10.4% (223/2135) in the second wave of the pandemic. The rate of ICU admission in NL (fully recorded for two out of four hospitals) decreased from 8.1% (214/2633) in the first wave to 5.9% (86/1466) in the second wave of the pandemic. However, for validation of the models predicting the need of mechanical ventilation or ICU admission in the NL data, we only used patients below the age of 70, as the probability of being admitted to the ICU paradoxically decreased with age after the age of 70, reflecting a triage policy not to admit older patients to the ICU rather than using a triage policy based on disease severity [15]. The rate of ICU admission in patients below the age of 70 in NL decreased from 9.9% (128/1296) in the first wave to 6.4% (45/706) in the second wave of the pandemic.

Validation of prognostic models

Mortality

COPE discriminated well at temporal validation (AUC 0.82 [0.80; 0.84]; Fig. 1A), with excellent calibration (E-statistic 0.8%; calibration intercept −0.05 [−0.17; 0.08]; calibration slope 0.98 [0.86; 1.10]). At geographic validation in second-wave NYC data, discrimination was satisfactory (AUC 0.78 [0.75; 0.81]; Fig. 1B), but with moderate over-prediction of mortality risk, particularly in higher risk patients (E-statistic 2.9%; calibration intercept −0.33 [−0.50; −0.17]; calibration slope 0.82 [0.66; 0.97]).

Fig. 1
figure 1

Temporal and geographic validation: performance of COPE and NOCOS for predicting death in second-wave patients. Calibration plots of patients who were admitted since September 2020 in 4 NL hospitals and 12 NYC hospitals. Temporal validations of COPE and NOCOS are in panels A and C, respectively. Geographic validations of COPE and NOCOS are in panels B and D, respectively. n is number of patients; a = calibration intercept (0 is perfect); b = calibration slope (1 is perfect); c = AUC (0.5 is useless; 1 is perfect); mb.c = model-based AUC

In contrast, when NOCOS was evaluated in NYC area data from the second wave, while discrimination was adequate (AUC 0.77 [0.74; 0.81]; Fig. 1C), NOCOS systematically overestimated the mortality risk (E-statistic 5.1%; calibration intercept −0.50 [−0.65; −0.34]). Similarly, when tested in NL data, discrimination remained adequate (AUC 0.81 [0.79; 0.83]; Fig. 1D), but again NOCOS over-predicted mortality risk, particularly in lower risk patients (E-statistic 4.0%; calibration intercept −0.39 [−0.51; −0.28]). Surprisingly, NOCOS was “underfitted” (calibration slope>1), both at temporal validation (calibration slope 1.33 [1.10; 1.57]) and geographic validation (calibration slope 1.55 [1.36; 1.74]), probably due to overly aggressive shrinkage of the predictor effects of NOCOS.

In NL data, both COPE and NOCOS had a positive net benefit for decision thresholds up to 70%, but the net benefit of COPE was considerably higher for decision thresholds up to 50% (Fig. 2A). Recalibration of both models to the second-wave NL data led to limited improvements in net benefit. In the NYC data, the net benefit of COPE was positive and more favorable than the net benefit of NOCOS for decision thresholds up to 30%, but was negative for decision thresholds over 35%, while NOCOS was not negative for the full range of decision thresholds. After recalibration of the intercept and the slope to the second-wave NYC data (Fig. 2B), the net benefit of COPE and NOCOS was more similar.

Fig. 2
figure 2

Temporal and geographic validation: net benefit of COPE and NOCOS for predicting death in second-wave patients. Decision curves of patients who were admitted since September 2020 in 4 NL hospitals (panel A) and 12 NYC hospitals (panel B). Net benefit is plotted against the full range of possible decision threshold probabilities for the original prognostic models (“COPE” in red and “NOCOS” in blue) and for these models with a calibrated intercept and slope (“COPE.recal” in dashed red and “NOCOS.recal” in dashed blue)

Exploring the influence of changes in outcome rates over time in the first wave

To explore variations in outcome rates over time, we examined COPE and NOCOS predictions when the variable for calendar time was excluded from the model and compared that to performance of the full model. When NL data from March was used to predict outcome rates in the second-wave NL data, the average predicted mortality was 17.1%, an over-prediction of 7.1%. When data from the full first wave was used, the average predicted mortality in the second wave decreased to 12.3%, an over-prediction of 2.3%. Correcting for the “March effect” led to the excellent calibration, with an average over-prediction of only 0.5%. The NOCOS model developed on first-wave data from March also over-predicted mortality in the second-wave NYC data; the average predicted mortality was 17.3%, an over-prediction of 7.1%. However, using NYC data from the full first wave or including a time effect did not correct this over-prediction; these models yielded over-predictions of 7.2% and 5.2%, respectively.

Need for mechanical ventilation or ICU admission

Although COPE significantly over-predicted ICU admission in second-wave patients in NL (Fig. 3A; calibration intercept −0.50 [−0.81; −0.19]; E-statistic 4.1%), it was well able to identify the patients at high risk of needing ICU admission, as expressed by good discriminative ability (AUC 0.83 [0.78; 0.89]) and substantially stronger predictor effects than in the development data (calibration slope 1.56 [1.11; 2.01]). COPE also substantially over-predicted the need for mechanical ventilation in NYC, possibly because it was designed to predict the need for ICU admission rather than the need for mechanical ventilation (Fig. 3B; calibration intercept −0.86 [−1.00; −0.71]; E-statistic 9.7%), but the discriminative ability (AUC 0.74 [0.71; 0.77]) was more in line with expectations (mb.c 0.71) and the calibration slope (1.13 [0.90; 1.37]) was much closer to ideal (slope 1).

Fig. 3
figure 3

Temporal and geographic validation: Performance of COPE and NOCOS for predicting the need for ICU admission (NL) or mechanical ventilation (NYC) in second-wave patients. Calibration plots of patients who were admitted since September 2020 in 4 NL hospitals and 12 NYC hospitals. Temporal validations of COPE and NOCOS are in panels A and C, respectively. Geographic validations of COPE and NOCOS are in panels B and D, respectively. n is number of patients; a = calibration intercept (0 is perfect); b = calibration slope (1 is perfect); c = AUC (0.5 is useless; 1 is perfect); mb.c = model-based AUC

NOCOS was well calibrated (Fig. 3C; calibration intercept −0.09 [−0.24; 0.07]; calibration slope 1.18 [0.85; 1.50]; E-statistic 1.7%) with a discriminative performance similar to expectation in second-wave NYC patients (AUC 0.69 [0.65; 0.74] versus mb.c 0.67). NOCOS predicted the need for ICU admission in NL very well on average (Fig. 3D; calibration intercept 0.15 [−0.16; 0.45]; E-statistic 2.3%), but the predictor effects were significantly stronger in NL (calibration slope 1.52 [1.05; 2.00]), also reflected by the much better discriminative ability than expected (AUC 0.80 [0.74; 0.87] versus mb.c 0.69).

In NL data, both COPE and NOCOS had a negative net benefit for decision thresholds over approximately 30% (Fig. 4A). For thresholds below this level, the net benefit of COPE was generally better than that of NOCOS. Surprisingly, recalibration of both models to the second-wave NL data led to worse net benefit in this range, probably because linear recalibration of the intercept and slope was insufficient. In the NYC data, the net benefit of COPE was negative for decision thresholds over 15% and clearly improved after recalibration of the intercept and the slope to the second-wave NYC data (Fig. 4B). The net benefit of NOCOS was positive for the full range of decision thresholds in the NYC data, and did not benefit from recalibration because NOCOS was already well calibrated in second-wave NYC data. After recalibration, the decision curves of COPE and NOCOS were quite similar.

Fig. 4
figure 4

Temporal and geographic validation: Net benefit of COPE and NOCOS for predicting need for ICU admission (NL) or mechanical ventilation (NYC) in second-wave patients. Decision curves of patients who were admitted since September 2020 in 4 NL hospitals (panel A) and 12 NYC hospitals (panel B). Net benefit is plotted against the full range of possible decision threshold probabilities for the original prognostic models (“COPE” in red and “NOCOS” in blue) and for these models with a calibrated intercept and slope (“COPE.recal” in dashed red and “NOCOS.recal” in dashed blue)

Discussion

We examined the performance of prognostic models developed on the “first wave” COVID-19 data to predict mortality during the second wave, both locally and in a different setting. The model developed in the Netherlands (COPE) had reasonably good performance in both settings, except with some over-prediction of risk in the NYC area. This performance was only achieved by carefully modeling the effect of secular changes during the first wave such that predictions were calibrated to yield risks consistent with the end of the first wave (August 2020). The model developed in the NYC area (NOCOS), greatly over-predicted risk in both NL and in NYC during the second wave, despite including a variable to capture the effect of calendar time in the first wave. These results underscore the need for caution when transporting prognostic models over time and space: sometimes these models work, but sometimes they don’t — and specifics matter. In particular, we observed that calibration may be especially sensitive to changes in setting, consistent with our prior work [11, 21].

It is unsurprising that models developed on data from March 2020 at the very beginning of the pandemic led to profound over-prediction of mortality risk during the second wave. Presumably, this in part reflects a “learning curve” as clinical management evolved rapidly over time. This might be due to the development of specific therapeutic approaches – including proning, minimizing paralytics, changes in ventilator volume settings, remdesivir, dexamethasone, and other treatments — as well as general improvements in supportive care, which may relate to the capacity of the health systems to cope with overwhelming volumes. Based on our findings, it appears that the “first wave effect” was more prolonged in NYC, since accounting for the secular trend within the first wave improved COPE predictions substantially but not NOCOS predictions. Again, this might be anticipated given the intensity of the pandemic in this region. Despite less than excellent performance on second-wave data, decision curve analysis generally showed positive net benefit across most thresholds, except at high-risk levels above which there were few patients.

An interesting finding from the models predicting the need for mechanical ventilation is that they both appeared to be under-fit, with stronger predictor effects (slope > 1.0) in second-wave data and a “paradoxical” improvement in discriminative ability on validation data. This suggests that mechanical ventilation might have been better targeted to patients at higher risk in the second wave in both settings.

A recent systematic review examined more than 200 COVID-19 models and found that these did not generally apply rigorous development methods [7]. All reviewed models demonstrated a high or unclear risk of bias when evaluated using the prediction model risk of bias assessment tool (PROBAST) [22]. We specifically used PROBAST as a guide when developing our model, to ensure methods consistent with a low risk of bias. Nevertheless, our results point to fundamental challenges of prediction when developing models during a dynamic pandemic, even when carefully adhering to good methodological practice. Techniques for dynamic updating of models may be needed in such circumstances [23,24,25,26,27]. While some of these challenges may be unique to COVID-19, recently the risk of poor model performance and the need for continual updating to avoid the potential for harmful decision-making has gotten increasing attention [9, 23, 26,27,28,29,30].

We note there are several limitations to our study. Performance of these models as measured here in second-wave data may not currently apply, since the pandemic has continued to evolve. In particular, the widespread dissemination of vaccines may well affect clinical presentation, patient risk, and predictor effects. So-called “breakthrough” COVID generally has a much lower mortality rate. These issues only strengthen the importance of our methodological conclusions. Another limitation is that we were limited to using variables that were routinely collected in both locations. In particular, we were limited to using variables present on ED presentation, which limited the performance of models used to predict outcomes in the subset of patients admitted to the ICU or placed on mechanical ventilation. Finally, the use of similar regression modeling strategies at both geographical sites may be considered a limitation.

Other investigators have underscored the fact that COVID-19 has posed many challenges for mathematical modelers, in particular, accurate forecasting of the pandemic has proven elusive [31]. Our findings underscore that the prediction of COVID-19 clinical outcomes may also have important challenges, since outcomes risks can be affected by variables that are not included in the model and that change over time and space, affecting the baseline risk and also modifying the effect of predictor variables in the model. In particular, mortality rates early in the epidemic were substantially higher than those later in the epidemic and this change over time was different across the two settings we examined. Additionally, improved targeted of mechanical ventilation over time led to paradoxically improved discrimination, although with poor calibration. These concerns point to the importance of dynamic model updating, which may need to be tailored to local circumstances, placing limits on the generalizability of global models.

We note that our study had several unique strengths. The databases used for model development were among the largest first-wave databases, including over 12,000 hospitalized patients from the NYC region and over 5000 hospitalized Dutch patients. They were both developed on multiple hospitals, also permitting rigorous internal-external validation approaches on model development and presumably improving model generalizability [32]. Unlike most prior models developed for in-hospital COVID-19 prognosis, we carefully adhered to methodological practices shown to be associated with a lower risk of bias [7, 33,34,35]. We used both conventional and novel measures of model performance, including decision curve analysis to assess clinical utility. Finally, ours is the only attempt we know of that has examined temporal and geographic validation of prognostic COVID-19 models in different pandemic waves and across different countries.

Future work should focus on methods for continuous dynamic model updating, including a comparison of different methods for updating [23, 25, 27, 36]. Furthermore, whether prognostic models improve process and clinical outcomes need to be studied, together with barriers and facilitators of their uptake in clinical practice.

Conclusions

NOCOS performed moderately worse than COPE, both at temporal and geographic validation, likely reflecting unique aspects of the early pandemic in NYC. Frequent updating of prognostic models is likely to be required to for transportability over time and space during a dynamic pandemic.