Introduction

At the start of the coronavirus disease 2019 (COVID-19) pandemic, prone positioning quickly became an important treatment strategy in the armamentarium of intensivists [1]. This was based on physiological plausibility inferred from clinical experience and clinical trials of proning in non-COVID-19 ARDS [2]. Proning has been shown to reduce mortality in moderate-to-severe ARDS. Proposed physiological mechanisms include improved gas exchange and changed lung mechanics facilitating lung-protective ventilation [3]. Gravitational forces may lead to improved drainage of respiratory secretions, re-expansion of collapsed lung parenchyma, redistribution of aeration and pulmonary blood flow. This may improve lung compliance and improve ventilation–perfusion matching by reducing both shunting and dead space ventilation [3]. These effects in turn may facilitate lung-protective ventilation by reducing ventilator mechanical power while maintaining adequate gas exchange and therefore reduce the risk of ventilator induced lung injury [4].

However, proning is not without risks. Recognized adverse events include endotracheal tube obstruction and dislodgement, decreased clearance of mucus, and loss of venous access [5]. In addition, turning patients requires a coordinated team effort, which is a logistic challenge especially when operating at surge capacity in full personal protective equipment.

Therefore, predicting which critically ill COVID-19 patients will benefit from prone positioning may be of clinical value and it should come as no surprise that labeling of responders and non-responders quickly became common practice [1]. Response to proning is defined based on intermediate physiological measurements related to shunting, dead space ventilation and respiratory system compliance. As recently reviewed, this short-term physiological response is, on average, consistent with that known from non-COVID-19 ARDS [3]. And in contrast to non-COVID-19 ARDS the outcome in terms of survival was shown to be significantly better in responders than in non-responders [3].

We hypothesized that machine learning techniques known for their classifying power and predictive performance could be used on highly granular electronic health record data from critically ill COVID-19 patients to discriminate responders from non-responders. To perform these analyses, we focused on PaO2/FiO2 ratio, ventilatory ratio, respiratory system compliance and mechanical power as outcomes defining responsiveness. We used the Dutch Data Warehouse (DDW) which contains more than 3,000 critically ill COVID-19 patients in 25 hospitals in the Netherlands [6].

Methods

The Medical Ethics Committee at Amsterdam UMC waived the need for patient informed consent and approved of an opt-out procedure for the collection of COVID-19 patient data during the COVID-19 crisis as documented under number 2020.156. This report adheres to the STROBE reporting guidelines [7].

Patients

We selected all intubated ICU patients admitted during the first and second COVID-19 wave between March 2020 and February 2021 with at least one registration in prone position. Subsequent turns into a prone position were included and patients intubated for less than 24 h were excluded. Prone positioning events were excluded if prone duration was measured as longer than 24 h which could indicate inaccurate registration of turning events. Berlin ARDS criteria were not formally documented [8].

Data preprocessing

Data from the DDW were filtered for unrealistic values (Additional file 1: Table S1). Individual measurements were aggregated for each hour starting from the admission time stamp. Parameters were forward filled for a variable amount of time based on clinical expertise following discussions with senior intensivists. For frequently measured parameters which are likely to be influenced by patient position, forward filling was limited to the time of position change (Additional file 1: Table S2). Missing values were derived (Additional file 1: Table S3). Further details may be found in Additional file 1.

Outcome parameters

We created 5 different definitions to determine treatment success or failure for proning. We based these outcome definitions on PaO2/FiO2 ratio, ventilatory ratio, respiratory system compliance, and mechanical power, as well as a composite outcome. The composite outcome was defined as any improvement of 10% or more in PaO2/FiO2 ratio, ventilatory ratio or respiratory system compliance, without any deterioration of 10% or more in any of these parameters.

Target values were determined closest to 4 h after turning to a prone position. As clinical practice introduces some variance in the timing of measurements, measurements between 1 up to 7 h after proning were included. This time window was chosen as improvement was generally observable within a clinical shift and supported by the median difference in PaO2/FiO2 ratio for each hour after prone positioning (Additional file 1: Fig. S1).

Ventilatory ratio was calculated as (minute volume * PaCO2)/(predicted body weight * 100 * 37.5). [9] Mechanical power was calculated based on peak pressure and plateau pressure where available (Tidal Volume * (Peak Pressure – (0.5 * (Plateau Pressure – PEEP)) * Respiratory Rate * 0.1) or based on PEEP and pressure above PEEP otherwise (Tidal Volume * (PEEP + Pressure Above PEEP) * Respiratory Rate * 0.098) where pressure above PEEP is defined as peak inspiratory pressure minus PEEP [10,11,12] (Additional file 1: Table S3).

These target values were compared to values obtained in the 3 h prior to turning. Turning patients to a prone position was labeled successful based on an improvement in outcome parameters of 10% or greater. This cut-off value was chosen to allow for small improvements to still be regarded as favorable in the most critically ill patients, while requiring a more pronounced effect in patients with more favorable physiology. Re-supination within 4 h after pronation was labeled as failure. As the time of registration of patient positioning may deviate from the actual moment of change in position, measurements in the same hour as registration of position changes were discarded from analysis.

As a sensitivity analysis, we used 20% as a cut-off. We also used 20 mmHg as a cut-off for change on PaO2/FiO2 ratio in line with previous research [3]. In addition, we also created a second combined outcome defined as an increase of at least 10% in either PaO2/FiO2 ratio, ventilatory ratio or respiratory system compliance without a deterioration of more than 10% in the other two. This new combined outcome served to represent a bedside definition of proning success where proning should at least improve one physiological measure with a plausible association with outcome while not seriously worsen the clinical picture.

Features

Based on a combination of clinical expertise and correlation with PaO2/FiO2 ratio difference, a set of 80 candidate features were selected to prevent overfitting on too many features. These include feature augmentations created as rolling 2-h averages and 8-h slopes, as well as the last values of each outcome shortly before turning to a prone position (Additional file 1: Table S4). Missing data were imputed using median imputation for continuous features and imputed as absent for medical history. To ensure the inclusion of clinically essential parameters, we combined the data for specific sub-parameters [9, 10]. Static and dynamic respiratory compliance were aggregated into a single compliance parameter where static compliance took precedence over dynamic compliance if both were available. Furthermore, driving pressure and pressure above PEEP were combined into a delta-pressure parameter. Previous medical history was aggregated into two groups based on respiratory system involvement as these were deemed likely to have a comparable effect on the outcome parameters. Finally, if a previous proning event was labeled as successful, the patient was marked as a previous responder to prone positioning in the next event. Numeric data were normalized for each analysis. Further details may be found in the online Additional file 1.

Modeling

Data were split in a training and test set, where subsequent turns of the same patient were kept in the same set to prevent leakage of information. For classification modeling, we used logistic regression, Random Forest, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Gaussian Naive Bayes (GNB) and XGBoost (XGB) for each of the classification targets. These models were selected based on their general predictive performance and prevalence in medical literature [13,14,15]. For each of the models, hyper-parameter optimization was performed using a grid search on the training set. Contribution to the predictions was evaluated using feature importances, permutation importances and absolute coefficients (Additional file 1: Tables S5, S6).

Results

1142 out of a total of 3600 patients were recorded with at least one prone position, with a total of 3619 prone events with a median of 2 (IQR 1–4) prone events per patient with a maximum of 27 prone events. The median last PaO2/FiO2 ratio in the last 3 h before turning to a prone position was 112 mmHg (IQR 87–142, N = 2958). At 4 h post-prone positioning, the median difference in PaO2/FiO2 ratio was 14.9 (IQR − 5–41, N = 2211). Time spent in a prone position was 17 h (11–20, median and IQR, N = 2632) (Table 1). PEEP levels before prone positioning were 12 (10–14.3, median and IQR, N = 3332) and after prone positioning were 12 (10–14.3, median and IQR, N = 3303). The overall ICU mortality of patients having spent time in a prone position was 424 out of 1142 (37.1%) with a median length of stay of 15.6 days (9–26.5, median and IQR). Further details can be found in Additional file 1 (Additional file 1: Table S7).

Table 1 Patient characteristics including clinically relevant features used for prediction

This initial dataset of prone events was reduced for each outcome to contain only observations with a measured outcome. This resulted in a dataset of 1289 prone events for the composite outcome, 1820 prone events for PaO2/FiO2 ratio outcome, 1626 prone events for the ventilatory ratio outcome, 1829 prone events for the mechanical power outcome and 2140 prone events for the compliance outcome. Outcome labels were balanced for predicting PaO2/FiO2 ratios (52.3% success rate) up to moderately imbalanced for predicting the composite outcome (15.8% success rate).

Predictive performance varied across models and outcomes with the most accurate predictions originating from Logistic Regression, Random Forest and XGBoost on relative improvement of PaO2/FiO2 ratio, based on a comparable 0.59–0.62 area under the receiver operator characteristic curve (ROC AUC), while the Gaussian Naive Bayes and Support Vector Machine provided an ROC AUC close to 0.5 (Fig. 1, Additional file 1: Table S8). For other outcomes, the ROC AUC was relatively close to 0.5. The F1-score, interpreted as the weighted average of the precision and recall values, was between 0.64 and 0.67 for all models predicting PaO2/FiO2 ratio, but lower for most models on the other outcomes with the exception of logistic regression (0.61) and XGBoost (0.58) for mechanical power (Fig. 2, Additional file 1: Table S9). Results were similar for sensitivity analyses where the cut-off was set to 20% or 20 mmHg, and for sensitivity analyses where no minimal prone duration was required (Additional file 1: Tables S10, S11).

Fig. 1
figure 1

Model performance by ROC AUC score for predicting improvement in various outcome parameters after turning patients to a prone position. The ROC AUC compares the true positive rate to the false positive rate where a performance of 1.0 reflects perfect scores where 0.5 describes complete randomness. LR logistic regression, RF  random forest, KNN  K-Nearest Neighbors, SVM  support vector machine, GNB Gaussian Naïve Bayes, XGB  eXtreme Gradient Boosting

Fig. 2
figure 2

Model performance by F1-score for predicting improvement in various outcome parameters after turning patients to a prone position. The F1-score combines the precision (positive predictive value) and recall (sensitivity) scores to provide a single metric to compare model performance where a performance of 1.0 reflects perfect scores while 0.0 reflects the worst performance. LR logistic regression, RF  random forest, KNN  K-Nearest Neighbors, SVM  support vector machine, GNB Gaussian Naïve Bayes, XGB  eXtreme Gradient Boosting

Underlying contribution of features to the predictive performance was generally low and showed little consistency across models for the most predictive features. (Table 2, Additional file 1: Tables S12, S13). Correlation between successful outcome of a previous proning episode and successful outcome of a subsequent proning episode was virtually absent. Correlation between PEEP levels before turning to a prone position and a successful outcome was absent as well (Table 3).

Table 2 Features ranked by feature importance, permutation importance or absolute coefficient values for outcome PaO2/FiO2 prediction
Table 3 Correlation of previous response to prone positioning or PEEP levels with the outcome label for each of the outcomes

Discussion

This is the first study to attempt to use machine learning techniques in predicting prone positioning responsiveness in intubated critically ill patients with COVID-19 using routinely registered data in the electronic health records. The authors are also not aware of any papers using traditional statistical regression-based techniques to infer predictors for the success of prone positioning. Despite extensive modeling using a plethora of machine learning techniques and inclusion of a large number of potentially clinically relevant features, discrimination between responders and non-responders based on commonly used physiological outcomes remained poor.

Notably, not even being a previous responder to prone positioning showed any meaningful contribution to the prediction of a next response. While expecting a similar response as before may seem intuitive, lung and respiratory system physiology can change rapidly with progression of disease or through the effect of therapy. Therefore, relying on previous success or failure may be suboptimal.

These findings of poor predictive performance in the context of suggested, although debated, benefit from prone positioning including mortality, based on previous literature, are important because of their clinical implication. In this context, prone positioning should not be withheld in mechanically ventilated COVID-19 patients based on their characteristics or previous proning failure despite the extra work involved and potential adverse effects [3].

Among the most consistently important features were the last known PaO2/FiO2 ratio, along with FiO2 slopes. However, their impact varied greatly among models and predictive performance remained poor (Additional file 1: Tables S12, S13).

Mortality in proned COVID-19 patients was 424 out of 1142 patients (37.1%) in this study, while overall mortality for all mechanically ventilated COVID-19 patients was previously shown to be circa 30%, and overall ICU mortality for all COVID-19 patients was 24.4% [6, 16]. This trend is expected as each step corresponds to an increased severity of disease and thus a decreased chance of survival.

In one study on COVID-19 ARDS, survival was better in responders than in non-responders to proning in terms of oxygenation [3]. However, this relationship is subject to debate for non-COVID-19 ARDS. Notably, in the landmark PROSEVA trial that showed an important mortality benefit for proning, there was no evidence that this mortality effect only accrues in patients who show a physiological response to proning [2, 17]. In our study, the association between response in P/F ratio and survival was weak. This should encourage a liberal approach to proning regardless of short-term improvements in physiology given the evidence supporting its effect on survival.

Strengths of this study include the use of large amounts of routinely collected granular data from a large multi-center database, the rigorous and extensive modeling attempts and the broad approach to defining short-term response to proning based on the principles of shunting, dead space ventilation and respiratory system compliance.

However, the physiological effect of proning may be difficult to predict at least partly due to limitations inherent in the chosen outcome variables. Ventilatory ratio is a poor surrogate for dead space ventilation and ventilation/perfusion match. PaO2/FiO2 ratio is a relatively poor surrogate for shunt and ventilation/perfusion match. Pressure above PEEP is a weak surrogate for plateau pressure, feeding into static respiratory system compliance, which is itself a poor surrogate for lung compliance.

Defining the exact cut-off for success is therefore non-trivial. Based on previous literature, we may select an absolute change of at least 20 mmHg [3]. But a relative change may be closer to clinical practice in which physicians may still consider smaller improvement successful in the most severe cases. Nevertheless, sensitivity analyses showed no major differences in predictive performance when adjusting these cut-off values.

This study also comes with limitations. For some potential features, data availability was limited due to varying frequencies of measurements and imputation strategies were necessary. Also imaging data as well as measurements requiring maneuvers such as inspiratory hold were mostly unavailable. Individual data points were not manually validated due to the vast amount of data, although most evident data entry errors were removed through preprocessing. These models were trained on 25 different hospitals, but external validation is needed to generalize these findings. Furthermore, as disease and treatment change over time, this drifting data may influence future applicability. Finally, our absence of evidence should not be regarded as evidence of absence. It is certainly thinkable that future developments in machine learning combined with increasing availability of data might facilitate better discrimination between physiological responders and non-responders.

Conclusion

The physiological response to prone positioning of COVID-19 ARDS patients could not be reliably predicted with highly granular EHR data using novel machine learning techniques. Predictors for physiological improvement were inconsistent and earlier response to proning showed no correlation to future responses. Although a definitive proof of unpredictability cannot be provided, we have shown that current EHR data are insufficient to aid in the decision to turn patients to a prone position. Therefore, the decision to turn a patient to a prone position should be based on group level evidence and only be omitted based on individual contra-indications.