The decision to extubate a COVID-19 patient can be challenging and a delicate trade-off between early and postponed extubation. In non-COVID patients, extubation failure occurs in 10–20% of intensive care cases and is associated with increased mortality [1]. While postponing extubation and waiting for further clinical improvement appears sensible, unnecessary extubation delays may lead to more ventilator-associated complications and inefficient use of scarce intensive care resources [2, 3].

An understanding of the risk factors for extubation failure will aid the clinician in determining the optimal time point for extubation. Previous studies in non-COVID-19 patients have investigated numerous factors related to extubation outcome, including age, maximum inspiratory pressure, and the rapid shallow breathing index [4]. However, given the complex interplay of many patient and treatment related characteristics in extubation success, a single parameter rarely provides sufficient accuracy to guide decision making [5]. Moreover, it remains largely unclear whether these parameters are similar for COVID-19 patients [6].

The collection of large intensive care datasets that span the entire intensive care admission paves the way for machine learning models to capture this complex interplay of predictors by using machine learning models. Previous non-COVID-19 machine learning work has aimed to predict simple and difficult weaning [7] and extubation failure [8,9,10,11,12,13,14,15]. However, data was frequently from over a decade ago, mechanical ventilator data was usually lacking, and no data was included from COVID-19 patients. Taken together, we identify an opportunity for machine learning models to predict unsuccessful extubation in critically ill COVID-19 patients.

We created the Dutch Data Warehouse (DDW), a multicenter database with critically ill COVID-19 patients [16]. All structured electronic health record (EHR) data for these patients have been combined and cleaned for research purposes. These data therefore represent the structured EHR data readily available to the intensivist at the bedside. In this study, we aim to identify and validate the most important predictors for extubation failure in COVID-19 patients.


This study follows the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines [17].

Data source

All data came from the DDW, a large, multicenter, full-admission, electronic health record data warehouse with data from critically ill COVID-19 patients in the Netherlands [16]. The data warehouse currently contains 3464 patients admitted between the beginning of the crisis in March 2020 and March 2021. Data spans both the first and second wave of ICU admissions from 25 hospitals in the Netherlands. The institutional review board of Amsterdam University Medical Center location VUmc waived the need for informed consent from individual patients and approved of an opt out procedure.


All critically ill patients extubated after more than 24 h of invasive mechanical ventilation were eligible for inclusion. Transferred patients were included if the transfer destination data were available. We excluded patients transferred before extubation or within 1 day after extubation in case the transfer destination data were not available. Patients transferred more than 24 h after extubation were assumed to be fit for transport and classified as successful extubations. Patients still admitted at the time of data collection were excluded.


The primary outcome was unsuccessful separation from invasive mechanical ventilation defined according to the WIND criteria [18], which mandate an extubation without reintubation or death within the next 7 days, or discharge from the ICU without invasive mechanical ventilation within 7 days [18]. The use of non-invasive ventilation is disregarded in this definition. As secondary outcomes, we applied the same criteria to a 48 h’ time window after extubation. The definition of extubation in EHR data has been published previously and reasonably excludes palliative care patients [16]. We did not distinguish between accidental and elective extubations as the reason for extubation is not routinely recorded.

Predictors and scoping literature search

Potential predictors for modeling were selected by a team of intensivists. Notably, the list included medication and fluid balance. To facilitate the selection process, machine learning studies that predict extubation failure were identified in the literature. Each of the identified articles was scanned full-text and included predictors were extracted. The total list of studies can be found in Additional file 1: Table S1. In addition, to account for the wide variety of ventilator settings in the DDW, the parameters from the landmark paper by Amato et al. on the association between ventilator parameters and outcome were included in the selection [19]. The mean or last value from the last 24 h before extubation as specified by the team of intensivists were included to facilitate interpretation of the model. The total dose in the last 24 h was included for the medications. For any predictor pair with an interpredictor correlation higher than 0.5, the most clinically insightful predictor was selected. The full list of predictors can be found in Table 1.

Table 1 Included parameters


Across all 25 hospitals, a nested cross validation was performed to assess model performance. First, the data was split into five equally large sets called outer folds. These outer folds were then each split into a train and test set. Each of the train sets was again divided into five subsets called the inner folds. A model was trained on these 5 inner folds with a randomized hyperparameter search. Model performance after training on these inner folds was then tested on the corresponding outer fold test set. Importantly, observations belonging to the same patient were always kept in the same split to prevent leakage of information. The overall model performance was the average of all outer fold test set performances.

We trained a logistic regression model, decision trees, and an XGBoost algorithm. These models were selected for their ease of determining predictor importance. Model performance was gauged with the area under the receiver operating characteristic (AUROC), Brier score, average precision, and calibration loss. Data imputation, standardization and automated feature selection were carried out on each outer fold separately. Missing values were imputed with the median and predictors were standardized to have a mean of 0 and a standard deviation of 1. Lasso regression was performed for automatic feature selection, and the L1 regularization term was optimized together with the other hyperparameters [20].

Predictor importance was estimated with the Shapley additive explanation (SHAP) framework. SHAP values represent a predictor’s marginal contribution to the overall prediction [21] and are state of the art in machine learning explainability. Moreover, Partial Dependence Plots (PDPs) were created to visualize the average change in probability of successful extubation for all values of a predictor while keeping all other predictors constant [22]. Standard deviations represent the distribution of the data. All analyses were carried out in Python 3.8 (Python software foundation).


Population and outcome

A total of 2.421 patients were mechanically ventilated during their ICU stay. In case of a patient transfer, data from the transferring and receiving hospital were merged when available. We excluded 517 transfers for which outcome or admission data were lacking, 123 patients that were still intubated when data were extracted, and 139 patients that were intubated less than 24 h. 568 patients died on the mechanical ventilator before their first extubation attempt and 191 patients received a tracheostomy. As a result, a total of 883 patients were included in the modeling. The reintubation rate in this COVID-19 population was 18.9% within 7 days and 13.4% within 48 h. The mortality rate was 1.0% within 7 days and 0.6% in the first 48 h after extubation. Patient characteristics are outlined in Table 2.

Table 2 Patient characteristics


Model performance for the primary outcome is shown in Additional file 1: Table S2 for each of the models. The XGBoost algorithm yielded the highest performance with an AUROC of 0.70, outperforming logistic regression (AUROC 0.67) and a decision tree (AUROC 0.59). Model performance for the prediction of unsuccessful extubation 48 h after extubation is presented in Additional file 1: Table S2. All algorithms, XGBoost (AUROC 0.67), logistic regression (0.66), and a decision tree (AUROC 0.54), performed worse compared to the primary outcome.

Predictor importance

Predictor importance was calculated with the XGBoost model since it yielded the highest performance. The SHAP values for the highest predictors are shown in Fig. 1. The most important predictive feature of extubation failure was the last FiO2 value before extubation. The majority of important predictors can be grouped into ventilatory characteristics, inflammation markers, neurological status and body mass index.

Fig. 1
figure 1

SHAP values for most important predictors of extubation failure. Overview of SHAP values for the top 20 predictors of successful extubation (negative SHAP values) or unsuccessful extubation (positive SHAP values). Features are ordered according to importance. FiO2: fraction of inspired oxygen, IBW: ideal body weight, PEEP: positive end expiratory pressure, P/F ratio: PaO2/FiO2 ratio

Ventilatory characteristics

Ventilatory characteristics are shown in Table 2. A short time-period between the last controlled mode and extubation, and a longer duration in controlled mode throughout the course of mechanical ventilation were associated with unsuccessful extubation. The PD-plots depict the difference in predicted probability of extubation failure compared to the median value for all of the observed values. The PD-plot shows a time since the last controlled mode shorter than 2 days and a controlled mode duration longer than 4 days are associated with increased chances of unsuccessful extubation compared to the median value.

For the ventilator settings, a higher fraction of inspired oxygen and a higher average tidal volume in the last 24 h are predictive of extubation failure. The PD-plot in Fig. 2 shows that an FiO2 above 35% or a tidal volume per kg ideal body weight above 8 ml/kg compared to their median values increases the probability of unsuccessful extubation. The median PEEP was 8 cmH2O (IQR 5–8 cmH2O) before extubation, with a median pressure support of 6 cmH2O (IQR 5–9 cmH2O). No patients received PEEP levels below 5 cmH2O, while pressure above PEEP was below 5 cmH2O in 7.3% of patients.

Fig. 2
figure 2

Partial dependence plots. PD-plot for the last FiO2 recording, mean glasgow coma score and tidal volume per kg ideal body weight in the last 24 h, and duration of the controlled mode

Inflammation markers, neurological scores and body mass index

Both a higher CRP, an elevated leukocyte count and higher thrombocyte count in the 24 h preceding extubation are predictors of an unsuccessful extubation attempt, while temperature was not in the top predicting features. For neurological scores, on the other hand, low EMV scores predict unsuccessful extubation. Lastly, BMI showed an inverse relationship with extubation failure; patients with a higher BMI had a lower probability of extubation failure. An increase in the chances of unsuccessful extubation is observed below 28 kg/m2 compared to the median in the PD-plot (shown in Additional file 1: Fig. S1).


To the best of our knowledge, this is the first study that identifies predictors for extubation failure in critically ill COVID-19 patients from a large and multicenter cohort that contains a wide variety of routinely collected clinical predictors. The most important predictors of extubation failure are ventilatory characteristics, inflammatory parameters, GCS score, and body mass index. These risk factors may aid intensive care professionals in selecting the optimal time point for extubation.

This study is unique as it provides predictive modeling of extubation failure across twenty-five hospitals. All previous machine learning studies in non-COVID patients for predicting extubation failure have been single center [7,8,9,10,11,12,13,14,15]. Model performance was higher in these studies, presumably due to overfitting resulting from the sole use of local data. Algorithms may be biased towards local extubation practices and extubation readiness assessments, making these models less generalizable to other clinical settings.

In our study, ventilatory characteristics, including ventilator settings, are the most important risk factors for extubation failure. These factors are systematically and frequently recorded by the ventilators, and are potentially modifiable. Two of the most important predictors associated with higher chances of failed extubation are the duration of the controlled and assisted ventilation modes prior to extubation. A longer time in a controlled mode was a stronger predictor than the total duration of mechanical ventilation. Moreover, a longer time in assisted mode was associated with improved chances of successful extubation. A possible explanation may be the reduced activity and consequent atrophy of the diaphragm or other skeletal muscles in controlled modes [23, 24]. Of note, none of the previous machine learning studies included the duration of controlled ventilation as a predictor. Our results show that the duration of ventilation modes should be recorded and taken into account when assessing extubation readiness.

For the ventilator settings, a higher FiO2 before extubation was associated with an increased risk of extubation failure. A higher FiO2 may indicate incomplete resolution of pulmonary pathology. Higher PEEP levels, on the other hand, were associated with better extubation success. The interquartile ranges of PEEP are low, however, indicating low PEEP is common practice before an extubation attempt. In addition, we observed that higher mean tidal volumes corrected for the ideal body weight in the last day before extubation were an important predictor of extubation failure. Patients with high average tidal volumes may suffer from more lung injury that may increase the risk of unsuccessful extubation [25]. While most of the ventilator settings are readily available, relevant respiratory system maneuvers such as spontaneous breathing trials, tracheobronchial suctioning and maximum inspiratory pressure that would ideally be included, were inconsistently recorded in the EHR systems and therefore not included in modelling. To evaluate their predicting importance in extubation failure, data of these maneuvers need to be incorporated systematically in the EHR.

Other important predictors included signals of ongoing or developing inflammation, poorer neurological status, and body mass index. Inflammation parameters are routinely determined in most intensive care units when extubation decisions are made. Conversely, neurological scores can be ambivalently scored in the intensive care unit. The Glasgow Coma Scale was originally designed for brain damage patients [26], but is used for the general intensive care patient. Unequivocal interpretation of sedated states, however, may hamper the use of this scale in the context of extubation readiness. Based on these results, we would recommend systematically recording and evaluating the predictive value of other scores like the Richmond agitation sedation scales.

Lastly, body mass index upon admission had an inverse relationship with extubation failure. Apart from one small study that found an association between BMI and post extubation stridor [27], no other studies were identified that found BMI to be an important predictor. As in any predictive study, the effect of BMI may be explained by an unmeasured predictor or a selection bias. That means, a low-BMI patient would have to be sicker to be admitted to the ICU. A negligible correlation was found between BMI and SOFA score, however, as an indicator of illness severity. Previous studies have also shown that BMI is uncorrelated with immunological responses or adverse outcomes [28]. Overall, once in the ICU, BMI is not related to higher chances of unsuccessful extubation and may not be a valid reason to postpone extubation.

Our study has several limitations. We aim to apply a holistic set of predictors across centers to assess extubation readiness. In routine practice, however, individualized treatment and diagnostic decisions result in variation of available parameters [29], and predictors may be unavailable in the 24 h prior to extubation. For example, it is not possible to conclude that cardiac markers like NT-pro-BNP or troponin do not aid in the prediction of extubation failure, because these markers were not routinely determined. Along the same line, we had to merge groups of medications, because individual drugs may not be administered frequently enough to be useful in the modeling. To truly exploit the predictive power of machine learning models, we should strive to systematically record the predictors of interest and determine which algorithms work in what clinical circumstances [30].

A further limitation is the missing outcome data because of patient transfers to centers not included in this project. The potential bias is considered small, as we connected all patients’ stays whenever available and transferred patients had similar baseline characteristics as the study population as a whole [31]. Lastly, the relationships identified in this study are associations and do not equal causation. As with any clinical observational dataset, we cannot observe counterfactual states; once a patient is extubated we irretrievably lose the outcome in case the patient would have been kept on mechanical ventilation. While many of the ventilatory settings are predictive of extubation failure, we would ultimately be interested in the effects of continuing mechanical ventilation for another day on extubation success. We believe that these results will provide a crucial step for other study designs to investigate the causal relation between modifiable predictors and successful extubation.


This is the first study to identify risk factors of extubation failure in a large multi-center cohort of critically ill COVID-19 patients. The large number of hospitals included limits the risk of overfitting due to specific local practices. From a large set of clinically important predictors, ventilatory characteristics, inflammatory markers, neurological status and BMI were most important predictors for failed extubation. These predictors should be taken into account to determine extubation readiness.