Introduction

Acute respiratory distress syndrome (ARDS) is a devastating critical care syndrome affecting over 200,000 patients in the U.S. each year, and an important cause of respiratory failure with significant morbidity and mortality1. At this time, identifying specific risk factors most associated with ARDS-related death are needed for early intervention in patients at risk.

In recent years, research has focused on identifying important clinical features and biomarkers, as well as their combined predictive capabilities for death in ARDS patients2. Among these, ventilation parameters such as positive end-expiratory pressure (PEEP) and plateau pressure demonstrated predictive importance for mortality in patients with ARDS3. Mean airway pressure (MAP) is a key component of the oxygenation index, which has also been associated with mortality in multiple studies of outcomes in both adult and pediatric respiratory failure4,5,6. Comparing to PEEP and plateau pressure, there is a paucity of research on MAP's dynamic change and early prognostic importance in ARDS.

Unsupervised learning methods have been increasingly used in complex clinical syndromes such as ARDS to address the issue of clinical and biological heterogeneity7,8. In contrast, current supervised machine learning (ML) models allow analysis of a variety of collected variables and development of an ARDS-specific mortality prediction system. We hypothesize that ARDS is characterized by unique clinical features, and these crucial clinical features can be applied to ML algorithms for accurate prediction of worst outcomes. To investigate this hypothesis, we employed an integrative approach, which incorporates clinical data collected in the ARDSNet FACTT Trial9 and ML modeling for ARDS prognostication, followed by comparisons of the accuracy of different ML models and determining the importance of clinical features, especially the ventilation parameters. Based on the prognostic value of the best-performing ML model, we further determined the importance of crucial clinical features in prioritizing patients for early intervene that can potentially reduce the mortality rate for ARDS.

Methods

Patient population

We performed a ML predictive modeling using data obtained from patients enrolled in the ARDSNet FACTT randomized clinical trial. The design of the FACTT study has been described previously9,10. Briefly, FACTT trial enrolled 1,000 patients with ARDS between 2000 and 2005. The trial randomized subjects in a two-by-two factorial design; one arm compared conservative versus liberal fluid management9, whereas the other arm compared monitoring patients with ARDS with a pulmonary artery versus central venous catheter10. Patients were followed for 60 days or until discharge home with unassisted breathing. The primary outcome was mortality at 60 days before discharge home9,10,11. This study was approved by the Johns Hopkins University Institutional Review Board and all patients gave informed consent at the time of enrollment. All procedures were followed in accordance with the ethical standards (details provided in Supplementary Methods).

Clinical data

When evaluating clinical conditions of ARDS, clinical data from the baseline and early stages of the disease (such as days 3) were frequently utilized12,13,14. Fewer studies, nevertheless, assessed the prognostic significance of information gathered beyond baseline to identify at risk patients early on for worse outcomes. In this study, we assessed baseline clinical characteristics and data gathered early on day 3. Clinical predictors of 60-day mortality were chosen from 29 parameters collected in the FACTT trial using a backward stepwise selection scheme (see Supplementary Methods); we also considered the importance of selected parameters as indicated by published studies2,11. Backward elimination approach is a well-documented logistic regression method for variable selection and previously has been employed for evaluating risk factors associated with prognosis of ARDS15. The 29 clinical parameters were in 5 major categories including (1) baseline characteristics; (2) vital signs and circulatory; (3) respiratory and ventilatory; (4) blood and coagulation; and (5) metabolism and renal. We also excluded the following clinical parameters: (1) parameters closely related to fluid strategy; and (2) parameters with data missing rate greater than 30% (tidal volume, bilirubin). Finally, nine predictors (age, sex, pneumonia as cause of ARDS, heart rate, mean airway pressure, glucose, albumin, platelet count, bicarbonate) were selected and utilized for the development of ML models.

Prediction model building and evaluation

The primary aim of our study was to establish a ML model for predicting 60-day mortality. As summarized in Fig. 1, five steps were performed including variables selection, applying analysis strategy, building ML models and model comparison as well as model application. First, we employed non-imputed datasets by dropping observations with missing data. We also imputed missing data using an iterative multivariate imputation technique16 for two datasets (day 0 and day 3). Second, the entire dataset was randomly divided into 70% training and 30% testing for all ML classifiers. The training set was used to build the ensemble model, while the testing set was used to evaluate the predictive performance of the model. Third, we employed six typically used supervised ML classification algorithms (Random Forest [RF], XGBoost, Support vector machine [SVM], Logistic regression [LR], Multi-layer perceptron [MLP], and Stacking Classifier [SC] models) for classifying survivors and non-survivors. Next, we performed a fivefold cross validation on the training dataset on each model to evaluate their cross-validation performance. Finally, the different models were evaluated and compared using area under the receiver operating characteristic (ROC) curve (AUC) and confusion matrix (Supplementary figure S1), which can be summarized using precision, sensitivity and F1 score (which is a weighted average of precision and sensitivity). We plotted the calibration curve of the models for predicting survival utilizing the sigmoid regressor based on Platt’s logistic model. See the Supplementary Methods for further details.

Figure 1
figure 1

Overview of the analysis plan.

Our analysis was conducted in python version 3.6 (https://www.python.org) using the library Scikit Learn17.

Statistical analyses

Continuous variables were expressed as mean ± SE or median (interquartile range), as appropriate, while categorical variables were presented as numbers (percentage). Qualitative and quantitative differences between subgroups were analyzed by chi-square test for categorical parameters and Student’s t test or Mann–Whitney’s test for continuous parameters, as appropriate. Missing values were imputed using multivariate imputation. A general linear model with repeated measures was utilized to evaluate the trend over time of MAP at baseline, days 1 through 3. The group-by-time interaction term was tested first. If significant, between-group (survivor and non-survivor) differences at each time point were examined. Then within-group changes over time (trend) were tested in both survivor and non-survivor groups independently, with Bonferroni correction applied. Two-tailed P values less than 0.05 were considered statistically significant. Statistical analyses were performed with SPSS statistical software (version 22, IBM® SPSS Inc., Chicago, IL, USA).

Ethics approval

This study was approved by the Johns Hopkins University Institutional Review Board (approval number: NA_00034898) and all procedures were followed in accordance with the ethical standards.

Results

Baseline patient characteristics comparison

Table 1 shows baseline patient characteristics between survivors and non-survivors at day 60. After excluding observations with missing data, there were 700 patients (survivors = 505, non-survivors = 195) at day 0 and 593 patients (survivors = 453, non-survivors = 140) at day 3 for non-imputed datasets. The non-survivors were older and had high percentage of pneumonia and sepsis but lower percentage of trauma compared to survivors. The non-survivors had significantly higher respiratory rate, FiO2, serum potassium, BUN and creatinine, lower body temperature, mean arterial pressure, serum albumin, bicarbonate and arterial pH at either day 0 or day 3. At day 3, the non-survivors had higher PEEP, peak pressure, plateau pressure, MAP, chloride and blood glucose but lower sodium, hemoglobin and platelet count compared to survivors.

Table 1 Patient characteristics at baseline (day 0) and day 3 comparing survivors to non-survivors at day 60.

Variable selection

Multivariate binary logistic regression (backward elimination) was applied to explore the risk factors for 60-day mortality. Twenty-nine variables were entered into the analysis. Supplementary table S1 showed nine predictors (age, sex, pneumonia, heart rate, MAP, glucose, albumin, platelet and bicarbonate) were independently associated with 60-day mortality, which were then included in the development of ML models. Of note, we randomly split the non-imputed datasets from day 0 and day 3 into training and testing groups and compared patient characteristics, there were no significant differences for all nine parameters (Supplementary table S2). The other two well established logistic regression methods for variable selection (Enter and Forward selection) yielded comparable results; however, the backward elimination method included albumin, for which there was previously research indicating a possible link to ARDS mortality. Similar to findings reported in prior research18,19,20,21, we found that patients with ARDS were more likely to die due to increased heart rate, blood glucose, and reduced platelet counts and albumin. Moreover, we observed that acid–base balance was a significant predictor of death, which is consistent with findings from previous studies22. MAP was widely used for the ventilatory management in critical illness, including ARDS23. Given its emerging significance in ARDS prognostication, in the following section, we further evaluated the association between elevated MAP and organ dysfunction among non-survivors.

Elevated MAP was associated with ARDS mortality

We compared the repeated assessments of MAP between survivors and non-survivors throughout the first three days following the onset of ARDS (Supplementary Results and figure S2). The interaction term between groups (survivors and non-survivors) and time indicated significant interaction effects: (F = 11.67, P < 0.001), thus we further examined the effects of group status and time separately. For between-group differences at each time point, there was no significant difference between survivors and non-survivors at day 0 (P = 0.597). But from day 1 to day 3, survivors' MAP was much lower than non-survivors (day 1: P < 0.01, days 2 and 3: P < 0.001). Furthermore, we assessed changes of MAP over time. Over the course of the three-day follow-up, the trend of MAP among survivors decreased progressively, with day 3 seeing the lowest values (P < 0.001). In contrast, the trend of MAP in non-survivors did not change over time (P = 0.926). We further dichotomized ARDS patients into MAPhigh (n = 357) and MAPlow (n = 353) subgroups based on their MAP levels using a median split (Supplementary table S3). At day 3, patients in the MAPhigh group showed worse respiratory (P/F ratio) and renal function metrics (higher BUN and creatinine, P < 0.001) as well as raised ventilation parameters. In contrast, we did not observe worse renal function in the MAPhigh group at baseline. Furthermore, comparing 235 survivors and 122 non-survivors in the MAPhigh subgroup at day 3, we found that the bicarbonate and arterial pH levels in the non-survivor group were significantly lower (below normal ranges) than in the survivors group (22.50 ± 0.58 vs. 27.53 ± 0.39, 7.31 ± 0.009 vs. 7.39 ± 0.005, P < 0.001), indicating metabolic acidosis. On the other hand, at baseline, no discernible difference was found. Consequently, at day 3, ARDS patients with higher MAP levels were more likely to experience metabolic dysfunction and organ damage, which may have contributed to the high death rate (38.1% vs. 19.2%, P < 0.001).

Assessment of AUC values of six prediction models in the testing set

The performance of each predictive model was assessed in the testing sets based on its receiver operating characteristic (ROC) curve, judged by its area under the ROC curve (AUC), and the 95% confidence interval (CI) for each AUC value.

Imputed data

First, we applied six ML classification algorithms (RF, XGBoost, LR, SVM, MLP and SC) to the imputed testing sets. Then, we obtained the average AUC and 95% CI for each model to evaluate the performance (Fig. 2). At baseline (n = 300, panel A), the AUC values and 95% CI were 0.64 (0.56–0.72), 0.64 (0.56–0.72), 0.72 (0.65–0.78), 0.57 (0.49–0.65), 0.7 (0.63–0.77) and 0.64 (0.57–0.72), respectively. Only the LR classifier obtained a satisfactory AUC value of 0.72 (above 0.70). At day 3 (n = 289, panel B), all 6 classifiers achieved satisfactory AUC values above 0.70. The AUC values were 0.84 (0.78–0.89), 0.82 (0.77–0.88), 0.80 (0.74–0.86), 0.77 (0.7–0.83), 0.79 (0.73–0.85) and 0.84 (0.78–0.89), respectively. Overall, we observed enhanced performance for all 6 models at day 3 with the RF and SC classifiers having the highest AUC value (0.84), followed by XGBoost (0.82).

Figure 2
figure 2

ROC curves of six ML classifiers for predicting 60-day mortality in ARDS patients. Panel (A): data collected from Day 0 in the testing dataset (n = 300 with imputation); Panel (B): data collected from Day 3 in the testing dataset (n = 289 with imputation). Definition of abbreviations: ROC = receiver operating characteristics, AUC = area under the curve.

Non-imputed data

At baseline (n = 210, panel C), the AUC values were 0.68, 0.65, 0.69, 0.55, 0.62 and 0.67, respectively. The LR classifier obtained the highest AUC value of 0.69 (95% CI: 0.61–0.78). At day 3 (n = 178, panel D), all 6 classifiers achieved satisfactory AUC values above 0.70. The AUC values and 95% CI were 0.82 (0.75–0.89), 0.79 (0.71–0.88), 0.79 (0.71–0.87), 0.80 (0.73–0.87), 0.83 (0.77–0.9) and 0.83 (0.76–0. 9), respectively. Similar to the imputed data, non-imputed data at day 3 exhibited enhanced performance for all 6 models, with the MLP and SC classifiers having the highest AUC value (0.83), followed by RF (0.82). Intriguingly, the accuracy of the prediction values (ROC-AUCs) appears to increase from approximately 0.7 at day 0 to above 0.8 at day 3 in the prediction models.

Comparing the performance of six prediction models

ML classifiers demonstrated high accuracy as indicated by their prediction values (ROC-AUCs) utilizing data from the FACTT trial. To further demonstrate good discrimination of prediction models, we compared the comprehensive performance of 6 classifiers (Supplementary figure S3) utilizing imputed data at day 0 (panel A) and day 3 (panel B) in the testing sets by computing a confusion matrix for each classifier. As shown in Table 2 and Supplementary table S4B, the six models presented varying performances as indicated by the efficacy metrics generated from confusion matrix: precision, sensitivity and F1 score. The RF and LR classifiers achieved the best F1 score of 0.86 for predicting survivors at baseline. In contrast, the MLP classifier achieved the best F1 score of 0.45 for predicting non-survivors. At day 3, both the RF and SC classifiers achieved the best F1 scores of 0.88 for predicting survivors whereas the scores for non-survivors were 0.51 and 0.53, respectively. Additionally, the efficacy of the six classifiers in the training sets are shown in Table 2 and Supplementary table S4A. The RF classifier, at day 3, also demonstrated the highest F1 scores of 0.94 and 0.77 for survivors and non-survivors, respectively. These results indicated that RF classifier consistently exhibited better prognostic values for classifying survivors and non-survivors at either day 0 or day 3.

Table 2 The efficacy (precision, sensitivity and F1 score) of top three machine-learning classifiers in the training (A) and testing (B) sets (using imputed data).

To test the calibration of the model, we ultimately drew the calibration curves of the models at baseline and 3 days after ARDS onset. At day 3, most of the models under-predicted the true probabilities with the prediction/observation points distributed above the 45° (dashed accuracy-equals-confidence) line (Supplementary figure S4B). In contrast, at baseline models were over-confident until about 0.6 and then under-predicted around 0.8 (Supplementary figure S4A). In addition, the results showed that the RF model performed relatively better in each dataset.

Feature importance of the RF classifier in the testing set

We determined the importance of quantitative clinical features according to the estimated probability of 60-day mortality from the RF classifier, which was the best-performing model in the testing sets at day 3 with the highest AUC value of 0.84 when applied to imputed data. Feature importance is the average of reduction in impurity index over all trees when a particular feature is used at split point17; we measured feature rankings by using the ‘gini impurity or mean decrease in impurity’ metric in RF (see Supplementary Methods). Figure 3 illustrated the outcomes of relative feature importance for each single attribute. The relative ranked top seven features (from high to low) in the RF predictor at baseline (panel A) were age, platelet count, bicarbonate, MAP, heart rate, glucose, and pneumonia. The relative ranked top features were changed at day 3 (panel B) as the following: MAP, bicarbonate, age, platelet count, albumin, heart rate and glucose. Of note, MAP, one of the ventilator-related features, ranked as the most important mortality risk predictor at day 3, possibly reflecting ARDS progression. We observed similar trends in data without imputation. Age, bicarbonate and MAP were among the top three important features demonstrating improved predictive performance for risk of mortality at day 3 (Supplementary figure S5). As a result, ML modeling offered more proof that MAP is crucial for early ARDS prognostication.

Figure 3
figure 3

Importance of clinical features in the training dataset according to the estimated probability of 60-day mortality in Random Forest models. Normalized values for each single attribute (in a range of 0–100, the most important predictor variable was assigned with the value of 100) were illustrated as relative ranked features (from high to low). The training datasets from day 0 (n = 700, Panel (A)) and day 3 (n = 674, Panel (B)) with imputation were applied.

Discussion

ARDS is an inflammatory syndrome characterized by acute respiratory failure due to non-cardiogenic pulmonary edema and hypoxemia, and is associated with significant morbidity and mortality1. Despite advances in the understanding of the pathophysiology of ARDS, mortality rates remain high – ranging between 16.1% and 45.4% in recent reports24. There are no known effective pharmacological interventions for ARDS, proven interventions are limited to supportive care. Leveraging cutting-edge analytical workflows (e.g., ML and artificial intelligence approaches) in the analysis of large datasets from the inpatient setting is likely to identify critical risk factors most associated with poor outcomes of ARDS and facilitate the development of strategies for early intervention in patients at the highest risk. In the current study, we have applied a mathematical modelling approach based on state-of-the-art ML algorithms to identify the most discriminative clinical features of ARDS-associated mortality. We compared efficacy of mortality prediction models derived from the ARDSNet FACTT trial datasets (day 0 vs. day 3), and the ML approach revealed key clinical features representing multi-organ dysfunction for ARDS mortality prediction.

There is an increasing trend of utilizing modern ML algorithms to develop and evaluate highly accurate classifier models for both early diagnosis and mortality prediction in ARDS 25,26. However, most of the studies only examined data collected at baseline despite the fact that predictive values tend to be improved at early days beyond ARDS onset (e.g., at day 3). We applied six commonly adopted ML classifiers to data collected at two early time points – day 0 and day 3, for predicting ARDS mortality. The area under the receiver operating characteristic curve, confusion matrix, precision, sensitivity and F1 score (which is the harmonic mean of the precision and sensitivity) were used to evaluate and compare the comprehensive performance of model types. As expected, the accuracy of the prediction values (e.g., AUC values) increased from approximately 0.7 at day 0 to above 0.8 at day 3 in the prediction models. The RF classifier consistently demonstrated outstanding performance at both days, and obtained the highest AUC value of 0.84 at day 3 (Fig. 2B). The SC classifier which also displayed outstanding performance at day 3, is an ensemble ML algorithm that learns how to best combine the predictions from multiple well-performing ML models to obtain better predictive performance27. With regard to RF, it is a powerful ensemble-based ML classifier made up of multiple decision trees28, and it demonstrated high predictive ability for prognostication in the data from the FACTT trial. During the training process, it randomly samples the training dataset with replacement (also known as bootstrapping) to build a decision tree. Additionally, it considers random subsets of features to split the nodes. The final predictions of the RF are made by averaging the predictions of each individual tree. Additionally, we analyzed the importance of clinical features in predicting 60-day mortality within the RF model. Our results suggested seven clinical parameters (MAP, bicarbonate, age, platelet count, albumin, heart rate and glucose) were the most important features at day 3 for ARDS mortality risk in the FACTT trial. The illustration of clinical feature importance (Fig. 3) may give physicians an intuitive understanding of the key features within the RF model, which provided the highest precision and sensitivity.

To account for the dynamic process of disease progression, efforts have been made previously to compare the performance of mortality risk factors utilizing data collected after ARDS onset. Bone et al.29 analyzed the dynamic change of PaO2/ FiO2 ratio during the first 7 days of ARDS onset and found survivors had a significantly higher PaO2/ FiO2 ratio at day 1–7 than non-survivors, even though the two group of patients had a similar PaO2/ FiO2 ratio at day 0. Lai et al.30 found that using clinical data from 1-day after ARDS onset could predict outcomes better than using data collected at baseline. Go et al.12 examined the change of oxygenation index (OI) over the first seven days of ARDS and found that failure to improve OI at day 7 was associated with higher mortality. In this study, we also found evidence that variables measured after ARDS onset (e.g., day 3) have better predictive performance than those at baseline. In the field of critical care medicine, increased accuracy in predicting mortality may have a major impact on various aspects of patient care, i.e. improved prognostication to allow more accurate patient stratification for clinical trials and help inform family discussions. Thus, utilizing clinical indices collected early after ARDS onset may improve the performance of mortality prediction in ARDS.

In this study, we found that MAP was the most significant clinical characteristic for predicting 60-day death, along with other crucial clinical features. Previous studies have reported the value of PEEP, plateau pressure or tidal volume in predicting mortality in ARDS patients31. In contrast, fewer studies have evaluated the predictive value of MAP. Recently, Sahetya SK et al6 reported the prognostic value of MAP at baseline (within 24 h of mechanical ventilation). However, they did not assess whether dynamic data may provide enhanced predictive value. In this study (Supplementary figure S2), despite the fact that there was no significant difference in MAP between survivors and non-survivors at baseline, non-survivors showed a tendency toward significantly higher MAP than survivors by days one to three (P < 0.001), suggesting that MAP represents the most important predictor for ARDS mortality. While driving pressure and plateau pressure reflect lung stress, MAP provides a more complete estimation of lung disease severity, respiratory compliance, and need for respiratory support than driving pressure or plateau pressure alone. MAP will increase if airway resistance increases, compliance of the lung or chest wall decreases, or dead space and work of breathing increase23. Plateau pressure represents the stiffness in the respiratory system, which predicts mortality in patients with ARDS3. MAP correlates directly with plateau pressure but also varies with minute ventilation, which could reflect dead space or acidosis. Furthermore, MAP took part in the entire respiratory cycle, contained more information of mechanical ventilation. In our study, we confirmed: (1) ARDS patients in the MAPhigh subgroup had increased risk of metabolic dysfunction and organ injuries associated with high mortality rate (P < 0.001, Supplementary table S3); and (2) utilizing mechanical ventilation parameters such as MAP collected at day 3 after treatment could provide better prognostic value for ARDS mortality.

The strengths of our study include utilizing robust clinical parameters collected in a large multi-center study – the FACTT trial, and implementing the state-of-the art ML workflows. However, our study has some limitations. First, we generated the ML models from secondary analyses of previously conducted randomized controlled trial; these models must be evaluated in observational cohorts prospectively before they can be generalized to the ARDS population and used in the clinical setting. Second, given that the proposed ML method is data-driven, repeating the whole procedure in data collected from early days beyond day 3 (e.g., day 7) may reveal various performances. Third, a combination of clinical predictors and biomarkers may enhance the performance of mortality predicting models in ARDS2,32. Biological markers of cell-specific injury, acute inflammation, and altered coagulation were correlate with mortality in multicenter clinical trials in ARDS33,34,35,36. Further, novel biomarkers discovered by a systems biology multi- “omics” approach may hold the promise to establish predictive or prognostic stratification methods and ultimately helps to develop more tailored therapeutics for ARDS patients37,38.

Of note, we have considered an internal validation strategy in our study design. We utilized a common approach that is to split the single FACTT dataset into two parts: a training cohort, and a separate testing/validation cohort that is not used in developing the model itself. The superior performance of ML classifiers in the training sets was confirmed in the testing sets. The decision rules developed within the RF model at day 3 can predict the mortality rates of patients in advance with more than 80% accuracy. Given the novelty of our findings and its potential for translation into practice, the model was further developed into a web ARDS mortality ‘calculator’ (https://mortality-predictor.streamlit.app/). This exploratory online tool uses the RF classifier, which is already trained on data from day 3 of the FACTT trial to compute the prediction and returns a numerical value for ARDS mortality (Supplementary figure S6).

In conclusion, utilizing a large ARDS dataset, we developed ML-based models for risk stratification in critically ill ARDS patients and identified MAP as the most important clinical predictor for mortality. Future prospective research is warranted to validate the proposed models and to translate the advantages of ML models into improved patient outcomes through early intervene.