Introduction

The Acute Respiratory Distress Syndrome (ARDS) is a clinically and biologically heterogeneous syndrome that contributes significantly to morbidity and mortality in critically ill patients [1,2,3]. ARDS has long been a focal point of critical care research, but hundreds of randomized controlled trials (RCTs) have led to merely two guideline recommendations supported by high-level evidence: low tidal volume ventilation and prone positioning in patients with severe ARDS [4, 5].

The paucity of high-level evidence is due to indeterminate and conflicting trial results. Many RCTs in the ARDS population report an indeterminate outcome—detecting neither significant benefit nor harm of investigated therapeutic strategies [6]. Several other large RCTs demonstrated contradictory results, with seemingly beneficial therapies being found ineffective in subsequent trials [7, 8].

It has become clear that treatment effects of interventions in ARDS are highly dependent on the details of the intervention, as small variations of the same treatment have led to disparate results. For example, different definitions of ‘low’ and ‘high’ tidal volumes [9,10,11,12,13] or differences in neuromuscular blockade and sedation [7, 8, 14] have led to different trial outcomes. On the other hand, it is much less clear how differences between methodological trial characteristics and patient characteristics affect study outcomes. Between-trial heterogeneity refers to the non-random variation in treatment effect of an intervention due to methodological or clinical differences between patient populations. Unmeasured or unexplainable heterogeneity—both among patients in a single trial and between trial populations—may adversely affect the validity and generalizability of study results (see: ‘Panel: A practical example of the problem with unexplainable between-trial heterogeneity’) [15, 16].

In this study, we set out to quantify the consistency of reporting baseline characteristics and to measure between-trial heterogeneity in 28-day control group mortality among all ARDS RCTs in the lung-protective ventilation era. Our aim was to determine to which extent between-trial differences in control group mortality could be explained by differences in trial and patient characteristics. We hypothesized that between-trial heterogeneity would be large and trial populations often poorly characterized, leading to a discrepancy between inclusion criteria and patient characteristics on the one hand, and control group outcomes on the other hand.

Panel: A practical example of the problem with unexplainable between-trial heterogeneity

We note two high-profile trials published in the same journal issue [17, 18]. Both trials investigated high-frequency oscillatory ventilation in the same target population of moderate to severe ARDS patients, but they reported a different effect on mortality. Judging by the control group patient characteristics, there were clinically meaningful differences between the trial populations: there was a 32% relative difference in baseline Acute Physiology and Chronic Health Evaluation (APACHE II) scores (22 vs. 29 points) and there was a 41% relative difference in control group mortality (41% vs 29% at 30 days). But, paradoxically, the trial with the lowest baseline mean APACHE II score had the highest control group mortality. This makes the interpretation of the conflicting trial results exceedingly difficult. One trial demonstrated significant harm from the intervention while the other trial found no effect. Was the difference in treatment effect due to subtle unreported differences in the intervention, due to unreported differences in the patient populations, or due to differences in the standard of care? Were the patients more severely ill at baseline in the trial with the highest APACHE II score or in the trial with the highest control group mortality rate? It is clear that unexplainable outcome heterogeneity reduces the generalizability of intervention effects to the global ARDS population [19].

Methods

This systematic review follows the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA). The study protocol and statistical analysis plan were registered online at the International Prospective Register of Systematic Reviews (PROSPERO, registration number: CRD42020161809).

Systematic search

A comprehensive search was conducted in MEDLINE, Embase and Scopus for randomized clinical trials including adult ARDS patients published from January 1st, 2000 until January 31st, 2020. Eligible studies included a) adult ARDS patients diagnosed according to the AECC guidelines from 1994 [20] or the Berlin definition from 2012 [21], subjected to b) invasive lung-protective mechanical ventilation according to the ARDSnet protocol [11], or reporting a tidal volume of ≤ 8 ml/kg. Included studies were c) randomized clinical trials reporting on d) 28-day, hospital, intensive care unit (ICU) or 60-day mortality. There were no restrictions with regards to the intervention or phase of the study. More details about the review process are provided in the supplementary appendix, Sect. 1.1 and 1.2.

Outcome measures

For each study, we recorded trial characteristics, intervention, inclusion and exclusion criteria, mean patient baseline characteristics and mortality outcomes.

Primary outcome was the between-trial heterogeneity based on the 28-day control group mortality rate (I2). The 28-day control group mortality rate reflects the baseline risk of death of a patient population of an individual trial. Secondary outcomes included associations between 28-day control group mortality and characteristics of trial design and outcome, inclusion- and exclusion criteria, as well as baseline characteristics.

Estimation of 28-day control group and intervention group mortality

All analyses investigating heterogeneity were conducted using the 28-day control group mortality rate. For trials reporting solely on the hospital, ICU or 60-day mortality, 28-day control group mortality was estimated with linear regression using data from trials reporting on both, 28-day mortality and any of the other mortality outcomes [22]. 28-day intervention group mortality was estimated in the same manner for analyses investigating differences between control and intervention group mortality. A sensitivity analysis was conducted, using only the trials reporting 28-day control group mortality.

Estimation and quantification of between-trial heterogeneity

The 28-day control group mortality rates across studies were analyzed using a random-effects meta-regression model with the log odds of mortality as the dependent variable. Each individual trial was weighted by the inverse of the sampling variance of the mortality rates. A maximum likelihood estimator was applied to estimate the mean mortality (random-effects pooled estimate), the between-study standard deviation due to heterogeneity (τ), and heterogeneity (the percentage of variation in control group mortality due to heterogeneity rather than chance, I2). To make heterogeneity interpretable in a clinically meaningful manner, we calculated the 95% prediction interval. The prediction interval represents the distribution of estimated underlying mortality after correction for random chance and predictive covariates [for further details see method section reference [22]]. This model and its corresponding outcomes were used to present the distribution of 28-day control group mortality between all trials, and to investigate differences between individual trial characteristics.

Associations between patient characteristics and mortality rates

The associations between 28-day control group mortality and reported patient characteristics were estimated by adding each individual covariate separately to the random-effect model as moderators in univariate analysis. The goodness-of-fit of the log-linear, quadratic and power models were compared, and the model with the lowest Akaike information criterion (AIC) was selected [23]. For each model, the regression coefficient (b) and unadjusted R2 were reported. R2 represents the proportion of between-trial heterogeneity in 28-day control group mortality explained by the individual baseline characteristic—for the n trials reporting the covariate.

Prediction of control group mortality based on significant patient characteristics

To predict between-trial differences in mortality based on patient characteristics, a comprehensive multivariate logistic regression model was constructed. Missing observations were imputed using multiple imputation generating 20 datasets with predictive mean matching. For a detailed description of the process, we refer to the supplementary appendix, Sect. 2.5. Significant baseline characteristics reported in at least 25% of all trials with a univariate regression R2 ≥ 0.10 were eligible for the model. The threshold R2 of 0.10 was a compromise between the number of variables and the limited number of observations, as described before [22]. A stepwise backward selection procedure was applied removing regressors if p ≥ 0.05 for the final model. To facilitate comparisons between the individual covariates the standardized regression coefficient (β) and the standardized standard error (SSE) were reported in the supplementary appendix, Sect. 2.5.

Control group mortality differences between trials demonstrating benefit vs. no benefit

A trial demonstrating significant benefit was defined as a reported p value of < 0.05 for the primary endpoint (as defined by authors) in favor for the intervention group. Comparisons between trials demonstrating significant benefit and trials with an indeterminate outcome or harm were performed using the Mann–Whitney U test. Linear mixed-effects and regression models were applied to estimate the probability of a significant trial outcome based on the observed control group mortality and intervention group mortality, respectively.

Statistical analyses

A p-value of < 0.05 was considered statistically significant.

Statistical analyses were performed with R Studio interface (Version 1.1.447. R core team. R: A Language and Environment for Statistical Computing. 2013. http://www.r–project.org/) using the packages ‘tidyverse, ‘dplyr’, ‘metafor’, ‘mice’, ‘Hmisc’, ‘wCorr’, ‘data.table’, ‘MASS’ and ‘ggplot2’.

Results

Systematic search

The literature search yielded 3479 results. A total of 67 RCTs met all inclusion and exclusion criteria and were included in the analyses (eFigure 1) [7, 8, 17, 18, 24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86]. Table 1 provides an overview of the included trial characteristics.

Table 1 Characteristics of included randomized clinical trials

Estimation of 28-day control group and intervention group mortality

The 28-day control group mortality was reported in 45 trials. For trials reporting another mortality timeframe, 28-day mortality could be reliably estimated (adjusted R2 ≥ 0.98 for all estimation models). Linear equations and corresponding regression plots are shown in eTable 1 and eFigure 2. For the sensitivity analysis for trials reporting solely on 28-day mortality we refer to the supplementary appendix, Sect. 2.3.

Estimation and quantification of between-trial heterogeneity

The 28-day control group mortality ranged from 9.7 to 66.7% with a weighted mean mortality rate of 30.9%. Between-trial heterogeneity was large with 87.5% of the differences between control group mortality rates not explained by chance (I2 = 87.5%, τ = 0.509, p < 0.0001). The 95% prediction interval (the estimated range of mortality rates, corrected for small trials and random error) was 14 to 55%. The mean mortality and the magnitude of heterogeneity were similar for all subgroups of trial characteristics (Fig. 1). The exclusion criteria were too manifold for valid analyses. However, the number of reported exclusion criteria was associated with a lower 28-day control group mortality (p = 0.04, eFigure 3).

Fig. 1
figure 1

28-day control group mortality for individual trial characteristics. The diamond represents the mean mortality rate (peak) with the corresponding 95% confidence interval (length of diamond). The black line denotes the 95% prediction interval, which is the estimated between-trial variability in mortality rates after adjusting for random chance and sample size, i.e. the between-trial heterogeneity. I2 represents the proportion of between-trial variability that cannot be explained by chance.

Associations between patient characteristics and mortality rates

Table 2 presents the most-reported baseline characteristics and their associations with 28-day control group mortality. For the goodness-of-fit statistics of the individual associations, we refer to eTable 2 in the supplementary appendix. Figure 2 provides an overview of between-trial differences in control group mortality and patient characteristics. We observed an association between 28-day control group mortality and the following patient characteristics in univariate analyses: mean age; mean body mass index; mean APACHE II score; the proportion of patients treated with vasopressors; the proportion of patients presenting with shock at baseline; mean lung injury score; mean oxygenation index; mean plateau pressure; mean PaO2/FiO2 ratio and mean arterial pH. Individual regression plots are shown in eFigure 3.

Table 2 Univariate associations between 28-day control group mortality and commonly reported mean baseline patient characteristics
Fig. 2
figure 2

Heatmap of control group outcomes and baseline characteristics. On the y-axis, all included trials are ordered from highest to lowest (estimated) 28-day mortality rate. The color of a tile represents whether, for a specific trial, a reported variable was lowest (blue) or highest (red) among all trials that reported the variable. A white tile represents a variable not reported by a specific trial. The X-axis depicts the most reported baseline characteristics. Some show a concordant pattern (e.g. age) with 28-day mortality while others do not (e.g. SAPS II score, SOFA score). Most importantly, the distribution of white tiles demonstrates the large variability in the reporting of baseline characteristics.

Prediction of control group mortality based on significant patient characteristics

A detailed description of variable selection and construction of the multivariate logistic regression model is available in the supplementary appendix, Sect. 2.5. Significant variables for the final logistic regression model were: mean age (p < 0.0001), mean LIS (p = 0.0099), mean plateau pressure (p = 0.0078) and mean arterial pH (p = 0.0119). The residual 95% prediction interval adjusted for the significant predictors was 18 to 45%. Six trials reported all four variables in the final model.

Control group mortality differences between trials demonstrating benefit vs. no benefit

As shown in Fig. 3a, trials demonstrating significant benefit reported a higher 28-day control group mortality compared to trials with an indeterminate outcome or harm (mean 28-day control group mortality rate: 0.275 vs. 0.439; p = 0.001). Figure 3b demonstrates that trials with higher control group mortality were more likely to demonstrate significant benefit. Conversely, Fig. 3c shows that intervention group mortality did not differ between trials demonstrating significant benefit and compared to trials with an indeterminate outcome or harm (mean 28-day intervention group mortality rate: 0.271 vs. 0.301; p = 0.697). Figure 3d demonstrates that trials with higher or lower intervention group mortality rates were not more or less likely to demonstrate significant benefit.

Fig. 3
figure 3

Differences in 28-day control group and intervention group mortality between significant and indeterminate trials, and the corresponding probability of a significant treatment effect. a Mean 28-day control group mortality was 43.9% in trials with a beneficial outcome versus 27.5% in trials with an indeterminate outcome or significant harm (p = 0.001). b The higher control group mortality, the higher the probability to obtain a beneficial trial outcome for the intervention group (p = 0.012). c Mean 28-day intervention group mortality does not differ between trials with significant benefit and trials with the indeterminate outcome or significant harm. (27.1% vs. 30.1%; p = 0.697). d The probability to obtain a beneficial trial outcome was not affected by intervention group mortality (p = 0.410)

Discussion

This systematic analysis of 67 ARDS RCTs in the lung-protective ventilation era revealed a statistically significant and clinically relevant amount of between-trial heterogeneity in reporting and outcome.

The description of patient characteristics was variable and often incomplete. Basic ventilation characteristics such as mean respiratory rate, FiO2, pH and PaCO2 were reported in a minority of trials.

The estimated range of 28-day control group mortality corrected for small trials and random error was 14 to 55%. This between-trial heterogeneity in control group outcomes could not be explained by differences in trial characteristics. Reported baseline patient characteristics explained some of the outcome heterogeneity, but the residual (unexplainable) range in control group mortality was still 18 to 45% after adjusting for the most predictive baseline characteristics. There was no secular trend in mortality outcomes over the period 2000–2019.

Notably, trials with higher control group mortality were more likely to report a significant benefit, also after adjustment for baseline mean severity of illness.

Relevance for clinicians

To assess the applicability of a trial’s result to individual patients, clinicians need a clear description of the study population and concomitant treatment from trial reports. In the present study, we identified important problems in this respect.

The variation in reporting of baseline variables was considerable, with ≥ 90% of all studies reporting on age and gender, but only 75% describing observed tidal volumes and PEEP, down to only a third of all trials reporting on lung injury scores or results from blood gas analyses.

After adjustment for significant baseline characteristics (mean age, mean LIS, mean plateau pressure, mean arterial pH), the residual (unexplainable) range in control group mortality was 18 to 45%. In other words, among trials with comparable inclusion criteria and comparable baseline patient characteristics, there were inexplicable 2.5-fold mortality differences (45%/18%). This indicates that there are very important yet unreported differences in ARDS populations, and possibly also differences in co-interventions and standard care.

This silent heterogeneity between trials makes it nearly impossible to evaluate whether RCT results are valid outside of the immediate trial context (i.e., the exact population in the actual participating centers). At its most extreme, this can be thought of as a generalizability crisis in ARDS research: we cannot know which trial results are transportable to which patients outside the trial [87, 88]. The generalizability crisis comes into clear focus when different RCTs show conflicting and statistically mutually exclusive results (benefit vs. no benefit or harm) of the same intervention. Conflicting study results are often ascribed to subtle differences in the intervention, while, in fact, it may be important yet unreported differences in the population or standard of care that are driving conflicting outcomes.

The finding that the number of exclusion criteria is inversely associated with the control group mortality rate only exacerbates the generalizability problem. It means that trial populations, especially those with a large number of exclusion criteria, likely differ from the intended (broader) target population.

Clinicians trying to gauge the applicability of a trial’s result should carefully review not only inclusion criteria and baseline characteristics, but also whether the control group mortality fits the apparent patient characteristics.

Relevance for ARDS researchers

What could account for the large unexplainable heterogeneity? Possible factors may be found in biology, standards of care and co-interventions, or measurement variability.

Statistical cluster analyses of various biological and clinical characteristics led to the identification of distinct ARDS subphenotypes, each associated with a different mortality risk, a different biochemical inflammatory profile and importantly, differential responses to treatments such as PEEP, fluids, low-dose macrolide therapy or simvastatin [89,90,91,92,93,94]. We are only just beginning to appreciate the extraordinary biological heterogeneity of ARDS [95], which is undoubtedly one of the causes of between-trial variability in outcomes. It is currently not clear whether larger more pragmatic trials or smaller high-adherence trials in more selected populations will provide more useful clinical information in the future.

Variability in standards of care and co-interventions may be another likely cause of between-trial heterogeneity. The LUNG-SAFE study revealed that many ARDS patients are undertreated or not treated according to the current best practices [1]. We cannot know the implications for interpreting RCTs, because standard care and co-interventions are almost never described in trial papers. Historically, this was due to restrictions in space and words allotted by scientific journals. However, there is an urgent need for future studies to report these details, nowadays enabled by the use of supplementary materials and public data repositories.

A final likely cause of between-trial heterogeneity may be measurement variability of baseline characteristics and clinical outcomes. Severity of illness scores, for example, are notoriously dependent on small variations in measurement definitions [96, 97]. The call for standardization of baseline and outcome measures in ARDS trials is not new, but the current study underlines its urgency and importance [98, 99].

Consequently, comprehensive reporting in a standardized manner on patient characteristics, standard care and concomitant treatments may be one step towards solving the generalizability crisis in ARDS research. Clearly, heterogeneity is not the only cause of a large number of indeterminate and conflicting trial results. Statistical shortcomings, such as underpowered studies [100] or overestimations of effect size [101], equally contribute to indeterminate trial outcomes. Moreover, qualifying studies as ‘indeterminate’ is the consequence of the frequentist statistical paradigm, while Bayesian analyses offer another perspective providing often useful information about trial results [102].

Because we were limited to trial-level data in this study, we should be careful to avoid the ‘ecological fallacy’: individual-level relationships cannot be inferred from group-level data. This important limitation of the present study is also an important message: we will continue to fail to understand outcome heterogeneity between ARDS trials as long as we must rely on aggregated study-level data. Sharing (anonymized) individual patient data is likely to provide a path forward and provides an immense opportunity to stratify and subphenotype patients, to detect treatment benefit and harm for specific patient groups, and to find valuable therapeutic strategies in an inherently heterogeneous syndrome.

Complete standardization of reporting is unwarranted and can even be detrimental. Reported characteristics and outcomes should be tailored to the research question at hand. But the results in this study indicate that different trials lack sufficient common ground to validly compare trial populations. Creating this common ground for between-trial comparisons requires the reporting of a ‘core baseline set’: a commonly agreed-upon minimum set of descriptors to characterize the patient population of a trial (including standards of care and co-interventions). Developing this core population characteristics set requires meta-epidemiological data (which this study provides) and clinical domain expertise from a diverse sample of ARDS researchers.

Conclusion

Randomized controlled ARDS trials in the lung-protective ventilation era present a statistically significant and clinically relevant amount of heterogeneity in reporting and mortality outcomes. Differences in baseline characteristics partly explained the variability in outcome, but large unexplainable heterogeneity remained after extensive statistical adjustments. This study underlines the urgent need for standardized and comprehensive reporting of trial and baseline characteristics to diminish between-trial heterogeneity and to support the transportability of study results across populations.