Introduction

The fundamental criteria from the consensus definitions of septic shock are used to select patients for inclusion in clinical studies [1,2,3,4]. While the mortality rate of septic shock was found to be 46% (95% confidence interval (CI) 43–50%) in a meta-analysis of observational cohorts [5], randomized controlled trials report more diverse numbers. For example, two high-profile septic shock trials published a year apart reported control group mortality rates as disparate as 16% [6] and 80% [7]. Despite the seemingly wide range of mortality rates there has not yet been a systematic inquiry into its patterns and possible causes.

Identifying the correct patient population to benefit from a specific therapy has been recognized as an essential condition for improving critical care research [8,9,10]. Yet large unexplained mortality differences among trials that all aim to include septic shock patients may hamper reproducibility and generalizability. Insights into the magnitude and sources of between-trial heterogeneity are therefore valuable in the design, reporting, and interpretation of septic shock trials. For example, incorrect prediction of baseline mortality rates has been identified as a major reason for negative critical care trials, as a discrepancy between expected and observed event rates often leads to underpowered studies [11].

We sought to quantify between-trial heterogeneity and identify inclusion criteria and population characteristics associated with differences in control group mortality rates.

Methods

After a systematic search to identify all trials published in the past decade that aimed to include patients with septic shock, we used linear mixed models to estimate the total heterogeneity in control group mortality rates and its association with reported baseline characteristics. Using both a multivariate linear model and a machine learning algorithm, we estimated the proportion of heterogeneity that can be explained by population characteristics.

The review protocol was prospectively registered [12] and adheres to the PRISMA checklist [13], which is included in the electronic supplementary material (ESM). Study screening, application of the inclusion- and exclusion criteria and data-extraction were performed independently by two reviewers (HJdG and JP). Conflicting entries were resolved by consensus.

Inclusion criteria and search strategy

PubMed, Embase, and the Cochrane Central Register of Controlled Trials were queried using the search term [“septic shock” AND (random* or rct)]. Embase was additionally queried using the search term “septic shock” with the randomized controlled trial filter activated. The queries were limited to publications from 1 January 2006 and the queries were last performed on 20 January 2018.

We limited the search to trials published between 2006 and 2018 as a compromise between the number of eligible studies and secular trends in clinical practice, research practice, and reporting standards. Publications from 2006 and later had sufficient lead time to incorporate the 2004 update of the Surviving Sepsis Campaign guidelines [4].

Eligible for inclusion were parallel-group randomized controlled trials with adult patients in septic shock according to the published consensus definitions or Surviving Sepsis Campaign guidelines [1, 2, 4]. Trials were excluded if the report was not written in English, if it was only available in abstract, if no baseline characteristics were reported, or if no mortality outcome was reported. Trials that aimed to include a specific subcategory of septic shock patients (e.g. “septic shock patients requiring renal replacement therapy”) were also excluded, as these would be a major source of between-trial heterogeneity.

Identification of the control group and variables of interest

Because the nature of the randomized intervention could contribute to heterogeneity, we focused on the control groups. For each trial, we identified the control group as defined by the authors as ‘control group’, ‘usual care group’, or a variation thereof. When no control group could be identified (in a comparison of two usual care therapies) we defined the control group as the means of the two groups in terms of sample size, mortality, and baseline characteristics. A sensitivity analysis was performed towards this construct by analyzing whether trials with and without specifically defined control groups differed in terms of mean mortality or the amount of between-trial heterogeneity.

For each trial, we recorded the type of intervention, single- or multicenter design, and the primary endpoint. Trials were graded according to the Jadad scale [14]. For the control group in each trial, we recorded the sample size, the reported baseline characteristics, and the mortality rates.

Estimation of heterogeneity in mortality rates and associations with population characteristics

We used 28-day mortality throughout all analyses. For trials that did not report this outcome, we estimated 28-day mortality based on reported hospital, ICU, or 90-day mortality using linear regression with data from trials that reported both 28-day and another mortality measure.

To analyze mortality rates across trials we used a random-effects meta-regression model with the log odds of mortality as dependent variable and a random intercept for each study. Each trial was weighted by the inverse of the sampling variance of the mortality rates. A maximum likelihood estimator was used to estimate the mean mortality (random effects pooled estimate), the between-study standard deviation due to heterogeneity (τ), and the percentage of variation due to heterogeneity rather than change (I2). To quantify between-trial heterogeneity, we report the 95% prediction interval (mean mortality ± 1.96 τ), which represents the distribution of estimated future mortality rates based on observed mortalities weighted by sampling variance (trial size) and corrected for random chance [15]. In the absence of between-study heterogeneity, the 95% prediction interval is equal to the 95% confidence interval, but when significant heterogeneity is present the prediction interval estimates the bandwidth of expected mortality rates from similar studies [15, 16]. In other words, the 95% prediction interval can be thought of as the estimate of true between-study distribution of mortality rates. The prediction interval can therefore be used to guide power calculations for future studies [16].

The between-trial heterogeneity in mortality rates was calculated for subcategories of trials employing different inclusion criteria: confirmed or suspected infection; confirmed infection only; different definitions of hypotension; mandatory hyperlactatemia; mandatory vasopressor therapy; and mandatory mechanical ventilation. Differences in mortality rates between subcategories were calculated by addition of dummy variables to the mixed-effects model.

To estimate the association between study and population characteristics and mortality, these variables were added to the model as covariates. Residuals were checked for normality with Q–Q plots, and the goodness of fit of the log‐linear model was compared with quadratic and power models by selecting the model with the lowest Akaike information criterion (AIC). To facilitate comparisons between variables, we report standardized regression coefficients (β) and the proportion of between-trial variability in mortality explained by the population variable (unadjusted R2) for all univariate analyses.

Predicting mortality rates using a linear model and recursive partitioning

We then constructed a comprehensive model to predict between-study differences in mortality. Population variables that were reported by at least 25% of the included trials with a univariate regression R2 ≥ 0.10 were included as regressors in a multivariate model and removed in a stepwise manner for P values ≥ 0.05. The threshold R2 of 0.10 was a compromise between the number of variables and the limited number of observations. This model selection process was not prospectively protocolized as the number of eligible variables could not be estimated a priori. Multiple imputation (generating 20 datasets) with predictive mean matching was used for missing observations (i.e., missing population characteristics). The imputation methods are further described in section 7 of the ESM.

As a complementary approach to predict 28-day mortality rates from population characteristics, we constructed a regression tree model based on recursive partitioning (a machine learning algorithm) [17, 18] for its ability to handle partially missing observations (obviating the need for imputation) and its robustness to nonlinear relations. We set up the model to predict 28-day mortality based on all inclusion criteria and population characteristics. In short, the recursive partitioning algorithm selected the most informative variable, which was then ‘split’ at the value that best differentiates low from high mortality. The algorithm then selected the most informative variable for each of the two resulting subgroups, and split it again. When a splitting variable was missing for a specific trial, a surrogate variable (the variable most closely correlated to the splitting variable) was used. After multiple splits, this recursive partitioning resulted in a regression tree (similar to a decision tree) with subgroups of trials ranked from low to high expected mortality. R2 represents the variance in mortality explained by the decision tree. Overfitting was examined using the cross-validated error.

For all analyses, P < 0.05 was considered significant. The analyses were performed in R version 3.4.2 using the metafor, mice and rpart packages [19,20,21].

Results

Characteristics of the included trials

The search resulted in 65 trials that met all inclusion and exclusion criteria (eFigure 1 in the ESM), representing a total of 8634 control group patients [6, 7, 22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]. A list of excluded trials is available in the ESM. The trial characteristics are presented in Table 1.

Table 1 Characteristics of included trials

Twenty trials (31%) did not report 28-day mortality but only hospital mortality, ICU mortality, or 90-day mortality. Using trials that reported multiple mortality measures, 28-day mortality was estimated as a linear function of hospital mortality, ICU mortality, or 90-day mortality (R2 values 0.99, 0.98, and 0.98, respectively). The estimates and validation plots are presented in eTable 1 and eFigure 2 of the ESM.

In 14 trials (21%) the control group could not be identified because two usual care therapies were compared. For these trials, the control group characteristics and mortality rates were defined as the means of the two treatment groups. None of these 14 trials reported significant mortality differences between the treatment groups.

The distribution of mortality rates

The control group mortality rates ranged between 13.8 and 84.6%, with a random-effects estimated mean mortality rate of 38.6%. There was significant heterogeneity among trials (I2 = 93%, τ = 0.710, p < 0.0001), and the 95% prediction interval was 13.5–71.7%.

Figure 1 shows the mortality rates of trials categorized by inclusion criteria. The mean mortality rate did not differ between trials with different definitions of hypotension, infection (confirmed vs. suspected), or vasopressor or mechanical ventilation inclusion criteria. There were no significant differences in mean mortality rate or in heterogeneity between large vs. small trials, monocenter vs. multicenter trials, unblinded vs blinded trials, high-quality trials vs. low-quality trials, or trials with vs. without a specifically defined control group (eTable 2 in the ESM).

Fig. 1
figure 1

Control-group mortality rates categorized by trial inclusion criteria. The diamonds represent the mean mortality rates and 95% confidence intervals. The 95% prediction intervals (dashed lines) represents the estimated between-trial variability in mortality rates after adjusting for random chance and sample size. I2 represents the proportion of between-trial variability that cannot be explained by chance. There were no significant differences in mean mortality rates between inclusion criteria. MAP mean arterial pressure, SBP systolic blood pressure

The exclusion criteria employed in the trials were too diverse for statistical analysis, but the total number of exclusion criteria (ranging from 0 to 30) was inversely associated with the mortality rate (β = − 0.375, R2 = 0.14, P = 0.007).

The heatmap in Fig. 2 provides an overview of the between-trial differences in mortality rates and population characteristics. The log-linear associations between the mortality rate and reported control group baseline characteristics are presented in Table 2 (goodness-of-fit statistics are reported in eTable 3 in the ESM). There was no significant decrease in mortality over the period 2006–2018, with only (R2) 4% of heterogeneity explained by the year of publication (Table 2, eFigure 3). Baseline variables that were univariately associated with mortality were: mean Sequential Organ Failure Assessment (SOFA) score, the proportion of patients on mechanical ventilation, the proportion of patients on vasopressors, and mean serum creatinine. Regression plots of selected associations are shown in eFigure 3 of the ESM.

Fig. 2
figure 2

Heatmap of included trials (n = 65) and associated baseline characteristics, ranked by decreasing mortality rates. White tiles represent the mean value across trials, while red and blue tiles are indicative of higher and lower than average values, respectively. Gray tiles (N/A) are variables that were not reported. The 28-day mortality rate ranged between 13.8 and 84.6%, with a mean of 38.6%. APACHE Acute Physiology and Chronic Health Evaluation, SAPS Simplified Acute Physiology Score, SOFA Sequential Organ Failure Assessment score, MAP mean arterial pressure, CVP central venous pressure, CNS central nervous system. (Asterisk) Variables with a significant univariate association with 28-day mortality

Table 2 Univariate associations between mortality rates and reported mean or median population characteristics

Predicting mortality rates from population characteristics

Details of the variable selection process for the multivariate model are available in section 7 of the ESM. Significant independent variables in the final multivariate model were: baseline mean SOFA score (β = 0.39, standardized standard error (SSE) = 0.17, P = 0.019), the proportion of patients on mechanical ventilation (β = 0.42, SSE = 0.18, P = 0.019), and mean serum creatinine (β = 0.31, SSE = 0.10, P = 0.0015). The multivariate model R2 was 0.41 with significant residual heterogeneity (I2 = 82%, τ = 0.544, P < 0.0001). Figure 3 shows the predicted and actual mortality rates of the included trials.

Fig. 3
figure 3

Included trials ordered by predicted control group mortality rate (diamonds). The predicted mortality rates were based on a multivariate weighted random-effects regression model with baseline mean Sequential Organ Failure Assessment (SOFA) score, the proportion of patients on mechanical ventilation, and mean serum creatinine as significant independent variables. The squares and brackets are the observed control-group mortality rates with 95% confidence interval. The figure illustrates that the model explained (R2) 41% of the variability in mortality rates, with significant residual heterogeneity (P < 0.0001). The red dots are the reported a-priori expected mortality rates used for sample size calculations

The recursive partitioning algorithm resulted in a regression tree with the following variables as informative determinants of the mortality rate: mean age (split at 64.8 years); the proportion of patients with a respiratory infection (split at 54.5%); the proportion of patients on mechanical ventilation (split at 74.3%); and the proportion of male patients (splits at 63.8 and 53.8%). The R2 value of the regression tree was 0.42. The cross-validated relative error decreases to below the root (split 0) value, which indicates that the tree was not overfitted. The results from the regression tree analysis are further described in eFigures 4 and 5 of the ESM (section 7).

Discussion

In this analysis of 65 septic shock trials published in the past decade, we found a statistically significant and clinically relevant amount of heterogeneity in control group mortality rates. The mean mortality rate was 38.6% with estimated 95% prediction limits of 13.5–71.7%, revealing a wide range in underlying mortality rates after discounting the effects of random change and small trials.

In contrast to findings from large observational studies that the mortality of sepsis has decreased in the past decade, we found only a small nonsignificant decline in the period 2006–2018 [85, 86]. Different inclusion definitions of septic shock did not affect mean mortality rates, but a higher total number of exclusion criteria was associated with lower mortality. We used three statistical methods to analyze the association between population characteristics and mortality.

The univariate associations reflect how the reader of a trial report could interpret the population characteristics in relation to the mortality rate, and shows that the proportion of ventilated patients, mean SOFA score, and the proportion of patients on vasopressor support were most informative (i.e. have highest standardized regression coefficients).

The multivariate linear model (with missing observations imputed) shows which combinations of characteristics were predictive of mortality if all trials hypothetically reported the same variables. A combination of three independently significant characteristics (mean SOFA score, proportion of ventilated patients, and mean creatinine) explained only 41% of the heterogeneity in mortality rates across trials.

The recursive partitioning algorithm, which is not limited by dependence on multiple imputation and the assumption of linearity, shows which characteristics were most informative, given that different trials report different characteristics. The resulting regression tree explained only 42% of the heterogeneity in mortality.

The linear model and the regression tree arrived at different predictor variables because the linear model is biased towards more informative linear associations, while the regression tree allows for nonlinear relations and is biased towards variables with less missing data.

In all, these results indicate that there are clinically significant between-trial differences in control group mortality rates, and that these differences are not associated with differences in inclusion criteria and only weakly associated with reported baseline characteristics. Visual inspection of the heatmap (Fig. 2) shows that there are no unambiguous patterns in the relation between population characteristics and mortality rates. This heterogeneity is reflected in our finding that different statistical methods result in different predictive variables.

Possible sources of residual heterogeneity

Residual heterogeneity among trials may be caused by population differences in nutrition and socio-economic status, heterogenous exclusion criteria, incomplete reporting, between-trial differences in variable definitions, the timing of randomization, and differences in post-randomization co-interventions and standards of care.

We found that no single measure of chronic comorbidity was reported in more than 40% of the included trials and that characteristics of causative pathogens were reported in only 28–39% of trials. This compromised the power of our analysis to detect associations across all trials, but, more importantly, it also prevents readers of trial reports from evaluating and comparing populations among trials and from judging to what extent a trial population corresponds to the population under their care.

Another source of heterogeneity is the imprecise definition of many variables. It is unclear whether a variable like ‘pre-existing kidney disease’ in one trial has the same meaning as ‘chronic renal insufficiency’ in another trial. Minor variations in variable definitions and data capture methods have been shown to lead to significantly different septic shock populations and to inter-observer variability in severity-of-illness scoring systems [5, 87, 88]. The importance of this ‘fine print’ in defining a population does not receive due attention in the methods section of most trials.

The time of inclusion may be an additional source of heterogeneity. Patients recruited later after the diagnosis of septic shock have not responded to treatment in an earlier phase and are therefore likely to have a worse prognosis. Only 13 trials reported the time from diagnosis to randomization, and for those trials it explained 22% of the heterogeneity.

While we have focused on inclusion criteria and baseline characteristics, the prognosis of septic shock may be largely influenced by post-randomization standards of care and co-interventions. Unfortunately, co-interventions and (control group) treatment standards are often described as ‘according to the Surviving Sepsis Campaign guidelines’ or not discussed at all in trial reports. Variables describing important post-randomization interventions, such as red blood cell transfusions, vasopressor dose, or fluid balance were recently found to be reported in only 33, 17, and 13% of large septic shock trials, respectively [89].

We did not analyze the association between trial countries and the mortality rate because many countries are represented by a single trial in the present sample. Nevertheless, between-country differences in standards of care or access to early healthcare may account for part of the residual heterogeneity. Large international observational studies are a more appropriate instrument for the investigation of differences in mortality rates among countries.

Implications for investigators and clinicians

Clinicians demand of clinical trials that they are relevant, reproducible, and generalizable to a clearly defined patient population. The results of this study indicate that many of the baseline characteristics upon which clinicians rely to gauge the applicability of trial results to their practice are in fact only weakly or not at all associated with mortality outcomes across trials.

The association between the number of exclusion criteria and mortality suggests that many seemingly inconsequential criteria together may have a significant effect on the composition of a trial population. Investigators should therefore be aware of this phenomenon in the design phase of a trial, as it affects the generalizability and external validity of trial results.

The wide prediction limits of control-group mortality have consequences for sample size calculations. Detecting a relative risk reduction of 25% with 80% power requires 245 patients if mortality is estimated to be 71.7%, while it requires 795 patients if control group mortality is 38.6% or 2980 patients if mortality is 13.5%. In practice, misestimation of the mortality rate by more than 7.5% occurred in 65% of critical care trials [11]. We therefore suggest that sample size calculations should not be based on the mean of reported control-group mortality rates in the literature but should be robust towards a wider range of expected event rates.

Reproducibility and generalizability also require a common phenomenological structure with respect to diagnostic definitions, inclusion criteria, patient characteristics, concomitant treatment, and outcomes. A recent review of large septic shock trials found that only half of the information deemed necessary for evaluation of the control group was reported in the investigated trials [89]. In the present study, we now find that many of the reported characteristics are not associated with control-group mortality rates, possibly due to variations in variable definitions.

The third consensus definitions for sepsis and septic shock were partly developed to harmonize the inclusion criteria for clinical studies [3]. We were unable to analyze a subset of trials with populations that might fit the Sepsis-3 septic shock definition, as none of the included trials employed both delta SOFA score and vasopressor inclusion criteria. We do note that SOFA score is independently associated with mortality rates, although baseline SOFA explains only 33% (R2) of the variation in mortality rates in the 37 trials that report it. Furthermore, we found significant heterogeneity within subsets of trials employing similar inclusion criteria (Fig. 2).

We suggest that an international consensus is necessary to standardize variable definitions, data collection, and reporting of patient characteristics and outcomes for sepsis trials, as has been proposed before [89,90,91,92]. The feasibility of harmonizing study protocols has been demonstrated in three large trials investigating early goal-directed therapy [93]. The present results indicate that SOFA score, the proportion of ventilated patients, and creatinine independently reflect baseline risk across trials and should therefore be reported for each trial.

The results from this study also support the practice of data sharing, as we have shown that aggregated population characteristics are less informative than expected. Sharing individual patient data will not only increase the power to detect treatment effects across multiple studies but can also be used to test the generalizability of trial results vis-à-vis large cohorts with septic shock.

Strengths and limitations

This study was performed with a prospectively registered protocol and analysis plan. We chose to include only trials published between 2006 and 2018 to minimize the influence of long-term secular trends in septic shock diagnosis, treatment, and mortality [94, 95]. The search strategy was broad and comprehensive, but we excluded 40 trial reports not written in English, which compromised power and generalizability. We excluded trials that recruited only septic shock patients with specific organ dysfunction (such as kidney or liver failure) to rule out this source of between-trial heterogeneity.

For 20 trials, 28-day mortality was estimated using another reported mortality rate. Although the prediction equations were very precise (R2 values ≥ 0.98), we cannot rule out the possibility that this influenced the results. Excluding these 20 trials would have eroded the power of the study.

Importantly, using study-level data means that, to avoid the ecological fallacy, we cannot make inferences about predictive characteristics at the individual patient level, although several predictor variables are known to be individually associated with mortality (e.g. high SOFA score as a risk factor [96, 97]).The fact that there was substantial variation in the reporting of baseline variables was an important finding in itself, but also limited our power to detect associations across trials. A more in-depth investigation into the heterogeneity among trial populations would require individual patient data, but we think that obtaining such data would lead to significant selection bias.

Conclusion

Septic shock is a syndrome with various etiologies, biochemical characteristics, and phenotypes [9, 98]. Onto this inherently heterogeneous syndrome, a layer of investigator-induced heterogeneity is added when trials employ different inclusion criteria, report different variables, and use different variable definitions. This compounded complexity causes heterogeneity among trial populations that may go unnoticed. We have shown that control-group mortality rates are very dissimilar across trials, and that the majority of this heterogeneity remains unexplained after accounting for reported population characteristics. The lack of standardized reporting limits the usefulness of the variables explaining the mortality differences found in this study. In all, the substantial between-trial heterogeneity limits the reproducibility and generalizability of septic shock research and may inhibit the discovery of beneficial therapies for specific (sub)populations. The findings of this study therefore strongly support the argument for profound standardization and harmonization of septic shock trial reporting as well as data-sharing policies to test the external validity of trial populations.