Meta-analyses underpin guideline recommendations for clinical decisions and synthesize multiple studies into two important estimates: the pooled treatment effect and the non-random variation (heterogeneity) in treatment effect between studies. When different studies produce conflicting results, it is essential to identify the factors that vary between the studies in order to explain the heterogeneity in treatment effects. For example, perhaps corticosteroids are beneficial only in sepsis patients with high severity of illness [1]? Perhaps low tidal volume ventilation is beneficial only in patients affected by acute respiratory distress syndrome (ARDS) with low pulmonary compliance [2]? Here, we illustrate why identifying “treatment responders” requires individual patient data.

When relying on averaged patient characteristics reported in trial papers to explain between-trial differences in treatment effects, the observed direction of effect modification can be completely reversed from the direction that is relevant for individual treatment decisions: patient characteristics that appear to be associated with treatment benefit at the study level may actually be associated with harm at the individual patient level, and vice versa. We illustrate that the circumstances that lead to effect modification reversal can be expected to occur in intensive care-related research questions.

A reversal of conclusions

To illustrate the problem, we simulated study data. The R code to re-generate and analyze similar data can be found in the supplementary materials.

Suppose that four randomised controlled trials (RCTs) have been performed testing the same intervention. One trial showed benefit from the intervention, two trials showed no effect and one trial showed harm. Readers of the four trials noticed that the reported average severity of illness was lowest in the trial that showed benefit.

The hypothesis of effect modification by severity of illness is explored in a conventional meta-analysis that stratifies the trials by average illness severity. While the pooled effect of the two trials with low mean severity of illness is more consistent with benefit, the pooled effect of the two trial with high mean severity of illness clearly indicates harm. A meta-regression analysis is performed [3], which likewise shows a strong negative association between the studies’ mean severity of illness and treatment benefit (p < 0.01). The meta-analysts conclude that the treatment is harmful in patients with high severity of illness.

Meanwhile, the investigators of the individual studies collaborate to perform an individual patient data meta-analysis. They find a strong positive association between severity of illness and treatment benefit (p < 0.01) and conclude—in direct contrast with the first study-level meta-analysis—that treatment is beneficial in patients with high severity of illness.

The reversed conclusion about the direction of effect modification by severity of illness has important consequences for individual patient care. Should the therapy be avoided or recommended in patients with high illness severity?

The results from the meta-analysis that uses individual patient data are correct: within each of the four RCTs, more severely ill patients benefited more from treatment. Figure 1 shows the true relationship between severity of illness and treatment effect (panel A), as well as misleading study-level association between severity and treatment effect in a meta-analysis (panel B) and meta-regression (panel C), and the correctly estimated association between severity and treatment effect using individual patient data (panel D).

Fig. 1
figure 1

Averaged baseline characteristics and trial outcomes can lead to false inferences about factors that influence the efficacy of a tested intervention. In this simulated example, four RCTs are performed testing the same intervention. Panel A shows that, within each trial, the intervention is more beneficial (high odds ratio of survival) for patients with higher severity of illness scores. Panels B, C show that, when severity scores and treatment effects are averaged at the trial level, the treatment appears to be less beneficial with a higher severity score. This can occur when a confounder that influences both severity of illness and treatment effect is unequally distributed among the four trials. Panel D shows that when individual patient data are available, a proper analysis gives the correct result of more benefit for patients with higher illness severity, even without measuring the hidden confounder variable

Even though the study-level meta-analysis was wrong, no technical error was made. The true direction of effect modification is inherently unobservable from aggregated data at the study level.

Explaining effect modification reversal

Effect modification reversal is a special case of Simpson’s paradox: the reversal of a statistical association when the level of data aggregation changes [4]. It can be explained as follows.

Suppose that a variable—the confounder—is unevenly distributed across RCTs and that a true effect modifier is correlated to the confounder. For example, one could imagine that the percentage of patients with extrapulmonary (instead of pulmonary) ARDS differs between several different RCTs testing low vs. high positive end-expiratory pressure (PEEP). Suppose that high PEEP is more beneficial in patients with high alveolar recruitability, which tends to be higher on average in extrapulmonary ARDS.

Effect modification reversal occurs when a hypothesized effect-modifying variable is positively correlated to the confounder, but negatively correlated to the true effect modifier, or vice versa. Severity of illness may be positively correlated to extrapulmonary ARDS (those with extrapulmonary ARDS are more severely ill) but negatively correlated with recruitability (patients with high recruitability are less severely ill, given their status of pulmonary vs. extrapulmonary ARDS).

The correlation structure in this example leads to effect modification reversal: studies including more patients with extrapulmonary ARDS have included more patients with high recruitability and, therefore, demonstrate more benefit from a high PEEP strategy. These studies also included, on average, more severely ill patients (extrapulmonary ARDS was positively correlated with severity). At the study level, treatment benefit appears to be positively associated with severity of illness (Fig. 1, panels B, C). In fact, given that any patient has pulmonary or extrapulmonary ARDS, treatment benefit is negatively associated with severity of illness, as severity of illness was negatively correlated to recruitability (Fig. 1, panel D).

Failing to identify the true effect modifier is not the heart of the problem. When the data are analyzed at the individual patient level, the correct direction of effect modification will be borne out. In our example, an analysis that estimates interaction between severity and outcome while conditioning on the study (for example in a mixed-effects model) will demonstrate the true direction of effect modification.

Take-home message

The cause of between-trial heterogeneity cannot be reliably inferred from study-level data, whether assessed informally in the literature appraisal or formally through meta-regression analysis. Apparent and even statistically significant effect modification in one direction may actually be completely misleading. When only study-level data are available, interpreting between-study differences in patient characteristics as evidence of effect modification are misguided and potentially hazardous for individual care decisions. This underscores the need for international collaborative programs for data sharing and evidence synthesis using individual patient data [5,6,7,8].