Futility decisions rely on a planned, conditional probability estimate that a trial has a certain likelihood to be statistically significant as expected in the protocol if it were to finish. When futility is declared as per the protocol it is the same as a final analysis. The data monitoring committee has reviewed the futility analysis, checked the conditional probability estimates, and basically did an interim and final efficacy analysis. Here, as well, the DMC discussed results with Biogen who shared in the decision to stop (2). After futility is determined subsequent analyses are post hoc, subset analyses, and not “prespecified” as Biogen would have it. Moreover, multiplicity, interim efficacy analyses for stopping early, and adjustments for Type I error correction come into play.
Biogen later argued that the assumptions underlying their futility analysis were violated. This is like saying, ‘Sorry, we did the wrong analysis, but never mind.’ According to Biogen these assumptions were, first, that the treatment effect in the two trials would be similar, and second, that there is a constant effect throughout the trials, meaning that the later enrolled patients would show the same effect as earlier enrolled patients.
Some of this is inherently illogical because if the treatment effects are different between identically designed trials, then the trials don’t replicate. One does not confirm or support the other; the validity of both may be questioned. Facing discordant results while maintaining optimism, Biogen should have immediately started a third phase 3 trial in the Spring of 2019 when they first learned their outcomes. FDA fully understands the need for replication and confirmation as it first argued without apparent irony to its advisory committee in November 2020 that the negative Engage trial did not contradict Emerge and wasn’t needed for regular approval because a phase 1b study could fill the role of a confirmatory trial. Subsequently, as a condition for accelerated approval FDA required that a third, similar phase 3 trial be done as a post-marketing requirement.
The second assumption, that there is a constant effect throughout the trials, is one made in all trials that do not explicitly treat participants randomized early versus later differently in the pooled result. Put simply, in trial designs we expect the first patient to be like the last. Yet, it is often true that later participants may differ from earlier recruited participants in that the pool of easily available participants becomes depleted or because trial centers recruit at differing rates. If this is the explanation for why the futility analysis failed, then the treatment outcomes should only apply to this early subgroup while adjusting the p value to correct for type I error because of the multiple analyses. However, since the trials were stopped about halfway through this would be difficult to calculate, be roughly equivalent to an interim analysis to stop early for efficacy and require a much more stringent boundary for statistical significance than a nominal P = .05. Indeed, Biogen’s statistical analysis plan called for an O’Brien-Fleming stopping boundary for an interim analysis for efficacy (2) (FDA biostatistics report, page 20) and thus the critical P value would be much lower than .01.
In any event, the authors want us to forget the futility analysis, ignore the Engage trial, accept the unusual changes to the Emerge placebo group as ordinary, discount the functional unblinding and other risks for bias, and simply accept Emerge as a positive trial with “clinically meaningful” outcomes. We cannot.