Background

Non-inferiority (NI) randomized clinical trials aim to demonstrate whether an experimental treatment is not inferior, below a certain pre-specified margin, to the control treatment [1]. This margin should be formulated according to earlier knowledge and clinical relevance [1, 2]. It has been shown, for instance in paediatrics, that the choice is not well-documented in 63% of studies [3]. However, when there is no reliable placebo-controlled historical data, and when conducting such a trial is not ethical due to changes in practices, margins based solely on clinical judgement could be acceptable, if constructed with rigorous methods, such as systematic analysis of several independent experts’ opinions.

When conducing a trial, the analysis of some secondary outcomes, in addition to the primary endpoint, might be challenging, as the sample size was not specifically tuned for that. This issue is of particular importance for safety events, and is even more true when considering rare but critical safety outcomes, which might not occur or only a few can be observed. Consequently, these individual trials are usually under-powered to detect safety differences and to ensure reliable conclusions. Some efficient methods have been proposed for the detection of rare events that are using meta-analysis tools in order to improve overall power. Nevertheless, many methods of meta-analysis are based on large sample approximations, and may be unsuitable when events are rare [4]. Moreover, regulatory agencies and investigators may not wish to wait for post-marketing studies to draw conclusions about rare but serious outcomes of a new intervention. Furthermore, they might want to get reliable safety information before the end of a trial.

When considering NI trials, investigators would like to monitor whether the difference in safety outcomes between arms is clinically relevant. In this case, similar reasoning as for the efficacy primary outcome can be applied, using specific NI margins. If we consider settings where events are rare, a Bayesian approach may seem appropriate to construct sequential stopping rules. Several authors have proposed Bayesian designs for NI trials [5, 6]. Gamalo et al. have proposed a Bayesian NI approach for binary endpoints in which an active-control’s treatment effect is estimated using historical data under a fixed margin assumption [7]. However, this Bayesian decision criterion utilizes historical placebo-controlled data, it depends on a single final analysis, and no recommendation is provided to define the prespecified decision threshold.

We propose a Bayesian NI sequential design to monitor several safety dichotomous events where margins are based on clinical relevance obtained from several experts.

Motivation

The ongoing BETADOSE study (NCT02897076) aims to demonstrate that a 50% reduced betamethasone dose regimen is not inferior to a full-dose in preventing neonatal severe respiratory distress syndrome [8]. Several studies have proven the benefice of antenatal corticosteroids, such as betamethasone, so it is used worldwide in pregnant women at risk [913]. However, concerns persist regarding long-term adverse events of antenatal corticosteroids, mainly dose-related [1416].

The trial plans to include 1571 women per arm in 37 French centres. A sequential data analysis has been planned after every 300 newborns reach the primary outcome.

As a safety secondary objective, the protocol plans to monitor, at each interim analysis, the absence of an excess of four other neonatal complications, i.e., neonatal death, severe intraventricular haemorrhage (IVH), necrotising enterocolitis and retinopathy, in two gestational age subgroups of neonates (<28 weeks, 28–32 weeks).

Because only 33% of the randomized women are expected to deliver before 32 weeks, and due to the low frequency of some complications in preterm children, the trial planning had to cope with an expected low number of some secondary events (based on the EPIPAGE-2 cohort study - Additional file 1) [17]. As a consequence, standard frequentist analysis of those outcomes, consisting of tests repeated at each interim analysis, might be powerless.

The Bayesian approach proposed in the manuscript will be applied to this trial (complementary to the frequentist analysis) so the Data Safety Monitoring Board and the investigators can evaluate the difference in terms of the result’s interpretation and the benefit of such an approach.

Methods

Let i=0,1 be the arm-index (1 for the half-dose, 0 for the full-dose) and j=1,2,3,4 the event-index. For the sake of clarity, we show the methodology and results for only one subgroup of neonates (<28 weeks), as it can be repeated in the other subgroup. We used a Bayesian Non-Inferiority approach, detailed in the next subsection. If πi,j denotes the event rate in the ith arm, and Dj∈(0,1) the maximal acceptable difference, the probabilities of interest are Pr(π1,jπ0,jDj). To consider the difference in event prevalence and relative severities, this approach was done for each event (j). In our setting, the quantity Dj is not a fixed value, but rather a distribution fitted from elicited experts’ opinions through a mixture of beta distributions to consider some variability between experts. The setting of prior distributions and decision thresholds are detailed in the following subsections. Then, a practical example is given using a simulated dataset that mimics the trial. A summary of the general framework is presented in Fig. 1.

Fig. 1
figure 1

General framework describing the two steps of the decision rule building. This figure summarizes the general framework, divided in two steps: 1 Fit margins from experts’ elicitation (Dj); 2 Sensitivity analysis to choose the prior and the decision thresholds

Bayesian non-inferiority approach

For each event j and arm i, let yi,j,n denote the observed binary outcome for the nth subject, ni the total number of observations and \(Y_{i,j} = \sum _{n=0}^{n_{i}}{y_{i,j,n}} \) the number of events. Following a Bayesian binomial model, we have

$$ Y_{i,j} \sim Bin(n_{i}, \theta_{i,j}) $$
(1)

where θi,jBeta(αi,j,βi,j) are considered as random variables following a beta prior density. In this setting, the posterior distribution of each θi,j is given by:

$$ \theta_{i,j} \vert Y_{i,j} \sim Beta(\alpha_{i,j} + Y_{i,j}, \beta_{i,j} + n_{i,j} - Y_{i,j}) $$
(2)

Indexing by l the interim analysis, l∈[1,…,L], we will calculate for each event at each analysis the posterior probability that the difference of events rates, θ1,jθ0,j, is higher than the acceptable difference distribution Dj:

$$\begin{array}{*{20}l} {}P \left(\delta^{l}_{j}\right) & = P \left(\theta_{1,j} - \theta_{0,j} > D_{j} \ \vert \ Y^{l}_{1,j}, Y^{l}_{0,j}\right) \\ & = \!\int_{0}^{1}\! \left(\!\theta_{1,j} \,-\, \theta_{0,j} \!>\! x \ \!\vert\! \ Y^{l}_{1,j}, Y^{l}_{0,j}, D_{j}=x\! \right)\! \ . \ P(D_{j}=x)\! \ . \, \text{dx} \end{array} $$
(3)

At the lth interim analysis, the Bayesian decision rule will conclude that there is an unacceptable excess of event j in the experimental arm if \(P (\delta ^{l}_{j}) \geq \tau ^{l}_{j}\), where \(\tau ^{l}_{j}\) is a prespecified decision threshold.

Fit margins from experts’ elicitation

To evaluate the distribution of Dj, the acceptable difference of events rate between arms, we performed a formal elicitation with several experts. A questionnaire was sent to the two main investigators (1 obstetrician and 1 neonatologist) of each centre involved in the trial. They were asked about (i) their own characteristics (age, sex, speciality, etc.), (ii) the maximum prevalence of events they may tolerate in the experimental arm, given the expected prevalence of each event in the control arm, (iii) the weight of each event, that is the relative severity of the outcomes, considering that death has maximum weight equal to 100.

Let \(\tilde {f}_{j} \) denote the estimated event rate in the full-dose arm, based on the EPIPAGE-2 study (Additional file 1), and hj,e the acceptable event rate in the half-dose arm according to the eth expert, e∈[1,…,E]. The acceptable difference between arms according to the eth expert is: \(d_{j,e} = h_{j,e} - \tilde {f}_{j} \). For each event, the distribution of the acceptable difference among the E experts was modeled using a mixture of beta distributions, with a maximum of 3 distributions. Using the betamix function (betareg package on R software [18, 19]), 3 different estimation methods were adopted (the first mathematically driven and the other two empirically driven). See Table 1 for details as well as Section 1 of the Additional file 7. As results, the distribution of Dj will be denoted as Djf(a1,j,b1,j,a2,j,b2,j,a3,j,b3,j,w1,j,w2,j,w3,j), where (a1,j,b1,j), (a2,j,b2,j) and (a3,j,b3,j) are parameters for the 3 beta distributions, and (w1,j,w2,j,w3,j) the corresponding weights. Parameters will be omitted when mixtures contain less than 3 distributions.

Table 1 Three methods of fitting used to model the physicians’ acceptable differences of rates of events

Sensitivity analysis to select the prior and the decision thresholds

The sensitivity analysis aimed to compare the performances of different priors and thresholds \(\tau ^{l}_{j}\) and to select the most appropriate combination. In the reference arm, θ0,j was imputed from historical data (Table 2) [17]. For the experimental arm, five scenarios were considered, determined by the assumed true values of the response probabilities (θ1,j). Let s be the scenario-index (s∈[1,…,5]), and θ1,j,s denote the prevalence in the experimental arm of the sth scenario. In the first scenario, the prevalence in the experimental and control arms are the same (θ1,j,1=θ0,j). In the second scenario, the prevalence are lower in the experimental than in the control arm (θ1,j,2=2/3×θ0,j). In the third to fifth scenario, the prevalence is higher in the experimental than in the control arm (θ1,j,3=1.5×θ0,j, θ1,j,4=2×θ0,j and θ1,j,5=3×θ0,j). For each scenario, 1000 trials have been generated, with ni=162 (Additional file 1), and Yi,j,s following the Eq. (1).

Table 2 Prevalence of events assumed in each trial, according to the scenario and to the application data set, and weight and maximal rates of misclassifications assigned to each event to build the decision rule

The observations of each trial were sampled in L interim analyses. At each analysis, the analysis’ population will include the patients of the actual analysis and the patients of the l−1 previous analyses.

To address the issues of how prior location and precision may affect posterior inferences, we constructed an array of P alternative priors, each obtained by specifying numerical values of two quantities, one that changes the prior’s location E(π1,jπ0,j) and one that changes its precision (see more details in Section 2 of the Additional file 7).

Choice of the prior and thresholds for the final analysis

The posterior distributions of θ1,jθ0,j of the final Lth analysis, were obtained through the Hamiltonian-Monte Carlo method, using the rstan package [20, 21] carried out in R among the 5000 simulated trials. The posterior probability that it is higher than the acceptable difference distribution was calculated. Then, we calculated, the overall number of misclassifications obtained when applying the decision rule with different decision thresholds \(\tau ^{l}_{j}\) at the final analysis, with \(\tau ^{L}_{j} \in (0.50, 1.00)\). Considering the contingency table presented below, we defined two types of misclassifications:

 

Truth

 

The difference is Acceptable

The difference is Unacceptable

Conclusion of the decision rule

The difference is Unacceptable

A = Class a misclassification

D

 

The difference is Acceptable

C

B = Class b misclassification

The rates of class a and class b misclassifications are =A/(A+C) and =B/(B+D), respectively.

This work was repeated for each event, using the P priors. Then, the most appropriate prior was selected, along with the thresholds \(\tau ^{l}_{j}\) for each event, that is those that gave acceptable rates of class a and b misclassifications. Let p∗ denote the selected prior and \(\tau *^{L}_{j}\) the selected decision thresholds at the L analysis for the event j.

Choice of the thresholds for the interim analyses

To construct the decision rule to be applied at each previous interim analysis, the simulation has been repeated for the L interim analyses, using the p∗ elected prior. The decision thresholds \(\tau ^{l}_{j}\) were defined as follows: (i) for the final analysis, \(\tau *^{L}_{j}\) was the one defined through the previous step, (ii) for the first analysis, \(\tau *^{1}_{j}\) has been set to 0.95, (iii) for l∈(2,L−1), four decreasing functions have been tested to define \(\tau ^{l}_{j}\) (see Table 3). The overall number of misclassifications obtained with those different functions has been compared. Then, the most appropriate function and thresholds \(\tau *^{L}_{j}\) have been selected.

Table 3 Four functions applied to define the thresholds at each of the interim analyses

Results

Fit margins from experts’ elicitation

Among the 78 experts to which the questionnaire was sent, 44 answered (56.4%) (Table 4), including 43 who provided answers about acceptable rates of events in the half-dose arm.

Table 4 Main characteristics of the experts who answered to the elicitation questionnaire

Figure 2 presents the histogram of the acceptable differences of IVH among the E experts (dj,e), the fits (Dj) obtained through the 3 different methods, and their criteria for goodness of fit. For the other events, see Additional file 2. The Additional file 3 summarizes the mixtures retained for the acceptable differences Dj.

Fig. 2
figure 2

Histogram of the acceptable difference of severe intraventricular haemorrhage between arms, and mixtures of Beta distributions fitted from experts’ elicitation, through 3 different methods, with their criteria for goodness of fit. The histogram represents the acceptable difference of IVH among the E experts (dj,e). The 3 lines represent the fits of this difference (Dj), obtained through the 3 different methods. The legend gives the parameters of the fits and their criteria for goodness of fit

Sensitivity analysis to select the prior and the decision thresholds

A good sequential decision rule is supposed to help in making a good decision, that is to advise when to stop the trial when the prevalence of events is truly unacceptable and to not stop when the difference is acceptable. Table 2 summarizes what was considered as a “good decision” according to each scenario and events (see more details in the Section 3 of the Additional file 7).

The maximum rate of class a misclassifications has been set to 0.10. For class b misclassifications, we set a maximum inversely proportional to the weight of the event according to the experts (Table 2). Denote by Wj the median weight of the j event among the E experts (Wj∈[0,100] and Wdeath=100), the maximal rate of class b misclassifications has been set to: \(\text {Max}(\text {class b misclassification})_{j} = 0.1 + 0.50 \times \frac {100 - W_{j}}{100}\).

Selection of the prior and thresholds for the final analysis

Figure 3 shows the number of posterior misclassifications at the final analysis according to each prior and final threshold for IVH. See Additional file 4 for the other events. In an effort to construct a homogeneous decision rule, we selected the same prior for all of the events. Several priors gave acceptable rates of misclassifications (prior 1, 3, 4, 5, 8, 9 and 13). We arbitrarily chose the prior Number 9. Conversely, we applied different final thresholds \(\tau *^{L}_{j}\) for each event, as they are influenced by the prevalence of events and by the acceptable difference \(\delta ^{l}_{j}\) (Table 5).

Fig. 3
figure 3

Plots of posterior class a and class b misclassifications according to the decision thresholds for each of the 13 pairs of priors for severe intraventricular haemorrhage. This figure represents the posterior rates of misclassifications for each pair of priors. Prior 1 is the non-informative prior, with α1,j=α0,j=β1,j=β0,j=1; Prior 2 to 13 are distinguished by (i) the means for the difference between the two arms: E(π1,jπ0,j)=0 for prior 2, 3, 4 and 5; E(π1,jπ0,j)=median(dj,e) for prior 6, 7, 8 and 9; and E(π1,jπ0,j)=π0,j for prior 10, 11, 12 and 13; (ii) their precision: 1 for prior 2, 6 and 10; 1/3 for prior 3, 7 and 11; 1/10 for prior 4, 8 and 12; and 1/20 for prior 5, 9 and 13. For each prior, the red solid line represents the number of posterior class a misclassifications (trials that conclude that the difference between arms is Unacceptable, while it is not true) at the final analysis, according to each final threshold. The blue solid line represents the number of posterior class b misclassifications (trials that conclude that the difference between arms is Acceptable, while it is not true)

Table 5 Final decision rule retained through the sensitivity analysis: thresholds to be applied at each interim analysis and final overall rates of misclassifications, according to the event

Selection of the thresholds for the interim analyses

In our case-study, we set L=11. The number of misclassifications obtained by applying the 4 functions defining \(\tau ^{l}_{j}\) are presented in the Additional file 5. We retained the linear function with an exponential transformation because it maintained the overall rate of misclassifications under the prespecified acceptable rates. The 3 other functions increased the rate of class a misclassifications over 0.10.

Table 5 summarizes the thresholds finally retained in the decision rule, \(\tau *^{L}_{j}\), and the overall rates of misclassifications. Figure 4 gives the distribution of the conclusions and misclassifications among the trials, at each interim analysis and in total, for IVH. Additional file 6 represents the distribution of the conclusions and misclassifications for the other events. Finally, Fig. 5 presents the overall numbers or misclassifications obtained by applying this decision rule, according to the scenario.

Fig. 4
figure 4

Distribution of the successive conclusions and errors, obtained by applying the decision rule to the 5000 simulated trials, at each interim analysis and in overall, for severe intraventricular haemorrhage. The left part of the plot represents the conclusions at each interim analysis. The right part represents the overall count of conclusions among the 11 analyses. The upper part of the plot represents the trials with an Acceptable difference between arms: orange area correspond to trials that conclude that the difference between arms is Acceptable, while it is true; red area correspond to trials that conclude that the difference between arms is Unacceptable, while it is not true (class a misclassifications). The bottom part of the plot represents the trials with an Unacceptable difference between arms: green area correspond to trials that conclude that the difference between arms is Unacceptable, while it is true; blue area correspond to trials that conclude that the difference between arms is Acceptable, while it is not true (class b misclassifications)

Fig. 5
figure 5

Distribution of the overall conclusions and errors, obtained by applying the decision rule to the 5000 simulated trials, according to the event and the scenario. This plot presents the overall numbers or misclassifications obtained by applying this decision rule, according to the 5 scenario and to the 4 events. The left part of the plot represents the trials with an Acceptable difference between arms: orange area correspond to trials that conclude that the difference between arms is Acceptable, while it is true; red area correspond to trials that conclude that the difference between arms is Unacceptable, while it is not true (class a misclassifications). The right part of the plot represents the trials with an Unacceptable difference between arms: green area correspond to trials that conclude that the difference between arms is Unacceptable, while it is true; blue area correspond to trials that conclude that the difference between arms is Acceptable, while it is not true (class b misclassifications). IVH: Intraventricular haemorrhage; NEC: Necrotizing enterocolitis

Application to data

We applied our method to a simulated dataset for the BETADOSE trial. In this dataset, we considered that the final sample size was 1571 per arm, with ni=162 for children born before 28 weeks. The prevalence of events was sampled as detailed in Table 2. For all events, a good decision of this trial was considered to conclude an “Unacceptable” difference using the same explanation given before (Section 3 of the Additional file 7). Table 6 summarizes the results at the end of the trial (expressed as observed prevalence) and the Bayesian sequential results, using the rule built in the previous step (Table 5).

Table 6 Observations and decision obtained by application of the Bayesian decision rule to a data set

At the 6th analysis, since the posterior probabilities became higher than the prespecified threshold \(\tau *^{6}_{j}\) for death, the trial was stopped because of a potential unacceptable increase of deaths in the experimental arm. If the trial had continued, it would have stopped at the 10th analysis because of IVH.

Discussion

Motivated by the desire to deal with settings where rare but serious events have to be monitored during an non-inferiority trial, we have proposed a methodology that provides a practical way to help in the decision making at each interim analysis.

Our approach has the advantage of incorporating experts’ opinions about the non-inferiority margins. As a consequence, it can be used as an alternative in cases where historical placebo-controlled data aren’t available. We have proposed to keep the variability among experts and used a distribution instead of a discrete margin. Indeed, we could have averaged all experts’ opinions, but this will not have reflected all potential variability. In a simulation study, we compared our approach to the use of average values (see Additional file 9). We found that the use of a mixture gave different results than the use of the mean of the experts’ opinions. Indeed, the difference between the two approaches increased as the variability among experts increased. Moreover, we could have weighted experts’ opinions according to some pertinent covariates. In a previous work, Thall et al. compared different ways to weight physicians’ opinion using mixture priors of the parameter of interest [22]. The authors found, according to their design, that posterior quantities appear to be insensitive to how physicians are weighted, so we decided to weight all physicians equally. In our case, the variability among experts was kept in order to reflect all potential opinions, that is the distribution across all the range of potential margins. Our method can be applied whatever the values are in between zero and one.

One limitation of our motivating example is that the majority of the experts set the acceptable difference to zero, whereas zero is not a possible value for a non-inferiority margin. When generalizing this method to another non-inferiority trial, we suggest to investigators to remind the experts that the margin cannot be set to zero.

Because the prior chosen for a Bayesian analysis needs to be well documented and robust to its parameter choices, we performed an extensive sensitivity analysis evaluating non-informative and informative priors and several thresholds. Thresholds retained were varying between events, allowing us to consider the differences in prevalence, and in margins and severity conferred by clinicians to each event. Likewise, when we repeated this work in the subgroup of premature infants born after 28 weeks (results not shown), the thresholds were different, reflecting the higher rarity of events and the different margins.

To choose the best priors and stopping thresholds, the rates of misclassifications have been computed and compared. As the two types of misclassifications are moving in opposite directions, we had to find a compromise between the two. Since we do not want to wrongly conclude too often an inferiority of the experimental arm, we decided to set a maximum for class A misclassification at 0.10, to be more permissive in terms of class B misclassifications and to adapt this permissiveness to the severity of each event. To define the stopping thresholds at each interim analysis, simulations have compared several initial thresholds and four decreasing functions of τ. The purpose of this study was to find the best thresholds in order to have good functional properties of the design, i.e. do not stop frequently at the beginning when it is wrong and do not continue until the end when we have to stop. Finally, as we dealt with some rare events, overall rates of class A and B misclassifications were relatively high. This has to be put in balance with frequentist type I and type II error rates that sometimes have to be compromised, especially in the case of rare secondary events.

When generalizing this method to another trial, this work needs to be repeated before the analysis of the real data; the maximal rates of class A and B misclassifications have to be balanced, considering the setting, and the parameters of the decision rule have to be adapted in consequence, namely the prior, the margins and the decision thresholds. Finally, after having pre-specified all these parameters, the decision rule can be applied by the statistician to the unblinded data, and presented to the Data Safety Monitoring Board. In order to apply this methodology, we already designed a non-inferiority trial that should start in few months, using the same statistical approach in an other setting.

In conclusion, our approach was found to be efficient in dealing with safety monitoring of rare and non-rare events in a non-inferiority context. It requires a strong collaboration between physicians and the trial statisticians for the benefit of all.

Conclusion

We proposed a practical way to help to assist with decisions on safety dichotomous events at each interim analysis of a non-inferiority trial. This Bayesian design is suitable for rare events and for non-rare events. It incorporates experts’ opinions on margins, so it can be constructed without historical placebo-controlled data. This Bayesian sequential approach could be applied as a complement to the frequentist analysis, so both Data Safety Monitoring Boards and investigators can benefit from such an approach.