Background

Research suggests that a majority of randomized clinical trials (RCTs) on medical interventions may not be justified based on established evidence, but contain unjustified research. Justified clinical trials may be defined as trials designed around a clear hypothesis around which uncertainty exists and that uncertainty should be as established through systematic reviews or network meta-analyses (NMA) based on existing evidence [1]. This is of relevance because estimated costs of each piece of evidence in a series of RCTs increases across decades [2, 3]. Optimizing the number of clinical trials to scientifically justifiable amounts is therefore recommended to save resources, reduce exposure of patients to less effective treatments, and allow for earlier uptake of treatment recommendations in practice [1].

Conditional power of NMA has been introduced as a concept to optimize trial designs thereby contributing to the reduction of unjustified research [46]. Conditional power is the probability that updating existing inconclusive evidence in NMA with additional trial(s) will result in conclusive evidence, given assumptions regarding trial design, anticipated effect sizes, or event probabilities [7, 8]. A key issue when designing a RCT is to determine how large the sample size needs to be in order to achieve a desirable level of power given a predefined significance level α [7]. Further, some interventions may not achieve high levels of power when considered within a single trial in isolation. In such situations, two or more RCTs in combination may be appropriate to form a cumulative synthesis of findings from RCTs addressing the same question [5, 6]. This situation may also arise if a direct treatment comparison of interest includes treatments that are known to be poorly tolerated in patients (e.g., due to known adverse events); therefore, adding indirect evidence including only better tolerable treatments in future trials may be more appropriate for the evidence to become conclusive. If conditional power analysis suggests for example at least 80% conditional power, which conventionally implies that trial(s) investigating a true effect will correctly reject the null hypothesis [9], together with a reasonable required sample size, further research may be promising. Otherwise, if such an analysis suggests for example less than 20% conditional power, which conventionally may be regarded as futility boundary with values below indicating that a trial is likely to be futile under the null hypothesis [10], then it may be recommended to refrain from further RCTs on a given intervention to save resources.

The present work aimed to estimate conditional power for NMA on antidepressant treatments. The analysis was based on a published network known as the GRISELDA dataset [11], contributing 502 RCTs for the acute treatment of adult major depressive disorder (MDD) conducted between 1979-2018 [12]. Together the network compares 21 antidepressants, considering outcomes such as efficacy in terms of the symptom change on the Hamilton Depression Scale (HAMD) [13] and tolerability in terms of dropout rate due to adverse events (Supplement 1 Fig. S1).

At the time of writing (as of October 2020), four ongoing RCTs can be found on clinicaltrials.gov that cover one or more of the afore-mentioned antidepressants and fit the inclusion criteria of the present data set (NCT04364997, intervention: bupropion (BUP), escitalopram (ESC), mirtazapine (MIR), sertraline (SER), venlafaxine (VEN), planned sample size N = 400, estimated start and completion dates Jun-18 to Dec-22, Beijing Anding Hospital, China [14]; NCT03538691, intervention: citalopram (CIT), duloxetine (DUL), escitalopram (ESC), fluoxetine (FLO), paroxetine (PAR), sertraline (SER), venlafaxine (VEN) versus placebo (PLA), planned sample size N = 1450, estimated start and completion dates Jul-18 to Sep-22, Otsuka Pharmaceutical Development & Commercialization, Inc. [15]; NCT04345471, intervention: desvenlafaxine (DES) versus placebo (PLA), planned sample size N = 594, estimated start and completion dates May-20 to Dec-22, Mochida Investigational sites, Japan [16]; NCT04422652, intervention: desvenlafaxine (DES) versus vortiozetine (VOR), planned sample size N = 600, estimated start and completion dates Aug-20 to Apr-26, H. Lundbeck A/S [17]).

For example, one of the most recent antidepressants is vortioxetine (VOR) approved in 2013 by the US Food and Drug Administration (FDA). The existing evidence on VOR comprises 17 RCTs (16 placebo-controlled RCTs, 1 head-to-head RCT) completed between 2007 - 2017 and published between 2012 - 2018 [1834]. Based on this current evidence, VOR has been shown to be more effective (standardized mean difference (SMD) -0.29 [95%CI -0.38 - -0.20]), but less tolerable (odds ratio (OR) 1.48 [95%CI 1.15 - 1.89]) compared to placebo, with the evidence becoming conclusive in 2009 (efficacy) and 2011 (tolerability), respectively. An ongoing phase IV, double-bind RCT (NCT04448431 [35]) started in August 2020 with estimated completion date in April 2026. This RCT aims to compare the efficacy of VOR versus desvenlafaxine (DES) in 600 MDD patients that have tried one available treatment without getting the full benefit, with the primary outcome being the change in the Montgomery and Åsberg Depression Rating Scale (MADRS) from baseline to week 8. Based on current evidence, the comparison DES:VOR is inconclusive in terms of efficacy (SMD -0.06 [95%CI -0.19 - 0.08]) and tolerability (OR 0.80 [95%CI 0.54 - 1.18]); suggesting a slight yet inconclusive advantage for VOR compared to DES with respect to both outcomes. To estimate whether the advantage for VOR may turn into conclusive evidence, conditional power analysis may support the decision whether the ongoing research on that comparison is promising or otherwise futile. This example shows how the present work may inform decision-makers and researchers regarding the expected clinical relevance of ongoing and future antidepressant RCTs that aim to challenge antidepressant treatment recommendations.

Methods

Data sources

A total of 535 RCTs (445 published trials, 90 unpublished trials) were identified on the acute treatment of MDD conducted between 1979 and 2018. 522 trials constituted the GRISELDA dataset [11] provided by Cipriani et al. [12]. Additional 13 trials [34, 3647] were identified by own literature search. Together the network compares 21 antidepressants, agomelatine (AGO), amitriptyline (AMI), bupropion (BUP), citalopram (CIT), clomipramine (CLO), desvenlafaxine (DES), duloxetine (DUL), escitalopram (ESC), fluoxetine (FLO), fluvoxamine (FLV), levomilnacipran (LEV), milnacipran (MIL), mirtazapine (MIR), nefazodone (NEF), paroxetine (PAR), reboxetine (REB), sertraline (SER), trazodone (TRA), venlafaxine (VEN), vilazodone (VIL), vortioxetine (VOR), and placebo (PLA). The supplementary appendix provides a PRISMA flow-chart (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) [48] detailing the study selection process (Supplement 1, Fig. S1a, Tab. S1), a complete list of the included studies (Supplement 1, Tab. S4).

Two outcomes were considered. The continuous outcome efficacy in terms of the symptom change on the Hamilton Depression Scale (HAMD) [13], estimated on the standardized mean difference (SMD) scale, was available in 438 trials (99 direct comparisons) with a total sample size of N = 109’254 (median sample size N = 249 [range N = 7 - 821]). The binary outcome tolerability in terms of the dropout rate due to adverse events, estimated on the odds ratio (OR) scale, was available in 438 trials (99 direct comparisons) with a total sample size of N = 105’616 (median sample size N = 241 [range N = 3 - 657]). The final dataset, containing information on either one of the outcomes, consisted of 502 trials. Other commonly used outcomes related to the effectiveness of antidepressants, such as response and remission rates, were not considered due to well-known methodological difficulties arising from dichotomization, such as reduced statistical power and inflated effect sizes [4952].

Study year was defined as study year of completion, study year of publication, or year of drug approval from the FDA, where available in this order; preference was given to study year of completion, because unpublished trials, by definition, have no year of publication [53]. The resulting study year range was 1977-2017.

Conditional power

Conditional power was estimated using the ConditionalPower package provided by Nikolakopoulou et al. [7, 8, 54] in R [55]. Briefly, conditional power in NMA can be described as [7], for example for a comparison of interest:

$$ {\begin{aligned} CP = \phi \left(\frac{-z_{a/2} * \sqrt{C} - H*M}{\sqrt{H^{N}*\nu^{N} * \left(H^{N}\right)^{\prime}}}\right) + \phi \left(\frac{-z_{a/2} * \sqrt{C} + H*M}{\sqrt{H^{N}*\nu^{N} * \left(H^{N}\right)^{\prime}}}\right) \end{aligned}} $$
(1)

where C represents the covariance matrix of the NMA (direct and indirect) effect estimates, the vector M contains the NMA (direct and indirect) effect estimates of the old pairwise meta-analyses and the alternative effect sizes for the comparison of interest, the matrices H and HN connect the NMA (direct and indirect) effect estimates to the pairwise (direct) effects derived from old or new trials, respectively, and the vector νN represents the variances of the pairwise (direct) effect estimates derived from new trials. The reader may be referred to Nikolakopoulou et al. [7] for further details.

Conditional power was estimated across a range of possible N = 1 - 5000 sample sizes assuming 1:1 randomization between treatment arms. Results were reported in terms of two conditional power indices quantifying sample sizes:

  • NCP=20%: Sample size at 80% conditional power, which conventionally implies that a trial investigating a true effect will correctly reject the null hypothesis 80% of the time and will report a false negative (commit a type II error) in the remaining 20% of cases [9].

  • NCP=80%: Sample size at 20% conditional power, which conventionally may be regarded as futility boundary with values below indicating that a trial is likely to be futile under the null hypothesis [10].

Three parameters were considered for each outcome of interest:

  • Trial design: The main analysis considered a trial design with a ratio of direct/indirect evidence (r) of r = 1/0. The ratio r = 1/0 indicates that conditional power for each treatment comparison was assessed by updating the network with one new trial contributing direct evidence regarding the comparison of interest, but without any new trials contributing indirect evidence. A sensitivity analysis was conducted to estimate conditional power by updating with trial design represented by two additional ratios of r = 1/1 and r = 1/2. The ratio r = 1/1 indicates that conditional power for each treatment comparison was assessed by updating the network with one new trial contributing direct and one new trial contributing indirect evidence regarding the comparison of interest (for this trial design 41 possible combinations for each comparison were computed), whereas the ratio r = 1/2 indicates that conditional power for each treatment comparison was assessed by updating the network with one new trial contributing direct and two new trial contributing indirect evidence regarding the comparison of interest (for this trial design 820 possible combinations for each comparison were computed). Results were reported in terms of the optimal trial designs for each comparison, i.e., those with smallest NCP=80%.

  • Effect size: The main analysis considered anticipated treatment effects (fxy) set equal to the relative effect estimates (i.e., the relative effects between competing treatments of interest) observed in the network (fxyN). A sensitivity analysis was conducted to estimate conditional power at alternative effect sizes (fxy = 0.01, 0.1, 0.2, 0.3, 0.5, 0.8) in terms of Cohen’s d (small effect d = 0.2, moderate effect d = 0.5, large effect d = 0.8) [56].

  • Event probability: The main analysis considered anticipated event probabilities (pc) set equal to the average event probabilities observed in the entire network (pcN). For the outcome efficacy, anticipated average event probability (pcN = 0.17) was calculated in terms of the proportion of change on the HAMD of at least 4 points (number of trials with change ≥4 points divided by the number of trials with change <4 points) corresponding to Cohen’s d = 0.5 [57]. For the outcome tolerability, anticipated average event probability (pcN = 0.08) was calculated in terms of the proportion of dropouts (total number of dropouts divided by the total sample size in the network) [7]. A sensitivity analysis was conducted to estimate conditional power at alternative event probabilities in terms of small to large event risks (pc = 0.01, 0.1, 0.2, 0.3, 0.5).

Conditional power is typically estimated for direct comparisons observed in the network [7]. The antidepressant network however contains only 99 direct comparisons out of a total of 231 comparisons. It was therefore hypothesized that inclusion of all competing treatment comparisons in the network would be of clinical interest. For this purpose, dummy connections (with sample size = 1) were created to connect treatment comparisons not-directly observed in the network, and subsequently included in the analysis. Dummy connections did not affect relative treatment effects as assessed by the Pearson correlation between original and dummy effect sizes (efficacy r = 0.999, tolerability r = 0.995) (Supplement 1, Fig. S1d). Between-trial heterogeneity was assumed to be equal to that observed in the original NMA.

All results reported in the article can be found in the supplementary appendices (Supplement 1 & 2). The data set used in the analysis is provided in comma-separated values (CSV) format (Supplement 3).

Results

Existing evidence

The cumulative evolution of conclusive evidence in the antidepressant network across decades is illustrated in Fig. 1, for the two outcomes efficacy and tolerability. Since 2017, no new conclusive evidence has been observed. As of 2020, the ratio of the number of comparisons with conclusive evidence versus inconclusive evidence was found to be half the size for the outcome efficacy (ratio = 0.41, conclusive N = 67 versus inconclusive N = 164) compared to tolerability (ratio = 0.82, conclusive N = 104 versus inconclusive N = 127).

Fig. 1
figure 1

Evidence across study year. Bar plots illustrating the cumulative sum of comparisons with conclusive versus inconclusive evidence across study year with respect to the two outcomes efficacy and tolerability. The total number of treatment comparisons is 231

Conditional power main analysis

The estimated strength of conditional power across all comparisons with inconclusive evidence is illustrated in Fig. 2, based on the main analysis considering anticipated effect sizes set equal to fxyN and anticipated event probabilities set equal to pcN. The figure further demonstrates how the two conditional power indices quantifying sample sizes were derived, i.e., sample sizes at 20% and 80% conditional power (NCP=20%, NCP=80%). Across all comparisons with inconclusive evidence, required sample sizes at 80% conditional power (NCP=80%) were estimated to be approximately double the size for efficacy (median N = 1586, range N = 894 - 4190) than those required for tolerability (median N = 791, range N = 521 - 1246). By contrast, sample sizes at the futility boundary of 20% conditional power (NCP=20%) were estimated to be comparable between outcomes (efficacy median N = 250 [range N = 49 - 485], tolerability median N = 198 [range N = 40 - 320]) (Table 1). The relation between the two indices, NCP=20% and NCP=80%, for each individual comparison is detailed in Fig. 3. The network graphs depicted in Fig. 4 finally summarize the sample size needed to achieve conditional power. To translate these indices to the individual antidepressant level, the medians of the two indices, NCP=20% and NCP=80%, were computed across all inconclusive comparisons including each individual antidepressants. Antidepressants with the smallest median sample sizes were identified as CLO, LEV, MIL, NEF, and VIL with respect to both outcomes (Fig. 4). This is reasonable as these antidepressants (or better the associated comparisons) are the once on which current direct evidence is low. Thus, although estimated conditional power differed in the overall strength between outcomes, with that for efficacy being weaker compared to tolerability, the proportional strength of conditional power in individual treatment comparisons was comparable (Pearson r = 0.81). The supplementary appendix provides details on the conditional power for each individual comparison (Supplement 1, Tab. S2 and Supplement 2).

Fig. 2
figure 2

Conditional power. Box plots illustrating conditional power (CP) across all comparisons with inconclusive evidence as a function of sample size with respect to the two outcomes efficacy and tolerability. Whiskers of the box plots extend to the most extreme data values. Horizontal red dashed lines indicate 20% and 80% conditional power at which sample sizes (NCP=20%, NCP=80%) were estimated. Results are shown based on the main analysis considering a trial design ratio of r = 1/0, anticipated alternative effect sizes equal to the network estimates (fxyN), and anticipated event probabilities equal to the average network event probability (pcN)

Fig. 3
figure 3

Sample size. Heat map illustrating sample size at 20% (NCP=20%) (lower triangles) versus 80% conditional power (NCP=80%) (upper triangles) for individual comparisons with respect to the two outcomes efficacy and tolerability. Colormap is log scaled for better visibility. Comparisons with conclusive evidence are marked (white). Results are shown based on the main analysis considering a trial design ratio of r = 1/0, anticipated alternative effect sizes equal to the network estimates (fxyN), and anticipated event probabilities equal to the average network event probability (pcN)

Fig. 4
figure 4

Network graphs. Network graphs illustrating treatment comparisons with inconclusive evidence with respect to the two outcomes efficacy and tolerability. Circle size is proportionate to actual sample size. Line width is inverse proportionate to the sample size at 80% conditional power (NCP=80%), such that thicker connections indicate smaller sample sizes and thus greater conditional power. Thickness is log scaled for better visibility. Results are shown based on the main analysis considering a trial design ratio of r = 1/0, anticipated alternative effect sizes equal to the network estimates (fxyN), and anticipated event probabilities equal to the average network event probability (pcN). See the supplementary appendix for graphs of the original network (Supplement 1, Fig. S2)

Table 1 Conditional power

Conditional power sensitivity analyses

Sensitivity analysis quantifying the trial design ratio between direct/indirect evidence (r) suggested that adding indirect evidence may considerably increase conditional power and consequently reduce required sample sizes. Compared to a trial design ratio of r = 1/0, considering trial design ratios of r = 1/1 and r = 1/2 reduced median sample sizes (NCP=80%) by median percentages changes of -24% and -35% for efficacy and -7% and -15% for tolerability (Table 1).

By contrast, sensitivity analysis assessing varying anticipated effect sizes suggested that the impact of fxy on the strength of conditional power was small. Considering larger effect sizes (e.g., d = 0.8 in terms of Cohen’s, which is indeed unrealistic) than those observed in the network estimates (fxyN) would increase sample sizes by up to 5% (efficacy) and 3% (tolerability), whereas smaller effect sizes (e.g., d = 0.01 in terms of Cohen’s) had basically no impact on sample sizes (0% efficacy, -1% tolerability) (Table 1).

Last, sensitivity analysis assessing varying event probabilities suggested a relatively larger impact of pc on the strength of conditional power. However, considering the current evidence in terms of average event probabilities (efficacy pcN = 0.17, tolerability pcN = 0.08), larger event probabilities may hardly be considered (Table 1). The supplementary appendix provides details on all sensitivity analyses (Supplement 1, Fig. S3, Tab. S3).

Discussion

The recent NMA by Cipriani et al. [12] provided evidence regarding the ongoing debate on the effectiveness of antidepressant treatment. Today, two years after the publication of the NMA, the question aires whether additional RCTs updating the evidence would pay off. Current ongoing RCTs [1417] may contribute to answer the question, but final results may only be expected after estimated completion of the RCTs (completion dates 2022 - 2026). It may therefore be of clinical interest to estimate the probability whether the current research may lead to updates in treatment recommendations or whether it may be considered unjustified.

Overall, the present findings value the probability of achieving new conclusive evidence in antidepressant treatment recommendations that goes beyond current evidence to be low. Though, sufficient conditional power may be obtained for a majority of evaluated treatment comparisons (Fig. 4), there are substantial limitations in terms of both required sample sizes and expected effect sizes.

Considering median sample sizes in the in the four ongoing RCTs (range N = 400 - 1450) [1417], required sample sizes obtained by the present analysis to achieve conventionally recommended power of at least 80% [9] were estimated to be more than double (tolerability) or even three times (efficacy) the size and may not even exceed the estimated futility boundaries (Table 1). Though, sample sizes may be reduced using optimized trial designs including additional indirect evidence, the associated research costs when conducting multiple trials may not pay off.

It should be noted that the present work is limited in the evaluation of optimal trial designs evaluating the relation between direct and indirect evidence. Nikolakopoulou et al. [54] demonstrated how decisions in future trials may be supported by conditional power analyses considering not only ’different ratios of the number of trials’ contributing direct versus indirect evidence, as done in the current work, but also by considering ’different ratios of the sample size between trials’ assessing direct versus indirect information. An extensive analysis assessing these ratios is feasible in small networks or may be applied to selected treatment comparisons of interest based on a priori hypotheses. The large treatment space in the present network, however, did not allow for such extensive sensitivity analyses due to practical reasons considering both processing time and exponential result dimension. Future research should therefore consider the present findings as an approximation for a more detailed breakdown of the evidence.

Compared to the impact of trial designs on reducing sample sizes, the impact of varying effect sizes or event probabilities may be assumed of less practical importance; this is because trial designs can be experimentally modified, whereas effect sizes and event probabilities are inherently limited by the existing evidence of the various treatments. In particular, considering the well-known overall small effect sizes for efficacy in antidepressants in the conclusive treatment comparisons (i.e., drug-placebo differences with a median d = 0.3 in terms of Cohen’s d [57]) and the even smaller effect sizes in so far inconclusive relative treatment comparisons (median d <0.1 in terms of Cohen’s d [57]) (Supplement 1, Tab. S2), the clinical relevance of additional trials aiming to challenge current antidepressant treatment recommendations may be low. In other words, it may be questioned whether any additional RTCs on antidepressant treatment can challenge the current treatment recommendations.

Referring to the example in the introduction, the present results may be applied to judge the conditional power of the ongoing RCT (NCT04448431 [35]) aiming to compare the efficacy of VOR versus DES. Though, current evidence may assume a trend towards the advantage of VOR compared to DES in terms of both efficacy and tolerability Supplement 1, Fig. S1), the probability of achieving conclusive evidence at reasonable sample sizes is low. The present analysis suggested required sample sizes to achieve at least 80% conditional power (NCP=80%) of N = 1670 and N = 733 in terms of efficacy and tolerability, respectively (Fig. 3). These estimated sample sizes are considerably larger than the planned sample size of N = 600 [35]. Indeed, the planned sample size of N = 600 corresponds to approximately 56% (efficacy) and 74% (tolerability) (Supplement 2), and may thus be considered too low to reach new conclusive evidence in an updated NMA.

The above-mentioned example demonstrates the importance of a priori conditional power analyses, if it is the aim of a RCT to challenge current treatment recommendations. Based on the information available in the ongoing RCTs, it is unclear whether a priori conditional power analysis has been performed. The results expected after the completion of the ongoing RCTs will show whether a priori conditional power analysis could have contributed to improved trial designs, and thus would have saved resources in terms of clinical trial costs.

It should however be made clear that the ongoing RCTs may focus on primary aims other than challenging current antidepressant treatment recommendations. In other words, and they may have not been indented to be conditionally powered for possible future updating of NMAs, but may indeed be sufficiently powered as stand-alone trials. As discussed by Salanti and Nikolakopoulou [58], when NMA is deemed inconclusive and future trials should be planned, specific recommendations about what sort of trials should be planned are required. Trials can be planned to reduce risk of bias in particular comparisons, to explain heterogeneity, or to inform outcomes for which evidence is imprecise. When the aim is to included the planned trial in an updated NMA later on, trials may not be considered as stand-alone trials but may be seen as sequential additions to the existing evidence. The power and findings of individual trials are thus not of interest; rather, the conditional power of the NMA when the new trial is added and the resulting summary effect are of importance. Consequently, when NMA is deemed inconclusive because of imprecision, sample size calculations should be based on the conditional power of an updated NMA.

With this in mind, the present work should not be misunderstood or lead to possible miss-use of conditional power analyses. Weber et al. [59] raised that fundamental question regarding the use of conditional power analyses by asking whether “it is appropriate to gain power for an updated NMA by in- or decreasing the number of planned future trials while manipulating the power of each of the individual planned future trials?” The authors argued that traditional methods of power analysis are still favorable due to the fact that drug licensing is based on stand-alone RCT. Regardless of planning one or multiple trials, trials planned using conditional power may require different sample sizes (smaller or larger) than those planned using traditional power analysis aimed to achieve stand-alone conclusiveness. In other words, “individual RCTs should always be designed to satisfy their objectives and stand-alone studies (should not be) substituted by a meta-analysis of trials of inadequate size” [60].

Conclusions

In conclusion, the present analysis may inform decision-makers and researchers in the planning future antidepressant trials in MDD. Results suggests that new conclusive evidence leading to potential updates in antidepressant treatment recommendations may hardly be achieved within reasonable trial scales. Limiting the use of the presented conditional power analysis are primarily due to the estimated large sample sizes which would be required in future trials as well as due to the overall well-known small effect sizes in antidepressant treatments. These findings may be of importance to evaluate the clinical relevance and justification of research in ongoing or future RCTs on antidepressant treatments in MDD.