In a randomized controlled trial (RCT) comparing treatments A and B, a null hypothesis (H0) of no difference in a primary outcome of interest is defined. Whether any observed difference is statistically significant has traditionally been based on the P value and the confidence interval (CI). For a century, an arbitrary threshold of 0.05 (1/20) has been used to define statistical significance.1 Because this probability is quite low, we conclude that P ≤ 0.05 suggests that the observed difference between A and B is incompatible with H0, and, with a ≥ 95% degree of certainty, that A and B are different. The 95% CI is a lower-upper limit in which the true effect estimate lies within a 95% certainty. A 95% CI that does not include the null effect size indicates the observed difference has reached statistical significance, and H0 is rejected. The 95% CI provides information about the precision of the estimate and complements the P value. Reporting both is recommended.

Use of a P value of 0.05 to dichotomize whether treatments A and B are “truly” different is appealing because of its simplicity. Indeed, for all its many limitations,2 its use actually increased from 1990 to 2015 in MEDLINE and PubMed Central abstracts and articles.3

Statistical significance carries considerable weight. Researchers, editors and reviewers, readers, and the press tend to become more excited about positive results.4 However, a shocking number of scientific studies and meta-analyses are not reproducible or replicable.5,6,7 A reasonable question to ask is: if a positive study was to be repeated, how “easy” might it be for the results to change from being statistically significant to non-significant (and thus, rightly or wrongly, lose some of their appeal)? What if there was, by sheer chance, one or several more (or less) outcome event(s) in one of the comparative groups? Would the coveted statistical significance be lost? In recent years, there has been much interest in a metric—the fragility index (FI),8,9,10,11 first proposed 3 decades ago12—to test the robustness/fragility of statistically significant results. The various applications and implications of the FI are discussed herein.

RCT WITH 1-1 RANDOMIZATION AND DICHOTOMOUS OUTCOMES

In a RCT with a 1-to-1 (2 groups) randomization and dichotomous outcomes (e.g., cure or no cure), the FI is, hypothetically, the minimum number of patients to change from having a “non-event” (e.g., survival) to an “event” (death), or vice versa, to increase P (recalculated using Fisher’s exact test) from ≤ 0.05 to > 0.05.8 We illustrate FI with a report that found that injectable diacetylmorphine resulted in statistically significant (P = 0.004) reduction in illicit-drug use or other illegal activities in patients with opioid dependency refractory to previous treatment at 12 months as compared to oral methadone (Table 1).13 We iteratively reassigned 1 fewer patient in the diacetylmorphine group experiencing improvement and recalculated the P value using Fisher’s exact test. After 7 fewer patients, the results became non-significant.

Table 1 Comparison between diacetylmorphine (experimental cohort) and methadone (control) on reduction in illicit-drug use or other illegal activities in opioid addicts measured at 12 months.13 A fragility index of 7 is based on hypothetically 7 fewer patients in the diacetylmorphine group experiencing reduction

A closer look at the study shows that of the 111 participants randomized to receive methadone, 66 (59.5%) had discontinued intervention due to incarceration, dropout, loss to follow-up, etc., and of the 115 participants randomized to receive diacetylmorphine, 38 (33%) had similar issues. Because the analysis was intention-to-treat, participants who had discontinuation of intervention were all included in their respective cohorts in the final analysis. One might argue that the FI is relatively low compared to the high dropout rates (see susceptibility index below). It therefore begs the question whether the results would have been different had those issues occurred less. The results could conceivably have tilted even more in favor of diacetylmorphine, but since we have already concluded that this therapy is superior, the narrative is not greatly altered. However, it is also possible that the end results could have tilted in favor of methadone by sheer chance, or by a slight reduction in dropout rates in that group in which a few participants recorded improvement, resulting in loss of statistical significance for the study results.

A FI of 0 is defined as when P changes from ≤ 0.05 to > 0.05 when the results are analyzed using Fisher’s exact test instead of chi-square.8 Both Fisher’s exact and chi-square are for analyzing dichotomous outcomes. The former is used for smaller studies and tends to result in higher P values when compared to the latter. Use of Fisher’s exact test to calculate the FI increases the strength of the “stress test” on the robustness of the results.

There is no fixed threshold for FI to infer whether results are fragile or robust. Confidence in the results could be negatively affected when FI is low in comparison to problems such as differences in the comparative cohorts despite randomization, dropout rates, loss to follow-up, outliers, protocol violations, patients crossing over from one group to the other, etc., as illustrated by the example above.

Walsh et al. reviewed 399 RCTs in top journals and found that they have a median FI of 8 (range 0–109). Twenty-five % had a FI ≤ 3, and 53% of them had a FI less than the number of patients lost to follow-up.8 Researchers have calculated the FI for RCTs in various specialties (Table 2). It turns out low FIs are common.

Table 2 A partial list of reviews on fragility index (FI) of randomized controlled trials (RCTs) of various specialties

RCT WITH 1-1 RANDOMIZATION AND CONTINUOUS OUTCOMES

In a RCT with a 1-to-1 randomization and continuous outcomes (e.g., cholesterol level), the FI is the required number of data point(s), moved from the set with the higher mean to the set with the lower mean, to change P from ≤ 0.05 to > 0.05. The data points are chosen by proximity to the mean, with the one closest to but greater than the mean moved first, and continuing in this way systematically until the recalculated P value is > 0.05.9 The Welch t test is used to calculate P during the iterations.9 The best time to calculate this FI is when the original dataset is being analyzed by the authors themselves.9 Readers computing the FI of a published paper usually have no access to the raw data but can generate a candidate dataset with the same number of data points for each treatment arm consisting of a normally distributed list of random numbers, using the published sample sizes, means, and standard deviations.9 We illustrate this continuous FI (CFI) with a RCT on vagal nerve electrical stimulation (VNS) + standard rehabilitation to enhance upper limb function after stroke compared to sham (stimulation electrodes placed but zero electrical current applied) + standard rehabilitation.25 The primary outcome was the change in impairment measured by the Fugl-Meyer Assessment-Upper Extremity (FMA-UE; higher number = more improvement) score on the first day after completion of therapy. Using the summary statistics (control group: n = 55, mean = 2.4, SD = 3.8; VNS group: n = 53, mean = 5, SD = 4.4) provided, we generated two sets of normally distributed FMA-UE scores, one for the control and one for the VNS group. Table 3 shows the partial random number sets and the P values calculated after moving one data point at each iteration from the VNS group to the control group until P > 0.05. The CFI is found to be 3. In this study, there was no patient lost to follow-up.

Table 3 Computation of the continuous fragility index for the study comparing vagal nerve stimulation (VNS) + rehabilitation vs. sham stimulation + rehabilitation after stroke25

META-ANALYSIS

FI has recently been used to appraise meta-analyses.10 A meta-analysis pools the results from comparable RCTs for improved statistical power to derive a point estimate and the 95% CI of the treatment effect. Statistical significance is reached if the 95% CI of the pooled treatment effect does not reach/cross the point of no effect (nullity). The FI of a meta-analysis is the minimum number of changes from “non-event” to “event,” or vice versa, from ≥ 1 included trials that would cause the 95% CI of the pooled treatment effect to reach or cross the point of nullity, thus changing the results from statistically significant to non-significant.10 For example, Andò and Capodanno performed a meta-analysis of 4 RCTs (total population 17,133) comparing radial and femoral access in percutaneous coronary intervention in adults with acute coronary syndrome (Fig. 1).26 As shown in the upper forest plot, the pooled risk ratio (RR) for major adverse cardiac event (MACE) is 0.86 [95% CI 0.75, 0.98], i.e., in favor of radial access. The bottom forest plot shows that if 3 more radial-access patients in 1 of the studies (STEMI-RADIAL) had MACE, the pooled results would have resulted in a RR that includes 1, signaling statistical non-significance. In comparison, the total number patients lost to follow-up in the 4 RCTs were 62, and various protocol violations occurred in ~ 610 cases.

Fig. 1
figure 1

Forest plots of a meta-analysis comparing radial and femoral access for percutaneous intervention after acute coronary syndrome on the incidence of major adverse cardiac event.26 The upper plot is the original analysis which shows that the 95% confidence interval (CI) does not reach nullity of 1 for the risk ratio. The lower plot is the result after shifting 3 patients from 1 of the component RCTs from the radial to the femoral cohort, resulting in a new 95% CI that reaches nullity. Acknowledgement: These forest plots were plotted with RevMan5, made available free of charge by Cochrane (https://www.cochrane.org/)

Atal et al. found that in 43% of the 400 meta-analyses with statistically significant pooled treatment effects, modifications for < 1% of the total participants in one or several trials would have rendered the results non-significant. Similarly, in 9% of these same meta-analyses, it would have taken < 1% of the total number of events to produce non-significance. The FI was ≤ 5 in 29% of the meta-analyses.10

FRAGILITY FOR SURVIVAL ANALYSIS

Time-to-event or survival data can also be subjected to fragility analysis. The survival-inferred FI (SIFI) is the minimum number of reassignments of the best survivors (defined as the patients with the longest follow-up time, regardless of having an event or being censored) from the experimental group to the control group resulting in loss of significance.11 For illustration, we present a hypothetical study of a cancer therapy vs. control. There are 100 patients in the treatment group and 100 patients in the control group (Table 4). Over the course of 5 years, some die and some others drop off from the study for unrelated reasons and are censored. Figure 2 shows the survival curves for the 2 groups using the original data. The observed difference has a P value of 0.026, indicating that there is a statistically significant survival advantage associated with treatment. Upon shifting 1 patient with the longest follow-up time from the treatment group to the control group, P becomes 0.072 (survival curve not shown but looks quite similar to the original curves as the differences are small). The SIFI is therefore 1, a number much smaller than the number of patients censored.

Table 4 A hypothetical 5-year survival analysis showing treatment is associated with superior survival with a P value = 0.026. The P value is recalculated after shifting 1 or more patients with the longest follow-up time from the treatment group to the control group. In this example, shifting 1 patient results in a new P value of 0.072
Fig. 2
figure 2

The Kaplan-Meier survival curve of a hypothetical study comparing treatment and control

Bomze et al. reviewed 45 phase 3 RCTs on immune checkpoint inhibitors and cancer patient survival. The median SIFI was 5 (interquartile range, -4 to 12), representing 1% of the total sample size in 35% of the studies. In 51% of the RCTs, the SIFI was less than the number of censored patients.11

FI VARIANTS

The fragility quotient (FI divided by the total number of study participants) adds context to FI. For example, a low FI in spite of a large sample size results in a smaller fragility quotient (hence more fragile results) as compared to the same FI from a small study.27 The susceptibility index is (number of patients lost to follow-up minus FI) / (number of patients lost to follow-up).21 A low susceptibility index suggests that having fewer patients lost to follow-up and depending on their outcomes could result in a P value of > 0.05. The solidity index (also called the reverse FI) is the minimum number of event changes to render a statistically non-significant result significant (P changes from > 0.05 to ≤ 0.05).28

FI IS INFLUENCED BY THE P VALUE, SAMPLE SIZE, AND EVENT RATES

Computer simulation shows that FI is inversely correlated with the P value.29 It is intuitive that for any hypothetical study, exaggerating the outcome difference to yield lower P values would lead to correspondingly higher FIs, and vice versa. An analysis of 17 RCTs in the New England Journal of Medicine30 and of the traumatic brain injury literature31 came to the same conclusion (although an analysis of 10 RCTs in the epilepsy literature found no correlation32). Carter et al. argue that the FI is just the P value in disguise,29 and adds little to traditional statistics. However, different studies with the same P value can have different FIs,33 such that the FI of a study cannot be derived from the P value alone. In the diacetylmorphine vs. methadone RCT cited above,13 a P value of 0.004 seems rather far from 0.05. However, a rather high FI does not follow. Although a FI of 7 is slightly higher than what has been found in many reviews (Table 1; ref 33), when compared to the large number (104 out of 226 patients) of protocol violations and data loss,13 this FI yields a very low susceptibility index. As another example, the NINDS trial showed that tissue plasminogen for ischemic stroke resulted in superior neurologic global outcomes with P = 0.008,34 but the FI is 3,35 suggesting that the results may not be robust. Indeed, the majority of similar studies on thrombolytics in ischemic stroke are neutral or demonstrate harm.36

The FI is positively correlated with sample size.28 Larger sample sizes tend to lead to smaller P values, which, as mentioned above, are inversely correlated with FI. A small study with a P value of 0.03 should have a lower FI than a larger study with the same P value.37 However, a large study does not automatically imply a high FI and robustness. As pointed out by Walsh et al.,8 the Leicester Intravenous Magnesium Intervention Trial (LIMIT-2), which tested the effect of intravenous magnesium on 28-day survival in patients with suspected acute myocardial infarction, had 2316 patients randomized, and it demonstrated a 24% relative risk reduction in mortality with a P value of 0.04.38 Three years later, a trial of more than 58,000 patients demonstrated no benefit, and subsequent meta-analyses demonstrated that it was unlikely any true benefit exists.39 The FI of the LIMIT-2 trial was 1.8

The FI is positively correlated with event rates.31 Intuitively, in determining the FI by hypothetically changing one event at a time, the percentage changes are small when the event rates are high, and are less likely to change the P value until many incremental steps later, driving up the FI. The implication is that in studies with high event rates, the FI tends to be higher, and in those with low event rates, the FI tends to be lower, and could be considered fragile.

PUTTING FI IN PERSPECTIVE

To date, RCTs and meta-analyses from just about all specialties have been analyzed for their FIs. Low FIs are very common. This should not be surprising. In planning a RCT, researchers perform a power analysis to estimate the sample population size required to achieve a false-positive error rate of ≤ 5% for the primary outcome. Furthermore, many RCTs are halted once the interim outcome results indicate that a P value of ≤ 5% is reached. In other words, many RCTs, including many practice changing ones,30 are designed and executed in such a way that a low FI is not unexpected. Although it may be interesting to also determine the FIs for the results of important secondary outcomes, one must remember that RCTs are not usually powered for secondary outcomes.

The FI, as an offshoot of the P value, inherits the problems of the frequentist paradigm of null hypothesis testing. Much of medicine is not an exact science. Theoretically, the difference between a P value of 0.049 and 0.051 is not as big as what the frequentist world makes it out to be, and the FI concept only perpetuates our misguided, what many would argue, binary mindset.2, 24 In contrast, in the Bayesian paradigm, the probability of an outcome is derived from prior probability estimates updated by experimental data.24 Such prior probability estimates may be derived from existing data with, for sensitivity analysis, added uncertainties ranging from conservative to liberal to generate a credibility range for the posterior probability estimate. Indeed, we think, discuss, and argue in a Bayesian way, and there have been calls for increased use of the Bayesian approach.37, 40 Nevertheless, it is human nature to try to have simple answers for complex problems, and our reliance on the often misunderstood and misused P value as an arbiter of “truth” and importance/significance shows no signs of diminishing.3 In appraising the usefulness of FI, one must recognize that it is but one metric in an imperfect paradigm. Like the P value, the FI is focused on statistical significance, not clinical significance. It can be considered a sensitivity analysis on the statistical significance of a study’s outcomes. FI is an indicative tool for communicating the strength of statistical conclusions, and should be considered in conjunction with supporting evidence, plausibility, sample sizes, effect sizes, attrition rates, crossover rates, protocol violations, etc. In meta-analyses, the quality and compatibility of the included trials, publication bias (“negative” studies are less likely to get submitted or published3), the exclusion of non-English papers, etc. are also important to consider. There need not be a fixed number for FI or its variants to dichotomize results into “robust” and “not robust,” for that would simply be a perpetuation of our flawed binary mindset. Researchers have only started to dissect the FI (and its variants) to improve our understanding of this potentially useful but also potentially misleading metric.37 As clinical research and reporting methods continue to evolve, time will tell whether the FI concept will improve decision and policy making.