Statistical inference is the process of analyzing samples to infer characteristics about the populations from which the samples are drawn. Statistical methods are based on the laws of chance. Different frameworks exist for analyzing and reporting data. In medicine—as indeed across all scientific disciplines—a frequentist hypothesis framework predominates, with declarations of “significant” and “not significant” based on P values and confidence intervals (CIs). Nevertheless, P values and CIs are widely misunderstood and declarations of significance are frequently misleading. Consequently, clinical trials may be misinterpreted, with the potential for useful interventions to be discarded and for ineffective or harmful interventions to be adopted.1

Consider the LOVIT randomized trial, which compared high-dose vitamin C with placebo in patients with sepsis.2 Based on weak prior evidence, the authors postulated that vitamin C would be superior to placebo. Surprisingly, the authors observed a higher rate of death or persisting organ dysfunction (the composite primary outcome) in the vitamin C group (44.5%) than in the control group (38.5%). Under the frequentist framework, the result was statistically significant (relative risk [RR], 1.21; 95% CI, 1.04 to 1.40; P = 0.01). As we discuss later, a statistically significant result is equivalent to declaring that the intervention has an effect on the outcome. Therefore, we might reasonably ask, how likely is it that in patients with sepsis, high-dose vitamin C increases the risk of death or organ dysfunction? Unfortunately, P values and CIs provide no direct answer to the question. In a previous paper, one of us (D. S.) argued that the LOVIT trial provides little evidence for or against the existence of a real effect of high-dose vitamin C.3 Indeed, many significant results provide little evidence for real treatment effects.4,5

The precarious relationship that can exist between significant results and real effects is highlighted by a 2018 study by Silberzahn et al., in which 29 groups of researchers were asked to analyze the same data set to answer a simple research question: whether soccer referees are more likely to give red cards to dark skin-toned players than light skin-toned players.6 Twenty groups reported a significant association between skin tone and red cards and nine did not. In all, 21 unique models were used and the estimated effect sizes (odds ratios) varied between 0.89 and 2.93. While the real effect of skin tone on red cards is unknown, it is clear that identical data can lead to different outcomes from significance tests depending on the statistical model.

In 2015, the American Statistical Association published a statement on P values and statistical significance.7 The impetus for the statement was increasing concern about the misuse and misinterpretation of P values and the “numerous deep flaws” in frequentist hypothesis testing.8 There are three possible responses to the “significance problem.” First—and most obviously—we should better understand the limitations of frequentist hypothesis testing. A second—more radical—approach is to do away with hypothesis testing entirely and replace it with something different. A third approach is to quantify the relationship that exists between declarations of significance and the presence of real treatment effects. In this article, we discuss each approach in turn. Figure 1 summaries the three approaches with reference to the LOVIT trial. Throughout the article, we refer to four multicentre randomized trials, two from the anesthesia literature (OPTIMISE, INPRESS) and two from the critical care literature (LOVIT, EOLIA) (Table 1).2,9,10,11 Each trial in Table 1 involves a control and an intervention group and reports the difference in a binary outcome, which is the most common framework for multicentre clinical trials in our specialties.

Fig. 1
figure 1

Infographic: Frequentist and Bayesian approaches applied to the LOVIT trial.

The infographic shows the three approaches to analyzing trial data that are described in the text.

The left panel (yellow) shows traditional frequentist hypothesis testing. The authors applied a generalized linear regression model to the observed data and calculated a test statistic (t value), from which a P value and 95% confidence interval (CI) were obtained. A P value is the probability of obtaining a test statistic at least as large as that observed under a true null hypothesis. Since P ≤ 0.05 and the tail of the 95% CI excluded the null value (RR of 1), the null hypothesis was rejected and the alternative hypothesis accepted. The result was “significant” at the 5% threshold. Rejecting the null hypothesis and accepting the alternative hypothesis is equivalent to declaring the intervention has an effect on the outcome.

The lower middle panel (pink) shows the Bayes factor approach, which can be used to interpret declarations of statistical significance. The Bayes factor favouring the alternative hypothesis (BF1:0) is 0.68 (K&V method), indicating the data provide ambiguous/weak evidence for an effect (see Table 2). Since the result of frequentist hypothesis testing was “significant,” we use Bayes’ theorem to calculate the false positive risk (FPR). The FPR is the probability the null hypothesis is true despite the significant test result. The FPR of 60% means we should be cautious of accepting that a real effect exists.

The right panel (purple) shows Bayesian parameter estimation. A prior distribution for the treatment effect (δ) is formulated. In this case, the prior is neutral, centred on an absolute risk reduction of zero. From the data, a likelihood function is calculated, which is the probability of the data given different values for the treatment effect (i.e., data|δ). Bayes’ theorem is used to combine the prior and the likelihood to obtain the posterior distribution, which is the distribution of the treatment effect given the data (i.e., δ|data). From the posterior distribution, a 95% credible interval can be calculated, which is the range of treatment effects (δ|data) encompassing 95% of the probability density.

H0 = null hypothesis; H1 = alternative hypothesis; P(T+|H0) = type I error rate or probability the result is significant given the null hypothesis is true; P(T|H1) = type II error rate or probability the result is not significant given the alternative hypothesis is true; RR = relative risk; BF1:0 = Bayes factor favouring the alternative hypothesis over the null hypothesis (see Equation 4); K&V = method of calculating the Bayes factor developed by Kass and Vaidyanathan; FPR = false positive risk; P(H0|T+) = probability null hypothesis true given the test significant; δ = treatment effect (absolute risk difference)

Table 1 Results of frequentist and Bayesian analysis of four multicentre trials.

Part I: Frequentist hypothesis testing

There are two key ideas that underpin frequentist hypothesis testing. The first is that two competing explanations for the data are formulated: the null and the alternative hypothesis. The null hypothesis is that there is no effect of the intervention on the outcome. For a trial comparing event rates in a control and an intervention group, the null hypothesis (H0) is:

$$H_{0} :\theta_{1} - \theta_{2} = 0$$
(1)

where, θ1 and θ2 represent the event rates (proportions) in the populations. The alternative hypothesis (H1), is that there is an effect of the intervention:

$$H_{1} :\theta_{1} - \theta_{2} \ne 0.$$
(2)

When framed in this way, one of the hypotheses must be true and one must be false. Notice that the null hypothesis is only true when the treatment effect in the population is precisely zero, which is termed a “point null.” While a treatment effect in the population of precisely zero is only rarely true (e.g., placebo vs placebo), it is a useful model under the frequentist framework. As the statistician George Box famously observed, “All models are wrong but some are useful.”12

The second key idea is that the data are used to quantify evidence against the null hypothesis. If the data are sufficiently improbable under a true null, the null hypothesis is rejected and the alternative hypothesis is accepted, which is equivalent to declaring the intervention has an effect on the outcome. The result is deemed “significant.” In medical research, “sufficiently improbable” has usually been defined as a P ≤ 0.05 (5%) or, equivalently, exclusion of the null result from the 95% CI. By contrast, if P > 0.05, the null hypothesis is not rejected and the result is deemed “not significant.” Not rejecting the null hypothesis is not the same as declaring the intervention has no effect.13 When the result is not significant, we can make no claims on the truth of the null hypothesis. Thus, with frequentist hypothesis testing, we quantify evidence against but not in favour of the null hypothesis.

The definition of a P value follows naturally from the preceding discussion. A test statistic (e.g., t value, χ2 statistic) is calculated from the data. A P value is the probability of obtaining a test statistic at least as large as that observed under a true null hypothesis (i.e., no real effect). Consequently, a P value is the probability of observing data at least as extreme as that observed given the null hypothesis is true. The key points are: 1) a P value is a probability relating to the data not the hypothesis and 2) a P value is calculated assuming the null hypothesis is true. It is a common misconception that a P value represents the probability the null hypothesis is true, which is not possible, as a P value is calculated assuming the null hypothesis is true.14

While more complicated to untangle, a CI also relates to the data—not the hypothesis. If an identical study was repeated many times (i.e., involving multiple sets of sample data) and on each occasion a 95% CI was calculated on the observed difference (e.g., θ1 − θ2), then we would expect 95% of those intervals to contain the true population difference. Because of sampling variability, the intervals will vary. This definition is different from “there is a probability of 0.95 (95%) that the true population difference lies within the interval,” which is incorrect. Because the interval is the random variable—not the population effect size, which has a fixed but unknown value—it is not appropriate to calculate probabilities for the hypothesis from a CI.

Power and errors

If we observe P ≤ 0.05 but there is no effect of the intervention in the population, a type I error occurs (false positive). Nevertheless, when P ≤ 0.05, we do not know if the result is a true positive or a false positive. All we know is that if we repeated the same trial many times over, each with the same sample size, then over the long run, we expect type I errors to occur roughly 5% of the time. If we observe P > 0.05 but there is a real effect, a type II error occurs (false negative). Again, we have no way of knowing if the result is a false negative or a true negative.

Statistical power is the probability the test will be significant given a real effect exists (i.e., power is the probability of obtaining a true positive result). When a real effect exists, the test is either significant (true positive) or not significant (false negative). Therefore, power is 1 minus the type II error rate. Most trials report power of at least 0.8 (80%). Nevertheless, there is an Achilles heel to such claims: they are only valid if the treatment effect in the population is at least as large as that used when calculating the sample size. If the true effect size is smaller, the achieved power of the study will be less than the reported (design) power and the type II error rate will be higher than expected. In trials in anesthesia and critical care, overestimating the size of the treatment effect when calculating the sample size is extremely common.4,15,16,17 In fact, low power is endemic across medical research. In a recent analysis, van Zwet et al. estimated that the median achieved power among more than 20,000 randomized trials in the Cochrane Database of Systematic Reviews was only 0.13 (13%)!18 When power is low, the type II error rate is high, meaning few trials report significant results. This is the situation that exists for most multicentre trials in anesthesia and critical care.4,16

The 5% significance threshold

Imagine we have some trial data and apply two statistical models. Model A gives a P value of 0.04 and model B gives a P value of 0.06. Which model is correct? Is the result a true positive, false positive, true negative, or false negative? The question is meaningless. Nothing magical happens at P = 0.05. In fact, the two models give similar results: assuming the treatment is ineffective, the probability of observing the effect size is about one in 20.

The 0.05 threshold was introduced by Ronald Fisher in his 1925 book Statistical Methods for Research Workers.19 One of the reasons Fisher chose 0.05 was that it was convenient for “ease of calculation at a time before computers rendered tables and approximations largely obsolete” given that it was “roughly two standard deviations away from the mean of a normally distributed random variable.”20 Fisher might well be surprised and alarmed that his 0.05 threshold has been afforded such importance in modern medical research.

The current hypothesis testing framework for frequentist inference is most unsatisfactory. We use an arbitrary threshold to declare the presence of a real effect based on a metric that has no direct relationship to the presence of a real effect. Could we analyze and report data in a way that avoids dichotomizing the results using a metric that quantifies something more useful than the probability of the test statistic subject to the null hypothesis? As it turns out, there is.

Part II: Bayesian inference

Bayesian inference has a simple conceptual framework. First, we consider what is already known about an intervention and express that knowledge as a probability distribution. Next, we do an experiment and obtain data. Finally, we use Bayes’ theorem to combine our prior knowledge with the data to obtain a posterior distribution for the thing we are interested in, which is usually the size of the treatment effect. This simple conceptual framework belies some complicated mathematical theory and advanced computational methods. But we will start simply. What is a probability distribution?

Probability distributions

Imagine the year is 2010 and you are part of the team of researchers planning the EOLIA trial, which was a multicentre randomized trial comparing extracorporeal membrane oxygenation (ECMO) with conventional treatment in patients with severe acute respiratory distress syndrome (ARDS).10 To determine the sample size for the trial, you wish to estimate the proportion of patients with severe ARDS who die with conventional treatment. Based on published data, you think the mortality rate is around 60%. You are uncertain of the exact value but are fairly confident it is between 50% and 70%. These beliefs can be represented as a probability distribution.

Probability distributions can be modelled using mathematical functions, each with particular properties. A property of all probability distributions is that the area under the curve (i.e., the sum of all probabilities) is 1. Proportions, which take values between 0 and 1, can be modelled with beta distributions. Beta distributions have two parameters, a and b, which determine the shape and position of the distribution. A beta distribution with a = 36.18 and b = 24.23 closely resembles the aforementioned beliefs about the mortality from severe ARDS (Fig. 2).

Fig. 2
figure 2

Probability distribution.

The figure shows a probability distribution, representing the uncertainty surrounding the population mortality rate from severe acute respiratory distress syndrome for the EOLIA trial. The horizontal axis shows the proportion (θ1) and the vertical axis shows the height of the probability density. Since θ1 is a proportion, values on the horizontal axis are constrained to lie between 0 and 1. The most likely value for θ1 is 0.6, and 95% of the area under the curve (probability density) lies between 0.5 and 0.7. The area bounded by the curve (i.e., the sum of all probabilities) is 1. Notice that peak of the distribution extends to a height on the vertical axis of roughly 6 units, which is required to obtain a total area for the distribution of 1.

The distribution was obtained by simulated draws from a beta distribution with shape parameters a = 36.18 and b = 24.23. The distribution corresponds to the neutral priors used for θ1 (control event rate) and θ2 (intervention rate; extracorporeal membrane oxygenation group) for the EIOLIA trial shown in Fig. 3.

Prior distributions

The probability distribution shown in Fig. 2 is a prior distribution (“prior”); it is an estimate of the distribution for the population event rate for the control group for the EOLIA trial before observing any data. For simplicity, and because you do not know if ECMO offers a survival benefit, you decide to use the same prior for the intervention (i.e., θ2). We now have two priors, one for θ1 and one for θ2. If we use a computer to draw random samples from each prior, and take the difference each time, we can obtain a prior for the treatment effect. The prior for the treatment effect described above is “neutral”. We are assuming only that θ1 and θ2 are broadly similar (i.e., dependent) but not presupposing EMCO is beneficial.

We could have chosen many different priors for θ1 and θ2. For instance, we might have chosen priors where mortality was on average 10% less for θ2 than θ1. The resulting prior for the treatment effect would be “enthusiastic”—we are assuming ECMO reduces mortality compared with standard care by about 10%. Alternatively, if we had no inkling as to the mortality rate from severe ARDS or ECMO, we might have decided that any possible event rate was equally likely and assigned uniform distributions to θ1 and θ2. The resulting prior for the treatment effect would then be “uninformed.” Figure 3 shows examples of uninformed, neutral, and enthusiastic priors for the treatment effect of ECMO in severe ARDS.

Fig. 3
figure 3

Uninformed, neutral, and enthusiastic prior distributions.

The left panels (top, middle, bottom) show prior distributions for the population mortality rate (θ1) for patients with severe acute respiratory distress syndrome (ARDS) treated with conventional therapy. The centre panels (top, middle, bottom) show prior distributions for the population mortality rate (θ2) for patients with severe ARDS treated with extracorporeal membrane oxygenation (ECMO). The right panels (top, middle, bottom) show prior distributions for the population treatment effect of ECMO.

Top panels: Uninformed prior. The left and centre panels show uniform distributions for θ1 and θ2. The resulting distribution for the treatment effect (right panel) is a triangular distribution, centred on a treatment effect of 0 with tails of the distribution at −1 and +1.

Middle panels: Neutral prior. The left and centre panels show identical distributions for θ1 and θ2, with a mode of 0.6 and 95% of the probability density between 0.5 and 0.7. The resulting distribution for the treatment effect (right panel) is centred on a treatment effect of 0 with most of the density between −0.2 and +0.2.

Bottom panels: Enthusiastic prior. The left panel shows a distribution for θ1 with a mode of 0.6 and 95% of the probability density between 0.5 and 0.7. The centre panel shows a distribution for θ2 with a mode of 0.5 and 95% of the probability density between 0.4 and 0.6. The resulting distribution for the treatment effect (right panel) is centred on a value of 0.1 with most of the density between  −0.1 and 0.3.

The distributions for θ1 and θ2 were obtained using random draws from beta distributions with the following shape parameters. For the uninformed prior, a = 1 and b = 1 were used for both θ1 and θ2. For the neutral prior, a = 36.18 and b = 24.23 were used for both θ1 and θ2. For the enthusiastic prior, a = 36.18 and b = 24.23 was used for θ1 and a = 33.41 and b = 33.41 was used for θ2. Priors for the treatment effect were obtained by random draws from θ1 and θ2, taking the difference each time.

The “strength” of a prior refers to the degree of uncertainty around the true value and is reflected in the variance (i.e., spread) of the distribution. A low variance indicates high strength, and vice versa. Notice from Fig. 3 that the spread of the uninformed prior is greater than the spread of the neutral and enthusiastic priors. Thus, the neutral and enthusiastic priors are “stronger” than the uninformed prior.

The likelihood function—incorporating our experimental data

A likelihood function (“likelihood”) is a distribution representing the probability of the data for each possible value for the parameter of interest. In the EOLIA trial, there were 57/125 (46%) deaths in the control group and 44/124 (35%) deaths in the ECMO group. The likelihood for the control group tells us how likely we are to observe 57 deaths in 125 patients given different population values for θ1. Similarly, the likelihood for the intervention group tells us how likely we are to observe 44 deaths in 124 patients given different population values for θ2. Figure 4 shows the likelihoods for the control group, intervention group, and treatment effect for the EOLIA data.

Fig. 4
figure 4

Likelihoods for the EOLIA data.

The left panel shows the likelihood for the observed event rate in the control group (standard care). The centre panel shows the likelihood for the observed event rate in the intervention group (extracorporeal membrane oxygenation). The right panel shows the likelihood for the treatment effect. A likelihood is interpreted as a probability for the data given a particular value for parameter of interest in the population. Thus, the likelihoods for the observed event rates in the two groups can be written as P(data|θ1) and P(data|θ2), respectively. Notice that the mode of the likelihood corresponds to the observed event rates, which for θ1 is 0.46 and for θ2 is 0.35. The mode of the likelihood for the treatment effect corresponds to the observed treatment effect (i.e., 0.46 − 0.35 = 0.11). The likelihood for the treatment effect was estimated by drawing random samples from the likelihoods for θ1 and θ2 and taking the difference each time. Because of sampling error, the mode is estimated as an absolute risk reduction of 0.10 (10%), not the observed value from the trial of 0.11 (11%).

Posterior distributions

A posterior distribution (“posterior”) is a probability distribution for the population value for the parameter of interest based on the prior (what we knew before collecting data) and the likelihood (what we learn from the data). The posterior is calculated using Bayes’ theorem, which in words can be expressed as:

$${\text{posterior }} \propto {\text{likelihood}} \times {\text{prior}}$$
(3)

Strong priors have a greater impact on the posterior than weak priors do. Figure 5 shows posterior distributions for the treatment effect of ECMO, based on the EOLIA likelihood and three different priors (uninformed, neutral, enthusiastic).

Fig. 5
figure 5

Posterior distributions based on uninformed, neutral, and enthusiastic priors.

Each panel shows a prior, likelihood, and posterior distribution for the treatment effect of extracorporeal membrane oxygenation (ECMO) for treating severe acute respiratory distress syndrome. Priors (blue lines) are uninformed (top panel), neutral (middle panel), and enthusiastic (bottom panel), corresponding to the priors shown in Fig. 3. The likelihoods (red lines) are all identical and correspond to the likelihood for the treatment effect from the EOLILA trial shown in Fig. 4. The posterior distributions are shown in green. The vertical green dashed lines correspond to the mode of the posteriors.

In the top panel (uninformed prior), the mode of the posterior occurs an absolute risk reduction of 0.10 (10%). When the prior is uninformed, the posterior is determined entirely by the data (i.e., the likelihood). Notice that the posterior overlies the likelihood, which is therefore not easily seen.

In the middle panel (neutral prior), the mode of the posterior occurs at an absolute risk reduction of 0.07 (7%), which lies between the mode of the prior (absolute risk reduction of zero) and the likelihood (absolute risk reduction of 0.10).

In the bottom panel (enthusiastic prior) the mode of the posterior occurs an absolute risk reduction of 10%. In this case, the modes of the prior, likelihood, and posterior are all the same (absolute risk reduction of 0.10).

Because of sampling variability, the mode of the likelihood occurs at an absolute risk reduction of 0.10 (10%), not the observed value from the trial of 0.11 (11%). The plots were created in R using a conjugate beta-binomial model. The code to replicate the plots is available at URL: https://github.com/cjdbarlow/papers (accessed June 2023).

Interpreting posterior distributions

We can summarize the posterior in several ways. Commonly, researchers report the treatment effect incorporating 95% of the probability density, termed a 95% credible interval. The 95% credible interval for the EOLIA data based on our neutral prior is −0.04 to 0.17, meaning there is a 95% chance that the absolute mortality reduction associated with EMCO is between −4% and 17%. The peak of the posterior distribution is an absolute mortality reduction of about 6.5%, which, in this case, lies a little closer to the peak of the likelihood (11%) than the prior (0%).

Notice from Table 1 that the 95% credible intervals for each of the four trials include an absolute risk difference of 0%, which is the null value. With frequentist hypothesis testing, inclusion of the null value in the 95% CI means the result would be classified as not significant. Nevertheless, with a Bayesian framework, inclusion of the null value in the tail of the 95% credible interval is not of critical importance.

In addition to a credible interval, we might report the probability that the treatment effect exceeds a minimum clinically important difference (MCID).21 For the trials in Table 1, let us assume that a 5% lower event rate for the intervention group represents a MCID—a value at which most clinicians would adopt the intervention. For the statistically significant INPRESS trial, the probability the treatment effect exceeds our hypothetical clinically important difference is 0.8 (80%). For the two trials that were not significant (EOLIA and OPTIMISE), the probability the treatment effect is clinically important is 0.62 (62%) and 0.56 (56%), respectively. Thus, for all three trials (INPRESS, EOLIA, and OPTIMISE) there is a reasonable chance the treatment effect exceeds our hypothetical MCID. In the LOVIT trial, given the observed effect of vitamin C was for harm, the probability that vitamin C has a beneficial effect that exceeds our 5% threshold is very low.

Under the frequentist framework, the not significant result from the EOLIA trial led the authors to conclude, “Among patients with very severe ARDS, 60-day mortality was not significantly lower with ECMO …”10 By contrast, our Bayesian interpretation tells us that more likely than not ECMO does reduce mortality in patients with severe ARDS, with an effect size that might be considered clinically important.

Credible vs confidence intervals

Both a frequentist CI and a Bayesian credible interval provide a plausible range for the treatment effect. Therefore, we might ask, do we not obtain the same information from both? Well, not quite. Because a credible interval represents the plausible range for the probability of the treatment effect given the data, it is straightforward to calculate probabilities for the treatment effect; for instance, the probability that the treatment effect exceeds a MCID. As noted above, it is not appropriate to infer such probabilities from a CI. A CI is based entirely on the sample data, which in the case of the EOLIA trial is the 249 participants who took part in the trial. By contrast, a credible interval incorporates prior knowledge. A full Bayesian analysis typically involves calculating posterior distributions using priors of varying enthusiasm. For instance, in 2018, Goligher et al. did a Bayesian reanalysis of the EOLIA data using multiple priors.22 In this way, a reader can see how sensitive the posterior is to different priors and interpret the result accordingly. Finally, under the frequentist model, the main use of a CI is to determine statistical significance. A more useful interpretation of a CI is to quantify the plausible range for the treatment effect. Nevertheless, this aspect is frequently ignored. For instance, in a selection of ten multicentre trials in anesthesia reporting nonsignificance for the primary outcome, only three drew attention to the uncertainty associated with the observed treatment effect.23

Part III: Simplified Bayesian metrics for interpreting declarations of “significant” and “not significant”

In the preceding section, we used Bayesian inference to estimate an unknown parameter—the size of the treatment effect for ECMO in patients with severe ARDS—by estimating a posterior distribution. A Bayesian posterior distribution provides a complete picture of the data informed by prior knowledge.24 Nevertheless, we can also use Bayesian methods to compare two models for the data—in our case the relative merits of the null and the alternative hypothesis. For model comparison, we can use a Bayes factor (BF) or a likelihood ratio (LR).

Bayes factors

A BF is the ratio of the probability of the data (e.g., the observed values for θ1 and θ2) under two competing models. For the null (H0) and alternative (H1) hypotheses, we can define the BF favouring the alternative hypothesis as:

$$BF_{1:0} = \frac{{P({\text{data}}|H_{1} {)}}}{{P({\text{data}}|H_{0} )}}$$
(4)

The right side of Equation 4 is the ratio of two conditional probabilities. A conditional probability arises when the chance of an event occurring depends on another event being true. The upper-case “P” means “the probability that” and the symbol “|” means “given that.” So, P(data|H1) means the probability of the data given the alternative hypothesis is true. On the left side of Equation 4, the term “BF1:0” means “the BF favouring the alternative hypothesis relative to the null hypothesis.” A BF1:0 of 5 means the data are five times more likely under the alternative than the null hypothesis. The reciprocal of BF1:0 is the BF favouring the null over the alternative, denoted BF0:1. Therefore, if BF1:0 is 5, then BF0:1 is 1/5th (0.2). A BF is always positive and has no upper bound. Unlike a P value, a BF quantifies the evidence in favour of both the null (BF0:1) and the alternative (BF1:0) hypotheses. The strength of evidence provided by a BF can be classified using a qualitative scale (Table 2).

Table 2 Qualitative scale for the strength of evidence provided by the Bayes factor.

Calculating BFs involves solving complicated integrals. For instance, evaluating P(data|H1) involves integrating (i.e., averaging) the probability of the data across all combinations for the parameters (θ1, θ2) that are possible under the alternative hypothesis, weighted by their prior probabilities.25,26 In practice, BFs are estimated using mathematical approximations to the integrals or other clever workarounds.26,27,28 The phrase “weighted by their prior probabilities” is important, as it means we must specify prior distributions for the parameters.

Several methods are available for calculating a BF for the equivalence of two proportions.29 Two popular approaches are the methods developed by Gunel and Dickey (G&D)30,31 and Kass and Vaidyanathan (K&V).32,33,34 As a default, the G&D method assigns uniform priors to θ1 and θ2. By contrast, the K&V method assumes an inherent dependency between θ1 and θ2 with the effect that the prior for the treatment effect is neutral.32 We see from Table 1 that the BF1:0 calculated using the G&D and K&V methods for the EOLIA trial are 0.57 and 0.87, respectively. Since in both cases, the BF1:0 < 1, the data slightly favour the null over the alternative hypothesis but the strength of evidence is weak (Table 2).

Likelihood ratios

LRs are another method of comparing the relative evidence for the alternative and null hypotheses. Again, we can use the notation LR1:0 and LR0:1 to represent the evidence favouring the alternative and null hypotheses. Unlike a BF, which compares the relative probabilities of the data under the point null (i.e., θ1 – θ2 = 0) to an averaged value of all possible values for the alternative hypothesis (i.e., θ1 – θ2 ≠ 0), a LR compares the point null hypothesis to a point (i.e., single) value for the alternative hypothesis (e.g., θ1 – θ2 = 0.1). Since we are only considering a single value for the alternative hypothesis, there is no requirement to specify priors for the parameters. Likelihood ratios provide an intuitive feel for the relative merits of the two hypotheses but are highly dependent on the specific value chosen for the alternative hypothesis.

An obvious choice for the alternative hypothesis is the observed treatment effect. The observed treatment effect is the effect size that is best supported by the data.25,35 Using the observed treatment effect produces the highest possible value for LR1:0, which is termed the maximum likelihood ratio (MLR1:0).25,35 A MLR1:0 can be calculated directly from the data.35 While not strictly a MLR, a value for LR1:0 that is best supported by the data (i.e., an upper bound across a large class of reasonable alternative hypotheses) can be calculated directly from the P value.36,37,38,39 Notice from Table 1 that the values for MLR1:0 are substantially larger than the equivalent values for BF1:0.

Another choice for the alternative hypothesis is to use the expected treatment effect used by the researchers when they calculated the sample size.35 There are two issues with this approach. First, we know that the expected treatment effects are typically implausibly large.4,15,16 Second, if the observed treatment effect is in the opposite direction to the expected treatment effect, the LR1:0 will be implausibly small. For example, for the LOVIT data, the LR1:0 calculated using the expected treatment effect is zero. Together, these two issues raise a very important point about being circumspect about the expected treatment effect and its direction when designing clinical trials, but we do not consider this further here.

One of us (P. M. J.) has developed an online calculator for the LR, which may be used for continuous and binary outcome measures using different values for the alternative hypothesis, available at URL: https://medresearch.shinyapps.io/Bayesian_re-analysis/ (accessed June 2023).

Probabilities for the hypothesis: the false positive and false negative risk

We can use a BF or the LR to estimate probabilities for the hypothesis.5,35,37 We can define four probabilities depending on whether the null or the alternative hypothesis is true and whether the test is significant or not significant. We will concentrate on two: the false positive risk (FPR) and the false negative risk (FNR). The FPR is the probability the null hypothesis is true given the test result is significant and the FNR is the probability the alternative hypothesis is true given the test result is not significant (Table 3).

Table 3 Conditional probabilities associated with frequentist hypothesis testing, assuming either the null (H0) or alternative (H1) hypothesis is true

If we a priori assume that the null and alternative hypothesis each have a 50% chance of being true, we can calculate the FPR and FNR using Bayes’ theorem:4,5

$${\text{FPR}} = \frac{1}{{BF_{1:0} + 1}}$$
(5)
$${\text{FNR}} = \frac{1}{{BF_{0:1} + 1}}$$
(6)

When using an LR, we substitute LR1:0 or LR0:1 for the analogous BF.

Given the preceding discussion, an obvious question arises: which metric (LR or BF) should we use to estimate the FPR and FNR?

When using Bayes’ theorem to calculate probabilities for the hypothesis, we assume that either the null or the alternative hypothesis is true. Since a BF compares the probability of the data under a point null to an averaged value for all possible values for the alternative hypothesis, this assumption is satisfied. Nevertheless, when we use a LR—which compares two specific (point) hypotheses—the assumption is questionable, as all other values for the alternative hypothesis are assigned a probability of zero. For this reason, using a BF may be more valid than using a LR when the purpose is estimating the FPR or FNR.

If we use a BF, we must choose priors for θ1 and θ2. So far, we have described two methods for calculating BFs for two proportions: the G&D method, which as a default assigns an uninformed prior for the treatment effect, and the K&V method, which results in a neutral prior. In fact, there is another widely used technique called the Savage Dickey method, whereby priors are specified directly by the user.26,28

Due to the inherent dependency that exists between θ1 and θ2, a neutral prior for the treatment effect is more realistic than an uninformed prior. For instance, in the EOLIA trial, the expected population mortality rate for the control group was 60%. Consequently, the expected population mortality rate for the ECMO group is more likely to be closer to 60% than to (say) 1% or 99%. A neutral prior assigns relatively more probability density to plausible values for the alternative hypothesis compared with an uninformed prior (compare the top right and middle right panels of Fig. 3). Consequently, a neutral prior provides relatively more support for the alternative hypothesis than an uninformed prior, meaning the BF1:0 tends to be larger and the BF0:1 smaller. Therefore, neutral priors tend to result in lower values for the FPR and higher values for the FNR. These effects can be seen in Table 1 by comparing the BFs calculated using the G&D (uninformed prior) and the K&V (neutral prior) methods. In Table 1, the differences in the BFs calculated by the G&D and K&V methods are small. Nevertheless, when the event rates are very high or low and the P value is close to the significance threshold, the differences between the two methods are substantial.32

Given neutral priors account for the inherent dependency between θ1 and θ2, the K&V method may be a better choice than the G&D for estimating the FPR and FNR. One of us (C. J. B.) has developed an online calculator for estimating the FPR and FNR for trials reporting binary outcomes using either the K&V or G&D methods, available at URL: https://cjdbarlow.github.io/bayes-trial-eval.html (accessed June 2023).

The false positive and false negative risk versus the type I and type II error rates

Figure 6 shows the relationship between the P value and the FPR (Fig. 6A) and FNR (Fig. 6B). Data were obtained using simulated statistical testing. P values were calculated using a Fisher exact test and BFs were calculated using the K&V method. The striking feature of Fig. 6 is that, for P values close to the significance threshold, the FPR approaches 30% and the FNR exceeds 60%. Figure 6 highlights the poor relationship that exists between real effects and the outcome from the hypothesis test for P values close to the significance threshold. By contrast, for P values well displaced from the significance threshold, the relationship is more robust. For instance, when P < 0.005 the FPR is below 5% and when P > 0.5 the FNR is below 30%.

Fig. 6
figure 6

The relationship between the false positive risk and the false negative risk and the P value.

Fig. 6A shows the relationship between the P value (for P ≤ 0.05) and the false positive risk (FPR). The vertical axis (labelled “Risk”) indicates the probability the null hypothesis is true given the statistical test is significant, which, using mathematical notation, may be written as P(H0|T+). The vertical hatched lines are at P values of 0.005 and 0.01.

Fig. 6B shows the relationship between the P value (for P > 0.05) and the false negative risk (FNR). The vertical axis (labelled “Risk”) indicates the probability the alternative hypothesis is true given the statistical test is not significant, which, using mathematical notation, may be written as P(H1|T). The vertical hatched line is at a P value of 0.10.

Data were obtained by simulated statistical testing. Two populations were defined, a “control” population with an event rate of 0.6 and an “intervention” population with an event rate of 0.4, corresponding to the expected event rates in the EOLIA trial. Sample pairs were drawn from the populations. In 50% of the simulations, both samples were drawn from the “control” population (no true effect). For the remaining simulations, one sample was drawn from the “control” population and one sample was drawn from the “intervention” population (true effect). In this way, the pretest probability of a real effect was set at 0.5. The sample size was set to 194 (97 per group) to achieve 80% power at a two-sided type I error rate of 0.05. For each sample pair, a P value was calculated using a Fisher exact test and a Bayes factor (BF) was calculated using the method of Kass and Vaidyanathan. From the BF, the FPR was calculated if P ≤ 0.05 and the FNR calculated if P > 0.05. A total of 10,000 sample pairs were drawn for the simulation. The curves are the lines of best fit. For Fig. 6A (P value vs FPR) the horizontal axis is shown on a square root scale to improve visualization of small P values. Simulations were done in R. The code to replicate the plots are available at URL: https://github.com/cjdbarlow/papers (accessed June 2023).

Some readers may be surprised and alarmed to see such high values for the FPR and FNR for P values close to the significance threshold. Furthermore, it may not be obvious how it is possible to have an FPR of 30% when the type I error rate is only 5% (and an FNR of 60% when the type II error rate is 20%). The explanation for this apparent discrepancy is that the FPR and the type I error rate (and the FNR and type II error rate) represent different conditional probabilities (Table 3). The FPR and FNR are probabilities for the truth of the hypothesis conditional on the outcome from the significance test, whereas the error rates are probabilities for the outcome of the significance test conditional on the truth of the hypothesis. That the FPR is not < 5% when P < 0.05, reflects a common misunderstanding about the definition of P values and the meaning of the type I error rate.

Tackling the significance problem

One of the goals in writing this article was to highlight how misleading declarations of “significant” and “not significant” are when P is close to the default threshold of 0.05 or the tail of the CI includes the null value. The issue is particularly relevant in anesthesia and intensive care, as most significant results from multicentre trials report P values close to 0.05.4,16

An obvious solution to the problem of dichotomizing results as significant and not significant is to replace frequentist hypothesis testing with Bayesian inference as the default method of analyzing and reporting data from randomized trials. A Bayesian posterior distribution provides a complete picture of the data informed by prior knowledge. Priors can be updated as new evidence is published. In fact, Bayesian methods have a broad range of applications within statistics, including calculating and updating sample sizes, determining stopping rules for randomized trials, and for combining data from multiple studies in meta-analyses.40,41,42,43

Nevertheless, transitioning from the frequentist hypothesis testing framework—which is firmly established and has dominated medical research for the last 100 years—will not happen overnight. While Bayesian posterior distributions have an intuitive interpretation, the underlying probability theory can be intimidating. We have presented a gentle introduction to Bayesian parameter estimation and largely avoided discussing mathematical theory. We have not addressed the main method of estimating posterior distributions, Markov Chain Monte Carlo, which requires knowledge of code-based statistical packages, such as R (R Foundation for Statistical Computing, Vienna, Austria) or STAN (Stan Development Team, New York, NY, USA). We have also not fully addressed the details of choosing prior distributions, with which many researchers are unfamiliar.

Therefore, despite increasing interest in Bayesian methods, frequentist hypothesis testing is likely to remain the dominant framework for statistical inference for some time to come. As an interim measure—while we await the Bayesian revolution—we propose two solutions to the significance problem. First, when reporting the results of randomized trials under a frequentist hypothesis framework, researchers should place more emphasis on the CI as a plausible range for the treatment effect and less emphasis on whether the result is “significant” at some prespecified threshold. Reporting a point estimate and 95% CI for the treatment effect as an absolute risk difference gives an intuitive feel for the effect size and aids in interpreting the result. For randomized trials where the outcome is binary, reporting both the absolute and relative treatment effect (95% CI), is recommended in the Consolidated Standards of Reporting Trials (CONSORT) guidelines.44

Second, reporting the FPR (when the result is significant) or the FNR (when the result is not significant), would help calibrate readers to the likely presence or absence of a real treatment effect and reduce the chance that the trial result will be misinterpreted. When estimating the FPR or FNR, we advocate using a BF with neutral prior, such as the method of Kass and Vaidyanathan.