Interpreting frequentist hypothesis tests: insights from Bayesian inference

Sidebotham, David; Barlow, C. Jake; Martin, Janet; Jones, Philip M.

doi:10.1007/s12630-023-02557-5

Interpreting frequentist hypothesis tests: insights from Bayesian inference

Interprétation des tests d’hypothèses fréquentistes : des enseignements tirés de l’inférence bayésienne

Special Article
Open access
Published: 04 October 2023

Volume 70, pages 1560–1575, (2023)
Cite this article

Download PDF

You have full access to this open access article

Canadian Journal of Anesthesia/Journal canadien d'anesthésie Aims and scope Submit manuscript

Interpreting frequentist hypothesis tests: insights from Bayesian inference

Download PDF

David Sidebotham MBChB, FANZCA ORCID: orcid.org/0000-0001-9478-021X^1,2,5,
C. Jake Barlow MBBS, MPh, FANZCA¹,
Janet Martin PharmD, MSc (HTA&M)^3,4 &
…
Philip M. Jones MD, MSc, FRCPC^3,4

3026 Accesses
3 Citations
35 Altmetric
Explore all metrics

Abstract

Randomized controlled trials are one of the best ways of quantifying the effectiveness of medical interventions. Therefore, when the authors of a randomized superiority trial report that differences in the primary outcome between the intervention group and the control group are “significant” (i.e., P ≤ 0.05), we might assume that the intervention has an effect on the outcome. Similarly, when differences between the groups are “not significant,” we might assume that the intervention does not have an effect on the outcome. Nevertheless, both assumptions are frequently incorrect.

In this article, we explore the relationship that exists between real treatment effects and declarations of statistical significance based on P values and confidence intervals. We explain why, in some circumstances, the chance an intervention is ineffective when P ≤ 0.05 exceeds 25% and the chance an intervention is effective when P > 0.05 exceeds 50%.

Over the last decade, there has been increasing interest in Bayesian methods as an alternative to frequentist hypothesis testing. We provide a robust but nontechnical introduction to Bayesian inference and explain why a Bayesian posterior distribution overcomes many of the problems associated with frequentist hypothesis testing.

Notwithstanding the current interest in Bayesian methods, frequentist hypothesis testing remains the default method for statistical inference in medical research. Therefore, we propose an interim solution to the “significance problem” based on simplified Bayesian metrics (e.g., Bayes factor, false positive risk) that can be reported along with traditional P values and confidence intervals. We calculate these metrics for four well-known multicentre trials. We provide links to online calculators so readers can easily estimate these metrics for published trials. In this way, we hope decisions on incorporating the results of randomized trials into clinical practice can be enhanced, minimizing the chance that useful treatments are discarded or that ineffective treatments are adopted.

Résumé

Les études randomisées contrôlées constituent l’un des meilleurs moyens de quantifier l’efficacité des interventions médicales. Par conséquent, lorsque les autrices et auteurs d’une étude randomisée superiorité signalent que les différences dans le critère d’évaluation principal entre le groupe d’intervention et le groupe témoin sont « significatives » (c.-à-d. P ≤ 0,05), nous pourrions supposer que l’intervention a un effet sur le critère d’évaluation. De même, lorsque les différences entre les groupes ne sont « pas significatives », nous pourrions supposer que l’intervention n’a pas d’effet sur le critère d’évaluation. Pourtant, ces deux hypothèses s’avèrent souvent incorrectes.

Dans cet article, nous explorons la relation qui existe entre les effets réels d’un traitement et les déclarations de signification statistique fondées sur les valeurs P et les intervalles de confiance. Nous expliquons pourquoi, dans certaines circonstances, la probabilité qu’une intervention soit inefficace lorsque P ≤ 0,05 dépasse 25 % et la probabilité qu’une intervention soit efficace lorsque P > 0,05 dépasse 50 %.

Au cours de la dernière décennie, nous avons assisté à un intérêt croissant pour les méthodes bayésiennes comme alternative aux tests d’hypothèses fréquentistes. Nous proposons une introduction robuste mais non technique à l’inférence bayésienne et expliquons pourquoi une distribution postérieure bayésienne surmonte bon nombre des problèmes associés aux tests d’hypothèses fréquentistes.

Malgré l’intérêt actuel pour les méthodes bayésiennes, les tests d’hypothèses fréquentistes restent la méthode par défaut pour l’inférence statistique en recherche médicale. Par conséquent, nous proposons une solution provisoire au « problème de signification » basée sur des mesures bayésiennes simplifiées (par exemple, facteur de Bayes, risque de faux positifs) qui peuvent être rapportées en même temps que les mesures traditionnelles des valeurs P et des intervalles de confiance. Nous calculons ces paramètres pour quatre études multicentriques bien connues. Nous fournissons des liens vers des calculatrices en ligne afin que les lectrices et lecteurs puissent facilement estimer ces mesures pour les études publiées. De cette façon, nous espérons que les décisions sur l’intégration des résultats des études randomisées dans la pratique clinique pourront être améliorées, minimisant ainsi le risque que des traitements utiles soient rejetés ou que des traitements inefficaces soient adoptés.

We are All Bayesian, Everyone is Not a Bayesian

Article 27 March 2018

The distribution of power-related random variables (and their use in clinical trials)

Article Open access 19 September 2024

Bayesian Decision Theory and the Design and Analysis of Randomized Clinical Trials

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Statistical inference is the process of analyzing samples to infer characteristics about the populations from which the samples are drawn. Statistical methods are based on the laws of chance. Different frameworks exist for analyzing and reporting data. In medicine—as indeed across all scientific disciplines—a frequentist hypothesis framework predominates, with declarations of “significant” and “not significant” based on P values and confidence intervals (CIs). Nevertheless, P values and CIs are widely misunderstood and declarations of significance are frequently misleading. Consequently, clinical trials may be misinterpreted, with the potential for useful interventions to be discarded and for ineffective or harmful interventions to be adopted.¹

Consider the LOVIT randomized trial, which compared high-dose vitamin C with placebo in patients with sepsis.² Based on weak prior evidence, the authors postulated that vitamin C would be superior to placebo. Surprisingly, the authors observed a higher rate of death or persisting organ dysfunction (the composite primary outcome) in the vitamin C group (44.5%) than in the control group (38.5%). Under the frequentist framework, the result was statistically significant (relative risk [RR], 1.21; 95% CI, 1.04 to 1.40; P = 0.01). As we discuss later, a statistically significant result is equivalent to declaring that the intervention has an effect on the outcome. Therefore, we might reasonably ask, how likely is it that in patients with sepsis, high-dose vitamin C increases the risk of death or organ dysfunction? Unfortunately, P values and CIs provide no direct answer to the question. In a previous paper, one of us (D. S.) argued that the LOVIT trial provides little evidence for or against the existence of a real effect of high-dose vitamin C.³ Indeed, many significant results provide little evidence for real treatment effects.^4,5

The precarious relationship that can exist between significant results and real effects is highlighted by a 2018 study by Silberzahn et al., in which 29 groups of researchers were asked to analyze the same data set to answer a simple research question: whether soccer referees are more likely to give red cards to dark skin-toned players than light skin-toned players.⁶ Twenty groups reported a significant association between skin tone and red cards and nine did not. In all, 21 unique models were used and the estimated effect sizes (odds ratios) varied between 0.89 and 2.93. While the real effect of skin tone on red cards is unknown, it is clear that identical data can lead to different outcomes from significance tests depending on the statistical model.

In 2015, the American Statistical Association published a statement on P values and statistical significance.⁷ The impetus for the statement was increasing concern about the misuse and misinterpretation of P values and the “numerous deep flaws” in frequentist hypothesis testing.⁸ There are three possible responses to the “significance problem.” First—and most obviously—we should better understand the limitations of frequentist hypothesis testing. A second—more radical—approach is to do away with hypothesis testing entirely and replace it with something different. A third approach is to quantify the relationship that exists between declarations of significance and the presence of real treatment effects. In this article, we discuss each approach in turn. Figure 1 summaries the three approaches with reference to the LOVIT trial. Throughout the article, we refer to four multicentre randomized trials, two from the anesthesia literature (OPTIMISE, INPRESS) and two from the critical care literature (LOVIT, EOLIA) (Table 1).^2,9,10,11 Each trial in Table 1 involves a control and an intervention group and reports the difference in a binary outcome, which is the most common framework for multicentre clinical trials in our specialties.

Table 1 Results of frequentist and Bayesian analysis of four multicentre trials.

Full size table

Part I: Frequentist hypothesis testing

There are two key ideas that underpin frequentist hypothesis testing. The first is that two competing explanations for the data are formulated: the null and the alternative hypothesis. The null hypothesis is that there is no effect of the intervention on the outcome. For a trial comparing event rates in a control and an intervention group, the null hypothesis (H₀) is:

$$H_{0} :\theta_{1} - \theta_{2} = 0$$

(1)

where, θ₁ and θ₂ represent the event rates (proportions) in the populations. The alternative hypothesis (H₁), is that there is an effect of the intervention:

$$H_{1} :\theta_{1} - \theta_{2} \ne 0.$$

(2)

When framed in this way, one of the hypotheses must be true and one must be false. Notice that the null hypothesis is only true when the treatment effect in the population is precisely zero, which is termed a “point null.” While a treatment effect in the population of precisely zero is only rarely true (e.g., placebo vs placebo), it is a useful model under the frequentist framework. As the statistician George Box famously observed, “All models are wrong but some are useful.”¹²

The second key idea is that the data are used to quantify evidence against the null hypothesis. If the data are sufficiently improbable under a true null, the null hypothesis is rejected and the alternative hypothesis is accepted, which is equivalent to declaring the intervention has an effect on the outcome. The result is deemed “significant.” In medical research, “sufficiently improbable” has usually been defined as a P ≤ 0.05 (5%) or, equivalently, exclusion of the null result from the 95% CI. By contrast, if P > 0.05, the null hypothesis is not rejected and the result is deemed “not significant.” Not rejecting the null hypothesis is not the same as declaring the intervention has no effect.¹³ When the result is not significant, we can make no claims on the truth of the null hypothesis. Thus, with frequentist hypothesis testing, we quantify evidence against but not in favour of the null hypothesis.

The definition of a P value follows naturally from the preceding discussion. A test statistic (e.g., t value, χ² statistic) is calculated from the data. A P value is the probability of obtaining a test statistic at least as large as that observed under a true null hypothesis (i.e., no real effect). Consequently, a P value is the probability of observing data at least as extreme as that observed given the null hypothesis is true. The key points are: 1) a P value is a probability relating to the data not the hypothesis and 2) a P value is calculated assuming the null hypothesis is true. It is a common misconception that a P value represents the probability the null hypothesis is true, which is not possible, as a P value is calculated assuming the null hypothesis is true.¹⁴

While more complicated to untangle, a CI also relates to the data—not the hypothesis. If an identical study was repeated many times (i.e., involving multiple sets of sample data) and on each occasion a 95% CI was calculated on the observed difference (e.g., θ₁ − θ₂), then we would expect 95% of those intervals to contain the true population difference. Because of sampling variability, the intervals will vary. This definition is different from “there is a probability of 0.95 (95%) that the true population difference lies within the interval,” which is incorrect. Because the interval is the random variable—not the population effect size, which has a fixed but unknown value—it is not appropriate to calculate probabilities for the hypothesis from a CI.

Power and errors

If we observe P ≤ 0.05 but there is no effect of the intervention in the population, a type I error occurs (false positive). Nevertheless, when P ≤ 0.05, we do not know if the result is a true positive or a false positive. All we know is that if we repeated the same trial many times over, each with the same sample size, then over the long run, we expect type I errors to occur roughly 5% of the time. If we observe P > 0.05 but there is a real effect, a type II error occurs (false negative). Again, we have no way of knowing if the result is a false negative or a true negative.

Statistical power is the probability the test will be significant given a real effect exists (i.e., power is the probability of obtaining a true positive result). When a real effect exists, the test is either significant (true positive) or not significant (false negative). Therefore, power is 1 minus the type II error rate. Most trials report power of at least 0.8 (80%). Nevertheless, there is an Achilles heel to such claims: they are only valid if the treatment effect in the population is at least as large as that used when calculating the sample size. If the true effect size is smaller, the achieved power of the study will be less than the reported (design) power and the type II error rate will be higher than expected. In trials in anesthesia and critical care, overestimating the size of the treatment effect when calculating the sample size is extremely common.^4,15,16,17 In fact, low power is endemic across medical research. In a recent analysis, van Zwet et al. estimated that the median achieved power among more than 20,000 randomized trials in the Cochrane Database of Systematic Reviews was only 0.13 (13%)!¹⁸ When power is low, the type II error rate is high, meaning few trials report significant results. This is the situation that exists for most multicentre trials in anesthesia and critical care.^4,16

The 5% significance threshold

Imagine we have some trial data and apply two statistical models. Model A gives a P value of 0.04 and model B gives a P value of 0.06. Which model is correct? Is the result a true positive, false positive, true negative, or false negative? The question is meaningless. Nothing magical happens at P = 0.05. In fact, the two models give similar results: assuming the treatment is ineffective, the probability of observing the effect size is about one in 20.

The 0.05 threshold was introduced by Ronald Fisher in his 1925 book Statistical Methods for Research Workers.¹⁹ One of the reasons Fisher chose 0.05 was that it was convenient for “ease of calculation at a time before computers rendered tables and approximations largely obsolete” given that it was “roughly two standard deviations away from the mean of a normally distributed random variable.”²⁰ Fisher might well be surprised and alarmed that his 0.05 threshold has been afforded such importance in modern medical research.

The current hypothesis testing framework for frequentist inference is most unsatisfactory. We use an arbitrary threshold to declare the presence of a real effect based on a metric that has no direct relationship to the presence of a real effect. Could we analyze and report data in a way that avoids dichotomizing the results using a metric that quantifies something more useful than the probability of the test statistic subject to the null hypothesis? As it turns out, there is.

Part II: Bayesian inference

Bayesian inference has a simple conceptual framework. First, we consider what is already known about an intervention and express that knowledge as a probability distribution. Next, we do an experiment and obtain data. Finally, we use Bayes’ theorem to combine our prior knowledge with the data to obtain a posterior distribution for the thing we are interested in, which is usually the size of the treatment effect. This simple conceptual framework belies some complicated mathematical theory and advanced computational methods. But we will start simply. What is a probability distribution?

Probability distributions

Imagine the year is 2010 and you are part of the team of researchers planning the EOLIA trial, which was a multicentre randomized trial comparing extracorporeal membrane oxygenation (ECMO) with conventional treatment in patients with severe acute respiratory distress syndrome (ARDS).¹⁰ To determine the sample size for the trial, you wish to estimate the proportion of patients with severe ARDS who die with conventional treatment. Based on published data, you think the mortality rate is around 60%. You are uncertain of the exact value but are fairly confident it is between 50% and 70%. These beliefs can be represented as a probability distribution.

Probability distributions can be modelled using mathematical functions, each with particular properties. A property of all probability distributions is that the area under the curve (i.e., the sum of all probabilities) is 1. Proportions, which take values between 0 and 1, can be modelled with beta distributions. Beta distributions have two parameters, a and b, which determine the shape and position of the distribution. A beta distribution with a = 36.18 and b = 24.23 closely resembles the aforementioned beliefs about the mortality from severe ARDS (Fig. 2).

Prior distributions

The probability distribution shown in Fig. 2 is a prior distribution (“prior”); it is an estimate of the distribution for the population event rate for the control group for the EOLIA trial before observing any data. For simplicity, and because you do not know if ECMO offers a survival benefit, you decide to use the same prior for the intervention (i.e., θ₂). We now have two priors, one for θ₁ and one for θ₂. If we use a computer to draw random samples from each prior, and take the difference each time, we can obtain a prior for the treatment effect. The prior for the treatment effect described above is “neutral”. We are assuming only that θ₁ and θ₂ are broadly similar (i.e., dependent) but not presupposing EMCO is beneficial.

We could have chosen many different priors for θ₁ and θ₂. For instance, we might have chosen priors where mortality was on average 10% less for θ₂ than θ₁. The resulting prior for the treatment effect would be “enthusiastic”—we are assuming ECMO reduces mortality compared with standard care by about 10%. Alternatively, if we had no inkling as to the mortality rate from severe ARDS or ECMO, we might have decided that any possible event rate was equally likely and assigned uniform distributions to θ₁ and θ₂. The resulting prior for the treatment effect would then be “uninformed.” Figure 3 shows examples of uninformed, neutral, and enthusiastic priors for the treatment effect of ECMO in severe ARDS.

The “strength” of a prior refers to the degree of uncertainty around the true value and is reflected in the variance (i.e., spread) of the distribution. A low variance indicates high strength, and vice versa. Notice from Fig. 3 that the spread of the uninformed prior is greater than the spread of the neutral and enthusiastic priors. Thus, the neutral and enthusiastic priors are “stronger” than the uninformed prior.

The likelihood function—incorporating our experimental data

A likelihood function (“likelihood”) is a distribution representing the probability of the data for each possible value for the parameter of interest. In the EOLIA trial, there were 57/125 (46%) deaths in the control group and 44/124 (35%) deaths in the ECMO group. The likelihood for the control group tells us how likely we are to observe 57 deaths in 125 patients given different population values for θ₁. Similarly, the likelihood for the intervention group tells us how likely we are to observe 44 deaths in 124 patients given different population values for θ₂. Figure 4 shows the likelihoods for the control group, intervention group, and treatment effect for the EOLIA data.

Posterior distributions

A posterior distribution (“posterior”) is a probability distribution for the population value for the parameter of interest based on the prior (what we knew before collecting data) and the likelihood (what we learn from the data). The posterior is calculated using Bayes’ theorem, which in words can be expressed as:

$${\text{posterior }} \propto {\text{likelihood}} \times {\text{prior}}$$

(3)

Strong priors have a greater impact on the posterior than weak priors do. Figure 5 shows posterior distributions for the treatment effect of ECMO, based on the EOLIA likelihood and three different priors (uninformed, neutral, enthusiastic).

Interpreting posterior distributions

We can summarize the posterior in several ways. Commonly, researchers report the treatment effect incorporating 95% of the probability density, termed a 95% credible interval. The 95% credible interval for the EOLIA data based on our neutral prior is −0.04 to 0.17, meaning there is a 95% chance that the absolute mortality reduction associated with EMCO is between −4% and 17%. The peak of the posterior distribution is an absolute mortality reduction of about 6.5%, which, in this case, lies a little closer to the peak of the likelihood (11%) than the prior (0%).

Notice from Table 1 that the 95% credible intervals for each of the four trials include an absolute risk difference of 0%, which is the null value. With frequentist hypothesis testing, inclusion of the null value in the 95% CI means the result would be classified as not significant. Nevertheless, with a Bayesian framework, inclusion of the null value in the tail of the 95% credible interval is not of critical importance.

In addition to a credible interval, we might report the probability that the treatment effect exceeds a minimum clinically important difference (MCID).²¹ For the trials in Table 1, let us assume that a 5% lower event rate for the intervention group represents a MCID—a value at which most clinicians would adopt the intervention. For the statistically significant INPRESS trial, the probability the treatment effect exceeds our hypothetical clinically important difference is 0.8 (80%). For the two trials that were not significant (EOLIA and OPTIMISE), the probability the treatment effect is clinically important is 0.62 (62%) and 0.56 (56%), respectively. Thus, for all three trials (INPRESS, EOLIA, and OPTIMISE) there is a reasonable chance the treatment effect exceeds our hypothetical MCID. In the LOVIT trial, given the observed effect of vitamin C was for harm, the probability that vitamin C has a beneficial effect that exceeds our 5% threshold is very low.

Under the frequentist framework, the not significant result from the EOLIA trial led the authors to conclude, “Among patients with very severe ARDS, 60-day mortality was not significantly lower with ECMO …”¹⁰ By contrast, our Bayesian interpretation tells us that more likely than not ECMO does reduce mortality in patients with severe ARDS, with an effect size that might be considered clinically important.

Credible vs confidence intervals

Both a frequentist CI and a Bayesian credible interval provide a plausible range for the treatment effect. Therefore, we might ask, do we not obtain the same information from both? Well, not quite. Because a credible interval represents the plausible range for the probability of the treatment effect given the data, it is straightforward to calculate probabilities for the treatment effect; for instance, the probability that the treatment effect exceeds a MCID. As noted above, it is not appropriate to infer such probabilities from a CI. A CI is based entirely on the sample data, which in the case of the EOLIA trial is the 249 participants who took part in the trial. By contrast, a credible interval incorporates prior knowledge. A full Bayesian analysis typically involves calculating posterior distributions using priors of varying enthusiasm. For instance, in 2018, Goligher et al. did a Bayesian reanalysis of the EOLIA data using multiple priors.²² In this way, a reader can see how sensitive the posterior is to different priors and interpret the result accordingly. Finally, under the frequentist model, the main use of a CI is to determine statistical significance. A more useful interpretation of a CI is to quantify the plausible range for the treatment effect. Nevertheless, this aspect is frequently ignored. For instance, in a selection of ten multicentre trials in anesthesia reporting nonsignificance for the primary outcome, only three drew attention to the uncertainty associated with the observed treatment effect.²³

Part III: Simplified Bayesian metrics for interpreting declarations of “significant” and “not significant”

In the preceding section, we used Bayesian inference to estimate an unknown parameter—the size of the treatment effect for ECMO in patients with severe ARDS—by estimating a posterior distribution. A Bayesian posterior distribution provides a complete picture of the data informed by prior knowledge.²⁴ Nevertheless, we can also use Bayesian methods to compare two models for the data—in our case the relative merits of the null and the alternative hypothesis. For model comparison, we can use a Bayes factor (BF) or a likelihood ratio (LR).

Bayes factors

A BF is the ratio of the probability of the data (e.g., the observed values for θ₁ and θ₂) under two competing models. For the null (H₀) and alternative (H₁) hypotheses, we can define the BF favouring the alternative hypothesis as:

$$BF_{1:0} = \frac{{P({\text{data}}|H_{1} {)}}}{{P({\text{data}}|H_{0} )}}$$

(4)

The right side of Equation 4 is the ratio of two conditional probabilities. A conditional probability arises when the chance of an event occurring depends on another event being true. The upper-case “P” means “the probability that” and the symbol “|” means “given that.” So, P(data|H₁) means the probability of the data given the alternative hypothesis is true. On the left side of Equation 4, the term “BF_1:0” means “the BF favouring the alternative hypothesis relative to the null hypothesis.” A BF_1:0 of 5 means the data are five times more likely under the alternative than the null hypothesis. The reciprocal of BF_1:0 is the BF favouring the null over the alternative, denoted BF_0:1. Therefore, if BF_1:0 is 5, then BF_0:1 is 1/5^th (0.2). A BF is always positive and has no upper bound. Unlike a P value, a BF quantifies the evidence in favour of both the null (BF_0:1) and the alternative (BF_1:0) hypotheses. The strength of evidence provided by a BF can be classified using a qualitative scale (Table 2).

Table 2 Qualitative scale for the strength of evidence provided by the Bayes factor.

Full size table

Calculating BFs involves solving complicated integrals. For instance, evaluating P(data|H₁) involves integrating (i.e., averaging) the probability of the data across all combinations for the parameters (θ₁, θ₂) that are possible under the alternative hypothesis, weighted by their prior probabilities.^25,26 In practice, BFs are estimated using mathematical approximations to the integrals or other clever workarounds.^26,27,28 The phrase “weighted by their prior probabilities” is important, as it means we must specify prior distributions for the parameters.

Several methods are available for calculating a BF for the equivalence of two proportions.²⁹ Two popular approaches are the methods developed by Gunel and Dickey (G&D)^30,31 and Kass and Vaidyanathan (K&V).^32,33,34 As a default, the G&D method assigns uniform priors to θ₁ and θ₂. By contrast, the K&V method assumes an inherent dependency between θ₁ and θ₂ with the effect that the prior for the treatment effect is neutral.³² We see from Table 1 that the BF_1:0 calculated using the G&D and K&V methods for the EOLIA trial are 0.57 and 0.87, respectively. Since in both cases, the BF_1:0 < 1, the data slightly favour the null over the alternative hypothesis but the strength of evidence is weak (Table 2).

Likelihood ratios

LRs are another method of comparing the relative evidence for the alternative and null hypotheses. Again, we can use the notation LR_1:0 and LR_0:1 to represent the evidence favouring the alternative and null hypotheses. Unlike a BF, which compares the relative probabilities of the data under the point null (i.e., θ₁ – θ₂ = 0) to an averaged value of all possible values for the alternative hypothesis (i.e., θ₁ – θ₂ ≠ 0), a LR compares the point null hypothesis to a point (i.e., single) value for the alternative hypothesis (e.g., θ₁ – θ₂ = 0.1). Since we are only considering a single value for the alternative hypothesis, there is no requirement to specify priors for the parameters. Likelihood ratios provide an intuitive feel for the relative merits of the two hypotheses but are highly dependent on the specific value chosen for the alternative hypothesis.

An obvious choice for the alternative hypothesis is the observed treatment effect. The observed treatment effect is the effect size that is best supported by the data.^25,35 Using the observed treatment effect produces the highest possible value for LR_1:0, which is termed the maximum likelihood ratio (MLR_1:0).^25,35 A MLR_1:0 can be calculated directly from the data.³⁵ While not strictly a MLR, a value for LR_1:0 that is best supported by the data (i.e., an upper bound across a large class of reasonable alternative hypotheses) can be calculated directly from the P value.^36,37,38,39 Notice from Table 1 that the values for MLR_1:0 are substantially larger than the equivalent values for BF_1:0.

Another choice for the alternative hypothesis is to use the expected treatment effect used by the researchers when they calculated the sample size.³⁵ There are two issues with this approach. First, we know that the expected treatment effects are typically implausibly large.^4,15,16 Second, if the observed treatment effect is in the opposite direction to the expected treatment effect, the LR_1:0 will be implausibly small. For example, for the LOVIT data, the LR_1:0 calculated using the expected treatment effect is zero. Together, these two issues raise a very important point about being circumspect about the expected treatment effect and its direction when designing clinical trials, but we do not consider this further here.

One of us (P. M. J.) has developed an online calculator for the LR, which may be used for continuous and binary outcome measures using different values for the alternative hypothesis, available at URL: https://medresearch.shinyapps.io/Bayesian_re-analysis/ (accessed June 2023).

Probabilities for the hypothesis: the false positive and false negative risk

We can use a BF or the LR to estimate probabilities for the hypothesis.^5,35,37 We can define four probabilities depending on whether the null or the alternative hypothesis is true and whether the test is significant or not significant. We will concentrate on two: the false positive risk (FPR) and the false negative risk (FNR). The FPR is the probability the null hypothesis is true given the test result is significant and the FNR is the probability the alternative hypothesis is true given the test result is not significant (Table 3).

Table 3 Conditional probabilities associated with frequentist hypothesis testing, assuming either the null (H₀) or alternative (H₁) hypothesis is true

Full size table

If we a priori assume that the null and alternative hypothesis each have a 50% chance of being true, we can calculate the FPR and FNR using Bayes’ theorem:^4,5

$${\text{FPR}} = \frac{1}{{BF_{1:0} + 1}}$$

(5)

$${\text{FNR}} = \frac{1}{{BF_{0:1} + 1}}$$

(6)

When using an LR, we substitute LR_1:0 or LR_0:1 for the analogous BF.

Given the preceding discussion, an obvious question arises: which metric (LR or BF) should we use to estimate the FPR and FNR?

When using Bayes’ theorem to calculate probabilities for the hypothesis, we assume that either the null or the alternative hypothesis is true. Since a BF compares the probability of the data under a point null to an averaged value for all possible values for the alternative hypothesis, this assumption is satisfied. Nevertheless, when we use a LR—which compares two specific (point) hypotheses—the assumption is questionable, as all other values for the alternative hypothesis are assigned a probability of zero. For this reason, using a BF may be more valid than using a LR when the purpose is estimating the FPR or FNR.

If we use a BF, we must choose priors for θ₁ and θ₂. So far, we have described two methods for calculating BFs for two proportions: the G&D method, which as a default assigns an uninformed prior for the treatment effect, and the K&V method, which results in a neutral prior. In fact, there is another widely used technique called the Savage Dickey method, whereby priors are specified directly by the user.^26,28

Due to the inherent dependency that exists between θ₁ and θ₂, a neutral prior for the treatment effect is more realistic than an uninformed prior. For instance, in the EOLIA trial, the expected population mortality rate for the control group was 60%. Consequently, the expected population mortality rate for the ECMO group is more likely to be closer to 60% than to (say) 1% or 99%. A neutral prior assigns relatively more probability density to plausible values for the alternative hypothesis compared with an uninformed prior (compare the top right and middle right panels of Fig. 3). Consequently, a neutral prior provides relatively more support for the alternative hypothesis than an uninformed prior, meaning the BF_1:0 tends to be larger and the BF_0:1 smaller. Therefore, neutral priors tend to result in lower values for the FPR and higher values for the FNR. These effects can be seen in Table 1 by comparing the BFs calculated using the G&D (uninformed prior) and the K&V (neutral prior) methods. In Table 1, the differences in the BFs calculated by the G&D and K&V methods are small. Nevertheless, when the event rates are very high or low and the P value is close to the significance threshold, the differences between the two methods are substantial.³²

Given neutral priors account for the inherent dependency between θ₁ and θ₂, the K&V method may be a better choice than the G&D for estimating the FPR and FNR. One of us (C. J. B.) has developed an online calculator for estimating the FPR and FNR for trials reporting binary outcomes using either the K&V or G&D methods, available at URL: https://cjdbarlow.github.io/bayes-trial-eval.html (accessed June 2023).

The false positive and false negative risk versus the type I and type II error rates

Figure 6 shows the relationship between the P value and the FPR (Fig. 6A) and FNR (Fig. 6B). Data were obtained using simulated statistical testing. P values were calculated using a Fisher exact test and BFs were calculated using the K&V method. The striking feature of Fig. 6 is that, for P values close to the significance threshold, the FPR approaches 30% and the FNR exceeds 60%. Figure 6 highlights the poor relationship that exists between real effects and the outcome from the hypothesis test for P values close to the significance threshold. By contrast, for P values well displaced from the significance threshold, the relationship is more robust. For instance, when P < 0.005 the FPR is below 5% and when P > 0.5 the FNR is below 30%.

Some readers may be surprised and alarmed to see such high values for the FPR and FNR for P values close to the significance threshold. Furthermore, it may not be obvious how it is possible to have an FPR of 30% when the type I error rate is only 5% (and an FNR of 60% when the type II error rate is 20%). The explanation for this apparent discrepancy is that the FPR and the type I error rate (and the FNR and type II error rate) represent different conditional probabilities (Table 3). The FPR and FNR are probabilities for the truth of the hypothesis conditional on the outcome from the significance test, whereas the error rates are probabilities for the outcome of the significance test conditional on the truth of the hypothesis. That the FPR is not < 5% when P < 0.05, reflects a common misunderstanding about the definition of P values and the meaning of the type I error rate.

Tackling the significance problem

One of the goals in writing this article was to highlight how misleading declarations of “significant” and “not significant” are when P is close to the default threshold of 0.05 or the tail of the CI includes the null value. The issue is particularly relevant in anesthesia and intensive care, as most significant results from multicentre trials report P values close to 0.05.^4,16

An obvious solution to the problem of dichotomizing results as significant and not significant is to replace frequentist hypothesis testing with Bayesian inference as the default method of analyzing and reporting data from randomized trials. A Bayesian posterior distribution provides a complete picture of the data informed by prior knowledge. Priors can be updated as new evidence is published. In fact, Bayesian methods have a broad range of applications within statistics, including calculating and updating sample sizes, determining stopping rules for randomized trials, and for combining data from multiple studies in meta-analyses.^40,41,42,43

Nevertheless, transitioning from the frequentist hypothesis testing framework—which is firmly established and has dominated medical research for the last 100 years—will not happen overnight. While Bayesian posterior distributions have an intuitive interpretation, the underlying probability theory can be intimidating. We have presented a gentle introduction to Bayesian parameter estimation and largely avoided discussing mathematical theory. We have not addressed the main method of estimating posterior distributions, Markov Chain Monte Carlo, which requires knowledge of code-based statistical packages, such as R (R Foundation for Statistical Computing, Vienna, Austria) or STAN (Stan Development Team, New York, NY, USA). We have also not fully addressed the details of choosing prior distributions, with which many researchers are unfamiliar.

Therefore, despite increasing interest in Bayesian methods, frequentist hypothesis testing is likely to remain the dominant framework for statistical inference for some time to come. As an interim measure—while we await the Bayesian revolution—we propose two solutions to the significance problem. First, when reporting the results of randomized trials under a frequentist hypothesis framework, researchers should place more emphasis on the CI as a plausible range for the treatment effect and less emphasis on whether the result is “significant” at some prespecified threshold. Reporting a point estimate and 95% CI for the treatment effect as an absolute risk difference gives an intuitive feel for the effect size and aids in interpreting the result. For randomized trials where the outcome is binary, reporting both the absolute and relative treatment effect (95% CI), is recommended in the Consolidated Standards of Reporting Trials (CONSORT) guidelines.⁴⁴

Second, reporting the FPR (when the result is significant) or the FNR (when the result is not significant), would help calibrate readers to the likely presence or absence of a real treatment effect and reduce the chance that the trial result will be misinterpreted. When estimating the FPR or FNR, we advocate using a BF with neutral prior, such as the method of Kass and Vaidyanathan.

References

Sidebotham D. Are most randomised trials in anaesthesia and critical care wrong? An analysis using Bayes’ Theorem. Anaesthesia 2020; 75: 1386–93. https://doi.org/10.1111/anae.15029
Article CAS PubMed Google Scholar
Lamontagne F, Masse MH, Menard J, et al. Intravenous vitamin C in adults with sepsis in the intensive care unit. N Engl J Med 2022; 386: 2387–98. https://doi.org/10.1056/nejmoa2200644
Article CAS PubMed Google Scholar
Sidebotham D. Fooled by significance testing: an analysis of the LOVIT vitamin C trial. J Extra Corpor Technol 2022; 53: 324–9. https://doi.org/10.1182/ject-2200030
Article Google Scholar
Seretny S, Barlow CJ, Sidebotham D. Multicentre randomised trials in anaesthesia: an anlysis using Bayesian metrics. Anaesthesia 2022; 78: 73–80. https://doi.org/10.1111/anae.15867
Article CAS PubMed Google Scholar
Sidebotham D, Barlow CJ. The false-positive and false-negative risks for individual multicentre trials in critical care. BJA Open 2022; 1: 100003. https://doi.org/10.1016/j.bjao.2022.100003
Article PubMed PubMed Central Google Scholar
Silberzahn R, Uhlmann EL, Martin DP, et al. Many analysts, one data set: making transparent how variations in analytic choices affect results. Adv Meth Pract Psychol Sci 2018; 1: 337–56. https://doi.org/10.1177/2515245917747646
Article Google Scholar
Wasserstein RL, Lazar NA. The ASA statement on p-values: context, process, and purpose. Am Stat 2016; 70: 129–33. https://doi.org/10.1080/00031305.2016.1154108
Article Google Scholar
Cumming G. The problem with p values: how significant are they, really? 2013. Available from URL: http://phys.org/wire-news/145707973/the-problem-with-p-values-how-significant-are-they-really.html (accessed April 2023).
Futier E, Lefrant JY, Guinot PG, et al. Effect of individualized vs standard blood pressure management strategies on postoperative organ dysfunction among high-risk patients undergoing major surgery: a randomized clinical trial. JAMA 2017; 318: 1346–57. https://doi.org/10.1001/jama.2017.14172
Article PubMed PubMed Central Google Scholar
Combes A, Hajage D, Capellier G, et al. Extracorporeal membrane oxygenation for severe acute respiratory distress syndrome. N Engl J Med 2018; 378: 1965–75. https://doi.org/10.1056/nejmoa1800385
Article PubMed Google Scholar
Pearse RM, Harrison DA, MacDonald N, et al. Effect of a perioperative, cardiac output-guided hemodynamic therapy algorithm on outcomes following major gastrointestinal surgery: a randomized clinical trial and systematic review. JAMA 2014; 311: 2181–90. https://doi.org/10.1001/jama.2014.5305
Article CAS PubMed Google Scholar
Box GE. Science and statistics. JASA 1976; 71: 791–9.
Google Scholar
Alderson P. Absence of evidence is not evidence of absence. BMJ 2004; 328: 476–7. https://doi.org/10.1136/bmj.328.7438.476
Article PubMed PubMed Central Google Scholar
Sidebotham D. Understanding significance testing. Anaesthesia 2021; 76: 1659–64. https://doi.org/10.1111/anae.15591
Article CAS PubMed Google Scholar
Aberegg SK, Richards DR, O'Brien JM. Delta inflation: a bias in the design of randomized controlled trials in critical care medicine. Crit Care 2010; 14: R77. https://doi.org/10.1186/cc8990
Article PubMed PubMed Central Google Scholar
Sidebotham D, Popovich I, Lumley T. A Bayesian analysis of mortality outcomes in multicentre clinical trials in critical care. Br J Anaesth 2021; 127: 487–94. https://doi.org/10.1016/j.bja.2021.06.026
Article PubMed Google Scholar
Chow JT, Turkstra TP, Yim E, Jones PM. Sample size calculations for randomized clinical trials published in anesthesiology journals: a comparison of 2010 versus 2016. Can J Anaesth 2018; 65: 611–8. https://doi.org/10.1007/s12630-018-1109-z
Article PubMed Google Scholar
van Zwet E, Schwab S, Senn S. The statistical properties of RCTs and a proposal for shrinkage. Stat Med 2021; 40: 6107–17. https://doi.org/10.1002/sim.9173
Article PubMed PubMed Central Google Scholar
Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd; 1925.
Google Scholar
Kennedy-Shaffer L. Before p < 0.05 to beyond p < 0.05: using history to contextualize p-values and significance testing. Am Stat 2019; 73: 82–90. https://doi.org/10.1080/00031305.2018.1537891
Yarnell CJ, Abrams D, Baldwin MR, et al. Clinical trials in critical care: can a Bayesian approach enhance clinical and scientific decision making? Lancet Respir Med 2021; 9: 207–16. https://doi.org/10.1016/s2213-2600(20)30471-9
Article PubMed Google Scholar
Goligher EC, Tomlinson G, Hajage D, et al. Extracorporeal membrane oxygenation for severe acute respiratory distress syndrome and posterior probability of mortality benefit in a post hoc bayesian analysis of a randomized clinical trial. JAMA 2018; 320: 2251–9. https://doi.org/10.1001/jama.2018.14276
Article PubMed Google Scholar
Gibbs NM, Weightman WM. Beta errors in anaesthesia randomised controlled trials in which no statistical significance is found: is there an elephant in the room? Anaesth Intensive Care 2022; 50: 153–8. https://doi.org/10.1177/0310057x221086590
Article PubMed Google Scholar
Carlisle JB. The credibility plot for extreme explanations and all explanations in between. Anaesthesia 2022; 78: 17–22. https://doi.org/10.1111/anae.15944
Article PubMed Google Scholar
Goodman SN. Toward evidence-based medical statistics. 2: the Bayes factor. Ann Intern Med 1999; 130: 1005–13. https://doi.org/10.7326/0003-4819-130-12-199906150-00019
Article CAS PubMed Google Scholar
Stefan AM, Gronau QF, Schönbrodt FD, Wagenmakers EJ. A tutorial on Bayes factor design analysis using an informed prior. Behav Res Methods 2019; 51: 1042–58. https://doi.org/10.3758/s13428-018-01189-8
Article PubMed PubMed Central Google Scholar
van Ravenzwaaij D, Etz A. Simulation studies as a tool to understand Bayes factors. Adv Method Pract Psychol Sci 2021; 4. https://doi.org/10.1177/2515245920972624
Wagenmakers EJ, Lodewyckx T, Kuriyal H, Grasman R. Bayesian hypothesis testing for psychologists: a tutorial on the Savage–Dickey method. Cogn Psychol 2010; 60: 158–89. https://doi.org/10.1016/j.cogpsych.2009.12.001
Article PubMed Google Scholar
Held L, Otto M. On p-values and Bayes factors. Annu Rev Stat Appl 2018; 5: 393–419. https://doi.org/10.1146/annurev-statistics-031017-100307
Article Google Scholar
Gûnel E, Dickey J. Bayes factors for independence in contingency tables. Biometrika 1974; 61: 545–57. https://doi.org/10.1093/biomet/61.3.545
Article Google Scholar
Jamil T, Ly A, Morey RD, Love J, Marsman M, Wagenmakers EJ. Default "Gunel and Dickey" Bayes factors for contingency tables. Behav Res Methods 2017; 49: 638–52. https://doi.org/10.3758/s13428-016-0739-8
Article PubMed Google Scholar
Dablander F, Huth K, Gronau QF, Etz A, Wagenmakers EJ. A puzzle of proportions: two popular Bayesian tests can yield dramatically different conclusions. Stat Med 2022; 41: 1319–33. https://doi.org/10.1002/sim.9278
Article PubMed Google Scholar
Gronau QF, Akashi YJ, Wagenmakers EJ. Informed Bayesian inference for the A/B test. J Stat Softw 2021; 100: 1–39. https://doi.org/10.48550/arXiv.1905.02068
Kass RE, Vaidyanathan SK. Approximate Bayes factors and orthogonal parameters, with application for testing equality of two proportions. J R Stat Soc B 1992; 54: 129–44. https://doi.org/10.1111/j.2517-6161.1992.tb01868.x
Article Google Scholar
Perneger TV. How to use likelihood ratios to interpret evidence from randomized trials. J Clin Epidemiol 2021; 136: 235–42. https://doi.org/10.1016/j.jclinepi.2021.04.010
Article PubMed Google Scholar
Chuang Z, Martin J, Shapiro J, Nguyen D, Neocleous P, Jones PM. Minimum false-positive risk of primary outcomes and impact of reducing nominal P-value threshold from 0.05 to 0.005 in anaesthesiology randomised clinical trials: a cross-sectional study. Br J Anaesth 2023; 130: 412–20. https://doi.org/10.1016/j.bja.2022.11.001
Colquhoun D. The false positive risk: a proposal concerning what to do about p-values. Am Stat 2019; 73: 192–201. https://doi.org/10.1080/00031305.2018.1529622
Article Google Scholar
Sellke T, Bayarri MJ, Berger JO. Calibration of p values for testing precise null hypotheses. Am Stat 2001; 55: 62–71. https://doi.org/10.1198/000313001300339950
Article Google Scholar
Benjamin DJ, Berger JO. Three recommendations for improving the use of p-values. Am Stat 2019; 73: 186–91. https://doi.org/10.1080/00031305.2018.1543135
Article Google Scholar
Wilson KJ, Williamson SF, Allen AJ, Williams CJ, Hellyer TP, Lendrem BC. Bayesian sample size determination for diagnostic accuracy studies. Stat Med 2022; 41: 2908–22. https://doi.org/10.1002/sim.9393
Article PubMed PubMed Central Google Scholar
Moerbeek M. Bayesian updating: increasing sample size during the course of a study. BMC Med Res Methodol 2021; 21: 137. https://doi.org/10.1186/s12874-021-01334-6
Article PubMed PubMed Central Google Scholar
Parmar MK, Griffiths GO, Spiegelhalter DJ, et al. Monitoring of large randomised clinical trials: a new approach with Bayesian methods. Lancet 2001; 358: 375–81. https://doi.org/10.1016/s0140-6736(01)05558-1
Article CAS PubMed Google Scholar
Gronau QF, Heck DW, Berkhout SW, Haaf JM, Wagenmakers EJ. A primer on Bayesian model-averaged meta-analysis. Adv Meth Pract Psychol Sci 2021; 4: 1–19. https://doi.org/10.1177/25152459211031256
Article Google Scholar
Butcher NJ, Monsour A, Mew EJ, et al. Guidelines for reporting outcomes in trial reports: the CONSORT-outcomes 2022 extension. JAMA 2022; 328: 2252–64. https://doi.org/10.1001/jama.2022.21022
Article PubMed Google Scholar

Download references

Author contributions

David Sidebotham contributed to all aspects of this manuscript, including conception and design; acquisition, analysis, and interpretation of data; and drafting and editing the article. C. Jake Barlow contributed to the analysis and interpretation of data, development of the simulations for the figures, development of the online calculator for Bayes factors and the false positive risk, and editing the article. Janet Martin contributed to the analysis and interpretation of data and editing the article. Philip M. Jones contributed to the analysis and interpretation of data, development of the online calculator for likelihood ratios, and editing the article.

Disclosures

Philip M. Jones is Deputy Editor-in-Chief of the Canadian Journal of Anesthesia/Journal canadien d’anesthésie. He had no involvement in the handling of this manuscript. The authors declare no other conflicts of interest.

Funding statement

Open Access funding enabled and organized by CAUL and its Member Institutions.

Editorial responsibility

This submission was handled by Dr. Stephan K. W. Schwarz, Editor-in-Chief, Canadian Journal of Anesthesia/Journal canadien d’anesthésie.

Author information

Authors and Affiliations

Department of Anaesthesia and the Cardiothoracic and Vascular Intensive Care Unit, Auckland City Hospital, Auckland, New Zealand
David Sidebotham MBChB, FANZCA & C. Jake Barlow MBBS, MPh, FANZCA
Faculty of Medical and Health Sciences, University of Auckland, Auckland, New Zealand
David Sidebotham MBChB, FANZCA
Department of Anesthesia & Perioperative Medicine, University of Western Ontario, London, ON, Canada
Janet Martin PharmD, MSc (HTA&M) & Philip M. Jones MD, MSc, FRCPC
Department of Epidemiology & Biostatistics, University of Western Ontario, London, ON, Canada
Janet Martin PharmD, MSc (HTA&M) & Philip M. Jones MD, MSc, FRCPC
Cardiothoracic and Vascular Intensive Care Unit (Ward 48), Building 32, Auckland City Hospital, 2 Park Road, Grafton, Auckland, 1023, New Zealand
David Sidebotham MBChB, FANZCA

Authors

David Sidebotham MBChB, FANZCA
View author publications
You can also search for this author in PubMed Google Scholar
C. Jake Barlow MBBS, MPh, FANZCA
View author publications
You can also search for this author in PubMed Google Scholar
Janet Martin PharmD, MSc (HTA&M)
View author publications
You can also search for this author in PubMed Google Scholar
Philip M. Jones MD, MSc, FRCPC
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Sidebotham MBChB, FANZCA.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which permits any non-commercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc/4.0/.

Reprints and permissions

About this article

Cite this article

Sidebotham, D., Barlow, C.J., Martin, J. et al. Interpreting frequentist hypothesis tests: insights from Bayesian inference. Can J Anesth/J Can Anesth 70, 1560–1575 (2023). https://doi.org/10.1007/s12630-023-02557-5

Download citation

Received: 18 February 2023
Revised: 25 March 2023
Accepted: 27 March 2023
Published: 04 October 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s12630-023-02557-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Interpreting frequentist hypothesis tests: insights from Bayesian inference

Abstract

Résumé

Similar content being viewed by others

We are All Bayesian, Everyone is Not a Bayesian

The distribution of power-related random variables (and their use in clinical trials)

Bayesian Decision Theory and the Design and Analysis of Randomized Clinical Trials