In frequentist statistics, there is an intimate connection between the p value null-hypothesis significance test and the confidence interval of the test-relevant parameter. Specifically, a \([100 \times (1-\alpha )]\)% confidence interval contains only those parameter values that would not be rejected if they were subjected to a null-hypothesis test with level \(\alpha\). That is, frequentists confidence intervals can often be constructed by inverting a null-hypothesis significance test (e.g., Natrella 1960; Stuart et al. 1999, p. 175). Thus, the construction of the confidence interval involves, at a conceptual level, the computation of p values.

In Bayesian statistics, in contrast, there exists a conceptual divide between the Bayes factor hypothesis test and the credible interval. On the one hand, the Bayes factor (e.g., Etz and Wagenmakers 2017; Haldane 1932; Jeffreys 1939; Kass et al. 1995; Wrinch and Jeffreys 1921) reflects the relative predictive adequacy of two competing models or hypotheses, say \({\mathcal {H}}_0\) (which postulates the absence of the test-relevant parameter) and \({\mathcal {H}}_1\) (which postulates the presence of the test-relevant parameter). On the other hand, under the assumption that \({\mathcal {H}}_1\) is true, the associated credible interval for the test-relevant parameter provides a range that contains 95% of the posterior mass. In other words, the Bayes factor test seeks to quantify the evidence for the presence or absence of an effect, whereas the credible interval quantifies the size of the effect under the assumption that it is present. For this reason, one may encounter paradoxical situations in which the following are simultaneously true: (1) the Bayes factor supports the point hypothesis \({\mathcal {H}}_0: \theta = \theta _0\) over the composite hypothesis \({\mathcal {H}}_1\) in which \(\theta\) is assigned some continuous prior distribution; and (2) the central 95% credible interval excludes the value \(\theta _0\).

As a concrete example, consider a binomial test with \({\mathcal {H}}_0:\theta _0 = 1/2\) and \({\mathcal {H}}_1:\theta \sim \text {Beta}(1,1)\), and assume we observe 60 successes and 40 failures. Figure 1 shows that the Bayes factor slightly favors \({\mathcal {H}}_0:\theta _0 = 1/2\), whereas the 95% credible interval just excludes that point.

Fig. 1
figure 1

Based on 60 successes and 40 failures, a binomial test with \({\mathcal {H}}_0:\theta _0 = 1/2\) versus \({\mathcal {H}}_1:\theta \sim \text {Beta}(1,1)\) yields (very slight) evidence in favor of \({\mathcal {H}}_0: \theta _0 = 1/2\), whereas the central 95% credible interval under \({\mathcal {H}}_1\) ranges from 0.502 to 0.691, just excluding the point \(\theta _0 = 1/2\). Figure from JASP,

There are different responses to this paradoxical state of affairs:

  1. 1.

    One may blame the Bayes factor, or, more specifically, one may blame the fact that the prior distribution for \(\theta\) under \({\mathcal {H}}_1\) is overly wide, which harms predictive performance of \({\mathcal {H}}_1\). However, the conflict arises irrespective of the prior distribution; that is, if person X specifies, for instance, a \(\text {Beta}(a,b)\) prior, then person Y can present a fictitious data set for which the paradox emerges. This implies that the prior for \(\theta\) cannot be the cause of the conflict.

  2. 2.

    One may recognize that Fig. 1 does not in fact present the complete posterior distribution for \(\theta\). Instead, the complete (marginal) distribution for \(\theta\) consists of a posterior spike at \(\theta _0 = 1/2\) under \({\mathcal {H}}_0\) and the continuous distribution for \(\theta\) under \({\mathcal {H}}_1\) (e.g., Rouder et al. 2018). Ignoring the posterior spike at \(\theta _0\) paints an overly optimistic picture of what values \(\theta\) is likely to have.

  3. 3.

    One may realize that the paradox is a contradiction that is only apparent; thus, one may simply accept that intervals computed under \({\mathcal {H}}_1\) may exclude values that, when considered in isolation, remain relatively plausible.

Here we explore a fourth response, one that attempts to define a Bayesian interval based on the same principles that underlie the construction of the frequentist confidence interval. This rather unconventional interval defines a set of values of \(\theta\) that predicted the observed data relatively well, and it prevents the paradoxical situation outlined above from arising. Before introducing the interval, which is based on earlier work by Keynes (1921), Carnap (1950), Evans (1997, 2015), Morey et al. (2016), and Rouder and Morey (2019), we provide some background information on the Bayes factor.

1 Background on the Bayes Factor

The Bayes factor quantifies the degree to which data y change the relative prior plausibility of two hypotheses (say \({\mathcal {H}}_0\) and \({\mathcal {H}}_1\)) to the relative posterior plausibility, as follows:

$$\begin{aligned} \underbrace{ \frac{p({\mathcal {H}}_0 \mid y)}{p({\mathcal {H}}_1 \mid y)}}_{\begin{array}{c} \text {Relative posterior}\\ \text {uncertainty} \end{array} } = \underbrace{ \frac{p({\mathcal {H}}_0)}{p({\mathcal {H}}_1)}}_{\begin{array}{c} \text {Relative prior}\\ \text {uncertainty} \end{array} } \times \,\,\,\,\,\,\, \underbrace{ \frac{p(y \mid {\mathcal {H}}_0)}{p(y \mid {\mathcal {H}}_1)}}_{\begin{array}{c} \text {Bayes factor}\\ \text {BF}_{01} \end{array} }. \end{aligned}$$

For concreteness, consider a binomial test between \({\mathcal {H}}_0:\theta _0 = 1/2\) vs. \({\mathcal {H}}_1: \theta \sim \text {Beta}(2,2)\). Suppose the data at hand consist of 8 successes and 2 failures. In this specific case, the Bayes factor is given by

$$\begin{aligned} \text {BF}_{01} = \frac{p(y \mid \theta _0 = 1/2)}{p(y \mid \theta \sim \text {Beta}(2,2))}, \end{aligned}$$

the ratio of the predictive performances for \({\mathcal {H}}_0\) and \({\mathcal {H}}_1\). But now consider only \({\mathcal {H}}_1: \theta \sim \text {Beta}(2,2)\), and observe how the data change the relative plausibilities of the different values of \(\theta\) under \({\mathcal {H}}_1\):

$$\begin{aligned} \underbrace{ p(\theta \mid y, {\mathcal {H}}_1)}_{\begin{array}{c} \text {Posterior distribution}\\ \text {under} \,{\mathcal {H}}_1 \end{array} } = \underbrace{ p(\theta \mid {\mathcal {H}}_1)}_{\begin{array}{c} \text {Prior distribution}\\ \text {under} \,{\mathcal {H}}_1 \end{array} } \times \,\,\,\,\,\,\, \underbrace{ \frac{p(y \mid \theta , {\mathcal {H}}_1)}{p(y \mid {\mathcal {H}}_1)} }_{\begin{array}{c} \text {Predictive}\\ \text {updating factor} \end{array} }. \end{aligned}$$

The updating factor in Eq. (3), assessed for the value \(\theta = 1/2\), is identical to the Bayes factor in Eq. (1). In other words, when we consider only \({\mathcal {H}}_1\), and evaluate the change from prior to posterior ordinate at a specific \(\theta _0\), we may equally well interpret this as the Bayes factor for \({\mathcal {H}}_0: \theta = \theta _0\) vs. \({\mathcal {H}}_1: \theta \sim \text {Beta}(2,2)\). This relation holds generally (e.g., Dickey and Lientz 1970; Wetzels et al. 2010; but see Marin and Robert 2010; Verdinelli and Wasserman 1995). Thus, “strength of evidence for a parameter value is precisely the relative gain in predictive accuracy when conditioning on it” (Rouder and Morey 2019). Specifically, we can rewrite Eq. (3) as

$$\begin{aligned} \frac{p(\theta \mid y,{\mathcal {H}}_1)}{p(\theta \mid {\mathcal {H}}_1)}= \frac{p(y \mid \theta , {\mathcal {H}}_1)}{p(y \mid {\mathcal {H}}_1)}, \end{aligned}$$

which shows that the ratio of posterior to prior density for a parameter value is precisely equal to its predictive updating factor.

To underscore this key point, Fig. 2 highlights the changes from prior to posterior distribution for the above binomial example; if the posterior ordinate for a specific \(\theta _0\) is higher than the prior ordinate, the data have made that \(\theta _0\) more credible than it was before: the updating factor exceeded 1, meaning that \(\theta _0\) predicted the observed data better than average (Morey et al. 2016; Wagenmakers et al. 2016).

Fig. 2
figure 2

In Bayesian parameter estimation, the plausibility update for a specific value of \(\theta\) (e.g., \(\theta _0\)) is mathematically identical to a Bayes factor against a point-null hypothesis \({\mathcal {H}}_0: \theta = \theta _0\). In this example, \(\theta\) is assigned a Beta(2, 2) prior distribution (i.e., the dotted line), the data y consist of 8 successes out of 10 trials, and the resulting posterior for \(\theta\) is a Beta(10, 4) distribution. Note the similarity to the Savage–Dickey density ratio test (e.g., Dickey and Lientz 1970; Wetzels et al. 2010)

2 The Support Interval

As illustrated in Fig. 2, some values of \(\theta\) received support from the data—the updating factor was in their favor—whereas other values of \(\theta\) are undermined by the data; for these values, the posterior ordinate is lower than the prior ordinate, signaling a loss in credibility. We can use this information to define an interval containing only those values of \(\theta\) that receive a certain minimum level of corroboration from the data. This leads to the following definition.

Definition of the  \(\text {BF}=k\)  Support Interval: A \(\text {BF}=k\) support interval for a parameter \(\theta\) contains only those values for \(\theta\) which predict the observed data y at least k times better than average; these are values of \(\theta\) that are associated with an updating factor \(p( y \mid \theta ) / p(y) \ge k\).

2.1 Example: Choosing a value of k

The definition of the support interval makes it apparent that in practice one must choose a value for the critical updating factor k. The choice of k depends on what we want our interval to convey about the evidence in the data. Consider again the binomial scenario illustrated in Fig. 2 and suppose we seek a \(\text {BF}=1\) support interval for \(\theta\), that is, an interval that contains only those values whose credibility is not decreased by observing the data. This interval contains all values for \(\theta\) where the posterior ordinate is equal to or exceeds the prior ordinate, and serves as a natural default choice for k. In this case, the interval ranges from \(\theta \approx 0.57\) to \(\theta \approx 0.94\).

We may seek an interval of values for \(\theta\) that enjoy more impressive support from the data. This interval is a smaller subset of the initial \(\text {BF}=1\) interval. For instance, choosing \(k=3\) would produce an interval that contains all parameter values that receive at least “moderate” support from the data (according to conventions set by Jeffreys 1939). In our binomial example, the \(\text {BF}=3\) support interval for \(\theta\) ranges from \(\theta \approx 0.75\) to \(\theta \approx 0.84\). On the other hand, by choosing \(k<1\) we may also broaden our interval to encompass those values that are not strongly contraindicated by the data. The interpretation of such intervals would be analogous to how a frequentist confidence interval contains all the parameter values that would not have been rejected if tested at level \(\alpha\). For instance, a \(\text {BF}=1/3\) support interval encloses all values of \(\theta\) for which the updating factor is not stronger than 3 against; in our example this interval ranges from \(\theta \approx 0.47\) to \(\theta \approx 0.97\).

3 Comparison to the Credible Interval

The support interval is based on evidence—how the data change our beliefs—whereas the credible interval is based on the posterior beliefs directly. Because evidence and belief are different concepts, it is straightforward to present situations in which the two intervals yield different results.

For instance, the top panel in Fig. 3 shows an example of unexpected data: a \(\text {Beta}(10,3)\) prior distribution (dotted line) for a binomial parameter \(\theta\) is updated to a posterior distribution (solid line) after having observed \(y = 3\) successes out of \(n = 20\) trials. In order to underscore that the data are unexpected under the prior, the panel also presents the likelihood (dashed line). The posterior is a compromise between prior and likelihood that is blind to any conflict between them; specifically, the exact same posterior (and, consequently, the exact same 95% central credible interval) for \(\theta\) would have resulted if \(\theta\) had been assigned, say, a \(\text {Beta}(5,8)\) prior distribution and \(y = 8\) successes out of \(n = 20\) trials had been observed. The support interval, in contrast, is sensitive to the unexpected nature of the data. Specifically, a \(\text {BF}=1\) support interval for \(\theta\) comprises all values of \(\theta\) that predict the data at least as well as average: these are the values for \(\theta\) where the posterior distribution equals or exceeds the prior distribution, which happens here even for a relatively wide range of values of \(\theta\).

Fig. 3
figure 3

Differences between the support interval and the credible interval. The top panel shows a Beta(10, 3) prior distribution (dotted line) which is updated to a posterior distribution based on observing \(y = 3\) successes out of \(n = 20\) trials; the \(\text {BF} = 1\) support interval is much larger than the central 95% credible interval. The bottom panel shows a Beta(10, 10) prior distribution which is updated to a posterior distribution based on observing \(y = 1\) successes out of \(n = 2\) trials; the \(\text {BF} = 1\) support interval is smaller than the central 95% credible interval. See text for details

The bottom panel in Fig. 3 shows an example of relatively uninformative data: a \(\text {Beta}(10, 10)\) prior distribution (dotted line) is updated to a posterior distribution (solid line) based on having observed a single success out of two trials. The 95% credible interval is relatively wide, indicating substantial uncertainty about the true value of \(\theta\); in contrast, the \(\text {BF}=1\) support interval is relatively narrow, as relatively few values of \(\theta\) predicted the data better than average. For a deeper understanding of the source of the discrepancies we now take a closer look at the likelihood.

4 A Likelihood Perspective

The construction of the support interval is based on the change from the prior to the posterior distribution, that is,

$$\begin{aligned} \frac{p(y \mid \theta )}{p(y)} = \frac{p(y \mid \theta )}{\int p(y \mid \theta ) p(\theta ) \, \text {d}\theta }. \end{aligned}$$

The denominator is the marginal likelihood—a constant number that does not depend on \(\theta\), so that the updating factor can also be written as \(c \cdot p(y \mid \theta )\). Figure  4 shows the updating factor function from the second binomial example (see Fig. 2). The construction of the \(\text {BF}=k\) support interval involves, first, the selection of a threshold level of evidence, say \(\text {BF} = 1\) (marked in the figure with a dotted horizontal line), and then the identification of the values of \(\theta\) for which the function exceeds that threshold (i.e., the values of \(\theta\) in between the two gray dots that mark the intersection of the threshold with the updating factor function). Higher evidence thresholds mean smaller intervals; for instance, in Fig.  4 the \(\text {BF}=3\) support interval ranges from approximately \(\theta = .75\) to \(\theta = .84\). If the evidence threshold is raised to 3.5 or higher, an empty interval is obtained, indicating that the data do not support any value of \(\theta\) this strongly.

Fig. 4
figure 4

Example of an update factor function that quantifies the change from prior to posterior ordinate for binomial rate parameter \(\theta\). Data y consist of 8 successes and 2 failures, and \(\theta\) is assigned a \(\text {Beta}(2,2)\) distribution, as in Fig. 2

For a likelihoodist, the update factor function is simply a representation of the likelihood function, thus conferring the support interval the invariance properties enjoyed by likelihood-based inferences (Edwards 1992; Royall 1997). However, for a likelihoodist the marginal likelihood constant is arbitrary, and the update factor function may therefore be arbitrarily rescaled (e.g., to have maximum 1) without changing the inference (Etz 2018). Consequently, a likelihood interval (e.g., Cumming 2014; Hudson 1971; Royall 1997) cannot be constructed by reference to any horizontal line. Instead, an interval may be constructed by comparing the maximum height of the function to the height at any other point. For instance, a likelihood ratio interval of 3 (i.e., \(\text {LR}=3\)) would contain all values of \(\theta\) for which the likelihood ratio against the maximum is less than 3, that is, \(p(y \mid {\hat{\theta }}) / p(y \mid \theta ) < 3\) (where \({{\hat{\theta }}}\) denotes the maximum likelihood estimate).

Consider again the case of 8 successes in 10 binomial trials. The maximum likelihood estimate is \({{\hat{\theta }}}=.8\). To obtain the likelihood ratio interval we can find the two boundary values of \(\theta\) whose likelihood ratio against \(\theta =.8\) is equal to 3, which gives an interval from approximately \(\theta =.58\) to \(\theta =.94\). This likelihood ratio interval differs markedly from the BF = 3 interval, which ranged from \(\theta =.75\) to \(\theta =.84\).

Therefore, even though the LR interval and the support interval are based on the same updating/likelihood function, the intervals differ: the support interval is based on a comparison to an average, whereas the likelihood interval is based on a comparison to a maximum.

5 Conceptual Advantages of the Support Interval

The support interval is a unique, transformation-invariant interval that generalizes to situations with multiple parameters of interest in a straightforward fashion (Dickey and Lientz 1970; Wetzels et al. 2010). The main advantage of the support interval, however, is conceptual: it quantifies directly which values of \(\theta\) are supported by the data. Specifically, those values of \(\theta\) that predict the data at least k times better than average are part of the \(\text {BF}=k\) support interval. This definition of an interval for \(\theta\) prevents the interval-versus-testing paradox from arising.

The interval-versus-testing paradox that we present here can be seen as an alternative interpretation of the famous Lindley paradox (Jeffreys 1939; Lindley 1957). Lindley’s paradox states that one can simultaneously have a frequentist test at level \(\alpha\) reject the null hypothesis while at the same time the corresponding Bayesian test overwhelmingly supports the null hypothesis. Whereas this paradox is traditionally used to highlight the inevitable divergence of p values and Bayesian posterior probabilities (or Bayes factors) for hypothesis testing, there is an alternative interpretation of the paradox as a warning against use of improper priors for Bayesian testing (see Bartlett 1957; DeGroot 1982; Robert 2014). However, the duality of p values and confidence intervals suggests yet another re-interpretation of the paradox, namely, that of a divergence between confidence intervals and Bayesian hypothesis tests. In turn, because most Bayesian credible intervals are approximately confidence intervals (which converge asymptotically) Lindley’s paradox can be seen highlighting the divergence of Bayesian hypothesis tests and conventional interval estimation more broadly.

Reconsider our first binomial example, shown in Fig. 1, featuring 60 successes and 40 failures and a \(\text {Beta}(1,1)\) distribution for \(\theta\) under \({\mathcal {H}}_1\). In this scenario, a \(\text {BF}=1\) interval ranges from \(\theta \approx .498\) to \(\theta \approx .697\), and a \(\text {BF}=1/3\) interval ranges from \(\theta \approx .475\) to \(\theta \approx .717\). The BF = 1 interval includes \(\theta = 1/2\), indicating that the data have increased its plausibility and it should therefore not be excluded from consideration; the paradoxical difference between conclusions drawn from the interval estimate and the Bayes factor hypothesis test no longer exists.

6 Nuisance Parameters

The presence of nuisance parameters can pose a challenge for the Savage–Dickey representation of the Bayes factor, and thus for the interpretation of a support interval as containing those parameters which would result in a \(\text {BF}=k\) test if they were used as the test value of \(\theta\). Consider a case where there is one parameter of interest \(\theta\) and a vector of nuisance parameters \(\phi\). Bayes’ theorem dictates that after observing data y, the joint posterior of \(\phi\) and \(\theta\) under the alternative is given by

$$\begin{aligned} p(\theta ,\phi \mid y,{\mathcal {H}}_1)=p(\phi ,\theta \mid {\mathcal {H}}_1)\times \frac{p(y\mid \theta ,\phi ,{\mathcal {H}}_1)}{p(y\mid {\mathcal {H}}_1)}, \end{aligned}$$

where the final term is a joint updating factor for pairs of \(\theta\) and \(\phi\) values. It would seem that a support interval for \(\theta\) (the parameter of actual interest) could be obtained by marginalizing \(\phi\) out of both the joint posterior and joint prior, and computing the marginal updating factor for \(\theta\) as in Eq. (4) using the marginal posterior and prior of \(\theta\) . While it is true that a value of \(\theta\) contained in a k-support interval constructed in this way is indeed one which marginally becomes k-times more plausible, it will not necessarily correspond to a \(\text {BF}=k\) test. Thus, in the presence of nuisance parameters a support interval does not necessarily correspond to an inversion of the Bayes factor hypothesis test.

For the equivalence of the Bayes factor in favor of \(\theta _0\) and the predictive updating factor for \(\theta =\theta _0\) under the alternative to hold in the presence of nuisance parameters \(\phi\), we must ensure that marginalization of \(\phi\) from both models yields \(p(y\mid \theta _0,{\mathcal {H}}_0)=p(y\mid \theta =\theta _0,{\mathcal {H}}_1)\). That is, it must be true that

$$\begin{aligned} \int _\Phi p(y\mid \theta _0,\phi ,{\mathcal {H}}_0)p(\phi \mid {\mathcal {H}}_0) \, \text {d}\phi =\int _\Phi p(y\mid \theta =\theta _0,\phi ,{\mathcal {H}}_1)p(\phi \mid \theta =\theta _0,{\mathcal {H}}_1) \, \text {d}\phi . \end{aligned}$$

The above equality will hold if and only if the prior distribution of \(\phi\) under the null model matches the conditional prior distribution for \(\phi\) given \(\theta =\theta _0\) under the alternative model, that is, if and only if \(p(\phi \,|\,{\mathcal {H}}_0)= p(\phi \,|\,\theta =\theta _0,{\mathcal {H}}_1)\) (Dickey 1971). Verdinelli and Wasserman (1995) show that whenever these priors do not match, the Bayes factor and marginal updating factor will be off by a multiplicative constant, say c. Thus, if we are not careful to construct priors on nuisance parameters in just the right fashion, we will have a set of \(\theta\) values in a k-support interval which correspond to \(\text {BF}=c\cdot k\) test.

One way to satisfy this condition on the priors for \(\phi\) is to take \(\phi\) and \(\theta\) as a priori independent under the alternative hypothesis, that is, \(p(\phi ,\theta \,|\,{\mathcal {H}}_1) = p(\phi \,|\,{\mathcal {H}}_1)p(\theta \,|\,{\mathcal {H}}_1)\). Subsequently one can directly set \(p(\phi \,|\,{\mathcal {H}}_1)\) equal to \(p(\phi \,|\,{\mathcal {H}}_0)\). However, in many modeling contexts the construction of independent priors can be difficult—and sometimes undesirable. For instance, Heck (2018) demonstrates that multivariate Cauchy priors do not generally satisfy the marginal-conditional condition above.

7 Earlier Work

The key relation between strength of evidence and relative predictive performance, expressed in Eq. (4) above, was previously discussed by Carnap (1950, pp. 326–333), who called it the “general division theorem”. More specifically, Carnap termed the ratio of posterior plausibility to prior plausibility the “relevance quotient”, and this quotient was a critical component in Carnap’s theory of confirmation: a datum D supports hypothesis \({\mathcal {H}}\) if and only if \(P({\mathcal {H}} \,|\, D)>P({\mathcal {H}})\), that is, if and only if \(P({\mathcal {H}} \,|\, D)/P({\mathcal {H}}) > 1\). Still earlier, the predictive updating factor was discussed by Keynes (1921), who called it the “coefficient of influence” (p. 170; as acknowledged by Carnap). Keynes attributed his coefficient of influence to a set of unpublished notes provided to him by W. E. Johnson, stating that his exposition relating to the coefficient of influence is “derived in its entirety from his [Johnson’s] notes” (p. 170). Carnap, Keynes, and Johnson were all considering how the data impact our belief in a singular claim or hypothesis and did not discuss the possibility of extending these ideas into an estimation context, although we should stress that this is a small step from their original ideas.

Our proposal to construct intervals based on the relative support lent by the data is similar to a more recent proposal by Evans (1997, 2015). Evans first puts forward a method for point estimation which amounts to choosing the parameter value which maximizes the posterior to prior ratio (i.e., the updating factor).Footnote 1 To quantify the uncertainty in this point estimate, Evans then constructs a “relative surprise” (or “relative belief”) interval for \(\theta\) that contains \(\gamma\)% of the posterior mass, such that any value in the interval has a higher updating factor than any value outside the interval (see also Shalloway 2014).

The relative surprise interval is similar to a traditional credible interval in that it contains a fixed, predetermined proportion of the posterior mass. Thus, a 95% relative surprise interval for \(\theta\) is constructed by finding a set of \(\theta\) values such that (1) the posterior probability of this set is 95%, and (2) any values not in the set have smaller updating factor than those contained in the set. The construction of a relative surprise interval is not unlike that of a \(\xi\)% highest-density interval for \(\theta\), which is constructed such that (1) the posterior probability of the interval is \(\xi\)%, and (2) any values within the interval have higher posterior density than any values outside the interval. In fact, because the updating factor is proportional to the likelihood function, the relative surprise interval for a one-to-one transformation of \(\theta\), say \(\psi = g(\theta )\), is equivalent to a highest-density interval whenever the prior distribution induced for \(\psi\) is uniform.

Clearly, the relative surprise interval and our proposed support interval are closely related. Both intervals have the property that any values inside the interval have a larger updating factor than those outside the interval, and both intervals are invariant under smooth reparameterization. It is straightforward to shift the interpretation of a relative surprise interval into a support interval, and vice-versa. To determine what relative surprise coefficient \(\gamma\) corresponds to a \(\text {BF}=k\) support interval, one can simply find the posterior probability contained in the support interval. For instance, the \(\text {BF}=3\) support interval for our second binomial example ranges from \(\theta =.75\) to \(\theta =.84\), and the posterior probability of that interval is .274; hence, this \(\text {BF}=3\) support interval is a \(\gamma =27.4\)% relative surprise interval. Likewise, computing the updating factor corresponding to the boundary points of a relative surprise interval gives the critical value of a support interval.

The important difference between the two intervals is that the set of \(\theta\) values in a support interval is defined by a critical value of the updating factor, so the proportion of the posterior distribution in a support interval is not fixed in advance. Indeed, a given choice of critical updating factor can even result in an empty support interval; for instance, this would occur in our second binomial example when the critical updating factor is taken to be 3.5. In contrast, a \(\gamma\)% relative surprise interval will always have posterior probability \(\gamma\). For relatively large \(\gamma\) (e.g., \(\gamma =.95\)), this necessitates including values of \(\theta\) in the relative surprise interval which have an updating factor smaller than one. In other words, the interval can include parameter values that the data have undermined. The relative surprise interval is more a summary of the posterior distribution, whereas the support interval is more a summary of the evidence in the data. This behavior of the relative surprise interval has led Evans (2015) to recommend reporting instead a so-called plausible interval, which only contains parameter values which have evidence in their favor—in other words, a support interval for \(k=1\). These ideas have also been expanded upon by Baskurt and Evans (2013) in the context of evidence calibration, and by Evans and Tomal (2018) in the context of multiple testing and sparsity.

More recently, Rouder and Morey (2019) have argued that the updating factor should receive more attention when teaching Bayes’ rule. They display an example of an updating factor function, and mention that in their teaching, “we ask students to find intervals where the data have decreased the plausibility by more than 10-1.” Similar remarks can be found in Morey et al. (2016), where their Fig. 2 shows an example of a \(\text {BF}=1\) support interval. In sum, the support interval discussed here is a more elaborate description that is inspired by earlier work conducted by Keynes, Carnap, Evans, and Rouder, Romeijn, and Morey.Footnote 2

8 Concluding Comments

The support interval is based on evaluating, for each parameter value, the degree to which it predicted the observed data better than the average prediction across all parameter values. One may argue that by omitting \(p(\theta )\) and focusing entirely on the evidence that is provided by the data, the support interval is not sufficiently Bayesian; on the other hand, one may argue that the quantify \(\int p(y \mid \theta ) p(\theta ) \,\text {d}\theta\), the marginal likelihood or average predictive performance across \(\theta\), is too dependent on \(p(\theta )\)—this is the common objection to Bayes factor model selection (e.g., Liu and Aitkin 2008).

All intervals come with assumptions, limitations, and advantages, and we believe it is useful to know which parameter values have received more than a specific level of corroboration from the data. Alternatively, one could of course forgo the computation of an interval altogether and simply plot the prior and the posterior distributions (e.g., Fig. 1). However, the forces of habit or nature are likely to lead researchers to extract, by eye, the intervals of interest. One interval that catches the eye is the credible interval, which is based purely on the posterior; another interval that stands out is the support interval, the collection of parameter values for which the posterior height and the prior height differ by a specific factor.

Despite its intuitive appeal, the support interval has received scant attention in the Bayesian literature on estimation. One may speculate that objective Bayesians perhaps undervalue their prior distribution, whereas subjective Bayesians overvalue it. Regardless of why measures of support in estimation have been spurned for so long, we believe that practical and theoretical considerations suggest that the support interval can provide a useful summary of what was learned from the data.