The foundation of Bayesian statistics is represented by the Bayes theorem:
$$\begin{aligned} p(\theta |d,{\mathcal {M}}_{i}) = \frac{\pi (\theta |{\mathcal {M}}_{i}) {\mathcal {L}}_{{\mathcal {M}}_{i}}(\theta )}{Z_i}, \end{aligned}$$
(1)
where \(\pi (\theta |{\mathcal {M}}_{i})\) and \(p(\theta |d,{\mathcal {M}}_{i})\) are the prior and posterior probabilities for the parameters \(\theta \) given a model \({\mathcal {M}}_{i}\), \({\mathcal {L}}_{{\mathcal {M}}_{i}}(\theta )\) is the likelihood function, depending on the parameters \(\theta \), given the data d and the model \({\mathcal {M}}_{i}\), and
$$\begin{aligned} Z_i = \int _{{\varOmega }_\theta } d\theta \,\pi (\theta |{\mathcal {M}}_{i})\,{\mathcal {L}}_{{\mathcal {M}}_{i}}(\theta ) \end{aligned}$$
(2)
is the Bayesian evidence of \({\mathcal {M}}_{i}\) [13], the integral of prior times likelihood over the entire parameter space \({\varOmega }_\theta \).
While the Bayes theorem indicates how to obtain the posterior probability as a function of all the model parameters \(\theta \), when presenting results we are typically interested in the marginalized posterior probability as a function of one parameter (or two), which we can generally indicate with x. The marginalization is performed over the remaining parameters, which we can indicate with \(\psi \):
$$\begin{aligned} p(x|d,{\mathcal {M}}_{i}) = \int _{{\varOmega }_\psi }d\psi \, p(x,\psi |{\mathcal {M}}_{i},d). \end{aligned}$$
(3)
Let us now assume that the prior is separable and we can write \(\pi (\theta |{\mathcal {M}}_{i})=\pi (x|{\mathcal {M}}_{i})\cdot \pi (\psi |{\mathcal {M}}_{i})\). Under such hypothesis, Eq. (3) can be written as:
$$\begin{aligned} p(x|d,{\mathcal {M}}_{i}) = \frac{\pi (x|{\mathcal {M}}_{i})}{Z_i} \int _{{\varOmega }_\psi }d\psi \, \pi (\psi |{\mathcal {M}}_{i}){\mathcal {L}}_{{\mathcal {M}}_{i}}(x,\psi ).\nonumber \\ \end{aligned}$$
(4)
Let us consider the marginalized posterior as written in Eq. (4). The prior dependence is only present explicitly outside the integral, and therefore we can obtain a prior-independent quantityFootnote 1 just dividing the posterior by the prior. The right-hand side of Eq. (4), however, has an explicit dependence on the value of x through the likelihood that appears in the integral. We can note that such integral resembles the definition of the Bayesian evidence in Eq. (2), not anymore for model \({\mathcal {M}}_{i}\), but for a sub-case of \({\mathcal {M}}_{i}\) which contains x as a fixed parameter. Let us label this model with \({\mathcal {M}}_{i}^{x}\) and define its Bayesian evidence:
$$\begin{aligned} Z_i^{x} \equiv \int _{{\varOmega }_\psi }d\psi \, \pi (\psi |{\mathcal {M}}_{i}){\mathcal {L}}_{{\mathcal {M}}_{i}}(x,\psi ), \end{aligned}$$
(5)
which is independent of the prior \(\pi (x)\), but still depends on the parameter value x, now fixed. Note that Eq. (4) can be rewritten in the following form:
$$\begin{aligned} Z_i = \frac{\pi (x|{\mathcal {M}}_{i})}{p(x|d,{\mathcal {M}}_{i})} Z_i^x. \end{aligned}$$
(6)
Now, let us consider two models \({\mathcal {M}}_{i}^{x_1}\) and \({\mathcal {M}}_{i}^{x_2}\). Since \(Z_i\) is independent of x, we can use Eq. (6) to obtain
$$\begin{aligned} \frac{\pi (x_1|{\mathcal {M}}_{i})}{p(x_1|d,{\mathcal {M}}_{i})} Z_i^{x_1} = \frac{\pi (x_2|{\mathcal {M}}_{i})}{p(x_2|d,{\mathcal {M}}_{i})} Z_i^{x_2}, \end{aligned}$$
(7)
which can be rewritten as
$$\begin{aligned} \frac{Z_i^{x_1}}{Z_i^{x_2}} = \frac{p(x_1|d,{\mathcal {M}}_{i})/\pi (x_1|{\mathcal {M}}_{i})}{p(x_2|d,{\mathcal {M}}_{i})/\pi (x_2|{\mathcal {M}}_{i})}. \end{aligned}$$
(8)
The left hand side of this equation is a ratio of the Bayesian evidences of the models \({\mathcal {M}}_{i}^{x_1}\) and \({\mathcal {M}}_{i}^{x_2}\), therefore it is a Bayes factor. For reasons that will be clear later, let us rename \(x_1\rightarrow x\) and \(x_2\rightarrow x_0\) and define this ratio as \({\mathcal {R}}(x,x_0|d)\), which was named “relative belief updating ratio” or “shape distortion function” in the past [10,11,12]:
$$\begin{aligned} {\mathcal {R}}(x,x_0|d) \equiv \frac{Z_i^{x}}{Z_i^{x_0}} = \frac{ p(x|d,{\mathcal {M}}_{i})/\pi (x,{\mathcal {M}}_{i}) }{ p(x_0|d,{\mathcal {M}}_{i})/\pi (x_0,{\mathcal {M}}_{i}) }. \end{aligned}$$
(9)
Although this function has been already employed in the past, see e.g. [14,15,16,17], its use has been somewhat faded into obscurity. Here, we will revise its properties and discuss them in details.
Let us recall that \(Z_i^x\) is independent of \(\pi (x)\), see Eq. (5): this means that \({\mathcal {R}}(x,x_0|d)\) is also independent of \(\pi (x)\). This quantity therefore represents a prior-independent way to compare some results concerning two values of some parameter x. At the practical level, \({\mathcal {R}}\) is particularly useful when dealing with open likelihoods, i.e. when data only constrain the value of some parameter from above or from below. In such case, the likelihood becomes insensitive to the parameter variations below (or above) a certain threshold. Let us consider for example the absolute scale of neutrino masses, on which data (either cosmological or at laboratory experiments) only put an upper limit: the data are insensitive to the value of x when x goes towards 0, so we can consider \(x_0=0\) as a reference value. Regardless of the prior, when x is sufficiently close to \(x_0\) the likelihoods in x and \(x_0\) are essentially the same in all the points of the parameter space \({\varOmega }_\psi \), so \(Z_i^{x}\simeq Z_i^{x_0}\) and \({\mathcal {R}}(x,x_0)\rightarrow 1\). In the same way, when x is sufficiently far from \(x_0\), the data penalize its value (\(Z_i^{x}\ll Z_i^{x_0}\)) and we have \({\mathcal {R}}(x,x_0)\rightarrow 0\). In the middle, the function \({\mathcal {R}}\) indicates how much x is favored/disfavored with respect to \(x_0\) in each point, in the same way a Bayes factor indicates how much a model is preferred with respect to another one.
While \({\mathcal {R}}\) can define the general behavior of the posterior as a function of x, any probabilistic limit one can compute will always depend on the prior shape and range, which is an unavoidable ingredient of Bayesian statistic. The description of the results through the \({\mathcal {R}}\) function, however, allows to use the data to define a region above which the parameter values are disfavored, regardless of the prior assumptions, and also to guarantee an easier comparison of two experimental results. A good standard could be to provide a (non-probabilistic) “sensitivity bound”, defined as the value of x at which \({\mathcal {R}}\) drops below some level, for example \(|\ln {\mathcal {R}}|=1\), 3 or 5 in accordance to the Jeffreys’ scale (see e.g. [2, 13]). Let us consider \(x_0=0\) as above: we could say, for example, that we consider as “moderately (strongly) disfavored” the region \(x>x_s\) for which \(\ln {\mathcal {R}}<s\), with \(s=-\,3\) (or \(-5\)), and then use the different values \(x_s\) to compare the strengths of different data combinations d in constraining the parameter x. This will not represent an upper bound at some given confidence level, since it is not a probabilistic bound, but rather a hedge “which separates the region in which we are, and where we see nothing, from the the region we cannot see” [11].
From the computational point of view, it is not necessary to perform the integrals in the definition of \(Z_i^x\) in order to compute \({\mathcal {R}}\). One can directly use the right hand side of Eq. (9), i.e. numerically compute \(p(x|d,{\mathcal {M}}_{i})\) with a specific prior assumption, then divide by \(\pi (x,{\mathcal {M}}_{i})\) and normalize appropriately. Notice also that, once \({\mathcal {R}}\) is known, anyone can obtain credible intervals with any prior of choice: the posterior \(p(x|d,{\mathcal {M}}_{i})\) can easily be computed using Eq. (9) and normalizing to a total probability of 1 within the prior.
Few final comments: in most of the cases, obtaining limits with the \({\mathcal {R}}\) function is nearly equivalent to using a likelihood ratio test. The difference is that, while the likelihood ratio test only takes into account the likelihood values in the best-fit at fixed \(x_0\) and x, the \({\mathcal {R}}\) method weighs the information of the entire posterior distribution and takes into account the mean likelihood over the prior \({\varOmega }_\psi \). This means that in cases with multiple posterior peaks or complex posterior distributions, the limits obtained using the \({\mathcal {R}}\) function can be more conservative than those obtained with the likelihood ratio test. As an example, we provide in the lower panel of Fig. 1 a comparison of the likelihood ratio and of the \(-2\ln {\mathcal {R}}\) functions when the following likelihood is considered:
$$\begin{aligned} {\mathcal {L}}(x, \theta )\propto & {} \exp (-(x+0.6\theta )^2/(2\cdot 1^2)) \nonumber \\&\times \left[ \exp (-\theta ^2/(2\cdot 3^2)\right. \nonumber \\&\quad +\left. 0.5\exp (-(x-6)^2/(2\cdot 0.5^2)\right] . \end{aligned}$$
(10)
The dependence of the likelihood on x and \(\theta \) is shown in the upper panel of Fig. 1. In such case, the \({\mathcal {R}}\) function takes into account the existence of a second peak in the posterior. The choice of the function and the coefficients in Eq. (10) is appropriate to show that, while cutting at 1 (corresponding to the \(1\sigma \) limit, in a frequentist sense, for the likelihood ratio test) the likelihood ratio and the \({\mathcal {R}}\) methods give the same results, the cut at 4 (corresponding to a \(2\sigma \) significance for the likelihood ratio test) leads to different results, because the likelihood ratio takes into account only the likelihood values at the best-fit, while the \({\mathcal {R}}\) method is also affected by the second peak of the posterior. For the same reason, the local minimum of \(-2\ln {\mathcal {R}}\) at \(x=6\) appears.
Another advantage is computational. In cosmological analyses, it is typically difficult to study the maximum of the likelihood, because of the number of dimensions, the numerical noise and the computational cost of the likelihood. An example showing the technical difficulties in such kind of analyses can be found in [18]. Similar difficulties can emerge in different analyses. Even if the best-fit point is not known with sufficient precision, however, the \({\mathcal {R}}\) function allows to obtain a prior-independent bound with a Markov Chain Monte Carlo or similar method.