On the Measure-Theoretic Premises of Bayes Factor and Full Bayesian Significance Tests: a Critical Reevaluation

The Full Bayesian Significance Test (FBST) and the Bayesian evidence value recently have received increasing attention across a variety of sciences including psychology. Ly and Wagenmakers (2021) have provided a critical evaluation of the method and concluded that it suffers from four problems which are mostly attributed to the asymptotic relationship of the Bayesian evidence value to the frequentist p-value. While Ly and Wagenmakers (2021) tackle an important question about the best way of statistical hypothesis testing in the cognitive sciences, it is shown in this paper that their arguments are based on a specific measure-theoretic premise. The identified problems hold only under a specific class of prior distributions which are required only when adopting a Bayes factor test. However, the FBST explicitly avoids this premise, which resolves the problems in practical data analysis. In summary, the analysis leads to the more important question whether precise point null hypotheses are realistic for scientific research, and a shift towards the Hodges-Lehmann paradigm may be an appealing solution when there is doubt on the appropriateness of a precise hypothesis.


Introduction
Over the last two decades, the Full Bayesian Significance Test (FBST) has been developed as a Bayesian counterpart to a frequentist point null hypothesis test (Pereira & Stern, 1999). The FBST and its associated Bayesian evidence value, the e-value, were designed to replace a frequentist hypothesis test while being coherent with the likelihood principle and a fully Bayesian philosophy (Pereira et al., 2008). In their paper, Ly and Wagenmakers (2021) identify four problems of the FBST and conclude that the Bayes factor (BF) avoids these. In this paper, it is shown that the problems are observed only under a measuretheoretic premise which is required for a Bayes factor test, but explicitly avoided under the FBST. This renders the Bayes factor in favour of Θ 0 is then defined as the ratio of posterior and prior odds BF 01 = P ϑ|Y (Θ 0 ) P ϑ|Y (Θ 1 ) where P ϑ|Y is the posterior distribution of ϑ given Y . 1 An important condition in the definition of the Bayes factor is that both Θ 0 and Θ 1 receive strictly positive prior probability mass, that is, P ϑ (Θ 0 ) > 0 and P ϑ (Θ 1 ) > 0 both need to hold. Otherwise, it is clear that the prior odds P ϑ (Θ 0 )/P ϑ (Θ 1 ) are either zero (whenever P ϑ (Θ 0 ) = 0 and P ϑ (Θ 1 ) > 0) or not even well-defined (whenever P ϑ (Θ 0 ) > 0 and P ϑ (Θ 1 ) = 0). In the former case, the posterior odds are zero, too, because these can be rewritten as P ϑ (Θ 0 ) P ϑ (Θ 1 ) · BF 01 = P ϑ|Y (Θ 0 ) P ϑ|Y (Θ 1 ) so that no matter what value the Bayes factor BF 01 (y) takes for observed data Y = y, the posterior odds P ϑ|Y (Θ 0 )/P ϑ|Y (Θ 1 ) are then zero, too. In the latter case, the posterior odds are not well-defined because one would have to divide by zero. Now, consider a dominated statistical model P := {P θ : θ ∈ Θ} for sample data Y ∈ Y where the parameter space is a subset of the real numbers, Θ ⊂ R (we could also use R n for n ∈ N to generalize the argument), and more specifically, an interval (a, b) for a < b ∈ R. Suppose we test the precise point null hypothesis 2 H 0 : θ = θ 0 versus H 1 : θ = θ 0 Now, if the prior P ϑ is absolutely continuous with respect to Lebesgue measure λ on Θ, the prior odds are zero, and the posterior odds, too. This follows because of the absolute continuity of P ϑ with respect to λ, as for every Lebesguenull-set A with λ(A) = 0 it follows that P ϑ (A) = 0. However, it is well-known from standard measure theory that lower-dimensional submanifolds and, in particular, countable sets have Lebesgue measure zero (Bauer, 2001). As a consequence, as H 0 : θ = θ 0 is a countable set with a single element θ 0 ∈ Θ, it follows that λ(Θ 0 ) = 0 and thereby P ϑ (Θ 0 ) = 0, where Θ 0 := {θ 0 }. The prior odds of H 1 are strictly positive, because λ((a, b)) > 0 for any interval with a < b. 3 From Eq.
(2), it follows that 1 Recall that the parameter is a random variable ϑ : (Ω, A, P * ) → (Θ, G) from the product space (Ω, A, P * ) where Ω := Y × Θ is the product space of the data and parameter, A the associated productσ -algebra and the joint probability measure P * is implicitly defined via the selection of prior distribution P ϑ and the statistical model P := {P θ : θ ∈ Θ} for the observed data Y ∈ Y, that is, Y ∼ P θ , compare Schervish (1995). 2 The below notation implies Θ 0 := {θ 0 } and Θ 1 := Θ \ {θ 0 } and is mainly used because it is widely established. 3 When Θ = R, λ(H 1 ) = ∞ and the prior odds are zero, too, when adopting the convention 0/∞ := 0. the posterior odds P ϑ|Y (Θ 0 )/P ϑ|Y (Θ 1 ) are then zero, too. 4 Importantly, the Bayes factor BF 01 is not even well-defined in this case: as the prior odds are zero, and BF 01 is the ratio of posterior to prior odds which is seen from Eq.
(2), we would have to divide by zero in this case to obtain BF 01 . In this context it is important to note that "When the parameter space is uncountable, prior distributions are typically continuous. This means that the prior (and posterior) probability of Θ = θ 0 is 0." (Schervish (1995), p. 221) That is, in the majority of continuously parameterized models P (or equivalently, whenever we have uncountable parameter spaces), the prior distributions P ϑ are typically absolutely continuous with respect to the Lebesgue measure λ and have a continuous Radon-Nikodým density with respect to λ. Examples include a normal prior, exponential prior, Cauchy prior, or Student-t prior with corresponding λ-densities. Under any of these priors, we cannot test a simple null hypothesis versus its alternative. To circumvent this problem, one is forced to introduce an arbitrary prior probability > 0 that θ = θ 0 and additionally select a prior distribution P Θ 1 ϑ under the alternative hypothesis Θ 1 . The introduction of prior probability is primarily justified to be able to calculate a Bayes factor. 5 Proceeding with the subjective assignment of positive probability to a Lebesgue-null-set {θ 0 }, the resulting prior distribution P ϑ is then a convex combination of a Dirac-measure 6 E Θ 0 under Θ 0 and the prior distribution P Θ 1 ϑ under Θ 1 : for all parameter sets G ⊆ G in the parameter space (Θ, G). Even interpreting the Bayes factor as the ratio of marginal likelihoods under H 0 and H 1 does not help, although it seems to avoid the assignment of an explicit prior probability to Θ 0 and Θ 1 : equivalent to (1), the BF can be expressed as (compare Robert (2007, p. 227)) 4 This also follows from the fact that the posterior distribution P ϑ|Y is absolutely continuous with respect to the prior distribution P ϑ , that is, see Schervish (1995, Theorem 1.31). 5 However, Schervish (1995) emphasizes that a (possibly) more realistic alternative would be to "replace the hypothesis with (what might be more reasonable) an interval hypothesis of the form H : Θ ∈ [θ 0 − ε, θ 0 + ε]." (Schervish, 1995, p. 221), see also Berger (1985, p. 148). 6 The Dirac-measure E Θ 0 is defined as follows: E Θ 0 (A) := 1 for A ∈ Θ 0 and E Θ 0 (A) := 0 for A / where equality (1) follows from denoting the Lebesguedensity of the prior distribution P ϑ on Θ i as p i := dP ϑ /dλ for i = 0, 1, and equality (2) by writing integration of θ with respect to λ as usual as dθ. Now, from (A) in Eq. (5), it is immediate that whenever Θ 0 is a P ϑ -null-set, the Bayes factor BF 01 will be zero because Θ 0 f θ dP ϑ = 0 then. Thus, the Bayes factor cannot provide any evidence anymore for Θ 0 . 7 While the definition of the BF could easily be adapted to incorporate this scenario, the more alerting situation occurs when instead BF 10 = 1/BF 01 is calculated for a P ϑ -null-set Θ 0 ∈ Θ: Again, from (A) in Eq. (5), it follows that BF 10 has a singularity in Θ 0 then, and one would arrive at BF 10 = 1/0, causing the BF not to be well-defined anymore. This is the reason why by definition, the Bayes factor is only defined for measurable partitions {Θ 0 , Θ 1 } with both P ϑ (Θ 0 ) > 0 and P ϑ (Θ 1 ) > 0 (Schervish, 1995, p. 220-221). Thus, it is crucial that the prior distribution P ϑ assigns positive mass to Θ 0 for the BF to be well-defined, and only then does Eq. (5) provide the usual interpretation of the Bayes factor as the ratio of marginal likelihoods under H 0 := Θ 0 and H 1 := Θ 1 . Also, the more familiar-looking representation BF 01 = f θ 0 (y) Θ 1 f θ (y)p 1 (θ)dθ of the Bayes factor for H 0 : θ = θ 0 against H 1 : θ = θ 0 holds if and only if a mixture prior with Dirac-measure component E θ 0 for the value θ 0 is assigned to the parameter with positive mixing weight > 0 (Robert, 2007, p. 231). Robert phrases this as follows: "Testing of point-null hypotheses and the like thus impose a drastic modification of the prior distribution ... This modification of the prior is puzzling from a measure-theoretic point of view, since it puts some weight on a set previously of measure 0." (Robert, 2007, p. 229), see also Berger (1985, p. 148).
The above line of thought clarifies why the Bayes factor is only defined for hypotheses Θ 0 and Θ 1 which receive strictly positive prior mass. Importantly, it plays no role for this requirement whether the BF is interpreted as the ratio of posterior to prior odds or as the ratio of marginal likelihoods of both hypotheses under consideration.

Measure-Theoretic Background-the FBST
The idea of the FBST is to use the e-value ev(H 0 ), which quantifies the Bayesian evidence against H 0 as a Bayesian replacement of the frequentist p-value. Importantly, the FBST was designed as a Bayesian replacement of 7 Note that because Θ 1 := Θ \ Θ 0 (where \ denotes the set complement, that is A \ B includes all elements which are located in A but not in B), the set Θ 1 has P ϑ -measure one, because P ϑ (Θ 1 ) = P ϑ (Θ \ Θ 0 ) = P ϑ (Θ) − P ϑ (Θ 0 ) = 1 − 0 = 1. Thus, BF 01 = 0/1 = 0 in Eq. (5) when P ϑ (Θ 0 ) = 0. frequentist point null tests and thus should be capable only to reject a point null hypothesis. In the FBST, the Bayesian surprise function s(θ) := p(θ|y)/r(θ) normalizes the posterior density p(θ|y) by a reference function r(θ). Possible choices include a flat reference function r(θ) := 1 or any prior probability density p(θ) for the parameter θ. In most settings, r(θ) will be the probability density of an absolutely continuous probability distribution with respect to the Lebesgue measure λ, as the introduction of a mixture prior as detailed above for the BF is not required for the FBST. As a consequence, the FBST does not separate between hypothesis testing and estimation with regard to the choice of prior distribution, and is not forced to assign positive probability to a Lebesgue-null-set {θ 0 }. The calculations of the e-value are explicitly possible under absolutely continuous priors: s * is defined as the maximum of the surprise function s(θ) over the null set Θ 0 which belongs to the hypothesis H 0 , that is, s * := s(θ * ) = max The crucial difference to the BF is that the FBST does not force the introduction of an arbitrary prior probability mass on the point null value {θ 0 }. Thus, the FBST is coherent with a Bayesian parameter estimation perspective (which would not accept the assignment of arbitrary probability mass to a point null value θ 0 ) and does not separate Bayesian hypothesis testing and parameter estimation. The use of priors which are absolutely continuous with respect to the Lebesgue measure λ is explicitly possible and a standard choice when adopting the FBST. 8

The e-value as an Approximate p-value
The first two problems identified by Ly and Wagenmakers (2021) are connected to the alleged relationship of the evalue to the frequentist p-value. As an example where the e-value and p-value coincide, they provide the example of Diniz et al. (2012) who assumed normally distributed data N (μ, 1) and concluded that under a uniform prior the two quantities become exactly equal. However, as Diniz et al. (2012) stressed: "This result is a consequence of the fact that the normal density depends on t and μ only on (t − μ) 2 ." (Diniz et al. (2012), p. 162) Diniz et al. (2012) considered a Gauß test for normal data with unknown mean μ ∈ R and variance σ 2 = 1, and assumed that the sample mean t = 3.9 is observed for n = 3 observations. The null hypothesis Held and Sabanés-Bové (2014, Example 3.4) -and thus the sampling density of the statistic t (not the original data, which we can replace with the minimal sufficient statistic t instead) under H 0 is a N (5, 1/3) density. The density of the posterior distribution based on t under an improper flat prior p(μ) = 1 is also a normal density N (3.9, 1/3), because the likelihood , and the latter is a kernel of a N (3.9, 1/3) density for observed t = 3.9 and σ 2 = 1. Now, the two-sided p-value results in P μ=5,σ 2 =1/3 (t < 3.9) · 2 = 0.0567, which is twice the tail-area of t < 3.9 of the sampling density under H 0 (we could equivalently compute P μ=5,σ 2 =1/3 (t < 3.9) + P μ=5,σ 2 =1/3 (t > 6.1)). The evalue ev(H 0 ) in favour of H 0 results in P ϑ=(3.9,1/3)|Y (μ > 5) · 2 = 0.0567 which is twice the tail area of μ > 5, where 5 is the value under H 0 (we could equivalently compute P ϑ=(3.9,1/3)|Y (μ > 5) + P ϑ=(3.9,1/3)|Y (μ < 2.8)). Thus, we arrive at p = ev(H 0 ), and the only reason for this coincidence is that the tail probabilities we compute for p and ev(H 0 ) depend only on the distance (t − μ) 2 because the variances are identical in the likelihood N (5, 1/3) and posterior N (3.9, 1/3) (equivalently, omit the quadratic term and replace it by |t − μ|, because we are interested in the absolute differences between the true parameter and t). In both cases, we compute tail probabilities which only depend on the difference |t − μ| = 1.1 (or equivalently (t − μ) 2 = 1.1 2 ) and thus p = ev(H 0 ). Very informally speaking, the "bell-shape" of the normal distribution is equal for identical variances in the likelihood and posterior, and the tail probabilities of both coincide when the only difference are the means, which are shifted. However, as soon as we use a proper prior, the relationship disappears.
Also, it is well-known that testing based on improper priors quickly leads to problems, which is why some authors have argued that "improper priors should not be used at all in tests." (Robert, 2007, p. 232), compare Degroot (1973). Widening their criticism from a single model to the general case, Ly and Wagenmakers (2021) argue that the asymptotic results of Diniz et al. (2012) pose a more serious problem, while admitting that for "non-uniform priors the relation between FBST ev and p-value is only approximate" (Ly & Wagenmakers, 2021, p. 5). Diniz et al. (2012) showed based on the Bernstein-von-Mises-Theorem (van der Vaart, 1998, Section 10) that for large samples sizes n where p is the frequentist p-value, m := dim(Θ), h := dim(Θ 0 ) and F m is the cumulative distribution function (c.d.f.) of the χ 2 m distribution and F −1 denotes the generalized inverse c.d.f. Ly and Wagenmakers (2021) argue that whenever dim(Θ 0 ) = 0 and dim(Θ) = k for k ∈ N 0 , Eq. (7) so that for large sample sizes n, the p-value and e-value become equal. Admittedly, the binomial example chosen by Ly and Wagenmakers (2021) exploits this condition as dim(Θ 0 ) = 0 (H 0 : θ = θ 0 is a single point which has dimension zero) and dim(Θ) = 1 which shows that at least in univariate models, the asymptotic relationship seems to hold. 9 The simulations in Diniz et al. (2012) confirm this relationship, but they also show that even under perfectly controllable simulation settings, there remain differences: they try to model the functional relationship between p-value and evalue via a Beta c.d.f., and results show that the asymptotic relationship (not equality!) starts to hold only for n ≥ 1000 observations in each group in some models (Diniz et al., 2012, Figure 8).
However, the more important aspect is that even when dim(Θ 0 ) = 0 the relationship is, in general, a purely theoretical result. The Bernstein-von-Mises theorem assumes that the sample X 1 , X 2 , ... is generated i.i.d. from P θ 0 from some θ 0 ∈ Θ for the dominated and parameterized statistical model P := {P θ : θ ∈ Θ}. However, under an absolutely continuous prior P ϑ with regard to λ on θ , the probability of the parameter taking the value θ 0 is zero: P ϑ (θ 0 ) = 0, because λ(θ 0 ) = 0. By Definition of a sharp hypothesis as a submanifold 10 , this holds for any such hypothesis H 0 := Θ 0 := {θ 0 }. Thus, the critical condition that H 0 is true, under which the result of Diniz et al. (2012) is established, has probability zero under any prior which is absolutely continuous with respect to Lebesgue measure λ. As a consequence, data is generated with probability zero from P θ 0 , where θ 0 is the value specified in H 0 := Θ 0 := {θ 0 }. 11 As this is the usual prior choice in continuous models P in the FBST, even in univariate models, the asymptotic relationship (not equality) identified by Diniz et al. (2012) will be observed with probability zero in practice. That is, Eq. (7) holds with probability zero under this prior choice. As Cohen stressed, the null hypothesis "taken literally (and that's the only way you can take it in formal hypothesis testing) is always false. It can only be true in the bowels of a computer processor running a Monte Carlo study." Cohen (1990Cohen ( , p. 1308 Although the relationship identified by Diniz et al. (2012) is interesting from a theoretical perspective, it will only be observed in the laboratory conditions of a computer processor. Furthermore, even then the relationship does not hold whenever the assumptions of the Bernstein-von-Mises theorem are violated, in particular, in all models P with discrete parameters. In summary, in continuous models the relationship in Eq. (7) thus (1) holds only in univariate models and (2) occurs with probability zero outside of a Monte Carlo simulation with perfectly controllable conditions, whenever an absolutely continuous prior is assumed. Point (2) thus resolves the problem also for univariate models. In discrete models, Eq. (7) does not hold at all, even without the above arguments.
Note, however, that from a measure-theoretic point of view when testing with Bayes factors, the last criticism stays valid at least in univariate models P: the point null value θ 0 then has positive measure as specified in the Dirac component of the mixture prior, compare Eq. (4), and the Bernstein-von-Mises theorem can be applied to recover the asymptotic relationship. However, unless the mixing weight is chosen as = 1 in the mixture prior, the probability of obtaining an i.i.d. sample X i ∼ P θ 0 goes to zero for large enough sample size n → ∞. Thus, even under a mixture prior the assumption of an i.i.d. sample cannot be reconciled with the Bayesian approach. However, when 11 Note that from the Bayesian perspective, the sample is not i.i.d., but only i.i.d. conditional upon having observed a parameter value θ. Unconditionally, data is distributed as the marginal distribution (the prior predictive distribution) of the data, taking a frequentist measure-theoretic stance, θ is fixed but unknown so it becomes meaningful again when saying the sample X 1 , X 2 , ... is i.i.d. ∼ P θ 0 . 12 This is also the assumption in the law of the iterated logarithm proven in Feller (1968, p. 204-205), which shows that the assertation that "the proofs on sampling to a foregone conclusion in Feller (1970) also pertain to the FBST procedure" Ly and Wagenmakers (2021, p. 6) does not hold, as the i.i.d. assumption X 1 , X 2 , ... ∼ P θ 0 under an absolutely continuous prior P ϑ -which is the standard choice in the FBST -is not valid.

Revisiting Problem 1: Quantifying Evidence in Favour of the Null Hypothesis
Now, the first problem in Ly and Wagenmakers (2021) is revisited. It is argued that a defect of the FBST and e-value is that it cannot quantify evidence in favour of the null hypothesis H 0 : θ = θ 0 . However, from the above measure-theoretic analysis it is clear that this is a mere consequence of the FBST not assigning arbitrary probability mass to Θ 0 as is assigned by the mixture prior in Eq. (4). Under absolutely continuous priors, Θ 0 has zero prior probability P ϑ (Θ 0 ), so it cannot have positive posterior probability P ϑ|Y (Θ 0 ). Thus, acceptance of H 0 := Θ 0 is not possible. However, the FBST was designed as a Bayesian replacement of frequentist point null tests and thus should be capable only to reject a point null hypothesis. Also, the mixture prior in Eq. (4) was historically chosen to become able to confirm general laws (Wrinch & Jeffreys, 1921;Etz & Wagenmakers, 2015). Thus, the introduction of positive probability for the value θ 0 (which in a general law referred to a boundary of the parameter space like θ = 1 in a binomial experiment) was due to the fact that "... Broad used Laplace's theory of sampling, which supposes that if we have a population of n members, r of which may have a property ϕ, and we do not know r, the prior probability of any particular value of r (0 to n) is 1/(n + 1). Broad showed that (...) if we take a sample of number m and find all of them with ϕ, the posterior probability that all n are ϕ's is (m + 1)/(n + 1). A general rule would never acquire a high probability until nearly the whole of the class had been sampled. We could never be reasonably sure that apple trees would always bear apples (...). The result is preposterous, and started the work of Wrinch and myself in 1919-1923." Jeffreys (1980 While the mixture prior structure renders the statistician able to confirm general laws, it is today commonly used for what could be called "arbitrary laws": For example, when testing for a difference between two groups in a clinical trial, "it will virtually never be the case that one seriously entertains the possibility that θ = θ 0 exactly (c.f. Hodges and Lehmann (1954)." (Berger, 1985, p. 148), where θ could model the effect size between the treatment and control group. Thus, whenever no general law like all apple trees bear apples or all swans are white are considered, it remains questionable to impose the mixture prior. 13 Importantly, next to the inadequate use of the mixture prior in situations where no general law is tested, the assignment of positive mass > 0 to θ 0 separates the prior beliefs for hypothesis testing and parameter estimation (where for the latter task an absolutely continuous prior like a normal prior often is much more reasonable than a mixture prior), while the FBST does not separate testing and estimation. 14 Importantly, the FBST does not even need to be able to quantify evidence in favour of H 0 by measure-theoretic design. From the above, we know that H 0 has probability zero under an absolutely continuous prior P ϑ with respect to λ. As the posterior P ϑ|Y is absolutely continuous with respect to the prior P ϑ (Schervish, 1995, Theorem 1.31), the posterior probability of H 0 will be zero, too. Thus, quantifying evidence in favour of H 0 becomes obsolete under the measure-theoretic assumptions of the FBST. 15 Admittedly, one might argue what the use is to test H 0 if it is a priori is known to have probability zero. However, the e-value quantifies the discrepancy between observed data and H 0 , and the FBST rejects a null hypothesis only when the data display sufficient evidence against it. The BF bypasses this issue via the assignment of probability to H 0 , but this does not render H 0 more realistic in practice. The relevance of the ability of the BF to quantify evidence in favour for H 0 may thus be overstated, because in practice, the parameter value is most probably not exactly equal to 13 Note also that a small interval hypothesis is clearly not more realistic than a point null hypothesis in the case a general law is tested. For example, testing H 0 : θ = 1 is in sharp contrast toH 0 : θ ∈ [0.95, 1] and the interpretation is radically different (e.g. now only 95-100% of swans are white when confirmingH compared to all swans are white when confirming H 0 ). 14 Which is again close to the frequentist paradigm, where for example confidence intervals are obtained by inverting hypothesis tests (Robert, 2007). 15 As a sidenote, the convergence rate under H 0 and H 1 for the Bayes factor differ, which is why solutions like non-local priors have been introduced to establish exponential convergence rates under both hypotheses, see Johnson and Rossell (2010). The faster convergence of the e-value under H 1 compared to the BF, see Kelter (2020), is a consequence of the FBST not assigning prior probability to H 0 . θ 0 . Thus, for large enough sample size n, the BF will reject H 0 , too, even if the true parameter is θ 0 + ε for a tiny ε and thus the choice H 0 : θ = θ 0 was very close to the true parameter θ = θ 0 + ε. Thus, from this perspective, both the FBST and BF experience a similar form of sampling to a foregone conclusion which could be called sampling to trivial effect sizes. Admittedly, the FBST is somewhat begging the question by not assigning prior mass to H 0 : θ = θ 0 and being only able to reject H 0 . The BF is, however, similarly begging the question by assuming from the outset that H 0 is true with probability > 0 a priori, and as a consequence being able to confirm H 0 : θ = θ 0 then.

Revisiting Problem 2: Susceptibility to Sampling to a Foregone Conclusion
Revisit problem 2: Based on the asymptotic relationship to the frequentist p-value it is argued that the FBST will sample to a foregone conclusion. Thus, by only collecting enough samples one will eventually produce an e-value ev(H 0 ) which rejects H 0 : θ = θ 0 . However, the measuretheoretic premises circumvent this caveat again. While under a frequentist perspective, H 0 : θ = θ 0 may be true and from a Bayes factor perspective, θ in H 0 : θ = θ 0 has positive prior probability (compare Eq. (4)), under an absolutely continuous prior P ϑ with respect to λ used in the FBST P ϑ (H 0 ) = 0 holds. Therefore, sampling to a foregone conclusion will not occur as data is generated as i.i.d. from P θ 0 with probability zero, except inside the bowels of a computer processor running a Monte Carlo simulation, to recite Cohen (1990). Under the mixture prior which is used for the Bayes factor, the problem can occur as H 0 has positive prior probability. However, assuming such a prior is simply not necessary when using the FBST, and thus the problem vanishes when choosing an absolutely continuous prior with respect to λ. 16 Even when adopting a mixture prior, unless = 1 (which implies the prior probability of θ 0 is one, and all uncertainty vanishes) for large enough n ∈ N the probability of sampling X 1 , ..., X n i.i.d. as X i ∼ P θ 0 goes to zero, compare Eq. (4). Thus, even under the measure-theoretic assumptions of the mixture prior which is used in the Bayes factor test the problem vanishes. 17 16 Importantly, this argument shows that the criticism of sampling to a foregone conclusion pertains to p-values, as the assumption X 1 , X 2 , ... ∼ P θ 0 remains possible: No probability statements about θ 0 being the true parameter can be made by frequentists and one can assume such an i.i.d. sampling mechanism. 17 When n is only moderate, convergence in the Bernstein-von-Mises theorem is questionable because it is an asymptotic result (van der Vaart, 1998). Thus, no sampling to a foregone conclusion occurs under a mixture prior even in moderately sized samples.

Revisiting Problem 3: the Principle of Predictive Irrelevance
The third problem concerns the principle of predictive irrelevance which goes back to Jeffreys (1961). The binomial example used by Ly and Wagenmakers (2021) shows that the FBST is, in general, not predictively matched. However, although predictive matching is an appealing property, it is at best loosely related to more profound principles of statistical inference (Berger and Wolpert, 1988). Furthermore, predictive matching depends on the definition of a sample. Suppose that in the experiment, tupels (n 1 , n 2 ) are observed (e.g. simultaneous sensor measurements) instead of single measurements n for each step in the sequence. Observing an uninformative tupel which consists of a success and a failure based on the symmetric beta posterior will result in another symmetric beta posterior, leaving ev(H 0 ) unchanged at ev(H 0 ) = 1. The fallacy is to interpret ev(H 0 ) = 1 as evidence in favour for H 0 , while it actually only implies the weakest possible evidence against H 0 . When the entire posterior has posterior surprise values which are smaller or equal to the surprise value under H 0 , there is essentially no evidence against H 0 . Thus, ev(H 0 ) = 1. As the FBST aims at rejection of H 0 (and not confirmation of H 0 ), the concept of predictive matching is not helpful. Any modification of the posterior which causes the surprise function of the null value to shift away from the mode indicates (and should indicate) some magnitude of evidence against H 0 . More generally, Robert (2016) argued that "predictive matching is not a well-defined concept. That both predictives take the same values for such "completely uninformative" data thus sounds more like a post-hoc justification than a way of truly calibrating the Bayes factor." Robert (2016, p. 3) Thus, as the FBST cannot accept H 0 , it cannot be calibrated in this way. In fact, predictive matching is important whenever one commits the "all too common sin of assuming that θ can be assigned the same prior distribution, π(θ), under H 0 and H 1 ." Berger and Guglielmi (2001, p. 177). A partial justification of such an assumption is thus given by predictive matching, see Berger & Guglielmi (2001, p. 177) as then a completely uninformative "sample of size m should always yield a Bayes factor of 1 (implying that the two models are equally supported by the data)" (Berger & Guglielmi, 2001, p. 177), which justifies the assignment of identical priors under H 0 and H 1 . 18 Thus, 18 Compare also the intrinsic Bayes factor approach of Berger and Pericchi (1996). predictive matching post-hoc justifies the selection of identical priors (for nuisance parameters) under H 0 and H 1 , but in the FBST, we do not need separate priors for nuisance parameters because in the FBST there is only single prior and no separate priors under H 0 and H 1 exist. Thus, there is no need for predictive matching after all. Ultimately, predictive matching is an important concept for the BF, but a controversial one as the BF itself is not always predictively matched, see Gronau et al. (2019), Wang and Liu (2016), and Berger and Guglielmi (2001).

Revisiting Problem 4: the Jeffreys-Lindley Paradox
Problem four is formulated as the FBST avoiding the Jeffreys-Lindley paradox. It is argued that "the FBST ev is based on an assessment of the posterior distribution, and therefore, lacks the Bayesian correction for cherrypicking" (Ly & Wagenmakers, 2021, p. 11), where the correction is the prior distribution which prevents the selection of parameter values that the data happen to support. However, the implicit premise is that a flat reference function r(θ) = 1 is used (which is equal to an improper prior), so that the surprise function results in the posterior density. Whenever a proper absolutely continuous prior (e.g. normal prior, Cauchy prior) is used for r(θ), the correction of the prior applies as it prevents the inclusion of parameter values inside the tangential set that the data happen to support. The assumption is similar to the flat prior assumption made in problem 1 for the binomial test, and it shows that improper priors can be problematic for the FBST. However, the ultimate question is what can be learned from the Jeffreys-Lindley paradox, and as argued by Robert (2014) "divergences between different statistical theories of inference and their numerical conclusions are to be expected", and even more importantly, the Jeffreys-Lindley-paradox "points at the poor (and even unacceptable) behaviour of improper prior distributions when testing point-null hypotheses" (Robert, 2014, p. 2).
In fact, the Jeffreys-Lindley paradox can be interpreted as the consequence of the failing approximation of a small interval hypothesis through a precise null hypothesis.
When replacing the precise hypothesis H 0 : θ = θ 0 with a small interval hypothesis H 0 : θ ∈ (θ 0 − b, θ 0 + b) for b > 0, the Jeffreys-Lindley paradox does not occur because the approximation is rendered unrealistic before the paradox blends in. For illustration purposes, an example of Berger (1985) is used: Suppose a sample X 1 , ..., X n is observed from a N (θ, σ 2 ) distribution with known σ 2 . The observed likelihood function is then proportional to a N (x, σ 2 /n) density for θ , and given that we really should be testing H 0 : θ ∈ (θ 0 − b, θ 0 + b), we need to know when it is suitable to approximate H 0 by H 0 : θ = θ 0 . The "only sensible answer to this question is -the approximation is reasonable if the posterior probabilities of H 0 are nearly equal in the two situations." (Berger, 1985, p. 149) and this happens when the observed likelihood function is approximately constant on (θ 0 − b, θ 0 + b), because then the posterior probabilities will be equal when b is small enough. Berger (1985) showed that the likelihood function varies by no more than 5% on where z = √ n|x − θ 0 |/σ is the classical test statistic for a Gauß test. Turning to the Jeffreys-Lindley paradox now, when we assign a N (μ, τ 2 ) density under H 1 : θ = θ 0 to θ, and set μ := θ 0 , the posterior probability of H 0 : θ = θ 0 can be computed as for details see Berger (1985, p. 150). Now, suppose a fixed z is observed, which for example for z = 1.960 corresponds to a two-sided p-value of p = 0.05. Berger (1985) provides the posterior probabilities (9) for varying sample sizes n and these start at P ϑ|Y (H 0 ) = 0.35 for n = 1 and then grow steadily to P ϑ|Y (H 0 ) = 0.80 for n = 1000 (Berger, 1985, p. 151, Table 4.2). In fact, "this phenomenon that α 0 → 1 as n → ∞ and the P-value is held fixed, actually holds for virtually any fixed prior and point null testing problem." Berger (1985, p. 156), where α 0 equals the posterior probability P ϑ|Y (H 0 ) of H 0 : θ = θ 0 . As a consequence of P ϑ|Y (H 0 ) → 1 for fixed z (or p-value) under n → ∞, it follows that BF 01 → ∞. Now, the reason of the Jeffreys-Lindley paradox occurring (that is, p = 0.05 seems to reject H 0 while BF 01 → ∞ for n → ∞) is that the assumption of a precise null hypothesis simply is not feasible for even moderately large n: "This large n phenomenon provides an extreme illustration of the conflict between classical and Bayesian testing of a point null. One could classically reject H 0 with a P-value of 10 −10 , yet, if n were large enough, the posterior probability of H 0 would be very close to 1. This surprising result has been called "Jeffreys' paradox" and "Lindley's paradox," ... We will not discuss this "paradox" here because the point null approximation is rarely justifiable for very large n." Berger (1985, p. 156) Thus, when n is small, the Jeffreys-Lindley paradox does not occur for a fixed z (or p-value). When we let n → ∞, the Jeffreys-Lindley paradox blends in as P ϑ|Y (H 0 ) → 1 and the p-value remains constant, but then the validity of our approximation of the more realistic interval hypothesis (θ 0 −b, θ 0 +b) by H 0 : θ = θ 0 breaks. In fact, it breaks even for moderate sample sizes. For the situation n = 1 and z = 1.960 under σ = 1, we arrive at the posterior probability P ϑ|Y (H 0 ) = 0.35, which is not in conflict with p = 0.05, but only somewhat weaker evidence against H 0 than assured by the p-value. Increasing sample size to n = 50, the paradox blends in as then P ϑ|Y (H 0 ) = 0.52, so that the BF concludes to favour H 0 (assuming prior weights of 0.5 for both H 0 and H 1 ) while p = 0.05 states evidence against H 0 : θ = θ 0 . However, for n = 50, we have b ≤ 0.0017 based on Eq. (8) and we would have to accept the tiny interval hypothesis (θ 0 − 0.0017, θ 0 + 0.0017) as the hypothesis we actually want to test, for H 0 : θ = θ 0 to be a reasonable approximation. Thus, even for moderate samples sizes like n = 50, the innocuous looking approach to approximate a realistic interval hypothesis via a precise hypothesis can become untrustable unless one has extremely good reasons to assume a tiny interval hypothesis. In the majority of research, we will not be willing to accept such a precise interval hypothesis. Thus, as the point null approximation is rarely justifiable for large n, it matters little that the Jeffreys-Lindley paradox is observed. In these situations, we will already hesitate to trust the test of a precise hypothesis because the approximation has become unreliable.
The reason that the FBST does not suffer from the paradox is simply due to the fact that it does not make use of an approximation. Whereas the prior probability > 0 in the mixture prior used for a Bayes factor test is conceptualized as the prior probability assigned to the more realistic interval hypothesis H 0 : θ ∈ (θ 0 − b, θ 0 + b), the FBST does not assign mass to H 0 : θ = θ 0 . However, the prior probability of the more realistic interval hypothesis H 0 : θ ∈ (θ 0 − b, θ 0 + b), of course, has positive prior probability under the absolutely continuous prior P ϑ (with respect to the Lebesgue measure λ) which is employed in the FBST. Thus,no paradox occurs. 19 Admittedly, the FBST cannot make it easier for an a priori highly unprobable hypothesis to be rejected. In contrast, the Bayes factor approach allows to balance the prior probability > 0 accordingly to incorporate such a priori scepticism. 20 However, while the FBST does not allow for such a straightforward incorporation of lower prior probability, modifying the prior distribution P ϑ which is used for the FBST allows for such a Bayesian calibration. In fact, in a variety of situations the modification of the prior probability (e.g. assigning the bulk of the mass to values inside [0.4, 0.6] for the parameter θ of a newborn being a boy) is straightforward and under the FBST this prior perspective holds both when opting for Bayesian parameter estimation and hypothesis testing. Also, the modification of the reference function r(θ) allows to incorporate such a priori beliefs in a second step.

Conclusion-the Validity of Precise Hypotheses
In this paper, it was shown that the problems identified by Ly and Wagenmakers (2021) are mostly consequences of making the measure-theoretic premise of assigning positive probability > 0 to a point null value of a precise hypothesis H 0 : θ = θ 0 . Thus, they hold only under a perspective which is required to apply a Bayes factor test, but not under the absolutely continuous priors available for use with the FBST. An important lesson from the analysis of Ly and Wagenmakers (2021) is that absolutely continuous priors could be preferred in the FBST to avoid these problems. Whenever a prior is chosen which assigns positive probability mass to the null value, sampling to a foregone conclusion as well as the asymptotic relationship to the p-value could hold. However, as discussed in this paper, due to the requirements of the Bernstein-von-Mises theorem and unless = 1, they will not hold in practice. Also, improper priors are questionable when the Jeffreys-Lindley paradox and predictive matching are considered.
Importantly, the mathematical arguments brought forward by Ly and Wagenmakers (2021) and in this paper hold depending on which prior beliefs are held, and not depending on whether one wants to be able to only reject or to both reject and confirm a statistical hypothesis. For example, we could hold the prior beliefs that data are normally distributed 20 Three examples illustrate this scepticism, see Savage (1961). A musician states he can tell whether a sheet of music is from Mozart or Haydn and succeeds in ten out of ten cases. A lady drinking tea states she can tell whether the tea or the milk was filled in the cup first and she succeeds in ten out of ten cases. A drunken friend states at 3 p.m. that he can tell the result of a flipped coin and succeeds in ten out of ten cases. In the last case, the prior probability of H 0 : θ = 1 will be a priori smaller than in cases one and two for most people. as N (μ, σ 2 ), but we are not willing to assign positive mass > 0 to a point null value θ 0 . Thus, we cannot employ a Bayes factor, and we need to use the FBST. Alternatively, we could have prior beliefs which are reflected by a mixture prior that assigns mass > 0 to the theoretically interesting value θ 0 as proposed by Ly and Wagenmakers (2021). 21 Then, the more natural choice is the Bayes factor. Thus, our prior beliefs determine whether we can only reject or both reject and confirm a hypothesis. In contrast, our desire to be able to either only reject, or both reject and confirm a hypothesis should not determine our prior beliefs. 22 Importantly, the deficits which are brought forward by Ly and Wagenmakers hold under the latter prior beliefs where we use a mixture prior, and as shown in this paper no mixture prior is needed when employing the FBST.
However, the discussion puts the magnifying glass onto a more important problem. Are point null hypotheses realistic for scientific research? Both the BF and the FBST test a precise hypothesis, and criticisms on the appropriateness of such hypotheses range back at least to Good (1950). The validity of point null hypotheses was challenged seriously the first time by Hodges and Lehmann (1954), and the more realistic test of small interval hypotheses was termed the Hodges-Lehmann paradigm consequently. Good (1992Good ( ,1993Good ( ,1994, Anderson and Hauck (1983), Berger and Delampady (1987), and Rao and Lovric (2016) challenged the appropriateness of precise hypotheses, and Rousseau (2007) showed that the approximation of interval BFssee Morey and Rouder (2011) -through precise BFs does hold only under very small intervals and is unreliable for large sample size n, see also Berger (1985), Bernado (1999), and Sellke et al. (2001, p. 64). A shift towards the Hodges-Lehmann paradigm may improve the reliability of research while simultaneously removing most of the measure-theoretic bypasses which become necessary for precise hypothesis testing. For the BF, this bypass is the assignment of arbitrary probability mass to a Lebesgue-nullset which can be seen as the price which is paid to be able to confirm H 0 . For the FBST, the bypass is the resulting inability to confirm H 0 because no such assumption is made. An important open problem is to clarify the "vexing issue of the relevance of point null hypotheses" (Robert, 2016, p. 5), which is neither achieved by the BF nor the FBST. I agree with Ly and Wagenmakers (2021) that 21 In any case, we should then interpret the assignment of mass > 0 to H 0 : θ = θ 0 as a proxy for the assignment of > 0 to H 0 : θ ∈ (θ 0 − b, θ 0 + b) and check the quality of this approximation. 22 Note that for any small interval hypothesis H 0 : θ ∈ (θ 0 − b, θ 0 + b), the prior probability is implied by the choice of prior distribution P ϑ and the width b when P ϑ is absolutely continuous with respect to λ, and we do not need to explicitly assign a prior probability to H 0 . whenever there are strong a priori grounds to believe in the point null value (e.g. when testing a general law), testing a point null hypothesis is reasonable. Also, testing a point null hypothesis is reasonable when interpreted as an approximation to a small interval hypothesis. However, whenever there is suspicion about the validity of such an approximation, the FBST may be an attractive alternative because it does not assign positive prior mass > 0 to the null value θ 0 . Thus, checking the approximation which conceptualizes as the mass which should actually be assigned to the more realistic small interval hypothesis H 0 : θ ∈ (θ 0 − b, θ 0 + b), and which fails for n → ∞ becomes obsolete when using the FBST (which was one reason for the Jeffreys-Lindley paradox not to occur under the FBST as shown above).
In fact, a shift to the Hodges-Lehmann paradigm is mostly a sociological contribution to statistical science.
When conducting a precise hypothesis test, the statistician (whether frequentist or Bayesian) in the majority of cases must check the quality of the approximation of the more realistic interval hypothesis by the point null hypothesis. However, "it is usually easier for a Bayesian to deal directly with the interval hypothesis than to check the adequacy of the approximation." (Berger, 1985, p. 149). More importantly, "there are many (...) problems that would lead to a hypothesis of the above interval form with large b, but such problems will rarely be well approximated by testing a point null." (Berger, 1985, p. 149), where b is the width of the interval hypothesis around the point null value θ 0 ∈ Θ. Thus, shifting to a Hodges-Lehmann test of an interval hypothesis requires first to elicit the interval hypothesis boundaries, and in a second step hypothesis testing is performed. Importantly, performing a test without checking the quality of the approximation becomes impossible.
As Alan Birnbaum stressed in Savage et al. (1962, p. 322), "each scientist and interpreter of experimental results bears ultimate responsibility for his own concepts of evidence and his own interpretation of results.", and which method to choose needs to be decided by practitioners themselves. However, I suspect both the BF and FBST to provide similar conclusions in the majority of cases, and a shift towards the Hodges-Lehmann paradigm would be beneficial in a variety of situations.