Abstract
Testing a precise hypothesis can lead to substantially different results in the frequentist and Bayesian approach, a situation which is highlighted by the JeffreysLindley paradox. While there exist various explanations why the paradox occurs, this article extends prior work by placing the less wellstudied pointnullzeroprobability paradox at the center of the analysis. The relationship between the two paradoxes is analyzed based on accepting or rejecting the existence of precise hypotheses. The perspective provided in this paper aims at demonstrating how the Bayesian and frequentist solutions can be reconciled when paying attention to the assumption of the pointnullzeroprobability paradox. As a result, the JeffreysLindleyparadox can be reinterpreted as a Bayesfrequentist compromise. The resolution shows that divergences between Bayesian and frequentist modes of inference stem from (a) accepting the existence of a precise hypothesis or not, (b) the assignment of positive measure to a null set and (c) the use of unstandardized pvalues or pvalues standardized to tailarea probabilities.
Similar content being viewed by others
1 Introduction
Testing a point null hypothesis is a highly controversial topic in statistical science and of general interest to a broad range of domains. Although various authors have embraced a compromise between Bayesian and frequentist modes of statistical inference (Berger et al. 1997, 1994; Good 1981, 1992), the divergence between Bayesian and frequentist solutions to the test of a parametric point null hypothesis \(H_0:\theta =\theta _0\) versus its alternative \(H_1:\theta \ne \theta _0\) for some \(\theta _0 \in \Theta \) has led to heated debates about the right mode of inference in the past decades. These differences manifest themselves prominently in the JeffreysLindley paradox, and the latter is – as a consequence – still often seen as an obstacle for what was called a BayesnonBayes compromise by Good (1992). The JeffreysLindley paradox was first mentioned by Jeffreys (1939), Good (1950) and Lindley (1957) and was extensively discussed in the statistical literature, see for example Good (1985), Berger (1985) or Naaman (2016). Also, the paradox raised attention in the philosophy of science literature, and resolutions were presented by Spanos (2013), Sprenger (2013) and Robert (2014). Importantly, the paradox is sometimes used as an argument in favor or against a specific statistical method (depending on whether the procedure suffers from the paradox occuring or not) (Kelter, 2021c; Ly and Wagenmakers, 2021). Resolutions to the paradox range from not being impressed by the divergences at all (Spanos, 2013) (because the Bayesian and frequentist approaches to statistical inference make considerably differing assumptions), over attributing the divergences largely to the poor performance of improper priors for testing a point null hypothesis (Robert, 2014) to shifting the focus to different statistical techniques (Sprenger, 2013; Naaman, 2016; Kelter, 2021c).
In this paper, the following questions are considered:

1.
Why does Lindley’s paradox occur from a mathematical perspective?

2.
Why does Lindley’s paradox occur from an extramathematical perspective? That is, which arguments can be seen as causal for the occurrence of the paradox that do not borrow their strength based on probability theory?

3.
What are its implications for a methodological debate between Bayesian and frequentist modes of statistical inference in hypothesis testing?
The plan of the paper is as follows: First, Section 2 details the setting of the JeffreysLindley paradox and sets the stage for this article. Then, the pointnull zero probability paradox is detailed in Section 3. The first question leads to the measuretheoretic assumptions made in Bayesian and frequentist inference. Based on the pointnullzeroprobability paradox, the different measuretheoretic basis of frequentist and Bayesian approaches – in which tests of a precise null hypothesis are framed – are shown to build the foundation for the JeffreysLindley paradox to manifest itself.
The second question then deals with whether and when the test of a precise null hypothesis is appropriate. The second question is analyzed in Section 4 based on the pointnullzeroprobability paradox and the frequentist and Bayesian perspectives on it are separately treated. It is shown that the occurrence of Lindley’s paradox is closely tied to the validity of a precise hypothesis in scientific research and whether the null hypothesis is assumed to be precise or not. This relates the JeffreysLindley paradox directly to the pointnullzeroprobability paradox. Also, it provides a broader perspective on the purely mathematical arguments which provide answers to question one.
Based on this analysis it is proven in Section 4 that the paradox resolves when shifting to the appropriate frame of inference, both under the perspective that precise hypotheses exist – Section 4.1 – and under the perspective that precise hypotheses are always false – Section 4.2.
Regarding the third question, Section 4 also shows that next to the form of the statistical hypotheses under consideration a major reason for the JeffreysLindley paradox is that pvalues are not standardized tailarea probabilities. This latter fact provides a realignment of the Bayesian and frequentist solution to the test of a precise null hypothesis and up to date was largely ignored in the literature. The key results of Section 4 are summarized in Section 5 then which details the relationship between both paradoxes.
Section 6 provides a conclusion and shows that the JeffreysLindley paradox should not – as is often the case – be described as a separating phenomenon between Bayesian and frequentist inference, but could much more be taken as a unifying fact which emphasizes the necessary shift towards what Rao and Lovric (2016) called a 21st century perspective on statistical hypothesis testing when the validity of a precise hypothesis is questioned. The latter holds in particular in the biomedical and social sciences.
2 The JeffreysLindley paradox
For illustration purposes and to set the stage for this article, a simple example from Berger (1985) is revisited. Suppose a sample \(X:=(X_1,...,X_n)\) from a \(\mathcal {N}(\theta ,\sigma ^2)\) distribution is taken with \(\sigma ^2>0\) assumed to be known for simplicity. Reduction to the sufficient statistic^{Footnote 1}\(\bar{X}\) then yields a likelihood function of \(\bar{X}\) which is proportional to a \(\mathcal {N}(\bar{x},\frac{\sigma ^2}{n})\) probability density for \(\theta \):
Assuming a \(\mathcal {N}(\mu ,\tau ^2)\) prior on \(\theta =\theta _0\), the alternative hypothesis space when testing \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\) yields that the marginal likelihood under \(H_1\) is a \(\mathcal {N}(\mu ,\tau ^2+\sigma ^2/n)\) density (Berger, 1985, p. 127128). Assuming \(\mu =\theta _0\) for the prior on \(\theta \ne \theta _0\) (which is reasonable since values close to \(\theta _0\) would often be assumed more likely a priori than values far away from \(\theta _0\)), the posterior probability \(P(\theta _0 \mid x)\) is given as
which for \(m_0(x)=\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp \left[ \frac{(\bar{x}\theta _0)^2}{2\sigma ^2/n}\right] \) and \(m_1(x)=\frac{1}{\sqrt{2\pi (\tau ^2+\sigma ^2/n)}}\) \(\exp \left[ \frac{(\bar{x}\theta _0)^2}{2[\tau ^2+\sigma ^2/n]}\right] \) reduces to
where \(z=\sqrt{n}\mid \bar{x}\theta _0\mid /\sigma \) is the usual statistic for testing \(H_0:\theta =\theta _0\) in a twosided Gausstest, for details see (Berger, 1985, p. 151). Under \(H_0\), from the central limit theorem it follows that Z is distributed standard normal, \(Z \sim \mathcal {N}(0,1)\). To illustrate the JeffreysLindley paradox, note that
so for fixed and equal prior probabilities \(P(H_0)=P(\theta =\theta _0)=\frac{1}{2}\), \(P(H_1)=P(\theta \ne \theta _0)=\frac{1}{2}\) (which expresses that \(H_0\) and \(H_1\) are equally probable a priori), hyperparameters set to \(\mu =\theta _0\) and \(\tau =\sigma \), and any fixed z statistic, a lower bound for the posterior probability \(P(H_0\mid x)=P(\theta =\theta _0\mid x)\) can be obtained from Eq. 4. Now, for fixed \(z=1.96\), which corresponds to a pvalue of \(p=P(\mid z \mid>1.96 \mid H_0)=P(\mid z \mid > 1.96 \mid \theta =\theta _0)=0.05\), a frequentist hypothesis test would reject \(H_0\) based on the test level \(\alpha =0.05\).
However, from Eq. 4 it follows that for fixed \(z=1.96\), a Bayesian will arrive at different posterior probabilities for varying values of the sample size n as shown in Table 1.
For example, fixing \(z=1.96\) yields \(P(\theta =\theta _0\mid x)=0.35\) for \(n=1\), which shows that about two thirds of the posterior probability are evidential against \(H_0\), while for \(n=1000\), this fraction has reduced to only a fifth. In the latter case, a Bayesian will – given his prior choice – readily state that the posterior probability of \(H_0\) is \(80\%\), which is in clear contrast to the rejection of \(H_0\) by the frequentist.^{Footnote 2}
The divergence between both approaches remains when using Bayes factors instead of posterior probabilities, or selecting different test levels \(\alpha \) than \(\alpha =0.05\) for the pvalue. In total, this leads to a salient divergence between the frequentist and Bayesian solution to the test of a precise null hypothesis.^{Footnote 3} In sum, the JeffreysLindley paradox can be distilled in the following form which is attributed to (Lindley 1957, p. 187):
JeffreysLindley Paradox: Let \(\mathcal {N}(\theta ,\sigma ^2)\) be a Gaussian statistical model with \(\sigma ^2>0\) known and consider the test of \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\) under prior probability \(P(H_0)>0\) and any proper prior distribution on \(H_1\). Then, for any choice of test level \(\alpha \in [0,1]\) there exists a sample size \(N(\alpha )\) and an independent and identically distributed sample x for which the sample mean \(\bar{x}\) is significantly different from \(\theta _0\) at level \(\alpha \) and the posterior probability \(P(H_0\mid x)\ge 1\alpha \).
For details, see also Sprenger (2013, p. 734735) who provides an example of Lindley’s paradox in the context of extrasensory perception (ESP) and Robert (2014, p. 217).
3 The Pointnullzeroprobabilityparadox
An old criticism of statistical hypothesis testing includes the “relevance of point null hypotheses” (Robert, 2016, p. 5). Criticisms that point null hypotheses are usually unrealistic evolved as the result of a still ongoing debate which included statisticians and philosophers of science likewise over nearly the last century (BuchananWollaston, 1935). For example, Good (1950, p. 90) argued that when testing the fairness of a die “From one point of view it is unnecessary to look at the statistics since it is obvious that no die could be absolutely symmetrical.”. In a footnote, he added: “It would be no contradiction (...) to say that the hypothesis that the die is absolutely symmetrical is almost impossible. In fact, this hypothesis is an idealised proposition rather than an empirical one.” (Good, 1950, p. 90). Similar arguments were brought forward by Savage (1954, p. 332333), who stressed that “null hypotheses of no difference are usually known to be false before the data are collected” and “their rejection ... is not a contribution to science”.^{Footnote 4}
On the other hand, Good (1994, p. 241) noted that there is at least one example of a precise hypothesis, which states that there is no extrasensory perception. Also, in physics or chemistry the null hypothesis may correspond to a general law or natural constant (Jeffreys, 1939), which may not be measurable with absolute precision but still may have the form of a precise hypothesis.
Although in the majority of cases in the medical and social sciences the assumption of a precise hypothesis is questionable, one approach to save precise hypothesis testing was pursued by Berger and Sellke (1987) who showed that for reasonably small interval hypotheses, point null hypotheses are at least useful approximations (Berger and Delampady, 1987, Theorem 2). Thus, the prior probability \(P(H_0)\) which is assigned to the null value \(\theta _0\) in \(H_0:\theta =\theta _0\) should actually be interpreted as the probability mass allocated to a small area around \(\theta _0\). Good (1994) argued similarly that the precise null hypothesis is “often a good enough approximation” (Good, 1994, p. 241). However, Bernado (1999) showed that the quality of the approximation decreases for growing sample size, and Rousseau (2007) showed that for large sample sizes the Bayes factor for a point null hypothesis is no reasonable approximation of the Bayes factor for an interval hypothesis anymore, unless the interval sizes are extremely small. Below it will be shown that this is one reason for the JeffreysLindley paradox to occur.
Based on the broad consensus that precise hypotheses are seldom realistic for scientific research, Rao and Lovric (2016) posed the question why statisticians and nonstatisticians keep testing point null hypotheses, “when it is known in advance they are almost never exactly true in the real world” (Rao and Lovric, 2016, p. 6).
They called the below result the (point null) zeroprobabilityparadox, which shows that the probability of a point null hypothesis \(H_0:\theta =\theta _0\) about the mean of a normally distributed population is zero, where \(H_1:\theta \ne \theta _0\) is the alternative, and \(\Theta =\Theta _{\mathbb {Q}}\cup \Theta _{\mathbb {R}\setminus \mathbb {Q}}\) is the parameter space, \(\mathbb {Q}\) are the rational numbers and thus \(\Theta \) is the mutually exclusive union of \(\Theta _{\mathbb {Q}}\) and \(\Theta _{\mathbb {R}\setminus \mathbb {Q}}\).
Theorem 1
( RaoLovric ) The probability of the null hypothesis \(H_0:\theta =\theta _0\) (about the mean of a normal population \(\mathcal {N}(\theta ,\sigma ^2)\)) is equal to zero, that is, \(P(\{H_0 \mid \theta _0 \in \mathbb {Q}\})=0\).
Based on their result Rao and Lovric (2016) make the strong (and debatable) statement that this “inequivocably amounts to the deduction that any singlepoint null hypothesis about the normal mean has also probability zero.” (Rao and Lovric, 2016, p. 10). Although the set of hypotheses includes formally only rational values, the fact that the latter are dense within the real numbers accounts for the fact that any arbitrary fine measurement precision can be captured by rational numbers. Thus, the result generalizes to all practically measurable values \(\theta _0\). Further discussions of the result are provided in Sawilowsky (2016) and Zumbo and Kroc (2016).
Rao and Lovric (2016) conclude that the term testing thus is a misnomer and should be replaced by the term inexactification, referring to the earlier proposal of Good (1993). Their recommendation is to shift towards what they call the HodgesLehmann paradigm, based on the seminal work of Hodges and Lehmann (1954), who proposed to replace the test of a point null hypothesis with the test of the hypotheses
where the interval null hypothesis \(H_0\) postulates a negligible effect size \(\delta \) while the alternative \(H_1\) states a practically meaningful effect size.
Note that for a frequentist the true but unknown parameter \(\theta _0 \in \Theta \) is fixed and not random, so any statement as made in the RaoLovric theorem becomes pointless. A frequentist can therefore safely escape the consequences of the result, while a Bayesian will readily accept it when using an absolutely continuous prior distribution with respect to the dominating measure \(\mu \) of the statistical model \(\mathcal {P}\). Under such a prior, the prior probability of any value \(\theta \in \Theta \) is zero (compare also Robert (2007, p. 221), Berger (1985, p. 127130) and Kelter (2021c)), and therefore \(P(\{H_0 \mid \theta \in \mathbb {Q}\})=0\) as stated in the RaoLovric theorem follows immediately under these conditions.
However, the decision to accept or reject the existence of a precise hypothesis does not need to rely on mathematical arguments such as the RaoLovric theorem. It can also be decided based on extramathematical arguments such as assuming that an effect size difference of exactly zero between a treatment and control group in a clinical trial is unrealistic and will never occur. In light of such arguments, the RaoLovric theorem merely formalizes whether probability theory allows to test a precise hypothesis or not. It does not mandate whether one should believe in the existence of precise hypotheses or not.
Solely extramathematical arguments can determine whether any probability measure should be associated with the parameter, effectively rendering it a random variable and implying a Bayesian mode of inference, or whether the parameter obeys no probabilistic element, effectively rendering it a fixed and unknown constant and mandating a frequentist approach.
Both Bayesians and frequentists can accept or deny the existence of precise hypotheses. A frequentist does not need to refer to probability statements about the parameter to do so (in fact he can’t). A Bayesian must align his prior distribution with his extramathematical beliefs about the existence of precise hypotheses. In particular, when a Bayesian believes in such hypotheses he must (artificially) assign some positive probability mass \(\varrho >0\) to the theoretically interesting null value \(\theta _0\) specified in \(H_0:\theta =\theta _0\).
4 A Reanalysis of the JeffreysLindley paradox
In this section, a reanalysis of the JeffreysLindley paradox under two different perspectives is undertaken. First, the RaoLovric theorem is taken for granted and it is assumed that \(H_0:\theta =\theta _0\) is always false, no matter which value \(\theta _0 \in \Theta \) is specified in \(H_0\). Second, the RaoLovric theorem is not taken for granted. This latter perspective can be justified by extramathematical arguments that cause one to believe in the existence of a precise hypothesis \(H_0:\theta =\theta _0\). This belief is uncoupled from the fact whether a mathematical result postulates that the probability of such a hypothesis in a formal axiomatic system like probability theory must be zero or not. One can believe in extrasensory perception or the existence of precise hypotheses for example in physics. Whether probability theory allows to test such hypotheses statistically is a different issue, as shown by the RaoLovric theorem. Importantly, however, the beliefs about a precise hypothesis must be incorporated into the form of the hypothesis that is actually tested. This latter fact will demonstrate that the pointnullzeroprobability paradox is relevant for the occurrence of Lindley’s paradox.
In the following Subsection 4.1, it is proven that a Bayesfrequentist compromise can be reached in the setting of the JeffreysLindley paradox when precise hypotheses are assumed to exist. In the subsequent Subsection 4.2 it is proven that the same holds when precise hypotheses are assumed to be always false. These new results indicate that a reconciliation of the Bayesian and frequentist solutions in the JeffreysLindley paradox can be reached, when adjusting for the pointnull zero probability paradox.
4.1 First case: Precise Hypotheses \(H_0:\theta =\theta _0\) can be true
Now, in this first perspective it is assumed that the precise hypothesis \(H_0:\theta =\theta _0\) is true (that is, precise hypotheses do exist). Then, the situation in the JeffreysLindley paradox can be analyzed as follows: The Bayesian solution accepts \(H_0\), which is the correct choice in this case. The hypothesis gets accepted or confirmed for increasing sample size n as the posterior probability \(P(H_0\mid x)\rightarrow 1\). Jeffreys (1939) proposed the now well established mixture prior
as the prior distribution \(P_\vartheta \) for the parameter \(\theta \), where \(\varrho \in (0,1)\) determines the prior probability mass assigned to the null value \(\theta _0\) and \(\mathcal {E}_{\theta _0}\) is the Diracmeasure in \(\theta _0 \in \Theta \), and \(P_\vartheta ^{\Theta _1}\) a suitable probability measure on the alternative hypothesis space \(\Theta _1:=\{\theta \in \Theta \mid \theta \ne \theta _0 \}\).^{Footnote 5} The standard Bayesian solution to hypothesis testing based on posterior probabilities is thus primarily able to confirm \(H_0\) in the JeffreysLindley paradox when \(H_0:\theta =\theta _0\) is true because the prior explicitly assigns probability mass \(\varrho >0\) to the null hypothesis \(H_0\). If an absolutely continuous prior would be used instead, or one would pick \(\varrho =0\), the situation would collapse and \(P(H_0\mid x)=0\) would hold for all \(n\in \mathbb {N}\), aligning the Bayesian solution with the frequentist one in the JeffreysLindley paradox (Kelter, 2021c; Ly and Wagenmakers, 2021), because then both the frequentist and Bayesian solution would reject \(H_0:\theta =\theta _0\).
What about the frequentist solution in the paradox in this first case? The pvalue is significant, or \(z=1.96\), and one would reject \(H_0\) although it was assumed to be true, and thus seems to lead to a false decision. However, Greenland (2019) stressed that pvalues behave exactly as they should but that their scaling as measures of evidence against \(H_0\) is sometimes flawed. I concur with this perspective and adopt an example of Good (1985, p. 260) below to illustrate this point. The example shows that a small pvalue (or equivalently, the fixed test statistic \(z=1.96\)) can loose its interpretation as evidence against \(H_0:\theta =\theta _0\) when the sample size n becomes only large enough.
Reconsider the example of Berger (1985) which was used to illustrate the JeffreysLindley paradox earlier. There, \(z=1.96\) was fixed and the goal was to test \(H_0:\theta =\theta _0\) against \(H_1:\theta \ne \theta _0\). Now, measurement precision is always finite and it is assumed that measurement precision in the example allows to record up to two digits in the application context (other values could be chosen without affecting the argument following below). Some possible values for the parameter \(\theta \in \Theta \) are shown in Fig. 1.
For simplicity, assume that \(\theta _0=0\) in \(H_0:\theta =\theta _0\), but the argument applies likewise for arbitrary \(\theta _0 \in \Theta \). Henceforth, we also denote \(\bar{X}\) as \(\bar{X}_n\), solely to stress the dependence on the sample size n. The latter is relevant because the evidential interpretation of \(\bar{X}_n\) for small and large n differs substantially when \(X_i {\mathop {\sim }\limits ^{i.i.d.}} P_{\theta _0}\). The arrow with \(\bar{X}_n\) indicates where the sample mean is located, and as \(H_0\) is assumed to be true, for \(n\rightarrow \infty \) we have \(\bar{X}_n \rightarrow \theta _0=0\) almost surely under the distribution \(P_{\theta _0}\) of \(H_0\) by the strong law of large numbers (SLLN) (Bauer, 2001). Thus, for large enough n the distance \(\mid \bar{X}_n\theta _0\mid \) between \(\theta _0=0\) and \(\bar{X}_n\) will be smaller than the distance \(\mid \bar{X}_n\theta _1\mid \) between \(\bar{X}\) and \(\theta _1=0.01\) on the righthand side of \(\bar{X}_n\) in Fig. 1. Thus, from the SLLN we have the following proposition:
Proposition 2
Under \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\) and the SLLN, \(\mid \bar{X}_n\theta _0\mid <\mid \bar{X}_n\theta '\mid \) \(\forall \theta ' \in \Theta , \theta ' \ne \theta _0\) for \(n\rightarrow \infty \) \(P_{\theta _0}\)almostsurely.
Now, as \(z=1.96\) is fixed, for large enough n the tailarea probability corresponding to \(p=0.05\) stays fixed, while \(\mid \bar{X}_n\theta _0\mid \rightarrow 0\) under \(P_{\theta _0}\). However, in a normal model with known \(\sigma ^2 >0\) and unknown \(\mu \in \mathbb {R}\) (the setting of the JLparadox), \(\bar{X}_n\) is a sufficient statistic (Rüschendorf, 2014, Theorem 4.1.18). Thus, the sufficient statistic converges \(P_0\)almostsurely to \(\theta _0\) specified in \(H_0:\theta =\theta _0\), but \(z=1.96\) (respectively \(p=0.05\)) stays constant.
As a consequence, the pvalue corresponding to \(z=1.96\) becomes more and more supporting for \(H_0:\theta =0\) (as the sufficient statistic \(\bar{X}_n\) converges to \(\theta _0=0\)), and less supporting for any alternative value \(\theta _1\ne \theta _0\) located in \(H_1:\theta \ne \theta _0\). In the words of Good (1985), this shows “that a given Pvalue means less for large N” (Good, 1985, 260). Alluding to sufficiency, the above line of thought can be formalized as follows:
Definition 1
(Statistical evidence) Let \(T_n:\mathcal {Y}\rightarrow \Theta \) be a sufficient and consistent estimator for the parameter \(\theta \in \Theta \). The statistical evidence \(\text {Ev}:\Theta \rightarrow [0,\infty )\), \(\theta \mapsto \text {Ev}(\theta )\) provided against \(\theta \in \Theta \) is given as \(\text {Ev}(\theta ):=1/\mid T_n(y)\theta \mid \).
The above definition implements the abstract statistical evidence \(Ev(H_0):=Ev(\theta _0)\) in the sense of Birnbaum (1962) by means of the inverse absolute distance of a sufficient and consistent estimator \(T_n\) from a given value \(\theta \in \Theta \).^{Footnote 6} When \(T_n\) is close to \(\theta _0\), \(\theta _0\) denoting the true parameter value, the statistical evidence for \(\theta _0\) is large. Note that this does not require a frequentist or Bayesian position, because the sufficiency principle is accepted by both frequentists and Bayesians (Berger and Wolpert, 1988; Grossman, 2011). Based on Definition 1 and Proposition 2 we obtain:
Theorem 3
Suppose statistical evidence is measured based on a \(\text {Ev}(\theta ):=1/\mid T_n(y)\theta \mid \), where \(T_n:\mathcal {Y}\rightarrow \Theta \) is a sufficient and consistent estimator for the parameter \(\theta \in \Theta \). Suppose \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\) with \(P_{\theta _0}\sim \mathcal {N}(\mu ,\sigma ^2)\) as specified in the JeffreysLindleyparadox. Then, \(\text {Ev}(\theta ')<\text {Ev}(\theta _0)\) for \(n\rightarrow \infty \), \(P_{\theta _0}\)almostsurely, for all \(\theta '\in \Theta , \theta ' \ne \theta _0\). That is, the statistical evidence favors \(H_0:\theta =\theta _0\) in the JLparadox.
Proof
Fix a \(\theta ' \in \Theta \) with \(\theta ' \ne \theta _0\). As \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\), Proposition 2 implies that
for all \(\theta ' \in \Theta \) for \(n\rightarrow \infty \) \(P_{\theta _0}\)almostsurely. As by assumption, \(P_{\theta _0}=\mathcal {N}(\mu _0,\sigma ^2)\) with \(\sigma ^2>0\) known, \(\bar{X}_n:=\frac{1}{n}\sum _{i=1}^n X_i\) is a sufficient statistic for \(\mu \in \mathbb {R}\) and also consistent. Thus, from Definition 1, \(\text {Ev}(\theta ):=1/\mid \bar{X}_n\theta \mid \) and thereby Eq. 7 implies \(\text {Ev}(\theta ')<\text {Ev}(\theta _0)\) for \(n\rightarrow \infty \), \(P_{\theta _0}\)almostsurely. As \(\theta '\) was chosen arbitrary, the same holds for all \(\theta '\in \Theta , \theta ' \ne \theta _0\) and the conclusion follows.
Three points are important to note here:
 \(\blacktriangleright \):

The above behavior occurs although the probability of the test statistic \(P_{\theta _0}(Z>1.96)\) is independent of n. For each n, \(Z\sim \mathcal {N}(0,1)\) but here, we fix a value \(z=1.96\) corresponding to \(p=0.05\) and only then investigate the evidential interpretation of the pvalue p for increasing n. We make no reference to the distribution of Z, in particular, we do not state that the distribution of the pvalue changes (and as a consequence of the latter, the interpretation should change).
 \(\blacktriangleright \):

Theorem 3 makes explicit Good’s argument that a given pvalue of \(p=0.05\) means less for large sample size. However, it also shows that this argument is based crucially on the premise that \(1/\mid \bar{X}_n\theta _0\mid \) measures the strength of evidence against \(H_0\) in a sensible way. One could object to this premise that the test statistic \(Z(X):=\sqrt{n} \frac{\mid \bar{X}_n\theta _0\mid }{\sigma }\) scaled by \(\sigma \) and \(\sqrt{n}\) should be used instead to measure this evidence. However, note that (i) for fixed \(z=1.96\) and increasing \(\sqrt{n}\) for any fixed \(\sigma >0\) one still has \(\mid \bar{X}_n\theta _0\mid \rightarrow 0\) \(P_0\)almostsurely to keep \(z=1.96\) fixed, so then one also obtains \(\text {Ev}(\theta _0)\rightarrow \infty \) under \(P_{\theta _0}\). Thus, the sufficient statistic \(\bar{X}_n\) again indicates evidence in favor of \(H_0\) although \(p=0.05\) remains statistically significant.
 \(\blacktriangleright \):

The original argument of Good does not make use of the concept of sufficiency to motivate the evidential meaning of the absolute distance \(\mid \bar{X}_n\theta _0\mid \). Definition 1 instead does, and this comes at the price of formalizing the abstract notion of statistical evidence, see Birnbaum (1962). A hardnosed sceptic can thus reject the conclusion by rejecting Definition 1. However, this in turn comes at the price of being suspicious about the information provided by a sufficient (and consistent) estimator. We will return to this weak spot later .
In the context of the JeffreysLindley paradox, Theorem 3 demonstrates that the small pvalue corresponding to the fixed zstatistic for increasing n can loose its interpretation as evidence against \(H_0:\theta =\theta _0\) for growing \(n\rightarrow \infty \). In the limiting case, as shown above, the pvalue even signals evidence in favor of \(\theta _0\) specified in \(H_0:\theta =\theta _0\) compared to all other values for \(\theta \).^{Footnote 7} The principal assumption to arrive at this conclusion is to accept the concept of sufficiency of an estimator and to agree that Definition 1 constitutes a reasonable measure of statistical evidence. The latter is neither frequentist nor Bayesian but measuretheoretic as sufficiency goes down to the sigmaalgebraic level (Schervish, 1995; Bauer, 2001).
Thus, under assumption that \(H_0\) is true, the Bayesian solution correctly accepts \(H_0\), and the frequentist solution seems to reject \(H_0\) solely because the pvalue is no standardized measure of evidence for growing sample size n. An appealing proposal made by Good (1985) is to standardize tailareas such as pvalues to the sample size n. In the above example, “if a small tailarea probability P occurs with sample size N we would say it is equivalent to a tailarea probability of \(P\sqrt{100/N}\) for a sample of 100 if this is also small” (Good, 1985, p. 260). Thus, when observing \(\tilde{p}=0.05\) for \(n=1000\), for which the posterior probability in the example of Berger (1985) resulted in \(P(H_0\mid x)=0.80\), the equivalent pvalue for a sample of size \(n=100\) would be
which is far away from any conventional threshold for statistical significance. Based on \(p=0.1581\) which is standardized to the sample size \(n=100\), one would not reject \(H_0:\theta =\theta _0\).
Theorem 3 shows that the usual interpretation to interpret p only as evidence against \(H_0\) is problematic, as for large sample sizes n it can – as shown above – become evidence in favor of \(H_0:\theta =\theta _0\). However, pvalues in a Fisherian sense are almost always used only to quantify the evidence against a null hypothesis (Greenland, 2019; Kelter, 2021a; Rafi and Greenland, 2020).
A weak spot in Theorem 3 is that it depends on Definition 1. Thus, as the logBayes factor provides a natural explication of the weight of evidence for a Bayesian according to Jeffreys (1961), Good (1985) or Sprenger and Hartmann (2019) (for an axiomatic derivation see Good (1960)), a second option to arrive at the conclusion of Theorem 3 is given below:
Definition 2
[Weight of evidence] Let \(\theta _0,\theta ' \in \Theta \), \(\theta _0 \ne \theta '\). The weight of evidence \(Ev:\Theta \rightarrow [0,\infty ), \theta ' \mapsto Ev(\theta ')\) against \(\theta _0\) compared to \(\theta '\) provided by \(y\in \mathcal {Y}\) is given as \(Ev(\theta '):=\log [\text {BF}_{10}(x)]\), where \(\text {BF}_{10}(x)\) denotes the Bayes factor against \(H_0:\theta =\theta _0\) compared to the hypothesis \(H':\theta =\theta '\) based on the observed data x.
Thus, weight of evidence is precisely the logBayes factor when we compare two precise hypotheses \(H_0:\theta =\theta _0\) and \(H':\theta =\theta '\). Weight of evidence is even the logBayes factor for general hypotheses, but for our purposes this suffices. We obtain:
Theorem 4
Suppose statistical evidence is measured based on a \(\text {Ev}(\theta '):=\log [\text {BF}_{10}(x)]\) and assume \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\) with \(P_{\theta _0}\sim \mathcal {N}(\mu ,\sigma ^2)\) as specified in the JeffreysLindleyparadox. Then, \(\text {Ev}(\theta ')<\text {Ev}(\theta _0)\) for \(n\rightarrow \infty \), \(P_{\theta _0}\)almostsurely, for all \(\theta '\in \Theta , \theta ' \ne \theta _0\).
Proof
Fix a \(\theta ' \in \Theta \) with \(\theta ' \ne \theta _0\). As \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\), Proposition 2 implies that \(\mid \bar{X}_n\theta '\mid >\mid \bar{X}_n\theta _0\mid \) for all \(\theta ' \in \Theta \) for \(n\rightarrow \infty \) \(P_{\theta _0}\)almostsurely. But \(\mid \bar{X}_n\theta '\mid >\mid \bar{X}_n\theta _0\mid \) is equivalent to:
Thus, Proposition 1 implies that for all \(\theta ' \in \Theta \) for \(n\rightarrow \infty \) \(P_{\theta _0}\)almostsurely we have \(\log [\text {BF}_{10}(x)]<0\) where in the above \(m_0(x)=\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp [\frac{(\bar{X}_n\theta _0)^2}{2\sigma ^2/n}]\) denotes the marginal likelihood under \(H_0:\theta =\theta _0\) and \(m_1(x)=\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp (\frac{(\bar{X}_n\theta ')^2}{2\sigma ^2 /n})\) denotes the marginal likelihood under \(H':\theta =\theta '\) for any \(\theta ' \in \Theta \), \(\theta ' \ne \theta _0\). As by assumption, \(\text {Ev}(\theta ):=\log [\text {BF}_{10}(x)]\) it follows that \(\text {Ev}(\theta ')<0\) for all \(\theta ' \in \Theta \) for \(n\rightarrow \infty \) \(P_{\theta _0}\)almostsurely. As \(\text {Ev}(\theta _0):=\log (1)=0\), we have \(\text {Ev}(\theta ')<\text {Ev}(\theta _0)\) which is the desired conclusion.
Thus, when questioning Definition 1, Definition 2 provides a different Bayesian definition for the weight of evidence. Theorem 4 then shows that under the situation of the JLparadox, the weight of evidence \(\text {Ev}(\theta ')\) against \(H_0:\theta =\theta _0\) compared to any other point null hypothesis \(H':\theta =\theta '\) is negative asymptotically when \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\) for \(P_{\vartheta }\)almostevery \(\theta ' \ne \theta _0\). Thus, \(H_0:\theta =\theta _0\) is preferred to every other hypothesis \(H':\theta =\theta '\), \(\theta ' \ne \theta _0\). As in the JLparadox the alternative \(H_1:\theta \ne \theta _0\) is the logical disjunction of these points \(\theta ' \ne \theta _0\), the weight of evidence against \(H_0\) should become negative, which shows that the Bayes factor favors \(H_0\), and Theorem 4 shows that is indeed the case.
Theorem 4 seems to require an explicit Bayesian definition of the weight of evidence to arrive at favoring \(H_0\) in the JLparadox. However, the weight of evidence as stated in Definition 2 is not only equal to the logBayes factor of a point versus point hypothesis, it also is equal to the likelihood ratio
see e.g. Marin and Robert (2014), which is a frequentist concept (Royall, 1997). From this angle, Theorem 4 shows that from a likelihoodist point of view, each likelihoodratio comparison of \(H_0:\theta =\theta _0\) versus \(H':\theta =\theta '\) for any \(\theta ' \in H_1:\theta \ne \theta _0\) shows evidence for \(H_0\) asymptotically. Thus, Theorem 4 proves that not only will a Bayesian arrive at the conclusion that a small pvalue can become evidential in favor of \(H_0\) for large enough sample size n. It also shows that a frequentist who uses likelihood ratios (such as in NeymanPearsontests) arrives at the same conclusion. From this angle, Definition 2 can be reinterpreted as quantifying statistical evidence via likelihood ratios.^{Footnote 8}
In summary, in the first case the pointnullzeroprobability paradox was not assumed to be true, and the existence of a precise hypothesis is accepted. The Bayesian solution correctly confirmed \(H_0\), and the frequentist solution seemed to incorrectly reject \(H_0\). The latter was resolved when noticing that the pvalue looses its interpretation as evidence against \(H_0\) for increasing sample size n, aligning the Bayesian and frequentist solutions. Thus, the JeffreysLindley paradox does not occur when reinterpreting pvalues this way. Theorem 3 and 4 showed that both a Bayesian and a frequentist – who either uses likelihoodratios and the NeymanPearsonian approach or sufficiency and the Fisherian approach – will arrive at this conclusion.
Another possibility to align the Bayesian and frequentist solutions is to shift to an absolutely continuous prior in the Bayesian approach: Then, both approaches reject \(H_0\), but note that such prior choices are in conflict with the extramathematical beliefs that precise the hypothesis \(H_0:\theta =\theta _0\) exists, so such a prior would not be selected in a Bayesian analysis.
4.2 Case 2: Precise Hypotheses such as \(H_0:\theta =\theta _0\) are always false
Now, in the second case it is assumed that the pointnullzeroprobability paradox is true and the RaoLovric theorem holds. Therefore, the extramathematical beliefs are that precise hypotheses are unrealistic and should be replaced with a more reasonable small interval hypothesis, independently of whether we are able to test a precise hypothesis via probability theory or some statistical method or not.
Reconsider the Bayesian solution which accepted \(H_0:\theta =\theta _0\). Clearly, when precise hypotheses are assumed to be always false, the Bayesian approach yields the incorrect solution now. The true hypothesis \(H_0\) in the example of Berger (1985) now must be of the form \(H_0:\theta \in [\theta _0b,\theta _0+b]\) for some \(b >0\). Importantly, we make the restriction that the size of the interval must be reasonably small.
“Given that one should really be testing \(H_0:\theta \in (\theta _0b,\theta _0+b)\), we need to know when it is suitable to approximate \(H_0\) by \(H_0:\theta =\theta _0\). From the Bayesian perspective, the only sensible answer to this question is – the approximation is reasonable if the posterior probabilities of \(H_0\) are nearly equal in the two situations.”(Berger 1985, p. 149)
Now, a condition under which this would be the case is that the observed likelihood functions are approximately constant on \((\theta _0b,\theta _0+b)\) (because then the resulting posterior for any fixed prior is approximately constant, too), and in the example of Berger (1985) which was used to illustrate the JeffreysLindley paradox earlier, this is the case if
where nearly constant is being interpreted as the observed likelihood functions do not vary by more than 5%. Importantly, in this second case the test of \(H_0:\theta =\theta _0\) which is carried out in the Bayesian analysis of the JeffreysLindley paradox has a different interpretation than in the first case. In the first case we intended to test \(H_0:\theta =\theta _0\) as a precise hypothesis. In the second case now, we interpret the prior probability \(\varrho \) assigned to the null hypothesis value \(\theta _0\) as the mass which would have been assigned to the more realistic interval hypothesis \(\tilde{H}:\theta \in (\theta _0b,\theta _0+b)\), were the point null approximation not being used. Thus, we interpret the test of a precise hypothesis as an approximation of the test of the realistic interval hypothesis.
Of course, a straightforward modification would be to consider the test of the interval hypothesis directly^{Footnote 9}, but to stay in the setting of the JeffreysLindley paradox this option is not considered further here.^{Footnote 10}
Now, consider again the case \(z=1.96\) for \(n=1000\) which yielded \(P(H_0\mid x)=0.80\) for the test of \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\). Based on this result, we would accept the precise null hypothesis, which is false in this second case. However, from Eq. 11 and under assumption of \(\sigma =1\), the condition becomes \(b\le 0.000387\). Using larger values of \(\sigma \) will increase b accordingly, but also imply that data are highly variable, justifying a larger interval width b. For example, from Eq. 4 it follows that for \(n=1000\) one would require a standard deviation \(\sigma \ge 25.9\) (which is equivalent to a variance of \(\sigma ^2 \approx 670.81\)) for b to be larger or equal to 0.01, which is the smallest possible measurement (as we assumed twodigit measurement precision). Thus, for \(n=1000\) and fixed z Eq. 4 can be read as separating measurable interval sizes – for which \(\sigma \ge 25.9\) and \(b\ge 0.01\) – from the ones that are too small in the application context – for which \(\sigma < 25.9\) and \(b<0.01\). Any interval width smaller than \(b=0.01\) is by assumption not measurable anymore, and in almost all cases one will not be willing to draw any conclusions based on even \(n=1000\) samples when \(\sigma ^2>670.81\).
Now, when stating \(P(H_0\mid x)=0.80\) we have to keep in mind that we do not believe in the existence of the precise hypothesis, but interpret this posterior probability as the approximation of the posterior probability of the interval hypothesis \(H_0:\theta \in (\theta _00.000387,\theta _0+0.000387)\) (under assumption of \(\sigma =1\)). In most realistic situations, one will not be able to accept such a narrow interval hypothesis. Measuring that precise is not possible based on twodigit measurement precision. The interval hypothesis becomes measurable first when \(\sigma \ge 25.9\), which is highly unrealistic in almost any context.
For example, returning to Fig. 1, we assumed that two digit precision is the best our measurements can offer. Thus, we can reverse the above process and ask what posterior probability for \(H_0\) we can state given our measurement precision. Based on two digit precision and assuming \(\sigma =1\) known, we arrive at \(n=37\) samples for which \(b=0.010065\) (for \(n=36\) samples we already have \(b=0.009931<0.01\) which is too small for our maximum measurement precision) and the corresponding posterior probability based on Eq. 4 is \(P(H_0\mid x)=0.47\). Thus, when paying attention to the measurement precision, the Bayesian solution does not confirm \(H_0\) anymore, as now \(P(H_0\mid x)<\frac{1}{2}\).
The frequentist pvalue on the other hand proceeds identical to the first case. The precise hypothesis \(H_0:\theta =\theta _0\) is false, but under assumption that b is reasonably small we can still assume that \(Z\sim \mathcal {N}(0,1)\) approximately, and arrive at \(p=0.05\) for \(z=1.96\), rejecting \(H_0\) correctly. However, due to the arguments in case one treated above the rejection of \(H_0\) via statistical significance of the pvalue looses its interpretation as weight of evidence against \(H_0\) for increasing sample size n. Thus, while for small sample sizes n the rejection has a valid interpretation as the pvalue functions as a measure of evidence against \(H_0\), for large sample sizes n the interpretation of the pvalue is reversed, now functioning as a measure of evidence in favor of \(H_0\), compare Theorems 3 and 4.
Still, this behavior is consistent with the one of the Bayesian solution: The Bayesian solution does reject \(H_0\) for small sample sizes. In the setting above, the posterior probability of \(H_0:\theta =\theta _0\) for the Bayesian test of \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\) passes the threshold \(P(H_0\mid x)=\frac{1}{2}\) at \(n=46\) samples. For sample sizes \(n\ge 46\) the posterior probability of \(H_0:\theta =\theta _0\) then becomes larger than \(\frac{1}{2}\). The frequentist pvalue always rejects \(H_0\), but for small sample sizes the rejection has a valid interpretation as evidence against \(H_0\), while for large sample sizes the interpretation is reversed.^{Footnote 11} Although there is no explicit cutoff like \(n=46\) for the pvalue, the behavior is identical, aligning the Bayesian and frequentist solution also in this second case.
An important note should be added: Eq. 11 explicitly warns the Bayesian statistician when his approximation of the interval hypothesis via a precise hypothesis becomes unreliable. There is no such warning for the pvalue. Even the fact that the pvalue reverses its interpretation as measure of evidence against \(H_0:\theta =\theta _0\) into a measure of evidence in favor \(H_0\) for fixed z statistic and growing sample size n does not help to determine when this changepoint occurs.^{Footnote 12}
5 Relationship Between the Paradoxes
Table 2 illustrates the relationship between the paradoxes. First, suppose the RaoLovric theorem is accepted and the position is adopted that precise hypotheses are always false. When precise hypotheses do not exist and are always false, in the small sample size setting both the Bayesian and frequentist solution arrive at rejection of \(H_0\) which is correct (blue part of Table 2). Importantly, the ticks and crosses should be interpreted as indicating the correct or wrong solution with respect to the pointnull zero probability paradox and the setting of the JeffreysLindley paradox. Thus, the ticks behind the blue entries in Table 2 mean that when precise hypotheses do not exist, rejecting the precise hypothesis \(H_0\) is the desired conclusion in the setting of the JeffreysLindley paradox. In contrast, accepting \(H_0\) (Bayesian solution for large sample sizes in red) is the incorrect solution in the setting of the JeffreysLindley paradox when precise hypotheses are assumed to be always false. Therefore, a cross indicates this at the second column, second row in Table 2.^{Footnote 13}
The JeffreysLindley paradox occurs (the Bayesian solution starts to favor the null hypothesis) when the approximation of the (as true assumed) interval null hypothesis through the precise point null hypothesis becomes unreliable, as illustrated by application of Eq. 11. In Table 2, this corresponds to the acceptance of \(H_0\) in the large sample setting while the frequentist solution still rejects \(H_0\) (red part of Table 2).
However, when the condition in Eq. 11 is checked against the measurement precision, the paradox will not blend in unless exceptional high measurement precision is available. Therefore, the Bayesian solution will be judged as unreliable and the Bayesian will not proceed with the test of \(H_0:\theta =\theta _0\) (see the \(^{**}\) in Table 2 at the Bayesian solution in this case). The Bayesian solution shifts to the test of the correct hypothesis \(H_0:\theta \in (\theta _0b,\theta _0+b)\) in this case and from the consistency of the posterior distribution will accept the latter hypothesis (Kleijn, 2022).^{Footnote 14} The frequentist solution rejects \(H_0\) in this case, but as shown in Theorems 3 and 4 the pvalue \(p=0.05\) looses its interpretation as evidence against \(H_0\) and becomes evidential in favor of \(H_0\). When the frequentist adopts this interpretation (or uses Good’s standardized pvalue) he arrives at the conclusion to accept \(H_0:\theta =\theta _0\), while the Bayesian accepts \(H_0:\theta \in (\theta _0b,\theta _0+b)\). The conclusions are thus highly similar, and become identical when the frequentist notices that the test of a small interval hypotheses may be more appropriate, for example by computing what Fraser (2011) called the pvalue function, which shows that the pvalue is significant for each value inside a small interval of parameter values \((\theta _0b,\theta _0+b)\). Both the Bayesian and frequentist then arrive at the acceptance of the true small interval hypothesis \(H_0:\theta \in (\theta _0b,\theta _0+b)\).
When accepting the existence of precise hypotheses (right part of Table 2), now the Bayesian and frequentist solutions both reject \(H_0\) in the small sample setting (purple part in Table 2), which seems to be incorrect. However, the latter can be attributed to the limited sample size n.
Shifting to the large sample size setting (black part of Table 2 on the righthand side), the frequentist solution rejects \(H_0\) while the Bayesian solution accepts \(H_0\). Now the discrepancy between both solutions can be reconciled by noticing again that the small pvalue \(p=0.05\) becomes evidential in favor of \(H_0:\theta =\theta _0\) for large n (also, if standardized pvalues would be used, they would not reject \(H_0\)). If the frequentist rejects this interpretation, the difference boils down to the fact that the Bayesian solution yields the correct answer only by explicitly assigning probability mass \(\varrho >0\) to the null set \(\{\theta _0\}\) in Jeffreys’ mixture prior, while the frequentist solution does not. A Bayesian could use an absolutely continuous prior to arrive at rejection of \(H_0\) and align the Bayesian with the frequentist solution. However, Theorems 3 and 4 show that a straightforward solution is to notice that the significant pvalue in the large sample setting can be evidence in favor of \(H_0\), aligning the frequentist with the Bayesian solution.
In sum, when null hypotheses are judged to be always false and the pointnullzero probability paradox is accepted (left part of Table 2), the different solutions between Bayesian and frequentist hypothesis tests in the JeffreysLindley paradox can be reconciled by interpreting the test of a precise hypothesis \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\) as an approximation of the test of the more realistic small interval hypothesis \(H_0:\theta \in (\theta _0b,\theta _0+b)\) versus its alternative (which corresponds to what a frequentist or Bayesian would naturally prefer when accepting that precise hypotheses are always false). Sample size and measurement precision then determine whether the JeffreysLindley paradox blends in and when sufficient sample size is reached so that the JeffreysLindley paradox is witnessed, the actually tested interval hypothesis becomes unrealistically precise for the context.
Theorems 3 and 4 demonstrate that when the existence of null hypotheses is accepted, the frequentist solution can be aligned with the Bayesian solution (right part of Table 2). The only requirement is a Fisherian position which is based on sufficiency or a NeymanPearsonian position which is based on likelihood ratios for a frequentist. For a Bayesian, the explication of weight of evidence as the Bayes factor is required, which is uncontroversial based on available results of Good (1960, 1968), see Good (1985).
6 Conclusion
In sum, adjusting for the pointnullzeroprobability paradox shows that the JeffreysLindley paradox can be resolved and even provide a BayesnonBayes compromise in the spirit of Good (1992).
Three questions were considered in this paper. Regarding the first question, the mathematical arguments why the JeffreysLindley paradox occurs, the measuretheoretic premises are essential. The Bayes factor assigns positive mass to the null value while the frequentist pvalue is cornered in a framework which does not even allow a probability measure on the parameter space \(\Theta \) (the latter one reducing to a set in the frequentist approach, compare Schervish (1995)). Also, pvalues are not standardized which shows that a significant p can be interpreted as evidential for \(H_0:\theta =\theta _0\) for only large enough n based on the concept of sufficiency.^{Footnote 15}
Regarding the extramathematical arguments, the pointnullzeroprobability paradox showed that the form of the hypothesis under consideration is essential to resolve the JeffreysLindley paradox. When the existence of precise hypotheses is taken for granted, the Bayesian solution proceeds fine (at the price of assigning positive probability mass in the prior to the null value \(\theta _0\)) and the frequentist pvalue must be standardized to align both solutions. When the existence of precise hypotheses is questioned, the Bayesian solution must be seen as an approximation and Eq. 11 as a warning sign for the reliability of the Bayesian approximation of an interval null via a precise null hypothesis. Taking the quality of the approximation into account then aligns both approaches again, as the JeffreysLindley paradox blends in as soon as the approximation becomes unreliable. Also, the standardization of pvalues works here, too, to yield identical conclusions in both the Bayesian and frequentist approach, that is, to shift towards the test of \(H_0:\theta \in (\theta _0b,\theta _0+b)\) versus its alternative instead of the test of \(H_0:\theta =\theta _0\) versus its alternative.
No matter whether one accepts the RaoLovric theorem, in this paper it was shown that from a mathematical point of view the JeffreysLindley paradox can be attributed to the fact that the standard Bayesian solution accepts the existence of a precise hypothesis while the pvalue does not (at least in a Fisherian perspective, under which the null hypothesis is never true and can only be rejected based on sufficient evidence against it). However, the price the Bayesian pays for this behavior is the risk that the approximation of \(H_0:\theta \in (\theta _0b,\theta _0+b)\) by \(H_0:\theta =\theta _0\) fails for large enough n. Furthermore, the fact that pvalues were designed as evidence measures against a null hypothesis but in fact can become evidential in favor of a null hypothesis when sample size n is only large enough is another reason why the puzzling situation of the JeffreysLindley paradox occurs. Both points have by now not been discussed in detail in the literature.
From an extramathematical perspective, the pointnullzeroprobability paradox builds the overarching frame to resolve the JeffreysLindley paradox under both cases treated in this paper, compare Table 2.
With regard to the methodological debate between frequentist and Bayesian inference (Wasserstein et al., 2019), the results in this paper therefore demonstrate that (i) the JeffreysLindley paradox may help to rethink the validity of a precise null hypothesis in a broad variety of research contexts as well as (ii) the Bayesian assignment of positive probability mass to a value of dominating measure zero, and (iii) the scaling of pvalues as measures of evidence against a precise null hypothesis.
Notes
Replacing the original data with the sufficient statistic \(\bar{X}\) and performing any hypothesis test based on this compressed new data \(\bar{x}\) requires both a Bayesian and frequentist to accept the sufficiency principle (SP) (Berger and Wolpert, 1988). However, the SP follows naturally from the likelihood principle which is readily accepted by Bayesians, and while a NeymanPearsonian frequentist may question any form of the conditionality or ancillary principle, the sufficiency principle is also accepted by him or her so the replacement does not imply to make an assumption a Bayesian or frequentist may not be willing to make.
Although z depends on \(\sqrt{n}\), the strong law of large numbers causes \(\bar{X}\) to converge almost surely under \(H_0\) to \(\theta _0\), counterbalancing the growing \(\sqrt{n}\) and holding z fixed. Thus, even for a large n like \(n=1000\) there exists a corresponding sample \(x=(x_1,...,x_{1000})\) for which the difference \(\mid \bar{x}\theta _0\mid \) becomes small enough so that \(\mid \bar{x}\theta _0\mid \frac{\sqrt{n}}{\sigma }=z\) for a fixed z like \(z=1.96\).
Although the prior specifications for the hyperparameters \(\mu ,\tau \) or the prior probabilities for \(H_0,H_1\) could be questioned, the observed phenomenon holds independently of the specific choices made in the example, for details see Berger (1985).
Other criticisms include Meehl (1967, p. 108): “the old pointnull hypothesis (...) is [quasi] always false in biological and social science.” and Cohen (1990, p. 1308), who went as far as stating that the null hypothesis “taken literally (...) is always false in the real world. It can only be true in the bowels of a computer processor running a Monte Carlo study (and even then a stray electron may make it false).”.
The subscript \(\vartheta \) in \(P_\vartheta \) serves to indicate that the distribution \(P_\vartheta \) corresponds to the random variable \(\vartheta :\Omega \rightarrow \Theta \) which models the parameter \(\theta \in \Theta \) in the Bayesian approach, compare Kleijn (2022) and Schervish (1995).
We omit the subscript \(T_n\) when denoting \(\text {Ev}\), so we write Ev\((\theta )\) instead of \(\text {Ev}_{T_n}(\theta )\), because it is apparent from the context which sufficient and consistent estimator \(T_n\) for the parameter is used.
This also manifests itself in the seemingly paradoxical situation that a significant result with \(z=1.96\) may occur for large n although the effect size estimate is tiny, which is due to \(\bar{X}_n\rightarrow \theta _0=0\).
A main reason for this situation is that likelihoodists and Bayesians both accept the conditionality principle (Berger and Wolpert, 1988). A frequentist sticking to Fisher’s position may reject the conditionality principle to compute a pvalue, but will, in general, accept the sufficiency principle. Theorem 3 then requires only the concept of sufficiency to arrive at the conclusion that a small pvalue can become evidential in favor of \(H_0\) for large n.
For example, Berger (1985, p. 151) notes that for a Bayesian it is usually easier to work directly with the interval hypothesis.
Importantly, it must be kept in mind now that for large sample sizes the pvalue becomes evidential in favor of values inside the interval hypothesis, in contrast to becoming evidential only for the null value \(\theta _0\) as specified in the precise hypothesis \(H_0:\theta =\theta _0\) in the first case.
Using standardized pvalues as proposed by Good (1985) may be appealing to safeguard against this uncertainty. Also, inspecting the difference \(\bar{X}_n\theta _0\) in the example could possibly help to determine if the changepoint has already occurred. Whenever \(\bar{X}_n\) is located inside the interval \((\theta _0b,\theta _0+b)\) and the sample size is large enough, the small pvalue corresponding to the fixed z has become evidential in favor of \(H_0:\theta \in (\theta _0b,\theta _0+b)\).
As noted by one reviewer, the purple entries in Table 2 which both have crosses behind them seem to ignore the possibility that \(H_0\) could indeed be wrong even when precise hypotheses do exist. Then, rejecting \(H_0\) would be the correct conclusion. However, in the setting of the JeffreysLindley paradox data is generated according to \(X_i \sim P_{\theta _0}\), so \(H_0:\theta =\theta _0\) is known to be true and \(H_0\) should be accepted, as precise hypotheses are assumed to exist. This argument shows that rejecting \(H_0\) in the purple entries of Table 2 is in conflict with the desired conclusion of the JeffreysLindley setting when precise hypotheses are assumed to exist.
This shows that a Bayesian arrives – when forced into the setting of the JeffreysLindley paradox which assumes a precise hypothesis – at the test of a small interval hypothesis. This is precisely what he would have started with when accepting that precise hypotheses are always false, if he would not have been forced into the context of the JeffreysLindley paradox setting.
Further mathematical arguments for the occurrence of the paradox have already been presented by Robert (2014, p. 219220) and are thus not repeated here.
References
Bauer, H. (2001). Measure and integration theory. De Gruyter, Berlin.
Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis. Springer, New York.
Berger, J., Boukai, B. and Wang, Y. (1997). Unified Frequentist and Bayesian testing of a precise hypothesis. Stat. Sci., 12, 133–160.
Berger, J., Brown, L. and Wolpert, R. (1994). A unified conditional frequentist and Bayesian test for fixed and sequential hypothesis testing. Ann. Stat., 22, 1787–1807. https://doi.org/10.1214/aos/1176348654.
Berger, J. and Delampady, M. (1987). Testing precise hypotheses. Stat. Sci., 2, 317–335. https://doi.org/10.1214/ss/1177013238.
Berger, J. and Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of P values and evidence. J American Stat. Assoc., 82, 112–122. Retrieved from https://doi.org/10.1080/01621459.1987.10478397.
Berger, J. and Wolpert, R.L. (1988). The Likelihood Principle, (S.S. Gupta, ed.). Institute of Mathematical Statistics, Hayward, California.
Bernado, J. (1999). Nested hypothesis testing: the Bayesian reference criterion, (J. Bernado, J. Berger, A. Dawid, and A. Smith eds.), Bayesian Statistics (vol. 6) (pp. 101–130 (with discussion)). Oxford University, Valencia. Press.
Birnbaum, A. (1962). On the Foundations of Statistical Inference (with discussion). J American Stat. Assoc., 57, 269–306. https://doi.org/10.2307/2281640.
BuchananWollaston, H.J. (1935). Statistical tests. Nature, 136, 722. Retrieved from https://www.nature.com/articles/136722a010.1038/136722a0
Cohen, J. (1990). Things I have learned (so far). Amer Psychol, 45, 1304–1312. Retrieved from record/199111596001 https://doi.org/10.1037/0003066X.45.12.1304
Fraser, D. (2011). Is Bayes posterior just quick and dirty confidence? Stat. Sci., 26, 299–316.
Good, I. (1950). Probability and the Weighing of Evidence. London, Charles Griffin.
Good, I. (1960). Weight of Evidence, Corroboration, Explanatory Power, Information and the Utility of Experiments. J Royal Stat. Soc.: Ser. B (Methodological), 22, 319–331. https://doi.org/10.1111/J.25176161.1960.TB00378.X
Good, I. (1968). Corroboration, explanation, evolving probability, simplicity and a sharpened Razor. Source British J Philosophy Sci, 19, 123–143.
Good, I. (1981). Some Logic and History of Hypothesis Testing. (J.C. Pitt ed.), Philosophy in Economics (pp. 149–174). Springer Netherlands, Virginia.
Good, I. (1985). Weight of evidence: A brief survey. (J. Bernado, M.H. DeGroot, D. Lindley, and A. Smith eds.), Bayesian Statistics (vol. 2) (pp. 249–277). Elsevier Science Publishers B.V., Valencia, Spain. (NorthHolland).
Good, I. (1992). The Bayes/nonBayes compromise: a brief review. J American Stat. Assoc., 87, 597–606. https://doi.org/10.1080/01621459.1992.10475256.
Good, I. (1993). C397. Refutation and rejection versus inexactification, and other comments concerning terminology. J Stat. Comput. Simul., 47, 91–92. https://doi.org/10.1080/00949659308811514.
Good, I. (1994). C420. The existence of sharp null hypotheses. J Stat. Comput. Simul., 49, 241–242. https://doi.org/10.1080/00949659408811587.
Greenland, S. (2019). Valid pvalues behave exactly as they should: some misleading criticisms of pvalues and their resolution with svalues. Amer. Stat., 73, 106–114. https://doi.org/10.1080/00031305.2018.1529625.
Grossman, J. (2011). The likelihood principle. (P.S. Bandyopadhyay and M.R. Forster eds.), Philosophy of statistics (pp. 553–580). Elsevier NorthHolland, Amsterdam.
Hodges, J.L. and Lehmann, E.L. (1954). Testing the approximate validity of statistical hypotheses. J Royal Stat. Soc.: Ser. B (Methodological), 16, 261–268. https://doi.org/10.1111/j.25176161.1954.tb00169.x.
Jeffreys, H. (1939). Theory of Probability (1st ed.). The Clarendon Press, Oxford.
Jeffreys, H. (1961). Theory of Probability (3rd ed.). Oxford University Press, Oxford.
Kelter, R. (2020). Analysis of Bayesian posterior significance and effect size indices for the twosample ttest to support reproducible medical research. BMC Medical Research Methodology, 20. https://doi.org/10.1186/s12874020009682.
Kelter, R. (2021a). Bayesian and frequentist testing for differences between two groups with parametric and nonparametric twosample tests. Wiley Interdisciplinary Rev Comput Stat, 13, e1523. https://doi.org/10.1002/WICS.1523.
Kelter, R. (2021b). Bayesian HodgesLehmann tests for statistical equivalence in the twosample setting: Power analysis, type I error rates and equivalence boundary selection in biomedical research. BMC Medical Research Methodology, 21. https://doi.org/10.1186/s12874021013417.
Kelter, R. (2021c). On the measuretheoretic premises of Bayes factor and full Bayesian significance tests: a critical reevaluation. Computational Brain & Behavior(online first), 1–11. https://doi.org/10.1007/s42113021001105
Kleijn, B. (2022). The frequentist theory of Bayesian statistics. Springer, Amsterdam.
Kruschke, J.K. (2018). Rejecting or accepting parameter values in Bayesian estimation. Advan. Method Pract. Psychol. Sci., 1, 270–280. https://doi.org/10.1177/2515245918771304.
Lakens, D., Scheel, A.M. and Isager, P.M. (2018). Equivalence testing for psychological research: a tutorial. Advan. Method Pract. Psychol. Sci., 1, 259–269. https://doi.org/10.1177/2515245918770963.
Linde, M., Tendeiro, J., Selker, R., Wagenmakers, E.J. and van Ravenzwaaij, D. (2020). Decisions About Equivalence: A Comparison of TOST, HDIROPE, and the Bayes Factor. psyarxiv preprint, https://psyarxiv.com/bh8vu.
Lindley, D. (1957). A statistical paradox. Biometrika, 44, 187–192.
Ly, A. and Wagenmakers, E.J. (2021). A Critical Evaluation of the FBST ev for Bayesian Hypothesis Testing. Computational Brain & Behavior. https://doi.org/10.1007/s4211302100109y.
Marin, J.M. and Robert, C. (2014). Bayesian essentials with R. Springer, New York.
Meehl, P.E. (1967). Theory testing in psychology and physics: a methodological paradox. Philosophy Sci., 34, 103–115.
Naaman, M. (2016). Almost sure hypothesis testing and a resolution of the JeffreysLindley paradox. Electronic J Stat., 10, 1526–1550. https://doi.org/10.1214/16EJS1146.
Rafi, Z. and Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology, 20, 244. Retrieved from https://doi.org/10.1186/s12874020011059, arXiv:1909.08579.
Rao, C.R. and Lovric, M.M. (2016). Testing point null hypothesis of a normal mean and the truth 21st Century perspective. Journal of Modern Applied Statistical Methods, 15 (2), 2–21. https://doi.org/10.22237/jmasm/1478001660.
Robert, C.P. (2007). The Bayesian Choice (2nd ed.). Paris: Springer New York. https://doi.org/10.1007/0387715991.
Robert, C.P. (2014). On the JeffreysLindley paradox. Philosophy of Science, 81 (2), 216–232. Retrieved from https://www.journals.uchicago.edu/doi/abs/10.1086/675729arXiv:1303.5973
Robert, C.P. (2016). The expected demise of the Bayes factor. Journal of Mathematical Psychology, 72 (2009), 33–37. arXiv:1506.08292, https://doi.org/10.1016/j.jmp.2015.08.002
Rousseau, J. (2007). Approximating Interval hypothesis : pvalues and Bayes factors. J. Bernado, J. Berger, A. Dawid, & A. Smith (Eds.), Bayesian statistics (vol. 8) (pp. 417–452). Valencia: Oxford University Press.
Royall, R. (1997). Statistical Evidence: A likelihood paradigm for statistical evidence. London: Chapman and Hall.
Rüschendorf, L. (2014). Mathematische Statistik. Springer.
Savage, L.J. (1954). The Foundations of Statistics. New York: John Wiley & Sons.
Sawilowsky, S. (2016). RaoLovric and the Triwizard Point Null Hypothesis Tournament. Journal of Modern Applied Statistical Methods, 15 (2), 11–12. Retrieved from http://digitalcommons.wayne.edu/jmasm10.22237/jmasm/1478001720
Schervish, M.J. (1995). Theory of Statistics. New York: Springer Verlag.
Spanos, A. (2013). Who should be afraid of the JeffreysLindley paradox? Philosophy of Science, 80 (1), 73–93. https://doi.org/10.1086/673730.
Sprenger, J. (2013). Testing a Precise Null Hypothesis: The Case of Lindley’s Paradox. Philosophy of Science, 80 (5), 733–744.
Sprenger, J. and Hartmann, S. (2019). Bayesian Philosophy of Science. Oxford University Press. 10.1093/oso/9780199672110.001.0001
Wasserstein, R.L., Schirm, A.L. and Lazar, N.A. (2019). Moving to a World Beyond “p<0.05”. The American Statistician, 73 (sup1), 1–19.
Zumbo, B.D. and Kroc, E. (2016). Some Remarks on Rao and Lovric’s “Testing Point Null Hypothesis of a Normal Mean and the Truth: 21st Century Perspective”. Journal of Modern Applied Statistical Methods, 15 (2), 11–2016. Retrieved from http://digitalcommons.wayne.edu/jmasm10.22237/jmasm/1478001780
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors report there are no competing interests to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kelter, R. The Case of the JeffreysLindleyparadox as a Bayesfrequentist Compromise: A Perspective Based on the RaoLovricTheorem. Sankhya A 86, 337–363 (2024). https://doi.org/10.1007/s1317102300321x
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1317102300321x