1 Introduction

Testing a point null hypothesis is a highly controversial topic in statistical science and of general interest to a broad range of domains. Although various authors have embraced a compromise between Bayesian and frequentist modes of statistical inference (Berger et al. 1997, 1994; Good 1981, 1992), the divergence between Bayesian and frequentist solutions to the test of a parametric point null hypothesis \(H_0:\theta =\theta _0\) versus its alternative \(H_1:\theta \ne \theta _0\) for some \(\theta _0 \in \Theta \) has led to heated debates about the right mode of inference in the past decades. These differences manifest themselves prominently in the Jeffreys-Lindley paradox, and the latter is – as a consequence – still often seen as an obstacle for what was called a Bayes-non-Bayes compromise by Good (1992). The Jeffreys-Lindley paradox was first mentioned by Jeffreys (1939), Good (1950) and Lindley (1957) and was extensively discussed in the statistical literature, see for example Good (1985), Berger (1985) or Naaman (2016). Also, the paradox raised attention in the philosophy of science literature, and resolutions were presented by Spanos (2013), Sprenger (2013) and Robert (2014). Importantly, the paradox is sometimes used as an argument in favor or against a specific statistical method (depending on whether the procedure suffers from the paradox occuring or not) (Kelter, 2021c; Ly and Wagenmakers, 2021). Resolutions to the paradox range from not being impressed by the divergences at all (Spanos, 2013) (because the Bayesian and frequentist approaches to statistical inference make considerably differing assumptions), over attributing the divergences largely to the poor performance of improper priors for testing a point null hypothesis (Robert, 2014) to shifting the focus to different statistical techniques (Sprenger, 2013; Naaman, 2016; Kelter, 2021c).

In this paper, the following questions are considered:

  1. 1.

    Why does Lindley’s paradox occur from a mathematical perspective?

  2. 2.

    Why does Lindley’s paradox occur from an extra-mathematical perspective? That is, which arguments can be seen as causal for the occurrence of the paradox that do not borrow their strength based on probability theory?

  3. 3.

    What are its implications for a methodological debate between Bayesian and frequentist modes of statistical inference in hypothesis testing?

The plan of the paper is as follows: First, Section 2 details the setting of the Jeffreys-Lindley paradox and sets the stage for this article. Then, the point-null zero probability paradox is detailed in Section 3. The first question leads to the measure-theoretic assumptions made in Bayesian and frequentist inference. Based on the point-null-zero-probability paradox, the different measure-theoretic basis of frequentist and Bayesian approaches – in which tests of a precise null hypothesis are framed – are shown to build the foundation for the Jeffreys-Lindley paradox to manifest itself.

The second question then deals with whether and when the test of a precise null hypothesis is appropriate. The second question is analyzed in Section 4 based on the point-null-zero-probability paradox and the frequentist and Bayesian perspectives on it are separately treated. It is shown that the occurrence of Lindley’s paradox is closely tied to the validity of a precise hypothesis in scientific research and whether the null hypothesis is assumed to be precise or not. This relates the Jeffreys-Lindley paradox directly to the point-null-zero-probability paradox. Also, it provides a broader perspective on the purely mathematical arguments which provide answers to question one.

Based on this analysis it is proven in Section 4 that the paradox resolves when shifting to the appropriate frame of inference, both under the perspective that precise hypotheses exist – Section 4.1 – and under the perspective that precise hypotheses are always false – Section 4.2.

Regarding the third question, Section 4 also shows that next to the form of the statistical hypotheses under consideration a major reason for the Jeffreys-Lindley paradox is that p-values are not standardized tail-area probabilities. This latter fact provides a realignment of the Bayesian and frequentist solution to the test of a precise null hypothesis and up to date was largely ignored in the literature. The key results of Section 4 are summarized in Section 5 then which details the relationship between both paradoxes.

Section 6 provides a conclusion and shows that the Jeffreys-Lindley paradox should not – as is often the case – be described as a separating phenomenon between Bayesian and frequentist inference, but could much more be taken as a unifying fact which emphasizes the necessary shift towards what Rao and Lovric (2016) called a 21st century perspective on statistical hypothesis testing when the validity of a precise hypothesis is questioned. The latter holds in particular in the biomedical and social sciences.

2 The Jeffreys-Lindley paradox

For illustration purposes and to set the stage for this article, a simple example from Berger (1985) is revisited. Suppose a sample \(X:=(X_1,...,X_n)\) from a \(\mathcal {N}(\theta ,\sigma ^2)\) distribution is taken with \(\sigma ^2>0\) assumed to be known for simplicity. Reduction to the sufficient statisticFootnote 1\(\bar{X}\) then yields a likelihood function of \(\bar{X}\) which is proportional to a \(\mathcal {N}(\bar{x},\frac{\sigma ^2}{n})\) probability density for \(\theta \):

$$\begin{aligned} f(\bar{x}\mid \theta )=\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp \left[ -\frac{n}{2\sigma ^2}(\theta -\bar{x})^2 \right] \end{aligned}$$
(1)

Assuming a \(\mathcal {N}(\mu ,\tau ^2)\) prior on \(\theta =\theta _0\), the alternative hypothesis space when testing \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\) yields that the marginal likelihood under \(H_1\) is a \(\mathcal {N}(\mu ,\tau ^2+\sigma ^2/n)\) density (Berger, 1985, p. 127-128). Assuming \(\mu =\theta _0\) for the prior on \(\theta \ne \theta _0\) (which is reasonable since values close to \(\theta _0\) would often be assumed more likely a priori than values far away from \(\theta _0\)), the posterior probability \(P(\theta _0 \mid x)\) is given as

$$\begin{aligned} \frac{P(\theta \ne \theta _0\mid x)}{P(\theta =\theta _0\mid x)}&=\frac{m_1(x)}{m_0(x)}\frac{P(\theta \ne \theta _0)}{P(\theta =\theta _0)} \Leftrightarrow \frac{1-P(\theta =\theta _0\mid x)}{P(\theta =\theta _0 \mid x)}= \frac{m_1(x)}{m_0(x)} \frac{P(\theta \ne \theta _0)}{P(\theta =\theta _0)}\nonumber \\&\Leftrightarrow \left[ 1+\frac{m_1(x)}{m_0(x)} \frac{P(\theta \ne \theta _0)}{P(\theta =\theta _0)}\right] ^{-1}=P(\theta =\theta _0\mid x) \end{aligned}$$
(2)

which for \(m_0(x)=\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp \left[ -\frac{(\bar{x}-\theta _0)^2}{2\sigma ^2/n}\right] \) and \(m_1(x)=\frac{1}{\sqrt{2\pi (\tau ^2+\sigma ^2/n)}}\) \(\exp \left[ -\frac{(\bar{x}-\theta _0)^2}{2[\tau ^2+\sigma ^2/n]}\right] \) reduces to

$$\begin{aligned} P(\theta =\theta _0\mid x)=\left[ 1+\frac{\exp (\frac{1}{2}z^2 [1+\sigma ^2/(n\tau ^2)]^{-1})}{({1+n\tau ^2/\sigma ^2)^{\frac{1}{2}}}} \frac{P(\theta \ne \theta _0)}{P(\theta =\theta _0)}\right] ^{-1} \end{aligned}$$
(3)

where \(z=\sqrt{n}\mid \bar{x}-\theta _0\mid /\sigma \) is the usual statistic for testing \(H_0:\theta =\theta _0\) in a two-sided Gauss-test, for details see (Berger, 1985, p. 151). Under \(H_0\), from the central limit theorem it follows that Z is distributed standard normal, \(Z \sim \mathcal {N}(0,1)\). To illustrate the Jeffreys-Lindley paradox, note that

$$\begin{aligned} P(\theta =\theta _0\mid x)\ge \left[ 1+\frac{\exp (\frac{1}{2}z^2)}{({1+n\tau ^2/\sigma ^2)^{\frac{1}{2}}}} \frac{P(\theta \ne \theta _0)}{P(\theta =\theta _0)}\right] ^{-1} \end{aligned}$$
(4)

so for fixed and equal prior probabilities \(P(H_0)=P(\theta =\theta _0)=\frac{1}{2}\), \(P(H_1)=P(\theta \ne \theta _0)=\frac{1}{2}\) (which expresses that \(H_0\) and \(H_1\) are equally probable a priori), hyperparameters set to \(\mu =\theta _0\) and \(\tau =\sigma \), and any fixed z statistic, a lower bound for the posterior probability \(P(H_0\mid x)=P(\theta =\theta _0\mid x)\) can be obtained from Eq. 4. Now, for fixed \(z=1.96\), which corresponds to a p-value of \(p=P(\mid z \mid>1.96 \mid H_0)=P(\mid z \mid > 1.96 \mid \theta =\theta _0)=0.05\), a frequentist hypothesis test would reject \(H_0\) based on the test level \(\alpha =0.05\).

However, from Eq. 4 it follows that for fixed \(z=1.96\), a Bayesian will arrive at different posterior probabilities for varying values of the sample size n as shown in Table 1.

Table 1 Posterior probabilities of \(H_0:\theta =\theta _0\) in the example of Berger (1985) illustrating the Jeffreys-Lindley paradox

For example, fixing \(z=1.96\) yields \(P(\theta =\theta _0\mid x)=0.35\) for \(n=1\), which shows that about two thirds of the posterior probability are evidential against \(H_0\), while for \(n=1000\), this fraction has reduced to only a fifth. In the latter case, a Bayesian will – given his prior choice – readily state that the posterior probability of \(H_0\) is \(80\%\), which is in clear contrast to the rejection of \(H_0\) by the frequentist.Footnote 2

The divergence between both approaches remains when using Bayes factors instead of posterior probabilities, or selecting different test levels \(\alpha \) than \(\alpha =0.05\) for the p-value. In total, this leads to a salient divergence between the frequentist and Bayesian solution to the test of a precise null hypothesis.Footnote 3 In sum, the Jeffreys-Lindley paradox can be distilled in the following form which is attributed to (Lindley 1957, p. 187):

Jeffreys-Lindley Paradox: Let \(\mathcal {N}(\theta ,\sigma ^2)\) be a Gaussian statistical model with \(\sigma ^2>0\) known and consider the test of \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\) under prior probability \(P(H_0)>0\) and any proper prior distribution on \(H_1\). Then, for any choice of test level \(\alpha \in [0,1]\) there exists a sample size \(N(\alpha )\) and an independent and identically distributed sample x for which the sample mean \(\bar{x}\) is significantly different from \(\theta _0\) at level \(\alpha \) and the posterior probability \(P(H_0\mid x)\ge 1-\alpha \).

For details, see also Sprenger (2013, p. 734-735) who provides an example of Lindley’s paradox in the context of extrasensory perception (ESP) and Robert (2014, p. 217).

3 The Point-null-zero-probability-paradox

An old criticism of statistical hypothesis testing includes the “relevance of point null hypotheses” (Robert, 2016, p. 5). Criticisms that point null hypotheses are usually unrealistic evolved as the result of a still ongoing debate which included statisticians and philosophers of science likewise over nearly the last century (Buchanan-Wollaston, 1935). For example, Good (1950, p. 90) argued that when testing the fairness of a die “From one point of view it is unnecessary to look at the statistics since it is obvious that no die could be absolutely symmetrical.”. In a footnote, he added: “It would be no contradiction (...) to say that the hypothesis that the die is absolutely symmetrical is almost impossible. In fact, this hypothesis is an idealised proposition rather than an empirical one.” (Good, 1950, p. 90). Similar arguments were brought forward by Savage (1954, p. 332-333), who stressed that “null hypotheses of no difference are usually known to be false before the data are collected” and “their rejection ... is not a contribution to science”.Footnote 4

On the other hand, Good (1994, p. 241) noted that there is at least one example of a precise hypothesis, which states that there is no extrasensory perception. Also, in physics or chemistry the null hypothesis may correspond to a general law or natural constant (Jeffreys, 1939), which may not be measurable with absolute precision but still may have the form of a precise hypothesis.

Although in the majority of cases in the medical and social sciences the assumption of a precise hypothesis is questionable, one approach to save precise hypothesis testing was pursued by Berger and Sellke (1987) who showed that for reasonably small interval hypotheses, point null hypotheses are at least useful approximations (Berger and Delampady, 1987, Theorem 2). Thus, the prior probability \(P(H_0)\) which is assigned to the null value \(\theta _0\) in \(H_0:\theta =\theta _0\) should actually be interpreted as the probability mass allocated to a small area around \(\theta _0\). Good (1994) argued similarly that the precise null hypothesis is “often a good enough approximation” (Good, 1994, p. 241). However, Bernado (1999) showed that the quality of the approximation decreases for growing sample size, and Rousseau (2007) showed that for large sample sizes the Bayes factor for a point null hypothesis is no reasonable approximation of the Bayes factor for an interval hypothesis anymore, unless the interval sizes are extremely small. Below it will be shown that this is one reason for the Jeffreys-Lindley paradox to occur.

Based on the broad consensus that precise hypotheses are seldom realistic for scientific research, Rao and Lovric (2016) posed the question why statisticians and non-statisticians keep testing point null hypotheses, “when it is known in advance they are almost never exactly true in the real world” (Rao and Lovric, 2016, p. 6).

They called the below result the (point null) zero-probability-paradox, which shows that the probability of a point null hypothesis \(H_0:\theta =\theta _0\) about the mean of a normally distributed population is zero, where \(H_1:\theta \ne \theta _0\) is the alternative, and \(\Theta =\Theta _{\mathbb {Q}}\cup \Theta _{\mathbb {R}\setminus \mathbb {Q}}\) is the parameter space, \(\mathbb {Q}\) are the rational numbers and thus \(\Theta \) is the mutually exclusive union of \(\Theta _{\mathbb {Q}}\) and \(\Theta _{\mathbb {R}\setminus \mathbb {Q}}\).

Theorem 1

( Rao-Lovric ) The probability of the null hypothesis \(H_0:\theta =\theta _0\) (about the mean of a normal population \(\mathcal {N}(\theta ,\sigma ^2)\)) is equal to zero, that is, \(P(\{H_0 \mid \theta _0 \in \mathbb {Q}\})=0\).

Based on their result Rao and Lovric (2016) make the strong (and debatable) statement that this “inequivocably amounts to the deduction that any single-point null hypothesis about the normal mean has also probability zero.” (Rao and Lovric, 2016, p. 10). Although the set of hypotheses includes formally only rational values, the fact that the latter are dense within the real numbers accounts for the fact that any arbitrary fine measurement precision can be captured by rational numbers. Thus, the result generalizes to all practically measurable values \(\theta _0\). Further discussions of the result are provided in Sawilowsky (2016) and Zumbo and Kroc (2016).

Rao and Lovric (2016) conclude that the term testing thus is a misnomer and should be replaced by the term inexactification, referring to the earlier proposal of Good (1993). Their recommendation is to shift towards what they call the Hodges-Lehmann paradigm, based on the seminal work of Hodges and Lehmann (1954), who proposed to replace the test of a point null hypothesis with the test of the hypotheses

$$\begin{aligned} H_0:\mid \theta -\theta _0\mid \le \delta \text { versus } H_1:\mid \theta -\theta _0\mid > \delta \end{aligned}$$
(5)

where the interval null hypothesis \(H_0\) postulates a negligible effect size \(\delta \) while the alternative \(H_1\) states a practically meaningful effect size.

Note that for a frequentist the true but unknown parameter \(\theta _0 \in \Theta \) is fixed and not random, so any statement as made in the Rao-Lovric theorem becomes pointless. A frequentist can therefore safely escape the consequences of the result, while a Bayesian will readily accept it when using an absolutely continuous prior distribution with respect to the dominating measure \(\mu \) of the statistical model \(\mathcal {P}\). Under such a prior, the prior probability of any value \(\theta \in \Theta \) is zero (compare also Robert (2007, p. 221), Berger (1985, p. 127-130) and Kelter (2021c)), and therefore \(P(\{H_0 \mid \theta \in \mathbb {Q}\})=0\) as stated in the Rao-Lovric theorem follows immediately under these conditions.

However, the decision to accept or reject the existence of a precise hypothesis does not need to rely on mathematical arguments such as the Rao-Lovric theorem. It can also be decided based on extra-mathematical arguments such as assuming that an effect size difference of exactly zero between a treatment and control group in a clinical trial is unrealistic and will never occur. In light of such arguments, the Rao-Lovric theorem merely formalizes whether probability theory allows to test a precise hypothesis or not. It does not mandate whether one should believe in the existence of precise hypotheses or not.

Solely extra-mathematical arguments can determine whether any probability measure should be associated with the parameter, effectively rendering it a random variable and implying a Bayesian mode of inference, or whether the parameter obeys no probabilistic element, effectively rendering it a fixed and unknown constant and mandating a frequentist approach.

Both Bayesians and frequentists can accept or deny the existence of precise hypotheses. A frequentist does not need to refer to probability statements about the parameter to do so (in fact he can’t). A Bayesian must align his prior distribution with his extra-mathematical beliefs about the existence of precise hypotheses. In particular, when a Bayesian believes in such hypotheses he must (artificially) assign some positive probability mass \(\varrho >0\) to the theoretically interesting null value \(\theta _0\) specified in \(H_0:\theta =\theta _0\).

4 A Reanalysis of the Jeffreys-Lindley paradox

In this section, a reanalysis of the Jeffreys-Lindley paradox under two different perspectives is undertaken. First, the Rao-Lovric theorem is taken for granted and it is assumed that \(H_0:\theta =\theta _0\) is always false, no matter which value \(\theta _0 \in \Theta \) is specified in \(H_0\). Second, the Rao-Lovric theorem is not taken for granted. This latter perspective can be justified by extra-mathematical arguments that cause one to believe in the existence of a precise hypothesis \(H_0:\theta =\theta _0\). This belief is uncoupled from the fact whether a mathematical result postulates that the probability of such a hypothesis in a formal axiomatic system like probability theory must be zero or not. One can believe in extrasensory perception or the existence of precise hypotheses for example in physics. Whether probability theory allows to test such hypotheses statistically is a different issue, as shown by the Rao-Lovric theorem. Importantly, however, the beliefs about a precise hypothesis must be incorporated into the form of the hypothesis that is actually tested. This latter fact will demonstrate that the point-null-zero-probability paradox is relevant for the occurrence of Lindley’s paradox.

In the following Subsection 4.1, it is proven that a Bayes-frequentist compromise can be reached in the setting of the Jeffreys-Lindley paradox when precise hypotheses are assumed to exist. In the subsequent Subsection 4.2 it is proven that the same holds when precise hypotheses are assumed to be always false. These new results indicate that a reconciliation of the Bayesian and frequentist solutions in the Jeffreys-Lindley paradox can be reached, when adjusting for the point-null zero probability paradox.

4.1 First case: Precise Hypotheses \(H_0:\theta =\theta _0\) can be true

Now, in this first perspective it is assumed that the precise hypothesis \(H_0:\theta =\theta _0\) is true (that is, precise hypotheses do exist). Then, the situation in the Jeffreys-Lindley paradox can be analyzed as follows: The Bayesian solution accepts \(H_0\), which is the correct choice in this case. The hypothesis gets accepted or confirmed for increasing sample size n as the posterior probability \(P(H_0\mid x)\rightarrow 1\). Jeffreys (1939) proposed the now well established mixture prior

$$\begin{aligned} P_{\vartheta } = \varrho \cdot \mathcal {E}_{\theta _0}+(1-\varrho )\cdot P_{\vartheta }^{\Theta _1} \end{aligned}$$
(6)

as the prior distribution \(P_\vartheta \) for the parameter \(\theta \), where \(\varrho \in (0,1)\) determines the prior probability mass assigned to the null value \(\theta _0\) and \(\mathcal {E}_{\theta _0}\) is the Dirac-measure in \(\theta _0 \in \Theta \), and \(P_\vartheta ^{\Theta _1}\) a suitable probability measure on the alternative hypothesis space \(\Theta _1:=\{\theta \in \Theta \mid \theta \ne \theta _0 \}\).Footnote 5 The standard Bayesian solution to hypothesis testing based on posterior probabilities is thus primarily able to confirm \(H_0\) in the Jeffreys-Lindley paradox when \(H_0:\theta =\theta _0\) is true because the prior explicitly assigns probability mass \(\varrho >0\) to the null hypothesis \(H_0\). If an absolutely continuous prior would be used instead, or one would pick \(\varrho =0\), the situation would collapse and \(P(H_0\mid x)=0\) would hold for all \(n\in \mathbb {N}\), aligning the Bayesian solution with the frequentist one in the Jeffreys-Lindley paradox (Kelter, 2021c; Ly and Wagenmakers, 2021), because then both the frequentist and Bayesian solution would reject \(H_0:\theta =\theta _0\).

What about the frequentist solution in the paradox in this first case? The p-value is significant, or \(z=1.96\), and one would reject \(H_0\) although it was assumed to be true, and thus seems to lead to a false decision. However, Greenland (2019) stressed that p-values behave exactly as they should but that their scaling as measures of evidence against \(H_0\) is sometimes flawed. I concur with this perspective and adopt an example of Good (1985, p. 260) below to illustrate this point. The example shows that a small p-value (or equivalently, the fixed test statistic \(z=1.96\)) can loose its interpretation as evidence against \(H_0:\theta =\theta _0\) when the sample size n becomes only large enough.

Reconsider the example of Berger (1985) which was used to illustrate the Jeffreys-Lindley paradox earlier. There, \(z=1.96\) was fixed and the goal was to test \(H_0:\theta =\theta _0\) against \(H_1:\theta \ne \theta _0\). Now, measurement precision is always finite and it is assumed that measurement precision in the example allows to record up to two digits in the application context (other values could be chosen without affecting the argument following below). Some possible values for the parameter \(\theta \in \Theta \) are shown in Fig. 1.

Fig. 1
figure 1

Relation between p-values and evidence against \(H_0\)

For simplicity, assume that \(\theta _0=0\) in \(H_0:\theta =\theta _0\), but the argument applies likewise for arbitrary \(\theta _0 \in \Theta \). Henceforth, we also denote \(\bar{X}\) as \(\bar{X}_n\), solely to stress the dependence on the sample size n. The latter is relevant because the evidential interpretation of \(\bar{X}_n\) for small and large n differs substantially when \(X_i {\mathop {\sim }\limits ^{i.i.d.}} P_{\theta _0}\). The arrow with \(\bar{X}_n\) indicates where the sample mean is located, and as \(H_0\) is assumed to be true, for \(n\rightarrow \infty \) we have \(\bar{X}_n \rightarrow \theta _0=0\) almost surely under the distribution \(P_{\theta _0}\) of \(H_0\) by the strong law of large numbers (SLLN) (Bauer, 2001). Thus, for large enough n the distance \(\mid \bar{X}_n-\theta _0\mid \) between \(\theta _0=0\) and \(\bar{X}_n\) will be smaller than the distance \(\mid \bar{X}_n-\theta _1\mid \) between \(\bar{X}\) and \(\theta _1=0.01\) on the right-hand side of \(\bar{X}_n\) in Fig. 1. Thus, from the SLLN we have the following proposition:

Proposition 2

Under \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\) and the SLLN, \(\mid \bar{X}_n-\theta _0\mid <\mid \bar{X}_n-\theta '\mid \) \(\forall \theta ' \in \Theta , \theta ' \ne \theta _0\) for \(n\rightarrow \infty \) \(P_{\theta _0}\)-almost-surely.

Now, as \(z=1.96\) is fixed, for large enough n the tail-area probability corresponding to \(p=0.05\) stays fixed, while \(\mid \bar{X}_n-\theta _0\mid \rightarrow 0\) under \(P_{\theta _0}\). However, in a normal model with known \(\sigma ^2 >0\) and unknown \(\mu \in \mathbb {R}\) (the setting of the JL-paradox), \(\bar{X}_n\) is a sufficient statistic (Rüschendorf, 2014, Theorem 4.1.18). Thus, the sufficient statistic converges \(P_0\)-almost-surely to \(\theta _0\) specified in \(H_0:\theta =\theta _0\), but \(z=1.96\) (respectively \(p=0.05\)) stays constant.

As a consequence, the p-value corresponding to \(z=1.96\) becomes more and more supporting for \(H_0:\theta =0\) (as the sufficient statistic \(\bar{X}_n\) converges to \(\theta _0=0\)), and less supporting for any alternative value \(\theta _1\ne \theta _0\) located in \(H_1:\theta \ne \theta _0\). In the words of Good (1985), this shows “that a given P-value means less for large N” (Good, 1985, 260). Alluding to sufficiency, the above line of thought can be formalized as follows:

Definition 1

(Statistical evidence) Let \(T_n:\mathcal {Y}\rightarrow \Theta \) be a sufficient and consistent estimator for the parameter \(\theta \in \Theta \). The statistical evidence \(\text {Ev}:\Theta \rightarrow [0,\infty )\), \(\theta \mapsto \text {Ev}(\theta )\) provided against \(\theta \in \Theta \) is given as \(\text {Ev}(\theta ):=1/\mid T_n(y)-\theta \mid \).

The above definition implements the abstract statistical evidence \(Ev(H_0):=Ev(\theta _0)\) in the sense of Birnbaum (1962) by means of the inverse absolute distance of a sufficient and consistent estimator \(T_n\) from a given value \(\theta \in \Theta \).Footnote 6 When \(T_n\) is close to \(\theta _0\), \(\theta _0\) denoting the true parameter value, the statistical evidence for \(\theta _0\) is large. Note that this does not require a frequentist or Bayesian position, because the sufficiency principle is accepted by both frequentists and Bayesians (Berger and Wolpert, 1988; Grossman, 2011). Based on Definition 1 and Proposition 2 we obtain:

Theorem 3

Suppose statistical evidence is measured based on a \(\text {Ev}(\theta ):=1/\mid T_n(y)-\theta \mid \), where \(T_n:\mathcal {Y}\rightarrow \Theta \) is a sufficient and consistent estimator for the parameter \(\theta \in \Theta \). Suppose \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\) with \(P_{\theta _0}\sim \mathcal {N}(\mu ,\sigma ^2)\) as specified in the Jeffreys-Lindley-paradox. Then, \(\text {Ev}(\theta ')<\text {Ev}(\theta _0)\) for \(n\rightarrow \infty \), \(P_{\theta _0}\)-almost-surely, for all \(\theta '\in \Theta , \theta ' \ne \theta _0\). That is, the statistical evidence favors \(H_0:\theta =\theta _0\) in the JL-paradox.

Proof

Fix a \(\theta ' \in \Theta \) with \(\theta ' \ne \theta _0\). As \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\), Proposition 2 implies that

$$\begin{aligned} \mid \bar{X}_n-\theta '\mid >\mid \bar{X}_n-\theta _0\mid \end{aligned}$$
(7)

for all \(\theta ' \in \Theta \) for \(n\rightarrow \infty \) \(P_{\theta _0}\)-almost-surely. As by assumption, \(P_{\theta _0}=\mathcal {N}(\mu _0,\sigma ^2)\) with \(\sigma ^2>0\) known, \(\bar{X}_n:=\frac{1}{n}\sum _{i=1}^n X_i\) is a sufficient statistic for \(\mu \in \mathbb {R}\) and also consistent. Thus, from Definition 1, \(\text {Ev}(\theta ):=1/\mid \bar{X}_n-\theta \mid \) and thereby Eq. 7 implies \(\text {Ev}(\theta ')<\text {Ev}(\theta _0)\) for \(n\rightarrow \infty \), \(P_{\theta _0}\)-almost-surely. As \(\theta '\) was chosen arbitrary, the same holds for all \(\theta '\in \Theta , \theta ' \ne \theta _0\) and the conclusion follows.

Three points are important to note here:

\(\blacktriangleright \):

The above behavior occurs although the probability of the test statistic \(P_{\theta _0}(Z>1.96)\) is independent of n. For each n, \(Z\sim \mathcal {N}(0,1)\) but here, we fix a value \(z=1.96\) corresponding to \(p=0.05\) and only then investigate the evidential interpretation of the p-value p for increasing n. We make no reference to the distribution of Z, in particular, we do not state that the distribution of the p-value changes (and as a consequence of the latter, the interpretation should change).

\(\blacktriangleright \):

Theorem 3 makes explicit Good’s argument that a given p-value of \(p=0.05\) means less for large sample size. However, it also shows that this argument is based crucially on the premise that \(1/\mid \bar{X}_n-\theta _0\mid \) measures the strength of evidence against \(H_0\) in a sensible way. One could object to this premise that the test statistic \(Z(X):=\sqrt{n} \frac{\mid \bar{X}_n-\theta _0\mid }{\sigma }\) scaled by \(\sigma \) and \(\sqrt{n}\) should be used instead to measure this evidence. However, note that (i) for fixed \(z=1.96\) and increasing \(\sqrt{n}\) for any fixed \(\sigma >0\) one still has \(\mid \bar{X}_n-\theta _0\mid \rightarrow 0\) \(P_0\)-almost-surely to keep \(z=1.96\) fixed, so then one also obtains \(\text {Ev}(\theta _0)\rightarrow \infty \) under \(P_{\theta _0}\). Thus, the sufficient statistic \(\bar{X}_n\) again indicates evidence in favor of \(H_0\) although \(p=0.05\) remains statistically significant.

\(\blacktriangleright \):

The original argument of Good does not make use of the concept of sufficiency to motivate the evidential meaning of the absolute distance \(\mid \bar{X}_n-\theta _0\mid \). Definition 1 instead does, and this comes at the price of formalizing the abstract notion of statistical evidence, see Birnbaum (1962). A hard-nosed sceptic can thus reject the conclusion by rejecting Definition 1. However, this in turn comes at the price of being suspicious about the information provided by a sufficient (and consistent) estimator. We will return to this weak spot later .

In the context of the Jeffreys-Lindley paradox, Theorem 3 demonstrates that the small p-value corresponding to the fixed z-statistic for increasing n can loose its interpretation as evidence against \(H_0:\theta =\theta _0\) for growing \(n\rightarrow \infty \). In the limiting case, as shown above, the p-value even signals evidence in favor of \(\theta _0\) specified in \(H_0:\theta =\theta _0\) compared to all other values for \(\theta \).Footnote 7 The principal assumption to arrive at this conclusion is to accept the concept of sufficiency of an estimator and to agree that Definition 1 constitutes a reasonable measure of statistical evidence. The latter is neither frequentist nor Bayesian but measure-theoretic as sufficiency goes down to the sigma-algebraic level (Schervish, 1995; Bauer, 2001).

Thus, under assumption that \(H_0\) is true, the Bayesian solution correctly accepts \(H_0\), and the frequentist solution seems to reject \(H_0\) solely because the p-value is no standardized measure of evidence for growing sample size n. An appealing proposal made by Good (1985) is to standardize tail-areas such as p-values to the sample size n. In the above example, “if a small tail-area probability P occurs with sample size N we would say it is equivalent to a tail-area probability of \(P\sqrt{100/N}\) for a sample of 100 if this is also small” (Good, 1985, p. 260). Thus, when observing \(\tilde{p}=0.05\) for \(n=1000\), for which the posterior probability in the example of Berger (1985) resulted in \(P(H_0\mid x)=0.80\), the equivalent p-value for a sample of size \(n=100\) would be

$$\begin{aligned} \tilde{p}\!=\!p\cdot \sqrt{100/n}\Leftrightarrow 0.05 \!=\! p\sqrt{100/1000} \Leftrightarrow p\!=\!0.05\sqrt{1000/100}\Leftrightarrow p\!=\!0.1581 \end{aligned}$$
(8)

which is far away from any conventional threshold for statistical significance. Based on \(p=0.1581\) which is standardized to the sample size \(n=100\), one would not reject \(H_0:\theta =\theta _0\).

Theorem 3 shows that the usual interpretation to interpret p only as evidence against \(H_0\) is problematic, as for large sample sizes n it can – as shown above – become evidence in favor of \(H_0:\theta =\theta _0\). However, p-values in a Fisherian sense are almost always used only to quantify the evidence against a null hypothesis (Greenland, 2019; Kelter, 2021a; Rafi and Greenland, 2020).

A weak spot in Theorem 3 is that it depends on Definition 1. Thus, as the log-Bayes factor provides a natural explication of the weight of evidence for a Bayesian according to Jeffreys (1961), Good (1985) or Sprenger and Hartmann (2019) (for an axiomatic derivation see Good (1960)), a second option to arrive at the conclusion of Theorem 3 is given below:

Definition 2

[Weight of evidence] Let \(\theta _0,\theta ' \in \Theta \), \(\theta _0 \ne \theta '\). The weight of evidence \(Ev:\Theta \rightarrow [0,\infty ), \theta ' \mapsto Ev(\theta ')\) against \(\theta _0\) compared to \(\theta '\) provided by \(y\in \mathcal {Y}\) is given as \(Ev(\theta '):=\log [\text {BF}_{10}(x)]\), where \(\text {BF}_{10}(x)\) denotes the Bayes factor against \(H_0:\theta =\theta _0\) compared to the hypothesis \(H':\theta =\theta '\) based on the observed data x.

Thus, weight of evidence is precisely the log-Bayes factor when we compare two precise hypotheses \(H_0:\theta =\theta _0\) and \(H':\theta =\theta '\). Weight of evidence is even the log-Bayes factor for general hypotheses, but for our purposes this suffices. We obtain:

Theorem 4

Suppose statistical evidence is measured based on a \(\text {Ev}(\theta '):=\log [\text {BF}_{10}(x)]\) and assume \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\) with \(P_{\theta _0}\sim \mathcal {N}(\mu ,\sigma ^2)\) as specified in the Jeffreys-Lindley-paradox. Then, \(\text {Ev}(\theta ')<\text {Ev}(\theta _0)\) for \(n\rightarrow \infty \), \(P_{\theta _0}\)-almost-surely, for all \(\theta '\in \Theta , \theta ' \ne \theta _0\).

Proof

Fix a \(\theta ' \in \Theta \) with \(\theta ' \ne \theta _0\). As \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\), Proposition 2 implies that \(\mid \bar{X}_n-\theta '\mid >\mid \bar{X}_n-\theta _0\mid \) for all \(\theta ' \in \Theta \) for \(n\rightarrow \infty \) \(P_{\theta _0}\)-almost-surely. But \(\mid \bar{X}_n-\theta '\mid >\mid \bar{X}_n-\theta _0\mid \) is equivalent to:

$$\begin{aligned} \mid \bar{X}_n-\theta '\mid>\mid \bar{X}_n-\theta _0\mid&\Leftrightarrow (\bar{X}_n-\theta ')^2 >(\bar{X}_n-\theta _0)^2\nonumber \\&\Leftrightarrow -\frac{(\bar{X}_n-\theta ')^2}{2\sigma ^2 /n}< -\frac{(\bar{X}_n-\theta _0)^2}{2\sigma ^2 /n} \nonumber \\&\Leftrightarrow \exp (-\frac{(\bar{X}_n-\theta ')^2}{2\sigma ^2 /n})< \exp (-\frac{(\bar{X}_n-\theta _0)^2}{2\sigma ^2 /n}) \nonumber \\&\Leftrightarrow \frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp (-\frac{(\bar{X}_n-\theta ')^2}{2\sigma ^2 /n})\nonumber \\&\quad< \frac{1}{\sqrt{2\pi \sigma ^2/n}} \exp (-\frac{(\bar{X}_n-\theta _0)^2}{2\sigma ^2 /n}) \nonumber \\&\Leftrightarrow \frac{m_1(x)}{m_0(x)}<1 \Leftrightarrow \log [\frac{m_1(x)}{m_0(x)}]<\log (1)\nonumber \\&\Leftrightarrow \log [\text {BF}_{10}(x)]<0. \end{aligned}$$
(9)

Thus, Proposition 1 implies that for all \(\theta ' \in \Theta \) for \(n\rightarrow \infty \) \(P_{\theta _0}\)-almost-surely we have \(\log [\text {BF}_{10}(x)]<0\) where in the above \(m_0(x)=\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp [-\frac{(\bar{X}_n-\theta _0)^2}{2\sigma ^2/n}]\) denotes the marginal likelihood under \(H_0:\theta =\theta _0\) and \(m_1(x)=\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp (-\frac{(\bar{X}_n-\theta ')^2}{2\sigma ^2 /n})\) denotes the marginal likelihood under \(H':\theta =\theta '\) for any \(\theta ' \in \Theta \), \(\theta ' \ne \theta _0\). As by assumption, \(\text {Ev}(\theta ):=\log [\text {BF}_{10}(x)]\) it follows that \(\text {Ev}(\theta ')<0\) for all \(\theta ' \in \Theta \) for \(n\rightarrow \infty \) \(P_{\theta _0}\)-almost-surely. As \(\text {Ev}(\theta _0):=\log (1)=0\), we have \(\text {Ev}(\theta ')<\text {Ev}(\theta _0)\) which is the desired conclusion.

Thus, when questioning Definition 1, Definition 2 provides a different Bayesian definition for the weight of evidence. Theorem 4 then shows that under the situation of the JL-paradox, the weight of evidence \(\text {Ev}(\theta ')\) against \(H_0:\theta =\theta _0\) compared to any other point null hypothesis \(H':\theta =\theta '\) is negative asymptotically when \(X_i {\mathop {\sim }\limits ^{\text {i.i.d.}}} P_{\theta _0}\) for \(P_{\vartheta }\)-almost-every \(\theta ' \ne \theta _0\). Thus, \(H_0:\theta =\theta _0\) is preferred to every other hypothesis \(H':\theta =\theta '\), \(\theta ' \ne \theta _0\). As in the JL-paradox the alternative \(H_1:\theta \ne \theta _0\) is the logical disjunction of these points \(\theta ' \ne \theta _0\), the weight of evidence against \(H_0\) should become negative, which shows that the Bayes factor favors \(H_0\), and Theorem 4 shows that is indeed the case.

Theorem 4 seems to require an explicit Bayesian definition of the weight of evidence to arrive at favoring \(H_0\) in the JL-paradox. However, the weight of evidence as stated in Definition 2 is not only equal to the log-Bayes factor of a point versus point hypothesis, it also is equal to the likelihood ratio

$$\begin{aligned} \log [ \text {BF}_{10}(x) ] = \log \left[ \frac{m_1(x)}{m_0(x)} \right] = \log \left[ \frac{\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp (-\frac{(\bar{X}_n-\theta ')^2}{2\sigma ^2 /n})}{\frac{1}{\sqrt{2\pi \sigma ^2/n}}\exp [-\frac{(\bar{X}_n-\theta _0)^2}{2\sigma ^2/n}]} \right] \end{aligned}$$
(10)

see e.g. Marin and Robert (2014), which is a frequentist concept (Royall, 1997). From this angle, Theorem 4 shows that from a likelihoodist point of view, each likelihood-ratio comparison of \(H_0:\theta =\theta _0\) versus \(H':\theta =\theta '\) for any \(\theta ' \in H_1:\theta \ne \theta _0\) shows evidence for \(H_0\) asymptotically. Thus, Theorem 4 proves that not only will a Bayesian arrive at the conclusion that a small p-value can become evidential in favor of \(H_0\) for large enough sample size n. It also shows that a frequentist who uses likelihood ratios (such as in Neyman-Pearson-tests) arrives at the same conclusion. From this angle, Definition 2 can be reinterpreted as quantifying statistical evidence via likelihood ratios.Footnote 8

In summary, in the first case the point-null-zero-probability paradox was not assumed to be true, and the existence of a precise hypothesis is accepted. The Bayesian solution correctly confirmed \(H_0\), and the frequentist solution seemed to incorrectly reject \(H_0\). The latter was resolved when noticing that the p-value looses its interpretation as evidence against \(H_0\) for increasing sample size n, aligning the Bayesian and frequentist solutions. Thus, the Jeffreys-Lindley paradox does not occur when reinterpreting p-values this way. Theorem 3 and 4 showed that both a Bayesian and a frequentist – who either uses likelihood-ratios and the Neyman-Pearsonian approach or sufficiency and the Fisherian approach – will arrive at this conclusion.

Another possibility to align the Bayesian and frequentist solutions is to shift to an absolutely continuous prior in the Bayesian approach: Then, both approaches reject \(H_0\), but note that such prior choices are in conflict with the extra-mathematical beliefs that precise the hypothesis \(H_0:\theta =\theta _0\) exists, so such a prior would not be selected in a Bayesian analysis.

4.2 Case 2: Precise Hypotheses such as \(H_0:\theta =\theta _0\) are always false

Now, in the second case it is assumed that the point-null-zero-probability paradox is true and the Rao-Lovric theorem holds. Therefore, the extra-mathematical beliefs are that precise hypotheses are unrealistic and should be replaced with a more reasonable small interval hypothesis, independently of whether we are able to test a precise hypothesis via probability theory or some statistical method or not.

Reconsider the Bayesian solution which accepted \(H_0:\theta =\theta _0\). Clearly, when precise hypotheses are assumed to be always false, the Bayesian approach yields the incorrect solution now. The true hypothesis \(H_0\) in the example of Berger (1985) now must be of the form \(H_0:\theta \in [\theta _0-b,\theta _0+b]\) for some \(b >0\). Importantly, we make the restriction that the size of the interval must be reasonably small.

“Given that one should really be testing \(H_0:\theta \in (\theta _0-b,\theta _0+b)\), we need to know when it is suitable to approximate \(H_0\) by \(H_0:\theta =\theta _0\). From the Bayesian perspective, the only sensible answer to this question is – the approximation is reasonable if the posterior probabilities of \(H_0\) are nearly equal in the two situations.”(Berger 1985, p. 149)

Now, a condition under which this would be the case is that the observed likelihood functions are approximately constant on \((\theta _0-b,\theta _0+b)\) (because then the resulting posterior for any fixed prior is approximately constant, too), and in the example of Berger (1985) which was used to illustrate the Jeffreys-Lindley paradox earlier, this is the case if

$$\begin{aligned} b\le (0.024)\frac{\sigma }{z\sqrt{n}} \end{aligned}$$
(11)

where nearly constant is being interpreted as the observed likelihood functions do not vary by more than 5%. Importantly, in this second case the test of \(H_0:\theta =\theta _0\) which is carried out in the Bayesian analysis of the Jeffreys-Lindley paradox has a different interpretation than in the first case. In the first case we intended to test \(H_0:\theta =\theta _0\) as a precise hypothesis. In the second case now, we interpret the prior probability \(\varrho \) assigned to the null hypothesis value \(\theta _0\) as the mass which would have been assigned to the more realistic interval hypothesis \(\tilde{H}:\theta \in (\theta _0-b,\theta _0+b)\), were the point null approximation not being used. Thus, we interpret the test of a precise hypothesis as an approximation of the test of the realistic interval hypothesis.

Of course, a straightforward modification would be to consider the test of the interval hypothesis directlyFootnote 9, but to stay in the setting of the Jeffreys-Lindley paradox this option is not considered further here.Footnote 10

Now, consider again the case \(z=1.96\) for \(n=1000\) which yielded \(P(H_0\mid x)=0.80\) for the test of \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\). Based on this result, we would accept the precise null hypothesis, which is false in this second case. However, from Eq. 11 and under assumption of \(\sigma =1\), the condition becomes \(b\le 0.000387\). Using larger values of \(\sigma \) will increase b accordingly, but also imply that data are highly variable, justifying a larger interval width b. For example, from Eq. 4 it follows that for \(n=1000\) one would require a standard deviation \(\sigma \ge 25.9\) (which is equivalent to a variance of \(\sigma ^2 \approx 670.81\)) for b to be larger or equal to 0.01, which is the smallest possible measurement (as we assumed two-digit measurement precision). Thus, for \(n=1000\) and fixed z Eq. 4 can be read as separating measurable interval sizes – for which \(\sigma \ge 25.9\) and \(b\ge 0.01\) – from the ones that are too small in the application context – for which \(\sigma < 25.9\) and \(b<0.01\). Any interval width smaller than \(b=0.01\) is by assumption not measurable anymore, and in almost all cases one will not be willing to draw any conclusions based on even \(n=1000\) samples when \(\sigma ^2>670.81\).

Now, when stating \(P(H_0\mid x)=0.80\) we have to keep in mind that we do not believe in the existence of the precise hypothesis, but interpret this posterior probability as the approximation of the posterior probability of the interval hypothesis \(H_0:\theta \in (\theta _0-0.000387,\theta _0+0.000387)\) (under assumption of \(\sigma =1\)). In most realistic situations, one will not be able to accept such a narrow interval hypothesis. Measuring that precise is not possible based on two-digit measurement precision. The interval hypothesis becomes measurable first when \(\sigma \ge 25.9\), which is highly unrealistic in almost any context.

For example, returning to Fig. 1, we assumed that two digit precision is the best our measurements can offer. Thus, we can reverse the above process and ask what posterior probability for \(H_0\) we can state given our measurement precision. Based on two digit precision and assuming \(\sigma =1\) known, we arrive at \(n=37\) samples for which \(b=0.010065\) (for \(n=36\) samples we already have \(b=0.009931<0.01\) which is too small for our maximum measurement precision) and the corresponding posterior probability based on Eq. 4 is \(P(H_0\mid x)=0.47\). Thus, when paying attention to the measurement precision, the Bayesian solution does not confirm \(H_0\) anymore, as now \(P(H_0\mid x)<\frac{1}{2}\).

The frequentist p-value on the other hand proceeds identical to the first case. The precise hypothesis \(H_0:\theta =\theta _0\) is false, but under assumption that b is reasonably small we can still assume that \(Z\sim \mathcal {N}(0,1)\) approximately, and arrive at \(p=0.05\) for \(z=1.96\), rejecting \(H_0\) correctly. However, due to the arguments in case one treated above the rejection of \(H_0\) via statistical significance of the p-value looses its interpretation as weight of evidence against \(H_0\) for increasing sample size n. Thus, while for small sample sizes n the rejection has a valid interpretation as the p-value functions as a measure of evidence against \(H_0\), for large sample sizes n the interpretation of the p-value is reversed, now functioning as a measure of evidence in favor of \(H_0\), compare Theorems 3 and 4.

Still, this behavior is consistent with the one of the Bayesian solution: The Bayesian solution does reject \(H_0\) for small sample sizes. In the setting above, the posterior probability of \(H_0:\theta =\theta _0\) for the Bayesian test of \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\) passes the threshold \(P(H_0\mid x)=\frac{1}{2}\) at \(n=46\) samples. For sample sizes \(n\ge 46\) the posterior probability of \(H_0:\theta =\theta _0\) then becomes larger than \(\frac{1}{2}\). The frequentist p-value always rejects \(H_0\), but for small sample sizes the rejection has a valid interpretation as evidence against \(H_0\), while for large sample sizes the interpretation is reversed.Footnote 11 Although there is no explicit cutoff like \(n=46\) for the p-value, the behavior is identical, aligning the Bayesian and frequentist solution also in this second case.

An important note should be added: Eq. 11 explicitly warns the Bayesian statistician when his approximation of the interval hypothesis via a precise hypothesis becomes unreliable. There is no such warning for the p-value. Even the fact that the p-value reverses its interpretation as measure of evidence against \(H_0:\theta =\theta _0\) into a measure of evidence in favor \(H_0\) for fixed z statistic and growing sample size n does not help to determine when this changepoint occurs.Footnote 12

Table 2 Bayesian and frequentist solutions for the Jeffreys-Lindley-paradox depending on whether the existence of precise hypotheses is taken for granted or not and depending on sample size

5 Relationship Between the Paradoxes

Table 2 illustrates the relationship between the paradoxes. First, suppose the Rao-Lovric theorem is accepted and the position is adopted that precise hypotheses are always false. When precise hypotheses do not exist and are always false, in the small sample size setting both the Bayesian and frequentist solution arrive at rejection of \(H_0\) which is correct (blue part of Table 2). Importantly, the ticks and crosses should be interpreted as indicating the correct or wrong solution with respect to the point-null zero probability paradox and the setting of the Jeffreys-Lindley paradox. Thus, the ticks behind the blue entries in Table 2 mean that when precise hypotheses do not exist, rejecting the precise hypothesis \(H_0\) is the desired conclusion in the setting of the Jeffreys-Lindley paradox. In contrast, accepting \(H_0\) (Bayesian solution for large sample sizes in red) is the incorrect solution in the setting of the Jeffreys-Lindley paradox when precise hypotheses are assumed to be always false. Therefore, a cross indicates this at the second column, second row in Table 2.Footnote 13

The Jeffreys-Lindley paradox occurs (the Bayesian solution starts to favor the null hypothesis) when the approximation of the (as true assumed) interval null hypothesis through the precise point null hypothesis becomes unreliable, as illustrated by application of Eq. 11. In Table 2, this corresponds to the acceptance of \(H_0\) in the large sample setting while the frequentist solution still rejects \(H_0\) (red part of Table 2).

However, when the condition in Eq. 11 is checked against the measurement precision, the paradox will not blend in unless exceptional high measurement precision is available. Therefore, the Bayesian solution will be judged as unreliable and the Bayesian will not proceed with the test of \(H_0:\theta =\theta _0\) (see the \(^{**}\) in Table 2 at the Bayesian solution in this case). The Bayesian solution shifts to the test of the correct hypothesis \(H_0:\theta \in (\theta _0-b,\theta _0+b)\) in this case and from the consistency of the posterior distribution will accept the latter hypothesis (Kleijn, 2022).Footnote 14 The frequentist solution rejects \(H_0\) in this case, but as shown in Theorems 3 and 4 the p-value \(p=0.05\) looses its interpretation as evidence against \(H_0\) and becomes evidential in favor of \(H_0\). When the frequentist adopts this interpretation (or uses Good’s standardized p-value) he arrives at the conclusion to accept \(H_0:\theta =\theta _0\), while the Bayesian accepts \(H_0:\theta \in (\theta _0-b,\theta _0+b)\). The conclusions are thus highly similar, and become identical when the frequentist notices that the test of a small interval hypotheses may be more appropriate, for example by computing what Fraser (2011) called the p-value function, which shows that the p-value is significant for each value inside a small interval of parameter values \((\theta _0-b,\theta _0+b)\). Both the Bayesian and frequentist then arrive at the acceptance of the true small interval hypothesis \(H_0:\theta \in (\theta _0-b,\theta _0+b)\).

When accepting the existence of precise hypotheses (right part of Table 2), now the Bayesian and frequentist solutions both reject \(H_0\) in the small sample setting (purple part in Table 2), which seems to be incorrect. However, the latter can be attributed to the limited sample size n.

Shifting to the large sample size setting (black part of Table 2 on the right-hand side), the frequentist solution rejects \(H_0\) while the Bayesian solution accepts \(H_0\). Now the discrepancy between both solutions can be reconciled by noticing again that the small p-value \(p=0.05\) becomes evidential in favor of \(H_0:\theta =\theta _0\) for large n (also, if standardized p-values would be used, they would not reject \(H_0\)). If the frequentist rejects this interpretation, the difference boils down to the fact that the Bayesian solution yields the correct answer only by explicitly assigning probability mass \(\varrho >0\) to the null set \(\{\theta _0\}\) in Jeffreys’ mixture prior, while the frequentist solution does not. A Bayesian could use an absolutely continuous prior to arrive at rejection of \(H_0\) and align the Bayesian with the frequentist solution. However, Theorems 3 and 4 show that a straightforward solution is to notice that the significant p-value in the large sample setting can be evidence in favor of \(H_0\), aligning the frequentist with the Bayesian solution.

In sum, when null hypotheses are judged to be always false and the point-null-zero probability paradox is accepted (left part of Table 2), the different solutions between Bayesian and frequentist hypothesis tests in the Jeffreys-Lindley paradox can be reconciled by interpreting the test of a precise hypothesis \(H_0:\theta =\theta _0\) versus \(H_1:\theta \ne \theta _0\) as an approximation of the test of the more realistic small interval hypothesis \(H_0:\theta \in (\theta _0-b,\theta _0+b)\) versus its alternative (which corresponds to what a frequentist or Bayesian would naturally prefer when accepting that precise hypotheses are always false). Sample size and measurement precision then determine whether the Jeffreys-Lindley paradox blends in and when sufficient sample size is reached so that the Jeffreys-Lindley paradox is witnessed, the actually tested interval hypothesis becomes unrealistically precise for the context.

Theorems 3 and 4 demonstrate that when the existence of null hypotheses is accepted, the frequentist solution can be aligned with the Bayesian solution (right part of Table 2). The only requirement is a Fisherian position which is based on sufficiency or a Neyman-Pearsonian position which is based on likelihood ratios for a frequentist. For a Bayesian, the explication of weight of evidence as the Bayes factor is required, which is uncontroversial based on available results of Good (1960, 1968), see Good (1985).

6 Conclusion

In sum, adjusting for the point-null-zero-probability paradox shows that the Jeffreys-Lindley paradox can be resolved and even provide a Bayes-non-Bayes compromise in the spirit of Good (1992).

Three questions were considered in this paper. Regarding the first question, the mathematical arguments why the Jeffreys-Lindley paradox occurs, the measure-theoretic premises are essential. The Bayes factor assigns positive mass to the null value while the frequentist p-value is cornered in a framework which does not even allow a probability measure on the parameter space \(\Theta \) (the latter one reducing to a set in the frequentist approach, compare Schervish (1995)). Also, p-values are not standardized which shows that a significant p can be interpreted as evidential for \(H_0:\theta =\theta _0\) for only large enough n based on the concept of sufficiency.Footnote 15

Regarding the extra-mathematical arguments, the point-null-zero-probability paradox showed that the form of the hypothesis under consideration is essential to resolve the Jeffreys-Lindley paradox. When the existence of precise hypotheses is taken for granted, the Bayesian solution proceeds fine (at the price of assigning positive probability mass in the prior to the null value \(\theta _0\)) and the frequentist p-value must be standardized to align both solutions. When the existence of precise hypotheses is questioned, the Bayesian solution must be seen as an approximation and Eq. 11 as a warning sign for the reliability of the Bayesian approximation of an interval null via a precise null hypothesis. Taking the quality of the approximation into account then aligns both approaches again, as the Jeffreys-Lindley paradox blends in as soon as the approximation becomes unreliable. Also, the standardization of p-values works here, too, to yield identical conclusions in both the Bayesian and frequentist approach, that is, to shift towards the test of \(H_0:\theta \in (\theta _0-b,\theta _0+b)\) versus its alternative instead of the test of \(H_0:\theta =\theta _0\) versus its alternative.

No matter whether one accepts the Rao-Lovric theorem, in this paper it was shown that from a mathematical point of view the Jeffreys-Lindley paradox can be attributed to the fact that the standard Bayesian solution accepts the existence of a precise hypothesis while the p-value does not (at least in a Fisherian perspective, under which the null hypothesis is never true and can only be rejected based on sufficient evidence against it). However, the price the Bayesian pays for this behavior is the risk that the approximation of \(H_0:\theta \in (\theta _0-b,\theta _0+b)\) by \(H_0:\theta =\theta _0\) fails for large enough n. Furthermore, the fact that p-values were designed as evidence measures against a null hypothesis but in fact can become evidential in favor of a null hypothesis when sample size n is only large enough is another reason why the puzzling situation of the Jeffreys-Lindley paradox occurs. Both points have by now not been discussed in detail in the literature.

From an extra-mathematical perspective, the point-null-zero-probability paradox builds the overarching frame to resolve the Jeffreys-Lindley paradox under both cases treated in this paper, compare Table 2.

With regard to the methodological debate between frequentist and Bayesian inference (Wasserstein et al., 2019), the results in this paper therefore demonstrate that (i) the Jeffreys-Lindley paradox may help to rethink the validity of a precise null hypothesis in a broad variety of research contexts as well as (ii) the Bayesian assignment of positive probability mass to a value of dominating measure zero, and (iii) the scaling of p-values as measures of evidence against a precise null hypothesis.