As a first example, we consider what is perhaps the world’s oldest inference problem, one that has occupied philosophers for over two millennia: given a general law such as “all X’s have property Y,” how does the accumulation of confirmatory instances (i.e., X’s that indeed have property Y ) increase our confidence in the general law? Examples of such general laws include “all ravens are black,” “all apples grow on apple trees,” “all neutral atoms have the same number of protons and electrons,” and “all children with Down syndrome have all or part of a third copy of chromosome 21.”
To address this question statistically, we can compare two models (e.g., Etz and Wagenmakers 2017; Wrinch and Jeffreys 1921). The first model corresponds to the general law and can be conceptualized as \(\mathcal {H}_{0}: \theta = 1\), where \(\theta \) is a Bernoulli probability parameter. This model predicts that only confirmatory instances are encountered. The second model relaxes the general law and is therefore more complex; it assigns \(\theta \) a prior distribution, which, for mathematical convenience, we take to be from the beta family— consequently, we have \(\mathcal {H}_{1}: \theta \sim \text {Beta}(a,b)\).
In the following, we assume that, in line with the prediction from \(\mathcal {H}_{0}\), only confirmatory instances are observed. In such a scenario, we submit that there are at least three desiderata for model selection. First, for any sample size \(n>0\) of confirmatory instances, the data ought to support the general law \(\mathcal {H}_{0}\); second, as n increases, so should the level of support in favor of \(\mathcal {H}_{0}\); third, as n increases without bound, the support in favor of \(\mathcal {H}_{0}\) should grow infinitely large.
How does LOO perform in this scenario? Before proceeding, note that when LOO makes predictions based on the maximum likelihood estimate (MLE), none of the above desiderata are fulfilled. Any training set of size \(n-1\) will contain \(k=n-1\) confirmatory instances, such that the MLE under \(\mathcal {H}_{1}\) is \(\hat {\theta }=k/(n-1)= 1\); of course, the general law \(\mathcal {H}_{0}\) does not contain any adjustable parameters and simply stipulates that \(\theta = 1\). When the models’ predictive performance is evaluated for the test set observation, it then transpires that both \(\mathcal {H}_{0}\) and \(\mathcal {H}_{1}\) have \(\theta \) set to 1 (\(\mathcal {H}_{0}\) on principle, \(\mathcal {H}_{1}\) by virtue of having seen the \(n-1\) confirmatory instances from the training set), so that they make identical predictions. Consequently, according to the maximum likelihood version of LOO, the data are completely uninformative, no matter how many confirmatory instances are observed.Footnote 5
The Bayesian LOO makes predictions using the leave-one-out posterior distribution for \(\theta \) under \(\mathcal {H}_{1}\), and this means that it at least fulfills the first desideratum: the prediction under \(\mathcal {H}_{0}: \theta = 1\) is perfect, whereas the prediction under \(\mathcal {H}_{1}: \theta \sim \text {Beta}(a+n-1,b)\) involves values of \(\theta \) that do not make such perfect predictions. As a result, the Bayesian LOO will show that the general law \(\mathcal {H}_{0}\) outpredicts \(\mathcal {H}_{1}\) for the test set.
What happens when sample size n grows large? Intuitively, two forces are in opposition: on the one hand, as n grows large, the leave-one-out posterior distribution of \(\theta \) under the complex model \(\mathcal {H}_{1}\) will be increasingly concentrated near 1, generating predictions for the test set data that are increasingly similar to those made by \(\mathcal {H}_{0}\). On the other hand, even with n large, the predictions from \(\mathcal {H}_{1}\) will still be inferior to those from \(\mathcal {H}_{0}\), and these inferior predictions are multiplied by n, the number of test sets.
As it turns out, these two forces are asymptotically in balance, so that the level of support in favor of \(\mathcal {H}_{0}\) approaches a bound as n grows large. We first provide the mathematical result and then show the outcome for a few select scenarios.
Mathematical Result
In example 1, the data consist of n realizations drawn from a Bernoulli distribution, denoted by \(y_{i}\), \(i = 1,2,\ldots , n\). Under \(\mathcal {H}_{0}\), the success probability 𝜃 is fixed to 1 and under \(\mathcal {H}_{1}\), \(\theta \) is assigned a \(\text {Beta}(a, b)\) prior. We consider the case where only successes are observed, that is, yi = 1,∀i ∈{1,2,…, n}. The model corresponding to \(\mathcal {H}_{0}: \theta = 1\) has no free parameters and predicts \(y_{i} = 1\) with probability one. Therefore, the Bayesian LOO estimate \(\text {elpd}_{\text {loo}}^{\mathcal {H}_{0}}\) is equal to 0. To calculate the LOO estimate under \(\mathcal {H}_{1}\), one needs to be able to evaluate the predictive density for a single data point given the remaining data points. Recall that the posterior based on \(n-1\) observations is a \(\text {Beta}(a + n - 1, b)\) distribution. Consequently, the leave-one-out predictive density is obtained as a generalization (with a and b potentially different from 1) of Laplace’s rule of succession applied to \(n-1\) observations,
$$\begin{array}{@{}rcl@{}} p(y_{i} \mid y_{-i}) &=& {{\int}_{0}^{1}} \underbrace{\theta}_{p(y_{i} \mid \theta)} \underbrace{\tfrac{{\Gamma}\left( a + n - 1 + b\right)}{{\Gamma}\left( a + n - 1\right) {\Gamma}(b)} \theta^{a + n - 2} \left( 1 - \theta\right)^{b - 1}}_{p(\theta \mid y_{-i})} \text{d}\theta \\ &=& \frac{a + n - 1}{a + n - 1 + b}, \end{array} $$
(6)
and the Bayesian LOO estimate under \(\mathcal {H}_{1}\) is given by
$$ \text{elpd}_{\text{loo}}^{\mathcal{H}_{1}} = n \log\left( \frac{a + n - 1}{a + n - 1 + b}\right). $$
(7)
The difference in the LOO estimates is
$$\begin{array}{@{}rcl@{}} {\Delta}\text{elpd}_{\text{loo}}^{\mathcal{H}_{0}, \mathcal{H}_{1}} &=& \text{elpd}_{\text{loo}}^{\mathcal{H}_{0}} - \text{elpd}_{\text{loo}}^{\mathcal{H}_{1}} \\ &=& - n \log\left( \frac{a + n - 1}{a + n - 1 + b}\right). \end{array} $$
(8)
As the number of confirmatory instances n grows large, the difference in the LOO estimates approaches a bound (see Appendix A for a derivation):
$$ \lim_{n \to \infty} {\Delta}\text{elpd}_{\text{loo}}^{\mathcal{H}_{0}, \mathcal{H}_{1}} = b. $$
(9)
Hence, the asymptotic difference in the Bayesian LOO estimates under \(\mathcal {H}_{0}\) and under \(\mathcal {H}_{1}\) equals the Beta prior parameter b. Consequently, the limit of the pseudo-Bayes factor is
$$ \lim_{n \to \infty} \text{PSBF}_{01} = \exp\left\{b\right\}, $$
(10)
and the limit of the model weight for \(\mathcal {H}_{0}\) is
$$ \lim_{n \to \infty} w_{0} = \frac{\exp\left\{b\right\}}{1 + \exp\left\{b\right\}}. $$
(11)
Select Scenarios
The mathematical result can be applied to a series of select scenarios. Figure 1 shows the LOO weight in favor of the general law \(\mathcal {H}_{0}\) as a function of the number of confirmatory instances n, separately for five different prior specifications under \(\mathcal {H}_{1}\). The figure confirms that for each prior specification, the LOO weight for \(\mathcal {H}_{0}\) approaches its asymptotic bound as n grows large.
We conclude the following: (1) as n grows large, the support for the general law \(\mathcal {H}_{0}\) approaches a bound; (2) for many common prior distributions, this bound is surprisingly low. For instance, the Laplace prior \(\theta \sim \text {Beta(1,1)}\) (case d) yields a weight of \(e/(1 + e) \approx 0.731\); (3) contrary to popular belief, our results provide an example of a situation in which the results from LOO are highly dependent on the prior distribution, even asymptotically. This is clear from Eq. 11 and evidenced in Fig. 1; and (4) as shown by case e in Fig. 1, the choice of Jeffreys’s prior (i.e., \(\theta \sim \text {Beta}(0.5,0.5)\)) results in a function that approaches the asymptote from above. This means that, according to LOO, the observation of additional confirmatory instances actually decreases the support for the general law, violating the second desideratum outlined above. This violation can be explained by the fact that the confirmatory instances help the complex model \(\mathcal {H}_{1}\) concentrate more mass near 1, thereby better mimicking the predictions from the simple model \(\mathcal {H}_{0}\). For some prior choices, this increased ability to mimic outweighs the fact that the additional confirmatory instances are better predicted by \(\mathcal {H}_{0}\) than by \(\mathcal {H}_{1}\).
One counterargument to this demonstration could be that, despite its venerable history, the case of induction is somewhat idiosyncratic, having to do more with logic than with statistics. To rebut this argument, we present two additional examples.