Advertisement

Commentary on Gronau and Wagenmakers

  • Suyog H. Chandramouli
  • Richard M. Shiffrin
Article

Abstract

The three examples Gronau and Wagenmakers (Computational Brain and Behavior, 2018; hereafter denoted G&W) use to demonstrate the limitations of Bayesian forms of leave-one-out cross validation (let us term this LOOCV) for model selection have several important properties: The true model instance is among the model classes being compared; the smaller, simpler model is a point hypothesis that in fact generates the data; the larger class contains the smaller. As G&W admit, there is a good deal of prior history pointing to the limitations of cross validation and LOOCV when used in such situations (e.g., Bernardo and Smith 1994). We do not wish to rehash this literature trail, but rather give a conceptual overview of methodology that allows discussion of the ways that various methods of model selection align with scientific practice and scientific inference, and give our recommendation for the simplest approach that matches statistical inference to the needs of science. The methods include minimum description length (MDL) as reported by Grünwald (2007); Bayesian model selection (BMS) as reported by Kass and Raftery (Journal of the American Statistical Association, 90, 773–795, 1995); and LOOCV as reported by Browne (Journal of Mathematical Psychology, 44, 108–132, 2000) and Gelman et al. (Statistics and Computing, 24, 997–1016, 2014). In this commentary, we shall restrict the focus to forms of BMS and LOOCV. In addition, in these days of “Big Data,” one wants inference procedures that will give reasonable answers as the amount of data grows large, one focus of the article by G&W. We discuss how the various inference procedures fare when the data grow large.

Keywords

Model Selection Overlapping Model Classes Cross-Validation Bayes Factor 
At the most conceptual level, our commentary is motivated by the desire to have statistics serve science not science serve statistics. To help make this happen, we propose Bayesian inference be carried out with the following components. A chief motivation for these proposals is the assumption that the “true” model is never among and is always more complex than any instance in the models being considered:
  1. 1)

    Discretize all instance hypotheses and probabilities into sufficiently small intervals (multidimensional volumes).

     
  2. 2)

    Represent all model classes as combinations of instance intervals.

     
  3. 3)

    Approximate each instance interval by a point hypothesis within the interval.

     
  4. 4)

    Start Bayesian inference by finding the posterior probability of every one of these point hypotheses (see Gelman and Carlin 2017).

     
  5. 5)

    When desired and desirable, form a posterior probability for each class by summing the posterior probabilities for the instances in the class, and use those class posteriors to make class inferences.

     
  6. 6)

    Carry out model class comparisons for model classes that do not overlap. To make this proposal generally applicable, one must allow certain types of shared instances that are “distinguishable” to be separated into distinct hypotheses.

     

We start with terminology: a model instance is fully specified—it has all parameter values defined and predicts a fully specified distribution of outcomes for the experiment in question. A model class is a collection of model instances, often defined by a given parametric form (e.g., a seven degree polynomial). Different model classes sometimes overlap by sharing instances. This is commonly seen when model classes are nested hierarchically (e.g., a two degree polynomial is nested in a seven degree polynomial), in which case, all the instances in the smaller class are also in the larger class. It is of course possible for model classes to overlap in non-nested fashion; e.g., class 1 for probability of success, p, might specify p to lie in (0, .7) while class 2 might specify p to lie in (.4, 1.0), so p values from .4 to .7 are shared. One might think that instances that predict the same outcome distribution are the same, but this is not usually the case. In fact, the default versions of Bayesian inference, constructing a Bayes factor with equal class priors and some sort of uninformative priors within each class, explicitly assign different prior probabilities to instances that predict the same outcome distribution but are in different classes. We discuss in more detail the situations in which this approach is reasonable scientifically.

An argument for comparing model classes that do not overlap was given in Shiffrin et al. (2016). They considered cases in which the smaller model class was an interval of instances nested in the larger class, and when instances in common to the two classes were not distinguishable (in ways laid out later in this commentary). When the class comparisons were not nested, so there were no instances in common, then MDL and BMS aligned well qualitatively. They gave additional reasons for carrying out model selection and comparison for model classes that do not overlap (briefly reiterated later in this commentary). In order to eliminate such overlap from nested class comparisons, they suggested that shared instances be subtracted from the larger class before comparison. That is of course just a way of treating the overlapping instances and the non-overlapping instances as separate classes. This idea can be generalized as in Fig. 1. In Fig. 1a, one class is nested in the larger; in Fig. 1b, two classes overlap without nesting; in Fig. 1c, three classes overlap without nesting. In each case, we suggest each separate region be considered a class for the purposes of inference, something that can be carried out by inferring the Bayesian posterior probabilities of the instances in each separate region. Along the same lines, one can carry out LOOCV by treating each separate region as a group of instances, calculating the posterior probabilities of each instance in that region based on all data points but one, predict that held-out data point, do this for each point in the region (class) treated as the held-out point, and then multiply the results. These results, one for each region, can then be compared.
Fig. 1

a–c We recommend comparison of each distinct region of an overlapping model space as a separate class (i.e., comparison between (MC1–MC2) and (MC2) rather than between MC1 and MC2 when MC2 is nested within MC1 as shown in the top left figure; comparison between (MC1–MC2), (MC1 ∩ MC2), and (MC2–MC1) rather than between MC1 and MC2 when a part of MC1 and MC2 overlap (top right) and so on)

Shiffrin et al. (2016) considered only model classes consisting of intervals of instances. We believe that this should almost always be the case to align with scientific practice and argue that all classes being compared should consist of intervals (i.e., multidimensional volumes) or combinations of intervals. There are well-known problems in taking seriously comparisons of a point hypothesis to an interval of hypotheses (e.g., Gelman et al. 2004, p.250). We further suggest that point hypotheses are in almost all cases not scientifically plausible and should be considered as stand-ins for, and approximations of, small intervals. Doing so simplifies exposition, aligns with modern technology that uses discretization on computers to carry out statistical inference, and helps make model comparison methods easier to understand. It should be noted that most applications of model comparison involve nested models, in which one compares model classes with differing numbers of parameters. In such cases, the smaller class would typically or always have a lower dimensionality than the larger, and subtracting the smaller class from the larger prior to comparison would make no difference. For example, in the cases considered by Gronau and Wagenmakers (G&W), one could subtract the point hypothesis, of measure zero, from the entire interval; the larger model class, the entire interval, would for all practical purposes and indeed for Bayesian inference would be unchanged by the subtraction.

Yet, point hypotheses (or more generally lower dimensional hyperplanes) are almost never scientifically plausible. Consider the common example of testing ESP with coin flipping: One hypothesis is p = .5 (chance). The other class of hypotheses might be a distribution across all p values in the interval (0,1). But .5 cannot be exactly correct: A coin is asymmetrical, and its probability of producing heads will not exactly match .5. Nor is any other exact value possible: Different coins are slightly different, different test settings will differ slightly, so what is really meant by “chance” is a small interval that is approximated by the point hypothesis. We argue that this is the case in essentially all scientific settings. From a mathematical perspective, there are interesting issues that arise from the use of a class of hypotheses that is of measure zero within a larger class. Consider for example a test of the hypothesis that the probability of guessing is a rational number in (0,1) vs. the probability that the probability of guessing is a transcendental number in (0,1). The rationals are dense in (0,1) but of measure zero, so technically have zero probability. Yet, we are discussing giving point hypotheses, such as a particular rational number, a positive probability. Measure theory and set theory questions led to important developments in the early days of mathematics, but are not of scientific interest when comparing models. We argue and will try to show that the differences between LOOCV and Bayesian inference noted by G&W are in part due to taking point hypotheses seriously rather than as stand-ins for intervals, and in part due to comparing overlapping classes.

In this commentary we suggest starting Bayesian inference by calculating posterior probabilities of instances. This suggestion is similar in certain respects to one put forth in Gelman and Carlin (2017). There are nonetheless cases where scientists would want to compare model classes. There are good arguments that can be made that better information, more information, and more useful information can be found in the distribution of posterior probabilities across instances in each class than in the class summary statistic formed by adding the instance posteriors within each class. However, science has made progress historically by forming, testing, and refining parameterized models (i.e., model classes), and often chooses between model classes on the basis of a binary assessment concerning which does better overall, based on a summary statistic. Thus class comparisons based on a summary statistic for each class will likely remain a core component of model comparison.

When model classes do not share instances, point hypothesis instances are stand-ins for small intervals, and classes are comprised of such intervals, everything is easy to understand. The Bayesian procedure is straightforward: Each instance in every class is assigned a prior probability. These add to 1.0. Bayesian inference is used to produce a posterior probability for each instance; these also add to 1.0. A summary statistic for each class can be formed by summing its instance posteriors. These class posteriors will also sum to 1.0. For example, if each class is defined to be all instances in each separate, non-overlapping region of the Venn diagram in Fig. 1c, rather than defined to be all instances in each of the three circles, then the class posteriors will add to 1.0, and class inference will match standard BMS (see Eqs. 13 below). When classes overlap, matters are more complicated, but one result is worth noting: the class posteriors will add to more than 1.0 (due to multiple counting), but the ratios of the class posteriors will match standard BMS, as shown in Eqs. 13. We argue below that this should be done only when the overlapping instances are distinguishable.

Equation 1 gives the usual Bayesian model selection (BMS) method for comparing two classes.
$$ \frac{p\left({M}_1|{x}_{\mathrm{obs}}\right)}{p\left({M}_2|{x}_{\mathrm{obs}}\right)}=\frac{\sum_ip\left({x}_{\mathrm{obs}}|{\theta}_{1i}\right){p}_0\left({\theta}_{1i}|{M}_1\right)}{\sum_jp\left({x}_{\mathrm{obs}}|{\theta}_{2j}\right){p}_0\left({\theta}_{2j}|{M}_2\right)}\frac{p_0\left({M}_1\right)}{p_0\left({M}_2\right)} $$
(1)
In Eq. 1, M1 and M2 are model classes, xobs is the observed data, and the θ’s are the parameters defining the instances in the model classes. The first term in brackets is the Bayes factor; the second term is the ratio of class priors.
$$ \frac{p\left({M}_1|{x}_{\mathrm{obs}}\right)}{p\left({M}_2|{x}_{\mathrm{obs}}\right)}=\frac{\sum_ip\left({x}_{\mathrm{obs}}|{\theta}_{1i}\right){p}_0\left({\theta}_{1i}\right)}{\sum_jp\left({x}_{\mathrm{obs}}|{\theta}_{2j}\right){p}_0\left({\theta}_{2j}\right)} $$
(2)
Equation 2 is an equivalent form of Eq. 1, using unconditional priors for the θ’s, obtained by multiplying the two prior terms in Eq. 1
$$ \frac{p\left({M}_1|{x}_{\mathrm{obs}}\right)}{p\left({M}_2|{x}_{\mathrm{obs}}\right)}=\frac{\sum_ip\left({x}_{\mathrm{obs}}|{\theta}_{1i}\right){p}_0\left({\theta}_{1i}\right)/Z}{\sum_jp\left({x}_{\mathrm{obs}}|{\theta}_{2j}\right){p}_0\left({\theta}_{2j}\right)/Z} $$
(3)
Equation 3 is an equivalent form for Eqs. 1 and 2, where
$$ Z={\sum \limits}_kp\left({x}_{\mathrm{obs}}|{\theta}_k\right){p}_0\left({\theta}_k\right) $$
where k indexes all instances in all the classes, as if they were separate and distinguishable and where
$$ p\left({x}_{\mathrm{obs}}|{\theta}_k\right){p}_0\left({\theta}_k\right)/Z $$

is the posterior probability of instance θk in comparison to all distinguishable instances. Thus, Eq. 3 shows that the ratio of sums of the posterior probabilities of the separate distinguishable instances within each class will be the usual result of Bayesian inference: The Bayes factor times the ratio of class priors. (Although we will not pursue the issue, there is an equivalent form of these three equations in which the sums are replaced by integrals, and the instances by intervals.)

We will later discuss the desire to favor simpler models, the Bayesian Ockham’s razor, but simply note here that BMS in any of these equivalent versions will have a prior “bias” to favor whatever model class (es) have a smaller average unconditional prior probability (see Eq. 2) across its instances.

There is an important proviso to what we have just presented: In the above presentation, we have used the term distinguishable when describing the separate instances in the various classes. We use this term because it allows us to generalize our proposal that model classes should not overlap, allows us to assign prior probabilities that are our best guesses concerning each instance (as we shall show), and as we see from Eqs. 13, matches traditional Bayesian analysis. As we describe next, there are many settings for which some instances, and perhaps all instances, are shared between classes but the shared instances are distinguishable, even though they predict the same distribution of outcomes of an experiment. When this is true we suggest the apparently shared instances be separated into distinguishable instances, thereby eliminating the overlap of the classes under comparison. Doing so also allows us to assign prior probabilities to each of the resultant instances that match our best estimates from prior knowledge. The fact that the result matches that from traditional Bayesian analysis suggests that that BMS implicitly and perhaps explicitly assumes distinguishable instances: This is most obvious in default uses of the Bayes factor (Eq. 1), because the same instance (in the numerator and denominator) is given a different prior depending on the class in which it resides.

Consider first an example taken from Kruschke (2014) and discussed in Appendix D of Shiffrin et al. (2016). Factories A and B each make biased coins, the bias being determined by different beta distributions for A and B. A coin is selected randomly from A and B and flipped N times, producing n successes. We wish to infer the factory from which it was chosen. The beta distributions can be considered as the model classes and the biases as the instances; in this case every instance is shared. Shiffrin et al. (2016) gave various reasons to eliminate this example (and presumably similar examples) as model comparison situations. We now propose to deal with this situation using our proposed and standard method of model comparison starting with instance inference. It is possible to handle this case by differentiating each shared instance (a given value of coin bias) by the factory of origin. Thus we replace each instance, each p value, by two instances, one for each factory: (A, p) and (B, p). Thus we distinguish each instance by a causal model: Which factory produced the coin. If for example, one factory coated each coin with a poisonous substance that would kill anyone who handled the coin 2 weeks after touch was made, one would certainly want to distinguish the inferred p value by the factory of origin; the predicted distribution of coin flips would be the same for a given p for a coin from the two factories, but the factory of origin is a critical factor. Once the coins are distinguished by factory of origin, all instances are now separate and none are shared. It is straightforward to assign priors to these separated instances and carry out Bayesian inference. The factory of origin then becomes the sum of its instance posterior probabilities. The result will of course match the standard Bayesian analysis but does so without comparing overlapping model classes.

A similar case occurs in hierarchical modeling. Suppose there exists a set of master distributions from which instance parameters are sampled. The same parameters might be sampled from different master distributions, but we suggest they be considered separate instances by virtue of the master distribution of origin. In this case the master distributions occupy the positions of the factories in the example in the previous paragraph.

A third example is one that Shiffrin and Chandramouli are pursuing in current research. Space prevents giving details, but they take into account experimenter produced biases when making inferences about model classes from data. Consider a study of extrasensory perception (ESP) in which a subject successfully predicts the outcomes of 623 of 1000 coin flips. Let us consider that there exists a model instance that would have a much higher than .5 prior probability of success, an instance producing a distribution of outcomes with a high likelihood of seeing 623 successes. This instance might well lie in the ESP class, if it represented a process in which prediction is well above chance due to some extrasensory basis. However the same distribution might be predicted by a pure guessing model, if that pure guessing was distorted by experimenter error or bias. For example, the experimenter might discard a number of trials showing mispredictions (perhaps on the basis that the subject claims “not feeling right”). The “guessing plus distortion” instance might predict the same outcome distribution as the ESP instance, but they are quite different in kind, based on a completely different causal model. We therefore argue they should be treated as separate instances, each assigned appropriate priors (presumably lower for the ESP instance) at the outset of the inference process.

A fourth example is related to the ESP example above. A subject tries to make a binary decision about a stimulus claimed not to have been seen (Baribault et al. 2018). In this case it would make sense to separate the .5 hypothesis into two instances, one representing “guessing” and one representing probability of success based on features perceived unconsciously. These should be assigned priors based on our knowledge.

A fifth example arises due to precision of measurement, perhaps caused by discretization. Consider one model class being a linear function with zero slope and only intercept as a parameter. Another model class is a flat sine wave with very small amplitude and fixed phase, varying only in intercept and frequency. Due to the small amplitude, at the level of discretization used, there could well be many sine wave instances indistinguishable from the linear function with the same intercept. Yet, one would want to discriminate these instances for several reasons, including the facts that they are based on a different generating model, and that a finer level of discretization might make the instances discriminable.

A sixth example is illustrated in Fig. 2. Model class 1 predicts accuracy rises as response time rises. Model class 2 predicts accuracy falls as response time rises. Instances in the region of overlap represent different causal models and should be treated as separate instances.
Fig. 2

Model classes with overlapping instances that are distinguishable. Two psychological models propose different causal relations between response time and accuracy

These examples show cases where a shared instance should be treated as a group of separate instances, because the resultant separate instances are actually or potentially distinguishable, by virtue of different mechanisms, causes, processes, or origins. Of course, for scientific reasons, we also ought to separate non-shared instances that are distinguishable, even if the main question is one of class comparison. An ESP researcher might want to test the class that p is greater than .6 vs. any value of p. The value p = .5 is not shared but its posterior probability would reflect some combination of guessing and ESP performance, and it would be of value to separate these possibilities even if the class comparison is aimed at a different contrast. In any event, to clarify exposition in the following, let an instance that predicts a different outcome distribution than any other instance be termed an x-instance (by definition these cannot be forms of a distinguishable instance). Let instances that predict the same outcome distribution but have distinguishable forms be termed d-instances.

Examples three and four are cases where the shared instance is a point hypothesis, commonly seen in statistical practice in the form of null hypothesis testing, and highlighted in the three examples presented by G&W. Separating such a shared instance when distinguishable will match the results obtained with a Bayes factor analysis (in discretized form) using the same prior assignments (as shown in Eqs. 13). However, we do not prefer this approach, and rather recommend comparison of non-overlapping models: e.g., Treat the small interval representing the point hypothesis as one class, and treat all the other (non-overlapping intervals) as the other class. We return to this point later.

Shared instances are not always distinguishable. Sometimes separate instances are arbitrarily collected into classes that overlap. In such cases, we consider the shared instances to be indistinguishable and recommend that we carry out inference on each separate instance (e.g., Gelman and Carlin 2017). In such settings, we should assign a prior probability to any shared instance as if it were a single instance. E.g., suppose we are testing the probability p of success in a gym climbing competition. Someone proposes to compare two classes arbitrarily formed to consist of class 1: p lies in (.75, .80); class 2: p lies in (.10, .90). In this case, p = .78, a shared instance, is not different in kind than p = .70, a non-shared instance. We should therefore assign prior probabilities to each (discretized) value of p and calculate posterior probabilities of each value, ignoring the arbitrary class assignments. If desired, one can assess the arbitrary classes by summing the posteriors of the instances in each class. The ratio of those class posteriors will match the Bayes factor analysis if the class priors are made equal to the summed instance priors in each class. However, we argue against comparing in this way when shared instances are not distinguishable. Instead, we recommend using the instance posteriors directly, or perhaps comparing non-overlapping classes. If we do the latter, any interval of instances (or several such intervals) has a posterior probability of the sum of its instances—this seems natural and sensible.

Here is a trivial example illustrating the difference. Suppose we assess the probability of success, p, and have two model classes, A: {.00, .50, 1.0} and B: {.50}. Suppose the data are five successes and five failures. If we assume class priors are equal, and the instances of class A have equal probabilities, the Bayes factor for A over B is {(1/3)(.5**10)}/{(.5**10)} = (1/3). Suppose we had distinguished the .5 instance into guessing (.5, g) and causal process (.5, c), and assigned them the unconditional priors of the traditional analysis: .5 and (.5)(1/3) respectively. The posterior for (.5, g) among {.00, (.5, g), (.5, c), 1.0) would be (.5)(.5**10)/{(.5)(.5**10) + (1/6)(.5**10)}, which would be the posterior for class B. The class A posterior would be 1.0 minus this. The ratio for class A over B would be (1/6)/(.5) = 1/3, matching the Bayes factor, a result that must occur, as shown in Eqs. 13. The inference seems reasonable: If .5 has two causal explanations, the 3:1 preference should match the prior ratio, since the likelihoods are the same.

However, suppose we did not believe the .5 instance was different by virtue of its class membership. In this case, we could assign whatever non-zero priors we liked to the three p values; regardless, the posterior for .5 would be 1.0: Neither p = .00 or p = 1.0 can predict the mixed data. When the shared instance is not distinguishable, we think this is the inference we would want to reach, not one concerning arbitrary classes.

Let us return briefly to the reasons for preferring to compare model classes that do not overlap. Shiffrin et al. (2016) demonstrated that MDL and BMS were brought into better qualitative alignment when this is done and gave a few other reasons for comparing non-overlapping model classes. Consider comparisons for which shared instances are not distinguishable. There is a logical argument that the focus should be on assessing which instances in the various classes provide the best approximation to the truth, an assessment that we will assume is approximated by the Bayesian posterior probability for each instance. Having made that assessment, summing the probabilities of the instances in separated classes produces what should be an unambiguous class inference. Shiffrin et al. (2016) also discussed comparing a class A to a class B nested within it, and alternatively comparing a class A to a class (A-B) nested within it. Logically, one is interested in which of A and A-B contains instances best approximating the truth. This is best ascertained by treating the instances of A and (A-B) separately in inference. If one instead carries out these two complementary nested model comparisons in the standard way, the results tend to differ due to the potentially different sizes of A and (A-B). Arguments like this are even more persuasive when there are more than two classes to be compared. For example, in the situation depicted in Fig. 1c, it is not clear what ratios of posteriors for overlapping classes should be taken to represent properly all the class preferences. Shiffrin et al. (2016) also discuss complications that arise when comparing model classes that overlap and have instance prior probabilities that are highly non-uniform. In such situations, class comparisons are more coherent when model classes do not overlap. Thus in these cases where shared instances are not distinguishable, we suggest drawing inferences from the posteriors for the various instances. Alternatively and equivalently, one could change the model comparison to classes without shared instances, in which case the class posterior obtained by summing the instance posteriors in the class would be meaningful.

An important issue that deserves more discussion is the proper way to impose various sorts of simplicity preferences when carrying out model class comparison. That discussion would take a long article of its own, but we mention a few considerations here. The Bayesian Ockham’s razor prefers simpler classes to the degree that the average of the unconditional prior probabilities in the larger class is smaller than the average in the smaller class (Eq. 2). When the amount of data is limited and noisy, the instances in the larger class that are given high posterior probability will tend to be far from the instance in that same class that best approximates the truth. This happens because high probability is given to those instances that fit the data, and hence fit the noise in the data. A lower average prior probability to the instances in the larger class tends to counter this factor. However, as more and more data are collected, the posterior tends to converge on one instance, independent of the prior assignments. This is another way of saying the likelihoods get larger and dominate the priors. If one assumes that there is significant probability that the smaller class contains an instance that is true, and this is the case, then the posterior will converge on an instance in the smaller class. However, in science, all models are wrong, and we should be trying to infer which instances best approximate the truth, a truth that is almost always something more complex than any instances in any classes being considered. As long as instances in both large and small classes both “cover” the plausible outcomes of an experiment, the Bayesian posterior will most of the time converge on an instance in the larger class, due to the greater range of instances in the larger class, one of which is more likely to approximate the more complex underlying reality. In effect, the simplicity preference is lost as the data grow large. For various scientific reasons, we would not want to almost always favor a more complex model as the amount of data grows.

Shiffrin and Chandramouli (2016) and Chandramouli and Shiffrin (2016) propose a possible solution. In science, we do not want models to fit only the results of the current experimental setting; we want results, and the models of the results, to generalize to other relevant and similar settings. Different settings will produce different outcomes due to a host of factors that will affect the results, but are not specified and in most cases cannot be identified. In effect, a model assessed by the results of the present study and setting will be fitting the “noise” associated with the current setting. Shiffrin and Chandramouli (2016) proposed dealing with this problem by differentiating two forms of variability determining the distribution of outcomes an instance produces: “within-setting variability” that decreases with the amount of data in the current setting, and “between-setting variability” that does not grow with the amount of data in the study. To the extent that the between-setting variability is large, the simplicity preference induced by the average prior in the larger class will remain even though data grow large. In this case, the likelihood is not forced to increasingly dominate the prior.

Beyond these factors that impose a preference for simpler models are others that are not statistical but due to the way that humans practice science. Humans seek to understand the processes and primary causal factors that are at work in an experimental setting and that produce the predictions of each model being considered. However, humans have limited cognitive abilities. Often, a very large and complex model will fit well and may approximate the truth, but will be difficult to understand. The large and complex model may nonetheless be preferred for engineering applications, but if the goal is developing theoretical understanding, an additional simplicity preference might be justified beyond what the any of the statistical methods produce.

We now discuss the implications of our proposals for the three examples presented by G&W, and for LOOCV. It is fairly easy to see what happens in the limit as data grow large, if we carry out our preferred approach. Let us take the three examples considered by G&W, replace the point hypotheses by small intervals, remove the overlap of classes, and calculate the Bayesian posterior for the resultant classes by summing the instance posteriors in each class. As data grows without bound, the analysis by BMS and LOOCV come to align—both converge on favoring the point hypothesis with certainty. The reasons are “boringly” obvious. Consider the first example. Suppose we discretize the interval [0,1] into 101 point hypotheses, .00, .01, …., .99, 1.0. Suppose we compare the “class” [1.0] with the larger class [.00, .01, …., .99], and the data are generated by the instance 1.0. As the data grow, the Bayesian posterior for the small class is fixed at its only instance 1.0, and the Bayesian posterior for the large class converges on .99 because it predicts a sequence of successes better than .98 (or any other point hypothesis in the larger class). As the number of successes N grows, the hypothesis 1.0 increasingly wins the competition with the larger class, because the likelihood of N successes is 1.0 for the small class, but converges on .99**N for the larger class. Similarly, the prediction for LOOCV for each success is 1.0 for the small class and converges on .99 for the larger class. N such comparisons is again 1.0 vs. .99**N. Analogous arguments and analyses apply to a point hypothesis of .5 or a point hypothesis of a zero mean center of a Gaussian distribution. Figures 3 and 4 contrasts the inferences for our proposals (labeled M1 vs. M2) with traditional bayesian inference (labeled M1 vs. M3) as N increases.
Fig. 3

a LOO weights as number of trials increase for the two binomial cases considered by G&W. Top panels: M1 represents the point hypotheses, θ = 1. Bottom panels: M1 represents the point hypotheses, θ = 0.5. M3 represents all hypotheses in [0,1]. M2 represents all possible instances except the point hypothesis (i.e., M2 = M3 − M1). The left panels show comparison of M1 with M2: The comparison converges to complete preference for M1. The right panels compare M1 with M3. In these nested comparisons, the weights converge to a value that depends on the fineness of discretization, which for this calculation consisted of 21 hypotheses evenly spaced from 00 to 1.0. b Bayesian inference for the cases considered in a: plotted are the Bayes factor divided by one plus the Bayes factor, assuming equal class priors and equal priors for the 21 evenly spaced instances within the large class. The left panels show comparisons of M1 with M2; these converge to complete preference for M1. The right panels show comparisons of M1 to M3; the plotted weight is based on the Bayes factor that converges to the ratio of priors for the point hypothesis that in this case is the ratio of class sizes, 21:1 (see text)

Fig. 4

a LOO weights for increasing sample sizes when each sample has mean ȳ = 0 and standard deviation σy = 1 and is generated by N(μ = 0, σ = 1). Results are shown for discretized hypotheses, means μ ranging from μ = − 1, to μ = 1, 0.05 apart. These instances have priors discretized to approximate a truncated Gaussian distribution with zero mean and standard deviation one. The data are samples from the Gaussian with the discretized mean, also discretized by rounding to the nearest .01. M1 is the point hypothesis μ = 0. M3 is all the discrete hypotheses including the point hypothesis. M2 represents all means except the point hypothesis (i.e., M2 = M3 − M1). The left panel shows comparison of M1 with M2; the weights converge to complete preference for M1. The right panel shows comparison of M1 to M3; the weights converge to an intermediate value. b Weights are the Bayes Factor divided by one plus the Bayes factor. Comparisons are based on the assumptions for a. The left panel shows comparison of M1 with M2: The weights converge to complete preference for M1. The right panel shows comparison of M1 with M3: The weights converge to a value based on a Bayes factor that converges to the ratio of priors for the point hypothesis (see text)

Although we prefer to compare classes that do not overlap, as implemented in the analyses described in the previous paragraph, the cases considered by G&W involve nested classes: The point hypothesis is also in the larger class. Again, we can reason what will happen in the limit, when the point hypothesis generates the data, and when we discretize the instances and classes. Consider first Bayesian inference in the form of the Bayes factor analysis. As illustrated in the example discussed earlier, the instance posterior will converge on the point hypothesis as data grow. I.e., the sum of prior times likelihood will converge on prior times likelihood for the single instance that is the point hypothesis generating the data. Thus, whether or not we consider the point hypothesis as distinguishable, the Bayes factor will approach the ratio of (unconditional) priors for the point hypothesis, depending on its class (because the likelihoods are the same). In most cases of interest, the ratio of priors will grow without bound as the fineness of discretization increases. For example in the default use of the Bayes factor, a uniform prior within class spreads the prior evenly across all instances: the more instances the lower the prior to any one, including the point hypothesis. In such cases, the limit as the fineness of discretization increases will match the continuous Bayes factor result.

However, when is the Bayes factor result, certainty assigned to the point hypothesis, scientifically sensible? When there are different causal or other bases for distinguishing different forms of the point hypothesis, we regard the ratio of priors, a ratio that is not infinite or zero, as a sensible inference as data increase. This especially is likely when the point hypothesis represents an interval, something that would be the case in almost all scientific applications. Why? If the point hypothesis represents an interval, suppose that the data causes inference to converge on certainty that that interval contains the true hypothesis (or best approximation to truth). When that interval is in both classes, then the only basis for distinguishing the different forms of that interval hypothesis is the different priors assigned to them. As a corollary, one should not in such cases of distinguishable instances discretize to a finer level than the belief in the size of the interval represented by the point hypothesis. We believe this is reasonable: We come to believe that the interval containing the point hypothesis does generate the data, but there are two forms of the interval point hypothesis, so which generates the data should remain uncertain.

The situation is quite different when different forms of the point hypothesis are not distinguishable. In that case, it does not make sense to assign different priors to the same point hypothesis due to its class membership. Different priors are something inherent in standard Bayesian analysis using default priors. One one could adjust the class prior ratio (in Eq. 1) to remove the difference in conditional priors, and in such cases, probably should. However, there is little point in operating this way. We prefer not to carry out standard analyses in such cases, and instead recommend using the posterior probabilities of the various instances (point hypotheses) directly, or compare non-overlapping classes.

In the case of overlapping classes, the predictions for LOOCV are somewhat harder to see and explain. The point hypothesis that generates the data will predict them better than any other hypothesis in the large class, and increasingly so as data grow. The point hypothesis when considered as a class by itself by definition has a Bayesian posterior probability of 1.0. However, in the larger class, it has a smaller than 1.0 posterior probability because other hypotheses, especially nearby ones, have non-zero posterior probability. This difference becomes increasingly small as N grows, tending to make the class predictions equal, but LOOCV is used N times for prediction, tending to magnify whatever differences exist. The way these conflicting tendencies change as N grows is mathematically interesting but are probably not of scientific interest, except to the extent that they produce sensible inference. In any event, we show in our figures the results of inference as N grows for LOOCV for overlapping classes (in the comparisons labeled M1 vs. M3), for the three examples examined by G&W. The results shown are for the discretized situation. Keep in mind that we do not recommend carrying out either Bayesian inference or LOOCV for overlapping classes, so these results are not what we recommend. They are shown for correspondence to the G&W article. We recommend instead comparison of non-overlapping classes comprised of intervals of interval instances, in which case LOOCV and Bayesian inference align as N grows.

Our recommendations for carrying out inference are easy to understand and will in our opinion produce inference that matches good scientific judgment. In some cases, they will be computationally efficient (see examples one and two in G&W) and in other cases, the use of mathematical derivations may be used to augment the method and make it more efficient (see the Gaussian example of G&W). However, we regard the production of good and sensible scientific inference as the best argument in its favor.

Notes

References

  1. Baribault, B., Donkin, C., Little, D. R., Trueblood, J., Oravecz, Z., van Ravenzwaaij, D., White, C. N., De Boeck, P., & Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115, 2607–2612.CrossRefGoogle Scholar
  2. Bernardo, J. M., & Smith, A. F. (1994). Bayesian theory. 1994. John Willey and Sons. Valencia (España).Google Scholar
  3. Browne, M. (2000). Cross-validation methods. Journal of Mathematical Psychology, 44, 108–132.CrossRefGoogle Scholar
  4. Chandramouli, S. H., & Shiffrin, R. M. (2016). Extending Bayesian induction. Journal of Mathematical Psychology., 72, 38–42.CrossRefGoogle Scholar
  5. Gelman, A., & Carlin, J. (2017). Some natural solutions to the p-value communication problem—and why they won’t work. Journal of the American Statistical Association, 112(519), 899–901.CrossRefGoogle Scholar
  6. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis, 2nd edn. Chapman and Hall.Google Scholar
  7. Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24, 997–1016.CrossRefGoogle Scholar
  8. Gronau, F. Q., & Wagenmakers, E.J. (2018). Limitations of Bayesian leave-one-out validation for model selection. Computational Brain and Behavior.Google Scholar
  9. Grünwald, P. (2007). The minimum description length principle. Cambridge: MIT Press.Google Scholar
  10. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795.CrossRefGoogle Scholar
  11. Kruschke, J. (2014). Doing Bayesian data analysis. In A tutorial with R, JAGS, and Stan, 2nd edn. Academic Press.Google Scholar
  12. Shiffrin, R. M., & Chandramouli, S. H. (2016). Model selection, data distributions, and reproducibility. In H. Atmanspacher & S. Maasen (Eds.), Reproducibility: principles, problems, and practices (pp. 115–140). New York: John Wiley.CrossRefGoogle Scholar
  13. Shiffrin, R. M., Chandramouli, S. H., & Grunwald, P. G. (2016). Bayes factors, relations to minimum description length, and overlapping model classes. Journal of Mathematical Psychology., 72, 56–77.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Indiana UniversityBloomingtonUSA

Personalised recommendations