Abstract
This paper considers a key point of contention between classical and Bayesian statistics that is brought to the fore when examining so-called ‘persistent experimenters’—the issue of stopping rules, or more accurately, outcome spaces, and their influence on statistical analysis. First, a working definition of classical and Bayesian statistical tests is given, which makes clear that (1) once an experimental outcome is recorded, other possible outcomes matter only for classical inference, and (2) full outcome spaces are nevertheless relevant to both the classical and Bayesian approaches, when it comes to planning/choosing a test. The latter point is shown to have important repercussions. Here we argue that it undermines what Bayesians may admit to be a compelling argument against their approach—the Bayesian indifference to persistent experimenters and their optional stopping rules. We acknowledge the prima facie appeal of the pro-classical ‘optional stopping intuition’, even for those who ordinarily have Bayesian sympathies. The final section of the paper, however, provides three error theories that may assist a Bayesian in explaining away the apparent anomaly in their reasoning.
Similar content being viewed by others
Notes
Note that the outcomes of an experiment are often summarized in terms of a test statistic, for example, ‘4 heads in 20 tosses’ as opposed to a precise description of the ordered sequence of tosses. This paper does not refer to test statistics, however, and thus the criteria for determining what is an appropriate test statistic will not be discussed.
A composite hypothesis, on the other hand, specifies a set of possible probability distributions over the outcome space.
Classical statistics does not recognize probabilities for hypotheses, so in this setting the likelihoods Pr(x|H i ) are not equivalent to the familiar ratio of unconditional probabilities. The likelihoods must instead be primitive, or rather, we should write Pr H_i (x). This is a detail, however, that need not concern us; likelihoods are expressed as conditional probabilities throughout the paper, for both the classical and Bayesian settings alike.
Another reason for stressing that it is ultimately the role of the full outcome space that is at the heart of the classical/Bayesian dispute, and not the stopping rule per se, is that there are some models of experiments that involve so-called informative stopping rules. In these cases, relevant information that governs stopping is not recorded in the experimental outcomes. Cases of this sort are discussed in Roberts (1967), and will be raised again later. These experimental set-ups are, for the most part, not our interest here. The last part of the paper does ultimately appeal to the implications of informative stopping rules, but our case is more subtle and ambiguous than those discussed by Roberts.
Strictly speaking, classical statisticians of the Neyman-Pearson school have no use for p-values. They focus just on whether or not an experimental outcome lies in the rejection region. The p-values are a feature of Fisherian classical statistics. We nonetheless use p-values in this presentation as a means to determine whether an outcome falls in the rejection region of a Neyman-Pearson classical test.
Any ordinal transformation of the distance functions given will do.
The set of possible hypothesis assessments, D, might be something other than {a, r}, but this paper considers only the simple accept/reject scenario. One might object that it is important to include an additional agnostic assessment (indicating more experimentation is needed), but note that the decision to continue experimenting can always be built into the stopping rule.
Note that, for discrete probability distributions (as per our example), and barring the use of tests with some randomization, not all choices of α are possible, but that is not important here.
Note that this is true even for the second classical test defined above. It can be shown to satisfy Neyman and Pearson’s condition: the outcomes whose conditional probabilities are added to get v 2 are almost the complement of those outcomes whose conditional probabilities are added to get v 1, almost in the sense that one outcome—the experimental outcome in question, x—is common to both. Thus if there is some outcome x i in R because it makes \(\frac{v_1}{v_2}<k\), then all x j that are more ‘distant’ than x i relative to H 1 will also be in R, because these will yield an even smaller value for \(\frac{v_1}{v_2}\). Thus Neyman and Pearson’s condition will be satisfied.
The choice is trickier when neither α nor β is fixed (as per the second type of classical test considered) and the available tests are such that one test has minimal type I error while another has minimal type II error. In such cases there may be a number of admissible tests; the classical approach does not offer an explicit method for determining the optimal balance between type I and type II errors.
The expression here assumes that the decision problem remains unchanged, regardless of the evidence that is received. This is another way of saying that the evidence is free, in the sense of Good (1967).
Moreover, it is not clear in what sense it is epistemically beneficial to base inference on the actual stopping rule employed. Sprenger (2009) argues that the classical statistician does not have the conceptual arsenal to make this claim.
I am referring to the Bayesian ‘convergence theorems’. See Earman (1992).
This result is proved by Robbins (1970), as reported in Sober (2008, p. 77). Kadane et al. (1996) prove a similar result, which they contrast with more problematic cases that involve a probability function that does not satisfy countable additivity. Armitage (1962) first posed the optional stopping problem for Bayesian statistics in the latter kind of setting, as reported by Mayo and Kruse (2001).
If the desired test result is r and decision utilities are constant, then trials are run until:
$$ \frac{Pr(x|H_1)}{Pr(x|H_2)}<t \quad \hbox{ where } \quad t =\frac{(1-p) \cdot (u_{r2} - u_{a2})}{p \cdot (u_{a1} - u_{r1})}\\ $$As per the above, if t ≤ 1, the maximum probability that the experiment will end is less than or equal to t.
A Bayesian optional stopping test is easy to construct: the trials stop, to yield a possible outcome, just as soon as the likelihood ratio for the experimental outcome falls on the desired side of the rejection threshold value, t. Now let us try to construct a classical optional stopping test: it will halt, to yield a possible outcome, just as soon as the p-value (determined with respect to the full outcome space) is less than some value α. The sticking point here is that we need to know the full outcome space in the very process of trying to work out the possible outcomes. It is not clear that it is possible to construct a test with this kind of circularity, but we leave the question open.
The same is true for the second type of classical test mentioned in Sect. 2. Here too, the outcome x = (TTHHH) speaks less for the rejection of the null if it was produced by an optional stopping test. Again, details are in the Appendix.
Just what counts as ‘free information’ is non-trivial (see Kadane et al. 2008). In the proof of Good’s theorem, evidence is ‘free’ if the relevant decision problem, including the acts under consideration and the utilities for the act-outcomes, remains unchanged once the evidence, whatever it turns out to be, is learned (cf. footnote 12).
I owe the idea of appealing to underdescribed experimental results to Cian Dorr. A similar construction is presented in Romeijn (2011) by way of illustrating how Neyman-Pearson tests may be given a Bayesian representation.
The optimal test (i.e. optimal stopping rule) is the test with greatest expected utility at the outset; it must also have greatest expected utility throughout the experiment if it is to be followed. The optimal test can be determined by depicting the experimenter’s decision at each stage as a choice between 3 options: accept H 1, reject H 1, and continue running trials. As mentioned, the Arrow et al. result depends on each trial having the same cost. It is also assumed that the utilities for the correct decisions are identical, and the utilities for the incorrect decisions are both less than the utilities for the correct decisions. For this particular result, there is also an assumption that the number of trials is limitless.
See Anscombe (1963) for a discussion of the ethical costs, and implications for optimal test design, of medical trials. A consequentialist would also consider the plight of future patients, as Anscombe recommends, but let us suppose that this particular NGO experimenter does not think harms/benefits to a present patient are on a par with harms/benefits to future patients.
Note that Roberts’ (1967) discussion of informative stopping rules assumes that the inference-maker is identical to the experimenter.
A clearer statement along the same lines, however, appears in Berger (1985, p. 511). Berger explicitly notes that learning the stopping rule may be informative because it tells the inference-maker something of the experimenter’s attitudes (in this case their beliefs), which are relevant to the truth of the hypotheses. Interestingly, Berger also contrasts this kind of informative stopping rule (which affects an agent’s priors before the experiment has begun) with the kind we discussed earlier. Berger does not elaborate on how learning an experimenter’s attitudes may produce the optional stopping intuition, however, as we do here.
Many thanks to Deborah Mayo for discussions, and Franz Dietrich, Jan Sprenger, Jan-Willem Romeijn, Barteld Kooi, Olivier Roy, Luc Bovens and Nancy Cartwright for very helpful commentary on previous drafts of this paper.
References
Anscombe, F. J. (1963). Sequential medical trials. Journal of the American Statistical Association, 58(302), 365–383.
Armitage, P. (1962). Contribution to discussion in L. Savage.
Arrow, K. J., Blackwell, D., & Girshick, M. A. (1949). Bayes and minimax solutions of sequential decision problems. Econometrica, 17(3/4), 213–244.
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. New York: Springer.
Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed). Haywood, CA: Institute of Mathematical Statistics.
Earman, J. (1992). Bayes or bust?: A critical examination of Bayesian confirmation theory. Cambridge, MA: MIT Press.
Edwards, A. W. F. (1972). Likelihood (1st ed). Cambridge: Cambridge University Press.
Good, I. J. (1967). On the principle of total evidence. British Journal of Philosophy of Science, 17, 319–321.
Hacking, I. (1965). Logic of statistical inference. London: Cambridge University Press.
Howson, C., & Urbach, P. (1989). Scientific reasoning: The Bayesian approach. La Salle, III.: Open Court.
Jeffreys, H. (1931). Scientific inference. Cambridge: Cambridge University Press.
Kadane, J. B., Schervish, M. J., & Seidenfeld, T. (1996). Reasoning to a foregone conclusion. Journal of the American Statistical Association, 91(435), 1228–1235.
Kadane, J. B., Schervish, M. J., & Seidenfeld, T. (2008). Is ignorance bliss? Journal of Philosophy 105(1), 5–36.
Kendall, M., & Stuart, A. (1991). Kendall’s advanced theory of statistics (5th ed), Vol. II. London: Edward Arnold.
Mayo, D. G., & Kruse, M. (2001). Principles of inference and their consequences. In D. Corfield & J. Williamson (Eds.), Foundations of Bayesianism. Kluwer.
Robbins, H. (1970). Statistical methods related to the law of the iterated logarithm. Annals of Mathematical Statistics 41, 1397–1409.
Roberts, H. V. (1967). Informative stopping rules and inferences about population size. Journal of the American Statistical Association, 62(319), 763–775.
Romeijn, J.-W. (2011). Inductive logic and statistics. forthcoming. In D. M. Gabbay, S. Hartmann & J. Woods (Eds.), Handbook of the history of logic. Volume 10: Inductive logic. Elsevier.
Raiffa, H., & Schlaifer, R. (1961). Applied statistical decision theory. Boston: Harvard University, Graduate School of Business Administration, Division of Research.
Savage, L. J. (1962). The foundations of statistical inference: A discussion London: Methuen.
Sober, E. (2008). Evidence and evolution. Cambridge: Cambridge University Press.
Sprenger, J. (2009). Evidence and experimental design in sequential trials. Philosophy of Science (forthcoming).
Worrall, J. (2007). Evidence in medicine and evidence-based medicine. Philosophy Compass, 2(6), 981–1022.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix: Illustrative Example
The following concern the two hypotheses:
For the Bayesian tests in Tables 5 and 6, the prior probabilities and decision utilities are such that \(\frac{p_{2}\cdot (u_{r2}-u_{a2})}{p_{1}\cdot (u_{a1}-u_{r1})}=1\). (This is the critical likelihood-ratio threshold for rejecting the H 1 hypothesis.)
The example optional stopping test is as follows: the test is stopped when H 1—that the coin is fair—is rejected, or after 7 tosses, whichever happens first. Given the assumptions above, this means that a sample continues, \(x=(t_{1},t_{2},\ldots )\) where \(t_{i}\in \{H,T\}\) until the first t n such that \(\frac{Pr(x|H_{1})}{Pr(x|H_{2})}<1\). If the experiment has not already stopped before the maximum m trials, then it is stopped at this point, regardless of the test result. A summary of the outcomes is given in Table 5.
We see that H 1 is more likely than not to be rejected, whether or not it is true: The probability that H 1 will be rejected, when it is true, is the Type I error, which is 0.727 (sum the ‘reject’ entries in column 2: 0.5 + 0.125 + 2 × 0.03125 + 5 × 0.007813 = 0.727). The probability that H 1 will be rejected, when it is false, is 1 - Type II error, which is 0.855 (sum the ‘reject’ entries in column 3: 0.6 + 0.144 + 2 × 0.03456 + 5 × 0.008294 = 0.855).
Now compare the above error probabilities with those of the fixed 5-toss test in Table 6. (For instance, it might be the case that the sequence of tosses is x = TTHHH, and we want to entertain the possibility that the test performed was either the optional stopping test or the fixed 5-trial test.)
The type I error probability is substantially less for this fixed 5-trial test than for the optional stopping test (sum the ‘reject’ entries in column 2: 0.03125 × 16 = 0.5). The Bayesian draws the same inference for the outcome x = (TTHHH) in both cases, however.
Note that we get a different classical inference for the outcome x = (TTHHH) , depending on whether the outcome space was that associated with the optional stopping test defined above, X O , or the outcome space associated with tossing the coin 5 times, X F . For the first case, the relevant p-value, v 1, of the outcome x = (TTHHH), is 0.688 (from column 2 of Table 5: 0.5 + 0.125 + 2 × 0.03125 = 0.688), while for the second, v 1 is 0.5 (from column 2 of Table 6: 16 × 0.03125 = 0.5). Thus, if we are speaking of a typical classical test (where rejection hinges on the choice of α), the experimental outcome ‘speaks more’ for the rejection of H 1 if the outcome space was X F rather than X O , because the p-value is smaller for X F than X O .
We might also consider the other sort of classical test mentioned above: testClass2 X (x). Here too, the outcome x = (TTHHH) will lead to different decision results for our two outcome spaces X O and X F , for certain values of k. For instance, k = 1 might be deemed appropriate to the context. This would mean rejecting H 1 just in case the ratio of p-values, \(\frac{v1}{v2}\) is less than 1. This is indeed the case when the outcome space is X F : v 1 = 0.5 < v 2 = 0.663 (where the v 2 value is calculated from terms in column 3 of Table 6: 10 × 0.03456 + 10 × 0.02304 + 5 × 0.01536 + 0.01024 = 0.663). Thus H 1 would be rejected. If the outcome space is X O , however, v 1 = 0.688 > v 2 = 0.256 (where the v 2 term is calculated similarly to the above) and so there would not be sufficient evidence to reject H 1.
Proof: Errors for Optional Stopping Tests
Claim: Optional stopping tests (2 simple exhaustive hypotheses) with maximum m trials, designed to stop as soon as a ‘reject’ result is achieved (assuming that the rejection threshold ratio t remains constant), have greater or equal type I error than any fixed-trial test of up to m trials.
Consider any fixed-n-trial test (n ≤ m) for which H 1 is rejected if the relevant likelihood ratio <t. The rejection set is denoted here R.
To prove the claim above, it suffices to show that all the sequences in R, and perhaps more, are also in the rejection set, denoted \(R^{\prime}\), for the optional-stopping test of maximum m trials that is designed to stop when the relevant likelihood ratio <t. If so, the type I error for the optional stopping test is clearly greater than (when \(R\subset R^{\prime}\)) or equal (when \(R=R^{\prime}\)) to the type I error for the fixed-n-trial test.
To show this, we appeal to a super-outcome space that is the set of all sequences of length m. Call this set M. Any sequence of length n ≤ m, which we’ll denote S n i , is just a subset of M, i.e. the subset of length-m sequences that begin with the relevant length-n sequence. When n = m these sets S n i contain only one length-m sequence, otherwise they contain more. Note too that \(\bigcup\nolimits_{i}S_{i}^{n}=M\).
We denote the set of length-n trials S n i that result in rejection, \(\mathcal{S}_{R^{n}}\). The rejection region for the fixed-n-trial test, R, can be given as \(R=\bigcup\nolimits_{S_{i}^{n}\in \mathcal{S}_{R^{n}}}S_{i}^{n}\). This is to say that R contains all length-m sequences that are elements of a set S n i that is rejected.
Turn now to the optional stopping test. It will stop if any \(S_i^n\in \mathcal{S}_{R^{n}}\) occurs, and the set of those that may occur, denoted \(\mathcal{S}_{O^{n}}\), will be rejected. The remaining sequences \(\mathcal{S}_{R^{n}}\backslash \mathcal{S}_{O^{n}}\) are those that cannot occur, which implies that the optional stopping test would stop on a rejection result before the first n trials for all length-n sequences in \(\mathcal{S}_{R^{n}}\backslash \mathcal{S}_{O^{n}}\).
Now there are two cases to consider: (1) \(\mathcal{S}_{R^{n}}\backslash \mathcal{S}_{O^{n}}\neq \varnothing \), and (2) \(\mathcal{S}_{R^{n}}\backslash \mathcal{S}_{O^{n}}=\varnothing \).
First case: \(\mathcal{S}_{R^{n}}\backslash \mathcal{S}_{O^{n}}\neq \varnothing\)
All S n i in \(\mathcal{S}_{R^{n}}\backslash \mathcal{S}_{O^{n}}\) must themselves be subsets of smaller-length sequence sets that resulted in rejection. Call the set of these smaller-length sequence sets that resulted in rejection \(\mathcal{S}_{S}\). Then \(\bigcup\nolimits_{S_i^n\in \mathcal{S}_{R^{n}}\backslash \mathcal{S}_{O^{n}}}S_i^n\subset \bigcup\nolimits_{S_{i}\in \mathcal{S}_{S}}S_{i}\).
It follows that \(\bigcup\nolimits_{S_i^n\in \mathcal{S}_{R^{n}}}S_i^n\subset \bigcup\nolimits_{S_{i}\in \mathcal{S}_{O^{n}}\cup \mathcal{S}_{S}}S_{i}\).
The latter term in the above expression is a subset of the full set of length-m sequences in the rejection region for the optional stopping test, which we denote \(R^{\prime}\). (There may additionally be sequences of length greater than n that are also rejected (cf. case below).) The former term in the expression is R, as defined earlier.
So in this first case \(R^{\prime}\supset R\), and thus the type I error for the optional stopping test is strictly greater than the type I error for the fixed-n-trial test.
Second case: \(\mathcal{S}_{R^{n}}\backslash \mathcal{S}_{O^{n}}=\varnothing. \)
It is easy to see here that the type I error for the optional stopping test is at least as great as the type I error for the fixed-n-trial test, since \(\mathcal{S}_{O^{n}}=\mathcal{S}_{R^{n}}\).
Note, however, that \(\bigcup\nolimits_{S_{i}^{n}\in \mathcal{S}_{O^{n}}}S_{i}^{n}\subseteq R^{\prime}\) . The reason the left-hand term (which = R) may be a proper subset of \(R^{\prime}\) is that there may be other sequences that were accepted in the fixed-n-trial test that in fact also result in rejection in the optional stopping test. (These rejected trials will be such that the likelihood ratio for the first j trials is <t, where n < j ≤ m.)
So in this second case \(R^{\prime}\supseteq R\) , and thus the type I error for the optional stopping test is greater than or equal to the type I error for the fixed-n-trial test.
\(\ldots\)
Corrollary Note also that the type II error for the optional stopping test with maximum m trials must be less than or equal to the type II error for the fixed-n-trial test where n ≤ m. This is because the type II error is the probability, given hypothesis H 2, of \(M\backslash R^{\prime}\) , for the optional stopping test, or else M\ R, for the fixed-trial test.
Since \(R^{\prime}\supseteq R\) , then \(M\backslash R^{\prime}\subseteq M\backslash R\) , and from this follows the stated corollary.
\(\ldots\)
Rights and permissions
About this article
Cite this article
Steele, K. Persistent Experimenters, Stopping Rules, and Statistical Inference. Erkenn 78, 937–961 (2013). https://doi.org/10.1007/s10670-012-9388-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10670-012-9388-1