Skip to main content

Counterexamples to a likelihood theory of evidence


The likelihood theory of evidence (LTE) says, roughly, that all the information relevant to the bearing of data on hypotheses (or models) is contained in the likelihoods. There exist counterexamples in which one can tell which of two hypotheses is true from the full data, but not from the likelihoods alone. These examples suggest that some forms of scientific reasoning, such as the consilience of inductions (Whewell, 1858. In Novum organon renovatum (Part II of the 3rd ed.). The philosophy of the inductive sciences. London: Cass, 1967), cannot be represented within Bayesian and Likelihoodist philosophies of science.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. Terminology varies. In the computer science literature especially, a simple hypothesis is called a model and what I am calling a model is referred to as a model class.

  2. A peculiar thing about the quote from Barnard (above) is that he refers to the likelihood of a simple hypothesis as a probability function. It is not a function except in the very trivial sense of mapping a single hypothesis to a single number.

  3. Akaike (1973), Sakamoto, Ishiguro, and Kitagawa (1986), Forster and Sober (1994) and Burnham and Anderson (2002).

  4. In contrast, the Law of Likelihood (LL) is very specific about how likelihoods are used in the comparison of simple hypotheses. Forster and Sober (2004) argue that AIC is a counterexample to LL. Unfortunately, Forster and Sober (2004) mistakenly describe LL as the likelihood principle, which was pointed out by Boik (2004) in the same volume. For the record, Forster and Sober (2004) did not intend to say anything about the likelihood principle—the present paper is the first publication in which I have discussed LP.

  5. See Forster (2000) for a description of the best known model selection criteria, and for an argument that the Akaike framework is the conceptually clearest framework for understanding the problem of model selection because it clearly distinguishes criteria from goals.

  6. The term ‘predictive accuracy’ was coined by Forster and Sober (1994), where it is given a precise definition in terms of SOS and likelihood fit functions.

  7. I owe this suggestion to Jason Grossman.

  8. The problem is the same one discussed in Forster, 1988b.

  9. While the refutation is not refutation in the strict logical sense, the number of data in the example can be increased to whatever number you like, so it becomes arbitrarily close to that ideal.

  10. Fitelson (1999) shows that choice of the difference measure does matter in some applications. But that issue does not arise here.

  11. Causal modeling of this kind has received a great deal of attention in recent years. See Pearl (2000) for a comprehensive survey of recent results, as well as Woodward (2003) for an introduction that is more accessible to philosophers.

  12. The word ‘constraint’ is borrowed from Sneed (1971), who introduced it as a way of constraining submodels. Although the sense of ‘model’ assumed here is different from Sneed’s, the idea is the same.

  13. Myrvold and Harper (2002) criticize the Akaike criterion of model selection (Forster & Sober, 1994) because it underrates the importance of the agreement of independent measurements in Newton’s argument for universal gravitation (see Harper, 2002 for an intriguing discussion of Newton’s argument). While this paper supports their conclusion, it does so in a more precise and general way. The important advance in this paper is (1) to point out that the limitation applies to all model selection criteria based on the Likelihood Principle and (2) to pinpoint exactly where the limitation lies. Nor is it my conclusion that statistics does not have the resources to address the problem.

  14. Wasserman (2000) provides a nice survey.

  15. Hooker (1987) and Norton (1993, 2000) discuss relevant issues and examples; in fact, there is a wealth of good literature in the philosophy of and history of science that deserves serious attention from outsiders.


  • Aitkin, M. (1991). Posterior Bayes factors. Journal of the Royal Statistical Society B,  53, 111–142.

    MATH  Google Scholar 

  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov, & F. Csaki (Eds.), 2nd International symposium on information theory (pp. 267–281). Budapest: Akademiai Kiado.

    Google Scholar 

  • Barnard, G. A. (1947). Review of Wald’s ‘Sequential analysis’. Journal of the American Statistical Association, 42, 658–669.

    Article  Google Scholar 

  • Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). New York: Springer-Verlag.

    MATH  Google Scholar 

  • Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.). Hayward, California: Institute of Mathematical Statistics.

    MATH  Google Scholar 

  • Birnbaum, A. (1962). On the foundations of statistical inference (with discussion). Journal of the American Statistical Association, 57, 269–326.

    MATH  MathSciNet  Article  Google Scholar 

  • Boik, R. J. (2004). Commentary. In M. Taper, & S. Lele (Eds.), The nature of scientific evidence (pp. 167–180). Chicago and London: University of Chicago Press.

    Google Scholar 

  • Burnham, K. P., & Anderson, D. R. (2002). Model selection and multi-model inference. New York: Springer Verlag.

    Google Scholar 

  • Earman, J. (1978). Fairy tales vs. an ongoing story: Ramsey’s neglected argument for scientific realism. Philosophical Studies, 33, 195–202.

    Article  Google Scholar 

  • Edwards, A. W. F. (1987). Likelihood (Expanded edition). Baltimore and London: The John Hopkins University Press.

  • Fitelson, B. (1999). The plurality of Bayesian measures of confirmation and the problem of measure sensitivity. Philosophy of Science, 66, S362–S378.

    MathSciNet  Article  Google Scholar 

  • Forster, M. R. (1984). Probabilistic causality and the foundations of modern science. Ph.D. Thesis, University of Western Ontario.

  • Forster, M. R. (1986). Unification and scientific realism revisited. In A. Fine, & P. Machamer (Eds.), PSA 1986 (Vol. 1, pp. 394–405). E. Lansing, Michigan: Philosophy of Science Association.

  • Forster, M. R. (1988a). Unification, explanation, and the composition of causes in Newtonian mechanics. Studies in the History and Philosophy of Science, 19, 55–101.

    Article  Google Scholar 

  • Forster, M. R. (1988b). Sober’s principle of common cause and the problem of incomplete hypotheses. Philosophy of Science, 55, 538–559.

    Article  Google Scholar 

  • Forster, M. R. (2000). Key concepts in model selection: Performance and generalizability. Journal of Mathematical Psychology,  44, 205–231.

    MATH  Article  Google Scholar 

  • Forster, M. R. (forthcoming). The miraculous consilience of quantum mechanics. In E. Eells, & J. Fetzer (Eds.), Probability in science. Open Court.

  • Forster, M. R., & Sober, E. (1994). How to tell when simpler, more unified, or less ad hoc theories will provide more accurate predictions. British Journal for the Philosophy of Science, 45, 1–35.

    MathSciNet  Google Scholar 

  • Forster, M. R., & Sober, E. (2004). Why likelihood? In M. Taper, & S. Lele (Eds.), The nature of scientific evidence (pp. 153–165). Chicago and London: University of Chicago Press.

    Google Scholar 

  • Friedman, M. (1981). Theoretical explanation. In R. A. Healey (Ed.), Time, reduction and reality (pp. 1–16). Cambridge: Cambridge University Press.

    Google Scholar 

  • Glymour, C. (1980). Explanations, tests, unity and necessity. Noûs, 14, 31–50.

  • Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Harper, W. L. (2002). Howard Stein on Isaac Newton: Beyond hypotheses. In D. B. Malament (Ed.), Reading natural philosophy: Essays in the history and philosophy of science and mathematics (pp. 71–112). Chicago and La Salle, Illinois: Open Court.

    Google Scholar 

  • Hooker, C. A. (1987). A realistic theory of science. Albany: State University of New York Press.

    Google Scholar 

  • Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: The Clarendon press.

    Google Scholar 

  • Mayo, D. G. (1996). Error and the growth of experimental knowledge. Chicago and London: The University of Chicago Press.

    Google Scholar 

  • Myrvold, W., & Harper, W. L. (2002). Model selection, simplicity, and scientific inference. Philosophy of Science, 69, S135–S149.

    Article  Google Scholar 

  • Norton, J. D. (1993). The determination of theory by evidence: The case for quantum discontinuity, 1900–1915. Synthese, 97, 1–31.

    MathSciNet  Article  Google Scholar 

  • Norton, J. D. (2000). How we know about electrons. In R. Nola, & H. Sankey (Eds.), After Popper, Kuhn and Feyerabend (pp. 67–97). Kluwer Academic Press.

  • Pearl, J. (2000). Causality: Models, reasoning, and inference. Cambridge: Cambridge University Press.

  • Royall, R. M. (1991). Ethics and statistics in randomized clinical trials (with discussion). Statistical Science, 6, 52–88.

    MATH  MathSciNet  Google Scholar 

  • Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. Boca Raton: Chapman & Hall/CRC.

    Google Scholar 

  • Savage, L. J. (1976). On rereading R. A. Fisher (with discussion). Annals of Statistics, 42, 441–500.

    Google Scholar 

  • Sakamoto, Y., Ishiguro, M., & Kitagawa, G. (1986). Akaike information criterion statistics. Dordrecht: Kluwer Academic Publishers.

    MATH  Google Scholar 

  • Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–465.

    MATH  MathSciNet  Google Scholar 

  • Sneed, J. D. (1971). The logical structure of mathematical physics. Dordrecht: D. Reidel.

    MATH  Google Scholar 

  • Sober, E. (1993). Epistemology for empiricists. In H. Wettstein (Ed.), Midwest studies in philosophy (pp. 39–61). Notre Dame: University of Notre Dame Press.

    Google Scholar 

  • Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44, 92–107.

    MATH  MathSciNet  Article  Google Scholar 

  • Whewell, W. (1858). Novum organon renovatum. Reprinted as Part II of the 3rd ed. of The philosophy of the inductive sciences. London: Cass, 1967.

  • Whewell, W. (1989). In R. E. Butts (Ed.), Theory of scientific method. Indianapolis/Cambridge: Hackett Publishing Company.

  • Woodward, J. (2003). Making things happen: A theory of causal explanation. Oxford and New York: Oxford University Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Malcolm R. Forster.

Additional information

Thanks go to all those who responded well to the first version of this paper presented at the University of Pittsburgh Center for Philosophy of Science on January 31, 2006, and especially to Clark Glymour. A revised version was presented at Carnegie-Mellon University on April 6, 2006. I also wish to thank Jason Grossman, John Norton, Teddy Seidenfeld, Elliott Sober, Peter Vranas, and three anonymous referees for valuable feedback on different parts of the manuscript.

This paper is part of the ongoing development of a half-baked idea about cross-situational invariance in causal modeling introduced in Forster (1984). I appreciated the encouragement at that time from Jeff Bub, Bill Demopoulos, Michael Friedman, Bill Harper, Cliff Hooker, John Nicholas, and Jim Woodward. Cliff Hooker discussed the idea in his (1987), and Jim Woodward suggested a connection with statistics, which has taken me 20 years to figure out.




If the maximum likelihood hypothesis in F is \(Y=\frac{10}{\sqrt{101}}X+U\) and the observed variance of X is 101, then the observed variance of Y is also 101. Thus, the maximum likelihood hypothesis in B is \(X=\frac{10}{\sqrt{101}}Y+Z,\) and they have the same likelihood. Moreover, for any α, β, and σ, there exist values of a, b, and s such that Y = α + β X + σ U and X = a + bY + sZ have the same likelihood.

Partial Proof

The observed X variance of data distributed in two Gaussian clusters with unit variance centered at X = −10 and X = +10, where the observed means of X and Y are 0, is equal to

$$ \hbox{Var}X=\frac{1}{2}\frac{1}{N/2}\sum{x_i^2}+\frac{1}{2}\frac{1}{N/2}\sum{x_j^2}, $$

where x i denotes X values in the lower cluster and x j denotes X values in the upper cluster. If all the x i where equal to  −10, and all the x j were equal to +10, then VarX would be equal to 100. To that, one must add the effect of the local variances. More exactly,

$$ \hbox{Var}X=\frac{1}{2}\frac{1}{N/2}\sum{((x_i+10)-10)^2}+\frac{1}{2}\frac{1} {N/2}\sum{((x_j-10)+10)^2}=101. $$

From the equation \(Y=\frac{10}{\sqrt{101}}X+U,\) it follows that \(\hbox{Var}Y=\frac{100}{101}101+1=101.\) Standard formulae for regression curves now prove that \(X=\frac{10}{\sqrt{101}}Y\) is the backwards regression line, where the observed residual variance is also equal to 1. Therefore, the two hypotheses have the same conditional likelihoods, and the same total likelihoods. It follows that the hypotheses \(Y=\frac{10}{\sqrt{101}} X+\sigma U\) and \(X=\frac{10}{\sqrt{101}}Y+\sigma Z\) have the same likelihoods for any value of σ. It is also clear that for any α, β, and σ, there exist values of a, b, and s such that Y = α + β X + σ U and X = a + bY + sZ have the same likelihoods.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Forster, M.R. Counterexamples to a likelihood theory of evidence. Minds & Machines 16, 319–338 (2006).

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • The likelihood principle
  • The law of likelihood
  • Evidence
  • Bayesianism
  • Likelihoodism
  • Curve fitting
  • Regression
  • Asymmetry of cause and effect