Skip to main content
Log in

The principle of maximum entropy and a problem in probability kinematics

  • Published:
Synthese Aims and scope Submit manuscript

Abstract

Sometimes we receive evidence in a form that standard conditioning (or Jeffrey conditioning) cannot accommodate. The principle of maximum entropy (MAXENT) provides a unique solution for the posterior probability distribution based on the intuition that the information gain consistent with assumptions and evidence should be minimal. Opponents of objective methods to determine these probabilities prominently cite van Fraassen’s Judy Benjamin case to undermine the generality of maxent. This article shows that an intuitive approach to Judy Benjamin’s case supports maxent. This is surprising because based on independence assumptions the anticipated result is that it would support the opponents. It also demonstrates that opponents improperly apply independence assumptions to the problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Bovens, L. (2010). Judy Benjamin is a sleeping beauty. Analysis, 70(1), 23–26.

    Article  Google Scholar 

  • Bradley, R. (2005). Radical probabilism and Bayesian conditioning. Philosophy of Science, 72(2), 342–364.

    Article  Google Scholar 

  • Caticha, A. (2012). Entropic inference and the foundations of physics. Sao Paulo: Brazilian Chapter of the International Society for Bayesian Analysis.

    Google Scholar 

  • Caticha, A., & Giffin, A. (2006). Updating probabilities. In MaxEnt 2006, the 26th international workshop on Bayesian inference and maximum entropy methods.

  • Cover, T., & Thomas, J. (2006). Elements of information theory (Vol. 6). Hoboken, NJ: Wiley.

  • Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 2, 299–318.

    Google Scholar 

  • Diaconis, P., & Zabell, S. (1982). Updating subjective probability. Journal of the American Statistical Association, 77, 822–830.

    Article  Google Scholar 

  • Douven, I., & Romeijn, J. (2009). A new resolution of the Judy Benjamin problem. CPNSS Working Paper, 5(7), 1–22.

    Google Scholar 

  • Gardner, M. (1959). Mathematical games. Scientific American, 201(4), 174–182.

    Article  Google Scholar 

  • Grove, A., & Halpern, J. (1997). Probability update: Conditioning vs. cross-entropy. In Proceedings of the thirteenth conference on uncertainty in artificial intelligence, Citeseer, Providence, RI.

  • Grünwald, P. (2000). Maximum entropy and the glasses you are looking through. Proceedings of the sixteenth conference on uncertainty in artificial intelligence (pp. 238–246). Burlington: Morgan Kaufmann Publishers.

  • Grünwald, P., & Halpern, J. (2003). Updating probabilities. Journal of Artificial Intelligence Research, 19, 243–278.

    Google Scholar 

  • Halpern, J. Y. (2003). Reasoning about uncertainty. Cambridge, MA: MIT Press.

    Google Scholar 

  • Hobson, A. (1971). Concepts in statistical mechanics. New York: Gordon and Beach.

    Google Scholar 

  • Howson, C., & Franklin, A. (1994). Bayesian Conditionalization and Probability Kinematics. The British Journal for the Philosophy of Science, 45(2), 451–466.

    Article  Google Scholar 

  • Jaynes, E. (1989). Papers on probability, statistics and statistical physics. Dordrecht: Springer.

    Google Scholar 

  • Jaynes, E., & Bretthorst, G. (1998). Probability theory: The logic of science. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Jeffrey, R. (1965). The logic of decision. New York: McGraw-Hill.

    Google Scholar 

  • Shore, J., & Johnson, R. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on Information Theory, 26(1), 26–37.

    Article  Google Scholar 

  • Topsøe, F. (1979). Information-theoretical optimization techniques. Kybernetika, 15(1), 8–27.

    Google Scholar 

  • Uffink, J. (1996). The constraint rule of the maximum entropy principle. Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics, 27(1), 47–79.

    Article  Google Scholar 

  • Van Fraassen, B. (1981). A problem for relative information minimizers in probability kinematics. The British Journal for the Philosophy of Science, 32(4), 375–379.

    Article  Google Scholar 

  • Van Fraassen, B. (1986). A problem for relative information minimizers, continued. The British Journal for the Philosophy of Science, 37(4), 453–463.

    Article  Google Scholar 

  • Walley, P. (1991). Statistical reasoning with imprecise probabilities. London: Chapman and Hall.

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Lukits.

Appendices

Appendix A: Jaynes’ constraint rule

This appendix provides a concise but comprehensive summary of Jaynes’ constraint rule not easily obtainable in the literature (although the constraint rule is a straightforward application of Lagrange multipliers). Jaynes applies it to the Brandeis Dice Problem (see Jaynes 1989, p. 243), but does not give a mathematical justification.

Let \(f\) be a probability distribution on a finite space \(x_{1},\ldots ,x_{m}\) that fulfills the constraint

$$\begin{aligned} \sum _{i=1}^{m}r(x_{i})f(x_{i})=\alpha . \end{aligned}$$
(2)

An affine constraint can always be expressed by assigning a value to the expectation of a probability distribution (see Hobson 1971). In Judy Benjamin’s case, for example, let \(r(x_{1})=0, r(x_{2})=1-{\vartheta }, r(x_{3})=-{\vartheta }\hbox { and }\alpha =0\). Because \(f\) is a probability distribution it fulfills

$$\begin{aligned} \sum _{i=1}^{m}f(x_{i})=1. \end{aligned}$$
(3)

We want to maximize Shannon’s entropy, given the constraints (2) and (3),

$$\begin{aligned} -\sum _{i=1}^{m}f(x_{i})\ln (x_{i}). \end{aligned}$$
(4)

We use Lagrange multipliers to define the functional

$$\begin{aligned} J(f)=-\sum _{i=1}^{m}f(x_{i})\ln {}f(x_{i})+\lambda _{0}\sum _{i=1}^{m}f(x_{i})+\lambda _{1}\sum _{i=1}^{m}r(x_{i})f(x_{i}) \end{aligned}$$

and differentiate it with respect to \(f(x_{i})\)

$$\begin{aligned} \frac{\partial {}J}{\partial {}f(x_{i})}=-\ln (f(x_{i}))-1+\lambda _{0}+\lambda _{1}r(x_{i}). \end{aligned}$$
(5)

Set (5) to \(0\) to find the necessary condition to maximize (4)

$$\begin{aligned} g(x_{i})=e^{\lambda _{0}-1+\lambda _{1}r(x_{i})}. \end{aligned}$$

This is the Gibbs distribution. We still need to do two things: (a) show that the entropy of \(g\) is maximal, and (b) show how to find \(\lambda _{0}\) and \(\lambda _{1}\). (a) is shown in Theorem 12.1.1 in Cover and Thomas (2006) and there is no reason to copy it here.

For (b), let

$$\begin{aligned} \lambda _{1}&= -\beta \\ Z(\beta )&= \sum _{i=1}^{m}e^{-\beta {}r(x_{i})}\\ \lambda _{0}&= 1-\ln (Z(\beta )). \end{aligned}$$

To find \(\lambda _{0}\) and \(\lambda _{1}\) we introduce the constraint

$$\begin{aligned} -\frac{\partial }{\partial {}\beta }\ln (Z(\beta ))=\alpha . \end{aligned}$$

To see how this constraint gives us \(\lambda _{0}\) and \(\lambda _{1}\), Jaynes’ solution of the Brandeis Dice Problem (see Jaynes 1989, p. 243) is a helpful example. We are, however, interested in a general proof that this choice of \(\lambda _{0}\) and \(\lambda _{1}\) gives us the probability distribution maximizing the entropy. That \(g\) so defined maximizes the entropy is shown in (a). We need to make sure, however, that with this choice of \(\lambda _{0}\) and \(\lambda _{1}\) the constraints (2) and (3) are also fulfilled (a standard result in the application of Lagrange multipliers).

First, we show

$$\begin{aligned} \sum _{i=1}^{m}g(x_{i})&= \sum _{i=1}^{m}e^{\lambda _{0}-1+\lambda _{1}r(x_{i})}=e^{\lambda _{0}-1}\sum _{i=1}^{m}e^{\lambda _{1}r(x_{i})}\\&= e^{-\ln (Z(\beta ))}Z(\beta )=1. \end{aligned}$$

Then, we show, by differentiating \(\ln (Z(\beta ))\) using the substitution \(x=e^{-\beta }\)

$$\begin{aligned} \alpha&= -\frac{\partial }{\partial {}\beta }\ln (Z(\beta ))=-\frac{1}{\sum _{i=1}^{m}x^{r(x_{i})}}\left( \,\sum _{i=1}^{m}r(x_{i})x^{r(x_{i})-1}\right) (-x)\\&= \frac{\sum _{i=1}^{m}r(x_{i})x^{r(x_{i})}}{\sum _{i=1}^{m}x^{r(x_{i})}}. \end{aligned}$$

And, finally,

$$\begin{aligned} \sum _{i=1}^{m}r(x_{i})g(x_{i})&= \sum _{i=1}^{m}r(x_{i})e^{\lambda _{0}-1+\lambda _{1}r(x_{1})}=e^{\lambda _{0}-1}\sum _{i=1}^{m}r(x_{i})e^{\lambda _{1}r(x_{1})}\\&= e^{\lambda _{0}-1}\sum _{i=1}^{m}r(x_{i})x^{r(x_{i})}=\alpha {}e^{\lambda _{0}-1}\sum _{i=1}^{m}x^{r(x_{i})}=\alpha {}e^{\lambda _{0}-1}\sum _{i=1}^{m}e^{-\beta {}r(x_{i})}\\&= \alpha {}Z(\beta )e^{\lambda _{0}-1}=\alpha {}Z(\beta ))e^{-\ln (Z(\beta ))}=\alpha . \end{aligned}$$

Filling in the variables from Judy Benjamin’s scenario gives us result (1). The lambdas are:

$$\begin{aligned} \lambda _{0}=1-\ln \left( \sum _{i=1}^{m}e^{\lambda _{1}r(x_{i})}\right) \quad \lambda _{1}=\ln {}{\vartheta }-\ln (1-{\vartheta }). \end{aligned}$$

We combine the normalized odds vector \((0.16,0.48,0.36)\) following from these lambdas using Dempster’s Rule of Combination with (MAP) and get result (1).

The powerset approach formalized

Let us assume a partition \(\{B_{i}\}_{i=1,{\ldots },4n}\) of \(A_{1}\,\cup \,A_{2}\,\cup \,A_{3}\) into sets that are of equal measure \(\mu \) and whose intersection with \(A_{i}\) is either the empty set or the whole set itself (this is the division into rectangles of scenario III). (MAP) dictates that the number of sets covering \(A_{3}\) equals the number of sets covering \(A_{1}\cup {}A_{2}\). For convenience, we assume the number of sets covering \(A_{1}\) to be \(n\). Let \(\mathcal C \), a subset of the powerset of \(\{B_{i}\}_{i=1,{\ldots },4n}\), be the collection of sets which agree with the constraint imposed by (HDQ), i.e.

$$\begin{aligned} C\in \mathcal C \quad \hbox {iff }C=\{C_{j}\}\quad \hbox {and}\quad t\mu \left( \bigcup {}C_{j}\cap {}A_{1}\right) =\mu \left( \bigcup {}C_{j}\cap {}A_{2}\right) . \end{aligned}$$

In Figs. 5 and 6 there are diagrams of two elements of the powerset of \(\{B_{i}\}_{i=1,{\ldots },4n}\). One of them (Fig. 5) is not a member of \(\mathcal C \), the other one (Fig. 6) is.

The binomial distribution dictates the expectation \(EX\) of \(X\), using simple combinatorics. In this case we require, again for convenience, that \(n\) be divisible by \(t\) and the ‘grain’ of the partition be \(s=n/t\). Remember that \(t\) is the factor by which (HDQ) indicates that Judy’s chance of being in \(A_{2}\) is greater than being in \(A_{1}\). In Judy’s particular case, \(t=3\) and \({\vartheta }=0.75\). We introduce a few variables which later on will help for abbreviation:

$$\begin{aligned} n=ts \quad 2m=n \quad 2j=n-1 \quad {T}=t^{2}+1. \end{aligned}$$

\(EX\), of course, depends both on the grain of the partition and the value of \(t\). It makes sense to make it independent of the grain by letting the grain become increasingly finer and by determining \(EX\) as \(s\rightarrow \infty \). This cannot be done for the binomial distribution, as it is notoriously uncomputable for large numbers (even with a powerful computer things get dicey around \(s=10\)). But, equally notorious, the normal distribution provides a good approximation of the binomial distribution and will help us arrive at a formula for \(G_{\mathrm{pws}}\) (corresponding to \(G_{\mathrm{ind}}\) and \(G_{\mathrm{max}}\)), determining the value \(q_{3}\) dependent on \({\vartheta }\) as suggested by the powerset approach.

First, we express the random variable \(X\) by the two independent random variables \(X_{12}\) and \(X_{3}. X_{12}\) is the number of partition elements in the randomly chosen \(C\) which are either in \(A_{1}\) or in \(A_{2}\) (the random variable of the number of partition elements in \(A_{1}\) and the random variable of the number of partition elements in \(A_{2}\) are decisively not independent, because they need to obey (HDQ)); \(X_{3}\) is the number of partition elements in the randomly chosen \(C\) which are in \(A_{3}\). A relatively simple calculation shows that \(EX_{3}=n\), which is just what we would expect (either the powerset approach or the uniformity approach would give us this result):

$$\begin{aligned} EX_{3}=2^{-2n}\sum _{i=0}^{2n}i\left( {\begin{array}{c}2n\\ i\end{array}}\right) =n\left( \hbox {use }\left( {\begin{array}{c}n\\ k\end{array}}\right) =\frac{n}{k}\left( {\begin{array}{c}n-1\\ k-1\end{array}}\right) \right) . \end{aligned}$$

The expectation of \(X, X\) being the random variable expressing the ratio of the number of sets covering \(A_{3}\) and the number of sets covering \(A_{1}\cup {}A_{2}\cup {}A_{3}\), is

$$\begin{aligned} EX=\frac{EX_{3}}{EX_{12}+EX_{3}}=\frac{n}{EX_{12}+n}. \end{aligned}$$

If we were able to use uniformity and independence, \(EX_{12}=n\) and \(EX=1/2\), just as Grove and Halpern suggest (although their uniformity approach is admittedly less crude than the one used here). Will the powerset approach concur with the uniformity approach, will it support the principle of maximum entropy, or will it make another suggestion on how to update the prior probabilities? To answer this question, we must find out what \(EX_{12}\) is, for a given value \(t\) and \(s\rightarrow \infty \), using the binomial distribution and its approximation by the normal distribution.

Using combinatorics,

$$\begin{aligned} EX_{12}=(t+1)\frac{\sum _{i=1}^{s}i\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts\\ ti\end{array}}\right) }{\sum _{i=0}^{s}\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts\\ ti\end{array}}\right) }. \end{aligned}$$

Let us call the numerator of this fraction NUM and the denominator DEN. According to the de Moivre–Laplace Theorem,

$$\begin{aligned} \hbox {DEN}=\sum _{i=0}^{s}\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts\\ ti\end{array}}\right) \approx {}2^{2n}\sum _{i=0}^{s}\int \limits _{i-\frac{1}{2}}^{i+\frac{1}{2}}\mathcal N \left( \frac{n}{2},\frac{n}{4}\right) (i)\mathcal N \left( \frac{n}{2},\frac{n}{4}\right) (ti)di, \end{aligned}$$

where

$$\begin{aligned} \mathcal{N }(\mu ,\sigma ^{2})(x)=\frac{1}{\sqrt{2\pi \sigma ^{2}}}\exp \left( -\frac{(x-\mu )^{2}}{2\sigma ^{2}}\right) . \end{aligned}$$

Substitution yields

$$\begin{aligned} \hbox {DEN}\approx {}2^{2n}\frac{1}{\pi {}m}\sum _{i=0}^{s}\int \limits _{i-\frac{1}{2}}^{i+\frac{1}{2}}\exp \left( -\frac{\left( x-m\right) ^{2}}{m}-\frac{t^{2}\left( x-\frac{m}{t}\right) ^{2}}{m}\right) dx. \end{aligned}$$

Consider briefly the argument of the exponential function:

$$\begin{aligned} -\frac{\left( x-m\right) ^{2}}{m}-\frac{t^{2}\left( x-\frac{m}{t}\right) ^{2}}{m}=-\frac{t^{2}}{m}({a''}x^{2}+{b''}x+{c''})=-\frac{t^{2}}{m}\left( {a''}(x-{h''})^{2}+{k''}\right) \end{aligned}$$

with (the double prime sign corresponds to the simple prime sign for the numerator later on)

$$\begin{aligned} {a''}&= \frac{1}{t^{2}}{T}\quad {b''}=(-2m)\frac{1}{t^{2}}(t+1)\quad {c''}=2m^{2}\frac{1}{t^{2}}\\ {h''}&= -{b''}/2{a''} \quad {k''}={a''}{h''}^{2}+{b''}{h''}+{c''}. \end{aligned}$$

Consequently,

$$\begin{aligned} \hbox {DEN}\approx {}2^{2n}\exp \left( -\frac{t^{2}}{m}{k''}\right) \sqrt{\frac{1}{\pi {}{a''}mt^{2}}}\int \limits _{-\infty }^{s+\frac{1}{2}}\mathcal N \left( {h''},\frac{m}{2{a''}t^{2}}\right) dx. \end{aligned}$$

And, using the error function for the cumulative density function of the normal distribution,

$$\begin{aligned} \hbox {DEN}\approx {}2^{2n-1}\sqrt{\frac{1}{\pi {}{a''}mt^{2}}}\exp \left( -\frac{{k''}t^{2}}{m}\right) \left( 1-\text {erf}({w''})\right) \end{aligned}$$
(6)

with

$$\begin{aligned} {w''}=\frac{t\sqrt{{a''}}\left( s+\frac{1}{2}-{h''}\right) }{\sqrt{m}}. \end{aligned}$$

We proceed likewise with the numerator, although the additional factor introduces a small complication:

$$\begin{aligned} \hbox {NUM}&= \sum _{i=1}^{s}i\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts\\ ti\end{array}}\right) =\sum _{i=1}^{s}s\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts-1\\ ti-1\end{array}}\right) \\&\approx s2^{2n-1}\sum _{i=1}^{s}\mathcal N \left( m,\frac{m}{2}\right) (i)\mathcal N \left( j,\frac{j}{2}\right) (ti-1). \end{aligned}$$

Again, we substitute and get

$$\begin{aligned} \hbox {NUM}\approx {}s2^{2n-1}\left( \pi \sqrt{mj}\right) ^{-1}\sum _{0}^{s-1}\int \limits _{i-\frac{1}{2}}^{i+\frac{1}{2}}\exp \left( {a'}(x-{h'})^{2}+{k'}\right) , \end{aligned}$$

where the argument for the exponential function is

$$\begin{aligned} -\frac{1}{mj}\left( j(x-m)^{2}+mt^{2}\left( x-\frac{j+1}{t}\right) ^{2}\right) \end{aligned}$$

and therefore

$$\begin{aligned} {a'}&= j+mt^{2} \quad {b'}=2j(1-m)+2mt\left( t-j\right) \quad {c'}=j(1-m)^{2}+m\left( t-j-1\right) ^{2}\\ {h'}&= -{b'}/2{a'} \quad {k'}={a'}{h'}^{2}+{b'}{h'}+{c'}. \end{aligned}$$

Using the error function,

$$\begin{aligned} \hbox {NUM}\approx {}2^{2n-2}\frac{s}{\sqrt{\pi {}{a'}}}\exp \left( -\frac{{k'}}{mj}\right) \left( 1+\text {erf}({w'})\right) \end{aligned}$$
(7)

with

$$\begin{aligned} {w'}=\frac{\sqrt{{a'}}\left( s-\frac{1}{2}-{h'}\right) }{\sqrt{mj}}. \end{aligned}$$

Combining (6) and (7),

$$\begin{aligned} EX_{12}&= (t+1)\frac{\hbox {NUM}}{\hbox {DEN}}\\&\approx \frac{1}{2}(t+1)\sqrt{\frac{{T}{}ts}{{T}{}ts-1}}se^{\alpha _{t,s}} \end{aligned}$$

for large \(s\), because the arguments for the error function \(w'\) and \(w''\) escape to positive infinity in both cases (NUM and DEN) so that their ratio goes to 1. The argument for the exponential function is

$$\begin{aligned} \alpha _{t,s}=-\frac{{k'}}{mj}+\frac{{k''}t^{2}}{m} \end{aligned}$$

and, for \(s\rightarrow \infty \), goes to

$$\begin{aligned} \alpha _{t}=\frac{1}{2}{T}^{-2}(2t^{3}-3t^{2}+4t-5). \end{aligned}$$

Notice that, for \(t\rightarrow \infty , \alpha _{t}\) goes to \(0\) and

$$\begin{aligned} EX=\frac{n}{EX_{12}+n}\rightarrow \frac{2}{3} \end{aligned}$$

in accordance with intuition T2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lukits, S. The principle of maximum entropy and a problem in probability kinematics. Synthese 191, 1409–1431 (2014). https://doi.org/10.1007/s11229-013-0335-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11229-013-0335-8

Keywords

Navigation