The principle of maximum entropy and a problem in probability kinematics

Lukits, Stefan

doi:10.1007/s11229-013-0335-8

The principle of maximum entropy and a problem in probability kinematics

Published: 05 September 2013

Volume 191, pages 1409–1431, (2014)
Cite this article

Synthese Aims and scope Submit manuscript

Stefan Lukits¹

359 Accesses
4 Citations
Explore all metrics

Abstract

Sometimes we receive evidence in a form that standard conditioning (or Jeffrey conditioning) cannot accommodate. The principle of maximum entropy (MAXENT) provides a unique solution for the posterior probability distribution based on the intuition that the information gain consistent with assumptions and evidence should be minimal. Opponents of objective methods to determine these probabilities prominently cite van Fraassen’s Judy Benjamin case to undermine the generality of maxent. This article shows that an intuitive approach to Judy Benjamin’s case supports maxent. This is surprising because based on independence assumptions the anticipated result is that it would support the opponents. It also demonstrates that opponents improperly apply independence assumptions to the problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximum Entropy Beyond Selecting Probability Distributions

The sufficiency of the evidence, the relevancy of the evidence, and quantifying both with a single number

Article 01 January 2021

You Can’t Always Get What You Want Some Considerations Regarding Conditional Probabilities

Article 05 August 2014

References

Bovens, L. (2010). Judy Benjamin is a sleeping beauty. Analysis, 70(1), 23–26.
Article Google Scholar
Bradley, R. (2005). Radical probabilism and Bayesian conditioning. Philosophy of Science, 72(2), 342–364.
Article Google Scholar
Caticha, A. (2012). Entropic inference and the foundations of physics. Sao Paulo: Brazilian Chapter of the International Society for Bayesian Analysis.
Google Scholar
Caticha, A., & Giffin, A. (2006). Updating probabilities. In MaxEnt 2006, the 26th international workshop on Bayesian inference and maximum entropy methods.
Cover, T., & Thomas, J. (2006). Elements of information theory (Vol. 6). Hoboken, NJ: Wiley.
Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 2, 299–318.
Google Scholar
Diaconis, P., & Zabell, S. (1982). Updating subjective probability. Journal of the American Statistical Association, 77, 822–830.
Article Google Scholar
Douven, I., & Romeijn, J. (2009). A new resolution of the Judy Benjamin problem. CPNSS Working Paper, 5(7), 1–22.
Google Scholar
Gardner, M. (1959). Mathematical games. Scientific American, 201(4), 174–182.
Article Google Scholar
Grove, A., & Halpern, J. (1997). Probability update: Conditioning vs. cross-entropy. In Proceedings of the thirteenth conference on uncertainty in artificial intelligence, Citeseer, Providence, RI.
Grünwald, P. (2000). Maximum entropy and the glasses you are looking through. Proceedings of the sixteenth conference on uncertainty in artificial intelligence (pp. 238–246). Burlington: Morgan Kaufmann Publishers.
Grünwald, P., & Halpern, J. (2003). Updating probabilities. Journal of Artificial Intelligence Research, 19, 243–278.
Google Scholar
Halpern, J. Y. (2003). Reasoning about uncertainty. Cambridge, MA: MIT Press.
Google Scholar
Hobson, A. (1971). Concepts in statistical mechanics. New York: Gordon and Beach.
Google Scholar
Howson, C., & Franklin, A. (1994). Bayesian Conditionalization and Probability Kinematics. The British Journal for the Philosophy of Science, 45(2), 451–466.
Article Google Scholar
Jaynes, E. (1989). Papers on probability, statistics and statistical physics. Dordrecht: Springer.
Google Scholar
Jaynes, E., & Bretthorst, G. (1998). Probability theory: The logic of science. Cambridge, UK: Cambridge University Press.
Google Scholar
Jeffrey, R. (1965). The logic of decision. New York: McGraw-Hill.
Google Scholar
Shore, J., & Johnson, R. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on Information Theory, 26(1), 26–37.
Article Google Scholar
Topsøe, F. (1979). Information-theoretical optimization techniques. Kybernetika, 15(1), 8–27.
Google Scholar
Uffink, J. (1996). The constraint rule of the maximum entropy principle. Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics, 27(1), 47–79.
Article Google Scholar
Van Fraassen, B. (1981). A problem for relative information minimizers in probability kinematics. The British Journal for the Philosophy of Science, 32(4), 375–379.
Article Google Scholar
Van Fraassen, B. (1986). A problem for relative information minimizers, continued. The British Journal for the Philosophy of Science, 37(4), 453–463.
Article Google Scholar
Walley, P. (1991). Statistical reasoning with imprecise probabilities. London: Chapman and Hall.
Book Google Scholar

Download references

Author information

Authors and Affiliations

University of British Columbia, 1866 Main Mall Buchanan E370, Vancouver, BC , V6T 1Z1, Canada
Stefan Lukits

Authors

Stefan Lukits
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Lukits.

Appendices

Appendix A: Jaynes’ constraint rule

This appendix provides a concise but comprehensive summary of Jaynes’ constraint rule not easily obtainable in the literature (although the constraint rule is a straightforward application of Lagrange multipliers). Jaynes applies it to the Brandeis Dice Problem (see Jaynes 1989, p. 243), but does not give a mathematical justification.

Let $f$ be a probability distribution on a finite space $x_{1},\ldots ,x_{m}$ that fulfills the constraint

$$\begin{aligned} \sum _{i=1}^{m}r(x_{i})f(x_{i})=\alpha . \end{aligned}$$

(2)

An affine constraint can always be expressed by assigning a value to the expectation of a probability distribution (see Hobson 1971). In Judy Benjamin’s case, for example, let $r(x_{1})=0, r(x_{2})=1-{\vartheta }, r(x_{3})=-{\vartheta }\hbox { and }\alpha =0$. Because $f$ is a probability distribution it fulfills

$$\begin{aligned} \sum _{i=1}^{m}f(x_{i})=1. \end{aligned}$$

(3)

We want to maximize Shannon’s entropy, given the constraints (2) and (3),

$$\begin{aligned} -\sum _{i=1}^{m}f(x_{i})\ln (x_{i}). \end{aligned}$$

(4)

We use Lagrange multipliers to define the functional

$$\begin{aligned} J(f)=-\sum _{i=1}^{m}f(x_{i})\ln {}f(x_{i})+\lambda _{0}\sum _{i=1}^{m}f(x_{i})+\lambda _{1}\sum _{i=1}^{m}r(x_{i})f(x_{i}) \end{aligned}$$

and differentiate it with respect to $f(x_{i})$

$$\begin{aligned} \frac{\partial {}J}{\partial {}f(x_{i})}=-\ln (f(x_{i}))-1+\lambda _{0}+\lambda _{1}r(x_{i}). \end{aligned}$$

(5)

Set (5) to $0$ to find the necessary condition to maximize (4)

$$\begin{aligned} g(x_{i})=e^{\lambda _{0}-1+\lambda _{1}r(x_{i})}. \end{aligned}$$

This is the Gibbs distribution. We still need to do two things: (a) show that the entropy of $g$ is maximal, and (b) show how to find $\lambda _{0}$ and $\lambda _{1}$. (a) is shown in Theorem 12.1.1 in Cover and Thomas (2006) and there is no reason to copy it here.

For (b), let

$$\begin{aligned} \lambda _{1}&= -\beta \\ Z(\beta )&= \sum _{i=1}^{m}e^{-\beta {}r(x_{i})}\\ \lambda _{0}&= 1-\ln (Z(\beta )). \end{aligned}$$

To find $\lambda _{0}$ and $\lambda _{1}$ we introduce the constraint

$$\begin{aligned} -\frac{\partial }{\partial {}\beta }\ln (Z(\beta ))=\alpha . \end{aligned}$$

To see how this constraint gives us $\lambda _{0}$ and $\lambda _{1}$, Jaynes’ solution of the Brandeis Dice Problem (see Jaynes 1989, p. 243) is a helpful example. We are, however, interested in a general proof that this choice of $\lambda _{0}$ and $\lambda _{1}$ gives us the probability distribution maximizing the entropy. That $g$ so defined maximizes the entropy is shown in (a). We need to make sure, however, that with this choice of $\lambda _{0}$ and $\lambda _{1}$ the constraints (2) and (3) are also fulfilled (a standard result in the application of Lagrange multipliers).

First, we show

$$\begin{aligned} \sum _{i=1}^{m}g(x_{i})&= \sum _{i=1}^{m}e^{\lambda _{0}-1+\lambda _{1}r(x_{i})}=e^{\lambda _{0}-1}\sum _{i=1}^{m}e^{\lambda _{1}r(x_{i})}\\&= e^{-\ln (Z(\beta ))}Z(\beta )=1. \end{aligned}$$

Then, we show, by differentiating $\ln (Z(\beta ))$ using the substitution $x=e^{-\beta }$

$$\begin{aligned} \alpha&= -\frac{\partial }{\partial {}\beta }\ln (Z(\beta ))=-\frac{1}{\sum _{i=1}^{m}x^{r(x_{i})}}\left( \,\sum _{i=1}^{m}r(x_{i})x^{r(x_{i})-1}\right) (-x)\\&= \frac{\sum _{i=1}^{m}r(x_{i})x^{r(x_{i})}}{\sum _{i=1}^{m}x^{r(x_{i})}}. \end{aligned}$$

And, finally,

$$\begin{aligned} \sum _{i=1}^{m}r(x_{i})g(x_{i})&= \sum _{i=1}^{m}r(x_{i})e^{\lambda _{0}-1+\lambda _{1}r(x_{1})}=e^{\lambda _{0}-1}\sum _{i=1}^{m}r(x_{i})e^{\lambda _{1}r(x_{1})}\\&= e^{\lambda _{0}-1}\sum _{i=1}^{m}r(x_{i})x^{r(x_{i})}=\alpha {}e^{\lambda _{0}-1}\sum _{i=1}^{m}x^{r(x_{i})}=\alpha {}e^{\lambda _{0}-1}\sum _{i=1}^{m}e^{-\beta {}r(x_{i})}\\&= \alpha {}Z(\beta )e^{\lambda _{0}-1}=\alpha {}Z(\beta ))e^{-\ln (Z(\beta ))}=\alpha . \end{aligned}$$

Filling in the variables from Judy Benjamin’s scenario gives us result (1). The lambdas are:

$$\begin{aligned} \lambda _{0}=1-\ln \left( \sum _{i=1}^{m}e^{\lambda _{1}r(x_{i})}\right) \quad \lambda _{1}=\ln {}{\vartheta }-\ln (1-{\vartheta }). \end{aligned}$$

We combine the normalized odds vector $(0.16,0.48,0.36)$ following from these lambdas using Dempster’s Rule of Combination with (MAP) and get result (1).

The powerset approach formalized

Let us assume a partition $\{B_{i}\}_{i=1,{\ldots },4n}$ of $A_{1}\,\cup \,A_{2}\,\cup \,A_{3}$ into sets that are of equal measure $\mu $ and whose intersection with $A_{i}$ is either the empty set or the whole set itself (this is the division into rectangles of scenario III). (MAP) dictates that the number of sets covering $A_{3}$ equals the number of sets covering $A_{1}\cup {}A_{2}$. For convenience, we assume the number of sets covering $A_{1}$ to be $n$. Let $\mathcal C $, a subset of the powerset of $\{B_{i}\}_{i=1,{\ldots },4n}$, be the collection of sets which agree with the constraint imposed by (HDQ), i.e.

$$\begin{aligned} C\in \mathcal C \quad \hbox {iff }C=\{C_{j}\}\quad \hbox {and}\quad t\mu \left( \bigcup {}C_{j}\cap {}A_{1}\right) =\mu \left( \bigcup {}C_{j}\cap {}A_{2}\right) . \end{aligned}$$

In Figs. 5 and 6 there are diagrams of two elements of the powerset of $\{B_{i}\}_{i=1,{\ldots },4n}$. One of them (Fig. 5) is not a member of $\mathcal C $, the other one (Fig. 6) is.

The binomial distribution dictates the expectation $EX$ of $X$, using simple combinatorics. In this case we require, again for convenience, that $n$ be divisible by $t$ and the ‘grain’ of the partition be $s=n/t$. Remember that $t$ is the factor by which (HDQ) indicates that Judy’s chance of being in $A_{2}$ is greater than being in $A_{1}$. In Judy’s particular case, $t=3$ and ${\vartheta }=0.75$. We introduce a few variables which later on will help for abbreviation:

$$\begin{aligned} n=ts \quad 2m=n \quad 2j=n-1 \quad {T}=t^{2}+1. \end{aligned}$$

$EX$, of course, depends both on the grain of the partition and the value of $t$. It makes sense to make it independent of the grain by letting the grain become increasingly finer and by determining $EX$ as $s\rightarrow \infty $. This cannot be done for the binomial distribution, as it is notoriously uncomputable for large numbers (even with a powerful computer things get dicey around $s=10$). But, equally notorious, the normal distribution provides a good approximation of the binomial distribution and will help us arrive at a formula for $G_{\mathrm{pws}}$ (corresponding to $G_{\mathrm{ind}}$ and $G_{\mathrm{max}}$), determining the value $q_{3}$ dependent on ${\vartheta }$ as suggested by the powerset approach.

First, we express the random variable $X$ by the two independent random variables $X_{12}$ and $X_{3}. X_{12}$ is the number of partition elements in the randomly chosen $C$ which are either in $A_{1}$ or in $A_{2}$ (the random variable of the number of partition elements in $A_{1}$ and the random variable of the number of partition elements in $A_{2}$ are decisively not independent, because they need to obey (HDQ)); $X_{3}$ is the number of partition elements in the randomly chosen $C$ which are in $A_{3}$. A relatively simple calculation shows that $EX_{3}=n$, which is just what we would expect (either the powerset approach or the uniformity approach would give us this result):

$$\begin{aligned} EX_{3}=2^{-2n}\sum _{i=0}^{2n}i\left( {\begin{array}{c}2n\\ i\end{array}}\right) =n\left( \hbox {use }\left( {\begin{array}{c}n\\ k\end{array}}\right) =\frac{n}{k}\left( {\begin{array}{c}n-1\\ k-1\end{array}}\right) \right) . \end{aligned}$$

The expectation of $X, X$ being the random variable expressing the ratio of the number of sets covering $A_{3}$ and the number of sets covering $A_{1}\cup {}A_{2}\cup {}A_{3}$, is

$$\begin{aligned} EX=\frac{EX_{3}}{EX_{12}+EX_{3}}=\frac{n}{EX_{12}+n}. \end{aligned}$$

If we were able to use uniformity and independence, $EX_{12}=n$ and $EX=1/2$, just as Grove and Halpern suggest (although their uniformity approach is admittedly less crude than the one used here). Will the powerset approach concur with the uniformity approach, will it support the principle of maximum entropy, or will it make another suggestion on how to update the prior probabilities? To answer this question, we must find out what $EX_{12}$ is, for a given value $t$ and $s\rightarrow \infty $, using the binomial distribution and its approximation by the normal distribution.

Using combinatorics,

$$\begin{aligned} EX_{12}=(t+1)\frac{\sum _{i=1}^{s}i\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts\\ ti\end{array}}\right) }{\sum _{i=0}^{s}\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts\\ ti\end{array}}\right) }. \end{aligned}$$

Let us call the numerator of this fraction NUM and the denominator DEN. According to the de Moivre–Laplace Theorem,

$$\begin{aligned} \hbox {DEN}=\sum _{i=0}^{s}\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts\\ ti\end{array}}\right) \approx {}2^{2n}\sum _{i=0}^{s}\int \limits _{i-\frac{1}{2}}^{i+\frac{1}{2}}\mathcal N \left( \frac{n}{2},\frac{n}{4}\right) (i)\mathcal N \left( \frac{n}{2},\frac{n}{4}\right) (ti)di, \end{aligned}$$

where

$$\begin{aligned} \mathcal{N }(\mu ,\sigma ^{2})(x)=\frac{1}{\sqrt{2\pi \sigma ^{2}}}\exp \left( -\frac{(x-\mu )^{2}}{2\sigma ^{2}}\right) . \end{aligned}$$

Substitution yields

$$\begin{aligned} \hbox {DEN}\approx {}2^{2n}\frac{1}{\pi {}m}\sum _{i=0}^{s}\int \limits _{i-\frac{1}{2}}^{i+\frac{1}{2}}\exp \left( -\frac{\left( x-m\right) ^{2}}{m}-\frac{t^{2}\left( x-\frac{m}{t}\right) ^{2}}{m}\right) dx. \end{aligned}$$

Consider briefly the argument of the exponential function:

$$\begin{aligned} -\frac{\left( x-m\right) ^{2}}{m}-\frac{t^{2}\left( x-\frac{m}{t}\right) ^{2}}{m}=-\frac{t^{2}}{m}({a''}x^{2}+{b''}x+{c''})=-\frac{t^{2}}{m}\left( {a''}(x-{h''})^{2}+{k''}\right) \end{aligned}$$

with (the double prime sign corresponds to the simple prime sign for the numerator later on)

$$\begin{aligned} {a''}&= \frac{1}{t^{2}}{T}\quad {b''}=(-2m)\frac{1}{t^{2}}(t+1)\quad {c''}=2m^{2}\frac{1}{t^{2}}\\ {h''}&= -{b''}/2{a''} \quad {k''}={a''}{h''}^{2}+{b''}{h''}+{c''}. \end{aligned}$$

Consequently,

$$\begin{aligned} \hbox {DEN}\approx {}2^{2n}\exp \left( -\frac{t^{2}}{m}{k''}\right) \sqrt{\frac{1}{\pi {}{a''}mt^{2}}}\int \limits _{-\infty }^{s+\frac{1}{2}}\mathcal N \left( {h''},\frac{m}{2{a''}t^{2}}\right) dx. \end{aligned}$$

And, using the error function for the cumulative density function of the normal distribution,

$$\begin{aligned} \hbox {DEN}\approx {}2^{2n-1}\sqrt{\frac{1}{\pi {}{a''}mt^{2}}}\exp \left( -\frac{{k''}t^{2}}{m}\right) \left( 1-\text {erf}({w''})\right) \end{aligned}$$

(6)

with

$$\begin{aligned} {w''}=\frac{t\sqrt{{a''}}\left( s+\frac{1}{2}-{h''}\right) }{\sqrt{m}}. \end{aligned}$$

We proceed likewise with the numerator, although the additional factor introduces a small complication:

$$\begin{aligned} \hbox {NUM}&= \sum _{i=1}^{s}i\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts\\ ti\end{array}}\right) =\sum _{i=1}^{s}s\left( {\begin{array}{c}ts\\ i\end{array}}\right) \left( {\begin{array}{c}ts-1\\ ti-1\end{array}}\right) \\&\approx s2^{2n-1}\sum _{i=1}^{s}\mathcal N \left( m,\frac{m}{2}\right) (i)\mathcal N \left( j,\frac{j}{2}\right) (ti-1). \end{aligned}$$

Again, we substitute and get

$$\begin{aligned} \hbox {NUM}\approx {}s2^{2n-1}\left( \pi \sqrt{mj}\right) ^{-1}\sum _{0}^{s-1}\int \limits _{i-\frac{1}{2}}^{i+\frac{1}{2}}\exp \left( {a'}(x-{h'})^{2}+{k'}\right) , \end{aligned}$$

where the argument for the exponential function is

$$\begin{aligned} -\frac{1}{mj}\left( j(x-m)^{2}+mt^{2}\left( x-\frac{j+1}{t}\right) ^{2}\right) \end{aligned}$$

and therefore

$$\begin{aligned} {a'}&= j+mt^{2} \quad {b'}=2j(1-m)+2mt\left( t-j\right) \quad {c'}=j(1-m)^{2}+m\left( t-j-1\right) ^{2}\\ {h'}&= -{b'}/2{a'} \quad {k'}={a'}{h'}^{2}+{b'}{h'}+{c'}. \end{aligned}$$

Using the error function,

$$\begin{aligned} \hbox {NUM}\approx {}2^{2n-2}\frac{s}{\sqrt{\pi {}{a'}}}\exp \left( -\frac{{k'}}{mj}\right) \left( 1+\text {erf}({w'})\right) \end{aligned}$$

(7)

with

$$\begin{aligned} {w'}=\frac{\sqrt{{a'}}\left( s-\frac{1}{2}-{h'}\right) }{\sqrt{mj}}. \end{aligned}$$

Combining (6) and (7),

$$\begin{aligned} EX_{12}&= (t+1)\frac{\hbox {NUM}}{\hbox {DEN}}\\&\approx \frac{1}{2}(t+1)\sqrt{\frac{{T}{}ts}{{T}{}ts-1}}se^{\alpha _{t,s}} \end{aligned}$$

for large $s$, because the arguments for the error function $w'$ and $w''$ escape to positive infinity in both cases (NUM and DEN) so that their ratio goes to 1. The argument for the exponential function is

$$\begin{aligned} \alpha _{t,s}=-\frac{{k'}}{mj}+\frac{{k''}t^{2}}{m} \end{aligned}$$

and, for $s\rightarrow \infty $, goes to

$$\begin{aligned} \alpha _{t}=\frac{1}{2}{T}^{-2}(2t^{3}-3t^{2}+4t-5). \end{aligned}$$

Notice that, for $t\rightarrow \infty , \alpha _{t}$ goes to $0$ and

$$\begin{aligned} EX=\frac{n}{EX_{12}+n}\rightarrow \frac{2}{3} \end{aligned}$$

in accordance with intuition T2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lukits, S. The principle of maximum entropy and a problem in probability kinematics. Synthese 191, 1409–1431 (2014). https://doi.org/10.1007/s11229-013-0335-8

Download citation

Received: 22 December 2012
Accepted: 19 August 2013
Published: 05 September 2013
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11229-013-0335-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The principle of maximum entropy and a problem in probability kinematics

Abstract

Access this article

Similar content being viewed by others

Maximum Entropy Beyond Selecting Probability Distributions

The sufficiency of the evidence, the relevancy of the evidence, and quantifying both with a single number

You Can’t Always Get What You Want Some Considerations Regarding Conditional Probabilities

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Jaynes’ constraint rule

The powerset approach formalized

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The principle of maximum entropy and a problem in probability kinematics

Abstract

Access this article

Similar content being viewed by others

Maximum Entropy Beyond Selecting Probability Distributions

The sufficiency of the evidence, the relevancy of the evidence, and quantifying both with a single number

You Can’t Always Get What You Want Some Considerations Regarding Conditional Probabilities

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Jaynes’ constraint rule

The powerset approach formalized

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation