1 Introduction

Information theory is concerned with communication in the presence of noise [1]. A noisy channel may be described by the probability distribution \(p(x|\theta )\) over received messages \(x\in X\), for a given signal \(\theta \in \varTheta \). The mutual information between input and output \(I(X;\varTheta )\) depends not only on the channel, but also on the distribution of input signals \(p(\theta )\). A choice of communication code implies a choice of this input distribution, and we are interested in \(p_\star (\theta )\) which maximizes I, which is then precisely the channel capacity.

Some channels are used repeatedly to send independent signals, as for example in telecommunications. One surprising feature of the optimal distribution \(p_\star (\theta )\) in this context is that it is usually discrete: Even when \(\theta \) may take on a continuum of values, the optimal code uses only finite number K of discrete symbols. In a sense the best code is digital, even though the channel is analog.

The opposite limit of the same problem has also been studied, where effectively a single signal \(\theta \) is sent a large number number of times, generating independent outputs \(x_{i}\), \(i=1,2,\ldots m\). In this case we maximize \(I(X^{m},\varTheta )\) over \(p(\theta )\) with the goal of transmitting \(\theta \) to high accuracy. The natural example here is not telecommunications, but instead comes from viewing a scientific experiment as a channel from the parameters \(\theta \) in a theory, via some noisy measurements, to recorded data \(x_{i}\). Lindley [2] argued that the channel capacity or mutual information may then be viewed as the natural summary of how much knowledge we will gain. The distribution \(p(\theta )\) is then a Bayesian prior, and Bernardo [3, 4] argued that the optimal \(p_\star (\theta )\) provides a natural choice of uninformative prior. This situation is usually studied in the limit \(m\rightarrow \infty \), where they named this a reference prior. Unlike the case \(m=1\) above, in this limit the prior is typically continuous, and in fact usually agrees with the better-known Jeffreys prior [5].

The result we report here is a novel scaling law, describing this approach to a continuum. If K is the optimal number of discrete states, the form of our law plotted in Fig. 1 is this:

(1)

Slope \(\zeta =1\) on this figure represents the absolute bound \(I(X;\varTheta )\le \log K\) on the mutual information, which simply encodes the fact that difference between certainty and complete ignorance among \(K=2^{n}\) outcomes is exactly n bits of information.

While the motivations above come from various fields, our derivation of this law (1) is very much in the tradition of physics. In Sect. 3 we study a field theory for the local number density of delta functions, \(\rho (\theta )\). The maximization gives us an equation of motion for this density, and solving this, we find that the average \(\rho _{0}=\int \rho (\theta )\,d\theta /L\) behaves as

$$\begin{aligned} \rho _{0}=\frac{K}{L}\sim L^{-1+1/\zeta }=L^{1/3}\quad \text { when }L\rightarrow \infty ,\qquad \zeta =3/4. \end{aligned}$$
(2)

Here L is a proper length \(L=\int \sqrt{g_{\theta \theta }\,d\theta ^{2}}\), measured with respect to the natural measure on \(\varTheta \) induced by \(p(x^{m}|\theta )\), namely the Fisher information metric (4). At large L, this length is proportional to the number of distinguishable outcomes, thus the information grows as \(I(X;\varTheta )\sim \log L\). Then the scaling law (1) above reads \(K^{\zeta }\sim L\), equivalent to (2).

Fig. 1
figure 1

Scaling law for the number of delta functions as L is increased, plotting \(I(X;\varTheta )/\log 2\) against K. The lines drawn are \(I(X;\varTheta )=\zeta \,\log K\) for \(\zeta =1\) (an absolute bound) and \(\zeta =3/4\) (our scaling law). The blue cross data points are for the Bernoulli model discussed in Sect. 4

We derive this law assuming Gaussian noise, but we believe that it is quite general, because this is (as usual) a good approximation in the limit being taken. Section 4 looks explicitly at another one-dimensional model which displays the same scaling (also plotted in Fig. 1) and then at the generalization to D dimensions. Finally Appendix A looks at a paper from 30 years ago which could have discovered this scaling law. But we begin by motivating in more detail why we are interested in this problem, and this limit.

2 Prior Work

How much does observing data x inform us about the parameters \(\theta \) of a model? Lindley [2] argued that this question could be formalized by considering the mutual information between the parameters and the expected data, \(I(X;\varTheta )\). In this framework, before data is seen we have some prior on parameter space \(p(\theta )\), and a resulting expectation for the probability of data x given by \(p(x)=\int d\theta \,p(\theta )\,p(x|\theta )\). After seeing particular data x, and updating \(p(\theta )\) to \(p(\theta |x)\) using Bayes’ rule, the entropy in parameter space will be reduced from \(S(\varTheta )\) to \(S(\varTheta |x)\). Thus on average the final entropy is \(S(\varTheta \vert X)=S(\varTheta )-I(X;\varTheta )\), and we have learned information I. Bernardo and others [6] argued that in the absence of any other knowledge, the prior \(p(\theta )\) should be chosen to maximize I, so as to learn as much as possible from the results of an experiment. The statistics community has mostly focused on the limit where data is plentiful—where each experiment is repeated m times, and m goes to infinity. In this limit, the prior which maximizes \(I(X^{m};\varTheta )\) for the aggregate data \((x_{1},x_{2},\ldots ,x_{m})\) is known as a reference prior [6]. It usually approaches Jeffreys prior [5], which can also be derived from invariance and geometric considerations, described below.

In a recent paper [7] we argued that the finite data case (\(m\ne \infty \)) contains surprises which naturally encode a preference for model simplicity. This places model selection and prior selection into the same framework. With finite data, it was long knownFootnote 1 that the optimal prior \(p_\star (\theta )\) is almost always discrete, composed of a finite number of delta functions:

$$\begin{aligned} p_\star (\theta )=\sum _{a=1}^{K}\lambda _{a}\,\delta (\theta -\theta _{a}). \end{aligned}$$
(3)

The delta functions become more closely spaced as the number of repetitions m increases, with their density approaching the smooth Jeffreys prior as \(m\rightarrow \infty \), the limit of plentiful data. However, in the data-starved limit, instead this prior has only small number of delta functions, placed as far apart as possible, often at edges of the allowed parameter space. It is the combination of these two limiting behaviors which makes these priors useful for model selection. The typical situation in science is that we have many parameters, of which a few relevant combinations are in the data-rich regime, while many more are in the data-starved regime [16, 17]. If we are able to apply the methods of renormalization, then these unmeasurable parameters are precisely the irrelevant directions. We argued that, in general, \(p_\star (\theta )\) determines the appropriate model class by placing weight on edges of parameter space along irrelevant directions, but in the interior along relevant directions. Thus it selects a sub-manifold of \(\varTheta \) describing an appropriate effective theory, ignoring the irrelevant directions.

The appropriate notion of distance on the parameter manifold \(\varTheta \) should describe how distinguishable the data resulting from different parameter values will be. This is given by the Fisher information metric,

$$\begin{aligned} ds^{2}=\sum _{\mu ,\nu =1}^{D}g_{\mu \nu }d\theta ^{\mu }d\theta ^{\nu },\qquad g_{\mu \nu }(\theta )=\int dx\,p(x|\theta )\,\frac{\partial \log p(x|\theta )}{\partial \theta ^{\mu }}\,\frac{\partial \log p(x|\theta )}{\partial \theta ^{\nu }} \end{aligned}$$
(4)

which measures distances between points \(\theta \) in units of standard deviations of \(p(x\vert \theta )\). Such distances are invariant to changes of parameterization, and this invariance is an attractive feature of Jeffreys prior, which is simply the associated volume form, normalized to have total probability 1:

$$\begin{aligned} p_\mathrm {J}(\theta )=\frac{1}{Z}\sqrt{\det g_{\mu \nu }(\theta )},\qquad Z=\int d\theta \sqrt{\det g_{\mu \nu }(\theta )}. \end{aligned}$$

Notice, however, that normalization destroys the natural scale of the metric. Repeating an experiment m times changes the metric \(g_{\mu \nu }\rightarrow m\,g_{\mu \nu }\), encoding the fact that more data allows us to better distinguish nearby parameter values. But this repetition does not change \(p_\mathrm {J}(\theta )\). We argued in [7] that this invariance is in fact an unattractive feature of Jeffreys prior. The scale of the metric is what captures the important difference between parameters we can measure well (length \(L=\int \sqrt{g_{\theta \theta }\,d\theta ^{2}}\gg 1\) in the Fisher metric) and parameters which we cannot measure at all (length \(L\ll 1\)).

Instead, the optimal prior \(p_\star (\theta )\) has a different invariance, towards the addition of extra irrelevant parameters. Adding extra irrelevant parameters increases the dimensionality of the manifold, and multiplies the volume form by the irrelevant volume. This extra factor, while by definition smaller than 1, can still vary by many orders of magnitude between different points in parameter space (say from \(10^{-3}\) to \(10^{-30}\)). Jeffreys prior, and its implied distribution p(x), is strongly affected by this. It effectively assumes that all parameter directions can be measured, because all are large in the limit \(m\rightarrow \infty \). But this is not true in most systems of interest to science. By contrast \(p_\star (\theta )\) ignores irrelevant directions, giving a distribution p(x) almost unchanged by their addition, or removal.

Bayesian priors have been a contentious subject from the beginning, with arguments foreshadowing some of the later debates about wavefunctions. Uninformative priors such as Jeffreys prior, which depend on the likelihood function \(p(x|\theta )\) describing a particular experiment, fit into the so-called objective Bayesian viewpoint. This is often contrasted with a subjective viewpoint, in which the prior represents our state of knowledge, and we incorporate every possible new experimental result by updating it: “today’s posterior is tomorrow’s prior” [18]. Under such a view the discrete \(p_\star (\theta )\), which is zero at almost every \(\theta \), would be extremely strange. Note however that this subjective viewpoint is already incompatible with Jeffreys prior. If we invent a different experiment to perform tomorrow, we ought to go back and change our prior to the one appropriate for the joint experiment, resulting in an update not described by Bayes’ rule [19]. We differ only in that we regard (say) \(10^{6}\) repetitions of the same experiment as also being a different experiment, since this is equivalent to obtaining a much higher-resolution instrument.

But the concerns of this paper are different. We are interested in the relevant directions in parameter space, as it is along such directions that we are in the regime in which the scaling law (1) holds. In the derivation of this, in Sect. 3 below, we study one such dimension in isolation. While we use the notation of the above statistics problem, we stress the result applies equally to problems from other domains, as discussed above.

3 Derivation

This section studies a model in just one dimension, measuring \(\theta \in [0,L]\) with Gaussian noise of known variance, thus

$$\begin{aligned} p(x\vert \theta )=\tfrac{1}{\sigma \sqrt{2\pi }}\,e^{-(x-\theta )^{2}/2\sigma ^{2}}. \end{aligned}$$
(5)

We consider only one measurement, \(m=1\), since more repetitions are equivalent to having less noise. It is convenient to choose units in which \(\sigma =1\), so that \(\theta \) measures proper distance: \(g_{\theta \theta }=1\). Thus L is the length of parameter space, in terms of the Fisher metric. Jeffreys prior is a constant \(p_\mathrm {J}(\theta )=1/L\). The optimal prior has K points of mass:

$$\begin{aligned} p_\star (\theta )=\sum _{a=1}^{K}\lambda _{a}\,\delta (\theta -\theta _{a}),\qquad \theta _{1}=0,\quad \theta _{K}=L,\qquad \sum _{a=1}^{K}\lambda _{a}=1. \end{aligned}$$

The positions \(\theta _{a}\) and weights \(\lambda _{a}\) should be fixed my maximizing the mutual information, This is symmetric \(I(X;\varTheta )=I(\varTheta ;X)\), so can be written

$$\begin{aligned} I(X;\varTheta )=S(X)-S(X\vert \varTheta ) \end{aligned}$$
(6)

where the entropy and relative entropy are

$$\begin{aligned} S(X)&=-\int dx\,p(x)\log p(x),\qquad p(x)=\int d\theta \,p(x\vert \theta )p(\theta )\\ S(X\vert \varTheta )&=\int d\theta p(\theta )\left[ -\int dx\,p(x\vert \theta )\log p(x\vert \theta )\right] . \end{aligned}$$

For this Gaussian model, the relative entropy \(S(X\vert \varTheta )=\frac{1}{2}+\frac{1}{2}\log 2\pi \) is independent of the prior, so it remains only to calculate the entropy S(X).

On an infinite line, the entropy would be maximized by a constant p(x), i.e. a prior with delta functions spaced infinitesimally close together. But on a very short line, we observe that entropy is maximized by placing substantial weight at each end, with a gap before the next delta function. The idea of our calculation is that the behavior on a long but finite line should interpolate between these two regimes. We work out first the cost of a finite density of delta functions, and then the local cost of a spatially varying density, giving us an equation of motion for the optimum \(\rho (x)\). By solving this we learn how the density increases as we move away from the boundary. The integral of this density then gives us K with the desired scaling law.

Since the deviations from a constant p(x) will be small, we write

$$\begin{aligned} p(x)=\frac{1}{L}\left[ 1+w(x)\right] ,\qquad \int dx\,w(x)=0 \end{aligned}$$

and then expand the entropy in powers of \(w(x)\):

$$\begin{aligned} S(X)&=\log L-\frac{1}{2L}\int _{0}^{L}dx\,w(x)^{2}+\mathcal {O}(w^{4})\nonumber \\&=\log L-\frac{1}{2}\sum _{k}\,\left| w_{k}\right| ^{2}+\cdots . \end{aligned}$$
(7)

Here our convention for Fourier transforms is that

$$\begin{aligned} w_{k}=\int _{0}^{L}\frac{dx}{L}\,e^{-ikx}w(x),\qquad k\in \frac{2\pi }{L}{\mathbb {Z}}. \end{aligned}$$

3.1 Constant Spacing

Consider first the effect of a prior which is a long string of delta functions at constant spacing a, which we assume to be small compared to the standard deviation \(\sigma =1\), which in turn is much less than the length L.Footnote 2 This leads to

$$\begin{aligned} p(x)=\frac{a}{L}\sum _{n\in {\mathbb {Z}}}\,\frac{1}{\sqrt{2\pi }}e^{-(x-na)^{2}/2}. \end{aligned}$$
(8)

Because this is a convolution of a Dirac comb with a Gaussian kernel, its Fourier transform is simply a product of such pieces. Let us write the transformation of the positions of the sources as follows:

$$\begin{aligned} c^{0}(x)=a\sum _{n\in {\mathbb {Z}}}\delta (x-na),\qquad \Rightarrow \qquad c_{k}^{0}=\frac{a}{L}\sum _{n\in {\mathbb {Z}}}e^{-ikna}=\sum _{m\in {\mathbb {Z}}}\delta _{k-m\frac{2\pi }{a}}. \end{aligned}$$

The zero-frequency part of \(p_{k}\) is the constant term in p(x), with the rest contributing to \(w(x)\):

$$\begin{aligned} w_{k}=\sum _{m\ne 0}\,\delta _{k-m\frac{2\pi }{a}}\,e^{-k^{2}/2}={\left\{ \begin{array}{ll} e^{-k^{2}/2} &{} k\in \frac{2\pi }{a}{\mathbb {Z}}\text { and }k\ne 0\\ 0 &{} \text {else}. \end{array}\right. } \end{aligned}$$

The lowest-frequency terms at \(k=\pm 2\pi /a\) give the leading exponential correction to the entropy:

$$\begin{aligned} S(X)=\log L-e^{-q^{2}}+\mathcal {O}(e^{-2q^{2}}),\qquad q=\frac{2\pi }{a}. \end{aligned}$$
(9)

As advertised, any nonzero spacing \(a>0\) (i.e. frequency \(q<\infty \)) reduces the entropy from its maximum.

Fig. 2
figure 2

Above, numerical solution \(p_{\star }(\theta )\) for \(L=30\), in which we observe that as the spacing of the delta functions grows closer together, their weights compensate to leave p(x) almost constant, with deviations \(w(x)\) at a wavelength comparable to the spacing. Below, a diagram to show the scales involved when perturbing the positions of the delta functions in our derivation. These are arranged from longest to shortest wavelength, see also (11)

3.2 Variable Spacing

Now consider perturbing the positions of the delta functions by a slowly varying function \(\varDelta (x)\), and multiplying their weights by \(1+h(x)\). We seek a formula for the entropy in terms of \(\varDelta (x)\), while allowing \(h(x)\) to adjust so as to minimize the disturbance. This cannot be done perfectly, as \(h(x)\) is only sampled at spacing a, so only contributions at frequencies lower than \(q=2\pi /a\) will be screened. Thus we expect what survives to appear with the same exponential factor as (9). In particular this ensures that at infinite density, no trace of \(\varDelta (x)\) remains. And that is necessary in order for the limit to agree with Jeffreys prior, which is a constant.

Figure 2 illustrates how the positions and weights of \(p_\star (\theta )\) compensate to leave p(x) almost constant in the interior, in a numerical example. Below that it shows how the functions \(\varDelta (x)\) and \(h(x)\) used here mimic this effect.

The comb of delta functions \(c^{0}(x)\) we had above is perturbed to

$$\begin{aligned} c(x)=\left[ 1+h(x)\right] \,a\sum _{n\in {\mathbb {Z}}}\delta \big (x-na-\varDelta (na)\big )=\left[ 1+h(x)\right] \,c^{\varDelta }(x). \end{aligned}$$

The effect of \(h(x)\) is a convolution in frequency space:

$$\begin{aligned} c_{k}=c_{k}^{\varDelta }+\sum _{k'}h_{k'}c_{k-k'}^{\varDelta }. \end{aligned}$$

It will suffice to study \(\varDelta (x)=\varDelta \cos (\hat{k}x)\), i.e. frequencies \(\pm \hat{k}\) only: \(\varDelta _{k}=\tfrac{1}{2}\varDelta (\delta _{k-\hat{k}}+\delta _{k+\hat{k}})\). The driving frequency is \(\hat{k}\ll q\). We can expand in the amplitude \(\varDelta \) to write

$$\begin{aligned} c_{k}^{\varDelta }&=\frac{a}{L}\sum _{n}e^{-ik\big (na+\varDelta (na)\big )}\\&=c_{k}^{0}-\frac{ik\varDelta }{2}\left[ c_{k-\hat{k}}^{0}+c_{k+\hat{k}}^{0}\right] -\frac{k^{2}\varDelta ^{2}}{8}\left[ 2c_{k}^{0}+c_{k-2\hat{k}}^{0}+c_{k+2\hat{k}}^{0}\right] +\mathcal {O}(\varDelta ^{3}). \end{aligned}$$

The order \(\varDelta \) term has contributions at \(k=\hat{k}\ll q=2\pi /a\), which can be screened in the full \(c_{k}\) by setting \(h_{k}=+ik\varDelta _{k}\) i.e. \(h(x)=-\hat{k}\varDelta \sin (\hat{k}x)\). What survives in \(c_{k}\) then are contributions at \(k=0\), \(k=\pm q\) and \(k=\pm q\pm \hat{k}\):Footnote 3

$$\begin{aligned} c_{k}=\delta _{k}+\sum _{\pm }\delta _{k\pm q}\Big (1-\frac{q^{2}\varDelta ^{2}}{4}\Big )+\sum _{\pm }[\delta _{k\mp q-\hat{k}}+\delta _{k\mp q+\hat{k}}]\Big (\pm \frac{i\,q\,\varDelta }{2}\Big )+\mathcal {O}(\delta _{k\pm 2q},\varDelta ^{2}) \end{aligned}$$

All but the zero-frequency term are part of \(w_{k}=(c_{k}-\delta _{k})e^{-k^{2}/2}\), and enter (7) independently, giving this:

$$\begin{aligned} S(X)&=\log L-e^{-q^{2}}\Big (1-\frac{q^{2}\varDelta ^{2}}{4}\Big )^{2}-\left[ e^{-(q-\hat{k})^{2}}+e^{-(q+\hat{k})^{2}}\right] \Big (\frac{\varDelta q}{2}\Big )^{2}+\mathcal {O}(\varDelta ^{4})+\mathcal {O}(e^{-2q^{2}})\nonumber \\&=\log L-e^{-q^{2}}\left[ 1{+}\varDelta ^{2}\left( q^{4}\hat{k}^{2}{+}\frac{1}{3}q^{6}\hat{k}^{4}{+}\frac{2}{45}q^{8}\hat{k}^{6}+\ldots \right) \left[ 1+\mathcal {O}(1/q^{2})\right] +\ldots \right] +\ldots \end{aligned}$$
(10)

As promised, the order \(\varDelta ^{2}\) term comes with the same overall exponential as in (9) above. Restoring units briefly, the expansion in round brackets makes sense only if \(\hat{k}\,q\,\sigma ^{2}\ll 1\).Footnote 4 In terms of length scales this means \(\frac{2\pi /\hat{k}}{\sigma }\gg \frac{\sigma }{a}\), or writing all the assumptions made:

$$\begin{aligned} \underset{\text {comb spacing}}{a=2\pi /q}\ll \underset{\text {kernel width}}{\sigma =1}\lll \underset{\text {perturbation wavelength}}{2\pi /\hat{k}}\ll \underset{\text {box size}}{L}. \end{aligned}$$
(11)

3.3 Entropy Density

We can think of this entropy (10) as arising from some density: \(S(X)=\int \frac{dx}{L}{\mathcal {S}}(x)={\mathcal {S}}_{0}\). Our claim is that this density takes the form

$$\begin{aligned} {\mathcal {S}}(x)=\log L-L\,e^{-(2\pi )^{2}\rho (x)^{2}}\left[ 1+\frac{128\pi ^{6}}{3}\rho (x)^{4}\rho '(x)^{2}\right] . \end{aligned}$$
(12)

The constant term is clearly fixed by (9). To connect the kinetic term to (10), we need

$$\begin{aligned} \rho (x)=\frac{1}{a(1+\varDelta '(x))}=\frac{1}{a}\left[ 1-\varDelta '(x)+\varDelta '(x)^{2}+\mathcal {O}(\varDelta ^{3})\right] \end{aligned}$$

thus \(\rho '(x)=-\frac{1}{a}\varDelta ''(x)+\ldots \) and

$$\begin{aligned} e^{-(2\pi )^{2}\rho (x)^{2}}=e^{-q^{2}}\left[ 1+2q^{2}\varDelta '(x)+2q^{4}\varDelta '(x)^{2}+\mathcal {O}(\varDelta ^{3})\right] \left[ 1+\mathcal {O}(1/q^{2})\right] . \end{aligned}$$

Multiplying these pieces, the order \(\varDelta ^{1}\) term of \({\mathcal {S}}(x)\) integrates to zero. We can write the order \(\varDelta ^{2}\) term in terms of Fourier coefficients (using (7), and \(\varDelta '_{k}=ik\varDelta _{k}\)), and we recover the leading terms in (10). The next term there \(q^{8}\hat{k}^{6}\) would arise from a term \(\rho (x)^{6}\rho ''(x)^{2}\) in the density, which we neglect.Footnote 5

The Euler-Lagrange equations from (12) read

$$\begin{aligned} 0=\rho (x)^{4}\rho ''(x)+2\rho (x)^{3}\rho '(x)^{2}-4\pi ^{2}\rho (x)^{5}\rho '(x)^{2}+\frac{3}{32\pi ^{4}}\rho (x).\end{aligned}$$

We are interested in the large-x behavior of a solution with boundary condition at \(x=0\) of \(\rho =1\). Or any constant density, but this value is independent of L because the only interaction is of scale \(\sigma =1\). This is also what we observe numerically, shown in Fig. 3. Making the ansatz \(\rho (x)=1+x^{\eta }\) with \(\eta >1\), these four terms scale as

$$\begin{aligned} x^{5\eta -2},\quad x^{5\eta -2},\quad x^{7\eta -2},\quad x^{\eta },\qquad \text {all }\times \,e^{-x^{2\eta }},\;x\rightarrow \infty . \end{aligned}$$
(13)

Clearly the first two terms are subleading to the third, and thus the last two terms must cancel each other. We have \(7\eta -2=\eta \) and thus \(\eta =1/3\). Then the total number of delta functions in length L is

$$\begin{aligned} K=\int _{0}^{L}dx\,\rho (x)\sim L^{4/3} \end{aligned}$$
(14)

establishing the result (2).

Fig. 3
figure 3

Positions of delta functions in optimal priors \(p_\star (\theta )=\sum _{a=1}^{K}\lambda _{a}\,\delta (\theta -\theta _{a})\), for various values of L. Each horizontal line represents one prior. We observe that the second (and third...) delta functions occur at fixed proper distance from the first, justifying the fixed boundary condition on \(\rho (x)\). On the right, we show similar data for the Bernoulli model of Sect. 4 below, in terms of proper distance \(\phi \). Here \(L=\pi \sqrt{m}\) for \(m=1,2,3,\ldots ,50\)

4 Extensions

The other one-dimensional example studied in [7] was Bernoulli problem, of determining the weighting of an unfair coin given the number of heads seen after m flips:

$$\begin{aligned} p(x\vert \theta )=\frac{m!}{x!(m-x)!}\theta ^{x}(1-\theta )^{m-x},\qquad \theta \in [0,1],\qquad x\in \{0,1,2,3,\ldots ,m\}. \end{aligned}$$
(15)

The Fisher metric here is

$$\begin{aligned} g_{\theta \theta }=\frac{m}{\theta (1-\theta )}\qquad \Rightarrow \qquad L=\int _{0}^{1}\sqrt{g_{\theta \theta }\,d\theta ^{2}}=\pi \sqrt{m} \end{aligned}$$

and we define the proper parameter \(\phi \) by

$$\begin{aligned} ds^{2}=\frac{m\,d\theta ^{2}}{\theta (1-\theta )}=d\phi ^{2}\qquad \Leftarrow \qquad \theta =\sin ^{2}\Big (\frac{\phi }{2\sqrt{m}}\Big ),\quad \phi \in \left[ 0,\pi \sqrt{m}\right] . \end{aligned}$$

The optimal prior found by maximizing the mutual information is again discrete, and when \(m\rightarrow \infty \) it also obeys the scaling law (1) with the same slope \(\zeta \). Numerical data showing this is also plotted in Fig. 1 above. This scaling relies on the behavior far from the ends of the interval, where this binomial distribution can be approximated by a Gaussian:

$$\begin{aligned} p(x\vert \theta )\approx \frac{1}{\sigma \sqrt{2\pi }}e^{-(x-m\theta )^{2}/2\sigma ^{2}},\qquad m\rightarrow \infty ,\;\theta \text { finite},\qquad \sigma ^{2}=m\,\theta (1-\theta ). \end{aligned}$$
(16)

The agreement of these very different models suggests that the \(\zeta =3/4\) power is in some sense universal, for nonsingular one-dimensional models.

Near to the ends of the interval, we observe in Fig. 3 that first few delta functions again settle down to fixed proper distances. In this regime (16) is not a good approximation, and instead the binomial (15) approaches a Poisson distribution:

$$\begin{aligned} p(x\vert \theta )\approx \frac{e^{-\mu }\mu ^{-x}}{x!},\qquad m\rightarrow \infty ,\quad \mu =m\theta \approx \frac{\phi ^{2}}{2}\text { finite}. \end{aligned}$$
(17)

The first few positions and weights are as follows:Footnote 6

$$\begin{aligned} \qquad \phi _{2}&\approx 3.13&\lambda _{2}/\lambda _{1}&\approx 0.63\qquad \nonumber \\ \phi _{3}&\approx 5.41&\lambda _{3}/\lambda _{1}&\approx 0.54\nonumber \\ \phi _{4}&\approx 7.42&\lambda _{4}/\lambda _{1}&\approx 0.49\,. \end{aligned}$$
(18)

This implies that the second delta function is at mean \(\mu \approx 2.47\), skipping the first few integers x.

4.1 More Dimensions

Returning to the bulk scaling law, one obvious thing to wonder is whether this extends to more dimensions. The trivial example is to consider the same Gaussian model (8) in D-dimensional cube:

$$\begin{aligned} p(\mathbf {x}\vert \varvec{\theta })\propto e^{-(\mathbf {x}-\varvec{\sigma })^{2}/2},\quad \varvec{\theta }\in [0,L]^{D},\;\mathbf {x}\in {\mathbb {R}}^{D}. \end{aligned}$$
(19)

This simply factorizes into the same problem in each direction: (6) is the sum of D identical mutual information terms. Thus the optimal prior is simply

$$\begin{aligned} p_{\star }(\varvec{\theta })=\prod _{\mu =1}^{D}p_{\star }(\theta _{\mu })=\sum _{a_{1}\ldots a_{D}=1}^{K}\lambda _{a_{1}}\cdots \lambda _{a_{D}}\,\delta (\theta -\theta _{a_{1}})\cdots \delta (\theta -\theta _{a_{D}}) \end{aligned}$$

with the same coefficients as in (3) above. The total number of delta functions is \(K_{\mathrm {tot}}=K^{D}\) which scales as

$$\begin{aligned} K_{\mathrm {tot}}\sim L^{D/\zeta }=V^{1/\zeta },\qquad \zeta =3/4,\quad V\rightarrow \infty . \end{aligned}$$
(20)

We believe that this scaling law is also generic, provided the large-volume limit is taken such that all directions expand together. If the scaling arises from repeating an experiment m times, then this will always be true as all directions grow as \(\sqrt{m}\).

To check this in a less trivial example, we consider now the bivariate binomial problem studied by [21]. We have two unfair coins whose weights we wish to determine, but we flip the second coin only when the first coin comes up heads. After m throws of the first coin, the model is

$$\begin{aligned} p(x,y\vert \theta ,\phi )={m \atopwithdelims ()x}\theta ^{x}(1-\theta )^{m-x}{x \atopwithdelims ()y}\phi ^{y}(1-\phi )^{x-y}. \end{aligned}$$
(21)

with \(\theta ,\phi \in [0,1]\) and \(0\le y\le x\le m\in {\mathbb {Z}}\). The Fisher information metric here is

$$\begin{aligned} ds^{2}=\frac{m}{\theta (1-\theta )}d\theta ^{2}+\frac{m\,\theta }{\phi (1-\phi )}d\phi ^{2} \end{aligned}$$

which implies

$$\begin{aligned} V=\int _{0}^{1}d\theta \,\int _{0}^{1}d\phi \,\sqrt{\det g_{\mu \nu }}=2\pi m \end{aligned}$$

and \(p_{\mathrm {J}}(\theta ,\phi )=\frac{1}{2\pi }\left[ (1-\theta )\phi (1-\phi )\right] ^{-1/2}\). Topologically the parameter space is a triangle, since at \(\theta =0\) the \(\phi \) edge is of zero length. The other three sides are each of length \(\pi \sqrt{m}\), and so will all grow in proportion as \(m\rightarrow \infty \).

Fig. 4
figure 4

Scaling law in two dimensions. On the right we plot \(I(X;\varTheta )/\log 2\) against \(\log K_{\mathrm {tot}}\) for the bivariate binomial model (21) and the \(D=2\) Gaussian model (19). On the left we show examples of the priors, for \(m=20\) and \(L_{1}=L_{2}=13\). For the bivariate binomial the plot axes are \((\theta ,\theta \phi )\) so as to respect the topology of the parameter space, but the figure is not isometric to \(\varTheta \)

We can find the optimal priors for this numerically.Footnote 7 In Fig. 4 we see that the mutual information obeys the same law as (1) above: \(I(X;\varTheta )\sim \zeta \log K\) with \(\zeta \approx 0.75\). Since the Fisher volume is proportional to the number of distinguishable states \(I(X;\varTheta )\sim \log V\), this implies (20).

Finally, suppose that instead of a square (or an equilateral triangle), a two-dimensional \(\varTheta \) has one direction much longer than the other:

$$\begin{aligned} L_{1}=a_{1}\sqrt{m},\qquad L_{2}=a_{2}\sqrt{m},\qquad a_{1}\ll a_{2}. \end{aligned}$$

Then as we increase m we will pass through three regimes, according to how many of the lengths are long enough to be in the scaling regime:

(22)

The last regime is the one we discussed above. When plotting K against \(\log m\) (or \(\log L\)), we expect to see a line with a series of straight segments, and an increase in slope every time another dimension becomes relevant.Footnote 8

5 Conclusion

The fact that \(\zeta <1\) is important for the qualitative behavior of the priors studied in [7]. This is what ensures that the number of delta functions \(K\sim L^{1/\zeta }\) grows faster than the Fisher length of parameter space L, ensuring that discreteness washes out in the asymptotic limit \(L\rightarrow \infty \). Parameters which we can measure with good accuracy are in this limit. For such a parameter, the posterior \(p(\theta \vert x)\), which is also discrete, has substantial weight on an increasing number of points, and in this sense approaches a continuous description.

In Sect. 4 we also studied some generalizations beyond what we did in [7]. Very near to the end of a long parameter, the discreteness does not wash out as \(L\rightarrow \infty \), and we wrote down its proper position for Gaussian and Poisson models. And we observed that this scaling law holds in any number of dimensions, if stated in terms of the mutual information (1). But stated in terms of the length L, it gives a slope which depends on the number of large dimensions (22), and hence has phase transitions as more parameters become relevant.

While our motivation here was finding optimal priors, our conclusions apply to a much larger class of problems, including the maximization of channel capacity over a continuous input distribution [14, 15, 24,25,26], which is formally equivalent to what we did above. This problem is where mutual information was first discussed [1], and discreteness was first seen in this context [8,9,10]. This maximization is also equivalent to a minimax optimization problem [27], and discreteness was known in other minimax problems slightly earlier [11,12,13]; see [3, 4] for other work in statistics. More recently discreteness has been employed in economics [28, 29], and is seen in various systems optimized by evolution [30,31,32]. This scaling law should apply to all of these examples, when interpolating between the coarse discreteness at small L and the continuum \(L\rightarrow \infty \).