An information scaling law: \zeta = 3/4

Consider the entropy of a unit Gaussian convolved over a discrete set of K points, constrained to an interval of length L. Maximising this entropy fixes K, and we show that this number exhibits a novel scaling law K ~ L^1/\zeta as L ->infinity, with exponent \zeta = 3/4. This law was observed numerically in a recent paper about optimal effective theories; here we present an analytic derivation. We argue that this law is generic for channel capacity maximisation, or the equivalent minimax problem. We also briefly discuss the behaviour at the boundary of the interval, and higher dimensional versions.


Introduction
In a recent paper [1] we studied the construction of optimal effective theories, defined to be those which maximise the mutual information between their parameters and the expected data. We wrote this in the language of Bayesian prior choice, and the optimal prior turns out to almost always be discrete, consisting of K delta functions. As the noise 1/L in measuring a certain parameter is taken to zero, we observed that there is a scaling law for the number of delta functions: The main purpose of the present paper is to derive this law analytically. In more detail, the setup is as follows. We consider a model which, at each point θ in the full parameter space Θ, gives a probability distribution p(x|θ) for x in data space X. An effective model is a restriction of θ to lie in some subspace, and this can be expressed as a prior p(θ) with support only in this subspace. For a given prior, the mutual information of interest is where the entropy and relative entropy are S(X) = − dx p(x) log p(x), p(x) = dθ p(x|θ)p(θ) S(X|Θ) = dθ p(θ) − dx p(x|θ) log p(x|θ) .
This quantity I(X; Θ) is the natural measure of how much information about the data can be encoded in the parameters, or equivalently, how much information about the parameters can be learned from the data [2].  Choosing a prior by maximising this information was considered by Bernardo [3]. The problem is mathematically identical to maximising the capacity of a communications channel, which was known to often result in a discrete distribution [4][5][6][7]. This can be avoided by considering instead I(X m ; Θ) from m independent sets of data, and taking the limit m → ∞, in which case the optimal prior (now called a reference prior) approaches Jeffreys prior [3,8] which is a smooth function. Up to normalization, Jeffreys prior is the volume form arising from the Fisher information metric [9]: This metric measures the distinguishability of the data which would be produced by different parameter values, in units of standard deviations. The statement that we have enough data to fix all parameters with high precision is equivalent to the statement that the manifold Θ is long in all dimensions, measured with this metric. And the limit m → ∞ amounts to having infinitely much data. The main point of our paper [1], and of earlier work such as [10], is that in most situations of physical interest we have many very short directions, corresponding to irrelevant parameters. Thus the generic point θ is very near to a boundary of Θ, rather than being safely in the interior, and thus Jeffreys prior is a poor choice. We showed that the prior p (θ) which maximises mutual information I(X; Θ) typically puts weight only on the boundaries of Θ, and thus describes an appropriate lower-dimensional effective model. The scaling law ensures that, along a relevant parameter in this effective model, neighbouring delta functions became indistinguishable, giving an effectively continuous distribution.
Here however our focus is much narrower. We primarily consider a model in just one dimension, measuring θ ∈ [0, L] with Gaussian noise of known variance, thus It is convenient to choose units in which σ = 1, so that θ measures proper distance: g θθ = 1.
Thus L is the length of parameter space, in terms of the Fisher metric. Jeffreys prior is a constant p J (θ) = 1/L. The optimal prior is With K delta functions there is a bound I(X; Θ) ≤ log K on the mutual information -the difference between certainty and complete ignorance among K = 2 n outcomes is exactly n bits of information. When finding p (θ) numerically in [1], we observed that as L → ∞ the mutual information follows a line with lower slope (see figure 1): Since I(X; Θ) ∼ log L for large L, this implies the scaling law (1). This can also be stated in terms of the number density of delta functions: since K grows slightly faster than L, the average proper density grows without bound: To derive this law, Section 2 treats a field theory for the local density of delta functions ρ(θ). Section 3 looks at at another one-dimensional model which displays the same scaling law, and then at the generalisation to D dimensions. And appendix A looks at paper from almost 30 years ago which could have discovered this scaling law.

Derivation
For this Gaussian model (3), the relative entropy S(X|Θ) = 1 2 + 1 2 log 2π is independent of the prior, so it remains only to calculate the entropy S(X) in order to maximise the mutual information (2).
On an infinite line, the entropy would be maximised by a constant p(x), i.e. a prior (4) with delta functions spaced infinitesimally close together. But on a very short line, we observe that entropy is maximised by placing substantial weight at each end, with a gap before the next delta function. The idea of our calculation is that the behaviour on a long but finite line should interpolate between these two regimes. We work out first the cost of a finite density of delta functions, and then the local cost of a spatially varying density, giving us an equation of motion for the optimum ρ(x). By solving this we learn how the density increases as we move away from the boundary. The integral of this density then gives us K with the desired scaling law.
Since the deviations from a constant p(x) will be small, we write and then expand the entropy Here our convention for Fourier transforms is that

Constant spacing
Consider first the effect of a prior which is a long string of delta functions at constant spacing a, which we assume to be small compared to the standard deviation σ = 1, which in turn is much less than the length L. 1 This leads to Because this is a convolution of a Dirac comb with a Gaussian kernel, its Fourier transform is simply a product of such pieces. Let us write the transformation of the positions of the sources as follows: The zero-frequency part of p k is the constant term in p(x), with the rest contributing to w(x): The lowest-frequency terms at k = ±2π/a give the leading exponential correction to the entropy: As advertised, any nonzero spacing a > 0 (i.e. frequency q < ∞) reduces the entropy from its maximum.

Variable spacing
Now consider perturbing the positions of the delta functions by a slowly varying function ∆(x), and multiplying their weights by 1 + h(x). We seek a formula for the entropy in terms of ∆(x), while allowing h(x) to adjust so as to minimise the disturbance. This cannot be done perfectly, as h(x) is only sampled at spacing a, so only contributions at frequencies lower than q = 2π/a will be screened. Thus we expect what survives to appear with the same exponential factor as (8). In particular this ensures that at infinite density, no trace of ∆(x) remains. And that is necessary in order for the limit to agree with Jeffreys prior, which is a constant. Figure 2 illustrates how the positions and weights of p (θ) compensate to leave p(x) almost constant in the interior, in a numerical example. Below that it shows how the functions ∆(x) and h(x) used here mimic this effect. 1 We summarise all the scales involved in (10), see also figure 2.  Below, a diagram to show the scales involved when perturbing the positions of the delta functions in our derivation. These are arranged from longest to shortest wavelength, see also (10).
The comb of delta functions c 0 (x) we had above is perturbed to The effect of h(x) is a convolution in frequency space: It will suffice to study ∆(x) = ∆ cos( k x), i.e. frequencies ± k only: ∆ k = 1 2 ∆(δ k− k + δ k+ k ). The driving frequency is k q. We can expand in the amplitude ∆ to write The order ∆ term has contributions at k = k q = 2π/a, which can be screened in the full c k by setting h k = +ik∆ k i.e. h(x) = − k∆ sin( k x). What survives in c k then are contributions at k = 0, k = ±q and k = ±q ± k: 2 All but the zero-frequency term are part of w k = (c k − δ k )e −k 2 /2 , and enter (6) independently, 2 The contributions at and k = ±q ± 2 k will only matter at order ∆ 4 in S(X).
giving this: As promised, the order ∆ 2 term comes with the same overall exponential as in (8)

Entropy density
We can think of this entropy (9) as arising from some density: S(X) = dx L S(x) = S 0 . Our claim is that this density takes the form The constant term is clearly fixed by (8). To connect the kinetic term to (9), we need Multiplying these pieces, the order ∆ 1 term of S(x) integrates to zero. We can write the order ∆ 2 term in terms of Fourier coefficients (using (6), and ∆ k = ik∆ k ), and we recover the leading terms in (9). The next term there q 8 k 6 would arise from a term ρ(x) 6 ρ (x) 2 in the density, which we neglect. 4 The Euler-Lagrange equations from (11) read We are interested in the large-x behaviour of a solution with boundary condition at x = 0 of ρ = 1. Or any constant density, but this value is independent of L because the only interaction is of scale σ = 1. This is also what we observe numerically, shown in figure 3. Making the ansatz ρ(x) = 1 + x η with η > 1, these four terms scale as Clearly the first two terms are subleading to the third, and thus the last two terms must cancel each other. We have 7η − 2 = η and thus η = 1/3. Then the total number of delta functions in length L is 3 In footnote 4 we confirm that this indeed holds. 4 The term q 8 k 6 in S(X) (9) corresponds to a term ρ 6 (ρ ) 2 in S(x) (11). This gives a term in the equations of motion (12) going like x 9η−4 , which goes to zero as x → ∞ with η = 1/3. Thus we are justified in dropping this.

Extensions
The other one-dimensional example studied in [1] was Bernoulli problem, of determining the weighting of an unfair coin given the number of heads seen after m flips: The Fisher metric here is and we define the proper parameter φ by The optimal prior found by maximising the mutual information is again discrete, and when m → ∞ it also obeys the scaling law (1) with the same slope ζ. Numerical data showing this is also plotted in figure 1 above. This scaling relies on the behaviour far from the ends of the interval, where this binomial distribution can be approximated by a Gaussian: The agreement of these very different models suggests that the ζ = 3/4 power is in some sense universal for nonsingular one-dimensional models.
Near to the ends of the interval, we observe in figure 3 that first few delta functions again settle down to fixed proper distances. In this regime (15) is not a good approximation, and instead the binomial (14) approaches a Poisson distribution: In fact the second delta function appears to be within half a percent of φ 2 = π, one of several curious numerical co-incidences. The first few positions and weights are as follows: 5 This implies that the second delta function is at mean µ = π 2 /4 ≈ 2.47, skipping the first few integers x.

More dimensions
Returning to the bulk scaling law, one obvious thing to wonder is whether this extends to more dimensions. The trivial example is to consider the same Gaussian model (7) in D-dimensional cube: This simply factorises into the same problem in each direction: (2) is the sum of D identical mutual information terms. Thus the optimal prior is simply with the same coefficients as in (4) above. The total number of delta functions is K tot = K D which scales as We believe that this scaling law is also generic, provided the large-volume limit is taken such that all directions expand together. If the scaling arises from repeating an experiment m times, then this will always be true as all directions grow as √ m. To check this in a less trivial example, we consider now the bivariate binomial problem studied by [11]. We have two unfair coins whose weights we wish to determine, but we flip the second coin only when the first coin comes up heads. After m throws of the first coin, the model is with θ, φ ∈ [0, 1] and 0 ≤ y ≤ x ≤ m ∈ Z. The Fisher information metric here is 5 For the Gaussian model (3), instead we see θ 2 ≈ e, with one more significant figure numerically. The corresponding table reads: θ 2 ≈ 2.718 λ 2 /λ 1 ≈ 0.672 θ 3 ≈ 4.889 λ 3 /λ 1 ≈ 0.582 . which implies Topologically the parameter space is a triangle, since at θ = 0 the φ edge is of zero length. The other three sides are each of length π √ m, and so will all grow in proportion as m → ∞.
We can find the optimal priors for this numerically. 6 In figure 4 we see that the mutual information obeys the same law as (5) above: I(X; Θ) ∼ ζ log K with ζ ≈ 0.75. Since the Fisher volume is proportional to the number of distinguishable states I(X; Θ) ∼ log V, this implies (19).
Finally, suppose that instead of a square (or an equilateral triangle), a two-dimensional Θ has one direction much longer than the other: Then as we increase m we will pass through three regimes, according to how many of the lengths are long enough to be in the scaling regime: The last regime is the one we discussed above. When plotting K against log m (or log L), we expect to see a line with a series of straight segments, and an increase in slope every time another dimension becomes relevant. 7

Conclusion
As with [1] our conclusion must reflect surprise that this law has not been noticed before. Discreteness in such optimal distributions has been known for a long time [4-7, 13, 14], and while the numerical work here and in [1] could not have been done on a laptop in 1967, the analytic derivation would have been equally simple. Maximising mutual information is the same problem as maximising channel capacity, and the papers which discovered discreteness [4][5][6] were working in this area. 8 There are various algorithms for this maximisation [17][18][19], and those of [18,19] exploit this discreteness; our Appendix A looks at some data from one of these [18] which turns out to contain a related scaling law. Besides the intrinsic interest of a scaling law, this served a useful purpose in [1]. While a discrete prior seems very strange at first, we are accustomed to thinking about parameters which we can measure with good accuracy. This means that they have large L. And this law is what ensures that the neighbouring delta functions then become indistinguishable, leaving an effectively a continuous prior along relevant directions.
In section 3 we also studied some generalisations beyond [1]. Very near to the end of a long parameter, the discreteness does not wash out as L → ∞. We observed that the proper distance to the second delta function is e in the Gaussian case, and π in the binomial (Poisson) case. At present we cannot explain these numbers, and they may not be exact. Finally we observed that this law holds in any number of dimensions, if stated in terms of the mutual information (5). But stated in terms of the length L, it gives a slope which depends on the number of large dimensions (21).  Table 1 of [18].
The standard priors for this problem are the beta distributions with α = β = 1 clearly a flat prior, and α = β = 1 /2 giving Jeffreys prior. These have the attractive property that the posterior is of the same form, simply p(θ|x, m) = p α+x, β+m−x (θ). Thus it is trivial to calculate the probability of surprise, which depends only on the behaviour of prior close to θ = 0: In this regard the optimal prior p (θ) seems to behave like α = β = 1 /8. Calculating instead for a discrete prior, again it is clearly dominated by the first few delta functions near to φ 1 = 0: On the last line we substitute in the numbers from (17). The result is not convincingly close to an integer, unfortunately. And anyway it is far from clear to us what this this p surprise means physically.