The Möbius function and statistical mechanics

We consider a probabilistic model for square-free numbers, and provide limit theorems for several random variables defined in our ensemble. The limit transition corresponds to the thermodynamical limit in Statistical Mechanics. We also prove some inequalities inspired by a recent conjecture by P. Sarnak concerning the randomness in the Möbius sequence, and discuss a method of summation for the Riemann zeta function ζ(s) on the vertical line $${\mathfrak{R}s = 1}$$ .


Introduction
The purpose of this paper is to study the statistical properties of the classical Möbius function from Number Theory. It is defined for positive integers n by the formula ifn is not square-free; (−1) k if n is the product of k distinct primes and is closely connected with the Riemann ζ -function by the formulae (1) In [31], Sarnak raised a general question: to what extent can the sequence {μ(n)} n∈N be treated as a typical realization of some ergodic random process. A possible approach to this problem requires the proof of the so-called Chowla conjecture about the existence of the limits giving arbitrary finite-dimensional distributions of this process. The proof of this hypothesis is presumably very difficult.
Below we study the randomness of μ using some ideas from Statistical Mechanics. Fix m and introduce the ensemble m , consisting of integers having the form p ν 1 1 p ν 2 2 · · · p ν m m , where 2 = p 1 < p 2 < · · · < p m are the first m prime numbers and ν j ∈ {0, 1}. It is clear that μ(n) = 0 for every n ∈ m . Moreover, every square free number n ≤ p m belongs to m . As n grows above p m , the set of n ∈ m becomes much smaller than the set of all n for which μ(n) = 0. The cardinality of m is clearly 2 m and its largest element is p 1 p 2 · · · p m . Notice that m ⊆ m+1 and the union m≥1 m is the set of all square-free numbers.
Introduce the probability distribution m on m for which where Z m is the normalizing factor such that n∈ m π m (n) = 1. By analogy with Statistical Mechanics, the probability distribution m can be called microcanonical distribution and Z m partition function. It will be shown that m plays an important role in our analysis. The asymptotic expression for Z m follows from the classical Mertens product formula [28]  Thus, as m → ∞, Z m ∼ e γ ζ (2) ln p m As in Statistical Mechanics, Z m → ∞ as m → ∞ and this opens some possibilities for a thermodynamic limit transition.
In the ensemble ( m , m ) the random variables ν j are mutually independent and The proof of this fact follows easily from the representation In Sect. 2, it will be shown that m j=1 ν j (n) ln p j ln p m does not have a limit in any sense, but it has a limiting distribution w.r.t. m as m → ∞. Therefore, the classical Shannon-McMillan theorem from Information Theory is not valid. This means that our case is different from the usual situation in Statistical Mechanics.

Introduce the random variables
In Sect. 2 we prove the following Theorem 1 As m → ∞, the distributions of the random variables {ζ m } m∈N converge weakly to a limiting probability distribution whose characteristic function ϕ(λ) has the form Moreover, the limiting probability distribution is infinitely divisible.
The corresponding probability distribution is called Dickman-De Bruijn distribution (see [9,6]). It appears naturally in Probability Theory in the following way. Let {η j } be a sequence of independent random variables such that and ζ m = m j=1 η j . Then the limiting distribution of 1 m ζ m as m → ∞ is the Dickman-De Bruijn distribution.
Let us recall that a random variable ξ called infinitely divisible if for every N , one can find N independent, identically distributed random variables ξ 1 , . . . , ξ N , such that ξ = ξ 1 + · · · + ξ N . A complete characterization of such probability distributions in terms of the logarithm of their characteristic functions was provided by A.N. Kolmogorov, P. Lévy, and A.Ya. Khintchin, see, e.g. [14].
The density d(t) of the distribution given by (4) The above result should be compared with the well-known asymptotic density of square-free integers: In general, consider an interval ⊂ R ≥ 0 and denote by M( ) the number of integers n ∈ for which μ(n) = 0. The usual exclusion-inclusion principle from Probability Theory gives where R(L) → 0 as L → ∞. It is easy to show that R(L) = O(L −1/2 ) and, assuming the Riemann Hypothesis, the best known result is R(L) = O ε (L −37/54+ε ). See [29] for a survey on this topic. Comparing (6) and (7), one can see the difference between our ensemble m , where the probability of n is proportional to 1 n , and the ensemble {n ≤ x : μ(n) = 0}, where the probability distribution is uniform. Another difference between the two ensembles follows. If we use our Theorem 1 to estimate and therefore e γ as m → ∞. In other words, Theorem 1 can only be used to obtain inequalities concerning M( ), and these inequalities become less precise (and depend on c 1 , c 2 ) as m grows. Nevertheless, there is a connection between the two ensembles. While m is very sparse and its largest element is of order m m , the initial segment m ∩ {1, . . . , p m } actually coincides with the set {n ≤ p m : μ(n) = 0}. This fact was used in the analysis of the error term in Theorem 1 for shrinking intervals in [4].
It is worth mentioning that there are several examples in Number Theory where the limiting densities are constant on some intervals. This is the case, for instance, for the gap distribution in the sequence { √ n mod 1} n∈N studied by Elkies and McMullen [10], and for the distribution of lattice points visible from the origin discovered by Boca et al. [1]. Presumably, this is a general property of a class of infinitely divisible distributions that appear in problems related to Number Theory.
It is well known from Number Theory that the Prime Number Theorem is equivalent to the fact that for any ε > 0 as N → ∞, see [23].
In [31], Sarnak discusses a classical heuristic argument according to which for any 'reasonable' bounded sequence ξ(n), defined independently of μ(n). If (8) holds, μ is said to be orthogonal to ξ . The sequences ξ(n) considered are generated by dynamical systems S = (X, T ), where X is a compact topological space, T : X → X is continuous, and ξ(n) = f (T n x) for x ∈ X and f ∈ C(X ). A system is called deterministic if its topological entropy is zero. The sequence μ(n) is said to be disjoint from a system S if μ(n) is orthogonal to all sequences ξ(n) generated by S as above. Sarnak's conjecture reads then as: μ is disjoint from any deterministic system. Known cases of this conjecture are: (i) the trivial case |X | = 1 corresponds to the Prime Number Theory; Davenport [5]; (iv) extensions of the latter to Kronecker flows and zero-entropy affine automorphisms (see, e.g. the work by Liu and Sarnak [27]); (v) the case of S = ( \G, T α ), where G is a nilpotent Lie group, a lattice in G, and T α ( x) = xα, addressed by Green and Tao [19].
The conjecture is open for Ratner sequences, i.e. sequences generated as above when G is a semisimple group and α is unipotent. The particular case of S = (SL(2, Z)\SL(2, R), T α ) is related to the recent work of Sarnak and Ubis [32].
The above conjecture is related to the so-called Chowla conjecture, concerning the self-correlation of the Möbius function, that can be phrased as follows: for every d ∈ N, 0 ≤ a 1 ≤ a 2 ≤ · · · ≤ a d and k 1 , k 2 , . . . , k d ∈ {1, 2} not all even, we have as N → ∞. One can show, see [31], that the Chowla conjecture implies the above conjecture. Let us also mention the very recent results by J. Bourgain, showing that Walsh functions are orthogonal to μ(n) [2], and by B. Green, who proved that the Möbius sequence is orthogonal to the so-called bounded depth circuit functions, belonging to the class AC 0 (d) [18]. Another way to study the randomness μ(n) is to study the spectral properties of the Schrödinger equations on the positive integers, where μ(n) plays the role of a "random" potential. It was shown recently by J. Bourgain that almost every eigenfunction of the Schrödinger operator + μ on 2 (Z ≥0 ) has positive Lyapunov exponent [3], in accordance with the theory of (truly) random Schrödinger operators. It is still not known whether Anderson localization holds for the potential μ(n).
In Sect. 3 we prove the following result. Let f (t) = 1≤r ≤r 0 c r t r be an arbitrary polynomial with integer coefficients and α = k l be a rational number, with (k, l) = 1.
If f (t) = t, then Theorem 2 gives an analog of Davenport's result [5] in the case of our ensemble m and for rational α. Similarly, if f (t) = t 2 we get an analog of a result by Green and Tao for nilflows [19]. The basic difference is that instead of the intervals [1, N ] we use the sets m and (9) is much sharper.
For real α satisfying certain Diophantine conditions some estimates of the sum (9) are given in Sect. 3 for real α satisfying certain Diophantine conditions.
In Sect. 4 we prove a theorem which can be considered as related to the Möbius Randomness Law [23,31] and sheds some light on the cancellations caused by (−1) k in the definition of the Möbius function.

Theorem 3 Let f be a 1-periodic function of the form f (t) = k∈Z f k e(kt) and such that k∈Z
Then The proof of Theorem 3 exploits the special structure of our ensemble m when sampled by functions of the type f ( ln · ln p m ), where f is 1-periodic. The study of the sum (10) for functions of bigger periods requires the following analysis. For k r ∈ Q let The following Theorem deals with the case when the period r tends to infinity.
Theorem 3 allows to construct a class of functions for which (11) holds. Fix three non-decreasing sequences and such that While the sequence r (n) controls the period of the functions, the estimates involving K (n) and G(n) put some constraint on the decay of the Fourier coefficients f (m) k . We have the following Corollary of Theorem 4: Then In other words, if G(m) does not grow too fast as m → ∞, Corollary 5 provides a class of functions that are not disjoint (in the sense discussed above) from μ(n) when sampled along the rescaled ensemble { ln n ln p m : n ∈ m }. In particular, if we want to approximate the indicator function of an interval (whose Fourier coefficient decay as 1 k ), we can choose K (m) = m α , 0 < α < 1, r (m) = m and G(m) = c ln(m). In this case the functions such that Remark 1 In this case the choice of r (m) = m is motivated by the fact that the largest element of our rescaled ensemble ln n ln p m , n ∈ m is The idea is to construct a sequence of functions in { f m } m∈N that approximate a function supported on [0, 1]. Since this interval corresponds to the classical ensemble {n ≤ p m , μ(n) = 0}, presumably one could be able to get some estimates for the classical problem (8).
There are natural generalizations of the Möbius function. Fix r ≥ 1 and consider numbers n = p ν 1 1 · · · p ν m m , 0 ≤ ν j ≤ r . The set of such numbers is denoted by The exclusion-inclusion principle (see above) gives the following: if is an arbitrary interval and M (r ) ( ) = #{n ∈ : where Assuming the Riemann Hypothesis one can get better estimates but they look quite complicated, see again [29] and the references therein. The sum m j=1 ν j plays an important role in our analysis. For example, μ(n) = (−1) m j=1 ν j . Since terms of the sum are independent-but not identically distributed-random variables, we can use some standard tools from Probability Theory to deduce the asymptotic statistical behavior of the sum. In Sect. 5 we prove the following two limit theorems where N (0, 1) denotes the standard Gaussian distribution with mean 0 and variance 1.
converges weakly as m → ∞ to a random variable having Poisson distribution with parameter ln b a .
Section 5 also discusses some consequences of the theorems above, indicating the asymptotic probability that an element of our ensemble m has no prime factors larger than certain functions (depending on m), such as Section 6 discusses some connections between the Möbius function and the Riemann ζ -function, via the first formula in (1). While in the region s > 1 all the sums are absolutely convergent, it is a priori non clear what happens when 0 < s ≤ 1. We introduce a specific way of summing of the sums Using the function N (t) as before we get   The analysis of the error terms is left to the interested reader and can be found in [4] and the appendix therein. Let us consider the change of variables v = ln t ln p m , for which dt t = dv ln p m . Then dv v e iλv − 1 + error terms and, as m → ∞, the above sum converges to Let us prove that the distribution with characteristic function given by (4) is infinitely divisible. Kolmogorov proved [24] that a probability distribution P ξ over R with finite variance is infinitely divisible if and only if its characteristic function ϕ(λ) has the form where κ is a constant and v → K (v) is a non-decreasing function of bounded variation satisfying lim v→−∞ K (v) = 0. It easy to check that κ = R xd P(x) = Eξ and lim v→∞ K (v) = E(ξ − Eξ) 2 . In our case κ = 1 and lim v→∞ K (v) = 1 2 (see Sect. 2.1), and by choosing (17) gives (4). Therefore the limiting distribution is infinitely divisible and Theorem 1 is proven.

On the Dickman-De Bruijn distribution
It is known (see [6]) that ϕ(λ) is the characteristic function of the Dickman-De Bruijn distribution, with density e −γ ρ(t), where ρ(t) is determined by the initial condition and the integral equation The density e −γ ρ(t) is plotted in Fig. 1. It also satisfies the delay differential equation Among other properties of ρ(t) one can mention that it is log-concave on [1, ∞) and as t → ∞. In other words, the limiting density e −γ ρ(t) is constant on the interval (0, 1], where it takes the value e −γ , and then decays faster then exponentially on (1, ∞), like Poisson distribution. In particular, all its moments are finite. Let us remark that the explicit formula for ϕ(λ) allows us to compute the k-th moment m k of the Dickman-De Bruijn distribution via The above sequence can therefore be characterized by the following identity: Let (x, y) denote the number of integers ≤ x whose prime factors are ≤ y. Dickman [9] showed that (x, x 1/u ) ∼ xρ(u) as x → ∞. The range of y such that the asymptotic formula (x, y) ∼ xρ(u), where x = y u , has been significantly enlarged by De Bruijn [6][7][8] (y ≥ exp((ln x) 5/8+ε )) and Hildebrand [22] (y ≥ exp((ln ln x) 5/3+ε )). Notice that in our ensemble m (where each element is weighted, not simply counted) we have x = p 1 p 2 · · · p m and y = p m and thus y ∼ ln x. In this regime Erdös [11] showed that ln (x, ln x) ∼ ln 4 ln x ln ln x as x → ∞ and therefore the asymptotic is no longer given by the function ρ. In other words, some kind of phase transition occurs in the asymptotic behavior of (x, y). For a survey on the theoretical and computational aspects of smooth numbers see [16].
Another remarkable occurrence of the Dickman-De Bruijn density comes from the limit of convolution powers, namely . See [21] for the discussion and generalization of this result. Goncharov [15] discovered the distribution e −γ ρ(t) (although expressed in a somewhat cumbersome form) in 1944 when studying the distribution of maximal cycle length in random permutations. His work, independent of Dickman's, was later popularized by Vershik and Schimdt [35,36]. The Dickman-De Bruijn distribution also appears when studying the the marginals of the so-called Poisson-Dirichlet distribution, see e.g. [33,34] and the references therein.

Proof of Theorem 2 We consider a polynomial f ∈ Z[t], f (t) =
We clearly have and thus Every p ∈Ñ m is of the form p = 1 + s l for some integer s = s ( p ), and hence we have p ∈Ñ m p q ≡ 1(mod l). This means that and does not depend onÑ m . Therefore in (19) we have Now, by Dirichlet's theorem, we can find m * = m * (l) large enough so that |N m * | = ∅ (and the same of course happens for every m ≥ m * ). We thus have and Theorem 2 is proven.

Remark 2 Theorem 2 can be extended to polynomials with rational coefficients f ∈Q[t].
Indeed, if f (t) = 1≤q≤q 0 c q t q and c q = a q b q , (a q , b q ) = 1, then we can consider M = LCM{b q , 1 ≤ q ≤ q 0 } and introduce the polynomialf = M · f and α = α M . In view of this fact, it is important to understand how big m has to be as the denominator of α gets larger. The problem of finding the smallest m * (l) as in the proof of Theorem 2 can be rephrased as follows: given l, find the least prime p(l) in the arithmetic progression {1 + l, 1 + 2l, 1 + 3l, . . .}. The problem of estimating how big p(l) can be has a long history and very important contributions were made. In our notation p(l) = p m * (l) .
As far as a lower bound is concerned, the Prime Number Theorem implies that p(l) ≥ (1 + o(1))φ(l) ln l as l → ∞, where φ denotes the Euler's totient function. The factor (1 + o(1)) has been improved to (e γ + o(1)) by Pomerance [30]. It is also know that if l has at most exp( ln ln l ln ln ln l ) prime factors (as happens for 'almost all' l), then one has the bound p(l) ≥ (e γ + o(1))φ(l) (ln ln l)(ln ln ln ln l) (ln ln ln l) 2 . It was conjectured by Granville and Pomerance [17] that p(l) φ(l) ln 2 l. Concerning the upper bounds, Linnik [25,26] proved that The constant L (called Linnik's constant) has been estimated by several authors by exploiting the connection with zero-free regions for Dirichlet L-functions. The current record is due to Xylouris [37], which gives L ≤ 5.2. It is conjectured by Heath-Brown [20] that p(l) l 2 . Assuming the Generalized Riemann Hypothesis, one can show that p(l) φ(l) 2 ln 2 l. e(α f (n))μ(n) = 0 (21) whenever there exists a prime p = p(l, m 1 , m 2 ) such that p m 1 ≤ p ≤ p m 2 and p ≡ 1 (mod l).
Let us restrict our analysis to the case f (t) = t d . In order to study the sum in (9) for α ∈ R, we need to assume certain Diophantine properties. We have the following for some τ ≥ 1, and assume that there exists j ≤ m such that p j ≡ 1 (mod l).

Proof of Proposition 8 We write
where, by assumption (22), Observe that first sum in (24) is zero by our second hypothesis. The second sum in (24) is estimated as follows while the third sum in (24) gives simply Combining (25) and (26) we get the desired estimate.
We can now use Proposition 8 in order to obtain an analog of (8) for the ensemble m for ξ(n) = e αn d when α ∈ R. We have the following where L is Linnik's constant (see Remark 2) and the implied constant is the same as in (20).
The assumptions of the above Corollary put a rather stringent condition on the type of α ∈ R that can be considered. The requirement is that α can be super-exponentially well approximated by a sequence of rational numbers whose denominators do not grow too fast. The explicit dependence on Linnik's constant L is remarkable. In fact, any improvement of the known estimate (L ≤ 5.2) by Xylouris [37] towards the value L = 2 conjectured by Heath-Brown [20] would enlarge the set of α affected by the Corollary.

Proof of Corollary 9
The assumption (27) implies that for every ε ≤ ε we have (1) since for j = m we get a zero factor in the product in (30).
Another way to see that I m (k) = 0 is the following: In other words, the ensemble m can be seen as two copies of the ensemble m−1 (corresponding to the values of ν m = ±1), and these two contribute to the sum I m (k) in opposite ways.
Proof of Theorem 4 We use again the special structure of our ensemble m for which the sum I m ( k r ) can be written as a product.
and, by summation by parts, Notice that the boundary term in (31) satisfies the asymptotic estimate Since the imaginary part of J m ( k r ) does not play any role in the estimate of I m ( k r ), we have from (33)  This concludes the proof of Theorem 4.

Remark 4
Estimates of the higher order terms in the proof of Theorem 4 could provide a more precise result. Moreover, with a more careful analysis of cancellations, one can presumably handle the case when k r is simply bounded.
The proof of Corollary 5 is simple and is a direct application of Theorem 4.

Proof of Corollary 5 Let us write
where ln n ln p m μ(n), The first sum is estimated using Theorem 4 and (12): while the second sum is estimated using (13):

Distribution of ν j
Let us compute the expectation and the variance (with respect to m ) of the sum of the independent (but not identically distributed) random variables ν j : Proof of Theorem 6 It is easy to see that the sum m j=1 ν j satisfies the Lindeberg condition In fact, since dF ν j (x) = p j 1+ p j δ 0 + 1 1+ p j δ 1 dx and D m → ∞ as m → ∞, then for sufficiently large m we have that x ≤ 1 1+ p j − ε D m < 0 and x ≥ 1 1+ p j + ε D m > 1, and thus the m integrals in (34) are identically zero. The Lindeberg condition implies the Central Limit Theorem for m j=1 ν j . Notice that this result is an analog of the celebrated Erdös-Kac Central Limit Theorem [12].
Proof of Theorem 7 Let us compute the characteristic function of η (a,b) m and then take the limit as m → ∞: as m → ∞. Since the characteristic function of Pois(σ ) is e σ (e iλ −1) , we obtain, as m → ∞, Notice that if we replace m a (or m b ) by c 1 m a (or c 2 m b ), we do not affect the limit distribution. In particular, when a = b = 1 we get a degenerate distribution concentrated at 0: for every c 1 < c 2 as m → ∞. The exponential rate of convergence in (35) will depend on c 2 /c 1 and can be easily obtained: as m → ∞. For example, for a = 1 2 and b = 1, we obtain the remarkable fact that the probability that n has no prime factors larger than p √ m ∼ √ m ln m 2 tends to 1 2 as m → ∞. Equivalently, as m → ∞. The asymptotic relations (36) and (37) show precisely how how the measure m is concentrated on square-free numbers with small prime factors.

Large deviations
The probability of large deviations for the sum j ν j can be also studied. K. Mody and this provides a sub-exponential bound for the probability of large deviations.

Connection with the Riemann ζ -function
In this section we investigate the connection between our ensemble m and the identity n≥1 μ(n) For s > 1 both sides of (38) are absolutely convergent series and one can compute the series on the left-hand-side as However, for 0 < s ≤ 1, if the series in the LHS of (39) converges, then it is conditionally convergent and, a priori, the limits κ(s) = lim can be different. Let us write s = s 1 + is 2 . First, we discuss the case when s = 1 + is 2 and, for simplicity s 2 > 0. We prove the following Lemma 11 For every s 2 > 0 there exists τ = τ (s 2 ) > 1 such that, for some constant C > 0 and all N , Proof Introduce the variable t = n N for which dt = 1 N . Then If we choose τ = τ (s 2 ) = e 2π/s 2 > 1, the contribution from the integral (42) is zero and we obtain the desired estimate.