1 Introduction

One of the most famous graphs, not only among mathematicians and scientists, is the probability density function of the (standard) normal distribution (see Fig. 1), which has adorned the 10 Mark note of the former German currency for many years. Although already taking a central role in a work of Abraham de Moivre (26. May 1667 in Vitry-le-Francois; 27. November 1754 in London) from 1718, this curve only earned its enduring fame through the work of famous German mathematician Carl Friedrich Gauß (30. April 1777 in Braunschweig; 23. February 1855 in Göttingen), who used it in the approximation of orbits by ellipsoids when developing the least squares method, nowadays a standard approach in regression analysis. More precisely, Gauß conceived this method to master the random errors, i.e., those which fluctuate due to the unpredictability or uncertainty inherent in the measuring process, that occur when one tries to measure orbits of celestial bodies. The strength of this method became apparent when he used it to predict the future location of the newly discovered asteroid Ceres. Ever since, this curve seems to be the key to the mysterious world of chance and still the myth holds on that wherever this curve appears, randomness is at play.

With this article we seek to address mathematicians as well as a mathematically educated audience alike. One can say that the goal of this manuscript is 3‑fold. First, for those less familiar with it we want to undo the fetters that connect chance and the Gaussian curve so onesidedly. Second, we want to recall the deep and intimate connection of the notion of statistical independence and the Gaussian law of errors beyond classical probability theory, which, thirdly, demonstrates that occasionally one is obliged to step aside from its seemingly ultimate form in terms of the Kolmogorov axioms and work with notions having its roots in earlier foundations of probability theory.

To achieve this goal we shall, partially embedded in a historic context, present and discuss several results from mathematics where, once an appropriate form of statistical independence has been established, the Gaussian curve emerges naturally. In more modern language this means that central limit theorems describe the fluctuations of mathematical quantities in different contexts. Our focus shall be on results that nowadays are considered to be part of probabilistic number theory. At the very heart of this development lies the true comprehension and appreciation of independence by Polish mathematician Mark Kac (3. August 1914 in Kremenez; 26. October 1984 in California). His pioneering works and insights, especially his collaboration with Hugo Steinhaus (14. January 1887 in Jasło; 25. February 1972 in Wrocław) and famous mathematician Paul Erdős (26. March 1913 in Budapest; 20. September 1996 in Warsaw), have revolutionized our understanding and formed the development of probabilistic number theory for many years with lasting influence. We refer the reader to [10, 11, 49, 50] for general literature on the subject.

Fig. 1
figure 1

The Gaussian curve

2 The classical central limit theorems and independence–a refresher

In this section we start with two fundamental results of probability theory and the notion of independence. These considerations form the starting point for future deliberations.

2.1 The notion of independence

Independence is one of the central notions in probability theory. It is hard to imagine today that this, for us so seemingly elementary and simple concept, has only been used vaguely and intuitively for hundreds of years without a formal definition underlying this notion. Implicitly this concept can be traced back to the works of Jakob Bernoulli (6. January 1655 in Basel; 16. August 1705 in Basel) and evolved in the capable hands of Abraham de Moivre. In his famous oeuvre “The Doctrine of Chances” [15] he wrote:

…if a Fraction expresses the Probability of an Event, and another Fraction the Probability of another Event, and those two Events are independent; the Probability that both those Events will Happen, will be the Product of those Fractions.

It is to be noted that, even though this definition matches the modern one, neither the notion “Probability” nor “Event” had been introduced in an axiomatic way. It seems that the first formal definition of independence goes back to the year 1900 and the work [12] of German mathematician Georg Bohlmann (23. April 1869 in Berlin; 25. April 1928 in Berlin)Footnote 1. In fact, long before Andrei Nikolajewitsch Kolmogorov (25. April 1903 in Tambow; 20. October 1987 in Moscow) proposed his axioms that today form the foundation of probability theory, Bohlmann had presented an axiomatization–but without asking for \(\sigma\)-additivity. For a detailed exposition of the historical development and the work of Bohlmann, we refer the reader to an article of Ulrich Krengel [37].

Roughly speaking, two events are considered to be independent if the occurrence of one does not affect the probability of occurrence of the other, see also Remark 3. We now continue with the formal definition of independence as it is used today. Let \((\Omega,\mathcal{A},\mathbb{P})\) be a probability space consisting of a non-empty set \(\Omega\) (the sample space), a \(\sigma\)-Algebra (the set of events) on \(\Omega\), and a probability measure \(\mathbb{P}:\mathcal{A}\to[0,1]\). We then say that two events \(A,B\in\mathcal{A}\) are (statistically) independent if and only if

$$\mathbb{P}[A\cap B]=\mathbb{P}[A]\cdot\mathbb{P}[B]\,.$$

In other words two events are independent if their joint probability equals the product of their probabilities. This extends to any collection \((A_{i})_{i\in I}\) of events, which is said to be independent if and only if for every \(n\in\mathbb{N}\), \(n\geq 2\), and all subsets \(J\subseteq I\) of cardinality \(n\),

$$\mathbb{P}\Big[\bigcap_{i\in J}A_{i}\Big]=\prod_{i\in J}\mathbb{P}[A_{i}]\,.$$

It is important to note that in this case we ask for way more than just

$$\mathbb{P}\Big[\bigcap_{i\in I}A_{i}\Big]=\prod_{i\in I}\mathbb{P}[A_{i}]\,.$$

and still much more than pairwise independence. Consequently, we also have to verify much more: the number of conditions to be verified to show that \(n\) given events are independent is exactly

$${n\choose 2}+{n\choose 3}+\dots+{n\choose n}=2^{n}-(n+1)\,.$$

Having this notion of independence at hand, we define independent random variables. If \(X:\Omega\to\mathbb{R}\) and \(Y:\Omega\to\mathbb{R}\) are two random variables, then we say they are independent if and only if for all measurable subsets \(A,B\subseteq\mathbb{R}\),

$$\mathbb{P}[X\in A,Y\in B]=\mathbb{P}[X\in A]\cdot\mathbb{P}[Y\in B]\,.$$

We use the standard notation \(\{X\in A\}\) for \(\{\omega\in\Omega\,:\,X(\omega)\in A\}\), \(\mathbb{P}[X\in A]\) for \(\mathbb{P}[\{X\in A\}]\), and \(\mathbb{P}[X\in A,Y\in B]\) for \(\mathbb{P}[\{X\in A\}\cap\{Y\in B\}]\).

This means that the random variables \(X\) and \(Y\) are independent if and only if for all measurable subsets \(A,B\subseteq\mathbb{R}\) the events \(\{X\in A\}\in\mathcal{A}\) and \(\{Y\in B\}\in\mathcal{A}\) are independent. Again, a sequence \(X_{1},X_{2},\dots:\Omega\to\mathbb{R}\) of random variables is said to be independent if and only if for every \(n\in\mathbb{N}\), \(n\geq 2\), any subset \(I\subseteq\mathbb{N}\) of cardinality \(n\), and all measurable sets \(A_{i}\subseteq\mathbb{R}\), \(i\in I\),

$$\mathbb{P}\bigg[\bigcap_{i\in I}\{X_{i}\in A_{i}\}\bigg]=\prod_{i\in I}\mathbb{P}[X_{i}\in A_{i}]\,.$$

2.2 The central limit theorems of de Moivre-Laplace and Lindeberg

The history of the central limit theorem starts with the work of French mathematician Abraham de Moivre, who, around the year 1730, proved a central limit theorem for standardized sums of independent random variables following a symmetric Bernoulli distribution [16].Footnote 2 It was not before 1812 that Pierre-Simon Laplace (28. March 1749 in Beaumont-en-Auge; 5. March 1827 in Paris) generalized this result to the asymmetric case [39]. However, a central limit theorem for standardized sums of independent random variables together with a rigorous proof only appeared much later in a work of Russian mathematician Alexander Michailowitsch Ljapunov (06. June 1857 in Jaroslawl; 03. November 1918 in Odessa) from 1901 [43]. Jarl Waldemar Lindeberg (04. August 1876 in Helsinki; 12. December 1932 Helsinki) published his works on the central limit theorem, in which he developed his famous and ingenious method of proof (today known as Lindeberg method), in 1922 [41, 42]. While in a certain sense elementary, this technique can be applied in various ways. A very nice exposition on Lindeberg’s method can be found in the survey article [20] of Peter Eichelsbacher and Matthias Löwe. For an exhaustive presentation on the history of the central limit theorem we warmly recommend the monograph of Hans Fischer [24].

Let us start with the classical central limit theorem of de Moivre, hence restricting ourselves to the symmetric case \(p=\frac{1}{2}\) in the Bernoulli distribution.

Theorem 1 (De Moivre, 1730)

Let \(X_{1},X_{2},X_{3},\dots\) be a sequence of independent random variables with a symmetric Bernoulli distribution. Then, for all \(a,b\in\mathbb{R}\) with \(a<b\), we have

$$\lim_{n\to\infty}\mathbb{P}\bigg[a\leq\frac{\sum_{k=1}^{n}X_{k}-\frac{n}{2}}{\sqrt{\frac{n}{4}}}\leq b\bigg]=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-\frac{x^{2}}{2}}\,\text{d}x\,.$$

The theorem of de Moivre, when discussed in school for instance, can be nicely depicted using the Galton Board (also known as bean machine). Let us consider the experiment of throwing an ideal and fair coin \(n\)-times (i.e., head shows up with probability \(1/2\)). The single throws are regarded to be independent as none of them influences the other. The number \(k\) of heads showing up in that experiment is a number between 0 and \(n\). The probability that we see heads exactly \(k\)-times is described by a binomial distribution. Now de Moivre’s theorem says that, for a large number \(n\) of tosses tending to infinity, the form of a suitably standardized histogramm approaches the Gaussian curve.

We have already mentioned at the beginning of this section that under suitable conditions a central limit theorem for general independent random variables may be obtained, not only those describing or modeling a coin toss.

We formulate Lindeberg’s central limit theorem. In what follows, we shall denote by \(1_{A}\) the indicator function of the set \(A\), i.e., \(1_{A}(x)\in\{0,1\}\) with \(1_{A}(x)=1\) if and only if \(x\in A\). The expectation of a random variable \(X\) with respect to the probability measure \(\mathbb{P}\) is defined as \(\mathbb{E}[X]:=\int_{\Omega}X\text{d}\mathbb{P}\), if this integral is defined. \(X\) is called centered if and only if \(\mathbb{E}[X]=0\). If \(\mathbb{E}[|X|]<\infty\) we define the variance by \(\mathrm{Var}[X]:=\mathbb{E}\big[(X-\mathbb{E}[X])^{2}\big]\).

Theorem 2 (Lindeberg CLT, 1922)

Let \(X_{1},X_{2},X_{3},\dots\) be a sequence of independent, centered, and square integrable random variables. Assume that for each \(\varepsilon\in(0,\infty)\),

$$L_{n}(\varepsilon):=\frac{1}{s_{n}^{2}}\sum_{k=1}^{n}\mathbb{E}\big[X_{k}^{2}\,1_{\{|X_{k}|> \varepsilon s_{n}\}}\big]\,\,\overset{n\to\infty}{\longrightarrow}\,\,0\qquad(\textrm{Lindeberg condition}),$$

where \(s_{n}^{2}:=\sum_{k=1}^{n}\mathrm{Var}[X_{k}]\). Then, for all \(a,b\in\mathbb{R}\) with \(a<b\), we have

$$\lim_{n\to\infty}\mathbb{P}\bigg[a\leq\frac{\sum_{k=1}^{n}X_{k}}{s_{n}}\leq b\bigg]=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-\frac{x^{2}}{2}}\,\text{d}x\,.$$

Lindeberg’s condition guarantees that no single random variable has too much influence. This immediately becomes apparent when looking at the Feller condition, which is implied by Lindeberg’s condition. We refrain from discussing or presenting the details and refer again to [20].

Remark 1

Let us assume that the random variables in Theorem 2 are identically distributed and have variance \(\mathrm{Var}[X_{k}]=\sigma^{2}\in(0,\infty)\) for all \(k\in\mathbb{N}\). Then Lindeberg’s condition is automatically satisfied:

$$s_{n}^{2}=\sum_{k=1}^{n}\mathrm{Var}[X_{k}]=n\sigma^{2}$$

and therefore, since the random variables \(X_{k}\) are identically distributed, we obtain for any \(\varepsilon> 0\) that

$$\begin{aligned}\displaystyle L_{n}(\varepsilon)&\displaystyle=\frac{1}{n\sigma^{2}}\sum_{k=1}^{n}\mathbb{E}\Big[X_{k}^{2}{ 1}\!\!1_{\{|X_{k}|> \varepsilon s_{n}\}}\Big]=\frac{1}{n\sigma^{2}}\sum_{k=1}^{n}\mathbb{E}\Big[X_{1}^{2}\,{ 1}\!\!1_{\{|X_{1}|> \varepsilon\sqrt{n}\sigma\}}\Big]\cr\displaystyle&\displaystyle=\frac{1}{\sigma^{2}}\mathbb{E}\Big[X_{1}^{2}\,{ 1}\!\!1_{\{|X_{1}|> \varepsilon\sqrt{n}\sigma\}}\Big]\,\,\overset{n\to\infty}{\longrightarrow}\,\,0,\end{aligned}$$

where the convergence to 0 is a consequence of the Beppo Levi TheoremFootnote 3.

The previous remark immediately implies the classical central limit theorem for independent and identically distributed random variables.

Corollary 1

Let \(X_{1},X_{2},X_{3},\dots\) be a sequence of independent and identically distributed random variables with \(\mathbb{E}[X_{1}]=0\) and \(\mathrm{Var}[X_{1}]=\sigma^{2}\in(0,\infty)\) . Then, for all \(a,b\in\mathbb{R}\) with \(a<b\) , we have

$$\lim_{n\to\infty}\mathbb{P}\bigg[a\leq\frac{\sum_{k=1}^{n}X_{k}}{\sqrt{n\sigma^{2}}}\leq b\bigg]=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-\frac{x^{2}}{2}}\,\text{d}x\,.$$

One thing we immediately notice in the general version of Lindeberg’s central limit theorem is the universality towards the underlying distribution of the random variables. Hence, the distribution seems to be irrelevant. On the other hand, in both the central limit theorem of de Moivre and the one of Lindeberg, we require the random variables to be independent. Could it be that independence is the key to a Gaussian law of errors? If so, does this connection go deeper and beyond a purely probabilistic framework? In the remaining parts of this work we want to get to the bottom of those questions.

2.3 Binary expansion and independence

In this section we will present a first example which a priori is non probabilistic. It has to do with intervals corresponding to binary expansions of real numbers \(x\in[0,1]\) and a corresponding product rule for their lengths.

For simplicity, we start by reminding the reader of the decimal expansion of a number \(x\in[0,1)\). One can prove that each number \(x\in[0,1)\) has a non-terminating and unique decimal expansion (see, e.g., [8]). For example,

$$\frac{2}{7}=0{.}285714285714{\ldots}$$

and this expression is merely a short way for writing

$$\frac{2}{7}=\frac{2}{10}+\frac{8}{10^{2}}+\frac{5}{10^{3}}+\frac{7}{10^{4}}+\dots\,.$$

Generally, for each \(x\in[0,1)\) there exist unique numbers \(d_{1}(x),d_{2}(x),d_{3}(x),{\ldots}\) in \(\{0,1,{\ldots},9\}\) such that

$$x=\frac{d_{1}(x)}{10}+\frac{d_{2}(x)}{10^{2}}+\frac{d_{2}(x)}{10^{3}}+{\ldots}\,.$$

Analogous to the decimal expansion, each number \(x\in[0,1)\) has a binary expansion (also known as dyadic expansion), i.e., there are unique numbers \(b_{1}(x),b_{2}(x),b_{3}(x),{\ldots}\) in the set \(\{0,1\}\) such that

$$x=\frac{b_{1}(x)}{2}+\frac{b_{2}(x)}{2^{2}}+\frac{b_{3}(x)}{2^{3}}+\dots\,.$$
(1)

For instance, we can write

$$\frac{2}{7}=\frac{0}{2}+\frac{1}{2^{2}}+\frac{0}{2^{3}}+\frac{0}{2^{4}}+\frac{1}{2^{5}}+\frac{0}{2^{6}}+\dots\,.$$

To guarantee uniqueness in the expansion, we agree to write the expansion in such a way that infinitely many of the binary digits are zero. As already indicated by the way we write it, the binary digits are functions in the variable we denoted by \(x\), i.e.,

$$b_{k}:[0,1)\to\{0,1\},\qquad x\mapsto b_{k}(x).$$

Sometimes these functions are called Rademacher functions, although Hans Rademacher (3. April 1892 in Wandsbek; 7. February 1969 in Haverford) defined a slightly different version [45]. The value that \(b_{k}\) takes at \(x\) not only provides information about the \(k\)-th binary digit of \(x\), but also about \(x\) itself. Obviously, if \(b_{1}(x)=1\), then \(x\in[1/2,1)\) or if \(b_{2}(x)=0\), then \(x\in[0,1/4)\cup[1/2,3/4)\). More generally, if we define for each \(k\in\mathbb{N}\) the set

$$B_{k}:=\bigcup_{j=1}^{2^{k-1}}\Big[\frac{2j-2}{2^{k}},\frac{2j-1}{2^{k}}\Big),$$

then

$$b_{k}(x)=1_{[0,1)\setminus B_{k}}(x)=\begin{cases}0&:x\in B_{k}\\ 1&:x\in[0,1)\setminus B_{k}\,.\end{cases}$$

These considerations yield the following: if \(n\in\mathbb{N}\), \(k_{1},{\dots},k_{n}\in\mathbb{N}\), and \(\varepsilon_{1},\dots\varepsilon_{n}\in\{0,1\}\), then

$$\begin{aligned}\displaystyle\lambda\left(\bigcap_{i=1}^{n}b_{k_{i}}^{-1}(\varepsilon_{i})\right)&\displaystyle=\lambda\big(\{x\in[0,1)\,:\,b_{k_{1}}=\varepsilon_{1},{\dots},b_{k_{n}}=\varepsilon_{n}\}\big)\\ \displaystyle&\displaystyle=\Big(\frac{1}{2}\Big)^{n}=\prod_{i=1}^{n}\lambda\big(\{x\in[0,1)\,:\,b_{k_{i}}=\varepsilon_{i}\big\}\big),\end{aligned}$$

where \(\lambda\) denotes the 1‑dimensional Lebesgue measure (which in this case simply assigns the length to an interval). This implies that the binary coefficients as functions in \(x\in[0,1)\), are independent; a result seemingly discovered by French mathematician Émile Borel (7. January 1871 in Saint-Affrique; 3. February 1956 in Paris) in 1909 [13]. In particular, the random variables \(X_{k}=b_{k}\) satisfy the assumptions of de Moivre’s theorem (Theorem 1) and so we obtain a central limit theorem for binary expansions \(b_{k}\). Probability in the sense of coin tosses or events has not played any role in our arguments. (Nevertheless, technically the \(X_{k}\)’s are bona-fide random variables on the probability space \(\big([0,1),\mathcal{B}([0,1)),\lambda)\).)

2.4 Prime factors and independence

We shall now consider a fundamentally different example of independence in mathematics. Take a sufficiently large natural number \(N\in\mathbb{N}\). We note that roughly half of the numbers between 1 and \(N\) are divisible by the prime number 2, namely \(2,4,6\) and so on. In the same way, roughly one third of the numbers between 1 and \(N\) are divisible by the prime number 3, namely \(3,6,9\) and so on. If we now consider the numbers between 1 and \(N\) which are divisible by 6, then this is again roughly one sixth. However, divisibility by 6 is equivalent to both divisibility by 2 and 3 and we can write this as

$$\frac{1}{6}=\frac{1}{2}\cdot\frac{1}{3}$$

for the corresponding fractions of numbers between 1 and \(N\). But this reminds us of the multiplication of probabilities–as occurring in the concept of independence! Of course, the same argument applies for divisibility by general distinct primes \(p\) and \(q\) as well as by any finite number of primes. We can say, in this sense, that divisibility of a number by distinct primes is independent.

Apparently, every second natural number is divisible by 2, so that the numbers with this property constitute one half of all natural numbers. One could thus think that a randomly chosen natural number is divisible by 2 with probability \(\frac{1}{2}\). In the same way, this number would be divisible by 3 with probability \(\frac{1}{3}\), and an analog statement would hold for divisibility by every natural number.

It turns out that this notion, although intuitive, is incompatible with Kolmogorov’s concept of probability in that no probability measure on the naturals with the above property exists (And, as a consequence, it is impossible to define a uniform measure on any countably infinite set).

To see this, define, for every pair of numbers \(n,k\) with \(n\in\mathbb{N}\) and \(k\in\{1,{\dots},n\}\) the set \(A_{n,k}:=\{jn+k\colon j\in\mathbb{N}\cup\{0\}\}\). For \(k\neq n\), \(A_{n,k}\) consists of all natural numbers which yield remainder \(k\) after division by \(n\), while for \(k=n\) we have \(A_{n,k}=A_{n,n}\), which is the set of all natural numbers that are divisible by \(n\). We denote by \(\mathcal{P}(\mathbb{N})\) the set of all subsets of \(\mathbb{N}\).

Lemma 1

Let \(\mu\) be a finite measure on the set \(\mathcal{P}(\mathbb{N})\) , which satisfies

$$\mu(A_{p,k})=\mu(A_{p,p})$$
(2)

for every prime number \(p\) and every \(k\in\{1,{\dots},p\}\) . Then \(\mu(\{m\})=0\) for every \(m\in\mathbb{N}\) , and therefore \(\mu(A)=0\) for all \(A\subseteq\mathbb{N}\) .

Proof

First note that (2) implies \(\mu(A_{p,k})=\mu(\mathbb{N})/p\) for every prime number \(p\) and all \(k\in\{1,{\dots},p\}\): indeed, if \(p\) is a prime number, then

$$\mu(\mathbb{N})=\mu\bigg(\bigcup_{k\in\{1,{\dots},p\}}A_{p,k}\bigg)=\sum_{k\in\{1,{\dots},p\}}\mu(A_{p,k})\overset{(2)}{=}p\mu(A_{p,p})\,,$$

where we used the finite additivity of \(\mu\) to obtain the second equality. Combining \(\mu(\mathbb{N})=p\mu(A_{p,p})\) with (2) gives \(\mu(A_{p,k})=\mu(\mathbb{N})/p\) for all \(k\in\{1,{\dots},p\}\).

Now fix \(m\in\mathbb{N}\). For every prime number \(p\) there exist numbers \(j\in\mathbb{N}\cup\{0\}\) and \(k\in\{1,{\dots},p\}\) such that \(m=jp+k\). Thus \(m\in A_{p,k}\). From our earlier considerations it follows

$$\mu(\{m\})\leq\mu(A_{p,k})=\mu(\mathbb{N})/p\,.$$

Since \(\mu(\mathbb{N})<\infty\) by assumption, and since there are arbitrarily large primes, it follows that \(\mu(\{m\})=0\). But since \(\mu\) is a measure, and thus is \(\sigma\)-additive, we get \(\mu(A)=\sum_{m\in A}\mu(\{m\})=0\) for every \(A\subseteq\mathbb{N}\). \(\square\)

So there exists no measure on \(\mathcal{P}(\mathbb{N})\) having the desired property (2). But could it be that we have chosen the domain of \(\mu\) too large? The next proposition shows that there is no smaller domain containing all \(A_{p,k}\).

Proposition 1

We have \(\sigma\big(\big\{A_{p,k}\colon p\;\text{prime},\,k\in\{1,{\dots},p\}\big\}\big)=\mathcal{P}(\mathbb{N})\) .

Proof

We define the set \(\Sigma:=\sigma\big(\big\{A_{p,k}\colon p\;\text{prime},\,k\in\{1,{\dots},p\}\big\}\big)\). It is sufficient to show that \(\{m\}\in\Sigma\) for all \(m\in\mathbb{N}\). To this end fix \(m\in\mathbb{N}\). For every prime \(p> m\) we have \(m\in A_{p,m}\), since \(m=0\cdot p+m\). Therefore,

$$m\in\bigcap_{p\;\text{prime},\,p> m}A_{p,m}\,.$$

Let \(\ell\in\bigcap_{p\;\text{prime},\,p> m}A_{p,m}\). Then there exists a prime \(p\) with \(p> \ell\) so that, since \(\ell\in A_{p,m}\), \(\ell=0\cdot p+m=m\). Thus, \(\{m\}=\bigcap_{p\;\text{prime},\,p> m}A_{p,m}\in\Sigma\). \(\square\)

Remark 2

Eq. (2) in Lemma 1 formalizes our earlier intuition that if \(\mu(A_{p,p})\) is the fraction of numbers divisible by \(p\) then this should equal the fraction of numbers giving remainder 1 and so on. The Lemma shows us that there cannot be a non-trivial finite measure \(\mu\) with this property and therefore we cannot assign meaningful probabilities to those subsets in the framework of Kolmogorov’s theory. In contrast to the independence of distinct binary digits of a number in \([0,1)\), we cannot cover the independence of divisibility by distinct primes of a number in \(\mathbb{N}\) using Kolmogorov’s notion of independence of random variables.

3 Relative Measures

A possible remedy is a notion related to one of the earlier approaches to probability theory going back at least to Richard von Mises (19. April 1883 in Lviv; 14. Juli 1953 in Boston) and can be found in early work of Kac and Steinhaus. However, we were unable to trace the original source. In any case, this approach has to a large extent been replaced by Kolmogorov’s axiomatization of probability.

One of the central notions in this manuscript shall be referred to as relative measure and its definition and properties be discussed in the following section.

3.1 Relative measurable subsets of \(\mathbb{N}\)

Definition 1 (Relatively measurable subsets of \(\mathbb{N}\) and relative measure)

We say that a subset \(A\subseteq\mathbb{N}\) is relatively measurable if and only if the limit

$$\lim_{N\to\infty}\frac{|A\cap\{1,{\dots},N\}|}{N}\,,$$

exists. In that case we define the relative measure \(\mu_{R}\) of \(A\) as exactly this limit,

$$\mu_{R}(A):=\lim_{N\to\infty}\frac{|A\cap\{1,{\dots},N\}|}{N}\,.$$

It is easy to see that the collection of relatively measurable subsets of \(\mathbb{N}\) forms an algebra and that \(\mu_{R}\) is a non-negative and (finitely-)additive set function on it. Moreover, it is obvious that every finite subset of \(\mathbb{N}\) is relatively measurable with relative measure 0.

The sets \(A_{n,k}\), \(n\in\mathbb{N}\) and \(k\in\{0,{\dots},n-1\}\) defined in Sect. 2.4 are relatively measurable with

$$\mu_{R}(A_{n,k})=\frac{1}{n}\,.$$

It is a direct consequence of Lemma 1 that \(\mu_{R}\) cannot be \(\sigma\)-additive. Indeed,

$$\mu_{R}\Big(\bigcup_{i\in\mathbb{N}}\{i\}\Big)=\mu_{R}(\mathbb{N})=1\neq 0=\sum_{i\in\mathbb{N}}\mu_{R}(\{i\})\,.$$

On the other hand, we can construct sets which are not relatively measurable.

Example 1

Let \(a_{1}=0\) and define

$$a_{k}:=\begin{cases}0&:2^{2m}<k\leq 2^{2m+1}\text{ for some }m\in\mathbb{N}_{0}\\ 1&:2^{2m+1}<k\leq 2^{2m+2}\text{ for some }m\in\mathbb{N}_{0}\,.\end{cases}$$

Consider the level set \(A:=\{k\in\mathbb{N}\,\colon\,a_{k}=1\}\). Then \(A\) is not relatively measurable because

$$\begin{aligned} & 2^{-(2m+2)}|A\cap\{1,{\dots},2^{2m+2}\}| = 2^{-(2m+2)}2(1+2^{2}+\dots+2^{2m+1})= \\ & 2^{-(2m+1)}\frac{2^{2m+2}-1}{3}\to\frac{2}{3} \\ & 2^{-(2m+1)}|A\cap\{1,{\dots},2^{2m+1}\}| = 2^{-(2m+1)}2(1+2^{2}+\dots+2^{2m-1})= \\ & 2^{-(2m)}\frac{2^{2m}-1}{3}\to\frac{1}{3}\,.\end{aligned}$$

The relative measure allows us to conceive and show the independence of divisibility by different primes in a formal way. In this regard this notion is superior to a measure in the sense of Kolmogorov. We are now going to prove the independence of \(A_{p,p}\) and \(A_{q,q}\) for different primes \(p\) and \(q\). By the fundamental theorem of arithmetic a number is divisible by \(p\) as well as \(q\) if and only if it is divisible by their product \(pq\), and so \(A_{p,p}\cap A_{q,q}=A_{pq,pq}\). Therefore, we obtain

$$\mu_{R}(A_{p,p}\cap A_{q,q})=\mu_{R}(A_{pq,pq})=\frac{1}{p\cdot q}=\frac{1}{p}\cdot\frac{1}{q}=\mu_{R}(A_{p,p})\,\mu_{R}(A_{q,q})\,,$$

which is the product rule so characteristic for independence. Similarly, one can show this property for each finite collection of different primes \(p_{1},{\dots},p_{m}\).

The following lemma shows that if the indicator function of a subset of the natural numbers is eventually periodic, then the relative measure of that set is equal to the average over the period. We shall leave the proof to the reader.

Lemma 2

Consider a set \(A\subseteq\mathbb{N}\) . If there exist \(k\in\mathbb{N}\) and \(n_{0}\in\mathbb{N}\) such that

$$\forall n\geq n_{0}\colon 1_{A}(n+k)=1_{A}(n)\,,$$

then \(A\) is relatively measurable and

$$\mu_{R}(A)=\frac{|A\cap\{n_{0}+1,{\dots},n_{0}+k\}|}{k}.$$

Remark 3 (Independence and information)

One important property of statistical independence is that knowledge of one event, say \(B\), does not present any information about an independent event \(A\): for independent \(A,B\) we have \(\mathbb{P}(A|B)=\mathbb{P}(A)\).

A similar situation occurs with numbers: knowledge about divisibility by one prime does not tell us anything about divisibility by another one. This holds also true for the digits considered earlier: if we know the \(k\)-th digit of a number \(x\in[0,1)\) this does not tell us anything about its \(\ell\)-th digit.

Consider now, for every \(j\in\mathbb{N}\) the function \(\beta_{j}\colon\mathbb{N}\to\{0,1\}\) defined by

$$\beta_{j}(n):=\begin{cases}0&:\lfloor\frac{n}{2^{j-1}}\rfloor\text{ is even }\\ 1&:\lfloor\frac{n}{2^{j-1}}\rfloor\text{ is odd }\,,\end{cases}$$
(3)

such that \(\beta_{j}(n)\) is the \(j\)-th binary digit of \(n\), and

$$n=\sum_{j=1}^{\infty}\beta_{j}(n)\,2^{j-1}=\sum_{j=1}^{\lfloor\log_{2}(n)\rfloor+1}\beta_{j}(n)\,2^{j-1}\,.$$

To every \(j\in\mathbb{N}\) assign the set \(B_{j}:=\{n\in\mathbb{N}\colon\beta_{j}(n)=1\}\), i.e. the set of all natural numbers for which the \(j\)-th binary digit equals 1.

It follows from the definition of binary digits that for each \(j\in\mathbb{N}\)

$$B_{j}=\bigcup_{m\in\mathbb{N}\cup\{0\}}\{2^{j-1}(2m+1),{\dots},2^{j-1}(2m+1)+2^{j-1}-1\}\,,$$

which means that

$$B_{j}^{c}=\bigcup_{m\in\mathbb{N}\cup\{0\}}\{2^{j}m,{\dots},2^{j}m+2^{j-1}-1\}\,,$$

and so \(\mu_{R}(B_{j})=\frac{1}{2}\). Moreover, for every choice of \(j,k\in\mathbb{N}\) with \(j<k\), we have \(\mu_{R}(B_{j}\cap B_{k})=\mu_{R}(B_{j})\mu_{R}(B_{k})\), which can be proven using Lemma 2.

Definition 2

Let \((A_{j})_{j\in J}\) be a family of relatively measurable subsets of \(\mathbb{N}\). We say that \((A_{j})_{j\in J}\) are independent if and only if for every \(m\in\mathbb{N}\) and every subset \(I\) of cardinality \(m\)

$$\mu_{R}\Big(\bigcap_{i\in I}A_{i}\Big)=\prod_{i\in I}\mu_{R}(A_{i})\,.$$

Summarizing the preceding thoughts, we obtain the following result.

Proposition 2

  1. 1.

    For \(n\in\mathbb{N}\) and \(k\in\{1,{\dots},n\}\) , let \(A_{n,k}:=\{jn+k\colon j\in\mathbb{N}\cup\{0\}\}\) . Then the family \(\big(A_{p,p}\big)_{p\in\mathbb{N},p\text{ prime}}\) is independent.

  2. 2.

    For every \(j\in\mathbb{N}\) let \(B_{j}=\bigcup_{m\in\mathbb{N}\cup\{0\}}\{2^{j-1}(2m+1),{\dots},2^{j-1}(2m+1)+2^{j-1}-1\}\,.\) Then the family \(\big(B_{j}\big)_{j\in\mathbb{N}}\) is independent.

It is quite interesting that similar results to the ones for expansions of real numbers in \([0,1)\) with respect to the Lebesgue measure can be obtained for the expansion of natural numbers with respect to the relative measure on \(\mathbb{N}\).

3.2 Relatively measurable sequences and their distribution

In this subsection we shall introduce the notion of a relatively measurable sequence and, in broad similarity to the way independence is defined in the sense of Kolmogorov, we introduce the notion of relatively independent sequences \(x,y\colon\mathbb{N}\to\mathbb{R}\) and define a distribution function with respect to relative measures. As we shall see, such a distribution function does not possess all the properties that—coming from probability theory—we might expect it to have.

Definition 3 (Relatively measurable sequence)

A sequence \(x\colon\mathbb{N}\to\mathbb{R}\) is said to be relatively measurable if and only if the pre-image

$$x^{-1}(I):=\big\{n\in\mathbb{N}\colon x_{n}\in I\big\}$$

of each interval \(I\subseteq\mathbb{R}\) under \(x\) is a relatively measurable subset of \(\mathbb{N}\).

To us an interval means a convex subset of \(\mathbb{R}\), in particular singleton sets are intervals. Natural examples of measurable sequences are indicator functions of relatively measurable sets and their finite sums.

We shall now introduce what it means for two sequences to be independent with respect to a relative measure. This is again done via a product rule.

Definition 4 (Independent sequences)

Two relatively measurable sequences \(x,y\colon\mathbb{N}\to\mathbb{R}\) are said to be \(\mu_{R}\)-independent if and only if for any two intervals \(I,J\subseteq\mathbb{R}\) we have

$$\mu_{R}\big(x^{-1}(I)\cap y^{-1}(J)\big)=\mu_{R}\big(x^{-1}(I)\big)\,\mu_{R}\big(y^{-1}(J)\big)\,.$$

This definition can be generalized in an obvious way to any finite number of relatively measurable sequences.

We now turn to the definition of a (relative) distribution function of a relatively measurable sequence.

Definition 5 (Distribution function)

Let \(x\colon\mathbb{N}\to\mathbb{R}\) be a relatively measurable sequence. Then the function

$$F_{x}:\mathbb{R}\to[0,1],\quad F_{x}(z):=\mu_{R}\Big(\big\{n\in\mathbb{N}\,:\,x_{n}\in(-\infty,z]\big\}\Big)$$

is called the (relative) distribution function of \(x\).

By its very definition such a distribution function resembles a classical distribution function we know from probability theory. In particular, it is immediately clear that it is non-decreasing. However, in general not all properties we may expect from a relative distribution function have to hold.

Example 2

Consider the sequence \(x:\mathbb{N}\to\mathbb{R}\) given by

$$x_{n}:=\begin{cases}-n&\text{ if }n=4k\text{ for some }k\in\mathbb{N}_{0}\\ 0&\text{ if }n=4k+1\text{ for some }k\in\mathbb{N}_{0}\\ \frac{1}{n}&\text{ if }n=4k+2\text{ for some }k\in\mathbb{N}_{0}\\ n&\text{ if }n=4k+3\text{ for some }k\in\mathbb{N}_{0}\,.\end{cases}$$

Then it is easy to see that \(x\) is relatively measurable and that its relative distribution function is given by

$$F_{x}(z)=\frac{1}{4}1_{(-\infty,0)}(z)+\frac{2}{4}1_{\{0\}}(z)+\frac{3}{4}1_{(0,\infty)}(z)\,.$$

Hence, \(F_{x}\) is neither left nor right continuous, and we have

$$\lim_{z\to-\infty}F_{x}(z)> 0\qquad\text{and}\qquad\lim_{z\to-\infty}F_{x}(z)<1\,.$$

Note however that for every bounded relatively measurable sequence \(x\)

$$\lim_{z\to-\infty}F_{x}(z)=0\qquad\text{and}\qquad\lim_{z\to-\infty}F_{x}(z)=1\,.$$

Next we introduce and study the notion of an average of a relatively measurable sequence.

Definition 6 (Relative average)

Let \(x:\mathbb{N}\to\mathbb{R}\) be a relatively measurable sequence. Then we define the relative average of \(x\) by

$$M(x):=\lim_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}x_{n}\,,$$

whenever this limit exists.

The following theorem shows that the relative average of a relatively measurable and bounded sequence can be written in terms of a Stieltjes integral with respect to the relative distribution function.

Theorem 1

Let \(x:\mathbb{N}\to\mathbb{R}\) be a relatively measurable and bounded sequence. Then \(M(x)\) exists and

$$M(x)=\int_{-\infty}^{\infty}z\,\text{d}F_{x}(z)\,.$$
(4)

Proof

In this proof we simply write \(F\) instead of \(F_{x}\). By assumption there exists some \(K\in(0,\infty)\) such that \(-K+1\leq x_{n}\leq K\) for every \(n\in\mathbb{N}\). The Stieltjes integral exists since the function \(\text{id}\,\colon[-K,K]\to\mathbb{R}\), \(z\mapsto z\) is continuous and \(F\) is monotone on \([-K,K]\) and constant on the intervals \((-\infty,-K]\) and \([K,\infty)\). Therefore, given \(\varepsilon> 0\) there exists a decomposition \(Z=\{-K=t_{0}<t_{1}<{\ldots}<t_{m}=K\}\) of \([-K,K]\) such that \(O(\text{id}\,,F,Z)-U(\text{id}\,,F,Z)<\varepsilon\), where \(U\) and \(O\) denote upper and lower Riemann-Stieltjes sums, i.e.,

$$\begin{aligned} U(\text{id}\,,F,Z)&=\sum_{k=1}^{m}t_{k-1}\big(F(t_{k})-F(t_{k-1})\big)\quad\text{ and}\\ O(\text{id}\,,F,Z)&=\sum_{k=1}^{m}t_{k}\big(F(t_{k})-F(t_{k-1})\big)\,. \end{aligned}$$

We observe that

$$\begin{aligned} \limsup_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}x_{n} &= \sum_{k=1}^{m}\limsup_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}1_{(t_{k-1},t_{k}]}(x_{n})x_{n}\\ &\leq\sum_{k=1}^{m}\limsup_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}1_{(t_{k-1},t_{k}]}(x_{n})t_{k}\\ &\leq\sum_{k=1}^{m}t_{k}\big(F(t_{k})-F(t_{k-1})\big)=O(\text{id}\,,F,Z)\,. \end{aligned}$$

Similarly one can show that \(\liminf_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}x_{n}\geq U(\text{id}\,,F,Z)\), which then proves the assertion. \(\square\)

It follows from the properties of Riemann-Stieltjes integrals that

$$M(x)=\int_{-\infty}^{\infty}zF_{x}^{\prime}(z)\,\text{d}z\,,$$
(5)

whenever \(F_{x}\) is differentiable on \(\mathbb{R}\) with \(F^{\prime}_{x}=f_{x}\) outside some at most finite subset of \(\mathbb{R}\).

Remark 4

We see that measurable sequences behave in many ways like random variables, and indeed a measurable sequence can be taken as a mathematical model for a “random number”. As noted before, this kind of model has been put forward by Austrian mathematician Richard von Mises in the first half of the 20th century. This model was–at least among the vast majority of probabilists–replaced by Kolmogorov’s approach, mainly because of the potent tools from Lebesgue’s measure theory and the accompanied clean and simple concepts and theorems of convergence.

Nevertheless there is a certain appeal to the alternative, in particular its slim theoretical foundation. Within this approach one can simply state that a real number is a Cauchy sequence of rational numbers and a random number is a relatively measurable sequence of real numbers.

We now assign to every \(\mathbb{Z}\)-valued and relatively measurable sequence \(x\) a function \(\rho_{x}\colon\mathbb{Z}\to[0,1]\) via

$$\rho_{x}(k):=\mu_{R}\big(\{n\in\mathbb{N}\colon x_{n}=k\}\big)\,.$$

Then for bounded, \(\mathbb{Z}\)-valued and relatively measurable sequences we have \(\sum_{k\in\mathbb{Z}}\rho_{x}(k)\) = 1 and the well-known convolution formula:

Proposition 3

Let \(x,y:\mathbb{N}\to\mathbb{R}\) be bounded and relatively measurable sequences taking values in \(\mathbb{Z}\) . If \(x\) and \(y\) are \(\mu_{R}\) -independent, then \(\rho_{x+y}=\rho_{x}\ast\rho_{y}\) , where

$$\rho_{x}\ast\rho_{y}(k):=\sum_{j\in\mathbb{Z}}\rho_{x}(j)\rho_{y}(k-j)\,,\quad k\in\mathbb{Z}\,.$$

All in all, we can say that relatively measurable sequences behave in many ways like random variables. For instance, the indicator functions of the sets \(B_{j}\) introduced after Remark 3 form an independent, relatively measurable, bounded, and \(\mathbb{Z}\)-valued sequence. Therefore, their sums satisfy

$$\rho_{1_{B_{1}}+{\ldots}+1_{B_{m}}}(k)={m\choose k}2^{-m}\,,\quad k\in\mathbb{Z}\,.$$

This means that the partial sums of the indicator functions of the sets \(B_{j}\) satisfy the central limit theorem of de Moivre (Theorem 1), i.e., for any \(a,b\in\mathbb{R}\) with \(a<b\),

$$\begin{aligned}\begin{array}[]{l}{\lim_{m\to\infty}\mu_{R}\bigg(\bigg\{n\in\mathbb{N}\colon a\leq\frac{\sum_{j=1}^{m}1_{B_{j}}(n)-\frac{m}{2}}{\sqrt{\frac{m}{4}}}\leq b\bigg\}\bigg)}\\ =\lim_{m\to\infty}\sum_{k=0}^{m}{m\choose k}2^{-m}1_{[a,b]}\Big(\frac{k-\frac{m}{2}}{\sqrt{\frac{m}{4}}}\Big)=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-\frac{x^{2}}{2}}\,\text{d}x\,.\end{array}\end{aligned}$$
(6)

Note again that the set considered above is indeed relatively measurable. To see this, we note that, as was argued before, the sets \(B_{j}\) are all relatively measurable and hence, because the collection of relatively measurable sets forms an algebra, so are their complements \(B_{j}^{c}\). This immediately implies that the sequences \((1_{B_{j}}(n))_{n\in\mathbb{N}}\) and their finite sums are relatively measurable.

Thus, for the binary expansion of natural numbers we have the same central limit theorem as for the binary expansion of real numbers in \([0,1)\). In fact, we can now formulate a quite interesting version of this, which can be found, for example, in [18]. Contrary to almost all numbers in \([0,1)\), every natural number has a finite expansion and hence it is reasonable to define for \(n\in\mathbb{N}\) its sum-of-digits function with respect to the binary expansion,

$$s_{2}(n):=\sum_{j=1}^{\lfloor\log_{2}(n)\rfloor+1}1_{B_{j}}(n)=\sum_{j=1}^{\infty}1_{B_{j}}(n)\,,\quad n\in\mathbb{N}\,.$$

The following result describes the Gaussian fluctuations of the sum-of-digits function.

Theorem 2 (Central limit theorem for the sum-of-digits function)

For all \(b\in\mathbb{R}\), we have

$$\mu_{R}\Big(\Big\{n\in\mathbb{N}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{b}e^{-\frac{x^{2}}{2}}\,\text{d}x\,.$$

We recall the following lemma from probability theory.

Lemma 3

Let \(F:\mathbb{R}\to[0,1]\) be a continuous cumulative distribution function and let \((F_{n})_{n\in\mathbb{N}}\) be a sequence of non-decreasing functions \(F_{n}\colon\mathbb{R}\to[0,1]\) with \(\lim_{n\to\infty}F_{n}(x)=F(x)\) for all \(x\in\mathbb{R}\) . Then \(F_{n}\to F\) uniformly on \(\mathbb{R}\) .

We are now able to prove the central limit theorem for the sum-of-digits function.

Proof (Proof of Theorem  2 )

Let \(\varepsilon\in(0,\infty)\). For \(b\in\mathbb{R}\) let us write \(\Phi(b):=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{b}e^{-\frac{x^{2}}{2}}\,\text{d}x\). It follows from de Moivre’s central limit theorem (see Eq. (6)) and Lemma 3 that there exists \(m_{0}\in\mathbb{N}\) such that for all \(m\geq m_{0}\) and every \(b\in\mathbb{R}\),

$$-\frac{\varepsilon}{6}<\mu_{R}\bigg(\bigg\{n\in\mathbb{N}\colon\frac{\sum_{j=1}^{m}1_{B_{j}}(n)-\frac{m}{2}}{\sqrt{\frac{m}{4}}}\leq b\bigg\}\bigg)-\Phi(b)<\frac{\varepsilon}{6}\,.$$

Moreover, for each \(m\geq m_{0}\), we have

$$\left|\bigg\{0\leq n<2^{m}\colon\sum_{j=1}^{m}s_{2}(n)=k\bigg\}\right|=\left|\bigg\{0\leq n<2^{m}\colon\sum_{j=1}^{m}1_{B_{j}}(n)=k\bigg\}\right|={m\choose k}$$

and therefore,

$$2^{-m}\Big|\Big\{0\leq n<2^{m}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}m}+\tfrac{1}{2}m\Big\}\Big|\in\Big(\Phi(b)-\tfrac{\varepsilon}{6},\Phi(b)+\tfrac{\varepsilon}{6}\Big)\,.$$

Now let \(\ell\in\mathbb{N}\) with \(2^{-\ell}<\tfrac{\varepsilon}{3}\) and \(j\in\{1,{\ldots},2^{\ell}\}\). For every \(m\geq\ell+m_{0}\),

$$\begin{aligned} & {\tfrac{1}{j2^{m-\ell}}\Big|\Big\{2^{m}\leq n<2^{m}+j2^{m-\ell}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|}\\ &\geq\tfrac{1}{j2^{m-\ell}}\Big|\Big\{2^{m}\leq n<2^{m}+j2^{m-\ell}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}m}+\tfrac{1}{2}m\Big\}\Big|\\ &=\sum_{i=1}^{j}\tfrac{2^{m-\ell}}{j2^{m-\ell}}2^{-(m-\ell)}\Big|\Big\{0\leq n<2^{m-\ell}\colon s_{2}(n)\leq b\sqrt{\tfrac{m}{4}}+\tfrac{m}{2}-s_{2}(i)\Big\}\Big|\,.\end{aligned}$$

Since \(m-\ell\geq m_{0}\),

$$\begin{aligned}&{2^{-(m-\ell)}\Big|\Big\{0\leq n<2^{m-\ell}\colon s_{2}(n)\leq b\sqrt{\tfrac{m}{4}}+\tfrac{m}{2}-s_{2}(i)\Big\}\Big|}\\ &\geq\Phi\Big(b\sqrt{\tfrac{m}{m-\ell}}+\tfrac{\ell-2s_{2}(i)}{\sqrt{m-\ell}}\Big)-\tfrac{\varepsilon}{6}\geq\Phi\Big(b\sqrt{\tfrac{m}{m-\ell}}-\tfrac{\ell}{\sqrt{m-\ell}}\Big)-\tfrac{\varepsilon}{6}\,.\end{aligned}$$

Therefore,

$$\begin{aligned}\displaystyle&\displaystyle\tfrac{1}{j2^{m-\ell}}\Big|\Big\{2^{m}\leq n<2^{m}+j2^{m-\ell}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|\\ \displaystyle&\displaystyle\geq\Phi\Big(b\sqrt{\tfrac{m}{m-\ell}}-\tfrac{\ell}{\sqrt{m-\ell}}\Big)-\tfrac{\varepsilon}{6}\,,\end{aligned}$$

and in the same way,

$$\begin{aligned} &\tfrac{1}{j2^{m-\ell}}\Big|\Big\{2^{m}\leq n<2^{m}+j2^{m-\ell}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|\\ &\quad=\tfrac{1}{j2^{m-\ell}}\Big|\Big\{2^{m}\leq n<2^{m}+j2^{m-\ell}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}(m+1)}+\tfrac{1}{2}(m+1)\Big\}\Big|\\ &\quad\leq\Phi\Big(b\sqrt{\tfrac{m+1}{m-\ell}}+\tfrac{1+\ell}{\sqrt{m-\ell}}\Big)+\tfrac{\varepsilon}{6}\,. \end{aligned}$$

Now for fixed \(b\in\mathbb{R}\) there exists \(m_{1}\in\mathbb{N}\) with \(m_{1}\geq m_{0}+\ell\) such that for all \(m\geq m_{1}\)

$$\begin{aligned} \Phi(b)-\tfrac{\varepsilon}{3} &< \tfrac{1}{j2^{m-\ell}}\Big|\Big\{2^{m}\leq n<2^{m}+j2^{m-\ell}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|\\ &<\Phi(b)+\tfrac{\varepsilon}{3}\,. \end{aligned}$$

Note that this equation holds in particular for \(j=2^{\ell}\), so that

$$\begin{aligned}\displaystyle\Phi(b)-\tfrac{\varepsilon}{3}<\tfrac{1}{2^{m}}\Big|\Big\{2^{m}\leq n<2^{m+1}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|&\displaystyle<\Phi(b)+\tfrac{\varepsilon}{3}\,.\end{aligned}$$

Now let \(N> 2^{m_{1}}\tfrac{3}{\varepsilon}\), and let \(m=\lfloor\log_{2}(N)\rfloor\). Then \(2^{m}+(j-1)2^{m-\ell}\leq N<2^{m}+j2^{m-\ell}\) for some \(j\in\{1,{\dots},2^{\ell}\}\). Then,

$$\begin{aligned} & {\tfrac{1}{N}\Big|\Big\{0\leq n<N\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|}\\ &=\tfrac{1}{N}\Big|\Big\{0\leq n<2^{m_{1}}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|\\ &\quad+\sum_{k=m_{1}}^{m-1}\tfrac{2^{k}}{N}\tfrac{1}{2^{k}}\Big|\Big\{2^{k}\leq n<2^{k+1}\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|\\ &\quad+1_{\{j\neq-1\}}\tfrac{(j-1)2^{m-\ell}}{N}\tfrac{1}{(j-1)2^{m-\ell}}\Big|\Big\{2^{m} \leq n<2^{m}+(j-1)2^{m-\ell}\colon s_{2}(n)&\\ &\qquad \leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|\\ &\quad+\tfrac{1}{N}\Big|\Big\{2^{m}+(j-1)2^{m-\ell}\leq n<N\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|\\ &\leq\tfrac{\varepsilon}{3}+\tfrac{1}{N}\sum_{k=0}^{m-1}2^{k}\big(\Phi(b)+\tfrac{\varepsilon}{3}\big)+1_{\{j\neq-1\}}\tfrac{(j-1)2^{m-\ell}}{N}\big(\Phi(b)+\tfrac{\varepsilon}{3}\big)+2^{m-\ell}\tfrac{1}{N}\\ &<\tfrac{\varepsilon}{3}+\tfrac{2^{m}+(j-1)2^{m-\ell}}{N}\big(\Phi(b)+\tfrac{\varepsilon}{3}\big)+\tfrac{\varepsilon}{3}\leq\Phi(b)+\varepsilon\,, \end{aligned}$$

where we have used that since \(2^{-\ell}<\tfrac{\varepsilon}{3}\), we also have \(2^{m-\ell}\tfrac{1}{N}\leq 2^{m-\ell}\tfrac{1}{2^{m}}<\tfrac{\varepsilon}{3}\). In the same way we get

$$\begin{aligned} &{\tfrac{1}{N}\Big|\Big\{0\leq n<N\colon s_{2}(n)\leq b\sqrt{\tfrac{1}{4}\log_{2}(n)}+\tfrac{1}{2}\log_{2}(n)\Big\}\Big|}\\ &\geq\tfrac{1}{N}\sum_{k=m_{1}}^{m-1}2^{k}\left(\Phi(b)-\tfrac{\varepsilon}{3}\right)+1_{\{j\neq-1\}}\tfrac{(j-1)2^{m-\ell}}{N}\left(\Phi(b)-\tfrac{\varepsilon}{3}\right)\\ &=\tfrac{2^{m}-2^{m_{1}}+(j-1)2^{m-\ell}}{N}\big(\Phi(b)-\tfrac{\varepsilon}{3}\big)\\ &=\left(1-\tfrac{N-2^{m}+2^{m_{1}}-(j-1)2^{m-\ell}}{N}\right)\left(\Phi(b)-\tfrac{\varepsilon}{3}\right)\\ &=\Phi(b)-\tfrac{2^{m_{1}}}{N}-\tfrac{\varepsilon}{3}-\tfrac{N-2^{m}-(j-1)2^{m-\ell}}{N}\\ &>\Phi(b)-2\tfrac{\varepsilon}{3}-\tfrac{2^{m-\ell}}{N}> \Phi(b)-\varepsilon\,,\end{aligned}$$

which proves the result. \(\square\)

3.3 Uniform distribution mod 1 and Weyl’s theorem

In this section we address a famous theorem of Hermann Weyl (9. November 1885 in Elmshorn; 8. December 1955 in Zürich). Before we start, let us remind the reader that the fractional part of a number \(x\in\mathbb{R}\) is defined as

$$\{x\}:=x-\lfloor x\rfloor$$

where

$$\lfloor x\rfloor:=\max\{k\in\mathbb{Z}\,:\,k\leq x\}\,.$$

If we are given a sequence \(x:\mathbb{N}\to\mathbb{R}\) and a set \(B\subseteq[0,1)\), then we define another set by setting

$$A_{x,B}:=\big\{n\in\mathbb{N}\,:\,\{x_{n}\}\in B\big\}\,.$$

The sequence \(x=(x_{n})_{n\in\mathbb{N}}\) is said to be uniformly distributed modulo 1 (we simply write mod 1) if and only if for all \(a,b\in\mathbb{R}\) with \(0\leq a<b\leq 1\), we have

$$\mu_{R}\big(A_{x,[a,b)}\big)=b-a\,.$$

In particular, this means that for each uniformly distributed sequence \((x_{n})_{n\in\mathbb{N}}\) the sequence \((\{x_{n}\})_{n\in\mathbb{N}}\) is relatively measurable.

Weyl’s theorem [51, 52], also known as Weyl’s criterion, says that a sequence \((x_{n})_{n\in\mathbb{N}}\) of real numbers is uniformly distributed mod 1 if and only if for every \(h\in\mathbb{Z}\setminus\{0\}\) the following condition is satisfied,

$$\lim_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}e^{2\pi ihx_{n}}=0\,.$$

In an extended and multivariate version this theorem reads as follows.

Theorem 3

Let \(m\in\mathbb{N}\) and consider sequences \(x^{1},{\ldots},x^{m}:\mathbb{N}\to\mathbb{R}\) . Then the following are equivalent:

  1. 1.

    Every sequence \(x^{k}\), \(k\in\{1,{\ldots},m\}\) is uniformly distributed mod 1 and \(\{x^{1}\},{\ldots},\{x^{m}\}\) are \(\mu_{R}\)-independent;

  2. 2.

    For each \(m\)-tuple \((h_{1},{\ldots},h_{m})\in\mathbb{Z}^{m}\setminus\{0\}\),

    $$\lim_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}e^{2\pi i(h_{1}x^{1}_{n}+{\ldots}+h_{m}x^{m}_{n})}=0\,;$$
  3. 3.

    For every continuous function \(\psi\colon[0,1]^{m}\to\mathbb{R}\),

    $$\lim_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}\psi(\{x^{1}_{n}\},{\ldots},\{x^{m}_{n}\})=\int_{[0,1]^{m}}\psi(z_{1},{\ldots},z_{m})\,\text{d}z_{1}{\ldots}\text{d}z_{m}\,;$$
  4. 4.

    For every Riemann integrable function \(\psi\colon[0,1]^{m}\to\mathbb{R}\),

    $$\lim_{N\to\infty}\frac{1}{N}\sum_{n=1}^{N}\psi(\{x^{1}_{n}\},{\ldots},\{x^{m}_{n}\})=\int_{[0,1]^{m}}\psi(z_{1},{\ldots},z_{m})\,\text{d}z_{1}{\ldots}\text{d}z_{m}\,.$$

An important consequence is that for each \(\alpha\in\mathbb{R}\) the sequence \((n\alpha)_{n\in\mathbb{N}}\) is uniformly distributed mod 1 if and only if \(\alpha\) is irrational, and that for \(\alpha_{1},{\ldots},\alpha_{m}\in\mathbb{R}\) the sequences \(\{\alpha_{1}n\}_{n\in\mathbb{N}},{\ldots},\{\alpha_{m}n\}_{n\in\mathbb{N}}\) are uniformly distributed mod 1 and \(\mu_{R}\)-independent if and only if \(1,\alpha_{1},{\ldots},\alpha_{m}\) are linearly independent over \(\mathbb{Q}\).

Remark 5

Theorem 3 is also of practical interest, as it provides us with a method for numerical integration of a Riemann integrable function \(\psi\) on \([0,1]^{m}\). Note that, if we only know that the coordinate sequences are uniformly distributed mod 1 and \(\mu_{R}\)-independent, we cannot say anything about the speed of convergence of the sums towards the integral.

The concept of discrepany of a sequence measures the speed with which a sequence in \([0,1)^{m}\) approaches the uniform distribution on \([0,1)^{m}\). Sequences with a “high” speed of convergence are informally called low-discrepancy sequences and give rise to a class of numerical integration algorithms called quasi-Monte Carlo methods. For more information about these sequences and algorithms see [17, 19, 38, 40].

Definition 7 (Finitely measurable function)

We say that a function \(g\colon I\to\mathbb{R}\) is finitely measurable if and only if the pre-image of each interval \(J\subset\mathbb{R}\) under \(g\) can be written as the union of finitely many subintervals, i.e., there exists \(k\in\mathbb{N}\) and subintervals \(I_{1},{\ldots},I_{k}\) of \(I\) such that

$$g^{-1}(J)=I_{1}\cup{\ldots}\cup I_{k}\,.$$

Examples of finitely measurable functions are the monotone functions and the functions \(g\) with the following so-called Dirichlet property:

A function \(g\colon[a,b]\to\mathbb{R}\) is said to have the Dirichlet property if and only if it is continuous on \([a,b]\) and has only finitely many local extreme points.

A concrete example of a finitely measurable function thus is \(\cos(2\pi\cdot)\colon[0,1]\to\mathbb{R}\), \(z\mapsto\cos(2\pi z)\).

Proposition 4

Let \(m\in\mathbb{N}\) and \(x^{1},{\ldots},x^{m}:\mathbb{N}\to\mathbb{R}\) be sequences. Consider finitely measurable functions \(g^{1},{\ldots},g^{m}\colon\mathbb{R}\to\mathbb{R}\) . If \(x^{1},{\ldots},x^{m}\) are relatively measurable and \(\mu_{R}\) -independent, then the sequences \(g^{1}(x^{1}),{\ldots},g^{m}(x^{m})\) are relatively measurable and \(\mu_{R}\) -independent.

The previous result, whose proof is left to the reader, has the following interesting corollary.

Corollary 2

Let \(1,\alpha_{1},{\dots},\alpha_{m}\in\mathbb{R}\) be linearly independent over \(\mathbb{Q}\) . Then the sequences \(\big(\cos(2\pi\alpha_{1}n)\big)_{n\in\mathbb{N}},{\ldots},\big(\cos(2\pi\alpha_{m}n)\big)_{n\in\mathbb{N}}\) are relatively measurable and \(\mu_{R}\) -independent.

Proof

We have already concluded, as a consequence of Weyl’s theorem, that the sequences \(\{\alpha_{1}n\}_{n\in\mathbb{N}},{\ldots},\{\alpha_{m}n\}_{n\in\mathbb{N}}\) are uniformly distributed mod 1 and \(\mu_{R}\)-independent. Hence, by Proposition 4 the sequences

$$\big(\cos(2\pi\{\alpha_{1}n\})\big)_{n\in\mathbb{N}},{\ldots},\big(\cos(2\pi\{\alpha_{m}n\})\big)_{n\in\mathbb{N}}$$

are \(\mu_{R}\)-independent as well and thus the sequences

$$\big(\cos(2\pi\alpha_{1}n)\big)_{n\in\mathbb{N}},{\ldots},\big(\cos(2\pi\alpha_{m}n)\big)_{n\in\mathbb{N}}\,.$$

\(\square\)

Proposition 5

Let \(x,y:\mathbb{N}\to\mathbb{R}\) be bounded and relatively measurable sequences with continuous and increasing distribution functions \(F_{x}\) and \(F_{y}\) respectively. If \(x\) and \(y\) are \(\mu_{R}\) -independent, then the distribution function \(F_{x+y}\) of \(x+y\) is given by the convolution of \(F_{x}\) and \(F_{y}\) , i.e.,

$$F_{x+y}(z)=F_{x}*F_{y}(z)=\int_{-\infty}^{\infty}F_{x}(z-\eta)\text{d}F_{y}(\eta)=\int_{-\infty}^{\infty}F_{y}(z-\xi)\text{d}F_{x}(\xi)\,.$$

Proof

It is comparably easy to see that the sequences \((F_{x}(x_{n}))_{n\in\mathbb{N}}\) and \((F_{y}(y_{n}))_{n\in\mathbb{N}}\) are uniformly distributed mod 1. Proposition 4 implies that they are \(\mu_{R}\)-independent. Observe that the restriction of \(F_{x}\) to the closure of \(\{t\in\mathbb{R}\colon F_{x}(t)\in(0,1)\}\) is continuous and increasing and therefore has an inverse, which we denote by \(G_{x}\). Denote by \(G_{y}\) the corresponding inverse function of \(F_{y}\). We have

$$\begin{aligned} \mu_{R}(x+y\leq z) &= \lim_{N\to\infty}\sum_{n=1}^{N}1_{(-\infty,z]}(x_{n}+y_{n})\\ &=\lim_{N\to\infty}\sum_{n=1}^{N}1_{(-\infty,z]}\Big(G_{x}\big(F_{x}(x_{n})\big)+G_{y}\big(F_{y}(y_{n})\big)\Big)\\ &\overset{(*)}{=}\int_{[0,1]^{2}}1_{(-\infty,z]}\big(G_{x}(\xi)+G_{y}(\eta)\big)\text{d}\xi\,\text{d}\eta\\ &=\int_{\mathbb{R}^{2}}1_{(-\infty,z]}(\xi+\eta)\text{d}F_{x}(\xi)\text{d}F_{y}(\eta)\\ &=\int_{-\infty}^{\infty}\int_{-\infty}^{z-\eta}\text{d}F_{x}(\xi)\text{d}F_{y}(\eta)=\int_{-\infty}^{\infty}F_{x}(z-\eta)\text{d}F_{y}(\eta)\,, \end{aligned}$$

where we have used in \((*)\) that \((F_{x}(x_{n}))_{n\in\mathbb{N}}\) and \((F_{y}(y_{n}))_{n\in\mathbb{N}}\) are uniformly distributed mod 1 and independent. \(\square\)

If we consider, for instance, the sequence \(x=\big(\cos(2\pi\alpha n)\big)_{n\in\mathbb{N}}\) with irrational \(\alpha\), then, since \((\alpha n)_{n\in\mathbb{N}}\) is uniformly distributed mod 1,

$$\begin{aligned} F_{x}(z) &= \mu_{R}(x\leq z)=\lim_{N-> \infty}\frac{1}{N}\sum_{n=1}^{N}1_{(-\infty,z]}\big(\cos(2\pi\alpha n)\big)\\ &=\int_{0}^{1}1_{(-\infty,z]}\big(\cos(2\pi\xi)\big)\text{d}\xi\\ &=2\int_{0}^{\frac{1}{2}}1_{(-\infty,z]}\big(\cos(2\pi\xi)\big)\text{d}\xi=\frac{1}{\pi}\int_{1}^{-1}1_{(-\infty,z]}(\eta)\arccos^{\prime}(\eta)\text{d}\eta\\ &=\frac{1}{\pi}\int_{-1}^{1}1_{(-\infty,z]}(\eta)\arcsin^{\prime}(\eta)\text{d}\eta=1_{[-1,1]}(z)\frac{1}{\pi}\arcsin(z)+1_{(1,\infty)}(z)\,. \end{aligned}$$

This means that the distribution function of the sequence \(\Big(\cos(2\pi\alpha_{1}n)+{\ldots}+\cos(2\pi\alpha_{m}n)\Big)_{n\in\mathbb{N}}\) is given by \(F_{x}^{*m}\). Therefore, we obtain a central limit theorem for partial sums of cosines with linearly independent frequencies, i.e., with \(1,\alpha_{1},\alpha_{2},\dots\) linearly independent over \(\mathbb{Q}\),

$$\begin{aligned} &\lim_{m\to\infty}\mu_{R}\bigg(\bigg\{n\in\mathbb{N}\colon a\leq\frac{\cos(2\pi\alpha_{1}n)+{\ldots}+\cos(2\pi\alpha_{m}n)}{\sqrt{m/2}}\leq b\bigg\}\bigg)\\ &=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-\frac{\xi^{2}}{2}}\text{d}\xi\,. \end{aligned}$$

3.4 Relatively measurable subsets of \((0,\infty)\)–the continuous setting

The deliberations of the previous subsection can quite effortlessly be lifted to a continuous setting. A continuous version of a relative measure on Lebesgue measurable subsets of \(\mathbb{R}\) can be defined as the limit

$$\mu_{R}(A):=\lim_{T\to\infty}\frac{1}{T}\int_{0}^{T}1_{A}(x)\,\text{d}x$$

if it exists. In analogy to the case of sequences, one obtains a continuous version of Weyl’s theorem (see also [38, Chap. 9]) and thus the independence of functions of uniformly distributed functions. An example is again given by the cosines with linearly independent frequencies (cf. [34]), i.e., if \(1,\alpha_{1},\alpha_{2},\dots\) are linearly independent over \(\mathbb{Q}\), then for all \(m\in\mathbb{N}\) and all \(s_{1},{\dots},s_{m}\in\mathbb{R}\),

$$\begin{aligned} &\mu_{R}\Big(\Big\{t\in(0,\infty)\,:\,\cos(2\pi\alpha_{1}t)\leq s_{1},\cdots,\cos(2\pi\alpha_{m}t)\leq s_{m}\Big\}\Big)\\ &=\prod_{j=1}^{m}\mu_{R}\Big(\big\{t\in(0,\infty)\,:\,\cos(2\pi\alpha_{j}t)\leq s_{j}\big\}\Big). \end{aligned}$$

Those considerations then yield a central limit theorem of the form

$$\begin{aligned} &\lim_{m\to\infty}\mu_{R}\left(\left\{t\in(0,\infty)\,:\,a\leq\frac{\cos(2\pi\alpha_{1}t)+\dots+\cos(2\pi\alpha_{m}t)}{\sqrt{m/2}}\leq b\right\}\right)\\ &=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-\frac{\xi^{2}}{2}}\,\text{d}\xi\,. \end{aligned}$$

The original approach to this result is, as we find, more complicated and can be found in [34]. The latter is presented in a more accessible way in [32, Chap. 3].

4 The Erdős-Kac Theorem

This section is devoted to a famous theorem of Paul Erdős and Mark Kac. One can say that this result marks the birth of what is today known as probabilistic number theory. The close link between probability theory and number theory illustrated by this theorem can hardly be overrated and turned out to be extremely fruitful.

We shall start with the original heuristics of Mark Kac, which led him to conjecture the result he later proved together with Paul Erdős.

4.1 Heuristics–Independence & CLT

A guiding idea of Mark Kac has been that if there is some sort of independence, then there is the Gaussian law of errors at play. Exactly this maxim underlies the Erdős-Kac theorem. The object of interest is the number of different prime factors of a given number.

Let us consider the following indicator functions. For each prime number \(p\) and every \(n\in\mathbb{N}\), we define

$$I_{p}(n)=\begin{cases}1&:\text{if }p\text{ divides }n\\ 0&:\text{if }p\text{ does not divide }n.\end{cases}$$

Given a natural number \(n\in\mathbb{N}\), we denote by \(\omega(n)\) the number of different prime factors of \(n\). The indicator functions allow us to express \(\omega(n)\) as follows,

$$\omega(n)=\sum_{p\text{ prime}}I_{p}(n)\,.$$

From Sect. 2.4 we already know that this collection of indicator functions is \(\mu_{R}\)-independent. We now want to provide a plausibility argument, and here we follow Mark Kac’s original heuristics, that suggests these indicator functions also satisfy Lindeberg’s condition. In analogy to the central limit theorem of Lindeberg, this suggests that the properly normalized sum of indicator functions follows a Gaussian law of errors. For this we note first that for all \(x\in\mathbb{R}\) with \(x\geq 2\) we have

$$\begin{aligned}\sum_{p\text{ prime,}\atop{p\leq x}}\frac{1}{p}> \ln\ln x-\frac{1}{2}\,,\end{aligned}$$
(7)

see [28, Kap. 3]. As we already explained in the first part of Sect. 2.4, essentially a fraction of \(1/p\) of the numbers is divisible by the prime \(p\), i.e., we may say that a number \(n\in\mathbb{N}\) is divisible by \(p\) with probability \(1/p\). In other words, the indicator functions \(I_{p}(n)\) behave like Bernoulli random variables with parameter \(1/p\) and are independent. But then the expectation is \(1/p\) and the variance \(1/p(1-1/p)\). What does it mean for Lindeberg’s condition? Well, using the notation of Theorem 2, we have for all \(n\geq 2\)

$$\begin{aligned} s_{n} &= \sqrt{\sum_{p\text{ prime}\atop p\leq n}\mathrm{Var}[I_{p}(n)]}=\sqrt{\sum_{p\text{ prime}\atop p\leq n}\frac{1}{p}\bigg(1-\frac{1}{p}\bigg)}\\ &\geq\frac{1}{\sqrt{2}}\sqrt{\sum_{p\text{ prime}\atop p\leq n}\frac{1}{p}}\overset{(7)}{\geq}\frac{1}{\sqrt{2}}\sqrt{\ln\ln n-\frac{1}{2}}\,. \end{aligned}$$

So if \(\varepsilon\in(0,\infty)\), then for sufficiently large \(n\in\mathbb{N}\), we have

$$\begin{aligned} \mathbb{E}\Big[I_{p}(n)^{2}\,{1}\!\!1_{\{|I_{p}(n)|> \varepsilon s_{n}\}}\Big] &\leq\mathbb{P}\big[I_{p}(n)> \varepsilon s_{n}\big]\\ &\leq\mathbb{P}\Big[I_{p}(n)> \frac{\varepsilon}{\sqrt{2}}\sqrt{\ln\ln n-1/2}\,\Big]=0. \end{aligned}$$

The latter holds since \(I_{p}(n)\) only takes the values 0 and 1. Therefore, Lindeberg’s condition in Theorem 2 is satisfied. Together with the independence of the indicator functions \(I_{p}(n)\), \(p\) prime as well as property (7), this suggests that the sequence

$$\frac{\omega(n)-\ln\,\ln\,n}{\sqrt{\ln\,\ln\,n}},\qquad n\in\mathbb{N}$$

satisfies a central limit theorem. Indeed, for every \(m\in\mathbb{N}\) let \(c_{m}=\sum_{p\;\text{prime},\,p\leq m}\frac{1}{p}\) and \(d_{m}^{2}=\sum_{p\;\text{prime},\,p\leq m}\frac{1}{p}\big(1-\frac{1}{p}\big)\). Further let, \(\omega_{m}(n):=\sum_{p\text{ prime},\,p\leq m}I_{p}(n)\). Then for every \(a,b\in\mathbb{R}\) with \(a<b\),

$$\lim_{m\to\infty}\mu_{R}\bigg(\Big\{n\in\mathbb{N}\colon a\leq\frac{\omega_{m}(n)-c_{m}}{d_{m}}\leq b\Big\}\bigg)=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-x^{2}/2}\,\text{d}x\,,$$

which appears as Lemma 1 in [22]. This means that

$$\begin{aligned} &\lim_{m\to\infty}\lim_{N\to\infty}\frac{1}{N}\bigg|\Big\{n\in\{1,{\dots},N\}\colon a\leq\frac{\omega_{m}(n)-c_{m}}{d_{m}}\leq b\Big\}\bigg|\\ &=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-x^{2}/2}\,\text{d}x\,. \end{aligned}$$

If one could show that the two limits may be taken simultaneously, then we would obtain

$$\lim_{N\to\infty}\frac{1}{N}\bigg|\Big\{n\in\{1,{\dots},N\}\colon a\leq\frac{\omega_{N}(n)-c_{N}}{d_{N}}\leq b\Big\}\bigg|=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-x^{2}/2}\,\text{d}x\,.$$

Together with the (proper) asymptotics for \(\omega_{N}(n),c_{N},d_{N}\), this would give

$$\lim_{N\to\infty}\frac{1}{N}\bigg|\Big\{n\in\{1,{\dots},N\}\colon a\leq\frac{\omega(n)-\ln\ln N}{\sqrt{\ln\ln N}}\leq b\Big\}\bigg|=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-x^{2}/2}\,\text{d}x\,.$$

Of course, this is merely a heuristic argument, not a proof. In any case, the heuristic and conjecture just presented leads us in the following subsection to the ingenious and famous central limit theorem of Erdős-Kac [22].

4.2 The CLT of Erdős-Kac

After having presented the heuristic of Mark Kac, let us tell the anecdote about the origin of the Erdős-Kac theorem as described by Mark Kac himself in his autobiography [33].

I knew very little number theory at the time, and I tried to find a proof along purely probabilistic lines but to no avail. In March 1939 I journeyed from Baltimore to Princeton to give a talk. Erdős, who was spending the year at the Institute for Advanced Study, was in the audience but he half-dozed through most of my lecture; the subject matter was too far removed from his interests. Toward the end I described briefly my difficulties with the number of prime divisors. At the mention of number theory Erdős perked up and asked me to explain once again what the difficulty was. Within the next few minutes, even before the lecture was over, he interrupted to announce that he had the solution.

When once asked about their famous result, Mark Kac replied the following (see [14] and [33]):

It took what looks now like a miraculous confluence of circumstances to produce our result…. It would not have been enough, certainly not in 1939, to bring a number theorist and a probabilist together. It had to be Erdős and me: Erdős because he was almost unique in his knowledge and understanding of the number theoretic method of Viggo Brun,… and me because I could see independence and the normal law through the eyes of Steinhaus.

We will now formulate the central limit theorem of Erdős and Kac.

Theorem 1 (Erdős-Kac, 1940)

Let \(a,b\in\mathbb{R}\) with \(a<b\). Then

$$\lim_{N\to\infty}\frac{1}{N}\bigg|\bigg\{n\in\{1,{\dots},N\}\,:\,a\leq\frac{\omega(n)-\ln\ln N}{\sqrt{\ln\ln N}}\leq b\bigg\}\bigg|=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-x^{2}/2}\,\text{d}x\,.$$

In other words, for large \(N\in\mathbb{N}\) the proportion of natural numbers in the set \(\{1,{\dots},N\}\) for which the suitably normalized number of different prime factors is between \(a\) and \(b\) is close to a Gaussian integral from \(a\) to \(b\). In short: the number of prime factors of a large, suitably normalized number follow a Gaussian curve.

Providing a formal proof for Theorem 1 would go beyond the scope of this paper. The original argument of Erdős and Kac use number theoretic methods of sieve theory (more precisely Brun’s sieve). Another proof is due to Alfréd Rényi (20. March 1921 in Budapest; 1. February 1970 Budapest) and Pál Turán (18. August 1910 in Budapest; 26. September 1976 Budapest) and can be found in [46]. Let us mention that Godfrey Harold Hardy (7. February 1877 in Cranleigh; 1. December 1947 in Cambridge) and Srinivasa Ramanujan (22. December 1887 in Erode; 26. April 1920 in Kumbakonam) prove in their paper [27] from 1917 that for all \(\varepsilon\in(0,\infty)\)

$$\lim_{N\to\infty}\frac{1}{N}\bigg|\bigg\{n\in\{1,{\dots},N\}\,:\,\Big|\frac{\omega(n)}{\ln\ln N}-1\Big|\geq\varepsilon\bigg\}\bigg|=0\,.$$

This means that for large \(N\in\mathbb{N}\) if we pick a number \(n\in\{1,{\dots},N\}\) at random (with respect to the uniform distribution), then the number \(\omega(n)\) of different prime factors is of order \(\ln\,\ln\, N\).

Remark 6

Even though Pál Turán already noticed that the result of Hardy and Ramanujan can be obtained from an inequality for the second moment of \(\omega(n)\) together with an application of Chebychev’s inequality [9], one can say that the Erdős-Kac Theorem marks the beginning of probabilistic number theory. Also the work [23] of Paul Erdős and Aurel Wintner (8. April 1903 in Budapest; 15. January 1958 in Baltimore) has been one of the pioneering contributions to this complex of problems.

We close this section with the statement of a corollary that gives a different version of the Erdős-Kac theorem, in which \(N\) in the \(\log\log\) terms is replaced by \(n\), which looks more natural in our setup, because it directly states that the distribution function of the sequence \(\big(\frac{\omega(n)-\ln\ln n}{\sqrt{\ln\ln n}}\Big)_{n\in\mathbb{N}}\) is that of the standard normal one.

Corollary 3

Let \(a,b\in\mathbb{R}\) with \(a<b\) . Then

$$\lim_{N\to\infty}\frac{1}{N}\bigg|\bigg\{n\in\{1,{\dots},N\}\,:\,a\leq\frac{\omega(n)-\ln\ln n}{\sqrt{\ln\ln n}}\leq b\bigg\}\bigg|=\frac{1}{\sqrt{2\pi}}\int_{a}^{b}e^{-x^{2}/2}\,\text{d}x\,.$$

Proof

Clearly, for every \(b\in\mathbb{R}\), we have

$$\begin{aligned}&{\limsup_{N\to\infty}\frac{1}{N}\bigg|\bigg\{n\in\{1,{\dots},N\}\,:\,\frac{\omega(n)-\ln\ln n}{\sqrt{\ln\ln n}}\leq b\bigg\}\bigg|}\\ &\leq\lim_{N\to\infty}\frac{1}{N}\bigg|\bigg\{n\in\{1,{\dots},N\}\,:\,\frac{\omega(n)-\ln\ln N}{\sqrt{\ln\ln N}}\leq b\bigg\}\bigg|=\Phi(b)\,,\end{aligned}$$

where \(\Phi(t)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{t}e^{-x^{2}/2}\,\text{d}x\) for all \(t\in\mathbb{R}\) as before. First note that, by Theorem 1, the distribution functions \(F_{N}\) with \(F_{N}(t):=\tfrac{1}{N}|\{1\leq n\leq N\colon\omega(n)\leq t\sqrt{\ln\,\ln\, N}+\ln\,\ln\, N\}|\) converge pointwise to \(\Phi\), and therefore also uniformly on \(\mathbb{R}\), by Lemma 3.

Now fix \(b\in\mathbb{R}\) and let \(K\in(0,\infty)\) be such that \(e^{-\frac{K}{2}}<\tfrac{\varepsilon}{3}\). Let \(N_{0}\in\mathbb{N}\) be such that for all \(N\geq N_{0}\) and all \(t\in\mathbb{R}\) we have \(F_{N}(t)\in(\Phi(t)-\tfrac{\varepsilon}{3},\Phi(t)+\tfrac{\varepsilon}{3})\), \(\Phi\big(b-\tfrac{K}{\ln\,\ln\, N}\big)> \Phi(b)-\tfrac{\varepsilon}{3}\), \(\sqrt{\ln\,\ln\, N}> b\), and \(\ln\,\ln\, N> 0\). With this

$$\begin{aligned} &\tfrac{1}{N}\big|\big\{n\in\{1,{\dots},N\}\colon\omega(n)\leq b\sqrt{\ln\ln N}+\ln\ln N-K\big\}\big|\\ &\geq\Phi\big(b-\tfrac{K}{\ln\ln N}\big)-\tfrac{\varepsilon}{3}> \Phi\big(b\big)-\tfrac{2\varepsilon}{3}\,. \end{aligned}$$

If we denote \(N_{1}:=\sup\{n\in\mathbb{N}\colon b\sqrt{\ln\ln N}+\ln\ln N-K> b\sqrt{\ln\ln n}+\ln\ln n\}\), then

$$\begin{aligned}&{\tfrac{1}{N}\big|\big\{n\in\{1,{\dots},N\}\colon\omega(n)\leq b\sqrt{\ln\ln n}+\ln\ln n\big\}\big|}\\ &\geq\tfrac{1}{N}\big|\big\{n\in\{N_{1}+1,{\dots},N\}\colon\omega(n)\leq b\sqrt{\ln\ln n}+\ln\ln n\big\}\big|\\ &\geq\tfrac{1}{N}\big|\big\{n\in\{N_{1}+1,{\dots},N\}\colon\omega(n)\leq b\sqrt{\ln\ln N}+\ln\ln N-K\big\}\big|\\ &\geq\tfrac{1}{N}\big|\big\{n\in\{1,{\dots},N\}\colon\omega(n)\leq b\sqrt{\ln\ln N}+\ln\ln N-K\big\}\big|-\tfrac{N_{1}}{N}\\ &> \Phi(b)-\tfrac{2\varepsilon}{3}-\tfrac{N_{1}}{N}\,.\end{aligned}$$

Now, let us compare \(N\) and \(N_{1}\). We observe that if

$$b(\sqrt{\ln\,\ln\, N}-\sqrt{\ln\,\ln\, N_{1}})+\ln\,\ln\, N-\ln\,\ln\, N_{1}> K\,,$$

then

$$(\sqrt{\ln\,\ln\, N}+\sqrt{\ln\,\ln\, N_{1}})(\sqrt{\ln\,\ln\, N}-\sqrt{\ln\,\ln\, N_{1}})+\ln\,\ln\, N-\ln\,\ln\, N_{1}> K\,,$$

which implies that

$$2(\ln\,\ln\, N-\ln\,\ln\, N_{1})> K\,.$$

Hence, we have

$$\ln\,\ln\, N-\tfrac{K}{2}> \ln\,\ln\, N_{1}$$

and so \(N^{e^{-\frac{K}{2}}}> N_{1}\). Therefore,

$$\begin{aligned}\frac{N_{1}}{N}<N^{e^{-\frac{K}{2}}-1}<N^{-K/2}<e^{-K/2}<\tfrac{\varepsilon}{3}\,,\end{aligned}$$

which completes the proof. \(\square\)

A similar calculation shows that the two formulations of the Erdős-Kac theorem are actually equivalent.

5 Some complementary considerations–The case of lacunary series

What we have seen so far shows the power of the concept of relative measure in number theory and how it can naturally (in large parts along the lines of classical probability theory) lead us to central limit theorems for number theoretic quantities, even where the axiomatic framework of Kolmogorov is not applicable. On the other hand, we have seen, when studying binary expansions, that Kolmogorov’s theory is a powerful tool as well and allows us to obtain information about the Gaussian fluctuations of number theoretic quantities. A common spirit of both, and eventually a key to a Gaussian law, has always been a notion of independence.

In what follows, we complement the previous considerations by showing that lacunary series, for instance those that are formed with functions \(\cos(2\pi n_{k}\cdot):[0,1]\to\mathbb{R}\) and quickly increasing gap sequence \((n_{k})_{k\in\mathbb{N}}\), behave in many ways like independent random variables, and that this almost-independence or weak form of independence may still lead to fascinating results within the axiomatic theory of Kolmogorov.

Already in Sect. 2.3 on binary expansions we noted that Hans Rademacher introduced in [45] what is known today as Rademacher functions. Those functions are defined in the following way,

$$r_{k}(t)=\text{sign}\big(\sin(2^{k}\pi t)\big),\qquad t\in[0,1],\,k\in\mathbb{N}\,,$$

where for \(x\in\mathbb{R}\),

$$\text{sign}(x):=\begin{cases}-1&:x<0\\ 0&:x=0\\ +1&:x> 0.\end{cases}$$

Rademacher studied the convergence behavior of series

$$\begin{aligned}\sum_{k=1}^{\infty}a_{k}r_{k}(t),\qquad t\in[0,1],\,(a_{k})_{k=1}^{\infty}\in\mathbb{R}^{\mathbb{N}}\,,\end{aligned}$$
(8)

and proved that such series converge for almost all \(t\in[0,1]\) if

$$\begin{aligned}\sum_{k=1}^{\infty}a_{k}^{2} & <+\infty\,.\end{aligned}$$
(9)

The necessity of square integrability was obtained by Alexander Khintchine (19. July 1894 in Kondyrjowo; 18. November 1959 in Moscow) and Andrei Kolmogorov in their 1925 paper [35], showing that if

$$\begin{aligned}\sum_{k=1}^{\infty}a_{k}^{2}=+\infty,\end{aligned}$$
(10)

then the series (8) diverges for almost all \(t\in[0,1]\).

Starting in the 1920s, Stefan Banach (30. March 1892 in Krakow; 31. August 1945 in Lviv), Andrei Kolmogorov, Raymond Paley (7. January 1907 in Bournemouth; 7. April 1933 near Banff), Antoni Zygmund (25. December 1900 in Warsaw; 30. May 1992 in Chicago) and others studied the convergence behavior of trigonometric series

$$\begin{aligned}\sum_{k=1}^{\infty}a_{k}\cos(2\pi n_{k}t),\qquad t\in[0,1],\,(a_{k})_{k=1}^{\infty}\in\mathbb{R}^{\mathbb{N}}\,,\end{aligned}$$
(11)

where the sequence \((n_{k})_{k=1}^{\infty}\) satisfies the Hadamard gap condition

$$\frac{n_{k+1}}{n_{k}}> q> 1$$

for all \(k\in\mathbb{N}\) (see [7, 36, 44, 53]). For such series one can obtain results similar to those for Rademacher series (8). Kolmogorov could prove in [36] that the square summability condition (9) is also sufficient for almost everywhere convergence of lacunary series. The necessity of (9) has been shown by Zygmund in [53].

An important analogy between Rademacher series and lacunary series, in particular in view of our article, remained unnoticed for a long time. In Sect. 2.3 we proved that the Rademacher functions (more precisely a version of them) are independent. In particular, given any sequence \((a_{k})_{k=1}^{\infty}\) of real numbers, the functions \(a_{k}r_{k}\), \(k\in\mathbb{N}\) are independent (but no longer identically distributed), and we have for all \(k\in\mathbb{N}\) that

$$\mathbb{E}[a_{k}r_{k}]=0\qquad\text{and}\qquad\mathrm{Var}[a_{k}r_{k}]=a_{k}^{2}\,.$$

Using the notation from Lindeberg’s theorem (see Theorem 2), we see that

$$s_{n}^{2}=\sum_{k=1}^{n}\mathrm{Var}[a_{k}r_{k}]=\sum_{k=1}^{n}a_{k}^{2}\,.$$

But this means that for \(\varepsilon\in(0,\infty)\), Lindeberg’s condition for the weighted Rademacher functions reads as follows,

$$\begin{aligned} &\frac{1}{\sum\limits_{k=1}^{n}a_{k}^{2}}\,\sum_{k=1}^{n}\mathbb{E}\Bigg[(a_{k}r_{k})^{2}{ 1}\!\!1_{\Big\{|a_{k}r_{k}|\geq\varepsilon\sqrt{\sum_{k=1}^{n}a_{k}^{2}}\Big\}}\Bigg]\\ &=\frac{1}{\sum\limits_{k=1}^{n}a_{k}^{2}}\,\sum_{k=1}^{n}a_{k}^{2}\,\mathbb{P}\Bigg[|a_{k}|\geq\varepsilon\sqrt{\sum_{k=1}^{n}a_{k}^{2}}\,\Bigg]\,. \end{aligned}$$

For Lindeberg’s condition to be satisfied, we require the right-hand side to converge to 0 as \(n\to\infty\). A moment’s thought, however, reveals that this is the case whenever

$$\begin{aligned}\sum_{k=1}^{\infty}a_{k}^{2}=+\infty\qquad\text{ and }\qquad\max_{1\leq k\leq n}|a_{k}|=o\Bigg(\sqrt{\sum_{k=1}^{n}a_{k}^{2}}\Bigg)\,.\end{aligned}$$
(12)

Therefore, under condition (12), we obtain that, for all \(t\in\mathbb{R}\),

$$\lim_{n\to\infty}\lambda\Bigg(\bigg\{x\in[0,1]\,:\,\sum_{k=1}^{n}a_{k}r_{k}(x)\leq t\sqrt{\sum_{k=1}^{n}a_{k}^{2}}\,\bigg\}\Bigg)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{t}e^{-\frac{y^{2}}{2}}\,\text{d}y\,.$$

It was not before 1947 that Raphaël Salem (7. November 1898 in Saloniki; 20. June 1963 in Paris) and Antoni Zygmund proved in [47] that for Hadamard gap sequences the functions \(\big(\cos(2\pi n_{k}\cdot)\big)_{k\in\mathbb{N}}\) follow a central limit theorem, i.e., for all \(t\in\mathbb{R}\),

$$\lim_{N\to\infty}\lambda\bigg(\Big\{x\in(0,1)\,:\,\sum_{k=1}^{N}\cos(2\pi n_{k}x)\leq t\sqrt{N/2}\Big\}\bigg)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{t}e^{-\frac{y^{2}}{2}}\,\text{d}y\,.$$

For sequences with very large gaps, i.e., those satisfying the stronger condition

$$\frac{n_{k+1}}{n_{k}}\overset{k\to\infty}{\longrightarrow}+\infty\,,$$

such a central limit theorem had been obtained in 1939 by Mark Kac in [29].

Around the same time as Salem and Zygmund, Mark Kac [30] (see also [31, 32] and the references therein) obtained a central limit theorem for functions \(f:\mathbb{R}\to\mathbb{R}\) of bounded variation on \([0,1]\) satisfying

$$f(t+1)=f(t)\qquad\text{and}\qquad\int_{0}^{1}f(t)\,\text{d}t=0\,.$$

He showed that for such functions

$$\lim_{N\to\infty}\lambda\bigg(\Big\{x\in(0,1)\,:\,\sum_{k=1}^{N}f(2^{k}x)\leq t\sigma\sqrt{N}\Big\}\bigg)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{t}e^{-\frac{y^{2}}{2}}\,\text{d}y\,$$

whenever

$$\begin{aligned}\sigma^{2}:=\int_{0}^{1}f(t)^{2}\,\text{d}t+2\sum_{k=1}^{\infty}\int_{0}^{1}f(t)f(2^{k}t)\,\text{d}t\neq 0\,.\end{aligned}$$
(13)

This already indicates that the functions \(f(2^{k}\cdot)\), \(k\in\mathbb{N}\) do not behave like independent random variables. In fact, in that case we would expect something like

$$\sigma^{2}=\int_{0}^{1}f(t)^{2}\,\text{d}t\neq 0$$

rather than condition (13). After further progress had been made by Gapoškin [25] and Takahashi [48], Gapoškin eventually discovered a deep connection between the validity of a central limit theorem and the number of solutions of a certain Diophantine equation [26], i.e., whether a central limit theorem holds or not depends not only on the growth rate of the sequence \((n_{k})_{k\in\mathbb{N}}\), but also critically on its number theoretic properties. In 2010 Christoph Aistleitner and István Berkes presented a paper in which they obtained both necessary and sufficient conditions under which a sequence \(f(n_{k}\cdot)_{k\in\mathbb{N}}\) follows a Gaussian law of errors [1].

Please note that the preceding paragraph is not intended to be exhaustive. Still it indicates the development of the subject, highlights some fascinating results, and shows how analytic, probabilistic, and number theoretic arguments and properties intertwine.

Remark 7

The results presented in this final section are not restricted to central limit phenomena. Beyond the normal fluctuations one can also prove laws of the iterated logarithm for lacunary series and we refer the reader to the work of Erdős and Gál [21], Aistleitner and Fukuyama [4, 5], Aistleitner, Berkes, and Tichy [2, 3], and the references cited therein. The study of large deviation principles for lacunary sums has recently been initiated by Aistleitner, Gantert, Kabluchko, Prochno, and Ramanan in [6].