How many digits are needed?

Let X 1 , X 2 , ... be the digits in the base-q expansion of a random variable X defined on [0 , 1) where q ≥ 2 is an integer. For n = 1 , 2 , ... , we study the probability distribution P n of the (scaled) remainder T n ( X ) = (cid:80) ∞ k = n +1 X k q n − k : If X has an absolutely continuous CDF then P n converges in the total variation metric to Lebesgue measure µ on the unit interval. Under weak smoothness conditions we establish first a coupling between X and a non-negative integer valued random variable N so that T N ( X ) follows µ and is independent of ( X 1 , ..., X N ), and second exponentially fast convergence of P n and its PDF f n . We discuss how many digits are needed and show examples of our results. The convergence results are extended to the case of a multivariate random variable defined on a unit cube.


Introduction
Let X be a random variable so that 0 ≤ X < 1, and for x ∈ R, let F (x) = P(X ≤ x) be the cumulative distribution function (CDF) of X.For a given integer q ≥ 2, we consider the base-q transformation T : [0, 1) → [0, 1) given by where ⌊•⌋ is the floor function (so ⌊xq⌋ is the integer part of xq).For n = 1, 2, ..., let T n = T • • • • • T denote the composition of T with itself n times and define where T 0 (X) = X.Then is the base-q expansion of X with digits X 1 , X 2 , ....Note that X is in a oneto-one correspondence to the first n digits (X 1 , ..., X n ) together with T n (X) = ∞ k=n+1 X k q n−k , which is the remainder multiplied by q n .Let µ denote Lebesgue measure on [0, 1), P n the probability distribution of T n (X) and F n its CDF, so X follows P 0 and has CDF F 0 = F .The following facts are well-known (see [4]): (a) P 0 = P 1 (i.e., invariance in distribution under T ) is equivalent to stationarity of the process X 1 , X 2 , ....
(b) P 0 = P 1 and F is absolutely continuous if and only if P 0 = µ.
Items (a)-(c) together with the fact that T is ergodic with respect to µ are used in metric number theory (see [5], [9], and the references therein) to establish properties such as 'for Lebesgue almost all numbers between 0 and 1, the relative frequency of any finite combination of digits of a given length n and which occurs among the first m > n digits converges to q −n as m → ∞' (which is basically the definition of a normal number in base-q, cf.[3]).To the best of our knowledge, less (or perhaps no) attention has been paid to the asymptotic behaviour of the (scaled) remainder T n (X) as n → ∞.This paper fills this gap.
Assuming F is absolutely continuous with a probability density function f we establish the following.We start in Section 2 to consider a special case of f where T n (X) follows exactly µ when n is sufficiently large.Then in Section 3, under a weak assumption on f , we specify an interesting coupling construction involving a non-negative integer-valued random variable N so that T N (X) follows exactly µ and is independent of (X 1 , ..., X N ).Moreover, in Section 4, we show that lim n→∞ d TV (P n , µ) = 0 where d TV is the total variation metric (as given later in (12)).Because of these results, if in an experiment a realization of X is observed and the first n digits are kept, and if (so far) the only model assumption is absolute continuity of F , then the remainder rescaled by q n is at least approximately uniformly distributed when n is large.Since we interpret the uniform distribution as the case of complete randomness, no essential information about the distribution is lost.On the other hand, if the distribution of the remainder is far from uniform, this may indicate that the distribution one is trying to find has finer structure that one is missing by looking only at the first n digits.We return to this issue in Section 5 when discussing sufficiency and ancillarity.Furthermore, in Section 4 we study the convergence rate of d TV (P n , µ) and other related properties.In Section 5, we illustrate our results from Sections 3 and 4 in connection to various specific choices of F , including the case where F follows the extended Newcomb-Benford law (Example 1).Finally, in Section 6, we generalize our convergence results to the situation where X is extended to a multivariate random variable with values in the k-dimensional unit cube [0, 1) k and each of the k coordinates of X is transformed by T .
We plan in a future paper to study the asymptotic behaviour of the remainder in other expansions, including a certain base-β expansion of a random variable, namely when q is replaced by β = (1+ √ 5)/2 (the golden ratio) in all places above.
Lemma 2.1.If F has no jump at any base-q fraction in [0, 1) then for every x ∈ [0, 1], Proof.Clearly, (4) holds for x = 1, so let 0 ≤ x < 1.For j 1 , ..., j n ∈ {0, 1, ..., q − 1} and j = n i=1 j i q n−i , the event that X 1 = j 1 , ..., X n = j n , and T n (X) ≤ x is the same as the event that q −n j ≤ X < q −n (j + 1) and X ≤ q −n (j + x).Hence, since 0 ≤ x < 1, whereby (4) follows since F (x) has no jumps at the base-q fractions.The property that F has no jump at any base-q fraction is of course satisfied when F is continuous.
For the remainder of this section and the following Sections 3-5 we assume that X has a probability density function (PDF) f concentrated on (0, 1), meaning that F is absolutely continuous with F (x) = x −∞ f (t) dt for all x ∈ R.Then, by (4), F n is absolutely continuous with PDF for 0 < x < 1.
Proof.If P m = µ then by invariance of µ under T , Thereby the first assertion follows.

Couplings
Let f be a PDF on [0, 1).We introduce the following notation.Let I ∅ = I 1;0 = [0, 1) and c ∅ = c 1;0 = inf I ∅ f .For n = 1, 2, ... and x 1 , x 2 , ... ∈ {0, 1, ..., q − 1}, let k = 1 + n i=1 x i q n−i and Recall that a function f is lower semi-continuous at a point x if for any sequence ) is not a base-q fraction, then lower semi-continuity at x is equivalent to Write U ∼ µ if U is a uniformly distributed random variable on [0, 1).
Theorem 3.1.Suppose f is lower semi-continuous at Lebesgue almost all points in [0, 1).Then there is a coupling between X ∼ f and a non-negative integer-valued random variable N such that T N (X) ∼ µ is independent of (X 1 , ..., X N ).
Remark 1. Set {0, 1, ..., q − 1} 0 = {∅} so we interpret ∅ as no digits.Then (X 1 , ..., X N ) is a discrete random variable with state space ∪ ∞ n=0 {0, 1, ..., q − 1} n .Commonly used PDFs are lower semi-continuous almost everywhere.For an example where this condition does not hold, let 0 < ϵ Hence, the uniform distribution µ H on H is absolutely continuous.Since H contains no base-q fraction and the set of base-q fractions is dense in [0, 1), any interval will contain points not in H. Now, any PDF f for µ H will be zero outside H ∪ A for some nullset A (depending on the version of f ), so for all integers n ≥ 0 and 1 ≤ k ≤ q n , f will be zero on Thus the right hand side in (7) is zero, so f is not lower semi-continuous anywhere.
Proof.For Lebesgue almost all x = ∞ n=1 x n q −n ∈ [0, 1) with x n = ⌊T n−1 (x)q⌋, assuming x is not a base-q fraction (recalling that the set of base-q fractions is a Lebesgue nullset), (7) gives Let N be a random variable such that for f (x) > 0, conditionally on X = x, By (8) and since c x 1 ,...,xn ≥ 0, this is a well-defined conditional distribution.By Bayes theorem, conditioned on N = n with P(N = n) > 0, X follows an absolutely continuous distribution with PDF Therefore, since f (x|n) is constant on each of the intervals I k;n , conditioned on N = n we immediately see that (X 1 , ..., X n ) (interpreted as nothing if n = 0) and T n (X) are independent and that T n (X) ∼ µ.The latter implies that T N (X) ∼ µ is independent of N .Consequently, if we do not condition on N , we have that (X 1 , ..., X N ) and T N (X) ∼ µ are independent.Corollary 3.2.For the coupling construction in the proof of Theorem 3.1, conditioned on X = x with f (x) > 0, we have where Remark 2. Corollary 3.2 is used in Section 5 to quantify how many digits are needed to make the remainder uniformly distributed with sufficiently high probability.Since a PDF is only defined up to a set of measure zero, it is possible for a distribution to have several PDFs that are almost everywhere lower semi-continuous but give rise to different constants c x 1 ,...,xn .Hence the distribution of (X 1 , . . ., X N ) is not uniquely defined.For example, if X ∼ µ, letting f be the indicator function on [0, 1) gives N = 0 almost surely, whilst letting f be the indicator function on [0, 1) \ {x 0 } for some x 0 ∈ [0, 1) gives P(N ≤ n) = 1 − q −n .By (10), in order to make N as small as possible, we prefer a version of f which is as large as possible.
Corollary 3.3.Let the situation be as in Theorem 3.1.The output of the following simulation algorithm is distributed as X ∼ f : (a) Draw N from (10).
(b) Conditionally on N , generate a discrete random variable K with (c) Independently of (N, K) pick a random variable U ∼ µ.
Proof.Let a n = q n k=1 c k;n be the normalizing constant in (11).Conditioned on N = n with P(N = n) > 0, steps (b) and (c) give that U ∼ µ and K are independent, so the conditional distribution of (K+U )q −N is absolutely continuous with a conditional PDF given by Moreover, we get from (10) that P(N = 0) = c ∅ and Therefore, the (unconditional) distribution of (K +U )q −N is absolutely continuous with a PDF which at each point x = ∞ n=1 x n q n ∈ [0, 1) with x n = ⌊T n−1 (x)q⌋ is given by This PDF agrees with (8), so (K + U )q −N ∼ f .Denote by B the class of Borel subsets of [0, 1).The total variation distance between two probability measures ν 1 and ν 2 defined on B and with PDFs g 1 and g 2 , respectively, is given by see e.g.Lemma 2.1 in [11].Then Theorem 3.1 shows the following.
It is well-known that b n is the maximal number such that there exists a coupling between T n (X) ∼ P n and a uniform random variable U ∼ µ for which T n (X) = U with probability b n (see e.g.Theorem 8.2 in [10]).Thus d TV (P n , µ) = P(N > n) if and only if 1 0 min{1, f n (t)} dt = q −n q n k=1 inf I k;n f .In particular, d TV (P 0 , µ) = P(N > 0) if and only if X ∼ µ.
It follows from Corollary 3.2 and 3.4 that (7) implies lim n→∞ d TV (P n , µ) = 0.In Theorem 4.1 below we show that (7) is not needed for this convergence result.
Proof.Using Corollary 3.3, let X = (K + U )q −N .For n = 0, 1, ..., if Q n denotes the probability distribution of T n (U ), then Q n = µ, and so where the first inequality is the standard coupling inequality for the coupled random variables T n (X) and T n (U ), and the last inequality follows since N ≤ n implies T n (X) = T n (U ).Thereby (13) is verified.
Remark 4. By the Kantorovich-Rubinstein theorem, the Wasserstein distance between two probability measures ν 1 and ν 2 on [0, 1] is given by where Γ(ν 1 , ν 2 ) consists of all couplings of ν 1 and ν 2 .By [6, Thm 4], so by Remark 3, Corollary 3.4 implies The latter bound can be improved by using the coupling between T n (X) and T N (X) ∼ µ to obtain See also [6] for an overview of the relation between the total variation distance and other measures of distance between probability measures.
Before proving this theorem we need the following lemma.
We are now ready for the proof of Theorem 4.1.
To prove (19) we use ( 22) with g replaced by f .Then, for every A ∈ B, We have sup where the second identity follows from (12).This gives (19).
Remark 6.In continuation of Remark 3, by Theorem 4.1, b n → 1 and under weak conditions the convergence is exponentially fast.
5 So how many digits are needed?
This section starts with some theoretical statistical considerations and continues then with some specific examples.Consider a parametric model for the probability distribution of X given by a parametric class of lower semi-continuous densities f θ where θ is an unknown parameter.By Theorem 3.1 this specifies a parametric model for (X 1 , ..., X N ) which is independent of T N (X) ∼ µ.In practice we cannot expect N to be observable, but let us imaging it is.Then, according to general statistical principles (see e.g.[1]), statistical inference for θ should be based on the sufficient statistic (X 1 , ..., X N ), whilst T N (X) is an ancillary statistic and hence contains no information about θ.Moreover, Theorem 4.1 ensures (without assuming that the densities are lower semi-continuous) that T n (X) is approximately uniformly distributed.Hence, if n is 'large enough', nearly all information about θ is contained in (X 1 , ..., X n ).
Remark 7.For another paper it could be interesting to consider a so-called missing data approach for a parametric model of the distribution of (X 1 , ..., X N ), with an unknown parameter θ and treating N as an unobserved statistic (the missing data): Suppose X (1) , ..., X (k) are IID copies of X, with corresponding 'sufficient statistics' (X The EM-algorithm may be used for estimation of θ.Or a Bayesian approach may be used, imposing a prior distribution for θ and then considering the posterior distribution of (N (1) , ..., N (k) , θ).
According to Corollary 3.2, the number of digits we need will in general depend on the realization of X = x.As a measure for this dependence, for f (x) > 0 and n = 0, 1, ..., we may consider P(N > n | X = x) as a function of x, which can be calculated from (9).Since N ≤ n implies T n (X) ∼ µ, an overall measure which quantifies the number n of digits needed is given by P(N > n), cf.(10).The use of these measures requires that f is lower semi-continuous, whilst the bounds in Theorem 4.1 for the total variation distance d TV (P n , µ) hold without this condition.
The following Examples 1 and 2 demonstrate how these measures can be used to quantify the number n of digits needed in order that N > n (conditioned or not on X = x) with a small probability or that d TV (P n , µ) is small.Example 1.Any number y ̸ = 0 can uniquely be written as y = sq k (y 0 + y f ) where s = s(y) ∈ {±1} is the sign of y, k = k(y) ∈ Z determines the decimal point of y in base-q, y 0 = y 0 (y) ∈ {1, ..., q − 1} is the leading digit of y in base-q, and y 0 + y f is the so-called significand of y in base-q, where y f = y f (y) ∈ [0, 1) is the fractional part of y 0 + y f in base-q.Correspondingly, consider any realvalued random variable Y ̸ = 0 (or just P(Y = 0) = 0), so (almost surely) Y = Sq K (X 0 + X) where S = s(Y ), K = k(Y ), X 0 = y 0 (Y ), and X = y f (Y ) are random variables.Let X 1 , X 2 , ... be the digits of X in the base-q expansion, cf.
(3).We call X 0 , X 1 , X 2 , ... the significant digits of Y in base-q.By definition Y satisfies the extended Newcomb-Benford law if for n = 0, 1, ... and any x 0 ∈ {1, ..., q − 1} and x j ∈ {0, 1, ..., q − 1} with 1 ≤ j ≤ n.Equivalently, the log-significand of Y in base-q, log q (X 0 + X), is uniformly distributed on [0, 1) (Theorem 4.2 in [2]).Then X has CDF and PDF given by The extended Newcomb-Benford law applies to a wide variety of real datasets, see [7,2] and the references therein.The law is equivalent to appealing scale invariance properties: Equation (28) is equivalent to that Y has scale invariant significant digits (Theorem 5.3 in [2]) or just that there exists some d ∈ {1, ..., q − 1} such that P(y 0 (aY ) = d) does not depend on a > 0 (Theorem 5.8 in [2]).Remarkably, for any positive random variable Z which is independent of Y , if the extended Newcomb-Benford law is satisfied by Y , it is also satisfied by Y Z (Theorem 8.12 in [2]).For the remainder of this example, suppose (28) is satisfied.Considering (10) gives for n = 0, 1, ... that The tail probabilities P(N > n) decrease quickly as n and q increase, see the left panel in Figure 1 for plots of P(N > n) against n for q = 2, 3, 5, 10.The middle panel of Figure 1 shows P(N > 1 | X = x) as a function of x for q = 10.We see large fluctuations, with probabilities dropping to zero when approaching the right limit of the intervals I k;1 , where inf I k;1 f is attained.To avoid these fluctuations, the right panel of Figure 1 shows an upper bound on P(N > n | X = x) as a function of x for q = 10 and n = 0, 1, 2, 3.The upper bound is found by noting that on each I k;n , P(N > n | X = x) is convex decreasing towards zero.Hence an upper bound is given by evaluating at the left end points and interpolating linearly.The plot shows that P(N > n | X = x) is very close to zero for all x already for n = 2.This is also in accordance with Theorem 4.1 stating that T n (X) converges to a uniform distribution on [0, 1) and hence the first digit X n of T n (X) is approximately uniformly distributed on {0, 1, ..., q − 1} when n is large.For n = 1, 2, ... Figure 2: Left panel: P(X n = 0) − P(X n = q − 1) as a function of n for various values of q when f is as in (29).Right panel: f n when q = 2 and n = 0, . . ., 5. and x n ∈ {0, 1, ..., q − 1}, we have where P(X n = x n ) is a decreasing function of x n .The left part of Figure 2 shows plots of P(X n = 0) − P(X n = q − 1) versus n for q = 2, 3, 5, 10 indicating fast convergence to uniformity and that the convergence speed increases with q.The right part of Figure 2 illustrates the stronger statement in (18) that the PDF f n of T n (X) converges uniformly to the uniform PDF.
To further illustrate the fast convergence, we drew a sample of 1000 observations with CDF (29) and made a χ 2 goodness-of-fit test for uniformity of X n .Considering a significance level of 0.05, the rejection rate for 10.000 repetitions is shown in Table 1.Such a χ 2 test can also be used as a test for uniformity of the remainder T n−1 (X).A more refined test can be performed by basing the goodnessof-fit test on 2 k combinations of the first k digits (X n , . . ., X n+k−1 ).The result is shown in Table 1 for k = 1, 2, 3.When n = 1 we always rejected the hypothesis that (X n , . . ., X n+k−1 ) is uniformly distributed, when n = 2 the rejection rate decreases as k grows and it is 0.067 for k = 3, and when n ≥ 3 the rejection rates are close to 0.05 as expected if the hypothesis is true.When we instead tried with a sample n 1 2 3 4 5 6 7 8 k = 1 1.000 0.094 0.050 0.054 0.052 0.051 0.054 0.055 k = 2 1.000 0.081 0.047 0.050 0.051 0.053 0.052 0.047 k = 3 1.000 0.067 0.050 0.049 0.049 0.050 0.052 0.052 Table 1: Rejection rate for a χ 2 goodness-of-fit test for uniformity of (X n , . . ., X n+k−1 ).
of 100 observations, even when n = 1 the test had almost no power for k = 1, 2, 3.
Example 2. To illustrate how the convergence rate in Theorem 4.1 depends on the smoothness of f , let f (t) = αt α−1 be a beta-density with shape parameters α > 0 and 1.Then, f ∈ L′ 1 (0, 1) if and only if α = 1 or α ≥ 2. Of course, P n and µ agree if α = 1.For q = 2, Figure 2 shows plots of d TV (P n , µ) and ln(d TV (P n , µ)) versus n when α = 0.1, 0.5, 1, 1.5, 5, 10 as well as a plot of ln( We used the Newton-Raphson procedure to find x 0 (the procedure always converges).The first plot in Figure 2 shows that for all values of α, d TV (P n , µ) goes to zero, as guaranteed by Theorem 4.1.The second plot indicates that for α > 1, d TV (P n , µ) decays exponentially at a rate independent of α, while for α < 1, the decay is also exponential, but with a slower rate.The graphs in the third plot seem to approach zero, indicating that for α ≥ 2, the rate of decay is indeed as given by (19), which holds since f ′′ is bounded.In the middle plot, the decay rate also seems to be q −n for α = 1.5, though this is not guaranteed by Theorem 4.1.To see why the rate q −n also holds for 1 < α < 2, we argue as follows.In (23), (24), and (26), we may refine to the cases j = 0 and j > 0 (observing that ∥f ′ n,j ∥ ∞ < ∞ when j > 0) to obtain the following modification of (17), shows the difference between log( 1 8 q −2n q n −1 j=0 ∥f ′ n,j ∥ ∞ ) and log(d TV (P n , µ)) for three values of α ≥ 2.
As in Example 1, we tested for uniformity of T n−1 (X) by a χ 2 goodness-of-fit test for uniform distribution of the k = 3 first digits (X n , X n+1 , X n+2 ) again using 10.000 replications of samples of 1000 observations from a beta-distribution with α = 0.1, 0.5, 1.5, 2. Table 2 shows that for α = 0.1, uniformity is rejected in all samples for all n indicating that the distribution of the remainder remains far from uniform even for n = 8.For α = 0.5, the rejection rate reaches the 0.05 level for n = 5, while for α = 1.5, this happens already for n = 2 and for α = 5 it happens around n = 3 or n = 4.For α > 1 close to 1, the results are comparable to those for the Benford law in Example 1, while for large α and α < 1, the rejection rate is higher indicating slower convergence.Remark 8.In conclusion, Examples 1 and 2 demonstrate that the answer to the title of our paper ('How many digits are needed?') of course depend much on q (in Example 1, the higher q is, the fewer digits are needed) and on how much f deviates from the uniform PDF on [0, 1) (in Example 2, the more skew f is, the more digits are needed).Moreover, as this deviation increases or the sample size decreases, the χ 2 goodness-of-fit test as used in the examples becomes less powerful; alternative tests are discussed in [8].

Figure 1 :
Figure 1: Left panel: P(N > n) as a function of n for q = 2, 3, 5, 10.Middle panel: P(N > 1 | X = x) as a function of x for q = 10.Right panel: An upper bound for P(N > n | X = x) as a function of x for n = 0, 1, 2, 3 and q = 10.

Figure 3 :
Figure 3: The first two plots show d TV (P n , µ) and log(d TV (P n , µ)), respectively, as a function of n for q = 2 and various values of α.The last plot shows the difference between log( 1 8 q −2n q n −1 j=0 ∥f ′ n,j ∥ ∞ ) and log(d TV (P n , µ)) for three values of α ≥ 2.