1 Introduction

Benford’s law is a much studied probability distribution for significant digits, which has been challenging mathematicians and practitioners for decades (Berger and Hill 2011, 2015; Miller 2015a). It has also attracted the interest of statisticians mainly thanks to the limit theorem presented by Hill (1995), which motivates the adoption of the law as the digit-generating model in many real-world situations. We then call Benford hypothesis the statement, to be made more precise in the following sections, that a random sample of n observations is made of realizations of an absolutely continuous random variable X which is distributed according to Benford’s law.

Statistical assessment of the Benford hypothesis is crucial for many purposes. For instance, when the Benford hypothesis is expected to hold for genuine observations, deviations from it can be taken as evidence of possible data manipulation. A selection of effective developments of this idea can be found in a growing number of fields, including the analysis of financial statements (Tam Cho and Gaines 2007; Nigrini 2012), electoral processes (Mebane 2010; Pericchi and Torres 2011; Fernández-Gracia and Lacasa 2018) and international trade (Barabesi et al. 2018; Cerioli et al. 2019; Lacasa 2019; Barabesi et al. 2021). The Benford hypothesis may also help to corroborate or disprove scientific theories. Shao and Ma (2010) theoretically prove that the first-digit pattern complies with Benford’s law for famous physical models such as the Boltzmann–Gibbs, the Fermi–Dirac and the Bose–Einstein statistics. Nigrini and Miller (2007) check the conformity of streamflow data arising in hydrology and conclude that data related to water bodies should adhere to the Benford hypothesis, except in cases where the data quality is poor, or the sampling mechanism is flawed, or the hydrological process follows a power law with a large exponent. In astrophysics, a wealth of research has been recently conducted in order to address the question whether the distribution of the digits of physical quantities of celestial bodies is in good agreement with Benford’s law (Alexopoulos and Leontsinis 2014; de Jong et al. 2020; Melita and Miraglia 2021). In the case of distances from the Earth to galaxies, Hill and Fox (2016) theoretically explain—using astrophysical arguments and the mathematical properties of Benford’s law—why these distances should follow the Benford hypothesis and conclude that empirical observation of such an agreement may be viewed as a new independent evidence of the validity of Hubble’s law. Accurate statistical assessment of the Benford hypothesis would then be required, but most of the applied astrophysical literature either rely on diagnostic checks of the data or just perform statistical evaluation of the looser hypothesis of first-digit compliance (see, e.g., Melita and Miraglia 2021).

Formal tests of the Benford hypothesis already exist and have been widely debated in the literature: see, e.g., Kossovsky (2015, Section 3) for a general account and Barabesi et al. (2022), Cerasa (2022) and Cerqueti and Lupi (2023) for a few recent proposals. Notwithstanding the difficulty of separating statistical significance from practical importance of an empirical observed deviation from Benford’s law, which implies that formal tests might be judged too severe in rejecting the Benford hypothesis especially in large samples (Kossovsky 2015), in this work we stress that most existing tests suffer from two main drawbacks. The first shortcoming is that they are rather ad-hoc in nature, being derived on the basis of some specific properties of Benford’s law. Each property may yield good statistical performance under certain alternatives, but poorer behavior in other circumstances; see Barabesi et al. (2022) for a detailed study and comparison of two such properties. As a consequence, a uniformly most powerful test does not exist. The second drawback of the available test statistics is their lack of scale invariance, except for the trivial case of multiplication by a power of 10. For example, consider the synthetic data set of \(n=30\) observations displayed in Table 1. On these data the classical first-digit Pearson statistic, which is introduced in Sect. 4.1 and there denoted by \(\chi ^2\), equals 1.79 and yields an exact p-value of 0.998. If all the observations in the table are multiplied by the fixed scaling factor \(\sigma =373\), we instead obtain \(\chi ^2=16.73\) and an exact p-value of 0.036. The same happens with another popular first-digit discrepancy measure, the mean absolute deviation also introduced in Sect. 4.1 and there denoted by M, for which we obtain \(M=0.022\) (with an exact p-value of 0.984) on the data of Table 1, and \(M=0.055\) (with an exact p-value of 0.147) after scaling by \(\sigma \). While lack of invariance may be tolerable to some extent in certain domains, it has certainly to be avoided for the purpose of validating scientific theories and, more generally, in the analysis of scientific data.

Table 1 Synthetic data set of \(n=30\) observations

The main goal of this paper is to provide a principled and unified framework for the assessment of the Benford hypothesis. Our aim is achieved by taking two alternative but connected routes that originate from the same source: the characteristic function of a suitable transform of the (univariate) random variable which is assumed to define the data generating process. One path leads to measure how close is the statistical model of interest from the Benford hypothesis, while the second track takes an inferential perspective and develops powerful and scale-invariant test statistics of this hypothesis.

Specifically, in the first part of the manuscript we develop a general relationship between the distribution of X and that of its significand S(X), to be defined in Sect. 2. This relationship allows us to study the probabilistic structure of many standard univariate models for X when they are framed in the significand space. Although a number of scattered results exist on this topic (see, e.g., Engel and Leuenberger 2003; Nigrini and Miller 2007; Dümbgen and Leuenberger 2008; Miller 2015b; Leemis 2015; Dümbgen and Leuenberger 2015; Berger and Twelves 2018), most of the findings reported in Sect. 3 are new. Even more importantly, our approach is general, being based on the study of the characteristic function of a convenient transform of X. It is thus readily available for potential extension to other univariate models not considered here and for developments in more complex schemes, such as mixtures and contamination models. We establish the closeness of a particular model to the Benford hypothesis through a suitable Kolmogorov distance, and we suggest an easy strategy for regularization. The proposed metric is a natural choice in our framework and could be potentially extended to deal with outlier contamination in situations where robust validation and testing are of interest (Álvarez-Esteban et al. 2012; del Barrio et al. 2020). We argue that providing a general relationship between the distribution of X and that of its significand S(X) can be fruitful for two orders of reasons. One is that it helps to translate much of the well-understood knowledge on X into the more slippery world of the significand space (see Berger and Hill 2021 for a list of surprisingly common mistakes in the field). The second advantage is that the computed discrepancy between X and the Benford hypothesis helps to sharpen the meaning of what might be called “almost Benford” behavior (Demidenko 2020, Section 2.1.6). Most real-world data generating processes do not exactly follow Benford’s law, although it is known that some of them may closely resemble the law for certain parameter configurations; see, e.g., Berger and Hill (2015, p. 55), Miller (2015b, Section 3.5.3) and Cerqueti and Maggi (2021). Therefore, our discrepancy measure allows us to represent deviations from the Benford hypothesis in terms of more comfortable and understandable variations in the parameters of X.

The relationship between the distributions of X and S(X) leads, in the second part of our work, to effective test statistics for assessing the Benford hypothesis when a random sample from X is available. Our approach takes advantage of a density estimation step based on a possibly unknown number of nonnegative trigonometric polynomials. Under this approach, we obtain, in Sect. 4, two test statistics of the Benford hypothesis which are shown to be asymptotically equivalent. One of them provides the so far missing likelihood ratio test of the Benford hypothesis, while the other statistic is as an extension of the well-known Rayleigh test which is effective against multimodal alternatives. We allow for a data-driven selection of the number of nonnegative trigonometric polynomials and we show that the proposed test statistics are invariant under scale transformation of the data.

In Sect. 5 the empirical advantage of the proposed tests of the Benford hypothesis is shown through an extensive simulation study encompassing many classical models for X. Two applications of scientific interest, one to astrophysical data and the other to hydrological data, conclude the paper in Sects. 6 and 7, respectively. These applications further motivate our developments and clearly point to the requirement of scale-invariant test statistics. All the technical proofs are deferred to the Appendix, available through the Supplementary Materials. There, we also provide a Supplement with additional simulation results, details about the simulation algorithm and links both to the code that we have used for our computations and to the data on which our tests are performed.

2 Preliminaries

We provide some basic definitions and properties of Benford’s law that are required by the developments of the following sections.

The (base-10) significand function \(S:\mathbb {R}\setminus \{0\}\rightarrow [1,10[\) is defined as

$$\begin{aligned} S(x)=10^{\langle \log _{10}|x|\rangle }, \end{aligned}$$

where \(\langle a\rangle =a-\lfloor a\rfloor \), and \(\lfloor a\rfloor =\max \{n\in \mathbb {Z}:n\le a\}\). Let X be an absolutely continuous random variable defined on the probability space \((\Omega ,\mathcal {F},P)\). According to Berger and Hill (2015, p.30), X is a Benford random variable if the distribution function of S(X) is

$$\begin{aligned} F_{S(X)}(u)=\log _{10}u, \quad u\in [1,10[. \end{aligned}$$
(1)

We say that the distribution of X satisfies the Benford hypothesis when (1) holds.

In our perspective it is often convenient to refer to the (base-10) log-significand function

$$\begin{aligned} s(x)=\langle \log _{10}|x|\rangle , \end{aligned}$$

with the assumption that \(s(0):=0\). If \(F_{s(X)}\) is the distribution function of s(X), the Benford hypothesis is then equivalently stated as

$$\begin{aligned} F_{s(X)}(u)=u, \quad u\in [0,1[. \end{aligned}$$
(2)

We are interested in providing a bridge between the usual statistical approach, where a model is placed on the distribution of X, and the Benford hypothesis. For this purpose, we acknowledge the contribution of Pinkham (1961), who provided a pioneering theoretical discussion of why and to what extent Benford’s law must hold. The key finding given by Pinkham (1961) consists in the following Theorem.

Theorem 1

Let X be an absolutely continuous random variable defined on the probability space \((\Omega ,\mathcal {F},P)\). Moreover, let \(\varphi _Y\) be the characteristic function of the random variable \(Y=2\pi \log _{10}|X|\), i.e.,

$$\begin{aligned} \varphi _Y(t)={\textrm{E}}\left[ e^{2\pi \textrm{i}t\log _{10}|X|}\right] , \quad t\in \mathbb {R}. \end{aligned}$$
(3)

If \(\varphi _Y(t)=O(|t|^{-h})\) when \(|t|\rightarrow \infty \) for a given \(h>0\), we have

$$\begin{aligned} F_{s(X)}(u)=u+\frac{1}{2\pi }\,\sum _{k\in \mathbb {Z},k\ne 0}\frac{1-e^{-2\pi \textrm{i}ku}}{\textrm{i}k}\,\varphi _Y(k), \quad u\in [0,1[. \end{aligned}$$
(4)

Proof

See Pinkham (1961). \(\square \)

An alternative (and more manageable) expression for (4) is introduced in the following corollary.

Corollary 1

We have

$$\begin{aligned} F_{s(X)}(u)=u+\frac{1}{\pi }\,\sum _{k\in \mathbb {N}}\frac{|\varphi _Y(k)|}{k}\,(\sin (\vartheta _k)-\sin (\vartheta _k-2\pi ku)), \quad u\in [0,1[, \end{aligned}$$
(5)

where \(\vartheta _k=\arg (\varphi _Y(k))\).

Proof

See the Appendix in the Supplementary Materials. \(\square \)

It is apparent from Corollary 1 that the coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) are central, since they determine the series representation (5). Moreover, the Nth partial sum of (5) is denoted by

$$\begin{aligned} G_{N}(u)=u+\frac{1}{\pi }\,\sum _{k=1}^{N}\frac{|\varphi _Y(k)|}{k}\,(\sin (\vartheta _k)-\sin (\vartheta _k-2\pi ku)), \quad u\in [0,1[. \end{aligned}$$
(6)

A series representation for the probability density function of s(X), say \(f_{s(X)}\), is also obtained in the following corollary, albeit this result is not adopted in the sequel of the paper.

Corollary 2

If \(\sum _{k\in \mathbb {N}}|\varphi _Y(k)|<\infty \), we have

$$\begin{aligned} f_{s(X)}(u)=1+2\,\sum _{k\in \mathbb {N}}|\varphi _Y(k)|\cos (\vartheta _k-2\pi ku), \quad u\in [0,1[. \end{aligned}$$

Proof

See the Appendix in the Supplementary Materials. \(\square \)

For instance, the condition in Corollary 2 is satisfied when \(|\varphi _Y(k)|=O(k^{-h})\) with \(h>1\) or when \(\varphi _Y\) is absolutely integrable, since \(\sum _{k\in \mathbb {N}}|\varphi _Y(k)|\le \int _{-\infty }^\infty |\varphi _Y(t)|dt\). Actually, the common parametric families of absolutely continuous circular densities (considered, e.g., by Mardia and Jupp 2000) satisfy such an assumption. Moreover, almost all the laws considered in this paper also satisfy the condition, the only exceptions being the Beta distribution with shape parameter \(\beta \le 1\) and the generalized Benford distribution.

We measure the closeness of the distribution of X to the Benford hypothesis (2) through the Kolmogorov distance between the distribution function of the uniform law on [0, 1[ and (6) for \(N\in \mathbb {N}\). This distance is

$$\begin{aligned} \varDelta _N= & {} \sup _{u\in [0,1[}\left|G_{N}(u)-u\right|\nonumber \\= & {} \sup _{u\in [0,1[}\left|\frac{1}{\pi }\,\sum _{k=1}^N\frac{|\varphi _Y(k)|}{k}\,(\sin (\vartheta _k)-\sin (\vartheta _k-2\pi ku))\right|. \end{aligned}$$
(7)

It is apparent that \(\varDelta _N\) is completely determined by the vector \((\varphi _Y(1),\ldots ,\varphi _Y(N))\). Furthermore, the Kolmogorov distance (7) provides a further equivalent statement of the Benford hypothesis as \(\varDelta _N=0\) for each \(N\in \mathbb {N}\).

Another recurrent theme of our work is the effect of scaling. Let \(X_\sigma =\sigma X\) denote the scaled version of the random variable X for \(\sigma \in \mathbb {R}^{+}\). A basic and easily seen property of the significand is that \(F_{s(X_\sigma )}\) has a periodic behavior with respect to the scale parameter \(\sigma \), i.e.,

$$\begin{aligned} F_{s(X_\sigma )}=F_{s(X_{10^n\sigma })}, \quad n\in \mathbb {N}. \end{aligned}$$
(8)

We expect this feature to be retrieved from the behavior of the Kolmogorov distance (7). However, as already anticipated in Sect. 1, it does not automatically translate into a sensible invariance property when testing conformance to (1) and (2).

We finally introduce the symbols \(\varDelta _\infty = \underset{N}{\lim }\; \varDelta _N\) and \(C=\log _{10} e\), to be used in the developments that follow.

3 Statistical models under the Benford lens

In this section we build on (3) and (5) to assess the structure of the coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) for some well-known models for X, which are representative of a wide range of probabilistic behavior. It is clear from (5) that these coefficients have a key role in assessing the discrepancy of the distribution of a given random variable from the distribution of a Benford random variable. For well-behaved statistical models, the first few coefficients should reasonably “characterize” the underlying distribution in the Benford’s framework, in such a way that \(G_N\) provides a suitable approximation of \(F_{s(X)}\) even for a small N. Therefore, an appropriate analysis of these quantities is required.

Our aim is to recast the structure of some basic models for X in terms of their agreement with the Benford hypothesis. We argue that these relationships can help to better link the usual approach where the focus is on X to the alternative statistical view where the reference space is that of S(X) (or, equivalently, that of s(X)). The reward of such a link is twofold. On the one hand, we can better understand the instances where the data generating process may be expected to be close to (or far from) the Benford behavior by relating this behavior to statistical models that are well understood after many decades (and even centuries) of research and empirical applications. On the other hand, if we trust a statistical model for X to be adequate for our problem of interest, we should find the corresponding information also in the significand space. When it is not the case, suspicion on the data should then arise. Such a dual role of the statistician, looking both at the observed value of X and at its digits, is also envisaged by Kossovsky (2015, p. 3–4) as a herald of fruitful statistical investigations. We believe that the results provided in this section could be helpful in this direction as well.

The ratio behind our selection of distributions stems from the well-known popularity of these probability laws. The choice was also carried out in order to cover a wide range of values with respect to the skewness and kurtosis indexes. Moreover, despite their central importance, we are not aware of existing results on the structure of the coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) for the normal and positive stable distributions, as well as for mixture of laws.

3.1 Normal law

We start our journey under the Benford umbrella from the ubiquitous normal law.

Proposition 1

If the random variable X follows the normal law \(N(\mu ,\sigma ^2)\), then

$$\begin{aligned} \varphi _Y(t)=\frac{1}{\sqrt{\pi }}\,(2\sigma ^2)^{\pi \textrm{i}Ct}\,\Gamma \left( \frac{1}{2}+\pi \textrm{i}Ct\right) M\left( -\pi \textrm{i}Ct,\frac{1}{2},-\frac{\mu ^2}{2\sigma ^2}\right) , \quad t\in \mathbb {R}, \end{aligned}$$

while

$$\begin{aligned} |\varphi _Y(t)|=\frac{1}{\sqrt{\cosh (\pi ^2Ct)}}\,\left|M\left( -\pi \textrm{i}Ct,\frac{1}{2},-\frac{\mu ^2}{2\sigma ^2}\right) \right|, \end{aligned}$$

where M(abz) is the Kummer confluent hypergeometric function and \(a,b,z\in \mathbb {C}\).

Proof

See the Appendix in the Supplementary Materials. \(\square \)

Noteworthy, simple expressions can be derived from Proposition 1 when \(\mu =0\), since \(M(a,b,0)=1\) in that case. It is also apparent that the coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) are solely functions of the ratio \(\rho =|\mu |/\sigma \). Numerical evaluation of these coefficients is provided in Table 2 for \(k\le 5\) and selected values of \(\rho \). We can see that the coefficients are increasing (and approach one) for a fixed k as \(\rho \) increases, while they are decreasing (and approach zero) for a fixed \(\rho \) as k increases. In addition, the Kolmogorov distance \(\varDelta _N\) is reported in Table 3 for \(N \le 5\) and some pairs \((\mu ,\sigma )\). These distances are fluctuating for a fixed \(\sigma \) and for a varying \(\mu \), while they tend to decrease as \(\sigma \) increases for a fixed \(\mu \). They also have a periodic behavior with respect to the scale parameter \(\sigma \) for a fixed \(\rho \), in agreement with (8). Finally, from the joint analysis of Tables 2 and 3, it can be seen that a good approximation to \(F_{s(X)}\) is obtained by truncating the series in (5) at the first term.

Table 2 Coefficients \(|\varphi _Y(k)|\) for \(k \le 5\) and selected parameters of one-population laws
Table 3 Kolmogorov distance \(\varDelta _N\) for the laws of Table 2 and \(N\le 5\)

3.2 Gamma law

Another long-standing distribution with fruitful applications in many domains, including survival analysis, actuarial science and economics, is the Gamma law.

Proposition 2

If the random variable X follows the Gamma law \(G(\alpha )\), where \(\alpha \) is the shape parameter, then

$$\begin{aligned} \varphi _Y(t)=\frac{\Gamma (\alpha +2\pi \textrm{i}Ct)}{\Gamma (\alpha )}, \quad t\in \mathbb {R}, \end{aligned}$$

while

$$\begin{aligned} |\varphi _Y(t)|=\prod _{n=0}^\infty \frac{1}{\sqrt{1+\frac{4\pi ^2C^2t^2}{(\alpha +n)^2}}}. \end{aligned}$$

Proof

See the Appendix in the Supplementary Materials. \(\square \)

Some simplifications occur in special cases. As an example, since for \(b\in \mathbb {R}\) it holds

$$\begin{aligned} |\Gamma (1+\textrm{i}b)|=\sqrt{\frac{\pi b}{\sinh (\pi b)}}, \end{aligned}$$
(9)

for the Exponential distribution (\(\alpha =1\)) we have

$$\begin{aligned} |\varphi _Y(t)|=\sqrt{\frac{2\pi ^2Ct}{\sinh (2\pi ^2Ct)}}, \end{aligned}$$

on the basis of Olver et al. (2010, Formula 5.4.3). Similar closed-form expressions can be obtained for \(\alpha =n+\frac{1}{2}\) and \(\alpha =n+1\), with \(n\in \mathbb {N}\), by considering the properties of the Gamma function.

Numerical computation of the coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) for \(k\le 5\) is again provided in Table 2 for selected values of \(\alpha \). For a fixed k, the coefficients are increasing (and approach one) as \(\alpha \) increases, while they are decreasing (and approach zero) as k increases. Table 3 displays the Kolmogorov distances \(\varDelta _N\), which consistently increase with \(\alpha \). Finally, it is seen that also for the Gamma law a good approximation to \(F_{s(X)}\) can be achieved by truncating the series in (5) at the first term.

3.3 Beta law

The Beta law is a classical model for random variables that have a finite range.

Proposition 3

If the random variable X follows the Beta law \(Be(\alpha ,\beta )\), where \(\alpha \) and \(\beta \) are the shape parameters, then

$$\begin{aligned} \varphi _Y(t)=\frac{\,\Gamma (\alpha +\beta )\Gamma (\alpha +2\pi \textrm{i}Ct)}{\Gamma (\alpha )\Gamma (\alpha +\beta +2\pi \textrm{i}Ct)}, \quad t\in \mathbb {R}, \end{aligned}$$

while

$$\begin{aligned} |\varphi _Y(t)|=\prod _{n=0}^\infty \sqrt{\frac{1+\frac{4\pi ^2C^2t^2}{(\alpha +\beta +n)^2}}{1+\frac{4\pi ^2C^2t^2}{(\alpha +n)^2}}}. \end{aligned}$$

Proof

See the Appendix in the Supplementary Materials. \(\square \)

Similarly to the Gamma law, the expression of \(|\varphi _Y(t)|\) can be simplified in some special cases. For the uniform distribution it reduces to

$$\begin{aligned} |\varphi _Y(t)|=\sqrt{\frac{1}{1+4\pi ^2C^2t^2}}. \end{aligned}$$

If \(\alpha =\beta =\frac{1}{2}\), i.e., when the arcsine distribution is considered, we obtain

$$\begin{aligned} |\varphi _Y(t)|=\sqrt{\frac{\tanh (2\pi ^2Ct)}{2\pi ^2Ct}}, \end{aligned}$$

since for \(b\in \mathbb {R}\) it holds

$$\begin{aligned} \left|\frac{\Gamma (\frac{1}{2}+\textrm{i}b)}{\Gamma (1+\textrm{i}b)}\right|=\sqrt{\frac{\tanh (\pi b)}{b}}. \end{aligned}$$

Moreover, closed-form expressions for \(|\varphi _Y(t)|\) can be derived when \(\alpha \) and \(\beta \) take the values \(n+\frac{1}{2}\) or \(n+1\), with \(n\in \mathbb {N}\).

We give the coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) for \(k\le 5\) in Table 2, and the Kolmogorov distances \(\varDelta _N\) in Table 3, for selected pairs \((\alpha ,\beta )\). For a fixed k, the coefficients are increasing (and approach one) as \(\alpha \) increases for a fixed \(\beta \), they are decreasing (and approach zero) as \(\beta \) increases for a fixed \(\alpha \), while they are decreasing (and approach zero) as k increases. Also for the Beta model, the final consideration is that a good approximation to \(F_{s(X)}\) is reached by truncating the series in (5) at the first term, or at most at the second one.

3.4 Positive stable law

We consider the positive stable law with Laplace transform \(L_X(t)=e^{-t^\alpha }\), for \(\Re (t)\in \mathbb {R}^+\), as a representative of the class of skewed and heavy-tailed laws.

Proposition 4

If the random variable X follows the positive stable law \(PS(\alpha )\), where \(\alpha \) is the tail parameter, then

$$\begin{aligned} \varphi _Y(t)=\frac{\Gamma \left( 1-\frac{2\pi \textrm{i}Ct}{\alpha }\right) }{\Gamma (1-2\pi \textrm{i}Ct)}, \quad t\in \mathbb {R}, \end{aligned}$$

while

$$\begin{aligned} |\varphi _Y(t)|=\sqrt{\frac{\sinh (2\pi ^2Ct)}{\alpha \sinh \left( \frac{2\pi ^2Ct}{\alpha }\right) }}. \end{aligned}$$

Proof

See the Appendix in the Supplementary Materials. \(\square \)

The coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) for \(k\le 5\) are displayed in Table 2 for selected values of \(\alpha \). For a fixed k, they are increasing with \(\alpha \), while they are decreasing (and approach zero) as k increases. Table 3 reports the values of the Kolmogorov distances \(\varDelta _N\), which steadily increase with \(\alpha \). Finally, also in the present model a suitable approximation to \(F_{s(X)}\) is achieved by truncating the series in (5) at the first couple of terms.

3.5 Mixtures

A more complicated situation of interest arises when X is distributed as a mixture of L laws with normalized weights \((w_1,\ldots ,w_L)\), so that its probability density is \(f_X = \sum _{l=1}^L w_l f_{Z_l}\), where \(f_{Z_l}\) is the density of \(Z_l\). Mixture models have gained huge popularity, since they can be seen as the building blocks of both model-based clustering and outlier-contamination models (see, e.g., Farcomeni and Punzo 2020; Hennig 2022; Ingrassia et al. 2022). In the case of a mixture, if \(V_l = 2\pi \log _{10}|Z_l|\) and \(\varphi _{V_l}\) is the characteristic function of \(V_l\), the characteristic function of \(Y=2\pi \log _{10}|X|\) is given by

$$\begin{aligned} \varphi _Y(t)=\sum _{l=1}^Lw_l\varphi _{V_l}(t). \end{aligned}$$
(10)

It is difficult to envisage the effect that the components of the convex combination exert on the distribution of s(X), as simple relationships between \(|\varphi _Y(k)|\) and \(|\varphi _{V_1}(k)|, \ldots , |\varphi _{V_L}(k)|\) do not exist. Result (10) thus anticipates a general lack of predictability in the power results under a mixture model for X. Nevertheless, an amusing specialization of (10) occurs if we assume a location-scale family for the distribution of \(Z_l\) such that the ratio of the location parameter to the scale parameter is given by the fixed constant \(\rho =\frac{\mu }{\sigma }\).

Proposition 5

Let \(f_Z\) be the probability density of a standard random variable Z, and let \((Z_1,\ldots ,Z_L)\) be absolutely continuous random variables with probability densities

$$\begin{aligned} f_{Z_l}(x)=\frac{1}{k_l\sigma }f_Z\left( \frac{x-k_l\mu }{k_l\sigma }\right) , \quad l=1,\ldots ,L, \end{aligned}$$

where \(\mu \in \mathbb {R}\), \(\sigma \in \mathbb {R}^{+}\) and \(k_l>0\). If X is a mixture with probability density \(f_X=\sum _{l=1}^Lw_lf_{Z_l}\), then

$$\begin{aligned} \varphi _Y(t)=\varphi _{Y_1}(t)\varphi _{Y_2}(t), \quad t\in \mathbb {R}, \end{aligned}$$

where \(Y_1=2\pi \log _{10}|X_1|\) and \(Y_2=2\pi \log _{10}|X_2|\), while \(X_1\) and \(X_2\) are independent random variables such that \(X_1\) is discrete with \(P(X_1=k_l)=w_l\), for \(l=1,\ldots ,L\), and \(X_2\overset{\mathcal {L}}{=}\sigma Z+\mu \).

Proof

See the Appendix in the Supplementary Materials. \(\square \)

We also see from Proposition (5) that the stochastic representation

$$\begin{aligned} Y\overset{\mathcal {L}}{=}Y_1+Y_2 \end{aligned}$$

holds irrespectively of the number of mixture components.

For concreteness, we show the results obtained from Proposition (5) in the case of normal mixtures. Numerical computation of the coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) for \(k \le 5\) and selected pairs \((\mu _1,\sigma _1)\) and \((\mu _2,\sigma _2)\) is provided in Table 4. The chosen parameters are such that the means belong to [0.5, 3.5] and the standard deviations are in [1.5, 3.5], but they are representative of a much wider behavior. Their values are reported below, where they are ordered to show increasing departures from the uniform distribution in terms of \(\varDelta _\infty \):

  1. (a)

    \(\mu _1=1.0,\mu _2=3.5,\sigma _1=0.5,\sigma _2=1.5\,(\varDelta _\infty =0.0372)\)

  2. (b)

    \(\mu _1=1.5,\mu _2=5.0,\sigma _1=0.5,\sigma _2=1.5\,(\varDelta _\infty =0.0565)\)

  3. (c)

    \(\mu _1=2.5,\mu _2=7.5,\sigma _1=1.0,\sigma _2=3.0\,(\varDelta _\infty =0.0680)\)

  4. (d)

    \(\mu _1=1.5,\mu _2=5.0,\sigma _1=0.5,\sigma _2=2.0\,(\varDelta _\infty =0.0735)\)

  5. (e)

    \(\mu _1=1.5,\mu _2=4.5,\sigma _1=0.5,\sigma _2=2.0\,(\varDelta _\infty =0.0871)\)

  6. (f)

    \(\mu _1=1.0,\mu _2=4.0,\sigma _1=0.5,\sigma _2=1.0\,(\varDelta _\infty =0.1191)\)

For a given parameter combination, the coefficients are decreasing with k, although not monotonically in this case. Table 5 reports the corresponding values of the Kolmogorov distances \(\varDelta _N\) for \(N\le 5\).

Table 4 As Table 2, but now for the normal mixtures (a)–(f)
Table 5 As Table 3, but now for the normal mixtures (a)–(f)

4 Testing the Benford hypothesis

4.1 Statement of the problem

Working with representation (2), a straightforward translation of the Benford hypothesis into a null hypothesis to be tested is

$$\begin{aligned} H_0:s(X)\overset{\mathcal {L}}{=}U, \end{aligned}$$
(11)

where U is a uniform random variable on [0, 1[. Given a random sample of n observations from X, say \((X_1,\ldots ,X_n)\), several statistics already exist for testing (11). The simplest and possibly most popular one in applications is the first-digit Pearson statistic

$$\begin{aligned} \chi ^2=\sum _{d=1}^9\frac{(n\widehat{p}_d-np_d)^2}{np_d}, \end{aligned}$$
(12)

where, for \(d \in \{1,2, \ldots 9 \}\),

$$\begin{aligned} p_d=\log _{10}\left( \frac{d+1}{d}\right) \end{aligned}$$
(13)

is the theoretical first-digit probability for digit d when \(H_0\) is true, while

$$\begin{aligned} \widehat{p}_d=\frac{1}{n}\,\sum _{i=1}^n\varvec{1}_{\{d\}}(D_1(X_i)) \end{aligned}$$

is its sample estimate, based on the first significant digit function \(D_1(x)=\lfloor S(x)\rfloor \), and \(\varvec{1}_B\) denotes the indicator function of the set B.

Strictly speaking, \(\chi ^2\) would test the hypothesis that the first-digit distribution of X conforms to (13), which is a weaker hypothesis than the Benford one. Indeed, the Benford hypothesis (11) on the significand implies that the first digit is Benford, i.e., that expression (13) holds, even if (13) does not imply the significand to be Benford (see, e.g., Berger and Hill 2015). We refer to Barabesi et al. (2022) for a detailed study of the relationships between (11), (13) and another amazing feature of Benford’s law known as the (first-digit) sum-invariance property. However, since disagreement with (13) implies rejection of (11), \(\chi ^2\) is also a legitimate test of the Benford hypothesis, provided that its null distribution is derived under \(H_0\).

A popular competitor of \(\chi ^2\) is the mean absolute deviation of first-digit proportions and Benford’s probabilities

$$\begin{aligned} M = \frac{1}{9}\sum _{d=1}^9 |\widehat{p}_d-p_d|, \end{aligned}$$

which is usually called the MAD. A more or less formal use of this statistic is often advocated for anti-fraud purposes in financial applications (Nigrini 2012; Barney and Schulzke 2016), perhaps after extension to the second digit, on the ground that this statistic gives the same weight to all the digits under consideration. A test statistic specifically tailored to (11) is instead the Kolmogorov–Smirnov (two-sided) statistic defined as

$$\begin{aligned} K = \sup _{t\in [1,10[} |\hat{F}_{S(X)}(t) - \log _{10}t |, \end{aligned}$$
(14)

where \(\hat{F}_{S(X)}\) is the empirical distribution function of \((S(X_1),\ldots ,S(X_n))\).

The test statistics considered above, as well as other related test statistics, suffer from two main drawbacks. First, most of them are derived under specific and non-equivalent properties of Benford’s law, such as the first-digit behavior (13) which underlies both \(\chi ^2\) and M, or the sum-invariance property which motivates the Hotelling-type statistic described in the Supplementary Materials. Therefore, as such, they can show excellent performance under certain circumstances but not in general, as shown by Barabesi et al. (2022). Second, as anticipated in Sect. 1, they are not scale invariant. A scale invariant version of K is the Kuiper statistic which is computed as

$$\begin{aligned} V = \left\{ \max _{1\le i\le n}\left( \frac{1}{n} - S_{(i)}\right) + \max _{1\le i\le n}\left( S_{(i)}-\frac{i-1}{n}\right) \right\} , \end{aligned}$$
(15)

where \(S_{(i)}< \cdots < S_{(n)}\) are the order statistics of \((S(X_1),\ldots ,S(X_n))\). However, a serious drawback of V is its loss of power under multimodal alternatives (Pycke 2010).

4.2 Rayleigh-type tests

Examination of (6) inspires the simple strategy of assessing whether a monotone function of the coefficients \((|\varphi _Y(k)|)_{k\ge 1}\) is non-null. An off-the-shelf option for testing \(H_0\) is the well-known Rayleigh statistic for directional data, which in our problem takes the form \(Z_{1,n}^2 =2n |\widehat{\varphi }_Y(1)|^2\), where \(\widehat{\varphi }_Y(k)=\bar{C}_{k,n}+\textrm{i}\bar{S}_{k,n}\), with

$$\begin{aligned} \bar{C}_{k,n}=\frac{1}{n}\sum _{i=1}^n\cos (2\pi ks(X_i)), \quad \bar{S}_{k,n}=\frac{1}{n}\sum _{i=1}^n\sin (2\pi ks(X_i)). \end{aligned}$$

Unfortunately, this test validates antipodal symmetry, which features both uniform shapes and several multimodal structures. It can thus be safely used only against unimodal alternatives. For a given N, Buccheri and De Jager (1989) introduced the statistic

$$\begin{aligned} Z_{N,n}^2=2n\,\sum _{k=1}^N(\bar{C}_{k,n}^2+\bar{S}_{k,n}^2), \end{aligned}$$

which is effective against more general alternative hypotheses. Such a test statistic has been investigated by Bogdan et al. (2002) in a parametric framework for directional data, showing that it is asymptotically optimal for testing uniformity within the family of circular exponential distributions. Data-driven criteria for the selection of N are considered in Sect. 4.4. Before then, we provide a firm foundation to \(Z^2_{N,n}\) by showing that it is the score statistic for testing (11) with a fixed N.

4.3 Likelihood ratio test of the Benford hypothesis

As we have seen in Sect. 3, \(G_N\) proves to be a suitable approximation of (5) even for small N. Thus, a general model for the probability density function of s(X) may be obtained by differentiating (6). We denote this function by \(g=G_N'\), where the subscript N is suppressed for simplicity of notation. Reparametrizing, we thus have

$$\begin{aligned} g(u)=1+2\,\sum _{k=1}^N(a_k\cos (2\pi ku)+b_k\sin (2\pi ku)),\quad u\in [0,1[, \end{aligned}$$
(16)

where \(a_k=|\varphi _Y(k)|\cos (\vartheta _k)\) and \(b_k=|\varphi _Y(k)|\sin (\vartheta _k)\), for \(k=1,2,\ldots ,N\).

In order to achieve a nonnegative trigonometric polynomial in (16), i.e., a bona fide probability density function, the Fejér–Riesz condition must hold. More precisely, expression (16) is nonnegative for each \(u\in [0,1[\) if there exists a vector of order \((N+1)\) of complex numbers, say \(c=(c_0,c_1,\ldots ,c_N)^{\text {T}}\), such that for \(k=1,\ldots ,N\)

$$\begin{aligned} a_k-\textrm{i}b_k=\sum _{j=0}^{N-k}c_{j+k}c_j^*, \end{aligned}$$

with \(\sum _{k=0}^N|c_k|^2=1\) and where \(c_k^*\) denotes the complex conjugate of \(c_k\). Model (16) corresponds to a large family of distributions first introduced by Fernández-Durán (2004) in the framework of directional statistics. On the basis of the Fejér–Riesz Theorem (Grenander and Szegö 1984, Section 1.12), g(u) can be written in a different parametrization as

$$\begin{aligned} g(u)=c^*T_u c, \quad u\in [0,1[, \end{aligned}$$
(17)

where \(c^*\) is the conjugate transpose of vector c, while \(T_u=(t_{k,j,u})\) is a Toeplitz matrix of order \((N+1)\) such that \(t_{k,j,u}=e^{2\pi \textrm{i}(k-j)u}\). The parameter space is assumed to be \(\mathcal {C}=\{c\in \mathbb {C}^{N+1}:c_0\in \mathbb {R},c^*c=1\}\). Hence, there exist \((2N+2)\) parameters in model (17), even if solely 2N of them are free owing to the two constraints. Under model (17), the Benford hypothesis becomes equivalent to

$$\begin{aligned} H_0:c_k=0, \quad k=1,\ldots ,N, \end{aligned}$$
(18)

for a given N. The likelihood ratio test statistic for assessing (18) is

$$\begin{aligned} R_{N,n}= \prod _{i=1}^n(\widehat{c}\,^*T_{s(X_i)}\widehat{c}\,)^{-1}, \end{aligned}$$

where \(\widehat{c}\) is the maximum likelihood estimator of c, which can be obtained by means of the algorithm proposed by Fernández-Durán and Gregorio-Domínguez (2010). In addition, the null finite-sample distribution of \(R_{N,n}\) can be approximated through a simple but computationally efficient Monte Carlo algorithm, by bearing in mind that the random variable

$$\begin{aligned} X = 10^U \end{aligned}$$
(19)

satisfies the Benford property (1) and that \(S(10^U)=10^U\). The usual null large-sample approximation \(\varLambda _{N,n}=-2\log R_{N,n}\overset{\mathcal {L}}{\rightarrow }\chi _{2N}^2\) is also available as \(n\rightarrow \infty \).

The following proposition states the large-sample equivalence of \(\varLambda _{N,n}\) and \(Z_{N,n}^2\). Its proof proceeds by showing that \(Z_{N,n}^2\) is the score test statistic for assessing \(H_0\) under model (17). Therefore, \(\varLambda _{N,n}\) and \(Z_{N,n}^2\) share the same optimal properties in the large-sample setting.

Proposition 6

Under \(H_0\), \(\varLambda _{N,n}=Z_{N,n}^2+o_P(1)\).

Proof

See the Appendix in the Supplementary Materials. \(\square \)

4.4 Estimating the number of trigonometric components

Proposition 6 allows us to extend the asymptotic properties of \(\varLambda _{N,n}\) to \(Z_{N,n}^2\), and vice versa. Model selection principles have been mainly addressed from the point of view of the score test. If a simplified version of the BIC criterion is used (see Kallenberg and Ledwina 1995), N is chosen as the smallest integer for which

$$\begin{aligned} \psi (N)=Z_{N,n}^2-2N\log (n) \end{aligned}$$
(20)

is maximum. Therefore, the estimated number of trigonometric components in a data-driven version of \(Z_{N,n}^2\) is

$$\begin{aligned} \widehat{N}=\inf \{N\in \mathbb {N}:\psi (N)\ge \psi (M),M\in \mathbb {N}\}. \end{aligned}$$
(21)

It follows from the results of Bogdan et al. (2002) that \(P(\widehat{N}=\infty )=0\) and that, under (18), \(\widehat{N}\overset{P}{\rightarrow }1\) as \(n\rightarrow \infty \), while the test based on \(Z_{\widehat{N},n}^2\) is consistent against every alternative to uniformity. In addition, \(Z_{\widehat{N},n}^2\overset{\mathcal {L}}{\rightarrow }\chi _2^2\) as \(n\rightarrow \infty \), even if convergence has been shown to be slow by Bogdan et al. (2002). In Sect. 5 we thus obtain Monte Carlo estimates of the null critical values of \(Z_{\widehat{N},n}^2\) through relationship (19).

The asymptotic equivalence of the two tests shown in Proposition 6 ensures that the results on data-driven selection of N for \(Z_{N,n}^2\) hold for \(\varLambda _{N,n}\) as well. As a consequence, selection rule (21) is still valid when \(\varLambda _{\widehat{N},n}\) is the test statistic of choice if we take

$$\begin{aligned} \psi (N)=\varLambda _{N,n}-2N\log (n) \end{aligned}$$
(22)

as our objective function. The proper BIC criterion is adopted in this case.

4.5 Scale invariance

A desirable property of a test statistic \(T=T(X_1,\ldots ,X_n)\) is its invariance under scale transformation of the data, which requires that

$$\begin{aligned} P(T(X_1,\ldots ,X_n)=T(\sigma X_1,\ldots ,\sigma X_n))=1, \end{aligned}$$

for \(\sigma \in \mathbb {R}^+\). Given a conversion from parsecs to light-years (as in the application of Sect. 6), or from decimeters to feet (as in the application of Sect. 7), while the observation values may change, any inferential statement about the Benford hypothesis should remain unaffected. However, most of the available test statistics are based on functions of the first significant digits and do not share this property.

It is well known that significant digits are scale invariant if and only if the underlying random variable X is Benford (Berger and Hill 2015, Section 5.1). But even when the Benford hypothesis holds, if we consider for instance the first-digit chi-square statistic \(\chi ^2=\chi ^2(X_1,\ldots ,X_n)\), it only follows that \(\chi ^2(X_1,\ldots ,X_n)\overset{\mathcal {L}}{=}\chi ^2(\sigma X_1,\ldots ,\sigma X_n)\). On the other hand, both the score and the likelihood ratio tests of the Benford hypothesis satisfy the following:

Proposition 7

The statistics \(Z_{N,n}^2\) and \(R_{N,n}\) are scale invariant.

Proof

See the Appendix in the Supplementary Materials. \(\square \)

5 Simulation experiments

In this section we provide simulation evidence of the empirical advantages of the tests derived in Sect. 4. We compare the performance of the score and likelihood ratio tests—both with \(N \in \{1,2\}\) and N selected from the data—to those of \(\chi ^2\) and M, taken as representatives of first-digit tests, the Kolmogorov–Smirnov test (14), perhaps the most obvious choice for assessing the uniformity hypothesis (11), and the scale-invariant Kuiper test (15). For simplicity we relegate to the Supplementary Materials the results for two other tests that we have taken into account in our experiments, the first-digit Hotelling-type test of Barabesi et al. (2022) and the two-digit version of \(\chi ^2\), but we can summarize that comparison with the score and likelihood ratio tests yields similar qualitative findings. We only display power results for \(n=100\), while those for different sample sizes are reported in the Supplementary Materials, together with further details about the simulation algorithm.

5.1 Exact null distributions

In our first simulation, B Monte Carlo replicates of the test statistic \(T=T(X_1,\ldots ,X_n)\) are generated under the Benford hypothesis as

$$\begin{aligned} T^\dagger _b = T(10^{U_{b,1}},\ldots ,10^{U_{b,n}}), \quad b = 1,\ldots ,B, \end{aligned}$$
(23)

where \(U_{b,1},\ldots ,U_{b,n}\) are independent uniform random variables on [0, 1[. For a realization t of T, the exact p-values are computed as \(p^\dagger _T(t) = 1 - F^\dagger _T(t)\), where \(F^\dagger _T(t) = \frac{1}{B}\sum _{b=1}^B \varvec{1}_{]-\infty ,t]}(T^\dagger _b)\). In Table 6 we show the estimated \((1-\gamma )\)th quantile

$$\begin{aligned} t^\dagger _{T,1-\gamma } =\inf \{t \in \mathbb {R}: F^\dagger _T(t) \ge 1-\gamma \} \end{aligned}$$

for each of the test statistics described in Sects. 4.24.4, for \(\gamma =0.05\) and using \(B=10^6\) replicates under the Benford hypothesis. We also compare the estimated exact quantiles to their asymptotic counterparts. It is clearly seen that asymptotic quantiles are already very accurate with \(n=50\) for both \(Z_{N,n}^2\) and \(\varLambda _{N,n}\) if N is fixed in advance, a result that parallels those of Cerioli et al. (2019) and Barabesi et al. (2022) for \(\chi ^2\) and other first-digit statistics. Convergence is instead much slower when N is selected from the data. The estimates \(\widehat{N}\) exhibit non-negligible positive bias in small samples both under (20) and (22), with respect to the true value \(N=1\) under the Benford hypothesis. These issues point to the use of the reported exact quantiles in the power comparisons that follow.

Table 6 Estimated 0.95th quantiles under the Benford hypothesis for different sample sizes and \(B=10^6\) replicates

5.2 Power comparisons

The power of each test under comparison is computed with respect to the corresponding exact 0.95th quantile estimated through (23), with \(B=10^6\). We first compare the power, based on \(10^4\) replicates, of the various test statistics under the same models for X studied in Sect. 3, to highlight agreement with the computed Kolmogorov distances.

In the one-population setting, Table 7 shows that the score and likelihood ratio tests behave very similarly. They are the most powerful ones, except when X is distributed as Be(1/2, 1), a case which is equivalent to the generalized Benford law of parameter 1/2 (see Table 9 and the related comments below). The best performance is reached with \(N=1\), since the distribution of s(X) is unimodal under all the considered alternatives and the series (5) is well approximated by the first term, in agreement with the results proved in Sect. 3. Nevertheless, the data-driven choice of N performs almost equally well if no prior information is available. The power of the Kuiper test, although generally inferior, is not too far from that of the top performers in such unimodal scenarios, while it often improves over the non-invariant tests.

The advantage of \(Z_{\widehat{N},n}^2\) and \(\varLambda _{\widehat{N},n}\) is instead paramount under the two-component mixtures considered in Table 8. There, estimation of the number of components becomes crucial, since both \(Z_{1,n}^2\) and \(\varLambda _{1,n}\) fail to detect the underlying structure, while the best performers \(Z_{2,n}^2\) and \(\varLambda _{2,n}\) require knowledge that the second term dominates in the series (5) (see Table 4). Not surprisingly, also the Kuiper test shows a major gap under most of these two-populations alternatives and the ordering among V, \(\chi ^2\), M and K becomes less clear.

Table 7 Estimated power under the one-population models of Sect. 3, when \(n=100\)
Table 8 As Table 7, but now for the two-component normal mixtures \(0.5N(\mu _1,\sigma ^2_1) + 0.5 N(\mu _2,\sigma ^2_2)\)

Our second power scenario widens the perspective by considering a worst-case instance for the likelihood ratio and score tests. Specifically, we assume that X is a generalized Benford random variable with parameter \(\alpha \). In this case we have

$$\begin{aligned} F_{s(X)}(u)= {\left\{ \begin{array}{ll} u&{}\alpha =0 \\ \frac{10^{\alpha u}-1}{10^\alpha -1} &{}\alpha \ne 0 \end{array}\right. } \end{aligned}$$

for \(u\in [0,1[\) and \(\alpha \in \mathbb {R}\). This model has been introduced to represent digit distributions when X follows a power-law (Pietronero et al. 2001; Barabesi and Pratelli 2020) and describes the leading-digit distribution of the sequences of prime numbers and non-trivial Riemann zeta zeroes (Luque and Lacasa 2009). It may also help to smoothly represent departure from the Benford property, which arises in the case \(\alpha =0\) by continuity.

The new insight provided by this scenario is that it shows the effect of modeling the significand transform s(X) instead of X itself, providing harmful alternatives to the Benford hypothesis. Indeed, the probability density function of s(X) is now monotonically increasing or decreasing on [0, 1[, according to the sign of \(\alpha \). The generalized Benford model is thus particularly unfavorable to the proposed tests, since several terms are needed to approximate the series (5). Table 9 shows empirical powers under this model for some values of \(\alpha \), while Table 10 reports the estimate of \(E[\widehat{N}]\) for both (20) and (22). It is seen that the performances of the score and likelihood ratio tests are very similar, although the non-invariant statistic K turns out to be the most powerful solution. As anticipated, a value \(N>1\) is required to capture the monotone shape of \(f_{s(X)}\), a task which is best accomplished by (21) in large samples (see the Supplementary Materials).

Table 9 Estimated power when X is a generalized Benford random variable of parameter \(\alpha \), for different values of \(\alpha \) and \(n=100\)
Table 10 Estimate of \(E[\widehat{N}]\) when X is a generalized Benford random variable of parameter \(\alpha \), for different values of \(\alpha \) and \(n=100\)

6 Astronomical distances

6.1 Motivation and data

In our first experiment, we perform two related analyses on the distribution of star distances from Earth, whose original source is the HYG (Hipparcos–Yale–Gliese) database available at https://www.astronexus.com/hyg. The first investigation is concerned with the data introduced by Alexopoulos and Leontsinis (2014), who explored the star-distance agreement with Benford’s law on the basis of the marginal empirical distribution of their first-three significant digits. In our second computation, we improve the analysis by considering a cleaner and more complete version of the same distance database. Our interest in the topic is inspired by several reasons. First, the data are at the center of a quite recent debate in astrophysics and conformance to the Benford hypothesis is for them a still open issue, since there is noticeable uncertainty on the conclusions when comparing empirical evidence with theory. Additionally, previous empirical investigation of this kind of data mostly rely on informal diagnostic checks and single-digit assessment through the non-invariant chi-square test (12). A further reason that motivates our interest is that the validity of the Benford hypothesis has been formally predicted by establishing a link with Hubble’s constant for a related problem that involves galaxy distances (Hill and Fox 2016).

Formal statistical inference has not been rigorously conducted on the cited star-distance data in order to validate their fit to Benford’s law. Indeed, de Jong et al. (2020) note that the size of astronomical databases is so large that even small departures from the law would constitute an adverse empirical evidence due to huge test power. The HYG 2.0 database contains around 115,000 observations, and this dimension is still very small if compared to the second release of the GAIA database that includes 1.3 billion parallaxes. Therefore, the usual practice has been to calculate a discrepancy index and then judge if it is big (small) enough to informally disprove (support) Benford’s law on the basis of its membership to one in a set of subjectively predefined value categories, without any distributional argument. Our suggested strategy for inference on the Benford hypothesis instead relies on bootstrapping the available distance databases for sample sizes comparable to those explored in Sect. 5 and in the Supplementary Materials. A relatively high proportion of rejections of (11) even in these moderately sized random samples would then cast doubt on the Benford nature of the random process generating the observed distances.

The first database that we analyze, and that we name Stars 1, consists of the first-three significant digits of star distances to Earth explored by Alexopoulos and Leontsinis (2014) and further examined by Hill and Fox (2016). In the second instance, called Stars 2, we replicate a similar experiment on a cleaned version of the HYG 3.0 database, containing 109,398 positive distances recorded in parsecs. Cleaning is performed by removing more than 10,000 stars whose coordinates are deemed to be uncertain and for which distances are replaced by a nonfactual label in the original HYG 3.0 database. In Fig. 1 we include a graphical comparison between theoretical and observed frequencies of the first significant digit in each of the two data sets. Shapes appear to be similar. In both cases the observed data have higher frequencies for the first three digits, and smaller for the others, perhaps suggesting a convolution between a noise law and the Benford one. This would call for a cleaner and possibly larger database, as suggested by de Jong et al. (2020).

Fig. 1
figure 1

Left: Comparison of the first significant digit of star distances in database Stars 1 (grey) with Benford’s law (white). Right: The same, but for database Stars 2

6.2 Results

In our analysis we draw 1000 samples of sizes 200, 300 and 500 with replacement from both star-distance databases. All the generated distance samples can be accessed through the Supplementary Materials. We then compute the proportion of rejections of the Benford hypothesis in each scenario, at exact size \(\gamma \), for all the tests considered in Sect. 5. Detailed results are given in the Supplementary Materials, while Table 11 focuses on the comparison between \(\chi ^2\) and \(\varLambda _{1,n}\) for a few representative values of \(\gamma \). The score statistic \(Z_{1,n}^2\) and the data-driven selection of N available through (20) and (22) are seen to yield similar conclusions. There is considerable evidence against the Benford hypothesis, with rejection proportions systematically much larger than \(\gamma \) and with \(\varLambda _{1,n}\) showing to be even less ambiguous than its competitors. Our likelihood ratio approach thus advises that the qualitative assessment of Alexopoulos and Leontsinis (2014) might have been overoptimistic, when contrasted to the full probabilistic structure of s(X) implied by (11).

We argue that one possible explanation of the disagreement between our results and optimistic expectations about the Benford behavior of star distances may reside in the actual quality of available data. While in the Supplementary Material we show that truncation of X to three significant digits, as is the case in database Stars 1, should not alter conclusions about the Benford hypothesis with the sample sizes under consideration, we note that the data-cleaning step operated on the distances of Stars 2 slightly reduces the disagreement with (11). More importantly, many ties appear in the HYG database and are then reflected in the realized values of s(X), even when several digits are considered. These ties should not occur under Benford’s law and their effect is magnified when looking at the empirical distribution of s(X). Therefore, the current levels of measurement precision and the associated errors that affect databases such as HYG seem to hinder formal statistical support to the sophisticated mathematical arguments put forth by Hill and Fox (2016).

A similar conclusion may also be reached by considering the measurement error model used in the Hipparcos and GAIA data processing (Bailer-Jones 2015). For a star at true distance r, its true but unknown parallax is 1/r. The measured parallax \(\varpi \) is a noisy measurement of 1/r under the small angle approximation. It is assumed that \(\varpi \) is normally distributed with unknown mean 1/r and known standard deviation \(\sigma _\varpi \), which ultimately depends on the inverse of the number of photons received from the star. It is then hard to anticipate conformance to the Benford hypothesis in view of the results of Sect. 3. Correspondingly, such a measurement-error model may lead to the simulation of synthetic astronomic data, under the Benford hypothesis, as the convolution of a Benford random variable with a suitable noise random variable.

Table 11 Astronomical distances. Proportion of rejections of the Benford hypothesis in 1000 samples of size n drawn with replacement from each database, at exact test size \(\gamma \)

We conclude our analysis of astronomical distances by briefly addressing the issue of lack of scale invariance of standard tests when applied to star-distance data. It is well known that parsecs and light years are two similarly popular scales for these distances. The results for \(\chi ^2\) in Table 11 refer to parsecs. Figure 2 shows the distributions of the differences between the p-values for \(\chi ^2\) that are obtained when the values in each sample are rescaled to measure distances in light years. We observe bell shapes roughly centered on zero and an ubiquitous remarkable variability. This confirms that the adoption of a different measurement unit can lead to strongly different results with digit statistics, a hardly acceptable outcome in the scientific context under consideration.

Fig. 2
figure 2

Astronomical distances. Distributions of the differences between the p-values of \(\chi ^2\) resulting from data expressed first in parsecs and then in light years

7 Streamflow data

Inspired by a Referee, we present a second case study related to water flows (in cubic feet per second) recorded at more than 17,000 streamgage sites in the U.S. over an extended period of time spanning from 1874 to 2004. This quite extensive collection, that contains 457,440 observations, was obtained and analyzed by Nigrini and Miller (2007) after deletion of blanks, zeroes (possibly corresponding to absence of water flows), negative numbers (possibly corresponding to data errors) and duplicate records. Such a large hydrological data set comes with the R package benford.analysis available at the web address https://github.com/carloscinelli/benford.analysis, from which we have retrieved it (see also Cinelli 2022).

Nigrini and Miller (2007) and Nigrini (2012, Chapter 12), through a detailed data description and a battery of formal digit tests, conclude that the streamflow data set exhibits a practically excellent fit to Benford’s law, in spite of the possible “excess of power” related to the big sample size and in spite of some potential sampling bias in the spatial distribution of the measuring stations. Based on first digit counts reported in Fig. 3, we easily agree since the visual fit is nearly perfect. Therefore, differently from the astronomical case study, our aim here is to inspect how our findings apply when the Benford hypothesis is a reasonable approximation to the true digit generating process.

As a first experiment, we test the whole data set, obtaining p-values equal to 0.0342 and 0.0144 for \(\chi ^2\) and \(\varLambda _{1,n}\), respectively. Although such quantities may appear smaller than expected, they are in good agreement with the evidence provided by Nigrini and Miller (2007). We thus see that also our likelihood ratio approach does not lead to strong evidence against the Benford hypothesis in spite of the huge available sample and of possible inaccuracies in the data collection process. By way of comparison, we note that for the astronomical distance data both tests yield a virtually null p-value, even with a smaller sample size. We then repeat the sub-sampling procedure of Sect. 6 to streamflow data. Again, we see in Table 12 that both tests exhibit rejection proportions which are very close to nominal test sizes, thus further corroborating the plausibility of the Benford hypothesis in this problem.

We conclude our analysis, as in Sect. 6, by assessing the effect of changing the measurement unit of streamflows from cubic feet to cubic decimeters per second. Figure 4 repeats the frame of Fig. 2 by reporting the distributions of the differences between the p-values of \(\chi ^2\) after this change of unit. The results are similar and perhaps even more alarming, meaning that the distributions are clearly less concentrated around zero. This empirical evidence appears to be very interesting because it shows that the sensitivity of \(\chi ^2\) to the data scale does not alleviate even if the data are close to being genuinely Benford.

Fig. 3
figure 3

Comparison of the first significant digit of the streamflow data (grey) with Benford’s law (white)

Table 12 Streamflow data. Proportion of rejections of the Benford hypothesis in 2000 samples of size n drawn with replacement from the database, at exact test size \(\gamma \)
Fig. 4
figure 4

Streamflow data. Distributions of the differences between the p-values of \(\chi ^2\) resulting from data expressed first in cubic feet and then in cubic decimeters per second