Abstract
Assessing goodness of fit to a given distribution plays an important role in computational statistics. The probability integral transformation (PIT) can be used to convert the question of whether a given sample originates from a reference distribution into a problem of testing for uniformity. We present new simulation- and optimization-based methods to obtain simultaneous confidence bands for the whole empirical cumulative distribution function (ECDF) of the PIT values under the assumption of uniformity. Simultaneous confidence bands correspond to such confidence intervals at each point that jointly satisfy a desired coverage. These methods can also be applied in cases where the reference distribution is represented only by a finite sample, which is useful, for example, for simulation-based calibration. The confidence bands provide an intuitive ECDF-based graphical test for uniformity, which also provides useful information on the quality of the discrepancy. We further extend the simulation and optimization methods to determine simultaneous confidence bands for testing whether multiple samples come from the same underlying distribution. This multiple sample comparison test is useful, for example, as a complementary diagnostic in multi-chain Markov chain Monte Carlo (MCMC) convergence diagnostics, where most currently used convergence diagnostics provide a single diagnostic value, but do not usually offer insight into the nature of the deviation. We provide numerical experiments to assess the properties of the tests using both simulated and real-world data and give recommendations on their practical application in computational statistics workflows.
Similar content being viewed by others
1 Introduction
Tests for uniformity play an essential role in computational statistics when estimating goodness of fit to a given distribution (Marhuenda et al. 2005). This is because, even when the distribution of interest is not uniform, there are methods to reduce the problem into testing for uniformity by transforming a sample from the given distribution to a (discrete or continuous) uniform distribution. Common use cases in Bayesian workflow (Gelman et al. 2020) are simulation-based calibration and Markov chain Monte Carlo convergence diagnostic, which we also use as examples in this paper. A graphical test can provide additional insight to the nature of discrepancy that goes beyond the dichotomy of the uniformity test.
1.1 Probability integral transformation
Transforming sampled values to a uniform distribution is usually achieved via the probability integral transform (PIT), provided that the distribution of interest has a tractable cumulative distribution function (CDF) (D’Agostino and Stephens 1986). Let \(y_1,\ldots , y_N \sim g(y)\) be an independent sample from an unknown continuous distribution with probability density function (PDF) g. We want to know whether \(g = p\), where p is the PDF of a known distribution with a tractable CDF. The PIT of the sampled value \(y_i\) with respect to p is
If \(g = p\), the transformed values \(u_i\) are continuously, independently, and uniformly distributed on the unit interval [0, 1], reducing the evaluation of the hypothesis into testing for uniformity of the transformed sample \(u_1, \ldots , u_N\). If the integral (1) does not have the closed form, the CDF (and hence the PIT values) can still be computed with sufficient accuracy through numerical integration (e.g. quadrature), if at least the corresponding PDF is tractable.
If neither the CDF nor the PDF have closed form, but a comparison sample of independent values \(x^i_1, \ldots , x^i_S \sim p(x)\) can be drawn separately for each \(y_i\), the hypothesis \(g = p\) can be evaluated through the empirical PIT values
where \(\mathbb {I}\) is the indicator function. Now, given \(g = p\), the transformed values \(u = u_1, \ldots , u_N\) are independently distributed according to a discrete uniform distribution with \(S+1\) values \((0,1/S,\ldots ,(S-1)/S,1)\). Accordingly, we can still apply uniformity tests to assess \(g = p\), just that this time, we need to test for discrete uniformity.
If either sample has dependencies, like an autocorrelated sample from a Markov chain, the ordered statistics is affected by the dependencies and the empirical PIT values (2) are not distributed uniformly even if \(g=p\) (unless the sample size goes to infinity). For Markov chains, the usual remedy is to thin the chain to obtain an approximately independent sample. This issue is illustrated, and thinning recommendations are provided in Appendix A.
Figure 1 shows an example of the empirical cumulative distribution function (ECDF) for u obtained through both Eq. (1) and Eq. (2). The figure also shows an example of a pointwise confidence interval for each ECDF. For the continuous integral of Eq. (1), the pointwise confidence interval can be computed from the continuous uniform ordered statistics distribution which is a common beta distribution. For the discrete sum of Eq. (2), the pointwise confidence interval can be computed from the discrete uniform ordered statistics distribution, with the cumulative distribution function of the ith ordered statistic \(u_{(i)}\) given as
for \(z \in (0,1/S,\ldots ,(S-1)/S,1)\) (Arnold et al. 2008, Example 3.1). The corresponding pointwise intervals do not have a nice form in general, and, more importantly, the discrete ordered statistics do not exhibit Markovian structure (exploited by our new optimization based approach) if there are possible ties in u (Arnold et al. 2008, Theorem 3.4.1).
To make the computation of the simultaneous confidence bands more straightforward and efficient, we propose making an additional transformation by computing the ECDF of u at chosen evaluation points \(z_i\):
We recommend choosing \(z_i\) as the ordered fractional ranks \(\tilde{r}_i\) of \(y_i\), defined as
The ordered fractional ranks form a uniform partition of the unit interval independent of the distribution of \(y_i\). Thus, they provide an ECDF that is easier to interpret than the corresponding ECDF based directly on the original sample \(y_i\). The resulting ECDF is illustrated in Fig. 1c. As we will show, useful properties of this ECDF are that 1) its pointwise confidence intervals can be computed easily from the binomial distribution, with a quantile function already implemented in most widely used environments for statistical computing, and 2) the distribution of the ECDF trajectories is Markovian, which is exploited in Sect. 2.3.
1.2 Simultaneous confidence bands
The major challenge that arises when developing a uniformity test based on the ECDF is to obtain simultaneous confidence bands with the desired overall coverage. For this purpose, one needs to take into account the inter-dependency in the ECDF values and adjust the coverage parameter accordingly. (We will discuss this in more detail in Sect. 2.)
When considering whether a given ECDF could present a sample from a uniform distribution, we need to jointly consider all pointwise uncertainties. For a set of evaluation points \((z_i)_{i=1}^K\), we provide lower and upper confidence bands \(L_i\) and \(U_i\), respectively, that jointly satisfy
where \(F(z_i)\) is the ECDF of a sample from either the standard uniform distribution or discrete uniform distribution on the unit interval evaluated at \(z_i \in (0, 1)\) and \(1 - \alpha \) is the desired simultaneous confidence level. In addition to offering a numerical test for uniformity, the simultaneous confidence bands provide an intuitive graphical representation of possible discrepancies from uniformity.
Aldor-Noiman et al. (2013) presented a simulation-based approach for computing simultaneous confidence band for the ECDF of the transformed sample acquired from Eq. (1) under the assumption of uniformity. In this paper, we present a simulation method inspired by Aldor-Noiman et al. (2013) as well as a new, faster optimization method for computing simultaneous confidence bands under uniformity, when the ECDF is computed from the empirical PIT values using Eqs. (2) and (4). Figure 2 contrasts the simultaneous confidence bands by Aldor-Noiman et al. (2013) against those obtained from our proposed method. Furthermore, we generalize our method and simultaneous confidence bands to test whether multiple samples originate from the same underlying distribution.
1.3 Related work
The idea of utilizing the ECDF to test uniformity is not new, but its potential has not yet been realized in full. For example, the well-known Kolmogorov–Smirnov (KS) test, first introduced by Kolmogorov (see e.g. Massey (1951), original article in Italian is Kolmogorov (1933)), is based on evaluating the maximum deviation of the sample ECDF from the theoretical CDF of the distribution to be tested against. Unfortunately, the KS test is relatively insensitive to deviations in the tails of the distribution (Aldor-Noiman et al. 2013), and numerous tests have been proposed to replace the KS test. An extensive comparison of more than thirty tests of uniformity of a single sample is provided by Marhuenda et al. (2005).
Due to its ease of interpretation and familiarity to people even with basic statistical knowledge, a graphical method for assessing uniformity commonly used as part of many statistical workflows is plotting histograms. This can even be turned into a formal test of uniformity with confidence intervals for the individual bins (e.g. Talts et al. 2020). Drawbacks of histograms are that binning discards information, there can be binning artefacts depending on the choice of bin width and placement, and they ignore the dependency between bins. The proposed ECDF-based method does not require binning or smoothing, provides intuitive visual interpretation, and works for continuous Eq. (1) and discrete Eq. (2) values. An illustration and comparison of histograms with two binning choices and our new method is shown in Fig. 3. The visual range between the simultaneous confidence bands for the ECDF is often narrow when visualizing a sample with a large number of observations. Thus, to achieve a more dynamic range for the visualization, we recommend to show ECDF difference plots, instead, as illustrated in Fig. 3d. The ECDF difference plot is obtained by subtracting the values of the expected theoretical CDF (i.e. the identity function in [0,1] in case of standard uniformity) from the observed ECDF values.
1.4 Summary of contributions
In this article, we focus on use case examples arising from inference validation and Markov chain Monte Carlo (MCMC) convergence diagnostics as part of a Bayesian workflow (Gelman et al. 2020), but our developed methods are applicable more generally. Our use cases can be divided into two main categories: a single sample test for uniformity and a multiple sample comparison where the hypothesis is tested that the samples are drawn from the same underlying (potentially non-uniform) distribution. We discuss both cases in more detail below.
We offer a graphical test for uniformity by providing simultaneous confidence bands for one or more ECDF trajectories obtained through the empirical probability integral transformation.
As our first contribution, we modify an existing ECDF-based approach proposed by Aldor-Noiman et al. (2013) to take into account the discreteness of the fractional rank-based PIT values. This forms the basis for our proposed single- and multi-sample tests.
As our second contribution, we provide both a simulation and optimization method to determine the adjustment needed to achieve a desired simultaneous confidence level for the ECDF trajectory given the fractional rank-based PIT values. In addition to presenting a simulation-based adjustment following the method of Aldor-Noiman et al. (2013),
we introduce a new optimization method that is computationally considerably more efficient in determining the needed adjustment, especially when bands with high resolution are desired for a large sample size.
Although our focus is on providing a test with an intuitive graphical representation, we show that our method performs competitively when compared to existing uniformity tests with the state-of-the-art performance. We demonstrate the usefulness of this graphical test in context of simulation-based calibration approach for assessing inference methods (Talts et al. 2020).
Finally, as our third contribution, we generalize the graphical test as well as both the simulation and optimization methods to evaluate the hypothesis that two or more samples are drawn from the same underlying distribution. We demonstrate the usefulness of this graphical test in MCMC convergence diagnostics, where the currently most common graphical tools for assessing convergence are trace plots of the individual sampled chains.
1.5 Outline of the paper
In Sect. 2, we first provide a simulation-based method to determine simultaneous confidence bands for the ECDF of a single uniform sample and then present new more efficient optimization-based method.
In Sect. 3, we extend the test to multiple sample comparison and follow a similar structure by offering both a simulation and an optimization-based methods.
We continue in Sect. 4 with simulated and real-world examples illustrating the application of our proposed method and end with a discussion in Sect. 5.
2 Simultaneous confidence bands for the empirical cumulative distribution
We propose simulation- and optimization-based approaches to providing the ECDF of a uniform sample with \(1-\alpha \)-level simultaneous confidence bands that are compatible with empirical PIT values, that is, confidence bands with a type-1-error rate of \(\alpha \). Our approach is similar to that presented by Aldor-Noiman et al. (2013) with one central distinction illustrated in Fig. 2. The method by Aldor-Noiman et al. (2013) obtains simultaneous confidence bands for the evaluation quantiles with fixed ECDF values based on beta distributions, that is, it obtains confidence bands along the horizontal axis (Fig. 2a). In contrast, our new method provides simultaneous confidence bands for the ECDF values at fixed evaluation quantiles based on binomial distributions, that is, it obtains confidence bands along the vertical axis (Fig. 2b). In the limit, as the sample size approaches infinity, there is no practical difference between the methods. However, when the number of possible unique ranks is small, our proposed method behaves better for smallest and largest ranks and consistently if the ranks are further binned.
2.1 Pointwise confidence bands
Determining the pointwise confidence interval for the ECDF value of a sample from the continuous uniform distribution at a given evaluation point \(z_i\in (0,1)\) is rather straightforward. By definition, given a sample \(u = u_1, \ldots , u_N\), the ECDF value is
As the sampled values, \(u_j\in (0,1)\), are expected to be continuously uniformly distributed, \(\Pr (u_j \le z_i) = z_i\) for each \(j=1,\ldots ,N\). Thus, the values resulting from scaling the ECDF with the sample size N are binomially distributed as
If we instead expect u to be sampled from a discrete uniform distribution with S distinct equally spaced values, \(s_j = j/S\), by choosing the partition points to form a subset of these category values, we again have \(\Pr (u_j \le z_i) = z_i\) for \(j=1,\ldots ,N\), and the marginal distribution of the scaled ECDF follows Eq. (8). Therefore, the methods introduced in Sects. 2.2 and 2.3 can be used to determine simultaneous confidence bands for both continuous and discrete uniform samples, allowing for testing uniformity of both the continuous PIT values of Eq. (1) and the discrete empirical PIT values in Eq. (2).
From Equation (8), it is straightforward to determine the \(1 - \alpha \) level pointwise lower and upper confidence bands, \(L_i\) and \(U_i\), respectively, satisfying for all \(i = 1,\ldots ,N\) individually
In contrast, determining the simultaneous confidence bands for ECDF trajectories (i.e. sets of ECDF values) is more complicated. In Fig. 4, we illustrate the dependency between ECDF values at distinct evaluation quantiles, together with simultaneous confidence bands computed via either of the new methods described in the following sections. As is illustrated in the figure, ECDF values evaluated at two quantiles close to each other are strongly dependent, while ECDF values evaluated at two quantiles far away from each other are only weakly dependent. In any case, these dependencies need to be taken into account when constructing simultaneous confidence bands.
Another important remark is that, as the marginal distribution of the scaled ECDF is discrete, the simultaneous confidence intervals do not in all cases meet the desired coverage level exactly. Brown et al. (2001) provide a thorough exploration of the effect discreteness plays in the coverage level of various interval estimations for binomial proportion, with listings of what the authors call lucky and unlucky sample lengths. In our experience, even though discreteness plays a role in the coverage level of the pointwise confidence intervals, this effect is reduced to deviation of under \(\pm 1\%\) for \(N\in [50,2000]\) in the coverage level of the resulting simultaneous confidence bands we introduce next.
2.2 Simultaneous confidence bands through simulation
Our goal is to define simultaneous confidence bands for the ECDF of a sample of N values drawn from the standard uniform distribution so that the interior of the confidence bands contains trajectories induced by that distribution with rate \(1 - \alpha \), where \(\alpha \in (0,1)\).
In this section, we describe a simulation-based method for determining the simultaneous confidence bands for the ECDF trajectory.
We follow steps similar to those introduced by Aldor-Noiman et al. (2013); with the exception that instead of determining limits for the Q-Q plot, we now determine the upper and lower limits of the ECDF values at the evaluation points \(z_i\):
-
1.
Choose a partition \(\left( z_i\right) _{i=1}^K\) of the unit interval.
-
2.
Determine coverage parameter \(\gamma \) to account for multiplicity in order to obtain the \(1-\alpha \)-level simultaneous confidence bands:
$$\begin{aligned} \Pr \left( L_i(\gamma )\le F(z_i)\le U_i(\gamma ) \text { for all } i\in \{1,\ldots , K\}\right) = 1-\alpha . \end{aligned}$$(10)
In determining these confidence bands, we use the knowledge from that the values of the scaled ECDF at each point \(z_i\) follow a binomial distribution and denote the value of the cumulative binomial distribution function with parameters N and \(z_i\) at \(k\in \mathbb {N}\) by \({{\,\mathrm{\mathrm {Bin}}\,}}(k {{\,\mathrm{\;|\;}\,}}N,z_i)\) and its inverse by \({{\,\mathrm{\mathrm {Bin}}\,}}^{-1}(q {{\,\mathrm{\;|\;}\,}}N,z_i)\) for quantile \(q \in [0, 1]\).
To find the desired coverage value \(\gamma \), we simulate M draws of size N from the standard uniform distribution. Let \(F^m\) denote the ECDF of the mth sample, \(u_1^m,\ldots u_N^m \sim \mathrm{uniform}(0,1)\). For each sample, we find the value of \(\gamma \) such that the equal tail quantiles
and
provide the tightest possible lower and upper limits, respectively, to the sample ECDF, \(F^m\), at each \(z_i\). This value of \(\gamma \) for the mth sample is
As now we have for \(\gamma ^m\) equally that
and as it holds that
\(\gamma ^m\) defines a set of upper and lower limits to the ECDF which is by Eq. (14) the tightest possible pair of limits defining equal tail quantiles for the ECDF at each \(z_i\). To obtain bands covering a \(1-\alpha \) fraction of the ECDFs of the simulated samples, we set \(\gamma \) to the \(\alpha \) quantile of the values \(\lbrace \gamma ^1,\ldots , \gamma ^M\rbrace \). Since \(\gamma ^m > 0\) by construction, we also have \(\gamma > 0\).
The following steps summarize the algorithm for simulating the adjusted coverage parameter \(\gamma \) and determining the \(1-\gamma \) level simultaneous confidence bands:
-
1.
For \(m=1,\ldots ,M\):
-
(a)
Simulate \(u_1^m,\ldots ,u_N^m \sim \mathrm{uniform}(0,1)\).
-
(b)
For \(i=1,\ldots ,K\), compute \(F^m(z_i)\).
-
(c)
For \(i=1,\ldots ,K\), compute
$$\begin{aligned} {{\,\mathrm{\mathrm {Bin}}\,}}(NF^m(z_i) {{\,\mathrm{\;|\;}\,}}N,z_i) \text { and } {{\,\mathrm{\mathrm {Bin}}\,}}(NF^m(z_i)-1 {{\,\mathrm{\;|\;}\,}}N,z_i). \end{aligned}$$ -
(d)
Find the minimum probability
$$\begin{aligned} \gamma ^m= & {} 2\min _i\left\{ \min \left( {{\,\mathrm{\mathrm {Bin}}\,}}(NF^m(z_i) {{\,\mathrm{\;|\;}\,}}N,z_i), \right. \right. \\&\left. \left. 1-{{\,\mathrm{\mathrm {Bin}}\,}}(NF^m(z_i)-1 {{\,\mathrm{\;|\;}\,}}N,z_i)\right) \right\} . \end{aligned}$$
-
(a)
-
2.
Set \(\gamma \) to be the \(100\alpha \) percentile of \(\lbrace \gamma ^1,\ldots ,\gamma ^M\rbrace \).
-
3.
Form the confidence bands
$$\begin{aligned}&\left[ L_i\left( \gamma \right) ,U_i\left( \gamma \right) \right] \\&\quad = \left[ \frac{1}{N}{{\,\mathrm{\mathrm {Bin}}\,}}^{-1}\left( \frac{\gamma }{2} {{\,\mathrm{\;|\;}\,}}N,z_i\right) ,\frac{1}{N}{{\,\mathrm{\mathrm {Bin}}\,}}^{-1}\left( 1-\frac{\gamma }{2} {{\,\mathrm{\;|\;}\,}}N,z_i\right) \right] \end{aligned}$$for \(i=1,\ldots ,K\).
2.3 Simultaneous confidence bands through optimization
We also propose a computationally more efficient optimization-based method for determining the simultaneous confidence bands.
In the following derivation of the optimization method, we denote the interior of the confidence bands for the ECDF at quantile \(z_i\) as \(\tilde{I}_i(\gamma )\). By denoting \(r_i = N F(z_i)\), the scaled interior \(I_i(\gamma )\) for \(r_i\) is given by
As is common for discrete statistical tests, we treat the borders between the interior and exterior as being within the confidence bands. Based on \(I_i(\gamma )\), we can easily obtain \(\tilde{I}_i(\gamma )\) as \(r \in I_i(\gamma )\) is equivalent to \(r / N \in \tilde{I}_i(\gamma )\).
A scaled ECDF trajectory defined as
with \(z_0 = 0\) and \(z_K = 1\) stays within the simultaneous confidence bands completely if and only if \(r_i \in I_i(\gamma )\) for all \(i \in \{0, \ldots , K\}\). If we denote the set of trajectories fulfilling \(r_i \in I_i\) as \(T_i\), we can write the set of trajectories which are completely within the simultaneous confidence bands as
In order for the simultaneous confidence bands to have confidence level \(1-\alpha \), we must have
Due to the pairwise independence of the original draws \(u_{i}\) (by assumption), the distribution of the ECDF values within a single trajectory is Markovian in the sense that the ECDF value \(F(z_{i+1})\) only depends on the observed value at the previous evaluation point, \(F(z_i)\) and not on the earlier behaviour of the ECDF trajectory.
This implies that, under uniformity of the original distribution, the remaining \(N - NF(z_i) = N - r_i\) samples are uniformly distributed on the interval \([z_i, 1]\), and thus the growth of the scaled ECDF from \(r_{i}\) to \(r_{i+1}\), between \(z_{i}\) and \(z_{i+1}\) is binomially distributed with \(N - r_{i}\) trials and the success probability
And so we have
The probability for \(r_{i+1} = k \in I_{i+1}\) to occur in a scaled ECDF trajectory \(t_0^K\) which stayed within the simultaneous confidence bands until point i, that is, for which we have
can thus be written recursively as
The recursion is initialized at \(z_0 = 0\) with \(\Pr (r_{0} = 0) = 1\) so that \(\Pr (T_0(\gamma )) = 1\) for all \(\gamma \in [0, 1]\). At any point \(i \in \{0, \ldots , K\}\), we can obtain
which is equal to \(\Pr (T(\gamma ))\) when arriving at \(i = K\). Clearly, \(\Pr (T(\gamma ))\) is monotonically decreasing but not continuous in \(\gamma \) due to the discrete nature of the binomial distribution. Thus, Equation (18) will not have an exact solution in general and so we will not be able to meet the simultaneous confidence level \(1-\alpha \) exactly. We can, however, try to get as close as possible by computing
with a unidimensional derivative-free optimizer. In our experiments, the optimizer proposed by Brent (1973) (which is implemented, e.g. in the R function optimize) converged quickly in all cases to \(\hat{\gamma }\) values implying a simultaneous confidence level very close to the nominal \(1-\alpha \).
With a 2015 laptop equipped with a 2.90GHz Intel\(\text{\textregistered} \) Core\(^{\mathrm{TM}}\) i5-5287U processor, the optimization method reduces the time required to compute the adjustment parameter \(\gamma \) from 10s to 600ms for a sample of length 250 when compared against the time required for 10,000 steps of the simulation method. With \(N = 1000\) this reduction is from 75s to 10s.
Both of the implementations used for this article only use a single computation thread, but would benefit from parallelization, as both methods include independent iterations.
The computation time required can be further reduced by using a grid of pre-computed values as the adjustment parameters, and interpolate for different values of N in log-log scale.
3 Comparison of multiple samples
In this section, we extend the uniformity test of Section 2 to test whether multiple samples originate from the same underlying distribution. In the case of multiple samples sharing the same distribution, the rank statistics of the values within each sample, when ranked jointly across all samples, are uniformly distributed on the interval \((1,\tilde{N})\), where \(\tilde{N}\) is the total length of the combined sample (Vehtari et al. 2021). Thus, instead of considering the sampled values directly, we consider the implied jointly rank-transformed values.
Due to this joint rank-transformation, the resulting chains are dependent on each other and the confidence intervals we construct in the following two sections are used to answer whether all the two or more samples originate from the same underlying distribution. In other words, in the case one or more of the ECDF trajectories leaves the confidence bands, we conclude that at least one of the samples exhibits larger-than-expected deviance from the other samples at hand.
An illustration of the connection between the sampled values, the corresponding fractional rank statistics and the two ECDF plots of these rank statistics are displayed in Fig. 5.
3.1 Pointwise confidence bands
An important distinction to the ECDF case considered in Section 2 is the form of the marginal distribution at quantile \(z_i\) when determining the adjusted coverage parameter \(\gamma \). As our main application is the comparison of distributions induced by MCMC chains, we speak of the L different samples as chains and assume all chains to have the same length N. We define \(r_i\) as the vector (of length L) of joint ranks across chains smaller than or equal to the sample size \(s_i = \lfloor z_i N L \rfloor \). That is, for each of the L elements \(r_{il}\) of \(r_i\), we have
where \(u_{lj}\) is the jth draw of the lth chain before transformation, \(R(u_{lj} {{\,\mathrm{\;|\;}\,}}u)\) is the rank of \(u_{lj}\) within the vector u of all draws across all chains, and \(\mathbb {I}\) is the indicator function. Clearly, because of the definition of ranks, we know for all i that
and we define the set of all \(r_i\) satisfying (26) as \(R_i\). Due to the pairwise independence of the original draws \(u_{lj}\) (by assumption), the marginal distribution of \(r_i\) at quantile \(z_i\) is multivariate hypergeometric
where \(\tilde{N} = (N_1, \ldots N_L)\) is the vector chain lengths (i.e. population sizes) and \(N_1 = \ldots = N_L = N\) as we assume chains to have equal length. It is well known that, in this case, the marginal distribution of \(r_{il}\), and thus the distribution defining the pointwise confidence bands, is hypergeometric
3.2 Simultaneous confidence bands through simulation
In this section, we extend the simulation method presented in Sect. 2.2 to comparison of multiple samples. Our aim is to define simultaneous confidence bands for the ECDFs of multiple, jointly rank-transformed distributions so that the interior of the simultaneous confidence bands jointly contains all trajectories induced by the rank-transformed distributions with rate \(1 - \alpha \). To this end, we define \(r_i\) and \(s_i\) as in Sect. 3.1 and denote the interior of the simultaneous confidence bands at quantile \(z_i\) as \(\tilde{I}_i(\gamma )\), with \(\gamma \) being the adjusted coverage parameter to be determined.
We continue the use of fractional ranks in the ECDF plots to provide illustrations independent of the length of the sampled chains. Suppose we have L chains of length N. The fractional rank score \(\tilde{r}_{il}\) corresponding to the ith value of the lth chain, \(u_{li}\), is
Instead of using the adjusted value of \(\gamma \) to obtain the \(1-\alpha \) level simultaneous confidence bands for a single ECDF trajectory, we adjust \(\gamma \) to account for the dependence between the samples introduced in the transformation into fractional ranks. That is, after choosing the evaluation quantiles \(z_i\), we adjust \(\gamma \) to find upper and lower simultaneous confidence bands satisfying
where \(F_l\) is the ECDF of the fractional rank scores of the lth chain.
We denote the CDF of the hypergeometric distribution as \({{\,\mathrm{\mathrm {Hyp}}\,}}\) and its inverse as \({{\,\mathrm{\mathrm {Hyp}}\,}}^{-1}\). The algorithm to approximate the adjusted coverage parameter \(\gamma \) when comparing L samples is as follows:
-
1.
For \(m=1,\ldots ,M\):
-
(a)
For \(l=1,\dots ,L\), simulate \(u_{l1}^m,\ldots ,u_{lN}^m \sim \mathrm{uniform}(0,1)\).
-
(b)
For \(j=1,\ldots ,N\) and \(l=1,\ldots ,L\), compute \(\tilde{r}_{jl}^m\).
-
(c)
For \(i=1,\ldots ,K\) and \(l=1,\ldots ,L\), compute \(F_l^m(z_i)\).
-
(d)
For \(i=1,\ldots ,K\) and \(l=1,\ldots ,L\), compute
$$\begin{aligned}&{{\,\mathrm{\mathrm {Hyp}}\,}}\left( NF_l^m(z_i) {{\,\mathrm{\;|\;}\,}}N, (L-1)N, s_i\right) \text { and }\\&{{\,\mathrm{\mathrm {Hyp}}\,}}\left( NF_l^m(z_i)-1 {{\,\mathrm{\;|\;}\,}}N, (L-1)N, s_i\right) , \end{aligned}$$where \(s_i = \lfloor z_i N L \rfloor \).
-
(e)
Find the minimum probability
$$\begin{aligned}&\gamma ^m = 2\min _{i,l}\left\{ \min \left( {{\,\mathrm{\mathrm {Hyp}}\,}}\left( NF_l^m(z_i) {{\,\mathrm{\;|\;}\,}}N, (L-1)N, s_i\right) ,\right. \right. \\&\quad \left. \left. 1-{{\,\mathrm{\mathrm {Hyp}}\,}}\left( NF_l^m(z_i)-1 {{\,\mathrm{\;|\;}\,}}N, (L-1)N, s_i\right) \right) \right\} . \end{aligned}$$
-
(a)
-
2.
Set \(\gamma \) to be the \(100\alpha \) percentile of \(\lbrace \gamma ^1,\ldots ,\gamma ^M\rbrace \).
-
3.
Form the confidence bands
$$\begin{aligned}&\left[ L_i(\gamma ),U_i(\gamma )\right] = \left[ {{\,\mathrm{\mathrm {Hyp}}\,}}^{-1} \left( \frac{\gamma }{2} {{\,\mathrm{\;|\;}\,}}N, N (L - 1), s_i \right) ,\right. \\&\quad \left. {{\,\mathrm{\mathrm {Hyp}}\,}}^{-1} \left( 1 - \frac{\gamma }{2} {{\,\mathrm{\;|\;}\,}}N, N (L - 1), s_i \right) \right] , \end{aligned}$$for \(i=1,\dots ,K\).
3.3 Simultaneous confidence bands through optimization
In this section, we extend the optimization method presented in Sect. 2.3 to comparison of multiple samples. With the marginal distribution of \(r_{il}\) being hypergeometric, the rank interior \(I_i(\gamma )\) for \(z_i\) is given by
We treat the borders between interior and exterior as belonging to the interior. Based on \(I_i(\gamma )\), we can again easily obtain \(\tilde{I}_i(\gamma )\), as \(r \in I_i(\gamma )\) is equivalent to \(r/N \in \tilde{I}_i(\gamma )\).
The remainder of the proof proceeds similar to the one-sample case, except that we replace the binomial distribution with the (multivariate) hypergeometric distribution. A (multivariate) rank ECDF trajectory, defined as
where \(z_0 = 0\) and \(z_K = 1\), stays within the simultaneous confidence bands completely if and only if \(r_i \in I_i(\gamma )\) for all \(i \in \{0, \ldots , K\}\). If we denote the set of trajectories fulfilling \(r_i \in I_i\) as \(T_i\), we can write the set of trajectories which are completely in the interior of the simultaneous confidence bands as
In order for the simultaneous confidence bands to have a confidence level \(1-\alpha \), we must satisfy
Due to the pairwise independence of the original draws \(u_{lj}\) (by assumption), the distribution of the rank ECDF trajectories again exhibits a similar Markovian property as in the single sample case. That is, any ECDF value \(F(z_{i+1})\) beyond a given point \(z_i\) only depends on \(F(z_i)\) but not on the earlier history of the ECDF trajectory. This implies that, under the assumption of all chains coming from the same underlying distribution, the growth \(r_{i+1} - r_{i}\) of the ECDF from \(z_{i}\) to \(z_{i+1}\) is multivariate hypergeometric with \(\tilde{N}_i = \tilde{N} - r_{i}\) and sample size \(\tilde{s}_{i+1} = s_{i+1} - s_i\). Accordingly, we have
where \(p_{{{\,\mathrm{\mathrm {MHyp}}\,}}}\) denotes the discrete PDF of the multivariate hypergeometric distribution. The probability for \(r_{i+1} = k \in I_{i+1}\) to occur in a rank ECDF trajectory \(t_0^K\) which stayed in the simultaneous confidence bands until point i, that is, for which we have
can thus be written recursively as
The recursion is initialized at \(z_0 = 0\) with \(\Pr (x_{0} = (0, \ldots , 0)) = 1\) so that \(\Pr (T_0(\gamma )) = 1\) for all \(\gamma \in [0, 1]\). At any point \(i \in \{0, \ldots , K\}\), we can obtain
which is equal to \(\Pr (T(\gamma ))\) when arriving at \(i = K\). Clearly, \(\Pr (T(\gamma ))\) is monotonically decreasing but not continuous in \(\gamma \) due to the discrete nature of the (multivariate) hypergeometric distribution. We can compute
using a unidimensional derivative-free optimizer. In our experiments, the optimizer proposed by Brent (1973) converged in all cases to \(\hat{\gamma }\) values implying a simultaneous confidence level very close to the nominal \(1-\alpha \).
Unfortunately, evaluating Eq. (37) suffers from combinatorial explosion as the \(R_i\) are L-dimensional sets constraint only by Equation (26) and as \(\Pr (r_{i+1} = k {{\,\mathrm{\;|\;}\,}}r_{i} = m)\) has to be computed for all combinations of elements \(k \in I_{i+1}\) and \(m \in I_{i+1}\) at each point i. Several measures can be taken to reduce the complexity of the computation. First, the ranks of one of the L chains are redundant as they follow deterministically from Equation (26) based on the ranks of the other \(L-1\) chains. This implies in particular that the 2-chain case has the same computational complexity as the one-sample case as only one of the two chains needs to be evaluated. Second, due to a priori symmetry of the chains, we can, without loss of generality, assume at the first non-zero quantile \(z_1\) that the elements \(r_{1l}\) of \(r_1\) are ordered such that \(r_{11} \le r_{12} \le \ldots \le r_{1L}\). This reduces the number of trajectories to be evaluated by a factor of \(L (L + 1) / 2\). Still even with these measures in place, computation will scale badly with L, and the simulation-based method, which scales almost linearly, or grid-based interpolation from pre-computed values is faster for larger number of chains.
4 Numerical experiments and power analysis
In this section, we provide insights into how the plots produced by our proposed methods should be interpreted. In each of the following cases, we link together the histogram, ECDF plot, and the ECDF difference plot. The code for the experiments and plots is available at https://github.com/TeemuSailynoja/simultaneous-confidence-bands.
4.1 Uniformity of a Single Sample
We begin by providing two examples connecting the shape of the histogram of the transformed sample to the characteristics of the corresponding ECDF and ECDF difference plots with basic discrepancies between the sample and the comparison distribution. After this, we illustrate an application of our method as part of a workflow to detect issues in model implementation or the computation of the posterior distribution. Lastly we provide power analysis comparing the performance of our proposed method to existing state of the art tests for uniformity.
With the exception of the power analysis tests in 4.1.4 where the samples are drawn directly from a continuous uniform distribution, the samples in the following examples are transformed to the unit interval from their respective sampling distributions through empirical PIT and are tested against the hypothesis of discrete uniformity.
4.1.1 Effect of difference in sample mean
To observe the typical characteristics of a sample with a mean different than that of the comparison distribution, we draw \(y=y_1,\dots ,y_N\sim {{\,\mathrm{\mathrm {normal}}\,}}(0.25,1)\) and N independent comparison samples \(x^i=x^i_1,\dots ,x^ i_N\sim {{\,\mathrm{\mathrm {normal}}\,}}(0,1)\) with \(N=250\). We then test for y being standard normal distributed by transforming the sampled values to the unit interval through empirical PIT. Figure 6a shows the histogram of the transformed sample exhibiting a higher-than-expected mean. As seen in the figure, a shift in the sample mean leads to the histogram being slanted towards the direction of the shift.
The ECDF plot in Fig. 6b shows this shift through the ECDF of the PIT values remaining under the theoretical CDF, which is also seen in the ECDF difference plot in Fig. 6c.
If the sample in question would instead have a mean lower than expected, the histogram would be slanted to the left and the behaviour of the resulting ECDF plot and ECDF difference plot would be reversed. That is, the ECDF plot would stay above the theoretical CDF as a higher-than-expected density is covered at low fractional ranks and the ECDF difference plot would, respectively, show a \(\cap \)-shape above the zero level.
4.1.2 Effect of difference in sample variance
Next, we investigate an example where the sample has a higher-than-expected variance. To this end, we draw \(y = y_1,\dots ,y_N \sim {{\,\mathrm{\mathrm {normal}}\,}}(0,1.25)\) and for each \(y_i\) a standard uniform comparison sample \(x^i = x^ i_1,\dots ,x^ i_N\sim {{\,\mathrm{\mathrm {normal}}\,}}(0,1)\) with \(N=250\). Figure 7a shows the histogram of the empirical PIT values. In general, a larger-than-expected variance leads to a \(\cup \)-shaped histogram and one can indeed see some of the histogram bins breaching the 95% confidence bounds.
In the ECDF plot shown in Fig. 7b, the larger-than-expected variance leads to faster-than-expected growth near the edges and slower-than-expected growth in the middle.
The shape is more clearly seen in the ECDF difference plot in Fig. 7c depicting the difference between the ECDF and the theoretical CDF.
If the sample would instead present a variance lower than expected, the histogram would be \(\cap \)-shaped and the behaviour of the resulting ECDF plot and ECDF difference plot would be reversed.
In the ECDF plot, this is shown as faster increase near the middle.
In general, the ECDF difference plot is decreasing when a smaller-than-expected density of samples is covered, and correspondingly increases when covering a higher-than-expected density.
4.1.3 Simulation-based calibration: eight schools
The eight schools (Gelman et al. 2013) is a classic hierarchical model example. The training course effects \(\theta _j\) in eight schools are modelled using an hierarchical varying intercept model.
If the model is constructed with the centred parameterization, the posterior distribution exhibits a funnel shape contracting to a region of high curvature near the population mean \(\mu \) when sampled with small values of the population standard deviation \(\tau \). This property makes exploring the distribution of \(\tau \) difficult for many MCMC methods. The centred parameterization \((\mu , \sigma , \mu _0, \tau )\) of the problem is as follows:
Cook et al. (2006) proposed a simulation-based calibration (SBC) method for validating Bayesian inference software. The idea is based on the fact we can factor the joint distribution of data y and parameters \(\theta \) in two ways
By considering \(\theta '\) and \(\theta ''\) the joint distribution is
and it is easy to see that \(\theta '\) and \(\theta ''\) have the same distribution conditionally on y. If write the joint distribution in an alternative way
\(\theta '\) and \(\theta ''\) still have the same distribution conditionally on y. We can sample from the joint distribution \(\pi (y,\theta ',\theta '')\) by first sampling from \(\pi (\theta ')\) and \(\pi (y|\theta ')\), which is usually easy for generative models. The last step is to sample from the conditional \(\pi (\theta |y)\), which is usually not trivial and instead, for example, a Markov chain Monte Carlo algorithm is used. We can validate the algorithm and its implementation used to sample from \(\pi (\theta ''|y)\) by checking that the samples obtained have the same distribution as \(\theta '\) (conditionally on y).
Cook et al. (2006) operationalize the approach by drawing \(\theta '_i\) from \(\pi (\theta ')\), generating data \(y_i \sim \pi (y_i|\theta '_i)\) and then using the algorithm to be validated to draw a sample \(\theta ''_1,\ldots ,\theta ''_S \sim \pi (\theta ''|y_i)\). If the algorithm and its implementation are correct, then \(\theta '_i,\theta ''_1,\ldots ,\theta ''_S\) conditional on \(y_i\) are draws from the same distribution. Cook et al. (2006) propose to compute empirical PIT valued for \(\theta '_i\) that they show to be uniformly distributed given \(S \rightarrow \infty \). The process is repeated for \(i=1,\ldots ,N\) and N empirical PIT values are used for testing. Cook et al. (2006) propose to use \(\chi ^2\)-test for the inverse of the normal CDF of the empirical PIT values. However, with finite S this approach does not correctly take into account the discreteness or the effect of correlated sample from Markov chain Gelman (2017).
By thinning \(\theta _1^{''},\ldots ,\theta _S^{''}\) to be approximately independent, the uniformity of empirical PIT values can be tested with the approach presented in this paper. See Appendix A for more on thinning.
Figure 8 shows the histogram and ECDF plots of 500 prior draws of the population standard deviation \(\tau \), each ranked based on a thinned posterior sample of 150 draws obtained from a chain of 3000 draws. The graphical test rejects the hypothesis of the prior draws being uniform, moreover the ECDF plots show that the prior draws of the parameter \(\tau \) ranked in relation to the posterior samples obtained from the centred parameterization of the eight schools model are skewed to small ranks. This suggests that the MCMC is not sampling correctly from the target distribution (which in this case is known to be caused by inability to reach the narrow funnel part of the posterior).
In Sect. 4.2.4, we will return to the eight schools model by providing further analysis on the convergence of individual chains in the centred parameterization case and illustrating how our method can be used to detect these convergence issues.
4.1.4 Power analysis
As our primary focus is on providing a graphical uniformity test, which gives the user useful information regarding the possible deviations from uniformity, we want to also ensure that the overall performance of our test is, if not the best, competitive with tests aimed at accurately detecting specific deviations from uniformity. To this end, we compare the sensitivity of our method with existing tests for uniformity, by considering the rejection rate of samples drawn from uniform distribution and then transformed according to the following three transformation families Marhuenda et al. (2005) use in their article comparing various tests for uniformity:
As Marhuenda et al. (2005) offer an extensive comparison of tests, we limit our comparison to the test specifically recommended to target each of the transformation families in addition to the widely known Kolmogorov–Smirnov test. For each of the test statistics, a critical value is calculated and samples exceeding that value are rejected.
For transformation family A, the recommended test is the mean distance of the ith value of the ordered sample \(u_{(i)}\) from the expected value \(i/(N+1)\):
For family B, the smooth goodness-of-fit test, \(N_h\), introduced by Neyman (1937) is recommended with the dimension h chosen according to the method recommended by Ledwina (1994) resulting in the test statistic \(N_S\), which also has the best overall performance across the transformation families. The test recommended for transformation family C is the statistic recommended by Watson (1961),
where \(\bar{u}\) is the mean of the \(u_i\) and \(W^2\) is the Cramér–von Mises statistic,
The rejection rates of these tests and our test through simultaneous confidence bands are shown in Fig. 9 for families A, B, and C with sample size \(N=100\) and k varying between 0.20 and 3.00. For each value of k, the rejection rate among 100, 000 samples was computed. As seen from these results, the proposed ECDF simultaneous confidence band method performs in a manner similar to the recommended tests with the exception of family C, where our method exhibits a lower rejection rate compared to some of the other tests.
4.2 Comparing multiple samples
When testing if two or more samples are produced from the same underlying distribution, we can compare the ranks of each sample relative to the sample obtained by combining all the samples in the comparison. As mentioned in Sect. 3, we need to adjust the confidence bands to take into account the dependency of the ranks of the values of one sample on the values in other samples in the comparison.
When using the multiple sample test for MCMC convergence diagnostics, we recommend first using existing numerical convergence statistics, such as the \(\widehat{R}\) by Vehtari et al. (2021) or the \(R^*\) by Lambert and Vehtari (2021) which can assess the convergence of all model parameters jointly and can indicate which parameters have possible convergence issues. In the case that these statistics indicate possible issues, further insight into the nature of these deviations can be obtained with the ECDF plots of fractional ranks.
4.2.1 Effect of difference in means and variances
We first compare two cases of MCMC sampling with four chains containing 250 independent draws, which is enough to reliably estimate the variances and autocorrelations required for \(\widehat{R}\) and effective sample size (ESS) as long as the rank-normalized ESS of the sample exceeds 400 (Vehtari et al. 2021), which is the case as the draws are independent. In each case, chains 2 to 4 were sampled from a \({{\,\mathrm{\mathrm {normal}}\,}}(0,1)\) distribution. In the first case, chain 1 is sampled with a larger mean than the other chains, \({{\,\mathrm{\mathrm {normal}}\,}}(0.5, 1)\). In the second case, chain 1 is sampled with a larger variance, \({{\,\mathrm{\mathrm {normal}}\,}}(0,1.5)\).
Rank plots for the first case with one chain having a larger mean are shown in Fig. 10a–d. Even though the difference in the sampling distribution of chain 1 can be seen in the histograms with 50 bins, this effect is more clearly represented in the ECDF difference plot in Fig. 10f where chain 1 shows the shape familiar from 4.1.1 and chains 2 to 4 show a reverse shape, indicating similar behaviour between these three chains.
Similar remarks regarding the behaviour of the chains can be made from the ECDF plot in Fig. 10e, but the more dynamic range of the ECDF difference plot in Fig. 10f makes the difference in the behaviour of the chains clearer.
In the second case, where chain 1 is sampled with a higher variance, we can see a \(\cup \)-shape in the rank plot of chain 1 in Fig. 11a, but the behaviour stands out more clearly in the ECDF difference plot in Fig. 11f.
When compared to commonly used convergence diagnostics not offering graphical insight into the nature of the possible underlying problems, both the classical \(\widehat{R}\) diagnostic by Gelman and Rubin (1992) and the improved \(\widehat{R}\) diagnostic proposed by Vehtari et al. (2021) indicate convergence issues as they give and estimated \(\hat{R}\) values of 1.05 and 1.04, respectively, to both the mean and variance related examples above. Vehtari et al. (2021) suggest that \(\widehat{R} > 1.01\) is an indication of potential convergence issues or too short chains.
4.2.2 Test performance under common deviations
To evaluate the performance of the multiple sample comparison test under a set of common deviations, one of the samples was transformed according to the three transformation families defined in equation (45). In the analysis 2, 4, and 8 chains of length 100 were simulated from U(0, 1) after which one of the chains was transformed according to the transformations \(f_{A,k}\), \(f_{B,k}\), and \(f_{C,k}\). The rejection rates of the multiple sample comparison test when varying the power, k, of the transformation were estimated from 10, 000 simulations and are recorded in Fig. 12. The observed test performance is independent of the number of chains used in the sample comparison. When compared to the rejection rates observed in the single sample power analysis in 4.1.4, the rejection rates show that the test sensitivity depends in a similar way on the transformation.
4.2.3 Chains with autocorrelation
As samples generated by MCMC processes are typically autocorrelated, it is essential to analyse the performance of the sample comparison test under autocorrelated samples. In Fig. 13, rejection rates of simulated multiple sample test 2, 4, and 8 chains produced by autoregressive models of order 1 (i.e. AR(1) models) with varying AR-parameter values are presented. Each rejection rate is computed as the mean of 100, 000 simulations. As seen in the figure, the higher the autocorrelation in the samples is and the more chains are sampled, the more likely the test is to reject the hypothesis that the samples are drawn from the same underlying distribution. Thus, before using the graphical illustration or the corresponding test, the chains should be thinned to have negligible autocorrelation. The same holds for other common uniformity tests as well, as they rely on the assumption of pairwise independence of draws.
4.2.4 Detecting model sampling issues: eight schools
We return to the eight schools model used to demonstrate SBC in Sect. 4.1.3. The issues detected with SBC earlier are apparent when multiple sample comparison is used to inspect the rank distribution between the four individual chains, each containing 1000 posterior draws after a warm-up period of 1000 steps. Even when sampled with more conservative settings of the sampler, we see from Fig. 14 that the chains are not properly exploring the posterior and thus the realized rank transformed chains have clearly different ECDFs.
While the classical \(\widehat{R}\) is estimated at 1, the improved \(\hat{R}\) diagnostic gives a value of 1.02 indicating possible convergence issues. One should also note that the sampling efficiency for \(\tau \) in the model is very low, as both the bulk-ESS and the tail-ESS by Vehtari et al. (2021) are under 150 for the combined sample.
As recommended in Sect. 22.7 of the Stan User’s Guide (Stan Development Team. xxx), these observed sampling issues of a hierarchical model with weak likelihood contribution can often be avoided by using the non-centred parameterization \((\tilde{\theta }, \mu , \tau , \sigma )\) of the model:
In the above parameterization, the treatment effect \(\theta _j\) is derived deterministically from the other parameter values and instead \(\tilde{\theta }_j\) is sampled. To keep the models comparable, we use the same conservative sampling options for the non-centred model although this is not required to obtain well mixing chains. In Fig. 15, we see an improvement in the sampling compared to the centred parameterization, as the sample ranks are distributed approximately uniformly among the four chains implying that the chains are mixing well.
Now, both of the \(\hat{R}\) diagnostics agree on convergence with the graphical test, yielding values close to 1.00, while also the sampling efficiency issues detected in the centred parameterization model have disappeared giving samples with bulk-ESS and tail-ESS reaching 2200 and 1600, respectively.
5 Discussion
By providing a graphical test for uniformity and comparison of samples, we offer an accessible tool to be used in many parts of practical statistical workflow.
For assessing the uniformity of a single sample, we recommend the optimization-based adjustment method, as it is efficient even for large sample sizes. For comparing multiple samples, the simulation-based method is likely to be computationally more efficient than the optimization-based method. To speed up the computations, we recommend pre-computing adjusted \(\gamma \) values for a set of sample size and number of samples (chains) and then interpolate (in log-log space) the adjustment as needed.
In the examples, we used empirical PIT with SBC, where the uniformity is expected by construction if the inference algorithm works correctly. PIT has also been used to compare predictive distributions. Specifically, in the LOO-PIT approach, PIT has been used to compare leave-one-out (LOO) cross-validation predictive distributions to the observations (e.g. Gneiting et al. 2007; Czado et al. 2009). Although the graphical LOO-PIT test is useful for visualization of model–data discrepancy, exact uniformity of LOO-PIT values can be expected only asymptotically given the true model. For example, if the data comes from a normal distribution and is modelled with a normal distribution with unknown mean and scale, the posterior predictive distribution is a Student’s t distribution that approaches normal only asymptotically. Thus use of graphical LOO-PIT tests needs further research.
We have assumed that distributions g and p are continuous and only the fractional rank statistics \(u_i\) from Eq. (2) are discrete. Our proposed methods do not work directly if g and p are discrete, as values obtained through PIT are no longer uniform. Also, in the multiple sample comparison case, the rank statistics are no longer mutually distinct as ties are possible. The potential approach to handling discrete g and p is to use randomized or non-randomized modifications of PIT values for discrete distributions, as discussed by Czado et al. (2009). However, developing proven and efficient algorithms for this purpose requires further work, which is left for future research.
References
Aldor-Noiman, S., Brown, L.D., Buja, A., Rolke, W., Stine, R.A.: The power to see: a new graphical test of normality. Am. Stat. 67(4), 249–260 (2013). https://doi.org/10.1080/00031305.2013.847865
Arnold, B.C., Balakrishnan, N., Nagaraja, H.N.: A First Course in Order Statistics (Classics in Applied Mathematics). Society for Industrial and Applied Mathematics, USA (2008)
Brent, R.: An algorithm with guaranteed convergence for finding the minimum of a function of one variable. In: Algorithms for Minimization Without Derivatives, pp. 61–80. Prentice-Hall, Englewood Cliffs, NJ (1973)
Brown, L.D., Cai, T.T., DasGupta, A.: Interval estimation for a binomial proportion. Stat. Sci. 16(2), 101–133 (2001). https://doi.org/10.1214/ss/1009213286
Cook, S.R., Gelman, A., Rubin, D.B.: Validation of software for Bayesian models using posterior quantiles. J. Comput. Graph. Stat. 15(3), 675–692 (2006)
Czado, C., Gneiting, T., Held, L.: Predictive model assessment for count data. Biometrics 65(4), 1254–1261 (2009). https://doi.org/10.1111/j.1541-0420.2009.01191.x
D’Agostino, R.B., Stephens, M.A.: Goodness-of-Fit Techniques. Marcel Dekker Inc, USA (1986)
Gelman, A.: Correction to Cook, Gelman, and Rubin (2006). J. Comput. Graph. Stat. 26(4), 940–940 (2017). https://doi.org/10.1080/10618600.2017.1377082
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B.: Aki VEHTARI a Donald B. RUBIN, 2013. Bayesian Data Anal. (2013). https://doi.org/10.1201/b16018
Gelman, A., Rubin, D.B.: Inference from iterative simulation using multiple sequences. Stat. Sci. 7(4), 457–472 (1992)
Gelman, A., Vehtari, A., Simpson, D., Margossian, C.C., Carpenter, B., Yao, Y., Kennedy, L., Gabry, J., Bürkner, P.C., Modrák, M. (2020). Bayesian workflow. arXiv preprint arXiv:2011.01808
Gneiting, T., Balabdaoui, F., Raftery, A.E.: Probabilistic forecasts, calibration and sharpness. J. Royal Stat. Soc. Series B (Stat. Methodol.) 69(2), 243–268 (2007). https://doi.org/10.1111/j.1467-9868.2007.00587.x
Kolmogorov, A.: Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn. 4, 83–91 (1933)
Lambert, B., Vehtari, A.: \(R^{*}\): A robust MCMC convergence diagnostic with uncertainty using decision tree classifiers. Bayesian Anal. https://doi.org/10.1214/20-BA1252
Ledwina, T.: Data-driven version of Neyman’s smooth test of fit. J. Am. Stat. Assoc. 89(427), 1000–1005 (1994). https://doi.org/10.1080/01621459.1994.10476834
Marhuenda, Y., Morales, D., Pardo, M.C.: A comparison of uniformity tests. Statistics 39(4), 315–327 (2005). https://doi.org/10.1080/02331880500178562
Massey, F.J., Jr.: The Kolmogorov-Smirnov testfor goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951)
Neyman, J.: Smooth test for goodness of fit. Scand. Actuar. J. 1937(3–4), 149–199 (1937). https://doi.org/10.1080/03461238.1937.10404821
Stan Development Team. (n.d.). Stan user’s guide. Retrieved September 29, 2020, from https://mc-stan.org/docs/2_24/stan-users-guide/index.html
Talts, S., Betancourt, M., Simpson, D., Vehtari, A., & Gelman, A. (2020). Validating Bayesian inference algorithms with simulation-based calibration. arXiv preprint arXiv:1804.06788v2
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., Bürkner, P.C.: Rank-normalization, folding, and localization: an improved \(\widehat{R}\) for assessing convergence of MCMC (with discussion). Bayesian Anal. 16(2), 667–718 (2021). https://doi.org/10.1214/20-BA1221
Watson, G.S.: Goodness-of-fit tests on a circle. Biometrika 48(1–2), 109–114 (1961). https://doi.org/10.1093/biomet/48.1-2.109
Acknowledgements
We thank the Academy of Finland (grant 298742), the Finnish Center for Artificial Intelligence, and the Technology Industries of Finland Centennial Foundation (grant 70007503; Artificial Intelligence for Research and Development) for partial support of this research. We also acknowledge the computational resources provided by the Aalto Science-IT project.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Autocorrelated samples
Appendix A: Autocorrelated samples
In this appendix, we highlight the effect autocorrelated draws have when they are used to estimate the extreme rank statistics of the target distribution. Accounting for autocorrelation is important when inspecting the distribution of order statistics, including the PIT values in Sect. 2 or the between chain fractional ranks in Sect. 3.
Given finite variance, central limit theorem holds also for correlated samples and many useful expectations can be estimated with desired accuracy by increasing the sample size. However, the bias in extreme ordered statistics can be non-negligible. This manifests in the expected value of the smallest and largest order statistics of an autocorrelated sample being less extreme than expected. This phenomenon is demonstrated with AR(1) processes in Fig. 16, which shows expected values of the 100 smallest ordered statistics computed from a sample of length 1000. The bias is smaller with less extreme ordered statistics, and, for example, estimates of \(p(x<-1.5)\) or 10% quantile in this case are likely to have negligible bias. In the uniformity test, extreme PIT estimates can have non-negligible bias, increasing the probability that ECDF steps out of the simultaneous confidence band.
A standard approach to reduce sample autocorrelation is to thin the sampled chains by keeping only every T values in the sample. Below, we compare three thinning strategies. First, the traditional approach, where \(T = S / \mathrm {ESS}\), where ESS is computed for the posterior mean (without rank normalization) (Vehtari et al. 2021). Second, an approach recommended by Talts et al. (2020), where the above ESS is computed for estimating the ECDF, \(P(y < y^*)\) where \(y^*\) are empirical quantiles of the sample y. The authors recommend using 19 quantiles \((0.05,0.1,\ldots ,0.95)\) and thinning the sample based on the ESS, which would result in the largest thinning factor. This method is targeted to address differences in sampling efficiency between the distribution quantiles. The third method, we introduce, is calculating the tail-ESS and bulk-ESS as defined by Vehtari et al. (2021), and picking the one resulting into the stricter thinning. This method aims to address possible differences in sampling efficiency between the central \(90\%\) quantile and the two \(5\%\) tail quantiles. R package ‘posterior’ was used for all ESS computations.
In Fig. 16, we additionally show the first 100 order statistics of the standard normal distribution compared to AR processes thinned according to the tail-ESS, as our focus here is on the tails of the distribution. In order to arrive at a thinned sample of equal length, an expected Tail-ESS was obtained by averaging over 10,000 simulations and the sample length was chosen accordingly to yield thinned samples of length 1000. After the thinning, the order statistics closely match those drawn independently from the standard normal distribution.
We inspected how the three above-mentioned thinning strategies manage to reduce the autocorrelation which, as shown in Sect. 4.2.3, the ECDF-based test is sensitive to. In this experiment, 1, 2, and 4 chains of length 1000 were drawn from the AR(1) process with varying values of the AR parameter \(\phi \). The results of this experiment are displayed in Fig. 17, and as one can see, all three of the methods produce very similar thinning recommendations, and thus also test results, managing to reduce the rejection rate near the desired \(5\%\).
If after using some default thinning approach, there are still many extreme PIT estimates, it is possible that there is still substantial autocorrelation in the sample and more careful investigation of the remaining autocorrelation is warranted. There is certainly a trade-off between the computation time and how accurately the behaviour of extreme tails need to be examined. Often the major issues can be seen with less accurate computation, and natural workflow can include iterative refinement of the diagnostic accuracy.
Although thinning may be needed for uniformity test as part of SBC or posterior predictive checks (PPC), when estimating quantities of interest that are not related to extreme tails, better efficiency is obtained by using all the posterior draws.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Säilynoja, T., Bürkner, PC. & Vehtari, A. Graphical test for discrete uniformity and its applications in goodness-of-fit evaluation and multiple sample comparison. Stat Comput 32, 32 (2022). https://doi.org/10.1007/s11222-022-10090-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-022-10090-6