Graphical Test for Discrete Uniformity and its Applications in Goodness of Fit Evaluation and Multiple Sample Comparison

Assessing goodness of fit to a given distribution plays an important role in computational statistics. The Probability integral transformation (PIT) can be used to convert the question of whether a given sample originates from a reference distribution into a problem of testing for uniformity. We present new simulation and optimization based methods to obtain simultaneous confidence bands for the whole empirical cumulative distribution function (ECDF) of the PIT values under the assumption of uniformity. Simultaneous confidence bands correspond to such confidence intervals at each point that jointly satisfy a desired coverage. These methods can also be applied in cases where the reference distribution is represented only by a finite sample. The confidence bands provide an intuitive ECDF-based graphical test for uniformity, which also provides useful information on the quality of the discrepancy. We further extend the simulation and optimization methods to determine simultaneous confidence bands for testing whether multiple samples come from the same underlying distribution. This multiple sample comparison test is especially useful in Markov chain Monte Carlo convergence diagnostics. We provide numerical experiments to assess the properties of the tests using both simulated and real world data and give recommendations on their practical application in computational statistics workflows.


Introduction
Tests for uniformity play an essential role in computational statistics when estimating goodness of fit to a given distribution (Marhuenda et al. 2005). This is because, even when the distribution of interest is not uniform, there are methods to reduce the problem into testing for uniformity by transforming a sample from the given distribution to a (discrete or continuous) uniform distribution. Common use cases in Bayesian workflow  are simulation based calibration and Markov chain Monte Carlo convergence diagnostic, which we also use as examples in this paper. A graphical test can provide additional insight to the nature of discrepancy that goes beyond the dichotomy of the uniformity test.

Probability integral transformation
Transforming sampled values to a uniform distribution is usually achieved via the probability integral transform (PIT), provided the distribution of interest has a tractable cumulative distribution function (CDF) (D'Agostino and Stephens 1986). Let y 1 , . . . , y N ∼ g(y) be an independent sample from an unknown continuous distribution with probability density function (PDF) g. We want to know whether g = p, where p is the PDF of a known distribution with a tractable CDF. The PIT of the sampled value y i with respect to p is The pointwise confidence intervals (red bars) for the ordered statistic u (i) are beta distributed and the simultaneous confidence intervals for the ECDF of u i are given by Aldor-Noiman et al. (2013).
(2) If one can draw an independent sample, x i 1 , . . . , x i S ∼ p(x), for each y i , the empirical PIT values u i are discrete and, given g = p, uniformly distributed. (b) The pointwise confidence intervals of the discrete ordered statistic u (i) could be solved from Eq. (3). (c) The pointwise confidence intervals for the values of the ECDF of u i at evaluation points z i ∈ [0, 1] are binomially distributed and the simultaneous confidence intervals are obtained by the method presented in this paper.
If g = p, the transformed values u i are continuously, independently, and uniformly distributed on the unit interval [0, 1], reducing the evaluation of the hypothesis into testing for uniformity of the transformed sample u 1 , . . . , u N . If the integral (1) does not have closed from, the CDF (and hence the PIT values) can still be computed with sufficient accuracy through numerical integration (e.g., quadrature), if at least the corresponding PDF is tractable.
If neither the CDF nor the PDF have closed form, but a comparison sample of independent values x i 1 , . . . , x i S ∼ p(x) can be drawn separately for each y i , the hypothesis g = p can be evaluated through the empirical PIT values where I is the indicator function. Now, given g = p, the transformed values u = u 1 , . . . , u N are independently distributed according to a discrete uniform distribution with S + 1 values (0, 1/S, . . . , (S − 1)/S, 1). Accordingly, we can still apply uniformity tests to assess g = p, just that this time, we need to test for discrete uniformity.
If either sample has dependencies, like an autocorrelated sample from a Markov chain, the ordered statistics is affected by the dependencies and the empirical PIT values (2) are not distributed uniformly even if g = p (unless the sample size goes to infinity). For Markov chains, the usual remedy is to thin the chain to obtain an approximately independent sample. This issue is illustrated, and thinning recommendations are provided in Appendix A. Figure 1 shows an example of the empirical cumulative distribution function (ECDF) for u obtained through both Eq. (1) and Eq. (2). The figure also shows an example of a pointwise confidence interval for each ECDF. For the continuous integral of Eq. (1), the pointwise confidence interval can be computed from the continuous uniform ordered statistics distribution which is a common beta distribution. For the discrete sum of Eq. (2), the pointwise confidence interval can be computed from the discrete uniform ordered statistics distribution, with the cumulative distribution function of the ith ordered statistic u (i) given as for z ∈ (0, 1/S, . . . , (S − 1)/S, 1) (Arnold et al. 2008, Example 3.1). The corresponding pointwise intervals do not have a nice form in general and, more importantly, the discrete ordered statistics do not exhibit Markovian structure (exploited by our new optimization based approach) if there are possible ties in u (Arnold et al. 2008, Theorem 3.4.1).
To make the computation of the simultaneous confidence bands more straightforward and efficient, we propose making an additional transformation by computing the ECDF of u at chosen evaluation points z i : We recommend choosing z i as the ordered fractional ranksr i of y i , defined as The ordered fractional ranks form a uniform partition of the unit interval independent of the distribution of y i . Thus, they provide an ECDF that is easier to interpret than the corresponding ECDF based directly on the original sample y i . The resulting ECDF is illustrated in Figure 1(c). As we will show, useful properties of this ECDF are that 1) its pointwise confidence intervals can be computed easily from the binomial distribution, with a quantile function already implemented in most widely used environments for statistical computing, and 2) the distribution of the ECDF trajectories is Markovian, which is exploited in Section 2.3.

Simultaneous confidence bands
The major challenge that arises when developing a uniformity test based on the ECDF is to obtain simultaneous confidence bands with the desired overall coverage. For this purpose, one needs to take into account the inter-dependency in the ECDF values and adjust the coverage parameter accordingly (we will discuss this in more detail in Section 2). When considering whether a given ECDF could present a sample from a uniform distribution, we need to jointly consider all pointwise uncertainties. For a set of evaluation points (z i ) K i=1 , we provide lower and upper confidence bands L i and U i respectively, that jointly satisfy where F (z i ) is the ECDF of a sample from either the standard uniform distribution or discrete uniform distribution on the unit interval evaluated at z i ∈ (0, 1) and 1 − α is the desired simultaneous confidence level. In addition to offering a numerical test for uniformity, the simultaneous confidence bands provide an intuitive graphical representation of possible discrepancies from uniformity. Aldor-Noiman et al. (2013) presented a simulation-based approach for computing simultaneous confidence band for the ECDF of the transformed sample acquired from Eq. (1) under the assumption of uniformity. In this paper, we present a simulation method inspired by Aldor-Noiman et al. (2013) as well as a new, faster optimization method for computing simultaneous confidence bands under uniformity, when the ECDF is computed from the empirical PIT values using Eqs. (2) and (4). Figure 2 contrasts the simultaneous confidence bands by Aldor-Noiman et al. (2013) against those obtained from our proposed method. Furthermore, we generalize our method and simultaneous confidence bands to test whether multiple samples originate from the same underlying distribution.

Related work
The idea of utilizing the ECDF to test uniformity is not new, but its potential has not yet been realized in full. For example, the well known Kolmogorov-Smirnov (KS) test, first introduced by Kolmogorov (see e.g . Massey Jr (1951), original article in Italian is Kolmogorov (1933)), is based on evaluating the maximum deviation of the sample ECDF from the theoretical CDF of the distribution to be tested against. Unfortunately, the KS test is relatively insensitive to deviations in the tails of the distribution (Aldor-Noiman et al. 2013), and numerous test have been proposed to replace the KS test. An extensive comparison of more than thirty tests of uniformity of a single sample is provided by Marhuenda et al. (2005).
Due to its ease of interpretation and familiarity to people even with basic statistical knowledge, a graphical method for assessing uniformity commonly used as part of many statistical workflows is plotting histograms. This can even Beta distribution-based 95% simultaneous confidence bands for quantiles are provided for reaching a set of ECDF values (along the x-axis). (b) Our method: For a set of evaluation quantiles, we provide binomial distribution-based 95% simultaneous confidence intervals for the ECDF value (along the y-axis). To asses uniformity of the sample, histograms (a) and (b) show a 95% confidence interval for each bin. Histograms can be sensitive to the number and placement of the bins selected, and the confidence intervals do not take into account possible inter-dependencies between the bin heights. For example, given the same sample, a 20 bin histogram stays within the confidence interval (a), but a 50 bin histogram exceeds the confidence interval (b). The ECDF plot (c) and ECDF difference plot (d) with 95% simultaneous confidence bands for the ECDF both show the sample staying within the given limits with the ECDF difference plot providing a more dynamic range for the visualization. be turned into a formal test of uniformity with confidence intervals for the individual bins (e.g., Talts et al. 2020). Drawbacks of histograms are that binning discards information, there can be binning artifacts depending on the choice of bin width and placement, and they ignore the dependency between bins. The proposed ECDF-based method doesn't require binning or smoothing, provides intuitive visual interpretation, and works for continuous Eq. (1) and discrete Eq. (2) values. An illustration and comparison of histograms with two binning choices and our new method is given in Figure 3. The visual range between the simultaneous confidence bands for the ECDF is often narrow when visualizing a sample with a large number of observations. Thus, to achieve a more dynamic range for the visualization, we recommend to show ECDF difference plots, instead, as illustrated in Figure 3(d). The ECDF difference plot is obtained by subtracting the values of the expected theoretical CDF (i.e., the identity function in [0,1] in case of standard uniformity) from the observed ECDF values.

Summary of contributions
In this article, we focus on use case examples arising from inference validation and Markov chain Monte Carlo (MCMC) convergence diagnostics as part of a Bayesian workflow ), but our developed methods are applicable more generally. Our use cases can be divided into two main categories: a single sample test for uniformity, and a multiple sample comparison where the hypothesis is tested that the samples are drawn from the same underlying (potentially non-uniform) distribution. We discuss both cases in more detail below.
We offer a graphical test for uniformity by providing simultaneous confidence bands for one or more ECDF trajectories obtained through the empirical probability integral transformation. As our first contribution, we modify an existing ECDF-based approach proposed by Aldor-Noiman et al. (2013) to take into account the discreteness of the fractional rank-based PIT values. This forms the basis for our proposed single and multi-sample tests.
As our second contribution, we provide both a simulation and optimization method to determine the adjustment needed to achieve a desired simultaneous confidence level for the ECDF trajectory given the fractional rank-based PIT values. In addition to presenting a simulation-based adjustment following the method of Aldor-Noiman et al. (2013), we introduce a new optimization method that is computationally considerably more efficient in determining the needed adjustment, especially when bands with high resolution are desired for a large sample size. Although our focus is on providing a test with an intuitive graphical representation, we show that our method performs competitively when compared to existing uniformity tests with state-of-the-art performance. We demonstrate the usefulness of this graphical test in context of simulation based calibration approach for assessing inference methods (Talts et al. 2020).
Finally, as our third contribution, we generalize the graphical test as well as both the simulation and the optimization method to evaluate the hypothesis that two or more samples are drawn from the same underlying distribution. We demonstrate the usefulness of this graphical test in MCMC convergence diagnostics, where the currently most common graphical tools for assessing convergence are trace plots of the individual sampled chains.

Outline of the paper
In Section 2, we first provide a simulation-based method to determine simultaneous confidence bands for the ECDF of a single uniform sample and then present new more efficient optimization-based method. In Section 3, we extend the test to multiple sample comparison, and follow a similar structure by offering both a simulation and an optimization-based methods. We continue in Section 4 with simulated and real-world examples illustrating the application of our proposed method, and end with a discussion in Section 5.

Simultaneous Confidence Bands for the Empirical Cumulative Distribution
We propose simulation and optimization based approaches to providing the ECDF of a uniform sample with 1 − α level simultaneous confidence bands that are compatible with empirical PIT values, that is, confidence bands with a type-1-error rate of α. Our approach is similar to that presented by Aldor-Noiman et al. (2013) with one central distinction illustrated in Figure 2. The method by Aldor-Noiman et al. (2013) obtains simultaneous confidence bands for the evaluation quantiles with fixed ECDF values based on beta distributions, that is, it obtains confidence bands along the horizontal axis ( Figure 2(a)). In contrast, our new method provides simultaneous confidence bands for the ECDF values at fixed evaluation quantiles based on binomial distributions, that is, it obtains confidence bands along the vertical axis (Figure 2(b)). In the limit, as the sample size approaches infinity, there is no practical difference between the methods. However, when the number of possible unique ranks is small, our proposed method behaves better for smallest and largest ranks, and consistently if the ranks are further binned.

Pointwise confidence bands
Determining the pointwise confidence interval for the ECDF value of a sample from the continuous uniform distribution at a given evaluation point z i ∈ (0, 1) is rather straightforward. By definition, given a sample u = u 1 , . . . , u N , the ECDF value is As the sampled values, u j ∈ (0, 1), are expected to be continuously uniformly distributed, Pr(u j ≤ z i ) = z i for each j = 1, . . . , N . Thus, the values resulting from scaling the ECDF with the sample size N are binomially distributed as If we instead expect u to be sampled from a discrete uniform distribution with S distinct equally spaced values, s j = j/S, by choosing the partition points to form a subset of these category values, we again have Pr(u j ≤ z i ) = z i for j = 1, . . . , N , and the marginal distribution of the scaled ECDF follows Eq. (8). Therefore, the methods introduced in sections 2.2 and 2.3 can be used to determine simultaneous confidence bands for both continuous and discrete uniform samples, allowing for testing uniformity of both the continuous PIT values of Eq.
(1) and the discrete empirical PIT values in Eq.
(2). From Equation (8), it is straightforward to determine the 1 − α level pointwise lower and upper confidence bands, L i and U i respectively, satisfying for all i = 1, . . . , N individually In contrast, determining the simultaneous confidence bands for ECDF trajectories (i.e., sets of ECDF values) is more complicated. In Figure 4, we illustrate the dependency between ECDF values at distinct evaluation quantiles, together with simultaneous confidence bands computed via either of the new methods described in the following sections. As is illustrated in the figure, ECDF values evaluated at two quantiles close to each other are strongly dependent while ECDF values evaluated at two quantiles far away from each other are only weakly dependent. In any case, these dependencies need to be taken into account when constructing simultaneous confidence bands. Another important remark is that, as the marginal distribution of the scaled ECDF is discrete, the simultaneous confidence intervals do not in all cases meet the desired coverage level exactly. Brown et al. (2001) provide a thorough exploration of the effect discreteness plays in the coverage level of various interval estimations for binomial proportion, with listings of what the authors call lucky and unlucky sample lengths. In our experience, even though discreteness plays a role in the coverage level of the pointwise confidence intervals, this effect is reduced to deviation of under ±1% for N ∈ [50,2000] in the coverage level of the resulting simultaneous confidence bands we introduce next.

Simultaneous confidence bands through simulation
Our goal is to define simultaneous confidence bands for the ECDF of a sample of N values drawn from the standard uniform distribution so that the interior of the confidence bands contains trajectories induced by that distribution with rate 1 − α, where α ∈ (0, 1).
In this section we describe a simulation based method for determining the simultaneous confidence bands for the ECDF trajectory. We follow steps similar to those introduced by Aldor- Noiman et al. (2013); with the exception that instead of determining limits for the Q-Q plot, we now determine the upper and lower limits of the ECDF values at the evaluation points z i : 2. Determine coverage parameter γ to account for multiplicity in order to obtain the 1 − α level simultaneous confidence bands: In determining these confidence bands, we use the knowledge from that the values of the scaled ECDF at each point z i follow a binomial distribution and denote the value of the cumulative binomial distribution function with parameters N and z i at k ∈ N by Bin(k | N, z i ) and its inverse by To find the desired coverage value γ, we simulate M draws of size N from the standard uniform distribution. Let F m denote the ECDF of the mth sample, u m 1 , . . . u m N ∼ uniform(0, 1). For each sample, we find the value of γ such that the equal tail quantiles and provide the tightest possible lower and upper limits respectively to the sample ECDF, F m , at each z i . This value of γ for the mth sample is As now we have for γ m equally that and as it holds that γ m defines a set of upper and lower limits to the ECDF which is by Eq. (14) the tightest possible pair of limits defining equal tail quantiles for the ECDF at each z i . To obtain bands covering a 1 − α fraction of the ECDFs of the simulated samples, we set γ to the α quantile of the values {γ 1 , . . . , γ M }. Since γ m > 0 by construction, we also have γ > 0.
The following steps summarize the algorithm for simulating the adjusted coverage parameter γ and determining the 1 − γ level simultaneous confidence bands: 2. Set γ to be the 100α percentile of {γ 1 , . . . , γ M }.

Simultaneous confidence bands through optimization
We also propose a computationally more efficient optimization based method for determining the simultaneous confidence bands.
In the following derivation of the optimization method, we denote the interior of the confidence bands for the ECDF at quantile z i asĨ i (γ). By denoting r i = N F (z i ), the scaled interior I i (γ) for r i is given by As is common for discrete statistical tests, we treat the borders between the interior and exterior as being within the confidence bands. Based on I i (γ), we can easily obtainĨ i (γ) as r ∈ I i (γ) is equivalent to r/N ∈Ĩ i (γ).
A scaled ECDF trajectory defined as with z 0 = 0 and z K = 1 stays within the simultaneous confidence bands completely if and only if r i ∈ I i (γ) for all i ∈ {0, . . . , K}. If we denote the set of trajectories fulfilling r i ∈ I i as T i , we can write the set of trajectories which are completely within the simultaneous confidence bands as In order for the simultaneous confidence bands to have confidence level 1 − α, we must have Due to the pairwise independence of the original draws u i (by assumption), the distribution of the ECDF values within a single trajectory is Markovian in the sense that the ECDF value F (z i+1 ) only depends on the observed value at the previous evaluation point, F (z i ) and not on the earlier behaviour of the ECDF trajectory.
This implies that, under uniformity of the original distribution, the remaining N − N F (z i ) = N − r i samples are uniformly distributed on the interval [z i , 1], and thus the growth of the scaled ECDF from r i to r i+1 , between z i and z i+1 is binomially distributed with N − r i trials and the success probabilitỹ And so we have The probability for r i+1 = k ∈ I i+1 to occur in a scaled ECDF trajectory t K 0 which stayed within the simultaneous confidence bands until point i, that is, for which we have can thus be written recursively as The recursion is initialized at z 0 = 0 with Pr(r 0 = 0) = 1 so that Pr(T 0 (γ)) = 1 for all γ ∈ [0, 1]. At any point i ∈ {0, . . . , K}, we can obtain  29), samples presented in the middle. When comparing multiple samples, we would like to take into account the within sample dependency, but also the between sample dependency introduced by the joint transformation. In Sections 3.2 and 3.3 we extend our methods in order to provide simultaneous confidence bands for the ECDF and the ECDF difference plots shown on the right.
which is equal to Pr(T (γ)) when arriving at i = K. Clearly, Pr(T (γ)) is monotonically decreasing but not continuous in γ due to the discrete nature of the binomial distribution. Thus, Equation (18) will not have an exact solution in general and so we will not be able to meet the simultaneous confidence level 1 − α exactly. We can, however, try to get as close as possible by computingγ = arg min with a unidimensional derivative-free optimizer. In our experiments, the optimizer proposed by Brent (1973) (which is implemented, e.g., in the R function optimize) converged quickly in all cases toγ values implying a simultaneous confidence level very close to the nominal 1 − α.
With a 2015 laptop equipped with a 2.90GHz Intel® Core™ i5-5287U processor, the optimization method reduces the time required to compute the adjustment parameter γ from 10s to 600ms for a sample of length 250 when compared against the time required for 10,000 steps of the simulation method. With N = 1000 this reduction is from 75s to 10s.
Both of the implementations used for this article only use a single computation thread, but would benefit from parallelization, as both methods include independent iterations.
The computation time required can be further reduced by using a grid of pre-computed values as the adjustment parameters, and interpolate for different values of N in log-log scale.

Comparison of multiple samples
In this section, we extend the uniformity test of section 2 to test whether multiple samples originate from the same underlying distribution. In the case of multiple samples sharing the same distribution, the rank statistics of the values within each sample, when ranked jointly across all samples, are uniformly distributed on the interval (1,Ñ ), whereÑ is the total length of the combined sample . Thus, instead of considering the sampled values directly, we consider the implied jointly rank-transformed values.
Due to this joint rank-transformation, the resulting chains are dependent on each other and the confidence intervals we construct in the following two sections are used to answer whether all the two or more samples originate from the same underlying distribution. In other words, in the case one or more of the ECDF trajectories leaves the confidence bands, we conclude that at least one of the samples exhibits larger than expected deviance from the other samples at hand.
An illustration of the connection between the sampled values, the corresponding fractional rank statistics and the two ECDF plots of these rank statistics are displayed in Figure 5.

Pointwise confidence bands
An important distinction to the ECDF case considered in section 2, is the form of the marginal distribution at quantile z i when determining the adjusted coverage parameter γ. As our main application is the comparison of distributions induced by MCMC chains, we speak of the L different samples as chains and assume all chains to have the same length N . We define r i as the vector (of length L) of joint ranks across chains smaller than or equal to the sample size s i = z i N L . That is, for each of the L elements r il of r i , we have where u lj is the jth draw of the lth chain before transformation, R(u lj | u) is the rank of u lj within the vector u of all draws across all chains, and I is the indicator function. Clearly, because of the definition of ranks, we know for all i that and we define the set of all r i satisfying (26) as R i . Due to the pairwise independence of the original draws u lj (by assumption), the marginal distribution of r i at quantile z i is multivariate hypergeometric whereÑ = (N 1 , . . . N L ) is the vector chain lengths (i.e., population sizes) and N 1 = . . . = N L = N as we assume chains to have equal length. It is well known that, in this case, the marginal distribution of r il , and thus the distribution defining the pointwise confidence bands, is hypergeometric

Simultaneous confidence bands through simulation
In this section, we extend the simulation method presented in Section 2.2 to comparison of multiple samples. Our aim is to define simultaneous confidence bands for the ECDFs of multiple, jointly rank-transformed distributions so that the interior of the simultaneous confidence bands jointly contains all trajectories induced by the rank-transformed distributions with rate 1 − α. To this end, we define r i and s i as in Section 3.1 and denote the interior of the simultaneous confidence bands at quantile z i asĨ i (γ), with γ being the adjusted coverage parameter to be determined. We continue the use of fractional ranks in the ECDF plots to provide illustrations independent of the length of the sampled chains. Suppose we have L chains of length N . The fractional rank scorer il corresponding to the ith value of the lth chain, u li , isr Instead of using the adjusted value of γ to obtain the 1 − α level simultaneous confidence bands for a single ECDF trajectory, we adjust γ to account for the dependence between the samples introduced in the transformation into fractional ranks. That is, after choosing the evaluation quantiles z i , we adjust γ to find upper and lower simultaneous confidence bands satisfying where F l is the ECDF of the fractional rank scores of the lth chain. We denote the CDF of the hypergeometric distribution as Hyp and its inverse as Hyp −1 . The algorithm to approximate the adjusted coverage parameter γ when comparing L samples is as follows: (e) Find the minimum probability 2. Set γ to be the 100α percentile of {γ 1 , . . . , γ M }.

Simultaneous confidence bands through optimization
In this section, we extend the optimization method presented in Section 2.3 to comparison of multiple samples. With the marginal distribution of r il being hypergeometric, the rank interior I i (γ) for z i is given by We treat the borders between interior and exterior as belonging to the interior. Based on I i (γ), we can again easily obtainĨ i (γ), as r ∈ I i (γ) is equivalent to r/N ∈Ĩ i (γ).
The remainder of the proof proceeds similar to the one-sample case, except that we replace the binomial distribution with the (multivariate) hypergeometric distribution. A (multivariate) rank ECDF trajectory, defined as where z 0 = 0 and z K = 1, stays within the simultaneous confidence bands completely if and only if r i ∈ I i (γ) for all i ∈ {0, . . . , K}. If we denote the set of trajectories fulfilling r i ∈ I i as T i , we can write the set of trajectories which are completely in the interior of the simultaneous confidence bands as In order for the simultaneous confidence bands to have a confidence level 1 − α, we must satisfy Due to the pairwise independence of the original draws u lj (by assumption), the distribution of the rank ECDF trajectories again exhibits a similar Markovian property as in the single sample case. That is, any ECDF value F (z i+1 ) beyond a given point z i only depends on F (z i ) but not on the earlier history of the ECDF trajectory. This implies that, under the assumption of all chains coming from the same underlying distribution, the growth r i+1 − r i of the ECDF from z i to z i+1 is multivariate hypergeometric withÑ i =Ñ − r i and sample sizes i+1 = s i+1 − s i . Accordingly, we have where p MHyp denotes the discrete PDF of the multivariate hypergeometric distribution. The probability for r i+1 = k ∈ I i+1 to occur in a rank ECDF trajectory t K 0 which stayed in the simultaneous confidence bands until point i, that is, for which we have can thus be written recursively as The recursion is initialized at z 0 = 0 with Pr(x 0 = (0, . . . , 0)) = 1 so that Pr(T 0 (γ)) = 1 for all γ ∈ [0, 1]. At any point i ∈ {0, . . . , K}, we can obtain which is equal to Pr(T (γ)) when arriving at i = K. Clearly, Pr(T (γ)) is monotonically decreasing but not continuous in γ due to the discrete nature of the (multivariate) hypergeometric distribution. We can computê using a unidimensional derivative-free optimizer. In our experiments, the optimizer proposed by Brent (1973) converged in all cases toγ values implying a simultaneous confidence level very close to the nominal 1 − α.
Unfortunately, evaluating Eq. (37) suffers from combinatorial explosion as the R i are L-dimensional sets constraint only by Equation (26) and as Pr(r i+1 = k | r i = m) has to be computed for all combinations of elements k ∈ I i+1 and m ∈ I i+1 at each point i. Several measures can be taken to reduce the complexity of the computation. First, the ranks of one of the L chains are redundant as they follow deterministically from Equation (26) based on the ranks of the other L − 1 chains. This implies in particular that the 2-chain case has the same computational complexity as the one-sample case as only one of the two chains needs to be evaluated. Second, due to a-priori symmetry of the chains, we can, without loss of generality, assume at the first non-zero quantile z 1 that the elements r 1l of r 1 are ordered such that r 11 ≤ r 12 ≤ . . . ≤ r 1L . This reduces the number of trajectories to be evaluated by a factor of L(L + 1)/2. Still even with these measures in place, computation will scale badly with L, and the simulation based method, which scales almost linearly, or grid-based interpolation from pre-computed values is faster for larger number of chains.

Numerical Experiments and Power Analysis
In this section, we provide insights into how the plots produced by our proposed methods should be interpreted. In each of the following cases, we link together the histogram, ECDF plot, and the ECDF difference plot. The code for the experiments and plots is available at https://github.com/TeemuSailynoja/simultaneous-confidence-bands.

Uniformity of a Single Sample
We begin by providing two examples connecting the shape of the histogram of the transformed sample to the characteristics of the corresponding ECDF and ECDF difference plots with basic discrepancies between the sample and the comparison distribution. After this we illustrate an application of our method as part of a workflow to detect issues in model implementation or the computation of the posterior distribution. Lastly we provide power analysis comparing the performance of our proposed method to existing state of the art tests for uniformity.
With the exception of the power analysis tests in 4.1.4 where the samples are drawn directly from a continuous uniform distribution, the samples in the following examples are transformed to the unit interval from their respective sampling distributions through empirical PIT and are tested against the hypothesis of discrete uniformity.  Figure 6: Effect of the sample mean. The histogram, ECDF plot, and ECDF difference plot of the empirical PIT values of y = y 1 , . . . , y 250 ∼ normal(0.25, 1) with respect to x i = x i 1 , . . . , x i 250 ∼ normal(0, 1) for i = 1, . . . , N . The larger than expected mean of the sample is somewhat visible as a slant to the right in the histogram, whereas the ECDF difference plot displays a clear ∪-shape. In the histogram 95% confidence intervals are provided for each of the 50 bins and the ECDF plots show the 95% simultaneous confidence bands. . , x i 250 ∼ normal(0, 1) for i = 1, . . . , , N . The larger than expected variance of the sample is visible as a ∪-shape in the histogram, whereas the ECDF difference plot displays rapid growth near the ends of the interval as a larger than expected number of values is covered. In the histogram, 95% confidence intervals are provided for each of the 50 bins and the ECDF plots show the 95% simultaneous confidence bands.

Effect of Difference in Sample Mean
To observe the typical characteristics of a sample with a mean different than that of the comparison distribution, we draw y = y 1 , . . . , y N ∼ normal(0.25, 1) and N independent comparison samples x i = x i 1 , . . . , x i N ∼ normal(0, 1) with N = 250. We then test for y being standard normal distributed by transforming the sampled values to the unit interval through empirical PIT. Figure 6(a) shows the histogram of the transformed sample exhibiting a higher than expected mean. As seen in the figure, a shift in the sample mean leads to the histogram being slanted towards the direction of the shift. The ECDF plot in Figure 6(b), shows this shift through the ECDF of the PIT values remaining under the theoretical CDF, which is also seen in the ECDF difference plot in Figure 6(c). If the sample in question would instead have a mean lower than expected, the histogram would be slanted to the left and the behaviour of the resulting ECDF plot and ECDF difference plot would be reversed. That is, the ECDF plot would stay above the theoretical CDF as a higher than expected density is covered at low fractional ranks and the ECDF difference plot would respectively show a ∩-shape above the zero level.

Effect of Difference in Sample Variance
Next, we investigate an example where the sample has a higher than expected variance. To this end we draw y = y 1 , . . . , y N ∼ normal(0, 1.25) and for each y i a standard uniform comparison sample x i = x i 1 , . . . , x i N ∼ normal(0, 1) with N = 250. Figure 7(a) shows the histogram of the empirical PIT values. In general, a larger than expected variance leads to a ∪-shaped histogram and one can indeed see some of the histogram bins breaching the 95% confidence bounds. In the ECDF plot shown in Figure 7(b), the larger than expected variance leads to faster than expected growth near the edges and slower than expected growth in the middle. The shape is more clearly seen in the ECDF difference plot in Figure 7(c) depicting the difference between the ECDF and the theoretical CDF. If the sample would instead present a variance lower than expected, the histogram would be ∩-shaped and the behaviour of the resulting ECDF plot and ECDF difference plot would be reversed. In the ECDF plot this is shown as faster increase near the middle. In general, the ECDF difference plot is decreasing when a smaller than expected density of samples is covered, and correspondingly increases when covering a higher than expected density.

Simulation Based Calibration: Eight Schools
The eight schools (Gelman, Carlin, et al. 2013), is a classic hierarchical model example. The training course effects θ j in eight schools are modelled using an hierarchical varying intercept model.
If the model is constructed with the centered parameterization, the posterior distribution exhibits a funnel shape contracting to a region of high curvature near the population mean µ when sampled with small values of the population standard deviation τ . This property makes exploring the distribution of τ difficult for many MCMC methods. The centered parameterization (µ, σ, µ 0 , τ ) of the problem is as follows: Cook et al. (2006) proposed a simulation-based calibration method for validating Bayesian inference software. The idea is based on the fact we can factor the joint distribution of data y and parameters θ in two ways π(y, θ) = π(y|θ)π(θ) = π(θ|y)π(y).
By considering θ and θ the joint distribution is π(y, θ , θ ) = π(y)π(θ |y)π(θ |y), and it's easy to see that θ and θ have the same distribution conditionally on y. If write the joint distribution in an alternative way π(y, θ , θ ) = π(θ )π(y|θ )π(θ |y), θ and θ still have the same distribution conditionally on y. We can sample from the joint distribution π(y, θ , θ ) by first sampling from π(θ ) and π(y|θ ), which is usually easy for generative models. The last step is to sample from the conditional π(θ|y), which is usually not trivial and instead, for example, a Markov chain Monte Carlo algorithm is used. We can validate the algorithm and its implementation used to sample from π(θ |y) by checking that the samples obtained have the same distribution as θ (conditionally on y). Cook et al. (2006) operationalize the approach by drawing θ i from π(θ ), generating data y i ∼ π(y i |θ i ) and then using the algorithm to be validated to draw a sample θ 1 , . . . , θ S ∼ π(θ |y i ). If the algorithm and its implementation are correct, then θ i , θ 1 , . . . , θ S conditional on y i are draws from the same distribution. Cook et al. (2006) propose to compute empirical PIT valued for θ i that they show to be uniformly distributed given S → ∞. The process is repeated for i = 1, . . . , N and N empirical PIT values are used for testing. Cook et al. (2006) propose to use χ 2 -test for the inverse of the normal CDF of the empirical PIT values. However, with finite S this approach doesn't correctly take into account the discreteness or the effect of correlated sample from Markov chain Gelman (2017).
By thinning θ 1 , . . . , θ S to be approximately independent, the uniformity of empirical PIT values can be tested with the approach presented in this paper. See Appendix A for more on thinning. Figure 8 shows the histogram and ECDF plots of 500 prior draws of the population standard deviation τ , each ranked based on a thinned posterior sample of 150 draws obtained from a chain of 3000 draws. The graphical test rejects the hypothesis of the prior draws being uniform, moreover the ECDF plots show that the prior draws of the parameter τ ranked in relation to the posterior samples obtained from the centered parameterization of the eight schools model are skewed to small ranks. This suggests that the MCMC is not sampling correctly from the target distribution (which in this case is known to be caused by inability to reach the narrow funnel part of the posterior).
In Section 4.2.4, we will return to the eight schools model by providing further analysis on the convergence of individual chains in the centered parameterization case and illustrating how our method can be used to detect these convergence issues.

Power analysis
As our primary focus is on providing a graphical uniformity test, which gives the user useful information regarding the possible deviations from uniformity, we want to also ensure that the overall performance of our test is, if not the best, competitive with tests aimed at accurately detecting specific deviations from uniformity. To this end, we compare the sensitivity of our method with existing tests for uniformity, by considering the rejection rate of samples drawn from uniform distribution and then transformed according to the following three transformation families Marhuenda et al. (2005) use in their article comparing various tests for uniformity: As Marhuenda et al. (2005) offer an extensive comparison of tests, we limit our comparison to the test specifically recommended to target each of the transformation families in addition to the widely known Kolmogorov-Smirnov test.
For each of the test statistics, a critical value is calculated and samples exceeding that value are rejected. For transformation family A, the recommended test is the mean distance of the ith value of the ordered sample u (i) from the expected value i/(N + 1): For family B, the smooth goodness-of-fit test, N h , introduced by Neyman (1937) is recommended with the dimension h chosen according to the method recommended by Ledwina (1994) resulting in the test statistic N S , which also has the best overall performance across the transformation families. The test recommended for transformation family C is the statistic recommended by Watson (1961),   Figure 9: When compared to existing tests for uniformity, the rejection rate of the graphical test with simultaneous confidence bands performs well in all the three families of deviations introduced in (45). A slightly lower rejection rate can be observed in family C for k ∈ (1, 3), which corresponds to samples biased towards the center of the unit interval.
whereū is the mean of the u i and W 2 is the Cramér-von Mises statistic, The rejection rates of these tests and our test through simultaneous confidence bands are shown in Figure 9 for families A, B, and C with sample size N = 100 and k varying between 0.20 and 3.00. For each value of k, the rejection rate among 100, 000 samples was computed. As seen from these results, the proposed ECDF simultaneous confidence band method performs in a manner similar to the recommended tests with the exception of family C, where our method exhibits a lower rejection rate compared to some of the other tests.

Comparing Multiple Samples
When testing if two or more samples are produced from the same underlying distribution, we can compare the ranks of each sample relative to the sample obtained by combining all the samples in the comparison. As mentioned in Section 3, we need to adjust the confidence bands to take into account the dependency of the ranks of the values of one sample on the values in other samples in the comparison.
When using the multiple sample test for MCMC convergence diagnostics, we recommend first using existing numerical convergence statistics, such as the R by Vehtari et al. (2021) or the R * by Lambert and Vehtari (2021) which can assess the convergence of all model parameters jointly and can indicate which parameters have possible convergence issues. In the case that these statistics indicate possible issues, further insight into the nature of these deviations can be obtained with the ECDF plots of fractional ranks.

Effect of difference in means and variances
We first compare two cases of MCMC sampling with four chains containing 250 independent draws, which is enough to reliably estimate the variances and autocorrelations required for R and effective sample size (ESS) as long as the rank-normalized ESS of the sample exceeds 400 , which is the case as the draws are independent. In each case, chains 2 to 4 were sampled from a normal(0, 1) distribution. In the first case, chain 1 is sampled with a larger mean than the other chains, normal(0.5, 1). In the second case, chain 1 is sampled with a larger variance, normal(0, 1.5).
Rank plots for the first case with one chain having a larger mean are shown in Figure 10(a)-(d). Even though the difference in the sampling distribution of chain 1 can be seen in the histograms with 50 bins, this effect is more clearly Fractional rank (f) ECDF difference plot Figure 10: Effect of differences in sample mean. While chains 2 to 4 are drawn from the standard normal distribution, chain 1 is drawn from normal(0.5, 1), which can be seen as a bias towards large fractional ranks in the rank plot of chain 1 and as a slightly lowered frequency of large fractional ranks in chains 2-4. In the ECDF plot and the ECDF difference plot, the ECDF of chain 1 obtains values considerably lower than expected resulting in a clear ∪-shape in the ECDF difference plot.
represented in the ECDF difference plot in Figure 10 (f) where chain 1 shows the shape familiar from 4.1.1 and chains 2 to 4 show a reverse shape, indicating similar behaviour between these three chains. Similar remarks regarding the behaviour of the chains can be made from the ECDF plot in Figure 10(e), but the more dynamic range of the ECDF difference plot in Figure 10(f) makes the difference in the behaviour of the chains clearer. In the second case, where chain 1 is sampled with a higher variance, we can see a ∪-shape in the rank plot of chain 1 in Figure 11(a), but the behaviour stands out more clearly in the ECDF difference plot in Figure 11(f).
When compared to commonly used convergence diagnostics not offering graphical insight into the nature of the possible underlying problems, both the classical R diagnostic by Gelman and Rubin (1992) and the improved R diagnostic proposed by Vehtari et al. (2021) indicate convergence issues as they give and estimatedR values of 1.05 and 1.04 respectively to both the mean and variance related examples above. Vehtari et al. (2021) suggest that R > 1.01 is an indication of potential convergence issues or too short chains.

Test performance under common deviations
To evaluate the performance of the multiple sample comparison test under a set of common deviations, one of the samples was transformed according to the three transformation families defined in equation (45). In the analysis 2, 4, and 8 chains of length 100 were simulated from U (0, 1) after which one of the chains was transformed according to the transformations f A,k , f B,k , and f C,k . The rejection rates of the multiple sample comparison test when varying the power, k, of the transformation were estimated from 10, 000 simulations and are recorded in Figure 12. The observed test performance is independent of the number of chains used in the sample comparison. When compared to the rejection rates observed in the single sample power analysis in 4.1.4, the rejection rates show that the test sensitivity depends in a similar way on the transformation. Fractional rank (f) ECDF difference plot Figure 11: Effect of differences in sample variances. While chains 2-4 are drawn from the standard normal distribution, chain 1 is drawn from normal(0, 1.5), which can be seen as a ∪-shape in the rank plot of chain 1 and as a low frequency of small and large fractional ranks in the rank plots of chains 2-4. In the ECDF plot and the ECDF difference plot, the ECDF of chain 1 grows fast near the ends of the unit interval, where a higher than expected density of fractional ranks is covered. In the middle of the interval, the ECDF difference plot of chain 1 is decreasing, whereas chains 2-4 are increasing.  Figure 13: Test rejection rate when comparing chains with autocorrelation. On the left as a function of the AR-parameter value and on the right as a function of the ratio between the bulk effective sample size, as defined by Vehtari et al. (2021), and the total sample size. The nominal rejection rate 0.05 is shown with a vertical line in both plots. As the test expresses sensitivity to autocorrelation, we recommend thinning the samples, in order to reduce autocorrelation, before using the sample comparison test.

Chains with autocorrelation
As samples generated by MCMC processes are typically autocorrelated, it is essential to analyse the performance of the sample comparison test under autocorrelated samples. In Figure 13, rejection rates of simulated multiple sample test 2, 4, and 8 chains produced by autoregressive models of order 1 (i.e., AR(1) models) with varying AR-parameter values are presented. Each rejection rate is computed as the mean of 100, 000 simulations. As seen in the figure, the higher the autocorrelation in the samples is and the more chains are sampled, the more likely the test is to reject the hypothesis that the samples are drawn from the same underlying distribution. Thus, before using the graphical illustration or the corresponding test, the chains should be thinned to have negligible autocorrelation. The same holds for other common uniformity tests as well, as they rely on the assumption of pairwise independence of draws.

Detecting model sampling issues: eight schools
We return to the eight schools model used to demonstrate SBC in Section 4.1.3. The issues detected with SBC earlier are apparent when multiple sample comparison is used to inspect the rank distribution between the four individual chains, each containing 1000 posterior draws after a warm-up period of 1000 steps. Even when sampled with more conservative settings of the sampler, we see from Figure 14 that the chains are not properly exploring the posterior and thus the realized rank transformed chains have clearly different ECDFs. While the classical R is estimated at 1, the improvedR diagnostic gives a value of 1.02 indicating possible convergence issues. One should also note that the sampling efficiency for τ in the model is very low, as both the bulk-ESS and the tail-ESS by Vehtari et al. (2021) are under 150 for the combined sample.
As recommended in Section 22.7 of the Stan User's Guide (Stan Development Team n.d.), these observed sampling issues of a hierarchical model with weak likelihood contribution can often be avoided by using the non-centered parameterization (θ, µ, τ, σ) of the model:θ In the above parameterization, the treatment effect θ j is derived deterministically from the other parameter values and insteadθ j is sampled. To keep the models comparable, we use the same conservative sampling options for the non-centered model although this is not required to obtain well mixing chains. In Figure 15, we see an improvement in the sampling compared to the centered parameterization, as the sample ranks are distributed approximately uniformly among the four chains implying that the chains are mixing well. Now, both of theR diagnostics agree on convergence with the graphical test, yielding values close to 1.00, while also the sampling efficiency issues detected in the centered parameterization model have disappeared giving samples with bulk-ESS and tail-ESS reaching 2200 and 1600 respectively.

Discussion
By providing a graphical test for uniformity and comparison of samples, we offer an accessible tool to be used in many parts of practical statistical workflow.
For assessing the uniformity of a single sample, we recommend the optimization-based adjustment method, as it is efficient even for large sample sizes. For comparing multiple samples, the simulation-based method is likely to be computationally more efficient than the optimization-based method. To speed up the computations, we recommend pre-computing adjusted γ values for a set of sample size and number of samples (chains) and then interpolate (in log-log space) the adjustment as needed.
In the examples, we used empirical PIT with SBC, where the uniformity is expected by construction if the inference algorithm works correctly. PIT has also been used to compare predictive distributions. Specifically, in the LOO-PIT approach, PIT has been used to compare leave-one-out (LOO) cross-validation predictive distributions to the observations (e.g. Czado et al. 2009;Gneiting et al. 2007). Although the graphical LOO-PIT test is useful for visualization of model-data discrepancy, exact uniformity of LOO-PIT values can be expected only asymptotically given the true model. For example, if the data comes from a normal distribution and is modeled with a normal distribution with unknown mean and scale, the posterior predictive distribution is a Student's t distribution that approaches normal only asymptotically. Thus use of graphical LOO-PIT tests needs further research.
We have assumed that distributions g and p are continuous and only the fractional rank statistics u i from Eq. (2) are discrete. Our proposed methods do not work directly if g and p are discrete, as values obtained through PIT are no longer uniform. Also, in the multiple sample comparison case, the rank statistics are no longer mutually distinct as ties are possible. The potential approach to handling discrete g and p is to use randomized or non-randomized modifications of PIT values for discrete distributions, as discussed by Czado et al. (2009). However, developing proven and efficient algorithms for this purpose requires further work, which is left for future research.  Figure 15: Detecting model sampling issues: non-centered parameter eight schools model. When inspecting the sampling of parameter τ , even when the 95% confidence bands of the rank plots of chains 2 and 3 are exceeded by one bin each, the ECDF plot and the ECDF difference plot of the non-centered parameterization eight schools model indicate no mixing issues as the ECDF of the fractional ranks of each chain stay between the 95% simultaneous confidence bands. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Normal(0, 1) AR(ϕ = −0.95) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Thinned AR(ϕ = −0.95) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Normal(0, 1) AR(ϕ = 0.95) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  Figure 16: The difference between the expectation of the first 100 ordered statistics of a sample of size 1000. The expectations are computed from 1000 simulations. One can see that both the AR process with strong positive autocorrelation, φ = 0.95, and the process with negative autocorrelation, φ = −0.95, produce samples with expected ordered statistics that are biased towards the center of the distribution. When thinned according to the tail-ESS, resulting in thinning by 18 and 7 respectively, the samples align well with the expected ordered statistics of the target distribution.
a thinned sample of equal length, an expected Tail-ESS was obtained by averaging over 10,000 simulations and the sample length was chosen accordingly to yield thinned samples of length 1000. After the thinning, the order statistics closely match those drawn independently from the standard normal distribution. We inspected how the three above mentioned thinning strategies manage to reduce the autocorrelation which, as shown in Section 4.2.3, the ECDF based test is sensitive to. In this experiment, 1, 2, and 4 chains of length 1000 were drawn from the AR(1) process with varying values of the AR parameter φ. The results of this experiment are displayed in Figure 17 and as one can see, all three of the methods produce very similar thinning recommendations, and thus also test results, managing to reduce the rejection rate near the desired 5%.
If after using some default thinning approach, there are still many extreme PIT estimates, it is possible that there is still substantial autocorrelation in the sample and more careful investigation of the remaining autocorrelation is warranted. There is certainly a trade-off between the computation time and how accurately the behavior of extreme tails need to be examined. Often the major issues can be seen with less accurate computation, and natural workflow can include iterative refinement of the diagnostic accuracy.
Although thinning may be needed for uniformity test as part of SBC or PPC, when estimating quantities of interest that are not related to extreme tails, better efficiency is obtained by using all the posterior draws. Tail − Bulk Traditional Unthinned Figure 17: The behaviour of the three thinning strategies when applied to a sample consisting of 1, 2, or 4 chains created with an AR(1) process. With AR(1) process, the three strategies agree on the recommended strategy and are fairly successful at recovering the desired confidence level, while the rejection rate of the unthinned chains grows as more autocorrelation is introduced. For large values of the AR parameter, φ, the thinning amounts are quite high, which would require larger samples to reliable estimate properties of the target distribution.