Multiscale stick-breaking mixture models

We introduce a family of multiscale stick-breaking mixture models for Bayesian nonparametric density estimation. The Bayesian nonparametric literature is dominated by single scale methods, exception made for P\`olya trees and allied approaches. Our proposal is based on a mixture specification exploiting an infinitely-deep binary tree of random weights that grows according to a multiscale generalization of a large class of stick-breaking processes; this multiscale stick-breaking is paired with specific stochastic processes generating sequences of parameters that induce stochastically ordered kernel functions. Properties of this family of multiscale stick-breaking mixtures are described. Focusing on a Gaussian specification, a Markov Chain Montecarlo algorithm for posterior computation is introduced. The performance of the method is illustrated analyzing both synthetic and real data sets. The method is well-suited for data living in $\mathbb{R}$ and is able to detect densities with varying degree of smoothness and local features.

metrics (BNP) has received abundant attention in the last decades and it is nowadays a well-established modelling option in the data scientist's toolbox. If standard parametric Bayesian inference focuses on the posterior distribution obtained by defining suitable prior distributions over a finite dimensional parametric space Ξ with ξ ∈ Ξ typically characterising a specific parametric distribution G ξ for data y = (y 1 , . . . , y n ), in BNP one defines prior distributions on infinite-dimensional probability spaces flexibly characterizing the distribution G. Under these settings, only minor assumptions are made on G making the whole inferential procedure more robust.
The cornerstone of the discipline is the Dirichlet process (DP) introduced by Ferguson (1973). The DP is a stochastic process that defines a prior on the space of distribution functions; several generalizations of the DP have been proposed such as the Pitman-Yor (PY) process (Perman et al., 1992;Pitman and Yor, 1997), the normalized random measures with independent increments (NRMI) (Regazzini et al., 2003;Nieto-Barajas et al., 2004;James et al., 2006James et al., , 2009 and, more in general, the Gibbs-type priors (Gnedin and Pitman, 2006). Realizations from these priors, however, are almost surely discrete probability functions and thus they do not admit a density with respect to the Lebesgue measure. As a remedy to this characteristic, the DP and allied priors can be used as prior distribution on the mixing measure of a mixture model. The first and most useful example of this is the DP mixture (DPM) of Gaussian kernels (Lo, 1984;Escobar and West, 1995).
Consistently with these contributions, in this paper we introduce a general class of multiscale stick-breaking processes with support on the space of discrete probability mass functions suitable as mixing measure in multiscale mixture of continuous kernels. The method generalizes Canale and Dunson (2016) in two directions. First, a more general multiscale stick-breaking process inspired by the PY process is introduced. Second, a multiscale base measure generating kernel densities that are stochastically ordered and defined on a general sample space is introduced. This construction leads to a class of prior measure for continuous densities that is robust to any specific prior parameters elicitation and that naturally adapts to the actual degree of smoothness of the true data generating distribution without the need of specifying several layers of hyper-priors.
The remainder of the paper is organized as follows.
In the next section we introduce our multiscale stickbreaking prior and describe some of its properties. Section 3 describes a Gibbs sampling algorithm for posterior computation. Section 4 illustrates the performance of the methods through the analysis of several synthetic and real datasets. Section 5 concludes the paper.

Multiscale stick-breaking mixture
Let y ∈ Y ⊂ R, be a random variable with unknown density f . We assume for f the following multiscale construction where K(·; θ) is a kernel function parametrized by θ ∈ Θ and {π s,h } and {θ s,h } are unknown sequences of positive weights summing to one and parameters belonging to Θ, respectively. We will refer to this model with the term multiscale mixture (MSM) of kernel densities. This construction can be represented with an infinitely deep binary tree in which each node is indexed by a scale s and an index h = 1, . . . , 2 s and where each of these nodes is characterized by the pair (π s,h , θ s,h ). A cartoon of a truncation of this binary tree is reported in Figure 1. Model (1) can be equivalently written as where δ x is the Dirac delta function. Thus a prior distribution for the multiscale mixture (1) is obtained by specifying suitable stochastic processes for the random mixing measure P or, equivalently, for the random sequences {π s,h } and {θ s,h }. These characterizations are separately carefully described in the next sections.
Approximations of the mixture model (1) can be obtained fixing an upper bound s for the depth of the tree. This truncation is obtained setting S s = 1 for each h = 1, . . . , 2 s as suggested by Ishwaran and James (2001) for the standard single-scale mixture model and discussed by Canale and Dunson (2016) for the multiscale mixture of Bernstein polynomial model. Such a truncation can be applied both if one considers not scientifically relevant higher levels of resolution or to reduce the computational burden.

Multiscale mixture weights
We first focus on the sequence of mixture weights {π s,h }. We introduce independent random variables S s,h and R s,h taking values in (0, 1) and describing the probability of taking a given path in the binary tree reported in Figure 1. Specifically, S s,h denotes the probability of stopping at node h of scale s while R s,h denotes the probability of taking the right path from scale s to scale s + 1 conditionally on not stopping in node h of that scale. The weights are then defined as where T r, h2 r−s = R r, h2 r−s , if (r + 1, h2 r−s+1 ) is the right daughter of node (r, h2 r−s , and T r, h2 r−s = 1 − R r, h2 r−s , otherwise. This construction is reminiscent of the stick-breaking process (Sethuraman, 1994;Ishwaran and James, 2001) and can be described by the following metaphor: take a stick of length one and break it according to the law of S 0,1 ; the remainder of the stick is then randomly splitted in two parts according to the law of R 0,1 ; at general node (s, h) the remainder of the stick, conditionally on the previous breaks, is broken according to S s,h and then splitted according to R s,h . Different distributions for S s,h and R s,h lead to different characteristics for the tree of weights. Inspired by the general stick-breaking prior construction of Ishwaran and James (2001) we can set This construction is a flexible generalization of Canale and Dunson (2016) that, mimicking the DP and its stickbreaking representation, fixed a s,h = 1, b s,h = α > 0, and c s,h = d s,h = β > 0 for each s and h. While being way more flexible, the specification in (4) has different parameters for each node and its elicitation may be cumbersome in practice. To avoid these complications while keeping an increasing degree of flexibility through a scale dependence for the distribution of the random weights, we consider δ ∈ [0, 1) and α > −δ and let This specification is reminiscent of the PY process, a model which stands out for being a good compromise between modelling flexibility, and mathematical and computational tractability. This construction leads to a proper sequence of weights as formalized in the next Lemma. Its proof is reported in the Appendix.
The δ parameter allows for a greater degree of flexibility in describing how the random probability weights are allocated to the nodes. To see this, consider the expectation of π s,h , i.e.
where we discard the h subscript on S l ∼ Be(1 − δ, α + δ(l + 1)) and T l ∼ Be(β, β) for ease in notation. This does not impact the calculation because any path taken up to scale s has the same probability a priori and the distribution of the random variables in (5) depends on the scale s only. The expected values of the random weights can be used to calculate the expected scale at which an observation falls, a measure of the expected resolution level, defined by E(S) = ∞ s=0 sE(π s,h ). The latter simplifies to α when δ = 0 but can be easily obtained numerically for δ > 0.
To better understand the role of δ, Figure 2 reports the total expected weight of scale s, defined as the expectation of π s = h π s,h , for different values of δ and α. It is clear that increasing values of δ make the first levels of the tree less probable a priori, thus favoring a deeper tree. Note that this characteristic has to be interpreted more in terms of prior robustness rather than favouring rougher densities as the prior mass is more spread through the whole tree allowing the posterior to concentrate on a tree of suitable depth. This interpretation is consistent with the role of the discount parameter of the PY process that controls how much prior probability is concentrated around the prior expected value of occupied clusters and thus inducing a posterior distribution that is more robust to the prior specification. See De Blasi et al. (2015) for a related discussion. We will show in Section 4.1 that this conjecture is empirically confirmed via simulations.

Multiscale kernel's parameters
We now discuss the stochastic process for the sequence {θ s,h }. For Y = (0, 1), Canale and Dunson (2016) assume that K(·; θ s,h ) is a Beta(h, 2 s − h + 1) density so that θ s,h is identified by the pair (h, 2 s −h+1), a fixed set of parameters. This construction is implicitly inducing a mixture of Bernstein polynomials (Petrone, 1999a,b) for each scale s and the randomness is totally driven by the sequence of mixture weights. Here, instead, we will consider the broader case where θ s,h are unknown parameters and where K(·; θ) is a location-scale kernel defined on a general sample space Y. Under this specification, we partition the kernel's parameter space into a location and scale part letting Θ = Θ µ × Θ ω .

Location parameters
We first focus on defining a suitable sequence of locations {µ s,h } that, consistently with the dyadic partition induced by the binary tree structure, uniformly covers the space Θ µ . To this end, for any scale s we introduce a partition of Θ µ by letting such that for two neighboring scales s and s + 1, Let G 0 be a base probability measure defined on Θ µ and use it both to define Θ µ;s,h and to generate the multiscale locations µ s,h . Specifically, we set where q r is the r-level quantile of the density of G 0 . Then random µ s,h are sampled proportionally to G 0 truncated in Θ µ;s,h . While preserving the covering of the Θ µ this construction allows for straightforward prior elicitation similarly to what is done for the DP or the PY process. The next lemma, whose proof is reported in the Appendix, shows that a priori the random probability measure on Θ µ defined by is centered around G 0 .
Lemma 2 Let G 0 be a base probability measure defined on Θ µ . Introduce a dyadic recursive partition of Θ µ defined by (7), (8), and (9) and G 0 . If G is the discrete measure (10) and each µ s,h is randomly sampled proportionally to G 0 truncated in Θ µ;s,h , then, for any set Note that equation (10) is similar, in spirit, to the approximate PT (APT) prior of Cipolli and Hanson (2017) and in particular to their equation (3). The difference, however, is twofold. First, our weights come from a multiscale stick-breaking process while those of APT are the result of the PT recursive partitioning. The second, more evident, difference lies on how the Dirac's delta masses are placed. While equation (3) of Cipolli and Hanson (2017) places these on the center of the intervals Θ µ;s,h , in our construction the masses are randomly placed inside Θ µ;s,h . Hence, while the learning in APT model is totally driven by the random weights, our approach allows for an update of the values µ s,h a posteriori.

Scale parameters
We now focus on describing the sequence of scale parameters {ω s,h }. Consistently with our multiscale setup, the scale parameters need to be ordered with respect to the scale levels of the binary tree in order to induce more concentrated kernels for increasing values of s, on average. In general the direction of the ordering depends on the actual role of the scale parameters in the specific kernel K(·; θ). For instance, for scale parameters proportional to the variances-respectively precisionsa decreasing-respectively increasing-sequence need to be specified. Assuming that ω s,h are proportional to the variances of the kernels, we induce a stochastic ordering of the ω s,h 's at different scales s in the following way. Let H 0 be a base probability measure defined on Θ ω with first moment E H0 (ω) = ω 0 and variance V H0 (ω) = γ 0 both finite. Then let where c(s) is a monotone decreasing deterministic function of s. Under this definition the sequence of {ω s,h } is stochastically decreasing and Consistently with our multiscale construction, the first inequality reflects the fact that from scale s to scale s + 1 we expect more concentrated kernels in equation (2). The second inequality, in addition, implies that the prior uncertainty about ω scales as well.
In the next section we discuss a specification of this construction by means of Gaussian kernels and suitable choices for G 0 , H 0 , and c(s).

Multiscale mixture of Gaussians
Although several choices for the kernel K(·; θ) can be made, the Gaussian one is probably the more natural when Y = R. Hence we specify the model described in previous sections assuming K(·; θ) = φ(·; µ, ω) where φ(·; µ, ω) is a Gaussian density with mean µ ∈ R and variance ω > 0. Under this specification equation (1) becomes a MSM of Gaussian densities, i.e.
A pragmatic choice for the base measures consists in choosing conjugate priors. Specifically we let G 0 be a Gaussian distribution with mean µ 0 and variance κ 0 . Similarly, we restrict to the inverse-gamma family of distributions the choice for H 0 . Following (11), we let W s,h iid ∼ IGa(k, λ), leading to E(W s,h ) = λ/(k − 1) and E(ω s,h ) = c(s)λ/(k − 1). A natural choice for the function c(·) is c(s) = 2 −s , which is equivalent to let ω s,h ∼ IGa(k, 2 −s λ).
Consistently with the discussion at the end of Section 2.2.1, this final specification is reminiscent of the smoothed approximate Polya tree (SAPT) of Cipolli and Hanson (2017). In both specifications, indeed, the variances of each Gaussian mixture component are the result of a deterministic scale-decreasing componentrepresented by the function c(s) here and by the parameter d k in SAPT-and a random quantity. The latter, while being controlled by a single parameter in the SAPT model, is component-specific in the proposed formulation thus allowing for local learning of the values of each scale parameter.

Posterior Computation
In this section we introduce a Markov Chain Montecarlo (MCMC) algorithm to perform posterior inference under the model introduced in the previous section. In the general settings, the algorithm consists of three steps: (i) allocate each observation to a multiscale cluster conditionally on the current values of {π s,h } and {θ s,h }; (ii) update {π s,h } conditionally on the cluster allocations; (iii) update {θ s,h } conditionally on the cluster allocations.
In this section we focus on the multiscale mixture of Gaussian and related prior elicitation discussed in Section 2.3 but steps (i) and (ii) also apply for a general kernel.
Suppose subject i is assigned to node (s i , h i ), with s i the scale and h i the node within scale. Conditionally on the values of the parameters, the posterior probability of subject i belonging to node (s, h) is simply Consider the total mass assigned at scale s, defined as π s = 2 s h=1 π s,h , and letπ s,h = π s,h /π s . Under this notation, we can rewrite the likelihood for the i-th observation as Following Kalli et al. (2011), we introduce the auxiliary random variables u i |y i , s i ∼ Unif(0, π si ), and consider the joint density where 1I A (x) is the indicator function that returns 1 if x ∈ A. Then we can update the scale s i and the node h i using Conditionally on cluster allocations, the update of the weights is obtained applying (3) to the updated values of S s,h and R s,h obtained sampling from where v s,h is the number of subjects passing through node (s, h), n s,h is the number of subjects stopping at node (s, h), and r s,h is the number of subjects that continue to the right after passing through node (s, h).
Conditionally on cluster allocation, the update of locations and scale parameters follows from usual conjugate analysis arguments. Specifically the location parameters are sampled from

Illustrations
In this section we discuss the performance of the proposed MSM of Gaussian densities through the analysis of different synthetic and real data sets. Specifically, we investigate the role of the δ parameter in the next section and compare the method with the DPM and the SAPT in Section 4.2. Finally, in Sections 4.3 and 4.4 the method and one possible extension of it are used to analyze two different astronomical data sets.

The role of δ
As already discussed in the previous sections, the δ parameter allows for a greater degree of flexibility in the prior specification. In this section we want to empirically assess its role a posteriori. To this end we generate 100 samples of size n = 50 from three different densities and run the Gibbs sampling algorithm described in Section 3 to get an estimate of the posterior mean density for different values of the α and δ parameters. Data are generated from a finite mixture of Gaussian densities f (y) = K k=1 π k φ(y; µ k , ω k ) with an increasing level of local variability. Specifically, the first density is the standard normal distribution, the second density is a mixture of two components with µ 1 = −µ 2 = 0.935, ω 1 = ω 2 = 1/8, and π 1 = π 2 = 1/2, while the last density has three components and parameters equal to µ 1 = 0, µ 2 = 1.392, µ 3 = −1.392, ω 1 = ω 2 = ω 3 = 1/32, π 1 = 1/2, π 2 = π 3 = 1/3. We considered δ equal to 0, 0.25 and 0.5 and numerically obtain the values of α in order to match a fixed prior expectation for the scale of the density. We considered three values for the prior expected scale that are consistent with the densities of the data generating processes. Specifically we assume E(S) = 1, 3 and 5. The related parameters are summarized in Table 1. We run the Gibbs sampler described in Section 3 for 1000 iterations with a burn in of 200. Visual inspections of the traceplots of the posterior mean density on a grid of domain points suggests no lack of convergence. Figure 3 reports the values of the average (over the 100 replicates) of the posterior mean scale as a function of the discount parameter δ. Each dot corresponds to a specific configuration of the α and δ parameters and configurations with the same prior mean scale are connected. For δ = 0 the prior choice drives the behavior of the posterior, i.e. on average lower posterior mean is obtained when the prior mean scale is equal to 1 while a higher posterior mean is obtained when the prior mean is equal to 5. This prior dependence is less evident for increasing values of δ. Indeed regardless the prior specification, when the value of δ increases, the posterior mean stabilizes in a neighborhood of a specific value. This behavior is consistent with what happens for the posterior mean number of clusters for a PY process mixure model (see De Blasi et al., 2015;Canale and Prünster, 2017, for related discussions).
Note that in addition to this prior robustness on the posterior mean scale-that is related to the actual degree of smoothness of the posterior mean density- we also observe an increasing precision of the density estimates in terms of L 1 distance of the posterior mean density and the true density. See the Supplementary Materials for additional details. The same simulation experiment was carried out also for data sets with sample size equal to 250. The qualitative results are similar but less striking as the different posterior mean scales are closer for small values of δ. This is expected and reflects the informative gain related to a bigger sample size. Additional details and plots are reported in the Supplementary Materials.

Comparison with alternative methods
In this section we assess the performance of the proposed method and compare it with standard alternatives, namely a location-scale DPM of Gaussians-the golden standard Bayesian nonparametric model for density estimation-and, for its close relations with our method, a SAPT. Synthetic data are simulated from different scenarios corresponding to varying degrees of global and local smoothness. As benchmark scenarios we used the densities reported in Marron and Wand (1992). For sake of brevity we report here the results for four scenarios (the results for all the densities of Marron and Wand (1992) are reported in the Supplementary Materials) corresponding to a smooth unimodal skew density (S1), a smooth bimodal density (S2), and two densities with sharp local variability (S3 and S4). These densities are plotted with a thick dark line in Figure 4. Obviously, we expect the DPM to perform better in the first two scenarios-not having any multiscale structure-and our MSM and the SAPT to perform better in the last two scenarios. For each scenario we generate 100 samples of sizes n = 100 and n = 500. Before fitting each model, data are standardized to have mean zero and variance one. For the DPM we used the marginal Pólya urn sampler implemented in the R package DPpackage (Jara et al., 2011) while for SAPT for we used the Gibbs sampler described in the paper and implemented in set of R functions gently provided by the authors-that we thank warmly. The performance of the three competing methods are evaluated in terms of L 1 distance and Kullback -Leibler (KL) divergence of the posterior mean densities from the true density evaluated on a grid of points.
For our MSM we set G 0 to be the standard normal distribution and λ = k = 2 6 for the inverse-gamma distribution H 0 . This choice for λ and k leads to a high variance for the scale parameters reflecting mild prior information about these quantities. The maximum depth for the tree was set to s = 6. Consistently with the discussion on δ of the previous sections, we assumed δ = 0.5. The value of α has been obtained numerically in order to match E(S) = 3. Finally, we set β = 1. For the SAPT model we followed the specification presented in Cipolli and Hanson (2017) and additionally let c ∼ Ga(1, 1). The tree was grown up to J = 6 levels, consistently with the truncation induced in the multiscale stick-breaking. For the DPM the model specification is f (·) = φ(·; µ, ω)dF (·; µ, ω), F ∼ DP (α, F 0 ), with F 0 = N(m 1 , ω/κ) × IGa(ν 1 , ψ 1 ) and additional hyperpriors m 1 ∼ N(m 2 , s 2 ), κ ∼ Ga(τ 1 /2, τ 2 /2), with values of the parameters equal to a 0 = b 0 = 1, m 2 = 0, s 2 = 1, ν 1 = ν 2 = 3, ψ 2 = rs 2 , r = 0.1, τ 1 = 2, τ 2 = 200 as in Cipolli and Hanson (2017). Each MCMC algorithm was run for 1000 iterations with a burn-in period of 200. Note that, contrarily to the prior specification for our MSM, for both the DPM and the SAPT we are adding different hyperprior distributions. Table 2 reports the results of the simulations study. Overall the performance of all the three methods are comparable. For n = 100 our MSM of Gaussian performs sightly better-on average-both in terms of L 1 distance and KL divergence with respect to the competing methods. The lower values of L 1 distance and KL divergence attained by our MSM, are also coupled with less Montecarlo variability. For the higher sample size of n = 500, all the methods improve in terms of precision with our MSM always performing slightly better-but in the first scenario where, as expected, the DPM achieves the best performance. These results show that our MSM approach is able to adapt to the actual smoothness of the density and its performance make it a serious competitor of standard methods not only in the situations where a multiscale structure is expected, but also when the density of the data is reasonably smooth. Figure 4 gives additional insights on the results summarized in Table 2. Each subplot of Figure 4 depicts, with thin bright lines, the posterior mean densities for each simulated data sets (of size n = 100) with different subplots in the same row denoting the three different competing methods. While the performance in terms of L 1 distance and Kullback-Leibler divergence are most of the time comparable with the other methods, it is evident that for some specific data sets, the DPM estimates is a unimodal density, oversmoothing the true underlying density-see the second, third and fourth lines. On the other side, the SAPT estimates avoid this oversmoothing but exhibit a very high variability with different data sets leading to estimates with prominent differences. The estimates obtained with our proposed method, instead, provide a better compromise between bias and variance, resulting in better posterior estimates that are smooth but also able to capture, abrupt local changes in the density-if present. Qualitatively similar results are also noticeable for the data sets with sample size n = 500. See the Supplementary Materials for details.

Roeder's galaxy speed data
As benchmark data set to assess the performance of our method we use the famous Galaxy velocity data set of Roeder (1990) reporting the velocity of 82 galaxies sampled from 6 conic sections of the Corona Borealis. Our goal here is to achieve comparable results in terms of goodness of fit with respect to standard methods that already showed to provide meaningful results-namely the DPM and allied models-as we do not expect a prominent multiscale structure.
We used the same prior specification of the previous section but used a more conservative truncation of the binary trees, namely s = 8. Figure 5 depicts the posterior mean density along with the histogram of the raw data and 95% credible bands. As expected the method has a comparable performance with respect to state of the art competitors and achieve a log-pseudo marginal likelihood value (Gelfand and Dey, 1994) of −217, comparable to that of the DPM (−212) and SAPT (−215).

Sloan Digital Sky Survey data
We consider a second astronomical data set consisting of n = 24 312 galaxies, drawn from the Sloan Digital Sky Survey first data release (see Abazajian et al, 2003, for details). The galaxies are partitioned into 25 different groups (Balogh et al., 2004), by combining the separation in 5 groups for different luminosity and in 5 groups by different density-the latter being a physical characteristic of the galaxy that does not need to be confused with a probability density function. Our goal here is twofold. From the astronomical point of view, considering this partition of the data as fixed, we want to estimate the probability density function of the difference of ultraviolet and red filters (U − R color) for each group. Secondly, we use this example to show the flexibility of the proposed approach in dealing with complex situations proposing a modification of the mixture model discussed in Section 2.3. Specifically, for each group g = 1, . . . , 25, we assume the multiscale mixture where each set of weights π (g) s,h is assumed to be generated independently according to the multiscale stick-breaking process introduced in Section 2.1 and each group-specific density f g shares a common set of kernel's parameters. The idea of a shared-kernel model accounts for the existence of common latent information shared between groups and allows for borrowing of information in learning the values of the kernel's parameters. See Lock and Dunson (2015) for a related approach.
Posterior sampling under the extension (12) can be performed following the details of Section 3 and considering the update of each group specific set of weights independently by simulating s,h , and r (g) s,h are defined consistently to v s,h , n s,h , and r s,h of Section 3 but considering only the subjects preassigned to group g.
Assuming the same prior specification of the previous sections with s = 4 we run 1000 iterations of a Gibbs sampler with a burn-in period of 200. Figure 6 reports, for each group, the estimated posterior mean density along with 95% credible bands. Many estimated densities show a clear bimodality which previous studies justified with the presence of two subpopulations of galaxies: a blue and red population (Balogh et al., 2004). The different estimated densities clearly show different levels of global and local variability that our model is able to capture. Notably, the posterior uncertainty is very low due to the joint effect of the moderately big sample size and of the shared kernel assumption.

Discussion
We introduced a family of multiscale stick-breaking mixture models for Bayesian n onparametric density estimation. This class of models is made of two building blocks: a flexible multiscale stick-breaking process inspired by the PY literature and a stochastic process that generates a dictionary of stochastically ordered kernel densities. We showed that the δ parameter of the multiscale stickbreaking process-related to the discount parameter of the PY-makes the prior flexible and robust. Specifically it allows the method to achieve results comparable to those obtainable by more basic models endowed with an additional degree of hyperpriors-thus relieving the computational burden. The comparison with standard Bayesian nonparametric competitors showed, on average, superior performance in terms of finding the right smoothness of the unknown density. In addition, through the analysis of the Sloan Digital Sky Survey data, we also showed that the proposed formulation is amenable to extensions and generalizations to more complex settings involving hierarchical structures or covariates. To establish the result, it is sufficient to show that the limit of each ∆ Sh for S → ∞ is 0 a.s. Note that each ∆ S,h has the same distribution of (1 − δ) s−1 j=0 (α + δ(j + 1)) s j=0 (α + δj + 1) (1 − δ) s−1 j=0 (α + δ(j + 1)) s j=0 (α + δj + 1) (1 − δ) s−1 j=0 (α + δ(j + 1)) s j=0 (α + δj + 1) = G 0 (A)