Importance conditional sampling for Pitman-Yor mixtures

Nonparametric mixture models based on the Pitman-Yor process represent a flexible tool for density estimation and clustering. Natural generalization of the popular class of Dirichlet process mixture models, they allow for more robust inference on the number of components characterizing the distribution of the data. We propose a new sampling strategy for such models, named importance conditional sampling (ICS), which combines appealing properties of existing methods, including easy interpretability and a within-iteration parallelizable structure. An extensive simulation study highlights the efficiency of the proposed method which, unlike other conditional samplers, shows stable performances for different specifications of the parameters characterizing the Pitman-Yor process. We further show that the ICS approach can be naturally extended to other classes of computationally demanding models, such as nonparametric mixture models for partially exchangeable data.


Introduction
Bayesian nonparametric mixtures are flexible models for density estimation and clustering, nowadays a well-established modelling option for applied statisticians (Frühwirth-Schnatter et al., 2019).The first of such models to appear in the literature was the Dirichlet process (DP) (Ferguson, 1973) mixture of Gaussian kernels by Lo (1984), a contribution which paved the way to the definition of a wide variety of nonparametric mixture models.In recent years, increasing interest has been dedicated to the definition of mixture models based on nonparametric mixing random probability measures that go beyond the DP (e.g.Nieto-Barajas et al., 2004;Lijoi et al., 2005bLijoi et al., ,a, 2007;;Argiento et al., 2016).Among these measures, the Pitman-Yor process (PY) (Perman et al., 1992;Pitman, 1995) stands out for conveniently combining mathematical tractability, interpretability, and modelling flexibility (see, e.g., De Blasi et al., 2015).
Let X = (X 1 , . . ., X n ) be an n-dimensional sample of observations defined on some probability space (Ω, A , P) and taking values in X, and F denote the space of all probability distributions on X.A Bayesian nonparametric mixture model is a random distribution taking values in F , defined as where K(x; θ) is a kernel and p is a discrete random probability measure.In this paper we focus on p ∼ P Y (σ, ϑ; P 0 ), that is we assume that p is distributed as a PY process with discount parameter σ ∈ [0, 1), strength parameter ϑ > −σ, and diffuse base measure P 0 ∈ F .The DP is recovered as a special case when σ = 0. Model (1.1) can alternatively be written in hierarchical form as p ∼ P Y (σ, ϑ; P 0 ). (1. 2) The joint distribution of θ = (θ 1 , . . ., θ n ) is characterized by the predictive distribution of the PY, which, for any i = 1, 2, . .., is given by where k i is the number of distinct values θ * j observed in the first i draws and n j is the number of observed θ l , for l = 1, . . ., i, coinciding with θ * j , such that k i j=1 n j = i.Markov chain Monte Carlo (MCMC) sampling methods represent the gold standard for carrying out posterior inference based on nonparametric mixture models.Resorting to the terminology adopted by Papaspiliopoulos and Roberts (2008), most of the existing MCMC sampling methods for nonparametric mixtures can be classified into marginal and conditional, the two classes being characterized by different ways to deal with the infinitedimensional random probability measure p.While marginal methods rely on the possibility of analytically marginalizing p out, the conditional ones exploit suitable finite-dimensional summaries of p.
Marginal methods for nonparametric mixtures were first devised by Escobar (1988) and Escobar and West (1995), contributions which focused on DP mixtures of univariate Gaussian kernels.Extensions of such proposal include the works of Müller et al. (1996), MacEachern (1994), MacEachern and Müller (1998), Neal (2000), Barrios et al. (2013), Favaro and Teh (2013), and Lomelí et al. (2017).It is worth noting that, despite being the first class of MCMC methods for Bayesian nonparametric mixtures appeared in the literature, marginal methods are still routinely used in popular packages such as the DPpackage (Jara et al., 2011), the de facto standard software for many Bayesian nonparametric models.Alternatively, conditional methods rely on the use of summaries-of finite and possibly random dimension-of realizations of p.To this end, the stick-breaking representation for the PY (Pitman and Yor, 1997) turns out to be very convenient.The almost sure discreteness of the PY allows p to be written as an infinite sum of random jumps {p j } ∞ j=1 occurring at random locations { θj } ∞ j=1 , that is p = ∞ j=1 p j δ θj . (1.4) The distribution of the locations is independent of that of the jumps and, while θj iid ∼ P 0 , the distribution of the jumps is characterized by the following construction: (1.5) (1.7) A first example of conditional approach can be found in Ishwaran and James (2001) and Ishwaran and Zarepour (2002), contributions that consider a fixed truncation of the stick-breaking representation of a large class of random probability measures, and provide a bound for the introduced truncation error.Along similar lines, Muliere and Tardella (1998) and Arbel et al. (2019) make the truncation level of the DP and the PY, respectively, random so to make sure that the resulting error is smaller than a given threshold.Exact solutions avoiding introducing truncation errors are the slice samplers of Walker (2007) and Kalli et al. (2011), the improved slice sampler of Ge et al. (2015), and the retrospective sampler of Papaspiliopoulos and Roberts (2008).It is worth noticing that, although originally introduced for the case of DP mixture models, the ideas behind slice and retrospective sampling algorithms are naturally extended to the more general class of mixture models for which the mixing random probability measure admits a stick-breaking representation (Ishwaran and James, 2001), thus including the PY mixture model as a special case.In this context Favaro and Walker (2013) propose a general framework for slice sampling the class of mixtures of σ-stable Poisson-Kingman model.Henceforth we will use the term slice sampling to refer to the proposals of Walker (2007) and Kalli et al. (2011), and not to the general definition of slice sampling.
Recent contributions have proposed hybrid strategies for posterior sampling nonparametric mixture models, that combine steps of marginal and conditional algorithms and therefore cannot be classified as either type of algorithm.Notable examples are the hybrid sampler of Lomelí et al. (2015) for the general class of Poisson-Kingman mixture models, and the hybrid approach proposed by Dubey et al. (2020) for a wide range of Bayesian nonparametric models based on completely random measures.
Marginal methods are appealing for their simplicity and for the fact that the number of random elements that must be drawn at each iteration of the sampler, i.e. the components of θ, is deterministic and thus bounded.At the same time, quantifying the posterior uncertainty, e.g.via posterior credible sets, by using the output of marginal methods is in general not straightforward since marginal methods do not generate realizations of the posterior distribution of f , but only of its conditional expectation where the expectation is taken with respect to p.To this end, convenient strategies have been proposed, which typically exploit the possibility of sampling approximate realizations of p conditionally on the values of θ generated by the marginal algorithm (see discussions in Gelfand and Kottas, 2002;Taddy and Kottas, 2012;Arbel et al., 2016).Conditional methods, instead, produce approximate trajectories from the posterior distribution of f , which can be readily used to quantify posterior uncertainty.Moreover, by exploiting the conditional independence of the parameters θ i 's, given p or a finite summary of it, conditional methods conveniently avoid sequentially updating the components of θ at each iteration of the MCMC, thus leading to a fully parallelizable updating step within each iteration.On the other hand, the random truncation at the core of conditional methods such as slice and retrospective samplers makes the number of atoms and jumps that must be drawn at each iteration of the algorithm, random and unbounded.By confining our attention to the slice sampler of Walker (2007) and, equivalently, its dependent slice-efficient version (Kalli et al., 2011), we observe that, while its sampling routines are efficient and reliable when the DP case is considered, the same does not hold for the more general class of PY mixtures, specially when large values of σ are considered.In practice, we noticed that, even for small sample sizes, the number of random elements that must be drawn at each iteration of the algorithm can be extremely large, often so large to make an actual implementation of the slice sampler for PY mixture models unfeasible.It is clear-cut that this limitation represents a major problem as the discount parameter σ greatly impacts the robustness of the prior with respect to model-based clustering (see Lijoi et al., 2007;Canale and Prünster, 2017).In order to shed some light on this aberrant behaviour, we investigate the distribution of the random number N n of jumps that must be drawn at each iteration of a slice sampler, implemented to carry out posterior inference based on a sample of size n.We can define-see Appendix A for details-a datafree lower bound for N n , that is a random variable M n such that N n (ω) ≥ M n (ω) for every ω ∈ Ω and for every sample of size n.M n is distributed as min l ≥ 1 : where the V j 's are defined as in (1.5) and B n ∼ Beta(1, n): studying the distribution of the lower bound M n will provide useful insight on N n .Note that, in addition, M n coincides with the number of jumps to be drawn in order to generate a sample of size n by adapting to the PY case the retrospective sampling idea introduced for the DP by Papaspiliopoulos and Roberts (2008).Figure 1 shows the empirical distribution of M n , with n = 100, for various combinations of ϑ and σ.The estimated median of the distribution of M n grows with σ and, for any given value of σ, with ϑ.It can be appreciated that the size of the values taken by M n , and thus by N n , explodes when σ grows beyond 0.5, fact which leads to the aforementioned computational bottlenecks in routine implementations of the slice sampler.
For example, when σ = 0.8, the estimated probability of M n exceeding 10 9 is equal to 0.35, 0.42 and 0.63, for ϑ equal to 0.1, 1 and 10, respectively.From an analytic point of view, following Muliere and Tardella (1998), it is easy to show that in the DP case (i.e.σ = 0), . Beyond the DP case (i.e.σ ∈ (0, 1)), an application of Arbel et al. (2019) allows us to derive an analogous asymptotic result, which corroborates our empirical findings on the practical impossibility of using the slice sampler for PY mixtures with σ ≥ 0.5.See Proposition A.1 and related discussion in the Appendix.
Herein, we propose a new sampling strategy, named importance conditional sampling (ICS), for PY mixture models, which combines the appealing features of both conditional and marginal methods, while avoiding their weaknesses, including the computational bottleneck depicted in Figure 1.Like marginal methods, the ICS has a simple and interpretable sampling scheme, reminiscent of Blackwell-MacQueen's Pólya urn (Blackwell and MacQueen, 1973), and allows to work with the update of a bounded number of random elements per iteration; at the same time, being a conditional method, it allows for fully parallelizable parameters update and it accounts for straightforward approximate posterior quantification.
Our proposal exploits the posterior representation of the PY process, derived by Pitman (1996) in combination with an efficient sampling-importance resampling idea.The structure of Pitman (1996)'s representation makes it suitable for numerical implementations of PY based models, as indicated in Ishwaran and James (2001), and nicely implemented by Fall and Barat (2014).
The rest of the paper is organized as follows.The ICS is described in Section 2. Section 3 is dedicated to an extensive simulation study, comparing the performance of the ICS with state-of-the-art marginal and conditional sampling methods.Section 4 proposes (reports) an illustrative application, where the proposed algorithm is used to analyse a data set from the Collaborative Perinatal Project (Klebanoff, 2009).In this context, Section 4.2 is dedicated to illustrate how the ICS approach can be extended to the case of nonparametric mixture models for partially exchangeable data.Section 5 concludes the paper with a discussion.
Additional results are presented in the Appendix.

Importance conditional sampling
The random elements involved in a PY mixture model defined as in (1.2) are observations X, latent parameters θ and the PY random probability measure p.The joint distribution of (X, θ, p) can be written as where θ * = (θ * 1 , . . ., θ * kn ) is the vector of unique values in θ, with frequencies (n 1 , . . ., n kn ) such that kn j=1 n j = n, and Q is the distribution of p ∼ P Y (σ, ϑ; P 0 ).In line of principle, the full conditional distributions of all random elements can be derived from (2.1) and used to devise a Gibbs sampler.Given that the vector X, conditionally on θ, is independent of p, the update of θ is the only step of the Gibbs sampler which works conditionally on a realization of the infinite-dimensional p.The conditional distribution p(θ | X, p) therefore will be the main focus of our attention: its study will allow us to identify a finite-dimensional summary of p, sufficient for the purpose of updating θ from its full conditional distribution.As a result, as far as p is concerned, only the update of its finite-dimensional summary will need to be included in the Gibbs sampler.Our proposal exploits a convenient representation of the posterior distribution of a PY process (Pitman, 1996), reported in the next proposition.
In the context of mixture models, Pitman's result implies that the full conditional distribution of p coincides with the distribution of a mixture composed by a PY process q with updated parameters, and a discrete random probability measure with k n fixed jump points at t = (t * 1 , . . ., t * kn ).This means that, in the context of a Gibbs sampler, while, by conditional independence, the update of each parameter θ i is done independently of the other parameters (θ 1 , . . ., θ i−1 , θ i+1 , . . ., θ n ), the distinct values θ * taken by the parameters at a given iteration, are carried on to the next iteration of the algorithm through p, in the form of fixed jump points t.Specifically, if Θ * = Θ \ {t * 1 , . . ., t * kn }, then, for every i = 1, . . ., n, the full conditional distribution of the i-th parameter θ i can be written as where q is the restriction of p to Θ * , p 0 = p(Θ * ) and p j = p(t * j ), for every j = 1, . . ., k n .The full conditional in (2.2) is reminiscent of the Blackwell-MacQueen urn scheme characterizing the update of the parameters in marginal methods: the parameter θ i can either coincide with one of the k n fixed jump points of p or take a new value from a distribution proportional to K(X i ; t)q(dt).The key observation at the basis of the ICS is that, for the purpose of updating the parameters θ, there is no need to know the whole realization of p but it suffices to know the vector t of fixed jump points of p, the value p = (p 0 , p 1 , . . ., p kn ) taken by p at the partition (Θ * , t * 1 , . . ., t * kn ) of Θ, and to be able to sample from a distribution proportional to K(X i , t)q(dt).For the latter task, we adopt a sampling-importance resampling approach (see, e.g., Smith and Gelfand, 1992) with proposal distribution q.It is remarkable that such solution allows us to approximately sample from the target distribution while avoiding the daunting task of simulating a realization of q itself.Indeed, for any m ≥ 1, a vector s = (s 1 , . . ., s m ) such that s i | q iid ∼ q can be generated by means of an urn scheme exploiting (1.3).
Given the almost sure discreteness of q, the generated vector will show ties with positive probability and thus will feature r m ≤ m distinct values (s * 1 , . . ., s * rm ), with frequencies (m 1 , . . ., m rm ) such that rm j=1 m j = m.In turn, importance weights for the resampling step are computed, for any = 1, . . ., m, as thus without requiring the evaluation of q.As a result, the full conditional (2.2) can be rewritten as Once more we highlight an interesting analogy between the conditional approach we propose and marginal methods: the introduction of the auxiliary random variables s * 1 , . . ., s * rm reminds of the augmentation introduced in Algorithm 8 of Neal (2000), marginal algorithm proposed to deal with a non-conjugate specification of the mixture model.From (2.3) it is straightforward to identify (s, t, p) as a finite-dimensional summary of p, sufficient for the purpose of updating the parameters θ i from their full conditionals.This means that, as far as p is concerned, only its summary (s, t, p) must be included in the updating steps of the Gibbs sampler.To this end, Proposition 1 provides the basis for the update of (s, t, p).Indeed, conditionally on θ, the fixed jump points t coincide with the k n distinct values appearing in θ, while the random vectors p and s are independent with p ∼ Dirichlet(ϑ + σk n , n 1 − σ, . . ., n kn − σ) and the joint distribution of s characterized by the predictive distribution of a PY(σ, ϑ + σk n ; P 0 ), that is, for any = 0, 1, . . ., m − 1, where (s * 1 , . . ., s * r ) is the vector of r distinct values appearing in (s 1 , . . ., s ), with corresponding frequencies (m 1 , . . ., m r ) such that r j=1 m j = .
Algorithm 1: ICS for PY mixture model for each = 0, . . ., m − 1 do let r (r) be the number of distinct values in (s m be the number of distinct values in s (r) ; [8] for each i = 1, . . ., n do ) be the vector of distinct parameters in θ (r) ; [11] for each j = 1, . . ., k n do ; [13] update θ * (r) By combining the steps just described, as summarized in Algorithm 1, we can then devise a Gibbs sampler which we name ICS.In Algorithm 1 and henceforth, the superscript (r) is used to denote the value taken by a random variable at the r-th iteration.In order to improve mixing, the ICS includes an acceleration step which consists in updating, at the end of each iteration, the distinct values θ * from their full conditional distributions.Namely, for every j = 1, . . ., k n , where Finally, a realization from the posterior distribution of (s, t, p) defines an approximate realization f of the posterior distribution of the random density defined in (1.1), that is If the algorithm is run for a total of R iterations, the first R b of which discarded as burn-in, then the posterior mean is estimated by where f (r) m denotes the approximate density sampled from the posterior at the r-th iteration.
The set of densities f (r) m can be also used to quantify posterior uncertainty.It is worth remarking though that any such quantification is based on realizations of a finite dimensional summary of the infinite-dimensional p and thus is, by its nature, approximated.For a quantification of the approximating error one could resort to Arbel et al. (2019).
It is instructive to consider how the ICS works for the special case of DP mixture models, that is when σ = 0.In such case, the steps described in Algorithm 1 can be nicely interpreted by resorting to three fundamental properties characterizing the DP, namely conjugacy, selfsimilarity, and availability of finite-dimensional distributions.More specifically, when σ = 0, step 4 of Algorithm 1 consists in generating the random weights p from a Dirichlet distribution of parameters (ϑ, n 1 , . . ., n kn ).This follows by combining the conjugacy of the DP (Ferguson, 1973), for which with the availability of finite-dimensional distributions of DP (Ferguson, 1973), which provides the distribution of p, defined as the evaluation of the conditional distribution of p on the partition of Θ induced by θ.Moreover, when σ = 0, according to the predictive distribution displayed in step 6 of Algorithm 1, the auxiliary random variables s are exchangeable from q ∼ DP (ϑ; P 0 ), with q independent of p.This is nicely implied by the self-similarity of the DP (see, e.g., Ghosal, 2010), according to which q = p| Θ * is independent of p| Θ\Θ * , and therefore of p, and is distributed as a DP (ϑP 0 (Θ * ); P 0 | Θ * ), and by the diffuseness of P 0 .
As a result, in the DP case, the auxiliary random variables s are generated from the prior model.

Simulation study
We Aware that different implementations can lead to a biased comparison (see Kriegel et al., 2017, for an insightful discussion), we aimed at reducing such bias to a minimum by letting the four algorithms considered here share the same code for most sub-routines.
Throughout this section we consider synthetic data generated from a simple two-component mixture of Gaussians, namely f 0 (x) = 0.75φ(x; −2.5, 1) + 0.25φ(x; 2.5, 1), with φ(•; µ, σ 2 ) denoting the density of a Gaussian random variable with mean µ and variance σ 2 .All data were analyzed by means of the nonparametric mixture model defined in (1.1) and specified by considering a univariate Gaussian kernel K(x, θ) = φ(x; µ, σ 2 ), with θ = (µ, σ 2 ), and by assuming a normal-inverse gamma base measure P 0 such that σ 2 ∼ IG(2, 1) and µ | σ 2 ∼ N (0, 5σ 2 ).Different combinations of values for the parameters σ and ϑ, and for the sample size n were considered.The results of this section are then obtained as averages over a specified number of replicates.All algorithms were run for 1 500 iterations, of which the first 500 discarded as burn-in.Convergence of the chains was checked by visual inspection of the trace plots of randomly selected runs, which did not provide any evidence against it.
The analysis was carried out by running BNPmix on R 4.0.3 on a 64-bit Windows machine with a 3.4-GHz Intel quad-core i7-3770 processor and 16 GB of RAM.
The first part of our investigation is dedicated to the role of m, the size of the auxiliary sample generated for the sampling-importance resampling step within the ICS.To this end, we considered two sample sizes, namely n = 100 and n = 1 000, and generated 10 data sets per size.Such data were then analyzed by considering a combination of values for the PY parameters, namely σ ∈ {0, 0.2, 0.4, 0.6, 0.8} and ϑ ∈ {1, 10}, and by running the ICS with m ∈ {1, 10, 100}.Estimated posterior densities, not displayed here, did not show any noticeable effect of m.More interesting findings were obtained when the analysis focused on the quality of the generated posterior sample: larger values for m appear to lead to a better mixing of the Markov chain at the price of additional computational cost.These effects were measured by considering the effective sample size (ESS), computed by resorting to the CODA package (Plummer et al., 2006), and the ratio between runtime, in seconds, and ESS (time/ESS), both averaged over 100 replicates.Following the algorithmic performance analyses of Neal (2000), Papaspiliopoulos and Roberts (2008) and Kalli et al. (2011), the ESS was computed on the number of clusters-k n as far as the ICS is concerned-and on the deviance of the estimated density, with the latter defined as dev(X, θ for the r-th MCMC draw.The ratio time/ESS takes into account both quality of the generated sample and computational cost, and can be interpreted as the average time needed to sample one independent draw from the posterior.When implementing the ICS, the value of m can be tuned based on the desired algorithm performance in terms of quality of mixing and runtime.As for the rest of the paper, redand for the ease of illustration, we will work with m = 10, chosen as a sensible compromise between good mixing and controlled computational cost. The second part of the simulation study compares the performance of ICS, marginal sampler, dependent and independent slice-efficient samplers.For the sake of clarity, pseudocode of the implemented algorithms is provided in Appendix D. We considered the sample sizes n = 100, n = 250 and n = 1 000, and generated 10 data sets per size from f 0 .These data were then analyzed by considering a combination of values for the PY parameters, namely σ ∈ {0, 0.2, 0.4, 0.6, 0.8} and ϑ ∈ {1, 10, 25}.The results we report are obtained, for each scenario, by averaging over the 10 replicates.As for the two slice samplers, due to the aforementioned explosion of the number of drawings per iteration when σ takes large values, our analysis was forcefully confined to the case σ ≤ 0.4.Moreover, the results referring to the case σ = 0.4 are approximate as they were obtained by constraining the slice sampler to draw at most 10 5 components at each iteration: such limitation of our study could not be avoided, given the otherwise unmanageable computational burden associated with this specific setting.Table B.1 in Appendix B shows that such bound was reached more often when large data sets were analyzed.For example, while for n = 100 the bound was reached on average 12% and 15% of the iterations, for independent and dependent slice-efficient samplers respectively, the same happened on average 26% and 40% of the iterations when n = 1 000.For this reason, these specific results must be considered approximated and, as far as the runtime is concerned, conservative.The four algorithms were compared by using the same measures adopted in the first part of the simulation study, namely the ESS for the number of clusters, the ESS for the deviance of the estimated density, and the corresponding ratios time/ESS.Figure 3: Simulated data.ESS computed on the random variable number of clusters, for ICS (gray), marginal sampler (orange), independent slice-efficient sampler (green) and dependent slice-efficient sampler (blue).Results are averaged over 10 replicates.The ×-shaped marker for the two slice samplers indicates that, when σ = 0.4, the value of the ESS is obtained with an arbitrary upper bound at 10 5 for the number of jumps drawn per iteration.
A clear trend can be appreciated in Figure 3 where the focus is on the ESS for the number of clusters: the marginal sampler displays, on average, a larger ESS than ICS, whose ESS appears, in turn, uniformly larger than the ones characterizing the two slice samplers.
As for the latter two, while the displayed trend is similar, it can be appreciated that the independent algorithm is uniformly characterized by a better mixing.Results referring to the ratio time/ESS, for the variable number of clusters, are displayed in marginal sampler show in general similar performances.It is interesting to notice though that, while the ratio time/ESS for the marginal algorithm is rather stable over the values of σ considered in the study, the same quantity for ICS indicates a slightly better performance when σ takes large values.On the counterpart, the efficiency of the two slice samplers is heavily affected by the value of σ, with time/ESS exploding when σ moves from 0 to 0.4 and when ϑ is increasing.On the basis of this study, the slice samplers appear competitive options when σ ∈ {0, 0.2} and a small ϑ are considered.On the contrary, it is apparent that larger values of σ make the two slice samplers less efficient than ICS and marginal sampler.

Illustrations
We consider a data set from the Collaborative Perinatal Project (CPP), a large prospective study of the cause of neurological disorders and other pathologies in children in the United States.Pregnant women were enrolled between 1959 and 1966 when they showed up for prenatal care at one of 12 hospitals.While several measurements per pregnancy are available, our attention focuses on two main quantities: the gestational age (in weeks) and the logarithm of the concentration level of DDE, a persistent metabolite of the pesticide DDT, known to have adverse impact on the gestational age (Longnecker et al., 2001).Our analysis has a two-fold goal.First, we focus on estimating and comparing the joint density of gestational age and DDE for two groups of women, namely smokers and non-smokers.
This will also allow us to assess how the probability of premature birth varies conditionally on the level of DDE.Adopting a nonparametric mixture model will allow us to investigate the presence of clusters within the data.Second, we consider the data set partitioned in the 12 hospitals of the study and focus on the estimation of the hospital-specific distribution of the gestational age, by accounting for possible association across subsamples collected at different hospitals.For this analysis we adopt a nonparametric mixture model for partially exchangeable data and propose an extension of the ICS approach presented in Section 2.

Cross-hospital analysis
Smokers and non-smokers groups have sample size of n 1 = 1023 and n 2 = 1290, respectively.
For the two groups we independently model the joint distribution of gestational age and DDE by means of a PY mixture model (1.2) with bivariate Gaussian kernel function K(x, θ) = φ(x, θ), with θ = (µ, Σ), and with conjugate normal-inverse Wishart base measure P 0 = N -IW (m 0 , k 0 , ν 0 , S 0 ).In absence of precise prior information on the density to be estimated, we specify a vague base measure following an empirical Bayes approach.Specifically we let m 0 be equal to the sample average, S 0 be equal to three times the empirical covariance, k 0 = 1/10, and ν 0 = 5.These settings are equivalent to assuming that the scale parameter of the generic mixture component coincides with 1.5 times the empirical covariance, while the location parameter is centered on the sample mean with prior variance equal to 10 times the scale parameter.Next, we set the parameters ϑ and σ on the basis of the prior distribution they imply on the number of clusters k n , within each group.Specifically, we set the prior expectation and prior standard deviation for k n equal to 10 and 20, respectively.Our choice implies that a small probability (≈ 0.05) is assigned to the event k n ≥ 50.This argument leads to set (σ, ϑ) equal to (0.548, −0.485) and (0.5295, −0.4660) for the groups of smokers and non-smokers, respectively.The values specified for σ are thus larger than 0.5, situation that is conveniently tackled by the ICS, as displayed by the simulation study of Section 3.
An alternative modelling strategy is achieved by introducing a hyperprior distribution for both σ and θ.While not explored in this illustration, it is worth stressing that this strategy might be conveniently implemented by adopting the ICS: if the prior on σ is defined on (0, 1), an implementation of the model requires a sampler whose efficiency is not compromised by the specific values of σ explored by the chain.
The analysis of both samples was carried out by running the ICS for 12 000 iterations, with the first 7 000 discarded as burn-in.Convergence of the chain was assessed as satisfactory by visually investigating the trace plots and by means of the Geweke's diagnostics (Geweke, 1992).Running the analysis of the two samples took less than two minutes in total.It is important to stress that, given the model specification, the same analysis could not be carried out by implementing the slice samplers described in Algorithms D.2 and D.3, as the value of σ would make computation time endless.We could instead implement the marginal sampler described in Algorithm D.1, which, as expected, took considerably longer than the ICS (about 11 minutes), due to the moderately large sample sizes.
The contour curves of the estimated joint densities of gestational age and DDE for the two groups are displayed in the left panel of Figure 5 and suggest different distributions between smokers and non-smokers, specially when large values for DDE are considered.
Differences between the two groups are further highlighted by the right panel of Figure 5, which shows the estimated probability-along with corresponding pointwise 90% posterior credible bands-of premature birth (i.e.gestational age smaller than 37 weeks), conditionally on the value taken by DDE, for the two groups.Once again, a difference between smokers and non-smokers can be appreciated for large values of DDE, although a sizeable uncertainty is associated with posterior estimates, as displayed by the large credible bands.Figure 5: CPP cross-hospital data.Left: observations and contour curves of the estimated joint posterior density of gestational age and DDE, for smokers (yellow dots and curves) and non-smokers (black dots and curves).Right: estimated probability of premature birth (gestational age below 37 weeks), conditionally on the level of DDE, for smokers (yellow curves) and non smokers (black curves), and associated pointwise 90% quantile-based posterior credible bands (filled areas).

A mixture model for partially exchangeable data
Smokers and non-smokers data are analyzed independently.For each group, heterogeneity across hospitals suggests to assume that data are partially exchangeable in the sense of de Finetti (1938).To account for this assumption, we consider a mixture model for partially exchangeable data, where the stratum-specific mixing random probability measures form the components of a dependent Dirichlet process.Within this flexible class of processes (see Foti and Williamson, 2015, and references therein), we consider the Griffiths-Milne dependent Dirichlet processes (GM-DDP), as defined and studied in Lijoi et al. (2014a,b).For an allied approach see Griffin et al. (2013).Let X i,l be the gestational age of the i-th woman in the l-th hospital, and θ l be the vector of latent variables θ i,l referring to the l-th hospital.The mixture model can be represented in its hierarchical form as (p 1 , . . ., pL ) ∼ GM-DDP(ϑ, z; P 0 ), with l = 1, . . ., L, i = 1, . . ., n l , ϑ > 0, z ∈ (0, 1), P 0 is a probability distribution on R × R + , and the GM-DDP distribution of the vector (p 1 , . . ., pL ) coincides with the distribution of the vector of random probability measures whose components are defined, for every l = 1, . . ., L, where γ 1 , . . ., γ L iid ∼ DP (ϑz; P 0 ) and γ 0 ∼ DP (ϑ(1 − z); P 0 ) is independent of γ l , for any l = 1, . . ., L.Moreover, the vector of random weights w = (w 1 , . . ., w L ), taking values in [0, 1] L , is distributed as a multivariate beta of parameters (ϑz, . . ., ϑz, ϑ(1 − z)), as defined in Olkin and Liu (2003), and its components are independent of the random probability measures γ 0 , γ 1 , . . ., γ L .As a result, the random probabilities pl are, marginally, identically distributed with pl ∼ DP (ϑ; P 0 ) (see Lijoi et al., 2014a, for details).

ICS for GM-DDP mixture model and its application
The ICS can be easily adapted to a variety of models.For example, it naturally fits the partially exchangeable framework of model (4.1).The ICS algorithm for GM-DDP mixture models is described in Algorithm C.1, and consists of three main steps.First, conditionally on the allocation of observations to clusters referring to either the idiosyncratic process γ l , with l = 1, . . ., L, or the common process γ 0 , summaries of all the processes, that is (s l , t l , p l ), for l = 0, . . ., L, are updated as done in Section 2 for a single process, with the proviso that σ = 0. Second, the latent variables θ i,l are updated for every l = 1, . . ., L and 1 ≤ i ≤ n l ; and, third, the components of w are sampled.The full conditional distributions for θ i,l and w are provided in Appendix C.
Model (4.1) is specified by assuming a univariate Gaussian kernel and normal-inverse gamma base measure P 0 = N -IG(0, 5, 4, 1).Moreover, the specification ϑ = 1 and z = 0.5 is adopted, with the latter choice corresponding to equal prior weights assigned to idiosyncratic and common components γ l and γ 0 .The ICS algorithm for the GM-DDP mixture model was run for 10 000 iterations, the first 5 000 of which were discarded as burn-in.Estimating posterior densities for smokers and non-smokers, required a total runtime of less than two and a half minutes.Convergence of the chains was assessed by visually investigating the trace plots, which did not provide any evidence against it.Figure 6 shows the estimated densities of the gestational age, for each stratum, with a comparison between smokers and non-smokers.The distribution for smokers is globally more skewed and shifted to the left than the one for non-smokers, indicating an expected more adverse effect of smoke on gestational age.

Discussion
We proposed a new sampling strategy for PY mixture models, named ICS, which combines desirable properties of existing marginal and conditional methods: the ICS shares easy interpretability with marginal methods, while allowing, likewise conditional samplers, for a parallelizable update of the latent parameters θ, and for a straightforward quantification of posterior uncertainty.The simulation study of Section 3 showed that the ICS overtakes some of the computational bottlenecks characterizing the conditional methods considered in the comparison.Specifically, the ICS can be implemented for any value of the discount parameter σ, with its efficiency being stable to the specification of σ.This is appealing as the discount parameter plays a crucial modelling role when PY mixture models are used for model-based clustering: the ICS allows for an efficient implementation of such models, without the need of setting artificial constraints on the value of σ.As far as the comparison of the performances of ICS and other algorithms is concerned, it is important to remark that the independent slice-efficient algorithm proposed by Kalli et al. (2011) is more general than the one considered in Section 3 as other specifications of the deterministic sequence ξ 1 , ξ 2 , . . .are possible.As nicely discussed by Kalli et al. (2011), the choice of such sequence "is a delicate issue and any choice has to balance efficiency and computational time".Alternative specifications of the sequence may be explored on a case-by-case basis but, in our experience, the computational time can be reduced only at the cost of worsening the mixing of the algorithm.It is also worth remarking that the ICS does not rely on any assumption of conjugacy between base measure and kernel, and thus it can be considered by all means a valid alternative to the celebrated Algorithm 8 of Neal (2000) when non-conjugate mixture models are to be implemented.Finally, while originally introduced to overtake computational problems arising in the implementation of algorithms for PY mixture models, the idea behind the ICS approach can be naturally extended to other classes of computationally demanding models.
As an example, we implemented the same idea to deal with posterior inference based on a flexible class of mixture models for partially exchangeable data.Other extensions are also possible and are currently subject of ongoing research.
A On the number of jumps to be drawn with the slice sampler Let N n be the random number of jumps which need to be drawn at each iteration of a slice sampler (Walker, 2007) or, equivalently, its dependent slice-efficient version (Kalli et al., 2011), implemented to carry out posterior inference based on a sample of size n.Conditionally on the cluster assignment variables c 1 , . . ., c n and on the weights p c 1 , . . ., p cn of the non-empty components of the mixture, N n is given by where the random weights p j 's are defined as in (1.5) and U 1 , . . ., U n are independent uniform random variables, independent of the weights p j 's.We next define a second random variable The random number M n is thus a data-free lower bound for N n , where the inequality M n (ω) ≤ N n (ω) holds for every ω ∈ Ω. Studying the distribution of M n will shed light on the distribution of its upper bound N n .Interestingly, M n represents also the random number of jumps to be drawn in order to generate a sample of size n from a PY by adapting the retrospective sampling idea of Papaspiliopoulos and Roberts (2008), described in their Section 2 for the DP case.The distribution of M n coincides with the distribution of min l ≥ 1 : j≤l (1 − V j ) < B n , where the stick-breaking variables (V j ) j≥1 are defined as in (1.5) and B n is a beta random variable with parameters 1 and n.Following Muliere and Tardella (1998), it is easy to show that, when σ = 0, then M n − 1 is distributed as a mixture of Poisson distributions, specifically (M n − 1) ∼ Poisson(ϑ log(1/B n )).This leads to E[M n ] = ϑH n + 1, where H n = n l=1 l −1 is the n-th harmonic number.It is worth noting that, for n → ∞, E[M n ] ≈ ϑ log(n), that is the growth is logarithmic in n, while the contribution of ϑ is linear.As for the PY process, we resort to Arbel et al. (2019), where the asymptotic distribution of the minimum number of jumps of a PY, needed to guarantee that the truncation error is smaller than a deterministic threshold, is studied.We introduce the notation a n where T σ,ϑ , independent of B n , is a polynomially tilted stable random variable (Devroye, 2009), with probability density function proportional to t −ϑ f σ (x), where f σ is the density function of a unilateral stable random variable with Laplace transform equal to exp{−λ σ }.

B Additional details on the simulation study
This section provides additional results of the simulation study presented in Section 3. Table B.1 reports on the number of times the upper bound for the number of jumps drawn at each
performed a simulation study to analyze the performance of the ICS algorithm and to compare it with marginal and slice samplers.For the latter, two versions proposed byKalli et al. (2011) were considered, namely the dependent and the independent slice-efficient algorithms.The independent version of the algorithm requires the specification of a deterministic sequence ξ 1 , ξ 2 , . .., which in our implementation was set equal to E[p 1 ], E[p 2 ], . .., with the p j 's defined in (1.5), in analogy with what was proposed byKalli et al. (2011) for the DP (see Algorithm D.3 in the Appendix for more details).All algorithms were written in C++ and are implemented in the BNPmix package(Corradin et al., 2021), available on CRAN.

Figure 2 :
Figure 2: Simulated data.ICS: ESS computed on the random variable number of clusters (top row) and ratio between runtime (in seconds) and ESS for the same random variable (bottom row).Results are averaged over 100 replicates.

FigureFigure 4 :
Figure4: Simulated data.Ratio of runtime (in seconds) over ESS, in log-scale, computed for the number of clusters, for ICS (gray), marginal sampler (orange), independent slice-efficient sampler (green) and dependent slice-efficient sampler (blue).Results are averaged over 10 replicates.The ×-shaped marker for the two slice samplers indicates that, when σ = 0.4, the value of time/ESS is obtained with an arbitrary upper bound at 10 5 for the number of jumps drawn per iteration.
Similar considerations can be drawn when analyzing the performance in terms of deviance of the estimated densities, with the plots for ESS and the ratio time/ESS displayed in FiguresB.2and B.3.

Figure 6 :
Figure6: CPP multi-hospital data.Estimated densities of the gestational age for the 12 hospitals, with comparison between smokers (yellow curves) and non-smokers (black curves).
a.s.∼ b n to indicate that P(lim n→∞ a n /b n = 1) = 1 and, by exploiting Theorem 2 in Arbel et al. (2019), we prove the following proposition.Proposition A.1.Let M n = min l ≥ 1 : j≤l (1 − V j ) < B n where the sequence (V j ) j≥1is defined as in (1.5) and B n is a beta random variable with parameters 1 and n.Then, for n → ∞,