Bayesian modeling via discrete nonparametric priors

The availability of complex-structured data has sparked new research directions in statistics and machine learning. Bayesian nonparametrics is at the forefront of this trend thanks to two crucial features: its coherent probabilistic framework, which naturally leads to principled prediction and uncertainty quantification, and its infinite-dimensionality, which exempts from parametric restrictions and ensures full modeling flexibility. In this paper, we provide a concise overview of Bayesian nonparametrics starting from its foundations and the Dirichlet process, the most popular nonparametric prior. We describe the use of the Dirichlet process in species discovery, density estimation, and clustering problems. Among the many generalizations of the Dirichlet process proposed in the literature, we single out the Pitman–Yor process, and compare it to the Dirichlet process. Their different features are showcased with real-data illustrations. Finally, we consider more complex data structures, which require dependent versions of these models. One of the most effective strategies to achieve this goal is represented by hierarchical constructions. We highlight the role of the dependence structure in the borrowing of information and illustrate its effectiveness on unbalanced datasets.


Introduction
Statistical learning and predictions are largely based on the tacit assumption of a correspondence between past and future observations.Whether there is a rational justification for this induction principle has been a long debated question, which can be traced back to the work of the renowned philosopher David Hume in the eighteenth century.In the twentieth century, de Finetti (1937) reformulates the problem in probabilistic terms: the symmetry between past and future can be postulated with an appropriate probabilistic framework through the concept of exchangeability, which requires the distribution of the observations to be invariant with respect to their order.More formally, a sequence (X n ) n≥1 of observations on a Polish space X is said to be exchangeable if for every N ∈ N, and for every permutation π of {1, . . ., N }, (X 1 , . . ., X N ) d = (X π(1) , . . ., X π(N ) ), where d = denotes the equality in distribution.Crucially, de Finetti proved that exchangeability is equivalent to conditional independence and identity in distribution, through the following representation theorem which goes under his name.
Theorem 1 (de Finetti Representation Theorem) A sequence (X n ) n≥1 of random elements on X is exchangeable if and only if there exists a probability law Q on the space P X of probabilities on X such that, for every N ∈ N and any Borel sets (A 1 , . . ., A N ), Q is termed the de Finetti measure of (X n ) n≥1 .
The representation theorem was proved in full generality by Hewitt and Savage (1955), with compelling consequences in terms of both inference and prediction.First of all, it provides a neat justification of the Bayes-Laplace paradigm, that is, the use of a prior distribution on the model parameters for Bayesian inference.Indeed, one can rewrite the representation theorem in hierarchical form, so that for any exchangeable sequence (X n ) n≥1 one can define a random probability P on P X such that, for every n ∈ N, From this expression, it is apparent that the de Finetti measure Q may act as prior distribution, the latent parameter being the distribution of the data.After observing the values X (n) = (X 1 , . . ., X n ), one can perform Bayesian inference by finding the posterior distribution Q( • | X (n) ).
Moreover, de Finetti's theorem can also be used as a technical tool for assigning explicit laws to exchangeable sequences and deriving predictive inference, the most important and fundamental kind of inductive inference according to Carnap (1950).Indeed, under the exchangeability assumption, the one-step predictions are conveniently evaluated as linear functions of the posterior distribution Q( • | X (n) ) as follows: and similarly for m-step ahead predictions P((X n+1 , . . ., Summing up, exchangeability guarantees an elegant and principled framework to perform inference and prediction on homogeneous observations.As for the flexibility of these models, it crucially depends on the support of the de Finetti measure Q on the space of probabilities P X .Though de Finetti had laid out the Bayesian framework in its full generality during the 1930s, for several decades, inference and predictions were confined to parametric models, that is, for priors Q that degenerate on a subclass of P X indexed by a finite-dimensional parameter.Still in 1972, Dennis V. Lindley wrote that it is perhaps worth stopping to remark that the problem is a technical one; the Bayesian method embraces non-parametric problems but cannot solve them because the requisite tool is missing (Lindley, 1972).However, the times were ripe: Ferguson (1973) made the breakthrough with the definition of the Dirichlet process prior, which paved the way for the development of the field of Bayesian Nonparametrics (BNP).See also Ferguson (1974).The rest of this paper will outline some of the most notable uses of the Dirichlet process and highlight some effective generalizations.We start by introducing the Dirichlet process through its stick-breaking representation (Sethuraman, 1994), arguably its simplest construction though perhaps not most suitable to gain insight into its distributional properties.
Definition 1 A random probability P ∼ DP(θ, P * ) is distributed according to a Dirichlet process prior with concentration θ > 0 and base probability P * ∈ P X if For simplicity, we will throughout assume that P * is non-atomic.Exhaustive accounts of the field can be found in the monographs (Hjort et al., 2010;Müller et al., 2015;Ghosal & van der Vaart, 2017).

Species discovery
In species sampling problems, a population of individuals is partitioned into different "types" or "species", and each observation is a species label.Given an observed sample of size n, one of the main goals is predicting the number of new distinct species in a future sample of size m.There are two key elements of species sampling problems that make BNP models the natural choice: (i) since the scope of the notion of species is to group individual observations (according to some criterion of similarity), there should be a positive probability of ties among the data; (ii) since each new observation could potentially represent a new species, the model has to assign a positive probability to this event or, in other terms, incorporate a positive discovery probability at each sampling step.This BNP approach to species sampling was first laid out in Lijoi et al. (2007).Let X (n) = (X 1 , . . ., X n ) be a vector of observed species labels.If one believes that the order with which the species are observed is irrelevant, assuming exchangeability is the natural choice leading to In this modeling framework, it is quite natural to assume that P is discrete and defined as P = i≥1 pi δ Z i , where the p j 's are the random species proportions in the population and the Z j 's are the corresponding species labels such that Z j iid ∼ P * for some non-atomic base measure P * .These assumptions imply that ties may be recorded in X (n) with positive probability, namely P(X i = X j ) > 0 for any i = j.Henceforth, we will denote by K n the total number of distinct species in the sample X (n) , by X * 1 , . . ., X * K n the unique values for the species, and by N i the frequency of the i-th distinct species in order of appearance.

Dirichlet process
The task of learning the next species may be rephrased into finding the predictive distribution P(X n+1 | X (n) ).The species corresponding to the next observation X n+1 should have a positive probability of coinciding with the already discovered species but also a positive probability of being new.A natural probabilistic structure achieving these desiderata is obtained by taking a linear combination of one's prior guess at the (marginal) distribution of the species labels P * and the empirical measure.This results in a learning mechanism of the form Now, if P is distributed according to a Dirichlet process prior (Definition 1), the corresponding predictive distributions replicate exactly (2) with Moreover, it can be proved (Regazzini, 1978;Lo, 1991) that the predictive distributions associated to an exchangeable sequence are a linear combination of P * and the empirical measure 1 n K n i=1 N i δ X * i if and only if the underlying P has a Dirichlet process distribution.
This predictive scheme, framed as a generative model, is known as the Blackwell-MacQueen generalized Pólya urn (Blackwell & MacQueen, 1973).Moreover, when focusing on the induced partition distribution, it reduces to the Chinese restaurant process, whose name originated from the following metaphor, attributed to L. Dubins and J. Pitman by D. Aldous.A sequence of customers arrives at a restaurant with X i denoting the table of customer i.The first customer sits at one of the empty tables.The second customer sits at the same table (X 2 = X 1 ) with probability 1/(θ + 1) and at a different table (X 2 = "new") with probability 1/(1 + θ), and so on.Thus, tables and species may be treated in the same way: the key feature they share is that they both determine a random partition of observations into clusters.
Let n k (n 1 , . . ., n k ) be the probability that a given sample X (n) displays k ≤ n species with cardinalities n 1 , . . ., n k .Whenever P is an almost surely discrete distribution this can be characterized in terms of the de Finetti measure as where X k * = {(x 1 , . . ., x k ) ∈ X k : x i = x j for i = j}, and it is referred to as the exchangeable partition probability function, a notion introduced by Pitman (1995).If P is sampled from a Dirichlet process this coincides with a variation of the popular Ewens sampling formula (Ewens, 1972;Antoniak, 1974), which plays a major role in population genetics and is given by denotes the n-th ascending factorial of θ and we set (θ ) 0 ≡ 1.By summing (4) over all partitions of n elements into k groups, one obtains the probability of observing k distinct species in a sample of size n.Indeed, with |s(n, k)| the signless Stirling number of the first kind.From a modeling perspective, an important aspect is the growth rate of the number of distinct species K n as n increases: under a Dirichlet process prior, K n diverges with a logarithmic behavior.More specifically, as shown in Korwar and Hollander (1973), for n → ∞, We refer to Pitman (2006), Mano (2018) and Yamato (2020) for further stimulating accounts.

Beyond the Dirichlet process
When modeling the probability of discovering a new species, one would expect this probability to depend explicitly on the number K n of distinct species in the sample.More specifically, it is often desirable for the probability of discovering a new species to be monotonically increasing in K n , so that the probability of discovering a new species is higher if mostly distinct species have been recorded in the past, and vice versa.The Dirichlet process prior does not accommodate for this feature, as it is apparent from (3).However, other choices of the de Finetti measure allow for this modeling behavior: a popular instance is given by the two-parameter Poisson-Dirichlet process (Perman et al., 1992;Pitman, 1995;Pitman & Yor, 1997), also known as Pitman-Yor process.See Lijoi and Prünster (2010) and Müller et al. (2018) for reviews of the various classes of discrete random probability measures that share this property.
Definition 2 A random probability P ∼ PY(σ, θ, P * ) is distributed according to a Pitman-Yor process prior with discount σ ∈ [0, 1), concentration θ > −σ , and diffuse base probability P * ∈ P X if Clearly, the Dirichlet process represents a special case, corresponding to σ = 0.When P ∼ PY(θ, σ, P * ) is sampled from a Pitman-Yor process, the exchangeable partition probability function becomes Moreover, the predictive distribution of a Pitman-Yor process satisfies where k is the number of distinct species in the sample X (n) .The complete prediction scheme is of the form which notably features a suitably weighted empirical measure.Depending on the value of σ , this leads to a markedly different learning scheme, as highlighted in Fig. 1.
Here, we compare the prediction of the number of new unique values m in an additional sample of size m, conditionally on X (n) .In particular, we consider the Bayesian nonparametric estimator for K Favaro et al. (2009), that is We refer to Lijoi et al. (2007) and Favaro et al. (2009) for details.In the Dirichlet process case, as shown in Favaro et al. (2011), ( 5) reduces to A practical issue to face is the specification of the parameters (σ, θ ).A first convenient approach proposed in Lijoi et al. (2007) is based on empirical Bayes ideas.It consists in fixing (σ, θ ) to maximize the exchangeable partition probability function, so that in the Pitman-Yor case, we have An alternative way of specifying (σ, θ ) is by placing a prior on it.However, as showcased in Lijoi et al. (2008), this often results in negligible differences for the estimates of K (n) m compared to the empirical Bayes approach, since in many practical scenarios the posterior distribution for (σ, θ ) is highly concentrated.
In order to illustrate the BNP methodology for species sampling problems we consider two real-world applications.First, we analyze a dataset of Holst (1981), which comprises n = 204 coins (obverse side) found in a hoard of ancient coins.They were classified according to different die types, which are the tools used to produce them and represent the "species" in this context: k = 141 distinct dies were recorded.The frequencies are conveniently summarized in terms of the number of "species" of a given size, i.e., r i represents the number of species with frequency i, and we have (r 1 , r 2 , r 3 , r 4 , r 5 , r 6 , r 7 ) = (102, 26, 8, 2, 1, 1, 1).The goal is to predict the number of new dies one may observe in a future hoard of coins.We adopt the BNP estimators m in an additional sample of size m, conditionally on the data X (n) , for a Dirichlet process (red line) and a Pitman-Yor process (blue line).The Good-Toulmin estimator (green line) is also displayed for comparison for the Pitman-Yor process.The two BNP estimators are compared to the Good-Toulmin estimator (Good & Toulmin, 1956;Mao, 2004), which is one of the most popular frequentist estimators.It is well known that the Good-Toulmin estimator usually provides reliable predictions for m < n, that is, if the additional sample over which predictions are made is not larger than the observed sample, whereas for values m > n it often shows an erratic behavior.All three estimators are depicted in Fig. 1a.The Dirichlet process and the Pitman-Yor process behave similarly.This is not surprising since the empirical Bayes estimate favors a low value for the discount parameter ( σ = 0.11), and we recall that the Pitman-Yor process reduces to the Dirichlet process if σ = 0.Moreover, we highlight that they both resemble the behavior of the Good-Toulmin estimator when m is smaller than the observed sample size.For larger values of m, the Good-Toulmin estimator becomes erratic, whereas this is not the case for the BNP estimators since they rely on principled probabilistic models.
Our second application concerns the genomic dataset FlavFr1: it comprises 1593 expressed sequence tags (ESTs) of a cDNA library obtained from the fruits of citrus clementina.These are categorized into 806 different genes with r i = 561, 148, 37, 18, 6, 5, 12, 1, 1, 3, 1, 2 for i ∈ {1, . . ., 12}, and , 15, 16, 19, 22, 23, 58, 117}.As before, we aim to predict the number of new genes in a future sample; the empirical Bayes specification yields θ = 651.05and ( σ , θ) = (0.63, 110.24) for the Dirichlet and the Pitman-Yor processes, respectively.The corresponding estimators together with the Good-Toulmin estimator are depicted in Fig. 1b.In this case, there are striking differences between the estimators obtained from the Dirichlet and the Pitman-Yor processes.This is due to the data favoring a large value for the discount parameter ( σ = 0.63), setting the Pitman-Yor case far apart from the Dirichlet process case ( σ = 0).This clearly showcases the usefulness of the additional flexibility of the Pitman-Yor process, which allows for polynomial growth rates controlled by the parameter σ .Once again, the Good-Toulmin estimator diverges for large values of m.However, for moderately large values of m it nicely resembles the predictions of the Pitman-Yor estimator.This further underpins the need to go beyond the Dirichlet process in species sampling contexts.Further instances of applications requiring nonparametric priors with polynomial growth rate, or equivalently power law tail behavior, can be found in Hoshino (2001), Teh (2006), Caron (2012) and Caron and Fox (2017).

Mixture models
The Dirichlet and Pitman-Yor process priors are laws for almost surely discrete random probabilities.This implies that, when used as de Finetti measures, the induced exchangeable observations (1) will have a positive probability of being equal, i.e., P(X i = X j ) > 0 for every i, j.Thus, different models should be considered whenever the data do not display ties.A popular strategy is to model a random density function through a kernel mixture.Let f be a probability density kernel and P be an almost surely discrete random probability.The random probability density is then defined as Using the law Q f of f as de Finetti measure, the law of induced exchangeable observations (Y i ) i≥1 may be equivalently described in hierarchical form as where the latent variable X i represents the parameter of the probability density from which Y i is sampled.Kernel mixtures over a Dirichlet process have been introduced in Lo (1984) and later extended to more general mixing measures, including the Pitman-Yor process (Ishwaran & James, 2001).See Müller and Quintana (2004), Barrios et al. (2013) andDe Blasi et al. (2015) for reviews.Arguably, kernel mixtures over discrete random probabilities are the most popular Bayesian nonparametric models.This is because they allow one to perform two important statistical tasks at once: (i) flexible density estimation, which avoids parametric constraints and adapts to any data generating distribution; (ii) model-based clustering, which exempts from fixing the number of clusters a priori.Indeed, due to the discreteness of P, any sample from the posterior distribution of the latent variables X i given Y (n) has K n ∈ {1, . . ., n} distinct values.This automatically partitions the n observed data into K n clusters, with two observations Y i and Y j belonging to the same cluster if they are sampled from the same mixture component, i.e., if Different discrete nonparametric priors induce different distributions for the number of clusters K n .Given the important role played by K n for the probabilistic clustering, when eliciting the prior this aspect should be taken into account.In the same way, the posterior distribution of P given the data Y (n) will induce a posterior distribution L (K n | Y (n) ), providing both point estimates and uncertainty quantification on the number of clusters in the data.In practice, posterior computations are customarily carried out using Markov Chain Monte Carlo, an avenue first explored in Escobar and West (1995).This is a highly non-trivial task, since it entails exploring the partition space.To address this issue, several computational methods have been proposed over the years.Another practical concern is the choice of the baseline distribution P * , which is known to have a significant impact on L (K n | Y (n) ).See Richardson and Green (1997).However, this equally affects parametric and nonparametric specifications.
In Fig. 2, we consider synthetic data generated from a finite mixture of normal distributions with three components, and the estimates resulting from a normal mixture model with Dirichlet and Pitman-Yor process as mixing measures.We consider a highly miss-specified scenario, where the expected number of clusters a priori is about 34, significantly larger than the true number of mixing components (three).This leads to consider a Dirichlet process with θ = 10, whereas for the Pitman-Yor process infi-  (300) .Probability masses corresponding to the same number of clusters are slightly shifted for visualization purposes.nite combinations of σ and θ are possible: we set σ = 0.6 to highlight the role of σ , and this results in θ = 0.026.In particular, the plot displays the prior and posterior distributions of the number of clusters in Fig. 2a, and the posterior densities in Fig. 2b.For both models, the posterior distribution moves away from the prior miss-specification.However, in the Pitman-Yor case, this correction is stronger leading to more accurate posterior inference on the number of clusters.The robustness of the clustering with respect to prior miss-specifications represents a highly appealing feature of Pitman-Yor process mixtures.In contrast, both models lead to highly accurate but essentially indistinguishable posterior densities.This is due to the lack of identifiability of the number of mixture components in a mixture density: one can always fit a mixture density with more components than needed.

Borrowing of information
In the previous sections, we focused on exchangeable models, which translate a notion of homogeneity in the data.Yet, there are many situations where this can be seen as a restrictive assumption.In de Finetti's words (de Finetti, 1938) 'But the case of exchangeability can only be considered as a limiting case: the case in which this "analogy" is, in a certain sense, absolute for all events under consideration'.More specifically, one may wish to generalize exchangeability to the case where data is collected under different experimental conditions, such that one retains homogeneity within each experiment though allowing for heterogeneity across different experiments.Typical examples include topic modeling, meta-analysis, two-sample problems, nonparametric regression, time-dependent data, and change point problems, to mention a few.A natural probabilistic framework that achieves this is partial exchangeability, as defined below.To simplify the notation, we focus on two groups of observations, though all the contents of this section may be easily extended to an arbitrary number of groups.
Definition 3 An array (X 1 , X 2 ) = (X 1, j , X 2, j ) j≥1 is partially exchangeable if for any N 1 , N 2 ≥ 1, π permutation of {1, . . ., N 1 }, and φ permutation of {1, . . ., N 2 }, Partially exchangeable sequences may be characterized as conditionally independent through an extension of de Finetti's representation theorem.Specifically, (X 1 , X 2 ) is partially exchangeable if and only if there exists a probability distribution Q on P X × P X such that Equivalently, there exists a vector of dependent random probabilities ( P1 , P2 ) such that We observe that the dependence between random probabilities induces dependence between the sequences X 1 , X 2 of (exchangeable) observations, with two extreme scenarios: when P1 and P2 are independent, so are X 1 and X 2 ; when P1 = P2 = P almost surely, X 1 , X 2 are (fully) exchangeable, as it is clear by comparing the "degenerate" form of de Finetti's representation theorem to the one for exchangeable sequences (1).Thus, we can interpret partial exchangeability as a dependence assumption on the observables that ranges from independence to exchangeability.
The use of partially exchangeable sequences for statistical purposes has been pioneered by Cifarelli and Regazzini (1978) and MacEachern (1999MacEachern ( , 2000)).To this end, one needs to build dependent random probability measures (see Quintana et al., 2022 for a recent review) with two key features in mind: (i) mathematical tractability, which corresponds to obtaining manageable representations for the posterior and/or the marginal structure, i.e., the partition distribution or the prediction rule; (ii) the ability to control the amount of dependence, since this is directly linked to the borrowing of information between the two groups: the more the dependence, the more information will be shared across the two groups.This is usually done by expressing the linear correlation between pairwise set-wise evaluations, Cor( P1 (A), P2 (A)) for any Borel set A, and has been recently extended to an arbitrary number of groups by relying on the Wasserstein distance (Catalano et al., 2021a(Catalano et al., , 2021b)).
Partial exchangeability represents the ideal probabilistic framework in many contexts.We showcase this by means of the hierarchical Dirichlet process mixture model (Teh et al., 2006), which is among the most popular and intuitive ways of introducing dependence between random probabilities and also enjoys attractive frequentist asymptotic properties (Catalano et al., 2022).We define ( P1 , P2 ) to be distributed according to a hierarchical Dirichlet process prior if for α, α * > 0 and P * a diffuse probability measure on X such that We will use the notation ( P1 , P2 ) ∼ HDP(α, α * , P * ).
To test the performance of the HDP in terms of borrowing of information we consider synthetic data generated from mixtures of three normal distributions.More specifically, the first group of n 1 = 15 synthetic observations are iid from the mixture 0.6 N(−0.8, 0.2 2 ) + 0.3 N(0, 0.2 2 ) + 0.1 N(0.8, 0.2 2 ), whereas the second group of n 2 = 200 observations are iid from the mixture 0.575 N(−0.8, 0.2 2 ) + 0.275 N(0, 0.2 2 ) + 0.15 N(0.8, 0.2 2 ).In other words, the two mixtures have the same scale and location parameters, but different probability weights.Moreover, the two groups of observations have markedly different sample sizes.As for the HDP model, we compare two different scenarios.We tune the parameters to achieve different values for the correlation, namely 0.09 and 0.91, corresponding to, respectively, weakly and highly dependent specifications.Among the infinite choices for and α * that induce the aforementioned values for the correlation, in both cases we select those that lead to an overall prior expected number of clusters among the two groups approximately equal to 7.5.This allows for a fair comparison of the two models.The resulting parameters are α = 1 and α * = 20 for the weakly and α = 19 and α * = 2 highly dependent HPD.In Fig. 3  As in the exchangeable case, posterior computations are based on Markov Chain Monte Carlo, leveraging specialized algorithms such as Lijoi et al. (2020).
The plots in Fig. 3 highlight the crucial role of the specification of the dependence structure.In the case of an HDP with weak dependence (Fig. 3b), the data of the first group (left panel) are not sufficiently informative to discover the existence of a third mixture component.Conversely, when we impose a stronger dependence (Fig. 3c), the data of the first group can effectively borrow strength from the second, whose sample size is much larger and therefore contains more information.As a result, the correct number of mixture components is recovered, despite the limited sample size of the first group.Moreover, the borrowing of information is also apparent from the credible intervals: a larger borrowing of information greatly reduces the uncertainty around the point estimator for the first group.

Fig. 1
Fig. 1 Species discovery: Bayesian nonparametric estimators E(K Distribution of the number of clusters K 300 : prior distribution and posterior update given Y of Y(300) , together with the Monte Carlo approximations of the posterior means E( f (y) | Y(300) ), for a grid o values of y.