Dirichlet process mixtures under affine transformations of the data

Location-scale Dirichlet process mixtures of Gaussians (DPM-G) have proved extremely useful in dealing with density estimation and clustering problems in a wide range of domains. Motivated by an astronomical application, in this work we address the robustness of DPM-G models to affine transformations of the data, a natural requirement for any sensible statistical method for density estimation and clustering. First, we devise a coherent prior specification of the model which makes posterior inference invariant with respect to affine transformations of the data. Second, we formalise the notion of asymptotic robustness under data transformation and show that mild assumptions on the true data generating process are sufficient to ensure that DPM-G models feature such a property. Our investigation is supported by an extensive simulation study and illustrated by the analysis of an astronomical dataset consisting of physical measurements of stars in the field of the globular cluster NGC 2419.


Introduction
A natural requirement for statistical methods for density estimation and clustering is for them to be robust under affine transformations of the data. Such a desideratum is exacerbated in multivariate problems where data components are incommensurable, that is not measured in the same physical unit, and for which, thus, the definition of a metric on the sample space requires the specification of constants relating units along different axes. As an illustrative example, consider astronomical data consisting of position and velocity of stars, thus living in the so-called phase-space: a metric on such a space can be defined by setting a dimensional constant to relate positions and velocities. In this setting, any sensible statistical procedure should be robust with respect to the specification of such a constant (Ascasibar and Binney, 2005;Maciejewski et al., 2009). This is specially important considering that often scarce to no a priori guidance about dimensional constants might be available, thus making the model calibration a daunting task. The motivating example of this work comes indeed from astronomy, the dataset we consider consisting of measurements on a set of 139 stars, possibly belonging to a globular cluster called NGC 2419 (Ibata et al., 2011). Globular clusters are sets of stars orbiting some galactic center. The NGC 2419, showed in Figure 1, is one of the furthest known globular clusters in the Milky Way. For each star we observe a four-dimensional vector (Y 1 , Y 2 , V, [Fe/H]), where (Y 1 , Y 2 ) is a two-dimensional projection on the plane of the sky of the position of the star, V is its line of sight velocity and [Fe/H] its metallicity, a measure of the abundance of iron relative to hydrogen. Out of these four components, only Y 1 and Y 2 are measured in the same physical unit, while dimensional constants need to be specified in order to relate position, velocity and metallicity. A key question arising with these data consists in identifying the stars that, among the 139 observed, can be rightfully considered as belonging to NGC 2419: a correct classification would be pivotal in the study of the globular cluster dynamics. Astronomers expect the large majority of the observed stars to belong to the cluster: the remaining ones, called field stars or contaminants, are Milky Way stars, unrelated to the cluster, that happen to appear projected in the same region of the plane of the sky. In general the contaminants have different kinematic and chemical properties with respect to the cluster members. Considering the nature of the problem, this research ques-tion can be formalised as an unsupervised classification problem, the goal being the identification of the stars which belong to the largest cluster, which can be interpreted as the NGC 2419 globular cluster. Admittedly, the terms of such a classification problem are not limited to the considered dataset but, on the contrary, are ubiquitous in astronomy and, more in general, might arise in any field where data components are incommensurable.
Bayesian nonparametric methods for density estimation and clustering have been successfully applied in a wide range of fields, including genetics (Huelsenbeck and Andolfatto, 2007), bioinformatics (Medvedovic and Sivaganesan, 2002), clinical trials (Xu et al., 2017), econometrics (Otranto and Gallo, 2002), to cite but a few. In this work we focus on the Dirichlet process mixture (DPM) model introduced by Lo (1984), arguably the most popular Bayesian nonparametric model. Although its properties have been thoroughly studied (see, e.g., Hjort et al., 2010), little attention has been dedicated to its robustness under data transformations (see Arbel and Nipoti, 2013). To the best of our knowledge, only Bean et al. (2016) study the effect of data transformation under a DPM model: their goal is to transform the sample so to facilitate the estimation of univariate densities on a new scale and thus to improve the performance of the methodology.
In this paper we investigate the effect of affine transformations of the data on location-scale DPM of multivariate Gaussians (DPM-G) (Müller et al., 1996), which will be introduced in Section 2. This is a very commonly used class of DPM models whose asymptotic properties have been studied by Wu and Ghosal (2010), Shen et al. (2013) and Canale and De Blasi (2017), among others. While rescaling the data, often for numerical convenience, is a common practice, the robustness of multivariate DPM-G models under such transformations remains essentially unaddressed to date. We fill this gap by formally studying robustness properties for a flexible specification of DPM-G models, under affine transformation of the data. Specifically, our contribution is two-fold: first, we formalise the intuitive idea that a location-scale DPM-G model on a given dataset induces a location-scale DPM-G model on rescaled data and we provide the parameters mapping for the transformed DPM-G model; second, we introduce the notion of asymptotic robustness under affine transformations of the data and show that, under mild assumptions on the true data generating process, DPM-G models feature such robustness property. Our theoretical results are supported by an extensive simulation study, focusing on both density and clustering estimation. These findings make the DPM-G model a suitable candidate to deal with problems where an informed choice of the relative scale of different dimensions seems prohibitive. We thus fit a DPM-G model to the NGC 2419 dataset and show that it provides interesting insight on the classification problem motivating this work.
The rest of the paper is organised as follows. In Section 2 we describe the modelling framework and introduce the notation used throughout the paper. Sections 3 and 4 present the main results of the work, with respective focus on finite sample properties and large sample asymptotics. A thorough simulation study is presented in Section 5 while Section 6 is dedicated to the analysis of the NGC 2419 dataset. Conclusions are discussed in Section 7. Finally, proofs of two technical lemmas are postponed to Appendix A.

Modelling framework
Let X (n) := (X 1 , . . . , X n ) be a sample of size n of d-dimensional observations X i := (X i,1 , . . . , X i,d ) defined on some probability space (Ω, A , P) and taking values in R d . Consider an invertible affine transformation g : The nature of the transformation g is such that, if applied to a random vector X with probability density function f , it gives rise to a new random vector g(X) with probability density function Henceforth we denote by F the space of all density functions with support on R d . The DPM model (Lo, 1984)

defines a random density taking values in
where k(x; θ) is a kernel on R d parameterized by θ ∈ Θ,P is a Dirichlet process (DP) with parameters α (precision parameter) and P 0 := E[P ] (base measure), a distribution defined on Θ (Ferguson, 1973). The almost sure discreteness of P allows the random densityf to be rewritten as where the random atoms θ i are i.i.d. from P 0 , and the random jumps w i , independent of the atoms, admit the following stick-breaking representation (Sethuraman, 1994): given a set of random weights v i iid ∼ Beta(1, α) (independent of the atoms θ i ), then w 1 = v 1 and, for j ≥ 2, w j = v j j−1 i=1 (1−v i ). While several kernels k(x; θ) have been considered in the literature, including e.g. skew-normal (Canale and Scarpa, 2016), Weibull (Kottas, 2006), Poisson (Krnjajić et al., 2008), here we focus on the convenient and commonly adopted Gaussian specification of Escobar and West (1995) and Müller et al. (1996). In the latter case, k(x; θ) represents a d-dimensional Gaussian kernel φ d (x; µ, Σ), provided that θ = (µ, Σ), where the column vector µ and the matrix Σ represent, respectively, mean vector and covariance matrix of the Gaussian kernel. This specification defines the model referred to as d-dimensional location-scale Dirichlet process mixture of Gaussians (DPM-G), which can be represented in hierarchical form as P ∼ DP (α, P 0 ).
The almost sure discreteness ofP implies that the vector θ (n) := (θ 1 , . . . , θ n ) might show ties with positive probability, thus leading to a partition of θ (n) into K n ≤ n distinct values. This, in turn, leads to a partition of the set of observations X (n) , obtained by grouping two observations X i1 and X i2 together if and only if θ i1 = θ i2 . This observation implies that the posterior distribution of the random densityf carries useful information on the clustering structure of the data, thus making DPM-G models convenient tools for density and clustering estimation problems.
Although other specifications for the base measure can be considered (see, e.g., Görür and Rasmussen, 2010), we choose to work within the framework set forth by Müller et al. (1996) where P 0 is defined as the product of two independent distributions for the location parameter µ and the scale parameter Σ, namely a multivariate normal and an inverse-Wishart distribution, that is For the sake of compactness, we use the notation π := (m 0 , B 0 , ν 0 , S 0 ) to denote the vector of hyperparameters characterising the base measure P 0 . We denote by Π the prior distribution induced on F by the DPM-G model (2) with base measure (4).

DPM-G model and affine transformation of the data
Letf π be a DPM-G model defined as in (2), with base measure (4) and hyperparameters π. The next result shows that, for any invertible affine transformation g(x) = Cx + b, there exists a specification π g := (m of the hyperparameters characterising the base measure in (4), such thatf πg = | det(C)| −1f π • g −1 . That is, for every ω ∈ Ω and given a random vector X distributed according tof π (ω), we have thatf πg (ω) is the density of the transformed random vector g(X).
Proposition 1. Letf π be a location-scale DPM-G model defined as in (2), with base measure (4) and hyperparameters π = (m 0 , B 0 , ν 0 , S 0 ). For any invertible affine transformation g(x) = Cx + b, we havẽ Proof. Modelf π can be written as By performing the change of variables S = CΣC and m = Cµ + b and observing that, by standard properties of the inverse-Wishart and normal distributions, All the identities in this proof are deterministic, that is they hold for every ω ∈ Ω.
This result implies that, for any invertible affine transformation g, modelling the set of observations X (n) with a DPM-G model (2), with base measure (4) and hyperparameters π, is equivalent with assuming the same model with transformed hyperparameters π g , for the transformed observations g(X) (n) := (g(X 1 ), . . . , g(X n )). As a by-product, the same posterior inference can be drawn conditionally on both the original and the transformed set of observations, as the conditional distribution of the random densityf πg , given g(X) (n) , coincides with the conditional distribution of | det(C)| −1f π • g −1 , given X (n) . Proposition 1 thus provides a formal justification for the procedure of transforming data, e.g. via standardisation or normalisation, often adopted to achieve numerical efficiency: as long as the prior specification of the hyperparameters of a DPM-G model respects the condition of Proposition 1, transforming the data does not affect posterior inference.
The elicitation of an honest prior, thus independent of the data, for the hyperparameters π of the base measure (4) of a DPM model is in general a difficult task. A popular practice, therefore, consists in setting the hyperparameters equal to some empirical estimatesπ(X (n) ), by applying the so-called empirical Bayes approach (see, e.g., Lehmann and Casella, 2006). Recent investigations (Petrone et al., 2014;Donnet et al., 2018) provide a theoretical justification of this hybrid procedure by shedding light on its asymptotic properties. The next example shows that this procedure satisfies the assumptions of Proposition 1 and, thus, guarantees that posterior Bayesian inference, under an empirical Bayes approach, is not affected by affine transformations to the data.
Example 1 (Empirical Bayes approach). A commonly used empirical Bayes approach for specifying the hyperparameters π of a DPM-G model, defined as in (2) and (4), consists in setting are the sample mean vector and the sample covariance matrix, respectively, and γ 1 , γ 2 > 0, ν 0 > d + 1. This specification for the hyperparameters π has a straightforward interpretation. Namely, the parameter m 0 , mean of the prior guess distribution of µ, can be interpreted as the overall mean value and, in absence of available prior information, set equal to the observed sample mean. Similarly, the parameter B 0 , covariance matrix of the prior guess distribution of µ, is set equal to a penalised version of the sample covariance matrix S 2 X , where γ 1 takes on the interpretation of the size of the ideal prior sample upon which the prior guess on the distribution of µ is based. Similarly, the hyperparameter S 0 is set equal to a penalised version of the sample covariance matrix S 2 X , choice that corresponds to the prior guess that the covariance matrix of each component of the mixture coincides with a rescaled version of the sample covariance matrix. Specifically, Finally the parameter ν 0 takes on the interpretation of the size of an ideal prior sample upon which the prior guess S 0 is based. Next we focus on the setting of the hyperparameters π g , given the transformed observations g(X) (n) . The same empirical Bayes procedure adopted in (5) leads to Observing that S 2 g(X) = CS 2 X C and setting ν (g) 0 = ν 0 shows that the described empirical Bayes procedure corresponds to π g = (Cm 0 + b, CB 0 C , ν 0 , CS 0 C ) and, thus, by Proposition 1,f πg = | det(C)| −1f π • g −1 .

Large n asymptotic robustness
We investigate the effect of affine transformations of the data on DPM-G models by studying the asymptotic behaviour of the resulting posterior distribution in the large sample size regime. To this end, we consider a scenario that mimics a situation where no precise information about the scale of the data is available, and thus the prior model must be specified arbitrarily. More specifically, we fit the same DPM-G modelf π , defined in (2) and (4), to two versions of the data, that is X (n) and g(X) (n) , by using the exact same specification for the hyperparameters π. Under this setting, the assumptions of Proposition 1 are not met and the posterior distributions obtained by conditioning on the two sets of observations are different random distributions which, thus, might lead to different statistical conclusions. The main result of this section shows that, under mild conditions on the true generating distribution of the observations, the posterior distributions obtained by conditioningf π on the two sets of observations X (n) and g(X) (n) , become more and more similar, up to an affine reparametrisation, as the sample size n grows. More specifically we show that the probability mass of the joint distribution of these two conditional random densities concentrates in a neighbourhood of as n goes to infinity. Henceforth we will say that the DPM-G model (2) with base measure (4) is asymptotically robust to affine transformation of the data. We first formalise this result and then provide its proof in Section 4.1. The latter is presented as split into intermediary lemmas whose proofs are deferred to Appendix A.
Henceforth we consider a metric ρ on F which can be equivalently defined as the Hellinger distance ρ )|dx between densities f 1 and f 2 in F , and we denote by · the Euclidean norm on R d . Moreover, we adopt here the usual frequentist validation approach in the large n regime, working 'as if' the observations X (n) were generated from a true and fixed data generating process (see for instance Rousseau, 2016). We also assume that this data generating process admits a density function with respect to the Lebesgue measure, denoted by f * . In the setting we consider, the same modelf π defined in (2) and (4) is fitted to X (n) and g(X) (n) , thus leading to two distinct posterior random densities, with distributions on F denoted by Π( · | X (n) ) and Π( · | g(X) (n) ), respectively. We use the notation Π 2 (· | X (n) ) to refer to their joint posterior distribution on F × F .
The assumptions of Theorem 1 thus refer to the true generating distribution f * of X (n) . Assumptions A1 and A2 require f * to be bounded, fully supported on R d and with finite entropy. Note that Assumption A2 is not implied by Assumption A1: for instance, if the true density f * (x) is defined on R with a right tail behaving as x −1 (log x) −2 at infinity, then f * (x) log f * (x) behaves like (x log x) −1 , and the entropy is infinite. 1 Assumption A3 is a condition of local regularity of the entropy of f * . Finally, Assumption A4 requires the tails of f * to be thin enough for some moment of order strictly larger than two to exist.

Proof of Theorem 1
The proof relies on results proved by Canale and De Blasi (2017). Let λ(Σ −1 ) := (λ 1 (Σ −1 ), . . . , λ d (Σ −1 )) be the vector of eigenvalues, in increasing order, of Σ −1 , the precision matrix of the Gaussian kernel. Henceforth we write f (x) g(x) to indicate that the inequality f (x) ≤ cg(x) holds for some constant c and for any x.
Theorem 2 provides general conditions on the base measure P 0 which guarantee consistency of the posterior distribution. The next lemma shows that these conditions are met by the normal/inverse-Wishart base measure (4).
Although the proof of Lemma 1 can be found in Canale and De Blasi (2017) (Corollary 1, relying, in turn, on results by Shen et al. (2013)), we provide it in Appendix A for the sake of completeness and in order to account for the slightly different prior specification considered in this work. Next lemma shows that if f * satisfies conditions A1-A4 of Theorem 1, so does f * g := | det(C)| −1 f * • g −1 , for any invertible affine transformation g.
Lemma 2. If conditions A1-A4 of Theorem 1 are satisfied by f * , then for any invertible affine transformation g(x) = Cx + b, they are also satisfied by f * g .
The proof of Leamma 2 is postponed to Appendix A. We are now ready to prove Theorem 1 by combining Theorem 2 with Lemma 1 and Lemma 2.

Simulation study
We performed a simulation study to provide empirical support to our results on the large n asymptotic robustness of a DPM-G model specified as in (2) with base measure (4), under affine transformations of the data. We considered 15 different simulation scenarios. Specifically, we considered three different sample sizes, namely n = 100, n = 300 and n = 1 000. Then, for each sample size, we generated a sample from a mixture of two Gaussian components, one being highly correlated and the other uncorrelated, defined as In order to test the robustness of the model under affine transformations of the data, we stretched or compressed the generated datasets by using five different constants, namely c = 1/5, c = 1/2, c = 1, c = 2 and c = 5.
For each constant, we multiplied the simulated data by c, thus obtaining a transformed dataset X (n) c := cX (n) . For each simulation scenario, namely c ∈ {1/5, 1/2, 1, 2, 5}, n ∈ {100, 300, 1 000}, we generated 100 replicates. We then fitted a DPM-G model, specified as in (2) and (4), to each one of the 1 500 simulated datasets. In order to enhance the flexibility of the model, we completed its specification by setting a normal/inverse-Wishart prior distribution for the hyperparameters (m 0 , B 0 ) of the base measure (4). Namely, we set B 0 ∼ IW (4, diag (15)) and m 0 | B 0 ∼ N (0, B 0 ), specification chosen so that E[µ] = 0 and to guarantee a prior guess on the location component µ flat enough to cover the support of the non-transformed data. As for the scale component of the base measure (4), we set (ν 0 , S 0 ) = (4, diag (1)). Finally, the mass parameter α of the Dirichlet process was set equal to 1.
Realisations of the mean of the posterior distribution were obtained by means of a Gibbs sampler relying on a Blackwell-McQueen Pólya urn scheme (see Müller et al., 1996), implemented in the BNPmix R package 2 . For each replicate, posterior inference was drawn based on 5 000 iterations, obtained after discarding the first 2 500. Convergence of the chains was assessed by visually investigating traceplots referring to randomly selected replicates, which did not provide indication against it.  Figure 2 shows, for every n ∈ {100, 300, 1000} and c ∈ {1/5, 1/2, 1, 2, 5}, a contour plot of the estimated posterior densities. The difference between estimated densities, across different values of c, is apparent when n = 100, with the two extreme cases, namely c = 1/5 and c = 5, suggesting a different number of modes in the estimated density. For larger sample sizes, this difference is less evident and, when n = 1 000, the contour plots are hardly distinguishable. These qualitative observations are in agreement with the large n asymptotic results of Theorem 1. The plots of Figure 2 refer to a single realisation of the samples X (100) , X (300) and X (100) considered in the simulation study, although qualitatively similar results can be found in almost any replicate.
The findings drawn from a visual inspection of Figure 2 were confirmed by assessing the distance between estimated posterior densities. Specifically, for any considered sample size n and for any pair of values c 1 and c 2 taken by the constant c, we approximately evaluated the L 1 distance between the suitably rescaled estimated posterior densities obtained conditionally on X (n) c1 and on X (n) c2 . The results of such analysis are shown in Figure 3 and indicate that as the sample size grows, the difference in terms of L 1 distance strictly decreases.  Figure 3: Simulation study. L 1 distance between suitably rescaled estimated densities after data transformations for different constants c 1 (X axis) and c 2 (Y axis), averaged over 100 replications. Left to right: sample size n = 100, sample size n = 300, sample size n = 1000.
The posterior distribution of the random density induced by a DPM-G model provides interesting insight also on the clustering structure of the data. The second goal of the simulation study, thus, consisted in investigating the impact of the scaling factor c on the estimated number of groups in the partition induced on the data. To this end, for each considered n and c, we estimated K (VI) n , the number of groups in the optimal partition estimated using a procedure introduced by Wade and Ghahramani (2018) and based on the variation of information loss function. The average values for this quantity, over 100 replicates, are reported in Table 1. There appears to be a clear trend suggesting that a larger scaling constant c leads to a largerK (VI) n : this finding is consistent with the fact that, if the data are stretched while the prior specification is kept unchanged, then we expect the estimated posterior density to need a larger number of Gaussian components to cover the support of the sample. For the purpose of this simulation study the main quantity of interest is the ratio c = 1/5 c = 1/2 c = 1 c = 2 c = 5 n = 100   Table 1 clearly indicate that, as the sample size n becomes large, such ratios tend to approach 1. This suggests that the large n robustness property of the DPM-G model nicely translates to an equivalent notion of robustness in terms of the estimated number of groupsK (VI) n in the data.

Astronomical data
The large n asymptotic robustness to affine transformation of the DPM-G model makes it a suitable candidate also for analysing data whose components are not commensurable and for which an informed choice of the relative scale of different dimensions seems prohibitive. We fitted the DPM-G model, specified as in (2) and with base measure (4), to the NGC 2419 dataset described in Section 1. The ultimate goal of our analysis consists in classifying stars as belonging to the NGC 2419 globular cluster or as being contaminants: an accurate classification is crucial for the astronomers to study the dynamics of the globular cluster. Since the large majority of the stars in the dataset is expected to belong to the globular cluster, with only a few of them being contaminants, we will identify the globular cluster as the largest group in the estimated partition of the dataset.
Prior to any analysis, data were standardised component by component, the legitimacy of such procedure following from the robustness results of Theorem 1. Hyperprior distributions were specified for the location parameter of the base measure (4) and on the DP mass parameter α. Specifically, B 0 ∼ IW (6, diag(15)) and m 0 | B 0 ∼ N (0, B 0 ), specification chosen to guarantee a prior guess on the location component µ flat enough to cover the support of the data and centered at 0. In addition, the precision parameter α was given a gamma prior distribution with unit shape parameter and rate parameter equal to 5.26, so to reflect the prior opinion of astronomers who would expect two distinct groups of stars in the dataset. Finally, as far as the scale component of the base measure (4) is concerned, we set (ν 0 , S 0 ) = (26, diag(21)), where the number of degrees of freedom ν 0 = 26 of the inverse-Wishart distribution was chosen to guarantee the conditions of Theorem 1 and, in turn, the scale matrix S 0 = diag(21) so that E[Σ] = diag(1). Realisations of the mean of the posterior distribution were obtained by means of a Gibbs sampler relying on a Blackwell-McQueen Pólya urn scheme 3 . In turn, posterior inference was drawn based on 20 000 iterations, after a burn-in period of 5 000 iterations. Convergence of the chains was assessed by visually investigating traceplots, which did not provide indication against it.   Figure 5 shows the scatter plots of the dataset with individual observations coloured according to their membership in the partition estimated based on the variation of information loss function (Wade and Ghahramani, 2018) and labeled as main group (grey circles) and other groups (coloured triangles). The estimated partition is composed of five groups. The largest one, identified as the globular cluster, consists of 124 stars. The remaining 15 stars are thus considered contaminants and are further divided into four groups, one composed by eight stars (group A), one containing five stars (group B) and two singletons (groups C and D). A visual investigation of Figure 5 suggests that stars in group A differ from those in the globular cluster   Figure 4) based on ad hoc physical considerations. Specifically, once the best fitting physical model, in the class of either Newtonian or Modified Newtonian Dynamics models, is detected, they use it in order to compute the average values of the physical variables describing the stars. Stars are then assigned to the globular cluster based on a comparison between their velocity and the average model velocity: those lying close enough are deemed to belong to the cluster, while the others are considered as potential contaminants. For the latter, the evidence of being contaminants is measured by evaluating how distant their metallicity is from the average model one. Two classifications are then proposed: the first one assigns to the globular cluster only the 118 stars for which the evidence seems strong, the second and less conservative strategy classifies as belonging to the globular cluster a total of 130 stars. Following this distinction and for the sake of simplicity, we summarise the results of Ibata et al. (2011)'s analysis, by devising three groups of stars: -globular cluster : 118 stars deemed to belong to the globular cluster, -likely globular cluster : 12 stars assigned to the globular cluster only when the less conservative procedure is adopted, -contaminants: 9 stars with strong evidence of being contaminants.  For the purpose of comparison, we report in Table 2  , the approach based on the DPM-G model assigns none to the globular cluster, three to group A, five stars to group B, which is composed only by stars considered contaminants in Ibata et al., and the star of group C, which shows an extremely small value for the velocity variable. Finally, the group D contains only one star, which is not consider a contaminant in Ibata et al. Further insight on the clustering structure of the data is provided by Figure 6, which shows the heatmap representation of the posterior similarity matrix obtained from the MCMC output. In agreement with the partition obtained by applying the approach of Wade and Ghahramani (2018), one main group identified with the globular cluster can be clearly detected in Figure 6. As for the remaining stars, arguably the contaminants, there seems to be two well defined groups, A and B, and a few stars whose group membership is less certain.

Conclusion
The purpose of this paper was to investigate the behaviour of the multivariate DPM-G model when affine transformations are applied to the data. To this end we focused on the DPM-G model with independent normal and inverse-Wishart specification for the base measure. Our investigation covered both the finite sample size and the asymptotic setting. Specifically, in Proposition 1, given any affine transformation g, an explicit model specification, depending on g, was derived so to ensure coherence between posterior inferences carried out based on a dataset or its transformation via g. We then considered a different setting where the specification of the model is assumed independent of the specific transformation g. In this case, we formalised the notion of asymptotic robustness of a model under transformations of the data and showed that mild conditions on the true data generating distributions are sufficient to ensure that the DPM-G model features such a property. Specifically, Theorem 1 shows that the posterior distributions obtained conditionally on a dataset or any affine transformation of it, become more and more similar as the sample size grows. Inference on densities and, as a by-product, on the clustering structure underlying the data, thus becomes increasingly less dependent on the affine transformation applied to the data, as the sample size grows to infinity. As a special case, Theorem 1 implies that posterior inference based DPM-G models is asymptotically robust to data transformations commonly adopted for the sake of numerical efficiency, such as standardisation or normalisation. This observation is particularly relevant when dealing with the astronomical unsupervised clustering problem motivating this work. Due to the lack of prior information on the dimensional constants relating different physical units, we resorted to a standardisation of each component of the data and chose an arbitrary model specification. Prior information was available in the form of the experts' prior opinion on the expected number of groups in the dataset and was used to elicit the hyperprior distribution for α, the total mass parameter of the DP.
= f * (y) log f * (y) ϕ δ (y) dy < ∞ where the last inequality holds by Assumption A3 on f * with δ . This finally shows that f * g satisfies Assumption A3 with δ.