DPM-G model and affine transformations of the data
Let \(\tilde{f}_{\varvec{\pi }}\) be a DPM-G model defined in (1), with base measure (2) and hyperparameters \(\varvec{\pi }\). The next result shows that, for any invertible affine transformation \(g(\mathbf {x})= \mathbf {C}\mathbf {x}+\mathbf {b}\), there exists a specification \(\varvec{\pi }_{g}:=(\mathbf {m}_0^{(g)},\mathbf {B}_0^{(g)},\nu _0^{(g)},\mathbf {S}_0^{(g)})\) of the hyperparameters characterising the base measure in (2), such that the deterministic relation \(\tilde{f}_{\varvec{\pi }_g}=|\det (\mathbf {C})|^{-1}\tilde{f}_{\varvec{\pi }}\circ g^{-1}\) holds. That is, for every \(\omega \in \varOmega \) and given a random vector \(\mathbf {X}\) distributed according to \(\tilde{f}_{\varvec{\pi }}(\omega )\), we have that \(\tilde{f}_{\varvec{\pi }_g}(\omega )\) is the density of the transformed random vector \(g(\mathbf {X})\).
Proposition 1
Let \(\tilde{f}_{\varvec{\pi }}\) be a location-scale DPM-G model defined as in (1), with base measure (2) and hyperparameters \(\varvec{\pi }= (\mathbf {m}_0, \mathbf {B}_0, \nu _0, \mathbf {S}_0)\). For any invertible affine transformation \(g(\mathbf {x})=\mathbf {C}\mathbf {x}+\mathbf {b}\), we have the deterministic relation
$$\begin{aligned} \tilde{f}_{\varvec{\pi }_g}=|\det (\mathbf {C})|^{-1}\tilde{f}_{\varvec{\pi }}\circ g^{-1}, \end{aligned}$$
where \(\varvec{\pi }_{g} :=(\mathbf {C}\mathbf {m}_0+\mathbf {b},\mathbf {C}\mathbf {B}_0 \mathbf {C}^\intercal ,\nu _0,\mathbf {C}\mathbf {S}_0 \mathbf {C}^\intercal )\).
While Proposition 1 can be derived from general properties of the Dirichlet process (see Lijoi and Prünster 2009), a direct proof is provided in “Appendix A.1”. This result implies that, for any invertible affine transformation g, modelling the set of observations \(\mathbf {X}^{(n)}\) with a DPM-G model (1), with base measure (2) and hyperparameters \(\varvec{\pi }\), is equivalent with assuming the same model with transformed hyperparameters \(\varvec{\pi }_g\), for the transformed observations \(g(\mathbf {X})^{(n)}:=(g(\mathbf {X}_1),\ldots ,g(\mathbf {X}_n))\). As a by-product, the same posterior inference can be drawn conditionally on both the original and the transformed set of observations, as the conditional distribution of the random density \(\tilde{f}_{\varvec{\pi }_g}\), given \(g(\mathbf {X})^{(n)}\), coincides with the conditional distribution of \(|\det (\mathbf {C})|^{-1}\tilde{f}_{\varvec{\pi }} \circ g^{-1} \), given \(\mathbf {X}^{(n)}\). Proposition 1 thus provides a formal justification for the procedure of transforming data, e.g. via standardisation or normalisation, often adopted to achieve numerical efficiency: as long as the prior specification of the hyperparameters of a DPM-G model respects the condition of Proposition 1, transforming the data does not affect posterior inference.
Empirical Bayes approach
The elicitation of an honest prior, thus independent of the data, for the hyperparameters \(\varvec{\pi }\) of the base measure (2) of a DPM model is in general a difficult task. A popular practice, therefore, consists in setting the hyperparameters equal to some empirical estimates \(\hat{\varvec{\pi }}(\mathbf {X}^{(n)})\), by applying the so-called empirical Bayes approach (see, e.g., Lehmann and Casella 2006). Recent investigations (Petrone et al. 2014; Donnet et al. 2018) provide a theoretical justification of this hybrid procedure by shedding light on its asymptotic properties. We show here that this procedure satisfies the assumptions of Proposition 1 and, thus, guarantees that posterior Bayesian inference, under an empirical Bayes approach, is not affected by affine transformations to the data.
A commonly used empirical Bayes approach for specifying the hyperparameters \(\varvec{\pi }\) of a DPM-G model, defined as in (1) and (2), consists in setting
$$\begin{aligned} \mathbf {m}_0=\overline{\mathbf {X}},\qquad \quad \mathbf {B}_0=\frac{1}{\gamma _1}\mathbf {S}_\mathbf {X}^2,\qquad \quad \mathbf {S}_0= \frac{\nu _0 - d - 1}{\gamma _2} \mathbf {S}_\mathbf {X}^2, \end{aligned}$$
(3)
where \(\overline{\mathbf {X}}=\sum _{i=1}^n \mathbf {X}_i/n\) and \(\mathbf {S}_\mathbf {X}^2=\sum _{i=1}^n(\mathbf {X}_i-\overline{\mathbf {X}})(\mathbf {X}_i-\overline{\mathbf {X}})^\intercal /(n-1)\) are the sample mean vector and the sample covariance matrix, respectively, and \(\gamma _1,\gamma _2>0\), \(\nu _0>d+1\). This specification for the hyperparameters \(\varvec{\pi }\) has a straightforward interpretation. Namely, the parameter \(\mathbf {m}_0\), mean of the prior guess distribution of \(\varvec{\mu }\), can be interpreted as the overall mean value and, in absence of available prior information, set equal to the observed sample mean. Similarly, the parameter \(\mathbf {B}_0\), covariance matrix of the prior guess distribution of \(\varvec{\mu }\), is set equal to a penalised version of the sample covariance matrix \(\mathbf {S}^2_\mathbf {X}\), where \(\gamma _1\) takes on the interpretation of the size of the ideal prior sample upon which the prior guess on the distribution of \(\varvec{\mu }\) is based. Similarly, the hyperparameter \(\mathbf {S}_0\) is set equal to a penalised version of the sample covariance matrix \(\mathbf {S}_\mathbf {X}^2\), choice that corresponds to the prior guess that the covariance matrix of each component of the mixture coincides with a rescaled version of the sample covariance matrix. Specifically, \(\mathbf {S}_0= \mathbf {S}^2_\mathbf {X}(\nu _0 - d - 1)/\gamma _2\) follows by setting \(\mathbb {E}[\varvec{\Sigma }]=\mathbf {S}_\mathbf {X}^2/\gamma _2\) and observing that, by standard properties of the inverse-Wishart distribution, \(\mathbb {E}[\varvec{\Sigma }]=\mathbf {S}_0/(\nu _0 - d - 1)\). Finally the parameter \(\nu _0\) takes on the interpretation of the size of an ideal prior sample upon which the prior guess \(\mathbf {S}_0\) is based. Next we focus on the setting of the hyperparameters \(\varvec{\pi }_g\), given the transformed observations \(g(\mathbf {X})^{(n)}\). The same empirical Bayes procedure adopted in (3) leads to
$$\begin{aligned} \mathbf {m}_0^{(g)}=\overline{g(\mathbf {X})}=\mathbf {C}\mathbf {m}_0 + \mathbf {b},\qquad \mathbf {B}^{(g)}_0=\frac{1}{\gamma _1}\mathbf {S}_{g(\mathbf {X})}^2,\qquad \mathbf {S}^{(g)}_0= \frac{\nu _0 - d - 1}{\gamma _2} \mathbf {S}_{g(\mathbf {X})}^2. \end{aligned}$$
Observing that \(\mathbf {S}_{g(\mathbf {X})}^2=\mathbf {C}\mathbf {S}_\mathbf {X}^2 \mathbf {C}^\intercal \) and setting \(\nu _0^{(g)}=\nu _0\) shows that the described empirical Bayes procedure corresponds to \(\varvec{\pi }_g=(\mathbf {C}\mathbf {m}_0+\mathbf {b},\mathbf {C}\mathbf {B}_0 \mathbf {C}^\intercal ,\nu _0,\mathbf {C}\mathbf {S}_0 \mathbf {C}^\intercal )\) and, thus, by Proposition 1, \(\tilde{f}_{\varvec{\pi }_g}=|\det (\mathbf {C})|^{-1}\tilde{f}_{\varvec{\pi }}\circ g^{-1}\).
Large n asymptotic robustness
We investigate the effect of affine transformations of the data on DPM-G models by studying the asymptotic behaviour of the resulting posterior distribution in the large sample size regime. To this end, we fit the same DPM-G model \(\tilde{f}_{\varvec{\pi }}\), defined in (1) and (2), to two versions of the data, that is \(\mathbf {X}^{(n)}\) and \(g(\mathbf {X})^{(n)}\), by using the exact same specification for the hyperparameters \(\varvec{\pi }\). Under this setting, the assumptions of Proposition 1 are not met and the posterior distributions obtained by conditioning on the two sets of observations are different random distributions which, thus, might lead to different statistical conclusions. The main result of this section shows that, under mild conditions on the true generating distribution of the observations, the posterior distributions obtained by conditioning \(\tilde{f}_{\varvec{\pi }}\) on the two sets of observations \(\mathbf {X}^{(n)}\) and \(g(\mathbf {X})^{(n)}\), become more and more similar, up to an affine reparametrisation, as the sample size n grows. More specifically we show that the probability mass of the joint distribution of these two conditional random densities concentrates in a neighbourhood of \(\{(f_1,f_2)\in \mathcal {F}\times \mathcal {F}\text { s.t. }f_1=|\det (\mathbf {C})| f_2 \circ g\}\) as n goes to infinity. Henceforth we will say that the DPM-G model (1) with base measure (2) is asymptotically robust to affine transformation of the data. The rest of the section formalises and discusses this result. We consider a metric \(\rho \) on \(\mathscr {F}\) which can be equivalently defined as the Hellinger distance \(\rho (f_1,f_2)=\{\int (\sqrt{f_1(\mathbf {x})}-\sqrt{f_2(\mathbf {x})})^2\mathrm {d}\mathbf {x}\}^{1/2}\) or the \(L^1\) distance \(\rho (f_1,f_2)=\int |f_1(\mathbf {x})-f_2(\mathbf {x}))|\mathrm {d}\mathbf {x}\) between densities \(f_1\) and \(f_2\) in \(\mathscr {F}\), and we denote by \(\Vert \cdot \Vert \) the Euclidean norm on \(\mathbb {R}^d\). Moreover, we adopt here the usual frequentist validation approach in the large n regime, working ‘as if’ the observations \(\mathbf {X}^{(n)}\) were generated from a true and fixed data generating process \(F^*\) (see for instance Rousseau 2016). We introduce the notation \(F_{n}^*\) to denote the n-fold product measure \(F^*\times \cdots \times F^*\), and we assume that \(F^*\) admits a density function with respect to the Lebesgue measure, denoted by \(f^*\). In the setting we consider, the same model \(\tilde{f}_{\varvec{\pi }}\) defined in (1) and (2) is fitted to \(\mathbf {X}^{(n)}\) and \(g(\mathbf {X})^{(n)}\), thus leading to two distinct posterior random densities, with distributions on \(\mathscr {F}\) denoted by \(\varPi (\,\cdot \, \mid \mathbf {X}^{(n)})\) and \(\varPi (\,\cdot \, \mid g(\mathbf {X})^{(n)})\), respectively. We use the notation \(\varPi _2(\cdot \mid \mathbf {X}^{(n)})\) to refer to their joint posterior distribution on \(\mathscr {F}\times \mathscr {F}\).
Theorem 1
Let \(f^*\in \mathscr {F}\), true generating density of \(\mathbf {X}^{(n)}\), satisfy the conditions
-
A1.
\(0< f^*(\mathbf {x}) < M\), for some constant M and for all \(\mathbf {x}\in \mathbb {R}^d\),
-
A2.
for some \(\eta > 0\), \(\int \Vert \mathbf {x}\Vert ^{2(1+\eta )}f^*(\mathbf {x})\mathrm {d}\mathbf {x}< \infty \),
-
A3.
\(\mathbf {x}\mapsto f^*(\mathbf {x})\log ^2(\varphi _\delta (\mathbf {x}))\) is bounded on \(\mathbb {R}^d\), where \(\varphi _\delta (\mathbf {x}) = \inf _{\{\mathbf {t}\,:\,\Vert \mathbf {t}-\mathbf {x}\Vert <\delta \}}f^*(\mathbf {t})\).
Let \(g:\mathbb {R}^d \longrightarrow \mathbb {R}^d\) be an invertible affine transformation and \(\tilde{f}_{\varvec{\pi }}\) be the random density induced by a DPM-G as (1) with base measure (2) where \(\nu _0>(d + 1)(2d - 3)\). Then, for any \(\varepsilon >0\),
$$\begin{aligned} \varPi _2((f_1,f_2): \rho (f_1,|\det (\mathbf {C})| f_2 \circ g)<\varepsilon \mid \mathbf {X}^{(n)})\longrightarrow 1 \end{aligned}$$
in \(F_{n}^*\)-probability, as \(n\rightarrow \infty \).
It is worth stressing that, while in line with the usual posterior consistency approach the existence of a true data generating process \(F^*\) is postulated, the focus of Theorem 1 is not on the asymptotic behaviour of the posterior distribution with respect to the true data generating process, but rather on the relative behaviour of two posterior distributions, obtained by conditioning the same model on two sets of observations which coincide up to an affine transformation. More specifically, according to Theorem 1, when the sample size grows, the joint distribution \(\varPi _2(\cdot \mid \mathbf {X}^{(n)})\) concentrates its mass on a subset of the space \(\mathscr {F}\times \mathscr {F}\) where the distance \(\rho \) between \(f_1\) and \(|\det (\mathbf {C})| f_2 \circ g\) is smaller than \(\varepsilon \). In other terms, the two posterior distributions get similar, up to the affine transformation, as n becomes large.
The assumptions of Theorem 1 refer to the true generating distribution \(f^*\) of \(\mathbf {X}^{(n)}\). Assumption A1 requires \(f^*\) to be bounded and fully supported on \(\mathbb {R}^d\). Assumption A2 requires the tails of \(f^*\) to be thin enough for some moment of order strictly larger than two to exist. Such an assumption is not met, for example, by a Student’s t-distribution with two degrees of freedom, case which will be considered in the simulation study of Sect. 4. Finally, assumption A3 is a weak condition ensuring local regularity of the entropy of \(f^*\).
The proof of Theorem 1 is based on previous results proved by Wu and Ghosal (2008) and Canale and De Blasi (2017) in order to derive the so-called Kullback–Leibler property at \(f^*\) for some mixtures of Gaussians models. Importantly, in Lemma 1 (see Appendix 1), we improve upon their results by showing that the set of assumptions required by Wu and Ghosal (2008) and Canale and De Blasi (2017) can be reduced to the simpler set of assumptions A1, A2 and A3 of Theorem 1 by removing a redundant assumption. More specifically, we prove that A1, A2 and A3 imply that \(f^*\) has finite entropy and regular local entropy, conditions required in the aforementioned works.