While interesting product-form distributions can be found throughout the applied probability literature—ranging from the stationary distributions of Jackson queues (Jackson 1957; Kelly 1979) and complex-balanced stochastic reaction networks (Anderson et al. 2010; Cappelletti and Wiuf 2016) to the mean-field approximations used in variational inference (Ranganath et al. 2014; Blei et al. 2017)—most target distributions encountered in practice are not product-form. In this section, we demonstrate how to combine product-form estimators with other Monte Carlo methodology and expand their utility beyond the product-form case.
We consider three simple extensions: one to targets that are absolutely continuous with respect to fully factorized distributions (Sect. 3.1), resulting in a product-form variant of importance sampling [e.g., see Chapter 8 in Chopin and Papaspiliopoulos (2020)]; another to targets that are absolutely continuous with respect to partially factorized distributions (Sect. 3.2), resulting in a product-form version of importance sampling squared (Tran et al. 2013); and a final one to targets with intractable densities arising from latent variable models (Sect. 3.3), resulting in a product-form variant of pseudo-marginal MCMC (Schmon et al. 2020). In all cases, we show theoretically that the product-form variants achieve smaller variances than their standard counterparts. We then investigate their performance numerically by applying them to a simple hierarchical model (Sect. 3.4).
A further extension, this time to targets that are mixtures of product-form distributions, can be found in Appendix F. Because many distributions may be approximated with these mixtures, this extension potentially opens the door to tackling still more complicated targets (at the expense of introducing some bias).
Importance sampling
Suppose that we are given an unnormalized (but finite) unsigned target measure \(\gamma \) that is absolutely continuous with respect to the product-form distribution \(\mu \) in Sect. 2, and let \(w:=d\gamma /d\mu \) be the corresponding Radon–Nikodym derivative. Instead of the usual important sampling (IS) estimator, \(\gamma ^N(\varphi ):=\mu ^N(w\varphi )\) with \(\mu ^N\) as in (6), for \(\gamma (\varphi )\), we consider its product-form variant, \(\gamma ^N_\times (\varphi ):=\mu _\times ^N(w\varphi )\) with \(\mu ^N_\times \) as in (7). The results of Sect. 2 immediately give us the following.
Corollary 2
If \(\varphi \) is \(\gamma \)-integrable, then \(\gamma ^N_\times (\varphi )\) is an unbiased estimator for \(\gamma (\varphi )\). If, furthermore, \(w\varphi \) lies in \(L^2_\mu \) , then \(\gamma ^N_\times (\varphi )\) is strongly consistent, asymptotically normal, and its finite sample and asymptotic variances are bounded above by those of \(\gamma ^N(\varphi )\):
$$\begin{aligned} \mathrm{Var}\,(\gamma ^N_{\times }(\varphi ))&=\mathrm{Var}\,(\mu ^{N}_\times (w\varphi ))\\&\le \mathrm{Var}\,(\mu ^{N}(w\varphi ))\\&=\mathrm{Var}\,(\gamma ^N(\varphi ))\forall N>0,\\ \sigma ^2_{\gamma ,\times }(\varphi )&=\sigma ^2_{\times }(w\varphi ) \le \sigma ^2(w\varphi )=\sigma ^2_{\gamma }(\varphi ), \end{aligned}$$
where \(\mathrm{Var}\,(\mu ^{N}_\times (w\varphi ))\) and \(\sigma ^2_{\times }(w\varphi )\) are as in Theorem 1.
Proof
Replace \(\varphi \) with \(w\varphi \) in Theorem 1 and Corollary 1. \(\square \)
Corollary 2 tells us that \(\gamma ^N_\times (\varphi )\) is more statistically efficient than the conventional IS estimator \(\gamma ^N(\varphi )\) regardless of whether the target \(\gamma \) is product-form or not. In a nutshell, \(\mu _\times ^N\) is a better approximation to the proposal \(\mu \) than \(\mu ^N\) and, consequently, \(\gamma ^N_\times (dx)=w(x)\mu _\times ^N(dx)\) is a better approximation to \(\gamma (dx)=w(x)\mu (dx)\) than \(\gamma ^N(dx)=w(x)\mu ^N(dx)\). Indeed, by constructing all \(N^K\) permutations of the tuples \(X^1,\dots ,X^N\), we explore other areas of the state space. This can be particularly useful when the proposal and target are mismatched as it can amplify the number of tuples landing in the target’s high probability regions (i.e., achieving high weights w) and, consequently, substantially improve the quality of the finite sample approximation (Fig. 3).
Similarly, the self-normalized version \(\pi ^N_\times (\varphi ):=\gamma ^N_\times (\varphi )/\gamma ^N_\times (S)\) of the product-form IS estimator \(\gamma ^N_\times (\varphi )\) is a consistent and asymptotically normal estimator for averages \(\pi (\varphi )\) with respect to the normalized target \(\pi :=\gamma /\gamma (S)\). As in the case of the standard self-normalized importance sampling (SNIS) estimator \(\pi ^N(\varphi ):=\gamma ^N(\varphi )/\gamma ^N(S)\), the ratio in \(\pi ^N_\times (\varphi )\)’s definition introduces an \(\mathcal {O}(N^{-1})\) bias and stops us from obtaining analytical expression for the finite sample variance (that the bias is \(\mathcal {O}(N^{-1})\) follows from an argument similar to that given for standard SNIS in p. 35 of Liu (2001) and requires making assumptions on the higher moments of \(\varphi (X^1)\)). Otherwise, \(\pi ^N_\times (\varphi )\)’s theoretical properties are analogous to those of the product-form estimator \(\mu ^N_\times (\varphi )\) and its importance sampling extension \(\gamma ^N_\times (\varphi )\):
Corollary 3
If \(w\varphi \) lies in \(L^2_\mu \), then \(\pi ^N_\times (\varphi )\) is strongly consistent, asymptotically normal, and its asymptotic variance is bounded above by that of \(\pi ^N(\varphi )\):
$$\begin{aligned} \sigma ^2_{\pi ,\times }(\varphi )&=\sigma ^2_{\times }(\gamma (S)^{-1}w[\varphi -\pi (\varphi )])\\&\le \sigma ^2(\gamma (S)^{-1}w[\varphi -\pi (\varphi )])=\sigma ^2_{\pi }(\varphi ), \end{aligned}$$
where \(\sigma ^2_{\times }(\gamma (S)^{-1}w[\varphi -\pi (\varphi )])\) is as in Theorem 1.
Proof
Given Theorem 1 and Corollary 1, the arguments here follow closely those for standard SNIS. In particular, because \(\pi ^N_\times (\varphi )=\gamma ^N_\times (\varphi )/\gamma ^N_\times (S)=\mu ^N_\times (w\varphi )/\mu ^N_\times (w)\) and \(\mu (w)=\gamma (S)\),
$$\begin{aligned} \pi ^N_\times (\varphi )-\pi (\varphi )&=\frac{\mu ^N_\times (w\varphi )}{\mu ^N_\times (w)} -\pi (\varphi )\\&=\frac{\mu (w)}{\mu ^N_\times (w)}\mu ^N_\times \left( \frac{w[\varphi -\pi (\varphi )]}{\gamma (S)}\right) \\&=\frac{\mu (w)}{\mu ^N_\times (w)}\mu ^N_\times (w^\pi [\varphi -\pi (\varphi )]). \end{aligned}$$
Given that \(\mu ^N_\times (w)\) tends to \(\mu (w)\) almost surely (and, hence, in probability) as N approaches infinity (Theorem 1), the strong consistency and asymptotic normality of \(\pi ^N_\times (\varphi )\) then follow from those of \(\mu ^N_\times (\gamma (S)^{-1}w[\varphi -\pi (\varphi )])\) (Theorem 1) and Slutsky’s theorem. The asymptotic variance bound follows from that in Corollary 1. \(\square \)
This type of approach is best suited for targets \(\pi \) possessing at least some product structure. The structure manifest itself in partially factorized weight functions w and substantially lowers the evaluation costs of \(\gamma ^N_\times (\varphi )\) and \(\pi ^N_\times (\varphi )\) for simple test functions \(\varphi \), as the following example illustrates.
Example 5
(A simple hierarchical model) Consider the following basic hierarchical model:
$$\begin{aligned} Y_{k} \sim \mathcal {N}(X_k,1)\quad X_{k} \sim \mathcal {N}(0,\theta ),\quad \forall k\in [K]. \end{aligned}$$
(21)
It has a single unknown parameter, the variance \(\theta \) of the latent variables \(X_1,\dots ,X_K\), which we infer using a Bayesian approach. That is, we choose a prior \(p(d\theta )\) on \(\theta \) and draw inferences from the corresponding posterior,
$$\begin{aligned}&\pi (d\theta ,dx):=p(d\theta ,dx\vert y)\nonumber \\&\quad \propto p(d\theta )\prod _{k=1}^K\mathcal {N}(y_k;x_k,1)\mathcal {N}(dx_k;0,\theta )=:\gamma (d\theta ,dx), \end{aligned}$$
(22)
where \(y=(y_1,\dots ,y_K)\) denotes the vector of observations. For most priors, no analytic expressions for the normalizing constant can be found and we are forced to proceed numerically. One option is to choose the proposal
$$\begin{aligned} \mu (d\theta ,dx):= p(d\theta )\prod _{k=1}^K\mathcal {N}(dx_k;0,1), \end{aligned}$$
(23)
in which case
$$\begin{aligned} w_{IS}(\theta ,x):=\frac{d\gamma }{d\mu }(\theta ,x)= \prod _{k=1}^K\frac{\mathcal {N}(y_k;x_k,1)\mathcal {N}(x_k;0,\theta )}{\mathcal {N}(x_k;0,1)}. \end{aligned}$$
(Were we to be using standard IS instead of product-form variant, the proposal
$$\begin{aligned} \mu (d\theta ,dx):=p(d\theta )\prod _{k=1}^K\mathcal {N}(dx_k;0,\theta ) \end{aligned}$$
(24)
would be the natural choice, a point we return to after the example.) Hence, to estimate the normalizing constant or any integral w.r.t. to a univariate marginal of the posterior, we need to draw samples from \(\mu \) and evaluate the product-form estimator \(\mu ^N_\times (\varphi )\) for a test function of the form \(\varphi (\theta ,x)=f(\theta )\prod _{k=1}^Kg_k(\theta ,x_k)\), the cost of which totals \(\mathcal {O}(KN^2)\) operations because
$$\begin{aligned} \mu ^N_\times (\varphi )=\frac{1}{N^{K+1}}\sum _{m=1}^Nf(\theta ^m) \prod _{k=1}^K\sum _{n_k=1}^Ng_k(\theta ^m,x_k^{n_k}). \end{aligned}$$
We return to this in Sect. 3.4, where we will make use of the following expression for the (unnormalized) posteriors’s \(\theta \)-marginal available due to the Gaussianity in (21):
$$\begin{aligned} \gamma (d\theta )=p(d\theta )\prod _{k=1}^K\mathcal {N}(y_k;0,\theta +1). \end{aligned}$$
(25)
Clearly, the above expression opens the door to simpler and more effective methods for computing integrals with respect to this marginal than estimators targeting the full posterior. However, the estimators we discuss can be applied analogously to the many commonplace hierarchical models [e.g., see Gelman and Hill (2006), Gelman (2006), Koller and Friedman (2009), Hoffman et al. (2013), Blei et al. (2003), and the many references therein] for which such expressions are not available.
When applying IS, or extensions thereof like SMC, one should choose the proposal to be as close as possible to the target [e.g., see Agapiou et al. (2017)]. In this regard, the product-form IS approach is not entirely satisfactory for the above example: by definition, the proposal must be fully factorized while the target, \(\pi \) in (22), is only partially so (the latent variables are independent only when conditioned on the parameter variable). As we show in the next section, it is straightforward to adapt this product-form IS approach to match such partially factorized targets.
Partially factorized targets and proposals
Consider a target or proposal \(\mu \) over a product space \((\Theta \times S,\mathcal {T}\times \mathcal {S})\) with the same partial product structure as the target in Example 5:
$$\begin{aligned} \mu (d\theta ,dx)&=(\mu _0\otimes \mathcal {M})(d\theta ,dx)\nonumber \\&:=\mu _0(d\theta )\prod _{k=1}^K\mathcal {M}_k(\theta ,dx_k), \end{aligned}$$
(26)
where, for each k in [K], .... \(\theta \mapsto \mathcal {M}_k(\theta ,dx_k)\) denotes a Markov kernel mapping from \((\Theta ,\mathcal {T})\) to \((S_k,\mathcal {S}_k)\). Suppose that we are given M i.i.d. samples \(\theta ^1,\dots ,\theta ^M\) drawn from \(\mu _0\) and, for each of these, N (conditionally) i.i.d. samples \(X^{m,1},\dots ,X^{m,N}\) drawn from the product kernel \(\mathcal {M}(\theta ,dx):=\prod _{k=1}^K\mathcal {M}_k(\theta ,dx_k)\) evaluated at \(\theta ^m\). Given a test function \(\varphi \) on \(\Theta \times S\), consider the following ‘partially product-form’ estimator for \(\mu (\varphi )\):
$$\begin{aligned} \mu _{\times }^{M,N}(\varphi )&:=\frac{1}{M}\sum _{m=1}^M\left( \frac{1}{N^K}\sum _{\varvec{n}\in [N]^K}\varphi (\theta ^m,X^{m,\varvec{n}})\right) \nonumber \\&=\frac{1}{MN^K}\sum _{m=1}^M\sum _{\varvec{n}\in [N]^K}\varphi (\theta ^m,X^{m,\varvec{n}}) \end{aligned}$$
(27)
for all \(M,N>0\). It is well founded (for simplicity, we only consider the estimator’s asymptotics as \(M\rightarrow \infty \) with N fixed, but other limits can be studied by combining the approaches in Appendices A and C.
Theorem 3
If \(\varphi \) is \(\mu \)-integrable with \(\mu \) as in (26), then \(\mu ^{M,N}_\times (\varphi )\) in (27) is unbiased and strongly consistent: for all \(N>0\),
$$\begin{aligned}&\mathbb {E}\left[ \mu ^{M,N}_\times (\varphi )\right] =\mu (\varphi )\quad \forall M>0,\\&\lim _{M\rightarrow \infty }\mu ^{M,N}_\times (\varphi )= \mu (\varphi )\text {almost surely.} \end{aligned}$$
If, furthermore, \(\varphi \) belongs to \(L^2_\mu \), then \(\mathcal {M}_{[K]\backslash A}(\varphi )\) belongs to \(L^2_{\mu _0\otimes \mathcal {M}_A}\) for all subsets A of [K], where \(\mathcal {M}_A(\theta ,dx_A):=\prod _{k\in A}\mathcal {M}_k(\theta ,dx_k)\), and the estimator is asymptotically normal: for all \(N>0\), and as \(M\rightarrow \infty \),
$$\begin{aligned} M^{1/2}[\mu ^{M,N}_\times (\varphi )-\mu (\varphi )]\Rightarrow \mathcal {N}(0,\sigma ^2_{\times ,N}(\varphi )), \end{aligned}$$
(28)
where \(\Rightarrow \) denotes convergence in distribution and
$$\begin{aligned}&\sigma _{\times ,N}^2(\varphi ):=\mu _0([\mathcal {M}\varphi -\mu (\varphi )]^2)\\&\quad +\,\sum _{\emptyset \ne A\subseteq [K]}\sum _{B\subseteq A}\frac{(-1)^{\left| A \right| -\left| B \right| }\mu _0(\mathcal {M}_B[\mathcal {M}_{[K]\backslash B}\varphi -\mathcal {M}\varphi ]^2)}{N^{\left| A \right| }}. \end{aligned}$$
For any \(N,M>0\), the estimator’s variance is given by \(\mathrm{Var}\,(\mu ^{M,N}_\times (\varphi ))=\sigma ^2_{\times ,N}(\varphi )/M\).
Proof
See Appendix C. \(\square \)
The partially product-form estimator (27) is more statistically efficient than its standard counterpart.
Corollary 4
For any \(\varphi \) belonging to \(L^2_\mu \) and \(N>0\),
$$\begin{aligned} \mathrm{Var}\,(\mu _\times ^{M,N}(\varphi ))&\le \mathrm{Var}\,(\mu ^{M,N}(\varphi ))\forall M>0,\\ \sigma _{\times ,N}^2(\varphi )&\le \sigma _{N}^2(\varphi ), \end{aligned}$$
where \(\mu ^{M,N}(\varphi ):=\frac{1}{MN}\sum _{m=1}^M\sum _{n=1}^N\varphi (\theta ^n,X^{m,n})\) and \(\sigma _{N}^2(\varphi )\) denotes its asymptotic (in M) variance.
Proof
See Appendix C. \(\square \)
In fact, modulo a small caveat (cf. Remark 1 below), \(\mu ^{M,N}_\times (\varphi )\) yields the best unbiased estimates of \(\mu (\varphi )\) achievable using only the knowledge that \(\mu \) is partially factorized and M i.i.d. samples drawn from \(\mu _0\otimes \mathcal {M}^N\): a perhaps unsurprising fact given that it is the composition of two minimum variance unbiased estimators (Theorem 2).
Theorem 4
Suppose that \(\mathcal {T}\) contains all singleton sets (i.e., \(\{\theta \}\) for all \(\theta \) in \(\Theta \)). For any given measurable real-valued function \(\varphi \) on \(\Theta \times S\), \(\mu _\times ^{M,N}(\varphi )\) is a minimum variance unbiased estimator for \(\mu (\varphi )\): if f is a measurable real-valued function on \((\Theta \times S^N)^M\) such that
$$\begin{aligned} \mathbb {E}\left[ f((\theta ^{m},X^{m,1},\dots ,X^{m,N})_{m=1}^M)\right] =\mu (\varphi ) \end{aligned}$$
whenever \((\theta ^{m},X^{m,1},\dots ,X^{m,N})_{m=1}^M\) is an i.i.d. sequence drawn from \(\mu _0\otimes \mathcal {M}^N\), for all partially factorized \(\mu =\mu _0\otimes \mathcal {M}\) on \(\Theta \times S\) satisfying \(\mu (\left| \varphi \right| )<\infty \) and
$$\begin{aligned} \mu _0(\{\theta \})=0\quad \forall \theta \in \Theta , \end{aligned}$$
(29)
then
$$\begin{aligned} \mathrm{Var}\,(f((\theta ^{m},X^{m,1},\dots ,X^{m,N})_{m=1}^M)) \ge \mathrm{Var}\,(\mu ^{M,N}_\times (\varphi )). \end{aligned}$$
Proof
See Appendix D. \(\square \)
Remark 1
(The importance of (29)) Consider the extreme scenario that \(\mu _0\) is a Dirac delta at some \(\theta ^*\), so that \(\theta ^1=\dots =\theta ^M=\theta ^*\) with probability one and
$$\begin{aligned} \mu _{\times }^{M,N}(\varphi )=\frac{1}{M}\sum _{m=1}^M \frac{1}{N^K}\sum _{\varvec{n}\in [N]^K}\varphi (\theta ^*,X^{m,\varvec{n}})\quad \text {a.s.} \end{aligned}$$
In this case, we are clearly better off (at least in terms estimator variance) stacking all of our X samples into one big ensemble and replacing the partially product-form estimator with the (fully) product-form estimator,
$$\begin{aligned} \mu _{\times }^{MN}(\varphi )=\frac{1}{(MN)^K}\sum _{\varvec{l} \in [MN]^K}\varphi (\theta ^*,\tilde{X}^{\varvec{l}}), \end{aligned}$$
where \((\tilde{X}^{l})_{l\in [MN]}\) denotes \((X^{m,n})_{m\in [M],n\in [N]}\) in vectorized form (indeed Theorem 2 implies that \(\mu _{\times }^{MN}(\varphi )\) is a minimum variance unbiased estimator in this situation). More generally, note that, because
$$\begin{aligned} \mu ^2_0(\{\theta ^1=\theta ^2\})&=\int 1_{\{\theta ^1=\theta ^2\}}\mu ^2_0(d\theta ^1,d\theta ^2)\\&=\int \left( \int 1_{\{\theta ^1=\theta ^2\}}\mu _0(d\theta ^1)\right) \mu _0(d\theta ^2)\\&=\int \mu _0(\{\theta \})\mu _0(d\theta ), \end{aligned}$$
\(\mu _0\) not possessing atoms, i.e., (29), is equivalent to \(\mu ^2_0(\{\theta ^1=\theta ^2\})=0\). It is then straightforward to argue that (29) is equivalent to the impossibility of several \(\theta ^m\) coinciding or, in other words, to
$$\begin{aligned} \mu _0^M(\{\theta ^i\ne \theta ^j\forall i\ne j\})=1. \end{aligned}$$
(30)
Were this not to be the case, the estimator in (27) would not possess the MVUE property. To recover it, we would need to amend the estimator as follows: ‘if several \(\theta ^m\)s take the same value, first stack their corresponding \(X^{m,1},\dots ,X^{m,N}\) samples, and then apply a product-form estimator to the stacked samples.’ However, to not overly complicate this section’s exposition and Theorem 4’s proof, we restrict ourselves to distributions satisfying (29).
We are now in a position to revisit Example 5 and better adapt the proposal to the target. This leads to a special case of an algorithm known as ‘importance sampling squared’ or ‘IS\(^2\)’, cf. Tran et al. (2013).
Example 6
(A simple hierarchical model, revisited) Consider again the model in Example 5. Recall that our previous choice of proposal did not quite capture the conditional independence structure in the target \(\pi \): the former was fully factorized while the latter is only partially so. It seems more natural to instead use the proposal in (24) which is also easy to sample from but both mirrors \(\pi \)’s independence structure and leads to further cancellations in the weight function (in particular, it no longer depends on \(\theta \)):
$$\begin{aligned} w_{IS^2}(x):=\prod _{k=1}^K\mathcal {N}(y_k;x_k,1)=\frac{d\gamma }{d\mu }(\theta ,x). \end{aligned}$$
It follows that, to estimate the normalizing constant or any integral w.r.t. to a univariate marginal of the posterior, we need to draw samples from \(\mu _0\otimes \mathcal {M}^N\) and evaluate the partially product-form estimator \(\mu ^{M,N}_\times (\varphi )\) for a test function of the form \(\varphi (\theta ,x)=f(\theta )\prod _{k=1}^Kg_k(x_k)\). Because
$$\begin{aligned} \mu ^{M,N}_\times (\varphi )=\frac{1}{MN^K}\sum _{m=1}^Mf(\theta ^m)\prod _{k=1}^K\sum _{n_k=1}^Ng_k(X_k^{m,n_k}), \end{aligned}$$
the total cost then reduces to \(\mathcal {O}(KMN)\). We also return to this is Sect. 3.4.
Grouped independence Metropolis–Hastings
As a further example of how one may embed product-form estimators within more sophisticated Monte Carlo methodology and exploit the independence structure present in the problem, we revisit Beaumont’s Grouped Independence Metropolis–Hastings [GIMH (Beaumont 2003)], a simple and well known pseudo-marginal MCMC sampler (Andrieu and Roberts 2009). Like many of these samplers, it is intended to tackle targets whose densities cannot be evaluated pointwise but are marginals of higher-dimensional distributions whose densities can be evaluated pointwise. Our inability to evaluate the target’s density precludes us from directly applying the Metropolis–Hastings algorithm (MH, e.g., see Chapter XIII in Asmussen and Glynn (2007)) as we cannot compute the necessary acceptance probabilities. For instance, in the case of a target \(\pi (d\theta )\) on a space \((\Theta ,\mathcal {T})\) and an MH proposal \(Q(\theta ,d\tilde{\theta })\) with respective densities \(\pi (\theta )\) and \(Q(\theta ,\tilde{\theta })\), we would need to evaluate
$$\begin{aligned} 1\wedge \frac{\pi (\tilde{\theta })Q(\theta ,\tilde{\theta })}{\pi (\theta ) Q(\tilde{\theta },\theta )} \end{aligned}$$
where \(\theta \) denotes the chain’s current state and \(\tilde{\theta }\sim Q(\theta ,\cdot )\) the proposed move. GIMH instead replaces the intractable \(\pi (\theta )\) and \(\pi (\tilde{\theta })\) in the above with importance sampling estimates thereof: if \(\pi (\theta ,x)\) denotes the density of the higher-dimensional distribution \(\pi (d\theta ,dx)\) whose \(\theta \)-marginal is \(\pi (d\theta )\), and \(w(\theta ,x):=\pi (\theta ,x)/\mathcal {M}(\theta ,x)\) for a given Markov kernel \(\mathcal {M}(\theta ,dx)\) with density \(\mathcal {M}(\theta ,x)\),
$$\begin{aligned} \pi ^N(\theta )&=\frac{1}{N}\sum _{n=1}^Nw(\theta ,X^n),\nonumber \\ \pi ^N(\tilde{\theta })&=\frac{1}{N}\sum _{n=1}^Nw(\tilde{\theta },\tilde{X}^n), \end{aligned}$$
(31)
where \(X^{1},\dots ,X^N\) and \(\tilde{X}^1,\dots ,\tilde{X}^N\) are i.i.d. samples drawn from \(\mathcal {M}(\theta ,\cdot )\) and \(\mathcal {M}(\tilde{\theta },\cdot )\), respectively Key in Beaumont’s approach is that the samples are recycled from one iteration to another: if \(Z^1,\dots ,Z^N\) and \(\tilde{Z}^1,\dots \tilde{Z}^N\) denote the i.i.d. samples used in the previous iteration, then \((X^1,\dots ,X^N):=(Z^1,\dots ,Z^N)\) if the previous move was rejected and \((X^1,\dots ,X^M):=(\tilde{Z}^1,\dots ,\tilde{Z}^N)\) if it was accepted.
As explained in Andrieu and Roberts (2009) [see also Andrieu and Vihola (2015)], the algorithm’s correctness does not require the density estimates to be generated by (31), only for them to be unbiased. In particular, if the estimates are unbiased, GIMH may be interpreted as an MH algorithm on an expanded state space with an extension of \(\pi (d\theta )\) as its invariant distribution. Consequently, provided that the density estimator is suitably well behaved, GIMH returns consistent and asymptotically normal estimates of the target under conditions comparable to those for standard MH algorithms [e.g., the GIMH chain is uniformly ergodic whenever the associated ‘marginal’ chain is and the estimator is uniformly bounded (Andrieu and Roberts 2009); see Andrieu and Vihola (2015) for further refinements]. Consequently, if the kernel is product-form (i.e., \(\mathcal {M}(\theta ,dx)\) is product-form for each \(\theta \)), we may replace the estimators in (31) with their product-form counterparts:
$$\begin{aligned}&\pi ^N_\times (\theta )=\frac{1}{N^K}\sum _{\varvec{n}\in [N]^K}w(\theta ,X^{\varvec{n}})\nonumber ,\\&\pi ^N_\times (\tilde{\theta })=\frac{1}{N^K} \sum _{\varvec{n}\in [N]^K}w(\tilde{\theta },\tilde{X}^{\varvec{n}}), \end{aligned}$$
(32)
where K denotes the dimensionality of the x-variables (the unbiasedness follows from \(X^{\varvec{n}}\) and \(\tilde{X}^{\varvec{n}}\) having respective laws \(\mathcal {M}(\theta ,dx)\) and \(\mathcal {M}(\tilde{\theta },d\tilde{x})\) for any \(\varvec{n}\) in \([N]^K\)). Thanks to the results in Andrieu and Vihola (2016), it is straightforward to show that this choice leads to lower estimator variances, at least asymptotically.
Corollary 5
Let \((\theta ^{m,N})_{m=1}^\infty \) and \((\theta _\times ^{m,N})_{m=1}^\infty \) denote the GIMH chains generated using (31) and (32), respectively, and the same proposal \(Q(\theta ,d\theta )\). If \(\varphi \) belongs to \(L^2_\pi \), then
$$\begin{aligned}&\lim _{M\rightarrow \infty }\text {Var}\left( \frac{1}{\sqrt{M}} \sum _{m=1}^M\varphi (\theta ^{m,N}_\times )\right) \\&\quad \le \lim _{M\rightarrow \infty }\text {Var}\left( \frac{1}{\sqrt{M}} \sum _{m=1}^M\varphi (\theta ^{m,N})\right) \quad \forall N>0. \end{aligned}$$
Proof
See Appendix E. \(\square \)
Given the argument used in the proof, the results of Andrieu and Vihola (2016), Theorem 10 in particular, imply much more than the variance bound in the corollary’s statement. For instance, if the target is not concentrated on points, then the spectral gap of \((\theta ^{m,N}_\times )_{m=1}^\infty \) is bounded below by that of \((\theta ^{m,N})_{m=1}^\infty \). We finish the section by returning to our running example.
Example 7
(A simple hierarchical model, re-revisited) Here, we follow Sect. 5.1 in Schmon et al. (2020). Consider once again the model in Example 5 and suppose we are interested only in the posterior’s \(\theta \)-marginal \(\pi (d\theta )\). Choosing
$$\begin{aligned} \mathcal {M}(\theta ,dx):=\prod _{k=1}^K\mathcal {N}(dx_k;0,\theta ), \end{aligned}$$
the weight function factorizes,
$$\begin{aligned} w_{GIMH}(\theta ,x)=\frac{\pi (\theta ,x)}{\mathcal {M}(\theta ,x)} =p(\theta )\prod _{k=1}^K\mathcal {N}(y_k;x_k,1); \end{aligned}$$
resulting in an evaluation cost of \(\mathcal {O}(KN)\) for (31, 32) and, regardless of which density estimates we use, a total cost of \(\mathcal {O}(KMN)\) where M denotes the number of steps we run the chain for. We return to this in the following section.
Numerical comparison
Here, we apply the estimators discussed throughout Sects. 3.1–3.3 to the simple hierarchical model introduced in Example 5 and we examine their performance. To benchmark the latter, we choose the prior to be conditionally conjugate to the model’s likelihood: \(p(d\theta )\) is the Inv-Gamma\((\alpha /2,\alpha \beta /2)\) distribution, in which case
$$\begin{aligned}&X_k\vert y_k,\theta \sim \mathcal {N}\left( \frac{y_k}{\theta ^{-1}+1}, \frac{1}{\theta ^{-1}+1}\right) \forall {k\in [K]},\\&\theta \vert y,X\sim \text {Inv-Gamma}\left( \frac{\alpha +K}{2}, \frac{\alpha \beta +\sum _{k=1}^KX_k^2}{2}\right) ; \end{aligned}$$
and we can alternatively approximate the posterior, \(\pi (d\theta ,dx)\) in (22), using a Gibbs’ sampler. Note that the above expressions are unnecessary for the evaluation of the estimators in Sects. 3.1–3.3. To compare with standard methodology that also does not requires such expressions, we also approximate the posterior using Random Walk Metropolis (RWM) with the proposal variance tuned so that the mean acceptance probability (approximately) equals 25%. To keep the comparison honest, we run these two chains for \(N^2\) steps and set \(M=N\) for the estimators in Sects. 3.2 and 3.3 ; in which case all estimators incur a similar \(\mathcal {O}(KN^2)\) cost. We further fix \(K:=100\), \(\alpha :=1\), \(\beta :=1\), and \(N:=100\) and generate artificial observations \(y_1,\dots ,y_{100}\) by running (21) with \(\theta :=1\).
Table 1 Average-across repeats W\(_1\) error and KS statistic for the approximations of \(\pi (d\theta )\), and average absolute errors for the corresponding mean and standard deviation estimates, obtained using each of the eight methods Figure 4 shows approximations to the posteriors’s \(\theta \)-marginal \(\pi (d\theta )\) obtained using a Gibbs sampler, RWM, IS (Sect. 3.1), IS\(^2\) (Sect. 3.2), GIMH (Sect. 3.3), and the last three’s product-form variants (PFIS, PFIS\(^2\), and PFGIMH, respectively). In the cases of Gibbs, RWM, GIMH, and PFGIMH, we used a 20% burn-in period and approximated the marginal with the empirical distribution of the \(\theta \)-components of the states visited by the chain. For GIMH and PFGIMH, we also used a random walk proposal with its variance tuned so that the mean acceptance probability hovered around 25%. For IS, PFIS, IS\(^2\), and PFIS\(^2\), we used the proposals specified in Examples 5 and 6 and computed the approximations using
$$\begin{aligned}&\pi ^{N^2}_{IS}(d\theta ) :=\frac{\sum _{n=1}^{N^2}w_{IS}(\theta ^n,X^n) \delta _{\theta ^n}}{\sum _{n=1}^{N^2}w_{IS}(\theta ^n,X^n)},\\&\pi ^{N}_{PFIS}(d\theta ) :=\frac{\sum _{n=1}^N\left( \sum _{\varvec{n}\in [N]^K}w_{IS} (\theta ^n,X^{\varvec{n}})\right) \delta _{\theta ^n}}{\sum _{n=1}^N \sum _{\varvec{n}\in [N]^K}w_{IS}(\theta ^n,X^{\varvec{n}})},\\&\pi ^{N,N}_{IS^2}(d\theta ) :=\frac{\sum _{m=1}^N \left( \sum _{n=1}^Nw_{IS^2}(X^{m,n})\right) \delta _{\theta ^m}}{\sum _{m=1}^N \sum _{n=1}^Nw_{IS^2}(X^{m,n})},\\&\pi ^{N,N}_{PFIS^2}(d\theta ) :=\frac{\sum _{m=1}^{N}\left( \sum _{\varvec{n}\in [N]^K} w_{IS^2}(X^{m,\varvec{n}})\right) \delta _{\theta ^m}}{\sum _{m=1}^{N} \sum _{\varvec{n}\in [N]^K}w_{IS^2}(X^{m,\varvec{n}})}. \end{aligned}$$
(Note that for IS, we are using \(N^2\) samples instead of N so that its cost is also \(\mathcal {O}(KN^2)\).)
Our first observation is that the approximations produced by IS, IS\(^2\), and GIMH are very poor. The first two exhibit severe weight degeneracy (in either case, a single particle had over 50% of the probability mass and three had over 90%), something unsurprising given the target’s moderately high dimension of 101.Footnote 2 The third possesses a pronounced spurious peak close to zero (with over 70% of the mass) caused by large numbers of rejections in that vicinity. Replacing the i.i.d. estimators embedded within these algorithms with their product-form counterparts removes both the weight degeneracy and the spurious peak; PFIS, PFIS\(^2\), and PFGIMH return much improved approximations. The best approximation is the one returned by the Gibbs sampler: an expected outcome given that the sampler’s use of the conditional distributions makes it the estimator most ‘tailored’ or ‘well adapted’ to the target. However, these distributions are not available for most models (precluding application of these samplers to such models) and even just taking the, usually obvious, independence structure into account can make a substantial difference: the quality of the approximations returned by PFIS and PFIS\(^2\) exceeds the quality of that returned by the common, or even default, choice of RWM. Note that this is the case even though the proposal variance in RWM was tuned, while that in the other two was simply set to 1 (a reasonable choice given that \(\theta =1\) was used to generate the data, but likely not the optimal one). In fact, for this simple model, it is easy to sensibly incorporate observations into the PFIS and PFIS\(^2\) proposals [e.g., use \(p(d\theta )\prod _{k=1}^K\mathcal {N}(dx_k;y_k,1)\) for PFIS and \(p(d\theta )\prod _{k=1}^K\mathcal {N}(dx_k; y_k\theta [1+\theta ]^{-1},\theta [1+\theta ]^{-1})\) for PFIS\(^2\)] and potentially improve their performance.
Table 2 Total absolute error for the mean and standard deviation estimates of \(\pi (dx)\)’s univariate marginals To benchmark the approaches more thoroughly, we generated \(R:=100\) replicates of the eight full posterior approximations and computed various error metrics (Tables 1 and 2). For the \(\theta \)-component, we used the high-quality reference approximation \(\pi _{REF}\) described in Fig. 4’s caption to obtain the average (across repeats) W\(_1\) distance and KS statistic (as described in the caption), and the average absolute error of the posterior mean and standard deviation estimates normalized by the true mean or standard deviation (i.e., \(M_\theta ^{-1}R^{-1}\sum _{r=1}^{R}\left| M^r_\theta -M_\theta \right| \) for the posterior mean estimates, where \(M_\theta \) denotes the true mean and \(M^r_\theta \) the \(r^{th}\) estimate thereof, and similarly for the standard deviation estimates). For the x-components, we instead used high-accuracy estimates for the component-wise means and standard deviations (obtained by running a Gibbs sampler for \(N^4=10^8\) steps) to compute the corresponding total absolute errors across replicates and components (\(\sum _{k=1}^K\sum _{r=1}^{R}\left| M^{r}_{k}-M_{k} \right| \), where \(M_k\) denotes the true mean for the \(k^{th}\) x-component and \(M^r_k\) the rth estimate thereof, and similarly for the standard deviation estimates).
Once again, the product-form estimators far outperformed their i.i.d. counterparts. Moreover, they perform just as well or better than RWM. PFIS\(^2\)’s estimates are particularly accurate: a fact that does not surprise us given that its proposal has the same partially factorized structure as the target, in this sense making it the best adjusted estimator to the problem. That is, best except for the Gibbs sampler which exploits the conditional distributions (encoding more information than this structure). We conclude with an interesting detail: PFIS\(^2\) and PFIS perform similarly when approximating the \(\theta \)-marginal (cf. Table 1), but PFIS\(^2\) outperforms PFIS when approximating the latent variable marginals (cf. Table 2). This is perhaps not too surprising because, in the case of the \(\theta \)-marginal approximation, both PFIS\(^2\) and PFIS employ the same number N of \(\theta \)-samples, while, in that of \(k^{th}\) latent variable, PFIS\(^2\) uses \(N^2\) \(x_k\)-samples and PFIS uses only N such samples.