As described in Sect. 2.4, the self-density ratios associated with both \(\text {p}_{1}(\phi )\) and \(\text {p}_{2}(\phi )\) may be required by the two-stage MCMC algorithm. To simplify notation, we consider in this section a generic joint density \(\text {p}(\phi , \gamma )\) that we can evaluate pointwise, but whose marginal \(\text {p}(\phi ) = \int \text {p}(\phi , \gamma )\text {d}\gamma \) we cannot obtain analytically. Our interest is in the self-density ratio evaluated at \(\phi _{\text {nu}}\) and \(\phi _{\text {de}}\) (the subscripts are abbreviations of numerator and denominator respectively) which we denote as
$$\begin{aligned} \text {r}(\phi _{\text {nu}}, \phi _{\text {de}}) = \frac{ \text {p}(\phi _{\text {nu}}) }{ \text {p}(\phi _{\text {de}}) } = \frac{ \int \text {p}(\phi _{\text {nu}}, \gamma ) \text {d} \gamma }{ \int \text {p}(\phi _{\text {de}}, \gamma ) \text {d} \gamma }. \end{aligned}$$
In our examples we set \(\phi _{\text {nu}}= \phi \) and \(\phi _{\text {de}}= \phi ^{*}\) for use in Eqs. (4) and (5); and define \(\gamma = (\psi _{m}, Y_{m})\) and \(\text {p}= \text {p}_{m}\) where \(m= 1\) or 2 as appropriate (see Sects. 4 and 5 for details).
To avoid the numerical issues associated with the naive approach, we need to improve the ratio estimate \(\widehat{\text {r}}(\phi _{\text {nu}}, \phi _{\text {de}})\) for improbable values of \(\phi _{\text {nu}}\) and \(\phi _{\text {de}}\), e.g. values more than two standard deviations away from the mean. The fundamental flaw in the naive approach in this context is that it minimises the absolute error in the high density region (HDR) of \(\text {p}(\phi )\), i.e. the region \(R_{\varepsilon }(\text {p}(\phi )) = \{\phi : \text {p}(\phi ) > \varepsilon \}\). But this is not necessarily the sole region of interest, and we are concerned with minimising the relative error. To address this we reweight \(\text {p}(\phi )\) towards a particular region, and thus obtain a more accurate estimate in that region. We then exploit the fact that we only interact with the prior marginal distribution via its self-density ratio to combine estimates from multiple reweighted distributions.
Single weighting function
We can shift \(\text {p}(\phi )\) by multiplying the joint distribution \(\text {p}(\phi , \gamma )\) by a known weighting function \(\text {w}(\phi ;\xi )\), controlled by parameter \(\xi \), then account for this shift in our KDE. This will improve the accuracy of the KDE in the region to which we shift the marginal. We first generate \(N\) samples denoted \(\{(\phi _{n}, \gamma _{n})\}_{n= 1}^{N}\), from a weighted version of the joint distribution
$$\begin{aligned} \{(\phi _{n}, \gamma _{n})\}_{n= 1}^{N} \, \sim \, \frac{1}{Z_{1}} \text {p}(\phi , \gamma ) \text {w}(\phi ; \xi ), \end{aligned}$$
(7)
where \(Z_{1} = \iint \text {p}(\phi , \gamma ) \text {w}(\phi ; \xi ) \text {d}\phi \text {d}\gamma \) is the normalising constant. The samples \(\{\phi _{n}^{(\text {s})}\}_{n= 1}^{N}\), obtained by ignoring the samples of \(\gamma _{n}\), are distributed according to a weighted version \(\text {s}(\phi ; \xi )\) of the marginal distribution \(\text {p}(\phi )\)
$$\begin{aligned} \{\phi _{n}^{(\text {s})}\}_{n= 1}^{N} \, \sim \, \frac{1}{Z_{2}} \text {p}(\phi )\text {w}(\phi ; \xi ) = \text {s}(\phi ; \xi ), \end{aligned}$$
where \(Z_{2} = \int \text {p}(\phi ) \text {w}(\phi ; \xi ) \text {d}\phi \). Typically (7) cannot be sampled by simple Monte Carlo; instead we employ MCMC.
Using the samples \(\{\phi _{n}^{(\text {s})}\}_{n= 1}^{N}\) from \(\text {s}(\phi ; \xi )\) we compute a weighted kernel density estimate (Jones 1991), with bandwidth \(h\), kernel \(\text {K}_{h}\), and normalising constant \(Z_{3}\)
$$\begin{aligned} {\hat{\hat{\text {p}}}}(\phi ) = \frac{1}{Z_{3} Nh} \sum _{n= 1}^{N} (\text {w}(\phi _{n}; \xi ))^{-1} \text {K}_{h}(\phi - \phi _{n}^{(\text {s})}), \end{aligned}$$
(8)
and form our weighted-sample self-density ratio estimate
$$\begin{aligned} {\hat{\hat{\text {r}}}}(\phi _{\text {nu}}, \phi _{\text {de}}) = \frac{ {\hat{\hat{\text {p}}}}(\phi _{\text {nu}}) }{ {\hat{\hat{\text {p}}}}(\phi _{\text {de}}) } = \frac{ \sum _{n= 1}^{N} (\text {w}(\phi _{n}; \xi ))^{-1} \text {K}_{h}(\phi _{\text {nu}}- \phi _{n}^{(\text {s})}) }{ \sum _{n= 1}^{N} (\text {w}(\phi _{n}; \xi ))^{-1} \text {K}_{h}(\phi _{\text {de}}- \phi _{n}^{(\text {s})}) } . \end{aligned}$$
The cancellation of the normalisation constant \(Z_{3}\) is crucial, as accurately estimating constants like \(Z_{3}\) is known to be challenging.
Choice of weighting function
The choice of \(\text {w}(\phi ;\xi )\) affects both the validity and efficacy of our methodology. The weighted marginal \(\text {s}(\phi ; \xi )\) must satisfy the requirements for a density for our method to be valid. Hence, the specific form of \(\text {w}(\phi ;\xi )\) is subject to some restrictions. Our first requirement is that \(\text {w}(\phi ;\xi ) > 0\) for all \(\phi \) in the support of \(\text {p}(\phi , \gamma )\). We also require that the weighted joint distribution, defined in (7), has finite integral, to ensure that it can be normalised to a probability distribution, and that the marginal \(\text {s}(\phi ; \xi )\) is positive over the support of interest, also with finite integral.
Multiple weighting functions
The methodology of Sect. 3.1 produces a single estimate \({\hat{\hat{\text {r}}}}(\phi _{\text {nu}}, \phi _{\text {de}})\) using \({\hat{\hat{\text {p}}}}(\phi )\) from Eq. (8). It is accurate for values in the HDR of \(\text {s}(\phi ; \xi )\), i.e. \(R_{\varepsilon }(\text {s}(\phi ; \xi ))\), and we can control the location of \(R_{\varepsilon }(\text {s}(\phi ; \xi ))\) through \(\xi \). This is similar to importance sampling, with \(\text {s}(\phi ; \xi )\) acting as the proposal density. Nakayama (2011) notes importance sampling can be used to improve the mean square error (MSE) of a KDE in a specific local region, at the cost of an increase in global MSE. To ameliorate the decrease in global performance, we specify multiple regions in which we want accurate estimates for \({\hat{\hat{\text {p}}}}(\phi )\), and then combine the corresponding estimates of \({\hat{\hat{\text {r}}}}(\phi _{\text {nu}}, \phi _{\text {de}})\) to provide a single estimate that is accurate across all regions.
We elect to use \(W\) different weighting functions, indexed by \(w= 1, \ldots , W\), with function-specific parameters \(\xi _{w}\) denoted \(\text {w}(\phi ;\xi _{w})\). Samples are then drawn from each of the \(W\) weighted distributions \(\text {s}_{w}(\phi ; \xi _{w}) \propto \text {p}(\phi ) \text {w}(\phi ;\xi _{w})\). Denote the samples from the \(w^{th}\) weighted distribution by \(\{\phi _{n}^{(\text {s}_{w})}\}_{n= 1}^{N}\). Each set of samples produces a separate ratio estimate \({\hat{\hat{\text {r}}}}_{w}(\phi _{\text {nu}}, \phi _{\text {de}})\) in the manner described in Sect. 3.1.
Each individual \({\hat{\hat{\text {r}}}}_{w}\) is accurate (in terms of relative accuracy) only in the HDR of \(\text {s}_{w}(\phi ; \xi _{w})\). Thus, when combining multiple ratio estimates, simply taking the mean of all \(w= 1, \ldots , W\) estimates (for a specific value of \(\phi _{\text {nu}}\) and \(\phi _{\text {de}}\)) would not make use of our knowledge about the region in which \({\hat{\hat{\text {r}}}}_{w}\) is accurate. We therefore propose a weighted average of all the individual ratio estimates, where the weights approximately come from \(\text {s}_{w}(\phi _{\text {nu}}; \xi _{w}) \text {s}_{w}(\phi _{\text {de}}; \xi _{w})\) – this quantity is largest when \({\hat{\hat{\text {r}}}}_{w}(\phi _{\text {nu}}, \phi _{\text {de}})\) is most accurate. This ensures the more accurate terms are given more weight in our final estimate. Specifically, we use \(\{\phi _{n}^{(\text {s}_{w})}\}_{n= 1}^{N}\) to compute a standard KDE of \(\text {s}_{w}(\phi ; \xi _{w})\)
$$\begin{aligned} {\hat{\text {s}}}_{w}(\phi ; \xi _{w}) = \frac{1}{Nh} \sum _{n= 1}^{N} \text {K}_{h}(\phi - \phi _{n}^{(\text {s}_{w})}). \end{aligned}$$
Finally, we form the weighted-sample self-density ratio estimate \({\hat{\hat{\text {r}}}}_{\text {WSRE}}(\phi _{\text {nu}}, \phi _{\text {de}})\), which is a weighted mean of the individual ratio estimates
$$\begin{aligned} {\hat{\hat{\text {r}}}}_{\text {WSRE}}(\phi _{\text {nu}}, \phi _{\text {de}}) = \frac{1}{Z_{4}} \sum _{w= 1}^{W} {\hat{\text {s}}}_{w}(\phi _{\text {nu}}, \phi _{\text {de}}; \xi _{w}) {\hat{\hat{\text {r}}}}_{w}(\phi _{\text {nu}}, \phi _{\text {de}}), \end{aligned}$$
where \({{\hat{\text {s}}}}_{w} (\phi _{\text {nu}}, \phi _{\text {de}}; \xi _{w}) = {{\hat{\text {s}}}}_{w} (\phi _{\text {nu}}; \xi _{w})\,{{\hat{\text {s}}}}_{w} (\phi _{\text {de}}; \xi _{w})\) and \(Z_{4} = \sum _{w= 1}^{W} {{\hat{\text {s}}}}_{w}(\phi _{\text {nu}}, \phi _{\text {de}}; \xi _{w})\).
Choosing values for \(\xi _{w}\)
Consider a D-dimensional \(\phi = (\phi ^{[1]}, \ldots , \phi ^{[D]})\) where \(\phi ^{[d]}\) is the \(d^{th}\) component of \(\phi \), for \(d = 1, \ldots , D\). Assume we have a compact region of interest for the \(d^{th}\) component denoted \(A_{d} = [a_{d}, b_{d}] \subseteq \text {supp}(\phi ^{[d]})\), such that the overall region of interest A can be defined as the Cartesian product of component-wise regions of interest
. We are interested in accurately evaluating the self-density ratio for two points in this region. We will obtain W choices for \(\xi _{w}\) by specifying V weighting functions for each of the D components, such that \(W = V^{D}\).
Assume that the weighting function \(\text {w}(\phi ; \xi )\) is composed of D independent component weighting functions
$$\begin{aligned} \text {w}(\phi ; \xi ) = \prod _{d = 1}^{D} \text {m}(\phi ^{[d]}; \xi ^{[d]}), \end{aligned}$$
where \(\xi ^{[d]}\) is the \(d^{th}\) component of \(\xi \). We can then define the marginal of the weighted target
$$\begin{aligned} \text {t}(\phi ^{[d]}; \xi ^{[d]}) = \int \text {s}(\phi ; \xi ) \text {d}\phi ^{[-d]}, \end{aligned}$$
where \(\phi ^{[-d]}\) represents the \(D - 1\) components of \(\phi \) that are not \(\phi ^{[d]}\). For typical choices of \(\xi \) and \(\text {w}(\phi ; \xi )\), the corresponding HDR of \(\text {t}(\phi ^{[d]}; \xi ^{[d]})\) does not span the region of interest. That is, \(|R_{\varepsilon }(\text {t}(\phi ^{[d]}; \xi ^{[d]})) |\ll |A_{d} |\).
Our aim is to choose, for each of the d components, values \(v = 1, \ldots , V\) of \(\xi ^{[d]}\) denoted \(\{\xi _{v, d}\}_{v = 1}^{V}\), yielding weighting functions \(\text {m}(\phi ^{[d]}, \xi _{v, d})\) and corresponding \(\text {t}(\phi ^{[d]}, \xi _{v, d})\), such that \(\bigcup _{v = 1}^{V} R_{\varepsilon }(\text {t}(\phi ^{[d]}; \xi _{v, d})) \approx A_{d} = [a_{d}, b_{d}]\). We employ the following heuristic argument, first choosing a “minimum” \(\xi _{1, d}\) and a “maximum” \(\xi _{V, d}\) such that
$$\begin{aligned} \begin{aligned} \xi _{1, d} :\, a_{d} \in R_{\varepsilon }(\text {t}(\phi ^{[d]}; \xi _{1, d})), \\ \xi _{V, d} :\, b_{d} \in R_{\varepsilon }(\text {t}(\phi ^{[d]}; \xi _{V, d})). \end{aligned} \end{aligned}$$
In words, we choose a minimum value \(\xi _{1, d}\) so that the corresponding HDR of the weighted target includes the lower limit of the region of interest. An analogous argument is used to choose the maximum \(\xi _{V, d}\). We then interpolate \(V - 2\) values between \(\xi _{1, d}\) and \(\xi _{V, d}\) ensuring that there is sufficient, but not complete, overlap between the corresponding HDRs.
Denote an element from the set of all W possible values for the parameter of the weighting function with
, noting that \(\xi _{w}\) is a D-vector.
The practitioner typically has some knowledge of \(p(\phi )\) and A from prior predictive checks and previous attempts at running the two-stage sampler. Thus only a small number of trial-and-error attempts should be needed to determine \(\xi _{1, d}\) and \(\xi _{V, d}\) for all dimensions. These attempts are also used to check for overlap between the HDRs, and increase V if the overlap is insufficient. Section 6 contains further discussion of this selection process and its relationship to umbrella sampling (Torrie and Valleau 1977)
Practicalities and software
In our examples we use Gaussian density functions for \(\text {m}(\phi ^{[d]};\xi _{v, d})\),
$$\begin{aligned} \text {m}(\phi ^{[d]};\xi _{v, d}) = \frac{1}{\sqrt{2 \pi \sigma ^{2}_{v, d}}} \text {exp} \left\{ - \frac{1}{2 \sigma ^{2}_{v, d}} (\phi - \mu _{v, d})^2 \right\} , \end{aligned}$$
with \(\xi _{v, d} = (\mu _{v, d}, \sigma ^{2}_{v, d})\), though we fix \(\sigma ^{2}_{v, d} = \sigma ^{2}_{d}\) for all v. Our definition of sufficient overlap is that 0.95 empirical quantile of \(\text {t}(\phi ^{[d]}; \xi _{v, d})\) is equal or slightly greater than the 0.05 empirical quantile of \(\text {t}(\phi ^{[d]}; \xi _{v + 1, d})\), for \(v = 1, \ldots , V - 1\).
Our implementation of our WSRE methodology is available in an R (R Core Team 2021) package at https://github.com/hhau/wsre. It is built on top of Stan (Carpenter et al. 2017) and Rcpp (Eddelbuettel and François 2011). Package users supply a joint density \(\text {p}(\phi , \gamma )\) in the form of a Stan model; choose the parameters \(\xi _{w}\) of each of the \(W\) weighting functions; and the number of samples \(N\) drawn from each \(\text {s}_{w}(\phi ; \xi _{w})\). The combined estimate \({\hat{\hat{\text {r}}}}_{\text {WSRE}}(\phi _{\text {nu}}, \phi _{\text {de}})\) is returned. A vignette on using wsre is included in the package, and documents the specific form of Stan model required.