1 Introduction

Estimating the normalizing constant of an unnormalized probability density, or the ratio of normalizing constants between two unnormalized densities is a challenging and important task. In Bayesian inference, such problems are closely related to estimating the marginal likelihood of a model or the Bayes factor between two competing models, and can arise from fields such as econometrics (Geweke 1999), astronomy (Bridges et al. 2009), phylogenetics (Fourment et al. 2020), etc. Monte Carlo methods such as Bridge sampling (Bennett 1976; Meng and Wong 1996), path sampling (Gelman and Meng 1998), reverse logistic regression (Geyer 1994), nested sampling (Skilling 2006) and reverse annealed importance sampling (Burda et al. 2015) have been proposed to address this problem. See Friel and Wyse (2012) for an overview of some popular algorithms. Fourment et al. (2020) also compare the empirical performance of 19 algorithms for estimating normalizing constants in the context of phylogenetics.

Bridge sampling (Bennett 1976; Meng and Wong 1996) is a powerful, easy-to-implement Monte Carlo method for estimating the ratio of normalizing constants. Let \({\tilde{q}}_i(\omega ), \omega \in \Omega _i, \; i=1,2\) be two unnormalized probability densities with respect to a common measure \(\mu \). Let \(q_i(\omega ) = \tilde{q}_i(\omega )/Z_i\) be the corresponding normalized density, where \(Z_i\) is the normalizing constant. Bridge sampling estimates \(r=Z_1/Z_2\) using samples from \(q_1,q_2\) and the unnormalized density functions \({\tilde{q}}_1, {\tilde{q}}_2\). Meng and Schilling (2002) point out that Bridge sampling is equally useful for estimating a single normalizing constant. The relative mean square error (RMSE) of a Bridge estimator depends on the overlap or “distance” between \(q_1,q_2\). The overlap can be quantified by some divergence between them. When \(q_1,q_2\) share little overlap, the corresponding Bridge estimator has large RMSE and therefore is unreliable. In order to improve the efficiency of Bridge estimators, various methods such as Warp Bridge sampling (Meng and Schilling 2002), Warp-U Bridge sampling (Wang et al. 2020) and Gaussianized Bridge sampling (Jia and Seljak 2020) have been introduced. These methods first apply transformations \(T_i\) to \(q_i\) in a tractable way without changing the normalizing constant \(Z_i\) for \(i=1,2\), then compute Bridge estimators based on the transformed densities \(q_i^{(T)}\) and the corresponding samples for \(i=1,2\). If \(q_1^{(T)}, q_2^{(T)}\) have greater overlap than the original ones, then the resulting Bridge estimator of r based on \(q_1^{(T)}, q_2^{(T)}\) would have a lower RMSE.

In this paper, we first demonstrate the connection between Bridge estimators and f-divergence (Ali and Silvey 1966). We show that one can estimate the asymptotic RMSE of the optimal Bridge estimator by equivalently estimating a specific f-divergence between \(q_1,q_2\). Nguyen et al. (2010) propose a general variational framework for f-divergence estimation. We apply this framework to our problem and obtain a new estimator of the asymptotic RMSE of the optimal Bridge estimator using the unnormalized densities \(\tilde{q}_1, {\tilde{q}}_2\) and the corresponding samples. We also find a connection between Bridge estimators and the variational lower bound of f-divergence given by Nguyen et al. (2010). In particular, we show that the problem of estimating an f-divergence between \(q_1,q_2\) using this variational framework naturally leads to a Bridge estimator of \(r=Z_1/Z_2\). Kong et al. (2003) observe that the optimal Bridge estimator is a maximum likelihood estimator under a semi-parametric formulation. We use this f-divergence estimation framework to extend this observation and show that many special cases of Bridge estimators such as the geometric Bridge estimator can also be interpreted as maximizers of some objective functions that are related to the f-divergence between \(q_1,q_2\). This formulation also connects Bridge estimators and density ratio estimation problems: since we can evaluate the unnormalized densities \({\tilde{q}}_1, {\tilde{q}}_2\), we know the true density ratio up to a multiplicative constant \(r=Z_1/Z_2\). Hence, estimating r can be viewed as a problem of estimating the density ratio between \(q_1,q_2\). A similar idea has been explored in, e.g. noise contrastive estimation (Gutmann and Hyvärinen 2010), where the authors treat the unknown normalizing constant as a model parameter, and cast the estimation problem as a classification problem. Similar ideas have also been discussed in, e.g. Geyer (1994) and Uehara et al. (2016).

We then utilize the connection between the asymptotic RMSE of the optimal Bridge estimator and a specific f-divergence between \(q_1,q_2\), and propose f-GAN-Bridge estimator (f-GB), which improves the efficiency of the optimal Bridge estimator of r by directly minimizing the first-order approximation of its asymptotic RMSE with respect to the densities using an f-GAN. f-GAN (Nowozin et al. 2016) is a class of generative model that aims to approximate the target distribution by minimizing an f-divergence between the generative model and the target. Let \(\mathcal {T}\) be a collection of transformations T such that \({\tilde{q}}_1^{(T)}\), the transformed unnormalized density of \(q_1\) is computationally tractable and have the same normalizing constant \(Z_1\) as the original \({\tilde{q}}_1\). The f-GAN-Bridge estimator is obtained using a two-step procedure: we first use the f-GAN framework to find the transformation \(T^*\) that minimizes a specific f-divergence between \(q_1^{(T)}\) and \(q_2\) with respect to \(T \in \mathcal {T}\). Once \(T^*\) and \(q_1^{(T^*)}\) are chosen, we then compute the optimal Bridge estimator of r based on \(q_1^{(T^*)}\) and \(q_2\) as the f-GAN-Bridge estimator. We show \(T^*\) asymptotically minimizes the first-order approximation of the asymptotic RMSE of the optimal Bridge estimator based on \(q_1^{(T)}\) and \(q_2\) with respect to T. In contrast, existing methods such as Warp Bridge sampling (Meng and Schilling 2002; Wang et al. 2020) and Gaussianized Bridge sampling (Jia and Seljak 2020) do not offer such theoretical guarantee. The transformed \(q_1^{(T)}\) can be parameterized in any way as long as it is computationally tractable and preserves the normalizing constant \(Z_1\). In this paper, we parameterize \(q_1^{(T)}\) as a Normalizing flow (Rezende and Mohamed 2015; Dinh et al. 2016) with base density \(q_1\) because of its flexibility.

1.1 Summary of our contributions

The main contribution of our paper is that we give a computational framework to improve the optimal Bridge estimator by minimizing the first-order approximation of its asymptotic RMSE with respect to the densities. We also give a new estimator of the asymptotic RMSE of the optimal Bridge estimator using the variational framework proposed by Nguyen et al. (2010). This formulation allows us to cast the estimation problem as a 1-d optimization problem. We find the f-GAN-Bridge estimator outperforms existing methods significantly in both simulated and real-world examples. Numerical experiments show that the proposed method provides not only a reliable estimate of r, but also an accurate estimate of its RMSE. In addition, we also find a connection between Bridge estimators and the problem of f-divergence estimation, which allows us to view Bridge estimators from a different perspective.

This paper is structured as follows: in Sect. 2, we briefly review Bridge sampling and existing improvement strategies. In Sect. 3, we give a new estimator of the asymptotic RMSE of the optimal Bridge estimator using the variational framework for f-divergence estimation (Nguyen et al. 2010). We also demonstrate the connection between Bridge estimators and the problem of f-divergence estimation. In Sect. 4, we introduce the f-GAN-Bridge estimator and give implementation details. We give both simulated and real-world examples in Sects. 5 and 6. Section 7 concludes the paper with a discussion. A Python implementation of the proposed method alongside with examples can be found on Github. A Python implementation of the proposed method alongside with examples can be found in https://github.com/hwxing3259/Bridge_sampling_and_fGAN.

2 Bridge sampling and related works

Let \(Q_1,Q_2\) be two probability distributions of interest. Let \(q_i(\omega ), \omega \in \Omega _i, i=1,2\) be the densities of \(Q_1,Q_2\) with respect to a common measure \(\mu \) defined on \(\Omega _1 \cup \Omega _2\), where \(\Omega _1\) and \(\Omega _2\) are the corresponding supports. We use \({\tilde{q}}_i(\omega ), i=1,2\) to denote the unnormalized densities and \(Z_i, i=1,2\) to denote the corresponding normalizing constants, i.e. \(q_i(\omega ) = \tilde{q}_i(\omega ) / Z_i\) for \(i=1,2\). Suppose we have samples from \(q_1,q_2\), but we are only able to evaluate the unnormalized densities \({\tilde{q}}_i(\omega ), i=1,2\). Our goal is to estimate the ratio of normalizing constants \(r = Z_1/Z_2\) using only \(\tilde{q}_i(\omega ), i=1,2\) and samples from the two distributions. Bridge sampling (Bennett 1976; Meng and Wong 1996) is a powerful method for this task.

Definition 1

(Bridge estimator) Suppose \(\mu (\Omega _1 \cap \Omega _2)>0\) and \(\alpha : \Omega _1 \cap \Omega _2 \rightarrow \mathbb {R}\) satisfies \(0< \left| \int _{\Omega _1 \cap \Omega _2} \alpha (\omega ) q_1(\omega )q_2(\omega )\right. \left. d\mu (\omega )\right| < \infty \). Given samples \(\{\omega _{ij}\}_{j=1}^{n_i} \sim q_i\) for \(i=1,2\), the Bridge estimator \({\hat{r}}_\alpha \) of \(r=Z_1/Z_2\) is defined as

$$\begin{aligned} {\hat{r}}_\alpha = \frac{{n_2}^{-1}\sum _{j=1}^{n_2}\alpha (\omega _{2j}) {\tilde{q}}_1(\omega _{2j}) }{{n_1}^{-1}\sum _{j=1}^{n_1} \alpha (\omega _{1j}){\tilde{q}}_2(\omega _{1j}) } \end{aligned}$$
(1)

The choice of free function \(\alpha \) directly affects the quality of \({\hat{r}}_\alpha \), which is quantified by the relative mean square error (RMSE) \(E({\hat{r}}_{\alpha } - r)^2/r^2\). Let \(n = n_1+n_2\) and \(s_i = n_i/n\) for \(i=1,2\). Let \({\textit{RE}}^2({\hat{r}}_\alpha )\) denote the asymptotic RMSE of \({\hat{r}}_\alpha \) as \(n_1,n_2 \rightarrow \infty \). Under the assumption that the samples from \(q_1,q_2\) are i.i.d., Meng and Wong (1996) show the optimal \(\alpha \) which minimizes the first-order approximation of \({\textit{RE}}^2({\hat{r}}_\alpha )\) takes the form

$$\begin{aligned} \alpha _{opt}(\omega ) \propto \frac{1}{s_1 {\tilde{q}}_1(\omega ) + s_2 {\tilde{q}}_2(\omega ) r}, \quad \omega \in \Omega _1 \cap \Omega _2 \end{aligned}$$
(2)

The resulting \({\textit{RE}}^2({\hat{r}}_{\alpha _{opt}})\) with the optimal free function \(\alpha _{opt}\) is

$$\begin{aligned}&{\textit{RE}}^2({\hat{r}}_{\alpha _{opt}}) \nonumber \\&\quad = \frac{1}{ns_1s_2} \left[ \left( \int _{\Omega _1 \cap \Omega _2} \frac{q_1(\omega )q_2(\omega )}{s_1 q_1(\omega ) + s_2 q_2(\omega )} d\mu (\omega ) \right) ^{-1} -1\right] \nonumber \\&\qquad + o\left( \frac{1}{n}\right) . \end{aligned}$$
(3)

Note that the optimal \(\alpha _{opt}\) is not directly usable as it depends on the unknown quantity r we would like to estimate in the first place. To resolve this problem, Meng and Wong (1996) give an iterative procedure

$$\begin{aligned} {\hat{r}}^{(t+1)}= & {} \frac{n_2^{-1}\sum _{j=1}^{n_2} \tilde{q}_1(\omega _{2j})/(s_1{\tilde{q}}_1(\omega _{2j}) + s_2 \tilde{q}_2(\omega _{2j}) {\hat{r}}^{(t)})}{n_1^{-1}\sum _{j=1}^{n_1}\tilde{q}_2(\omega _{1j})/(s_1{\tilde{q}}_1(\omega _{1j}) + s_2 \tilde{q}_2(\omega _{1j}) {\hat{r}}^{(t)})}, \nonumber \\ \quad t= & {} 0,1,2,\ldots \end{aligned}$$
(4)

The authors show that for any initial value \({\hat{r}}^{(0)}\), \({\hat{r}}^{(t)}\) is a consistent estimator of r for all \(t \ge 1\), and the sequence \(\{{\hat{r}}^{(t)}\}, \; t=0,1,2,\ldots \) converges to the unique limit \({\hat{r}}_{opt}\). Let \({\textit{MSE}}(\log {\hat{r}}_{opt})\) denote the asymptotic mean square error of \(\log {\hat{r}}_{opt}\).

Under the i.i.d. assumption, the authors also show \({\textit{RE}}^2({\hat{r}}_{opt})\) and \({\textit{MSE}}(\log {\hat{r}}_{opt})\) are asymptotically equivalent to \({\textit{RE}}^2({\hat{r}}_{\alpha _{opt}})\) in (3) up to the first order (i.e. they have the same leading term). Note that \({\hat{r}}_{opt}\) can be found numerically while \({\hat{r}}_{\alpha _{opt}}\) is not directly computable. We will focus on the asymptotically optimal Bridge estimator \({\hat{r}}_{opt}\) for the rest of the paper.

2.1 Improving bridge estimators via transformations

From (3) and the fact that \({\textit{RE}}^2({\hat{r}}_{opt})\) and \({\textit{RE}}^2({\hat{r}}_{\alpha _{opt}})\) are asymptotically equivalent, we see \({\textit{RE}}^2({\hat{r}}_{opt})\) depends on the overlap between \(q_1\) and \(q_2\). Even when \(\Omega _1 = \Omega _2\), if \(q_1\) and \(q_2\) put their probability mass on very different regions, the integral in (3) would be close to 0, leading to large RMSE and unreliable estimators. In order to improve the performance of \({\hat{r}}_{opt}\), one may apply transformations to \(q_1,q_2\) (and to the corresponding samples) to increase their overlap while keeping the transformed unnormalized densities computationally tractable and the normalizing constants unchanged. We assume that we are dealing with unconstrained, continuous random variables with a common support, i.e. \(\Omega _1 = \Omega _2 = \mathbb {R}^d\). When the supports \(\Omega _1, \Omega _2\) are constrained or different from each other, we can usually match them by applying simple invertible transformations to \(q_1\), \(q_2\). When \(\Omega _1\),\(\Omega _2\) have different dimensions, Chen and Shao (1997) suggest matching the dimensions of \(q_1,q_2\) by augmenting the lower dimensional distribution using some completely known random variables (see “Appendix B” for details).

Voter (1985) gives a method to increase the overlap in the context of free energy estimation by shifting the samples from one distribution to the other and matching their modes. Meng and Schilling (2002) extends this idea and consider more general mappings. Let \(T_i: \mathbb {R}^d \rightarrow \mathbb {R}^d\), \(i=1,2\) be two smooth and invertible transformations that aim to bring \(q_1,q_2\) “closer”. For \(\omega _i \sim q_i\), define \(\omega _i^{(T)} = T_i(\omega _i)\), \(i=1,2\). Then for \(i=1,2\), the distribution of the transformed sample \(\omega _i^{(T)}\) has density

$$\begin{aligned} q_i^{(T)}(\omega _i^{(T)})&= {\tilde{q}}_i(T_i^{-1}(\omega _i^{(T)}))\left| \det J_i(\omega _i^{(T)})\right| /Z_i \end{aligned}$$
(5)
$$\begin{aligned}&\equiv \tilde{q}_i^{(T)}(\omega _i^{(T)})/Z_i, \quad i=1,2 \end{aligned}$$
(6)

where \(\tilde{q}_i^{(T)}\) is the unnormalized version of \( q_i^{(T)}\), \(T_i^{-1}\) is the inverse transformation of \(T_i\) and \(J_i(\omega )\) is its Jacobian. One can then apply (1) to the transformed samples and the corresponding unnormalized densities \(\tilde{q}_1^{(T)},\tilde{q}_2^{(T)}\), and obtain a Bridge estimator

$$\begin{aligned} {\hat{r}}^{(T)}_\alpha = \frac{n_2^{-1} \sum _{j=1}^{n_2} \tilde{q}_1^{(T)}(T_2(\omega _{2j})) \alpha (T_2(\omega _{2j}))}{n_1^{-1} \sum _{j=1}^{n_1} \tilde{q}_2^{(T)}(T_1(\omega _{1j})) \alpha (T_1(\omega _{1j}))} \end{aligned}$$
(7)

without the need to sample from \(\tilde{q}_1^{(T)}\) or \(\tilde{q}_2^{(T)}\) separately. Let \({\hat{r}}_{opt}^{(T)}\) denote the asymptotically optimal Bridge estimator based on the transformed densities. We stress that the superscript of \({\hat{r}}^{(t)}\) in (4) indicates the number of iterations, while the superscript in \({\hat{r}}_{opt}^{(T)}\) means it is based on the transformed densities. If the transformed \(q_1^{(T)}, q_2^{(T)}\) have a greater overlap than the original \(q_1,q_2\), then \({\hat{r}}^{(T)}_{opt}\) should be a more reliable estimator of r with a lower RMSE. Meng and Schilling (2002) further extend this idea and propose the Warp transformation, which aims to increase the overlap by centring, scaling and symmetrizing the two densities \(q_1,q_2\). One limitation of the Warp transformation is that it does not work well for multimodal distributions. Wang et al. (2020) propose the Warp-U transformation to address this problem. The key idea of the Warp-U transformation is to first approximate \(q_i\) by a mixture of Normal or t distributions, then construct a coupling between them which allows us to map \( q_i\) into a unimodal density in the same way as mapping the mixture back to a single standard Normal or t distribution.

An alternative to the Warp transformation (Meng and Schilling 2002) is a Normalizing flow. A Normalizing flow (NF) (Rezende and Mohamed 2015; Dinh et al. 2016; Papamakarios et al. 2017) parameterizes a continuous probability distribution by mapping a simple base distribution (e.g. standard Normal) to the more complex target using a bijective transformation T, which is parameterized as a composition of a series of smooth and invertible mappings \(f_1,\ldots ,f_K\) with easy-to-compute Jacobians. This T is applied to the “base” random variable \(z_0 \sim p_0\), where \(z_0 \in \mathbb {R}^d\) and \(p_0\) is the known base density. Let

$$\begin{aligned} z_k = f_k \circ f_{k-1} \circ \cdots \circ f_1(z_0), \quad k=1,\ldots ,K \end{aligned}$$
(8)

Since the transformation T is smooth and invertible, by applying change of variable repeatedly, the final output \(z_K\) has density

$$\begin{aligned} p_K(z_K) = p_0(z_0)\prod _{k=1}^K \left| \det J_k(z_{k-1}) \right| ^{-1} \end{aligned}$$
(9)

where \(J_k\) is the Jacobian of the mapping \(f_k\). The final density \(p_K\) can be used to approximate target distributions with complex structure, and one can sample from \(p_K\) easily by applying \(T = f_K \circ f_{K-1} \circ \cdots \circ f_1\) to \(z_0 \sim p_0\). In order to evaluate \(p_K\) efficiently, we are restricted to transformations \(f_k\) whose \(\det J_k(z)\) is easy to compute. For example, Real-NVP (Dinh et al. 2016) uses the following transformation: for \(m \in \mathbb {N}\) such that \(1<m<d\), let \(z_{1:m}\) be the first m entries of \(z \in \mathbb {R}^d\), let \(\times \) be element-wise multiplication and let \(\mu _k, \sigma _k: \mathbb {R}^{m} \rightarrow \mathbb {R}^{d-m}\) be two mappings (usually parameterized by neural nets). The smooth and invertible transformation \(y = f_k(z)\) for each step k in Real-NVP is defined as

$$\begin{aligned} y_{1:m} = z_{1:m}, \quad y_{m+1:d} = \mu _k(z_{1:m}) + \sigma _k(z_{1:m}) \times z_{m+1:d} \end{aligned}$$
(10)

This means \(f_k\) keeps the first m entries of input z, while shifting and scaling the remaining ones. The Jacobian \(J_k\) of \(f_k\) is lower triangular, hence \(\det J_k(z) = \prod _{i=1}^{d-m} \sigma _{ik}(z_{1:m})\), where \(\sigma _{ik}(z_{1:m})\) is the ith entry of \(\sigma _k(z_{1:m})\). Each transformation \(f_k\) is also called a coupling layer. When composing a series of coupling layers \(f_1,\ldots ,f_K\), the authors also swap the ordering of indices in (10) so that the dimensions that are kept unchanged in one step k are to be scaled and shifted in the next step. Jia and Seljak (2020) utilize the idea of transforming \(q_i\) using a Normalizing flow, and propose Gaussianized Bridge sampling (GBS) for estimating a single normalizing constant. The authors set \(q_1\) to be a completely known density, e.g. standard multivariate Normal, and aim to approximate the target \(q_2\) using a Normalizing flow with base density \(q_1\). The transformed density \(q_1^{(T)}\) is estimated by matching the marginal distributions between \(q_1^{(T)}\) and \(q_2\). Once \(q_1^{(T)}\) is chosen, the authors use (7) and the iterative procedure (4) to form the asymptotically optimal Bridge estimator of \(Z_2\) based on the transformed \(q_1^{(T)}\) and the original \(q_2\).

The idea of increasing overlap via transformations is also applicable to discrete random variables. For example, Meng and Schilling (2002) suggest using swapping and permutation to increase the overlap between two discrete distributions. Tran et al. (2019) also give Normalizing flow models applicable to discrete random variables based on modulo operations. We give a toy example of increasing the overlap between two discrete distributions using Normalizing flows in “Appendix G”. In the later sections, we will extend the idea of increasing overlap via transformations, and propose a new strategy to improve \({\hat{r}}^{(T)}_{opt}\) by directly minimizing the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(T)}_{opt})\) with respect to the transformed densities.

3 Bridge estimators and f-divergence estimation

Frühwirth-Schnatter (2004) gives an MC estimator of \({\textit{RE}}^2({\hat{r}}_{opt})\). In this section, we introduce an alternative estimator of \({\textit{RE}}^2({\hat{r}}_{opt})\) and \({\textit{MSE}}(\log {\hat{r}}_{opt})\) by equivalently estimating an f-divergence between \(q_1,q_2\). This formulation allows us to utilize the variational lower bound of f-divergence given by Nguyen et al. (2010), and cast the problem of estimating \({\textit{RE}}^2({\hat{r}}_{opt})\) as a 1-d optimization problem. In the later section, we will also show how to use this new estimator to improve the efficiency of \({\hat{r}}^{(T)}_{opt}\). In addition, we find that estimating different choices of f divergences under the variational framework proposed by Nguyen et al. (2010) naturally leads to Bridge estimators of r with different choices of free function \(\alpha (\omega )\).

3.1 Estimating \({\textit{RE}}^2({\hat{r}}_{opt})\) via f-divergence estimation

f-divergence (Ali and Silvey 1966) is a broad class of divergences between two probability distributions. By choosing f accordingly, one can recover common divergences between probability distributions such as KL divergence \({\textit{KL}}(q_1,q_2)\), squared Hellinger distance \(H^2(q_1,q_2)\) and total variation distance \(d_{TV}(q_1,q_2)\).

Definition 2

(f-divergence) Suppose the two probability distributions \(Q_1,Q_2\) have absolutely continuous density functions \(q_1\) and \(q_2\) with respect to a base measure \(\mu \) on a common support \(\Omega \). Let the generator function \(f: \mathbb {R}^+ \rightarrow \mathbb {R}\) be a convex and lower semi-continuous function satisfying \(f(1)=0\). The f-divergence \(D_f(q_1,q_2)\) defined by f takes the form

$$\begin{aligned} D_f(q_1,q_2) = \int _{\Omega } f\left( \frac{q_1(\omega )}{q_2(\omega )} \right) q_2(\omega ) d\mu (\omega ) \end{aligned}$$
(11)

Unless otherwise stated, we assume \(\Omega = \mathbb {R}^d\) where \(d \in \mathbb {N}\), i.e. both \(q_1\) and \(q_2\) are defined on \(\mathbb {R}^d\). If the densities \(q_1,q_2\) have different or disjoint supports \(\Omega _1\), \(\Omega _2\), then we apply appropriate transformations and augmentations discussed in the previous sections to ensure that the transformed and augmented densities (if necessary) are defined on the common support \(\Omega =\mathbb {R}^d\). In this paper, we focus on a particular choice of f-divergence that is closely related to \({\textit{RE}}^2({\hat{r}}_{opt})\) in (3).

Definition 3

(Weighted harmonic divergence) Let \(q_1,q_2\) be continuous densities with respect to a base measure \(\mu \) on the common support \(\Omega \). The weighted harmonic divergence is defined as

$$\begin{aligned} H_{\pi }(q_1,q_2) = 1 - \int _{\Omega } \left( \pi q_1^{-1}(\omega ) + (1-\pi ) q_2^{-1}(\omega )\right) ^{-1} d\mu (\omega ) \end{aligned}$$
(12)

where \(\pi \in (0,1)\) is the weight parameter.

Wang et al. (2020) observe that the weighted harmonic divergence \(H_{\pi }(q_1,q_2)\) is an f-divergence with generator \(f(u) = 1-\frac{u}{\pi +(1-\pi )u}\), and \({\textit{RE}}^2({\hat{r}}_{opt})\) can be rearranged as

$$\begin{aligned} {\textit{RE}}^2({\hat{r}}_{opt})&= (s_1s_2n)^{-1} \left( (1 - H_{s_2}(q_1,q_2))^{-1} -1\right) \nonumber \\&\quad + o\left( \frac{1}{n}\right) . \end{aligned}$$
(13)

The same statement also holds for \({\textit{MSE}}(\log {\hat{r}}_{opt})\) since \({\textit{MSE}}(\log {\hat{r}}_{opt})\) is asymptotically equivalent to \({\textit{RE}}^2({\hat{r}}_{opt})\) (Meng and Wong 1996). This means if we have an estimator of \(H_{s_2}(q_1,q_2)\), then we can plug it into the leading term of the right hand side of (13) and obtain an estimator of the first-order approximation of \({\textit{RE}}^2({\hat{r}}_{opt})\) and \({\textit{MSE}}(\log {\hat{r}}_{opt})\). Before we give the estimator of \(H_{s_2}(q_1,q_2)\), we first introduce the variational framework for f-divergence estimation proposed by Nguyen et al. (2010). Every convex, lower semi-continuous function \(f: \mathbb {R}^+ \rightarrow \mathbb {R}\) has a convex conjugate \(f^*\) which is defined as follows,

Definition 4

(Convex conjugate) Let \(f: \mathbb {R}^+ \rightarrow \mathbb {R}\) be a convex and lower semi-continuous function. The convex conjugate of f is defined as

$$\begin{aligned} f^*(t) = \sup _{u \in \mathbb {R}^+} \{ut-f(u)\} \end{aligned}$$
(14)

Nguyen et al. (2010) show that any f-divergence \(D_f(q_1,q_2)\) satisfies

$$\begin{aligned} D_f(q_1,q_2) \ge \sup _{V \in \mathcal {V}} \Big ( E_{q_1}[V(\omega )] - E_{q_2}[f^*(V(\omega ))]\Big ), \end{aligned}$$
(15)

where \(\mathcal {V}\) is an arbitrary class of functions \(V: \Omega \rightarrow \mathbb {R}\), and \(f^*(t)\) is the convex conjugate of the generator f which characterizes the f-divergence \(D_f(q_1,q_2)\). A table of common f-divergences with their generator f and the corresponding convex conjugate \(f^*\) can be found in Nowozin et al. (2016). Nguyen et al. (2010) show that if f is differentiable and strictly convex, then \(D_f(q_1,q_2)\) is equal to \(E_{q_1}[V(\omega )] - E_{q_2}[f^*(V(\omega ))]\) in (15) if and only if \(V(\omega ) = f'\left( \frac{q_1(\omega )}{q_2(\omega )}\right) \), the first-order derivative of f evaluated at \(q_1(\omega )/q_2(\omega )\). The authors then give a new strategy of estimating the f-divergence \(D_f(q_1,q_2)\) by finding the maximum of an empirical estimate of \(E_{q_1}[V(\omega )] - E_{q_2}[f^*(V(\omega ))]\) in (15) with respect to the variational function \(V \in \mathcal {V}\). We now use this framework to give an estimator of \(H_{\pi }(q_1,q_2)\).

Proposition 1

(Estimating \(H_{\pi }(q_1,q_2)\)) Let \(q_1,q_2\) be continuous densities with respect to a base measure \(\mu \) on the common support \(\Omega \). Let \(\{\omega _{ij}\}_{j=1}^{n_i}\) be samples from \(q_i\) for \(i=1,2\). Let \(\pi \in (0,1)\) be the weight parameter. Let r be the true ratio of normalizing constants between \(q_1,q_2\), and \(C_2> C_1 > 0\) be constants such that \(r \in [C_1, C_2]\). For \({\tilde{r}} \in [C_1,C_2]\), define

$$\begin{aligned} G({\tilde{r}}; \pi )&= 1 - \frac{1}{\pi } E_{q_1}\left( \frac{\pi {\tilde{q}}_2(\omega ) {\tilde{r}}}{(1-\pi ) \tilde{q}_1(\omega ) + \pi {\tilde{q}}_2(\omega ) {\tilde{r}}} \right) ^2 \nonumber \\&\quad - \frac{1}{1-\pi } E_{q_2} \left( \frac{(1-\pi ) \tilde{q}_1(\omega )}{(1-\pi ){\tilde{q}}_1(\omega ) + \pi {\tilde{q}}_2(\omega ) {\tilde{r}} }\right) ^2 \end{aligned}$$
(16)

Then, \(H_{\pi }(q_1,q_2)\) satisfies

$$\begin{aligned} H_{\pi }(q_1,q_2) \ge G({\tilde{r}}; \pi ) \quad \forall {\tilde{r}} \in [C_1,C_2] , \end{aligned}$$
(17)

and equality holds if and only if \({\tilde{r}}=r\). In addition, let

$$\begin{aligned}&{\hat{G}}({\tilde{r}} ; \pi , \{\omega _{ij}\}_{j=1}^{n_i}) = 1 \nonumber \\&\quad - \frac{1}{\pi n_1} \sum _{j=1}^{n_1}\left( \frac{\pi \tilde{q}_2(\omega _{1j}) {\tilde{r}}}{(1-\pi ) {\tilde{q}}_1(\omega _{1j}) + \pi {\tilde{q}}_2(\omega _{1j}) {\tilde{r}}} \right) ^2 \nonumber \\&\quad - \frac{1}{(1-\pi )n_2} \sum _{j=1}^{n_2} \left( \frac{(1-\pi ) \tilde{q}_1(\omega _{2j})}{(1-\pi ){\tilde{q}}_1(\omega _{2j}) + \pi \tilde{q}_2(\omega _{2j}) {\tilde{r}} }\right) ^2 \end{aligned}$$
(18)

be the empirical estimate of \(G({\tilde{r}}; \pi )\) based on \(\{\omega _{ij}\}_{j=1}^{n_i} \sim q_i\) for \(i=1,2\). If \({\hat{r}}_{\pi } = {{\,\mathrm{arg\,max}\,}}_{{\tilde{r}} \in [C_1,C_2]} {\hat{G}}({\tilde{r}} ; \pi , \{\omega _{ij}\}_{j=1}^{n_i})\), then \({\hat{r}}_{\pi }\) is a consistent estimator of r, and \({\hat{G}}({\hat{r}}_{\pi };\pi , \{\omega _{ij}\}_{j=1}^{n_i})\) is a consistent estimator of \(H_{\pi }(q_1,q_2)\) as \(n_1,n_2 \rightarrow \infty \).

Proof

See “Appendix A”. \(\square \)

Note that (17) is a special case of the variational lower bound (15) with the f-divergence \(D_f(q_1,q_2) = H_{\pi }(q_1,q_2)\), the corresponding generator \(f(u) = 1-\frac{u}{\pi +(1-\pi )u}\) and variational function \(V_{{\tilde{r}}}(\omega ) = f'\left( \frac{{\tilde{q}}_1(\omega )}{\tilde{q}_2(\omega ){\tilde{r}}}\right) \) with \(\mathcal {V} = \{V_{\tilde{r}}(\omega ) \vert {\tilde{r}} \in [C_1,C_2]\}\), i.e. \({\tilde{r}} \in [C_1,C_2]\) is the is the sole parameter of \(V_{{\tilde{r}}}(\omega )\). Note that \(V_r(\omega )=f'\left( \frac{q_1(\omega )}{q_2(\omega )}\right) \) since r is the ratio of normalizing constants between \(q_1,q_2\). We parameterize the variational function in this specific form because we would like to take the advantage of knowing the unnormalized densities \({\tilde{q}}_1, {\tilde{q}}_2\) in our setup. Here, we assume that \({\tilde{r}} \in [C_1,C_2]\) instead of \({\tilde{r}} \in \mathbb {R}^+\). This is not a strong assumption, since we can set \(C_1\) \((C_2)\) to be arbitrarily small (large). We take \({\hat{G}}({\hat{r}}_{s_2} ; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\) as an estimator of \(H_{s_2}(q_1,q_2)\), and define our estimator of the first-order approximation of \({\textit{RE}}^2({\hat{r}}_{opt})\) as follows:

Definition 5

(Estimator of \({\textit{RE}}^2({\hat{r}}_{opt})\)) Let \(\{\omega _{ij}\}_{j=1}^{n_i}\) be samples from \(q_i\) for \(i=1,2\). Define

$$\begin{aligned} {\widehat{{\textit{RE}}}}^2({\hat{r}}_{opt})&= (s_1s_2n)^{-1} \nonumber \\&\quad \left( (1 - {\hat{G}}({\hat{r}}_{s_2} ; s_2, \{\omega _{ij}\}_{j=1}^{n_i}))^{-1} -1\right) \end{aligned}$$
(19)

as an estimator of the first-order approximation of both \({\textit{RE}}^2({\hat{r}}_{opt})\) and \({\textit{MSE}}(\log {\hat{r}}_{opt})\) in (13).

Even though \({\hat{G}}({\hat{r}}_{s_2} ; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\) is a consistent estimator of \(H_{s_2}(q_1,q_2)\), it suffers from a positive bias (see “Appendix C” for details). We have not found a practical strategy to correct it so far. On the other hand, we believe this bias does not prevent our proposed error estimator \({\widehat{{\textit{RE}}}}^2({\hat{r}}_{opt})\) from being useful in practice. Since our estimator of \({\textit{RE}}^2({\hat{r}}_{opt})\) in (19) is a monotonically increasing function of \({\hat{G}}({\hat{r}}_\pi ; \pi , \{\omega _{ij}\}_{j=1}^{n_i})\) in Prop 1, the positive bias in \({\hat{G}}({\hat{r}}_\pi ; \pi , \{\omega _{ij}\}_{j=1}^{n_i})\) leads to a positive bias in \({\widehat{{\textit{RE}}}}^2({\hat{r}}_{opt})\). Therefore, \({\widehat{{\textit{RE}}}}^2({\hat{r}}_{opt})\) will systemically overestimate the true error \({\textit{RE}}^2({\hat{r}}_{opt})\), which will lead to more conservative conclusions (e.g. wider error bars). This is certainly not ideal, but we believe in practice, it is less harmful than underestimating the variability in \({\hat{r}}_{opt}\). In addition, we see the proposed error estimator provides accurate estimates of \({\textit{RE}}^2({\hat{r}}_{opt})\) in both examples in Sects. 5 and 6, indicating the effectiveness of it.

3.2 f-divergence estimation and Bridge estimators

In the last section, we focus on estimating \(H_{\pi }(q_1,q_2)\). We now extend the estimation framework to other choices of f-divergence, and show how Bridge estimators naturally arise from this estimation problem. Let an f-divergence \(D_f(q_1,q_2)\) with the corresponding generator f(u) be given. Similar to Proposition 1, under our parameterization of the variational function \(V_{{\tilde{r}}}\), the empirical estimate of \(E_{q_1}[V(\omega )] - E_{q_2}[f^*(V(\omega ))]\) in (15) becomes

$$\begin{aligned}&{\hat{G}}_f({\tilde{r}}; \{\omega _{ij}\}_{j=1}^{n_i}) = \frac{1}{n_1} \sum _{j=1}^{n_1} V_{{\tilde{r}}}(\omega _{1j}) - \frac{1}{n_2} \sum _{j=1}^{n_2} f^*(V_{{\tilde{r}}}(\omega _{2j})) \end{aligned}$$
(20)
$$\begin{aligned}&= \frac{1}{n_1} \sum _{j=1}^{n_1} f'\left( \frac{\tilde{q}_1(\omega _{1j})}{{\tilde{q}}_2(\omega _{1j}){\tilde{r}}}\right) - \frac{1}{n_2} \sum _{j=1}^{n_2} f^* \circ f'\left( \frac{\tilde{q}_1(\omega _{2j})}{{\tilde{q}}_2(\omega _{2j}){\tilde{r}}}\right) , \end{aligned}$$
(21)

where \(\{\omega _{ij}\}_{j=1}^{n_i} \sim q_i\) for \(i=1,2\). Let \({\hat{r}}^{(f)} = {{\,\mathrm{arg\,max}\,}}_{{\tilde{r}} \in \mathbb {R}^+} {\hat{G}}_f({\tilde{r}}; \{\omega _{ij}\}_{j=1}^{n_i})\). By Nguyen et al. (2010), \(V_{{\hat{r}}^{(f)}} = f'\left( \frac{{\tilde{q}}_1(\omega )}{\tilde{q}_2(\omega ) {\hat{r}}^{(f)}}\right) \) is an estimator of \(V_{r}(\omega ) = f'\left( \frac{{\tilde{q}}_1(\omega )}{{\tilde{q}}_2(\omega ) r}\right) \), and \({\hat{G}}_f({\hat{r}}^{(f)}; \{\omega _{ij}\}_{j=1}^{n_i})\) is an estimator of \(D_f(q_1,q_2)\). In Proposition 1 we have shown that \({\hat{r}}^{(f)}\) and \({\hat{G}}_f({\hat{r}}^{(f)}; \{\omega _{ij}\}_{j=1}^{n_i})\) are consistent estimators of r and \(D_f(q_1,q_2)\) when \(D_f(q_1,q_2)\) is the weighted Harmonic divergence \(H_\pi (q_1,q_2)\).Footnote 1 Here, we show the connection between \({\hat{r}}^{(f)}\) and the Bridge estimators of r with different choices of free function \(\alpha (\omega )\).

Proposition 2

(Connection between \({\hat{r}}^{(f)}\) and Bridge estimators) Suppose \(f(u):\mathbb {R}^+ \rightarrow \mathbb {R}\) is strictly convex, twice differentiable and satisfies \(f(1)=0\). Let \(\{\omega _{ij}\}_{j=1}^{n_i}\) be samples from \(q_i\) for \(i=1,2\). If \({\hat{r}}^{(f)} = {{\,\mathrm{arg\,max}\,}}_{{\tilde{r}} \in \mathbb {R}^+} {\hat{G}}_f(\tilde{r}; \{\omega _{ij}\}_{j=1}^{n_i})\) is a stationary point of \({\hat{G}}_f({\tilde{r}}; \{\omega _{ij}\}_{j=1}^{n_i})\) in (21), then \({\hat{r}}^{(f)}\) satisfies the following equation

$$\begin{aligned} {\hat{r}}^{(f)} = \frac{\frac{1}{n_2} \sum _{j=1}^{n_2} f''\left( \frac{{\tilde{q}}_1(\omega _{2j})}{{\tilde{q}}_2(\omega _{2j}) {\hat{r}}^{(f)}}\right) \frac{{\tilde{q}}_1(\omega _{2j})}{\tilde{q}_2(\omega _{2j})^2}{\tilde{q}}_1(\omega _{2j})}{\frac{1}{n_1} \sum _{j=1}^{n_1} f''\left( \frac{{\tilde{q}}_1(\omega _{1j})}{\tilde{q}_2(\omega _{1j}) {\hat{r}}^{(f)}}\right) \frac{\tilde{q}_1(\omega _{1j})}{{\tilde{q}}_2(\omega _{1j})^2}{\tilde{q}}_2(\omega _{1j}) } \end{aligned}$$
(22)

where \(f''\) is the second-order derivative of f.

Proof

See “Appendix A”. \(\square \)

In Eq. (22), \(f''\left( \frac{\tilde{q}_1(\omega )}{{\tilde{q}}_2(\omega ) {\hat{r}}^{(f)}}\right) \frac{\tilde{q}_1(\omega )}{{\tilde{q}}_2(\omega )^2}\) plays the role of the free function \(\alpha (\omega )\) in a Bridge estimator (1). Common Bridge estimators such as the asymptotically optimal Bridge estimator \({\hat{r}}_{opt}\) and the geometric Bridge estimator can be recovered by choosing f accordingly (see “Appendix D”). Kong et al. (2003) observe that \({\hat{r}}_{opt}\) can be viewed as a semi-parametric maximum likelihood estimator. Proposition 2 extends this observation and shows that in addition to \({\hat{r}}_{opt}\), a large class of Bridge estimators can also be viewed as maximizers of some objective functions that are related to the variational lower bound of some f-divergences. In the next section, we will show how to use this variational framework to minimize the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(T)}_{opt})\) with respect to the transformed densities.

4 Improving \({\hat{r}}_{opt}\) via an f-GAN

From Sect. 2.1, we see that one can improve \({\hat{r}}_{opt}\) and reduce its RMSE by first transforming \(q_1,q_2\) appropriately, then computing \({\hat{r}}^{(T)}_{opt}\) using the transformed densities and samples. From Sect. 3, we also see the first-order approximation of \({\textit{RE}}^2({\hat{r}}_{opt})\) is a monotonic function of \(H_{s_2}(q_1,q_2)\). In this section, we utilize this observation and introduce the f-GAN-Bridge estimator (f-GB) that aims to improve \({\hat{r}}^{(T)}_{opt}\) by minimizing the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(T)}_{opt})\) with respect to the transformed densities. We show it is equivalent to minimizing \(H_{s_2}(q_1^{(T)},q_2)\) with respect to \(q_1^{(T)}\) using the variational lower bound of \(H_{\pi }(q_1,q_2)\) (17) and f-GAN (Nowozin et al. 2016).

4.1 The f-GAN framework

We start by introducing the GAN and f-GAN models. A generative adversarial network (GAN) (Goodfellow et al. 2014) is an expressive class of generative models. Let \(p_{{\textit{tar}}}\) be the target distribution of interest. In the original GAN, Goodfellow et al. (2014) estimate a generative model \(p_\phi \) parameterized by a real vector \(\phi \) by approximately minimizing the Jensen–Shannon divergence between \(p_\phi \) and \(p_{{\textit{tar}}}\). The key idea of the original GAN is to introduce a separate discriminator which tries to distinguish between “true samples” from \(p_{{\textit{tar}}}\) and artificially generated samples from \(p_\phi \). This discriminator is then optimized alongside with the generative model \(p_\phi \) in the training process. See Creswell et al. (2018) for an overview of GAN models.

f-GAN (Nowozin et al. 2016) extends the original GAN model using the variational lower bound of f-divergence (15), and introduces a GAN-type framework that generalizes to minimizing any f-divergence between \(p_{{\textit{tar}}}\) and \(p_\phi \). Let an f-divergence with the generator f be given. Nowozin et al. (2016) parameterize the variational function \(V_\xi \) and the generative model \(p_\phi \) as two neural nets with parameters \(\xi \) and \(\phi \), respectively, and propose

$$\begin{aligned} G(\phi , \xi ) = E_{p_{{\textit{tar}}}}(V_\xi (\omega )) - E_{p_\phi }(f^*(V_\xi (\omega ))) \end{aligned}$$
(23)

as the objective function of the f-GAN model, where \(f^*\) is the convex conjugate of the generator f of the chosen f-divergence. Recall that \(G(\phi , \xi )\) is in the form of the variational lower bound (15) of \(D_f(p_\phi ,p_{{\textit{tar}}})\). Nowozin et al. (2016) show that \(D_f(p_\phi ,p_{{\textit{tar}}})\) can be minimized by solving \(\min _\phi \max _\xi G(\phi , \xi )\). Intuitively, we can view \(\max _{\xi }G(\phi , \xi )\) as an estimate of \(D_f(p_\phi ,p_{{\textit{tar}}})\) (Nguyen et al. 2010). This means minimizing \(\max _\xi G(\phi , \xi )\) with respect to \(\phi \) can be interpreted as minimizing an estimate of \(D_f(p_\phi ,p_{{\textit{tar}}})\).

Now we show how to use the f-GAN framework to construct the f-GAN-Bridge estimator (f-GB). Suppose \(q_1,q_2\) are defined on a common support \(\Omega = \mathbb {R}^d\). Let \(T_\phi : \Omega \rightarrow \Omega \) be a transformation parameterized by a real vector \(\phi \in \mathbb {R}^l\) that aims to map \(q_1\) to \(q_2\). Let \(q_1^{(\phi )}\) be the transformed density obtained by applying \(T_\phi \) to \(q_1\), and \({\tilde{q}}_1^{(\phi )}\) be the corresponding unnormalized density. We also require \({\tilde{q}}_1^{(\phi )}\) to be computationally tractable, and \({\tilde{q}}_1^{(\phi )} = q_1^{(\phi )}Z_1\), i.e. \({\tilde{q}}_1^{(\phi )}\) and \({\tilde{q}}_1\) have the same normalizing constant \(Z_1\). Let \(\mathcal {T} = \{T_\phi : \phi \in \mathbb {R}^l\}\) be a collection of such transformations. Define \({\hat{r}}^{(\phi )}_{opt}\) to be the asymptotically optimal Bridge estimator of r based on the unnormalized densities \(\tilde{q}_1^{(\phi )}, {\tilde{q}}_2\) and corresponding samples \(\{T_{\phi }(\omega _{1j})\}_{j=1}^{n_1}, \{\omega _{2j}\}_{j=1}^{n_2}\). Let \(\pi \in (0,1)\). Define

$$\begin{aligned}&G(\phi , {\tilde{r}}; \pi ) = 1 - \frac{1}{\pi } E_{q_1^{(\phi )}}\left( \frac{\pi {\tilde{q}}_2(\omega ) {\tilde{r}}}{(1-\pi ) {\tilde{q}}_1^{(\phi )}(\omega ) + \pi \tilde{q}_2(\omega ) {\tilde{r}}} \right) ^2 \nonumber \\&\quad - \frac{1}{1-\pi } E_{q_2} \left( \frac{(1-\pi ) {\tilde{q}}_1^{(\phi )}(\omega )}{(1-\pi )\tilde{q}_1^{(\phi )}(\omega ) + \pi {\tilde{q}}_2(\omega ) {\tilde{r}} }\right) ^2. \end{aligned}$$
(24)

By Proposition 1, \(G(\phi , {\tilde{r}}; \pi )\) is the variational lower bound of \(H_{\pi }(q_1^{(\phi )}, q_2)\). In order to illustrate our idea, we first give an idealized Algorithm 1 to find the f-GAN-Bridge estimator. A practical version will be given in the next section.

figure a

Since \({\tilde{q}}_1^{(\phi )}\) and \({\tilde{q}}_1\) have the same normalizing constant by (6), \({\hat{r}}^{(\phi )}_{opt}\) is an asymptotically optimal Bridge estimator of r for any transformation \(T_\phi \in \mathcal {T}\). We show that within the given family of transformations \(\mathcal {T}\), Algorithm 1 is able to find \(T_{\phi ^*}\) that minimizes the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) with respect to \(T_\phi \in \mathcal {T}\) under the i.i.d. assumption.

Proposition 3

(Minimizing \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) using Algorithm 1) If \((\phi ^*, {\tilde{r}}^*)\) is a solution of \(\min _{\phi \in \mathbb {R}^l} \max _{{\tilde{r}}\in \mathbb {R}^+} G(\phi , {\tilde{r}}; s_2)\) defined in Algorithm 1, then \(G(\phi , {\tilde{r}}^*; s_2) = H_{s_2}(q_1^{(\phi )},q_2) \) for all \(\phi \in \mathbb {R}^l\), \(T_{\phi ^*}\) minimizes \(H_{s_2}(q_1^{(\phi )},q_2)\) with respect to \(T_\phi \in \mathcal {T}\). If the samples \(\{\omega _{ij}\}_{j=1}^{n_i} \overset{i.i.d.}{\sim } q_i\) for \(i=1,2\), then \(T_{\phi ^*}\) also minimizes \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) with respect to \(T_\phi \in \mathcal {T}\) up to the first order.

Proof

See “Appendix A”. \(\square \)

From Proposition 3, we see that under the i.i.d. assumption, \(T_{\phi ^*}\) and the corresponding f-GAN-Bridge estimator \({\hat{r}}^{(\phi ^*)}_{opt}\) are optimal in the sense that \({\hat{r}}^{(\phi ^*)}_{opt}\) attains the minimal RMSE (up to the first order) among all possible transformations \(T_\phi \in \mathcal {T}\) and their corresponding \({\hat{r}}^{(\phi )}_{opt}\). Since \(G(\phi ^*, {\tilde{r}}^*; s_2) = H_{s_2}(q_1^{(\phi ^*)},q_2)\), \({\widehat{{\textit{RE}}}}^2({\hat{r}}^{(\phi ^*)}_{opt})\) in Algorithm 1 is exactly the leading term of \({\textit{RE}}^2({\hat{r}}_{opt}^{(\phi ^*)})\) in the form of (13). Note that by Proposition 1, \({\tilde{r}}^*\) is equal the true ratio of normalizing constants r. This means if we have \((\phi ^*, {\tilde{r}}^*)\) in the idealized Algorithm 1, it seems there is no need to carry out the following Bridge sampling step. However, \((\phi ^*, {\tilde{r}}^*)\) is not computable in practice as \(G(\phi , {\tilde{r}}; s_2)\) depends on the unknown normalizing constants \(Z_1,Z_2\). Therefore, \(G(\phi , {\tilde{r}}; s_2)\) has to be approximated by an empirical estimate, and its corresponding optimizer w.r.t. \({\tilde{r}}\) is no longer equal to r. In the next section, we will give a practical implementation of Algorithm 1 and discuss the role of \({\tilde{r}}^*\) when \(G(\phi , {\tilde{r}}; s_2)\) is replaced by an empirical estimate of it.

In Algorithm 1, we use the f-GAN framework to minimize \(H_{s_2}(q_1^{(\phi )},q_2)\) with respect to \(T_\phi \in \mathcal {T}\). We can also apply this f-GAN framework to minimizing other choices of f-divergences such as KL divergence, Squared Hellinger distance and weighted Jensen–Shannon divergence. However, these choices of f-divergence are less efficient compared to the weighted Harmonic divergence \(H_{s_2}(q_1^{(\phi )},q_2)\) if our goal is to improve the efficiency of \({\hat{r}}_{opt}^{(\phi )}\), as we can show that minimizing these choices of f-divergence between \(q_1^{(\phi )}\) and \(q_2\) can be viewed as minimizing some upper bounds of the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) (see “Appendix E‘’).

4.2 Implementation and numerical stability

In this section, we give a practical implementation of the idealized Algorithm 1 based on an alternative objective function. We first describe the practical version of Algorithm 1 in Sect. 4.2.1, then justify the choice of this alternative objective in Sect. 4.2.2.

4.2.1 A practical implementation of Algorithm 1

In this paper, we parameterize \(q_1^{(\phi )}\) as a Normalizing flow. In particular, we parameterize \(q_1^{(\phi )}\) as a Real-NVP (Dinh et al. 2016) with base density \(q_1\) and a smooth, invertible transformation \(T_\phi \), where \(T_\phi \) is parameterized by a real vector \(\phi \in \mathbb {R}^l\). See Sect. 2.1 for a brief description of Real-NVP. Given samples \(\{\omega _{ij}\}_{j=1}^{n_i} \sim q_i\) for \(i=1,2\), define

$$\begin{aligned}&{\hat{G}}(\phi , {\tilde{r}}; \pi , \{\omega _{ij}\}_{j=1}^{n_i}) = 1 - \frac{1}{\pi n_1} \sum _{j=1}^{n_1}\nonumber \\&\quad \left( \frac{\pi \tilde{q}_2(T_\phi (\omega _{1j})) {\tilde{r}}}{(1-\pi ) \tilde{q}_1^{(\phi )}(T_\phi (\omega _{1j})) + \pi \tilde{q}_2(T_\phi (\omega _{1j})) {\tilde{r}}} \right) ^2 \nonumber \\&\quad - \frac{1}{(1-\pi )n_2} \sum _{j=1}^{n_2} \left( \frac{(1-\pi ) \tilde{q}_1^{(\phi )}(\omega _{2j})}{(1-\pi ){\tilde{q}}_1^{(\phi )}(\omega _{2j}) + \pi {\tilde{q}}_2(\omega _{2j}) {\tilde{r}} }\right) ^2 \end{aligned}$$
(25)

to be the empirical estimate of \(G(\phi , {\tilde{r}}; \pi )\) in (24). Unlike Algorithm 1, we do not aim to solve \(\min _{\phi \in \mathbb {R}^l} \max _{{\tilde{r}} \in \mathbb {R}^+} {\hat{G}}(\phi , \tilde{r}; \pi , \{\omega _{ij}\}_{j=1}^{n_i})\) directly. Instead, we define our objective function as

$$\begin{aligned}&L_{\lambda _1, \lambda _2}(\phi , {\tilde{r}}; \pi , \{\omega _{ij}\}_{j=1}^{n_i}) \nonumber \\&\quad = -\log (1-{\hat{G}}(\phi , {\tilde{r}}; \pi , \{\omega _{ij}\}_{j=1}^{n_i})) \nonumber \\&\qquad - \frac{\lambda _1}{n_1} \sum _{j=1}^{n_1} \left( \log {\tilde{q}}_2 (T_\phi (\omega _{1j})) - \log {\tilde{q}}_1^{(\phi )}(T_\phi (\omega _{1j}))\right) \nonumber \\&\qquad - \frac{\lambda _2}{n_2} \sum _{j=1}^{n_2} \log \tilde{q}_1^{(\phi )}(\omega _{2j}), \end{aligned}$$
(26)

where \(\lambda _1, \lambda _2\ge 0\) are two hyperparameters. We first give Algorithm 2, a practical implementation of Algorithm 1, then justify the choice of the objective function (26) in the following section. See “Appendix F” for implementation details.

figure b

In Algorithm 2, most of the computational cost is spent on estimating \(q_1^{(\phi )}\). Since we parameterize \(q_1^{(\phi )}\) as a Real-NVP in this paper, we leverage the GPU computing framework for neural networks. In particular, we implement Algorithm 2 using PyTorch (Paszke et al. 2017) and CUDA (NVIDIA et al. 2020). As a result, most of the computation of Algorithm 2 is parallelized and carried out on the GPU. This greatly accelerates the training process in Algorithm 2. We will further compare the computational cost of Algorithm 2 to existing improvement strategies for Bridge sampling (Meng and Schilling 2002; Jia and Seljak 2020; Wang et al. 2020) in Sect. 5 and 6.

4.2.2 Choosing the objective function

Note that the original empirical estimate \({\hat{G}}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\) can be extremely close to 1 when \(q_1^{(\phi )}\) and \(q_2\) share little overlap. In order to improve its numerical stability, we first transform \({\hat{G}}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\) to log scale using a monotonic function \(h(x)=-\log (1-x)\), then apply the log-sum-exp trick on the transformed \(-\log (1-{\hat{G}}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i}))\). Since h(x) is monotonically increasing on \((-\infty ,1)\), applying this transformation does not change the optimizers of \({\hat{G}}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\).

In addition, GAN-type models can be difficult to train in practice (Arjovsky and Bottou 2017). Grover et al. (2018) suggest one can stabilize the adversarial training process of GAN-type models by incorporating a log likelihood term into the original objective function when the generative model \(q_1^{(\phi )}\) is a Normalizing flow. Since both \({\tilde{q}}_1^{(\phi )}\) and \({\tilde{q}}_2\) are computationally tractable in our setup, we are able to extend this idea and stabilize the alternating training process by incorporating two “likelihood” terms that are asymptotically equivalent to \(\lambda _1 {\textit{KL}}(q_1^{(\phi )}, q_2), \lambda _2 {\textit{KL}}(q_2,q_1^{(\phi )})\) up to additive constants into the transformed f-GAN objective \(-\log (1-{\hat{G}}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i}))\). Our proposed objective function \(L_{\lambda _1, \lambda _2}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\) is then a weighted combination of \(-\log (1-{\hat{G}}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i}))\) and the two “likelihood” terms, where the hyper parameters \(\lambda _1, \lambda _2 \ge 0\) control the contribution of the “likelihood” terms.

Similar to Algorithm 1, let \((\phi ^*_L,\tilde{r}^*_L)\) be a solution of the min-max problem \(\min _{\phi \in \mathbb {R}^l} \max _{{\tilde{r}} \in \mathbb {R}^+} L_{\lambda _1, \lambda _2}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\). Note that regardless of the choice of \(\lambda _1, \lambda _2\), the scalar parameter \({\tilde{r}}\) only depends on \(L_{\lambda _1, \lambda _2}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\) through \({\hat{G}}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\). Therefore by Proposition 2, if \({\tilde{r}}^*_L\) is a stationary point of \(L_{\lambda _1, \lambda _2}(\phi ^*_L, {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\) w.r.t. \({\tilde{r}} \in \mathbb {R}^+\), then \({\tilde{r}}^*_L\) can be viewed as a Bridge estimator of r based on the transformed \({\tilde{q}}_1^{(\phi ^*_L)}\) and the original \({\tilde{q}}_2\) with a specific choice of the free function \(\alpha (\omega )\). However, \({\tilde{r}}^*_L\) is sub-optimal since the free function \(\alpha (\omega )\) it uses is different from the optimal \(\alpha _{opt}(\omega )\) in (2). This means \({\tilde{r}}^*_L\) will have greater asymptotic error than the asymptotically optimal Bridge estimator. In addition, \({\tilde{r}}^*_L\) suffers from an adaptive bias (Wang et al. 2020). Such bias arises from the fact that the estimated transformed density \(q_1^{(\phi _t)}\) in Algorithm 2 is chosen based on the training samples \(\{\omega _{ij}\}_{j=1}^{n_i}\) for \(i=1,2\). This means the density of the distribution of the transformed training samples \(\{T_{\phi _t}(\omega _{1j})\}_{j=1}^{n_1}\) is no longer proportional to \(\tilde{q}_1^{(\phi _t)}(T_{\phi _t}(\omega _{1j}))\) for \(j=1,\ldots ,n_1\), as \(\phi _t\) can be viewed as a function of \(\{\omega _{ij}\}_{j=1}^{n_i}\) (see “Appendix F” for more discussions). Hence, we do not use \({\tilde{r}}^*_L\) as our final estimator of r. Instead, once we have obtained \({\tilde{r}}^*_L\), we use it as a sensible initial value of the iterative procedure in (4), and compute the asymptotically optimal Bridge estimator \({\hat{r}}'^{(\phi ^*_L)}_{opt}\) using a separate set of estimating samples \(\{\omega '_{ij}\}_{j=1}^{n'_i}\), \(i=1,2\). The resulting \({\hat{r}}'^{(\phi ^*_L)}_{opt}\) does not suffer from the adaptive bias as the estimating samples are independent to the transformation \(q_1^{(\phi _t)}\). When \(n'_i = n_i\) for \(i=1,2\), \({\hat{r}}'^{(\phi ^*_L)}_{opt}\) is also statistically more efficient than \({\tilde{r}}^*_L\).

On the other hand, if \(\phi _L^*\) is a minimizer of \(L_{\lambda _1, \lambda _2}(\phi ,{\tilde{r}}^*_L; s_2,\) \(\{\omega _{ij}\}_{j=1}^{n_i})\) with respect to \(\phi \), then it asymptotically minimizes a mixture of \(-\log (1-H_{s_2}(q_1^{(\phi )},q_2))\), \({\textit{KL}}(q_1^{(\phi )}, q_2)\) and \({\textit{KL}}(q_2,q_1^{(\phi )})\). Recall that as \(n_1,n_2 \rightarrow \infty \), the additional log likelihood terms in (26) are asymptotically equivalent to \(\lambda _1 {\textit{KL}}(q_1^{(\phi )}, q_2), \lambda _2 {\textit{KL}}(q_2,q_1^{(\phi )})\) up to additive constants. We have demonstrated that minimizing \(-\log (1-H_{s_2}(q_1^{(\phi )},q_2))\) with respect to \(\phi \) is equivalent to minimizing the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) under the i.i.d. assumption. We can also show that minimizing \({\textit{KL}}(q_1^{(\phi )}, q_2), {\textit{KL}}(q_2,q_1^{(\phi )})\) correspond to minimizing upper bounds of the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) w.r.t. \(\phi \) under the same assumption (see “Appendix E”). Note that when \(\lambda _1, \lambda _2 \ne 0\), Proposition 3 no longer holds for this hybrid objective asymptotically, i.e. \(T_{\phi ^*_L}\) no longer asymptotically minimizes the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) w.r.t. \(T_\phi \). However, we find Algorithm 2 with the hybrid objective works well in the numerical examples in Sects. 5, 6 for any value of \(\lambda _1, \lambda _2 \in (10^{-2},10^{-1})\). We want to keep \(\lambda _1, \lambda _2\) small since we do not want the log likelihood terms to dominate \({\hat{G}}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\) in the hybrid objective \(L_{\lambda _1, \lambda _2}(\phi , {\tilde{r}}; s_2, \{\omega _{ij}\}_{j=1}^{n_i})\). In addition, we would like to stress that even though the final \(\phi _t\) in Algorithm 2 does not asymptotically minimize the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) w.r.t. \(\phi \) when \(\lambda _1,\lambda _2>0\), \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) in Algorithm 2 is still a consistent estimator of the first-order approximation of \({\textit{RE}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) by Proposition 1 and the fact that \({\hat{r}}'^{(\phi _t)}_{opt}\) is the asymptotically optimal Bridge estimator based on the transformed \(q_1^{(\phi _t)}\) and the original \(q_2\).

5 Example 1: mixture of rings

We first demonstrate the effectiveness of the f-GAN-Bridge estimator and Algorithm 2 using a simulated example. Since this paper focuses on improving the original Bridge estimator (Meng and Wong 1996) rather than giving a new estimator of the normalizing constant or the ratio of normalizing constants, we will focus on comparing the performance of the proposed f-GAN-Bridge estimator to existing improvement strategies for Bridge sampling (Meng and Schilling 2002; Wang et al. 2020; Jia and Seljak 2020) in this and the following section. We do not include other classes of methods such as path sampling (Gelman and Meng 1998; Lartillot and Philippe 2006), nested sampling (Skilling 2006) and variational approaches (Ranganath et al. 2014). in the examples. Empirical study (Fourment et al. 2020) finds evidence that Bridge sampling was competitive with a wide range of methods, including the methods mentioned above, in the context of phylogenetics.

In this example, we set \(q_1\), \(q_2\) to be mixtures of ring-shaped distributions, and we would like to estimate the ratio of their normalizing constants. We choose this example because such mixture has a multi-modal structure, and its normalizing constant is available in closed form. Let \({\varvec{x}} \in \mathbb {R}^2\). In order to define the pdf of \(q_1,q_2\) for this example, we first define the pdf of a 2-d ring distribution as

$$\begin{aligned}&R({\varvec{x}}; \varvec{\mu },b,\sigma ) = \frac{1}{ \sqrt{2\pi ^3\sigma ^2}\Phi (b/\sigma )}\nonumber \\&\quad \exp \left( -\frac{(\Vert {\varvec{x}}-\varvec{\mu }\Vert ^2_2-b)^2}{2\sigma ^2}\right) ; \quad \varvec{\mu } \in \mathbb {R}^2, \ b,\sigma >0 \end{aligned}$$
(27)

where \(\Phi (\cdot )\) is the standard Normal CDF and \(\varvec{\mu }, b, \sigma \) controls the location, radius and thickness of the ring, respectively. Let \({\tilde{R}}({\varvec{x}}; \varvec{\mu },b,\sigma ) = \exp \left( -\frac{(\Vert {\varvec{x}}-\varvec{\mu }\Vert ^2_2-b)^2}{2\sigma ^2}\right) \) be the corresponding unnormalized density. Let \(\varvec{\omega } \in \mathbb {R}^p\) where p is an even integer. For \(i=1,2\), let the unnormalized density \({\tilde{q}}_i\) be

$$\begin{aligned} {\tilde{q}}_i(\varvec{\omega }; \varvec{\mu }_{i1},\varvec{\mu }_{i2}, b_{i}, \sigma _{i})&= \prod _{j=1}^{p/2} \left( \frac{1}{2} {\tilde{R}}\left( \{\omega _{2j-1}, \omega _{2j}\}; \varvec{\mu }_{i1},b_i,\sigma _i\right) \right. \nonumber \\&\quad +\left. \frac{1}{2} {\tilde{R}}\left( \{\omega _{2j-1}, \omega _{2j}\}; \varvec{\mu }_{i2},b_i,\sigma _i\right) \right) \end{aligned}$$
(28)

where \(\omega _j\) is the jth entry of \(\varvec{\omega }\). This means for \(i=1,2\), if \(\varvec{\omega } \sim q_i\), then every two entries of \(\varvec{\omega }\) are independent and identically distributed, and follow an equally weighted mixture of 2-d ring distributions with different location parameters \(\varvec{\mu }_{i1}, \varvec{\mu }_{i2}\) and the same radius and thickness parameter \(b_i, \sigma _i\). It is straightforward to verify that \(Z_i\), the normalizing constant of \({\tilde{q}}_i\) is \(\left( \sqrt{2\pi ^3\sigma _i^2}\Phi (b_i/\sigma _i)\right) ^{p/2}\). In this example, we consider dimension \(p=\{12,18,24,30,36,42,48\}\), and set \(\varvec{\mu }_{11} = (2,2), \varvec{\mu }_{12} = (-2, -2), \varvec{\mu }_{21}=(3,-3), \varvec{\mu }_{22}=(-3,3)\), \(b_1 =3 , b_2=6, \sigma _1=1, \sigma _2=2\).

In this example, we estimate \( \log r = \log Z_1 - \log Z_2\) using the f-GAN-Bridge estimator (f-GB, Algorithm 2), Warp-III Bridge estimator (Meng and Schilling 2002), Warp-U Bridge estimator (Wang et al. 2020) and Gaussianized Bridge sampling (GBS) (Jia and Seljak 2020). We fix \(N_i\), the number of samples from \(q_i\), to be 2000 for \(i=1,2\), and compare the performance of these methods as we increase the dimension p. For each value of p, we run each methods 100 times. For Algorithm 2, we set \(\lambda _1, \lambda _2 = 0.05\), and \({\tilde{q}}^{(\phi )}_{1}\) to be a Real-NVP with 4 coupling layers. For Warp-III and GBS, we use the recommended or default settings. For Warp-U, we adopt the cross-splitting strategy suggested by the authors: we first estimate the Warp-U transformation using first half of the samples as the training set, and compute the Warp-U Bridge estimator using the second half as the estimating set. We then swap the role of the training and estimating set to compute another Warp-U Bridge estimator. The final output would then be the average of the two Warp-U Bridge estimators. This idea has also been discussed in Wong et al. (2020). Let \({\hat{r}}\) be a generic estimator of r. For each method and each value of p, we compute a MC estimate of the MSE of \(\log {\hat{r}}\) based on the results from the repeated runs. We use it as the benchmark of performance. From Fig. 1, we see f-GB outperforms all other methods for all choices of p. We also include a scatter plot of the first two dimensions of samples from \(q_1,q_2\) and the transformed \(q_1^{(\phi _t)}\) when \(p=48\), where \(q_1^{(\phi _t)}\) is estimated using Algorithm 2 with \(n_i=n'_i=N_i/2\) for \(i=1,2\). We see the transformed \(q_1^{(\phi _t)}\) captures the structure of \(q_2\) accurately, and they share much greater overlap than the original \(q_1,q_2\).

Fig. 1
figure 1

Left: MC estimates of MSE of \(\log {\hat{r}}\) for each methods. Vertical segments are \(2\sigma \) error bars. Note that the y-axis is on log scale. Right: scatter plot of the first two dimensions of samples from \(q_1,q_2\) and \(q_1^{(\phi _t)}\) when \(p=48\). \(q_1^{(\phi _t)}\) is obtained from Algorithm 2 with \(n_i=n'_i=1000\) for \(i=1,2\)

We now compare the computational cost of these methods. Recall that our Algorithm 2 utilizes GPU acceleration. Because of the difference in GPU and CPU computing, it is not straightforward to compare the computational cost of Algorithm 2 with GBS, Warp-III and Warp-U, which are CPU based, using benchmarks such as CPU seconds or number of function calls. We simply report the averaged running time for each method on our machine in Fig. 2. Similar to Wang et al. (2020), we will also report the average “precision per second”, which is the reciprocal of the product of the running time and the estimated MSE of \(\log {\hat{r}}\), for each method (higher precision per second means better efficiency). We see that for all methods, the computation time is approximately a linear function of the dimension p. Even though f-GB takes roughly twice longer to run compared to GBS and \(30 \sim 40\) times longer compared to Warp-III, it achieves the highest precision per second for all dimension p we consider. In addition, we also run further simulations with larger sample sizes. We find that when \(p=48\), Warp-U needs around \(N_1=N_2 =7500\) samples to reach a similar level of precision as f-GB based on \(N_1=N_2=2000\) samples. In this case, Warp-U takes around \(3 \sim 4\) times longer to run compared to f-GB. For Warp-III and GBS, we further increase the sample size to \(N_1=N_2=5\times 10^4\), but find that their performance is still worse than f-GB and Warp-U, and both take more than three times longer to run. For Warp-III and Warp-U, it is not obvious how they would benefit from GPU computation. Although GBS may benefit from GPU acceleration in principle, it would require careful implementation and optimization. Therefore, we compare our Algorithm 2 to these methods based on their publicly available implementations.

Fig. 2
figure 2

Left: Averaged running time for each method. Right: Averaged precision per second (i.e. reciprocal of the product of running time and the estimated MSE of \(\log {\hat{r}}\)) for each method

Recall that \({\textit{MSE}}(\log {\hat{r}}_{opt})\) is asymptotically equivalent to \({\textit{RE}}^2({\hat{r}}_{opt})\) (Meng and Wong 1996). Therefore, \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) returned from Algorithm 2 can also be viewed as an estimate of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\). In order to assess its accuracy, we compare it with both the error estimator given in Frühwirth-Schnatter (2004) and a direct MC estimator of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\): for each value of p, we first run Algorithm 2 with \(N_1=N_2=2000\) samples as before (i.e. we set \(n_i=n'_i=1000\) for \(i=1,2\)). We then fix the transformed density \({\tilde{q}}^{(\phi _t)}_{1}\) obtained from Algorithm 2, repeatedly draw \(n'_1 = n'_2 = 1000\) independent samples from \({\tilde{q}}^{(\phi _t)}_{1},q_2\) and record \({\hat{r}}'^{(\phi _t)}_{opt}\), \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) and the error estimate given in Frühwirth-Schnatter (2004) (F-S) based on these new samples. We repeat this process 100 times, and report the box plots of \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) and the error estimates given in Frühwirth-Schnatter (2004) (F-S) based on the repeated runs. We also compare the results with the direct MC estimate of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\) based on the repeated estimates \(\log {\hat{r}}'^{(\phi _t)}_{opt}\) and the ground truth \(\log r\). Note that here we fix the transformed \({\tilde{q}}^{(\phi _t)}_{1}\) and only repeat the Bridge sampling step in Algorithm 2. We summarize the results in Fig. 3. We see that \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) returned from Algorithm 2 agrees with the error estimator given in Frühwirth-Schnatter (2004) (F-S), and provides a sensible estimate of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\) for all choices of p.

Fig. 3
figure 3

Box plots of 100 repetitions of \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) based on Algorithm 2 and the error estimator given in Frühwirth-Schnatter (2004) (F-S) for each dimension P. Blue vertical segments are the \(2\sigma \) error bars of the corresponding MC estimates of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\) based on 100 repetition. (Color figure online)

6 Example 2: comparing two Bayesian GLMMs

In this section, we demonstrate the effectiveness of the f-GAN-Bridge estimator and Algorithm 2 by considering a Bayesian model comparison problem based on the six cities dataset (Fitzmaurice and Laird 1993), where \(q_1,q_2\) are the posterior densities of the parameters of two Bayesian GLMMs \(M_1,M_2\). This example is adapted from Overstall and Forster (2010). We choose this example because it is based on real-world dataset, and the posteriors \(q_1,q_2\) are relatively high dimensional and are defined on disjoint supports with different dimensions.

The six cities dataset consists of the wheezing status \(y_{ij}\) (1 = wheezing, 0 otherwise) of child i at time j for \(i=1,\ldots ,n\), \(n=537\) and \(j=1,...4\). It also includes \(x_{ij}\), the smoking status (1 = smoke, 0 otherwise) of the i-th child’s mother at time-point j as a covariate. We compare two mixed effects logistic regression models \(M_1,M_2\) with different linear predictors. Define

$$\begin{aligned} M1 : \eta _{ij}^{(1)}&= \beta _0 + u_i; \quad u_i {\mathop {\sim }\limits ^{i.i.d.}} N(0,\sigma ^2) \end{aligned}$$
(29)
$$\begin{aligned} M2: \eta _{ij}^{(2)}&= \beta _0 + \beta _1x_{ij} + u_i; \quad u_i{\mathop {\sim }\limits ^{i.i.d.}} N(0,\sigma ^2) \end{aligned}$$
(30)

where \(\beta _0,\beta _1\) are regression parameters, \(u_i\) is the random effect of the i-th child and \(\sigma ^2\) controls the variance of the random effects. We use the default prior given by Overstall and Forster (2010) for both models, i.e. we take \(\beta _0 \sim N(0, 4)\), \(\sigma ^{-2} \sim \Gamma (0.5,0.5)\) for \(M_1\) and \((\beta _0,\beta _1)\sim N(0, 4n({\varvec{X}}^T {\varvec{X}})^{-1})\), \(\sigma ^{-2} \sim \Gamma (0.5,0.5)\) for \(M_2\) where \({\varvec{X}} = [{\varvec{1}}_{4n}^T, ({\varvec{x}}_{1},\ldots ,{\varvec{x}}_{n})^T]\), \({\varvec{x}}_{i} = (x_{i1},\ldots ,x_{i4})\) for \(i=1,\ldots ,n\).

Let \({\varvec{y}} = ({\varvec{y}}_{1},\ldots {\varvec{y}}_{n})\) with \({\varvec{y}}_{i} = (y_{i1},\ldots ,y_{i4})\). Let \({\varvec{u}} = (u_1,\ldots u_n)\) be the vector of random effects. Let \(q_1(\beta _0, {\varvec{u}}) = p( \beta _0, {\varvec{u}} \vert {\varvec{X}},{\varvec{y}}, M_1)\) be the marginal posterior of \((\beta _0, {\varvec{u}})\) under \(M_1\), and \({\tilde{q}}_1(\beta _0, {\varvec{u}})\) be the corresponding unnormalized density. Let \(q_2(\beta _0,\beta _1 ,{\varvec{u}})\), \({\tilde{q}}_2(\beta _0, \beta _1,{\varvec{u}})\) be defined in a similar fashion under \(M_2\). Samples of \(q_1,q_2\) are obtained using MCMC package R2WinBUGS (Sturtz et al. 2005; Lunn et al. 2000). For \(k=1,2\), the normalizing constant \(Z_k\) of \({\tilde{q}}_k\) is the marginal likelihood under \(M_k\). We first generate \(2 \times 10^5\) MCMC samples from \(q_1,q_2\) and estimate \(\log Z_1, \log Z_2\) using the method described in Overstall and Forster (2010). The estimated log marginal likelihoods of \(M_1, M_2\) based on \(2 \times 10^5\) MCMC samples are \(-808.139\) and \(-809.818\), respectively. The results are consistent with the estimated log marginal likelihoods reported in Overstall and Forster (2010) based on \(5 \times 10^4\) MCMC samples. We take them as the baseline “true values” of \(\log Z_1\) and \(\log Z_2\). See Overstall and Forster (2010) for R codes and technical details.

Similar to the previous example, we use f-GB to estimate the log Bayes factor \( \log r= \log Z_1 - \log Z_2\) between \(M_1,M_2\). Note that \(q_1,q_2\) are defined on disjoint support \(\mathbb {R}^{n+1}, \mathbb {R}^{n+2}\), respectively. In order to apply our Algorithm 2 to this problem, we first augment \(q_1\) using a standard Normal to match up the difference in dimension between \(q_1\) and \(q_2\): let \(q_{1,{\textit{aug}}}(\beta _0, \gamma , {\varvec{u}}) = q_1(\beta _0, {\varvec{u}})N(\gamma ; 0,1)\) be the augmented density where \(N(\cdot ; 0,1)\) is the standard Normal pdf. Let \({\tilde{q}}_{1,{\textit{aug}}}\) be the corresponding unnormalized augmented density. Note that \({\tilde{q}}_{1,{\textit{aug}}}\) and \({\tilde{q}}_1\) have the same normalizing constant \(Z_1\). We can then apply Algorithm 2 to \(q_{1,{\textit{aug}}}\) and \(q_2\) since \(q_{1,{\textit{aug}}}\) and \(q_2\) are now defined on a common support \(\mathbb {R}^{n+2}\). We can sample from \(q_{1,{\textit{aug}}}\) by simply concatenating a sample \((\beta _0, {\varvec{u}}) \sim q_1\) and a sample \(\gamma \sim N(0,1)\).

Let \(N_k\) be the number of MCMC samples drawn from \(q_k\) for \(k=1,2\). In this example, we compare the performance of the f-GAN-Bridge estimator with the Warp-III Bridge estimator and the Warp-U Bridge estimator as we increase the number of MCMC samples \(N_1,N_2\). We consider sample size \(N=\{1000,2000,3000,4000,5000\}\). This is a challenging task since the sample size N is limited compared to the dimension of the problem (Recall that \(q_1,q_2\) are defined on \(\mathbb {R}^{n+1}, \mathbb {R}^{n+2}\), respectively, with \(n=537\)). For each choice of N, we repeatedly draw \(N_1=N_2=N\) MCMC samples from \(q_1,q_2\), respectively, and estimate the MSE of \(\log {\hat{r}}\) for each method in the same way as in the previous example. For our Algorithm 2, we augment \(q_1\) as described above, set \(\lambda _1, \lambda _2 = 0.1\) and \(q_{1,{\textit{aug}}}^{(\phi )}\) to be a Real-NVP with 10 coupling layers. For the Warp-U and Warp-III Bridge estimator, we still use the recommended or default settings. We do not include GBS in this example since we find that for all values of N, it does not converge for most of the repetitions. From Fig. 4, we see our Algorithm 2 outperforms the Warp-III and the Warp-U Bridge estimator for all sample size N. We also include a scatter plot of the first two dimensions of samples from \(q_{1,{\textit{aug}}},q_2\) and the transformed \( q^{(\phi _t)}_{1,{\textit{aug}}}\), where \(q^{(\phi _t)}_{1,{\textit{aug}}}\) is obtained from Algorithm 2 with \(N=3000\). We see \(q^{(\phi _t)}_{1,{\textit{aug}}}\) an \(q_2\) share much greater overlap than the original \(q_{1,{\textit{aug}}},q_2\). From Fig. 5, we see for the same sample size N, the running time of f-GB is \(4 \sim 6\) times as long as Warp-III, and roughly \(30\% \sim 40\%\) shorter than Warp-U. On the other hand, f-GB achieves the highest precision per second for all sample size N in this example. We further increase the sample size N, and find that Warp-U requires around \(10^4\) MCMC samples to reach a similar level of precision achieved by f-GB with \(N=5000\) samples, and takes around 2 times longer to run. Similarly, Warp-III requires around \(8\times 10^4\) samples to get a similar level of precision, and takes around three times longer to run.

Fig. 4
figure 4

Left: MC estimates of MSE of \(\log {\hat{r}}\) for each methods. Vertical segments are \(2\sigma \) error bars. Note that the y-axis is on log scale. Warp-III does not converge for most of the repetitions when \(N=1000\). Right: scatter plot of the first two dimensions of samples from \(q_{1,{\textit{aug}}},q_2\) and \(q_{1,{\textit{aug}}}^{(\phi _t)}\), where \(q_{1,{\textit{aug}}}^{(\phi _t)}\) is obtained from Algorithm 2 with \(n_1=n'_i=1500\) for \(i=1,2\). The first two dimensions of \(q_{1,{\textit{aug}}}\) and \(q_2\) are \((\beta _0,\gamma ), (\beta _0,\beta _1)\), respectively

Fig. 5
figure 5

Left: Averaged running time for each method. Warp-III does not converge for most of the repetitions when \(N=1000\). Right: Averaged precision per second (i.e. reciprocal of the product of running time and the estimated MSE of \(\log {\hat{r}}\)) for each method

Fig. 6
figure 6

Box plots of 100 repetitions of \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) based on Algorithm 2 and the error estimator given in Frühwirth-Schnatter (2004) (F-S) for each sample size N. Blue vertical segments are the \(2\sigma \) error bars of the corresponding MC estimates of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\) based on 100 repetition. (Color figure online)

For each choice of N, we also compare \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) returned from Algorithm 2 with the error estimator given in Frühwirth-Schnatter (2004) (F-S) and a direct MC estimator of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\) in the same way as in the last example. We summarize the results in Fig. 6. In principle, it is not appropriate to use \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) as an estimate of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\) in this example as the MCMC samples are correlated. However, from Fig. 6 we see it agrees with the error estimator given in Frühwirth-Schnatter (2004), which does take autocorrelation into account, and still provides sensible estimate of \({\textit{MSE}}(\log {\hat{r}}'^{(\phi _t)}_{opt})\) for all choices of N. This is likely due to the fact that the autocorrelation in our MCMC samples is weak, as we find that for all N, the effective sample sizes for all dimensions of the MCMC samples from \(q_1,q_2\) are greater than 0.8N. When working with weakly correlated MCMC samples, we recommend users to compute both our \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) and the error estimator given in Frühwirth-Schnatter (2004), which does take autocorrelation into account, and check if they agree with each other. When the MCMC samples are strongly correlated, we do not recommend using \({\widehat{{\textit{RE}}}}^2({\hat{r}}'^{(\phi _t)}_{opt})\) as the error estimate of \({\hat{r}}'^{(\phi _t)}_{opt}\).

7 Conclusion

In this paper, we give a new estimator of \({\textit{RE}}^2({\hat{r}}_{opt})\) based on the variational lower bound of f-divergence proposed by Nguyen et al. (2010), discuss the connection between Bridge estimators and the problem of f-divergence estimation, and give a computational framework to improve the optimal Bridge estimator using an f-GAN (Nowozin et al. 2016). We show that under the i.i.d. assumption, our f-GAN-Bridge estimator is optimal in the sense that it asymptotically minimizes the first-order approximation of \({\textit{RE}}^2({\hat{r}}_{opt}^{(\phi )})\) with respect to the transformed density \(q_1^{(\phi )}\). We see that in both simulated and real-world examples, our f-GB estimator provides accurate estimate of r and outperforms existing methods significantly. In addition, Algorithm 2 also provides accurate estimates of \({\textit{RE}}^2({\hat{r}}_{opt}^{(\phi )})\) and \({\textit{MSE}}(\log {\hat{r}}_{opt}^{(\phi )})\). In our experience, Algorithm 2 (f-GB) is computationally more demanding than the existing methods. In the numerical examples, the running time of Algorithm 2 is roughly 1 to 3 times as long as the existing methods such as Warp-U and GBS when the sample sizes are the same. We have not attempted to formalize the difference in computational cost because of the very different nature of GPU and CPU computing. Although in our examples, it is possible for a competing method to match the performance of the f-GB estimator by increasing the number of samples drawn from \(q_1,q_2\), it takes longer to run, and can be inefficient or impractical when sampling from \(q_1,q_2\) is computationally expensive. This also means the f-GB estimator is especially appealing when we only have a limited amount of samples from \(q_1,q_2\). In summary, when \(q_1,q_2\) are relatively simple-structured and low dimensional, the extra computational cost required by f-GB may not be worthwhile. However, when \(q_1,q_2\) are high dimensional or have complicated multi-modal structure, we recommend the users to choose the more accurate f-GB estimator of r, given the key summary role it plays in many applications and publications.

7.1 Limitations and future works

One limitation of the f-GB estimator is the computational cost. In this paper, we parameterize \(q_1^{(\phi )}\) as a Normalizing flow. A possible direction of future work is to explore different choices of parameterizations of \(q_1^{(\phi )}\). We expect that we can speed up our Algorithm 2 by replacing a Normalizing flow by simpler transformations such as Warp-I and Warp-II transformation (Meng and Schilling 2002) at the expense of flexibility. Another limitation is that Algorithm 1 is only optimal when samples from \(q_1,q_2\) are i.i.d. Recall that \({\textit{RE}}^2({\hat{r}}_{opt})\) in (13) is derived based on the i.i.d. assumption. Therefore, if the samples from \(q_1,q_2\) are correlated, then Proposition 3 no longer holds, and minimizing \(H_{s_2}(q_1^{(\phi )},q_2)\) with respect to \(q_1^{(\phi )}\) is no longer equivalent to minimizing the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\). Therefore, it is of interest to see if it is possible to give an algorithm that minimizes the first-order approximation of \({\textit{RE}}^2({\hat{r}}^{(\phi )}_{opt})\) when the samples are correlated. In addition, our approach only focuses on estimating the ratio of normalizing constants between two densities. When we have multiple unnormalized densities and would like to estimate the ratios between their normalizing constants, our approach needs to estimate these quantities separately in a pairwise fashion, which can be inefficient. Meng and Schilling (1996) and Geyer (1994) show that one can estimate multiple normalizing constants simultaneously up to a common multiplicative constant. We are also interested in extending our improvement strategy to this multiple densities setup.