Improving bridge estimators via f-GAN

Bridge sampling is a powerful Monte Carlo method for estimating ratios of normalizing constants. Various methods have been introduced to improve its efficiency. These methods aim to increase the overlap between the densities by applying appropriate transformations to them without changing their normalizing constants. In this paper, we first give a new estimator of the asymptotic relative mean square error (RMSE) of the optimal Bridge estimator by equivalently estimating an f-divergence between the two densities. We then utilize this framework and propose f-GAN-Bridge estimator (f-GB) based on a bijective transformation that maps one density to the other and minimizes the asymptotic RMSE of the optimal Bridge estimator with respect to the densities. This transformation is chosen by minimizing a specific f-divergence between the densities. We show f-GB is optimal in the sense that within any given set of candidate transformations, the f-GB estimator can asymptotically achieve an RMSE lower than or equal to that achieved by Bridge estimators based on any other transformed densities. Numerical experiments show that f-GB outperforms existing methods in simulated and real-world examples. In addition, we discuss how Bridge estimators naturally arise from the problem of f-divergence estimation.


Introduction
Estimating the normalizing constant of an unnormalized probability density, or the ratio of normalizing constants between two unnormalized densities is a challenging and important task. In Bayesian inference, such problems are closely related to estimating the marginal likelihood of a model or the Bayes factor between two competing models, and can arise from fields such as econometrics (Geweke, 1999), astronomy (Bridges et al., 2009), phylogenetics (Fourment et al., 2020), etc. Monte Carlo methods such as Bridge sampling (Bennett, 1976;Meng and Wong, 1996), path sampling (Gelman and Meng, 1998), reverse logistic regression (Geyer, 1994), nested sampling Skilling et al. (2006) and reverse Annealed Importance Sampling (Burda et al., 2015) have been proposed to address this problem. See Friel and Wyse (2012) for an overview of some popular algorithms. Fourment et al. (2020) also compare the empirical performance of 19 algorithms for estimating normalizing constants in the context of phylogenetics.
Bridge sampling (Bennett, 1976;Meng and Wong, 1996) is a powerful, easy-to-implement Monte Carlo method for estimating the ratio of normalizing constants. Letq i (ω), ω ∈ Ω i , i = 1, 2 be two unnormalized probability densities with respect to a common measure µ. Let q i (ω) =q i (ω)/Z i be the corresponding normalized density, where Z i is the normalizing constant. Bridge sampling estimates r = Z 1 /Z 2 using samples from q 1 , q 2 and the unnormalized density functionsq 1 ,q 2 . Meng and Schilling (2002) point out that Bridge sampling is equally useful for estimating a single normalizing constant. The relative mean square error (RMSE) of a Bridge estimator depends on the overlap or "distance" between q 1 , q 2 . The overlap can be quantified by some divergence between them. When q 1 , q 2 share little overlap, the corresponding Bridge estimator has large RMSE and therefore is unreliable. In order to improve the efficiency of Bridge estimators, various methods such as Warp Bridge sampling (Meng and Schilling, 2002), Warp-U Bridge sampling (Wang et al., 2020) and Gaussianized Bridge sampling (Jia and Seljak, 2020) have been introduced. These methods first apply transformations T i to q i in a tractable way without changing the normalizing constant Z i for i = 1, 2, then compute Bridge estimators based on the transformed densities q In this paper, we first demonstrate the connection between Bridge estimators and fdivergence (Ali and Silvey, 1966). We show that one can estimate the asymptotic RMSE of the optimal Bridge estimator by equivalently estimating a specific f -divergence between q 1 , q 2 . Nguyen et al. (2010) propose a general variational framework for f -divergence estimation. We apply this framework to our problem and obtain a new estimator of the asymptotic RMSE of the optimal Bridge estimator using the unnormalized densitiesq 1 ,q 2 and the corresponding samples. We also find a connection between Bridge estimators and the variational lower bound of f -divergence given by Nguyen et al. (2010). In particular, we show that the problem of estimating an f -divergence between q 1 , q 2 using this variational framework naturally leads to a Bridge estimator of r = Z 1 /Z 2 . Kong et al. (2003) observe that the optimal Bridge estimator is a maximum likelihood estimator under a semi-parametric formulation. We use this f -divergence estimation framework to extend this observation and show that many special cases of Bridge estimators such as the geometric Bridge estimator can also be interpreted as maximizers of some objective functions that are related to the f -divergence between q 1 , q 2 . This formulation also connects Bridge estimators and density ratio estimation problems: Since we can evaluate the unnormalized densitiesq 1 ,q 2 , we know the true density ratio up to a multiplicative constant r = Z 1 /Z 2 . Hence estimating r can be viewed as a problem of estimating the density ratio between q 1 , q 2 . A similar idea has been explored in e.g. Noise Contrastive Estimation (Gutmann and Hyvärinen, 2010), where the authors treat the unknown normalizing constant as a model parameter, and cast the estimation problem as a classification problem. Similar ideas have also been discussed in e.g. Geyer (1994) and Uehara et al. (2016).
We then utilize the connection between the asymptotic RMSE of the optimal Bridge estimator and a specific f -divergence between q 1 , q 2 , and propose f -GAN-Bridge estimator (f -GB), which improves the efficiency of the optimal Bridge estimator of r by directly minimizing the first order approximation of its asymptotic RMSE with respect to the densities using an f -GAN. f -GAN (Nowozin et al., 2016) is a class of generative model that aims to approximate the target distribution by minimizing an f -divergence between the generative model and the target. Let T be a collection of transformations T such thatq (T ) 1 , the transformed unnormalized density of q 1 is computationally tractable and have the same normalizing constant Z 1 as the originalq 1 . The f -GAN-Bridge estimator is obtained using a two-step procedure: We first use the f -GAN framework to find the transformation T * that minimizes a specific f -divergence between q (T ) 1 and q 2 with respect to T ∈ T . Once T * and q (T * ) 1 are chosen, we then compute the optimal Bridge estimator of r based on q (T * ) 1 and q 2 as the f -GAN-Bridge estimator. We show T * asymptotically minimizes the first order approximation of the asymptotic RMSE of the optimal Bridge estimator based on q (T ) 1 and q 2 with respect to T . In contrast, existing methods such as Warp Bridge sampling (Meng and Schilling, 2002;Wang et al., 2020) and Gaussianized Bridge sampling (Jia and Seljak, 2020) do not offer such theoretical guarantee. The transformed q (T ) 1 can be parameterized in any way as long as it is computationally tractable and preserves the normalizing constant Z 1 . In this paper, we parameterize q (T ) 1 as a Normalizing flow (Rezende and Mohamed, 2015;Dinh et al., 2016) with base density q 1 because of its flexibility.

Summary of our contributions
The main contribution of our paper is that we give a computational framework to improve the optimal Bridge estimator by minimizing the first order approximation of its asymptotic RMSE with respect to the densities. We also give a new estimator of the asymptotic RMSE of the optimal Bridge estimator using the variational framework proposed by Nguyen et al. (2010). This formulation allows us to cast the estimation problem as a 1-d optimization problem. We find the f -GAN-Bridge estimator outperforms existing methods significantly in both simulated and real-world examples. Numerical experiments show that the proposed method provides not only a reliable estimate of r, but also an accurate estimate of its RMSE. In addition, we also find a connection between Bridge estimators and the problem of f -divergence estimation, which allows us to view Bridge estimators from a different perspective.
This paper is structured as follows: In Section 2, we briefly review Bridge sampling and existing improvement strategies. In Section 3, we give a new estimator of the asymptotic RMSE of the optimal Bridge estimator using the variational framework for f -divergence estimation (Nguyen et al., 2010). We also demonstrate the connection between Bridge estimators and the problem of f -divergence estimation. In Section 4, we introduce the f -GAN-Bridge estimator and give implementation details. We give both simulated and real-world examples in Section 5, 6. Section 7 concludes the paper with a discussion. A Python implementation of the proposed method alongside with examples can be found on Github. A Python implementation of the proposed method alongside with examples can be found in https://github.com/hwxing3259/Bridge_sampling_and_fGAN.

Bridge sampling and related works
Let Q 1 , Q 2 be two probability distributions of interest. Let q i (ω), ω ∈ Ω i , i = 1, 2 be the densities of Q 1 , Q 2 with respect to a common measure µ defined on Ω 1 ∪Ω 2 , where Ω 1 and Ω 2 are the corresponding supports. We useq i (ω), i = 1, 2 to denote the unnormalized densities and Z i , i = 1, 2 to denote the corresponding normalizing constants, i.e. q i (ω) =q i (ω)/Z i for i = 1, 2. Suppose we have samples from q 1 , q 2 , but we are only able to evaluate the unnormalized densitiesq i (ω), i = 1, 2. Our goal is to estimate the ratio of normalizing constants r = Z 1 /Z 2 using onlyq i (ω), i = 1, 2 and samples from the two distributions.
Bridge sampling (Bennett, 1976;Meng and Wong, 1996) is a powerful method for this task.
The authors show that for any initial valuer (0) ,r (t) is a consistent estimator of r for all t ≥ 1, and the sequence {r (t) }, t = 0, 1, 2, ... converges to the unique limitr opt . Let M SE(logr opt ) denote the asymptotic mean square error of logr opt .
Under the i.i.d. assumption, the authors also show RE 2 (r opt ) and M SE(logr opt ) are asymptotically equivalent to RE 2 (r αopt ) in (3) up to the first order (i.e. they have the same leading term). Note thatr opt can be found numerically whiler αopt is not directly computable.
We will focus on the asymptotically optimal Bridge estimatorr opt for the rest of the paper.

Improving Bridge estimators via transformations
From (3) and the fact that RE 2 (r opt ) and RE 2 (r αopt ) are asymptotically equivalent, we see RE 2 (r opt ) depends on the overlap between q 1 and q 2 . Even when Ω 1 = Ω 2 , if q 1 and q 2 put their probability mass on very different regions, the integral in (3) would be close to 0, leading to large RMSE and unreliable estimators. In order to improve the performance ofr opt , one may apply transformations to q 1 , q 2 (and to the corresponding samples) to increase their overlap while keeping the transformed unnormalized densities computationally tractable and the normalizing constants unchanged. We assume that we are dealing with unconstrained, continuous random variables with a common support, i.e. Ω 1 = Ω 2 = R d .
When the supports Ω 1 , Ω 2 are constrained or different from each other, we can usually match them by applying simple invertible transformations to q 1 , q 2 . When Ω 1 ,Ω 2 have different dimensions, Chen and Shao (1997) suggest matching the dimensions of q 1 , q 2 by augmenting the lower dimensional distribution using some completely known random variables (See Appendix 9 for details). Voter (1985) gives a method to increase the overlap in the context of free energy estimation by shifting the samples from one distribution to the other and matching their modes. Meng and Schilling (2002) extends this idea and consider more general mappings.
Let T i : R d → R d , i = 1, 2 be two smooth and invertible transformations that aim to bring is the inverse transformation of T i and J i (ω) is its Jacobian. One can then apply (1) to the transformed samples and the corresponding unnormalized densitiesq 2 , and obtain a Bridge estimator without the need to sample fromq (T ) 1 orq (T ) 2 separately. Letr (T ) opt denote the asymptotically optimal Bridge estimator based on the transformed densities. We stress that the superscript ofr (t) in (4) indicates the number of iterations, while the superscript inr (T ) opt means it is based on the transformed densities. If the transformed q have a greater overlap than the original q 1 , q 2 , thenr (T ) opt should be a more reliable estimator of r with a lower RMSE. Meng and Schilling (2002) further extend this idea and propose the Warp transformation, which aims to increase the overlap by centering, scaling and symmetrizing the two densities q 1 , q 2 .
One limitation of the Warp transformation is that it does not work well for multimodal distributions. Wang et al. (2020) propose the Warp-U transformation to address this problem. The key idea of the Warp-U transformation is to first approximate q i by a mixture of Normal or t distributions, then construct a coupling between them which allows us to map q i into a unimodal density in the same way as mapping the mixture back to a single standard Normal or t distribution.
An alternative to the Warp transformation (Meng and Schilling, 2002) is a Normalizing flow. A Normalizing flow (NF) (Rezende and Mohamed, 2015;Dinh et al., 2016;Papamakarios et al., 2017) parameterizes a continuous probability distribution by mapping a simple base distribution (e.g. standard Normal) to the more complex target using a bijective transformation T , which is parameterized as a composition of a series of smooth and invertible mappings f 1 , ..., f K with easy-to-compute Jacobians. This T is applied to the "base" random variable z 0 ∼ p 0 , where z 0 ∈ R d and p 0 is the known base density. Let Since the transformation T is smooth and invertible, by applying change of variable repeatedly, the final output z K has density where J k is the Jacobian of the mapping f k . The final density p K can be used to approximate target distributions with complex structure, and one can sample from p K easily by applying In order to evaluate p K efficiently, we are restricted to transformations f k whose det J k (z) is easy to compute. For example, Real-NVP (Dinh et al., 2016) uses the following transformation: For m ∈ N such that 1 < m < d, let z 1:m be the first m entries of z ∈ R d , let × be element-wise multiplication and let µ k , σ k : R m → R d−m be two mappings (usually parameterized by neural nets). The smooth and invertible transformation y = f k (z) for each step k in Real-NVP is defined as This means f k keeps the first m entries of input z, while shifting and scaling the remaining . Each transformation f k is also called a coupling layer.
When composing a series of coupling layers f 1 , ..., f K , the authors also swap the ordering of indices in (10) so that the dimensions that are kept unchanged in one step k are to be scaled and shifted in the next step. Jia and Seljak (2020) utilize the idea of transforming q i using a Normalizing flow, and propose Gaussianzed Bridge sampling (GBS) for estimating a single normalizing constant. The authors set q 1 to be a completely known density, e.g. standard multivariate Normal, and aim to approximate the target q 2 using a Normalizing flow with base density q 1 . The transformed density q is chosen, the authors use (7) and the iterative procedure (4) to form the asymptotically optimal Bridge estimator of Z 2 based on the transformed q (T ) 1 and the original q 2 .
The idea of increasing overlap via transformations is also applicable to discrete random variables. For example, Meng and Schilling (2002) suggest using swapping and permutation to increase the overlap between two discrete distributions. Tran et al. (2019) also give Normalizing flow models applicable to discrete random variables based on modulo operations. We give a toy example of increasing the overlap between two discrete distributions using Normalizing flows in Appendix 14. In the later sections, we will extend the idea of increasing overlap via transformations, and propose a new strategy to improver 3 Bridge estimators and f -divergence estimation Frühwirth-Schnatter (2004) gives an MC estimator of RE 2 (r opt ). In this section, we introduce an alternative estimator of RE 2 (r opt ) and M SE(logr opt ) by equivalently estimating an fdivergence between q 1 , q 2 . This formulation allows us to utilize the variational lower bound of f -divergence given by Nguyen et al. (2010), and cast the problem of estimating RE 2 (r opt ) as a 1-d optimization problem. In the later section, we will also show how to use this new estimator to improve the efficiency ofr (T ) opt . In addition, we find that estimating different choices of f divergences under the variational framework proposed by Nguyen et al. (2010) naturally leads to Bridge estimators of r with different choices of free function α(ω). (Ali and Silvey, 1966) is a broad class of divergences between two probability distributions. By choosing f accordingly, one can recover common divergences between probability distributions such as KL divergence KL(q 1 , q 2 ), Squared Hellinger distance H 2 (q 1 , q 2 ) and total variation distance d T V (q 1 , q 2 ).

Estimating
Definition 3.1 (f -divergence). Suppose the two probability distributions Q 1 , Q 2 have absolutely continuous density functions q 1 and q 2 with respect to a base measure µ on a common support Ω. Let the generator function f : R + → R be a convex and lower semi-continuous function satisfying f (1) = 0. The f -divergence D f (q 1 , q 2 ) defined by f takes the form Unless otherwise stated, we assume Ω = R d where d ∈ N i.e. both q 1 and q 2 are defined on R d . If the densities q 1 , q 2 have different or disjoint supports Ω 1 , Ω 2 , then we apply appropriate transformations and augmentations discussed in the previous sections to ensure that the transformed and augmented densities (if necessary) are defined on the common support Ω = R d . In this paper, we focus on a particular choice of f -divergence that is closely related to RE 2 (r opt ) in (3).
Definition 3.2. (Weighted harmonic divergence) Let q 1 , q 2 be continuous densities with respect to a base measure µ on the common support Ω. The weighted harmonic divergence is defined as where π ∈ (0, 1) is the weight parameter. Wang et al. (2020) observe that the weighted harmonic divergence H π (q 1 , q 2 ) is an f -divergence with generator f (u) = 1 − u π+(1−π)u , and RE 2 (r opt ) can be rearranged as The same statement also holds for M SE(logr opt ) since M SE(logr opt ) is asymptotically equivalent to RE 2 (r opt ) (Meng and Wong, 1996). This means if we have an estimator of H s 2 (q 1 , q 2 ), then we can plug it into the leading term of the right hand side of (13) and obtain an estimator of the first order approximation of RE 2 (r opt ) and M SE(logr opt ). Before we give the estimator of H s 2 (q 1 , q 2 ), we first introduce the variational framework for f -divergence estimation proposed by Nguyen et al. (2010). Every convex, lower semi-continuous function f : R + → R has a convex conjugate f * which is defined as follows, Definition 3.3. (Convex conjugate) Let f : R + → R be a convex and lower semi-continuous function. The convex conjugate of f is defined as Nguyen et al. (2010) show that any f -divergence D f (q 1 , q 2 ) satisfies where V is an arbitrary class of functions V : Ω → R, and f * (t) is the convex conjugate of the generator f which characterizes the f -divergence D f (q 1 , q 2 ). A table of common f -divergences with their generator f and the corresponding convex conjugate f * can be found in Nowozin et al. (2016). Nguyen et al. (2010) show that if f is differentiable and , the first order derivative of f evaluated at q 1 (ω)/q 2 (ω). The authors then give a new strategy of estimating the f -divergence D f (q 1 , q 2 ) by finding the maximum of an empirical estimate of E q 1 [V (ω)] − E q 2 [f * (V (ω))] in (15) with respect to the variational function V ∈ V. We now use this framework to give an estimator of H π (q 1 , q 2 ).
Here we assume thatr ∈ [C 1 , C 2 ] instead ofr ∈ R + . This is not a strong assumption, since we can set C 1 (C 2 ) to be arbitrarily small (large). We takeĜ(r s 2 ; s 2 , {ω ij } n i j=1 ) as an estimator of H s 2 (q 1 , q 2 ), and define our estimator of the first order approximation of RE 2 (r opt ) as follows: Definition 3.4 (Estimator of RE 2 (r opt )). Let {ω ij } n i j=1 be samples from q i for i = 1, 2. Define as an estimator of the first order approximation of both RE 2 (r opt ) and M SE(logr opt ) in (13).
Even thoughĜ(r s 2 ; s 2 , {ω ij } n i j=1 ) is a consistent estimator of H s 2 (q 1 , q 2 ), it suffers from a positive bias (See Appendix 10 for details). We have not found a practical strategy to correct it so far. On the other hand, we believe this bias does not prevent our proposed error estimator RE 2 (r opt ) from being useful in practice. Since our estimator of RE 2 (r opt ) in (19) is a monotonically increasing function ofĜ(r π ; π, {ω ij } n i j=1 ) in Prop 1, the positive bias inĜ(r π ; π, {ω ij } n i j=1 ) leads to a positive bias in RE 2 (r opt ). Therefore RE 2 (r opt ) will systemically overestimate the true error RE 2 (r opt ), which will lead to more conservative conclusions (e.g. wider error bars). This is certainly not ideal, but we believe in practice, it is less harmful than underestimating the variability inr opt . In addition, we see the proposed error estimator provides accurate estimates of RE 2 (r opt ) in both examples in Sec 5 and 6, indicating the effectiveness of it.

f -divergence estimation and Bridge estimators
In the last section, we focus on estimating H π (q 1 , q 2 ). We now extend the estimation framework to other choices of f -divergence, and show how Bridge estimators naturally arise from this estimation problem. Let an f -divergence D f (q 1 , q 2 ) with the corresponding generator f (u) be given. Similar to Proposition 1, under our parameterization of the variational function Vr, the empirical estimate of where {ω ij } n i j=1 ∼ q i for i = 1, 2. Letr (f ) = arg maxr ∈R +Ĝ f (r; {ω ij } n i j=1 ). By Nguyen et al. (2010), is an estimator of D f (q 1 , q 2 ). In Proposition 1 we have shown thatr (f ) andĜ f (r (f ) ; {ω ij } n i j=1 ) are consistent estimators of r and D f (q 1 , q 2 ) when D f (q 1 , q 2 ) is the weighted Harmonic divergence H π (q 1 , q 2 ) 1 . Here we show the connection betweenr (f ) and the Bridge estimators of r with different choices of free function α(ω).
Proposition 2 (Connection betweenr (f ) and Bridge estimators). Suppose f (u) : (21), thenr (f ) satisfies the following equation where f is the second order derivative of f .
In Equation (22) q 2 (ω) 2 plays the role of the free function α(ω) in a Bridge estimator (1). Common Bridge estimators such as the asymptotically optimal Bridge estimatorr opt and the geometric Bridge estimator can be recovered by choosing f accordingly (See Appendix 11). Kong et al. (2003) observe thatr opt can be viewed as a semi-parametric maximum likelihood estimator. Proposition 2 extends this observation and show that in addition tor opt , a large class of Bridge estimators can also be viewed as maximizers of some objective functions that are related to the variational lower bound of some f -divergences.
In the next section, we will show how to use this variational framework to minimize the first order approximation of RE 2 (r (T ) opt ) with respect to the transformed densities.
4 Improvingr opt via an f -GAN From Sec 2.1, we see that one can improver opt and reduce its RMSE by first transforming q 1 , q 2 appropriately, then computingr (T ) opt using the transformed densities and samples. From Sec 3 we also see the first order approximation of RE 2 (r opt ) is a monotonic function of H s 2 (q 1 , q 2 ). In this section, we utilize this observation and introduce the f -GAN-Bridge estimator (f -GB) that aims to improver (T ) opt by minimizing the first order approximation of RE 2 (r (T ) opt ) with respect to the transformed densities. We show it is equivalent to minimizing H s 2 (q (T ) 1 , q 2 ) with respect to q (T ) 1 using the variational lower bound of H π (q 1 , q 2 ) (17) and f -GAN (Nowozin et al., 2016).

The f -GAN framework
We start by introducing the GAN and f -GAN models. A Generative Adversarial Network (GAN) (Goodfellow et al., 2014) is an expressive class of generative models. Let p tar be the target distribution of interest. In the original GAN, Goodfellow et al. (2014) estimate a generative model p φ parameterized by a real vector φ by approximately minimizing the Jensen-Shannon divergence between p φ and p tar . The key idea of the original GAN is to introduce a separate discriminator which tries to distinguish between "true samples" from p tar and artificially generated samples from p φ . This discriminator is then optimized alongside with the generative model p φ in the training process. See Creswell et al. (2018) for an overview of GAN models.
f -GAN (Nowozin et al., 2016) extends the original GAN model using the variational lower bound of f -divergence (15), and introduces a GAN-type framework that generalizes to minimizing any f -divergence between p tar and p φ . Let an f -divergence with the generator f be given. Nowozin et al. (2016) parameterize the variational function V ξ and the generative model p φ as two neural nets with parameters ξ and φ respectively, and propose Nowozin et al. (2016) show that D f (p φ , p tar ) can be minimized by solving min φ max ξ G(φ, ξ). Intuitively, we can view max ξ G(φ, ξ) as an estimate of D f (p φ , p tar ) (Nguyen et al., 2010). This means minimizing max ξ G(φ, ξ) with respect to φ can be interpreted as minimizing an estimate of D f (p φ , p tar ).
Now we show how to use the f -GAN framework to construct the f -GAN-Bridge estimator (f -GB). Suppose q 1 , q 2 are defined on a common support Ω = R d . Let T φ : Ω → Ω be a transformation parameterized by a real vector φ ∈ R l that aims to map q 1 to q 2 . Let q (φ) 1 be the transformed density obtained by applying T φ to q 1 , andq be the corresponding unnormalized density. We also requireq (φ) estimator of r based on the unnormalized densitiesq (φ) 1 ,q 2 and corresponding samples By Proposition 1, G(φ,r; π) is the variational lower bound of H π (q (φ) 1 , q 2 ). In order to illustrate our idea, we first give an idealized Algorithm 1 to find the f -GAN-Bridge estimator.
A practical version will be given in the next section.
Use the iterative procedure in (4) to compute the asymptotically optimal Bridge estimator andq 1 have the same normalizing constant by (6),r (φ) opt is an asymptotically optimal Bridge estimator of r for any transformation T φ ∈ T . We show that within the given family of transformations T , Algorithm 1 is able to find T φ * that minimizes the first opt ) with respect to T φ ∈ T up to the first order.
From Proposition 3 we see that under the i.i.d. assumption, T φ * and the corresponding opt are optimal in the sense thatr opt attains the minimal RMSE (up to the first order) among all possible transformations T φ ∈ T and their correspondinĝ opt ) in the form of (13). Note that by Proposition 1,r * is equal the true ratio of normalizing constants r. This means if we have (φ * ,r * ) in the idealized Algorithm 1, it seems there is no need to carry out the following Bridge sampling step. However, (φ * ,r * ) is not computable in practice as G(φ,r; s 2 ) depends on the unknown normalizing constants Z 1 , Z 2 . Therefore G(φ,r; s 2 ) has to be approximated by an empirical estimate, and its corresponding optimizer w.r.t.r is no longer equal to r. In the next section, we will give a practical implementation of Algorithm 1 and discuss the role ofr * when G(φ,r; s 2 ) is replaced by an empirical estimate of it.
In Algorithm 1, we use the f -GAN framework to minimize H s 2 (q (φ) 1 , q 2 ) with respect to T φ ∈ T . We can also apply this f -GAN framework to minimizing other choices of f -divergences such as KL divergence, Squared Hellinger distance and weighted Jensen-Shannon divergence. However, these choices of f -divergence are less efficient compared to opt , as we can show that minimizing these choices of f -divergence between q (φ) 1 and q 2 can be viewed as minimizing some upper bounds of the first order approximation of RE 2 (r (φ) opt ) (See Appendix 12).

Implementation and numerical stability
In this section, we give a practical implementation of the idealized Algorithm 1 based on an alternative objective function. We first describe the practical version of Algorithm 1 in Sec 4.2.1, then justify the choice of this alternative objective in Sec 4.2.2.

A practical implementation of Algorithm 1
In this paper, we parameterize q (φ) 1 as a Normalizing flow. In particular, we parameterize q (φ) 1 as a Real-NVP (Dinh et al., 2016) with base density q 1 and a smooth, invertible to be the empirical estimate of G(φ,r; π) in (24). Unlike Algorithm 1, we do not aim to solve min φ∈R l maxr ∈R +Ĝ(φ,r; π, {ω ij } n i j=1 ) directly. Instead, we define our objective function as where λ 1 , λ 2 ≥ 0 are two hyperparameters. We first give Algorithm 2, a practical implementation of Algorithm 1, then justify the choice of the objective function (26) in the following section. See Appendix 13 for implementation details.
Transform and augment q 1 , q 2 appropriately so that both densities are on a common support.
User t as the initial value of the iterative procedure in (4), computer In Algorithm 2, most of the computational cost is spent on estimating q as a Real-NVP in this paper, we leverage the GPU computing framework for neural networks. In particular, we implement Algorithm 2 using PyTorch (Paszke et al., 2017) and CUDA (NVIDIA et al., 2020). As a result, most of the computation of Algorithm 2 is parallelized and carried out on the GPU. This greatly accelerates the training process in Algorithm 2. We will further compare the computational cost of Algorithm 2 to existing improvement strategies for Bridge sampling (Meng and Schilling, 2002;Jia and Seljak, 2020;Wang et al., 2020) in Section 5 and 6.

Choosing the objective function
Note that the original empirical estimateĜ(φ,r; s 2 , {ω ij } n i j=1 ) can be extremely close to 1 when q (φ) 1 and q 2 share little overlap. In order to improve its numerical stability, we first is monotonically increasing on (−∞, 1), applying this transformation does not change the optimizers ofĜ(φ,r; s 2 , {ω ij } n i j=1 ). In addition, GAN-type models can be difficult to train in practice (Arjovsky and Bottou, 2017). Grover et al. (2018) suggest one can stabilize the adversarial training process of GAN-type models by incorporating a log likelihood term into the original objective function when the generative model q andq 2 are computationally tractable in our setup, we are able to extend this idea and stabilize the alternating training process by incorporating two "likelihood" terms that are asymptotically equivalent )) and the two "likelihood" terms, where the hyper parameters λ 1 , λ 2 ≥ 0 control the contribution of the "likelihood" terms.
Similar to Algorithm 1, let (φ * L ,r * L ) be a solution of the min-max problem min φ∈R l maxr ∈R + L λ 1 ,λ 2 (φ,r; s 2 , { Note that regardless of the choice of λ 1 , λ 2 , the scalar parameterr only depends on thenr * L can be viewed as a Bridge estimator of r based on the transformedq (φ * L ) 1 and the originalq 2 with a specific choice of the free function α(ω). However,r * L is sub-optimal since the free function α(ω) it uses is different from the optimal α opt (ω) in (2). This meansr * L will have greater asymptotic error than the asymptotically optimal Bridge estimator. In addition,r * L suffers from an adaptive bias (Wang et al., 2020). Such bias arises from the fact that the estimated transformed density q (φt) 1 in Algorithm 2 is chosen based on the training samples {ω ij } n i j=1 for i = 1, 2. This means the density of the distribution of the transformed training samples {T φt (ω 1j )} n 1 j=1 is no longer proportional toq (φt) 1 (T φt (ω 1j )) for j = 1, ..., n 1 , as φ t can be viewed as a function of {ω ij } n i j=1 (See Appendix 13 for more discussions). Hence we do not user * L as our final estimator of r. Instead, once we have obtainedr * L , we use it as a sensible initial value of the iterative procedure in (4), and compute the asymptotically optimal Bridge estimator r (φ * L ) opt using a separate set of estimating samples does not suffer from the adaptive bias as the estimating samples are independent to the transformation q (φt) 1 . When n i = n i for i = 1, 2,r (φ * L ) opt is also statistically more efficient thañ 1 ). Recall that as n 1 , n 2 → ∞, the additional log likelihood terms in (26) is 1 , q 2 )) with respect to φ is equivalent to minimizing the first order approximation of RE 2 (r (φ) opt ) under the i.i.d. assumption. We can also show that minimizing KL(q (φ) 1 , q 2 ), KL(q 2 , q (φ) 1 ) correspond to minimizing upper bounds of the first order approximation of RE 2 (r (φ) opt ) w.r.t. φ under the same assumption (See Appendix 12). Note that when λ 1 , λ 2 = 0, Proposition 3 no longer holds for this hybrid objective asymptotically, i.e. T φ * L no longer asymptotically minimizes the first order approximation of RE 2 (r (φ) opt ) w.r.t. T φ . However, we find Algorithm 2 with the hybrid objective works well in the numerical examples in Sec 5, 6 for any value of λ 1 , λ 2 ∈ (10 −2 , 10 −1 ). We want to keep λ 1 , λ 2 small since we do not want the log likelihood . In addition, we would like to stress that even though the final φ t in Algorithm 2 does not asymptotically minimize the first order approximation of RE 2 (r opt ) in Algorithm 2 is still a consistent estimator of the first order approximation of RE 2 (r (φt) opt ) by Proposition 1 and the fact thatr (φt) opt is the asymptotically optimal Bridge estimator based on the transformed q (φt) 1 and the original q 2 .

Example 1: Mixture of Rings
We first demonstrate the effectiveness of the f -GAN-Bridge estimator and Algorithm 2 using a simulated example. Since this paper focuses on improving the original Bridge estimator (Meng and Wong, 1996) rather than giving a new estimator of the normalizing constant or the ratio of normalizing constants, we will focus on comparing the performance of the proposed f -GAN-Bridge estimator to existing improvement strategies for Bridge sampling (Meng and Schilling, 2002;Wang et al., 2020;Jia and Seljak, 2020) in this and the following section. We do not include other classes of methods such as path sampling (Gelman and Meng, 1998;Lartillot and Philippe, 2006), nested sampling (Skilling et al., 2006), variational approaches (Ranganath et al., 2014), etc. in the examples. Empirical study (Fourment et al., 2020) finds evidence that Bridge sampling was competitive with a wide range of methods, including the methods mentioned above, in the context of phylogenetics.
In this example, we set q 1 , q 2 to be mixtures of ring-shaped distributions, and we would like to estimate the ratio of their normalizing constants. We choose this example because such mixture has a multi-modal structure, and its normalizing constant is available in closed form. Let x ∈ R 2 . In order to define the pdf of q 1 , q 2 for this example, we first define the pdf of a 2-d ring distribution as where Φ(·) is the standard Normal CDF and µ, b, σ controls the location, radius and thickness of the ring respectively. LetR( be the corresponding unnormalized density. Let ω ∈ R p where p is an even integer. For i = 1, 2, let the unnormalized densityq i bẽ where ω j is the jth entry of ω. This means for i = 1, 2, if ω ∼ q i , then every two entries of ω are independent and identically distributed, and follow an equally weighted mixture of 2-d ring distributions with different location parameters µ i1 , µ i2 and the same radius and thickness parameter b i , σ i . It is straightforward to verify that Z i , the normalizing constant ofq i is . In this example, we consider dimension p = , 18, 24, 30, 36, 42, 48}, and set µ 11 = (2, 2), µ 12 = (−2, −2), µ 21 = (3, −3), µ 22 = (−3, 3), b 1 = 3, b 2 = 6, σ 1 = 1, σ 2 = 2.

{12
In this example, we estimate log r = log Z 1 − log Z 2 using the f -GAN-Bridge estimator (f -GB, Algorithm 2), Warp-III Bridge estimator (Meng and Schilling, 2002), Warp-U Bridge estimator (Wang et al., 2020) and Gaussianzed Bridge Sampling (GBS) (Jia and Seljak, 2020). We fix N i , the number of samples from q i , to be 2000 for i = 1, 2, and compare the performance of these methods as we increase the dimension p. For each value of p, we run each methods 100 times. For Algorithm 2, we set λ 1 , λ 2 = 0.05, andq This idea has also been discussed in Wong et al. (2020). Letr be a generic estimator of r.
For each method and each value of p, we compute a MC estimate of the MSE of logr based on the results from the repeated runs. We use it as the benchmark of performance. From is estimated using Algorithm 2 with n i = n i = N i /2 for i = 1, 2.
We see the transformed q (φt) 1 captures the structure of q 2 accurately, and they share much greater overlap than the original q 1 , q 2 . We now compare the computational cost of these methods. Recall that our Algorithm 2 utilizes GPU acceleration. Because of the difference in GPU and CPU computing, it is not straightforward to compare the computational cost of Algorithm 2 with GBS, Warp-III and Warp-U, which are CPU based, using benchmarks such as CPU seconds or number of function calls. We simply report the averaged running time for each method on our machine in Figure 2. Similar to Wang et al. (2020), we will also report the average "precision per second", which is the reciprocal of the product of the running time and the estimated MSE of logr, for each method (higher precision per second means better efficiency). We see that for all methods, the computation time is approximately a linear function of the dimension p.
Even though f -GB takes roughly twice longer to run compared to GBS and 30 ∼ 40 times longer compared to Warp-III, it achieves the highest precision per second for all dimension p we consider. In addition, we also run further simulations with larger sample sizes. We find that when p = 48, Warp-U needs around N 1 = N 2 = 7500 samples to reach a similar level of precision as f -GB based on N 1 = N 2 = 2000 samples. In this case, Warp-U takes around 3 ∼ 4 times longer to run compared to f -GB. For Warp-III and GBS, we further increase the sample size to N 1 = N 2 = 5 × 10 4 , but find that their performance is still worse than f -GB and Warp-U, and both take more than three times longer to run. For Warp-III and Warp-U, it is not obvious how they would benefit from GPU computation. Although GBS may benefit from GPU acceleration in principle, it would require careful implementation and optimization. Therefore we compare our Algorithm 2 to these methods based on their publicly available implementations. Recall that M SE(logr opt ) is asymptotically equivalent to RE 2 (r opt ) (Meng and Wong, 1996). Therefore RE 2 (r (φt) opt ) returned from Algorithm 2 can also be viewed as an estimate of M SE(logr (φt) opt ). In order to assess its accuracy, we compare it with both the error estimator given in Frühwirth-Schnatter (2004)

Example 2: Comparing two Bayesian GLMMs
In this section we demonstrate the effectiveness of the f -GAN-Bridge estimator and Algorithm 2 by considering a Bayesian model comparison problem based on the six cities dataset (Fitzmaurice and Laird, 1993), where q 1 , q 2 are the posterior densities of the parameters of two Bayesian GLMMs M 1 , M 2 . This example is adapted from Overstall and Forster (2010). We choose this example because it is based on real world dataset, and the posteriors q 1 , q 2 are relatively high dimensional and are defined on disjoint supports with different dimensions.
The six cities dataset consists of the wheezing status y ij (1 = wheezing, 0 otherwise) of child i at time j for i = 1, ..., n, n = 537 and j = 1, ...4. It also includes x ij , the smoking status (1 = smoke, 0 otherwise) of the i-th child's mother at time-point j as a covariate. We compare two mixed effects logistic regression models M 1 , M 2 with different linear predictors. Define where β 0 , β 1 are regression parameters, u i is the random effect of the i-th child and σ 2 controls the variance of the random effects. We use the default prior given by Overstall and Forster (2010) for both models, i.e. we take β 0 ∼ N (0, 4), σ −2 ∼ Γ(0.5, 0.5) for M 1 Let y = (y 1 , ...y n ) with y i = (y i1 , ..., y i4 ). Let u = (u 1 , ...u n ) be the vector of random effects. Let q 1 (β 0 , u) = p(β 0 , u|X, y, M 1 ) be the marginal posterior of (β 0 , u) under M 1 , andq 1 (β 0 , u) be the corresponding unnormalized density. Let q 2 (β 0 , β 1 , u),q 2 (β 0 , β 1 , u) be defined in a similar fashion under M 2 . Samples of q 1 , q 2 are obtained using MCMC package R2WinBUGS (Sturtz et al., 2005;Lunn et al., 2000). For k = 1, 2, the normalizing constant Z k ofq k is the marginal likelihood under M k . We first generate 2 × 10 5 MCMC samples from q 1 , q 2 and estimate log Z 1 , log Z 2 using the method described in Overstall and Forster Similar to the previous example, we use f -GB to estimate the log Bayes factor log r = log Z 1 − log Z 2 between M 1 , M 2 . Note that q 1 , q 2 are defined on disjoint support R n+1 , R n+2 respectively. In order to apply our Algorithm 2 to this problem, we first augment q 1 using a standard Normal to match up the difference in dimension between q 1 and q 2 : Let q 1,aug (β 0 , γ, u) = q 1 (β 0 , u)N (γ; 0, 1) be the augmented density where N (·; 0, 1) is the standard Normal pdf. Letq 1,aug be the corresponding unnormalized augmented density.
Note thatq 1,aug andq 1 have the same normalizing constant Z 1 . We can then apply Algorithm 2 to q 1,aug and q 2 since q 1,aug and q 2 are now defined on a common support R n+2 . We can sample from q 1,aug by simply concatenating a sample (β 0 , u) ∼ q 1 and a sample γ ∼ N (0, 1).
Let N k be the number of MCMC samples drawn from q k for k = 1, 2. In this example, we compare the performance of the f -GAN-Bridge estimator with the Warp-III Bridge estimator and the Warp-U Bridge estimator as we increase the number of MCMC samples N 1 , N 2 . We consider sample size N = {1000, 2000, 3000, 4000, 5000}. This is a challenging task since the sample size N is limited compared to the dimension of the problem (Recall that q 1 , q 2 are defined on R n+1 , R n+2 respectively with n = 537). For each choice of N , we repeatedly draw N 1 = N 2 = N MCMC samples from q 1 , q 2 respectively and estimate the MSE of logr for each method in the same way as in the previous example. For our Algorithm 2, we augment q 1 as described above, set λ 1 , λ 2 = 0.1 and q (φ) 1,aug to be a Real-NVP with 10 coupling layers. For the Warp-U and Warp-III Bridge estimator, we still use the recommended or default settings. We do not include GBS in this example since we find that for all values of N , it does not converge for most of the repetitions. From Figure 4 we see our Algorithm 2 outperforms the Warp-III and the Warp-U Bridge estimator for all sample size N . We also include a scatter plot of the first two dimensions of samples from q 1,aug , q 2 and the transformed q (φt) 1,aug , where q (φt) 1,aug is obtained from Algorithm 2 with N = 3000. We see q (φt) 1,aug an q 2 share much greater overlap than the original q 1,aug , q 2 . From Figure 5 we see for the same sample size N , the running time of f -GB is 4 ∼ 6 times as long as Warp-III, and roughly 30% ∼ 40% shorter than Warp-U. On the other hand, f -GB achieves the highest precision per second for all sample size N in this example. We further increase the sample size N , and find that Warp-U requires around 10 4 MCMC samples to reach a similar level of precision achieved by f -GB with N = 5000 samples, and takes around 2 times longer to run. Similarly, Warp-III requires around 8 × 10 4 samples to get a similar level of precision, and takes around three times longer to run. 1,aug is obtained from Algorithm 2 with n 1 = n i = 1500 for i = 1, 2. The first two dimensions of q 1,aug and q 2 are (β 0 , γ), (β 0 , β 1 ) respectively.

Conclusion
In this paper, we give a new estimator of RE 2 (r opt ) based on the variational lower bound of f -divergence proposed by Nguyen et al. (2010), discuss the connection between Bridge estimators and the problem of f -divergence estimation, and give a computational framework to improve the optimal Bridge estimator using an f -GAN (Nowozin et al., 2016). We show that under the i.i.d. assumption, our f -GAN-Bridge estimator is optimal in the sense that it asymptotically minimizes the first order approximation of RE 2 (r (φ) opt ) with respect to the transformed density q (φ) 1 . We see that in both simulated and real world examples, our f -GB estimator provides accurate estimate of r and outperforms existing methods significantly. In addition, Algorithm 2 also provides accurate estimates of RE 2 (r (φ) opt ) and M SE(logr (φ) opt ). In our experience, Algorithm 2 (f -GB) is computationally more demanding than the existing methods. In the numerical examples, the running time of Algorithm 2 is roughly 1 to 3 times as long as the existing methods such as Warp-U and GBS when the sample size are the same.
We have not attempted to formalize the difference in computational cost because of the very different nature of GPU and CPU computing. Although in our examples, it is possible for a competing method to match the performance of the f -GB estimator by increasing the number of samples drawn from q 1 , q 2 , it takes longer to run, and can be inefficient or impractical when sampling from q 1 , q 2 is computationally expensive. This also means the f -GB estimator is especially appealing when we only have a limited amount of samples from q 1 , q 2 . In summary, when q 1 , q 2 are relatively simple-structured and low dimensional, the extra computational cost required by f -GB may not be worthwhile. However, when q 1 , q 2 are high dimensional or have complicated multi-modal structure, we recommend the users to choose the more accurate f -GB estimator of r, given the key summary role it plays in many applications and publications.

Limitations and future works
One limitation of the f -GB estimator is the computational cost. In this paper we parameterize q (φ) 1 as a Normalizing flow. A possible direction of future work is to explore different choices of parameterizations of q (φ) 1 . We expect that we can speed up our Algorithm 2 by replacing a Normalizing flow by simpler transformations such as Warp-I and Warp-II transformation (Meng and Schilling, 2002) at the expense of flexibility. Another limitation is that Algorithm 1 is only optimal when samples from q 1 , q 2 are i.i.d. Recall that RE 2 (r opt ) in (13) is derived based on the i.i.d. assumption. Therefore if the samples from q 1 , q 2 are correlated, then Proposition 3 no longer holds, and minimizing H s 2 (q (φ) 1 , q 2 ) with respect to q (φ) 1 is no longer equivalent to minimizing the first order approximation of RE 2 (r (φ) opt ). Therefore it is of interest to see if it is possible to give an algorithm that minimizes the first order approximation of RE 2 (r (φ) opt ) when the samples are correlated. In addition, our approach only focuses on estimating the ratio of normalizing constants between two densities.
When we have multiple unnormalized densities and would like to estimate the ratios between their normalizing constants, our approach needs to estimate these quantities separately in a pairwise fashion, which can be inefficient. Meng and Schilling (1996) and (Geyer, 1994) show that one can estimate multiple normalizing constants simultaneously up to a common multiplicative constant. We are also interested in extending our improvement strategy to this multiple densities setup.

SUPPLEMENTARY MATERIAL 8 Proofs
Here we give proof of Proposition 1, 2 and 3.
Now we show the consistency ofr π . It can be shown in a similar fashion to the proof of the consistency of an extremum estimator in e.g. Newey and McFadden (1994) Theorem 2.1.
Finally we showĜ(r π ; π, {ω ij } n i j=1 ) is a consistent estimator of H π (q 1 , q 2 ). Recall that G(r; π) = H π (q 1 , q 2 ). By triangle inequality, The first term on the RHS converges to 0 in probability by ULLN. The second term on the RHS converges to 0 in probability by continuous mapping theorem and the fact that r π is a consistent estimator of r. HenceĜ(r π ; π, {ω ij } n i j=1 ) is a consistent estimator of H π (q 1 , q 2 ).
Proof. Note that the objective function can be written aŝ using the equation (Uehara et al., 2016).
. Ifr (f ) is the stationary point, then it satisfies the "score" equation The above equation can be rearranged aŝ opt ) with respect to T φ ∈ T up to the first order.

Dimension matching
The standard Bridge estimator (1) can not be applied directly when Ω 1 ,Ω 2 have different dimensions. This is a common and important case. For example, if we would like to compare two models M 1 , M 2 by estimating the Bayes factor between them, the standard Bridge estimator (1) is not directly applicable when M 1 , M 2 are controlled by parameters that live in different dimensions.
This ensures the augmented density matches the dimension of the q 2 , whereq * 1 (ω 1 , θ) is the unnormalized augmented density. Let Ω * 1 be the augmented support of q * 1 . Since the augmented density q * 1 (ω 1 , θ) and the original q 1 (ω 1 ) have the same normalizing constant, we can then treat r = Z 1 /Z 2 as the ratio between the normalizing constants of q * 1 (ω 1 , θ) and q 2 (ω 2 ), and form an "augmented" Bridge estimatorr * α based on the augmented densities. Chen and Shao (1997) also show that when the free function α(ω) = α opt (ω), the optimal augmenting density p opt (θ|ω 1 ) which attains the minimal RE 2 (r * αopt ) is i.e. p opt (θ|ω 1 ) is the conditional distribution of the remaining d 2 − d 1 entries of ω 2 ∼ q 2 given that the first d 1 entries are ω 1 . However, q 2 (θ|ω 1 ) is difficult to evaluate or sample from in general. One way to approximate the optimal augmenting distribution q 2 (θ|ω 1 ) is to incorporate the augmented density q * 1 (ω 1 , θ) with a Normalizing flow (see Sec 2.1). Assume we start with an arbitrary augmenting density p(θ|ω 1 ), e.g. standard Normal N (0, I d 2 −d 1 ). Consider a Normalizing flow with base density q * 1 and a smooth and invertible transformation T * 1 : Ω * 1 → Ω 2 that aims to map the augmented q * 1 to the target q 2 . Let (ω is a good approximation to q 2 , then for the transformed augmenting density, we expect 1 ) as well. This means the transformed q * (T ) 1 automatically learns the optimal augmenting density.
10 Bias in the estimator of H π (q 1 , q 2 ) given in Proposition 1 In Proposition 1, the estimator of H π (q 1 , q 2 ) is given in the form of the maximum of the functionĜ(r; π, {ω ij } n i j=1 ) w.r.t.r. Let r = Z 1 /Z 2 be the true ratio of normalizing constants. Even thoughĜ(r; π, {ω ij } n i j=1 ) is an unbiased estimator of H π (q 1 , q 2 ), our proposed estimator G(r π ; π, {ω ij } n i j=1 ) sufferes from a positive bias. Intuitively speaking, this bias is analogous to the fact that the training error of a model is an underestimate of the true error. We use a toy example to illustrate this bias. Let x ∈ R 3 , σ 1 = 1 and σ 2 = 3. Let In other words,q 1 ,q 2 are the unnormalized pdf of two Gaussian distributions with zero mean and covaraince σ 1 I 3 , σ 2 I 3 respectively, where I p is the p × p identity matrix. Let q 1 , q 2 be the corresponding normalized densities. Let π = 0.5. It is straightforward to form an unbiased MC estimate of H π (q 1 , q 2 ) using (12). Let N = {10, 20, 30, ..., 1000}. For each value of N , we repeatedly compute the proposed estimatorĜ(r π ; π, {ω ij } n i j=1 ) 1000 times based on n 1 = n 2 = N i.i.d. samples from q 1 , q 2 respectively. We then report the sample mean of the repeated estimates for each N , and compare it with a high precision unbiased MC estimator of H π (q 1 , q 2 ). From Figure 7 we seeĜ(r π ; π, {ω ij } n i j=1 ) does exhibit a positive bias when N < 500, and this bias gradually vanishes as sample size increases.
Even though we have not found a practical strategy to correct this bias, we believe this bias does not prevent our proposed estimator from being useful in practice. Since our Figure 7: Sample mean of the estimated H π (q 1 , q 2 ) for each sample size N . The blue band represents the 2σ error bars of the sample means. Orange line represents a high precision unbiased MC estimator of H π (q 1 , q 2 ). Orange band represents the 2σ error bar of the MC estimate.
estimator of RE 2 (r opt ) in (19) is a monotonically increasing function ofĜ(r π ; π, {ω ij } n i j=1 ), the positive bias inĜ(r π ; π, {ω ij } n i j=1 ) leads to a positive bias in RE 2 (r opt ). Therefore RE 2 (r opt ) will systemically overestimate the true error RE 2 (r opt ), which will lead to more conservative conclusions (e.g. wider error bars). This is certainly not ideal, but we believe in practice, it is less harmful than underestimating the variability inr opt . In addition, we see that our proposed error estimator provides accurate estimates of the MSE of logr  Example 1 (KL divergence and the Importance sampling estimator) is an f -divergence with f (u) = u log u, f (u) = 1 + log u and f * (t) = exp(t − 1). This specification corresponds to Vr(ω) = 1 + logq 1 (x) q 2 (x)r . Suppose we have {ω 1j } n 1 j=1 ∼ q 1 and {ω 2j } n 2 j=1 ∼ q 2 . The maximizerr KL of equation (20) under this specification iŝ Note that this is the Importance sampling estimator of r using q 2 as the proposal, which is a special case of a Bridge estimator with free function α(ω) =q 2 (ω) −1 . Therefore we recover the Importance sampling estimator from the problem of estimating KL(q 1 , q 2 ). It is also straightforward to verify that estimating KL(q 2 , q 1 ) leads tor KL = 1 Reciprocal Importance sampling estimator of r based on a similar argument.
Example 2 (Weighted Jensen-Shannon divergence and the optimal Bridge estimator) Weighted Jensen-Shannon divergence is defined as where π ∈ (0, 1) is the weight parameter and q π = πq 1 + (1 − π)q 2 is a mixture of q 1 and q 2 .
In addition to the fact that Bridge estimators with different choices of free function α(ω) can arise from estimating different f -divergences, the asymptotic RMSE ofr KL ,r opt and r geo can also be written as functions of some f -divergences between the two distributions.
For example, Meng and Wong (1996) show that RE 2 (r geo ) is a function of the Hellinger distance between q 1 , q 2 , Wang et al. (2020) show that RE 2 (r opt ) is a function of H π (q 1 , q 2 ) in (13). It is also straightforward to show RE 2 (r KL ) is a function of the Rényi's 2-divergence between q 1 , q 2 using the formula of RE 2 (r α ) given by (3.2) in Meng and Wong (1996).
However, the general connection between RE 2 (r α ) and the f -divergence between the two distributions is not obvious. For example, suppose we choose the constant free function α(ω) = 1 discussed in Meng and Wong (1996). Then we can work out the asymptotic RMSE of the corresponding Bridge estimatorr 1 using the formula of RE 2 (r α ) in Meng and Wong (1996). Suppose q 1 , q 2 are defined on a common support Ω, the resulting RE 2 (r 1 ) takes the form It is not obvious how this expression can be rearranged into a function of some f -divergence between q 1 , q 2 , as the leading term of RE 2 (r 1 ) is in the form of ratio of integrals, which is different from the general functional form of an f -divergence. This example suggests that there may not be a general connection between the f -divergence between two distributions and the RMSE of a Bridge estimator apart from common Bridge estimators such as the optimal Bridge estimator and the geometric Bridge estimator. We have also tried the other direction. We started from an f -divergence. By Proposition 2, estimating the f -divergence leads to a Bridge estimator with a specific free function in the form of q 2 (ω) 2 . We then substitute this α f (ω) into the formula of RE 2 (r α ) in Meng and Wong (1996). The functional form of the resulting expression is still also very different from the functional form of an f -divergence in the general case, and it is not obvious to see how it can be rearranged into a function of some f -divergence between the two distributions. This also suggests that there may not be a general connection between RE 2 (r α ) and the f -divergence between two distributions.
12 Other choices of f -divergence The weighted Harmonic divergence H π (q (φ) 1 , q 2 ) is not the only choice of f divergence to minimize if our goal is to increase the overlap between q (φ) 1 and q 2 . Recall that in Algorithm 2 we parameterize q (φ) 1 as a Normalizing flow. Since bothq 1 ,q 2 are available, it is also possible to estimate q (φ) 1 by maximizing the log likelihood logq 2 (T φ (ω 1j )) or logq (φ) 1 (ω 2j ) without using the f -GAN framework. This is asymptotically equivalent to approximating ). In addition to the KL divergence, other common f -divergences such as the Squared Hellinger distance and the weighted Jensen-Shannon divergence are also sensible measures of overlap between densities, and we can minimize these divergences using the f -GAN framework in a similar fashion to Algorithm 1. However, f -divergences such as KL divergence, Squared Hellinger distance and the weighted Jensen-Shannon divergence are inefficient compared to the weighted Harmonic divergence H s 2 (q (φ) 1 , q 2 ) if our goal is to minimize RE 2 (r (φ) opt ). In Proposition 3 we have shown that under the i.i.d. assumption, minimizing H s 2 (q (φ) 1 , q 2 ) with respect to q (φ) 1 is equivalent to minimizing the first order approximation of RE 2 (r (φ) opt ) directly. On the other hand, Meng and Wong (1996) show that asymptotically, up to the first order, where n = n 1 + n 2 and s i = n i /n for i = 1, 2 under the same i.i.d. assumption. Note that H 2 (q (φ) 1 , q 2 ) → 0 also implies RE 2 (r (φ) opt ) → 0, but minimizing H 2 (q (φ) 1 , q 2 ) with respect to the density q (φ) 1 can be viewed as minimizing an upper bound of the first order approximation of RE 2 (r (φ) opt ), which is less efficient. Here we show minimizing JS π (q can also be viewed as minimizing some upper bounds of the first order approximation of RE 2 (r (φ) opt ).
13 Implementation details of Algorithm 2 Choosing the transformation T φ We parameterizeq (φ) 1 as a Real-NVP (Dinh et al., 2016) with base densityq 1 (See Sec 2.1 for a brief description of Normalizing flow models and Real-NVP). As we have discussed before, this ensures thatq (φ) 1 is both flexible and computationally tractable, and its normalizing constant is unchanged. It is possible to specifyq (φ) 1 using a simpler parameterization, e.g.
Warp-III transformation (Meng and Schilling, 2002). However, such parameterization is not as flexible comparing to a Normalizing flow. It is also possible to replace a Real-NVP by more sophisticated Normalizing flow architectures e.g. Autoregressive flows (Papamakarios et al., 2017) or Neural Spline flows (Durkan et al., 2019). But we find a Real-NVP is sufficient for us to illustrate our ideas and achieve satisfactory results in both simulated and real world examples. In addition, both the froward and inverse transformation of a Real-NVP can be computed efficiently. This is an appealing feature since we need both T φ and T −1 φ for evaluating L λ 1 ,λ 2 (φ,r; s 2 , {ω ij } n i j=1 ). Therefore we choose to use a Real-NVP in Algorithm 2, as it has a good balance of flexibility and computational efficiency.
The number of coupling layers in a Real-NVP controls its flexibility. Choosing too few coupling layers restricts the flexibility of the Real-NVP, while choosing too many of them increases the computational cost and the risk of overfitting. Choosing the optimal number of coupling layers that balances computational cost and flexibility is problem-dependent.
We demonstrate it using the mixture of rings example in Sec 5 with p = 12. Similar to Sec 5, we set β 1 = β 2 = 0.05 and N 1 = N 2 = 2000. We consider different number of coupling layers , 4, 6, 8, 10, 12, 14, 16, 18}. For each choice of K, we parameterize q (φ) 1 in Algorithm 2 as a Real-NVP with K coupling layers, then run Algorithm 2 50 times. We record the average running time and an MC estimate of M SE(logr (φt) opt ) based on the repeated runs for each K. From Figure 8 we see the running time is roughly a linear function of the number of coupling layers K. As K increases, the estimated M SE(logr (φt) opt ) first decreases then starts increasing. This is likely due to overfitting. Similar to Wang et al. (2020), we also use precision per second, which is the reciprocal of the product of the average running time and the estimated mean square error, as a benchmark of efficiency. We see that the estimated precision per second is high when K is between 2 and 6, and it starts decreasing rapidly when K ≥ 8. Therefore for this example, we see the most efficient choice of K is between 2 and 6. In practice, we recommend setting q (φ) 1 as a Real-NVP with 2 to 10 coupling layers in Algorithm 2. We also recommend running Algorithm 2 multiple times with different number of coupling layers in q Splitting the samples from q 1 , q 2 In Algorithm 2, we first estimate {φ t ,r t } using the training samples {ω ij } n i j=1 , then compute the optimal Bridge estimator based on the separate estimating samples {ω ij } n i j=1 . We use separate samples for the Bridge sampling step because the estimated transformed density q (φt) 1 in Algorithm 2 is chosen based on the training samples {ω ij } n i j=1 for i = 1, 2. This means the density of the distribution of the transformed training samples {T φt (ω 1j )} n 1 j=1 is no longer proportional toq (φt) 1 (T φt (ω 1j )) for j = 1, ..., n 1 as φ t can be viewed as a function of {ω ij } n i j=1 . If we apply the iterative procedure (4) to densities q training samples {T φt (ω 1j )} n 1 j=1 , {ω 2j } n 2 j=1 , then the resultingr (φt) opt will be a biased estimate of r. See also Wong et al. (2020) for a detailed discussion under a similar setting. One way to correct this bias is to split the samples from q 1 , q 2 into training samples {ω ij } n i j=1 and estimating samples {ω ij } n i j=1 for i = 1, 2. We first estimate the transformation T φt using the training samples {ω ij } n i j=1 , i = 1, 2. Once we have obtained the estimated φ t , we apply the iterative procedure (4) toq opt will not suffer from this bias. The same approach is used in Wang et al. (2020) and Jia and Seljak (2020). The idea of eliminating this bias by splitting the samples from q 1 , q 2 is further discussed in Wong et al. (2020). The above argument also applies to the estimation of RE 2 (r (φt) opt ). We compute RE 2 (r (φt) opt ) based on the independent estimating samples using (19). Since finding RE 2 (r (φt) opt ) is a 1-d optimization problem, the additional computational cost is negligible compared to the rest of Algorithm 2. In practice, we recommend setting n i = n i for i = 1, 2, i.e. splitting the samples from q 1 , q 2 into equally sized training samples and estimating samples.

Finding the saddle point using alternating gradient method
In Algorithm 2, we aim to find a saddle point of L λ 1 ,λ 2 (φ,r; s 2 , {ω ij } n i j=1 ) using the alternating gradient method. This approach is adapted from the Algorithm 1 of Nowozin et al. (2016).
The authors show that their Algorithm 1 converges geometrically to a saddle point under suitable conditions. In the alternating training process of Algorithm 2, updatingr t+1 is a 1-d optimization problem when φ t is treated as fixed for any step t. Hence we can also directly findr φt = arg maxr ∈R + L λ 1 ,λ 2 (φ t ,r; s 2 , {ω ij } n i j=1 ) instead of performing a single step gradient ascent onr t . By Proposition 1 and 2,r φt can be viewed as a (biased) Bridge estimator of r given φ t . However, such estimatorr φt is not reliable when q (φt) 1 and q 2 share little overlap.
Therefore directly optimizingr at each iteration t is not always necessary in practice, especially at the early stage of training when q (φt) 1 is not yet a sensible approximation of q 2 .
In addition, the gradient ascent update ofr t is computationally cheaper than finding the optimizerr φt directly. Therefore we follow Nowozin et al. (2016) and use the alternating gradient method to find the saddle point of L λ 1 ,λ 2 (φ,r; s 2 , {ω ij } n i j=1 ). We only recommend optimizingr t directly in Algorithm 2 when we know q (φt) 1 and q 2 have at least some degree of overlap.
Note that {φ t ,r t } being approximately a saddle point of the objective function does not necessarily imply that it solves min φ∈R l maxr ∈R + L λ 1 ,λ 2 (φ,r; s 2 , {ω ij } n i j=1 ). Forr t , it is easy to verify ifr t is indeed the maximizer of L λ 1 ,λ 2 (φ t ,r; s 2 , {ω ij } n i j=1 ) w.r.t.r ∈ R + since it is a 1-d optimization problem. However, for φ t there is no guarantee that it is the global minimizer of L λ 1 ,λ 2 (r t , φ; s 2 , {ω ij } n i j=1 ) w.r.t. φ ∈ R l . One way to address this problem is to run Algorithm 2 multiple times and choose the q (φt) 1 that attains the smallest objective function value. In the numerical examples, we find q (φt) 1 returned from Algorithm 2 is almost always a good approximation of q 2 . Therefore we do not worry about this problem in practice.
In the alternating training process, seeing the absolute difference between L λ 1 ,λ 2 (φ t ,r t ; s 2 , {ω ij } n i j=1 ) and L λ 1 ,λ 2 (φ t−1 ,r t−1 ; s 2 , {ω ij } n i j=1 ) being less than the tolerance level 1 at an iteration t does not necessarily imply that it has reached a saddle point. Therefore we also need to monitor the sequence {r t }, t = 0, 1, 2, ... in the training process. If |r t −r t−1 | > 2 , thenr t has not converged to a stationary point regardless of the value of the objective function. In other words, we know {φ t ,r t } has approximately converged to a saddle point only if both the objective function L λ 1 ,λ 2 (φ t ,r t ; s 2 , {ω ij } n i j=1 ) andr t have stopped changing. In practice, we recommend setting 1 ∈ (10 −3 , 10 −1 ) depending on the scale of the objective function, and 2 ∈ (10 −3 , 10 −2 ).

Effectiveness of the hybrid objective
As we have discussed previously, we introduce the hybrid objective to stabilize the alternating training process and accelerate the convergence of Algorithm 2. Here we demonstrate the effectiveness of the hybrid objective in Algorithm 2 using the mixture of rings example in Sec 5 with p = 12, µ 11 = (2, 2), µ 12 = (−2, −2), µ 21 = (2, −2), µ 21 = (−2, 2), b 1 = 3, b 2 = 6, σ 1 = 1, σ 2 = 2. We set q (φ) 1 to be a Real-NVP with 5 coupling layers. We first run Algorithm 2 50 times with n i = n i = 1000 for i = 1, 2 and λ 1 = λ 2 = 0.05. We record the values of the objective function andr t of the first 25 iterations. Then we run Algorithm 2 50 times with n i = n i = 1000 for i = 1, 2 and λ 1 = λ 2 = 0, and record the same values. Recall that setting λ 1 = λ 2 = 0 is equivalent to using the original f -GAN objective (25). From Figure 9 we see most of the hybrid objectives and the correspondingr t values have stabilized after 20 iterations. The stand alone f -GAN objective with λ 1 = λ 2 = 0 also demonstrate a decreasing trend, but the objective values are much more wiggly compared to the hybrid objective due to the adversarial training process, and there is no sign of convergence in 25 iterations. Note that for both the hybrid objective and the original f -GAN objective, the correspondingr t tend to converge to a value slightly different from the true r as the number of iteration increases. This is likely due to the bias we discussed previously.
Our goal is to estimate log r = log Z 1 − log Z 2 = − log Z 2 by first increasing the overlap between q 1 and q 2 using a Normalizing flow, then compute the asymptotically optimal Bridge estimator of r based of the transformed distributions. Let N = {500, 1000, 1500, 2000, 2500}.
To demonstrate the effectiveness of this approach, for each value of N , we first draw n 1 = n 2 = N training samples {ω 1j } n 1 j=1 and {ω 2j } n 2 j=1 from q 1 , q 2 respectively, and use an autoregressive discrete flow Tran et al. (2019) to estimate a bijective transformation T that maps q 1 to q 2 based on the training samples and the training procedure given by Tran et al. (2019). One key distinction between discrete flows Tran et al. (2019) and their continuous counterparts (e.g. Dinh et al. (2016); Kingma et al. (2016)) is that for discrete flows, the base distribution q 1 is treated as a model parameter and is estimated jointly with the transformation T . In our example, this means the parameterization (i.e. the 90 × 90 probability table) we chose for q 1 in (71) is only treated as the "initial values" of the model parameters, and is updated alongside with the transformation T . (Note that when q 1 , q 2 have a large number of discrete states, storing or updating the probability table of the base q 1 is computationally infeasible. To alleviate this problem, Tran et al. (2019) also considered more sophisticated parameterizations of the "trainable base" q 1 such as the autoregressive Categorical distribution.) Let T 1 be the estimated transformation,q 1 be the updated base distribution (which is also completely known and easy to sample from).

Letq
(T ) 1 be the transformed distribution obtained by applying T 1 to the samples from the updatedq 1 . We then draw n 1 = n 2 = N estimating samples {ω 1j } n 1 j=1 and {ω 2j } n 2 j=1 from q 1 , q 2 respectively, and computer opt is a reliable estimator of r for all choice of N and is much more accurate than the optimal Bridge estimator based on the originalq 1 ,q 2 .
From Figure 11 we also see that the transformedq (T ) 1 accurately captures the multimodal structure of q 2 .
In addition to the quantized mixture of Gaussian example, more substantial applications of discrete flows can also be found in Tran et al. (2019). However, the discrete flows in   Tran et al. (2019) are in general not directly applicable to our proposed Algorithm 2. This is because in our Algorithm 2, the unnormalized densitiesq 1 andq 2 are specified by the users and therefore can be arbitrary. However, for discrete flows, the "base" distribution has to be completely known, and is treated as a trainable model parameter (as in this example). This means we are not able to use it to directly estimate the ratio of normalizing constants between two arbitrary unnormalized probability mass functions in the same way as in Algorithm 2. Nevertheless, one may obtain an estimator of the ratio of normalizing constants between two discrete distributions by estimating their normalizing constants separately using discrete flows and the procedure used in this example. For future work, we are interested in extending Algorithm 2 so that it is able to handle arbitrary unnormalized pmfs using e.g. more sophisticated Normalizing flow architectures.

Simulated example: Mixture of t-distributions
In this example, we let q 1 and q 2 be two mixtures of p dimensional t-distributions. We are interested in this example because both q 1 , q 2 are multimodal and have heavy tails. For i = 1, 2, let q i (ω) = K k=1 π ik p t (ω; µ ik , Σ i , ν i ), where K is the number of components, π i = {π ik } K k=1 are the mixing weights and p t (·; µ ik , Σ i , ν i ), the kth component of q i is the pdf of a multivariate t-distribution with mean µ ik ∈ R p , positive-definite scale matrix Σ i ∈ R p×p and degree of freedom ν i ∈ R + . Note that all K components of q i have the same covariance structure and degree of freedom. Let be the normalizing constant of p t (·; µ ik , Σ i , ν i ). Note that this quantity does not depend on µ ik . Letp t (·; µ ik , Σ i , ν i ) = Z i p t (·; µ ik , Σ i , ν i ) be the unnormalized density of each component p t (·; µ ik , Σ i , ν i ). Letq i (ω) = K k=1 π ikpt (ω; µ ik , Σ i , ν i ) be the unnormalized density of q i (ω). It is easy to verify thatq i (ω) = Z i q i (ω), i.e. the normalizing constant ofq i (ω) is Z i .
We estimate log r = log Z 1 − log Z 2 in a similar fashion to the previous examples. For each choice of p, we run each method 30 times. Letr be a generic estimate of r. We use the MC estimate of RMSE of logr, i.e. E((logr − r) 2 )/(log r) 2 , based on the repeated runs as the benchmark of performance for this example. For each repetition, we run each method with N 1 = N 2 = 6000 independent samples from q 1 , q 2 respectively. For our Algorithm 2, we parameterize q (φ) 1 as a Real-NVP with 20 coupling layers, and set λ 1 = λ 2 = 0.01.

Computational cost of Algorithm 2
Comparing the computational cost of our Algorithm 2 with existing methods and their existing implementations is not straightforward because of the very different nature of GPU and CPU computing.
In both examples, we compare the existing CPU implementations of Warp-III, Warp-U, GBS and a GPU implementation of our Algorithm 2 in term of wall clock time. We think this comparison is not unfair because there is no simple way to accelerate existing algorithms with a GPU, while training neural nets on GPU was a design element in implementing Algorithm 2 using deep learning frameworks such as Torch (Paszke et al., 2017). If a user have access to both CPU and GPU, then we believe the wall clock time to some extent can be viewed as a natural metric of the time cost a user has to pay for the estimator. And the Precision per Second benchmark can be viewed as the cost-performance ratio of these methods. This measure is not a rigorous metric for comparing computation costs, but we believe it is at least an intuitive one for the users to get a rough idea of the time cost and efficiency of these algorithms.
From the numerical examples we see f -GB scaled better with dimension than its competitors. For example, from Example 1 we see that even though Warp-III can be 30 ∼ 40 faster to compute than our proposed method given the same amount of samples from q 1 , q 2 , its accuracy (measured in M SE(logr)) is orders of magnitude greater (worse) than our approach. When p = 48, we find that Warp-III is not able to return a sensible estimate of r even with 25 times more samples from q 1 , q 2 than f -GB. In Example 2 we also find that Warp-III requires around 18 times more samples to achieve a similar level of accuracy as f -GB, and takes around 3 times longer to run. Therefore we believe the extra computational cost of our f -GB estimator is "well-spent" as the numerical examples show that our Algorithm 2 is able to return an estimate of r with much higher precision than GBS, Warp-III and Warp-U and scales better with the dimension of the distributions.
As we acknowledge in Sec 7, if q 1 , q 2 are simple structured and low dimensional, then users can get adequate precision more quickly using Warp-III or Warp-U. On the other hand, in speaking to users who report Bayes factors in applied Bayesian work, the overwhelming requirement was that the estimate be reliable, as the Bayes Factor value is sometimes the crux. In all the numerical examples we considered, f -GB never broke and produced accurate estimates. Warp III and GBS did break on larger problems. Warp-U took a similar amount of time to run compared with f -GB, but was less accurate. Therefore users may still prefer a method which is "over-powered" but more reliable.