Variance reduction for Metropolis–Hastings samplers

We introduce a general framework that constructs estimators with reduced variance for random walk Metropolis and Metropolis-adjusted Langevin algorithms. The resulting estimators require negligible computational cost and are derived in a post-process manner utilising all proposal values of the Metropolis algorithms. Variance reduction is achieved by producing control variates through the approximate solution of the Poisson equation associated with the target density of the Markov chain. The proposed method is based on approximating the target density with a Gaussian and then utilising accurate solutions of the Poisson equation for the Gaussian case. This leads to an estimator that uses two key elements: (1) a control variate from the Poisson equation that contains an intractable expectation under the proposal distribution, (2) a second control variate to reduce the variance of a Monte Carlo estimate of this latter intractable expectation. Simulated data examples are used to illustrate the impressive variance reduction achieved in the Gaussian target case and the corresponding effect when target Gaussianity assumption is violated. Real data examples on Bayesian logistic regression and stochastic volatility models verify that considerable variance reduction is achieved with negligible extra computational cost.


Introduction
Statistical methods for reducing the bias and the variance of estimators have played a prominent role in Monte Carlo based numerical algorithms.Variance reduction via control variates has a long and well studied history introduced as early as the work of Kahn and Marshall (1953), whereas an early non-parametric estimate of bias, subsequently renamed jackknife and broadly used for bias reduction, was first presented by Quenouille (1956).However, the corresponding theoretical developments in the more complicated, but extremely popular and practically important, estimators based on MCMC algorithms has been rather limited.The major impediment is the fact that the MCMC estimators are based on ergodic averages of dependent samples produced by simulating a Markov chain.
We provide a general methodology to construct control variates for any discrete time random walk Metropolis (RWM) and Metropolis-adjusted Langevin algorithm (MALA) Markov chains that can achieve, in a post-processing manner and with a negligible additional computational cost, impressive variance reduction when compared to the standard MCMC ergodic averages.Our proposed estimators are based on an approximate, but accurate, solution of the Poisson equation for a multivariate Gaussian target density of any dimension.
Suppose that we have a sample of size n from an ergodic Markov chain {X n } n≥0 with continuous state space X ⊆ R d , transition kernel P and invariant measure π.A standard estimator of the mean E π [F ] := π(F ) = F dπ of a real-valued function F defined on X under π is the ergodic mean which satisfies, for any initial distribution of X 0 , a central limit theorem of the form with the asymptotic variance given by Interesting attempts on variance reduction methods for Markov chain samplers include the use of antithetic variables (Barone and Frigessi, 1990;Green and Han, 1992;Craiu et al., 2005), Rao-Blackwellization (Gelfand and Smith, 1990), Riemann sums (Philippe and Robert, 2001) or autocorrelation reduction (Mira and Geyer, 2000;Van Dyk and Meng, 2001;Yu and Meng, 2011).Control variates have played an outstanding role in the MCMC variance reduction quiver.A strand of research is based on Assaraf and Caffarel (1999) who noticed that a Hamiltonian operator together with a trial function are sufficient to construct an estimator with zero asymptotic variance.They considered a Hamiltonian operator of Schrödinger-type that led to a series of zero-variance estimators studied by Valle and Leisen (2010), Mira et al. (2013) and Papamarkou et al. (2014).The estimation of the optimal parameters of the trial function is conducted by ignoring the Markov chain sample dependency, an issue that was dealt with by Belomestny et al. (2020) by utilizing spectral methods.The main barrier for the wide applicability of zero-variance estimators is that their computational complexity increases with d, see South et al. (2018).Another approach to construct control variates is a non-parametric version of the methods presented by Mira et al. (2013) and Papamarkou et al. (2014) which lead to the construction of control functionals (Oates et al., 2017;Barp et al., 2018;South et al., 2020).Although their computational cost with respect to d is low, their general applicability is prohibited due to the cubic computational cost with respect to n (South et al., 2018;Oates et al., 2019) and the possibility to suffer from the curse of dimensionality that is often met in non-parametric methods (Wasserman, 2006).Finally, Hammer and Tjelmeland (2008) proposed constructing control variates by expanding the state space of the Metropolis-Hastings algorithm.
An approach which is closely related to our proposed methodology attempts to minimise the asymptotic variance σ 2 F .This seems a hard problem since a closed form expression of σ 2 F is not available and therefore a loss function to be minimised is not readily available; see, for example, Flegal et al. (2010).However, there has been a recent research activity based on the following observation by Andradóttir et al. (1993).If a solution F to the Poisson equation for F was available, that is if for every x ∈ X where then one could construct a function equal to F (x) + P F (x) − F (x) which is constant and equal to π(F ).It is then immediate that a zero-variance and zero-bias estimator for F is given by which can be viewed as an enrichment of the estimator µ n (F ) with the (optimal) control variate P F − F .Of course, solving (1) is extremely hard for continuous state space Markov chains, even if we assume that E π [F ] is known, because it involves solving a non-standard integral equation.Interestingly, a solution of this equation (also called the fundamental equation) produces zero-variance estimators suggested by Assaraf and Caffarel (1999) for a specific choice of Hamiltonian operator.One of the rare examples that (1) has been solved exactly for discrete time Markov chains is the random scan Gibbs sampler where the target density is a multivariate Gaussian density, see Dellaportas and Kontoyiannis (2012), Dellaportas and Kontoyiannis (2009).They advocated that this solution provides a good approximation to (1) for posterior densities often met in Bayesian statistics that are close to multivariate Gaussian densities.Indeed, since direct solution of (1) is not available, approximating F has been also suggested by Andradóttir et al. (1993), Atchadé and Perron (2005), Henderson (1997), Meyn (2008).Tsourti (2012) attempted to extend the work by Dellaportas and Kontoyiannis (2012) to RWM samplers.The resulting algorithms produced estimators with lower variance but the computational cost required for the post-processing construction of these estimators counterbalance the variance reduction gains.We build on the work by Tsourti (2012) here but we differ in that (i) we build new, appropriately chosen to facilitate analytic computations, non-linear d-dimensional approximations to F (x) rather than linear combinations of 1-dimensional functions and (ii) we produce efficient Monte Carlo approximations of the d-dimensional integral P F (x) so that no extra computation is required for its evaluation.Finally, Mijatović et al. (2018) approximate numerically the solution of (1) for 1-dimensional RWM samplers and Mijatović and Vogrinc (2019) construct control variates for large d by employing the solution of (1) that is associated with the Langevin diffusion in which the Markov chain converges as d → ∞ (Roberts et al., 1997); this requires very expensive Monte Carlo estimation methods so it is prohibited for realistic statistical applications.
We follow this route and add to this literature by extending the work of Dellaportas and Kontoyiannis (2012) and Tsourti (2012) to RWM and MALA algorithms by producing estimators for the posterior means of each co-ordinate of a d-dimensional target density with reduced asymptotic variance and negligible extra computational cost.Our Monte Carlo estimator to compute the expectation π(F ) makes use of three components: (c) An additional control variate, referred to as static control variate, that is based on the same Gaussian approximation π(x) and allows to reduce the variance of a Monte Carlo estimator for the intractable expectation P G(x).
In Section 2 we provide full details of the above steps.We start by discussing, in Section 2.1, how all the above ingredients are put together to eventually arrive at the general form of our proposed estimator in equation (7).In Section 3 we present extensive simulation studies that verify that our methodology performs very well with multi-dimensional Gaussian targets and it stops reducing the asymptotic variance when we deal with a multimodal 50-dimensional target density with distinct, remote modes.Moreover, we apply our methodology to real data examples consisting of a series of logistic regression examples with parameter vectors up to 25 dimensions and two stochastic volatility examples with 53 and 103 parameters.In all cases we have produced estimators with considerable variance reduction with negligible extra computational cost.

Some notation
In the remainder of the paper we use a simplified notation where both d-dimensional random variables and their values are denoted by lower case letters, such as x = (x (1) , . . ., x (d) ) and where x (j) is the jth dimension or coordinate, j = 1, . . ., d; the subscript i refers to the ith sample drawn by using an MCMC algorithm, that is x (j) i is the ith sample for the jth coordinate of x; the density of the d-variate Gaussian distribution with mean m and covariance matrix S is denoted by N (•|m, S); for a function f (x) we set ∇ =: (∂f /∂x (1) , . . ., ∂f /∂x (d) ); I d is the d × d identity matrix and the superscript in a vector or matrix denotes its transpose; ||•|| denotes the Euclidean norm; all the vectors are understood as column vectors.
P G(x) − G(x) has zero expectation with respect to π because the kernel P is invariant to π.Therefore, given n correlated samples from the target, i.e. x i ∼ π with i = 0, . . ., n − 1, the following estimator is unbiased Poisson control variate }. (2) For general Metropolis-Hastings algorithms the kernel P is such that the expectation P G(x) takes the form where α(x, y) = min {1, r(x, y)}, r(x, y) = π(y)q(x|y) π(x)q(y|x) (4) and q(y|x) is the proposal distribution.By substituting (3) back into estimator (2) we obtain To use this estimator we need to overcome two obstacles: (i) we need to specify the function G(x) and (ii) we need to deal with the intractable integral associated with the control variate.Regarding (i) there is a theoretical best choice which is to set G(x) to the function F (x) that solves the Poisson equation, α(x, y)( F (y) − F (x))q(y|x)dy = −F (x) + π(F ), for every x ∼ π, where we have substituted in the general form of the Poisson equation from (1) the Metropolis-Hastings kernel.
For such optimal choice for G the estimator in (5) has zero variance, i.e. it equals to the exact expectation π(F ).Nevertheless, getting F for general high-dimensional intractable targets is not feasible, and hence we need to compromise with an inferior choice for G that can only approximate F .To get such G, we make use of a Gaussian approximation to the intractable target, as indicated by the assumption below.
Assumption 1.The target π(x) is approximated by a multivariate Gaussian π(x) = N (x|µ, Σ) and the covariance matrix of the proposal q(y|x) is proportional to Σ.
The main purpose of the above assumption is to establish the ability to construct an efficient RWM or MALA sampler.Indeed, it is well-known that efficient implementation of these Metropolis-Hastings samplers when d > 1 requires that the covariance matrix of q(y|x) should resemble as much as possible the shape of Σ.In adaptive MCMC (Roberts and Rosenthal, 2009), such a shape matching is achieved during the adaptive phase where Σ is estimated.If π(x) is a smooth differentiable function, Σ could be alternatively estimated by a gradient-based optimisation procedure and it is then customary to choose a proposal covariance matrix of the form c 2 Σ for a tuned scalar c.
We then solve the Poisson equation for the Gaussian approximation by finding the function F π (x) that satisfies, It is useful to emphasize the difference between this new Poisson equation and the original Poisson equation in ( 6).This new equation involves the approximate Gaussian target π and the corresponding "approximate" Metropolis-Hastings transition kernel P , which now has been modified so that the ratio α(x, y) is obtained by replacing the exact target π with the approximate target π while the proposal q(y|x) is also modified if needed.1 Clearly, this modification makes P invariant to π.When π is a good approximation to π, we expect also F π to closely approximate the ideal function F .Therefore, in our method we propose to set G to F π (actually to an analytic approximation of F π ) and then use it in the estimator (5).
Having chosen G(x), we now discuss the second challenge (ii), i.e. dealing with the intractable expectation α(x i , y)(G(y) − G(x i ))q(y|x i )dy.Given that for any drawn sample x i of the Markov chain there is also a corresponding proposed sample y i that is generated from the proposal, we can unbiasedly approximate the integral with a single-sample Monte Carlo estimate, ) is a unbiased stochastic estimate of the Poisson-type control variate, it can have high variance that needs to be reduced.We introduce a second control variate based on some function h(x i , y i ), that correlates well with α(x i , y i )(G(y i ) − G(x i )), and it has analytic expectation E q(y|xi) [h(x i , y)].We refer to this control variate as static since it involves a standard Monte Carlo problem with exact samples from the tractable proposal density q(y|x).To construct h(x i , y) we rely again on the Gaussian approximation π(x) = N (x|µ, Σ) as we describe in Section 2.3.
With G(x) and h(x, y) specified, we can finally write down the general form of the proposed estimator that can be efficiently computed only from the MCMC output samples {x i } n−1 i=0 and the corresponding proposed samples {y i } n−1 i=0 : In practice we use a slightly modified version of this estimator by adding a set of adaptive regression coefficients θ n to further reduce the variance following Dellaportas and Kontoyiannis (2012); see Section 2.4.

Standard Gaussian case
In this section we construct an analytical approximation to the exact solution of the Poisson equation for the standard Gaussian d-variate target π 0 (x) = N (x|0, I d ) and for the function F (x) = x (j) where 1 ≤ j ≤ d.
We use the function F (x) = x (j) in the remainder of the paper which corresponds to approximating the mean value E π [x], while other choices of F are left for future work.We denote the exact unknown solution by F π0 and the analytical approximation by G 0 .Given this target and some choice for G 0 we express the expectation in (3) as where The calculation of P G 0 (x) reduces thus to the calculation of the integrals a(x) and a g (x).In both integrals x x is just a constant since the integration is with respect to y.Moreover, the MCMC algorithm we consider is either RWM or MALA with proposal where r = 1 corresponds to RWM and r = 1 − c 2 /2 to MALA while c > 0 is the step-size.Both α(x) and α g (x) are expectations under the proposal distribution q(y|x).
One key observation is that for any dimension d, y y is just an univariate random variable with law induced by q(y|x).Then, y y together with log q(x|y) q(y|x) can induce an overall tractable univariate random variable so that the computation of α(x) in ( 8) can be performed analytically.The computation of α g (x) is more involved since it depends on the form of G 0 .Therefore, we propose an approximate G 0 by first introducing a parametrised family that leads to tractable and efficient closed form computation of α g (x).In particular, we consider the following weighted sum of exponential functions where w k and γ k are scalars whereas β k and δ k are d-dimensional vectors.It turns out that using the form in ( 11) for G 0 we can analytically compute the expectation P G 0 as stated in Proposition 1.The proof of this proposition and the proofs of all remaining propositions and remarks presented throughout Section 2 are given in the Appendix.
Proposition 1.Let a(x) and a g (x) given by ( 8) and ( 9) respectively and G 0 in a g (x) to have the form in (11).Then, where τ 2 = 1 in the case of RWM and τ 2 = c 2 /4 in the case of MALA and f follows the non-central chi-squared distribution with d degrees of freedom and non-central parameter x x/c 2 , and where f k,g follows the non-central chi-squared distribution with d degrees of freedom and non-central pa- Proposition 1 states that the calculation of a g (x) and a(x) is based on the cdf of the non-central chisquared distribution and allows, for d-variate standard normal targets, the exact computation of the modified estimator µ n,G given by (2).
Having a family of functions for which we can calculate analytically the expectation P G 0 we turn to the problem of specifying a particular member of this family to serve as an accurate approximation to the solution of the Poisson equation for the standard Gaussian distribution.We first provide the following proposition which states that F π0 satisfies certain symmetry properties.
Proposition 2. Given F (x) = x (j) , the exact solution F π0 (x) is: (i) (holds for d ≥ 1) Odd function in the dimension x (j) .(ii) (holds for d ≥ 2) Even function over any remaining dimension x (j ) , j = j.(iii) (holds for d ≥ 3) Permutation invariant over the remaining dimensions.
To construct an approximation model family that incorporates the symmetry properties of Proposition 2 we make the following assumptions for the parameters in (11).We set K = 4 and we assume that w k ∈ R and γ k > 0 for each k = 1, 2, 3, 4 whereas we set 0 and δ 3 = −δ 4 ; we set the vectors β 1 and δ 3 to be filled everywhere with zeros except from their jth element which is equal to b 1 and c 2 respectively.We specify thus the function G 0 : R d → R as To identify optimal parameters for the function G 0 in ( 12) such that G 0 ≈ F π0 we first simulate a Markov chain with large sample size n from the d-variate standard Gaussian distribution by employing the RWM algorithm and the MALA.Then, for each algorithm we minimize the loss function with respect to the parameters b 0 , b 1 , b 2 , c 0 , c 1 and c 2 by employing the Broyden-Fletcher-Goldfarb-Shanno method.Figure 1 provides an illustration of the achieved approximation to F π0 in the univariate case where d = 1 and the model in ( 12) simplifies as ).
For such case, we can visualize our optimised G 0 and compare it against the numerical solution from Mijatović et al. (2018).Figure 1 shows this comparison which provides clear evidence that for d = 1 our approximation is very accurate.

General Gaussian case
Given the general d-variate Gaussian target π(x) = N (x|µ, Σ) we denote by F π the exact solution of the Poisson equation and by G the approximation that we wish to construct.To approximate F π we apply a change of variables transformation from the standard normal, as motivated by the following proposition and remark.
Proposition 3. Suppose the standard normal target π 0 (x) = N (x|0, I d ), the function F (x) = x (1) and F π0 the associated solution of the Poisson equation for either RWM with proposal q(y|x) = N (y|x, c 2 I) or MALA with proposal q(y|x) = N (y|(1 − c 2 /2)x, c 2 I).Then, the solution F π for the general Gaussian target π(x) = N (x|µ, Σ) and Metropolis-Hastings proposal , where L is a lower triangular Cholesky matrix such that Σ = LL T and L 11 is its first diagonal element.
Remark 1.To apply Proposition 3 for F (x) = x (j) , j = 1, the vector x needs to be permuted such that x (j) becomes its first element; the corresponding permutation has also to be applied to the mean µ and covariance matrix Σ.
Proposition 3 implies that we can obtain the exact solution of the Poisson equation for any d-variate Gaussian target by applying a change of variables transformation to the solution of the standard normal d-variate target.Therefore, based on this theoretical result we propose to obtain an approximation G of the Poisson equation in the general Gaussian case by simply transforming the approximation G 0 in (12) from the standard normal case so as The constant L 11 is omitted since it can be absorbed by the regression coefficient θ; see Section 2.4.

Construction of the static control variate h(x, y)
Suppose we have constructed a Gaussian approximation π(x) = N (x|µ, Σ), where Σ = LL , to the intractable target π(x) and also have obtained the function G from (15) needed for the proposed, general, estimator in ( 7).What remains is to specify the function h(x, y), labelled as static control variate in ( 7), which should correlate well with α(x, y)(G(y) − G(x)).The intractable term in this function is the Metropolis-Hastings probability α(x, y) in ( 4) where the Metropolis-Hastings ratio r(x, y) contains the intractable target π.This suggests to choose h(x, y) as where r(x, y) is the acceptance ratio in a M-H algorithm that targets the Gaussian approximation π(x), that is and q(•|•) is the proposal distribution that we would use for the Gaussian target π(x) as defined by equation ( 14).Importantly, by assuming that π serves as an accurate approximation to π, the ratio r(x, y) approximates accurately the exact M-H ratio r(x, y) and E q [h(x, y)] can be calculated analytically.In particular, using (15) we have that E q [h(x, y)] = h(x, y)q(y|x)dy This integral can be computed efficiently as follows.We reparametrize the integral according to the new variable ỹ = L −1 (y − µ) and also use the shortcut x = L −1 (x − µ) where x is an MCMC sample.After this reparametrization, the above expectation becomes under the distribution where we condition on x with a slightly abuse of notation since the term ∇ log π(x) is the exact pre-computed gradient for the sample x of the intractable target.Thus, the calculation of E q [h(x, y)] reduces to the evaluation of the following integral Note also that inside the Metropolis-Hastings ratio q(ỹ|x) = N (ỹ|rx, c 2 I) with r as in (10).In the case of RWM and by noting that the density q(ỹ|x) in ( 18) coincides with the density q(ỹ|x) in (10) we have that the calculation of the integral in (19) reduces to the calculation of the integrals in ( 8) and ( 9) and, thus, can be conducted by utilizing Proposition 1.The calculation of the integral in ( 19) for the MALA is slightly different as highlighted by the following remark.
Remark 2. In the case of MALA the mean of the density q(ỹ|x) in ( 18) is different from the mean of q(ỹ|x) due to the presence of the term c 2 2 L ∇ log π(x) and the formulas in Proposition (1) are modified accordingly.Finally, we note that except from the tractability in the calculations which offered by the particular choice of h(x, y), there is also the following intuition for its effectiveness.If the Gaussian approximation is exact, then the overall control variate, defined in equation ( 7) as the sum of a stochastic and a static control variate, becomes the exact "Poisson control variate" that we would compute if the initial target was actually Gaussian.Thus, we expect that the function h(x, y), as a static control variate in a non-Gaussian target, enables effective variance reduction under the assumption that the target is well-approximated by a Gaussian distribution.

The modified estimator with regression coefficients
As pointed out by Dellaportas and Kontoyiannis (2012) the fact that the proposed estimator µ n,G (F ) is based on an approximation G of the true solution Fπ of the Poisson equation implies that we need to modify µ n,G (F ) as where θn estimates the optimal coefficient θ that further minimizes the variance of the overall estimator.Dellaportas and Kontoyiannis (2012) show that for reversible MCMC samplers, the optimal estimator θn of the true coefficient θ can be constructed solely from the MCMC output.By re-writing the estimator in ( 20) as where the term approximates P G(x i ), we can estimate θn as The resulting estimator µ n,G (F θn ) in ( 20) is evaluated by using solely the output of the MCMC algorithm and under some regularity conditions converges to π(F ) a.s. as n → ∞, see Tsourti (2012).

Algorithmic summary
In summary, the proposed variance reduction approach can be applied a posteriori to the MCMC output samples {x i } n−1 i=0 obtained from either RWM or MALA with proposal density given by ( 14).The extra computations needed involve the evaluation of P G(x i ) given by ( 21).This is efficient since it relies on quantities that are readily available such as the values G(x i ) and G(y i ), where y i is the value generated from the proposal q(y|x i ) during the main MCMC algorithm, as well as on the acceptance probability a(x i , y i ) which has been also computed and stored at each MCMC iteration.The evaluation of P G(x i ) requires also the construction of the static control variate h(x i , y i ) defined by ( 16).This depends on the ratio r(x, y) given by ( 17) and on the expectation E q(y|xi) [h(x i , y)].The calculation of the latter expectation is tractable since r(x, y) is the acceptance ratio of Metropolis-Hastings algorithm that targets the Gaussian target π(x) = N (x|µ, Σ), where µ and Σ are estimators of the mean and covariance matrix respectively of the target π(x); see Assumption 1. Finally, we compute θn using ( 22) and evaluate the proposed estimator µ n,G (F θn ) from ( 20).Algorithm 1 summarizes the steps of the variance reduction procedure.
Algorithm 1 Variance reduction for Metropolis-Hasting samplers Inputs: The samples x i , i = 0, . . ., n − 1, simulated by using RWM or MALA with proposal distribution given by ( 14); the proposed samples y i generated from the proposal during the MCMC; the M-H probabilities α(x i , y i ) calculated during the MCMC; estimators µ and Σ of the mean and covariance matrix respectively of the target.Returns: An estimate for the mean of the jth coordinate of the target.

Application on real and simulated data
We present results from the application of the proposed methodology on real and simulated data examples.First we consider multivariate Gaussian targets for which we have shown that the function G in ( 12) allows the explicit calculation of the expectation P G defined by (3).Section 3.1 presents variance reduction factors in the case of d-variate standard Gaussian densities, simulated by employing the RWM and MALA, up to d = 100 dimensions.In Sections 3.2, 3.3 and 3.4 and we examine the efficiency of our proposed methodology in targets that depart from the Gaussian distribution and the expectation P G is not analytically available.
To conduct all the experiments we set the parameters b 0 , b 1 , b 2 , c 0 , c 1 and c 2 of the function G 0 in ( 12) in the values given by Table 1 which were estimated by minimizing the loss function in (13) for d = 2.In practice we observe that such values lead to good performance across all real data experiments, including those with d > 2.
To estimate the variance of µ n (F ) in each experiment we obtained T = 100 different estimates µ (i) n (F ), i = 1, . . ., T , for µ n (F ) based on T independent MCMC runs.Then, the variance of µ n (F ) has been estimated by 1 where μn (F ) is the average of µ (i) n (F ).We estimated similarly the variance of the proposed estimator µ n,G (F ).

Simulated data: Gaussian targets
The target distribution is a d-variate standard Gaussian distribution and we are interested in estimating the expected value of the first coordinate of the target by setting F (x) = x (1) .Samples of size n were drawn from target densities by utilising the proposal distribution in (10) with c 2 = 2.38 2 /d for the RWM case and by tuning c 2 during the burn-in period to achieve acceptance rate between 55% and 60% in the MALA case.Table 2 presents factors by which the variance of µ n (F ) is greater than the variance of µ n,G (F ) in the case of the RWM and MALA.Variance reduction is considerable even for d = 100.Figure 2 shows typical realizations of the sequences of estimates obtained by the standard estimators µ n (F ) and the proposed µ n,G (F θ ) for different dimensions of the standard Gaussian target and Figure 3 provides a visualization of the distribution of the estimators µ n (F ) and µ n,G (F θ ).

Simulated data: mixtures of Gaussian distributions
It is important to investigate how our proposed methodology performs when the target density departs from normality.We used as π(x) a mixture of d-variate Gaussian distributions with density where, following Mijatović and Vogrinc (2019), we set m to be the d-dimensional vector (h/2, 0, . . ., 0) and Σ is d × d covariance matrix randomly drawn from an inverse Wishart distribution by requiring its largest eigenvalue to be equal to 25.We drew samples from the target distribution by using the Metropolis-Hastings algorithm with proposal distribution q(y|x) = N (y|x, c 2 Σ) where by setting c 2 = 2.38 2 /d we achieve an acceptance ratio between 23% and 33%.When h > 6 the MCMC algorithm struggles to converge.Table 3 presents the factors by which the variance of µ n (F ) is greater than the variance of the modified estimator µ n,G (F ) for dimensions d = 10 and d = 50 and for different values of h.It is very reassuring that even in the very non-Gaussian scenario (h = 6) our modified estimator achieved a slight variance reduction.
Table 3: Estimated factors by which the variance of µ n (F ) is larger than the variance of µ n,G (F ) for a mixture of d-variate Gaussian distributions with density given by ( 23) for different values of the mean m.We collect n = 200, 000 samples after the first 10, 000 iterations of the RWM algorithm.h=2 h=4 h=6 d= 10 20.73 2.39 1.26 d= 50 7.88 1.35 1.01

Real data: Bayesian logistic regressions
We tested the variance reduction of our modified estimators on five datasets that have been commonly used in MCMC applications, see e.g.Girolami and Calderhead (2011), Titsias and Dellaportas (2019).They are consisted of one N -dimensional binary response variable and an N × d matrix with covariates including a column of ones; see Table 4 for the names of the datasets and details on the specific samples sizes and dimensions.We consider a Bayesian logistic regression model by setting an improper prior for the regression coefficients γ ∈ R d of the form p(γ) ∝ 1.
where c 2 = 2.38 2 /d and Σ is the maximum likelihood estimator of the covariance of γ.Table 5 presents the range of factors by which the variance of µ n (F ) is greater than the variance of µ n,G (F ) for all parameters γ.It is clear that our modified estimators achieve impressive variance reductions when compared with the standard RWM ergodic estimators.
Table 5: Range of estimated factors by which the variance of µ n (F ) is larger than the variance of µ n,G (F θ ) for the posterior distribution of logistic regression models applied on the datasets indicated by the first column.We collect n samples after the first 10, 000 iterations of the RWM algorithm.

Variance reduction for MALA
We draw samples from the posterior distribution of γ by employing the Metropolis-Hastings algorithm with proposal distribution where c 2 is tuned during the burn-in period in order to achieve an acceptance ratio between 55% and 60%, Σ is maximum likelihood estimator of the covariance of γ and π(γ) denotes the density of the posterior distribution of γ.Table 6 presents the range of factors by which the variance of µ n (F ) is greater than the variance of µ n,G (F ) for all parameters γ.Again, there is considerable variance reduction for all modified estimators.
Table 6: Estimated factors by which the variance of µ n (F ) is larger than the variance of µ n,G (F θ ) for the posterior distribution of logistic regression models applied on the datasets indicated by the first column.We collect n samples after the first 10, 000 iterations of the MALA.
To assess the proposed variance reduction methods we simulated daily log-returns of a stock for d days by using values for the parameters of the model that have been previously estimated in real data applications (Kim et al., 1998;Alexopoulos et al., 2021) φ = 0.98, µ = −0.85 and s = 0.15.To draw samples from the d-dimensional, d = N + 3, target posterior in (24) we first transform the parameters φ and s 2 to real-valued parameters φ and s2 by taking the logit and logarithm transformations and we assign Gaussian prior distributions by matching the first two moments of the Gaussian distributions with the corresponding moments of the beta and gamma distributions used as priors for the parameters of the original formulation.Then, we set x = (m, φ, s2 , h) and we draw the desired samples using a Metropolis-Hastings algorithm with proposal distribution where y = (m , φ , s2 , h ) are the proposed values, c 2 is tuned during the burn-in period in order to achieve an acceptance ratio between 55% and 60% and Σ is the maximum a posteriori estimate of the covariance matrix of (m, φ, s 2 , h).Table 7 presents the factors by which the variance of µ n (F ) is greater than the variance of the proposed estimator µ n,G (F θ ).We report variance reduction for all static parameters of the volatility process and the range of reductions achieved for the N -dimensional latent path h.All estimators have achieved considerable variance reduction.

Discussion
Typical variance reduction strategies for MCMC algorithms study ways to produce new estimators which have smaller variance than the standard ergodic averages by performing a post-processing manipulation of the drawn samples.Here we studied a methodology that constructs such estimators but our development was based on the essential requirement of a negligible post-processing cost.In turn, this feature allows the effortless variance reduction for MCMC estimators that are used in a wide spectrum of Bayesian inference applications.
We investigated both the applicability of our strategy in high dimensions and the robustness to departures of normality in the target densities by using simulated and real data examples.Since we have never encountered a case in which variance increases, we feel that there is strong evidence that our method is risk-free at least for posterior densities up to 100 dimensions.
There are many directions for future work.We limited ourselves to the simplest case of function F (x) = (j) but higher moments and indicator functions seem interesting avenues to be investigated next.Other Metropolis samplers such as the independent Metropolis or the Metropolis-within-Gibbs are also obvious candidates for studying.Finally, an issue that was discussed in some detail in Dellaportas and Kontoyiannis (2009) but has not yet studied with the care it deserves is the important problem of reducing the estimation bias of the MCMC samplers which depends on the initial point of the chain X 0 = x and vanishes asymptotically.As also noted by Dellaportas and Kontoyiannis (2009), control variables have probably an important role to play in this setting.

Proof of Proposition 1
Proof.We need to calculate the integrals a(x) and a g (x) in Eq. ( 8) and ( 9) for G 0 (x) given by ( 12).We have that for q(y|x) given by ( 10 where τ 2 = 1 in the case of RWM and τ 2 = c 2 /4 in the case of MALA. To compute a(x) we set z = (y − rx)/c, where r as in (10).Then, we have that where κ = rx/c.By setting f = (z +κ) (z +κ) we have that f follows the non-central chi-squared distribution with d degrees of freedom and non-central parameter r 2 x x/c 2 .Eq. ( 25) implies that α(x) in (8) becomes where p(f ) is the density of the random variable f and writes Pois(j|κ/2)Gam(f ; d/2 + j, 2).
Notice that the second term in (26) can be calculated by using the cdf of the non-central chi squared distribution.For the first term after some algebra we have that e − c 2 τ 2 2 Pois(j|κ/2) (c 2 τ 2 + 1) d/2+j Gam(f ; d/2 + j, To compute a g (x) we first note that where a g k (x) = min 1, exp − 1 2 (y y − x x) q(x|y) q(y|x) g k (y)q(y|x)dy, and Then, we calculate the a g k (x) by noting that g k (y)q(y|x) = A k (x)N (y|m k (x), s 2 k ), where and s 2 k = c 2 /(1 + 2c 2 γ k ).By setting z k,g = (x − m k (x))/s and f k,g = (z k,g + ζ k,g ) (z k,g + ζ k,g ), where ζ k,g = m k (x)/s k , we work as in ( 25) and have that where the random variable f k,g follows the chi-squared distribution with d degrees of freedom and non-central parameters m k (x) m k (x)/s 2 k and the expectation is calculated by utilizing the cdf of f k,g as in Eq. ( 26)-( 27).Finally, from Eq. ( 28) we have that Proof of Proposition 2 Let q(y|x) be the proposal distribution defined by (10).We have that π 0 (y)q(x|y) π 0 (x)q(y|x) = exp{− τ 2 2 (y y − x x)}, where τ 2 = 1 in the case of RWM and τ 2 = c 2 /4 in the case of MALA.We assume that F (x) = x (j) and we show that i) F 0 π (−x (j) , x (j ) ) = − F 0 π (x) and that ii) F 0 π (x (j) , x (j ) ) = F 0 π (x (j) , Πx (j ) ), where x (j ) denotes the vector x ∈ R d without its jth coordinate and Π is a permutation matrix.
To prove ii) we denote by z the d-dimensional vector such that z (j) = x (j) and z (j ) = Πx (j ) and we apply the following transformation on (30); we set z to be d-dimensional vector such that z(j) = y (j) and z(j ) = Πy (j ) .Then, we have that F 0 π (z (j) , Π −1 z (j ) ) α(z, z)q(z|z)dz − α(z, z)[ F 0 π (z (j) , Π −1 z (j ) )]q(z|z)dz = z (j) , where α(z, z) and q(z|z) as in (32) since they are invariant to arbitrary permutations of z and/or z and Π −1 is permutation matrix such that Π −1 Πz (j ) = z (j ) .From (33) we have that F 0 π (z (j) , Π −1 z (j ) ) is solution of freedom and non-central parameter k(x) k(x)/c 2 .To compute the integral a h g (x) = min 1, exp{− 1 2 (ỹ ỹ − x x)} q(x|ỹ) q(ỹ|x) G 0 (ỹ)dy we work again as in the proof of Proposition 2 for the calculation of a g (x) and we find that a h g (x) is given by equation ( 29) for and (a) An approximation G(x) to the solution of the Poisson equation associated with the target π(x), transition kernel P and function F (x).(b) A construction of G(x) based on firstly approximating π(x) with a Gaussian density π(x) = N (x|µ, Σ), and then specifying G(x) by an accurate approximation to the solution of the Poisson equation for the approximate target π(x).

Figure 1 :
Figure 1: Numerical solution of the Poisson equation (black solid lines) and its approximation (red dashed lines) in the case of univariate standard Gaussian target simulated by using the random walk Metropolis (RWM) algorithm and the Metropolis-adjusted Langevin algorithm (MALA).

Figure 2 :Figure 3 :
Figure 2: Sequence of the standard ergodic averages (black solid lines) and the proposed estimates (blue dashed lines).The red lines indicate the mean of the d-variate standard Gaussian target.The values are based on samples drawn by employing either the RWM (top row) or the MALA (bottom row) with 10, 000 iterations discarded as burn-in period.

Table 1 :
Optimal values for the parameters of the function G 0 in (12).

Table 2 :
Estimated factors by which the variance of µ n (F ) is larger than the variance of µ n,G (F ) for standard Gaussian d-variate target.We collect n samples after the first 10, 000 iterations of the RWM and the MALA.

Table 4 :
Summary of datasets for logistic regression

Table 7 :
Estimated factors by which the variance of µ n (F ) is larger than the variance of µ n,G (F ) for the parameters of d-dimensional stochastic volatility model.We collect n samples after the first 10, 000 of the MALA.