1 Introduction

Statistical methods for reducing the bias and the variance of estimators have played a prominent role in Monte Carlo based numerical algorithms. Variance reduction via control variates has a long and well studied history introduced as early as the work of Kahn and Marshall (1953), whereas an early non-parametric estimate of bias, subsequently renamed jackknife and broadly used for bias reduction, was first presented by Quenouille (1956). However, the corresponding theoretical developments in the more complicated, but extremely popular and practically important, estimators based on MCMC algorithms has been rather limited. The major impediment is the fact that the MCMC estimators are based on ergodic averages of dependent samples produced by simulating a Markov chain.

We provide a general methodology to construct control variates for any discrete time random walk Metropolis (RWM) and Metropolis-adjusted Langevin algorithm (MALA) Markov chains that can achieve, in a post-processing manner and with a negligible additional computational cost, impressive variance reduction when compared to the standard MCMC ergodic averages. Our proposed estimators are based on an approximate, but accurate, solution of the Poisson equation for a multivariate Gaussian target density of any dimension.

Suppose that we have a sample of size n from an ergodic Markov chain \(\{X_n\}_{n \ge 0}\) with continuous state space \({\textbf{X}}\subseteq \mathbb {R}^d\), transition kernel P and invariant measure \(\pi \). A standard estimator of the mean \(\textrm{E}_{\pi }[F] := \pi (F) = \int F d\pi \) of a real-valued function F defined on \({\textbf{X}}\) under \(\pi \) is the ergodic mean

$$\begin{aligned} \mu _n(F) :=\frac{1}{n}\sum _{i=0}^{n-1}F(X_i). \end{aligned}$$

which satisfies, for any initial distribution of \(X_0\), a central limit theorem of the form

$$\begin{aligned}{} & {} \sqrt{n} \left[ \mu _n(F) - \pi (F) \right] \nonumber \\{} & {} \quad = n^{-1/2} \sum _{i=0}^{n-1} \left[ F(X_i) - \pi (F) \right] \overset{D}{\rightarrow } N(0,\sigma ^2_F), \end{aligned}$$

with the asymptotic variance given by

$$\begin{aligned} \sigma ^2_F := \lim _{n \rightarrow \infty } n E_\pi \left[ \left( \mu _n(F) - \pi (F)\right) ^2 \right] . \end{aligned}$$

Interesting attempts on variance reduction methods for Markov chain samplers include the use of antithetic variables (Barone and Frigessi 1990; Green and Han 1992; Craiu et al. 2005), Rao-Blackwellization (Gelfand and Smith 1990), Riemann sums (Philippe and Robert 2001) or autocorrelation reduction (Mira and Geyer 2000; Van Dyk and Meng 2001; Yu and Meng 2011).

Control variates have played an outstanding role in the MCMC variance reduction quiver. A strand of research is based on Assaraf and Caffarel (1999) who noticed that a Hamiltonian operator together with a trial function are sufficient to construct an estimator with zero asymptotic variance. They considered a Hamiltonian operator of Schrödinger-type that led to a series of zero-variance estimators studied by Valle and Leisen (2010), Mira et al. (2013) and Papamarkou et al. (2014). The estimation of the optimal parameters of the trial function is conducted by ignoring the Markov chain sample dependency, an issue that was dealt with by Belomestny et al. (2020) by utilizing spectral methods. The main barrier for the wide applicability of zero-variance estimators is that their computational complexity increases with d, see South et al. (2018). Another approach to construct control variates is a non-parametric version of the methods presented by Mira et al. (2013) and Papamarkou et al. (2014) which lead to the construction of control functionals (Oates et al. 2017; Barp et al. 2018; South et al. 2020). Although their computational cost with respect to d is low, their general applicability is prohibited due to the cubic computational cost with respect to n (South et al. 2018; Oates et al. 2019) and the possibility to suffer from the curse of dimensionality that is often met in non-parametric methods (Wasserman 2006). Finally, Hammer and Tjelmeland (2008) proposed constructing control variates by expanding the state space of the Metropolis-Hastings algorithm.

An approach which is closely related to our proposed methodology attempts to minimise the asymptotic variance \(\sigma ^2_F\). This seems a hard problem since a closed form expression of \(\sigma ^2_F\) is not available and therefore a loss function to be minimised is not readily available; see, for example, Flegal et al. (2010). However, there has been a recent research activity based on the following observation by Andradóttir et al. (1993). If a solution \({{{\hat{F}}}}\) to the Poisson equation for F was available, that is if for every \(x \in {\textbf{X}}\)

$$\begin{aligned} F(x) + P{{{\hat{F}}}}(x) - {{{\hat{F}}}}(x) = \pi (F) \end{aligned}$$
(1)

where

$$\begin{aligned} PF(x) := E_x [ F(X_1)] := E_x [ F(X_1)| X_0 = x], \end{aligned}$$

then one could construct a function equal to \(F(x) + P{{{\hat{F}}}}(x) - {{{\hat{F}}}}(x)\) which is constant and equal to \(\pi (F)\). It is then immediate that a zero-variance and zero-bias estimator for F is given by

$$\begin{aligned} \mu _{n,{{{\hat{F}}}}}(F) :=\frac{1}{n}\sum _{i=0}^{n-1}\{F(X_i) + P{{{\hat{F}}}}(X_i) - {{{\hat{F}}}}(X_i)\} \end{aligned}$$

which can be viewed as an enrichment of the estimator \(\mu _n(F)\) with the (optimal) control variate \(P{{{\hat{F}}}} - {{{\hat{F}}}}\). Of course, solving (1) is extremely hard for continuous state space Markov chains, even if we assume that \(E_{\pi }[F]\) is known, because it involves solving a non-standard integral equation. Interestingly, a solution of this equation (also called the fundamental equation) produces zero-variance estimators suggested by Assaraf and Caffarel (1999) for a specific choice of Hamiltonian operator. One of the rare examples that (1) has been solved exactly for discrete time Markov chains is the random scan Gibbs sampler where the target density is a multivariate Gaussian density, see Dellaportas and Kontoyiannis (2012), Dellaportas and Kontoyiannis (2009). They advocated that this solution provides a good approximation to (1) for posterior densities often met in Bayesian statistics that are close to multivariate Gaussian densities. Indeed, since direct solution of (1) is not available, approximating \({{{\hat{F}}}}\) has been also suggested by Andradóttir et al. (1993), Atchadé and Perron (2005), Henderson (1997), Meyn (2008).

Tsourti (2012) attempted to extend the work by Dellaportas and Kontoyiannis (2012) to RWM samplers. The resulting algorithms produced estimators with lower variance but the computational cost required for the post-processing construction of these estimators counterbalance the variance reduction gains. We build on the work by Tsourti (2012) here but we differ in that (i) we build new, appropriately chosen to facilitate analytic computations, non-linear d-dimensional approximations to \({{{\hat{F}}}}(x)\) rather than linear combinations of 1-dimensional functions and (ii) we produce efficient Monte Carlo approximations of the d-dimensional integral \(P{{{\hat{F}}}}(x)\) so that no extra computation is required for its evaluation. Finally, Mijatović et al. (2018) approximate numerically the solution of (1) for 1-dimensional RWM samplers and Mijatović and Vogrinc (2019) construct control variates for large d by employing the solution of (1) that is associated with the Langevin diffusion in which the Markov chain converges as the dimension of its state space tends to infinity (Roberts et al. 1997); this requires very expensive Monte Carlo estimation methods so it is prohibited for realistic statistical applications.

We follow this route and add to this literature by extending the work of Dellaportas and Kontoyiannis (2012) and Tsourti (2012) to RWM and MALA algorithms by producing estimators for the posterior means of each co-ordinate of a d-dimensional target density with reduced asymptotic variance and negligible extra computational cost. Our Monte Carlo estimator to compute the expectation \(\pi (F)\) makes use of three components:

  1. (a)

    An approximation G(x) to the solution of the Poisson equation associated with the target \(\pi (x)\), transition kernel P and function F(x).

  2. (b)

    An approximation of the target \(\pi (x)\) with a Gaussian density \({\widetilde{\pi }}(x) = {\mathscr {N}}(x|\mu , \Sigma )\) and then specifying G(x) in (a) by an accurate approximation to the solution of the Poisson equation for the approximate target \({\widetilde{\pi }}(x)\).

  3. (c)

    An additional control variate, referred to as static control variate, that is based on the same Gaussian approximation \({\widetilde{\pi }}(x)\) and allows to reduce the variance of a Monte Carlo estimator for the intractable expectation PG(x).

In Section 2 we provide full details of the above steps. We start by discussing, in Section 2.1, how all the above ingredients are put together to eventually arrive at the general form of our proposed estimator in equation (7). In Section 3 we present extensive simulation studies that verify that our methodology performs very well with multi-dimensional Gaussian targets and it stops reducing the asymptotic variance when we deal with a multimodal 50-dimensional target density with distinct, remote modes. Moreover, we apply our methodology to real data examples consisting of a series of logistic regression examples with parameter vectors up to 25 dimensions and two stochastic volatility examples with 53 and 103 parameters. In all cases we have produced estimators with considerable variance reduction with negligible extra computational cost.

1.1 Some notation

In the remainder of the paper we use a simplified notation where both d-dimensional random variables and their values are denoted by lower case letters, such as \(x = (x^{(1)}, \ldots , x^{(d)})\) and where \(x^{(j)}\) is the jth dimension or coordinate, \(j=1,\ldots ,d\); the subscript i refers to the ith sample drawn by using an MCMC algorithm, that is \(x_i^{(j)}\) is the ith sample for the jth coordinate of x; the density of the d-variate Gaussian distribution with mean m and covariance matrix S is denoted by \({\mathscr {N}}(\cdot |m,S)\); for a function f(x) we set \(\nabla =:( \partial f/\partial x^{(1)},\ldots ,\partial f/\partial x^{(d)})\); \(I_d\) is the \(d \times d\) identity matrix and the superscript \(\top \) in a vector or matrix denotes its transpose; \(||\cdot ||\) denotes the Euclidean norm; all the vectors are understood as column vectors.

2 Metrolopis–Hastings estimators with control variates from the Poisson equation

2.1 The general form of estimators for arbitrary targets

Consider an arbitrary intractable target \(\pi \) from which we have obtained a set of correlated samples by simulating a Markov chain with transition kernel P obtained by a Metropolis-Hastings kernel invariant to \(\pi \). To start with, assume a function G(x). By following the observation of Henderson (1997) the function \(PG(x) - G(x)\) has zero expectation with respect to \(\pi \) because the kernel P is invariant to \(\pi \). Therefore, given n correlated samples from the target, i.e. \(x_i \sim \pi \) with \(i=0,\ldots ,n-1\), the following estimator is unbiased

$$\begin{aligned} \mu _{n,G}(F) :=\frac{1}{n}\sum _{i=0}^{n-1}\{F(x_i) + \underbrace{P G(x_i) - G(x_i)}_{\text {Poisson control variate}}\}. \end{aligned}$$
(2)

For general Metropolis-Hastings algorithms the kernel P is such that the expectation PG(x) takes the form

$$\begin{aligned} PG(x)&= \int P(x,d y) G(y) \nonumber \\&= \int \alpha (x, y) q(y|x) G(y) dy \nonumber \\&\quad + \Big (1-\int \alpha (x,y) q(y|x) dy \Big ) G(x) \nonumber \\&= G(x) + \int \alpha (x,y)(G(y)-G(x))q(y|x)dy, \end{aligned}$$
(3)

where

$$\begin{aligned} \alpha (x,y) = \min \big \{ 1,r(x,y) \big \},~~r(x,y) = \frac{\pi (y)q(x|y)}{\pi (x)q(y|x)} \end{aligned}$$
(4)

and q(y|x) is the proposal distribution. By substituting (3) back into estimator (2) we obtain

$$\begin{aligned} \mu _{n,G}(F):= & {} \frac{1}{n}\sum _{i=0}^{n-1} \bigg \{ F(x_i) \nonumber \\ {}{} & {} + \underbrace{ \int \alpha (x_i,y)(G(y)-G(x_i))q(y|x_i)dy }_{\text {Poisson control variate}} \bigg \}. \end{aligned}$$
(5)

To use this estimator we need to overcome two obstacles: (i) we need to specify the function G(x) and (ii) we need to deal with the intractable integral associated with the control variate.

Regarding (i) there is a theoretical best choice which is to set G(x) to the function \({\hat{F}}(x)\) that solves the Poisson equation,

$$\begin{aligned} \int \alpha (x,y)({\hat{F}}(y)-{\hat{F}}(x))q(y|x)dy = - F(x) + \pi (F), \end{aligned}$$
(6)

for every \(x \sim \pi \), where we have substituted in the general form of the Poisson equation from (1) the Metropolis-Hastings kernel. For such optimal choice for G the estimator in (5) has zero variance, i.e. it equals to the exact expectation \(\pi (F)\). Nevertheless, getting \({\hat{F}}\) for general high-dimensional intractable targets is not feasible, and hence we need to compromise with an inferior choice for G that can only approximate \({\hat{F}}\). To get such G, we make use of a Gaussian approximation to the intractable target, as indicated by the assumption below.

Assumption 1

The target \(\pi (x)\) is approximated by a multivariate Gaussian \({\widetilde{\pi }}(x) = {\mathscr {N}}(x|\mu , \Sigma )\) and the covariance matrix of the proposal q(y|x) is proportional to \(\Sigma \).

The main purpose of the above assumption is to establish the ability to construct an efficient RWM or MALA sampler. Indeed, it is well-known that efficient implementation of these Metropolis-Hastings samplers when \(d>1\) requires that the covariance matrix of q(y|x) should resemble as much as possible the shape of \(\Sigma \). In adaptive MCMC (Roberts and Rosenthal 2009), such a shape matching is achieved during the adaptive phase where \(\Sigma \) is estimated. If \(\pi (x)\) is a smooth differentiable function, \(\Sigma \) could be alternatively estimated by a gradient-based optimisation procedure and it is then customary to choose a proposal covariance matrix of the form \(c^2 \Sigma \) for a tuned scalar c.

We then aim to solve the Poisson equation for the Gaussian approximation by finding the function \({\hat{F}}_{{\widetilde{\pi }}}(x)\) that satisfies,

$$\begin{aligned} \int {\widetilde{\alpha }}(x,y)({\hat{F}}_{{\widetilde{\pi }}}(y)-{\hat{F}}_{{\widetilde{\pi }}}(x))q(y|x)dy = - F(x) + {\widetilde{\pi }}(F), \end{aligned}$$

for every \(x \sim {\widetilde{\pi }}\). It is useful to emphasize the difference between this new Poisson equation and the original Poisson equation in (6). This new equation involves the approximate Gaussian target \({\widetilde{\pi }}\) and the corresponding “approximate” Metropolis-Hastings transition kernel \({\widetilde{P}}\), which now has been modified so that the ratio \({\widetilde{\alpha }}(x,y)\) is obtained by replacing the exact target \(\pi \) with the approximate target \({\widetilde{\pi }}\) while the proposal q(y|x) is also modified if needed.Footnote 1 Clearly, this modification makes \({\widetilde{P}}\) invariant to \({\widetilde{\pi }}\). When \({\widetilde{\pi }}\) is a good approximation to \(\pi \), we expect also \({\hat{F}}_{{\widetilde{\pi }}}\) to closely approximate the ideal function \({\hat{F}}\). Therefore, in our method we propose to set G to \({\hat{F}}_{{\widetilde{\pi }}}\) (actually to an analytic approximation of \({\hat{F}}_{{\widetilde{\pi }}}\)) and then use it in the estimator (5).

Having chosen G(x), we now discuss the second challenge (ii), i.e. dealing with the intractable expectation \(\int \alpha (x_i,y)(G(y)-G(x_i))q(y|x_i) d y\). Given that for any drawn sample \(x_i\) of the Markov chain there is also a corresponding proposed sample \(y_i\) that is generated from the proposal, we can unbiasedly approximate the integral with a single-sample Monte Carlo estimate,

$$\begin{aligned}&\int \alpha (x_i,y)(G(y)-G(x_i))q(y|x_i) d y \\&\qquad \approx \alpha (x_i,y_i)(G(y_i)-G(x_i)), \end{aligned}$$

where \(y_i \sim q(y|x_i)\). Although \(\alpha (x_i,y_i)(G(y_i)-G(x_i))\) is a unbiased stochastic estimate of the Poisson-type control variate, it can have high variance that needs to be reduced. We introduce a second control variate based on some function \(h(x_i,y_i)\), that correlates well with \(\alpha (x_i,y_i)(G(y_i)-G(x_i))\), and it has analytic expectation \(\textrm{E}_{q(y|x_i)} [h(x_i,y)]\). We refer to this control variate as static since it involves a standard Monte Carlo problem with exact samples from the tractable proposal density q(y|x). To construct \(h(x_i,y)\) we rely again on the Gaussian approximation \({\widetilde{\pi }}(x) = {\mathscr {N}}(x|\mu , \Sigma )\) as we describe in Sect. 2.3.

With G(x) and h(xy) specified, we can finally write down the general form of the proposed estimator that can be efficiently computed only from the MCMC output samples \(\{x_i\}_{i=0}^{n-1}\) and the corresponding proposed samples \(\{y_i\}_{i=0}^{n-1}\):

$$\begin{aligned} \begin{aligned} \mu _{n,G}(F)&:=\frac{1}{n}\sum _{i=0}^{n-1}\bigg \{ F(x_i) + \underbrace{ \alpha (x_i,y_i)(G(y_i)-G(x_i)) }_{\text {Stochastic Poisson control variate}} \\ {}& + \underbrace{h(x_i,y_i) - \textrm{E}_{q(y|x_i)} [h(x_i,y)]}_{\text {Static control variate}} \bigg \}. \end{aligned} \end{aligned}$$
(7)

In practice we use a slightly modified version of this estimator by adding a set of adaptive regression coefficients \(\theta _n\) to further reduce the variance following Dellaportas and Kontoyiannis (2012); see Sect. 2.4.

2.2 Approximation of the Poisson equation for Gaussian targets

2.2.1 Standard Gaussian case

In this section we construct an analytical approximation to the exact solution of the Poisson equation for the standard Gaussian d-variate target \({\widetilde{\pi }}_0(x) = {\mathscr {N}}(x|0,I_d)\) and for the function \(F(x)=x^{(j)}\) where \(1 \le j \le d\). We use the function \(F(x)=x^{(j)}\) in the remainder of the paper which corresponds to approximating the mean value \(\textrm{E}_{\pi }[x^{(j)}]\), while other choices of F are left for future work. We denote the exact unknown solution by \({\hat{F}}_{{\widetilde{\pi }}_0}\) and the analytical approximation by \(G_0\). Given this target and some choice for \(G_0\) we express the expectation in (3) as

$$\begin{aligned} PG_0(x) = G_0(x)(1-a(x)) + a_g(x), \end{aligned}$$

where

$$\begin{aligned} a(x)= & {} \int \min \bigg \{1, e^{ -\frac{1}{2}(y^\top y - x^\top x)} \frac{q(x|y)}{q(y|x)} \bigg \} q(y|x) dy, \end{aligned}$$
(8)
$$\begin{aligned} a_g(x)= & {} \int \min \bigg \{1, e^{ - \frac{1}{2}(y^\top y- x^\top x)} \frac{q(x|y)}{q(y|x)} \bigg \} G_0(y) q(y|x) dy.\nonumber \\ \end{aligned}$$
(9)

The calculation of \(PG_0(x)\) reduces thus to the calculation of the integrals a(x) and \(a_g(x)\). In both integrals \(x^\top x\) is just a constant since the integration is with respect to y. Moreover, the MCMC algorithm we consider is either RWM or MALA with proposal

$$\begin{aligned} q(y|x)={\mathscr {N}}(y| rx,c^2I), \end{aligned}$$
(10)

where \(r=1\) corresponds to RWM and \(r=1-c^2/2\) to MALA while \(c>0\) is the step-size. Both a(x) and \(a_g(x)\) are expectations under the proposal distribution q(y|x).

One key observation is that for any dimension d, \(y^\top y\) is just an univariate random variable with law induced by q(y|x). Then, \(y^\top y\) together with \(\log \tfrac{q(x|y)}{q(y|x)}\) can induce an overall tractable univariate random variable so that the computation of a(x) in (8) can be performed analytically. The computation of \(a_g(x)\) is more involved since it depends on the form of \(G_0\). Therefore, we propose an approximate \(G_0\) by first introducing a parametrised family that leads to tractable and efficient closed form computation of \(a_g(x)\). In particular, we consider the following weighted sum of exponential functions

$$\begin{aligned} \sum _{k=1}^K w_k \exp \{\beta _k^\top x - \gamma _k (x-\delta _k)^\top (x-\delta _k)\}, \end{aligned}$$
(11)

where \(w_k\) and \(\gamma _k\) are scalars whereas \(\beta _k\) and \(\delta _k\) are d-dimensional vectors. It turns out that using the form in (11) for \(G_0\) we can analytically compute the expectation \(PG_0\) as stated in Proposition 1. The proof of this proposition and the proofs of all remaining propositions and remarks presented throughout Sect. 2 are given in the “Appendix”.

Proposition 1

Let a(x) and \(a_g(x)\) given by (8) and (9) respectively and \(G_0\) in \(a_g(x)\) to have the form in (11). Then,

$$\begin{aligned} a(x) = \textrm{E}_{f}\big [\min \big (1,\exp \big \{-\tfrac{c^2\tau ^2(f-x^\top x/c^2)}{2}\big \}\big )\big ], \end{aligned}$$

where \(\tau ^2=1\) in the case of RWM and \(\tau ^2=c^2/4\) in the case of MALA and f follows the non-central chi-squared distribution with d degrees of freedom and non-central parameter \(x^\top x/c^2\), and

$$\begin{aligned} a_g(x) = \sum _{k=1}^K A_k(x) \textrm{E}_{f_{k,g}}\big [\min \{1, \exp \{-\tfrac{\tau ^2s^2_k}{2}(f_{k,g}-x^\top x/s_k^2)\}\}\big ], \end{aligned}$$

where \(f_{k,g}\) follows the non-central chi-squared distribution with d degrees of freedom and non-central parameter \(m_k(x)^\top m_k(x)/c^2\) and \( A_k(x) =(1+2c^2\gamma _k)^{-d/2} \exp \bigg \{-\frac{r^2x^\top x}{2c^2}-\gamma _k\delta _k^\top \delta _k + \frac{m_k(x)^\top m_k(x)}{2c^2(1+2\gamma _kc^2)} \bigg \}, \) \(m_k(x) \!= \!\dfrac{rx + c^2(\beta _k+\gamma _k\delta _k)}{1+2c^2\gamma _k}\) and \(s_k^2=c^2/(1+2c^2\gamma _k)\).

Proposition 1 states that the calculation of \(a_g(x)\) and a(x) is based on the cdf of the non-central chi-squared distribution and allows, for d-variate standard normal targets, the exact computation of the modified estimator \(\mu _{n,G}\) given by (2).

Fig. 1
figure 1

Numerical solution of the Poisson equation (black solid lines) and its approximation (red dashed lines) in the case of univariate standard Gaussian target simulated by using the random walk Metropolis (RWM) algorithm and the Metropolis-adjusted Langevin algorithm (MALA). (Color figure online)

Having a family of functions for which we can calculate analytically the expectation \(PG_0\) we turn to the problem of specifying a particular member of this family to serve as an accurate approximation to the solution of the Poisson equation for the standard Gaussian distribution. We first provide the following proposition which states that \({\hat{F}}_{{\widetilde{\pi }}_0}\) satisfies certain symmetry properties.

Proposition 2

Given \(F(x) = x^{(j)}\), the exact solution \({\hat{F}}_{{\widetilde{\pi }}_0}(x)\) is: (i) (holds for \(d \ge 1\)) Odd function in the dimension \(x^{(j)}\). (ii) (holds for \(d \ge 2\)) Even function over any remaining dimension \(x^{(j')}, j' \ne j\). (iii) (holds for \(d \ge 3\)) Permutation invariant over the remaining dimensions.

To construct an approximation model family that incorporates the symmetry properties of Proposition 2 we make the following assumptions for the parameters in (11). We set \(K=4\) and we assume that \(w_k \in \mathbb {R}\) and \(\gamma _k >0 \) for each \(k=1,2,3,4\) whereas we set \(w_1=-w_2=b_0\), \(w_3=-w_4=c_0\), \(\gamma _1=\gamma _2 =b_2\) and \(\gamma _3=\gamma _4 =c_1\). Moreover, for the d-dimensional vectors \(\beta _k\) and \(\delta _k\) we assume that \(\beta _1 = -\beta _2\), \(\beta _3 = \beta _4 =\delta _1 = \delta _2 =0\) and \(\delta _3 = -\delta _4\); we set the vectors \(\beta _1\) and \(\delta _3\) to be filled everywhere with zeros except from their jth element which is equal to \(b_1\) and \(c_2\) respectively. We specify thus the function \(G_0:\mathbb {R}^d \rightarrow \mathbb {R}\) as

$$\begin{aligned} \begin{aligned} G_0(x)&= b_0 (e^{b_1 x^{(j)}}-e^{-b_1 x^{(j)}})\times e^{-b_2 ||x||^2}\\ {}& + c_0 (e^{- c_1 (x^{(j)}-c_2)^2} - e^{- c_1(x^{(j)}+c_2)^2})\\ {}& \times e^{-c_1 \sum _{j' \ne j} (x^{(j')})^2}. \end{aligned} \end{aligned}$$
(12)

We note that the above choices for the parameters of \(G_0\) are not the only ones that result in a function that obeys the symmetries properties of Proposition 2. By imposing, however, the described restrictions on the parameters of \(G_0\), we keep the number of free parameters low allowing, thus, the efficient identification of optimal parameter values.

To identify optimal parameters for the function \(G_0\) in (12) such that \(G_0 \approx {\hat{F}}_{{\widetilde{\pi }}_0}\) we first simulate a Markov chain with large sample size n from the d-variate standard Gaussian distribution by employing the RWM algorithm and the MALA. Then, for each algorithm we minimize the loss function

$$\begin{aligned} {\mathscr {L}} = (1/n)\sum _{i=1}^n(G_0(x_i)-PG_0(x_i)-x^{(1)}_i)^2, \end{aligned}$$
(13)

with respect to the parameters \(b_0\), \(b_1\), \(b_2\), \(c_0\), \(c_1 \) and \(c_2\) by employing the Broyden-Fletcher-Goldfarb-Shanno method (Broyden 1970) as implemented by the routine optim in the statistical software R (R Core Team 2021). Figure 1 provides an illustration of the achieved approximation to \({\hat{F}}_{{\widetilde{\pi }}_0}\) in the univariate case where \(d=1\) and the model in (12) simplifies as

$$\begin{aligned} G_0(x)&= b_0 (e^{b_1 x -b_2 x^2}-e^{-b_1 x-b_2 x^2}) \\ {}& + c_0 (e^{- c_1 (x-c_2)^2} - e^{- c_1(x+c_2)^2}). \end{aligned}$$

For such case, we can visualize our optimised \(G_0\) and compare it against the numerical solution from Mijatović et al. (2018). Figure 1 shows this comparison which provides clear evidence that for \(d=1\) our approximation is very accurate.

2.2.2 General Gaussian case

Given the general d-variate Gaussian target \({\widetilde{\pi }}(x) = {\mathscr {N}}(x|\mu ,\Sigma )\) we denote by \({\hat{F}}_{{\widetilde{\pi }}}\) the exact solution of the Poisson equation and by \(G\) the approximation that we wish to construct. To approximate \({\hat{F}}_{{\widetilde{\pi }}}\) we apply a change of variables transformation from the standard normal, as motivated by the following proposition and remark.

Proposition 3

Suppose the standard normal target \({\widetilde{\pi }}_0(x) = {\mathscr {N}}(x|0,I_d)\), the function \(F(x) = x^{(1)}\) and \({\hat{F}}_{{\widetilde{\pi }}_0}\) the associated solution of the Poisson equation for either RWM with proposal \(q(y|x) = {\mathscr {N}}(y|x, c^2I)\) or MALA with proposal \(q(y|x) = {\mathscr {N}}(y| (1 - c^2/2)x, c^2I)\). Then, the solution \({\hat{F}}_{{\widetilde{\pi }}}\) for the general Gaussian target \({\widetilde{\pi }}(x) = {\mathscr {N}}(x|\mu ,\Sigma )\) and Metropolis-Hastings proposal

$$\begin{aligned} q(y|x)= {\left\{ \begin{array}{ll} {\mathscr {N}}(y| x,c^2\Sigma ) &{} \text {if RWM} \\ {\mathscr {N}}(y| x + (c^2/2)\Sigma \nabla \log {\tilde{\pi }}(x),c^2\Sigma ) &{} \text {if MALA}, \end{array}\right. }\nonumber \\ \end{aligned}$$
(14)

is \( {\hat{F}}_{{\widetilde{\pi }}}(x) = L_{1 1} {\hat{F}}_{{\widetilde{\pi }}_0}(L^{-1}(x-\mu )), \) where L is a lower triangular Cholesky matrix such that \(\Sigma =L L^T\) and \(L_{11}\) is its first diagonal element.

Remark 1

To apply Proposition 3 for \(F(x) = x^{(j)}\), \(j\ne 1\), the vector x needs to be permuted such that \(x^{(j)}\) becomes its first element; the corresponding permutation has also to be applied to the mean \(\mu \) and covariance matrix \(\Sigma \).

Proposition 3 implies that we can obtain the exact solution of the Poisson equation for any d-variate Gaussian target by applying a change of variables transformation to the solution of the standard normal d-variate target. Therefore, based on this theoretical result we propose to obtain an approximation G of the Poisson equation in the general Gaussian case by simply transforming the approximation \(G_0\) in (12) from the standard normal case so as

$$\begin{aligned} G(x) = G_0(L^{-1}(x-\mu )). \end{aligned}$$
(15)

The constant \(L_{1 1}\) is omitted since it can be absorbed by the regression coefficient \(\theta \); see Sect. 2.4. Note that Remark 1 provides guidelines for the solution of the Poisson equation associated with \({\widetilde{\pi }}\) for \(F(x) = x^{(j)}\) for any \(j=2,\ldots ,d\). However, if we need to perform variance reduction in the estimation of the means of all or a large subset of the marginal distributions of a high-dimensional target, a computationally more efficient method to conduct the desired variance reductions is available: see “Appendix D” for details.

2.3 Construction of the static control variate h(xy)

Suppose we have constructed a Gaussian approximation \({\widetilde{\pi }}(x) = {\mathscr {N}}(x|\mu ,\Sigma )\), where \(\Sigma = L L^\top \), to the intractable target \(\pi (x)\) and also have obtained the function \(G\) from (15) needed for the proposed, general, estimator in (7). What remains is to specify the function h(xy), labelled as static control variate in (7), which should correlate well with \( \alpha (x,y)(G(y)- G(x)). \) The intractable term in this function is the Metropolis-Hastings probability \(\alpha (x,y)\) in (4) where the Metropolis-Hastings ratio r(xy) contains the intractable target \(\pi \). This suggests to choose h(xy) as

$$\begin{aligned} h(x,y)=\min \{1,\tilde{r}(x,y)\}\big [G(y)- G(x)\big ], \end{aligned}$$
(16)

where \({\widetilde{r}}(x,y)\) is the acceptance ratio in a M-H algorithm that targets the Gaussian approximation \({\widetilde{\pi }}(x)\), that is

$$\begin{aligned} {\widetilde{r}}(x,y)=\min \bigg \{1,\frac{{\widetilde{\pi }}(y){\widetilde{q}}(x|y)}{{\widetilde{\pi }}(x){\widetilde{q}}(y|x)} \bigg \}, \end{aligned}$$
(17)

and \({\widetilde{q}}(\cdot |\cdot )\) is the proposal distribution that we would use for the Gaussian target \({\widetilde{\pi }}(x)\) as defined by equation (14). Importantly, by assuming that \({\widetilde{\pi }}\) serves as an accurate approximation to \(\pi \), the ratio \({\widetilde{r}}(x,y)\) approximates accurately the exact M-H ratio r(xy) and \(\textrm{E}_q[h(x,y)]\) can be calculated analytically. In particular, using (15) we have that

$$\begin{aligned} \textrm{E}_q[h(x,y)]&= \int h(x,y)q(y|x)d y\\&= \int \min \{1,{\widetilde{r}}(x,y)\} \big [G_0(L^{-1}(y-\mu ))\\ {}&- G_0(L^{-1}(x-\mu )) \big ] q(y|x) d y. \end{aligned}$$

This integral can be computed efficiently as follows. We reparametrize the integral according to the new variable \(\tilde{y} = L^{-1} (y - \mu )\) and also use the shortcut \(\tilde{x} = L^{-1}(x - \mu )\) where x is an MCMC sample. After this reparametrization, the above expectation becomes under the distribution

$$\begin{aligned} q(\tilde{y}|\tilde{x})= {\left\{ \begin{array}{ll} {\mathscr {N}}(\tilde{y}| \tilde{x},c^2I) &{} \text {if RWM} \\ {\mathscr {N}}(\tilde{y} | \tilde{x} + \frac{c^2}{2} L^\top \nabla \log \pi (x), c^2 I) &{} \text {if MALA}, \end{array}\right. }\nonumber \\ \end{aligned}$$
(18)

where we condition on \(\tilde{x}\) with a slightly abuse of notation since the term \(\nabla \log \pi (x)\) is the exact pre-computed gradient for the sample x of the intractable target. Thus, the calculation of \(\textrm{E}_q[h(x,y)]\) reduces to the evaluation of the following integral

$$\begin{aligned}&\int \min \left\{ 1,\exp \{- \frac{1}{2}(\tilde{y}^\top \tilde{y} - \tilde{x}^\top \tilde{x}) \}\frac{{\widetilde{q}}(\tilde{x}|\tilde{y})}{{\widetilde{q}}(\tilde{y}|\tilde{x})} \right\} \nonumber \\&\quad \big [G_0(\tilde{y})- G_0(\tilde{x}) \big ]q(\tilde{y}|\tilde{x}) d \tilde{y}. \end{aligned}$$
(19)

Note also that inside the Metropolis-Hastings ratio \({\widetilde{q}}(\tilde{y}|\tilde{x}) = {\mathscr {N}}(\tilde{y}|r\tilde{x},c^2I)\) with r as in (10). In the case of RWM and by noting that the density \(q(\tilde{y}|\tilde{x})\) in (18) coincides with the density \({\widetilde{q}}(\tilde{y}|\tilde{x})\) in (10) we have that the calculation of the integral in (19) reduces to the calculation of the integrals in (8) and (9) and, thus, can be conducted by utilizing Proposition 1. The calculation of the integral in (19) for the MALA is slightly different as highlighted by the following remark.

Remark 2

In the case of MALA the mean of the density \(q(\tilde{y}|\tilde{x})\) in (18) is different from the mean of \({\widetilde{q}}(\tilde{y}|\tilde{x})\) due to the presence of the term \(\frac{c^2}{2} L^\top \nabla \log \pi (x)\) and the formulas in Proposition (1) are modified accordingly.

Finally, we note that except from the tractability in the calculations which is offered by the particular choice of h(xy), there is also the following intuition for its effectiveness. If the Gaussian approximation is exact, then the overall control variate, defined in equation (7) as the sum of a stochastic and a static control variate, becomes the exact “Poisson control variate” that we would compute if the initial target was actually Gaussian. Thus, we expect that the function h(xy), as a static control variate in a non-Gaussian target, enables effective variance reduction under the assumption that the target is well-approximated by a Gaussian distribution.

2.4 The modified estimator with regression coefficients

As pointed out by Dellaportas and Kontoyiannis (2012) the fact that the proposed estimator \(\mu _{n,G}(F)\) is based on an approximation G of the true solution \({\hat{F}}_{\pi }\) of the Poisson equation implies that we need to modify \(\mu _{n,G}(F)\) as

$$\begin{aligned} \begin{aligned} \mu _{n,G}(F_{{\hat{\theta }}_n})&:=\frac{1}{n}\sum _{i=0}^{n-1}\bigg \{ F(x_i) + {\hat{\theta }}_n\big \{\underbrace{ \alpha (x_i,y_i)(G(y_i)-G(x_i)) }_{\text {Stochastic Poisson control variate}} \\ {}& + \underbrace{h(x_i,y_i) - \textrm{E}_{q(y|x_i)} [h(x_i,y)]}_{\text {Static control variate}}\big \}\bigg \} \end{aligned} \end{aligned}$$
(20)

where \({\hat{\theta }}_n\) estimates the optimal coefficient \(\theta \) that further minimizes the variance of the overall estimator. Dellaportas and Kontoyiannis (2012) show that for reversible MCMC samplers, the optimal estimator \({\hat{\theta }}_n\) of the true coefficient \(\theta \) can be constructed solely from the MCMC output. By re-writing the estimator in (20) as

$$\begin{aligned} \mu _{n,G}(F_{{\hat{\theta }}_n})&:=\frac{1}{n}\sum _{i=0}^{n-1}\{F(x_i) - {\hat{\theta }}_n \{G(x_i) - \widehat{PG}(x_i) \} \}, \end{aligned}$$

where the term

$$\begin{aligned} \widehat{PG}(x_i)&= G(x_i)+a(x_i,y_i)(G(y_i)-G(x_i))+ h(x_i,y_i)\nonumber \\&\quad - E_{q(y|x_i)}[h(x_i,y)], \end{aligned}$$
(21)

approximates \(PG(x_i)\), we can estimate \({\hat{\theta }}_n\) as

$$\begin{aligned} {\hat{\theta }}_n = \frac{\mu _n(F(G+\widehat{PG}))- \mu _n(F)\mu _n(G+\widehat{PG})}{\tfrac{1}{n}\sum _{i=1}^{n-1}\big (G(x_i)-\widehat{PG}(x_{i-1}) \big )^2 }. \end{aligned}$$
(22)

Notice that, as in the case of control variates for Monte Carlo integration (Glasserman 2004), the denominator in (22) is the standard empirical estimator of the variance of the control variate \(G(x_i)-\widehat{PG}(x_{i})\). However, in contrast to the standard Monte Carlo case, the numerator is not the usual estimator of the covariance between the function F and the control variate since this latter covariance, in the case of a Markov Chain, is non-tractable. Therefore, the numerator in (22) has been constructed by Dellaportas and Kontoyiannis (2012) based on an alternative, tractable, form of the stationary covariance between the function F and the control variate \(G(x_i)-\widehat{PG}(x_{i})\).

The resulting estimator \(\mu _{n,G}(F_{{\hat{\theta }}_n})\) in (20) is evaluated by using solely the output of the MCMC algorithm and under some regularity conditions converges to \(\pi (F)\) a.s. as \(n \rightarrow \infty \), see Tsourti (2012).

Table 1 Optimal values for the parameters of the function \(G_0\) in (12)
Table 2 Estimated factors by which the variance of \(\mu _n(F)\) is larger than the variance of \(\mu _{n,G}(F)\) for standard Gaussian d-variate target

2.5 Algorithmic summary

In summary, the proposed variance reduction approach can be applied a posteriori to the MCMC output samples \(\{x_i\}_{i=0}^{n-1}\) obtained from either RWM or MALA with proposal density given by (14). The extra computations needed involve the evaluation of \(\widehat{PG}(x_i)\) given by (21). This is efficient since it relies on quantities that are readily available such as the values \(G(x_i)\) and \(G(y_i)\), where \(y_i\) is the value generated from the proposal \(q(y|x_i)\) during the main MCMC algorithm, as well as on the acceptance probability \(a(x_i,y_i)\) which has been also computed and stored at each MCMC iteration. The evaluation of \(\widehat{PG}(x_i)\) requires also the construction of the static control variate \(h(x_i,y_i)\) defined by (16). This depends on the ratio \({\widetilde{r}}(x,y)\) given by (17) and on the expectation \(E_{q(y|x_i)}[h(x_i,y)]\). The calculation of the latter expectation is tractable since \({\widetilde{r}}(x,y)\) is the acceptance ratio of Metropolis-Hastings algorithm that targets the Gaussian target \({\widetilde{\pi }}(x) = N(x|\mu ,\Sigma )\), where \(\mu \) and \(\Sigma \) are estimators of the mean and covariance matrix respectively of the target \(\pi (x)\); see Assumption 1. It is important to note that the calculation of the covariance matrix \(\Sigma \) as well as its Cholesky factor L do not increase the computational cost of the proposed variance reduction technique since they are calculated during the main MCMC algorithm. Finally, we compute \({\hat{\theta }}_n\) using (22) and evaluate the proposed estimator \(\mu _{n,G}(F_{{\hat{\theta }}_n})\) from (20). Algorithm 1 summarizes the steps of the variance reduction procedure.

figure a

3 Application on real and simulated data

We present results from the application of the proposed methodology on real and simulated data examples. First we consider multivariate Gaussian targets for which we have shown that the function G in (12) allows the explicit calculation of the expectation PG defined by (3). Section 3.1 presents variance reduction factors in the case of d-variate standard Gaussian densities, simulated by employing the RWM and MALA, up to \(d=100\) dimensions. In Sects. 3.2, 3.3 and 3.4 we examine the efficiency of our proposed methodology in targets that depart from the Gaussian distribution and the expectation PG is not analytically available. Assumption 1 and Algorithm 1 suggest that our proposed methodology depends on estimators \(\mu \) and \(\Sigma \) of the mean and covariance matrix of the target distribution respectively. Since the technique that we developed is a post-processing procedure which takes as input samples drawn by either the MALA or the RWM algorithm, we utilise these samples to estimate \(\mu \), the variance of which we aim to reduce without spending more computational resources than those used to run the MCMC algorithm. The choice of \(\Sigma \) is implied by the construction of the proposed methodology and it is required to be the covariance matrix of the proposal distribution of the MCMC algorithm.

To conduct all the experiments we set the parameters \(b_0,b_1,b_2,c_0,c_1\) and \(c_2\) of the function \(G_0\) in (12) in the values given by Table 1 which were estimated by minimizing the loss function in (13) for \(d=2\). In practice we observe that such values lead to good performance across all real data experiments, including those with \(d>2\).

To estimate the variance of \(\mu _{n}(F)\) in each experiment we obtained \(T=100\) different estimates \(\mu _{n}^{(i)}(F)\), \(i=1,\ldots ,T\), for \(\mu _{n}(F)\) based on T independent MCMC runs. Then, the variance of \(\mu _{n}(F)\) has been estimated by

$$\begin{aligned} \frac{1}{T-1}\sum _{i=1}^{T}\{\mu _{n}^{(i)}(F)-{\bar{\mu }}_{n}(F)\}^2, \end{aligned}$$

where \({\bar{\mu }}_{n}(F)\) is the average of \(\mu _{n}^{(i)}(F)\). We estimated similarly the variance of the proposed estimator \(\mu _{n,G}(F)\).

3.1 Simulated data: Gaussian targets

The target distribution is a d-variate standard Gaussian distribution and we are interested in estimating the expected value of the first coordinate of the target by setting \(F(x) = x^{(1)}\). Samples of size n were drawn from target densities by utilising the proposal distribution in (10) with \(c^2 = 2.38^2/d\) for the RWM case and by tuning \(c^2\) during the burn-in period to achieve acceptance rate between \(55\%\) and \(60\%\) in the MALA case; we initiated all the MCMC algorithms by drawing an initial parameter vector from the stationary distribution. Table 2 presents factors by which the variance of \(\mu _n(F)\) is greater than the variance of \(\mu _{n,G}(F)\) in the case of the RWM and MALA. Variance reduction is considerable even for \(d=100\). Figure 2 shows typical realizations of the sequences of estimates obtained by the standard estimators \(\mu _n(F)\) and the proposed \(\mu _{n,G}(F_{\theta })\) for different dimensions of the standard Gaussian target and Fig. 3 provides a visualization of the distribution of the estimators \(\mu _n(F)\) and \(\mu _{n,G}(F_{\theta })\). Note that in these experiments the covariance matrix of the target is assumed known; In the “Appendix” we repeat these experiments by relaxing this assumption.

Fig. 2
figure 2

Sequence of the standard ergodic averages (black solid lines) and the proposed estimates (blue dashed lines). The red lines indicate the mean of the d-variate standard Gaussian target. The values are based on samples drawn by employing either the RWM (top row) or the MALA (bottom row) with 10, 000 iterations discarded as burn-in period. (Color figure online)

Fig. 3
figure 3

Each pair of boxplots is consisted of 100 values for the estimators \(\mu _n(F)\) (left boxplot) and \(\mu _{n,G}(F_{\theta })\) (right boxplot) for the d-variate standard Gaussian target. The estimators have been calculated by using \(n \times 10^3\) samples drawn by employing either the RWM (top row) or the MALA (bottom row) and discarded the first 10, 000 samples as burn-in period

3.2 Simulated data: mixtures of Gaussian distributions

It is important to investigate how our proposed methodology performs when the target density departs from normality. We used as \(\pi (x)\) a mixture of d-variate Gaussian distributions with density

$$\begin{aligned} \pi (x) = \frac{1}{2}{\mathscr {N}}(x|m,\Sigma ) + \frac{1}{2}{\mathscr {N}}(x|-m,\Sigma ), \end{aligned}$$
(23)

where, following Mijatović and Vogrinc (2019), we set m to be the d-dimensional vector \((h/2,0,\ldots ,0)\) and \(\Sigma \) is \(d \times d\) covariance matrix randomly drawn from an inverse Wishart distribution by requiring its largest eigenvalue to be equal to 25. More precisely, we simulated one matrix \(\Sigma \) for each different value of d and we used the same matrix across the different choices for h.

We drew samples from the target distribution by using the Metropolis-Hastings algorithm with proposal distribution \(q(y|x) = {\mathscr {N}}(y|x,c^2\Sigma )\) where by setting \(c^2 = 2.38^2/d\) we achieve an acceptance ratio between \(23\%\) and \(33\%\). We also note that the covariance matrix \(\Sigma \) of the target was fixed across the T independent MCMC runs used to estimate the variance of \(\mu _n(F)\) and \(\mu _{n,G}(F_{\theta })\). When \(h>6\) the MCMC algorithm struggles to converge. Table 3 presents the factors by which the variance of \(\mu _n(F)\) is greater than the variance of the modified estimator \(\mu _{n,G}(F)\) for dimensions \(d=10\) and \(d=50\) and for different values of h. It is very reassuring that even in the very non-Gaussian scenario \((h=6)\) our modified estimator achieved a slight variance reduction.

Table 3 Estimated factors by which the variance of \(\mu _n(F)\) is larger than the variance of \(\mu _{n,G}(F)\) for a mixture of d-variate Gaussian distributions with density given by (23) for different values of the mean m
Table 4 Summary of datasets for logistic regression
Table 5 Range of estimated factors by which the variance of \(\mu _n(F)\) is larger than the variance of \(\mu _{n,G}(F_{\theta })\) for the posterior distribution of logistic regression models applied on the datasets indicated by the first column

3.3 Real data: Bayesian logistic regressions

We tested the variance reduction of our modified estimators on five datasets that have been commonly used in MCMC applications, see e.g. Girolami and Calderhead (2011), Titsias and Dellaportas (2019). They are consisted of one N-dimensional binary response variable and an \(N \times d\) matrix with covariates including a column of ones; see Table 4 for the names of the datasets and details on the specific samples sizes and dimensions. We consider a Bayesian logistic regression model by setting an improper prior for the regression coefficients \(\gamma \in \mathbb {R}^d\) of the form \(p(\gamma ) \propto 1\).

3.3.1 Variance reduction for RWM

We draw samples from the posterior distribution of \(\gamma \) by employing the Metropolis-Hastings algorithm with proposal distribution

$$\begin{aligned} q(\gamma '|\gamma ) = {\mathscr {N}}(\gamma '|\gamma ,c^2{\hat{\Sigma }}), \end{aligned}$$

where \(c^2 = 2.38^2/d\) and \({\hat{\Sigma }}\) is the maximum likelihood estimator of the covariance of \(\gamma \). Table 5 presents the range of factors by which the variance of \(\mu _n(F)\) is greater than the variance of \(\mu _{n,G}(F)\) for all parameters \(\gamma \). It is clear that our modified estimators achieve impressive variance reductions when compared with the standard RWM ergodic estimators.

3.3.2 Variance reduction for MALA

We draw samples from the posterior distribution of \(\gamma \) by employing the Metropolis-Hastings algorithm with proposal distribution

$$\begin{aligned} q(\gamma '|\gamma ) = {\mathscr {N}}(\gamma '|\gamma + \tfrac{1}{2}c^2{\hat{\Sigma }}\nabla \log \pi (\gamma ),c^2{\hat{\Sigma }} ), \end{aligned}$$

where \(c^2\) is tuned during the burn-in period in order to achieve an acceptance ratio between \(55\%\) and \(60\%\), \({\hat{\Sigma }}\) is maximum likelihood estimator of the covariance of \(\gamma \) and \(\pi (\gamma )\) denotes the density of the posterior distribution of \(\gamma \). Table 6 presents the range of factors by which the variance of \(\mu _n(F)\) is greater than the variance of \(\mu _{n,G}(F)\) for all parameters \(\gamma \). Again, there is considerable variance reduction for all modified estimators.

Table 6 Estimated factors by which the variance of \(\mu _n(F)\) is larger than the variance of \(\mu _{n,G}(F_{\theta })\) for the posterior distribution of logistic regression models applied on the datasets indicated by the first column
Table 7 Estimated factors by which the variance of \(\mu _n(F)\) is larger than the variance of \(\mu _{n,G}(F)\) for the parameters of d-dimensional stochastic volatility model

3.4 Simulated data: a stochastic volatility model

We use simulated data from a standard stochastic volatility model often employed in econometric applications to model the evolution of asset prices over time (Kim et al. 1998; Kastner and Frühwirth-Schnatter 2014). By denoting with \(r_t\), \(t=1,\ldots ,N\), the tth observation (usually log-return of an asset) the model assumes that \(r_t =\exp \{h_t/2\}\epsilon _t\), where \(\epsilon _t \sim N(0,1)\) and \(h_t\) is an autoregressive AR(1) log-volatility, process: \(h_t = m +\phi (h_{t-1}-m) + s\eta _t\), \(\eta _t \sim N(0,1)\) and \(h_0 \sim N(m,s^2/(1-\phi ^2))\). To conduct Bayesian inference for the parameters \(m \in \mathbb {R}\), \(\phi \in (-1,1)\) and \(s^2 \in (0,\infty )\) we specify commonly used prior distributions (Kastner and Frühwirth-Schnatter 2014; Alexopoulos et al. 2021): \(m \sim N(0,10)\), \((\phi +1)/2 \sim Beta(20,1/5)\) and \(s^2 \sim Gam(1/2,1/2)\). The posterior of interest is

$$\begin{aligned} \begin{aligned} \pi (m,\phi ,s^2,h)&= p(m,\phi ,s^2,h|r) \\ {}&\propto p(m)p(s^2)p(\phi ) {\mathscr {N}}(h_0|m,s^2/(1-\phi ^2)) \\&\times \prod _{t=1}^N {\mathscr {N}}(r_t|0,e^{h_t}){\mathscr {N}}(h_t|m+\phi (h_{t-1}-m),s^2), \end{aligned} \end{aligned}$$
(24)

where \(h=(h_0,\ldots ,h_N)\) and \(r=(r_1,\ldots ,r_N)\).

To assess the proposed variance reduction methods we simulated daily log-returns of a stock for d days by using values for the parameters of the model that have been previously estimated in real data applications (Kim et al. 1998; Alexopoulos et al. 2021) \(\phi =0.98\), \(\mu =-0.85\) and \(s= 0.15\). To draw samples from the d-dimensional, \(d=N+3\), target posterior in (24) we first transform the parameters \(\phi \) and \(s^2\) to real-valued parameters \({\tilde{\phi }}\) and \(\tilde{s}^2\) by taking the logit and logarithm transformations and we assign Gaussian prior distributions by matching the first two moments of the Gaussian distributions with the corresponding moments of the beta and gamma distributions used as priors for the parameters of the original formulation. Then, we set \(x=(m,{\tilde{\phi }},\tilde{s}^2,h)\) and we draw the desired samples using a Metropolis-Hastings algorithm with proposal distribution

$$\begin{aligned} q(y|x) = {\mathscr {N}}(y|x + \tfrac{c^2}{2}{\hat{\Sigma }}\nabla \log \pi (x),c^2{\hat{\Sigma }} ), \end{aligned}$$

where \(y= (m',{\tilde{\phi }}',\tilde{s}^{2'},h')\) are the proposed values, \(c^2\) is tuned during the burn-in period in order to achieve an acceptance ratio between \(55\%\) and \(60\%\) and \({\hat{\Sigma }}\) is the maximum a posteriori estimate of the covariance matrix of \((m,\phi ,s^2,h)\). Table 7 presents the factors by which the variance of \(\mu _n(F)\) is greater than the variance of the proposed estimator \(\mu _{n,G}(F_{\theta })\). We report variance reduction for all static parameters of the volatility process and the range of reductions achieved for the N-dimensional latent path h. All estimators have achieved considerable variance reduction.

3.5 Comparison with alternative methods

We compare the proposed variance reduction methodology with the zero variance (ZV) estimators considered, among others, by Mira et al. (2013) and South et al. (2018). We consider the first order ZV control variates as a competitive variance reduction method since their computational cost is less than all the other ZV estimators and, thus, comparable with the negligible computational cost of our methodology. The comparison that we perform is twofold. First, we compare the computational complexity of our proposed techniques with the one of the first order ZV estimators and then we present the mean squared error (MSE) in the estimation of the mean \(\pi (F)\) obtained by the two different approaches.

Fig. 4
figure 4

MSEs of the estimator \(\mu _{n,G}(F)\) over MSEs of the first order ZV estimator for a mixture of d-variate Gaussian distributions with density given by Eq. (23) for different values of the mean m indicated by the choice of the parameter h in the x-axis. The red dotted line indicates the value 1 in the y-axis. (Color figure online)

Table 8 Range of MSEs of the estimator \(\mu _{n,G}(F)\) over MSEs of the first order ZV estimator in logistic regression models applied to the datasets indicated by the first column
Table 9 Range of MSEs of the estimator \(\mu _{n,G}(F)\) over MSEs of the first order ZV estimator in logistic, where the MSEs are based on n samples collected after the first 10, 000 iterations of MALA

We compare the computational efficiency of the two methods by assuming that it is determined by their computational complexity based on target (and its derivatives) evaluations. First note that our proposed technique relies on the following three ingredients: (i) a Monte Carlo integration with sample size one, (ii) the computation of the cdf of the non-central chi-squared distribution and (iii) the calculation of the coefficient \({\hat{\theta }}_n\) in Eq. (22). All these steps do not require any extra target and/or gradient evaluations rather than the pre-computed evaluations required by the RWM algorithm. The computation of the transformation \(L^{-1}(x-\mu )\) can be also achieved by an efficient post-processing manner without extra target evaluations, see “Appendix F” for details.

On the other hand, although the first order ZV control variates do not depend on extra evaluations of the target, they require the extra (in the case of the RWM algorithm) evaluation of its gradients. Furthermore, the first order ZV methods are based on polynomials in which the number of terms increases with the dimension of the target and thus the inversion of a \(d \times d\) matrix is required for each sample drawn from the target distribution.

We additionally compare the two methods in terms of mean squared error (MSE) when estimating \(\pi (F)\). In particular, by using the MCMC output we calculate, for each one of the examples in Sects. 3.23.4, the MSEs of the proposed estimator \(\mu _{n,G}(F)\) and of the first order ZV estimator; we employed the R-package developed by South (2021) to conduct the calculation of the first order ZV estimators.

Figure 4 displays the ratio of the MSE of the proposed estimator \(\mu _{n,G}(F)\) over the MSE of the first order ZV estimator for the mixture of d-variate Gaussian distributions with density given by Eq. (23). It indicates that the proposed variance methodology is more robust with respect to non-Gaussian targets as well as to the dimension of the target. In particular, the ratio of the MSEs is getting closer to one as the dimension d and/or the parameter h of the target increase. Notice that given the difference of the computational complexity, a ratio of MSEs less or equal than one implies better overall efficiency for the proposed estimator \(\mu _{n,G}(F)\). We also note that the closer the target to the Gaussian distribution (small h), the lower MSE for the ZV estimator compared to our proposed estimator. This is a consequence of the fact that the ZV estimators are exact in the case of Gaussian targets. Tables 8, 9, 10 present ratios of the MSEs of the estimators that we compare in the case of the logistic regression and stochastic volatility models described in Sects. 3.3 and 3.4 respectively.

The combination of the ratios displayed by the Tables together with the computational complexity analysis of the competitive variance reductions methods provides evidence of the overall advantage of our variance reduction techniques over the first order ZV estimators for the estimation of the mean \(\pi (F)\).

Table 10 Range of MSEs of the estimator \(\mu _{n,G}(F)\) over MSEs of the first order ZV estimator for the parameters of d-dimensional stochastic volatility model

4 Discussion

Typical variance reduction strategies for MCMC algorithms study ways to produce new estimators which have smaller variance than the standard ergodic averages by performing a post-processing manipulation of the drawn samples. Here we studied a methodology that constructs such estimators but our development was based on the essential requirement of a negligible post-processing cost. In turn, this feature allows the effortless variance reduction for MCMC estimators that are used in a wide spectrum of Bayesian inference applications. We investigated both the applicability of our strategy in high dimensions and the robustness to departures of normality in the target densities by using simulated and real data examples.

There are many directions for future work. We limited ourselves to the simplest cases of linear functions such as \(F(x) = x^{(j)}\), but higher moments and indicator functions seem interesting avenues to be investigated next. The extension of the proposed method for other functions F requires the construction of an approximation of the solution \({\hat{F}}_{{\widetilde{\pi }}_0}\) of the Poisson equation associated with a standard Gaussian target, i.e, a function which will play the role of the function G defined by Eq. (15). Importantly, this function should be chosen such that the integral in Eq. (9) can be calculated analytically. We think that the form of G used in the present paper can serve as a starting point for this direction of research.

The developed variance reduction technique can be applied on any output from the MALA and the RWM algorithm. Other Metropolis samplers such as the independent Metropolis or the Metropolis-within-Gibbs are also obvious candidates for future work. Finally, an issue that was discussed in some detail in Dellaportas and Kontoyiannis (2009) but has not yet studied with the care it deserves is the important problem of reducing the estimation bias of the MCMC samplers which depends on the initial point of the chain \(X_0 = x\) and vanishes asymptotically. As also noted by Dellaportas and Kontoyiannis (2009), control variables have probably an important role to play in this setting.

Supplementary information The R code for reproducing the experiments is available at https://gitlab.com/aggelisalexopoulos/variance-reduction.