1 Introduction

Variations of Stochastic Gradient Algorithms (SGAs) are central in many modern machine learning applications such as large scale Bayesian inference (Welling and Teh 2011), variational inference (Hoffman et al. 2013), generative adversarial networks (Goodfellow et al. 2014), variational autoencoders (Kingma and Welling 2013) and deep reinforcement learning (Mnih et al. 2015). Statistical sampling perspective provides a unified framework to study non-asymptotic behaviour of these algorithms, which is the main topic of this work. More precisely, consider a data set \(D=(\xi _i)_{i=1}^{m} \subset \mathbb {R}^n\), with \(m \in {\mathbb {N}} \cup \{ \infty \}\) and the corresponding empirical measure \(\nu ^m:=\frac{1}{m}\sum _{i=1}^{m}\delta _{\xi _i}\), where \(\delta \) is a Dirac measure. Denote by \(\mathcal {P}({\mathbb {R}}^n)\) the space of all probability measures on \(\mathbb {R}^n\) and consider a potential \(V:{\mathbb {R}}^d \times {\mathcal {P}}({\mathbb {R}}^n) \rightarrow {\mathbb {R}}\). We are then interested in the problem of sampling from the (data-dependent) probability distribution \(\pi \) on \(\mathbb {R}^d\), given by

$$\begin{aligned} \textstyle \pi (x) \propto \exp \left( -\frac{2}{\beta ^2}V(x,\nu ^m)\right) dx \end{aligned}$$
(1.1)

for some fixed parameter \(\beta > 0\). Under some mild assumptions on V, the measure \(\pi \) is a stationary distribution of the (overdamped) Langevin stochastic differential equation (SDE) and classical Langevin Monte Carlo (Dalalyan 2017) algorithms utilise discrete-time counterparts of such SDEs to provide tools for approximate sampling from \(\pi \), which, however, require access to exact evaluations of \(\nabla V(\cdot , \nu ^m)\). On the other hand, SGAs take as input a noisy evaluation \(\nabla V(\cdot ,\nu ^s)\) for some \(s \in \{1, \ldots m \}\). The simplest example of \(\nabla V(\cdot ,\nu ^s)\) utilizes the subsampling with replacement method. Namely, consider a sequence of i.i.d. uniformly distributed random variables \(\tau ^k_i\sim {\text {Unif}}(\{ 1,\cdots ,m \})\) for \(k \ge 0\) and \(1\le i \le s\) and define a sequence of random data batches \(D_s^k := (\xi _{\tau ^k_i})_{i=1}^{s}\) and corresponding random measures \(\nu _s^k:=\frac{1}{s}\sum _{i=1}^{s}\delta _{\xi _{\tau ^k_i}}\) for \(k \ge 0\). Fix a learning rate (time-step) \(h>0\). The corresponding algorithm to sample from (1.1) is given by

$$\begin{aligned} \textstyle X_{k+1} = X_k - h \nabla _x V(X_k,\nu _s^k) + \beta \sqrt{h} Z_{k+1}, \end{aligned}$$
(1.2)

where \((Z_i)_{i=1}^{\infty }\) are i.i.d. random variables with the standard normal distribution. This method in its simplest form, without mini-batching (i.e., when we use the exact evaluation \(\nabla V(\cdot ,\nu ^m)\)) is known in computational statistics as the Unadjusted Langevin Algorithm (ULA) (Durmus and Moulines 2017), but it has numerous more sophisticated variants and alternatives (Brosse et al. 2018; Chatterji et al. 2018; Cornish et al. 2019; Ma et al. 2019; Majka et al. 2020).

Numerical methods based on Euler schemes with inaccurate (randomised) drifts such as (1.2) have recently become an object of considerable interest in both the computational statistics and the machine learning communities (Aicher et al. 2019; Chau and Rasonyi 2019; Deniz Akyildiz and Sabanis 2020; Gao et al. 2018; Ma et al. 2015; Nemeth and Fearnhead 2019; Raginsky et al. 2017; Zou et al. 2019). In particular, the Stochastic Gradient Langevin Dynamics (SGLD) method for approximate sampling from invariant measures of Langevin SDEs has been studied e.g. in Welling and Teh (2011), Teh et al. (2016), Vollmer et al. (2016), Dubey et al. (2016), Brosse et al. (2018), Barkhagen et al. (2020), Chau et al. (2019), Zhang et al. (2019). Furthermore, recall that under some mild assumptions on V, when \(\beta \rightarrow 0\), the measure \(\pi \) concentrates on the set of minima of V, i.e., on \(\{x\in {\mathbb {R}}^d: x = \arg \inf V(\cdot ,\nu ^m)\} \), cf. (Hwang 1980). Remarkably, no convexity of V is required for this to be true, which makes SGAs good candidates for a tool for non-convex optimisation. We would like to stress that throughout the paper, in our analysis of algorithm (1.2), we allow for \(\beta = 0\) and hence we cover also the Stochastic Gradient Descent (SGD) (Moulines and Bach 2011; Dalalyan and Karagulyan 2019).

Despite the great success of algorithm (1.2) and its various extensions, relatively little is known about how to optimally choose s and whether sub-sampling (or mini-batching) is a good idea at all (Nagapetyan et al. 2017). One of the reasons is that performance analysis appears to be problem-specific as we shall demonstrate on a simple example below. It is clear that subsampling increases the variance of the estimator and induces an additional non-asymptotic bias, see e.g. (Brosse et al. 2018; Nagapetyan et al. 2017). Therefore it is not clear that the reduced computational cost of running the algorithm compensates for these adverse effects. On the other hand, in a big data regime it may be computationally infeasible to use all data points at every step of the gradient algorithm and hence subsampling becomes a necessity.

In the present paper we propose a solution to these challenges by constructing a novel Multi-index Antithetic Stochastic Gradient Algorithm (MASGA). In settings where the measure \(\pi \) in (1.1) is log-concave, we will rigorously demonstrate that MASGA performs on par with Monte Carlo estimators having access to unbiased samples from the target measure, even though it consists of biased samples. Remarkably, our numerical results in Sect. 3 demonstrate that a good performance can be achieved even in some simple non-log-concave settings. To our knowledge, all current state-of-the-art SGAs (Baker et al. 2019; Cornish et al. 2019) require the user to a priori determine the mode of the target distribution and hence it is not clear how to implement them in non-log-concave settings. This problem is absent with MASGA, whose implementation is independent of the structure of the target measure. Moreover, the analysis in Cornish et al. (2019) is based on the Bernstein-von Mises phenomenon, which describes the asymptotic behaviour of the target measure as \(m \rightarrow \infty \), and hence their algorithm is aimed explicitly at the big data regime, see Sect. 3 therein. Meanwhile, as we will discuss below, MASGA works well irrespectively of the size of the dataset.

Mean-square error analysis. In the present paper we are studying the problem of computing

$$\begin{aligned} \textstyle (f,\pi ) := \int _{{\mathbb {R}}^d} f(x)\pi (dx), \end{aligned}$$

for some \(f\in L^2(\pi )\). This framework covers the tasks of approximating minima of possibly non-convex functions or the computation of normalising constants in statistical sampling. To this end, the Markov chain specified by (1.2) is used to approximate \((f,\pi )\) with \({\mathbb {E}}[f(X_k)]\) for large \(k > 0\). More precisely, one simulates \(N > 1\) independent copies \((X^i_k)_{k=0}^{\infty }\), for \(i \in \{ 1, \ldots , N \}\), of (1.2), to compute the empirical measure \(\mu _{N,k} := \frac{1}{N}\sum _{i=1}^{N} \delta _{X^i_k}\). The usual metric for measuring the performance of such algorithms is the (root) mean square error (MSE). Namely, for any \(f \in L^2(\pi )\), \(k \ge 1\) and \(N \ge 1\), we define

$$\begin{aligned} {\text {MSE}}(\mathcal {A}^{f,k,N}) := \left( \mathbb {E} \left| (f, \pi ) - (f,\mu _{N,k})\right| ^2 \right) ^{1/2} , \end{aligned}$$

where \(\mathcal {A}^{f,k,N}\) is the algorithm specified by the estimator \((f,\mu _{N,k})\). Then, for a given \(\varepsilon > 0\), we look for the optimal number k of steps and the optimal number N of simulated paths, such that for any fixed integrable function f we have \({\text {MSE}}(\mathcal {A}^{f,k,N}) < \varepsilon \). Note that

$$\begin{aligned} {\text {MSE}}(\mathcal {A}^{f,k,N}) \le \left| (f, \pi ) {-} (f, {\text {Law}}(X_k)) \right| {+} \left( N^{-1}{\mathbb {V}}[f(X_k)]\right) ^{1/2}. \end{aligned}$$
(1.3)

If f is Lipschitz with a Lipschitz constant L, then the Kantorovich duality representation of the \(L^1\)-Wasserstein distance \(W_1\), see (Villani 2009, Remark 6.5), allows us to upper bound the first term of the right hand side by \(L W_1({\text {Law}}(X_k), \pi )\). Hence it is possible to control the bias by using the vast literature on such Wasserstein bounds (see e.g. Brosse et al. 2018; Cheng et al. 2018; Durmus and Eberle 2021; Majka et al. 2020 and the references therein). Controlling the variance, however, is a more challenging task. Before we proceed, let us consider an example.

Motivational example In the context of Bayesian inference, one is interested in sampling from the posterior distribution \(\pi \) on \(\mathbb {R}^d\) given by

$$\begin{aligned} \textstyle \pi (dx) \propto \pi _0(dx) \prod _{i=1}^{m} \pi (\xi _i | x) , \end{aligned}$$

where the measure \(\pi _0(dx) = \pi _0(x)dx\) is called the prior and \((\xi _i)_{i=1}^{m}\) are i.i.d. data points with densities \(\pi (\xi _i | x)\) for \(x \in \mathbb {R}^d\). Note that in this example, for convenience, we assume that the data are conditionally independent given the parameters. See Sect. 4 for more general settings. Taking \(\beta = \sqrt{2}\), the potential V in (1.1) becomes

$$\begin{aligned} \textstyle V(x,\nu ^m):= & {} - \log \pi _0(x) - m \int \log \pi (y | x) \nu ^m(dy)\nonumber \\ {}{} & {} = - \log \pi _0(x) - \sum _{i=1}^{m} \log \pi (\xi _i | x) . \end{aligned}$$
(1.4)

In stochastic gradient algorithms, one replaces the exact gradient \(\nabla V(\cdot ,\nu ^m)\) in (1.2) with an approximate gradient, constructed by sampling only \(s \ll m\) terms from the sum in (1.4). This amounts to considering \(V(x,\nu ^s) = - \log \pi _0(x) - m \int \log \pi (y | x) \nu ^s(dy) = - \log \pi _0(x) - \frac{m}{s}\sum _{i=1}^{s} \log \pi (\xi _{\tau _i^k} | x)\), where \(\tau _i^k \sim {\text {Unif}}(\{ 1, \ldots , m \})\) for \(i = 1, \ldots , s\) are i.i.d. random variables uniformly distributed on \(\{ 1, \ldots , m \}\). Note that for any \(x \in \mathbb {R}^d\) the noisy gradient \(\nabla _x V(x,\nu ^s)\) is an unbiased estimator of \(\nabla _x V(x,\nu ^m)\). If we choose \(\beta = \sqrt{2}\) and the time-step h/(2m) in (1.2), we arrive at the Markov chain

$$\begin{aligned} \textstyle X_{k+1}= & {} X_k + \frac{1}{2} h \left( \frac{1}{m}\nabla \log \pi _0(X_k)\right. \nonumber \\ {}{} & {} \left. + \frac{1}{s} \sum _{i=1}^{s} \nabla \log \pi (\xi _{\tau _i^k} | X_k)\right) + \frac{1}{\sqrt{m}}\sqrt{h} Z_{k+1} . \end{aligned}$$
(1.5)

Note that the term in brackets in (1.5) is an unbiased estimator of \(- \frac{1}{m}\nabla V\).

The main difficulty in quantifying the cost of Monte Carlo algorithms based on (1.5) stems from the fact that the variance is problem-specific and depends substantially on the interplay between the parameters m, s, h and N. Hence, one may obtain different costs (and thus different answers to the question of profitability of using mini-batching) for different models and different data regimes (Nagapetyan et al. 2017).

In Figs. 1 and 2 we present a numerical experiment for a simple example of a Monte Carlo estimator \((f,\mu _{N,k})\) based on the chain (1.5) with the densities \(\pi (\xi _i | x)\) specified by a Bayesian logistic regression model, cf. Sect. 3 for details. We take \(m = 512\), \(h = 10^{-2}\) and we simulate up to \(t = 2\), hence we have \(k = 200\) iterations. On the left-hand side of both figures we can see the estimated MSE for different numbers of paths N and different numbers of samples \(s \le m\), for two different functions f. On the right hand side we can see how the variance changes with s. Evidently, subsampling/mini-batching works better for \(f(x) = \log |x|\) than for \(f(x) = \exp (x)\), since in the former case it allows us to obtain a small MSE by using just two samples, while in the latter the minimal reasonable number of samples seems to be 16. However, even in this simple example we see that the optimal choice of s and N is far from obvious and very much problem-specific. For an additional discussion on this subject, see Appendix 2.

Fig. 1
figure 1

MSE and Variance \({\mathbb {V}}[f(X_k)]\) with \(f(x) = \log |x|\)

Fig. 2
figure 2

MSE and Variance \({\mathbb {V}}[f(X_k)]\) with \(f(x) = \exp (x)\)

It has been observed in Nagapetyan et al. (2017), that in some specific regimes there is no benefit of employing the subsampling scheme. The authors of Nagapetyan et al. (2017) also observed that a subsampling scheme utilizing control variates can exhibit improved performance, but again only in some specific regimes (see also Baker et al. 2019; Brosse et al. 2018 for related ideas). Moreover, in order to implement their scheme one has to know the mode of the sampled distribution in advance and hence it is not clear how to adapt it to non-convex settings.

The fact that the analysis of the variance of SGLD (and hence of the computational cost of the algorithm \(\mathcal {A}^{f,k,N}\)) becomes cumbersome even in the simplest possible examples, clearly demonstrates the need for developing a different algorithm, for which the benefits of subsampling could be proven in a rigorous way for a reasonably large class of models. To this end, we turn towards the Multi-level Monte Carlo (MLMC) technique.

2 Main result

In order to approximate the measure \(\pi \), we consider a family of (possibly random) measures \((\pi ^{{\varvec{\ell }}})_{{\varvec{\ell }}\in {\mathbb {N}}^r}\), where \(\mathbb {N}^r\) for \(r \ge 1\) is the set of multi-indices \({\varvec{\ell }} = (\ell _1,\cdots ,\ell _r)\), where each \(\ell _i \ge 0\) corresponds to a different type of approximation. In this work we focus on the case \(r=2\), with \(\ell _1\) dictating the number of subsamples at each step of the algorithm and \(\ell _2\) the time discretisation error. Note that for any \(i \in \mathbb {N}\), each possible value of \(\ell _i\) corresponds to a different level of approximation within the given type of approximation, i.e., in our main example \(\ell _1 = k\) will correspond to the k-th level of approximation with respect to the number of subsamples, where we use, say, \(s_k\) subsamples out of the total number of m samples. Note that for some types of approximation, \(\ell _i\) can have the highest possible value (corresponding to the best possible approximation), such as \(\ell _1 = m\) for the number of subsamples, making the total possible number of approximation levels finite. However, for other types of approximation (such as with respect to the time discretisation error), we can keep refining the approximation parameter indefinitely and hence have an infinite number of approximation levels. While for a fixed \({\varvec{\ell }}\) the samples are biased (if at least one type of approximation has an infinite number of possible approximation levels), we work with methods that are asymptotically unbiased in the sense that

$$\begin{aligned} \textstyle \lim _{{\varvec{\ell }} \rightarrow \infty } \mathbb {E} ( f , \pi ^{{\varvec{\ell }}}) = (f, \pi ), \end{aligned}$$
(2.1)

where \({\varvec{\ell }} \rightarrow \infty \) means that \(\min _{i\in {1,\ldots ,r}}\ell _i\rightarrow \infty \). Note that we need to take the expectation in (2.1) since the measures \(\pi ^{{\varvec{\ell }}}\) can be random. Here we use the convention that if for a given type of approximation \(\ell _i = (\ell _i^j)_{j=1}^{\infty }\), there is only a finite number, say K, of possible approximation levels, then for any \(j \ge K\) we have \(\ell _i^j = \infty \), since then the bias with respect to that particular type of approximation is eliminated. We also remark that as the coordinates of \({\varvec{\ell }}\) increase and the bias decreases, the corresponding computational cost of MLMC increases. One therefore faces an optimisation problem and tries to obtain the minimal computational cost for a prescribed accuracy (or, equivalently, to minimise the bias for a fixed computational budget).

It turns out, perhaps surprisingly, that a Multi-level Monte Carlo estimator that consists of a hierarchy of biased approximations can achieve computational efficiency of vanilla Monte Carlo built from directly accessible unbiased samples (Giles 2008, 2015). In order to explain this approach, let us define backward difference operators

$$\begin{aligned} \textstyle \Delta _{s}\pi ^{{\varvec{\ell }}} := \pi ^{{\varvec{\ell }}} - \pi ^{{\varvec{\ell }} - {\varvec{e}}_s}, \quad {\varvec{\Delta }} \pi ^{{\varvec{\ell }}}:= \left( \prod _{s=1}^{r}\Delta _{s} \right) \pi ^{{\varvec{\ell }}}, \end{aligned}$$

where \({\varvec{e}}_s\) is the unit vector in the direction \(s \in \{1\,\ldots ,r\}\) and \(\prod _{s=1}^{r}\Delta _{s}\) denotes the concatenation of operators. The core idea of MLMC is to observe that thanks to (2.1),

$$\begin{aligned} \textstyle (f,\pi ) = \mathbb {E} \sum _{{\varvec{\ell }} \in \mathbb {N}^r} (f, {\varvec{\Delta }} \pi ^{{\varvec{\ell }}}), \end{aligned}$$
(2.2)

where we set \(\pi ^{{\varvec{0}} - {\varvec{e}}_s} := 0\) for all unit vectors \({\varvec{e}}_s\), with \({\varvec{0}} = (0, \ldots , 0)\). The original MLMC (Giles 2015) has been developed for \(r=1\), and the extension to an arbitrary r (named Multi-index Monte Carlo, or MIMC) has been developed in Haji-Ali et al. (2016). In MIMC we approximate each term \(\pi ^{{\varvec{\ell }}}\) on the right-hand side of (2.2) using mutually independent, unbiased Monte Carlo estimators \(\pi ^{{\varvec{\ell }}, N_{{\varvec{\ell }}}}\) with \(N_{{\varvec{\ell }}} \ge 1\) samples each, and we choose a finite set of multi-indices \({\mathcal {L}} \subset \mathbb {N}^r\) to define

$$\begin{aligned} \textstyle \mathcal {A}^{{\mathcal {L}}}(f) := \sum _{{\varvec{\ell }} \in {\mathcal {L}}} (f, {\varvec{\Delta }} \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}}). \end{aligned}$$
(2.3)

Clearly, \({\mathbb {E}}(f, {\varvec{\Delta }} \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}}) = \mathbb {E}(f, {\varvec{\Delta }} \pi ^{{\varvec{\ell }}})\) and hence \(\mathcal {A}^{{\mathcal {L}}}(f)\) is an asymptotically unbiased estimator of \((f,\pi )\) when \(\mathcal {L}\) increases to \(\mathbb {N}^r\). Moreover, we note that due to the independence of \(\pi ^{{\varvec{\ell }}, N_{{\varvec{\ell }}}}\) across levels, the variance of the MIMC estimator satisfies \({\mathbb {V}} [\mathcal {A}^{{\mathcal {L}}}(f)]=\sum _{{\varvec{\ell }} \in {\mathcal {L}}} \mathbb {V}[(f, {\varvec{\Delta }} \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}})]\).

In this work we develop an antithetic extension of the MIMC algorithm. To this end we define pairs \((\pi ^{+,{\varvec{\ell }}},\pi ^{-,{\varvec{\ell }}})\) of copies of \(\pi ^{{\varvec{\ell }}}\) in the sense that \({\mathbb {E}}[(f,\pi ^{{\varvec{\ell }}})]={\mathbb {E}}[(f,\pi ^{+, {\varvec{\ell }}})]={\mathbb {E}}[(f,\pi ^{-, {\varvec{\ell }}})]\) for all Lipschitz functions f, and

$$\begin{aligned} \textstyle \Delta ^{A}_{s}\pi ^{{\varvec{\ell }}} := \pi ^{{\varvec{\ell }}} - \frac{1}{2}(\pi ^{+,{\varvec{\ell }} - {\varvec{e}}_s}+\pi ^{-,{\varvec{\ell }} - {\varvec{e}}_s}), \quad {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}}:= \left( \prod _{s=1}^{r}\Delta ^{A}_{s} \right) \pi ^{{\varvec{\ell }}}. \end{aligned}$$
(2.4)

The corresponding Antithetic MIMC estimator is given by

$$\begin{aligned} \textstyle \mathcal {A}^{A,{\mathcal {L}}}(f) := \sum _{{\varvec{\ell }} \in {\mathcal {L}}} (f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}}). \end{aligned}$$
(2.5)

As we will see in the sequel, the reduction of the variance of \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) in comparison to \(\mathcal {A}^{{\mathcal {L}}}(f)\) can be achieved as a consequence of an appropriately chosen coupling between \(\pi ^{{\varvec{\ell }}}\), \(\pi ^{+,{\varvec{\ell }} - {\varvec{e}}_s}\) and \(\pi ^{-,{\varvec{\ell }} - {\varvec{e}}_s}\) for each \({\varvec{\ell }} \in \mathcal {L}\) and \(s \in \{ 1, \ldots , r \}\). Using the notation introduced in (1.2), we will apply (2.5) to the specific case of \({\varvec{\ell }} = (\ell _1, \ell _2)\) and \(\pi ^{{\varvec{\ell }}}\) given as the law of \(X_k^{\ell _1, \ell _2}\) for some fixed \(k \ge 1\), where

$$\begin{aligned} X_{k+1}^{\ell _1, \ell _2} = X_k^{\ell _1, \ell _2} - h_{\ell _2} \nabla _x V(X_k^{\ell _1, \ell _2}, \nu ^{s_{\ell _1}}) + \beta \sqrt{h_{\ell _2}} Z_{k+1} . \end{aligned}$$
(2.6)

In this setting, we call the algorithm \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) specified by (2.5) the Multi-index Antithetic Stochastic Gradient Algorithm (MASGA). Note that for a fixed \(t > 0\) such that \(t = kh_{\ell _2}\) for some \(k \ge 1\), the chain (2.6) can be interpreted as a discrete-time approximation, both in time with parameter h but also in data with parameter s, of the SDE

$$\begin{aligned} dY_t = - \nabla _x V(Y_t, \nu ^m) dt + \beta dW_t , \end{aligned}$$
(2.7)

where \((W_t)_{t \ge 0}\) is the standard Brownian motion in \(\mathbb {R}^d\). Then, MASGA with \(\pi ^{{\varvec{\ell }}}\) given as the law of \(X_k^{\ell _1, \ell _2}\), for any finite subset \(\mathcal {L} \subset \mathbb {N}^2\) provides a biased estimator of \((f,\pi _t)\) (where \(\pi _t := {\text {Law}}(Y_t)\) with \(Y_t\) given by (2.7)), where the bias stems from the use of a finite number of levels. However, since \(\pi \) given by (1.1) is the limiting stationary distribution of (2.7), \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) can be also interpreted as a biased estimator of \((f,\pi )\), with an additional bias coming from the difference between \((f,\pi )\) and \((f,\pi _t)\) due to the simulation up to a finite time.

Note that the construction of MASGA does not require any knowledge of the structure of the target measure, such as the location of its modes (Baker et al. 2019; Brosse et al. 2018; Nagapetyan et al. 2017), or any properties of the potential V. However, in order to formulate the main result of this paper, we will use the following set of assumptions.

Assumptions 2.1

Let the potential \(V : \mathbb {R}^d \times \mathcal {P}(\mathbb {R}^d) \rightarrow \mathbb {R}\) be of the form \(V(x,\nu ^m): = v_0(x) + \int _{\mathbb {R}^k} v(x,y) \nu ^m(dy) = v_0(x) + \frac{1}{m}\sum _{i=1}^{m} v(x, \xi _i)\), where \((\xi _i)_{i=1}^m \subset \mathbb {R}^k\) is the data and the functions \(v_0 : \mathbb {R}^d \rightarrow \mathbb {R}\) and \(v : \mathbb {R}^d \times \mathbb {R}^k \rightarrow \mathbb {R}\) are such that

  1. i)

    For all \(\xi \in \mathbb {R}^k\) we have \(\nabla v(\cdot ,\xi )\), \(\nabla v_0(\cdot ) \in C^2_b(\mathbb {R}^d;\mathbb {R}^d)\), i.e., the gradients of v and \(v_0\) are twice continuously differentiable with all partial derivatives of the first and second order bounded (but the gradients themselves are not necessarily bounded).

  2. ii)

    There exists a constant \(K > 0\) such that for all \(\xi \in \mathbb {R}^k\) and for all x, \(y \in \mathbb {R}^d\) we have

    $$\begin{aligned}{} & {} \langle x - y , \nabla _x v(x,\xi ) - \nabla _y v(y,\xi ) \rangle \ge K|x - y|^2 \qquad \text { and } \\{} & {} \quad \langle x - y , \nabla v_0(x) - \nabla v_0(y) \rangle \ge K|x - y|^2 . \end{aligned}$$

Note that the first condition above in particular implies that the gradient \(\nabla V(\cdot ,\nu ^m)\) of the potential is globally Lipschitz. Furthermore, as a consequence of Assumption 2.1(i), for all \(\xi \in \mathbb {R}^k\) the gradients \(\nabla v(\cdot ,\xi )\), as well as the gradient \(\nabla v_0(\cdot )\) satisfy a standard linear growth condition \(|\nabla _x v(x,\xi )| \le c(1+|x|)\) and \(|\nabla v_0(x)| \le c(1+|x|)\) with a constant \(c >0\). Using a standard inequality \((a+b)^4\le 8a^4 + 8b^4\), we can easily conclude that there exists a constant \(C > 0\) such that for all \(\xi \in \mathbb {R}^k\) and for all \(x \in \mathbb {R}^d\) we have

$$\begin{aligned} |\nabla _x v(x,\xi )|^4 \le C(1+|x|^4) \qquad \text { and } \qquad |\nabla v_0(x)|^4 \le C(1+|x|^4) . \end{aligned}$$

This growth condition will be important in our proofs in Sect. 4. We further remark that Assumption 2.1(ii) implies that \(\pi \) given via (1.1) is log-concave. Note also that the Bayesian inference example given in (1.5) satisfies Assumptions 2.1 if the functions \(x \mapsto - \nabla \log \pi (\xi | x)\) satisfy all the respective regularity conditions. We remark that we formulate our main result in this section only for the specific form of V given above just for convenience. Our result holds also for a much more general class of potentials, but the assumptions for the general case are more cumbersome to formulate and hence we postpone their presentation to Sect. 4. Moreover, we stress that assuming convexity of V is not necessary for the construction of our algorithm and it is a choice we made solely to simplify the proofs. By combining our approach with the coupling techniques from (Majka et al. 2020), it should be possible to extend our results to the non-convex case. This, however, falls beyond the scope of the present paper and is left for future work. We have the following result.

Theorem 2.2

Let Assumptions 2.1 hold. Let \(\mathcal {A}^{A,{\mathcal {L}}}\) be the MASGA estimator defined in (2.5) with \(\pi ^{{\varvec{\ell }}}\) given as the law of \(X_k^{\ell _1, \ell _2}\), defined via (2.6) for some fixed \(k \ge 1\). Then there exists a set of indices \(\mathcal {L}\) and a sequence \((N_{{\varvec{\ell }}})_{{\varvec{\ell }} \in \mathcal {L}}\) such that for any \(\varepsilon > 0\) and for any Lipschitz function f, the estimator \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) requires the computational cost of order \(\varepsilon ^{-2}\) to achieve mean square error \(\varepsilon \) for approximating \((f, \pi _t)\) with \(t = kh_{\ell _2}\), where \(\pi _t := {\text {Law}}(Y_t)\) with \(Y_t\) given by (2.7).

The proof of Theorem 2.2 and an explicit description of the sequence \((N_{{\varvec{\ell }}})_{{\varvec{\ell }} \in \mathcal {L}}\), as well as all the details of the relaxed assumptions that can be imposed on V, will be given in Sect. 4. Note that requiring computational cost of order \(\varepsilon ^{-2}\) to achieve \({\text {MSE}} < \varepsilon \) is the best performance that one can expect from Monte Carlo methods, cf. (Giles 2015).

The two quantities that feature in the optimization problem formulated in Theorem 2.2 are the MSE and the computational cost (i.e., we want to optimize the cost given the constraint \({\text {MSE}} < \varepsilon \) for a fixed \(\varepsilon > 0\)). Note that the cost is explicitly defined to be

$$\begin{aligned} \textstyle {\text {cost}}(\mathcal {A}^{A,{\mathcal {L}}}(f)) := \sum _{{\varvec{\ell }} \in \mathcal {L}} N_{\ell }\,C_{{\varvec{\ell }}}, \quad C_{{\varvec{\ell }}} := t h_{\ell _2}^{-1} s_{\ell _1} , \end{aligned}$$
(2.8)

where \(C_{{\varvec{\ell }}}\) is the cost of each path at the level \({\varvec{\ell }}\), i.e., the product of \(k = t h_{\ell _2}^{-1}\) steps and \(s_{\ell _1}\) subsamples for a given level \({\varvec{\ell }} = (\ell _1, \ell _2)\). On the other hand, from our discussion on MSE (1.3) it is evident that in order to control \({\text {MSE}}(\mathcal {A}^{A,{\mathcal {L}}}(f))\), it is crucial to find upper bounds on the variance \(\mathbb {V}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})]\) and the bias of the estimator at each level \({\varvec{\ell }}\). In our proof we will in fact rely on the complexity analysis from (Haji-Ali et al. 2016) (see also Giles 2015 and Theorem 4.12 below for more details), which is concerned exactly with such optimization problems.

For convenience, from now on we will assume that in our MASGA estimator (2.5), both the number of subsamples and the discretisation parameter are rescaled by two when moving between levels. More precisely, for a fixed \(s_0 \in \mathbb {N}_+\) and \(h_0 > 0\), we assume that \(s_{\ell _1} = 2^{\ell _1}s_0\) and \(h_{\ell _2} = 2^{-\ell _2}h_0\) for all \({\varvec{\ell }} = (\ell _1, \ell _2) \in \mathbb {N}^2\). In this setting, the complexity analysis from (Haji-Ali et al. 2016) tells us, roughly speaking (with more details in Sect. 4 below) that in order to obtain \({\text {MSE}}(\mathcal {A}^{A,{\mathcal {L}}}(f)) < \varepsilon \) with \({\text {cost}}(\mathcal {A}^{A,{\mathcal {L}}}(f)) < \varepsilon ^{-2}\), we need to have for each \({\varvec{\ell }} \in \mathbb {N}^2\),

$$\begin{aligned} \textstyle \mathbb {E}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})^2] \lesssim 2^{- \langle {\varvec{\beta }} , {\varvec{\ell }} \rangle } \qquad \text { and } \qquad C_{{\varvec{\ell }}} \lesssim 2^{\langle {\varvec{\gamma }} , {\varvec{\ell }} \rangle } , \end{aligned}$$
(2.9)

where \({\varvec{\beta }} = (\beta _1, \beta _2)\) and \({\varvec{\gamma }} = (\gamma _1, \gamma _2) \in \mathbb {R}^2\) are such that \(\gamma _1 < \beta _1\) and \(\gamma _2 < \beta _2\). Note that the bound on \(\mathbb {E}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})^2]\) trivially implies a bound on \(\mathbb {V}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})]\) of the same order (i.e., with the same \({\varvec{\beta }}\)), as well as the bound \(\mathbb {E}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})] \lesssim 2^{- \langle {\varvec{\alpha }} , {\varvec{\ell }} \rangle }\) with \({\varvec{\alpha }} = {\varvec{\beta }} /2\), which turns out to be crucial in the complexity analysis from (Haji-Ali et al. 2016) and is the reason why it suffices to verify (2.9), cf. also Theorem 2 in Giles (2015) and Theorem 4.12 below. Since, straight from the definition of \(C_{{\varvec{\ell }}}\), it is clear that in our setting we have \({\varvec{\gamma }} = (1,1)\) (recall that \(C_{{\varvec{\ell }}} \lesssim s_{\ell _1} h_{\ell _2}^{-1} \lesssim 2^{\ell _1 + \ell _2}\)), we can infer that in order to prove Theorem 2.2 all we have to do is to find an upper bound on \(\mathbb {E}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})^2]\) proportional to \(2^{- \langle {\varvec{\beta }} , {\varvec{\ell }} \rangle }\) with \({\varvec{\beta }} = (\beta _1, \beta _2)\) such that \(\beta _1 > 1\) and \(\beta _2 > 1\). In the proof of Theorem 2.2 we will in fact obtain \({\varvec{\beta }} = (2,2)\). However, the crucial difficulty in our argument will be to ensure that our upper bound is indeed of the product form, i.e., that we obtain \(\mathbb {E}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})^2] \lesssim s_{\ell _1}^{-2} h_{\ell _2}^2 \lesssim 2^{-2\ell _1 - 2 \ell _2}\).

Further extensions

The assertion of Theorem 2.2 states that \(\mathcal {A}^{A,{\mathcal {L}}}(f)\), interpreted as an estimator of \((f, \pi _t)\), requires computational cost of order \(\varepsilon ^{-2}\) to achieve mean square error \(\varepsilon \). Since the difference between \((f, \pi _t)\) and \((f, \pi )\) is of order \(\mathcal {O}(e^{-\lambda t})\) for some \(\lambda > 0\) (cf. the discussion in Appendix 2), this means that \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) interpreted as an estimator of \((f, \pi )\), requires computational cost of order \(\varepsilon ^{-2}\log (\varepsilon ^{-1})\) to achieve mean square error \(\varepsilon \). However, \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) could be further modified in order to remove the log term, by employing the MLMC in terminal time technique introduced in Giles et al. (2020), cf. Sect. 2.3 and Remark 3.8 therein. This would involve taking \(r = 3\) in (2.5) and modifying the definition as

$$\begin{aligned} \textstyle \mathcal {A}^{A,{\mathcal {L}}}(f) := \sum _{{\varvec{\ell }} \in {\mathcal {L}}} (f, \Delta ^{A}_1 \Delta ^{A}_2 \Delta _3 \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}}), \end{aligned}$$
(2.10)

i.e., we would take \(\pi ^{{\varvec{\ell }}}\) with \({\varvec{\ell }} = (\ell _1, \ell _2, \ell _3)\) to be the law of \(X_{k_{\ell _3}}^{\ell _1, \ell _2}\) given by (2.6), hence we would introduce a sequence of terminal times \(t := k_{\ell _3} h_{\ell _2}\) for the chain (2.6), changing at each level. However, in the definition (2.10) of \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) we would use the antithetic difference operators \(\Delta ^{A}\) only with respect to the subsampling level parameter \(\ell _1\) and the discretisation level parameter \(\ell _2\), while applying the plain difference operator \(\Delta \) to the terminal time level parameter \(\ell _3\). The details of how to construct the sequence of terminal times \(t := k_{\ell _3} h_{\ell _2}\) can be found in Giles et al. (2020). We skip this modification in the present paper in the attempt to try to keep the notation as simple as possible.

In the setting where the computational complexity is \(\varepsilon ^{-2}\), (as it is for MASGA), it is possible to easily modify the biased estimator \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) to obtain its unbiased counterpart. Indeed, let \({\varvec{M}} = (M_1, \ldots , M_r)\) be a random variable on \({\mathbb {N}}^r\), independent of \(({\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}})_{{\varvec{\ell }} \in \mathcal {L}}\). Define \({\mathcal {L}}^{{\varvec{M}}} := \{\ell \in \mathbb {N}^r : \ell _1 \le M_1, \cdots , \ell _r \le M_r\} \) and

$$\begin{aligned} \textstyle \mathcal {A}^{UA,{\mathcal {L}}^{{\varvec{M}}}}(f) := \sum _{{\varvec{\ell }} \in {\mathcal {L}}^{{\varvec{M}}}} \frac{(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}})}{{\mathbb {P}}({\varvec{M}} \ge {\varvec{\ell }})} = \sum _{{\varvec{\ell }} \in \mathbb {N}^r} 1_{\{{\varvec{M}} \ge {\varvec{\ell }}\} }\frac{(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}})}{{\mathbb {P}}({\varvec{M}} \ge {\varvec{\ell }})}, \end{aligned}$$

where \({\varvec{M}} \ge {\varvec{\ell }}\) is understood component-wise. One can see that \(\mathcal {A}^{UA,{\mathcal {L}}^{{\varvec{M}}}}(f)\) is then an unbiased estimator of \((f,\pi )\). Indeed,

$$\begin{aligned} \textstyle {\mathbb {E}}\big [\mathcal {A}^{UA,{\mathcal {L}}^{{\varvec{M}}}}(f) \big ] = {\mathbb {E}}\left[ \sum _{{\varvec{\ell }} \in \mathbb {N}^r} (f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }},N_{{\varvec{\ell }}}}) \right] = (f,\pi ), \end{aligned}$$

due to (2.2). It turns out that in order for the variance of \(\mathcal {A}^{UA,{\mathcal {L}}^{{\varvec{M}}}}(f)\) to be finite, we need to be in the regime where the computational complexity of the original estimator is \(\varepsilon ^{-2}\). We refer the reader to (Crisan et al. 2018; Rhee and Glynn 2015) for more details and recipes for constructing \({\varvec{M}}\). The methods from (Rhee and Glynn 2015) have recently been extended to more general classes of MCMC algorithms in Jacob et al. (2020), which contains further discussion of the benefits and costs of debiasing.

Literature review The idea of employing MLMC apparatus to improve efficiency of stochastic gradient algorithms has been studied before. In Giles et al. (2020) we introduced a Multi-level Monte Carlo (MLMC) method for SGLD in the global convexity setting, based on a number of decreasing discretisation levels. We proved that under certain assumptions on the variance of the estimator of the drift, this technique can indeed improve the overall performance of the Monte Carlo method with stochastic gradients. In Majka et al. (2020) we extended our approach to cover the non-convex setting, allowing for sampling from probability measures that satisfy a weak log-concavity at infinity condition. However, the computational complexity of such algorithms is sub-optimal. As we will observe in Sect. 3 with numerical experiments, the crucial insight of the present paper (and a novelty compared to Giles et al. 2020; Majka et al. 2020) is the application of MLMC with respect to the subsampling parameter. Note that, in a different context, the idea to apply MLMC to stochastic approximation algorithms has been studied in Frikha (2016), see also (Dereich 2021; Dereich and Müller-Gronbach 2019).

At the core of our analysis of Multi-level Monte Carlo estimators lies the problem of constructing the right couplings between Euler schemes (2.6) on different discretisation levels. For Euler schemes with standard (accurate) drifts this is done via a one-step analysis by coupling the driving noise in a suitable way, cf. Sects. 2.2 and 2.5 in Majka et al. (2020) and Sect. 2.4 in Eberle and Majka (2019). However, in the case of SGLD, one is faced with an additional problem of coupling the drift estimators used on different discretisation levels. In both (Giles et al. 2020) and (Majka et al. 2020) we addressed this issue in the simplest possible way, by coupling the drift estimators independently. In the present paper we show that by employing a non-trivial coupling we can substantially reduce the variance of MLMC and thus obtain the required bound \(\mathbb {V}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})] \lesssim 2^{-2\ell _1 - 2 \ell _2}\) as explained above. We achieve this by utilising the antithetic approach to MLMC as defined in (2.4). Related ideas for antithetic multi-level estimators have been used e.g. in Giles and Szpruch (2013), Giles and Szpruch (2014), Szpruch and Tse (2019). However, in the present paper we apply this concept for the first time to Euler schemes with inaccurate drifts. We also remark that, due to our bounds on second moments, we can easily derive confidence intervals for MASGA using Chebyshev’s inequality. However, it would also be possible to derive a Central Limit Theorem and corresponding concentration inequalities, in the spirit of Ben Alaya and Kebaier (2015), Ben Alaya et al. (2020), Jourdain and Kebaier (2019), Kebaier (2005).

The remaining part of this paper is organised as follows. In Sect. 3 we present numerical experiments confirming our theoretical findings. In Sect. 4 we present a more general framework for the MASGA estimator, we explain the intuition behind the antithetic approach to MLMC in more detail (see Example 4.2) and we formulate the crucial Lemma 4.13. We also explain how to prove Theorem 2.2 based on Lemma 4.13. In Sect. 5 we prove Lemma 4.13 in a few steps: we first discuss the antithetic estimator with respect to the discretisation parameter, which corresponds to taking \(r=1\) and \({{\varvec{\ell }}} = \ell _2\) in (2.5), then we consider the antithetic estimator with respect to the subsampling parameter, which corresponds to taking \(r=1\) and \({{\varvec{\ell }}} = \ell _1\) in (2.5) and, finally, we explain how these approaches can be combined in a multi-index estimator with \(r=2\) and \({{\varvec{\ell }}} = (\ell _1, \ell _2)\) to prove Lemma 4.13. Several technical proofs are postponed to Appendices.

3 Numerical experiments

We showcase the performance of the MASGA estimator in Bayesian inference problems, combining different Bayesian models and priors. The code for all the numerical experiments can be found at https://github.com/msabvid/MLMC-MIMC-SGD.

In Sect. 3.1 we compare the MASGA estimator introduced in Sect. 2 with an Antithetic Multi-level Monte Carlo (AMLMC) estimator with respect to the subsampling parameter, corresponding to taking \(r=1\) and \({{\varvec{\ell }}} = \ell _1\) in (2.5). We demonstrate that MASGA indeed achieves the optimal computational complexity. Both these estimators are also compared to a standard Monte Carlo estimator for reference. As we shall see, while the performance of MASGA in our experiments is always better than that of AMLMC, the difference is not substantial. This suggests that, from the practical standpoint, the crucial insight of this paper is the application of the antithetic MLMC approach with respect to the subsampling parameter. Hence in our subsequent experiments in Sects. 3.2 and 3.3, we will focus on the AMLMC estimator, which is easier to implement than MASGA.

Note that this phenomenon is partially explained by our computations, which show that the variance of AMLMC with respect to subsampling is already of the optimal order in both parameters, i.e., \(\mathcal {O}(h^2s^{-2})\), see Sect. 5 for details. However, based on the theory of MIMC, in order to prove the optimal computational complexity, we need to use AMLMC with respect to subsampling in conjunction with AMLMC with respect to discretisation as a multi-index estimator.

In Sect. 3.2 we will check the performance of AMLMC with respect to subsampling in both convex and non-convex settings, whereas in Sect. 3.3 we will compare it to the Stochastic Gradient Langevin Dynamics with Control Variates (SGLD-CV) method introduced in Baker et al. (2019). More precisely, in Sect. 3.3 we present an example of a convex setting in which AMLMC outperforms SGLD-CV. It is worth pointing out, that the latter method can be applied only to convex settings (similarly as the method from (Cornish et al. 2019)), whereas AMLMC is free of such limitations.

We would like to remark that our main point in implementing AMLMC in non-convex settings is to demonstrate the robustness of our approach and show that AMLMC can perform well in some simple examples where other methods (Baker et al. 2019; Cornish et al. 2019) are not even implementable. However, we do not guarantee (and do not expect) the optimal performance of AMLMC in such settings. We believe that in order to ensure the optimal performance of our algorithms in non-convex settings, we would have to re-design MASGA by employing a different coupling. More precisely, our current algorithm is based on coupling multiple chains in a synchronous way, which is well-suited to the convex setting, but not necessarily to non-convex settings. Adapting our method to non-convex settings would require a re-design of the algorithm by utilising a combination of the maximal and the reflection couplings, as described for ULA and SGLD in Eberle and Majka (2019), Majka et al. (2020). This is, however, highly non-trivial (since MASGA relies on coupling of multiple chains and not just two), and will require a lot of additional work, hence falls beyond the scope of the present paper.

3.1 MASGA and AMLMC with respect to subsampling

Let us begin by introducing the Bayesian logistic regression setting that we will use for our simulations. The data is modelled by

$$\begin{aligned} \textstyle p(y_i | \iota _i, x) = g(y_i x^T \iota _i) \end{aligned}$$
(3.1)

with \(g(z) = 1/(1+e^{-z})\), \(z \in \mathbb {R}\), where \(x\in {\mathbb {R}}^d\) are the parameters of the model that need to be sampled from their posterior distribution, \(\iota _i\) denotes an observation of the predictive variable in the data, and \(y_i\) the binary target variable. Given a dataset of size m, by Bayes’ rule, the posterior density of x satisfies

$$\begin{aligned} \textstyle \pi (x) \propto \pi _0(x) \prod _{i=1}^m g(y_ix^T\iota _i). \end{aligned}$$
(3.2)

In our experiments, we will consider two different priors \(\pi _0\), namely

  1. i)

    a Gaussian prior \(\pi _0 \sim {\mathcal {N}}(0, I)\).

  2. ii)

    a mixture of two Gaussians \(\pi _0 \sim \frac{1}{2} {\mathcal {N}}(0,I) + \frac{1}{2} {\mathcal {N}}(1,I)\).

We use Algorithm 1 to empirically calculate the cost of approximating \({\mathbb {E}}(f(X))\), for a function \(f:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\), where the law of X is the posterior \(\pi \), by the MASGA estimator \(\mathcal {A}^{A,{\mathcal {L}}}_{MASGA}(f) = \mathcal {A}^{A,{\mathcal {L}}}(f)\) defined in (2.5), such that its MSE is under some threshold \(\varepsilon \). Recall that in our notation \((f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})\) denotes the integral of f with respect to the antithetic measure \(\Delta ^{A} \pi ^{{\varvec{\ell }}}\) given by (2.4), where \(\pi ^{{\varvec{\ell }}}\) is specified by the law of the Markov chain (2.6) with the potential V determined from (3.2) in an analogous way as in the Bayesian inference example (1.4). More explicitly, we have

$$\begin{aligned} \textstyle V(x,\nu ^m) = -\log \pi _0(x) - \sum _{i=1}^{m} \log g(y_i x^T \iota _i) \end{aligned}$$

and

$$\begin{aligned}{} & {} \textstyle X_{k+1}^{\ell _1, \ell _2} = X_k^{\ell _1, \ell _2} + \frac{1}{2}h_{\ell _2} \left( \frac{1}{m} \nabla \log \pi _0(X_k^{\ell _1, \ell _2}) \right. \nonumber \\{} & {} \quad \left. + \frac{1}{s_{\ell _1}} \sum _{i=1}^{s_{\ell _1}} \nabla \log g\left( y_{\tau _i^k}(X_k^{\ell _1, \ell _2})^T \iota _{\tau _i^k}\right) \right) + \frac{1}{\sqrt{m}} \sqrt{h_{\ell _2}} Z_{k+1} ,\nonumber \\ \end{aligned}$$
(3.3)

where \(\tau _i^k\) for \(k \ge 1\) and \(i \in \{ 1, \ldots , s_{\ell _1} \}\) can correspond to subsampling either with or without replacement (in our simulations we choose the latter).

Below we present the results of our experiments for \(f(x) = |x|^2\) (we would like to remark that we obtained similar conclusions for \(f(x) = |x|\) and hence we skip the latter example to save space).

Furthermore, for the Bayesian Logistic regressions we use the covertype dataset (Blackard and Dean 1999) which has \(581\,012\) observations, and 54 columnsFootnote 1. We create a training set containing 20% of the original observations.

figure a

On the other hand, for AMLMC with respect to subsampling (denoted below by \(\mathcal {A}^{A,{\mathcal {L}}}_{MLMC}(f)\)) we take \(r = 1\) and \({\varvec{\ell }} = \ell _1\) in (2.5). This corresponds to using a fixed discretisation parameter h and applying the antithetic MLMC estimator only with respect to the subsampling parameter.

Note that in our experiments we apply the estimators \(\mathcal {A}^{A,{\mathcal {L}}}_{MASGA}(f)\) and \(\mathcal {A}^{A,{\mathcal {L}}}_{MLMC}(f)\) to approximate \((f,\pi _t^h)\), where \(\pi _t^h\) is the law given by the chain \(X_k\) with \(t = kh\), defined by

$$\begin{aligned} \textstyle X_{k+1}= & {} X_k + \frac{1}{2}h \left( \frac{1}{m} \nabla \log \pi _0(X_k) + \frac{1}{m} \sum _{i=1}^{m} \nabla \log g(y_i X_k^T \iota _i) \right) \nonumber \\{} & {} + \frac{1}{\sqrt{m}} \sqrt{h} Z_{k+1} , \end{aligned}$$
(3.4)

with a fixed discretisation parameter h (i.e., we do not take into account the error between \((f,\pi _t^h)\) and \((f,\pi _t)\) when calculating the MSE, where \(\pi _t\) is the law of the SDE (2.7)). The value of this h is determined by the final level used in MASGA, i.e., \(h = h_L\).

The experiment is organised as follows:

  1. i)

    let \(L\ge 1\), and \((\ell _s, \ell _d) = (L,L)\) be the highest multi-indices used in the calculation of \({\mathcal {A}}^{A, {\mathcal {L}}}_{MASGA}(f)\).

  2. ii)

    We measure the bias of \({\mathcal {A}}^{A, {\mathcal {L}}}_{MASGA}(f)\) and set \(\varepsilon := \sqrt{2} \left( \mathbb {E} \left[ |(f,\pi ^{h_L}_t) - {\mathcal {A}}^{A, {\mathcal {L}}}_{MASGA}(f)|^2 \right] \right) ^{1/2}\).

  3. iii)

    We then compare the cost of \({\mathcal {A}}^{A, {\mathcal {L}}}_{MASGA}(f)\) against the cost of \({\mathcal {A}}^{A, {\mathcal {L}}}_{MLMC}(f)\) with fixed discretisation parameter \(h = h_{\ell _d} = h_L\) satisfying \({\mathbb {E}} [((f,\pi ^{h_L}_t) - {\mathcal {A}}_{MLMC}^{A, {\mathcal {L}}}(f))^2]<\varepsilon ^2\).

We repeat the above three steps for \(L=1,\ldots ,7\), in order to measure the cost for different values of \(\varepsilon \).

We perform this comparison on two data regimes: first on the covertype dataset with 100K observations and 54 covariates, and second on a smaller synthetic dataset with 1K observations and 5 covariates. Results are shown in Fig. 3, where each \(\varepsilon \) corresponds to different values of L (see Table 1).

Fig. 3
figure 3

Comparison of MASGA cost against AMLMC cost for the standard Gaussian prior

Table 1 MASGA setting for Bayesian Logistic Regression

As expected, the higher the accuracy (the lower the \(\varepsilon \)) the better the cost of \({\mathcal {A}}^{A, {\mathcal {L}}}_{MASGA}(f)\) compared to the cost of \({\mathcal {A}}^{A, {\mathcal {L}}}_{MLMC}(f)\). Depending on the dataset size, it is necessary to reach different levels of accuracy of the MSE to notice an improvement on the cost. This comes from the amount of variance added by the noise added in the chain \(X_k^{\ell _1,\ell _2}\) (3.3) that will decrease as the dataset size m increases.

3.2 AMLMC with respect to subsampling in convex and non-convex settings

In the subsequent experiments, we study the AMLMC estimator with respect to subsampling, i.e., with a fixed discretisation parameter h. We simulate \(10\times 2^4\) steps of the chain (3.3). We take \(X_0\) to be an approximation of the mode of the posterior that we pre-compute using Stochastic Gradient Descent to replace the burn-in phase of the Markov chain, cf. (Baker et al. 2019). The number of steps and the step size are chosen so as to be consistent with the finest discretisation level of the MASGA experiment provided in the previous section.

A summary of the AMLMC setting is provided in Table 2.

Table 2 AMLMC setting for Bayesian Logistic Regression
Fig. 4
figure 4

AMLMC estimator \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) with respect to subsampling for \(f(x)= |x|^2\), where the prior is the standard Gaussian

Fig. 5
figure 5

AMLMC estimator \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) with respect to subsampling for \(f(x)= |x|^2\), where the prior is a mixture of two Gaussians

Plots in Fig. 4 correspond to the results where \(\pi _t^h\) is the approximation of the posterior of a Bayesian Logistic Regression with Gaussian prior. Plots in Fig. 5 use a mixture of Gaussians for the prior. The left plot shows the variance of \((f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})\) and \((f, \pi ^{{\varvec{\ell }}})\) per subsampling level. The right plot displays the computational cost multiplied by \(\varepsilon ^2\).

These figures indicate that the total cost of approximating \((f,\pi _t^h)\) by \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) as described above, is \(\mathcal {O}(\varepsilon ^{-2})\), even when the prior is not log-concave as is the case of a Mixture of two Gaussians (Fig. 5).

3.2.1 Bayesian mixture of Gaussians

For our next experiment, we use the setting from Example 5.1 in Welling and Teh (2011) to consider a Bayesian mixture of two Gaussians on a 2-dimensional dataset, in order to make the posterior multi-modal. Given a dataset of size m, by Bayes’ rule

$$\begin{aligned} \textstyle \pi (x) \propto \pi _0(x) \prod _{i=1}^m g(\iota _1, \iota _2 | x), \end{aligned}$$
(3.5)

where \(x=(x_1,x_2)\), g is the joint density of \((\iota _1, \iota _2)\) where each \(\iota _i \sim \frac{1}{2}{\mathcal {N}}(x_1, 5) + \frac{1}{2}{\mathcal {N}}(x_1+x_2, 5)\). For the experiment, we consider a Gaussian prior \({\mathcal {N}}(0,I)\) for \(\pi _0(x)\), and we create a synthetic dataset with 200 observations, by sampling from \(\iota _i \sim \frac{1}{2}{\mathcal {N}}(0, 5) + \frac{1}{2}{\mathcal {N}}(1, 5)\).

In this experiment we again take \(r = 1\) and \({\varvec{\ell }} = \ell _1\) in (2.5), which corresponds to using a fixed discretisation parameter h and applying the antithetic MLMC estimator only with respect to the subsampling parameter. We then use the same setting as before: we apply our estimator \(\mathcal {A}^{A,{\mathcal {L}}}_{MLMC}(f)\) to approximate \((f,\pi _t^h)\), where \(\pi _t^h\) is the law given by the chain \(X_k\) defined in (3.4), with a fixed discretisation parameter \(h = 1\) (i.e., we do not take into account the error between \((f,\pi _t^h)\) and \((f,\pi _t)\) when calculating the MSE). We simulate \(2\times 10^5\) steps of the chain (2.6), starting from \(X_0=0\) (see Table 3).

In this example there is the additional difficulty that the posterior has two modes (Fig. 6). It is therefore necessary to ensure that the number of steps is high enough so that the chain has explored all the space (Fig. 7).

Table 3 AMLMC setting for Bayesian Mixture of Gaussians
Fig. 6
figure 6

AMLMC estimator \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) with respect to subsampling for \(f(x)= |x|^2\), for a multi-modal posterior, where the prior \(\pi _0\) is the standard Gaussian

Results for this experiment are shown in Fig. 6, indicating that the total cost of approximating \((f,\pi _t)\) by the MASGA estimator \(\mathcal {A}^{A,{\mathcal {L}}}(f)\) is \(\mathcal {O}(\varepsilon ^{-2})\). We obtain the following rates of decay of the variance and the absolute mean of \({\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})\):

$$\begin{aligned} \textstyle i)\,\, {\mathbb {E}}[(f, {\varvec{\Delta }}^A \pi ^{{\varvec{\ell }}})]| \lesssim 2^{- 1.01\ell }, \quad ii) \,\,\mathbb {V}[(f, {\varvec{\Delta }}^A \pi ^{{\varvec{\ell }}})] \lesssim 2^{- 1.82\, \ell } . \end{aligned}$$

3.3 AMLMC and SGLD with control variates

In this subsection we compare the AMLMC estimator with respect to subsampling against the Stochastic Gradient Langevin Dynamics method with control variates (SGLD-CV) from Baker et al. (2019), Nagapetyan et al. (2017). We work with the standard Gaussian prior \(\pi _0\). For SGLD-CV, for a fixed time step size h, and for a fixed subsample size \(s_{\ell _1}\), instead of the process (2.6) we use

$$\begin{aligned} \textstyle X_{k+1}^{\ell _1}= & {} X_k^{\ell _1} - h \left( \nabla _x V({\hat{x}}, \nu ^{m}) + \left( \nabla _x V(X_k^{\ell _1}, \nu ^{s_{\ell _1}})-\nabla _x V({\hat{x}}, \nu ^{s_{\ell _1}})\right) \right) \nonumber \\{} & {} + \beta \sqrt{h_{\ell _2}} Z_{k+1} , \end{aligned}$$
(3.6)

where \(\beta = 1/\sqrt{m}\) and \({\hat{x}}\) is a fixed value denoting an estimate of the mode of the posterior \(\pi (x)\). We undertake the following steps:

  1. 1.

    We estimate the mode of the posterior \({\hat{x}}\) by using stochastic gradient descent.

  2. 2.

    For each considered accuracy \(\varepsilon \), we run AMLMC with respect to subsampling (as described in Sect. 3.2) to get the maximum subsample size \(s_{\ell _1}\) necessary to achieve an estimator \(\mathcal {A}^{A,{\mathcal {L}}}_{MLMC}(f)\) such that \({\text {MSE}}(\mathcal {A}^{A,{\mathcal {L}}}_{MLMC}(f)) \lesssim \varepsilon \).

  3. 3.

    We use each pair \((\varepsilon , s_{\ell _1})\) from the previous step to calculate the cost of SGLD-CV.

The AMLMC setting values are listed in Table 4 and results are shown in Fig. 8.

Fig. 7
figure 7

Histogram of \(X_k = (X_{k,1}, X_{k,2})\) sampled from (2.6)

Table 4 AMLMC setting for Bayesian Logistic Regression

The SGLD-CV method has been shown in Baker et al. (2019), Nagapetyan et al. (2017) to reduce the variance (and hence improve the performance) compared to the standard SGLD and the standard Monte Carlo methods. However, in some numerical examples, as demonstrated in this subsection, this gain can be relatively small compared to the gain from using AMLMC.

4 General setting for MASGA

We will work in a more general setting than the one presented in Sect. 2. Namely, we consider an SDE

$$\begin{aligned} dX_t = a(X_t)dt + \beta dW_t , \end{aligned}$$
(4.1)

where \(a: \mathbb {R}^d \rightarrow \mathbb {R}^d\), \(\beta \in \mathbb {R}_{+}\) and \((W_t)_{t \ge 0}\) is a d-dimensional Brownian motion. Furthermore, let \((Z_k)_{k=1}^{\infty }\) be i.i.d. random variables in \(\mathbb {R}^d\) with \(Z_k \sim \mathcal {N}(0,I)\) for \(k \ge 1\). For a fixed discretisation parameter \(h > 0\) we consider a discretisation of (4.1) given by \(X_{k+1} = X_k + ha(X_k) + \beta \sqrt{h} Z_{k+1}\), as well as its inaccurate drift counterpart

$$\begin{aligned} X_{k+1} = X_k + h b(X_k, U_k) + \beta \sqrt{h} Z_{k+1} . \end{aligned}$$
(4.2)

Here \(b:\mathbb {R}^d \times \mathbb {R}^n \rightarrow \mathbb {R}^d\) is an unbiased estimator of a in the sense that \((U_k)_{k=0}^{\infty }\) are mutually independent \(\mathbb {R}^n\)-valued random variables independent of \((Z_k)_{k=1}^{\infty }\) such that for any \(k \ge \) 0 we have

$$\begin{aligned} \mathbb {E}[ b(x,U_k)]=a(x) \text { for all } x \in \mathbb {R}^d . \end{aligned}$$
(4.3)

Note that for each k, the random variable \(X_k\) is independent of \(U_k\) and that \(\mathbb {E}[b(X_k,U_k)|X_k] = a(X_k)\). Moreover, note that the framework where the drift estimator b(xU) depends on a random variable U is obviously a generalisation of (1.2), since as a special case of (4.2) we can take \(b(x,U) = - \nabla _x V (x, {\text {Law}}(U))\), where \({\text {Law}}(U)\) denotes the law of U. We use the name Stochastic Gradient Langevin Dynamics (SGLD) (Dalalyan and Karagulyan 2019; Majka et al. 2020; Nagapetyan et al. 2017) to describe (4.2) even in the general abstract setting where b and a are not necessarily of gradient form.

Fig. 8
figure 8

Comparison of AMLMC with respect to subsampling for \(f(x)= |x|^2\) vs SGLD with control variate

This setting, besides having the obvious advantage of being more general than the one presented in Assumptions 2.1, allows us also to reduce the notational complexity by replacing sums of gradients with general abstract functions a and b. As a motivation for considering such a general framework, let us discuss an example related to generative models.

Example 4.1

Let \(\nu \) denote an unknown data measure, supported on \({\mathbb {R}}^D\), and let \(\nu ^m\) be its empirical approximation. While D is typically very large, in many applications \(\nu \) can be well approximated by a probability distribution supported on a lower dimensional space, say \({\mathbb {R}}^d\), with \(d \ll D\). The aim of generative models (Goodfellow et al. 2014) is to map samples from some basic distribution \(\mu \) supported on \({\mathbb {R}}^d\), into samples from \(\nu \). More precisely, one considers a parametrised map \(G:{\mathbb {R}}^d \times \Theta \rightarrow {\mathbb {R}}^D\), with a parameter space \(\Theta \subseteq {\mathbb {R}}^p\), that transports \(\mu \) into \(G(\theta )_{\#}\mu := \mu (G(\theta )^{-1}(B))\), \(B\in {\mathbb {R}}^D\). One then seeks \(\theta \) such that \(G(\theta )_{\#}\mu \) is a good approximation of \(\nu \) with respect to a user-specified metric. In this example we consider \(f:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) and \(\Phi :{\mathbb {R}} \rightarrow {\mathbb {R}}\) and define

$$\begin{aligned} \text {dist}(G(\theta )_{\#}\mu ,\nu ^m):=\Phi \left( \int f(x)(G(\theta )_{\#}\mu )(dx) - \int f(x)\nu ^m(dx) \right) . \end{aligned}$$

A popular choice for the generator G is a neural network (Goodfellow et al. 2014). In the case of a one-hidden-layer network with \(\theta =(\alpha ,\beta ) \in {\mathbb {R}}^p \times {\mathbb {R}}^p\) with the activation function \(\psi :{\mathbb {R}}^d \rightarrow {\mathbb {R}}^d\), one takes

$$\begin{aligned} G(x,\theta ):= \frac{1}{p}\sum _{i=1}^p \beta _{i} \psi (\alpha _{i} x). \end{aligned}$$

With this choice of G, the authors of Hu et al. (2019) derived a gradient flow equation that minimises suitably regularised \(\text {dist}(G(\theta )_{\#}\mu ,\nu ^m)\). The gradient flow identified in Hu et al. (2019), when discretised, is given by

$$\begin{aligned} \theta _{k+1} = \theta _k - h \left( b(\theta _k,\nu ^m) - \frac{\sigma ^2}{2}\nabla U(\theta _{k}) \right) + \sigma \sqrt{h} Z_{k+1}, \end{aligned}$$
(4.4)

with \(U:{\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) being a regulariser, \(\sigma > 0\) a regularisation parameter, \((Z_k)_{k=1}^{\infty }\) a sequence of i.i.d. random variables with the standard normal distribution, and

$$\begin{aligned} b(\theta ,\nu ^m)= & {} \nabla _x \Phi \left( \int (f \circ G)(x,\theta ) \mu (dx) {-} \int f(x)\nu ^m(dx)\right) \\{} & {} \int \nabla _{\theta } (f\circ G)(x,\theta ) \mu (dx). \end{aligned}$$

We refer the reader to Hu et al. (2019) for more details and to Jabir et al. (2019) for an extension to deep neural networks. One can see that b may depend on the data in a non-linear way and hence the general setting of (4.2) becomes necessary for the analysis of the stochastic gradient counterpart of (4.4). An application of MASGA to the study of generative models will be further developed in a future work.

In order to analyse the MASGA estimator (2.5), we need to interpret the Markov chain

$$\begin{aligned} X^{s,h}_{k+1} = X^{s,h}_k + h b(X^{s,h}_k, U^s_k) + \beta \sqrt{h} Z_{k+1} \end{aligned}$$
(4.5)

as characterized by two parameters: the discretisation parameter \(h > 0\) and the drift estimation parameter \(s \in \mathbb {N}\) that corresponds to the quality of approximation of a(x) by \(b(x,U^s_k)\), for some mutually independent random variables \((U^s_k)_{k=0}^{\infty }\). We will now carefully explain how to implement the antithetic MLMC framework from Sect. 2 in this setting.

To this end, suppose that we have a decreasing sequence \((h_\ell )_{\ell =0}^{L} \subset \mathbb {R}_{+}\) of discretisation parameters and an increasing sequence \((s_\ell )_{\ell =0}^{L} \subset \mathbb {N}_{+}\) of drift estimation parameters for some \(L \ge 1\). For any \(\ell _1\), \(\ell _2 \in \{ 1, \ldots , L \}\) and any function \(f : \mathbb {R}^d \rightarrow \mathbb {R}\), we define

$$\begin{aligned}{} & {} \Delta \Phi ^{s_{\ell _1}, h_{\ell _2}}_{f,k} := \left( f(X^{s_{\ell _1},h_{\ell _2}}_k) - f(X^{s_{\ell _1}, h_{\ell _2-1}}_k) \right) \nonumber \\{} & {} \quad - \left( f(X^{s_{\ell _1-1},h_{\ell _2}}_k)- f(X^{s_{\ell _1-1}, h_{\ell _2-1}}_k) \right) , \end{aligned}$$
(4.6)

and we also put \(\Delta \Phi ^{s_{0}, h_{0}}_{f,k} := f(X^{s_0,h_0}_k)\), \(\Delta \Phi ^{s_{0}, h_{1}}_{f,k} := f(X^{s_0,h_1}_k) - f(X^{s_0,h_0}_k)\) and \(\Delta \Phi ^{s_{1}, h_{0}}_{f,k} := f(X^{s_1,h_0}_k) - f(X^{s_0,h_0}_k)\). Then we can define a Multi-index Monte Carlo estimator

$$\begin{aligned} \mathcal {A} := \sum _{\ell _1 = 0}^{L} \sum _{\ell _2 = 0}^{L} \frac{1}{N_{\ell _1,\ell _2}} \sum _{j=1}^{N_{\ell _1,\ell _2}} \Delta \Phi ^{s_{\ell _1}, h_{\ell _2}, (j)}_{f,k} , \end{aligned}$$
(4.7)

where \(\Delta \Phi ^{s_{\ell _1}, h_{\ell _2}, (j)}_{f,k}\) for \(j = 1, \ldots , N_{\ell _1,\ell _2}\) are independent copies of \(\Delta \Phi ^{s_{\ell _1}, h_{\ell _2}}_{f,k}\). Here \(N_{\ell _1,\ell _2}\) is the number of samples at the (doubly-indexed) level \({\varvec{\ell }} = (\ell _1, \ell _2)\). Note that (4.7) corresponds to the regular (non-antithetic) MLMC estimator defined in (2.3) with \(r = 2\) and with L levels for both parameters.

We will now explain how to obtain the MASGA estimator (2.5) by modifying (4.7) by replacing the difference operator \(\Delta \Phi ^{s_{\ell _1}, h_{\ell _2}}_{f,k}\) with its antithetic counterpart. To this end, we will need to take a closer look at the relation between the chains (4.5) on different levels. From now on, we focus on sequences of parameters \(h_\ell := 2^{-\ell }h_0\) and \(s_\ell := 2^\ell s_0\) for \(\ell \in \{ 1, \ldots , L \}\) and some fixed \(h_0 > 0\) and \(s_0 \in \mathbb {N}\). Then, we observe that for a fixed \(s_{\ell _1}\), the chain \(X^{s_{\ell _1}, h_{\ell _2}}\) has twice as many steps as the chain \(X^{s_{\ell _1}, h_{\ell _2-1}}\), i.e., for any \(k \ge 0\) we have

$$\begin{aligned} \begin{aligned}&X^{s_{\ell _1},h_{\ell _2}}_{k+2} = X^{s_{\ell _1},h_{\ell _2}}_{k+1} + h_{\ell _2} b(X^{s_{\ell _1},h_{\ell _2}}_{k+1}, U^{s_{\ell _1},h_{\ell _2}}_{k+1}) + \beta \sqrt{h_{\ell _2}}Z_{k+2} \\&X^{s_{\ell _1},h_{\ell _2}}_{k+1} = X^{s_{\ell _1},h_{\ell _2}}_{k} + h_{\ell _2} b(X^{s_{\ell _1},h_{\ell _2}}_{k}, U^{s_{\ell _1},h_{\ell _2}}_{k}) + \beta \sqrt{h_{\ell _2}}Z_{k+1} \\&X^{s_{\ell _1},h_{\ell _2-1}}_{k+2} = X^{s_{\ell _1},h_{\ell _2-1}}_{k} + 2h_{\ell _2} b(X^{s_{\ell _1},h_{\ell _2-1}}_{k}, U^{s_{\ell _1},h_{\ell _2-1}}_{k}) \\&\qquad \quad + \beta \sqrt{2h_{\ell _2}}(Z_{k+2} + Z_{k+1}) . \end{aligned} \end{aligned}$$
(4.8)

Throughout the paper, we will refer to \((X^{s_{\ell _1},h_{\ell _2}}_k)_{k \in \mathbb {N}}\) as the fine chain, and to \((X^{s_{\ell _1},h_{\ell _2-1}}_k)_{k\in 2 \mathbb {N}}\) as the coarse chain. Note that for the chain \((X^{s_{\ell _1},h_{\ell _2-1}}_k)_{k\in 2 \mathbb {N}}\) we could in principle use a sequence of standard Gaussian random variables \((\hat{Z}_k)_{k \in 2 \mathbb {N}}\) completely unrelated to the one that we use for \((X^{s_{\ell _1},h_{\ell _2}}_k)_{k \in \mathbb {N}}\) (which is \((Z_k)_{k \in \mathbb {N}}\)). However, we choose \(\hat{Z}_{k+2} := (Z_{k+1} + Z_{k+2})/\sqrt{2}\) for all \(k \ge 0\), which corresponds to using the synchronous coupling between levels (which turns out to be a good choice in the global convexity setting as in Assumptions 2.1, cf. (Giles et al. 2020)). Moreover, note that since the chain \(X^{s_{\ell _1}, h_{\ell _2}}\) moves twice as frequently as \(X^{s_{\ell _1}, h_{\ell _2-1}}\), it needs twice as many random variables \(U^{s_{\ell _1}, h_{\ell _2}}\) as \(X^{s_{\ell _1}, h_{\ell _2-1}}\) needs \(U^{s_{\ell _1}, h_{\ell _2-1}}\). This can be interpreted as having to choose how to estimate the drift a twice as frequently (at each step of the chain).

The idea of the antithetic estimator (with respect to the discretisation parameter) involves replacing \(f(X^{s_{\ell _1}, h_{\ell _2-1}})\) in (4.6) with a mean of its two independent copies, i.e., with the quantity given by \(\frac{1}{2}\left( f(X^{s_{\ell _1}, h_{\ell _2-1}-}) + f(X^{s_{\ell _1}, h_{\ell _2-1}+}) \right) \), where the first copy \(X^{s_{\ell _1}, h_{\ell _2-1}-}\) uses \(U^{s_{\ell _1}, h_{\ell _2}}_k\) and the other copy \(X^{s_{\ell _1}, h_{\ell _2-1}+}\) uses \(U^{s_{\ell _1}, h_{\ell _2}}_{k+1}\) to estimate the drift, instead of drawing their own independent copies of \(U^{s_{\ell _1},h_{\ell _2-1}}_{k}\), i.e.,

$$\begin{aligned} \begin{aligned} X^{s_{\ell _1},h_{\ell _2-1}-}_{k+2}&= X^{s_{\ell _1},h_{\ell _2-1}-}_{k} + 2h_{\ell _2} b(X^{s_{\ell _1},h_{\ell _2-1}-}_{k}, U^{s_{\ell _1},h_{\ell _2}}_{k})\\&\quad + \beta \sqrt{2h_{\ell _2}}(Z_{k+2} + Z_{k+1}) \\ X^{s_{\ell _1},h_{\ell _2-1}+}_{k+2}&= X^{s_{\ell _1},h_{\ell _2-1}+}_{k} + 2h_{\ell _2} b(X^{s_{\ell _1},h_{\ell _2-1}+}_{k}, U^{s_{\ell _1},h_{\ell _2}}_{k+1}) \\&\quad + \beta \sqrt{2h_{\ell _2}}(Z_{k+2} + Z_{k+1}) . \end{aligned} \end{aligned}$$

Hence the term \(\left( f(X^{s_{\ell _1},h_{\ell _2}}_k) - f(X^{s_{\ell _1}, h_{\ell _2-1}}_k) \right) \) appearing in (4.6) would be replaced with the antithetic term \(\left( f(X^{s_{\ell _1},h_{\ell _2}}_k) - \frac{1}{2}\left( f(X^{s_{\ell _1}, h_{\ell _2-1}-}) + f(X^{s_{\ell _1}, h_{\ell _2-1}+}) \right) \right) \) and the same can be done for any fixed \(s_{\ell _1}\). Let us explain the intuition behind this approach on a simple example with \(f(x) = x\) and a state-independent drift a.

Example 4.2

We fix \(s_{\ell _1}\) and suppress the dependence on \(\ell _1\) in the notation, in order to focus only on MLMC via discretisation parameter. Let \(\xi = (\xi _1, \ldots , \xi _m)\) be a collection of m data points and let us consider a state-independent drift \(a = \frac{1}{m}\sum _{i=1}^{m} \xi _i\) and its unbiased estimator \(b(U^s) := \frac{1}{s} \sum _{i=1}^{s} \xi _{U^s_i}\) for \(s \le m\), where \(U^s = (U^s_1, \ldots U^s_s)\) is such that \(U^s_j \sim {\text {Unif}}\{ 1, \ldots , m \}\) for all \(j \in \{1, \ldots , s \}\) (i.e., we sample with replacement s data points from the data set \(\xi \) of size m). Consider the standard MLMC estimator with the fine \(X^f\) and the coarse \(X^c\) schemes defined as

$$\begin{aligned} \begin{aligned} X^f_{k+1}&= X^f_{k} + h b(U^{s,f}_k) + \beta \sqrt{h} Z_{k+1} , \qquad \\ X^f_{k+2}&= X^f_{k+1} + h b(U^{s,f}_{k+1}) + \beta \sqrt{h} Z_{k+2} , \\ X^c_{k+2}&= X^c_{k} + 2h b(U^{s,c}_k) + \beta \sqrt{2h} \hat{Z}_{k+2} , \end{aligned} \end{aligned}$$

where \(\hat{Z}_{k+2} := (Z_{k+1} + Z_{k+2})/\sqrt{2}\) , \(\beta = 1/\sqrt{m}\), with \(U^{s,f}_{k}\), \(U^{s,f}_{k+1}\) and \(U^{s,c}_{k}\) being independent copies of \(U^s\) for any \(k \ge 0\). Note that \((X^f_k)_{k \in \mathbb {N}}\) corresponds to \((X^{s_{\ell _1},h_{\ell _2}}_k)_{k \in \mathbb {N}}\) in (4.8) whereas \((X^c_k)_{k \in 2\mathbb {N}}\) corresponds to \((X^{s_{\ell _1},h_{\ell _2-1}}_k)_{k \in 2\mathbb {N}}\) for some fixed \(\ell _1\), \(\ell _2\). Recall from the discussion in Sect. 2 that our goal is to find a sharp upper bound on the variance (or the second moment) of \(X^f_k - X^c_k\) for any \(k \ge 1\) (which corresponds to bounding the variance of the standard, non-antithetic MLMC estimator (4.7) for a Lipschitz function f, cf. the difference (4.6) taken only with respect to the time-discretisation parameter h, with fixed s). We have

$$\begin{aligned} \begin{aligned}&\mathbb {E}\left| X^f_{k+2} - X^c_{k+2} \right| ^2 =\, \mathbb {E} \left| X^f_{k} - X^c_{k} \right| ^2 + \mathbb {E} \left\langle X^f_{k} - X^c_{k}, h b(U^{s,f}_k) \right. \\&\quad \left. \,+\, h b(U^{s,f}_{k+1}) - 2h b(U^{s,c}_k) \right\rangle \\&\quad + \mathbb {E} \left| h b(U^{s,f}_k) + h b(U^{s,f}_{k+1}) - 2h b(U^{s,c}_k) \right| ^2 = \mathbb {E} \left| X^f_{k} - X^c_{k} \right| ^2 \\&\quad + \,h^2 \mathbb {E} \left| b(U^{s,f}_k) + b(U^{s,f}_{k+1}) - 2 b(U^{s,c}_k) \right| ^2 , \end{aligned} \end{aligned}$$

where in the second step we used conditioning and the fact that b is an unbiased estimator of a. Hence we can show that, if we choose \(X^f_0 = X^c_0\), then for all \(k \ge 1\) we have \(\mathbb {E}\left| X^f_{k} - X^c_{k} \right| ^2 \le C h\) for some \(C > 0\) and we get a variance contribution of order h. On the other hand, if we want to apply the antithetic approach as in (2.5), we can define

$$\begin{aligned} X^{c-}_{k+2}= & {} X^{c-}_{k} + 2h b(U^{s,f}_k) + \beta \sqrt{2h} \hat{Z}_{k+2} , \\ X^{c+}_{k+2}= & {} X^{c+}_{k} + 2h b(U^{s,f}_{k+1}) + \beta \sqrt{2h} \hat{Z}_{k+2} \end{aligned}$$

with \(\beta = 1/\sqrt{m}\), and, putting \(\bar{X}^c_{k} := \frac{1}{2} \left( X^{c-}_{k} + X^{c+}_{k} \right) \), we obtain

$$\begin{aligned} \bar{X}^c_{k+2} = \frac{1}{2} \left( X^{c-}_{k} + X^{c+}_{k} \right) + h \left( b(U^{s,f}_k) + b(U^{s,f}_{k+1}) \right) + \beta \sqrt{2h} \hat{Z}_{k+2} . \end{aligned}$$

Then we have \(\mathbb {E}\left| X^f_{k+2} - \bar{X}^c_{k+2} \right| ^2 = \mathbb {E}\left| X^f_{k} - \bar{X}^c_{k} \right| ^2\) and, choosing \(X^f_0 = X^{c-}_0 = X^{c+}_0\), the variance contribution vanishes altogether.

In the general case, \(b(x,U^s)\) is a nonlinear function of the data and also depends on the state x. Therefore one should not expect that the variance of the drift estimator can be completely mitigated. Nonetheless, careful analysis will allow us to conclude that the application of the antithetic difference operators on all levels in our MASGA estimator allows us to obtain a desired upper bound on the variance as described in Sect. 2.

Having explained the motivation behind the antithetic approach to MLMC, let us now focus on the antithetic estimator with respect to the drift estimation parameter s. To this end, let us now fix \(h_{\ell _2}\) and observe that for the chain \(X^{s_{\ell _1}, h_{\ell _2}}\), the value of the drift estimation parameter \(s_{\ell _1} = 2s_{\ell _1-1}\) is twice the value for the chain \(X^{s_{\ell _1-1}, h_{\ell _2}}\). In the context of the subsampling drift as in Assumptions 2.1, this corresponds to the drift estimator in \(X^{s_{\ell _1}, h_{\ell _2}}\) using twice as many samples as the drift estimator in \(X^{s_{\ell _1-1}, h_{\ell _2}}\). Hence, instead of using independent samples for \(X^{s_{\ell _1-1}, h_{\ell _2}}\), we can consider two independent copies of \(X^{s_{\ell _1-1}, h_{\ell _2}}\), the first of which uses the first half \(U^{s_{\ell _1},h_{\ell _2},1}_{k}\) of samples of \(X^{s_{\ell _1}, h_{\ell _2}}\) and the other uses the second half \(U^{s_{\ell _1},h_{\ell _2},2}_{k}\), namely,

$$\begin{aligned} \begin{aligned} X^{s_{\ell _1-1}-, h_{\ell _2}}_{k+1}&:= X^{s_{\ell _1-1}-, h_{\ell _2}}_k + h_{\ell _2} b(X^{s_{\ell _1-1}-, h_{\ell _2}}_k, U^{s_{\ell _1},h_{\ell _2},1}_{k})\\&\quad + \beta \sqrt{h_{\ell _2}} Z_{k+1}\\ X^{s_{\ell _1-1}+, h_{\ell _2}}_{k+1}&:= X^{s_{\ell _1-1}+, h_{\ell _2}}_k + h_{\ell _2} b(X^{s_{\ell _1-1}+, h_{\ell _2}}_k, U^{s_{\ell _1},h_{\ell _2},2}_{k})\\&\quad + \beta \sqrt{h_{\ell _2}} Z_{k+1} . \end{aligned} \end{aligned}$$
(4.9)

Hence, using the antithetic approach, we could replace \(\left( f(X^{s_{\ell _1}, h_{\ell _2}}_k) - f(X^{s_{\ell _1-1}, h_{\ell _2}}_k) \right) \) in (4.6) with the difference \(\left( f(X^{s_{\ell _1}, h_{\ell _2}}_k) - \frac{1}{2}\left( f(X^{s_{\ell _1-1}-, h_{\ell _2}}_k) + f(X^{s_{\ell _1-1}+, h_{\ell _2}}_k) \right) \right) \).

Combining the ideas of antithetic estimators both with respect to the parameter s and h, we arrive at a nested antithetic difference \(\Delta {\text {Ant}} \Phi ^{s_{\ell _1}, h_{\ell _2}}_{f,k}\), defined for any \(\ell _1\), \(\ell _2 \in \{ 1, \ldots L \}\) and any \(k \ge 1\) as

$$\begin{aligned} \begin{aligned}&\Delta {\text {Ant}} \Phi ^{s_{\ell _1}, h_{\ell _2}}_{f,k} := \left[ f(X^{s_{\ell _1}, h_{\ell _2}}_k) - \frac{1}{2} \left( f(X^{s_{\ell _1}, h_{\ell _2-1}-}_k)\right. \right. \\ {}&\left. \left. \qquad \qquad + f(X^{s_{\ell _1}, h_{\ell _2-1}+}_k) \right) \right] \\&\quad - \frac{1}{2}\Bigg [ \left( f(X^{s_{\ell _1-1}-,h_{\ell _2}}_k) - \frac{1}{2}\left( f(X^{s_{\ell _1-1}-,h_{\ell _2-1}-}_k) \right. \right. \\ {}&\left. \left. \qquad \qquad + f(X^{s_{\ell _1-1}-,h_{\ell _2-1}+}_k) \right) \right) \\&\quad + \left( f(X^{s_{\ell _1-1}+,h_{\ell _2}}_k) - \frac{1}{2}\left( f(X^{s_{\ell _1-1}+,h_{\ell _2-1}-}_k) \right. \right. \\ {}&\left. \left. \qquad \qquad + f(X^{s_{\ell _1-1}+,h_{\ell _2-1}+}_k) \right) \right) \Bigg ] , \end{aligned} \end{aligned}$$
(4.10)

with the same convention as in (4.6) for the case of \(\ell _1 = 0\) or \(\ell _2 = 0\). We can now plug this difference into the definition of a Multi-index Monte Carlo estimator (4.7) to obtain

$$\begin{aligned} {\text {Ant}} \mathcal {A} := \sum _{\ell _1=0}^{L} \sum _{\ell _2=0}^{L} \frac{1}{N_{\ell _1,\ell _2}} \sum _{j=1}^{N_{\ell _1,\ell _2}} \Delta {\text {Ant}} \Phi ^{s_{\ell _1}, h_{\ell _2}, (j)}_{f,k} . \end{aligned}$$
(4.11)

Note that this corresponds to the Antithetic MIMC estimator introduced in (2.5), based on the chain (2.6), with \(\ell _1\) corresponding to the number of samples s and \(\ell _2\) to the discretisation parameter h, but with a more general drift a and its estimator b.

In order to formulate our result for the general setting presented in this section, we need to specify the following assumptions.

Assumptions 4.3

(Lipschitz condition and global contractivity of the drift) The drift function \(a : \mathbb {R}^d \rightarrow \mathbb {R}^d\) satisfies the following conditions:

  1. i)

    Lipschitz condition: there is a constant \(L > 0\) such that

    $$\begin{aligned} |a(x) - a(y)| \le L|x - y| \qquad \text { for all } x, y \in \mathbb {R}^d . \end{aligned}$$
    (4.12)
  2. ii)

    Global contractivity condition: there exists a constant \(K > 0\) such that

    $$\begin{aligned} \langle x - y , a(x) - a(y) \rangle \le -K|x - y|^2 \qquad \text { for all } x , y \in \mathbb {R}^d . \end{aligned}$$
    (4.13)
  3. iii)

    Smoothness: \(a \in C^2_b(\mathbb {R}^d; \mathbb {R}^d)\) (where \(C^2_b(\mathbb {R}^d; \mathbb {R}^d)\) is defined as in Assumptions 2.1). In particular, there exist constants \(C_{a^{(1)}}\), \(C_{a^{(2)}} > 0\) such that

    $$\begin{aligned} |D^{\alpha }a(x)| \le C_{a^{(|\alpha |)}} \end{aligned}$$
    (4.14)

    for all \(x \in \mathbb {R}^d\) and for all multiindices \(\alpha \) with \(|\alpha | = 1\), 2.

We remark that condition (4.14) could be easily removed by approximating a non-smooth drift a with suitably chosen smooth functions. This, however, would create additional technicalities in the proof and hence we decided to work with (4.14).

We now impose the following assumptions on the estimator b of the drift a.

Assumption 4.4

(Lipschitz condition of the estimator) There is a constant \(\bar{L} > 0\) such that for all x, \(y \in \mathbb {R}^d\) and all random variables \(U^s\) such that \(\mathbb {E}[b(x,U^s)] = a(x)\) for all \(x \in \mathbb {R}^d\), we have

$$\begin{aligned} \mathbb {E}|b(x,U^s) - b(y,U^s)| \le \bar{L}|x - y| . \end{aligned}$$
(4.15)

Assumption 4.5

(Variance of the estimator) There exists a constant \(\sigma > 0\) of order \(\mathcal {O}(s^{-1})\) such that for any \(x \in \mathbb {R}^d\) and any random variable \(U^s\) such that \(\mathbb {E}[b(x,U^s)] = a(x)\) for all \(x \in \mathbb {R}^d\), we have

$$\begin{aligned} \mathbb {E} \left| b(x, U^s) - a(x) \right| ^2 \le \sigma ^2 (1+|x|^2) . \end{aligned}$$
(4.16)

Assumption 4.6

(Fourth centered moment of the estimator) There exists a constant \(\sigma ^{(4)} \ge 0\) of order \(\mathcal {O}(s^{-2})\) such that for any \(x \in \mathbb {R}^d\) and for any random variable \(U^s\) such that \(\mathbb {E}[b(x,U^s)] = a(x)\) for all \(x \in \mathbb {R}^d\), we have

$$\begin{aligned} \mathbb {E} \left| b(x, U^s) - a(x) \right| ^4 \le \sigma ^{(4)} (1+|x|^4) . \end{aligned}$$
(4.17)

Note that obviously Assumption 4.6 implies Assumption 4.5. However, we formulate these conditions separately in order to keep track of the constants in the proofs. Moreover, with the same constant \(\sigma ^{(4)}\) as in (4.17), we impose

Assumption 4.7

The estimator b(xU) as a function of x is twice continuously differentiable for any U and, for any \(x \in \mathbb {R}^d\), we have \(\mathbb {E} \left| \nabla b(x,U^s) - \nabla a(x) \right| ^4 \le \sigma ^{(4)}(1+|x|^4)\).

Note that \(\nabla a(x)\), \(\nabla b(x,U^s) \in \mathbb {R}^{d \times d}\) and we use the matrix norm \(|\nabla a(x)|^2 := \sum _{i,j=1}^{d} |\partial _i a_j(x)|^2\), where \(a(x) = (a_1(x), \ldots , a_d(x))\).

Assumptions 4.8

Partial derivatives of the estimator of the drift are estimators of the corresponding partial derivatives of the drift. More precisely, for any multi-index \(\alpha \) with \(|\alpha | \le 2\) and for any random variable \(U^s\) such that \(\mathbb {E}[b(x,U^s)] = a(x)\) for all \(x \in \mathbb {R}^d\), we have \(\mathbb {E}[D^{\alpha } b(x,U^s)] = D^{\alpha } a(x)\) for any \(x \in \mathbb {R}^d\).

Assumption 4.9

(Growth of the drift) There exists a constant \(L_0^{(4)} > 0\) such that for all \(x \in \mathbb {R}^d\) we have

$$\begin{aligned} |a(x)|^4 \le L_0^{(4)}\left( 1 + |x|^4\right) . \end{aligned}$$
(4.18)

Finally, we have the following condition that specifies the behaviour of the drift estimator b with respect to the random variables \(U^{s_{\ell _1},h_{\ell _2},1}_{k}\) and \(U^{s_{\ell _1},h_{\ell _2},2}_{k}\) introduced in (4.9).

Assumption 4.10

For any \(x \in \mathbb {R}^d\) we have \(b(x, U^{s_{\ell _1},h_{\ell _2}}_{k}) = \frac{1}{2}b(x, U^{s_{\ell _1},h_{\ell _2},1}_{k}) + \frac{1}{2}b(x, U^{s_{\ell _1},h_{\ell _2},2}_{k})\).

Even though this set of conditions is long, the assumptions are in fact rather mild and it is an easy exercise to verify that when

$$\begin{aligned} a(x) := \frac{1}{m}\sum _{i=1}^{m} \nabla _x v(x,\xi _i) \quad \text { and } \quad b(x,U^s) := \frac{1}{s} \sum _{i=1}^{s} \nabla _x v(x, \xi _{U^s_i}) , \end{aligned}$$

where \(U^s_i \sim {\text {Unif}}(\{ 1, \ldots , m \})\) for \(i \in \{ 1, \ldots , s \}\) are i.i.d. random variables, uniformly distributed on \(\{ 1, \ldots , m \}\), whereas \(v : \mathbb {R}^d \times \mathbb {R}^k \rightarrow \mathbb {R}\) is the function satisfying Assumptions 2.1, then a and b satisfy all Assumptions 4.34.44.54.64.74.84.9 and 4.10. The only conditions that actually require some effort to be checked are Assumptions 4.5 and 4.6, however, they can be verified by extending the argument from Example 2.15 in Majka et al. (2020), where a similar setting was considered. As it turns out, these conditions hold also for the case of subsampling without replacement. All the details are provided in Appendix 1.

We have the following result.

Theorem 4.11

Under Assumptions 4.34.44.54.64.74.84.9 and 4.10 on the drift a and its estimator b, the MASGA estimator (4.11) achieves the mean square error smaller than \(\varepsilon > 0\) at the computational cost \(\varepsilon ^{-2}\). Here, at each level \((\ell _1,\ell _2) \in \{ 0, \ldots , L \}^2\), the number of paths \(N_{\ell _1,\ell _2}\) is given by \(\varepsilon ^{-2}\left( \sqrt{\frac{\mathbb {V}[(f, {\varvec{\Delta }}^{A} \pi ^{(\ell _1,\ell _2)})]}{C_{(\ell _1,\ell _2)}}} \sum _{{\varvec{\ell }} \in [L]^2} \sqrt{\mathbb {V}[(f, {\varvec{\Delta }}^{A} \pi ^{{\varvec{\ell }}})] C_{{\varvec{\ell }}}} \right) \), where \([L]^2 := \{0, \ldots , L \}^2\), \(C_{{\varvec{\ell }}}\) is defined via (2.8) and \((f, {\varvec{\Delta }}^{A} \pi ^{(\ell _1,\ell _2)}) := \Delta {\text {Ant}} \Phi ^{s_{\ell _1},h_{\ell _2}}_{f,k}\) given by (4.10).

As we explained in Sect. 2, the proof of Theorem 2.2 (and its generalisation Theorem 4.11) relies on the MIMC complexity analysis from (Haji-Ali et al. 2016) (see also Giles 2015).

Theorem 4.12

Fix \(\varepsilon \in (0, e^{-1})\). Let \(({\varvec{\alpha }}, {\varvec{\beta }}, {\varvec{\gamma }})\in {\mathbb {R}}^{3r}\) be a triplet of vectors in \(\mathbb {R}^r\) such that for all \(k \in \{ 1, \ldots , r \}\) we have \(\alpha _k \ge \frac{1}{2} \beta _{k}\). Assume that for each \({\varvec{\ell }} \in \mathbb {N}^r\)

$$\begin{aligned}{} & {} i)\,\,| {\mathbb {E}}[(f, {\varvec{\Delta }}^A \pi ^{{\varvec{\ell }}})]| \lesssim 2^{- \langle {\varvec{\alpha }} , {\varvec{\ell }} \rangle }, \quad ii) \,\,\mathbb {V}[(f, {\varvec{\Delta }}^A \pi ^{{\varvec{\ell }}})] \lesssim 2^{- \langle {\varvec{\beta }}, {\varvec{\ell }} \rangle }\\{} & {} \quad iii)\,\, C_{{\varvec{\ell }}} \lesssim 2^{\langle {\varvec{\gamma }} , {\varvec{\ell }} \rangle } . \end{aligned}$$

If \(\max _{k\in [1,\ldots ,r]}\frac{(\gamma _k-\beta _k)}{\alpha _k} < 0\), then there exists a set \({\mathcal {L}} \subset \mathbb {N}^r\) and a sequence \((N_{{\varvec{\ell }}})_{{\varvec{\ell }}\in {\mathcal {L}}}\) such that the MLMC estimator \(\mathcal {A}^{A, \mathcal {L}}(f)\) defined in (2.5) satisfies

$$\begin{aligned} {\mathbb {E}}[ ((f,\pi ) - \mathcal {A}^{A, \mathcal {L}}(f))^2 ] < \varepsilon ^2, \end{aligned}$$

with the computational cost \(\varepsilon ^{-2}\).

The key challenge in constructing and analysing MIMC estimators is to establish conditions i)-iii) in Theorem 4.12 i.e., to show that the leading error bounds for the bias, variance and cost can be expressed in the product form. In fact, there are very few results in the literature that present the analysis giving (i)–(iii), with the exception of (Giles 2015, Sect. 9) and Giles and Haji-Ali (2019). The bulk of the analysis in this paper is devoted to the analysis of ii). We remark that the optimal choice of \({\mathcal {L}}=[L]^2\) is dictated by the relationship between \(({\varvec{\alpha }}, {\varvec{\beta }}, {\varvec{\gamma }})\), see (Haji-Ali et al. 2016).

The following lemma will be crucial for the proof of Theorem 4.11.

Lemma 4.13

Let Assumptions 4.34.44.54.64.74.84.9 and 4.10 hold. Then there is a constant \(h_0 > 0\) and a constant \(C > 0\) (independent of s and h) such that for any Lipschitz function \(f: \mathbb {R}^d \rightarrow \mathbb {R}\), for any \(s \ge 1\), \(h \in (0, h_0)\) and for any \(k \ge 1\),

$$\begin{aligned} \mathbb {E}|\Delta {\text {Ant}} \Phi ^{s, h}_{f,k}|^2 \le C h^2/s^2 . \end{aligned}$$

As we already indicated in Sect. 2, once we have an upper bound on the second moment (and thus on the variance) of \(\Delta {\text {Ant}} \Phi ^{s, h}_{f,k}\) such as in Lemma 4.13, the proof of Theorem 4.11 becomes rather straightforward.

Proof of Theorem 4.11

Note that we have \(\mathbb {V}[(f, {\varvec{\Delta }}^A \pi ^{{\varvec{\ell }}})] \le \mathbb {E} [(f, {\varvec{\Delta }}^A \pi ^{{\varvec{\ell }}})^2] \lesssim h_{\ell _2}^2 / s_{\ell _1}^2\) due to Lemma 4.13. Moreover, the number of time-steps and the number of subsamples at each level of MIMC is doubled (i.e., we have \(s_{\ell _1} = 2^{\ell _1}s_0\) and \(h_{\ell _2} = 2^{-\ell _2}h_0\) for all \(\ell _1\), \(\ell _2 \in \{ 0, \ldots , L \}\)) and hence \({\varvec{\gamma }} = (1,1)\) and \({\varvec{\beta }}=(2,2)\) in the assumption of Theorem 4.12 (recall that \(C_{{\varvec{\ell }}} \lesssim s_{\ell _1} h_{\ell _2}^{-1}\)). Finally, we have

$$\begin{aligned} |\mathbb {E} (f, {\varvec{\Delta }}^A \pi ^{{\varvec{\ell }}})| \le \left( \mathbb {E} |(f, {\varvec{\Delta }}^A \pi ^{{\varvec{\ell }}})|^2 \right) ^{1/2} \lesssim h_{\ell _2} / s_{\ell _1} , \end{aligned}$$

which implies that \({\varvec{\alpha }} = (1,1)\). Hence the assumptions of Theorem 4.12 are satisfied and the overall complexity of MIMC is indeed \(\varepsilon ^{-2}\). \(\square \)

Proof of Theorem 2.2

Under Assumptions 2.1, the function \(a(x) := - \nabla v_0(x) - \frac{1}{m}\sum _{i=1}^{m} \nabla _x v(x,\xi _i)\) and its estimator \(b(x,U^s) := -\nabla v_0(x) - \frac{1}{s}\sum _{i=1}^{s} \nabla _x v(x,\xi _{U^s_i})\), where \(U^s_i\) are mutually independent random variables, uniformly distributed on \(\{ 1, \ldots m \}\), satisfy all Assumptions 4.34.44.54.74.84.94.6 and 4.10. Hence we can just apply Theorem 4.11 to conclude. \(\square \)

5 Analysis of AMLMC

The estimator we introduced in (2.5), see also (4.11), can be interpreted as built from two building blocks: the antithetic MLMC estimator with respect to the discretisation parameter, which corresponds to taking \(r=1\) and \({{\varvec{\ell }}} = \ell _2\) in (2.5), and the antithetic MLMC estimator with respect to subsampling, which corresponds to taking \(r=1\) and \({{\varvec{\ell }}} = \ell _1\) in (2.5). Let us begin our analysis by focusing on the former.

5.1 AMLMC via discretisation

We will analyse one step of the MLMC algorithm, for some fixed level \({\varvec{\ell }}\). To this end, let us first introduce the following fine \((X^f_k)_{k \in \mathbb {N}}\) and coarse \((X^c_k)_{k \in 2\mathbb {N}}\) chains

$$\begin{aligned} \begin{aligned} X^f_{k+1}&= X^f_{k} + h b(X^f_{k}, U^f_{k}) + \beta \sqrt{h}Z_{k+1} , \, \,\\ X^f_{k+2}&= X^f_{k+1} + h b(X^f_{k+1}, U^f_{k+1}) + \beta \sqrt{h} Z_{k+2} \\ X^c_{k+2}&= X^c_{k} + 2h b(X^c_{k}, U^c_{k}) + \beta \sqrt{2h} \hat{Z}_{k+2} , \end{aligned} \end{aligned}$$
(5.1)

where \(h > 0\) is fixed, \((U^f_{k})_{k=0}^{\infty }\) and \((U^c_{k})_{k=0}^{\infty }\) are mutually independent random variables such that for \(U \in \{ U^f_{k}, U^c_{k} \}\) we have \(\mathbb {E}[b(x,U)] = a(x)\) for all \(x \in \mathbb {R}^d\) and all \(k \ge 0\) and \((Z_k)_{k=1}^{\infty }\) are i.i.d. random variables with \(Z_k \sim \mathcal {N}(0,I)\). We also have \(\hat{Z}_{k+2} := \frac{1}{\sqrt{2}}\left( Z_{k+1} + Z_{k+2} \right) \).

In order to analyse the antithetic estimator, we also need to introduce two auxiliary chains

$$\begin{aligned} \begin{aligned} X^{c+}_{k+2}&= X^{c+}_{k} + 2h b(X^{c+}_{k}, U^f_{k}) + \beta \sqrt{2h} \hat{Z}_{k+2} \\ X^{c-}_{k+2}&= X^{c-}_{k} + 2h b(X^{c-}_{k}, U^f_{k+1}) + \beta \sqrt{2h} \hat{Z}_{k+2} . \end{aligned} \end{aligned}$$
(5.2)

Furthermore, we denote \(\bar{X}^c_{k} = \frac{1}{2} \left( X^{c+}_{k} + X^{c-}_{k} \right) \).

Before we proceed, let us list a few simple consequences of Assumptions 4.3. We have the following bounds:

$$\begin{aligned} |a(x)| \le L_0(1 + |x|) \text { for all } x \in \mathbb {R}^d , \text { where } L_0 := \max (L, |a(0)|) . \end{aligned}$$
(5.3)

Let \(M_2 := 4 L |a(0)|^2 / K^2 + 2 |a(0)|^2 / K\) and \(M_1 : = K / 2\). Then we have

$$\begin{aligned} \langle x, a(x) \rangle \le M_2 - M_1|x|^2 \text { for all } x \in \mathbb {R}^d . \end{aligned}$$
(5.4)

Finally, for all random variables U satisfying (4.3), we have

$$\begin{aligned} \mathbb {E} |b(x,U)|^2 \le \bar{L}_0(1 + |x|^2) \text { for all } x \in \mathbb {R}^d , \end{aligned}$$
(5.5)

where \(\bar{L}_0 := \sigma ^2 h^{\alpha } + 2 \max (L^2 , |a(0)|^2)\). Note that (5.3) is an immediate consequence of (4.12), (5.4) follows easily from (4.12) and (4.13) (cf. the proof of Lemma 2.11 in Majka et al. (2020)), whereas (5.5) is implied by (4.12) and (4.16), cf. (2.40) in Majka et al. (2020). Throughout our proofs, we will also use uniform bounds on the second and the fourth moments of Euler schemes with time steps h and 2h, i.e., we have

$$\begin{aligned}{} & {} \mathbb {E}|X^f_{k}|^2 \le C_{IEul} , \quad \mathbb {E}|X^{c-}_{k}|^2 \le C^{(2h)}_{IEul} , \quad \mathbb {E}|X^f_{k}|^4 \le C^{(4)}_{IEul} \quad \text { and } \\{} & {} \mathbb {E}|X^{c-}_{k}|^4 \le C^{(4),(2h)}_{IEul} , \end{aligned}$$

where the exact formulas for the constants \(C_{IEul}\), \(C^{(2h)}_{IEul}\), \(C^{(4)}_{IEul}\) and \(C^{(4),(2h)}_{IEul}\) can be deduced from Lemma 7.1 in the Appendix (for \(C_{IEul}\), \(C^{(2h)}_{IEul}\) see also Lemma 2.17 in Majka et al. (2020)).

We now fix \(g \in C_b^2(\mathbb {R}^d; \mathbb {R})\). Denote by \(C_{g^{(1)}}\), \(C_{g^{(2)}}\) positive constants such that \(|D^{\alpha }g(x)| \le C_{g^{(|\alpha |)}}\) for all \(x \in \mathbb {R}^d\) and all multiindices \(\alpha \) with \(|\alpha | = 1\), 2.

We begin by presenting the crucial idea of our proof. We will use the Taylor formula to write

$$\begin{aligned} \begin{aligned}&g (\bar{X}^c_{k}) - g(X^{c-}_{k}) = - \Big [ \sum _{|\alpha | = 1} D^{\alpha } g(\bar{X}^c_{k}) \left( \bar{X}^c_{k} - X^{c-}_{k} \right) ^{\alpha } \\&\quad + \sum _{|\alpha | = 2} \int _0^1 (1-t) D^{\alpha } g\left( \bar{X}^c_{k} + t\left( \bar{X}^c_{k} - X^{c-}_{k}\right) \right) dt \left( \bar{X}^c_{k} - X^{c-}_{k} \right) ^{\alpha } \Big ] . \end{aligned} \end{aligned}$$

We also express \(g(\bar{X}^c_{k}) - g(X^{c+}_{k})\) in an analogous way. Note that

$$\begin{aligned} \bar{X}^c_{k} - X^{c-}_{k} = \frac{1}{2} X^{c+}_{k} - \frac{1}{2} X^{c-}_{k} = - (\bar{X}^c_{k} - X^{c+}_{k}) \end{aligned}$$
(5.6)

and hence we have

$$\begin{aligned} \begin{aligned}&g(X^f_{k}) - \frac{1}{2}\left( g(X^{c-}_{k}) + g(X^{c+}_{k}) \right) = g(X^f_{k}) - g (\bar{X}^c_{k}) \\&\qquad + \frac{1}{2} \left( g (\bar{X}^c_{k}) - g(X^{c-}_{k}) + g (\bar{X}^c_{k}) - g(X^{c+}_{k}) \right) \\&\quad = g(X^f_{k}) - g (\bar{X}^c_{k}) - \frac{1}{2} \sum _{|\alpha | = 2} \int _0^1 (1-t) D^{\alpha }\\&\qquad g\left( \bar{X}^c_{k} + t\left( \bar{X}^c_{k} - X^{c-}_{k}\right) \right) dt \left( \bar{X}^c_{k} - X^{c-}_{k} \right) ^{\alpha } \\&\qquad - \frac{1}{2} \sum _{|\alpha | = 2} \int _0^1 (1-t) D^{\alpha }\\&\qquad g\left( \bar{X}^c_{k} + t\left( \bar{X}^c_{k} - X^{c+}_{k}\right) \right) dt \left( \bar{X}^c_{k} - X^{c+}_{k} \right) ^{\alpha } , \end{aligned} \end{aligned}$$

i.e., the first order terms in the Taylor expansions cancel out. Thus, using the inequalities \(\sum _{|\alpha | = 2} D^{\alpha } g(x) \left( \bar{X}^c_{k} - X^{c+}_{k} \right) ^{\alpha } \le \Vert \nabla ^2 g \Vert _{{\text {op}}} |\bar{X}^c_{k} - X^{c+}_{k}|^2\) and \(\Vert \nabla ^2 g \Vert _{{\text {op}}} \le C_{g^{(2)}}\) for all \(x \in \mathbb {R}^d\), where \(\Vert \nabla ^2 g \Vert _{{\text {op}}}\) is the operator norm of the Hessian matrix, we see that

$$\begin{aligned} \begin{aligned}&\mathbb {E}\left| g(X^f_{k}) - \frac{1}{2}\left( g(X^{c-}_{k}) + g(X^{c+}_{k}) \right) \right| ^2 \le 2 \mathbb {E}\left| g(X^f_{k}) - g (\bar{X}^c_{k}) \right| ^2 \\&\qquad + \frac{1}{2} C_{g^{(2)}} \left( \mathbb {E}\left| \bar{X}^c_{k} - X^{c-}_{k} \right| ^4 + \mathbb {E}\left| \bar{X}^c_{k} - X^{c+}_{k} \right| ^4 \right) \\&\quad = 2 \mathbb {E}\left| g(X^f_{k}) - g (\bar{X}^c_{k}) \right| ^2 + \frac{1}{16} C_{g^{(2)}} \mathbb {E}\left| X^{c+}_{k} - X^{c-}_{k} \right| ^4. \end{aligned} \end{aligned}$$
(5.7)

Note that we purposefully introduced the term \(\mathbb {E}\left| X^{c+}_{k} - X^{c-}_{k} \right| ^4\), since it will provide us with an improved rate in h. Indeed, we have the following result.

Lemma 5.1

Let Assumptions 4.34.54.6 and 4.9 hold. If \(X^{c+}_0 = X^{c-}_0\), then for all \(k \ge 1\) and for all \(h \in (0,h_0)\) we have

$$\begin{aligned} \mathbb {E} |X^{c+}_{k} - X^{c-}_{k}|^4 \le \frac{\bar{C}_1}{\bar{c}_1} h^2 , \end{aligned}$$
(5.8)

where \(\bar{C}_1 := 72 \frac{1}{\varepsilon } h_0^{2\alpha } \left( 1 + 2 C_{IEul}^{(2h)} + C_{IEul}^{(4),(2h)} \right) \sigma ^4 + 32 \left( 432\sqrt{2} + 648 + 27 h_0 \right) \sigma ^{(4)}(1 + C_{IEul}^{(4),(2h)})\) and \(\bar{c}_1\), \(h_0\), \(\varepsilon > 0\) are chosen such that

$$\begin{aligned}{} & {} - 8K + 72 \varepsilon + 72 h_0 L^2 + 432 h_0^3 L^4 +32 h_0^2 \left( \left( (1 + 432L^4)216L^4 \right) ^{1/2}\right. \\{} & {} \quad \left. + \frac{1 + 1296L^4}{4} \right) \le - \bar{c}_1 . \end{aligned}$$

Note that the constant \(\bar{C}_1\) above is of order \(\mathcal {O}(s^{-2})\) due to Assumptions 4.5 and 4.6. Hence the bound in (5.8) is in fact of order \(\mathcal {O}(h^2s^{-2})\), which is exactly what is needed in Lemma 4.13.

We remark that, in principle, it would be now possible to bound also the first term on the right hand side of (5.7) and hence to obtain a bound on the variance of the antithetic MLMC estimator with respect to discretisation, corresponding to taking \(r=1\) and \({{\varvec{\ell }}} = \ell _2\) in (2.5). However, such an estimator does not perform on par with the MASGA estimator (even though it is better than the standard MLMC) and hence we skip its analysis. In this subsection, we present only the derivation of the inequality (5.7) and we formulate the lemma about the bound on the term \(\mathbb {E} |X^{c+}_{k} - X^{c-}_{k}|^4\), since they will be needed in our analysis of MASGA.

We remark that the proof of Lemma 5.1 is essentially identical to the proof of Lemma 5.2 below (with different constants) and is therefore skipped. The latter proof can be found in the Appendix.

5.2 AMLMC via subsampling

As the second building block of our MASGA estimator, we discuss a multi-level algorithm for subsampling that involves taking different drift estimators (different numbers of samples) at different levels, but with constant discretisation parameter across the levels.

We fix \(s_0 \in \mathbb {N}_+\) and for \(\ell \in \{ 0, \ldots , L \}\) we define \(s^{\ell } := 2^{\ell } s_0\) and we consider chains

$$\begin{aligned} X^{f}_{k+1}= & {} X^{f}_{k} + h b(X^{f}_{k},U^{f}_{k}) + \beta \sqrt{h} Z_{k+1} , \, \, \nonumber \\ X^{c}_{k+1}= & {} X^{c}_{k} + h b(X^{c}_{k},U^{c}_{k}) + \beta \sqrt{h} Z_{k+1} , \end{aligned}$$
(5.9)

where \(U^{f}_{k}\) is from a higher level (in parameter s) than \(U^c_k\), i.e., we have \(b(x,U^{f}_{k}) = \frac{1}{2}b(x,U^{f,1}_{k}) + \frac{1}{2}b(x,U^{f,2}_{k})\) and \(U^{f,1}_{k}\), \(U^{f,1}_{k}\) are both from the same level (in parameter s) as \(U^{c}_{k}\), cf. Assumption 4.10. In the special case of subsampling, we have

$$\begin{aligned} b(x,U^f_k):= & {} \frac{1}{2s}\sum _{i=1}^{2s} \hat{b}(x,\theta _{(U^f_k)_i}) \qquad \text { and } \\ b(x,U^c_k):= & {} \frac{1}{s}\sum _{i=1}^{s} \hat{b}(x,\theta _{(U^c_k)_i}) , \end{aligned}$$

whereas \(b(x,U^{f,1}_k) := \frac{1}{s}\sum _{i=1}^{s} \hat{b}(x,\theta _{(U^f_k)_i})\) and \(b(x,U^{f,2}_k) := \frac{1}{s}\sum _{i=s}^{2s} \hat{b}(x,\theta _{(U^f_k)_i})\), for some kernels \(\hat{b}: \mathbb {R}^d \times \mathbb {R}^k \rightarrow \mathbb {R}^d\), where \((U^f_k)_i\), \((U^c_k)_i \sim {\text {Unif}}(\{ 1, \ldots , m \})\). In order to introduce the antithetic counterpart of (5.9), we will replace the random variable \(U^c_k\) taken on the coarse level with the two components of \(U^f_k\). Namely, let us denote

$$\begin{aligned} X^{c-}_{k+1}= & {} X^{c-}_{k} + h b(X^{c-}_k, U^{f,1}_k) + \beta \sqrt{h} Z_{k+1} , \, \,\nonumber \\ X^{c+}_{k+1}= & {} X^{c+}_{k} + h b(X^{c+}_k, U^{f,2}_k) + \beta \sqrt{h} Z_{k+1} . \end{aligned}$$
(5.10)

We also put \(\bar{X}^c_{k} := \frac{1}{2}\left( X^{c-}_{k} + X^{c+}_{k} \right) \).

Following our calculations from the beginning of Sect. 5.1 we see that for any Lipschitz function \(g \in C^2_b(\mathbb {R}^d; \mathbb {R})\),

$$\begin{aligned} \begin{aligned}&\mathbb {E}\left| g(X^f_{k+1}) - \frac{1}{2}\left( g(X^{c-}_{k+1}) + g(X^{c+}_{k+1}) \right) \right| ^2 \\&\quad \le 2 \mathbb {E}\left| g(X^f_{k+1}) - g (\bar{X}^c_{k+1}) \right| ^2 + \frac{1}{16} C^2_{g^{(2)}} \mathbb {E}\left| X^{c+}_{k+1} - X^{c-}_{k+1} \right| ^4 \end{aligned} \end{aligned}$$
(5.11)

and we need a similar bound as in Lemma 5.1.

Lemma 5.2

Let Assumptions 4.34.54.6 and 4.9 hold. For \(X^{c+}_{k}\) and \(X^{c-}_{k}\) defined in (5.10), if \(X^{c+}_{0} = X^{c-}_{0}\), then for any \(k \ge 1\) and any \(h \in (0,h_0)\) we have

$$\begin{aligned} \mathbb {E}|X^{c+}_{k} - X^{c-}_{k}|^4 \le \frac{C_1}{c_1}h^2 , \end{aligned}$$

where \(C_1 := 18 \frac{1}{\varepsilon } \left( 1 + 2 C^{(2h)}_{IEul} + C^{(4),(2h)}_{IEul} \right) \sigma ^4 + 4 \left( 27 \sqrt{2} + 54h_0 \right) \sigma ^{(4)} (1+ C^{(4),(2h)}_{IEul})\) and \(c_1\), \(\varepsilon \) and \(h_0\) are chosen so that

$$\begin{aligned}{} & {} - 4h_0K + 18h_0 \varepsilon + 18 h_0^2 L^2 + 27h_0^4 L^4 + 4h_0^3 \left( \left( \frac{1 + 27L^4}{2} 27 L^4 \right) ^{1/2} \right. \\{} & {} \quad \left. + \frac{1 + 81 L^4 }{4} \right) \le - c_1 h_0 . \end{aligned}$$

Using the Lipschitz property of g, we see that in order to deal with the first term on the right hand side of (5.11), we need to bound \(\mathbb {E}|X^{f}_{k} - \bar{X}^{c}_{k}|^2\). Indeed, we have the following result.

Lemma 5.3

Let Assumptions 4.34.54.64.7 and 4.9 hold. For \(X^f_{k}\) and \(\bar{X}^{c}_{k}\) defined in (5.10), if \(X^{c+}_{0} = X^{c-}_{0}\), then for any \(k \ge 1\) and any \(h \in (0,h_0)\) we have

$$\begin{aligned} \mathbb {E}|X^{f}_{k} - \bar{X}^{c}_{k}|^2 \le \frac{C_2}{c_2}h^2, \end{aligned}$$

where \(C_2 := \frac{3}{4} \sigma ^{(4)} (1 + C^{(4)}_{IEul}) + \frac{C_1}{c_1}\left( \frac{1}{4} d C_{a^{(2)}} \frac{1}{\varepsilon _1} + \frac{3}{4} + \frac{3}{8} h_0 C_{b^{(2)}}^2 \right) \) with \(C_1\) and \(c_1\) given in Lemma 5.2, whereas \(c_2\), \(\varepsilon _1\) and \(h_0\) are chosen so that \(- h_0 K + \frac{1}{4} d C_{a^{(2)}} \varepsilon _1 h_0 + \frac{3}{2} h_0^2 \bar{L}^2 \le - c_2 h_0\).

Note that both \(C_1\) and \(C_2\) in Lemma 5.2 and Lemma 5.3 are of order \(\mathcal {O}(s^{-2})\), which follows from the dependence of both these constants on the parameters \(\sigma ^{(4)}\) and \(\sigma ^2\) and Assumptions 4.5 and 4.6. The proofs of both these Lemmas can be found in Appendix 4.

5.3 Proof of Lemma 4.13

Similarly as in Sects. 5.1 and 5.2, we will analyse our estimator step-by-step. To this end, we first need to define nine auxiliary Markov chains. In what follows, we will combine the ideas for antithetic estimators with respect to the discretisation parameter and with respect to the subsampling parameter. We will therefore need to consider fine and coarse chains with respect to both parameters. We use the notational convention \(X^{{\text {subsampling}},{\text {discretisation}}}\), hence e.g. \(X^{f,c}\) would be a chain that behaves as a fine chain with respect to the subsampling parameter and as a coarse chain with respect to the discretisation parameter. We define three chains that move as fine chains with respect to the discretisation parameter

$$\begin{aligned} \begin{aligned}&{\left\{ \begin{array}{ll} \Delta X^{f,f}_{k+2} = hb(X^{f,f}_{k+1}, U^f_{k+1}) + \beta \sqrt{h} Z_{k+2} \\ \Delta X^{f,f}_{k+1} = hb(X^{f,f}_{k}, U^f_{k}) + \beta \sqrt{h} Z_{k+1} \end{array}\right. } \\&{\left\{ \begin{array}{ll} \Delta X^{c-,f}_{k+2} = hb(X^{c-,f}_{k+1}, U^{f,1}_{k+1}) + \beta \sqrt{h} Z_{k+2} \\ \Delta X^{c-,f}_{k+1} = hb(X^{c-,f}_{k}, U^{f,1}_{k}) + \beta \sqrt{h} Z_{k+1} \end{array}\right. }\\&{\left\{ \begin{array}{ll} \Delta X^{c+,f}_{k+2} = hb(X^{c+,f}_{k+1}, U^{f,2}_{k+1}) + \beta \sqrt{h} Z_{k+2} \\ \Delta X^{c+,f}_{k+1} = hb(X^{c+,f}_{k}, U^{f,2}_{k}) + \beta \sqrt{h} Z_{k+1} \end{array}\right. } \end{aligned} \end{aligned}$$

and six chains that move as coarse chains

$$\begin{aligned} \begin{aligned} \Delta X^{f,c-}_{k+2}&= 2h b(X^{f,c-}_{k}, U^f_{k}) + \beta \sqrt{2h} \hat{Z}_{k+2} , \, \, \\ \Delta X^{f,c+}_{k+2}&= 2h b(X^{f,c+}_{k}, U^f_{k+1}) + \beta \sqrt{2h} \hat{Z}_{k+2} \\ \Delta X^{c-,c-}_{k+2}&= 2h b(X^{c-,c-}_{k}, U^{f,1}_{k}) + \beta \sqrt{2h} \hat{Z}_{k+2} , \, \,\\ \Delta X^{c-,c+}_{k+2}&= 2h b(X^{c-,c+}_{k}, U^{f,1}_{k+1}) + \beta \sqrt{2h} \hat{Z}_{k+2} \\ \Delta X^{c+,c-}_{k+2}&= 2h b(X^{c+,c-}_{k}, U^{f,2}_{k}) + \beta \sqrt{2h} \hat{Z}_{k+2} , \, \,\\ \Delta X^{c+,c+}_{k+2}&= 2h b(X^{c+,c+}_{k}, U^{f,2}_{k+1}) + \beta \sqrt{2h} \hat{Z}_{k+2} . \end{aligned} \end{aligned}$$

Here \(\Delta X^{f,f}_{k+2} = X^{f,f}_{k+2} - X^{f,f}_{k+1}\), \(\Delta X^{f,f}_{k+1} = X^{f,f}_{k+1} - X^{f,f}_{k}\), \(\Delta X^{f,c-}_{k+2} = X^{f,c-}_{k+2} - X^{f,c-}_{k}\) and likewise for other chains, whereas \(\sqrt{2h} \hat{Z}_{k+2} = \sqrt{h} Z_{k+1} + \sqrt{h} Z_{k+2}\). In order to prove Lemma 4.13, we will first show that for any \(k \ge 1\) we have \(\mathbb {E}|\Delta {\text {Ant}} \Phi ^{s, h}_{g,k}|^2 \le {\widetilde{C}} \mathbb {E}|\Delta {\text {Ant}} \Phi ^{s, h}_{k}|^2 + \bar{C}h^2\), where \({\widetilde{C}}\), \(\bar{C}\) are positive constants and \(\bar{C}\) is of order \(\mathcal {O}(s^{-2})\). Here \(\Delta {\text {Ant}} \Phi ^{s, h}_{k}\) corresponds to taking \(\Delta {\text {Ant}} \Phi ^{s, h}_{g,k}\) with \(g(x) = x\) the identity function.

Then we will show that for all \(k \ge 1\) we have \(\mathbb {E}|\Delta {\text {Ant}} \Phi ^{s, h}_{k+2}|^2 \le (1-ch)\mathbb {E}|\Delta {\text {Ant}} \Phi ^{s, h}_{k}|^2 + Ch^3\) for some constants c, \(C > 0\), where C is of order \(\mathcal {O}(s^{-2})\). Finally, we will conclude that for all \(k \ge 1\) we have \(\mathbb {E}|\Delta {\text {Ant}} \Phi ^{s, h}_{k}|^2 \le C_1 h^2/s^2\) for some \(C_1 > 0\), which will finish the proof.

In order to simplify the notation, we define here one step of the MASGA estimator as

$$\begin{aligned} \begin{aligned} \Psi _{k}&:= \left[ X^{f,f}_{k} - \frac{1}{2}\left( X^{f,c-}_{k} + X^{f,c+}_{k} \right) \right] \\&\quad - \frac{1}{2} \left[ \left( X^{c-,f}_{k} - \frac{1}{2} \left( X^{c-,c-}_{k} + X^{c-,c+}_{k} \right) \right) \right. \\&\quad \left. + \left( X^{c+,f}_{k} - \frac{1}{2}\left( X^{c+,c-}_{k} + X^{c+,c+}_{k} \right) \right) \right] . \end{aligned} \end{aligned}$$

Let us first focus on the analysis of

$$\begin{aligned} \begin{aligned} \mathbb {E}|\Psi ^g_{k}|^2&:= \mathbb {E} \Bigg | \left[ g(X^{f,f}_{k}) - \frac{1}{2}\left( g(X^{f,c-}_{k}) + g(X^{f,c+}_{k}) \right) \right] \\&\quad - \frac{1}{2} \Big [ \left( g(X^{c-,f}_{k}) - \frac{1}{2} \left( g(X^{c-,c-}_{k}) + g(X^{c-,c+}_{k}) \right) \right) \\&\quad + \left( g(X^{c+,f}_{k}) - \frac{1}{2}\left( g(X^{c+,c-}_{k}) + g(X^{c+,c+}_{k}) \right) \right) \Big ] \Bigg |^2 \end{aligned} \end{aligned}$$

for \(g : \mathbb {R}^d \rightarrow \mathbb {R}\) Lipschitz. To this end, we will introduce three additional chains

$$\begin{aligned} \bar{X}^{f,c}_{k}\!= & {} \! \frac{1}{2}\left( X^{f,c-}_{k} \!+\! X^{f,c+}_{k} \right) , \quad \bar{X}^{c-,c}_{k} \!=\! \frac{1}{2}\left( X^{c-,c-}_{k} \!+\! X^{c-,c+}_{k} \right) \quad \text {and} \quad \\ \bar{X}^{c+,c}_{k}= & {} \frac{1}{2}\left( X^{c+,c-}_{k} + X^{c+,c+}_{k} \right) . \end{aligned}$$

Observe that in the expression \(\mathbb {E}|\Psi ^g_{kh}|^2\) above we have three rows of the same structure as the antithetic estimator via discretisation, hence we can proceed exactly as in Sect. 5.1 by adding and subtracting \(g\left( \bar{X}^{f,c}_{k} \right) \), \(g\left( \bar{X}^{c-,c}_{k} \right) \) and \(g\left( \bar{X}^{c+,c}_{k} \right) \), respectively in each row, and then applying Taylor’s formula in points \(\bar{X}^{f,c}_{k}\), \(\bar{X}^{c-,c}_{k}\) and \(\bar{X}^{c+,c}_{k}\), respectively in each row (note that the first order terms will cancel out) to get

$$\begin{aligned} \mathbb {E}|\Psi ^g_{k}|^2= & {} \mathbb {E} \Bigg | \left[ g(X^{f,f}_{k}) - g\left( \bar{X}^{f,c}_{k} \right) + T\left( X^{f,c-}_{k} - X^{f,c+}_{k} \right) \right] \\{} & {} \quad - \frac{1}{2} \Big [ \left( g(X^{c-,f}_{k}) - g\left( \bar{X}^{c-,c}_{k} \right) + T\left( X^{c-,c-}_{k} - X^{c-,c+}_{k} \right) \right) \\{} & {} \quad + \left( g(X^{c+,f}_{k}) - g\left( \bar{X}^{c+,c}_{k} \right) + T\left( X^{c+,c-}_{k} - X^{c+,c+}_{k} \right) \right) \Big ] \Bigg |^2, \end{aligned}$$

where

$$\begin{aligned} \begin{aligned}&T(X^{f,c-}_{k} - X^{f,c+}_{k}) := - \frac{1}{4}\sum _{|\alpha |=2} \int _0^1 (1-t) D^{\alpha } g\left( \bar{X}^{f,c}_{k} + t \left( X^{f,c-}_{k} - \bar{X}^{f,c}_{k} \right) \right) dt \\&\qquad \qquad \left( X^{f,c-}_{k} - X^{f,c+}_{k} \right) ^{\alpha } \\&\quad - \frac{1}{4}\sum _{|\alpha |=2} \int _0^1 (1-t) D^{\alpha } g\left( \bar{X}^{f,c}_{k} + t \left( X^{f,c+}_{k} - \bar{X}^{f,c}_{k} \right) \right) dt \left( X^{f,c+}_{k} - X^{f,c-}_{k} \right) ^{\alpha } , \end{aligned} \end{aligned}$$

where we used \(X^{f,c+}_{k} - \bar{X}^{f,c}_{k} = \frac{1}{2} \left( X^{f,c+}_{k} - X^{f,c-}_{k} \right) = - \left( X^{f,c-}_{k} - \bar{X}^{f,c}_{k} \right) \) and T is defined analogously for other terms, and hence \(|T(x)|^2 \le C|x|^4\) for some \(C > 0\) and for any \(x \in \mathbb {R}^d\). Using \((a+b)^2 \le 2a^2 + 2b^2\) for \(\mathbb {E}|\Psi ^g_{kh}|^2\) we can separate the terms involving \(T(\cdot )\) and, due to Lemma 5.1, we see that we have the correct order in h and s for all the terms expressed as \(T(\cdot )\). Hence the remaining term whose dependence on s and h we still need to check can be expressed as

$$\begin{aligned}{} & {} \mathbb {E} \left| \left[ g(X^{f,f}_{k}) - \frac{1}{2}\left( g(X^{c-,f}_{k}) + g(X^{c+,f}_{k}) \right) \right] \right. \\{} & {} \quad \left. - \left[ g\left( \bar{X}^{f,c}_{k} \right) - \frac{1}{2}\left( g\left( \bar{X}^{c-,c}_{k} \right) + g\left( \bar{X}^{c+,c}_{k} \right) \right) \right] \right| ^2 . \end{aligned}$$

We will now need yet another auxiliary chain \(\bar{X}^{c,f}_{k} := \frac{1}{2}\left( X^{c-,f}_{k} + X^{c-,f}_{k} \right) \). We repeat our argument from the previous step by adding and subtracting \(g\left( \bar{X}^{c,f}_{k} \right) \) and \(g\left( \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \right) \), respectively, to the first and the second term in square brackets above, respectively, and applying Taylor’s formula in points \(\bar{X}^{c,f}_{k}\) and \(\frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \), respectively (the first order terms again cancel out), to obtain

$$\begin{aligned} \begin{aligned}&\mathbb {E} \Bigg | \left[ g(X^{f,f}_{k}) - g\left( \bar{X}^{c,f}_{k} \right) + T\left( X^{c-,f}_{k} - X^{c+,f}_{k} \right) \right] \\&\quad - \left[ g\left( \bar{X}^{f,c}_{k} \right) - g\left( \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \right) + T\left( \bar{X}^{c-,c}_{k} - \bar{X}^{c+,c}_{k} \right) \right] \Bigg |^2 , \end{aligned} \end{aligned}$$

where again \(|T(x)|^2 \le C|x|^4\) for some \(C > 0\) and for any \(x \in \mathbb {R}^d\). Due to Lemma 5.2 we see that \(\mathbb {E} \left| X^{c-,f}_{k} - X^{c+,f}_{k} \right| ^4\) has the correct order in s and h. Moreover, we have \(\bar{X}^{c-,c}_{k} - \bar{X}^{c+,c}_{k} = \frac{1}{2}\left( X^{c-,c-}_{k} - X^{c+,c-}_{k} \right) + \frac{1}{2}\left( X^{c-,c+}_{k} - X^{c+,c+}_{k} \right) \) and hence, after using \((a+b)^4 \le 8a^4 + 8b^4\), Lemma 5.2 applies also to \(\mathbb {E} \left| \bar{X}^{c-,c}_{k} - \bar{X}^{c+,c}_{k} \right| ^4\). Hence we only need to deal with

$$\begin{aligned} I:= & {} \mathbb {E} \left| g(X^{f,f}_{k}) - g\left( \bar{X}^{c,f}_{k} \right) - g\left( \bar{X}^{f,c}_{k} \right) \right. \\{} & {} \left. + g\left( \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \right) \right| ^2 . \end{aligned}$$

We can now add and subtract \(g\left( \bar{X}^{f,c}_{k} + \bar{X}^{c,f}_{k} - \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \right) \), and, using the Lipschitz property of g (which we assume it satisfies with a Lipschitz constant, say, \(L_g > 0\)), we see that

$$\begin{aligned} I\le & {} 3 L_g \mathbb {E} \left| X^{f,f}_{k} - \bar{X}^{f,c}_{k} - \bar{X}^{c,f}_{k} + \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \right| ^2 \\{} & {} + 6 L_g \mathbb {E} \left| \bar{X}^{f,c}_{k} - \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \right| ^2 . \end{aligned}$$

However, observe that \(\mathbb {E} \left| X^{f,f}_{k} - \bar{X}^{f,c}_{k} - \bar{X}^{c,f}_{k} + \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \right| ^2 = \mathbb {E}|\Psi _{k}|^2\) and, moreover,

$$\begin{aligned} \begin{aligned}&\bar{X}^{f,c}_{k} - \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) = \frac{1}{2} \left( X^{f,c-}_{k} + X^{f,c+}_{k} \right) \\&\qquad - \frac{1}{4} \left( X^{c-,c-}_{k} + X^{c-,c+}_{k} + X^{c+,c-}_{k} + X^{c+,c+}_{k} \right) \\&\quad = \frac{1}{2} \left( X^{f,c-}_{k} - \frac{1}{2} \left( X^{c-,c-}_{k} + X^{c+,c-}_{k} \right) \right) \\&\qquad + \frac{1}{2} \left( X^{f,c+}_{k} - \frac{1}{2} \left( X^{c-,c+}_{k} + X^{c+,c+}_{k} \right) \right) . \end{aligned} \end{aligned}$$

Note that both terms on the right hand side above correspond to antithetic estimators via subsampling, hence from Lemma 5.3 we infer that \(\mathbb {E} \left| \bar{X}^{f,c}_{k} - \frac{1}{2}\left( \bar{X}^{c-,c}_{k} + \bar{X}^{c+,c}_{k} \right) \right| ^2\) has the correct order in s and h. We have thus demonstrated that for any \(k \ge 1\) we have \(\mathbb {E}|\Psi ^g_{k}|^2 \le C_1\mathbb {E}|\Psi _{k}|^2 + C_2 h^2/s^2\) for some constants \(C_1\), \(C_2 > 0\). Therefore, in order to finish the proof, it remains to be shown that \(\mathbb {E}|\Psi _{k}|^2\) has the correct order in h and s. As explained above, this will be achieved by proving that there exist constants c, \(C > 0\) (with C being of order \(\mathcal {O}(s^{-2})\)) such that for any \(k \ge 1\) we have \(\mathbb {E}|\Psi _{k+2}|^2 \le (1-ch)\mathbb {E}|\Psi _{k}|^2 + Ch^3\). The idea for dealing with \(\mathbb {E}|\Psi _{k+2}|^2\) is to group the terms in a specific way, add and subtract certain drifts in order to set up appropriate combinations of drifts for a Taylor’s formula application and then to cancel out some first order terms. As we have already seen, the remaining second order terms should then be of the correct order in h and s. To this end, we denote

$$\begin{aligned} \begin{aligned} \Xi ^1_{k}&:= hb(X^{f,f}_{k}, U^f_{k}) - hb(X^{f,c-}_{k}, U^f_{k}) - hb(X^{f,c+}_{k}, U^f_{k+1}) \\ \Xi ^2_{k}&:= hb(X^{c-,f}_{k}, U^{f,1}_{k}) - hb(X^{c-,c-}_{k}, U^{f,1}_{k}) - hb(X^{c-,c+}_{k}, U^{f,1}_{k+1}) \\ \Xi ^3_{k}&:= hb(X^{c+,f}_{k}, U^{f,2}_{k}) - hb(X^{c+,c-}_{k}, U^{f,2}_{k}) - hb(X^{c+,c+}_{k}, U^{f,2}_{k+1}) \\ \Xi _{k}&:= \Xi ^1_{k} - \frac{1}{2}\left( \Xi ^2_{k} + \Xi ^3_{k} \right) \text { and } \Upsilon _{k} := hb(X^{f,f}_{k+1}, U^f_{k+1}) \\&\quad - \frac{1}{2}hb(X^{c-,f}_{k+1}, U^{f,1}_{k+1}) - \frac{1}{2}hb(X^{c+,f}_{k+1}, U^{f,2}_{k+1}) \end{aligned} \end{aligned}$$

and hence, observing that all the noise variables cancel out, we can write \(\Psi _{k+2} = \Psi _{k} + \Xi _k + \Upsilon _{k}\). Thus we have \(\mathbb {E}\left| \Psi _{k+2} \right| ^2 = \mathbb {E} \left| \Psi _{k} \right| ^2 + 2 \mathbb {E} \langle \Psi _{k} , \Xi _{k} \rangle + 2 \mathbb {E} \langle \Psi _{k} , \Upsilon _{k} \rangle + \mathbb {E} \left| \Xi _{k} + \Upsilon _{k} \right| ^2\). We will bound \(\mathbb {E}|\Xi _{k} + \Upsilon _{k}|^2 \le 2\mathbb {E}|\Xi _{k}|^2 + 2\mathbb {E}|\Upsilon _{k}|^2\) and we will first deal with the terms involving \(\Upsilon _{k}\). We will need an additional auxiliary Markov chain (moving as a fine chain with respect to the discretisation parameter) defined as

$$\begin{aligned} X_{k+2}= & {} X_{k+1} + ha(X_{k+1}) + \beta \sqrt{h} Z_{k+2} , \qquad \\ X_{k+1}= & {} X_{kh} + ha(X_{k}) + \beta \sqrt{h} Z_{k+1} . \end{aligned}$$

Using \(b(x,U) = \frac{1}{2}b(x,U^1) + \frac{1}{2}b(x,U^2)\), we have

$$\begin{aligned} \begin{aligned} \Upsilon _{k}&= hb(X^{f,f}_{k+1}, U^f_{k+1}) - hb(X_{k+1}, U^f_{k+1}) \\&\quad - \frac{1}{2}h\left( b(X^{c-,f}_{k+1}, U^{f,1}_{k+1}) - b(X_{k+1}, U^{f,1}_{k+1})\right) \\&\quad - \frac{1}{2}h\left( b(X^{c+,f}_{k+1}, U^{f,2}_{k+1}) - b(X_{k+1}, U^{f,2}_{k+1})\right) . \end{aligned} \end{aligned}$$

First, in order to deal with \(\mathbb {E} \langle \Psi _{k} , \Upsilon _{k} \rangle \), we use Taylor’s formula to write

$$\begin{aligned} \Upsilon _{k}= & {} h \Bigg ( \sum _{|\alpha |=1} D^{\alpha } b (X_{k+1}, U^f_{k+1}) \left( X^{f,f}_{k+1} - X_{k+1}\right) ^{\alpha } \\{} & {} + \sum _{|\alpha |=2} \int _0^1 (1-t) D^{\alpha }b\left( X_{k+1} + t\left( X^{f,f}_{k+1} - X_{k+1} \right) , U^f_{k+1}\right) dt\\{} & {} \left( X^{f,f}_{k+1} - X_{k+1}\right) ^{\alpha } \Bigg ) \\{} & {} -\frac{1}{2}h \Bigg ( \sum _{|\alpha |=1} D^{\alpha } b (X_{k+1}, U^{f,1}_{k+1}) \left( X^{c-,f}_{k+1} - X_{k+1}\right) ^{\alpha } \\{} & {} + \sum _{|\alpha |=2} \int _0^1 (1-t) D^{\alpha }b\left( X_{k+1} + t\left( X^{c-,f}_{k+1} - X_{k+1} \right) , U^{f,1}_{k+1}\right) dt\\{} & {} \left( X^{c-,f}_{k+1} - X_{k+1}\right) ^{\alpha } \Bigg ) \\{} & {} -\frac{1}{2}h \Bigg ( \sum _{|\alpha |=1} D^{\alpha } b (X_{k+1}, U^{f,2}_{k+1}) \left( X^{c+,f}_{k+1} - X_{k+1}\right) ^{\alpha } \\{} & {} + \sum _{|\alpha |=2} \int _0^1 (1-t) D^{\alpha }b\left( X_{k+1} + t\left( X^{c+,f}_{k+1} - X_{k+1} \right) , U^{f,2}_{k+1}\right) dt\\{} & {} \left( X^{c+,f}_{k+1} - X_{k+1}\right) ^{\alpha } \Bigg ) . \end{aligned}$$

Hence, using Assumptions 4.7 and 4.8, we have

$$\begin{aligned}{} & {} \mathbb {E} \langle \Psi _{k} , \Upsilon _{k} \rangle = h \mathbb {E} \langle \Psi _{k} , \sum _{|\alpha |=1} D^{\alpha } a (X_{k+1})\\{} & {} \quad \left( X^{f,f}_{k+1} - \frac{1}{2}X^{c-,f}_{k+1} -\frac{1}{2}X^{c+,f}_{k+1} \right) ^{\alpha } \rangle \\{} & {} \quad + h \langle \Psi _{k} , \sum _{|\alpha |=2} \int _0^1 (1-t) D^{\alpha }a\left( X_{k+1} + t\left( X^{f,f}_{k+1} - X_{k+1} \right) \right) dt \\{} & {} \quad \left( X^{f,f}_{k+1} - X_{k+1}\right) ^{\alpha } \rangle \\{} & {} \quad + h \langle \Psi _{k} , \sum _{|\alpha |=2} \int _0^1 (1-t) D^{\alpha }a\left( X_{k+1} + t\left( X^{c-,f}_{k+1} - X_{k+1} \right) \right) dt \\{} & {} \quad \left( X^{c-,f}_{k+1} - X_{k+1}\right) ^{\alpha } \rangle \\{} & {} \quad + h \langle \Psi _{k} , \sum _{|\alpha |=2} \int _0^1 (1-t) D^{\alpha }a\left( X_{k+1} + t\left( X^{c+,f}_{k+1} - X_{k+1} \right) \right) dt \\{} & {} \quad \left( X^{c+,f}_{k+1} - X_{k+1}\right) ^{\alpha } \rangle . \end{aligned}$$

Now we can use Young’s inequality for each term above with some \(\varepsilon _1\), \(\varepsilon _2\), \(\varepsilon _3\), \(\varepsilon _4 > 0\) respectively and, using Assumption 4.3 (condition (4.14)), we get

$$\begin{aligned} \begin{aligned}&\mathbb {E} \langle \Psi _{k} , \Upsilon _{k} \rangle \le \frac{1}{2} h \varepsilon _1 C_{a^{(1)}} \mathbb {E} \left| \Psi _{k} \right| ^2 \\&\quad + \frac{1}{2} h \frac{1}{\varepsilon _1} C_{a^{(1)}} \mathbb {E} \left| X^{f,f}_{k+1} - \frac{1}{2}X^{c-,f}_{k+1} -\frac{1}{2}X^{c+,f}_{k+1}\right| ^2 \\&\quad + \frac{1}{2} h \varepsilon _2 C_{a^{(2)}} \mathbb {E} \left| \Psi _{k} \right| ^2 + \frac{1}{2} h \frac{1}{\varepsilon _2} C_{a^{(2)}} \mathbb {E} \left| X^{f,f}_{k+1} - X_{k+1}\right| ^4 \\&\quad + \frac{1}{2} h \varepsilon _3 C_{a^{(2)}} \mathbb {E} \left| \Psi _{k} \right| ^2 + \frac{1}{2} h \frac{1}{\varepsilon _3} C_{a^{(2)}} \mathbb {E} \left| X^{c-,f}_{k+1} - X_{k+1}\right| ^4 \\&\quad + \frac{1}{2} h \varepsilon _4 C_{a^{(2)}} \mathbb {E} \left| \Psi _{k} \right| ^2 + \frac{1}{2} h \frac{1}{\varepsilon _4} C_{a^{(2)}} \mathbb {E} \left| X^{c+,f}_{k+1} - X_{k+1}\right| ^4 . \end{aligned} \end{aligned}$$

Now note that the second term on the right hand side above is of order \(\mathcal {O}(h^2/s^2)\) due to Lemma 5.3. For the other three terms, we need the following auxiliary result.

Lemma 5.4

Let Assumptions 4.34.54.6 and 4.9 hold. Assuming \(X^{f,f}_0 = X_0\), there exists a constant \(C > 0\) such that for all \(k \ge 1\),

$$\begin{aligned} \mathbb {E} \left| X^{f,f}_{k} - X_{k} \right| ^4 \le C\frac{1}{s^2}h^2 . \end{aligned}$$

The proof is similar to the proof of Lemma 5.2 and can be found in Appendix 5.

The reasoning in the proof of Lemma 5.4 applies also to \(X^{c-,f}_{k}\) and \(X^{c+,f}_{k}\) in place of \(X^{f,f}_{k}\). Hence we see that \(\mathbb {E} \langle \Psi _{k} , \Upsilon _{k} \rangle \) is bounded from above by an expression of the form \(C_1(\varepsilon _1 + \varepsilon _2 + \varepsilon _3 + \varepsilon _4)h \mathbb {E}|\Psi _k|^2 + C_2h\), where \(\varepsilon _i\) for \(i \in \{ 1, \ldots 4 \}\) can be chosen as small as necessary and the constant \(C_2\) is of the correct order in s and h, i.e., of order \(\mathcal {O}(h^2/s^2)\). We will explain later how to handle the other terms and how to choose the values of \(\varepsilon _i\) for \(i \in \{ 1, \ldots 4 \}\). Now in order to deal with \(\mathbb {E}|\Upsilon _{k}|^2\) we use a different decomposition for \(\Upsilon _{k}\), namely we write

$$\begin{aligned} \Upsilon _{k}&= \frac{1}{2}hb(X^{f,f}_{k+1}, U^{f,1}_{k+1}) + \frac{1}{2}hb(X^{f,f}_{k+1}, U^{f,2}_{k+1}) \\&\quad - \frac{1}{2}hb(X^{c-,f}_{k+1}, U^{f,1}_{k+1}) - \frac{1}{2}hb(X^{c+,f}_{k+1}, U^{f,2}_{k+1}) \\&= \frac{1}{2}hb(X^{f,f}_{k+1}, U^{f,1}_{k+1}) - \frac{1}{2}hb(\bar{X}^{c,f}_{k+1}, U^{f,1}_{k+1}) \\&\quad + \frac{1}{2}hb(\bar{X}^{c,f}_{k+1}, U^{f,1}_{k+1}) - \frac{1}{2}hb(X^{c-,f}_{k+1}, U^{f,1}_{k+1}) \\&\quad + \frac{1}{2}hb(X^{f,f}_{k+1}, U^{f,2}_{k+1}) - \frac{1}{2}hb(\bar{X}^{c,f}_{k+1}, U^{f,2}_{k+1}) \\&\quad + \frac{1}{2}hb(\bar{X}^{c,f}_{k+1}, U^{f,2}_{k+1}) - \frac{1}{2}hb(X^{c+,f}_{k+1}, U^{f,2}_{k+1}) , \end{aligned}$$

where \(\bar{X}^{c,f}_{k+1} := \frac{1}{2} \left( X^{c-,f}_{k+1} + X^{c+,f}_{k+1} \right) \). Hence, using (4.15) we obtain

$$\begin{aligned} \begin{aligned}&\mathbb {E} \left| \Upsilon _{k} \right| ^2 \le \frac{3}{4} h^2 \bar{L}^2 \mathbb {E} \left| X^{f,f}_{k+1} - \bar{X}^{c,f}_{k+1} \right| ^2 + \frac{3}{4} h^2 \bar{L}^2 \mathbb {E} \left| X^{f,f}_{k+1} - \bar{X}^{c,f}_{k+1} \right| ^2 \\&\quad + \frac{3}{4} h^2 \mathbb {E} \left| b(\bar{X}^{c,f}_{k+1}, U^{f,1}_{k+1}) - b(X^{c-,f}_{k+1}, U^{f,1}_{k+1}) \right. \\&\quad \left. + b(\bar{X}^{c,f}_{k+1}, U^{f,2}_{k+1}) - b(X^{c+,f}_{k+1}, U^{f,2}_{k+1}) \right| ^2 . \end{aligned} \end{aligned}$$

The first two terms on the right hand side above are identical and have the correct order in s and h due to Lemma 5.3. On the other hand, the third term can be dealt with by one more application of Taylor’s formula (cf. the calculation for the term \(J_{33}\) in the proof of Lemma 5.3 in Appendix 4 for more details).

The terms involving \(\Xi _k\) can be handled using similar ideas as above. In particular, for \(\mathbb {E}|\Xi _k|^2\) we have the following result.

Lemma 5.5

Let Assumptions 4.34.54.64.7 and 4.9 hold. Assuming all the auxiliary chains introduced above are initiated at the same point, there exists a constant \(C_{1,\Xi } > 0\) such that for all \(k \ge 1\),

$$\begin{aligned} \mathbb {E} |\Xi _k|^2 \le C_{1,\Xi }\frac{1}{s^2}h^2 . \end{aligned}$$

The proof of Lemma 5.5 can be found in Appendix 5. The last term to deal with is \(\mathbb {E} \langle \Psi _k, \Xi _k \rangle \). We have the following result.

Lemma 5.6

Let Assumptions 4.34.44.54.64.74.84.9 and 4.10 hold. Assuming all the auxiliary chains introduced above are initiated at the same point, there exist constants \(C_{2,\Xi }\), \(C_{3,\Xi }\) and \(C_{4,\Xi } > 0\) such that for all \(k \ge 1\),

$$\begin{aligned} \mathbb {E} \langle \Psi _k , \Xi _k \rangle \le (C_{2,\Xi }\varepsilon - C_{3,\Xi })h \mathbb {E}|\Psi _k|^2 + C_{4,\Xi }\frac{1}{s^2}h^2 , \end{aligned}$$

where \(\varepsilon > 0\) can be chosen arbitrarily small.

The crucial insight about the bound in Lemma 5.6 is that, thanks to the special structure of the term \(\Xi _k = \Xi _k^1 - \frac{1}{2}(\Xi _k^2 + \Xi _k^3)\), we can extract from \(\mathbb {E}\langle \Psi _k , \Xi _k \rangle \), after a few applications of Taylor’s formula, a term of the form \(h\mathbb {E}\langle \Psi _k , \sum _{|\alpha |=1} D^{\alpha } a(X_k) (\Psi _k)^{\alpha } \rangle \), which, due to Assumptions 4.3, can be bounded from above by \(-Kh\mathbb {E}|\Psi _k|^2\). This gives us the term with the constant \(C_{3,\Xi }\) in Lemma 5.6. Then, after combining all our estimates and choosing all the \(\varepsilon _i > 0\) small enough, we can indeed conclude that for any \(k \ge 1\) we have \(\mathbb {E}|\Psi _{k+2}|^2 \le (1-ch)\mathbb {E}|\Psi _{k}|^2 + Ch^3\) with C of order \(\mathcal {O}(s^{-2})\), which, as explained above, finishes the proof.

The proof of Lemma 5.6 is lengthy and tedious but elementary and hence is moved to Appendix 5.