Multi-index antithetic stochastic gradient algorithm

Stochastic Gradient Algorithms (SGAs) are ubiquitous in computational statistics, machine learning and optimisation. Recent years have brought an influx of interest in SGAs, and the non-asymptotic analysis of their bias is by now well-developed. However, relatively little is known about the optimal choice of the random approximation (e.g mini-batching) of the gradient in SGAs as this relies on the analysis of the variance and is problem specific. While there have been numerous attempts to reduce the variance of SGAs, these typically exploit a particular structure of the sampled distribution by requiring a priori knowledge of its density’s mode. In this paper, we construct a Multi-index Antithetic Stochastic Gradient Algorithm (MASGA) whose implementation is independent of the structure of the target measure. Our rigorous theoretical analysis demonstrates that for log-concave targets, MASGA achieves performance on par with Monte Carlo estimators that have access to unbiased samples from the distribution of interest. In other words, MASGA is an optimal estimator from the mean square error-computational cost perspective within the class of Monte Carlo estimators. To illustrate the robustness of our approach, we implement MASGA also in some simple non-log-concave numerical examples, however, without providing theoretical guarantees on our algorithm’s performance in such settings.


Introduction
Variations of Stochastic Gradient Algorithms (SGAs) are central in many modern machine learning applications such as large scale Bayesian inference [58], variational inference [36], generative adversarial networks [34], variational autoencoders [43] and deep reinforcement learning [47]. Statistical sampling perspective provides a unified framework to study nonasymptotic behaviour of these algorithms, which is the main topic of this work. More precisely, consider a data set D = (ξ i ) m i=1 ⊂ R n , with m ∈ N ∪ {∞} and the corresponding empirical measure ν m := 1 m m i=1 δ ξ i , where δ is a Dirac measure. Denote by P(R n ) the space of all probability measures on R n and consider a potential V : R d × P(R n ) → R. We are then interested in the problem of sampling from the (data-dependent) probability distribution π on R d , given by for some fixed parameter β > 0. Under some mild assumptions on V , the measure π is a stationary distribution of the (overdamped) Langevin stochastic differential equation (SDE) and classical Langevin Monte Carlo [14] algorithms utilise discrete-time counterparts of such SDEs to provide tools for approximate sampling from π, which, however, require access to exact evaluations of ∇V (·, ν m ). On the other hand, SGAs take as input a noisy evaluation ∇V (·, ν s ) for some s ∈ {1, . . . m}. The simplest example of ∇V (·, ν s ) utilizes the subsampling with replacement method. Namely, consider a sequence of i.i.d. uniformly distributed random variables τ k i ∼ Unif({1, · · · , m}) for k ≥ 0 and 1 ≤ i ≤ s and define a sequence of random data batches D k s := (ξ τ k i ) s i=1 and corresponding random measures ν k s := 1 s s i=1 δ ξ τ k i for k ≥ 0. Fix a learning rate (time-step) h > 0. The corresponding algorithm to sample from (1.1) is given by where (Z i ) ∞ i=1 are i.i.d. random variables with the standard normal distribution. This method in its simplest form, without mini-batching (i.e., when we use the exact evaluation ∇V (·, ν m )) is known in computational statistics as the Unadjusted Langevin Algorithm (ULA) [21], but it has numerous more sophisticated variants and alternatives [8,12,7,45,46].
Numerical methods based on Euler schemes with inaccurate (randomised) drifts such as (1.2) have recently become an object of considerable interest in both the computational statistics and the machine learning communities [51,27,9,16,61,1,50,44]. In particular, the Stochastic Gradient Langevin Dynamics (SGLD) method for approximate sampling from invariant measures of Langevin SDEs has been studied e.g. in [58,55,57,19,7,3,10,60]. Furthermore, recall that under some mild assumptions on V , when β → 0, the measure π concentrates on the set of minima of V , i.e., on {x ∈ R d : x = arg inf V (·, ν m )}, cf. [38]. Remarkably, no convexity of V is required for this to be true, which makes SGAs good candidates for a tool for non-convex optimisation. We would like to stress that throughout the paper, in our analysis of algorithm (1.2), we allow for β = 0 and hence we cover also the Stochastic Gradient Descent (SGD) [48,15].
Despite the great success of algorithm (1.2) and its various extensions, relatively little is known about how to optimally choose s and whether sub-sampling (or mini-batching) is a good idea at all [49]. One of the reasons is that performance analysis appears to be problem-specific as we shall demonstrate on a simple example below. It is clear that subsampling increases the variance of the estimator and induces an additional non-asymptotic bias, see e.g. [49,7]. Therefore it is not clear that the reduced computational cost of running the algorithm compensates for these adverse effects. On the other hand, in a big data regime it may be computationally infeasible to use all data points at every step of the gradient algorithm and hence subsampling becomes a necessity.
In the present paper we propose a solution to these challenges by constructing a novel Multiindex Antithetic Stochastic Gradient Algorithm (MASGA). In settings where the measure π in (1.1) is log-concave, we will rigorously demonstrate that MASGA performs on par with Monte Carlo estimators having access to unbiased samples from the target measure, even though it consists of biased samples. Remarkably, our numerical results in Section 3 demonstrate that this optimal performance is achieved even in some non-log-concave settings. To our knowledge, all current state-of-the-art SGAs [12,2] require the user to a priori determine the mode of the target distribution and hence it is not clear how to implement them in non-log-concave settings. This problem is absent with MASGA, whose implementation is independent of the structure of the target measure. Moreover, the analysis in [12] is based on the Bernstein-von Mises phenomenon, which describes the asymptotic behaviour of the target measure as m → ∞, and hence their algorithm is aimed explicitly at the big data regime, see Section 3 therein. Meanwhile, as we will discuss below, MASGA works optimally irrespectively of the size of the dataset.
Mean-square error analysis. In the present paper we are studying the problem of computing for some f ∈ L 2 (π). This framework covers the tasks of approximating minima of possibly nonconvex functions or the computation of normalising constants in statistical sampling. To this end, the Markov chain specified by (1.2) is used to approximate (f, π) with E[f (X k )] for large k > 0. More precisely, one simulates N > 1 independent copies (X i k ) ∞ k=0 , for i ∈ {1, . . . , N }, of (1.2), to compute the empirical measure µ N,k := 1 The usual metric for measuring the performance of such algorithms is the (root) mean square error (MSE). Namely, for any f ∈ L 2 (π), k ≥ 1 and N ≥ 1, we define where A f,k,N is the algorithm specified by the estimator (f, µ N,k ). Then, for a given ε > 0, we look for the optimal number k of steps and the optimal number N of simulated paths, such that for any fixed integrable function f we have MSE(A f,k,N ) < ε. Note that If f is Lipschitz with a Lipschitz constant L, then the Kantorovich duality representation of the L 1 -Wasserstein distance W 1 , see [56,Remark 6.5], allows us to upper bound the first term of the right hand side by LW 1 (L(X k ), π). Hence it is possible to control the bias by using the vast literature on such Wasserstein bounds (see e.g. [11,20,46,7] and the references therein). Controlling the variance, however, is a more challenging task. Before we proceed, let us consider an example.

Motivational example
In the context of Bayesian inference, one is interested in sampling from the posterior distribution π on R d given by where the measure π 0 (dx) = π 0 (x)dx is called the prior and (ξ i ) m i=1 are i.i.d. data points with densities π(ξ i |x) for x ∈ R d . Note that in this example, for convenience, we assume that the data are conditionally independent given the parameters. See Section 4 for more general settings. Taking β = √ 2, the potential V in (1.1) becomes V (x, ν m ) := − log π 0 (x) − m log π(y|x)ν m (dy) = − log π 0 (x) − m i=1 log π(ξ i |x) .
(1. 4) In stochastic gradient algorithms, one replaces the exact gradient ∇V (·, ν m ) in (1.2) with an approximate gradient, constructed by sampling only s m terms from the sum in (1.4). This amounts to considering V (x, ν s ) = − log π 0 (x) − m log π(y|x)ν s (dy) = − log π 0 (x) − m s s i=1 log π(ξ τ k i |x), where τ k i ∼ Unif({1, . . . , m}) for i = 1, . . . , s are i.i.d. random variables uniformly distributed on {1, . . . , m}. Note that for any x ∈ R d the noisy gradient ∇ x V (x, ν s ) is an unbiased estimator of ∇ x V (x, ν m ). If we choose β = √ 2 and the time-step h/(2m) in (1.2), we arrive at the Markov chain The main difficulty in quantifying the cost of Monte Carlo algorithms based on (1.5) stems from the fact that the variance is problem-specific and depends substantially on the interplay between the parameters m, s, h and N . Hence, one may obtain different costs (and thus different answers to the question of profitability of using mini-batching) for different models and different data regimes [49].
In Figures 1.1 and 1.2 we present a numerical experiment for a simple example of a Monte Carlo estimator (f, µ N,k ) based on the chain (1.5) with the densities π(ξ i |x) specified by a Bayesian logistic regression model, cf. Section 3 for details. We take m = 512, h = 10 −2 and we simulate up to t = 2, hence we have k = 200 iterations. On the left-hand side of both figures we can see the estimated MSE for different numbers of paths N and different numbers of samples s ≤ m, for two different functions f . On the right hand side we can see how the variance changes with s. Evidently, subsampling/mini-batching works better for f (x) = log |x| than for f (x) = exp(x), since in the former case it allows us to obtain a small MSE by using just two samples, while in the latter the minimal reasonable number of samples seems to be 16. However, even in this simple example we see that the optimal choice of s and N is far from obvious and very much problem-specific. For an additional discussion on this subject, see Appendix 7.
It has been observed in [49], that in some specific regimes there is no benefit of employing the subsampling scheme. The authors of [49] also observed that a subsampling scheme utilizing control variates can exhibit improved performance, but again only in some specific regimes (see also [7,2] for related ideas). Moreover, in order to implement their scheme one has to know the mode of the sampled distribution in advance and hence it is not clear how to adapt it to non-convex settings.
The fact that the analysis of the variance of SGLD (and hence of the computational cost of the algorithm A f,k,N ) becomes cumbersome even in the simplest possible examples, clearly demonstrates the need for developing a different algorithm, for which the benefits of subsampling could be proven in a rigorous way for a reasonably large class of models. To this end, we turn towards the Multi-level Monte Carlo (MLMC) technique.

Main result
In order to approximate the measure π, we consider a family of (possibly random) measures (π ) ∈N r , where N r for r ≥ 1 is the set of multi-indices = ( 1 , · · · , r ), where each i ≥ 0 corresponds to a different type of approximation. In this work we focus on the case r = 2, with 1 dictating the number of subsamples at each step of the algorithm and 2 the time discretisation error. While for a fixed the samples are biased, we work with methods that are asymptotically unbiased in the sense that where → ∞ means that min i∈1,...,r i → ∞. Note that we need to take the expectation in (2.1) since the measures π can be random. We also remark that as the coordinates of increase and the bias decreases, the corresponding computational cost of MLMC increases. One therefore faces an optimisation problem and tries to obtain the minimal computational cost for a prescribed accuracy (or, equivalently, to minimise the bias for a fixed computational budget). It turns out, perhaps surprisingly, that a Multi-level Monte Carlo estimator that consists of a hierarchy of biased approximations can achieve computational efficiency of vanilla Monte Carlo built from directly accessible unbiased samples [28,29]. In order to explain this approach, let us define backward difference operators where e s is the unit vector in the direction s ∈ {1 . . . , r} and r s=1 ∆ s denotes the concatenation of operators. The core idea of MLMC is to observe that thanks to (2.1), where we set π 0−es := 0 for all unit vectors e s , with 0 = (0, . . . , 0). The original MLMC [29] has been developed for r = 1, and the extension to an arbitrary r (named Multi-index Monte Carlo, or MIMC) has been developed in [35]. In MIMC we approximate each term π on the right-hand side of (2.2) using mutually independent, unbiased Monte Carlo estimators π ,N with N ≥ 1 samples each, and we choose a finite set of multi-indices L ⊂ N r to define Clearly, E(f, ∆π ,N ) = E(f, ∆π ) and hence A L (f ) is an asymptotically unbiased estimator of (f, π) when L increases to N r . Moreover, we note that due to the independence of π ,N across levels, the variance of the MIMC estimator satisfies . In this work we develop an antithetic extension of the MIMC algorithm. To this end we define pairs (π +, , π −, ) of copies of π in the sense that for all Lipschitz functions f , and The corresponding Antithetic MIMC estimator is given by As we will see in the sequel, the reduction of the variance of A A,L (f ) in comparison to A L (f ) can be achieved as a consequence of an appropriately chosen coupling between π , π +, −es and π −, −es for each ∈ L and s ∈ {1, . . . , r}. Using the notation introduced in (1.2), we will apply (2.5) to the specific case of = ( 1 , 2 ) and π given as the law of X 1 , 2 k for some fixed k ≥ 1, where In this setting, we call the algorithm A A,L (f ) specified by (2.5) the Multi-index Antithetic Stochastic Gradient Algorithm (MASGA). Note that for a fixed t > 0 such that t = kh 2 for some k ≥ 1, the chain (2.6) can be interpreted as a discrete-time approximation, both in time with parameter h but also in data with parameter s, of the SDE where (W t ) t≥0 is the standard Brownian motion in R d . Then, MASGA with π given as the law of X 1 , 2 k , for any finite subset L ⊂ N 2 provides a biased estimator of (f, π t ) (where π t := L(Y t ) with Y t given by (2.7)), where the bias stems from the use of a finite number of levels. However, since π given by (1.1) is the limiting stationary distribution of (2.7), A A,L (f ) can be also interpreted as a biased estimator of (f, π), with an additional bias coming from the difference between (f, π) and (f, π t ) due to the simulation up to a finite time.
Note that the construction of MASGA does not require any knowledge of the structure of the target measure, such as the location of its modes [2,7,49], or any properties of the potential V . However, in order to formulate the main result of this paper, we will use the following set of assumptions.
i.e., the gradients of v and v 0 are twice continuously differentiable with all partial derivatives of the first and second order bounded (but the gradients themselves are not necessarily bounded).
ii) There exists a constant C > 0 such that for all ξ ∈ R k and for all x ∈ R d we have iii) There exists a constant K > 0 such that for all ξ ∈ R k and for all x, y ∈ R d we have Note that the first condition above in particular implies that the gradient ∇V (·, ν m ) of the potential is globally Lipschitz and the third condition implies that π given via (1.1) is logconcave. Note also that the Bayesian inference example given in (1.5) satisfies Assumptions 2.1 if the functions x → −∇ log π(ξ|x) satisfy all the respective regularity conditions. We remark that we formulate our main result in this section only for the specific form of V given above just for convenience. Our result holds also for a much more general class of potentials, but the assumptions for the general case are more cumbersome to formulate and hence we postpone their presentation to Section 4. Moreover, we stress that assuming convexity of V is not necessary for the construction of our algorithm and it is a choice we made solely to simplify the proofs. By combining our approach with the coupling techniques from [46], it should be possible to extend our results to the non-convex case. This, however, falls beyond the scope of the present paper and is left for future work. We have the following result.  (2.6) for some fixed k ≥ 1. Then there exists a set of indices L and a sequence (N ) ∈L such that for any ε > 0 and for any Lipschitz function f , the estimator A A,L (f ) requires the computational cost of order ε −2 to achieve mean square error ε for approximating (f, π t ) with t = kh 2 , where π t := L(Y t ) with Y t given by (2.7).
The proof of Theorem 2.2 and an explicit description of the sequence (N ) ∈L , as well as all the details of the relaxed assumptions that can be imposed on V , will be given in Section 4. Note that requiring computational cost of order ε −2 to achieve MSE < ε is the best performance that one can expect from Monte Carlo methods, cf. [29].
The two quantities that feature in the optimization problem formulated in Theorem 2.2 are the MSE and the computational cost (i.e., we want to optimize the cost given the constraint MSE < ε for a fixed ε > 0). Note that the cost is explicitly defined to be where C is the cost of each path at the level , i.e., the product of k = th −1 2 steps and s 1 subsamples for a given level = ( 1 , 2 ). On the other hand, from our discussion on MSE (1.3) it is evident that in order to control MSE(A A,L (f )), it is crucial to find upper bounds on the variance V[(f, ∆ A π )] and the bias of the estimator at each level . In our proof we will in fact rely on the complexity analysis from [35] (see also [29] and Theorem 4.12 below for more details), which is concerned exactly with such optimization problems.
For convenience, from now on we will assume that in our MASGA estimator (2.5), both the number of subsamples and the discretisation parameter are rescaled by two when moving between levels. More precisely, for a fixed s 0 ∈ N + and h 0 > 0, we assume that s 1 = 2 1 s 0 and h 2 = 2 − 2 h 0 for all = ( 1 , 2 ) ∈ N 2 . In this setting, the complexity analysis from [35] tells us, roughly speaking (with more details in Section 4 below) that in order to obtain MSE(A A,L (f )) < ε with cost(A A,L (f )) < ε −2 , we need to have for each ∈ N 2 , where β = (β 1 , β 2 ) and γ = (γ 1 , γ 2 ) ∈ R 2 are such that γ 1 < β 1 and γ 2 < β 2 . Note that the bound on E[(f, ∆ A π ) 2 ] trivially implies a bound on V[(f, ∆ A π )] of the same order (i.e., with the same β), as well as the bound E[(f, ∆ A π )] 2 − α, with α = β/2, which turns out to be crucial in the complexity analysis from [35] and is the reason why it suffices to verify (2.9), cf. also Theorem 2 in [29] and Theorem 4.12 below. Since, straight from the definition of C , it is clear that in our setting we have γ = (1, 1) (recall that C s 1 h −1 2 2 1 + 2 ), we can infer that in order to prove Theorem 2.2 all we have to do is to find an upper bound on E[(f, ∆ A π ) 2 ] proportional to 2 − β, with β = (β 1 , β 2 ) such that β 1 > 1 and β 2 > 1. In the proof of Theorem 2.2 we will in fact obtain β = (2, 2). However, the crucial difficulty in our argument will be to ensure that our upper bound is indeed of the product form, i.e., that we Further extensions The assertion of Theorem 2.2 states that A A,L (f ), interpreted as an estimator of (f, π t ), requires computational cost of order ε −2 to achieve mean square error ε. Since the difference between (f, π t ) and (f, π) is of order O(e −λt ) for some λ > 0 (cf. the discussion in Appendix 7), this means that A A,L (f ) interpreted as an estimator of (f, π), requires computational cost of order ε −2 log(ε −1 ) to achieve mean square error ε. However, A A,L (f ) could be further modified in order to remove the log term, by employing the MLMC in terminal time technique introduced in [31], cf. Section 2.3 and Remark 3.8 therein. This would involve taking r = 3 in (2.5) and modifying the definition as i.e., we would take π with = ( 1 , 2 , 3 ) to be the law of X 1 , 2 k 3 given by (2.6), hence we would introduce a sequence of terminal times t := k 3 h 2 for the chain (2.6), changing at each level. However, in the definition (2.10) of A A,L (f ) we would use the antithetic difference operators ∆ A only with respect to the subsampling level parameter 1 and the discretisation level parameter 2 , while applying the plain difference operator ∆ to the terminal time level parameter 3 . The details of how to construct the sequence of terminal times t := k 3 h 2 can be found in [31]. We skip this modification in the present paper in the attempt to try to keep the notation as simple as possible.
In the setting where the computational complexity is ε −2 , (as it is for MASGA), it is possible to easily modify the biased estimator A A,L (f ) to obtain its unbiased counterpart. Indeed, let M = (M 1 , . . . , M r ) be a random variable on N r , independent of (∆ A π ,N ) ∈L . Define L M := { ∈ N r : 1 ≤ M 1 , · · · , r ≤ M r } and where M ≥ is understood component-wise. One can see that A U A,L M (f ) is then an unbiased estimator of (f, π). Indeed, due to (2.2). It turns out that in order for the variance of A U A,L M (f ) to be finite, we need to be in the regime where the computational complexity of the original estimator is ε −2 . We refer the reader to [52,13] for more details and recipes for constructing M . The methods from [52] have recently been extended to more general classes of MCMC algorithms in [40], which contains further discussion of the benefits and costs of debiasing.

Literature review
The idea of employing MLMC apparatus to improve efficiency of stochastic gradient algorithms has been studied before. In [31] we introduced a Multi-level Monte Carlo (MLMC) method for SGLD in the global convexity setting, based on a number of decreasing discretisation levels. We proved that under certain assumptions on the variance of the estimator of the drift, this technique can indeed improve the overall performance of the Monte Carlo method with stochastic gradients. In [46] we extended our approach to cover the non-convex setting, allowing for sampling from probability measures that satisfy a weak log-concavity at infinity condition. However, the computational complexity of such algorithms is sub-optimal. As we will observe in Section 3 with numerical experiments, the crucial insight of the present paper (and a novelty compared to [31,46]) is the application of MLMC with respect to the subsampling parameter. Note that, in a different context, the idea to apply MLMC to stochastic approximation algorithms has been studied in [26], see also [18,17]. At the core of our analysis of Multi-level Monte Carlo estimators lies the problem of constructing the right couplings between Euler schemes (2.6) on different discretisation levels. For Euler schemes with standard (accurate) drifts this is done via a one-step analysis by coupling the driving noise in a suitable way, cf. Sections 2.2 and 2.5 in [46] and Section 2.4 in [25]. However, in the case of SGLD, one is faced with an additional problem of coupling the drift estimators used on different discretisation levels. In both [31] and [46] we addressed this issue in the simplest possible way, by coupling the drift estimators independently. In the present paper we show that by employing a non-trivial coupling we can substantially reduce the variance of MLMC and thus obtain the required bound V[(f, ∆ A π )] 2 −2 1 −2 2 as explained above. We achieve this by utilising the antithetic approach to MLMC as defined in (2.4). Related ideas for antithetic multi-level estimators have been used e.g. in [32,33,54]. However, in the present paper we apply this concept for the first time to Euler schemes with inaccurate drifts. We also remark that, due to our bounds on second moments, we can easily derive confidence intervals for MASGA using Chebyshev's inequality. However, it would also be possible to derive a Central Limit Theorem and corresponding concentration inequalities, in the spirit of [4,5,41,42].
The remaining part of this paper is organised as follows. In Section 3 we present numerical experiments confirming our theoretical findings. In Section 4 we present a more general framework for the MASGA estimator, we explain the intuition behind the antithetic approach to MLMC in more detail (see Example 4.2) and we formulate the crucial Lemma 4.13. We also explain how to prove Theorem 2.2 based on Lemma 4.13. In Section 5 we prove Lemma 4.13 in a few steps: we first discuss the antithetic estimator with respect to the discretisation parameter, which corresponds to taking r = 1 and = 2 in (2.5), then we consider the antithetic estimator with respect to the subsampling parameter, which corresponds to taking r = 1 and = 1 in (2.5) and, finally, we explain how these approaches can be combined in a multi-index estimator with r = 2 and = ( 1 , 2 ) to prove Lemma 4.13. Several technical proofs are postponed to Appendices.

Numerical experiments
We showcase the performance of the MASGA estimator in Bayesian inference problems, combining different Bayesian models and priors. The code for all the numerical experiments can be found at https://github.com/msabvid/MLMC-MIMC-SGD.
In Subsection 3.1 we compare the MASGA estimator introduced in Section 2 with an Antithetic Multi-level Monte Carlo (AMLMC) estimator with respect to the subsampling parameter, corresponding to taking r = 1 and = 1 in (2.5). We demonstrate that MASGA indeed achieves the optimal computational complexity. Both these estimators are also compared to a standard Monte Carlo estimator for reference. As we shall see, while the performance of MASGA in our experiments is always better than that of AMLMC, the difference is not substantial. This suggests that, from the practical standpoint, the crucial insight of this paper is the application of the antithetic MLMC approach with respect to the subsampling parameter. Hence in our subsequent experiments in Subsections 3.2 and 3.3, we will focus on the AMLMC estimator, which is easier to implement than MASGA. In Subsection 3.2 we will check its performance in both convex and non-convex settings, whereas in Subsection 3.3 we will compare it to the Stochastic Gradient Langevin Dynamics with Control Variates (SGLD-CV) method introduced in [2]. More precisely, in Subsection 3.3 we present an example of a convex setting in which AMLMC outperforms SGLD-CV. It is worth pointing out, that the latter method can be applied only to convex settings, whereas AMLMC is free of such limitations.

MASGA and AMLMC with respect to subsampling
Let us begin by introducing the Bayesian logistic regression setting that we will use for our simulations. The data is modelled by , z ∈ R, where x ∈ R d are the parameters of the model that need to be sampled from their posterior distribution, ι i denotes an observation of the predictive variable in the data, and y i the binary target variable. Given a dataset of size m, by Bayes' rule, the posterior density of x satisfies In our experiments, we will consider two different priors π 0 , namely i) a Gaussian prior π 0 ∼ N (0, I).
We use Algorithm 1 to empirically calculate the cost of approximating E(f (X)), for a func- (2.5), such that its MSE is under some threshold ε. Recall that in our notation (f, ∆ A π ) denotes the integral of f with respect to the antithetic measure ∆ A π given by (2.4), where π is specified by the law of the Markov chain (2.6) with the potential V determined from (3.2) in an analogous way as in the Bayesian inference example (1.4). More explicitly, we have , . . . , s 1 } can correspond to subsampling either with or without replacement (in our simulations we choose the latter).
Below we present the results of our experiments for f (x) = |x| 2 (we would like to remark that we obtained similar conclusions for f (x) = |x| and hence we skip the latter example to save space).
Furthermore, for the Bayesian Logistic regressions we use the covertype dataset [6] which has 581 012 observations, and 54 columns 1 . We create a training set containing 20% of the original observations.
On the other hand, for AMLMC with respect to subsampling (denoted below by A A,L M LM C (f )) we take r = 1 and = 1 in (2.5). This corresponds to using a fixed discretisation parameter h and applying the antithetic MLMC estimator only with respect to the subsampling parameter.
Note that in our experiments we apply the estimators where π h t is the law given by the chain X k with t = kh, defined by with a fixed discretisation parameter h (i.e., we do not take into account the error between (f, π h t ) and (f, π t ) when calculating the MSE, where π t is the law of the SDE (2.7)). The value of this h is determined by the final level used in MASGA, i.e., h = h L .
The experiment is organised as follows: i) let L ≥ 1, and ( s , d ) = (L, L) be the highest multi-indices used in the calculation of A A,L M ASGA (f ).
Calculate the number of paths in each level given by and calculate extra N 1 , 2 − n samples in each multi-index level. Update A A,L (f ) and estimate its bias (see [29,35]). if bias estimate is less than ε/2 then Set convergence=True, and stop. else Set L := L + 1, and calculate n samples of (f, ∆ A π ) for = ( 1 , L), 1 ≤ L and for = (L, 2 ), 2 ≤ L. end if end while return A A,L (f ) and the number of samples N 1 , 2 in each level.
ii) We measure the bias of A A,L M ASGA (f ) and set ε := We repeat the above three steps for L = 1, . . . , 7, in order to measure the cost for different values of ε. We perform this comparison on two data regimes: first on the covertype dataset with 100K observations and 54 covariates, and second on a smaller synthetic dataset with 1K observations and 5 covariates. Results are shown in Figure 3.1, where each ε corresponds to different values of L (see Table 1).
As expected, the higher the accuracy (the lower the ε) the better the cost of A A,L M ASGA (f ) compared to the cost of A A,L M LM C (f ). Depending on the dataset size, it is necessary to reach different levels of accuracy of the MSE to notice an improvement on the cost. This comes from the amount of variance added by the noise added in the chain X 1 , 2 k (3.3) that will decrease as the dataset size m increases.

AMLMC with respect to subsampling in convex and non-convex settings
In the subsequent experiments, we study the AMLMC estimator with respect to subsampling, i.e., with a fixed discretisation parameter h. We simulate 10×2 4 steps of the chain (3.3). We take X 0 to be an approximation of the mode of the posterior that we pre-compute using Stochastic Gradient Descent to replace the burn-in phase of the Markov chain, cf. [2]. The number of steps and the step size are chosen so as to be consistent with the finest discretisation level of the MASGA experiment provided in the previous section.
A summary of the AMLMC setting is provided in Table 2.
Approximation of the mode of the posterior  These figures indicate that the total cost of approximating (f, π h t ) by A A,L (f ) as described above, is O(ε −2 ), even when the prior is not log-concave as is the case of a Mixture of two Gaussians (Figure 3.3).

Bayesian Mixture of Gaussians
For our next experiment, we use the setting from Example 5.1 in [58] to consider a Bayesian mixture of two Gaussians on a 2-dimensional dataset, in order to make the posterior multimodal. Given a dataset of size m, by Bayes' rule . For the experiment, we consider a Gaussian prior N (0, I) for π 0 (x), and we create a synthetic dataset with 200 observations, by sampling from ι i ∼ 1 2 N (0, 5) + 1 2 N (1, 5). In this experiment we again take r = 1 and = 1 in (2.5), which corresponds to using a fixed discretisation parameter h and applying the antithetic MLMC estimator only with respect to the subsampling parameter. We then use the same setting as before: we apply our estimator where π h t is the law given by the chain X k defined in (3.4), with a fixed discretisation parameter h = 1 (i.e., we do not take into account the error between (f, π h t ) and (f, π t ) when calculating the MSE). We simulate 2 × 10 5 steps of the chain (2.6), starting from X 0 = 0 (see Table 3).
In this example there is the additional difficulty that the posterior has two modes. It is therefore necessary to ensure that the number of steps is high enough so that the chain has explored all the space (Figure 3.5).
Results for this experiment are shown in Figure 3.4, indicating that the total cost of approx- . We obtain the following rates of decay of the variance and the absolute mean of ∆ A π ):  Histogram of X k = (X k,1 , X k,2 ) sampled from (2.6).

AMLMC and SGLD with control variates
In this subsection we compare the AMLMC estimator with respect to subsampling against the Stochastic Gradient Langevin Dynamics method with control variates (SGLD-CV) from [2,49]. We work with the standard Gaussian prior π 0 . For SGLD-CV, for a fixed time step size h, and for a fixed subsample size s 1 , instead of the process (2.6) we use where β = 1/ √ m andx is a fixed value denoting an estimate of the mode of the posterior π(x). We undertake the following steps: 1. We estimate the mode of the posteriorx by using stochastic gradient descent. 2. For each considered accuracy ε, we run AMLMC with respect to subsampling (as described in Subsection 3.2) to get the maximum subsample size s 1 necessary to achieve an esti- 3. We use each pair (ε, s 1 ) from the previous step to calculate the cost of SGLD-CV.
The AMLMC setting values are listed in Table 4 and results are shown in Figure 3.6.
The SGLD-CV method has been shown in [2,49] to reduce the variance (and hence improve the performance) compared to the standard SGLD and the standard Monte Carlo methods. However, in some numerical examples, as demonstrated in this subsection, this gain can be relatively small compared to the gain from using AMLMC.

General setting for MASGA
We will work in a more general setting than the one presented in Section 2. Namely, we consider an SDE For a fixed discretisation parameter h > 0 we consider a discretisation of (4.1) given by X k+1 = X k +ha(X k )+β √ hZ k+1 , as well as its inaccurate drift counterpart Here b : R d × R n → R d is an unbiased estimator of a in the sense that (U k ) ∞ k=0 are mutually independent R n -valued random variables independent of (Z k ) ∞ k=1 such that for any k ≥ 0 we have Note that for each k, the random variable X k is independent of U k and that E[b(X k , U k )|X k ] = a(X k ). Moreover, note that the framework where the drift estimator b(x, U ) depends on a random variable U is obviously a generalisation of (1.2), since as a special case of (4.2) we can where L(U ) denotes the law of U . We use the name Stochastic Gradient Langevin Dynamics (SGLD) [49,46,15] to describe (4.2) even in the general abstract setting where b and a are not necessarily of gradient form. This setting, besides having the obvious advantage of being more general than the one presented in Assumptions 2.1, allows us also to reduce the notational complexity by replacing sums of gradients with general abstract functions a and b. As a motivation for considering such a general framework, let us discuss an example related to generative models. Example 4.1. Let ν denote an unknown data measure, supported on R D , and let ν m be its empirical approximation. While D is typically very large, in many applications ν can be well approximated by a probability distribution supported on a lower dimensional space, say R d , with d D.
The aim of generative models [34] is to map samples from some basic distribution µ supported on R d , into samples from ν. More precisely, one considers a parametrised map G : One then seeks θ such that G(θ) # µ is a good approximation of ν with respect to a user-specified metric. In this example we consider f : A popular choice for the generator G is a neural network [34]. In the case of a one-hidden-layer network with θ = (α, β) ∈ R p × R p with the activation function ψ : With this choice of G, the authors of [37] derived a gradient flow equation that minimises suitably regularised dist(G(θ) # µ, ν m ). The gradient flow identified in [37], when discretised, is given by We refer the reader to [37] for more details and to [39] for an extension to deep neural networks. One can see that b may depend on the data in a non-linear way and hence the general setting of (4.2) becomes necessary for the analysis of the stochastic gradient counterpart of (4.4). An application of MASGA to the study of generative models will be further developed in a future work.
In order to analyse the MASGA estimator (2.5), we need to interpret the Markov chain as characterized by two parameters: the discretisation parameter h > 0 and the drift estimation parameter s ∈ N that corresponds to the quality of approximation of a(x) by b(x, U s k ), for some mutually independent random variables (U s k ) ∞ k=0 . We will now carefully explain how to implement the antithetic MLMC framework from Section 2 in this setting.
To this end, suppose that we have a decreasing sequence (h ) L =0 ⊂ R + of discretisation parameters and an increasing sequence (s ) L =0 ⊂ N + of drift estimation parameters for some L ≥ 1. For any 1 , 2 ∈ {1, . . . , L} and any function f : and we also put Then we can define a Multi-index Monte Carlo estimator where ∆Φ Here N 1 , 2 is the number of samples at the (doubly-indexed) level = ( 1 , 2 ). Note that (4.7) corresponds to the regular (non-antithetic) MLMC estimator defined in (2.3) with r = 2 and with L levels for both parameters.
We will now explain how to obtain the MASGA estimator (2.5) by modifying (4.7) by replacing the difference operator ∆Φ ) k∈N (which is (Z k ) k∈N ). However, we chooseẐ k+2 := (Z k+1 + Z k+2 )/ √ 2 for all k ≥ 0, which corresponds to using the synchronous coupling between levels (which turns out to be a good choice in the global convexity setting as in Assumptions 2.1, cf. [31]). Moreover, note that since the chain X s 1 ,h 2 moves twice as frequently as X s 1 ,h 2 −1 , it needs twice as many random variables U s 1 ,h 2 as X s 1 ,h 2 −1 needs U s 1 ,h 2 −1 . This can be interpreted as having to choose how to estimate the drift a twice as frequently (at each step of the chain).
The idea of the antithetic estimator (with respect to the discretisation parameter) involves replacing f (X s 1 ,h 2 −1 ) in (4.6) with a mean of its two independent copies, i.e., with the quantity given by 1 to estimate the drift, instead of drawing their own independent copies of U Hence the term f (X ) appearing in (4.6) would be replaced with the antithetic term f (X ) and the same can be done for any fixed s 1 . Let us explain the intuition behind this approach on a simple example with f (x) = x and a state-independent drift a.
. . , m} for all j ∈ {1, . . . , s} (i.e., we sample with replacement s data points from the data set ξ of size m). Consider the standard MLMC estimator with the fine X f and the coarse X c schemes defined as Recall from the discussion in Section 2 that our goal is to find a sharp upper bound on the variance (or the second moment) of X f k − X c k for any k ≥ 1 (which corresponds to bounding the variance of the standard, nonantithetic MLMC estimator (4.7) for a Lipschitz function f , cf. the difference (4.6) taken only with respect to the time-discretisation parameter h, with fixed s). We have where in the second step we used conditioning and the fact that b is an unbiased estimator of a.
Hence we can show that, if we choose Ch for some C > 0 and we get a variance contribution of order h. On the other hand, if we want to apply the antithetic approach as in (2.5), we can define and, choosing X f 0 = X c− 0 = X c+ 0 , the variance contribution vanishes altogether.
In the general case, b(x, U s ) is a nonlinear function of the data and also depends on the state x. Therefore one should not expect that the variance of the drift estimator can be completely mitigated. Nonetheless, careful analysis will allow us to conclude that the application of the antithetic difference operators on all levels in our MASGA estimator allows us to obtain a desired upper bound on the variance as described in Section 2.
Having explained the motivation behind the antithetic approach to MLMC, let us now focus on the antithetic estimator with respect to the drift estimation parameter s. To this end, let us now fix h 2 and observe that for the chain X s 1 ,h 2 , the value of the drift estimation parameter s 1 = 2s 1 −1 is twice the value for the chain X s 1 −1 ,h 2 . In the context of the subsampling drift as in Assumptions 2.1, this corresponds to the drift estimator in X s 1 ,h 2 using twice as many samples as the drift estimator in X s 1 −1 ,h 2 . Hence, instead of using independent samples for X s 1 −1 ,h 2 , we can consider two independent copies of X s 1 −1 ,h 2 , the first of which uses the first half U with the same convention as in (4.6) for the case of 1 = 0 or 2 = 0. We can now plug this difference into the definition of a Multi-index Monte Carlo estimator (4.7) to obtain (4.11) Note that this corresponds to the Antithetic MIMC estimator introduced in (2.5), based on the chain (2.6), with 1 corresponding to the number of samples s and 2 to the discretisation parameter h, but with a more general drift a and its estimator b.
In order to formulate our result for the general setting presented in this section, we need to specify the following assumptions. ii) Global contractivity condition: there exists a constant K > 0 such that is defined as in Assumptions 2.1). In particular, there exist constants C a (1) , C a (2) > 0 such that for all x ∈ R d and for all multiindices α with |α| = 1, 2.
We remark that condition (4.14) could be easily removed by approximating a non-smooth drift a with suitably chosen smooth functions. This, however, would create additional technicalities in the proof and hence we decided to work with (4.14).
We now impose the following assumptions on the estimator b of the drift a.
Note that obviously Assumption 4.6 implies Assumption 4.5. However, we formulate these conditions separately in order to keep track of the constants in the proofs. Moreover, with the same constant σ (4) as in (4.17), we impose Assumption 4.7. The estimator b(x, U ) as a function of x is twice continuously differentiable for any U and, for any x ∈ R d , we have E |∇b(x, U s ) − ∇a(x)| 4 ≤ σ (4) (1 + |x| 4 ).
Even though this set of conditions is long, the assumptions are in fact rather mild and it is an easy exercise to verify that when  [46], where a similar setting was considered. As it turns out, these conditions hold also for the case of subsampling without replacement. All the details are provided in Appendix 6.
We have the following result.
The key challenge in constructing and analysing MIMC estimators is to establish conditions i)-iii) in Theorem 4.12 i.e., to show that the leading error bounds for the bias, variance and cost can be expressed in the product form. In fact, there are very few results in the literature that present the analysis giving i)-iii), with the exception of [29, Section 9] and [30]. The bulk of the analysis in this paper is devoted to the analysis of ii). We remark that the optimal choice of L = [L] 2 is dictated by the relationship between (α, β, γ), see [35].
The following lemma will be crucial for the proof of Theorem 4.11. E|∆ Ant Φ s,h f,k | 2 ≤ Ch 2 /s 2 .
As we already indicated in Section 2, once we have an upper bound on the second moment (and thus on the variance) of ∆ Ant Φ s,h f,k such as in Lemma 4.13, the proof of Theorem 4.11 becomes rather straightforward.

Analysis of AMLMC
The estimator we introduced in (2.5), see also (4.11), can be interpreted as built from two building blocks: the antithetic MLMC estimator with respect to the discretisation parameter, which corresponds to taking r = 1 and = 2 in (2.5), and the antithetic MLMC estimator with respect to subsampling, which corresponds to taking r = 1 and = 1 in (2.5). Let us begin our analysis by focusing on the former.

AMLMC via discretisation
We will analyse one step of the MLMC algorithm, for some fixed level . To this end, let us first introduce the following fine (X f k ) k∈N and coarse (X c k ) k∈2N chains for all x ∈ R d and all k ≥ 0 and (Z k ) ∞ k=1 are i.i.d. random variables with Z k ∼ N (0, I). We also haveẐ k+2 := 1 √ 2 (Z k+1 + Z k+2 ). In order to analyse the antithetic estimator, we also need to introduce two auxiliary chains Furthermore, we denoteX c k = 1 2 X c+ k + X c− k . Before we proceed, let us list a few simple consequences of Assumptions 4.3. We have the following bounds: |a(x)| ≤ L 0 (1 + |x|) for all x ∈ R d , where L 0 := max(L, |a(0)|) . Finally, for all random variables U satisfying (4.3), we have whereL 0 := σ 2 h α + 2 max(L 2 , |a(0)| 2 ). Note that (5.3) is an immediate consequence of (4.12), (5.4) follows easily from (4.12) and (4.13) (cf. the proof of Lemma 2.11 in [46]), whereas (5.5) is implied by (4.12) and (4.16), cf. (2.40) in [46]. Throughout our proofs, we will also use uniform bounds on the second and the fourth moments of Euler schemes with time steps h and 2h, i.e., we have where the exact formulas for the constants C IEul , C (2h) IEul and C IEul see also Lemma 2.17 in [46]). We now fix g ∈ C 2 b (R d ; R). Denote by C g (1) , C g (2) positive constants such that |D α g(x)| ≤ C g (|α|) for all x ∈ R d and all multiindices α with |α| = 1, 2.
We begin by presenting the crucial idea of our proof. We will use the Taylor formula to write We also express g(X c k ) − g(X c+ k ) in an analogous way. Note that and hence we have i.e., the first order terms in the Taylor expansions cancel out. Thus, using the inequalities (2) for all x ∈ R d , where ∇ 2 g op is the operator norm of the Hessian matrix, we see that Note that we purposefully introduced the term E X c+ k − X c− k 4 , since it will provide us with an improved rate in h. Indeed, we have the following result.
Note that the constantC 1 above is of order O(s −2 ) due to Assumptions 4.5 and 4.6. Hence the bound in (5.8) is in fact of order O(h 2 s −2 ), which is exactly what is needed in Lemma 4.13.
We remark that, in principle, it would be now possible to bound also the first term on the right hand side of (5.7) and hence to obtain a bound on the variance of the antithetic MLMC estimator with respect to discretisation, corresponding to taking r = 1 and = 2 in (2.5). However, such an estimator does not perform on par with the MASGA estimator (even though it is better than the standard MLMC) and hence we skip its analysis. In this subsection, we present only the derivation of the inequality (5.7) and we formulate the lemma about the bound on the term E|X c+ k − X c− k | 4 , since they will be needed in our analysis of MASGA. We remark that the proof of Lemma 5.1 is essentially identical to the proof of Lemma 5.2 below (with different constants) and is therefore skipped. The latter proof can be found in the Appendix.

AMLMC via subsampling
As the second building block of our MASGA estimator, we discuss a multi-level algorithm for subsampling that involves taking different drift estimators (different numbers of samples) at different levels, but with constant discretisation parameter across the levels.
We fix s 0 ∈ N + and for ∈ {0, . . . , L} we define s := 2 s 0 and we consider chains where U f k is from a higher level (in parameter s) than U c k , i.e., we have b(x, U f k ) = 1 2 b(x, U f,1 k ) + 1 2 b(x, U f,2 k ) and U f,1 k , U f,1 k are both from the same level (in parameter s) as U c k , cf. Assumption 4.10. In the special case of subsampling, we have . . . , m}). In order to introduce the antithetic counterpart of (5.9), we will replace the random variable U c k taken on the coarse level with the two components of U f k . Namely, let us denote We also putX c k := 1 2 X c− k + X c+ k . Following our calculations from the beginning of Section 5.1 we see that for any Lipschitz and we need a similar bound as in Lemma 5.1. Assumptions 4.3,4.5,4.6 and 4.9 hold. For X c+ k and X c− k defined in (5.10), if X c+ 0 = X c− 0 , then for any k ≥ 1 and any h ∈ (0, h 0 ) we have

Lemma 5.2. Let
IEul ) and c 1 , ε and h 0 are chosen so that Using the Lipschitz property of g, we see that in order to deal with the first term on the right hand side of (5.11), we need to bound E|X f k −X c k | 2 . Indeed, we have the following result.
where C 2 := 3 4 σ (4) (1 + C (4) with C 1 and c 1 given in Lemma 5.2, whereas c 2 , ε 1 and h 0 are chosen so that −h 0 K + 1 4 dC a (2) Note that both C 1 and C 2 in Lemma 5.2 and Lemma 5.3 are of order O(s −2 ), which follows from the dependence of both these constants on the parameters σ (4) and σ 2 and Assumptions 4.5 and 4.6. The proofs of both these Lemmas can be found in Appendix 9.

Proof of Lemma 4.13
Similarly as in Sections 5.1 and 5.2, we will analyse our estimator step-by-step. To this end, we first need to define nine auxiliary Markov chains. In what follows, we will combine the ideas for antithetic estimators with respect to the discretisation parameter and with respect to the subsampling parameter. We will therefore need to consider fine and coarse chains with respect to both parameters. We use the notational convention X subsampling,discretisation , hence e.g. X f,c would be a chain that behaves as a fine chain with respect to the subsampling parameter and as a coarse chain with respect to the discretisation parameter. We define three chains that move as fine chains with respect to the discretisation parameter and six chains that move as coarse chains Here and likewise for other chains, whereas √ 2hẐ k+2 = √ hZ k+1 + √ hZ k+2 . In order to prove Lemma 4.13, we will first show that for any k ≥ 1 we have E|∆ Ant Φ s,h g,k | 2 ≤ CE|∆ Ant Φ s,h k | 2 +Ch 2 , where C,C are positive constants andC is of order O(s −2 ). Here ∆ Ant Φ s,h k corresponds to taking ∆ Ant Φ s,h g,k with g(x) = x the identity function. Then we will show that for all k ≥ 1 we have E|∆ Ant Φ s,h k+2 | 2 ≤ (1−ch)E|∆ Ant Φ s,h k | 2 +Ch 3 for some constants c, C > 0, where C is of order O(s −2 ). Finally, we will conclude that for all k ≥ 1 we have E|∆ Ant Φ s,h k | 2 ≤ C 1 h 2 /s 2 for some C 1 > 0, which will finish the proof. In order to simplify the notation, we define here one step of the MASGA estimator as Let us first focus on the analysis of To this end, we will introduce three additional chains Observe that in the expression E|Ψ g kh | 2 above we have three rows of the same structure as the antithetic estimator via discretisation, hence we can proceed exactly as in Section 5.1 by adding and subtracting g X f,c k , g X c−,c k and g X c+,c k , respectively in each row, and then applying Taylor's formula in pointsX f,c k ,X c−,c k andX c+,c k , respectively in each row (note that the first order terms will cancel out) to get and T is defined analogously for other terms, and hence |T (x)| 2 ≤ C|x| 4 for some C > 0 and for any x ∈ R d . Using (a + b) 2 ≤ 2a 2 + 2b 2 for E|Ψ g kh | 2 we can separate the terms involving T (·) and, due to Lemma 5.1, we see that we have the correct order in h and s for all the terms expressed as T (·). Hence the remaining term whose dependence on s and h we still need to check can be expressed as We will now need yet another auxiliary chainX c,f k : , respectively (the first order terms again cancel out), to obtain . Hence we only need to deal with We can now add and subtract g X f,c , and, using the Lipschitz property of g (which we assume it satisfies with a Lipschitz constant, say, L g > 0), we see that Note that both terms on the right hand side above correspond to antithetic estimators via subsampling, hence from Lemma 5.3 we infer that E X f,c k − 1 2 X c−,c k +X c+,c k 2 has the correct order in s and h. We have thus demonstrated that for any k ≥ 1 we have E|Ψ g k | 2 ≤ C 1 E|Ψ k | 2 + C 2 h 2 /s 2 for some constants C 1 , C 2 > 0. Therefore, in order to finish the proof, it remains to be shown that E|Ψ k | 2 has the correct order in h and s. As explained above, this will be achieved by proving that there exist constants c, C > 0 (with C being of order O(s −2 )) such that for any k ≥ 1 we have E|Ψ k+2 | 2 ≤ (1 − ch)E|Ψ k | 2 + Ch 3 . The idea for dealing with E|Ψ k+2 | 2 is to group the terms in a specific way, add and subtract certain drifts in order to set up appropriate combinations of drifts for a Taylor's formula application and then to cancel out some first order terms. As we have already seen, the remaining second order terms should then be of the correct order in h and s. To this end, we denote and hence, observing that all the noise variables cancel out, we can write We will bound E|Ξ k +Υ k | 2 ≤ 2E|Ξ k | 2 +2E|Υ k | 2 and we will first deal with the terms involving Υ k . We will need an additional auxiliary Markov chain (moving as a fine chain with respect to the discretisation parameter) defined as First, in order to deal with E Ψ k , Υ k , we use Taylor's formula to write Hence, using Assumptions 4.7 and 4.8, we have Now we can use Young's inequality for each term above with some ε 1 , ε 2 , ε 3 , ε 4 > 0 respectively and, using Assumption 4.3 (condition (4.14)), we get Now note that the second term on the right hand side above is of order O(h 2 /s 2 ) due to Lemma 5.3. For the other three terms, we need the following auxiliary result.
The proof is similar to the proof of Lemma 5.2 and can be found in Appendix 10. The reasoning in the proof of Lemma 5.4 applies also to X c−,f k and X c+,f k in place of X f,f k . Hence we see that E Ψ k , Υ k is bounded from above by an expression of the form C 1 (ε 1 + ε 2 + ε 3 + ε 4 )hE|Ψ k | 2 + C 2 h, where ε i for i ∈ {1, . . . 4} can be chosen as small as necessary and the constant C 2 is of the correct order in s and h, i.e., of order O(h 2 /s 2 ). We will explain later how to handle the other terms and how to choose the values of ε i for i ∈ {1, . . . 4}. Now in order to deal with E|Υ k | 2 we use a different decomposition for Υ k , namely we write whereX c,f k+1 := 1 2 X c−,f k+1 + X c+,f k+1 . Hence, using (4.15) we obtain The first two terms on the right hand side above are identical and have the correct order in s and h due to Lemma 5.3. On the other hand, the third term can be dealt with by one more application of Taylor's formula (cf. the calculation for the term J 33 in the proof of Lemma 5.3 in Appendix 9 for more details). The terms involving Ξ k can be handled using similar ideas as above. In particular, for E|Ξ k | 2 we have the following result.
Lemma 5.5. Let Assumptions 4.3, 4.5, 4.6, 4.7 and 4.9 hold. Assuming all the auxiliary chains introduced above are initiated at the same point, there exists a constant C 1,Ξ > 0 such that for all k ≥ 1, The proof of Lemma 5.5 can be found in Appendix 10. The last term to deal with is E Ψ k , Ξ k . We have the following result.
The crucial insight about the bound in Lemma 5.6 is that, thanks to the special structure of the term Ξ k = Ξ 1 k − 1 2 (Ξ 2 k + Ξ 3 k ), we can extract from E Ψ k , Ξ k , after a few applications of Taylor's formula, a term of the form hE Ψ k , |α|=1 D α a(X k )(Ψ k ) α , which, due to Assumptions 4.3, can be bounded from above by −KhE|Ψ k | 2 . This gives us the term with the constant C 3,Ξ in Lemma 5.6. Then, after combining all our estimates and choosing all the ε i > 0 small enough, we can indeed conclude that for any k ≥ 1 we have E|Ψ k+2 | 2 ≤ (1 − ch)E|Ψ k | 2 + Ch 3 with C of order O(s −2 ), which, as explained above, finishes the proof.
The proof of Lemma 5.6 is lengthy and tedious but elementary and hence is moved to Appendix 10.

Appendix: Bounds on moments of subsampling estimators
In this section we present bounds both for subsampling with and without replacement [58,53]. We fix s, m ∈ N such that s < m. Let θ i ∈ R d , for i = 1, . . . , m. Moreover, let U = (U i ) i=1...,s be a collection of s independent random variables, uniformly distributed over the set {1, . . . , m}. We define Hereb : R d × R k → R d is a kernel such that for any x, y ∈ R d and any θ ∈ R k we have for some L, K > 0. Hence b is an unbiased estimator of a that corresponds to sampling with replacement s terms from the sum of m terms defining a, cf. Example 2.15 in [46]. Moreover, Assumptions 4.3 and 4.4 are satisfied with constants L, K andL = L. We will now verify Assumptions 4.5 and 4.6.
Note that related calculations for second moments of subsampling estimators were carried out in [46] (see Example 2.15 therein) for the drift a and its estimator b in (6.1) rescaled by m, that is, for a(x) = m i=1b (x, θ i ) and b(x, U k ) = m s s i=1b (x, θ U i ). Hence, obviously, all the upper bounds on second moments obtained in [46] still hold for a and b given by (6.1), after rescaling by 1/m 2 .
Based on the calculations in [46], we know that if we assume that for all θ and x we have |b(x, θ)| 2 ≤ C(1 + |x| 2 ) with some constant C > 0, then E|b(x, U ) − a(x)| 2 ≤ 1 s C(1 + |x| 2 ), which verifies Assumption 4.5 for the subsampling with replacement scheme. Let us now define a new estimator b wor (x, U ) := 1 s m j=1b (x, θ j )Z j , where (Z j ) m j=1 are correlated random variables such that P(Z j = 1) = s m , P(Z j = 0) = 1 − s m and P(Z i = 1, Z j = 1) = m−2 s−2 / m s for any i, j ∈ {1, . . . , m} such that i = j. Note that this definition of b wor corresponds to sampling s terms from the sum of m terms defining a in (6.1) without replacement, see Example 2.15 in [46]. It is immediate to check that this estimator of a is indeed unbiased. In order to bound the variance, we can first check that Note now that we have Hence we can easily check that E|b wor (x, U ) − a(x)| 2 ≤ 1 s (1 − s m )C(1 + |x| 2 ). Thus we see that the upper bound on the variance of the estimator b wor that we obtained is equal to the upper bound on the variance of b multiplied by (1 − s m ). In particular, this confirms that Assumption 4.5 holds also for the subsampling without replacement scheme.
Let us now explain how to estimate the fourth centered moments required for Assumption 4.6, based on the assumption that for all θ and x we have |b(x, θ)| 4 ≤ C(1 + |x| 4 ). We have Since (b(x, θ U i ) − a(x)) are mutually independent, centered random variables, we see that We can now compute for any i = 1, . . . , m where we used the linear growth conditions for a(x) and forb(x, θ). Hence we see that the first term on the right hand side of (6.3) can be bounded by 1 we see that the second term on the right hand side of (6.3) can be bounded by 1 . By analogy, for the estimator without replacement b wor we have Recall that the random variables Z i are not independent and hence we need to compute all the terms in the sum above. We have Let us now focus on the fourth term on the right hand side of (6.4). We have and hence, using the definition of the random variables Z i , Some straightforward computations allow us then to conclude that the fourth term on the right hand side of (6.4) is bounded by 1 Dealing in a similar way with the remaining terms, by tedious by otherwise simple computations we can conclude that

On the advisability of subsampling: A simple example
In this section we illustrate the issues that arise in the analysis of the dependence of the cost of SGAs on the parameters m and s. Let us begin by discussing the MSE estimates (1.3) in more detail.
In order to disentangle various approximation errors in our analysis, it is useful to consider the SDE where (W t ) t≥0 is the standard Brownian motion in R d . We remark that (7.1) is the time-changed SDE dȲ t = −∇V (Ȳ t , ν m )dt+ √ 2dW t and they both have the same limiting stationary distribution π, cf. [22,59]. In the analysis of the mean square error, for t = kh, we can estimate with the standard normal distribution. The three terms above are, in order, bias (due to the simulation up to a finite time t > 0), weak time discretisation error and the Monte Carlo variance. We choose to work with the SDE (7.1), to mitigate the effect of m on the Lipschitz and convexity constants that play the key role in the first two errors in (7.2), see the discussion in [49]. Consequently, we focus on the last term in (7.2), i.e., the variance of A f,k,N , in our analysis.
For convenience we assume that h = 1/n for some n ≥ 1, which corresponds to taking n steps in each unit time interval. There are numerous results in the stochastic analysis literature for bounding the first term on the right hand side of (7.2) by a quantity of order O(e −t ), under fairly general assumptions on ∇V , see e.g. [23,24]. Moreover, in our previous paper [46] we carried out the weak error analysis (see Theorem 1.5 therein) that provides an upper bound for the second term in (7.2) of order O(h). Hence, for any algorithm A f,k,N based on the chain (X k ) ∞ k=0 given above, we have for some λ > 0 (which is the exponential rate of convergence in the L 1 -Wasserstein distance of the SDE (7.1) to π). Here Λ(s, m) is a quantity whose exact value depends on the properties of V and the function f , cf. the discussion below. Fix ε > 0 and set MSE(A f,k,N ) ε. This enforces the following choice of the parameters t ≈ λ −1 log(ε −1 ), Λ(s, m)N ≈ ε −2 , n ≈ ε −1 . The cost of simulation of our algorithm is defined as the product of the number of paths N , the number of iterations k of each path and the number of data points s in each iteration. Since t = kh and h = 1/n, we have The main difficulty in quantifying the cost of such Monte Carlo algorithms stems from the fact that the value of Λ(s, m) is problem-specific and depends substantially on the interplay between parameters m, s and h. Hence, one may obtain different costs (and thus different answers to the question of profitability of using mini-batching) for different models and different data regimes [49].
In order to gain some insight into possible values of Λ(s, m), we consider a simple example of an Ornstein-Uhlenbeck Markov chain (X k ) ∞ k=0 given by where α > 0 and (ξ i ) m i=1 are data points in R k , and its stochastic gradient counterpart (X k ) ∞ k=0 given byX , we easily see that Following the calculations in Example 2.15 in [46], we see that for any j ≥ 1 which, assumingX 0 = X 0 , shows Since the sum k j=1 (1 − αh) 2(k−j) is of order 1/h, we infer that V[X k ] is of order 1/m, whereas V[X k ] is of order 1/m + h/s. Note that this corresponds to taking f (x) = x in A f,k,N and demonstrates that even in this simple case it is not clear whether the algorithm A f,k,N based on X k is more efficient than the one based on X k , since the exact values of their costs (7.4) depend on the relation between m, s and h. Note that our analysis in this case is exact, since we used equalities everywhere.
Let us also consider the case of f (x) = x 2 , which turns out to be more cumbersome. We first assume thatX 0 = X 0 and observe that then E[X k ] = E[X k ] for all k ≥ 0 and hence it is sufficient to compare the variances of the centered versions ofX k and X k . More precisely, in our analysis of the algorithm A f,k,N we want to look at with f (x) = x 2 and hence we will compare their respective upper bounds Hence we can expand the fourth power of the sum as in Section 6 and, after taking into account all the cross-terms, we see that E|X k − E[X k ]| 4 is of order h/m 2 . On the other hand, using (a + b) 4 ≤ 8a 4 + 8b 4 we get  [46] and in Section 6, to conclude that the second term on the right hand side of (7.10) is of order h 3 /s 2 . Hence we conclude that for the case f (x) = x 2 the algorithm based on X k has the variance of order h/m 2 , while the algorithm based onX k has the variance of order h/m 2 + h 3 /s 2 . Finally, let us analyse f (x) = √ x. To this end, we again use centered versions and compare Similarly as in (7.9) we observe that E|X k − E[X k ]| is of order 1/ √ hm (remembering that k j=1 (1 − αh) k−j is of order 1/h). Moreover, similarly as in (7.10) we have Hence, recalling once again from Example 2.15 in [46] that E(b j − E[b j ]) 2 is of order 1/s for any j ≥ 1, we can use Jensen's inequality to conclude that E(b j − E[b j ]) is of order 1/ √ s and, consequently, that the second term on the right hand side of (7.11) is also of order 1/ √ s (after cancellation of h with the sum k j=1 (1 − αh) k−j which, as we already pointed out, is of order 1/h). Hence for f (x) = √ x we observe that X k leads to the variance of order 1/ √ hm, whereas X k leads to the variance of order 1/ √ hm + 1/ √ s. Again, in all these cases, determining which term is the leading one depends on the interplay between m, s and h. IEul > 0 such that for the Markov chain (X k ) ∞ k=0 given by
Proof. By a standard computation, we have By conditioning on X k and U k and using properties of the multivariate normal distribution, we see thatĨ 8 =Ĩ 10 =Ĩ 11 =Ĩ 12 =Ĩ 13 =Ĩ 14 = 0. Hence we have Now observe that due to Assumptions 4.9 and 4.6 we have E|b(x, U )| 4 ≤L  These auxiliary estimates allow us to bound I 2 ≤ h 4L (4) 0 1 + E|X k | 4 and

Indeed, we have
Moreover, by conditioning, we get where in I 4 and I 6 we used (5.5) and in I 7 and I 9 we used (5.4). Hence we obtain holds for all h ∈ (0, h 0 ). Then, putting ult h for all h ∈ (0, h 0 ), and hence We obtain B 2 ≤ −4hKE|X c+ k − X c− k | 4 by conditioning on X c+ k and X c− k and using Assumption 4.3(ii). In order to deal with B 5 we write (9.1) We will now use the inequality n j=1 a j k ≤ n k−1 n j=1 a k j , which holds for all a j ≥ 0 and all integers k ≥ 2 and n ≥ 1 due to the Hölder inequality for sums. Hence we have (a + b + c) 4 ≤ 27(a 4 + b 4 + c 4 ) and we obtain Hence, due to Assumption 4.6, we get B 5 ≤ 27h 4 2σ (4) (1 + C (4),(2h) Using the Cauchy-Schwarz inequality, we now have Hence, after applying the inequalities ab ≤ 1 2 a 2 + 1 2 b 2 and (a + b) 1/2 ≤ a 1/2 + b 1/2 several times, we obtain In order to deal with B 3 , we again use the decomposition (9.1). Then, conditioning on X c+ k and X c− k , and using Assumption 4.5, we get Now, using the Cauchy-Schwarz inequality and ab ≤ 1 2 a 2 + 1 2 b 2 , we have for some ε > 0 to be specified later. Hence Note that here σ 4 = (σ 2 ) 2 , where σ 2 is given in Assumption 4.5, whereas σ (4) appearing in the bounds on B 4 and B 5 is possibly a different quantity, given in Assumption 4.6. Combining all our estimates together, we obtain E X c+ k+1 − X c− we obtain, for any k ≥ 1, E X c+ k+1 − X c− k+1 4 ≤ (1 − c 1 h) k E X c+ 0 − X c− 0 4 + k j=0 C 1 (1 − c 1 h) j h 3 . Taking X c+ 0 = X c− 0 and bounding the sum above by an infinite sum gives E X c+ k − X c− k 4 ≤ (C 1 /c 1 )h 2 for all k ≥ 1 and finishes the proof.

Proof of Lemma 5.3
Proof. Using b(X f k , U f k ) = 1 2 b(X f k , U f,1 k ) + 1 2 b(X f k , U f,2 k ), we have We begin by bounding J 2 . We have k , U f,2 k ) =: I 1 + I 2 + I 3 + I 4 .
By conditioning on X f k , X c+ k and X c− k and using Assumption 4.3 (specifically condition (4.13)) we get , while for the other terms we have I 2 = hE X f k −X c k , a(X c k ) − a(X c− k ) and I 4 = hE X f k −X c k , a(X c k ) − a(X c+ k ) . We now use the Taylor formula for a and (5.6) to write where we used the Lipschitz condition (4.15). Note that we have J 31 +J 32 = 3 2 h 2L2 E X f k −X c k 2 . On the other hand, in order to deal with J 33 , we use the Taylor theorem to write Recall that we assume (in Assumption 4.7) that E |∇b(x, U ) − ∇a(x)| 4 ≤ σ (4) (1 + |x| 4 ) and that |D α b(x, U )| ≤ C b (2) for all multiindices α with |α| = 2. Hence we have where in the second inequality we used Young's inequality. Combining all our estimates together, we see that if we choose h, c 2 and ε 1 > 0 such that −hK + 1 4 dC a (2) ε 1 h + 3 2 h 2L2 ≤ −c 2 h , (9.5) IEul ) + .

(9.6)
We can now finish the proof exactly as we did in Lemma 5.2.

Appendix: Proofs for MASGA
Proof of Lemma 5.4. The argument is very similar to the proof of Lemma 5.2 and in fact even simpler as here we have only one inaccurate drift. For completeness, we give here an outline of the proof anyway. We have By conditioning and Assumption 4.3, we have B 2 ≤ −4hKE X f,f k − X k 4 . Furthermore, where we used Assumptions 4.6, 4.3 and Lemma 8.1. It is now clear that the terms B 3 and B 4 can be dealt with exactly as the corresponding terms in the proof of Lemma 5.2 and we obtain essentially the same estimates with sligthly different constants, which are, however, of the same order in s and h.
Proof of Lemma 5.5. We denote and we have Ξ k = Ξ A k − Ξ B k − Ξ C k . Then, using b(x, U ) = 1 2 b(x, U 1 ) + 1 2 b(x, U 2 ), we have and hence Note that the first two terms on the right hand side above are identical and have the correct order in s and h due to Lemma 5.3. Furthermore, the last term can be dealt with by applying Taylor's formula twice inX c,f k and using the argument from the proof of Lemma 5.3 for the term J 33 therein. Bounds for E|Ξ B k | 2 and E|Ξ C k | 2 can be obtained in exactly the same way.
Note that X f,c− k −X f,c k = − X f,c+ Then, expanding Ξ 2,2 k and Ξ 3,2 k in an analogous way, we see that Now observe that E Ψ k , Ξ 1,3,1,2 Recall that we are dealing now with the group − Ξ 1,3 k − 1 2 Ξ 2,3 k + Ξ 3,3 k . Similarly as above, for Ξ 2,3 k and Ξ 3,3 k , we have the terms for some ε 10 , ε 11 > 0, and we can again apply Lemma 10.1 to conclude that the fourth moments above have the correct order in s and h. On the other hand, repeating the analysis for Ξ 1,3,1 k above, we see that E Ψ k , Ξ 2,3,1,1 D α a(X k ) (X k − X c k ) α . (D α a(X c k ) − D α a(X k )) X f,c k −