Nonstationary phase of the MALA algorithm
 223 Downloads
Abstract
The MetropolisAdjusted Langevin Algorithm (MALA) is a Markov Chain Monte Carlo method which creates a Markov chain reversible with respect to a given target distribution, \(\pi ^N\), with Lebesgue density on \({\mathbb {R}}^N\); it can hence be used to approximately sample the target distribution. When the dimension N is large a key question is to determine the computational cost of the algorithm as a function of N. The measure of efficiency that we consider in this paper is the expected squared jumping distance (ESJD), introduced in Roberts et al. (Ann Appl Probab 7(1):110–120, 1997). To determine how the cost of the algorithm (in terms of ESJD) increases with dimension N, we adopt the widely used approach of deriving a diffusion limit for the Markov chain produced by the MALA algorithm. We study this problem for a class of target measures which is not in product form and we address the situation of practical relevance in which the algorithm is started out of stationarity. We thereby significantly extend previous works which consider either measures of product form, when the Markov chain is started out of stationarity, or nonproduct measures (defined via a density with respect to a Gaussian), when the Markov chain is started in stationarity. In order to work in this nonstationary and nonproduct setting, significant new analysis is required. In particular, our diffusion limit comprises a stochastic PDE coupled to a scalar ordinary differential equation which gives a measure of how far from stationarity the process is. The family of nonproduct target measures that we consider in this paper are found from discretization of a measure on an infinite dimensional Hilbert space; the discretised measure is defined by its density with respect to a Gaussian random field. The results of this paper demonstrate that, in the nonstationary regime, the cost of the algorithm is of \({{\mathcal {O}}}(N^{1/2})\) in contrast to the stationary regime, where it is of \({{\mathcal {O}}}(N^{1/3})\).
Keywords
Markov Chain Monte Carlo MetropolisAdjusted Langevin Algorithm Diffusion limit Optimal scalingMathematics Subject Classification
Primary 60J22 Secondary 60J20 60H101 Introduction
1.1 Context
A widely used approach to tackle this problem is to study diffusion limits for the algorithm. Indeed the scaling used to obtain a well defined diffusion limit corresponds to the optimal scaling of the proposal variance (see Remark 1.1). This problem was first studied in [19], for the Random Walk Metropolis algorithm (RWM); in this work it is assumed that the algorithm is started in stationarity and that the target measure is in product form. In the case of the MALA algorithm, the same problem was considered in [20, 21], again in the stationary regime and for product measures. In this setting, the cost of RWM has been shown to be \({{\mathcal {O}}}(N)\), while the cost of MALA is \({{\mathcal {O}}}(N^{\frac{1}{3}}).\) The same \({{\mathcal {O}}}(N^{\frac{1}{3}})\) scaling for MALA, in the stationary regime, was later obtained in the setting of nonproduct measures defined via density with respect to a Gaussian random field [17]. In the paper [6] extensions of these results to nonstationary initializations were considered, however only for the Gaussian targets. For Gaussian targets, RWM was shown to scale the same in and out of stationarity, whilst MALA scales like \({{\mathcal {O}}}(N^{\frac{1}{2}})\) out of stationarity. In [12, 13] the RWM and MALA algorithms were studied out of stationarity for quite general product measures and the RWM method shown again to scale the same in and out of stationarity. For MALA the appropriate scaling was shown to differ in and out of stationarity and, crucially, the scaling out of stationarity was shown to depend on a certain moment of the potential defining the product measure. In this paper we contribute further understanding of the MALA algorithm when initialized out of stationarity by considering nonproduct measures defined via density with respect to a Gaussian random field. Considering such a class of measures has proved fruitful, see e.g. [15, 17]. Relevant to this strand of literature, is also the work [5].
In this paper our primary contribution is the study of diffusion limits for the the MALA algorithm, out of stationarity, in the setting of general nonproduct measures, defined via density with respect to a Gaussian random field. Significant new analysis is needed for this problem because the work of [17] relies heavily on stationarity in analyzing the acceptance probability, whilst the work of [13] uses propagation of chaos techniques, unsuitable for nonproduct settings.
The challenging diffusion limit obtained in this paper is relevant both to the picture just described and, in general, due to the widespread practical use of the MALA algorithm. The understanding we obtain about the MALA algorithm when applied to realistic nonproduct targets is one of the main motivations for the analysis that we undertake in this paper. The diffusion limit we find is given by an SPDE coupled to a onedimensional ODE. The evolution of such an ODE can be taken as an indicator of how close the chain is to stationarity (see Remark 1.1 for more details on this). The scaling adopted to obtain such a diffusion limit shows that the cost of the algorithm is of order \(N^{1/2}\) in the nonstationary regime, as opposed to what happens in the stationary phase, where the cost is of order \(N^{1/3}\). It is important to recognize that, for measures absolutely continuous with respect to a Gaussian random field, algorithms exist which require \({{\mathcal {O}}}(1)\) steps in and out of stationarity; see [7] for a review. Such methods were suggested by Radford Neal in [16], and developed by Alex Beskos for conditioned stochastic differential equations in [4], building on the general formulation of Metropolis–Hastings methods in [23]; these methods are analyzed from the point of view of diffusion limits in [18]. It thus remains open and interesting to study the MALA algorithm out of stationarity for nonproduct measures which are not defined via density with respect to a Gaussian random field; however the results in [12] demonstrate the substantial technical barriers that will exist in trying to do so. An interesting starting point of such work might be the study of non i.i.d. product measures as pioneered by Bédard [2, 3].
1.2 Setting and the main result
Main Result
Remark 1.1

Since the effective timestep implied by the interpolation (1.9) is \(N^{1/2}\), the main result implies that the number of steps required by the Markov chain in its nonstationary regime is \({{\mathcal {O}}}(N^{1/2})\). A more detailed discussion on this fact can be found in Sect. 4.
 Notice that Eq. (1.11) evolves independently of Eq. (1.10). Once the MALA algorithm (2.14) is introduced and an initial state \(x^0\in {\tilde{{\mathcal {H}}}}\) is given such that S(0) is finite, the real valued (double) sequence \(S^{k,N}\),started at \(S_0^N:=\frac{1}{N} \sum _{i=1}^N \frac{\left x^{0,N}_i\right ^2}{\lambda _i^2}\) is well defined. For fixed N, \(\{S^{k,N}\}_k\) is not, in general, a Markov process (however it is Markov if e.g. \(\varPsi =0\)). Consider the continuous interpolant \(S^{(N)}(t)\) of the sequence \(S^{k,N}\), namely$$\begin{aligned} S^{k,N}:=\frac{1}{N} \sum _{i=1}^N \frac{\left x^{k,N}_i\right ^2}{\lambda _i^2} \end{aligned}$$(1.15)In Theorem 4.1 we prove that \(S^{(N)}(t)\) converges in probability in \(C([0,T];{\mathbb {R}})\) to the solution of the ODE (1.11) with initial condition \(S_0:=\lim _{N\rightarrow \infty }S_0^N\). Once such a result is obtained, we can prove that \(x^{(N)}(t)\) converges to x(t). We want to stress that the convergence of \(S^{(N)}(t)\) to S(t) can be obtained independently of the convergence of \(x^{(N)}(t)\) to x(t).$$\begin{aligned} S^{(N)}(t)=(N^{1/2}tk)S^{k+1,N}+(k+1N^{1/2}t)S^{k,N}, \quad t_k\le t< t_{k+1}, \,\, t_k=\frac{k}{N^{\frac{1}{2}}}.\nonumber \\ \end{aligned}$$(1.16)
 Let \(S(t):{\mathbb {R}}\rightarrow {\mathbb {R}}\) be the solution of the ODE (1.11). We will prove (see Theorem 3.1) that \(S(t) \rightarrow 1\) as \(t\rightarrow \infty \); this is also consistent with the fact that, in stationarity, \(S^{k,N}\) converges to 1 as \(N \rightarrow \infty \) (for every \(k>0\)), see Remark 4.1. In view of this and the above comment, S(t) (or \(S^{k,N}\)) can be taken as an indication of how close the chain is to stationarity. Moreover, notice that \(h_{\ell }(1)=\ell \); heuristically one can then argue that the asymptotic behaviour of the law of x(t), the solution of (1.10), is described by the law of the following infinite dimensional SDE:It was proved in [9, 10] that (1.17) is ergodic with unique invariant measure given by (1.2). Our deduction concerning computational cost is made on the assumption that the law of (1.10) does indeed tend to the law of (1.17), although we will not prove this here as it would take us away from the main goal of the paper which is to establish the diffusion limit of the MALA algorithm.$$\begin{aligned} dz(t)=\ell (z(t)+{\mathcal {C}}\nabla \varPsi (z(t)))dt+ \sqrt{2\ell } dW(t). \end{aligned}$$(1.17)

In [12, 13] the diffusion limit for the MALA algorithm started out of stationarity and applied to i.i.d. target product measures is given by a nonlinear equation of McKeanVlasov type. This is in contrast with our diffusion limit, which is an infinitedimensional SDE. The reason why this is the case is discussed in detail in [14, Section 1.2]. The discussion in the latter paper is in the context of the Random Walk Metropolis algorithm, but it is conceptually analogous to what holds for the MALA algorithm and for this reason we do not spell it out here.

In this paper we make stronger assumptions on \(\varPsi \) than are required to prove a diffusion limit in the stationary regime [17]. In particular we assume that the first derivative of \(\varPsi \) is bounded, whereas [17] requires only boundedness of the second derivative. Removing this assumption on the first derivative, or showing that it is necessary, would be of interest but would require different techniques to those employed in this paper and we do not address the issue here.
Remark 1.2
1.3 Structure of the paper
The paper is organized as follows. In Sect. 2 we introduce the notation and the assumptions that we use throughout this paper. In particular, Sect. 2.1 introduces the infinite dimensional setting in which we work, Sect. 2.2 discusses the MALA algorithm and the assumptions we make on the functional \(\varPsi \) and on the covariance operator \({\mathcal {C}}\). Section 3 contains the proof of existence and uniqueness of solutions for the limiting Eqs. (1.10) and (1.11). With these preliminaries in place, we give, in Sect. 4, the formal statement of the main results of this paper, Theorems 4.1 and 4.2. In this section we also provide heuristic arguments outlining how the main results are obtained. The complete proof of these results builds on a continuous mapping argument presented in Sect. 5. The heuristics of Sect. 4 are made rigorous in Sects. 6–8. In particular, Sect. 6 contains some estimates of the size of the chain’s jumps and the growth of its moments, as well as the study of the acceptance probability. In Sects. 7 and 8 we use these estimates and approximations to prove Theorems 4.1 and 4.2, respectively. Readers interested in the structure of the proofs of Theorems 4.1 and 4.2 but not in the technical details may wish to skip the ensuing two sections (Sects. 2 and 3) and proceed directly to the statement of these results and the relevant heuristics discussed in Sect. 4.
2 Notation, algorithm, and assumptions
In this section we detail the notation and the assumptions (Sects. 2.1 and 2.3, respectively) that we will use in the rest of the paper.
2.1 Notation

x and y are elements of the Hilbert space \({\mathcal {H}}\);

the letter N is reserved to denote the dimensionality of the space \(X^N\) where the target measure \(\pi ^N\) is supported;

\(x^N\) is an element of \(X^N\)\(\cong {\mathbb {R}}^N\) (similarly for \(y^N\) and the noise \(\xi ^N\));

for any fixed \(N \in {\mathbb {N}}\), \(x^{k,N}\) is the kth step of the chain \(\{x^{k,N}\}_{k \in {\mathbb {N}}} \subseteq X^N\) constructed to sample from \(\pi ^N\); \(x^{k,N}_i\) is the ith component of the vector \(x^{k,N}\), that is \(x^{k,N}_i:=\langle x^{k,N}, \phi _i\rangle \) (with abuse of notation).
 Two (double) sequences of real numbers \(\{A^{k,N}\}\) and \(\{B^{k,N}\}\) satisfy \(A^{k,N} \lesssim B^{k,N}\) if there exists a constant \(K>0\) (independent of N and k) such thatfor all N and k such that \(\{A^{k,N}\}\) and \(\{B^{k,N}\}\) are defined.$$\begin{aligned} A^{k,N}\le KB^{k,N}, \end{aligned}$$

If the \(A^{k,N}\)s and \(B^{k,N}\)s are random variables, the above inequality must hold almost surely (for some deterministic constant K).

If the \(A^{k,N}\)s and \(B^{k,N}\)s are realvalued functions on \({\mathcal {H}}\) or \({\mathcal {H}}^s\), \(A^{k,N}= A^{k,N}(x)\) and \(B^{k,N}= B^{k,N}(x)\), the same inequality must hold with K independent of x, for all x where the \(A^{k,N}\)s and \(B^{k,N}\)s are defined.
2.2 The algorithm
We conclude this section by remarking that, if \(x^{k,N}\) is given, the proposal \(y^{k,N}\) only depends on the Gaussian noise \(\xi ^{k,N}\). Therefore the acceptance probability will be interchangeably denoted by \(\alpha ^N\big (x^N,y^N\big )\) or \(\alpha ^N\big (x^N,\xi ^N\big )\).
2.3 Assumptions
Assumption 2.1
 1.Decay of Eigenvalues \(\lambda _j^2\) of \({\mathcal {C}}\): there exists a constant \(\kappa > s+\frac{1}{2}\) such that$$\begin{aligned} j^{\kappa }\lesssim \lambda _j \lesssim j^{\kappa }. \end{aligned}$$
 2.
Domain of \(\varPsi \): the functional \(\varPsi \) is defined everywhere on \({\mathcal {H}}^s\).
 3.Derivatives of \(\varPsi \): The derivative of \(\varPsi \) is bounded and globally Lipschitz:$$\begin{aligned} \left \left \nabla \varPsi (x)\right \right _{s} \lesssim 1,\qquad \left \left \nabla \varPsi (x) \nabla \varPsi (y)\right \right _{s} \lesssim \left \left xy\right \right _{s}. \end{aligned}$$(2.19)
Remark 2.1
The condition \(\kappa > s+\frac{1}{2}\) ensures that \({\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s) < \infty \). Consequently, \(\pi _0\) has support in \({\mathcal {H}}^s\) (\(\pi _0({\mathcal {H}}^s)=1\)). \(\square \)
Example 2.1
The functional \(\varPsi (x) = \sqrt{1+\left \left x\right \right _{s}^2}\) satisfies all of the above.
Remark 2.2
Our assumptions on the change of measure (that is, on \(\varPsi \)) are less general than those adopted in [14, 17] and related literature (see references therein). This is for purely technical reasons. In this paper we assume that \(\varPsi \) grows linearly. If \(\varPsi \) was assumed to grow quadratically, which is the case in the mentioned works, finding bounds on the moments of the chain \(\{x^{k,N}\}_{k\ge 1}\) (much needed in all of the analysis) would become more involved than it already is, see Remark C.1. However, under our assumptions, the measure \(\pi \) (or \(\pi ^N\)) is still, generically, of nonproduct form. \(\square \)
We now explore the consequences of Assumption 2.1. The proofs of the following lemmas can be found in Appendix A.
Lemma 2.1
 1.The function \({\mathcal {C}}\nabla \varPsi (x)\) is bounded and globally Lipschitz on \({\mathcal {H}}^s\), that isTherefore, the function \(F(z):=z{\mathcal {C}}\nabla \varPsi (z)\) satisfies$$\begin{aligned} \left \left {\mathcal {C}}\nabla \varPsi (x)\right \right _{s}\lesssim 1 \quad \text{ and } \quad \left \left {\mathcal {C}}\nabla \varPsi (x){\mathcal {C}}\nabla \varPsi (y)\right \right _{s}\lesssim \left \left xy\right \right _{s}. \end{aligned}$$(2.20)$$\begin{aligned} \left \left F(x)  F(y)\right \right _{s} \lesssim \left \left xy\right \right _{s} \quad \text{ and } \quad \left \left F(x)\right \right _{s} \lesssim 1+ \left \left x\right \right _{s}. \end{aligned}$$(2.21)
 2.The function \(\varPsi (x)\) is globally Lipschitz and therefore also \(\varPsi ^N(x):=\varPsi ({\mathcal {P}}^N(x))\) is globally Lipschitz:$$\begin{aligned} \left \varPsi ^N(y)\varPsi ^N(x)\right \lesssim \left \left yx\right \right _{s}. \end{aligned}$$(2.22)
Lemma 2.2
 1.If the bounds (2.19) hold for \(\varPsi \), then they hold for \(\varPsi ^N\) as well:$$\begin{aligned} \left \left \nabla \varPsi ^N(x)\right \right _{s}\lesssim 1,\qquad \left \left \nabla \varPsi ^N(x) \nabla \varPsi ^N(y)\right \right _{s} \lesssim \left \left xy\right \right _{s}. \end{aligned}$$(2.24)
 2.Moreover,and$$\begin{aligned} \left \left {\mathcal {C}}_N\nabla \varPsi ^N(x)\right \right _s\lesssim 1, \end{aligned}$$(2.25)$$\begin{aligned} \left \left {\mathcal {C}}_N\nabla \varPsi ^N(x)\right \right _{{\mathcal {C}}_N}\lesssim 1. \end{aligned}$$(2.26)
3 Existence and uniqueness for the limiting diffusion process
The main results of this section are Theorems 3.1, 3.2 and 3.3. Theorems 3.1 and 3.2 are concerned with establishing existence and uniqueness for Eqs. (1.10) and (1.11), respectively. Theorem 3.3 states the continuity of the Itô maps associated with Eqs. (1.10) and (1.11). The proofs of the main results of this paper (Theorems 4.1 and 4.2) rely heavily on the continuity of such maps, as we illustrate in Sect. 5. Once Lemma 3.1 below is established, the proofs of the theorems in this section are completely analogous to the proofs of those in [14, Section 4]. For this reason, we omit them and refer the reader to [14]. In what follows, recall that the definition of the functions \(\alpha _{\ell }, h_{\ell }\) and \(b_{\ell }\) has been given in (1.12), (1.13) and (1.14), respectively.
Lemma 3.1
The functions \(\alpha _{\ell }(s)\), \(h_{\ell }(s)\) and \(\sqrt{h_{\ell }(s)}\) are positive, globally Lipschitz continuous and bounded. The function \(b_{\ell }(s)\) is globally Lipschitz and it is bounded above but not below. Moreover, for any \(\ell >0\), \(b_{\ell }(s)\) is strictly positive for \(s\in [0,1)\), strictly negative for \(s>1\) and \(b_{\ell }(1)=0\).
Proof of Lemma 3.1
When \(s>1\), \(\alpha _{\ell }(s)=1\) while for \(s\le 1\) \(\alpha _{\ell }(s)\) has bounded derivative; therefore \(\alpha _{\ell }(s)\) is globally Lipshitz. A similar reasoning gives the Lipshitzianity of the other functions. The further properties of \(b_{\ell }\) are straightforward from the definition. \(\square \)
In the case of (1.11) we have the following.
Theorem 3.1
For (1.10) we have that:
Theorem 3.2
Let Assumption 2.1 hold and consider Eq. (1.10), where W(t) is any \({\mathcal {H}}^s\)valued \({{{\mathcal {C}}}}_s\)Brownian motion and S(t) is the solution of (1.11). Then for any initial condition \( x^0\in {\mathcal {H}}^s\) and any \(T>0\) there exists a unique solution of Eq. (1.10) in the space \(C([0,T]; {\mathcal {H}}^s)\).
Theorem 3.3
4 Main theorems and heuristics of proofs
Theorem 4.1
Let Assumption 2.1 hold and let \(\delta =\ell /N^{\frac{1}{2}}\). Let \(x^0\in {\mathcal {H}}^s_{\cap }\) and \(T>0\). Then, as \(N\rightarrow \infty \), the continuous interpolant \(S^{(N)}(t)\) of the sequence \(\{S^{k,N}\}_{k\in {\mathbb {N}}} \subseteq {\mathbb {R}}_+\) (defined in (1.16)) and started at \(S^{0,N}=\frac{1}{N}\sum _{i=1}^N \left x_{i}^{0} \right ^2 / \lambda _i^2 \), converges in probability in \(C([0,T]; {\mathbb {R}})\) to the solution S(t) of the ODE (1.11) with initial datum \(S^0:=\lim _{N\rightarrow \infty }S^{0,N}\).
For the following theorem recall that the solution of (1.10) is interpreted precisely through Theorem 3.2 as a process driven by an \({\mathcal {H}}^s\)valued Brownian motion with covariance \({\mathcal {C}}_s\), and solution in \(C([0,T];{\mathcal {H}}^s).\)
Theorem 4.2
Let Assumption 2.1 hold let \(\delta =\ell /N^{\frac{1}{2}}\). Let \(x^0\in {\mathcal {H}}^s_{\cap }\) and \(T>0\). Then, as \(N \rightarrow \infty \), the continuous interpolant \(x^{(N)}(t)\) of the chain \(\{x^{k,N}\}_{k\in {\mathbb {N}}} \subseteq {\mathcal {H}}^s\) (defined in (1.9) and (2.14), respectively) with initial state \(x^{0,N}:={\mathcal {P}}^N(x^0)\), converges weakly in \(C([0,T]; {\mathcal {H}}^s)\) to the solution x(t) of Eq. (1.10) with initial datum \(x^0\). We recall that the timedependent function S(t) appearing in (1.10) is the solution of the ODE (1.11), started at \(S(0):= \lim _{N \rightarrow \infty } \frac{1}{N}\sum _{i=1}^N \left x_i^{0} \right ^2 / \lambda _i^2\).
Both Theorems 4.1 and 4.2 assume that the initial datum of the chains \(x^{k,N}\) is assigned deterministically. From our proofs it will be clear that the same statements also hold for random initial data, as long as (i) \(x^{0,N}\) is not drawn at random from the target measure \(\pi ^N\) or from any other measure which is a change of measure from \(\pi ^N\) (i.e. we need to be starting out of stationarity) and (ii) \(S^{0,N}\) and \(x^{0,N}\) have bounded moments (bounded uniformly in N) of sufficiently high order and are independent of all the other sources of noise present in the algorithm. Notice moreover that the convergence in probability of Theorem 4.1 is equivalent to weak convergence, as the limit is deterministic.
The rigorous proof of the above results is contained in Sects. 5–8. In the remainder of this section we give heuristic arguments to justify our choice of scaling \(\delta \propto N^{1/2}\) and we explain how one can formally obtain the (fluid) ODE limit (1.11) for the double sequence \(S^{k,N}\) and the diffusion limit (1.10) for the chain \(x^{k,N}\). We stress that the arguments of this section are only formal; therefore, we often use the notation “\(\simeq \)”, to mean “approximately equal”. That is, we write \(A\simeq B\) when \(A=B+\) “terms that are negligible” as N tends to infinity; we then justify these approximations, and the resulting limit theorems, in the following Sects. 5–8.
4.1 Heuristic analysis of the acceptance probability
Remark 4.1

If we start the chain in stationarity, i.e. \(x_0^N\sim \pi ^N\) (where \(\pi ^N\) has been defined in (1.6)), then \(x^{k,N} \sim \pi ^N\) for every \(k \ge 0\). As we have already observed, \(\pi ^N\) is absolutely continuous with respect to the Gaussian measure \(\pi _0^N \sim {\mathcal {N}}(0, {\mathcal {C}}_N)\); because all the almost sure properties are preserved under this change of measure, in the stationary regime most of the estimates of interest need to be shown only for \(x^N \sim \pi _0^N\). In particular if \(x^N \sim \pi _0^N\) then \(x^N\) can be represented as \(x^N= \sum _{i=1}^N \lambda _i \rho _i \phi _i\), where \(\rho _i\) are i.i.d. \({\mathcal {N}}(0,1)\). Therefore we can use the law of large numbers and observe that \(\Vert x^N\Vert _{{\mathcal {C}}^N}^2=\sum _{i=1}^N \left \rho _{i} \right ^2 \simeq N \).
 Suppose we want to study the algorithm in stationarity and we therefore make the choice \(\zeta =1/3\). With the above point in mind, notice that if we start in stationarity then by the Law of Large numbers \(N^{1}\sum _{i=1}^N \left \rho _{i} \right ^2= S^{k,N}\rightarrow 1\) (as \(N\rightarrow \infty \), with speed of convergence \(N^{1/2}\)). Moreover, if \(x^N \sim \pi _0^N\), by the Central Limit Theorem the term \(\langle x^N, {\mathcal {C}}_N^{1/2} \xi ^N\rangle _{{\mathcal {C}}_N}/\sqrt{N}\) is O(1) and converges to a standard Gaussian. With these two observations in place we can then heuristically see that, with the choice \(\zeta =1/3\) the term in (4.10) are negligible as \(N\rightarrow \infty \) while the terms in (4.9) are O(1). The term in (4.8) can be better understood by looking at the LHS of (4.11) which, with \(\zeta =1/3\) and \(x^N \sim \pi _0^N\), can be rewritten asThe expected value of the above expression is zero. If we apply the Central Limit Theorem to the i.i.d. sequence \(\{\left \rho _i\right ^2 \left \xi _i\right ^2 \}_i\), (4.12) shows that (4.8) is \(O(N^{1/22/3})\) and therefore negligible as \(N \rightarrow \infty \). In conclusion, in the stationary case the only O(1) terms are those in (4.9); therefore one has the heuristic approximation$$\begin{aligned} \frac{\ell ^2}{2N^{2/3}} \sum _{i=1}^N (\left \rho _i\right ^2 \left \xi _i\right ^2 ). \end{aligned}$$(4.12)For more details on the stationary case see [17].$$\begin{aligned} Q^N(x,\xi ) \sim {\mathcal {N}} \left( \frac{\ell ^3}{4}, \frac{\ell ^3}{2}\right) . \end{aligned}$$
 If instead we start out of stationarity the choice \(\zeta =1/3\) is problematic. Indeed in [6, Lemma 3] the authors study the MALA algorithm to sample from an Ndimensional isotropic Gaussian and show that if the algorithm is started at a point \(x^0\) such that \(S(0) <1\), then the acceptance probability degenerates to zero. Therefore, the algorithm stays stuck in its initial state and never proceeds to the next move, see [6, Figure 2] (to be more precise, as N increases the algorithm will take longer and longer to get unstuck from its initial state; in the limit, it will never move with probability 1). Therefore the choice \(\zeta =1/3\) cannot be the optimal one (at least not irrespective of the initial state of the chain) if we start out of stationarity. This is still the case in our context and one can heuristically see that the root of the problem lies in the term (4.8). Indeed if out of stationarity we still choose \(\zeta =1/3\) then, like before, (4.9) is still order one and (4.10) is still negligible. However, looking at (4.8), if \(x^0\) is such that \(S(0)<1\) then, when \(k=0\), (4.8) tends to minus infinity; recalling (4.2), this implies that the acceptance probability of the first move tends to zero. To overcome this issue and make \(Q^N\) of order one (irrespective of the initial datum) so that the acceptance probability is of order one and does not degenerate to 0 or 1 when \(N \rightarrow \infty \), we take \(\zeta =1/2\); in this way the terms in (4.8) are O(1), all the others are small. Therefore, the intuition leading the analysis of the nonstationary regime hinges on the fact that, with our scaling,hence$$\begin{aligned} Q^N(x^{k,N}, \xi ^{k,N}) \simeq \frac{\ell ^2}{2}(S^{k,N} 1); \end{aligned}$$(4.13)where the function \(\alpha _{\ell }\) on the RHS of (4.14) is the one defined in (1.12). The approximation (4.13) is made rigorous in Lemma 6.4, while (4.14) is formalized in Sect. 6.1 (see in particular Proposition 6.1).$$\begin{aligned} \alpha ^N(x^{k,N}, \xi ^{k,N}) = (1 \wedge e^{Q^N(x^{k,N}, \xi ^{k,N})}) \simeq \alpha _{\ell }\big (S^{k,N}\big ), \end{aligned}$$(4.14)

Finally, we mention for completeness that, by arguing similarly to what we have done so far, if \(\zeta < 1/2\) then the acceptance probability of the first move tends to zero when \(S(0)<1\). If \(\zeta >1/2\) then \(Q^N \rightarrow 0\), so the acceptance probability tends to one; however the size of the moves is small and the algorithm explores the phase space slowly.
Remark 4.2
Notice that in stationarity the function \(Q^N\) is, to leading order, independent of \(\xi \); that is, \(Q^N\) and \(\xi \) are asymptotically independent (see [17, Lemma 4.5]). This can be intuitively explained because in stationarity the leading order term in the expression for \(Q^N\) is the term with \(\delta ^3 \Vert x\Vert ^2\). We will show that also out of stationarity \(Q^N\) and \(\xi \) are asymptotically independent. In this case such an asymptotic independence can, roughly speaking, be motivated by the approximation (4.13), (as the interpolation of the chain \(S^{k,N}\) converges to a deterministic limit). The asymptotic correlation of \(Q^N\) and the noise \(\xi \) is analysed in Lemma 6.5.
Remark 4.3
4.2 Heuristic derivation of the weak limit of \(S^{k,N}\)
4.3 Heuristic analysis of the limit of the chain \(x^{k,N}\).
4.3.1 Approximate drift
4.3.2 Approximate diffusion
5 Continuous mapping argument
In this section we outline the argument which underlies the proofs of our main results. In particular, the proofs of Theorems 4.1 and 4.2 hinge on the continuous mapping arguments that we illustrate in the following Sects. 5.1 and 5.2, respectively. The details of the proofs are deferred to the next three sections: Sect. 6 contains some preliminary results that we employ in both proofs, in Sect. 7 contains the the proof of Theorem 4.1 and Sect. 8 that of Theorem 4.2.
5.1 Continuous mapping argument for (3.3)
5.2 Continuous mapping argument for (3.2)
 1.
We prove that \(d^N\) converges in \(L_2(\varOmega ; C([0,T]; {\mathcal {H}}^s))\) to zero (Lemma 8.1);
 2.
using the convergence in probability (in \(C([0,T]; {\mathbb {R}})\)) of \(S^{(N)}\) to S, we show convergence in probability (in \(C([0,T]; {\mathcal {H}}^s)\)) of \(\upsilon ^N\) to zero (Lemma 8.2);
 3.
we show that \(\eta ^N\) converges in weakly in \(C([0,T]; {\mathcal {H}}^s)\) to the process \(\eta \), defined in (5.11) (Lemma 8.3).
6 Preliminary estimates and analysis of the acceptance probability
This section gathers several technical results. In Lemma 6.1 we study the size of the jumps of the chain. Lemma 6.2 contains uniform bounds on the moments of the chains \(\{x^{k,N}\}_{k\in {\mathbb {N}}}\) and \(\{S^{k,N}\}_{k\in {\mathbb {N}}}\), much needed in Sects. 7 and 8. In Section 6.1 we detail the analysis of the acceptance probability. This allows us to quantify the correlations between \(\gamma ^{k,N}\) and the noise \(\xi ^{k,N}\), Sect. 6.2. Throughout the paper, when referring to the function \(Q^N\) defined in (4.3), we use interchangeably the notation \(Q^N(x^{k,N}, y^{k,N})\) and \(Q^N(x^{k,N}, \xi ^{k,N})\) (as we have already remarked, given \(x^{k,N}\), the proposal \(y^{k,N}\) is only a function of \(\xi ^{k,N}\)).
Lemma 6.1
Proof of Lemma 6.1
Lemma 6.2
Proof of Lemma 6.2
The proof of this lemma can be found in Appendix C. \(\square \)
6.1 Acceptance probability
The main result of this section is Proposition 6.1, which we obtain as a consequence of Lemma 6.3 (below) and Lemma 6.2. Proposition 6.1 formalizes the heuristic approximation (4.14).
Lemma 6.3
Before proving Lemma 6.3, we state Proposition 6.1.
Proposition 6.1
Proof of Lemma 6.3
Lemma 6.4
Proof of Lemma 6.4
 Proof of (6.12). Using (2.8), we rewrite \(I_1^N\) asExpanding the above we obtain:$$\begin{aligned}&I_1^N\big (x^{k,N},y^{k,N}\big )\\&\quad =\frac{\delta }{4}\left( \left \left (1\delta ) x^{k,N}\delta {\mathcal {C}}_N\nabla \varPsi ^N\big (x^{k,N}\big )+\sqrt{2\delta } {\mathcal {C}}_N^{1/2}\xi ^{k,N}\right \right _{{\mathcal {C}}_N}^2\left \left x^{k,N}\right \right _{{\mathcal {C}}_N}^2\right) . \end{aligned}$$where the difference \((r_{\varPsi }^N  r^N)\) is defined in (4.5) and we set$$\begin{aligned} I_1^N\big (x^{k,N},y^{k,N}\big )\frac{\ell ^2\big (S^{k,N}1\big )}{2}&= \left( \frac{\delta ^2}{2}\left \left {\mathcal {C}}_N^{1/2}\xi ^{k,N}\right \right _{{\mathcal {C}}_N}^2 \frac{\ell ^2}{2}\right) \nonumber \\&\quad +(r_{\varPsi }^N  r^N)+r_{\xi }^N+r_x^N, \end{aligned}$$(6.16)$$\begin{aligned} r^N_{\xi }&:= \frac{(\delta ^{3/2}\delta ^{5/2})}{\sqrt{2}} \left\langle x^{k,N},{\mathcal {C}}_N^{1/2}\xi ^{k,N} \right\rangle _{{\mathcal {C}}_N}, \end{aligned}$$(6.17)For the reader’s convenience we rearrange (4.5) below:$$\begin{aligned} r^N_{x}&:= \frac{\delta ^3}{4}\left \left x^{k,N}\right \right _{{\mathcal {C}}_N}^2. \end{aligned}$$(6.18)We come to bound all of the above terms, starting from (6.19). To this end, let us observe the following:$$\begin{aligned} r_{\varPsi }^N  r^N&= \frac{\delta ^2\delta ^3}{2}\left\langle x^{k,N},{\mathcal {C}}_N\nabla \varPsi ^N\big (x^{k,N}\big ) \right\rangle _{{\mathcal {C}}_N} \nonumber \\&\quad \frac{\delta ^3}{4}\left \left {\mathcal {C}}_N\nabla \varPsi ^N\big (x^{k,N}\big )\right \right _{{\mathcal {C}}_N}^2 +\frac{\delta ^{5/2}}{\sqrt{2}}\left\langle {\mathcal {C}}_N\nabla \varPsi ^N\big (x^{k,N}\big ),{\mathcal {C}}_N^{1/2}\xi ^{k,N} \right\rangle _{{\mathcal {C}}_N}. \end{aligned}$$(6.19)$$\begin{aligned} \left \left\langle x^{k,N},{\mathcal {C}}_N\nabla \varPsi ^N\big (x^{k,N}\big ) \right\rangle _{{\mathcal {C}}_N}\right ^2&=\left \sum _{i=1}^N x^{k,N}_i [\nabla \varPsi ^N\big (x^{k,N}\big )]_i\right ^2 \end{aligned}$$(6.20)Moreover,$$\begin{aligned}&{\mathop {\le }\limits ^{(2.6)}} \left \left x^{k,N}\right \right _{s}^2 \Vert \nabla \varPsi ^N\big (x^{k,N}\big )\Vert _{s}^2 {\mathop {\lesssim }\limits ^{(2.24)}} \left \left x^{k,N}\right \right _{s}^2. \end{aligned}$$(6.21)hence$$\begin{aligned} {\mathbb {E}}_k \left \left {\mathcal {C}}_N^{1/2} \xi ^{k,N}\right \right _{{\mathcal {C}}_N}^2 = {\mathbb {E}}_k \sum _{j=1}^N \left \xi _j\right ^2 = N, \end{aligned}$$From (6.19), (6.20), (2.26) and the above,$$\begin{aligned} \left \left\langle {\mathcal {C}}_N\nabla \varPsi ^N\big (x^{k,N}\big ),{\mathcal {C}}_N^{1/2}\xi ^{k,N} \right\rangle _{{\mathcal {C}}_N}\right ^2 \le \left \left {\mathcal {C}}_N\nabla \varPsi ^N\big (x^{k,N}\big )\right \right _{{\mathcal {C}}_N}^2\left \left {\mathcal {C}}_N^{1/2} \xi ^{k,N}\right \right _{{\mathcal {C}}_N}^2 {\mathop {\lesssim }\limits ^{(2.26)}}N. \end{aligned}$$By (6.17),$$\begin{aligned} {\mathbb {E}}_k \left r_{\varPsi }^Nr^N\right ^2 \lesssim \frac{\left \left x^{k,N}\right \right _{s}^2}{N^2}+\frac{1}{N^{3/2}}. \end{aligned}$$(6.22)where in the last equality we have used the fact that \(\{\xi _i^{k,N}:i=1,\ldots ,N\}\) are independent, zero mean, unit variance normal random variables (independent of \(x^{k,N}\)) and (4.6). As for \(r^N_{x}\),$$\begin{aligned} {\mathbb {E}}_k \left r^N_{\xi }\right ^2&\lesssim \frac{1}{N^{3/2}} {\mathbb {E}}_k\left \left\langle x^{k,N},{\mathcal {C}}_N^{1/2}\xi ^{k,N} \right\rangle _{{\mathcal {C}}_N}\right ^2\nonumber \\&= \frac{1}{N^{3/2}}{\mathbb {E}}_k \left( \sum _{i=1}^N \frac{x_i^{k,N} \xi _i^{k,N}}{\lambda _i} \right) ^2 = \frac{1}{\sqrt{N}}S^{k,N}, \end{aligned}$$(6.23)Lastly,$$\begin{aligned} {\mathbb {E}}_k \left r_x^N\right ^2 \lesssim \frac{1}{N^3}\left \left x^{k,N}\right \right _{{\mathcal {C}}_N}^4{\mathop {=}\limits ^{(4.6)}}\frac{(S^{k,N})^2}{N}. \end{aligned}$$Since \(\sum _{j=1}^N\xi ^2_j\) has chisquared law, \({\mathbb {E}}_k\left {\tilde{r}}^N\right ^2\lesssim Var\left( N^{1}\sum _{j=1}^N\xi ^2_j\right) \lesssim N^{1}\), by (6.5). Combining all of the above, we obtain the desired bound.$$\begin{aligned} {\tilde{r}}^N:=\frac{\delta ^2}{2}\left \left {\mathcal {C}}_N^{1/2}\xi ^{k,N}\right \right _{{\mathcal {C}}_N}^2 \frac{\ell ^2}{2}=\frac{\ell ^2}{2}\left( \frac{1}{N}\sum _{j=1}^N\xi ^2_j1\right) . \end{aligned}$$
 Proof of (6.13) From (6.10),where \(d_j\) is the addend on line j of the above array. Using (2.22), (2.24), (2.6) and Lemma 6.1, we have$$\begin{aligned} I_2^N\big (x^{k,N},y^{k,N}\big )&=\left[ \varPsi ^N(y^{k,N})\varPsi ^N\big (x^{k,N}\big ) \left\langle y^{k,N}x^{k,N},\nabla \varPsi ^N\big (x^{k,N}\big ) \right\rangle \right] \\&\quad +\frac{1}{2}\left\langle y^{k,N}x^{k,N},\nabla \varPsi ^N(y^{k,N})\nabla \varPsi ^N\big (x^{k,N}\big ) \right\rangle \\&\quad +\frac{\delta }{2}\left( \left\langle x^{k,N},\nabla \varPsi ^N\big (x^{k,N}\big ) \right\rangle \left\langle y^{k,N},\nabla \varPsi ^N(y^{k,N}) \right\rangle \right) =:\sum _{j=1}^3d_j, \end{aligned}$$By the first inequality in (2.24),$$\begin{aligned} {\mathbb {E}}_k \left d_1\right ^{2}\lesssim {\mathbb {E}}_k \left \left y^{k,N}x^{k,N}\right \right _s^{2} \lesssim \frac{1+\left \left x^{k,N}\right \right _{s}^{2}}{\sqrt{N}}. \end{aligned}$$Consequently, again by (2.6) and Lemma 6.1,$$\begin{aligned} \left \left \nabla \varPsi ^N(y^{k,N})\nabla \varPsi ^N\big (x^{k,N}\big )\right \right _{s}\lesssim 1. \end{aligned}$$Next, applying (2.6) and (2.24) gives$$\begin{aligned} {\mathbb {E}}_k \left d_2\right ^{2}\lesssim {\mathbb {E}}_k \left \left y^{k,N}x^{k,N}\right \right _s^{2}\lesssim \frac{1+\left \left x^{k,N}\right \right _{s}^{2}}{\sqrt{N}}. \end{aligned}$$Thus, applying Lemma 6.1 then gives the desired bound.$$\begin{aligned} \left {d_3}\right&\le \frac{\left \left x^{k,N}\right \right _{s}\left \left \nabla \varPsi ^N\big (x^{k,N}\big )\right \right _{s} +\left \left y^{k,N}\right \right _{s}\left \left \nabla \varPsi ^N(y^{k,N})\right \right _{s}}{\sqrt{N}}\\&\lesssim \frac{\left \left x^{k,N}\right \right _{s}+\left \left y^{k,N}\right \right _{s}}{\sqrt{N}} \lesssim \frac{\left \left x^{k,N}\right \right _{s}+\left \left y^{k,N}x^{k,N}\right \right _{s}}{\sqrt{N}}. \end{aligned}$$

Proof of (6.14) This follows directly from (2.25). \(\square \)
6.2 Correlations between acceptance probability and noise \(\xi ^{k,N}\)
Lemma 6.5
Lemma 6.6
The proofs of the above lemmata can be found in Appendix B. Notice that if \(\xi ^{k,N}\) and \(\gamma ^{k,N}\) (equivalently \(\xi ^{k,N}\) and \(Q^{N}\)) were uncorrelated, the statements of Lemmas 6.5 and 6.6 would be trivially true.
7 Proof of Theorem 4.1
As explained in Sect. 5.1, due to the continuity of the map \({\mathcal {J}}_2\) (defined in Theorem 3.3), in order to prove Theorem 4.1 all we need to show is convergence in probability of \({\hat{w}}^N(t)\) to zero. Looking at the definition of \({\hat{w}}^N(t)\), Eq. (5.3), the convergence in probability (in \(C([0,T];{\mathbb {R}})\)) of \({\hat{w}}^N(t)\) to zero is consequence of Lemmas 7.1 and 7.2 below. We prove Lemma 7.1 in Sect. 7.1 and Lemma 7.2 in Sect. 7.2.
Lemma 7.1
Lemma 7.2
7.1 Analysis of the drift
Proof of Lemma 7.1
Lemma 7.3
Lemma 7.4
Proof of Lemma 7.3
Lemma 7.5
Proof of Lemma 7.5
Proof of Lemma 7.4
7.2 Analysis of noise
Proof of Lemma 7.2
8 Proof of Theorem 4.2
Lemma 8.1
Lemma 8.2
If Assumption 2.1 holds, then \(\upsilon ^N\) (defined in (5.9)) converges in probability in \(C([0,T]; {\mathcal {H}}^s)\) to zero.
Lemma 8.3
Let Assumption 2.1 hold. Then the interpolated martingale difference array \(\mathfrak {\eta }^N(t)\) defined in (5.7) converges weakly in \(C([0,T]; {\mathcal {H}}^s)\) to the stochastic integral \(\eta (t)\), defined in Eq. (5.11).
8.1 Analysis of drift
Proof (Lemma 8.1)
Lemma 8.4
Lemma 8.5
Before proving Lemma 8.4, we state and prove the following Lemma 8.6. We then consecutively prove Lemmas 8.4, 8.5 and 8.2. Recall the definitions of \(\varTheta \) and \(\varTheta ^{k,N}\), equations (4.23) and (4.21), respectively.
Lemma 8.6
Proof of Lemma 8.6
Proof of Lemma 8.4
Following the analogous steps to those taken in the proof of Lemma 7.3, the proof is a direct consequence of Lemma 8.6, after observing that the summation \(\sum _{j=N+1}^\infty (\lambda _jj^s)^4\) is the tail of a convergent series hence it tends to zero as \(N \rightarrow \infty \). \(\square \)
Proof of Lemma 8.5
Proof of Lemma 8.2
8.2 Analysis of noise
Lemma 8.7
 (i)there exists a continuous and positive function \(f:[0,T]\rightarrow {\mathbb {R}}_+\) such that$$\begin{aligned} \lim _{N\rightarrow \infty } \sum _{k=1}^{k_N(T)} {\mathbb {E}}\bigg ({\left \left X^{k,N}\right \right }^2\vert {\mathcal {F}}_{k1}^N\bigg )= {\mathrm{Trace}}_{{\mathcal {H}}}(D) \int _0^T f(t) dt \, ; \end{aligned}$$
 (ii)if \(\{{\phi }_j\}_{j\in {\mathbb {N}}}\) is an orthonormal basis of \({\mathcal {H}}\) then$$\begin{aligned} \lim _{N\rightarrow \infty } \sum _{k=1}^{k_N(T)} {\mathbb {E}}\bigg (\langle X^{k,N},{\phi }_j \rangle \langle X^{k,N},{\phi }_i \rangle \vert {\mathcal {F}}_{k1}^N\bigg )=0\, \quad \text{ for } \text{ all } \,\, i\ne j\, ; \end{aligned}$$
 (iii)for every fixed \(\epsilon >0\),$$\begin{aligned} \lim _{N \rightarrow \infty } \sum _{k=1}^{k_N(T)} {\mathbb {E}}\bigg ({\left \left X^{k,N}\right \right }^2 \mathbf{1}_{\left\{ {\left \left X^{k,N}\right \right }^2\ge \epsilon \right\} } \vert {\mathcal {F}}_{k1}^N \bigg )=0, \qquad \text{ in } \text{ probability }, \end{aligned}$$
Proof of Lemma 8.3
 (i)Note that by the definition of \(L^{k,N}\), \({\mathbb {E}}[L^{k,N}\vert {\mathcal {F}}_{k1}^N]={\mathbb {E}}_k [L^{k,N}]\) almost surely. We need to show that the limitholds in probability. By (4.28),$$\begin{aligned} \lim _{N\rightarrow \infty }\frac{1}{\sqrt{N}}\sum _{k=0}^{[T\sqrt{N}]} {\mathbb {E}}_k \left \left L^{k,N}\right \right _{s}^2 = 2 \, {\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s) \int _0^T h_{\ell }(S(u))du, \end{aligned}$$(8.5)From the above, if we prove$$\begin{aligned} \frac{1}{\sqrt{N}} {\mathbb {E}}_k \left \left L^{k,N}\right \right _{s}^2&= {\mathbb {E}}_k \left \left x^{k+1,N}x^{k,N}\right \right _{s}^2  \left \left {\mathbb {E}}_k\left( x^{k+1,N}x^{k,N}\right) \right \right _{s}^2. \end{aligned}$$and that$$\begin{aligned} {\mathbb {E}}_{x^0}\sum _{k=0}^{[T\sqrt{N}]}\left \left {\mathbb {E}}_k\left( x^{k+1,N}x^{k,N}\right) \right \right _{s}^2 \rightarrow 0 \quad \text{ as } N\rightarrow \infty , \end{aligned}$$(8.6)then (8.5) follows. We start by proving (8.6):$$\begin{aligned}&\lim _{N\rightarrow \infty }\sum _{k=0}^{[T\sqrt{N}]} {\mathbb {E}}_k \left \left x^{k+1,N}x^{k,N}\right \right _{s}^2 \nonumber \\&\quad = 2 \, {\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s) \int _0^T h_{\ell }(S(u))du, \quad \text{ in } \text{ probability }, \end{aligned}$$(8.7)where the last inequality follows from (2.25) and (6.25). The above and (6.7) prove (8.6). We now come to (8.7):$$\begin{aligned} \left \left {\mathbb {E}}_k\left( x^{k+1,N}x^{k,N}\right) \right \right _{s}^2 {\mathop {\lesssim }\limits ^{(2.14)}}&\frac{1}{N}\left \left x^{k,N}+{\mathcal {C}}_N \nabla \varPsi ^N(x^{k,N})\right \right _{s}^2 \\&+\frac{1}{\sqrt{N}} \left \left {\mathbb {E}}_k \left( \gamma ^{k,N}({\mathcal {C}}_N)^{1/2}\xi ^{k,N}\right) \right \right _{s}^2\\ \lesssim&\, \frac{1}{N} \left( 1+ \left \left x^{k,N}\right \right _{s}^2\right) , \end{aligned}$$The first two addends tend to zero in \(L_1\) as N tends to infinity due to (2.25), (2.27) and Lemma 6.2. As for the third addend, we decompose it as follows$$\begin{aligned}&\left \sum _{k=0}^{[T\sqrt{N}]} {\mathbb {E}}_k \left \left x^{k+1,N}x^{k,N}\right \right _{s}^22 \, {\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s) \int _0^T h_{\ell }(S(u))du \right \\&\quad {\mathop {\lesssim }\limits ^{(2.14)}} \frac{1}{N}\sum _{k=0}^{[T\sqrt{N}]} {\mathbb {E}}_k \left \left x^{k,N}+{\mathcal {C}}_N \nabla \varPsi ^N(x^{k,N})\right \right _{s}^2\\&\quad \qquad + \frac{1}{N^{3/4}}\sum _{k=0}^{[T\sqrt{N}]} {\mathbb {E}}_k \left \langle x^{k,N}+{\mathcal {C}}_N \nabla \varPsi ^N(x^{k,N}), {\mathcal {C}}_N^{1/2}\xi ^{k,N}\rangle _s\right \\&\quad \qquad + \left \frac{2\ell }{\sqrt{N}}\sum _{k=0}^{[T\sqrt{N}]}{\mathbb {E}}_k \left \left \gamma ^{k,N}{\mathcal {C}}_N^{1/2}\xi ^{k,N}\right \right _{s}^2 2 \, {\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s) \int _0^T h_{\ell }(S(u))du \right .\\ \end{aligned}$$Convergence to zero in \(L^1\) of the first term in the above follows from Lemmas 6.2 and 6.6. As for the term in (8.8), we use the identity$$\begin{aligned}&\left \frac{2\ell }{\sqrt{N}}\sum _{k=0}^{[T\sqrt{N}]}{\mathbb {E}}_k \left \left \gamma ^{k,N}{\mathcal {C}}_N^{1/2}\xi ^{k,N}\right \right _{s}^2 2 \, {\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s) \int _0^T h_{\ell }(S(u))du \right \nonumber \\&\quad {\mathop {\lesssim }\limits ^{(1.13), (6.24)}} \left \frac{\ell }{\sqrt{N}}\sum _{k=0}^{[T\sqrt{N}]}{\mathbb {E}}_k \left \left \varepsilon ^{k,N}\right \right _{s}^2  \frac{\ell }{\sqrt{N}}\sum _{k=0}^{[T\sqrt{N}]}{\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s)\alpha _{\ell }\big (S^{k,N}\big )\right \nonumber \\&\qquad \qquad + \left \frac{1}{\sqrt{N}}\sum _{k=0}^{[T\sqrt{N}]}{\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s)h_{\ell }\big (S^{k,N}\big ) {\mathrm{Trace}}_{{\mathcal {H}}^s}({\mathcal {C}}_s) \int _0^T h_{\ell }(S(u))du\right . \end{aligned}$$(8.8)to further split it, obtaining:$$\begin{aligned} \int _0^Th_{\ell }({\bar{S}}^{(N)}(u))du =\left( T\frac{[T\sqrt{N}]}{\sqrt{N}}\right) h_{\ell }\big (S^{[T\sqrt{N}],N}\big ) +\frac{1}{\sqrt{N}}\sum _{k=0}^{[T\sqrt{N}]}h_{\ell }\big (S^{k,N}\big ), \end{aligned}$$$$\begin{aligned} (8.8)\lesssim&\, \left \int _0^T h_{\ell }({\bar{S}}^{(N)}(u)) h_{\ell }(S^{(N)}(u))du \right \end{aligned}$$(8.9)$$\begin{aligned}&+\left \int _0^T h_{\ell }(S^{(N)}(u)) h_{\ell }(S(u))du \right \end{aligned}$$(8.10)Convergence (in \(L_1\)) of (8.9) to zero follows with the same calculations leading to (7.6), the global Lipschitz property of \(h_{\ell }\), and Lemma 6.2. The addend in (8.10) tends to zero in probability since \(S^{(N)}\) tends to S in probability in \(C([0,T];{\mathbb {R}})\) (Theorem 4.1) and the third addend is clearly small. The limit (8.7) then follows.$$\begin{aligned}&+\left( T\frac{[T\sqrt{N}]}{\sqrt{N}}\right) h_{\ell }(S^{[T\sqrt{N}],N}). \end{aligned}$$(8.11)
 (ii)
Condition (ii) of Lemma 8.7 can be shown to hold with similar calculations, so we will not show the details.
 (iii)
Using (6.3), the last bound follows a calculation completely analogous to the one in [14, Section 8.2]. We omit the details here. \(\square \)
Footnotes
 1.
In this paper, we commit a slight abuse of our notation by writing \({\mathcal {C}}_s\) to mean the covariance operator on the Sobolevlike subspace \({\mathcal {H}}^s\) and \({\mathcal {C}}_N\) to mean that on the finite dimensional subspace \(X^N\) as defined in (1.5). We distinguish these two by always employing N as the subscript for the latter, and lower case letters such as s or r for the former.
 2.
Notice that \(S^{k,N}\) is only a function of \(x^{k,N}\).
 3.
Note that in the limit the dependence of the drift on \(S^{k,N}\) becomes explicit.
Notes
Acknowledgements
A.M. Stuart acknowledges support from AMS, DARPA, EPSRC, ONR. J. Kuntz gratefully acknowledges support from the BBSRC in the form of the Ph.D. studentship BB/F017510/1. M. Ottobre and J. Kuntz gratefully acknowledge financial support from the Edinburgh Mathematical Society.
References
 1.Beskos, A., Girolami, M., Lan, S., Farrell, P., Stuart, A.: Geometric MCMC for infinitedimensional inverse problems. J. Comput. Phys. 335, 327–351 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
 2.Bédard, M.: Weak convergence of Metropolis algorithms for noni.i.d. target distributions. Ann. Appl. Probab. 17(4), 1222–1244 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
 3.Bédard, M., Rosenthal, J.: Optimal scaling of Metropolis algorithms: heading toward general target distributions. Can. J. Stat. 36(4), 483–503 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 4.Beskos, A., Roberts, G., Stuart, A., Voss, J.: An MCMC method for diffusion bridges. Stochast. Dyn. 8(3), 319–350 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 5.Breyer, L., Piccioni, M., Scarlatti, S.: Optimal scaling of MALA for nonlinear regression. Ann. Appl. Probab. 14(3), 1479–1505 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
 6.Christensen, O., Roberts, G., Rosenthal, J.: Scaling limits for the transient phase of local Metropolis–Hastings algorithms. J. R. Stat. Soc. Ser. B Stat. Methodol. 67(2), 253–268 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
 7.Cotter, S., Roberts, G., Stuart, A., White, D., et al.: MCMC methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28(3), 424–446 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
 8.Da Prato, G., Zabczyk, J.: Stochastic Equations in Infinite Dimensions. Encyclopedia of Mathematics and Its Applications. Cambridge University Press, Cambridge (1992)CrossRefzbMATHGoogle Scholar
 9.Hairer, M., Stuart, A., Voss, J.: Analysis of SPDEs arising in path sampling. Part II: the nonlinear case. Ann. Appl. Probab. 17(5–6), 1657–1706 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
 10.Hairer, M., Stuart, A., Voss, J., Wiberg, P.: Analysis of SPDEs arising in path sampling. Part I: the Gaussian case. Commun. Math. Sci. 3, 587–603 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
 11.Hastings, W.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970)MathSciNetCrossRefzbMATHGoogle Scholar
 12.Jourdain, B., Lelièvre, T., Miasojedow, B.: Optimal scaling for the transient phase of Metropolis–Hastings algorithms: the longtime behavior. Bernoulli 20(4), 1930–1978 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 13.Jourdain, B., Lelièvre, T., Miasojedow, B.: Optimal scaling for the transient phase of the random walk Metropolis algorithm: the meanfield limit. Ann. Appl. Probab. 25(4), 2263–2300 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
 14.Kuntz, J., Ottobre, M., Stuart, A.: Diffusion limit for the Random Walk Metropolis algorithm out of stationarity. Arxiv preprint (2016)Google Scholar
 15.Mattingly, J., Pillai, N., Stuart, A.: Diffusion limits of the random walk Metropolis algorithm in high dimensions. Ann. Appl. Probab. 22(3), 881–930 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 16.Neal, R.M.: Regression and classification using Gaussian process priors (with discussion). In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (eds.) Bayesian statistics 6. Oxford University Press (1998). https://www.cs.toronto.edu/~radford/ftp/val6gp.pdf
 17.Pillai, N., Stuart, A., Thiéry, A.: Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions. Ann. Appl. Probab. 22(6), 2320–2356 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 18.Pillai, N., Stuart, A., Thiéry, A.: Noisy gradient flow from a random walk in Hilbert space. Stoch. Partial Differ. Equ. Anal. Comput. 2(2), 196–232 (2014)MathSciNetzbMATHGoogle Scholar
 19.Roberts, G., Gelman, A., Gilks, W.: Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7(1), 110–120 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
 20.Roberts, G., Rosenthal, J.: Optimal scaling of discrete approximations to Langevin diffusions. J. R. Stat. Soc. Ser. B Stat. Methodol. 60(1), 255–268 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
 21.Roberts, G., Tweedie, R.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
 22.Stuart, A.: Inverse problems: a Bayesian perspective. Acta Numerica 19, 451–559 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
 23.Tierney, L.: A note on Metropolis–Hastings kernels for general state spaces. Ann. Appl. Probab. 8(1), 1–9 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.