Noisy gradient flow from a random walk in Hilbert space
 768 Downloads
 3 Citations
Abstract
Consider a probability measure on a Hilbert space defined via its density with respect to a Gaussian. The purpose of this paper is to demonstrate that an appropriately defined Markov chain, which is reversible with respect to the measure in question, exhibits a diffusion limit to a noisy gradient flow, also reversible with respect to the same measure. The Markov chain is defined by applying a Metropolis–Hastings accept–reject mechanism (Tierney, Ann Appl Probab 8:1–9, 1998) to an Ornstein–Uhlenbeck (OU) proposal which is itself reversible with respect to the underlying Gaussian measure. The resulting noisy gradient flow is a stochastic partial differential equation driven by a Wiener process with spatial correlation given by the underlying Gaussian structure. There are two primary motivations for this work. The first concerns insight into Monte Carlo Markov Chain (MCMC) methods for sampling of measures on a Hilbert space defined via a density with respect to a Gaussian measure. These measures must be approximated on finite dimensional spaces of dimension \(N\) in order to be sampled. A conclusion of the work herein is that MCMC methods based on priorreversible OU proposals will explore the target measure in \({\mathcal O}(1)\) steps with respect to dimension \(N\). This is to be contrasted with standard MCMC methods based on the random walk or Langevin proposals which require \({\mathcal O}(N)\) and \({\mathcal O}(N^{1/3})\) steps respectively (Mattingly et al., Ann Appl Prob 2011; Pillai et al., Ann Appl Prob 22:2320–2356 2012). The second motivation relates to optimization. There are many applications where it is of interest to find global or local minima of a functional defined on an infinite dimensional Hilbert space. Gradient flow or steepest descent is a natural approach to this problem, but in its basic form requires computation of a gradient which, in some applications, may be an expensive or complex task. This paper shows that a stochastic gradient descent described by a stochastic partial differential equation can emerge from certain carefully specified Markov chains. This idea is wellknown in the finite state (Kirkpatricket al., Science 220:671–680, 1983; Cerny, J Optim Theory Appl 45:41–51, 1985) or finite dimensional context (German, IEEE Trans Geosci Remote Sens 1:269–276, 1985; German, SIAM J Control Optim 24:1031, 1986; Chiang, SIAM J Control Optim 25:737–753, 1987; J Funct Anal 83:333–347, 1989). The novelty of the work in this paper is that the emergence of the noisy gradient flow is developed on an infinite dimensional Hilbert space. In the context of global optimization, when the noise level is also adjusted as part of the algorithm, methods of the type studied here go by the name of simulated–annealing; see the review (Bertsimas and Tsitsiklis, Stat Sci 8:10–15, 1993) for further references. Although we do not consider adjusting the noiselevel as part of the algorithm, the noise strength is a tuneable parameter in our construction and the methods developed here could potentially be used to study simulated annealing in a Hilbert space setting. The transferable idea behind this work is that conceiving of algorithms directly in the infinite dimensional setting leads to methods which are robust to finite dimensional approximation. We emphasize that discretizing, and then applying standard finite dimensional techniques in \(\mathbb {R}^N\), to either sample or optimize, can lead to algorithms which degenerate as the dimension \(N\) increases.
Keywords
Optimisation Simulated annealing Markov chain monte carlo Diffusion approximationMathematics Subject Classification
6008 60H15 60J251 Introduction
Because the SDE (7) does not possess the smoothing property, almost sure fine scale properties under its invariant measure \(\pi ^{\tau }\) are not necessarily reflected at any finite time. For example, if \(C\) is the covariance operator of Brownian motion or Brownian bridge then the quadratic variation of draws from the invariant measure, an almost sure quantity, is not reproduced at any finite time in (7) unless \(z(0)\) has this quadratic variation; the almost sure property is approached asymptotically as \(t \rightarrow \infty \). This behaviour is reflected in the underlying Metropolis–Hastings Markov chain pCN with weak limit (7), where the almost sure property is only reached asymptotically as \(n \rightarrow \infty .\) In a second result of this paper we will show that almost sure quantities such as the quadratic variation under pCN satisfy a limiting linear ODE with globally attractive steady state given by the value of the quantity under \(\pi ^{\tau }\). This gives quantitative information about the rate at which the pCN algorithm approaches statistical equilibrium.
We have motivated the limit theorem in this paper through the goal of creating noisy gradient flow in infinite dimensions with tuneable noise level, using only draws from a Gaussian random variable and evaluation of the nonquadratic part of the objective function. A second motivation for the work comes from understanding the computational complexity of MCMC methods, and for this it suffices to consider \(\tau \) fixed at \(1\). The paper [2] shows that discretization of the standard Random Walk Metropolis algorithm, SRWM, will also have diffusion limit given by (7) as the dimension of the discretized space tends to infinity, whilst the time increment \(\delta \) in (6), decreases at a rate inversely proportional to \(N\). The condition on \(\delta \) is a form of CFL condition, in the language of computational PDEs, and implies that \(\mathcal{O}(N)\) steps will be required to sample the desired probability distribution. In contrast the pCN method analyzed here has no CFL restriction: \(\delta \) may tend to zero independently of dimension; indeed in this paper we work directly in the setting of infinite dimension. The reader interested in this computational statistics perspective on diffusion limits may also wish to consult the paper [3] which demonstrates that the Metropolis adjusted Langevin algorithm, MALA, requires a CFL condition which implies that \(\mathcal{O}(N^{\frac{1}{3}})\) steps are required to sample the desired probability distribution. Furthermore, the formulation of the limit theorem that we prove in this paper is closely related to the methodologies introduced in [2] and [3]; it should be mentioned nevertheless that the analysis carried out in this article allows to prove a diffusion limit for a sequence of Markov chains evolving in a possibly nonstationary regime. This was not the case in [2] and [3].
We prove in Theorem 4 that for a fixed temperature parameter \(\tau >0\), as the time increment \(\delta \) goes to \(0\), the pCN algorithm behaves as a stochastic gradient descent. By adapting the temperature \(\tau \in (0,\infty )\) according to an appropriate cooling schedule it is possible to locate global minima of \(J\); standard heuristics show that the distribution \(\pi ^{\tau }\) concentrates on a \(\tau ^{1/2}\)neighbourhood around the global minima of the functional \(J\). We stress though that all the proofs presented in this article assume a constant temperature. The asymptotic analysis of the effect of the cooling schedule is left for future work; the study of such Hilbert space valued simulated annealing algorithms presents several challenges, one of them being that that the probability distributions \(\pi ^{\tau }\) are mutually singular for different temperatures \(\tau >0\).
In Sect. 2 we describe some notation used throughout the paper, discuss the required properties of Gaussian measures and Hilbertspace valued Brownian motions, and state our assumptions. Section 3 contains a precise definition of the Markov chain \(\{x^{k,\delta }\}_{k \ge 0}\), together with statement and proof of the weak convergence theorem that is the main result of the paper. Section 4 contains proof of the lemmas which underly the weak convergence theorem. In Sect. 5 we state and prove the limit theorem for almost sure quantities such as quadratic variation; such results are often termed “fluid limits” in the applied probability literature. An example is presented in Sect. 6. We conclude in Sect. 7.
2 Preliminaries
In this section we define some notational conventions, Gaussian measure and Brownian motion in Hilbert space, and state our assumptions concerning the operator \(C\) and the functional \({\varPsi }.\)
2.1 Notation

Two sequences \(\{\alpha _n\}_{n \ge 0}\) and \(\{\beta _n\}_{n \ge 0}\) satisfy \(\alpha _n \lesssim \beta _n\) if there exists a constant \(K>0\) satisfying \(\alpha _n \le K \beta _n\) for all \(n \ge 0\). The notations \(\alpha _n \asymp \beta _n\) means that \(\alpha _n \lesssim \beta _n\) and \(\beta _n \lesssim \alpha _n\).

Two sequences of real functions \(\{f_n\}_{n \ge 0}\) and \(\{g_n\}_{n \ge 0}\) defined on the same set \({\varOmega }\) satisfy \(f_n \lesssim g_n\) if there exists a constant \(K>0\) satisfying \(f_n(x) \le K g_n(x)\) for all \(n \ge 0\) and all \(x \in {\varOmega }\). The notations \(f_n \asymp g_n\) means that \(f_n \lesssim g_n\) and \(g_n \lesssim f_n\).

The notation \(\mathbb {E}_x \big [ f(x,\xi ) \big ]\) denotes expectation with variable \(x\) fixed, while the randomness present in \(\xi \) is averaged out.

We use the notation \(a \wedge b\) instead of \(\min (a,b)\).
2.2 Gaussian measure on Hilbert space
2.3 Assumptions
In this section we describe the assumptions on the covariance operator \(C\) of the Gaussian measure \(\pi _0 =\mathrm N (0,C)\) and the functional \({\varPsi }\), and the connections between them. Roughly speaking we will assume that the secondderivative of \({\varPsi }\) is globally bounded as an operator acting between two spaces which arise naturally from understanding the domain of the function \({\varPsi }\); furthermore the domain of \({\varPsi }\) must be a set of full measure with respect to the underlying Gaussian. If the eigenvalues of \(C\) decay like \(j^{2\kappa }\) and \(\kappa >\frac{1}{2}\) then \(\pi _0^{\tau }(\mathcal {H}^s)=1\) for all \(s<\kappa \frac{1}{2}\) and so we will assume eigenvalue decay of this form and assume the domain of \({\varPsi }\) is defined appropriately. We now formalize these ideas.
Assumption 1
 A1.Decay of Eigenvalues \(\lambda _j^2\) of \(C\): there exists a constant \(\kappa > \frac{1}{2}\) such that$$\begin{aligned} \lambda _j \asymp j^{\kappa }. \end{aligned}$$(13)
 A2.
Domain of \({\varPsi }\): there exists an exponent \(s \in [0, \kappa  1/2)\) such \({\varPsi }\) is defined on \(\mathcal {H}^s\).
 A3.Size of \({\varPsi }\): the functional \({\varPsi }:\mathcal {H}^s \rightarrow \mathbb {R}\) satisfies the growth conditions$$\begin{aligned} 0 \le {\varPsi }(x) \lesssim 1 + \Vert x\Vert _s^2 . \end{aligned}$$
 A4.Derivatives of \({\varPsi }\): The derivatives of \({\varPsi }\) satisfy$$\begin{aligned} \Vert \nabla {\varPsi }(x)\Vert _{s} \lesssim 1 + \Vert x\Vert _s \qquad \text {and} \qquad \Vert \partial ^2 {\varPsi }(x)\Vert _{\mathcal{L}(\mathcal{H}^{s},\mathcal{H}^{s})} \lesssim 1. \end{aligned}$$
Remark 1
The condition \(\kappa > \frac{1}{2}\) ensures that \(\mathrm {Trace}_{\mathcal {H}^r}(C_r) < \infty \) for any \(r < \kappa  \frac{1}{2}\): this implies that \(\pi ^{\tau }_0(\mathcal {H}^r)=1\) for any \(\tau > 0\) and \(r < \kappa  \frac{1}{2}\).
Remark 2
The functional \({\varPsi }(x) = \frac{1}{2}\Vert x\Vert _s^2\) is defined on \(\mathcal {H}^s\) and satisfies Assumptions 1. Its derivative at \(x \in \mathcal {H}^s\) is given by \(\nabla {\varPsi }(x) = \sum _{j \ge 0} j^{2s} x_j \varphi _j \in \mathcal {H}^{s}\) with \(\Vert \nabla {\varPsi }(x)\Vert _{s} = \Vert x\Vert _s\). The second derivative \(\partial ^2 {\varPsi }(x) \in \mathcal {L}(\mathcal {H}^s, \mathcal {H}^{s})\) is the linear operator that maps \(u \in \mathcal {H}^s\) to \(\sum _{j \ge 0} j^{2s} \langle u,\varphi _j \rangle \varphi _j \in \mathcal {H}^{s}\): its norm satisfies \(\Vert \partial ^2 {\varPsi }(x) \Vert _{\mathcal {L}(\mathcal {H}^s, \mathcal {H}^{s})} = 1\) for any \(x \in \mathcal {H}^s\).
The Assumptions 1 ensure that the functional \({\varPsi }\) behaves well in a sense made precise in the following lemma.
Lemma 2
 1.The function \(d(x) \overset{{\tiny {\text{ def }}}}{=}\Big ( x + C \nabla {\varPsi }(x) \Big )\) is globally Lipschitz on \(\mathcal {H}^s\):$$\begin{aligned} \Vert d(x)  d(y)\Vert _s \lesssim \Vert xy\Vert _s \qquad \qquad \forall x,y \in \mathcal {H}^s. \end{aligned}$$(14)
 2.The second order remainder term in the Taylor expansion of \({\varPsi }\) satisfies$$\begin{aligned} \big  {\varPsi }(y){\varPsi }(x)  \langle \nabla {\varPsi }(x), yx \rangle \big  \lesssim \Vert yx\Vert _s^2 \qquad \qquad \forall x,y \in \mathcal {H}^s. \end{aligned}$$(15)
Proof
See [2].
In order to provide a clean exposition, which highlights the central theoretical ideas, we have chosen to make global assumptions on \({\varPsi }\) and its derivatives. We believe that our limit theorems could be extended to localized version of these assumptions, at the cost of considerable technical complications in the proofs, by means of stoppingtime arguments. The numerical example presented in Sect. 6 corroborates this assertion. There are many applications which satisfy local versions of the assumptions given, including the Bayesian formulation of inverse problems [18] and conditioned diffusions [19].
3 Diffusion limit theorem
This section contains a precise statement of the algorithm, statement of the main theorem showing that piecewise linear interpolant of the output of the algorithm converges weakly to a noisy gradient flow described by a SPDE, and proof of the main theorem. The proofs of various technical lemmas are deferred to Sect. 4.
3.1 pCN algorithm
Lemma 3
Proof
The definition of the proposal (16) shows that \( \Vert yx\Vert ^p_s \lesssim \delta ^p \; \Vert x\Vert ^p_s + \delta ^{\frac{p}{2}} \, \mathbb {E}\big [ \Vert \xi \Vert ^p_s \big ]\). Fernique’s theorem [17] shows that \(\xi \) has exponential moments and therefore \(\mathbb {E}\big [ \Vert \xi \Vert ^p_s \big ] < \infty \). This gives the conclusion.
3.2 Diffusion limit theorem
Fix a time horizon \(T > 0\) and a temperature \(\tau \in (0,\infty )\). The piecewise linear interpolant \(z^\delta \) of the Markov chain (19) is defined by Eq. (6). The following is the main result of this article. Note that “weakly” refers to weak convergence of probability measures.
Theorem 4
Conditions 5
 1.Convergence of the drift: there exists a globally Lipschitz function \(d:\mathcal {H}^s \rightarrow \mathcal {H}^s\) such that$$\begin{aligned} \Vert d^\delta (x)d(x)\Vert _s \lesssim \delta \cdot \big ( 1+\Vert x\Vert ^{p}_s \big ) \end{aligned}$$(25)
 2.
Invariance principle: as \(\delta \) tends to zero the sequence of processes \(\{W^{\delta }\}_{\delta \in (0,\frac{1}{2})}\) defined by Eq. (24) converges weakly in \(C([0,T],\mathcal {H}^s)\) to a Brownian motion \(W\) in \(\mathcal {H}^s\) with covariance operator \(C_s\).
 3.A priori bound: the following bound holds$$\begin{aligned} \sup _{\delta \in \left( 0,\frac{1}{2}\right) } \Big \{ \delta \cdot \mathbb {E}\Big [ \sum _{k \delta \le T} \Vert x^{k,\delta }\Vert ^p_s \Big ] \Big \} < \infty . \end{aligned}$$(26)
Remark 3
It is now proved that Conditions 5 are sufficient to obtain a diffusion approximation for the sequence of rescaled processes \(z^{\delta }\) defined by Eq. (6), as \(\delta \) tends to zero. Contrary to more classical diffusion approximation for Markov processes results [21, 22] based on infinitesimal generators, the next Lemma exploits specific structures which arise when the limiting process has additive noise and, in particular, is based on exploiting preservation of weak convergence under continuous mappings, together with an explicit construction of the noise process. This idea has previously appeared in the literature in, for example, the articles [2, 3] in the context of MCMC and the article [23], and the references therein, in the context of the derivation of SDEs from ODEs with random data.
Lemma 6
Proof

Integral equation representation
Notice that solutions of the \(\mathcal {H}^s\)valued SDE (27) are nothing else than solutions of the following integral equation,where \(W\) is a Brownian motion in \(\mathcal {H}^s\) with covariance operator equal to \(C_s\). We thus introduce the Itô map \({\varTheta }: C([0,T],\mathcal {H}^s) \rightarrow C([0,T],\mathcal {H}^s)\) that sends a function \(W \in C([0,T],\mathcal {H}^s)\) to the unique solution of the integral Eq. (28): solution of (27) can be represented as \({\varTheta }(W)\) where \(W\) is an \(\mathcal {H}^s\)valued Brownian motion with covariance \(C_s\). As is described below, the function \({\varTheta }\) is continuous if \(C([0,T],\mathcal {H}^s)\) is topologized by the uniform norm \(\Vert w\Vert _{C([0,T],\mathcal {H}^s)} \overset{{\tiny {\text{ def }}}}{=}\sup \{ \Vert w(t)\Vert _{s} : t \in (0,T)\}\). It is crucial to notice that the rescaled process \(z^{\delta }\), defined in Eq. (6), satisfies \(z^{\delta } = {\varTheta }(\widehat{W}^{\delta })\) with$$\begin{aligned} z(t) = x_* + \int _0^t \, d(z(u)) \, du + \sqrt{2 \tau } W(t) \qquad \forall t \in (0,T), \end{aligned}$$(28)In Equation (29), the quantity \(d^{\delta }\) is the approximate drift defined in Eq. (23) and \(\bar{z}^{\delta }\) is the rescaled piecewise constant interpolant of \(\{x^{k,\delta }\}_{k \ge 0}\) defined as$$\begin{aligned} \widehat{W}^{\delta }(t)\!:= W^{\delta }(t) + \frac{1}{\sqrt{2\tau }} \int _0^t [ d^{\delta }(\bar{z}^{\delta }(u)) d(z^{\delta }(u)) ]\,du. \end{aligned}$$(29)The proof follows from a continuous mapping argument (see below) once it is proven that \(\widehat{W}^{\delta }\) converges weakly in \(C([0,T],\mathcal {H}^s)\) to \(W\).$$\begin{aligned} \bar{z}^{\delta }(t) = x^{k,\delta } \qquad \text {for} \qquad t_k \le t < t_{k+1}. \end{aligned}$$(30) 
The Itô map \({\varTheta }\) is continuous
It can be proved that \({\varTheta }\) is continuous as a mapping from \(\Big (C([0,T],\mathcal {H}^s), \Vert \cdot \Vert _{C([0,T],\mathcal {H}^s)} \Big )\) to itself. The usual Picard’s iteration proof of the CauchyLipschitz theorem of ODEs may be employed: see [2].

The sequence of processes \(\widehat{W}^{\delta }\) converges weakly to \(W\)
The process \(\widehat{W}^{\delta }(t)\) is defined by \(\widehat{W}^{\delta }(t) = W^{\delta }(t) + \frac{1}{\sqrt{2\tau }} \int _0^t [ d^{\delta }(\bar{z}^{\delta }(u)) d(z^{\delta }(u)) ]\,du\) and Conditions 5 state that \(W^{\delta }\) converges weakly to \(W\) in \(C([0,T], \mathcal {H}^s)\). Consequently, to prove that \(\widehat{W}^{\delta }(t)\) converges weakly to \(W\) in \(C([0,T], \mathcal {H}^s)\), it suffices (Slutsky’s lemma) to verify that the sequences of processesconverges to zero in probability with respect to the supremum norm in \(C([0,T],\mathcal {H}^s)\). By Markov’s inequality, it is enough to check that \(\mathbb {E}\big [ \int _0^T \!\Vert d^{\delta }(\bar{z}^{\delta }(u)) d(z^{\delta }(u)) \Vert _s \; du \!\big ]\) converges to zero as \(\delta \) goes to zero. Conditions 5 states that there exists an integer \(p \ge 1\) such that \(\Vert d^{\delta }(x)d(x)\Vert \lesssim \delta \cdot (1+\Vert x\Vert _s^{p})\) so that for any \(t_k \le u < t_{k+1}\) we have$$\begin{aligned} (\omega ,t) \mapsto \int _0^t \big [ d^{\delta }(\bar{z}^{\delta }(u)) d(z^{\delta }(u)) \big ] \,du \end{aligned}$$(31)Conditions 5 states that \(d(\cdot )\) is globally Lipschitz on \(\mathcal {H}^s\). Therefore, Lemma 3 shows that$$\begin{aligned} \Big \Vert d^{\delta }(\bar{z}^{\delta }(u)) d(\bar{z}^{\delta }(u)) \Big \Vert _s \lesssim \delta \big ( 1+\Vert \bar{z}^{\delta }(u)\Vert _s^{p} \big ) = \delta \big ( 1+\Vert x^{k,\delta }\Vert _s^{p} \big ). \end{aligned}$$(32)From estimates (32) and (33) it follows that \(\Vert d^{\delta }(\bar{z}^{\delta }(u)) d(z^{\delta }(u)) \Vert _s \lesssim \delta ^{\frac{1}{2}} (1+\Vert x^{k,\delta }\Vert ^{p}_s)\). Consequently$$\begin{aligned} \mathbb {E}\Vert d(\bar{z}^{\delta }(u)) d(z^{\delta }(u))\Vert _s \lesssim \mathbb {E}\Vert x^{k+1,\delta }x^{k,\delta }\Vert _s \lesssim \delta ^{\frac{1}{2}} (1+\Vert x^{k,\delta }\Vert _s). \end{aligned}$$(33)The apriori bound of Conditions 5 shows that this last quantity converges to zero as \(\delta \) converges to zero, which finishes the proof of Eq. (31). This concludes the proof of \(\widehat{W}^{\delta }(t) \Longrightarrow W\).$$\begin{aligned} \mathbb {E}\Big [ \int _0^T \Vert d^{\delta }(\bar{z}^{\delta }(u)) d(z^{\delta }(u)) \Vert _s \; du \Big ] \lesssim \delta ^{\frac{3}{2}} \sum _{k \delta < T} \mathbb {E}\Big [ 1+\Vert x^{k,\delta }\Vert ^{p}_s \Big ]. \end{aligned}$$(34) 
Continuous mapping argument
It has been proved that \({\varTheta }\) is continuous as a mapping from \(\Big (C([0,T],\mathcal {H}^s), \Vert \cdot \Vert _{C([0,T],\mathcal {H}^s)} \Big )\) to itself. The solutions of the \(\mathcal {H}^s\)valued SDE (27) can be expressed as \({\varTheta }(W)\) while the rescaled continuous interpolate \(z^{\delta }\) also reads \(z^{\delta } = {\varTheta }(\widehat{W}^{\delta })\). Since \(\widehat{W}^{\delta }\) converges weakly in \(\Big (C([0,T],\mathcal {H}^s), \Vert \cdot \Vert _{C([0,T],\mathcal {H}^s)} \Big )\) to \(W\) as \(\delta \) tends to zero, the continuous mapping theorem ensures that \(z^{\delta }\) converges weakly in \(\Big (C([0,T],\mathcal {H}^s), \Vert \cdot \Vert _{C([0,T],\mathcal {H}^s)} \Big )\) to the solution \({\varTheta }(W)\) of the \(\mathcal {H}^s\)valued SDE (27). This ends the proof of Lemma 6.
In order to establish Theorem 4 as a consequence of the general diffusion approximation Lemma 6, it suffices to verify that if Assumptions 1 hold then Conditions 5 are satisfied by the Markov chain \(x^{\delta }\) defined in Sect. 3.1. In Sect. 4.2 we prove the following quantitative version of the approximation the function \(d^{\delta }(\cdot )\) by the function \(d(\cdot )\) where \(d(x) = \Big ( x + C \nabla {\varPsi }(x)\Big )\).
Lemma 7
It follows from Lemma (7) that Eq. (25) of Conditions 5 is satisfied as soon as Assumptions 1 hold. The invariance principle of Conditions 5 follows from the next lemma. It is proved in Sect 4.5.
Lemma 8
In Sect. 4.4 it is proved that the following a priori bound is satisfied,
Lemma 9
In conclusion, Lemmas 7 and 8 and 9 together show that Conditions 5 are consequences of Assumptions 1. Therefore, under Assumptions 1, the general diffusion approximation Lemma 6 can be applied: this concludes the proof of Theorem 4.
4 Key estimates
This section assembles various results which are used in the previous section. Some of the technical proofs are deferred to the appendix.
4.1 Acceptance probability asymptotics
Lemma 10
Proof
See Appendix 1
Recall the local mean acceptance \(\alpha ^{\delta }(x)\) defined in Eq. (18). Define the approximate local mean acceptance probability by \(\bar{\alpha }^{\delta }(x) \overset{{\tiny {\text{ def }}}}{=}\mathbb {E}_x[ \bar{\alpha }^{\delta }(x,\xi )]\). One can use Lemma 10 to approximate the local mean acceptance probability \(\alpha ^{\delta }(x)\).
Corollary 1
Proof
See Appendix 1
4.2 Drift estimates
Explicit computations are available for the quantity \(\bar{\alpha }^{\delta }\). We will use these results, together with quantification of the error committed in replacing \(\alpha ^{\delta }\) by \(\bar{\alpha }^{\delta }\), to estimate the mean drift (in this section) and the diffusion term (in the next section).
Lemma 11
Proof
We now use this explicit computation to give a proof of the drift estimate Lemma 7.
Proof
 Lemma 10 and Corollary 1 show that$$\begin{aligned} \Vert B_1 + x \Vert _s^p&= \Big \{ \frac{(12\delta )^{\frac{1}{2}}  1}{\delta }\alpha ^{\delta }(x) + 1 \Big \}^p \Vert x\Vert ^p_s \\&\lesssim \Big \{ \big  \frac{(12\delta )^{\frac{1}{2}}  1}{\delta }  1 \big ^p + \big  \alpha ^{\delta }(x)  1 \big ^p \Big \} \Vert x\Vert ^p_s \nonumber \\&\lesssim \Big \{\delta ^p + \delta ^\frac{p}{2} (1+\Vert x\Vert _s^p) \Big \} \Vert x\Vert ^p_s \lesssim \delta ^\frac{p}{2} (1+\Vert x\Vert _s^{2p}).\nonumber \end{aligned}$$(44)
 Lemma 10 shows thatBy Lemma 11, the second term on the right hand equals to zero. Consequently, the CauchySchwarz inequality implies that$$\begin{aligned} \Vert B_2 + C\nabla {\varPsi }(x)\Vert _s^p&= \big \Vert \sqrt{\frac{2\tau }{\delta }} \, \mathbb {E}_x[\alpha ^{\delta }(x,\xi ) \, \xi ] + C\nabla {\varPsi }(x) \big \Vert _s^p \\&\lesssim \delta ^{\frac{p}{2}} \big \Vert \mathbb {E}_x[\{\alpha ^{\delta }(x,\xi )  \bar{\alpha }^{\delta }(x,\xi ) \} \, \xi ] \big \Vert _s^p\nonumber \\&+\, \big \Vert \underbrace{ \sqrt{\frac{2\tau }{\delta }} \, \mathbb {E}_x[\bar{\alpha }^{\delta }(x,\xi ) \, \xi ] + C\nabla {\varPsi }(x)}_{=0} \big \Vert _s^p.\nonumber \end{aligned}$$(45)$$\begin{aligned} \Vert B_2 + C\nabla {\varPsi }(x)\Vert _s^p&\lesssim \delta ^{\frac{p}{2}} \mathbb {E}_x[ \big \alpha ^{\delta }(x,\xi )  \bar{\alpha }^{\delta }(x,\xi )\big ^2]^{\frac{p}{2}}\\&\lesssim \delta ^{\frac{p}{2}} \Big ( \delta ^2 (1+\Vert x\Vert _s^{4})\Big )\!^{\frac{p}{2}} \lesssim \delta ^{\frac{p}{2}} (1+\Vert x\Vert _s^{2p}). \end{aligned}$$
4.3 Noise estimates
Lemma 12
Proof
See Appendix 1
4.4 A priori bound
Now we have all the ingredients for the proof of the a priori bound presented in Lemma 9 which states that the rescaled process \(z^{\delta }\) given by Eq. (6) does not blow up in finite time.
Proof
 Let us suppose \((i,j,k) = (n1,0,1)\). Lemma 7 states that the approximate drift has a linearly bounded growth so thatConsequently, we have$$\begin{aligned} \Big \Vert \mathbb {E}\big [ x^{k+1,\delta }  x^{k,\delta }  x^{k,\delta } \big ] \Big \Vert _s = \delta \Vert d^{\delta }(x^{k,\delta })\Vert _s \lesssim \delta ( 1+\Vert x^{k,\delta }\Vert _s ). \end{aligned}$$This proves Eq. (52) in the case \((i,j,k) = (n1,0,1)\).$$\begin{aligned} \mathbb {E}\Big [\Big (\Vert x^{k,\delta } \Vert _s^2\Big )^{n1} \langle x^{k,\delta }, x^{k+1,\delta }\!\!x^{k,\delta } \rangle _s \!\Big ]&\lesssim \mathbb {E}\Big [ \Vert x^{k,\delta } \Vert _s^{2(n1)} \Vert x^{k,\delta } \Vert _s \Big ( \delta (1\!+\!\Vert x^{k,\delta }\Vert _s \!\Big ) \Big ] \\&\lesssim \delta (1 + V^{k,\delta }). \end{aligned}$$
 Let us suppose \((i,j,k) \not \in \Big \{ (n,0,0), (n1,0,1) \Big \}\). Because for any integer \(p \ge 1\),it follows from the CauchySchwarz inequality that$$\begin{aligned} \mathbb {E}_x\Big [ \Vert x^{k+1,\delta }x^{k,\delta }\Vert _s^p \Big ]^{\frac{1}{p}} \lesssim \delta ^{\frac{1}{2}} (1+\Vert x\Vert _s) \end{aligned}$$Since we have supposed that \((i,j,k) \not \in \Big \{ (n,0,0), (n1,0,1) \Big \}\) and \(i+j+k = n\), it follows that \(j + \frac{k}{2} \ge 1\). This concludes the proof of Eq. (52),$$\begin{aligned} \mathbb {E}\Big [\Big (\Vert x^{k,\delta } \Vert _s^2\Big )^i \Big ( \Vert x^{k+1,\delta }x^{k,\delta }\Vert _s^2 \Big )^j \Big ( \langle x^{k,\delta }, x^{k+1,\delta }x^{k,\delta } \rangle _s \Big )^k \Big ] \lesssim \delta ^{j + \frac{k}{2}} (1+ V^{k,\delta }). \end{aligned}$$
4.5 Invariance principle
Combining the noise estimates of Lemma 12 and the a priori bound of Lemma 9, we show that under Assumptions 1 the sequence of rescaled noise processes defined in Eq. 24 converges weakly to a Brownian motion. This is the content of Lemma 8 whose proof is now presented.
Proof
 Condition (53): since \(\mathbb {E}\Big [ \Vert {\varGamma }^{k,\delta }\Vert _s^2  x^{k,\delta }\Big ] = \mathrm {Trace}_{\mathcal {H}^s}(D^{\delta }(x^{k,\delta }))\), Lemma 12 shows thatwhere the error term \(\mathbf e _1^{\delta }\) satisfies \( \mathbf e _1^{\delta }(x)  \lesssim \delta ^{\frac{1}{8}} (1+\Vert x\Vert _s^2)\). Consequently, to prove condition (53) it suffices to establish that$$\begin{aligned} \mathbb {E}\Big [ \Vert {\varGamma }^{k,\delta }\Vert _s^2  x^{k,\delta }\Big ] = \mathrm {Trace}_{\mathcal {H}^s}(C_s) + \mathbf e _1^{\delta }(x^{k,\delta }) \end{aligned}$$We have \(\mathbb {E}\big [ \big  \delta \, \sum _{k \delta < T} \mathbf e _1^{\delta }(x^{k,\delta }) \big  \big ] \lesssim \delta ^{\frac{1}{8}} \Big \{ \delta \cdot \mathbb {E}\Big [ \sum _{k \delta < T} (1+\Vert x^{k,\delta }\Vert _s^2) \Big ] \Big \}\) and the a priori bound presented in Lemma 9 shows that$$\begin{aligned} \lim _{\delta \rightarrow 0} \; \mathbb {E}\Big [ \big  \delta \, \sum _{k \delta < T} \mathbf e _1^{\delta }(x^{k,\delta }) \big  \Big ] = 0 . \end{aligned}$$Consequently \(\lim _{\delta \rightarrow 0} \; \mathbb {E}\big [ \big  \delta \, \sum _{k \delta < T} \mathbf e _1^{\delta }(x^{k,\delta }) \big  \big ] = 0\), and the conclusion follows.$$\begin{aligned} \sup _{\delta \in \left( 0,\frac{1}{2}\right) } \quad \Big \{ \delta \cdot \mathbb {E}\Big [ \sum _{k \delta < T} (1+\Vert x^{k,\delta }\Vert _s^2) \Big ] \Big \} < \infty . \end{aligned}$$
 Condition (54): Lemma 12 states thatwhere the error term \(\mathbf e _2^{\delta }\) satisfies \( \mathbf e _2^{\delta }(x)  \lesssim \delta ^{\frac{1}{8}} (1+\Vert x\Vert _s)\). The exact same approach as the proof of Condition (53) gives the conclusion.$$\begin{aligned} \mathbb {E}_k \Big [ \langle {\varGamma }^{k,\delta }, \hat{\varphi }_i \rangle _s \langle {\varGamma }^{k,\delta }, \hat{\varphi }_j \rangle _s \Big ] = \langle \hat{\varphi }_i, C_s \hat{\varphi }_j \rangle _s + \mathbf e _2^{\delta }(x^{k,\delta }) \end{aligned}$$
 Condition (55): from the CauchySchwarz and Markov’s inequalities it follows thatLemma 7 readily shows that \(\mathbb {E}\Vert {\varGamma }^{k,\delta }\Vert _s^4 \lesssim 1+\Vert x\Vert _s^4\) Consequently we have$$\begin{aligned} \mathbb {E}\Big [\Vert {\varGamma }^{k,\delta }\Vert _s^2 \; {1\!\!1}_{ \{\Vert {\varGamma }^{k,\delta } \Vert _s^2 \ge \delta ^{1} \, \varepsilon \}} \Big ]&\le \mathbb {E}\Big [\Vert {\varGamma }^{k,\delta }\Vert _s^4 \Big ]^{\frac{1}{2}} \cdot \mathbb {P}\Big [ \Vert {\varGamma }^{k,\delta } \Vert _s^2 \ge \delta ^{1} \, \varepsilon \Big ]^{\frac{1}{2}}\\&\le \mathbb {E}\Big [\Vert {\varGamma }^{k,\delta }\Vert _s^4 \Big ]^{\frac{1}{2}} \cdot \Big \{ \frac{\mathbb {E}\big [\Vert {\varGamma }^{k,\delta }\Vert _s^4 \big ]}{(\delta ^{1} \, \varepsilon )^2}\Big \}^{\frac{1}{2}}\\&\le \frac{\delta }{\varepsilon } \cdot \mathbb {E}\Big [\Vert {\varGamma }^{k,\delta }\Vert _s^4 \Big ]. \end{aligned}$$and the conclusion again follows from the a priori bound Lemma 9.$$\begin{aligned} \mathbb {E}\Big [ \Big  \delta \, \sum _{k\delta < T} \mathbb {E}\Big [\Vert {\varGamma }^{k,\delta }\Vert _s^2 \; {1\!\!1}_{ \{\Vert {\varGamma }^{k,\delta } \Vert _s^2 \ge \delta ^{1} \, \varepsilon \}} x^{k,\delta } \Big ] \Big  \Big ] \le \frac{\delta }{\varepsilon } \times \Big \{ \delta \cdot \mathbb {E}\Big [ \sum _{k \delta < T} (1+\Vert x^{k,\delta }\Vert _s^4) \Big ] \Big \} \end{aligned}$$
5 Quadratic variation
As discussed in the introduction, the SPDE (7), and the Metropolis–Hastings algorithm pCN which approximates it for small \(\delta \), do not satisfy the smoothing property and so almost sure properties of the limit measure \(\pi ^\tau \) are not necessarily seen at finite time. To illustrate this point, we introduce in this section a functional \(V:\mathcal {H}\rightarrow \mathbb {R}\) that is well defined on a dense subset of \(\mathcal {H}\) and such that \(V(X)\) is \(\pi ^{\tau }\)almost surely well defined and satisfies \(\mathbb {P}\big ( V(X) = 1\big ) = \tau \) for \(X \overset{\mathcal {D}}{\sim }\pi ^{\tau }\). The quantity \(V\) corresponds to the usual quadratic variation if \(\pi _0\) is the Wiener measure. We show that the quadratic variation like quantity \(V(x^{k,\tau })\) of a pCN Markov chain converges as \(k \rightarrow \infty \) to the almost sure quantity \(\tau \). We then prove that piecewise linear interpolation of this quantity solves, in the small \(\delta \) limit, a linear ODE (the “fluid limit”) whose globally attractive stable state is the almost sure quantity \(\tau \). This quantifies the manner in which the pCN method approaches statistical equilibrium.
5.1 Definition and properties
Lemma 13
Proof
5.2 Large \(k\) behaviour of quadratic variation for pCN
Proposition 14
Proof
Let us first show that the number of accepted moves is infinite. If this were not the case, the Markov chain would eventually reach a position \(x^{k,\delta } = x \in \mathcal {H}\) such that all subsequent proposals \(y^{k+l} = (12\delta )^{\frac{1}{2}} \, x^k + (2 \tau \delta )^{\frac{1}{2}} \, \xi ^{k+l}\) would be refused. This means that the \(\text{ i.i.d. }\) Bernoulli random variables \(\gamma ^{k+l} = \hbox {Bernoulli} \big (\alpha ^{\delta }(x^k,y^{k+l}) \big )\) satisfy \(\gamma ^{k+l} = 0\) for all \(l \ge 0\). This can only happen with probability zero. Indeed, since \(\mathbb {P}[\gamma ^{k+l} = 1] > 0\), one can use BorelCantelli Lemma to show that almost surely there exists \(l \ge 0\) such that \(\gamma ^{k+l} = 1\). To conclude the proof of the Proposition, notice then that the sequence \(\{ u_{k} \}_{k \ge 0}\) defined by \(u_{k+1}u_k = 2 \delta (u_k  \tau )\) converges to \(\tau \).
5.3 Fluid limit for quadratic variation of pCN
Theorem 15
(Fluid limit for quadratic variation) Let Assumptions 1 hold. Let the Markov chain \(x^{\delta }\) start at fixed position \(x_* \in \mathcal {H}^s\). Assume that \(x_* \in \mathcal {H}\) possesses a finite quadratic variation, \(V(x_*) < \infty \). Then the function \(v^{\delta }(t)\) converges in probability in \(C([0,T], \mathbb {R})\), as \(\delta \) goes to zero, to the solution of the differential Eq. (58) with initial condition \(v_0 = V(x_*)\).
Lemma 16
Proof
The proof is given in Appendix 2.
We now complete the proof of Theorem 15 using the key Lemma 16.
Proof (of Theorem 15)
The proof consists in showing that the trajectory of the quadratic variation process behaves as if all the move were accepted. The main ingredient is the uniform lower bound on the acceptance probability given by Lemma 16.
6 Numerical results
7 Conclusion

Optimization We have demonstrated a class of algorithms to minimize the functional \(J\) given by (1). The Assumptions 1 encode the intuition that the quadratic part of \(J\) dominates. Under these assumptions we study the properties of an algorithm which requires only the evaluation of \({\varPsi }\) and the ability to draw samples from Gaussian measures with CameronMartin norm given by the quadratic part of \(J\). We demonstrate that, in a certain parameter limit, the algorithm behaves like a noisy gradient flow for the functional \(J\) and that, furthermore, the size of the noise can be controlled systematically. The advantage of constructing algorithms on Hilbert space is that they are robust to finite dimensional approximation. We turn to this point in the next bullet.
 Statistics The algorithm that we use is a Metropolis–Hastings method with an Onrstein–Uhlenbeck proposal which we refer to here as pCN, as in [14]. The proposal takes the form for \(\xi \sim \mathrm N (0,C)\),given in (5). The proposal is constructed in such a way that the algorithm is defined on infinite dimensional Hilbert space and may be viewed as a natural analogue of a random walk Metropolis–Hastings method for measures defined via density with respect to a Gaussian. It is instructive to contrast this with the standard random walk method SRWM with proposal$$\begin{aligned} y=\bigl (12\delta \bigr )^{\frac{1}{2}}x+\sqrt{2\delta \tau }\xi \end{aligned}$$Although the proposal for SRWM differs only through a multiplicative factor in the systematic component, and thus implementation of either is practically identical, the SRWM method is not defined on infinite dimensional Hilbert space. This turns out to matter if we compare both methods when applied in \({\mathbb R}^N\) for \(N \gg 1\), as would occur if approximating a problem in infinite dimensional Hilbert space: in this setting the SRWM method requires the choice \(\delta =\mathcal{O}(N^{1})\) to see the diffusion (SDE) limit [2] and so requires \(\mathcal{O}(N)\) steps to see \(\mathcal{O}(1)\) decrease in the objective function, or to draw independent samples from the target measure; in contrast the pCN produces a diffusion limit for \(\delta \rightarrow 0\) independently of \(N\) and so requires \(\mathcal{O}(1)\) steps to see \(\mathcal{O}(1)\) decrease in the objective function, or to draw independent samples from the target measure. Mathematically this last point is manifest in the fact that we may take the limit \(N \rightarrow \infty \) (and thus work on the infinite dimensional Hilbert space) followed by the limit \(\delta \rightarrow 0.\)$$\begin{aligned} y=x+\sqrt{2\delta \tau }\xi . \end{aligned}$$
Notes
Acknowledgments
The authors thank an anonymous referee for constructive comments. We are grateful to David Dunson for the his comments on the implications of theory, Frank Pinski for helpful discussions concerning the behaviour of the quadratic variation; these discussions crystallized the need to prove Theorem 15. NSP gratefully acknowledges the NSF grant DMS 1107070. AMS is grateful to EPSRC and ERC for financial support. Parts of this work was done when AHT was visiting the department of Statistics at Harvard university. The authors thank the department of statistics, Harvard University for its hospitality.
References
 1.Tierney, L.: A note on metropolishastings kernels for general state spaces. Ann. Appl. Probab. 8(1), 1–9 (1998)CrossRefzbMATHMathSciNetGoogle Scholar
 2.Mattingly, J., Pillai, N., Stuart, A.: SPDE limits of the random walk metropolis algorithm in high dimensions. Ann. Appl. Prob (2011)Google Scholar
 3.Pillai, N.S., Stuart, A.M., Thiery, A.H.: Optimal scaling and diffusion limits for the langevin algorithm in high dimensions. Ann. Appl. Prob. 22(6), 2320–2356 (2012). doi: 10.1214/11AAP828 CrossRefzbMATHMathSciNetGoogle Scholar
 4.Kirkpatrick, S., Jr., D., Vecchi, M.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)Google Scholar
 5.Černỳ, V.: Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. J. Optim. Theory Appl. 45(1), 41–51 (1985)CrossRefMathSciNetGoogle Scholar
 6.Geman, D.: Bayesian image analysis by adaptive annealing. IEEE Trans. Geosci. Remote Sens. 1, 269–276 (1985)Google Scholar
 7.Geman, S., Hwang, C.: Diffusions for global optimization. SIAM J. Control Optim. 24, 1031 (1986)CrossRefzbMATHMathSciNetGoogle Scholar
 8.Chiang, T., Hwang, C., Sheu, S.: Diffusion for global optimization in \(\text{ r }{\hat{\,}}\text{ n }\). SIAM J. Control Optim. 25(3), 737–753 (1987)CrossRefzbMATHMathSciNetGoogle Scholar
 9.Holley, R., Kusuoka, S., Stroock, D.: Asymptotics of the spectral gap with applications to the theory of simulated annealing. J. Funct. Anal. 83(2), 333–347 (1989)CrossRefzbMATHMathSciNetGoogle Scholar
 10.Bertsimas, D., Tsitsiklis, J.: Simulated annealing. Stat. Sci. 8(1), 10–15 (1993)CrossRefGoogle Scholar
 11.Hinze, M., Pinnau, R., Ulbrich, M., Ulbrich, S.: Optimization PDE Constraints. Springer, New York (2008)Google Scholar
 12.Da Prato, G., Zabczyk, J.: Ergodicity for Infinite Dimensional Systems. Cambridge Univ Press, Cambridge (1996)Google Scholar
 13.Beskos, A., Roberts, G., Stuart, A., Voss, J.: An mcmc method for diffusion bridges. Stoch. Dyn. 8(3), 319–350 (2008)CrossRefzbMATHMathSciNetGoogle Scholar
 14.Cotter, S., Roberts, G.O., Stuart, A., White, D.: MCMC methods for functions: modifying old algorithms to make them faster. Statistical ScienceGoogle Scholar
 15.Hairer, M., Stuart, A.M., Voss, J.: Analysis of spdes arising in path sampling. partii: the nonlinear case. Ann. Appl. Probab. 17(5–6), 1657–1706 (2007)CrossRefzbMATHMathSciNetGoogle Scholar
 16.Dashti, M., Law, K., Stuart, A., Voss, J.: MAP estimators and their consistency in bayesian nonparametric inverse problems. Inverse ProblemsGoogle Scholar
 17.Da Prato, G., Zabczyk, J.: Stochastic Equations in Infinite Dimensions, Encyclopedia of Mathematics and Its Applications, vol. 44. Cambridge University Press, Cambridge (1992)CrossRefGoogle Scholar
 18.Stuart, A.: Inverse problems: a bayesian perspective. Acta Numer. 19(–1), 451–559 (2010)CrossRefzbMATHMathSciNetGoogle Scholar
 19.Hairer, M., Stuart, A.M., Voss, J.: Signal Processing Problems on Function Space: Bayesian Formulation, Stochastic Pdes and Effective Mcmc Methods. The Oxford Handbook of Nonlinear Filtering. In: Crisan, D., Rozovsky, B. (2010). To AppearGoogle Scholar
 20.Cotter, S., Dashti, M., Stuart, A.: Variational data assimilation using targetted random walks. Int. J. Numer. Method Fluids (2011). doi: 10.1002/fld.2510
 21.Stroock, D.W., Varadhan, S.S.: Multidimensional Diffussion Processes, vol. 233. Springer, New York (1979)Google Scholar
 22.Ethier, S.N., Kurtz, T.G.: Markov Processes. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Wiley, New York (1986). Characterization and convergenceGoogle Scholar
 23.Kupferman, R., Stuart, A., Terry, J., Tupper, P.: Longterm behaviour of large mechanical systems with random initial data. Stoch. Dyn. 2(04), 533–562 (2002)CrossRefMathSciNetGoogle Scholar
 24.Berger, E.: Asymptotic behaviour of a class of stochastic approximation procedures. Probab. Theory Relat. Fields 71(4), 517–552 (1986)CrossRefzbMATHMathSciNetGoogle Scholar
 25.Henry, D.: Geometric Theory of Semilinear Parabolic Equations, vol. 61. Springer, New York (1981)zbMATHGoogle Scholar
 26.Hairer, M., Stuart, A.M., Voss, J., Wiberg, P.: Analysis of spdes arising in path sampling. part 1: the gaussian case. Comm. Math. Sci. 3, 587–603 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
 27.Chorin, A., Hald, O.: Stochastic Tools in Mathematics and Science. Springer, New York (2006)zbMATHGoogle Scholar