Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space

Optimal control of diffusion processes is intimately connected to the problem of solving certain Hamilton-Jacobi-Bellman equations. Building on recent machine learning inspired approaches towards high-dimensional PDEs, we investigate the potential of $\textit{iterative diffusion optimisation}$ techniques, in particular considering applications in importance sampling and rare event simulation, and focusing on problems without diffusion control, with linearly controlled drift and running costs that depend quadratically on the control. More generally, our methods apply to nonlinear parabolic PDEs with a certain shift invariance. The choice of an appropriate loss function being a central element in the algorithmic design, we develop a principled framework based on divergences between path measures, encompassing various existing methods. Motivated by connections to forward-backward SDEs, we propose and study the novel $\textit{log-variance}$ divergence, showing favourable properties of corresponding Monte Carlo estimators. The promise of the developed approach is exemplified by a range of high-dimensional and metastable numerical examples.


Introduction
Hamilton-Jacobi-Bellman partial differential equations (HJB-PDEs) are of central importance in applied mathematics. Rooted in reformulations of classical mechanics [49] in the nineteenth century, they nowadays form the backbone of (stochastic) optimal control theory [89,123], having a profound impact on neighbouring fields such as optimal transportation [120,121], mean field games [20], backward stochastic differential equations (BSDEs) [19] and large deviations [42]. Applications in science and engineering abound; examples include stochastic filtering and data assimilation [87,104], the simulation of rare events in molecular dynamics [55,59,128], and nonconvex optimisation [24]. Many of these applications involve HJB-PDEs in high-dimensional or even infinite-dimensional state spaces, posing a formidable challenge for their numerical treatment and in particular rendering grid-based schemes infeasible. In recent years, approaches to approximating the solutions of high-dimensional elliptic and parabolic PDEs have been developed combining well-known Feynman-Kac formulae with machine learning methodologies, seeking scalability and robustness in high-dimensional and complex scenarios [36,54]. Crucially, the use of artificial neural networks offers the promise of accurate and efficient function approximation which in conjunction with Monte Carlo methods might beat the curse of dimensionality, as investigated in [6,25,53,67]. In this paper, we focus on HJB-PDEs that can be linked to controlled diffusions (see Section 2), dX u s = (b(X u s , s) + σ(X u s , s)u(X u s , s)) ds + σ(X u s , s) dW s , where b and σ are coefficients derived from the model at hand, and u is to be thought of as an adaptable steering force to be chosen so as to minimise a given objective functional. In terms of the problems and applications alluded to in the first paragraph, we are particularly interested in situations where applying a suitable control u improves certain properties of (1); often these are related to sampling efficiency, exploration of state space, or fit to empirical data. We have been particularly motivated by the prospect of directing recent advances in the methodology for solving high-dimensional HJB-PDEs towards the challenges of rare event simulation [17]. Our attention in this paper is constrained to a class of algorithms that may be termed iterative diffusion optimisation (IDO) techniques, related in spirit to reinforcement learning [100]. Speaking in broad terms, those are characterised by the following outline of steps meant to be executed iteratively until convergence or until a satisfactory control u is found: ) 0≤s≤T , i = 1, . . . , N } of the solution to (1).
2. Compute a performance measure and a corresponding gradient associated to the control u, based on {(X u,(i) s ) 0≤s≤T , i = 1, . . . , N }.
3. Modify u according to the gradient obtained in the previous step. Repeat starting from 1.
Many algorithmic approaches from the literature can be placed in the IDO framework, in particular some that connect forward-backward SDEs and machine learning [36,54] as well as some that are rooted in molecular dynamics and optimal control [59,73,128]. Those instances of IDO mainly differ in terms of the performance measure employed in step 2, or, in other words, in terms of an underlying loss function L(u) constructed on the set of control vector fields. Typically, L(u) is given in terms of expectations involving the solution to (1). Consequently, step 1 can be thought of as providing an empirical estimate of this quantity (and its gradient) based on a sample of size N . For a principled design and understanding of IDO-like algorithms, it is central to analyse the properties of loss functions and corresponding Monte Carlo estimators, and identify guidelines that promise good performance. Permissible loss functions include those that admit a global minimum representing the solution to the problem at hand. Moreover, suitable loss functions yield themselves to efficient optimisation procedures (step 3) such as stochastic gradient descent. In this respect, important desiderata are the absence of local minima as well as the availability of low-variance gradient estimators.
In this article, we show that a variety of loss functions can be constructed and analysed in terms of divergences between probability measures on the path space associated to solutions of (1), providing a unifying framework for IDO and extending on previous works in that direction [59,73,128]. As this perspective entails the approximation of a target probability measure as a core element, our approach exposes connections to the theory of variational inference [15,124]. Classical divergences include the relative entropy (or KL-divergence) and its counterpart, the cross-entropy. Motivated by connections to forward-backward SDEs and importance sampling, we propose the novel family of log-variance divergences, D Var(log) P (P 1 |P 2 ) = Var P log dP 2 dP 1 , parametrised by a probability measure P. Loss functions based on these divergences can be viewed as modifications of those proposed in [36,54] for solving forward-backward SDEs, essentially replacing second moments by variances, see Section 3.2. Moreover, it turns out that the log-variance divergences are closely related to the KL-divergence (see Proposition 4.6), allowing us to draw (perhaps surprising) connections to methods that directly attempt to optimise the dynamics with respect to a control objective.
As the loss functions considered in this article are defined in terms of expected values, practical implementations require appropriate Monte Carlo estimators whose variance directly impacts algorithmic performance. We study the associated relative errors, in particular in high-dimensional settings and for P 1 ≈ P 2 , i.e. close to the optimal control. The proposed log-variance divergence and its corresponding standard Monte Carlo estimator turn out to be robust in both settings, in a precise sense that will be developed in later sections. After the completion of this manuscript, the potential of the log-variance divergences for inferences in computational Bayesian statistics has been explored in [105], along with a more careful analysis of their relations to control variates (see also Remark 4.7 below).

Our contributions and overview
The primary contributions of this article can be summarised as follows: 1. Building on earlier work connecting optimal control functionals and the KL-divergence [59,73,128], we develop the perspective of constructing loss functions via divergences on path space, offering a systematic approach to algorithmic design and analysis.
2. We show that modifications of recently proposed approaches based on forward-backward SDEs [36,54] can be placed within this framework. Indeed, the log-variance divergences (2) encapsulate a family of forwardbackward SDE systems (see Section 3.2). The aforementioned adjustments needed to establish the path space perspective often lead to faster convergence and more accurate approximation of the optimal control, as we show by means of numerical experiments. 3. We show that certain instances of algorithms based on the control objective (or KL-divergence) and forwardbackward SDEs (or the log-variance divergences) are equivalent when the sample size N in step 1 is large. 4. We investigate the properties of sample based gradient estimators associated to the losses and divergences under consideration. In particular, we define two notions of stability: robustness of a divergence under tensorisation (related to stability in high-dimensional settings) and robustness at the optimal control solution (related to stability of the final approximation). From the losses and divergences considered in this article, we show that only the log-variance divergences satisfy both desiderata and illustrate our findings by means of extensive numerical experiments.
The paper is structured as follows. In Section 2 we provide a literature overview, stating connections between different perspectives on the control problem under consideration and summarising corresponding numerical treatments. As a unifying viewpoint, in Section 3 we define viable loss functions through divergences on path space and discuss their connections to the algorithmic approaches encountered in Section 2. In particular, we elucidate the relationships of the log-variance divergences with forward-backward SDEs. In the two upcoming sections we analyse properties of the suggested losses, where in Section 4 we obtain equivalence relations that hold in an infinite batch size limit and in Section 5 we investigate the variances associated to the losses' estimator versions. In the latter case, we consider stability close to the optimal control solution as well as in high dimensional settings. In Section 6 we provide numerical examples that illustrate our findings. Finally, we conclude the paper with Section 7, giving an outlook to future research. Most of the proofs are deferred to the appendix.
In this section we will introduce three different perspectives on essentially the same problem. Throughout, we will assume a fixed filtered probability space (Ω, F, (F t ) t≥0 , Θ) satisfying the 'usual conditions' [77,Section 21.4] and consider stochastic differential equations (SDEs) of the form T ] → R d×d denotes the diffusion coefficient, (W s ) t≤s≤T denotes standard d-dimensional Brownian motion, and x init ∈ R d is the (deterministic) initial condition. We will work under the following conditions specifying the regularity of b and σ.
Assumption 1 (Coefficients of the SDE (3)). The coefficients b and σ are continuously differentiable, σ has bounded first-order spatial derivatives, and Let us furthermore introduce a modified version of (3), where we think of u : R d × [t, T ] → R d as a control term steering the dynamics. We will throughout assume that u ∈ U, the set of admissible controls. For definiteness, we will set : u grows at most linearly in x, in the sense of (4a) , but note that the smoothness and boundedness assumptions can be relaxed in various scenarios. Under Assumption 1 and with U as defined in (6), the SDEs (3) and (5)

Optimal control
Consider the cost functional where ) specifies a part of the running and g ∈ C 1 (R d ; R) the terminal costs, and (X u s ) t≤s≤T denotes the unique strong solution to the controlled SDE (5) with initial condition X u t = x init . Throughout we assume that f and g are such that the expectation in (7) is finite, for all (x init , t) ∈ R d × [0, T ]. Our objective is to find a control u ∈ U that minimises (7): Defining the value function [45,Section I.4], or 'optimal cost-to-go', it is well-known that under suitable conditions, V satisfies a Hamilton-Jacobi-Bellman PDE involving the infinitesimal generator [96, Section 2.3] associated to the uncontrolled SDE (3), The optimal control solving (8) can then be recovered from u * = −σ ∇V (see Theorem 2.2 for details). Let us state this reformulation of Problem 2.1 as follows: Problem 2.2 (Hamilton-Jacobi-Bellman PDE). Find a solution V to the PDE where f and g are as in (7).
Throughout, we will focus on solutions to (11) that admit bounded and continuous derivatives of up to first order in time and second order in space (see, however, Remark 2.4). This set will be denoted by C 2,1 b (R d × [0, T ]; R). Solutions to elliptic and parabolic PDEs admit probabilistic representations by means of the celebrated Feynman-Kac formulae [99, Sections 1.3.3 and 6.3]. To wit, consider the following coupled system of forward-backward SDEs (in the following FBSDEs for short): almost surely.
Under suitable conditions, Itô's formula implies that Y is connected to the value function V as defined in (9) via Y s = V (X s , s). Similarly, Z is connected to the optimal control u * through Z s = −u * (X s , s) = σ ∇V (X s , s). See [94,95] and Theorem 2.2 for details.

Conditioning and rare events
One major motivation for our work is the problem of sampling rare transition events in diffusion models. In this section we will explain how this challenge can be formalised in terms of weighted measures on path space, leading to a close connection to the optimal control problems encountered in the previous section. We will fix the initial time to be t = 0, i.e. consider the SDEs (3) and (5) on the interval [0, T ]. For fixed initial condition x init ∈ R d , let us introduce the path space equipped with the supremum norm and the corresponding Borel-σ-algebra, and denote the set of probability measures on C by P(C). The SDEs (3) and (5) induce probability measures on C defined to be the laws associated to the corresponding strong solutions; those measures will be denoted by P and P u , respectively 1 . Furthermore, we define the work functional W : C → R via where f : assuming f and g are such that Z is finite (we shall tacitly make this assumption from now on). We may ask whether Q can be obtained as the path measure related to a controlled SDE of the form (5): . Find u * ∈ U such that the path measure P u * associated to (5) coincides with Q.
Referring to the above as a conditioning problem is justified by the fact that (15) may be viewed as an instance of Bayes' formula relating conditional probabilities [104]. This connection can be formalised using Doob's h-transform [33,34] and applied to diffusion bridges and quasistationary distributions, for instance (see [26] and references therein).
Example 2.1 (Rare event simulation). Let us consider SDEs of the form (3), where the drift is a gradient, i.e. b = −∇Ψ, and the potential Ψ is of multimodal type. As an example we shall discuss the one-dimensional case d = 1 and assume that Ψ ∈ C ∞ (R) is given by with κ > 0. Furthermore, let us fix the initial conditions x init = −1 and t = 0, and assume a constant diffusion coefficient of size unity, σ = 1. Observe that Ψ exhibits two local minima at x = ±1, separated by a barrier at x = 0, the height of which is modulated by the parameter κ (see Figure 8 in Section 6.4 for an illustration). When κ is sufficiently large, the dynamics induced by (3) exhibits metastable behaviour: transitions between the two basins happen very rarely as the transition time depends exponentially on the height of the barrier [11,80]. Applications such as molecular dynamics are often concerned with statistics and derived quantities from these rare events as those are typically directly linked to biological functioning [37,109,110]. At the same time, computational approaches face a difficult sampling problem as transitions are hard to obtain by direct simulation from (3). Choosing f = 0 and g such that e −g is concentrated around x = 1 (consider, for instance, g(x) = ν(x − 1) 2 with ν > 0 sufficiently large), we see that Q as defined in (15) predominantly charges paths initialised in x = −1 at t = 0 and enter a neighbourhood of x = 1 at final time T . Problem 2.4 can then be understood as the task of finding a control u that allows efficient simulation of transition paths. Similar issues arise in the context of stochastic filtering, where the objective is to sample paths that are compatible with available data [104].

Sampling problems
The free energy [58] associated to the dynamics (3) and the work functional (14) is given by 1 Of course, we have that P 0 coincides with the path measure associated to the uncontrolled dynamics, i.e. P 0 = P.
where the normalising constant Z has been defined in (15). The problem of computing Z is ubiquitous in nonequilibrium thermodynamics and statistics [15,113], and, quite often, the variance associated to the random variable exp(−W(X)) is so large as to render direct estimation of the expectation E [exp(−W(X))] computationally infeasible 2 . A natural approach is then to use the identity where we recall that X and X u refer to the strong solutions to (3) and (5), respectively, and dP dP u denotes the Radon-Nikodym derivative, explicitly given by Girsanov's theorem 3 [118, Theorem 2.1.1], see the proof of Theorem 2.2. As explained in [58], techniques leveraging (18) may be thought of as instances of importance sampling on path space. Given that (18) holds for all u ∈ U, it is clearly desirable to choose the control such as to guarantee favourable statistical properties: Under suitable conditions, it turns out that there exists u * ∈ U such the variance expression (20) is in fact zero (see Theorem 2.2, (1d)), providing a perfect sampling scheme.
The problem formulations detailed so far are intimately connected as summarised by the following theorem: Theorem 2.2 (Connections and equivalences). The following holds: be a solution to Problem 2.2, i.e. solve the HJB-PDE (11). Set Then (a) the control u * provides a solution to Problem 2.1, i.e. u * minimises the objective (7), is almost surely constant.
Furthermore, we have that 2. Conversely, let u * ∈ U solve Problem 2.4, i.e. assume that P u * coincides with Q. Then the statement (1d) holds. Furthermore, setting solves the backward SDE (12b) from Problem 2.3, i.e. (25) together with the first equation in (12b) determines a process (Y s ) 0≤s≤T that satisfies the final condition Y T = g(X T ), almost surely. Remark 2.3. We extend the connections between the optimal control formulation (Problem 2.1) and FBSDEs (Problem 2.3) in Proposition 4.3, see also Remark 4.4.
Remark 2.4 (Regularity, uniqueness, and further connections). Going beyond classical solvability of the HJB-PDE (11) and introducing the notion of viscosity solutions [45,94], the strong regularity and boundedness assumptions on V in the first statement could be much relaxed and the connections exposed in Theorem 2.2 could be extended [99,123]. As a case in point, we note that in the current setting, neither a solution to Problem 2.1 nor to Problem 2.3 necessarily provides a classical solution to the PDE (11), as optimal controls are known to be non-differentiable, in general. However, assuming classical well-posedness of the HJB-PDE (11), Theorem 2.2 implies that the solution can be found by addressing one of the Problems 2.1, 2.3, 2.4 or 2.5 and using the formulas (21) and (22), as long as those problems admit unique solutions, in an appropriate sense. For the latter issue, we refer the reader to [79] and [115,Chapter 11] in the context of forward-backward SDEs and to [14] in the context of measures on path space. We note that, in particular, the forward SDE (12a) can be thought of as providing a random grid for the solution of the HJB-PDE (11), obtained through the backward SDE (12b).
Remark 2.5 (Random initial conditions). The equivalence between Problems 2.2 and 2.3 shows that u * does not depend on x init . Consequently, the initial condition in (12a) can be random rather than deterministic. In Section 6.3 we demonstrate potential benefits of this extension for FBSDE-based algorithms.
Remark 2.6 (Variational formulas and duality). The identities (24) connect key quantities pertaining to the problem formulations 2.1, 2.2, 2.3 and 2.4. The fact that J(u * ; x init , 0) = − log Z can moreover be understood in terms of the Donsker-Varadhan formula [16], furnishing an explicit expression for the value function, as discussed in [29,30,58].
with h being a function satisfying appropriate regularity and boundedness assumptions. As in Theorem 2.2 (1b), the nonlinear parabolic PDE (27) is related to a generalisation of the forward-backward system (12), where the connection is still given by (22), see [99,Section 6.3]. From the perspective of optimal control (Problem 2.1), it is possible to extend the discussion to SDEs of the form replacing (5), and to running costs f (X u s , u s , s) instead of f (X u s , s)+ 1 2 |u(X u s , s)| 2 in (7), assuming that u s ∈ U ⊂ R m , for some m ∈ N. This setting gives rise to more general HJB-PDEs, where ∇ 2 V denotes the Hessian of V , and the Hamiltonian H is given by see [45,99]. In certain scenarios [125,Section 4.5.2], it is then possible to relate (30) to (28), noting however that typically h will be given in terms of a minimisation problem as in (31). The relationship to Problems 2.4 and 2.5 as well as the identity (21) rest on the particular structure 4 inherent in (5) and (7), enabling the use of Girsanov's theorem (see the Proof of Theorem 2.2 below). The methods developed in this paper based on the log-variance loss (46) can straightforwardly be extended to equations of the form (27) in the case when h depends on V only through ∇V , owing to the invariance of the PDE under shifts of the form V → V + const., see Remark 3.12. In order to address optimal control problems involving additional minimisation tasks posed by Hamiltonians such as (31) it might be feasible to include appropriate penalty terms in the loss functional. We leave this direction for future work.
Proof of Theorem 2.2. The statement (1a) is a classical result in stochastic optimal control theory, often referred to as a verification theorem, and can for instance be found in [45,Theorem IV.4.4]  . Before proceeding to (1c), we note that the first equality in (24) now follows from (9) (for background, see [45,Section IV.2]), while the second equality is a direct consequence of (1b). Using (12) and (1b), the third equality follows from relying on the facts that Y 0 is deterministic (again using (1b)), and that the term inside the second expectation is a martingale (as u * is assumed to be bounded). Turning to (1c), let us define an equivalent measure Θ on (Ω, F) via Since u * is assumed to be bounded, Novikov's condition is satisfied, and hence Girsanov's theorem asserts that the process ( W t ) 0≤t≤T defined by is a Brownian motion with respect to Θ. Consequently, we have that using (12) and (24) in the last step. We note that similar arguments can be found in [75], [20,Section 3.3.1]. For the proof of (1d) we refer to [58,Theorem 2]. The proof of the second statement is very similar to the argument presented for (1c), resting primarily on (33) and (35), and is therefore omitted.

Algorithms and previous work
The numerical treatment of optimal control problems has been an active area of research for many decades and multiple perspectives on solving Problem 2.1 have been developed. The monographs [13] and [82] provide good overviews to policy iteration and Q-learning, strategies that have been further investigated in the machine learning literature and that are generally subsumed under the term reinforcement learning [100]. We also recommend [72] as an introduction to the specific setting considered in this paper. To cope with the key issue of high dimensionality, the authors of [92] suggest solving a certain type of control problem in the framework of hierarchical tensor products. Another strategy of dealing with the curse of dimensionality is to first apply a model reduction technique and only then solve for the reduced model. Here, recent results on balanced truncation for controlled linear S(P)DEs have for instance been suggested in [10], and approaches for systems with a slow-fast scale separation via the homogenisation method can be found in [127].
In contrast, the recently introduced Multilevel Picard method [66] based on a combination of the Feynman-Kac and Bismut-Elworthy-Li formulas has been proven to beat the curse of dimensionality in a variety of settings, see [7,65,68,69,70].
The FBSDE formulation (Problem 2.3) has opened the door for Monte Carlo based methods that have been developed since the early 90s. We mention in particular least-squares Monte Carlo, where (Z s ) 0≤s≤T is approximated iteratively backwards in time by solving a regression problem in each time step, along the lines of the dynamic programming principle [99,Chapter 3]. A good introduction can be found in [46]; for extensive analysis on numerical errors we refer the reader to [47,126]. Recently, this approach has also been connected with deep learning, replacing Galerkin approximations by neural networks [64], as well as with the tensor train format, exploiting inherent low rank structures [106].
Another method leveraging the FBSDE perspective has been put forward in [36,54] and further developed in [4,5].
Here, the main idea is to enforce the terminal condition Y T = g(X T ) in (12b) by iteratively minimising the loss function using a stochastic gradient descent IDO scheme. The notation Y T (y 0 , u) indicates that the process in (12b) is to be simulated with given initial condition y 0 and control u (these representing a priori guesses or current approximations, typically relying on neural networks), hence viewing (12b) as a forward process. Consequently, the approach thus described can be classified as a shooting method for boundary value problems. We note that this idea allows treating rather general parabolic and elliptic PDEs [52,67], as well as -with some modifications -optimal stopping problems [8,9], going beyong the setting considered in this paper. Using neural network approximations in conjunction with FBSDE-based Monte-Carlo techniques holds the promise of alleviating the curse of dimensionality; understanding this phenomenon and proving rigorous mathematical statements has been been the focus of intense current research [12,52,53,67,71]. Let us also mention that similar algorithms have been suggested in [101,102], in particular proposing to modify the loss function (36) in order to encode the backward dynamics (12b), and extensive investigation of optimal network design and choice of tunable parameters has been carried out [23]. Furthermore, we refer to [21,22] for convergence results in the broader context of mean field control. In [56, Section III.B] it has been proposed to modify the forward dynamics (12a) (and, to compensate, also the backward dynamics (12b)) by an additional control term. This idea is central for the main results of this paper, see Section 3.2. Similar ideas for other types of PDEs have been proposed as well, see for instance [39,102].
Conditioned diffusions (Problem 2.4) have been considered in a large deviation context [35] as well as in a variational setting [56,58] motivated by free energy computations, building on earlier work in [16,30], see also [3,26,29,43]. The simulation of diffusion bridges has been studied in [86] and conditioning via Doob's h-transform has been employed in a sequential Monte Carlo context [61]. The formulation in Problem 2.4 identifies the target measure Q, motivating approaches that seek to minimise certain divergences on path space. This perspective will be developed in detail in Section 3.1, building bridges to Problems 2.1, 2.2, 2.3 and 2.5. Prior work following this direction includes [14,50,59,73,103], in particular relying on a connection between the KL-divergence (or relative entropy) on path space and the cost functional (7), see also Proposition 3.5. A similar line of reasoning leads to the crossentropy method [58,74,108,128], see Proposition 3.7 and equation (62) in Section 3.3.
Problem 2.5 motivates minimising the variance of importance sampling estimators. We refer the reader to [88, Section 5.2] for a recent attempt based on neural networks, to [2] for a theoretical analysis of convergence rates, to [57] for potential non-robustness issues, and to [18] for a general overview regarding adaptive importance sampling techniques. The relationship between optimal control and importance sampling (see Theorem 2.2) has been exploited by various authors to construct efficient samplers [74,114], in particular also with a view towards the sampling based estimation of hitting times, in which case optimal controls are governed by elliptic rather than parabolic PDEs [55,56,59,60]. Similar sampling problems have been addressed in the context of sequential Monte Carlo [31,61] and generative models [116,117]. The latter works examine the potential of the controlled SDE (5) as a sampling device targeting a suitable distribution of the final state X u T .

Approximating probability measures on path space
In this section we demonstrate that many of the algorithmic approaches encountered in the previous section can be recovered as minimisation procedures of certain divergences between probability measures on path space. Similar perspectives (mostly discussing the relative entropy and cross-entropy in Definition 3.1 below) can be found in the literature, see [59,73,128]. Recall from Section 2.2 that we denote by C the space of R d -valued paths on the time interval [0, T ] with fixed initial point x init ∈ R d . As before, the probability measures on C induced by (3) and (5) will be denoted by P and P u , respectively. From now on, let us assume that there exists a unique optimal control with convenient regularity properties: For Assumption 2 to be satisfied, it is sufficient to impose the regularity and boundedness conditions b, σ, f ∈ The strong boundedness assumption on V could be weakened and for instance be replaced by the condition σ ∇V ∈ U. For existence and uniqueness results involving unbounded controls we refer to [44], and for specific examples to Sections 6.2 and 6.3. In the sense made precise in Theorem 2.2, the control u * defined above provides solutions to the Problems 2.1-2.5 considered in Section 2. Moreover, there exists a corresponding optimal path measure Q (in the following also called the target measure) defined in (15) and satisfying Q = P u * . We further note that Assumption 2 together with the results from [115,Chapter 11] imply that the solution to the FBSDE (12) is unique.

Divergences and loss functions
The SDE (5) establishes a measurable map U u → P u ∈ P(C) that can be made explicit in terms of Radon-Nikodym derivatives using Girsanov's theorem (see Lemma A.1 in Appendix A.1). Consequently, we can elevate divergences between path measures to loss functions on vector fields. To wit, let D : P(C) × P(C) → R ≥0 ∪ {+∞} be a divergence 6 , where, as before, P(C) denotes the set of probability measures on C. Then, setting we immediately see that L D ≥ 0, with Theorem 2.2 implying that L D (u) = 0 if and only if u = u * . Consequently, an approximation of the optimal control vector field u * can in principle be found by minimising the loss L D . In the remainder of the paper, we will suggest possible losses and study some of their properties.
Starting with the KL-divergence, we introduce the relative entropy loss and the cross-entropy loss, corresponding to the divergences Definition 3.1 (Relative entropy and cross-entropy losses). The relative entropy loss is given by and the cross-entropy loss by where the target measure Q has been defined in (15). (40) and (41) are understood as integrals on C, i.e.

Remark 3.2 (Notation). Note that, by definition, the expectations in
In contrast, the expectation operator E (without subscript, as used in (7) and (18), for instance) throughout denotes integrals on the underlying abstract probability space (Ω, F, (F t ) t≥0 , Θ).
For P ∈ P(C), it is straightforward to verify that and D Var(log) define divergences on the set of probability measures equivalent to P. Henceforth, these quantities shall be called variance divergence and log-variance divergence, respectively. Remark 3.3. Setting P = P 1 , the quantity D Var P1 (P 1 |P 2 ) coincides with the Pearson χ 2 -divergence [32,84] measuring the importance sampling relative error [2,57], hence relating to Problem 2.5. The divergence D Var(log) P seems to be new; it is motivated by its connections to the forward-backward SDE formulation of optimal control (see Problem 2.3), as will be explained in Section 3.2. Let us already mention that inserting the log in (43) to obtain (44) has the potential benefit of making sample based estimation more robust in high dimensions (see Section 5.2). Furthermore, we point the reader to Proposition 4.3 revealing close connections between D Var(log) P and the relative entropy.
Using (43) and (44) with P = P v , we obtain two additional families of losses, indexed by v ∈ U: Definition 3.4 (Variance and log-variance losses). For v ∈ U, the variance loss is given by and the log-variance loss by By direct computations invoking Girsanov's theorem, the losses defined above admit explicit representations in terms of solutions to SDEs of the form (3) and (5). Crucially, the propositions that follow replace the expectations on C used in the definitions (40), (41), (43) and (44) by expectations on Ω that are more amenable to direct probabilistic interpretation and Monte Carlo simulation (see also Remark 3.2). Recall that the target measure Q is assumed to be of the type (15), where W has been defined in (14). We start with the relative entropy loss: Proposition 3.5 (Relative entropy loss). For u ∈ U, let (X u s ) 0≤s≤T denote the unique strong solution to (5). Then Proof. See [59,73]. For the reader's convenience, we provide a self-contained proof in Appendix A.1.
Remark 3.6. Up to the constant log Z, the loss L RE coincides with the cost functional (7) associated to the optimal control formulation in Problem 2.1. The approach of minimising the KL-divergence between P u and Q as defined in (40) is thus directly linked to the perspective outlined in Section 2.1. We refer to [59,73] for further details. The cross-entropy loss admits a family of representations, indexed by v ∈ U: Proposition 3.7 (Cross-entropy loss). For v ∈ U, let (X v s ) 0≤s≤T denote the unique strong solution to (5), with u replaced by v. Then there exists a constant C ∈ R (not depending on u in the next line) such that for all u ∈ U. 7 These integrability conditions can readily be checked using the formulas provided in Proposition 3.10 below.
Proof. See [128] or Appendix A.1 for a self-contained proof.
Remark 3.8. The appearance of the exponential term in (48b) can be traced back to the reweighting 8 recalling that P v denotes the path measure associated to (5) controlled by v. While the choice of v evidently does not affect the loss function, judicious tuning may have a significant impact on the numerical performance by means of altering the statistical error for the associated estimators (see Section 3.3). We note that the expression (47) for the relative entropy loss can similarly be augmented by an additional control v ∈ U. However, Proposition 5.7 in Section 5.2 discourages this approach and our numerical experiments using a reweighting for the relative entropy loss have not been promising. In general, we feel that exponential terms of the form appearing in (48b) often have a detrimental effect on the variance of estimators, which should also be compared to an analysis in [57]. Therefore, an important feature of both the relative entropy loss and the log-variance loss (see Proposition 3.10) seems to be that expectations can be taken with respect to controlled processes (X v s ) 0≤s≤T without incurring exponential factors as in (48b).
where (X s ) 0≤s≤T solves the uncontrolled SDE (3). The quadratic dependence of L CE on u has been exploited in [128] to construct efficient Galerkin-type approximations of u * .
Finally, we derive corresponding representations for the variance and log-variance losses: Then and for all u ∈ U.
Setting v = u in (52) recovers the importance sampling objective in (18), i.e. the variance divergence D Var P u encodes the formulation from Problem 2.5. See also [88,57].
Remark 3.11. While different choices of v merely lead to distinct representations for the cross-entropy loss L CE according to Proposition 3.7 and Remark 3.8, the variance losses L Varv and L log Varv do indeed depend on v. However, the property L Varv (u) = 0 ⇐⇒ u = u * (and similarly for L log Varv ) holds for all v ∈ U, by construction.

FBSDEs and the log-variance loss
As it turns out, the log-variance loss L log Varv as computed in (53) is intimately connected to the FBSDE formulation in Problem 2.3 (and we already used the notation Y u,v T in hindsight). Indeed, setting v = 0 in Proposition 3.10 and writing Var Y u,0 for some (at this point, arbitrary) constant y 0 ∈ R, we recover the forward SDE (12a) from (3) and the backward SDE (12b) from (51) in conjunction with the optimality condition L log Varv (u) = 0, using also the identification u * (X s , s) =: −Z s suggested by (22). For arbitrary v ∈ U, we similarly obtain the generalised FBSDE system In this sense, the divergence D (55). Let us again insist on the fact that by construction the solution (Y s , Z s ) 0≤s≤T to (55) When u * (X s , s) = −Z s is approximated in an iterative manner (see Section 6.1), the choice v = u is natural as it amounts to applying the currently obtained estimate for the optimal control to the forward process (55a). In this context, the system (55) was put forward in [56, Section III.B]. The bearings of appropriate choices for v will be further discussed in Section 5. It is instructive to compare the expression (54) for the log-variance loss to the 'moment loss' suggested in [36,54] in the context of solving more general nonlinear parabolic PDEs 9 . More generally, we can define as a counterpart to the expression (53). Note that unlike the losses considered so far, the moment losses depend on the additional parameter y 0 ∈ R, which has implications in numerical implementations. Also, these losses do not admit a straightforward interpretation in terms of divergences between path measures. As we show in Proposition 4.6, algorithms based on L momentv are in fact equivalent to their counterparts based on L log Varv in the limit of infinite batch size when y 0 is chosen optimally or when the forward process is controlled in a certain way. We already anticipate that optimising an additional parameter y 0 can slow down convergence towards the solution u * considerably (see Section 6).
Remark 3.12. Reversing the argument, the log-variance loss can be obtained from (57) by replacing the second moment by the variance and using the translation invariance (54) to remove the dependence on y 0 . The fact that this procedure leads to a viable loss function (i.e. satisfying L(u) = 0 ⇐⇒ u = u * ) can be traced back to the fact that the Hamilton-Jacobi PDE (11a) is itself translation invariant (i.e. it remains unchanged under the transformation V → V + const). Following this argument, the log-variance loss can be applied for solving more general PDEs of the form (27) in the case when h depends on V only through ∇V . Furthermore, our interpretation in terms of divergences between probability measures on path space remains valid, at least in the case when σ is constant (in the following we let σ = I d×d for simplicity) 10 . Indeed, denoting as before the path measure associated to (28a) by P, defining the target Q via dQ dP ∝ e −g , and introducing the neural network approximation u ≈ −σ ∇V , the backward SDE (28b) induces a u-dependent path measure P u , assuming that the right-hand side is P-integrable. Using Z ≈ − u in (28b) and denoting the corresponding process by Y u , we then obtain as an implementable loss function, with straightforward modifications to (28) when P is replaced by P v , see (55). Note, however, that in general the vector field u does not lend itself to a straightforward interpretation in terms of 9 We have employed the notation Y u,0 T (y 0 ) in order to stress the dependence on y 0 through (56). 10 For more general diffusion coefficients, we can make similar arguments considering measures on the path space associated to (Wt) t≥0 , however departing slightly from the set-up in this paper. a control problem. The PDEs treated in [36,54] do not possess the shift-invariance property (that is, h depends on V ), and thus the vanishing of (60) does not characterise the solution to the PDE (27a) uniquely (not even up to additive constants). Uniqueness may be restored by including appropriate terms in (60) enforcing the terminal condition (27b). Theoretical and numerical properties of such extensions may be fruitful directions for future work.

Algorithmic outline and empirical estimators
In order to motivate the theoretical analysis in the following sections, let us give a brief overview of algorithmic implementations based on the loss functions developed so far. We refer to Section 6.1 for a more detailed account. Recall that by the construction outlined in Section 3.1, the solution u * as defined in (37) is characterised as the global minimum of L, where L represents a generic loss function. Assuming a parametrisation R p θ → u θ (derived from, for instance, a Galerkin truncation or a neural network), we apply gradient-descent type methods to the function θ → L(u θ ), relying on the explicit expressions obtained in Propositions 3.5, 3.7 and 3.10. It is an important aspect that those expressions involve expectations that need to be estimated on the basis of ensemble averages. To approximate the loss L RE , for instance, we use the estimator where (X the estimator for the variance loss is given by the estimator for the log-variance loss by L log(N ) and the estimator for the moment loss by In the previous displays, the overline denotes an empirical mean, for example and (W (i) t ) t≥0 , i = 1, . . . , N denote independent Brownian motions associated to (X u,(i) t ) t≥0 . By the law of large numbers, the convergence L (N ) (u) → L(u) holds almost surely up to additive and multiplicative constants 11 , but as we show in Section 6, the fluctuations for finite N play a crucial role for the overall performance of the method. The variance associated to empirical estimators will hence be analysed in Section 5. Remark 3.13. The estimators introduced in this section are standard, and more elaborate constructions, for instance involving control variates [107,Section 4.4.2], can be considered to reduce the variance. We leave this direction for future work. It is noteworthy, however, that the log-variance estimator (64) appears to act as a control variate in natural way, see Propositions 4.3 and 4.6 and Remark 4.7.
Remark 3.14. Note that the estimator L We provide a sketch of the algorithmic procedure in Algorithm 1. Clearly, choosing different loss functions (and corresponding estimators) at every gradient step as indicated leads to viable algorithms. In particular, we have in mind the option of adjusting the forward control v ∈ U using the current approximation u θ . More precisely, denoting by u (j) θ the approximation at the j th step, it is reasonable to set In the remainder of this paper, we will focus on this strategy for updating v, leaving differing schemes for future work.
Algorithm 1: Approximation of u * Choose a parametrisation R p θ → u θ . Initialise u θ (with a parameter vector θ ∈ R p ).
Choose an optimisation method descent, a batch size N ∈ N and a learning rate η > 0.

Equivalence properties in the limit of infinite batch size
In this section we will analyse some of the properties of the losses defined in Section 3.1, not taking into account the approximation by ensemble averages described in Section 3.3. In other words, the results in this section are expected to be valid when the batch size N used to compute the estimators L (N ) is sufficiently large. The derivatives relevant for the gradient-descent type methodology described in Section 3.3 can be computed as follows, where δ δu L(u; φ) denotes the Gâteaux derivative in direction φ. We recall its definition [112, Section 5.2]: , the real-valued function ε → L(u + εφ) is differentiable at ε = 0. In this case we define the Gâteaux derivative in direction φ to be Remark 4.2. The functions φ i defined in (67) depend on the chosen parametrisation for u. In the case when a Galerkin truncation is used, u θ = i θ i α i , these coincide with the chosen ansatz functions (i.e. φ i = α i ). Concerning neural networks, the family (φ i ) i reflects the choice of the architecture, the function φ i encoding the response to a a change in the i th weight. For convenience, we will throughout work under the assumption (implicit in Definition 4.1) that the functions φ i are bounded, noting however that this could be relaxed with additional technical effort. Furthermore, note that Definition 4.1 extends straightforwardly to the estimator versions L (N ) .
The following result shows that algorithms based on 1 2 L log Varv and L RE behave equivalently in the limit of infinite batch size, provided that the update rule v = u for the log-variance loss is applied (see the discussion towards the end of Section 3.3), and that 'all other things being equal', for instance in terms of network architecture and choice of optimiser. Furthermore, we provide an analytical expression for the gradient for future reference. Proposition 4.3 (Equivalence of log-variance loss and relative entropy loss). Let u, v ∈ U and φ ∈ C 1 b (R d × [0, T ]; R d ). Then L log Varv and L RE are Gâteaux-differentiable at u in direction φ. Furthermore, Remark 4.4. Proposition 4.3 extends the connection between the cost functional (7) and the FBSDE formulation (12) exposed in Theorem 2.2. Indeed, the Problems 2.1 and 2.3 do not only agree on identifying the solution u * ; it is also the case that the gradients of the corresponding loss functions agree for u = u * . Moreover, it is instructive to compare the expressions (47) and (53) (or their sample based variants (61) and (64)). Namely, computing the derivatives associated to the relative entropy loss entails differentiating both the SDEsolution X u as well as f and g, determining the running and terminal costs. Perhaps surprisingly, the latter is not necessary for obtaining the derivatives of the log-variance loss, opening the door for gradient-free implementations.
Proof of Proposition 4.3. We present a heuristic argument based on the perspective introduced in Section 3.1 and refer to Appendix A.2 for a rigorous proof. For fixed P ∈ P(C), let us consider perturbations P + εU, where U is a signed measure with U(C) = 0. Assuming sufficient regularity, we then expect where the first term on the right-hand side vanishes because of U(C) = 0. Likewise, For P = P, the second term in (71b) vanishes (again, because of U(C) = 0), and hence (71b) agrees with (70) up to a factor of 2.
Remark 4.5 (Local minima). It is interesting to note that (71) can be expressed as In particular, the derivative is zero for all U with U(C) = 0 if and only if P = Q. In other words, we expect the loss landscape associated to losses based on the log-variance divergence to be free of local minima where the optimisation procedure could get stuck. A more refined analysis concerning the relative entropy loss can be found in [83].
In the following proposition, we gather results concerning the moment loss L momentv defined in (57). The first statement is analogous to Proposition 4.3 and shows that L momentv and L log Varv are equivalent in the infinite batch size limit, provided that the update strategy v = u is employed. The second statement deals with the alternative v = u. In this case, y 0 = − log Z (i.e. finding the optimal y 0 according to Theorem 2.2) is necessary for L momentv to identify the correct u * . Consequently, approximation of the optimal control will be inaccurate unless the parameter y 0 is determined without error.   (69), the optimal value for β in terms of variance reduction is given by which splits into a φ-independent (i.e. shared across network weights) and a φ-dependent (i.e. weight-specific) term. The φ-independent term is reproduced in expectation by the log-variance estimator. Numerical evidence suggests that the φ-dependent term is often small and fluctuates around zero, but implementations that include this contribution (based on Monte Carlo estimates) hold the promise of further variance reductions. We note however that determining a control variate for every weight carries a significant computational overhead and that Monte Carlo errors need to be taken into account. Finally, if y 0 in the moment loss differs greatly from − E g(X u T ) − Y u,u T , we expect the corresponding variance to be large, hindering algorithmic performance. In our follow-up paper [105], we have provided a more detailed analysis of the connections between the log-variance divergences and variance reduction techniques in the context of computational Bayesian inference.

Finite sample properties and the variance of estimators
In this section we investigate properties of the sample versions of the losses as outlined in Section 3.3 and, in particular, study their variances and relative errors. We will highlight two different types of robustness, both of which prove significant for convergence speed and stability concerning practical implementations of Algorithm 1, see the numerical experiments in Section 6.

Robustness at the solution u *
By construction, the optimal control solution u * represents the global minimum of all considered losses. Consequently, the associated directional derivatives vanish at u * , i.e.
A natural question is whether similar statements can be made with respect to the corresponding Monte Carlo estimators. We make the following definition.
Remark 5.2. Robustness at the solution u * implies that fluctuations in the gradient due to Monte Carlo errors are suppressed close to u * , facilitating accurate approximation. Conversely, if robustness at u * does not hold, then the relative error (i.e. the Monte Carlo error relative to the size of the gradients (67)) grows without bounds near u * , potentially incurring instabilities of the gradient-descent type scheme. We refer to Figure 12 and the corresponding discussion for an illustration of this phenomenon. 2. For all v ∈ U, the moment estimator L (N ) momentv (·, y 0 ) is robust at u * , i.e.
if and only if y 0 = − log Z.

The relative entropy estimator
where (A s ) 0≤s≤T denotes the unique strong solution to the SDE 4. For all v ∈ U, the cross-entropy estimator L (N ) CE,v is not robust at u * . Remark 5.4. The fact that robustness of the moment estimator at u * requires y 0 = − log Z might lead to instabilities in practice as this relation is rarely satisfied exactly. Note that the variance of the relative entropy estimator at u * depends on ∇u * . We thus expect instabilities in metastable settings, where often this quantity is fairly large. For numerical confirmation, see Figure 12 and the related discussion.
Proof. For illustration, we show the robustness of the log-variance estimator L log(N ) Varv . The remaining proofs are deferred to Appendix A.3. By a straightforward calculation (essentially equivalent to (119) in Appendix A.1), we see that The claim now follows from observing that is almost surely constant (i.e. does not depend on i), according to the second equation in (55b).

Stability in high dimensions -robustness under tensorisation
In this section we study the robustness of the proposed algorithms in high-dimensional settings. As a motivation, consider the case when the drift and diffusion coefficients in the uncontrolled SDE (3) split into separate contributions along different dimensions, for x = (x 1 , . . . , x d ) ∈ R d , and analogously for the running and terminal costs f and g as well as for the control vector field u. It is then straightforward to show that the path measure P u associated to the controlled SDE (5) and the target measure Q defined in (15) factorise, From the perspective of statistical physics, (85) corresponds to the scenario where non-interacting systems are considered simultaneously. To study the case when d grows large, we leverage the perspective put forward in Section 3.1, recalling that D(P|Q) denotes a generic divergence. In what follows, we will denote corresponding estimators based on a sample of size N by D (N ) (P|Q), and study the quantity measuring the relative statistical error when estimating D(P|Q) from samples, noting that r (N ) (P|Q) = O(N −1/2 ). As r (N ) is clearly linked to algorithmic performance and stability, we are interested in divergences, corresponding loss functions and estimators whose relative error remains controlled when the number of independent factors in (85) increases: for all M ∈ N. Here, P i and Q i represent identical copies of P and Q, respectively, so that Clearly, if P and Q are measures on C([0, T ], R), then M coincides with the dimension of the combined problem.
Remark 5.6. The variance and log-variance divergences defined in (43) and (44) depend on an auxiliary measure P. Definition 5.5 extends straightforwardly by considering the product measures d i=1 P i . In a similar vein, the relative entropy and cross-entropy divergences admit estimators that depend on a further probability measure P, where X j ∼ P, motivated by the identities D RE (P|Q) = E P log dP dQ dP d P and D CE (P|Q) = E P log dQ dP dQ d P . We refer to Remark 3.8 for a similar discussion.
Proposition 5.7. We have the following robustness and non-robustness properties: 1. The log-variance divergence D Var(log) P , approximated using the standard Monte Carlo estimator, is robust under tensorisation, for all P ∈ P(C). 3. The variance divergence D Var P is not robust under tensorisation when approximated using the standard Monte Carlo estimator. More precisely, if dQ dP is not P-almost surely constant, then, for fixed N ∈ N, there exist constants a > 0 and C > 1 such that for all M ≥ 1.

4.
The cross-entropy divergence D RE , estimated using D RE,(N ) P , is not robust under tensorisation. More precisely, for fixed N ∈ N there exists a constant a > 0 such that for all M ≥ 1. Here denotes the χ 2 -divergence between Q and P.
Remark 5.8. Proposition 5.7 suggests that the variance and cross-entropy losses perform poorly in high-dimensional settings as the relative errors (89) and (90) scale exponentially in M . Numerical support can be found in Section 6. We note that in practical scenarios we have that P = Q as it is not feasible to sample from the target, and hence χ 2 (Q| P) + 1 > 1.

Numerical experiments
In this section we illustrate our theoretical results on the basis of numerical experiments. In Subsection 6.1 we discuss computational details of our implementations, complementing the discussion in Section 3.3. The Subsections 6.2 and 6.3 focus on the case when the uncontrolled SDE (3) describes an Ornstein-Uhlenbeck process and the dimension is comparatively large. In Section 6.4 we consider metastable settings (of both low and moderate dimensionality), representative of those typically encountered in rare event simulations (see Example 2.1). We rely on PyTorch as a tool for automatic differentiation and refer to the code at https://github.com/lorenzrichter/ path-space-PDE-solver.

Computational aspects
The numerical treatment of the Problems 2.1-2.5 using the IDO-methodology is based on the explicit loss function representations in Section 3.1, together with a gradient descent scheme relying on automatic differentiation 12 . Following the discussion in Section 3.3, a particular instance of an IDO-algorithm is determined by the choice of a loss function, and, in the case of the cross-entropy, moment and variance-type losses, by a strategy to update the control vector field v in the forward dynamics (see Propositions 3.7 and 3.10). As mentioned towards the end of Section 3.3, we focus on setting v = u at each gradient step, i.e. to use the current approximation as a forward control. Importantly, we do not differentiate the loss with respect to v; in practice this can be achieved by removing the corresponding variables from the autodifferentiation computational graph (for instance using the detach command in the PyTorch package). Including differentiation with respect to v as well as more elaborate choices of the forward control might be rewarding directions for future research. Practical implementations require approximations at three different stages: first, the time discretisation of the SDEs (3) or (5); second, the Monte Carlo approximation of the losses (as outlined in Section 3.3), or, to be precise, the approximation of their respective gradients; and third, the function approximation of either the optimal control vector field u * or the value function V . Moreover, implementations vary according to the choice of an appropriate gradient descent method. Concerning the first point, we discretise the SDE (5) using the Euler-Maruyama scheme [78] along a time grid 0 = t 0 < · · · < t K = T , namely iterating where ∆t > 0 denotes the step size, and ξ n ∼ N (0, I d×d ) are independent standard Gaussian random variables.
Recall that the initial value can be random rather than deterministic (see Remark 2.5). We demonstrate the potential benefit of sampling X 0 from a given density in Section 6.3. We next discuss the approximation of u * . First, note that a viable and straightforward alternative is to instead approximate V and compute u * = −σ ∇V whenever needed (for instance by automatic differentiation), see [101]. However, this approach has performed slightly worse in our experiments, and, furthermore, V can be recovered from u * by integration along an appropriately chosen curve. To approximate u * , a classic option is a to use a Galerkin truncation, i.e. a linear combination of ansatz functions for n ∈ {0, . . . , K − 1} with parameters θ n m ∈ R. Choosing an appropriate set {α m } M m=1 is crucial for algorithmic performance -a task that in high-dimensional settings requires detailed a priori knowledge about the problem at hand. Instead, we focus on approximations of u * realised by neural networks. Definition 6.1 (Neural networks). We define a standard feed-forward neural network Φ : with matrices A l ∈ R n l ×n l−1 , vectors b l ∈ R n l , 1 ≤ l ≤ L, and a nonlinear activation function : R → R that is to be applied componentwise. We further define the DenseNet [38,63] containing additional skip connections, where x L is defined recursively by with A l ∈ R n l × l−1 i=0 ni , b l ∈ R l for 1 ≤ l ≤ L − 1 and x 1 = x, n 0 = d. In both cases the collection of matrices A l and vectors b l comprises the learnable parameters θ.
Neural networks are known to be universal function approximators [28,62], with recent results indicating favourable properties in high-dimensional settings [40,41,52,97,111]. The control u can be represented by either u(x, t) = Φ (y) with y = (x, t) , i.e. using one neural network for both the space and time dependence, or by u(x, t n ) = Φ n (x), using one neural network per time step. The former alternative led to better performance in our experiments, and the reported results rely on this choice. For the gradient descent step we either choose SGD with constant learning rate [51,Algorithm 8.1] or Adam [51, Algorithm 8.7], [76], a variant that relies on adaptive step sizes and momenta. Further numerical investigations on network architectures and optimisation heuristics can be found in [23]. To evaluate algorithmic choices we monitor the following two performance metrics: 1. The importance sampling relative error, namely where u is the approximated control in the corresponding iteration step. This quantity is zero if and only if u = u * (cf. Theorem 2.2) and measures the quality of the control in terms of the objective introduced in Problem 2.5. Since its Monte Carlo version fluctuates heavily if u is far away from u * we usually estimate this quantity with additional samples not being used in the gradient computation.
2. An L 2 -error, where u * ref is computed either analytically or using a finite difference scheme for the HJB-PDE (11). This quantity is more robust w.r.t. deviations from u * and therefore we compute the Monte Carlo estimator using just the samples from the training iteration.

Ornstein-Uhlenbeck dynamics with linear costs
Let us consider the controlled Ornstein-Uhlenbeck process where A, B ∈ R d×d . Furthermore, we assume zero running costs, f = 0, and linear terminal costs g(x) = γ · x, for a fixed vector γ ∈ R d . As shown in Appendix A.4, the optimal control is given by which remarkably does not depend on x. Therefore, not only the variance and log-variance losses are robust at u * in the sense of Definition 5.1, but also the relative entropy loss, according to (79) in Proposition 5.3. We choose A = −I d×d + (ξ ij ) 1≤i,j≤d and B = I d×d + (ξ ij ) 1≤i,j≤d , where ξ ij ∼ N (0, ν 2 ) are sampled i.i.d. once at the beginning of the simulation. Note that this choice corresponds to a small perturbation of the product setting from Section 5.2. We set T = 1, ν = 0.1, γ = (1, . . . , 1) and as function approximation take the DenseNet from Definition 6.1 using two hidden layers, each with a width of n 1 = n 2 = 30, and = max(0, x) as the nonlinearity. Lastly, we choose the Adam optimiser as a gradient descent scheme. Figure 1 shows the algorithm's performance for d = 1 with batch size N = 200, learning rate η = 0.01 and step size ∆t = 0.01. We observe that log-variance, relative entropy and moment loss perform similarly and converge well to a suitable approximation. The cross-entropy loss decreases, but at later gradient steps fluctuates more than the other losses (we note that the fluctuations appear to be less pronounced when using SGD, however at the cost of substantially slowing down the overall speed of convergence). The inferior quality of the control obtained using the cross-entropy loss may be explained by its non-robustness at u * , see Proposition 5.3.   Figure 2 shows the algorithm's performance in a high-dimensional case, d = 40, where we now choose N = 500 as the batch size, η = 0.001 as the learning rate, ∆t = 0.01 as the time step, and as before rely on a DenseNet with two hidden layers. We observe that relative entropy loss and log-variance loss perform best, and that the moment and cross-entropy losses converge at a significantly slower rate. The variance loss is numerically unstable and hence not represented in Figure 2. We encounter similar problems in the subsequent experiments and thus do not consider the variance loss in what follows. In Figure 3 we plot some of the components of the 40-dimensional approximated optimal control vector field as well as the analytic solution u * ref (x, t) for a fixed value of x and varying time t, showcasing the inferiority of the approximation obtained using the cross-entropy loss. The comparatively poor performance of the cross-entropy and the variance losses can be attributed to their non-robustness with respect to tensorisations, see Section 5.2. To further illustrate these results, Figure 4 displays the relative error associated to the loss estimators computed from N = 15 · 10 6 samples in different dimensions. The dimensional dependence agrees with what is expected from Proposition 5.7, but we note that our numerical experiment goes beyond the product case.    Figure 5 shows that the initialisation of y 0 can have a significant impact on the convergence speed. Indeed, with the initialisation y 0 = − log Z, the moment and log-variance losses perform very similarly, in accordance with Proposition 4.6. In contrast, choosing the initial value y 0 such that the discrepancy |y 0 + log Z| is large incurs a much slower convergence. Comparing the two plots in Figure 5 shows that the Adam optimiser achieves a much faster convergence overall in comparison to SGD. Moreover, the difference in performance between y 0 -initialisations is more pronounced when the Adam optimiser is used. The observations in these experiments are in agreement with those in [23].

Ornstein-Uhlenbeck dynamics with quadratic costs
We consider the Ornstein-Uhlenbeck process described by (99) with quadratic running and terminal costs, i.e. f (x, s) = x P x and g(x) = x Rx, with P, R ∈ R d×d . This setting is known as the linear quadratic Gaussian control problem [119]. The optimal control is given by [119, Section 6.5] where the matrices F t fulfill the matrix Riccati equation In this example, we demonstrate an approach leveraging a priori knowledge about the structure of the solution. Motivated by (101), we consider the linear ansatz functions where the entries of the matrices Ξ n ∈ R d×d , n = 0, . . . , K − 1 represent the parameters to be learnt. The matrices A and B are chosen as in Subsection 6.2 and we set P = 1 2 I d ×d , R = I d×d and T = 0.5. Figure 6 shows the performance using Adam with learning rate η = 0.001 and SGD with learning rate η = 0.01, respectively. The relative entropy losses converges fastest, followed by the log-variance loss. The convergence of the cross-entropy loss is significantly slower, in particular in the SGD case. We also note that the cross-entropy loss diverges if larger learning rates are used. These findings are in line with the results from Proposition 5.7. When SGD is used, the moment loss experiences fluctuations in later gradient steps. This can be explained by the fact that the moment loss is robust at u * only if y 0 = − log Z is satisfied exactly (see Proposition 4.6).
Let us illustrate the potential benefit of sampling X 0 from a predescribed density (see Remark 2.5), here X 0 ∼ N (0, I d×d ). The overall convergence is hardly affected and the L 2 error dynamics agrees qualitatively with the one shown in Figure 6. However, the approximation is more accurate at initial time t = 0, see Figure 7. This phenomenon appears to be particularly pronounced in this example, as independent ansatz functions are used at each time step.

Metastable dynamics in low and high dimensions
We now come back to the double well potential from Example 2.1 and consider the SDE where B ∈ R d×d is the diffusion coefficient, is the potential (with κ i > 0 being a set of parameters) and x init = (−1, . . . , −1) is the initial condition. We consider zero running costs, where ν i > 0, and a terminal time T = 1. Recall from Example 2.1 that choosing higher values for κ i and ν i accentuates the metastable features, making sample-based estimation of E [exp(−g(X T ))] more challenging. For an illustration, Figure 8 shows the potential Ψ and the weight at final time e −g (see (15)), for different values of ν and κ, in dimension d = 1 and for B = 1. We furthermore plot the 'optimally tilted potentials' Ψ * = Ψ + BB V , noting that −∇Ψ * = −∇Ψ + Bu * . Finally, the right-hand side shows the gradients ∇u * at initial time t = 0. For an experiment, let us first consider the one-dimensional case, choosing B = 1, κ = 5 and ν = 3. In this setting the relative error associated to the standard Monte Carlo estimator, i.e. the estimator version of (97), which we denote by δ, is roughly δ(0) = 63.86 for a batch size of N = 10 7 trajectories, from which only about 2 · 10 3 (i.e. 0.02%) cross the barrier. Given that e −g is supported mostly in the right well, the optimal control u * steers the dynamics across the barrier. Using an approximation of u * obtained by a finite difference scheme, we achieve a relative error of δ(u * ) = 1.94 (the theoretical optimum being zero, according to Theorem 2.2) and a crossing ratio of approximately 87.28%.
To run IDO-based algorithms, we use the standard feed-forward neural network (see Definition 6.1) with the activation function = tanh and choose ∆t = 0.005, η = 0.05. We try batch sizes of N = 50 and N = 1000 and plot the training progress in Figures 9 and 10, respectively. In Figure 11 we display the approximation obtained using the log-variance loss and compare with the reference solution u * ref .  It can be observed that the log-variance and moment losses perform well with both batch sizes, with the log-variance loss however achieving a satisfactory approximation with fewer gradient steps. The cross-entropy loss appears to work well only if the batch size is sufficiently large. We attribute this observation to the non-robustness at u * (see Proposition 5.3) and, tentatively, to the exponential factor appearing in (48b), see Remark 3.8. The optimisation using the relative entropy loss is frustrated by instabilities in the vicinity of the solution u * . In order to further investigate this aspect we numerically compute the variances of the gradients and the associated relative errors with respect to the mean, using 50 realisations at each gradient step. Figure 12 shows the averages of the relative errors and variances over weights in the network 13 , confirming that the gradients associated to the log-variance loss have significantly lower variances. This phenomenon is in accordance with Proposition 5.3 (in particular noting that |∇u * | 2 is expected to be rather large in a metastable setting, see Figure 8) and explains the unsatisfactory behaviour of the relative entropy loss observed in Figures 9 and 10  Let us now consider the multidimensional setting, namely d = 10, where the dynamics exhibits 'highly' metastable characteristics in 3 dimensions and 'weakly' metastable characteristics in the remaining 7 dimensions. To be precise, we set κ i = 5, ν i = 3 for i ∈ {1, 2, 3} and κ i = 1, ν i = 1 for i ∈ {4, . . . , 10}. Moreover, we choose the diffusion coefficient to be B = I d×d and conduct the experiment with a batch size of N = 500.
In Figure 13 we see that only the log-variance loss achieves a reasonable approximation. Interestingly, the training progresses in stages, successively overcoming the potential barriers in the highly metastable directions. On the righthand side we display the components of the approximated optimal control associated to one highly and one weakly metastable direction, for fixed t = 0. We observe that the approximation is fairly accurate, and that comparatively large control forces are needed to push the dynamics over the highly metastable potential barrier.

Conclusion and outlook
Motivated by the observation that optimal control of diffusions can be phrased in a number of different ways, we have provided a unifying framework based on divergences between path measures, encompassing various existing numerical methods in the class of IDO algorithms. In particular, we have shown that the novel log-variance divergences are closely connected to forward-backward SDEs. We have furthermore shown a fundamental equivalence between approaches based on the KL-divergence and the log-variance divergences. Turning to the variance of Monte Carlo gradient estimators, we have defined and studied two notions of stability -robustness under tensorisation and robustness at the optimal control solution. Of the losses and estimators under consideration, only the log-variance loss is stable in both senses, often resulting in superior numerical performance. The consequences of robustness and non-robustness as defined have been exemplified by extensive numerical experiments. The results presented in this paper can be extended in various directions. First, it would be interesting to consider other divergences on path space and construct and study the ensuing algorithms. In this respect, we may also mention the development of more elaborate schemes to update the control for the forward dynamics. Second, one may attempt to generalise the current framework to other types of control problems and PDEs (for instance to elliptic PDEs and hitting time problems as considered in [55,56,59,60], or to the Schrödinger problem as discussed in [104]). Deeper understanding of the design of IDO algorithms could be achieved by extending our stability analysis beyond the product case and for controls that differ greatly from the optimal one. In particular, advances in this direction might help to develop more sophisticated variance reduction techniques. Finally, we envision applications of the log-variance divergences in other settings.
Acknowledgements. This research has been funded by Deutsche Forschungsgemeinschaft (DFG) through the grant CRC 1114 'Scaling Cascades in Complex Systems' (projects A02 and A05, project number 235221301). We would like to thank Carsten Hartmann and Wei Zhang for many very useful discussions. We thank the referees for their useful comments and suggestions that have led to various improvements in the presentation of this paper.

A Appendix
A.1 Proofs for Section 3.1 The Radon-Nikodym derivatives appearing in the divergences defined in Section 3.1 can be computed explicitly: Lemma A.1. For u ∈ U, the measures P and P u are equivalent. Moreover, the Radon-Nikodym derivative satisfies dP u dP (X) = exp Proof. The fact that the two measures are equivalent follows from the linear growth assumption on u (see (6) and where P W denotes the measure on C induced by Using and inserting (106) and (107), we obtain the desired result.
Proof of Proposition 3.5. Using (15) and (105) (or arguing as in the proof of Theorem 2.2) we compute Proof of Proposition 3.7. Similarly, we compute where C ∈ R does not depend on u.
Proof of Proposition 3.10. With Y u,v T defined as in (51), we compute for the variance loss Similarly, the log-variance loss equals

A.2 Proofs for Section 4
Proof of Proposition 4.3. For ε ∈ R and φ ∈ C 1 b (R d × [0, T ]; R d ), let us define the change of measure According to Girsanov's theorem, the process ( W s ) 0≤s≤T , defined as is a Brownian motion under Θ. We therefore obtain Using dominated convergence, we can interchange derivatives and integrals (for technical details, we refer to [83]) and compute where we have used Itô's isometry, Turning to the log-variance loss, we see that Setting v = u, we obtain from which the result follows by comparison with (117).
Proof of Proposition 4.6. We compute Setting v = u and using that E y 0 T 0 φ(X v s , s) · dW s = 0, the first statement follows by comparison with (69). The second statement follows from where we have used the fact that Y u * ,v T − g(X v T ) = log Z, almost surely.

A.3 Proofs for Section 5
Proof of Proposition 5.3. 1.) We compute where δ Y u,v,(i) T δu (u; φ) is given in (82). As in the proof for the log-variance estimator, the quantity is almost surely constant and thus the statement follows.
2.) Similarly to the computations involved in 1.) we have where we have used the fact that Y u * ,v,(i) T − g X u * ,(i) T = − log Z according to (24) and (55b). The variance of this expression equals implying the claim.
3.) Let φ ∈ C 1 b (R d × [0, T ]; R d ) and ε ∈ R. As usual, we denote by (X u * +εφ s ) 0≤s≤T the unique strong solution to (5), with u replaced by u * + εφ. By a slight modification of [81,  f (X u * +εφ s , s) ds + g(X u * +εφ From (11b) and using integration by parts, we see that the last term in (128b) satisfies Next, we employ Itô's formula and Einstein's summation convention to compute where we used (37) from the second to the third line and (11) to manipulate the first term in the third line. Using (80) and (130), we see that the quadratic variation process satisfies A · , ∇V (X u * · , ·) Combining (80), (129), (130) and (131), it follows that (128) equals The claim is now implied by Itô's isometry.
Since E δ δu u=u * L CE,v (u; φ) = 0 by construction, we see that Let us assume for the sake of contradiction that Var δ δu u=u * L CE,v (u; φ) = 0, for all φ ∈ C 1 b (R d × [0, T ]; R d ). It then follows that which is clearly false, in general.
Proof of Proposition 5.7. Throughout the proof, we will use the notation to denote the product measures on The sample variance satisfies [27] Var where We calculate where we have used the fact that, for instance, For P = P we have from which the robustness follows immediately. For P = P, on the other hand, and the proof of the non-robustness proceeds as in 4.).
3.) As in the proof of 1.) we have

Var D
Var,(N ) where and We can write the relative error as where the second bound is implied by the c r -inequality [85, Section 9.3]. By Jensen's inequality and since dQ dP is not P-almost surely constant by assumption, it holds that C 1 > 1 and C 2 < 1. The claim therefore follows from combining (148) and (149).
4.) Employing the notation introduced in (136), we see that Manipulating the first term, we obtain Notice that The claim now follows from combining (150) and (151) in definition (86).

A.4 Optimal control for Ornstein-Uhlenbeck dynamics with linear cost
The control problem considered in Section 6.2 can be solved analytically. Using (17), we note that the value function solving the HJB-PDE (11) fulfills V (x, t) = − log ψ(x, t), with where (X s ) t≤s≤T solves dX s = AX s ds + B dW s , X t = x.
The distribution of X T is known explicitly, namely with µ t = e A(T −t) x, Σ t = T −t 0 e As BB e A s ds.
We can now compute and the value function and therefore with (21)