A modified MSA for stochastic control problems

The classical Method of Successive Approximations (MSA) is an iterative method for solving stochastic control problems and is derived from Pontryagin's optimality principle. It is known that the MSA may fail to converge. Using careful estimates for the backward stochastic differential equation (BSDE) this paper suggests a modification to the MSA algorithm. This modified MSA is shown to converge for general stochastic control problems with control in both the drift and diffusion coefficients. Under some additional assumptions linear rate of convergence is proved. The results are valid without restrictions on the time horizon of the control problem, in contrast to iterative methods based on the theory of forward-backward stochastic differential equations.


Introduction
Stochastic control problems appear naturally in a range of applications in engineering, economics and finance. With the exception of very specific cases such as linear-quadratic control in engineering or Merton portfolio optimization task in finance, stochastic control problems typically have no closed form solutions and have to be solved numerically. In this work, we consider a modification to the method of successive approximations (MSA), see Algorithm 1. The MSA is essentially a way of applying the Pontryagin's optimality principle to get numerical solutions of stochastic control problems.
We will consider the continuous space, continuous time problem where the controlled system is modelled by an R d -valued diffusion process. Let W be a ddimensional Wiener martingale on a filtered probability space (Ω, F, (F t ) t≥0 , P). We will provide exact assumptions we need in Section 2. For now, let us fix a finite time T ∈ (0, ∞) and consider the controlled stochastic differential equation (SDE) for given measurable functions b : Here α = (α s ) s∈[0,T ] is a control process belonging to the space of admissible controls A, valued in a separable metric space A and we will write X α to denote the unique solution of (1) which starts from x at time 0 whilst being controlled by α. Furthermore let f : [0, T ] × R d × A → R and g : R d → R be given measurable functions and consider the gain functional for all x ∈ R d and α ∈ A. We want to solve the optimisation problem i.e. to find the optimal control α * which achieves the minimum of (2) (or, if the infimum cannot be reached by α ∈ A then an ε-optimal control α ε ∈ A such that inf α∈A J(x, α) ≤ J(x, α ε ) + ε.
In the present paper, we study an approach based on Pontyagin's optimality principle, see e.g. [4], [7] or [24]. The main idea is to consider optimality conditions for controls of the problem (2). Given b, σ and f we define the Hamiltonian H : x, y, z, a) = b(t, x, a) · y + tr(σ (t, x, a)z) + f (t, x, a) . ( Consider for each α ∈ A, the BSDE, called the adjoint equation (4) It is well known from Pontryagin's optimality principle that, if an admissible control α ∈ A is optimal, X α is the corresponding optimally controlled dynamic (1) and (Y α , Z α ) is the solution to the associated adjoint equation (4), then ∀a ∈ A and ∀s ∈ [0, T ] the following holds We now define the augmented HamiltonianH : Notice that when ρ = 0 we have exactly the definition of Hamiltonian (3). Given the augmented Hamiltonian, let us introduce the modified MSA in Algorithm 1 which consists of successive integrations of the state and adjoint systems and updates to the control. Notice that the backward SDE depends on the Hamiltonian H, while the control update step comes from minimizing the augmented HamiltonianH.

Algorithm 1 Modified Method of Successive Approximations:
Initialisation: make a guess of the control α 0 = (α 0 s ) s∈[0,T ] . while difference between J(x, α n ) and J(x, α n−1 ) is large do Given a control α n−1 = (α n−1 s ) s∈[0,T ] solve the following forward SDE, then solve backward SDE: Update the control end while return α n .
The method of successive approximations (i.e. case ρ = 0) for numerical solution of deterministic control problems was proposed already in [5]. Recent application of the modified MSA to a deep learning problem has been studied in [31], where they formulated the training of deep neural networks as an optimal control problem and introduced the modified method of successive approximations as an alternative training algorithm for deep learning. For us, the main motivation to explore the modified MSA for stochastic control problems is to obtain convergence, ideally with rate, of an iterative algorithm, applicable to problems with the control in the diffusion part of the controlled dynamics. This is in contrast to [33] where convergence rate of an the Bellman-Howard policy iteration is shown but only for control problems with no control in the diffusion part of the controlled dynamics.
In Lemma 2.3, which can be established using careful BSDE estimates, we can see the estimate on the change of J when we do a minimization step of Hamiltonian as in (8). If the sum of the last three terms of (15) is bigger than the first term, then for classical MSA algorithm (i.e. case ρ = 0) we cannot guarantee that we do an update of the control in optimal descent direction of J. That means that the method of successive approximations may diverge. To overcome this, we need to modify the algorithm in such way so that we ensure convergence. With this in mind the desirability of the the augmented Hamiltonian (6) for updating the control becomes clear, as long as it still characterises optimal controls like H does. Theorem 16 answers this question affirmatively which opens the way to the modified MSA. In Theorem 2.5 we show that the modified method of successive approximations converges for arbitrary T , and in Corollary 2.6, we show linear convergence rate for certain stochastic control problems. We observe that the forward and backward dynamics in (7) are decoupled, due to the iteration used. Therefore, it can be efficiently approximated, even in high dimension, using deep learning methods, see [30] and [29]. However, the minimization step (8) might be computationally expensive for some problems. A possible approach circumventing this is to replace the full minimization of (8) by gradient descent. A continuous version of this gradient flow is analyzed in [34].
The main contributions of this paper are the probabilistic proof of convergence of the modified method of successive approximations and establishing convergence rate for a specific type of optimal control problems. This paper is organised as follow: in Section 1.1 we compare our results with existing work. In Section 2 we state the assumptions and main results. In Section 3 we collect all proofs. Finally, in Appendix A we recall an auxiliary lemma which is needed in the proof of Corollary 2.6.
1.1. Related work. One can solve the stochastic optimal control problem using dynamic programming principle. It is well known, see e.g. Krylov [8], that under reasonable assumptions the value function, defined as infimum of (2) over all admissible controls, satisfies the Bellman partial differential equation (PDE). There are several approaches to solve this nonlinear problem. One may apply a finite difference method to discretise the Bellman PDE and get a high dimensional nonlinear system of equations, see e.g [19] or [21]. Or one may linearize the Bellman PDE and then iterate. The classical approach is the Bellman-Howard policy improvement / iteration algorithm, see e.g. [1], [2] or [3]. The algorithm is initialised with a "guess" of Markovian control. Given a Markovian control strategy at step n one solves a linear PDE with the given control fixed and then one uses the solution to the linear PDE to update the Markovian control, see e.g. [26], [27] or [28]. In [33], a global rate of convergence and stability for the policy iteration algorithm has been established using backward stochastic differential equations (BSDE) theory. However, the result only applies to stochastic control problems with no control in the diffusion coefficient of the controlled dynamics.
It is known that the solution of the stochastic optimal control problem can be obtained from a corresponding forward backward stochastic differential equation (FBSDE) via the stochastic optimality principle, see [25,Chapter 8.1]. Indeed, let us consider (1) and (4), and recall from the stochastic optimality principle, see [24,Theorem 4.12], that for the optimal control α = (α s ) s∈[0,T ] we have Assume that under some conditions on b, σ and f we have that the first order condition stated above uniquely determines α for s ∈ [0, T ] by for some function ϕ. Therefore, after plugging (10) into (1) and (4), we obtain the following coupled FBSDE: . It is worth mentioning that when σ does not depend on the controlσ will depend on forward process and time only. This means thatσ does not have Y and Z components. The theory of FBSDE has been studied widely and there are several methods to show the existence and uniqueness result, and a number of numerical algorithms have been proposed based on those methods. First is the method of contraction mapping. It was first studied by Antonelli [9] and later by Pardoux and Tang [14]. The main idea there is to show that a certain map is a contraction, and then to apply a fixed point argument. However, it turns out that this method works only for small enough time horizon T . In the case whenσ does not depend on Y and Z, having small T is sufficient to get contraction. Otherwise, one needs to assume additionally that the Lipschitz constants ofσ in z and that of g in x satisfy a certain inequality, see [25,Theorem 8.2.1]. Using the method of contraction mapping one can then implement a Picard-iteration-type numerical algorithm and show exponential convergence for small T . The second method is the Four Step Scheme. It was introduced by Ma, Protter and Yong, see [10], and was later studied by Delarue [16]. The idea is to use a decoupling function and then study an associated quasi-linear PDE. We note that in [10,16] when the forward diffusion coefficientσ does not depend on Z. This corresponds to stochastic control problems with the uncontrolled diffusion coefficient. The numerical algorithms based on this method exploits the numerical solution of the associated quasi-linear PDE and therefore faces some limitations for high dimensional problems, see Douglas, Ma and Protter [12], Milstein and Tretyakov [18], Ma, Shen and Zhao [20] and Delarue and Menozzi [17]. Guo, Zhang and Zhuo [23] proposed a numerical scheme for highdimensional quasi-linear PDE associated with the coupled FBSDE whenσ does not depend on Z, which is based on a monotone scheme and on probabilistic approach. Finally, there is the method of continuation. This method was developed by Hu and Peng [11], Peng and Wu [15] and by Yong [13]. It allows them to show the existence and uniqueness result for arbitrary T under monotonicity conditions on the coefficients, which one would not expect to apply to FBSDEs arising from a control problem as described by (10), (11). Recently, deep learning methods have been applied to solving FBSDEs. In [32], three algorithms for solving fully coupled FBSDEs which have good accuracy and performance for high-dimensional problems are provided. One of the algorithms is based on the Picard iteration and it converges, but only for small enough T .

Main results
We fix a finite horizon T ∈ (0, ∞). Let A be a separable metric space. This is the space where the control processes α take values. We fix a filtered probability Wiener martingale on this space. The state of the system is governed by the controlled SDE (1) . The corresponding adjoint equation satisfies (4).
Assumption 2.1. The functions b and σ are continuous in t and twice differentiable in x. There exists K ≥ 0 such that ∀x, and Under these assumptions, we can obtain the following estimate.
The proof will be given in Section 3. We now state a necessary condition for optimality for the augmented Hamiltonian.
Theorem 2.4 (Extended Pontryagin's optimality principle). Let α be the optimal control, X α be the associated controlled state solving (1), and (Y α , Z α ) be the associated adjoint processes solving (4). Then for any θ ∈ A we havẽ The proof of Theorem 2.4 will come in Section 3. We are now ready to present the main result of the paper.
for ∀t ∈ [0, T ] , ∀x ∈ R d , ∀a ∈ A. In addition, assume that f and g are convex in x, then we have the following estimate for the sequence (α n ) n∈N from Algorithm 1: where C is a positive constant.
The proof of Corollary 2.6 will be given in Section 3. Theorem 2.5 and Corollary 2.6 are extensions of the result in [5] to the stochastic case.

Proofs
We start working towards the proof of Theorem 2.5. Recall the adjoint equation for an admissible control α: and Then Y α and Z α are bounded. Proof.
Step 1. First, we show an estimate on Y α . From the definition of the Hamiltonian (3) we have Hence, one can observe that (19) is a linear BSDE. Thus, we can use exponential integrating factor and Girsanov theorem [25, sec 4.1] to derive the explicit solution of the equation (19). Since D x σ is bounded by assumption, we can apply Girsanov's transformation and define a new probability measure Q as Hence, we have for ∀s ∈ [0, T ] that Step 2. In order to show a bound on Z α , we need to differentiate (Y α , Z α ) with respect to the initial condition x. To make notation clear in this proof, we write X x,α and (Y x,α , Z x,α ). Hence, we have for all t ∈ [0, T ] that where by ∇ we denote derivative with respect to the initial condition x. Observe that Thus, the equation (25) becomes Using Z x,α t = ∇Y x,α t (∇X x,α t ) −1 σ(t, X x,α t , α t ), see [25,Lemma 5.2.3], and using the change of measure from Step 1 we obtain Hence, after taking conditional expectation, and using the integrating factor we get Thereofore, due to assumption we have for some constant C. Observe that (U s ) t≤s≤T , where U s := ∇X x,α s (∇X x,α t ) −1 , solves the SDE Therefore, since D x b and D x σ are bounded, under standard estimates for SDE we get that Hence, we conclude that |Z x,α s | is a.s. bounded for all s ∈ [0, T ].
Proof of Lemma 2.3. Let ϕ and θ be some generic admissible controls. We will write (X ϕ s ) s∈[0,T ] for the solution of (1) controlled by ϕ and (X θ s ) s∈[0,T ] for the solution of (1) controlled by θ. We denote solutions of corresponding adjoint equa- Due to Taylor's theorem, we note that for some random variable r 1 ∈ [0, 1], we have Recall that Y θ T = D x g(X θ T ). Hence, using Itô's product rule, the forward SDE (1), and the adjoint equation (4) we get tr[(σ(s, X ϕ s , ϕ s ) − σ(s, X θ s , θ s )) Z θ s ] ds On the other hand, by definition of the Hamiltonian we have Summing up (33) and (34) we get Due to Taylor's theorem, we have the process (r 2 (s)) s∈[0,T ] ∈ [0, 1] such that After substituting (36) into (35) we get Let us now get a standard SDE estimate for the difference of X ϕ and X θ . From (a + b) 2 ≤ 2a 2 + 2b 2 , from taking the expectation, from Hölder's inequality, from Assumption 2.1, from the Burkholder-Davis-Gundy inequality and from Gronwall's inequality we obtain Young's inequality allow us to get the estimate Hence, from (39), Assumption 2.2, and (38) we have that for some constant C > 0, which depends on K, T , and d.
Proof of Theorem 2.4. Since α is the optimal control for the problem (??), the Pontryagin's optimality principle holds, see e.g. [22]. Hence for any a ∈ A we have By definition of the augmented Hamiltonian (6) for all s ∈ [0, T ] we havẽ H(s, X s , Y s , Z s , α s , a) = H(s, X s , Y s , Z s , a) Therefore, due to (42) and (43) we havẽ This concludes the proof.
Proof of Theorem 2.5. Let us apply Lemma 2.3 for ϕ = α n and θ = α n−1 . Hence, for some C > 0 we have Due to the definition of α n (8) and (16)  ) .
(47) Therefore, we can observe that µ n ≤ 0. Hence we can rewrite the inequality (45) as where D := 1 − 2C ρ > 0. Notice that for any integer M > 1 we have Since (−µ n ) ≥ 0 and ∞ n=1 (−µ n ) < +∞ we have that µ n → 0 as n → 0. This concludes the proof. Now to prove Corollary 2.6 we need to introduce new notation. Consider τ ∈ By definition of α n notice that ∆ α H(t) ≤ 0 for all t ∈ [0, T ]. Let us show an auxiliary lemma.
Now we are ready to prove Corollary 2.6.
Proof of Corollary 2.6. If 1 − qc k < 0 we have Hence c k+1 < c k . On the other hand, if 1 − qc k ≥ 0, we have c k ≤ 1 q . Therefore, we conclude that for all k we have