1 Introduction

The goal of this article is to propose and analyse a deep learning scheme for PDEs of integral type which we refer to as PIDE models. The integral part of the considered equation is defined by a finite Lévy measure \(\lambda \) on \(\mathbb {R}^d\) (see Sect. 1.2).

A difficult problem in Applied Mathematics is to approximate solutions of Partial Differential Equations (PDEs) in large dimensions. In low dimensions such as 1, 2 or 3, classical methods such as finite differences or finite elements are commonly applied with satisfactory convergence orders (see e.g. Allaire [1, Chapters 2 and 6]). An important problem appears when we deal with high dimensional problems such as portfolio management, where each dimension represents the size of some financial derivative in the portfolio. More complications appear when the PDE is nonlocal, as presented in many applications. For finite difference methods, one needs to construct a mesh that, computationally speaking, has exponential cost on the dimension \(d\in \mathbb {N}\) of the considered PDE. This problem is known in the literature as the curse of dimensionality, and the most common attempt to solve this issue is via stochastic methods. Deep Learning (DL) methods have proven to be an efficient tool to handle this problem and to approximate solutions of high dimensional second order fully nonlinear PDEs. This is achieved by finding that the solution of the PDE, evaluated at a certain diffusion process, solves a Stochastic Differential Equation (SDE); then an Euler scheme together with DL is applied to solve the SDE, see [9, 31] for key developments.

Without being exhaustive, we present some of the current developments in this direction. First of all, Monte Carlo algorithms are an important approach to the resolution of this dimensional problem. This can be done by means of the classical Feynman–Kac representation that allows us to write the solution of a linear PDE as an expected value, and then approximate the high dimensional integrals with an average over simulations of random variables. The key developments in this area can be found in Han-Jentzen-E [29] and Beck-E-Jentzen [9]. On the other hand, Multilevel Picard method (MLP) is another approach and consist on interpreting the stochastic representation of the solution to a semilinear parabolic (or elliptic) PDE as a fixed point equation. Then, by using Picard iterations together with Monte Carlo methods for the computation of integrals, one is able to approximate the solution to the PDE, see [8, 32] for fundamental advances in this direction. On the other hand, the so-called Deep Galerkin method (DGM) is another DL approach used to solve quasilinear parabolic PDEs of the form \(\mathcal {L}(u)=0\) with initial and boundary conditions. The cost function in this framework is defined in an intuitive way, it consists of the differences between the approximate solution \(\hat{u}\) evaluated at the initial time and spatial boundary, with the true initial and boundary conditions plus \(\mathcal {L}(\hat{u})\). These quantities are captured by an \(L^2\)-type norm, which in high dimensions is minimized using Stochastic Gradient Descent (SGD) method. See [44] for the development of the DGM and [38] for an application.

In [31], the principal source of inspiration of this article, Hure, Pham, and Warin consider the framework introduced previously in [9] and present new approximation schemes for the solution of a parabolic nonlinear PDE and its gradient via Neural Networks. Via an intricate use of intermediate numerical approximations for each term in their scheme, they prove the numerical consistency and high accuracy of the method, at least in the case of low dimensions.

In general, standard PDEs model situations where, in order to know the state of a system at a particular point, one needs information of the state in a arbitrarily small neighborhood of the point. On the contrary, PIDEs can model more general phenomena where long distance interactions and effects are not negligible, therefore, must be considered. An important example of PIDEs are those which involve fractional derivatives, such as the Fractional Laplacian. This operator has been extensively studied, from the PDE point of view, during the past ten years, starting from the fundamental work by Caffarelli and Silvestre [15]. See [21, 45] and references therein for nice introductions to this operator, one of the most relevant examples of integro-differential operators. More generally speaking, nonlocal equations are used in a wide range of scientific areas, see [10] for applications in advection dispersion equations, [25] for image processing, [23] for perodyinamic, [46] for hydrodynamics, and see [16, 17] for finances. For more theoretical results on nonlocal equations, see e.g. [6, 11, 18] and references therein. In [19], the authors give a complete introduction to nonlocal equations and then they develop nonlocal version of three numerical methods: finite difference, finite element and Spectral-Galerkin.

In [35], the authors present a discrete-time approximation of a BSDEJ (Backward SDE with Jumps) such that its solution converges weakly to a solution of the continuous time equation. They also use this method to approximate the solution to the correspondent PIDE. Very recently, we have learned about a rigorous and complete work by L. Gonon and C. Schwab [26, 27] where a proof that deep ReLu neural networks (NNs) are able to approximate expectation of certain type of functionals defined in a space of stochastic processes is given. In particular, viscosity solutions of a linear PIDE can be represented in such way by means of a variation of Feynman–Kac formula. Furthermore, relying on controlling the size of NNs that approximate the parameters for a certain fixed tolerance, they mathematically prove that neural networks, as considered in their setting, are capable of breaking the curse of dimensionality in such approximation problems. They follow a similar procedure and generalize the results shown on [24, 28]. Our result is in some sense different, we work with a nonlinear equation and do not ask for NNs approximation of the parameters but we do not show that our scheme can overcome the curse of dimensionality. As we mention before, we propose and provide error bounds for a deep learning scheme for nonlinear parabolic PIDEs based in the work of [31]. We emphasize that their methods differ from ours, and were made in an independent fashion.

We present here an extension and generalization of [31] to PIDEs, by adding nonlocal contribution to the PDE. Some important changes are needed in the algorithm, including the use of a third Neural Network to approximate nonlocal parts of the solution. Of particular utility will be the result shown in [12] to prove convergence of the proposed numerical scheme.

The basic idea of the Euler scheme presented in this article is based on that presented by Zhang in [49]. In that paper, the author gives a discrete time approximation of a BSDE (backward SDE) with no jump terms. That scheme involves the computation of conditional expectations and gives important bounds and results that were used in [31] to prove the convergence of a DL algorithm to solve a second order fully nonlinear PDE. In our case, nonlocal integral models require additional treatments. The work by Bouchard and Elie [12], very important for the work presented here, generalizes the properties given in [49] to the nonlocal setting by considering Lévy process. We will closely follow their approach to construct our numerical scheme.

1.1 Notation

For any \(m\in \mathbb {N}\), \(\mathbb {R}^m\) represents the finite dimensional Euclidean space with elements \(x=(x_1,...,x_m)\) and endowed with the usual norm \(|x|^2=\sum _{i=1}^m |x_i|^2\). Note that for scalars \(a\in \mathbb {R}\) we also denote its norm as \(|a| = \sqrt{a^2}\). For \(x,y\in \mathbb {R}^m\) their scalar product is denoted as \(x\cdot y =\sum _{i=1}^m x_i y_i\). For a general measure space \((E,\Sigma ,\nu )\), \(p\ge 1\) and \(m\in \mathbb {N}\), \(L^p(E,\Sigma ,\nu ;\mathbb {R}^m)\) represents the standard Lebesgue space of p-integrable functions from E to \(\mathbb {R}^m\) and norm

$$\begin{aligned} \left\Vert f\right\Vert ^p_{L^p(E,\Sigma ,\nu ;\mathbb {R}^m)} = \int _{E} |f(x)|^p\nu (dx). \end{aligned}$$

We write \(L^p(E,\Sigma ,\nu )\) when \(m=1\). Given a general probability space \((\Omega ,\mathcal {F},\mathbb {P})\) and a random vector (or variable if \(m=1\)) \(X:\Omega \rightarrow \mathbb {R}^m\), for sake of simplicity and to avoid an overload of parenthesis we denote \(\mathbb {E}|X|^2 = \mathbb {E}(|X|^2)\). We also write

$$\begin{aligned} \int _E f(s)ds = \begin{pmatrix}\int _E f_1(s) ds\\ \vdots \\ \int _E f_m(s) ds\end{pmatrix}, \end{aligned}$$

whenever \(f:E\rightarrow \mathbb {R}^m\) with \(f=(f_1,...,f_m)\). Along the paper we use several times that for \(x_1,...,x_k\in \mathbb {R}\) the following bound holds,

$$\begin{aligned} (x_1+\cdots +x_k)^2\le k(x_1^2+\cdots +x_k^2). \end{aligned}$$
(1.1)

1.2 Setting

Let \(d\ge 1\) and \(T>0\). Consider the following integro-differential PDE

$$\begin{aligned} \left\{ \begin{aligned} \mathcal {L}u(t,x) + f(t,x,u(t,x),\sigma (x) \nabla u(t,x),\mathcal {I}[u](t,x))&=0,{} & {} (t,x)\in [0,T]\times \mathbb {R}^d,\\ u(T,x)&= g(x),{} & {} x\in \mathbb {R}^d. \end{aligned} \right. \end{aligned}$$
(1.2)

Here, \(u=u(t,x)\) is the unknown of the problem. The operator \(\mathcal {L}\) above is of parabolic nonlocal type, and is defined, for \(u\in \mathcal {C}^{1,2}([0,T]\times \mathbb {R}^d)\), as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}u(t,x)&= \partial _t u(t,x)+ \nabla u(t,x)\cdot b(x) + \frac{1}{2}\text {Trace}(\sigma (x)\sigma (x)^T D^2 u(t,x))\\&\quad + \int _{\mathbb {R}^d} [u(t,x+\beta (x,y))-u(t,x)-\nabla u(t,x)\cdot \beta (x,y)]\lambda (dy), \end{aligned} \end{aligned}$$
(1.3)

where \(\lambda \) is a finite measure on \(\mathbb {R}^d\), equipped with its Borel \(\sigma \)-algebra, and a Lévy measure as well which means that

$$\begin{aligned} \lambda (\left\{ 0\right\} ) =0\qquad \text {and} \qquad \int _{\mathbb {R}^d} (1\wedge |y|^2)\lambda (dy) < \infty . \end{aligned}$$

Also, \(f:[0,T]\times \mathbb {R}^d\times \mathbb {R}\times \mathbb {R}^d\times \mathbb {R}\rightarrow \mathbb {R}\). We also assume the standard Lipschitz conditions on the functions in order to have a unique solution to (1.2) in the class \(C^{1,2}\): there exists a universal constant \(K>0\) such that

$$\begin{aligned} {\textbf {(C)}}\; {\left\{ \begin{array}{ll} \bullet \hbox { (Regularity) }g:\mathbb {R}^d\rightarrow \mathbb {R}, b:\mathbb {R}^d\rightarrow \mathbb {R}^d\hbox { and }\sigma :\mathbb {R}^d\rightarrow \mathbb {R}^{d\times d}\hbox { are }\\ K-\hbox {Lipschitz real, vector} \hbox {\quad and matrix valued functions, respectively.}\\ \bullet \hbox { (Boundedness) }\beta :\mathbb {R}^d\times \mathbb {R}^d\rightarrow \mathbb {R}^d\hbox { and }\sup _{y\in \mathbb {R}^d} |\beta (0,y)|\le K.\\ \bullet \hbox { (Uniformly Lipschitz) }\sup _{y\in \mathbb {R}^d}|\beta (x,y)-\beta (x',y)|\le K|x-x'|,\ \forall \ x,x'\in \mathbb {R}^d.\\ \bullet \hbox { (H}\ddot{\textrm{o}}\hbox {lder continuity) For each }t,t'\in [0,T], y,y',w,w'\in \mathbb R\hbox { and }x,x',z,z' \in \mathbb {R}^d,\\ \hbox { one has} \quad |f(t,x,y,z,w)-f(t',x',y',z',w')|\le K\big (|t-t'|^{1/2}\\ +|x-x'|+|y-y'|+|z-z'|+|w-w'|\big ).\\ \bullet \hbox { (Invertibility) For each }y\in \mathbb {R}^d,\hbox { the map }x\rightarrow \beta (x,y)\hbox { admits a Jacobian matrix }\\ \nabla \beta (x, y) \quad \hbox {such that the function }a(x,\xi ; y)=\xi ^T (\nabla \beta (x, y) + I)\xi \hbox { satisfies, for all }\\ x,\xi \in \mathbb {R}^d, \quad a(x,\xi ;y)\ge |\xi |^2 K^{-1}\hbox { or }a(x,\xi ;y)\le -|\xi |^2 K^{-1}. \end{array}\right. } \end{aligned}$$
(1.4)

The last condition is of technical type and it is needed to ensure the validity of certain approximation results (see Theorem 4.1). On the other hand, the nonlocal, integro-differential operator \(\mathcal {I}\) is defined as

$$\begin{aligned} \mathcal {I}[u](t,x) = \int _{\mathbb {R}^d} \big (u(t,x+\beta (x,y))-u(t,x) \big ) \lambda (dy). \end{aligned}$$
(1.5)

The conditions stated in (1.4) are standard in the literature (see [5, 12, 35]) and are needed to ensure the existence and uniqueness (with satisfactory bounds mentioned below) of solutions to a FBSDEJ (forward BSDEJ) related to (1.2).

Remark 1.1

In the literature (see [5, 20]) a Lipschitz condition imposed on \(\beta \) is often written as

$$\begin{aligned} |\beta (x,y) - \beta (x',y)| \le K_1|x-x'|(1 \wedge |y|), \quad \hbox {for some }K_1>0\hbox { and for all }x,x',y\in \mathbb {R}^d. \end{aligned}$$

The reason to impose this requirement is to ensure that

$$\begin{aligned} \int _{\mathbb {R}^d} |\beta (x,y)-\beta (x',y)|^2\lambda (dy) \le K_2|x-x'|^2, \end{aligned}$$

for some constant \(K_2>0\), this is another way of saying that \(\beta \) is Lipschitz with respect to its first variable in an integral sense. Our uniformly Lipschitz requirement on \(\beta \) and \(\lambda \) being a finite measure is enough to satisfy the said restriction.

1.3 Forward backward formulation of (1.2)

In the previous context consider the following stochastic setting for (1.2). Let \((\Omega ,\mathcal {F},\mathbb {F},\mathbb {P})\), \(\mathbb {F}=(\mathcal {F}_t)_{0\le t\le T}\), be a stochastic basis satisfying the usual conditions: \(\mathbb {F}\) is right continuous, and \(\mathcal F_0\) is complete (contains all zero measure sets). The filtration \(\mathbb {F}\) is generated by a d-dimensional Brownian motion (BM) \(W=(W_t)_{0\le t\le T}\) and a Poisson random measure \(\mu \) on \(\mathbb {R}_+\times \mathbb {R}^d\) with intensity measure \(\lambda \), these two random objects are assumed to be mutually independent.

Recall that \(\lambda \) is a finite Lévy measure on \(\mathbb {R}^d\). The compensated measure of \(\mu \) is denoted as

$$\begin{aligned} \overline{\mu }(dt,dy)=\mu (dt,dy) - \lambda (dy)dt, \end{aligned}$$
(1.6)

and is such that for every measurable set A satisfying \(\lambda (A)<\infty \), \((\overline{\mu }(t,A):=\overline{\mu }([0,t],A))_t\) is a martingale. Given a time \(t_i\in [0,T]\), the operator \(\mathbb {E}_i\) will denote the conditional expectation with respect to \(\mathcal {F}_{t_i}\):

$$\begin{aligned} \mathbb {E}_i\left( X\right) := \mathbb {E}\left( X\big | \mathcal {F}_{t_i}\right) . \end{aligned}$$
(1.7)

Recall the equation (1.2)–(1.3)–(1.5). As usual, \(X_{r^-}\) denotes the a.e. limit of \(X_s\) as \(s\uparrow r\). Let us consider the next forward and backward stochastic differential equations with jumps in terms of the unknown variables (XYZU):

$$\begin{aligned} X_t&=x+\int _0^t b(X_s)ds+\int _0^t\sigma (X_{s}) dW_s+\int _0^t\int _{\mathbb {R}^d}\beta (X_{s^-},y)\overline{\mu }(ds,dy), \end{aligned}$$
(1.8)
$$\begin{aligned} Y_t&=g(X_T)+\int _t^T f(\Theta _s)ds-\int _t^T Z_s\cdot dW_s -\int _t^T\int _{\mathbb {R}^d}U_s(y)\overline{\mu }(ds,dy) , \end{aligned}$$
(1.9)
$$\begin{aligned} \Gamma _t&=\int _{\mathbb {R}^d}U_t (y)\lambda (dy), \end{aligned}$$
(1.10)

where \(\Theta _s=(s,X_s,Y_s,Z_s,\Gamma _s)\) for \(0\le s\le T\) and \(x\in \mathbb {R}^d\). Note that \((Z_t)_{0\le t\le T}\) is a vector valued process.

By applying Itô’s lemma (see [20, Thm 2.3.4]) to the solution \(X_t\) in (1.8) and a \(\mathcal {C}^{1,2}([0,T]\times \mathbb {R}^d)\) solution u of PIDE (1.2) as \(Y_t\) in (1.9), we obtain the compact stochastic formulation of (1.2):

$$\begin{aligned} u(t,X_t)=&~{} u(0,X_0)-\int _0^t f(s,X_{s^-},u(s,X_{s^-}),\sigma (X_{s^-})\nabla u(s,X_{s^-}), \mathcal {I}[u](s,X_{s^-})) ds,\nonumber \\&+\int _0^t [\sigma (X_{s^-}) \nabla u(s,X_{s^-})]\cdot dW_s+\int _0^t\int _{\mathbb {R}^d}[u(s,X_{s^-}\nonumber \\ {}&+\beta (X_{s^-},y))-u(s,X_{s^-})]\overline{\mu }(ds,dy), \end{aligned}$$
(1.11)

valid for \(t\in [0,T]\). This tells us that whatever we use as approximations of

$$\begin{aligned} u(t,X_t),\qquad \sigma (X_t)\nabla u(t,X_t)\qquad \text {and}\qquad u(t,X_{t}+\beta (X_{t},\cdot ))-u(t,X_{t}), \end{aligned}$$

must satisfy (1.11) in some proper metric. An important statement here is that the conditions (1.4) ensure the existence of a viscosity solution \(u\in \mathcal {C}([0,T]\times \mathbb {R}^d)\) with at most polynomial growth such that \(u(t,X_t) = Y_t\) (see [5, Thm 3.4]), and this is the reason why our scheme seek to approximate the solution to the FBSDEJ (1.81.9). In order to present our Neural Networks algorithm, we introduce them in Sect. 2.

1.4 Organization of this paper

The rest of this work is organized as follows. In Sect. 2 we give a concrete definition of the NNs that we will be using together with the approximation results needed in this paper. In Sect. 3 we introduce the discretization of the stochastic system that allows us to train our NNs. In Sect. 4 we state all the preliminary results and definitions needed for the proof of our main result. Finally, in Sect. 5 we state and prove the main result of this paper.

2 Neural networks and approximation theorems

Neural Networks (NNs) are not recent. In [41, 43], published in 1943 and 1958 respectively, the authors introduce the concept of a NN but far from the actual definition. Through the years, the use of NNs as a way to approximate functions, started to gain importance for its well performance in applications. A rigorous justification of this property was proven in [30, 36], using the Stone-Weierstrass theorem. These papers state that under suitable conditions on the approximated functions, measured in some mathematical terms, NNs have a very good performance. See [2, 48] for a review on the origin and state of the art survey of DL, respectively.

The huge amount of available data, due to social media, astronomical observatories and even Wikipedia, together with the progress of computational power, have allowed us to train more and more efficient Machine Learning (ML) algorithms and consider data that years ago were not possible to analyze. Deep Learning is a part of supervised ML algorithms and it concerns with the problem of approximating an unknown nonlinear function \(f:X\rightarrow Y\), where X represents the set of possibles inputs and Y the outputs, for example, Y could be a finite set of classes and therefore f has a classification task. In order to perform a DL algorithm we need a set of observations \(D = \left\{ (x,f(x)): x\in A\right\} \) of the phenomenon under consideration; in the literature this set is also known as training set. Here, A is a finite subset of X. The next step is to define a family of candidates \(\left\{ f_\theta : \theta \in \Xi \right\} \) where we can search for a good approximation of f, with \(\Xi \subset \mathbb {R}^{\kappa }\) for some \(\kappa \in \mathbb {N}\). Finally, how good the approximation is, will be measured by a cost function \(L(\cdot ;D):\Xi \rightarrow \mathbb {R}\) and therefore, intuitively, we take \(f_{\theta ^*}\) as the chosen approximation where \(\theta ^*\) minimizes \(L(\cdot ;D)\) over \(\Xi \).

The complexity and generality of the main problem that DL is trying to solve, makes it useful to a large variety of disciplines in science. In astronomy, the large amount of data recollected by observatories makes it a suitable place to implement ML, see [7] for a review of ML in astronomy and [39] for a concrete use of Convolutional Neural Networks (CNN) to classify light curves. See [13] for a review of ML on experimental high energy physics and [47] for an application of NN on quantum state tomography. In [40], the authors use DL to find patterns in fashion and style trends by space and time using data from Instagram. In [3] the authors train a CNN to classify brain tumors into Glioma, Meningioma, and Pituitary Tumor reaching high levels of accuracy. See [37] for a survey on the use of DL in medical science where CNN are the most common type of DL structure.

To fix ideas, in this paper we focus on a simpler setting, where the input and output variables belong to multidimensional real spaces \(\mathbb {R}^d\) and \(\mathbb {R}^m\) respectively with \(d,m\in \mathbb {N}\). In order to define the family of candidates we need \(L+1\in \mathbb {N}\) layers with \(l_i\in \mathbb {N}\) neurons each for \(i\in \left\{ 0,...,L\right\} \) where \(l_0=d\) and \(l_L=m\), weight matrices \(\left\{ W_i\in \mathbb {R}^{l_i\times l_{i-1}}\right\} _{i=1}^{L}\), bias vectors \(\left\{ b_i\in \mathbb {R}^{l_i}\right\} _{i=1}^{L}\), and an activation function \(\phi :\mathbb {R}\rightarrow \mathbb {R}\) which is a way to break the linearity (see Definition 2.1). The resulting function will take values from the input space \(\mathbb {R}^{l_0}\) to the output space \(\mathbb {R}^{l_L}\).

Remark 2.1

The first and last layer are called input and output layer respectively, the others \(L-1\) are often called hidden layers.

Definition 2.1

Given \(L\in \mathbb {N}\) and \(l_0,(l_i,W_i,b_i)_{i=1}^{L}\) as above, consider the parameter \(\theta =\left( W_i,b_i\right) _{i=1}^{L}\) which can be seen as an element of \(\mathbb {R}^\kappa \) with \(\kappa =\sum _{i=1}^{L}(l_i l_{i-1} + l_i)\) and a function \(\phi :\mathbb {R}\rightarrow \mathbb {R}\). We define the neural network \(f(\cdot ;\theta ,\phi ):\mathbb {R}^{l_0}\rightarrow \mathbb {R}^{l_L}\) as the following composition,

$$\begin{aligned} f(x;\theta ,\phi )=\left( A_L\circ \phi \circ A_{L-1}\circ \cdots \circ A_2 \circ \phi \circ A_1\right) (x), \end{aligned}$$

where \(A_i:\mathbb {R}^{l_{i-1}}\rightarrow \mathbb {R}^{l_i}\) is an affine linear function such that \(A_i(x)=W_ix+b_i\) for \(i\in \left\{ 1,...,L\right\} \) and \(\phi \) is applied component-wise. We denote \(f(\cdot ;\theta ,\phi ) = f(\cdot ;\theta )\) when the activation function is fixed and no confusion can arise.

In the following, the activation function will be fixed as well as the input and output dimensions, this is because those parameters are given by the mapping that we are trying to approximate, and the amount of layers L will also be fixed. The range of neural networks that we can reach varying the remaining parameters, which are the size and values of the weight matrices and bias vectors, will be called the set of neural networks and denoted by \(\mathcal {N}_{\phi ,L,l_0,l_L}\). The following definition materializes the previous explanation.

Definition 2.2

The set of Neural Networks associated to \((L, l_0=d, l_L=m)\subset \mathbb {N}\) and the function \(\phi :\mathbb {R}\rightarrow \mathbb {R}\) is defined by,

$$\begin{aligned} \mathcal {N}_{\phi ,L,d,m} = \bigcup _{\kappa \in \mathbb {N}}\mathcal {N}_{\phi ,L,d,m,\kappa } \end{aligned}$$

where,

$$\begin{aligned} \mathcal {N}_{\phi ,L,d,m,\kappa } = \Big \{ ~ f(\cdot ;\theta ,\phi ) ~ \Big | ~&\theta =\left( W_i,b_i\right) _{i=1}^{L}, ~ l_0=d, ~ l_L=m, ~ W_i\in \mathbb {R}^{l_i\times l_{i-1}}, ~ b_i\in \mathbb {R}^{l_i},~ l_i\in \mathbb {N},\\&i\in \left\{ 1,...,L\right\} ,~\kappa = \sum _{i=1}^L (l_i l_{i-1} + l_i) \Big \} \end{aligned}$$

In this paper m will be typically d, the space dimension of the PIDE (1.2), or 1. Using functional analysis arguments, K. Hornik proves in [30] that NNs are able to approximate functions in \(L^2\) spaces for any given tolerance. The space of NNs used in his work, to which we refer as \(\mathcal {H}\), is slightly different from ours, indeed,

$$\begin{aligned} \mathcal {H}=\mathcal {N}_{\phi ,2,d,1}\cap \left\{ f(\cdot ;\theta ,\phi )\in \mathcal {N}_{\phi ,2,d,1} ~ | ~ \theta =\left( W_1,b_1,W_2,0\right) \in \mathbb {R}^{nd+n+n+1},\ n\in \mathbb {N}\right\} . \end{aligned}$$

Note that in this space the free parameter \(\kappa \) depends on the size \(n\in \mathbb {N}\) of the first (and only) hidden layer in the following way, \(\kappa =\sum _{i=1}^2 (l_i l_{i-1} + l_i) = nd + n + n + 1\). It is straightforward that a function \(f\in \mathcal {H}\) takes the following form

$$\begin{aligned} f(x;\theta ,\phi ) = W_2\cdot \phi (W_1 x + b_1), \end{aligned}$$

for \(\left( W_1,b_1,W_2,0\right) \in \mathbb {R}^{nd+n+n+1}\) and \(n\in \mathbb {N}\). Hornik proves the following important result.

Theorem 2.1

([30], Theorem 1) If \(\phi :\mathbb {R}\rightarrow \mathbb {R}\) is bounded and non-constant, then \(\mathcal {H}\) is dense in \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R})\) for every finite measure \(\mu \) in \(\mathbb {R}^d\).

Let \(m\in \mathbb {N}\). For a measure \(\mu \) on \(\mathbb {R}^d\), consider the space \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)\) of square integrable vector valued functions endowed with the norm

$$\begin{aligned} \left\Vert h\right\Vert _{L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)}^2 = \int _{\mathbb {R}^d} \sum _{i=1}^m |h_i(x)|^2\mu (dx) \end{aligned}$$

for \(h = (h_1,...,h_m)\) and \(h_i\) a scalar function for \(i\in \left\{ 1,...,m\right\} \). We also need to approximate the derivative \(\nabla u\) of the solution u to PIDE (1.2), the following proposition proves density of NNs in the space of square integrable vector valued functions.

Lemma 2.1

Let \(m\in \mathbb {N}\) with \(m\ge 1\). If the activation function \(\phi \) is bounded and non-constant, then \(\mathcal {N}_{\phi ,2,d,m}\) is dense in \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)\) for every finite measure \(\mu \) on \(\mathbb {R}^d\).

Proof

Given \(\varepsilon >0\) and a function \(h=(h_1,...,h_m)\in L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)\) we need to find \(f(\cdot ;\theta ,\phi )=(f_1,...,f_m)\in \mathcal {N}_{\phi ,2,d,m}\) such that

$$\begin{aligned} \int _{\mathbb {R}^d} |h(x) - f(x;\theta ,\phi )|^2\mu (dx) < \varepsilon . \end{aligned}$$

First, observe that \(\mathcal {H}\subset \mathcal {N}_{\phi ,2,d,1}\) which implies, by using Theorem 2.1, that \(\mathcal {N}_{\phi ,2,d,1}\) is also dense in \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R})\) and therefore for every \(i\in \left\{ 1,...,m\right\} \) we can find \(f_i(\cdot ;\theta ^i,\phi )\) with \(\theta ^i=\left( W_1^{i},b_1^i,W_2^i,b_2^i\right) \) and \(\kappa ^i=n^i d + n^i + n^i + 1\), depending on \(\varepsilon \), such that

$$\begin{aligned} \int _{\mathbb {R}^d} |h_i(x) - f_i(x;\theta ^i,\phi )|^2\mu (dx) < \frac{\varepsilon }{m}. \end{aligned}$$

Consider \(f\in \mathcal {N}_{\phi ,2,d,m}\) defined by \(\widehat{\theta }=\left( \widehat{W}_1, \widehat{b}_1, \widehat{W}_2, \widehat{b}_2\right) \) with

$$\begin{aligned}&\widehat{W}_1 = \begin{pmatrix} W_1^1\\ \vdots \\ W_1^m \end{pmatrix}\in \mathbb {R}^{\left( \sum _{i=1}^m n^i\right) \times d},\ \ \widehat{b}_1 = \begin{pmatrix} b_1^1\\ \vdots \\ b_1^m \end{pmatrix}\in \mathbb {R}^{\sum _{i=1}^m n^i}\\&\widehat{W}_2 = \begin{pmatrix} W_2^{1,T} &{} 0 &{} 0\\ 0 &{} \ddots &{} 0 \\ 0 &{} 0 &{} W_2^{m,T} \end{pmatrix}\in \mathbb {R}^{m\times \sum _{i=1}^m n^i},\ \ \widehat{b}_2 = \begin{pmatrix} b_2^1\\ \vdots \\ b_2^m \end{pmatrix}\in \mathbb {R}^m, \end{aligned}$$

and which satisfies that for \(x\in \mathbb {R}^d\)

$$\begin{aligned} f(x;\widehat{\theta },\phi )&= \widehat{W}_2 \phi (\widehat{W}_1 x + \widehat{b}_1) + \widehat{b}_2 = \begin{pmatrix} W^{1,T}_2\phi (W_1^1 x + b_1^1) + b_2^1\\ \vdots \\ W^{m,T}_2\phi (W_1^m x + b_1^m) + b_2^m \end{pmatrix} = \begin{pmatrix} f_1(x;\theta ^1,\phi )\\ \vdots \\ f_m(x;\theta ^m,\phi ) \end{pmatrix}. \end{aligned}$$

Therefore,

$$\begin{aligned} \int _{\mathbb {R}^d} |h(x) - f(x;\widehat{\theta },\phi )|^2\mu (dx) = \int _{\mathbb {R}^d} \sum _{i=1}^m |h_i(x) - f_i(x;\theta ^i,\phi )|\mu (dx) < \varepsilon . \end{aligned}$$

This ends the proof.\(\square \)

Lemma 2.1 allows us to state that if we take some function \(h:\mathbb {R}^d\rightarrow \mathbb {R}^{m}\) in \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)\), then the quantity

$$\begin{aligned} \underset{\theta \in \mathbb {R}^{\kappa }}{\text {inf}} \int _{\mathbb {R}^d} |f(x;\theta ,\phi ) - h(x)|^2 \mu (dx) \end{aligned}$$
(2.1)

can be made arbitrarily small by possible making \(\kappa \) growing sufficiently large, whenever \(\mu \) is a finite measure on \(\mathbb {R}^d\) and the activation function that defines the NN is bounded and non-constant. The following lemma states the \(L^2(\Omega ,\mathcal {F},\mathbb {P})\) integrability of a random variable constructed from a NN, it assumes an activation function with linear growth which is more general that the ones presented here.

Lemma 2.2

Let \(m\in \mathbb {N}\) with \(m\ge 1\), \(X\in L^2 (\Omega ,\mathcal {F},\mathbb {P};\mathbb {R}^d)\) and \(f(\cdot ;\theta ,\phi )\in \mathcal {N}_{\phi ,2,d,m}\). Assume that \(|\phi (x)|\le C(1+|x|)\) for \(x\in \mathbb {R}\) and some positive constant C, then \(f(X;\theta ,\phi )\in L^2 (\Omega ,\mathcal {F},\mathbb {P};\mathbb {R}^m)\).

Proof

Let f be represented by \(\theta =\left( W_1,b_1,W_2,b_2\right) \in \mathbb {R}^{nd+n+mn+m}\) then,

$$\begin{aligned} \mathbb {E}|f(X;\theta ,\phi )|^2 = \mathbb {E}|W_2\phi (W_1X+b_1) + b_2|^2 = \sum _{i=1}^m \mathbb {E}|W_{2,i}\cdot \phi (W_1X + b_1) + b_{2,i}|^2. \end{aligned}$$

Therefore, without loss of generality we can assume \(f\in \mathcal {N}_{\phi ,2,d,1}\) and take \(\theta = \left( W_1,b_1,W_2,0\right) \in \mathbb {R}^{nd+n+n+1}\). By using the growth condition on \(\phi \) and that for \(a,b,c\in \mathbb {R}\) \((a+b+c)^2\le 3(a^2 + b^2 + c^2)\)

$$\begin{aligned} \mathbb {E}|f(X;\theta ,\phi )|^2&= \mathbb {E}|W_2\cdot \phi (W_1 X + b_1)|^2\\&= \mathbb {E} \Big |\sum _{i=1}^n W_{2,i} \phi \Big (\sum _{j=1}^d W_{1,i,j} X_j+ b_{1,j}\Big )\Big |^2\\&\le \mathbb {E} \left( \sum _{i=1}^n \Big |W_{2,i}\Big | \Big |\phi \Big (\sum _{j=1}^d W_{1,i,j} X_j+ b_{1,j}\Big )\Big |\right) ^2\\&\le \mathbb {E}\left( \sum _{i=1}^n |W_{2,i}|C\big (1 + \big |\sum _{j=1}^d W_{1,i,j} X_j + b_{1,j}\big |\big )\right) ^2\\&\le 3C^2\Big [\underbrace{\Big (\sum _{i=1}^n |W_{2,i}|\Big )^2 + \Big (\sum _{j=1}^d |W_{1,i,j}| |b_{1,j}|\Big )^2}_{< \infty } \\ {}&\quad + \mathbb {E}\Big ( \sum _{i=1}^n \sum _{j=1}^d |W_{2,i}| |W_{1,i,j}| |X_j|\Big )^2 \Big ]. \end{aligned}$$

Note that the first two terms in the last expression are deterministic and finite, then, by using Cauchy-Schwartz twice on the third term we get,

$$\begin{aligned} \mathbb {E}\Big ( \sum _{i=1}^n \sum _{j=1}^d |W_{2,i}| |W_{1,i,j}| |X_j|\Big )^2\le \sum _{i=1}^n |W_{2,i}|^2\sum _{i=1}^n \sum _{j=1}^d |W_{1,i,j}|^2 \mathbb {E}|X|^2 < \infty . \end{aligned}$$

This finishes the proof.\(\square \)

3 Discretization of the dynamics and the deep learning algorithm

Fix a constant step partition of the interval [0, T], defined as \(\pi =\left\{ \frac{iT}{N}\right\} _{i\in \left\{ 0,...,N\right\} }\), \(t_i= \frac{iT}{N}\), and set \(\Delta W_i=W_{t_{i+1}}-W_{t_i}\). Also, define \(h:=\frac{T}{N}\) and (with a slight abuse of notation), \(\Delta t_i=(t_i,t_{i+1}]\). Recall the compensated measure \(\overline{\mu }\) from (1.6). Let

$$\begin{aligned} M_t:=\overline{\mu }((0,t],\mathbb {R}^d) \quad \hbox {and} \quad \Delta M_i=:\overline{\mu }((t_i,t_{i+1}],\mathbb {R}^d)=\int _{t_i}^{t_{i+1}}\!\!\int _{\mathbb {R}^d}\overline{\mu }(ds,dy). \end{aligned}$$
(3.1)

It is well-known that an Euler scheme for the first equation in (1.8) obeys the form

$$\begin{aligned} X^{\pi }_0&= x, \end{aligned}$$
(3.2)
$$\begin{aligned} X^{\pi }_{t_{i+1}}&=X^{\pi }_{t_i}+ \, b(X^{\pi }_{t_i})h+\sigma (X^{\pi }_{t_i})\Delta W_i+\int _{\mathbb {R}^d}\beta (X_{t_i},y)\overline{\mu }((t_i,t_{i+1}],dy). \end{aligned}$$
(3.3)

Note that this scheme neglects the left limits that appears on the original equation, although, it satisfies the next error bound (see [20, Thm. 5.1.1], [12] or [26]),

$$\begin{aligned} \max _{i=1,...,N}\mathbb {E}\left( \sup _{t\in [t_i,t_{i+1}]} |X_t-X^{\pi }_{t_i}|^2\right) = O(h). \end{aligned}$$
(3.4)

Under suitable conditions, mostly Lipschitz and linear growth assumptions, it can be proved that the constant behind O(h) in (3.4) does not depend exponentially on d, see Lemma 4.3 in [26]. Adapting the argument of [31] to the non-local case, and in view of (1.11), we propose the following modified Euler scheme: for \(i=0,1,\ldots ,N\),

$$\begin{aligned} u(t_{i+1},X^\pi _{t_{i+1}})\approx&~{} F_i\Big ( t_i,X^\pi _{t_i},u(t_i,X^\pi _{t_i}),\sigma (X^\pi _{t_i}) \nabla u(t_i,X^\pi _{t_i}) ,u(t_i,X^\pi _{t_i}\\&\quad +\beta (X^\pi _{t_i},\cdot ))-u(t_i,X^\pi _{t_i}),h,\Delta W_i \Big ), \end{aligned}$$

where \(F_i:\Omega \times [0,T]\times \mathbb {R}^d\times \mathbb {R}\times \mathbb {R}^d\times L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda )\times \mathbb {R}_+\times \mathbb {R}^d\rightarrow \mathbb {R}\) is defined as

$$\begin{aligned}&F_i(\omega , t,x,y,z,\psi ,h,w):= y - hf \left( t,x,y,z,\int _{\mathbb {R}^d}\psi (y)\lambda (dy) \right) \\&\quad + w\cdot z + \int _{\mathbb {R}^d} \psi (y)\bar{\mu }\left( (t_i,t_{i+1}],dy\right) . \end{aligned}$$

Note that \(\omega \) is passed to \(F_i\) through its dependence on the compensated measure \(\bar{\mu }\). The function \(F_i\) is, indeed, a random variable

Remark 3.1

Note that the nonlocal term in (1.2) forces us to define \(F_i\) in such a way that its fifth argument must be a function \(\psi \) in \(L^2 (\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda )\). In view of the integrals involved in \(F_i\), it appears that we are again facing the same high dimensional problem; however this problem may be instead treated with Monte Carlo approximations, see below.

Remark 3.2

In the nonlocal setting, the function \(F_i\) also depends on the time interval \((t_i,t_{i+1}]\) in terms of the integrated measure \(\bar{\mu }\left( (t_i,t_{i+1}],dy\right) \). This is an important change in the Euler scheme, since we do not approximate the nonlocal term at time \(t_i\) in this case, but instead take into account how the measure \(\bar{\mu }\) behaves on the time interval \((t_i,t_{i+1}]\).

Recall Theorem 2.1 and let \(\phi :\mathbb {R}\rightarrow \mathbb {R}\) be a bounded and non-constant activation function. From now on we will be using NNs with a single hidden layer parameterized by \(\theta \in \Xi \), where \(\Xi =\mathbb {R}^{\kappa }\) for some free parameter \(\kappa \in \mathbb {N}\) depending on the size of the hidden layer. For every time \(t_i\) on the grid consider,

$$\begin{aligned} \mathcal {U}_i(\cdot ;\theta )&:\mathbb {R}^d\rightarrow \mathbb {R}\end{aligned}$$
(3.5)
$$\begin{aligned} \mathcal {Z}_i(\cdot ;\theta )&:\mathbb {R}^d\rightarrow \mathbb {R}^d\end{aligned}$$
(3.6)
$$\begin{aligned} \mathcal {G}_i(\cdot ,\circ ;\theta )&:\mathbb {R}^d\times \mathbb {R}^d\rightarrow \mathbb {R}\end{aligned}$$
(3.7)

with \(\mathcal {U}_i(\cdot ;\theta )\in \mathcal {N}_{\phi ,2,d,1,\kappa }\), \(\mathcal {Z}_i(\cdot ;\theta )\in \mathcal {N}_{\phi ,2,d,d,\kappa }\) and \(\mathcal {G}_i(\cdot ,\circ ;\theta )\in \mathcal {N}_{\phi ,2,d+d,1,\kappa }\) approximating

$$\begin{aligned} (u(t_i,\cdot ),~\sigma (\cdot ) \nabla u(t_i,\cdot ), ~u(t_i,\cdot + \beta (\cdot ,\circ ))-u(t_i,\cdot )), \end{aligned}$$

respectively, in some sense to be specified. Note that we dropped Let also

$$\begin{aligned} \langle \mathcal {G} \rangle _i(x;\theta ) = \int _{\mathbb R^d}\mathcal {G}_i(x,y;\theta )\lambda (dy). \end{aligned}$$
(3.8)

We propose an extension of the DBDP1 algorithm presented on [31]. The main idea of the algorithm is that the NNs, evaluated on \(X_{t_{i}}^{\pi }\), are good approximations of the processes solving the FBSDEJ. Let \(L_i\) be a cost function defined for \(\theta \in \Xi \) as

$$\begin{aligned} L_i(\theta )=\mathbb {E}\left| \widehat{\mathcal {U}}_{i+1}(X^\pi _{t_{i+1}})-F_i(t_i,X^\pi _{t_i},\mathcal {U}_i(X^\pi _{t_i};\theta ),\mathcal {Z}_i(X^\pi _{t_i};\theta ),\mathcal {G}_i(X^\pi _{t_i},\cdot ;\theta ),h,\Delta W_i)\right| ^2. \end{aligned}$$
(3.9)
figure a

For the minimization step we need to calculate an expected value, but this is a complicated task due to the non linearity and the fact that the distribution of the random variables involved are not always known. To overcome this situation, as well as in [31], one has to use a Monte Carlo approximation together with Stochastic Gradient Descent (SGD). See also Remark 3.1.

4 Preliminaries

For a general measure space \((E,\Sigma ,\nu )\), \(p\ge 1\) and \(m\in \mathbb {N}\), recall the definition and norm for the Lebesgue space \(L^p(E,\Sigma ,\nu ;\mathbb {R}^m)\) introduced in Sect. 1.1. For \(s,t\in [0,T]\) such that \(s\le t\), we define some spaces of stochastic processes:

  • \(\mathcal {S}_{[s,t]}^2(\mathbb {R})\) denotes the space of adapted càdlàg processes \(Y: \Omega \times [s,t]\rightarrow \mathbb {R}\) such that

    $$\begin{aligned} \left\Vert Y\right\Vert ^2_{\mathcal {S}^2_{[s,t]}}:=\mathbb {E}\left( \underset{r\in [s,t]}{\sup }|Y_r|^2\right) <\infty . \end{aligned}$$
  • \(L^2_{W,[s,t]}(\mathbb {R}^d)\) denotes the space of predictable processes \(Z: \Omega \times [s,t]\rightarrow \mathbb {R}^d\) such that

    $$\begin{aligned} \left\Vert Z\right\Vert _{L^2_{W,[s,t]}}^2:=\mathbb {E}\left( \int _s^t\left\Vert Z_r\right\Vert ^2 dr\right) <\infty . \end{aligned}$$
  • \(L^2_{\mu ,[s,t]}(\mathbb {R})\) denotes the space of \(\sigma (\mathcal {P}_{[s,t]}\times \mathcal {B}(\mathbb {R}^d))\)-measurable processes \(U: \Omega \times [s,t]\times \mathbb {R}^d\rightarrow \mathbb {R}\) with \(\mathcal {P}_{[s,t]}\) denoting the predictable sigma algebra on \(\Omega \times [s,t]\). These processes are such that

    $$\begin{aligned} \left\Vert U\right\Vert _{L^2_{\mu ,[s,t]}}^2:=\mathbb {E}\left( \int _s^t\int _{\mathbb {R}^d}|U_r(y)|^2\lambda (dy)dr\right) <\infty . \end{aligned}$$

Whenever \([s,t]=[0,T]\), we avoid mentioning the interval of time, and denote \(\mathcal {B}^2 = \mathcal {S}^2\times L_{W}^2(\mathbb {R}^d)\times L_{\mu }^2(\mathbb {R})\). In the following, \(C>0\) will denote a constant that may change from one line to another. Also, the notation \(a\lesssim b\) means that there exists \(C>0\) such that \(a\le Cb\).

4.1 Existence and uniqueness for the FBSDEJ

In order to estimate errors we need a solution to compare, the following lemmas present well-known results concerning the existence and uniqueness of a solution to the decoupled system (1.81.9). We only check that our hypotheses match those of [4] and [5], these results are the same as those given in [12] and [20, Section 4.1].

Lemma 4.1

There exists a unique solution \(X\in \mathcal {S}^2\) to (1.8) such that.

$$\begin{aligned} \mathbb {E}\left( \underset{s\le u \le t}{\sup }|X_u - X_s|^2\right) \le |t-s|\left( 1 + \mathbb {E}|X_s|^2\right) . \end{aligned}$$
(4.1)

Proof

Recall Remark 1.1. Observe that conditions (C), particularly those imposed on \(\beta \), imply that

$$\begin{aligned} \int _{\mathbb {R}^d}|\beta (x,y) - \beta (x',y)|^2\lambda (dy) \le K^2\lambda (\mathbb {R}^d)|x-x'|^2. \end{aligned}$$

This, together with the rest of conditions (C) are enough to fulfill the Lipschitz and growth hypotheses needed on [4, Section 6.2] to ensure the existence and uniqueness of a solution \(X\in S^2\) to the FSDEJ (1.8). Estimate (4.1) follows by considering the process \((X_u-X_s)_{u\in [s,t]}\) and using Doob’s maximal inequality [42, Theorem 20, Section 1] and Gronwall inequality. \(\square \)

Lemma 4.2

There exists a solution \((Y,Z,U)\in \mathcal {B}^2\) to (1.9).

Proof

We apply Theorem 2.1 of [5] with \(k=1\), \(Q=g(X_T)\) and a nonlinearity \(\bar{f}:\Omega \times [0,T]\times \mathbb {R}\times \mathbb {R}^d\times L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda )\rightarrow \mathbb {R}\) defined as

$$\begin{aligned}&\bar{f}(\omega ,t,y,z,w) = f\left( t,X_t(\omega ),y,z,\int _{\mathbb {R}^d}w(x)\lambda (dx)\right) , \\&\quad \hbox {for}\ (\omega ,t,y,z,w) \in \Omega \times [0,T]\times \mathbb {R}\times \mathbb {R}^d\times L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda ). \end{aligned}$$

By the Lipschitz property on g and the bound given in Lemma 4.1 we can see that \(Q\in L^2(\Omega ,\mathcal {F}_T,\mathbb {P})\). The Lipschitz condition on f implies that for all \(\omega \in \Omega ,t\in [0,T],y,y'\in \mathbb {R}^d,z,z'\in \mathbb {R}^d\) and \(w,w'\in L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda )\),

$$\begin{aligned}&|\bar{f}(\omega ,t,y,z,w) - \bar{f}(\omega ,t,y',z',w')|\\&\quad = \Bigg |f\left( t,X_t(\omega ),y,z,\int _{\mathbb {R}^d}w(x)\lambda (dx)\right) - f\left( t,X_t(\omega ),y',z',\int _{\mathbb {R}^d}w'(x)\lambda (dx)\right) \Bigg |\\&\quad \le K\left( |y-y'| + |x-x'| + \Big |\int _{\mathbb {R}^d}(w-w')\lambda (dy)\Big |\right) \\&\quad \le K\left( |y-y'| + |x-x'| + \lambda (\mathbb {R}^d)^{1/2} \left\Vert w-w'\right\Vert _{L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda ;\mathbb {R})}\right) , \end{aligned}$$

this proves the Lipschitz condition on \(\bar{f}\). Using the previous bound is clear that,

$$\begin{aligned} \mathbb {E}\int _0^T |\bar{f}(\cdot ,t,0,0,0)|^2 dt < \infty . \end{aligned}$$

These computations allow us to directly apply Theorem 2.1 of [5] this finishes the proof.\(\square \)

Combining previous lemmas we get that there exist a unique solution (XYZU) to the system (1.81.9) in the space \(\mathcal {S}^2\times \mathcal {B}^2\) this implies,

$$\begin{aligned} \mathbb {E}\left( \underset{s\in [0,T]}{\sup }|X_s|^2\right) + \mathbb {E}\left( \underset{s\in [0,T]}{\sup }|Y_s|^2\right) + \mathbb {E}\int _0^T |Z_s|^2 ds + \mathbb {E}\int _0^T \int _{\mathbb {R}^d}|U_s(y)|^2 \lambda (dy)ds < \infty . \end{aligned}$$
(4.2)

4.2 Useful results from stochastic calculus

The following lemma strongly depends on the filtration under consideration, recall that \((\mathcal {F}_t)_{t\in [0,T]}\) is generated by the two independent objects W and \(\mu \) which allows us to state the representation property. See the end of Section 2.4 in [20] where it is stated that when the filtration is generated by a Brownian Motion and an independent jump process the required representation holds.

Lemma 4.3

(Martingale Representation Theorem) For any square integrable martingale M there exists \((Z,U)\in L^2_{W}(\mathbb {R}^d)\times L^2_{\mu }(\mathbb {R})\) such that for \(t\in [0,T]\)

$$\begin{aligned} M_t = M_0 + \int _0^t Z_s\cdot dW_s + \int _0^t \int _{\mathbb {R}^d} U(s,y)\overline{\mu }(ds,dy). \end{aligned}$$

We will need the next property involving conditional expectation, Itô isommetry and that W is independent of \(\overline{\mu }\).

Lemma 4.4

(Conditional Ito isometry) For \(V^1,V^2\in L^2_{\mu }(\mathbb {R})\) and \(H,K\in L^2_W(\mathbb {R}^d)\),

$$\begin{aligned}&\mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}H_r\cdot dW_r\int _{t_i}^{t_{i+1}}K_r\cdot dW_r\right) \nonumber \\&\quad = \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}H_r \cdot K_r dr\right) ,\\&\mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}V^1(s,z)\overline{\mu }(ds,dz)\int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}V^2(s,z)\overline{\mu }(ds,dz)\right) \nonumber \\&\quad =\mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}V^1(s,z)V^2(s,z)\lambda (dz)ds\right) ,\nonumber \\&\mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}V^1(r,y)\overline{\mu }(dy,dr)\int _{t_i}^{t_{i+1}}H_r\cdot dW_r\right) \nonumber \\&\quad = 0. \nonumber \end{aligned}$$
(4.3)

Proof

Follows from the classical Ito isommetry and using the definition of conditional expectation.\(\square \)

Lemma 4.5

(Conditional Fubini) Let \(H\in L^2_{\mu }(\mathbb {R}^d)\) be a \(\mathbb {F}\)-adapted process and \(t>0\), then

$$\begin{aligned} \mathbb {E}\left( \int _{\mathbb {R}^d}\int _{t_i}^{t_{i+1}} H(s,y)ds\lambda (dy)\bigg |\mathcal {F}_{t_i}\right) = \int _{\mathbb {R}^d}\mathbb {E}\left( \int _{t_i}^{t_{i+1}} H(s,y)ds\bigg |\mathcal {F}_{t_i}\right) \lambda (dy). \end{aligned}$$

Proof

The proof is standard, but we included it by the sake of completeness. Let \(A\in \mathcal {F}_{t_i}\), we have to prove that

$$\begin{aligned} \int _A \left( \int _{\mathbb {R}^d} \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}H(s,y)ds\right) \lambda (dy) \right) d\mathbb {P}(\omega ) = \int _{A}\left( \int _{\mathbb {R}^d}\int _{t_i}^{t_{i+1}}H(s,y)(\omega )ds\lambda (dy)\right) d\mathbb {P}(\omega ). \end{aligned}$$

Note that because of \(H\in L^2_{\mu }(\mathbb {R}^d)\),

$$\begin{aligned} \int _\Omega \int _{t_i}^{t_{i+1}}\!\! \int _{\mathbb {R}^d} \! |H(s,y)(\omega )|^2 \lambda (dy) ds d\mathbb {P}(\omega ) < \infty ; \end{aligned}$$

this means that H can be seen as an element of \(\in L^2(\Omega \times [t_i,t_{i+1}]\times \mathbb {R}^d)\subset L^1(\Omega \times [t_i,t_{i+1}]\times \mathbb {R}^d)\), both spaces endowed with the correspondent finite product measure. Then we can use classical Fubini theorem:

$$\begin{aligned}&\int _{A}\left( \int _{\mathbb {R}^d}\int _{t_i}^{t_{i+1}}H(s,y)(\omega )ds\lambda (dy)\right) d\mathbb {P}(\omega )\\ {}&\quad = \int _{\mathbb {R}^d}\left( \int _{A} \int _{t_i}^{t_{i+1}}H(s,y)(\omega )dsd\mathbb {P}(\omega )\right) \lambda (dy)\\&\quad =\int _{\mathbb {R}^d}\left( \int _{A} \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}H(s,y)(\omega )ds\right) d\mathbb {P}(\omega )\right) \lambda (dy)\\&\quad = \int _{A}\left( \int _{\mathbb {R}^d} \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}H(s,y)(\omega )ds\right) \lambda (dy)\right) d\mathbb {P}(\omega ). \end{aligned}$$

This finishes the proof.\(\square \)

4.3 Measuring the error

We first introduce the conditional expectations of the averaged processes

$$\begin{aligned} \begin{aligned} \overline{Z}_{t_i}=\dfrac{1}{h}\mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}Z_t dt\right) ,\quad \overline{\Gamma }_{t_i}=&~{} \dfrac{1}{h}\mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\Gamma _td t\right) . \end{aligned} \end{aligned}$$
(4.4)

these quantities allows us to define the \(L^2\)-regularity of the solutions \((Z,\Gamma )\) (see [12] and [31]) as follows

$$\begin{aligned} \begin{aligned} \varepsilon ^Z(h)&:= \mathbb {E}\left( \sum _{i=0}^{N-1}\int _{t_i}^{t_{i+1}}|Z_t-\overline{Z}_{t_i}|^2 dt\right) ,\\ \varepsilon ^\Gamma (h)&:= \mathbb {E}\left( \sum _{i=0}^{N-1}\int _{t_i}^{t_{i+1}}|\Gamma _t-\overline{\Gamma }_{t_i}|^2 dt\right) . \end{aligned} \end{aligned}$$
(4.5)

Both quantities can be made arbitrarily small as it is shown on [12] and presented in the following theorem.

Theorem 4.1

Under assumptions (C), there exists a constant \(C>0\) such that

$$\begin{aligned} \varepsilon ^\Gamma (h) \le Ch \qquad and \qquad \varepsilon ^Z (h) \le Ch. \end{aligned}$$

Proof

See [12, Theorem 2.1 (i)] for the bound on \(\varepsilon ^{\Gamma }(h)\) and [12, Theorem 2.1 (ii)] for the bound on \(\varepsilon ^{Z}(h)\). Note that in the cited reference this result is presented, using our notation, as follows,

$$\begin{aligned} \left\Vert \Gamma - \overline{\Gamma }\right\Vert ^2_{L^2_{W}} \le C N^{-1} \qquad and \qquad \left\Vert Z - \overline{Z}\right\Vert ^2_{L^2_{W}} \le C N^{-1}. \end{aligned}$$

Where \(\overline{\Gamma }_t = \overline{\Gamma }_{t_i}\) for \(t\in [t_i,t_{i+1})\) and \(\overline{Z}_t = \overline{Z}_{t_i}\) for \(t\in [t_i,t_{i+1})\).\(\square \)

We introduce a somehow auxiliary scheme that at the same time depends on the main one. Let \(i\in \{0,\ldots , N-1\}\), as stated in Sect. 3. We follow the procedure taken in [31], with key modifications. Let us use the ideas of [12] to define \(\mathcal {F}\)-adapted discrete processes

$$\begin{aligned} \widehat{\mathcal {V}}_{t_i}&=\mathbb {E}_i\left( \widehat{\mathcal {U}}_{i+1}(X^\pi _{t_{i+1}})\right) + f\left( t_i,X^\pi _{t_i},\widehat{\mathcal {V}}_{t_i}, \overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i}\right) h, \end{aligned}$$
(4.6)
$$\begin{aligned} \overline{\widehat{Z}}_{t_i}&= \dfrac{1}{h}\mathbb {E}_i \left( \widehat{\mathcal {U}}_{i+1}(X^\pi _{t_{i+1}})\Delta W_{i} \right) , \end{aligned}$$
(4.7)
$$\begin{aligned} \overline{\widehat{\Gamma }}_{t_i}&= \dfrac{1}{h} \mathbb {E}_i\left( \widehat{\mathcal {U}}_{i+1}(X^\pi _{t_{i+1}})\Delta M_{i}\right) , \end{aligned}$$
(4.8)

where \(\widehat{\mathcal {V}}_{t_i}\) is well-defined for sufficiently small h by Lemma 4.6 and the variables \(\overline{\widehat{Z}}_{t_i}\), \(\overline{\widehat{\Gamma }}_{t_i}\) are defined below.

Lemma 4.6

The process \(\widehat{\mathcal {V}}_{t_i}\) is well-defined.

Proof

Let \(i\in \left\{ 0,...,N-1\right\} \) and \(\psi :L^2(\Omega ,\mathcal {F},\mathbb {P})\rightarrow L^2(\Omega ,\mathcal {F},\mathbb {P})\) be defined as

$$\begin{aligned} \psi (\xi )(\omega )=\mathbb {E}_i\left( \widehat{\mathcal {U}}_{i+1}(X^\pi _{t_{i+1}})\right) (\omega ) + f\left( t_i,X^\pi _{t_i}(\omega ),\xi (\omega ), \overline{\widehat{Z}}_{t_i}(\omega ),\overline{\widehat{\Gamma }}_{t_i}(\omega )\right) h. \end{aligned}$$

For all \(\xi \in L^2(\Omega ,\mathcal {F},\mathbb {P})\) and \(\omega \in \Omega \). This function is well-defined by the properties of f and Lemma 2.2. Let \(\xi ,\overline{\xi }\in L^2\), then \(\mathbb {P}\) a.s \(|\psi (\xi )-\psi (\overline{\xi })| \le h |\xi -\overline{\xi }|\), therefore

$$\begin{aligned} \left\Vert \psi (\xi )-\psi (\overline{\xi })\right\Vert _{L^2(\Omega ,\mathcal {F}, \mathbb {P})} \le h \left\Vert \xi -\overline{\xi }\right\Vert _{L^2(\Omega ,\mathcal {F}, \mathbb {P})} \end{aligned}$$

Taking sufficiently small h we can see that this function is a contraction on \(L^2(\Omega ,\mathcal {F},\mathbb {P})\), and therefore, by applying Banach’s fixed point theorem, we conclude the proof.\(\square \)

For fixed \(i\in \left\{ 0,...,N\right\} \), let \(N_t\) be a process defined as \(N_t := \mathbb {E}\left( \widehat{\mathcal {U}}_{i+1}(X_{t_{i+1}}^{\pi })\Big |\mathcal {F}_t\right) \) for \(t\in [t_i,t_{i+1}]\). Using Lemma 2.2, it is not difficult to see that \(N_t\) is a square integrable martingale and therefore, by Martingale Representation Theorem (see Lemma 4.3), there exist \((\widehat{Z},\widehat{U})\in L^2_{\mu }\times L^2_{W}\) such that

$$\begin{aligned} N_t = N_{t_i} + \int _{t_i}^{t}\widehat{Z}_s \cdot dW_s + \int _{t_i}^{t}\int _{\mathbb {R}^d} \widehat{U}_s (y) \overline{\mu }(ds,dy). \end{aligned}$$

By taking \(t=t_{i+1}\) and using (1.7),

$$\begin{aligned} \widehat{\mathcal {U}}_{i+1}(X_{t_{i+1}}^{\pi }) = \mathbb {E}_i\left( \widehat{\mathcal {U}}_{i+1}(X_{t_{i+1}}^{\pi })\right) + \int _{t_i}^{t_{i+1}}\widehat{Z}_s \cdot dW_s + \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d} \widehat{U}_s (y) \overline{\mu }(ds,dy). \end{aligned}$$

By multiplying by \(\Delta W_i\) and \(\Delta M_i\), then taking \(\mathbb {E}_i\) and using Itô isometry,

$$\begin{aligned} \overline{\widehat{Z}}_{t_i}&= \frac{1}{h} \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\widehat{Z}_sds\right) ,\\ \overline{\widehat{\Gamma }}_{t_i}&= \frac{1}{h} \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}\widehat{U}_s(y)\lambda (dy)ds\right) . \end{aligned}$$

Let

$$\begin{aligned} \overline{\widehat{U}}_{t_i}(y):=\frac{1}{h}\mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\widehat{U}_s(y)ds\right) . \end{aligned}$$
(4.9)

By Lemma 4.5 one can see that

$$\begin{aligned} \overline{\widehat{\Gamma }}_{t_i} = \frac{1}{h} \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}\widehat{U}_s(y)\lambda (dy)ds\right) = \int _{\mathbb {R}^d} \overline{\widehat{U}}_{t_i}(y) \lambda (dy). \end{aligned}$$
(4.10)

The last equality can be seen as an analogous to (1.10) and makes sense with the notation \(\overline{\widehat{\Gamma }}_{t_i}=\langle \overline{\widehat{U}}_{t_i} \rangle \). Also, we can establish the following useful bound:

$$\begin{aligned} \mathbb {E}\left| \overline{\widehat{\Gamma }}_{t_i} - \langle \mathcal {G} \rangle _i (X_{t_{i}}^{\pi };\theta ) \right| ^2 \lesssim \mathbb {E}\left( \left\Vert \overline{\widehat{U}}_{t_i}(\cdot ) - \mathcal {G}_i(X_{t_{i}}^{\pi },\cdot ,\theta )\right\Vert ^2_{L^2(\lambda )} \right) . \end{aligned}$$

Indeed, from (4.10) and (3.8), Hölder inequality and the fact that \(\lambda \) is a finite measure

$$\begin{aligned} \begin{aligned} \mathbb {E}\left| \overline{\widehat{\Gamma }}_{t_i} - \langle \mathcal {G} \rangle _i (X_{t_{i}}^{\pi };\theta ) \right| ^2 =&~{} \mathbb {E}\left| \int _{\mathbb {R}^d} \overline{\widehat{U}}_{t_i}(y) \lambda (dy) - \int _{\mathbb R^d}\mathcal {G}_i(X_{t_{i}}^{\pi },y;\theta )\lambda (dy) \right| ^2 \\ \lesssim&~{} \mathbb {E}\left( \left\Vert \overline{\widehat{U}}_{t_i}(\cdot ) - \mathcal {G}_i(X_{t_{i}}^{\pi },\cdot ,\theta )\right\Vert ^2_{L^2(\lambda )} \right) . \end{aligned} \end{aligned}$$

Following [31], we can find deterministic functions \(v_i, z_i, \gamma _i\) such that \(v_i(X_{t_{i}}^{\pi }) = \widehat{\mathcal {V}}_{t_i}, z_i(X_{t_{i}}^{\pi }) = \overline{\widehat{Z}}_{t_i}\) and \(\gamma _i(y, X_{t_{i}}^{\pi }) = \overline{\widehat{U}}_{t_i}(y)\) for \(y\in \mathbb {R}^d\). The correspondent \(L^2\)-integrability of these functions is ensured by the properties of \(\widehat{\mathcal {V}}_{t_i}, \overline{\widehat{Z}}_{t_i}\) and \(\overline{\widehat{U}}_{t_i}\). With the previous setup, the natural extension of the terms to estimate the error of the scheme shown on [31] must be

$$\begin{aligned} \begin{aligned} \mathcal {E}_i^v&= \underset{\xi \in \mathbb {R}^{\kappa }}{\inf } \mathbb {E}\left| v_i(X_{t_{i}}^{\pi })-\mathcal {U}_i(X_{t_{i}}^{\pi };\xi ) \right| ^2, \quad \mathcal {E}_i^z = \underset{\xi \in \mathbb {R}^{\kappa }}{\inf } \mathbb {E}\left| z_i(X_{t_{i}}^{\pi })-\mathcal {Z}_i(X_{t_{i}}^{\pi };\xi ) \right| ^2\\ \mathcal {E}_i^{\gamma }&= \underset{\xi \in \mathbb {R}^{\kappa }}{\inf } \mathbb {E}\left( \int _{\mathbb {R}^d} \left| \gamma _i(y, X_{t_{i}}^{\pi })-\mathcal {G}_i(X_{t_{i}}^{\pi },y;\xi ) \right| ^2\lambda (dy)\right) . \end{aligned} \end{aligned}$$
(4.11)

The expected values can be written as a integral with respect a probability measure in \(\mathbb {R}^d\) and therefore, applying the Theorem 2.1, these quantities can be made arbitrarily small as \(\kappa \) increases.

The following results will be useful in the proof of the main result. In Section 2.5 of [12], it is explained that the results presented there still hold for a time-dependent non-linearity.

Proposition 4.1

([12], Proposition 2.1) There exists a constant \(C>0\) independent of the step h such that

$$\begin{aligned} \sum _{i=0}^{N-1}\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2ds\right) \le Ch. \end{aligned}$$

We will also need the following result.

Lemma 4.7

Consider \((X,Y,Z,U)\in \mathcal {S}^2\times \mathcal {B}^2\) the solution to (1.81.9), \(\Gamma \) defined as in (1.10) and \(\Theta _s = (s,X_s,Y_s,Z_s,\Gamma _s)\). Then,

$$\begin{aligned} \mathbb {E}\left( \int _0^T |f(\Theta _s)|^2ds\right) < \infty . \end{aligned}$$

Proof

First, note that by using useful bound (1.1) we have that for every \(s\in [0,T]\)

$$\begin{aligned} |f(\Theta _s)|^2 \le 2(|f(\Theta _s) - f(s,0,0,0,0)|^2 + |f(s,0,0,0,0)|^2). \end{aligned}$$

Applying again (1.1) and the Lipschitz bound on f,

$$\begin{aligned} |f(\Theta _s)|^2 \le 2\Big [ K^2 5 (|X_s|^2 + |Y_s|^2 + |Z_s|^2 + |\Gamma _s|^2) + |f(s,0,0,0,0)|^2 \Big ]. \end{aligned}$$

Then, integrating on \(\Omega \times [0,T]\) with respect to \(d\mathbb {P}\times ds\), using Hölder inequality and bound (4.2),

$$\begin{aligned} \mathbb {E}\left( \int _0^T |f(\Theta _s)|^2ds\right)&\le 10K^2 T \mathbb {E}\left( \underset{s\in [0,T]}{\sup }|X_s|^2 + \underset{s\in [0,T]}{\sup }|Y_s|^2\right) \\&+ 10K^2 \left( \mathbb {E}\int _0^T |Z_s|^2 ds + \lambda (\mathbb {R}^d) \mathbb {E}\int _0^T \int _{\mathbb {R}^d} |U_s(y)|^2\lambda (dy)ds\right) \\&+ 2T\underset{s\in [0,T]}{\sup }\ |f(s,0,0,0,0)|^2\\&< \infty \end{aligned}$$

this finishes the proof.\(\square \)

5 Main result

As stated previously, the proof of our main result, Theorem 5.1, is deeply inspired in the case without jumps considered in [31]. We follow the lines of that proof with some important differences because of the nonlocal character of our problem. Also, along the proof we use several times the useful bound (1.1); for \(x_1,...,x_k\in \mathbb {R}\) the following holds,

$$\begin{aligned} (x_1+\cdots +x_k)^2\le k(x_1^2+\cdots +x_k^2). \end{aligned}$$

Theorem 5.1

Under (C), there exists a constant \(C>0\) independent of the partition such that for sufficiently small h,

$$\begin{aligned}&\underset{i=0,...,N-1}{\max }\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) \right| ^2 + \sum _{i=0}^{N-1}\mathbb {E}\left( \int _{t_i}^{t_{i+1}}\Big [ |Z_t-\widehat{\mathcal {Z}}_{i}(X_{t_{i}}^{\pi })|^2 + |\Gamma _t-\widehat{\langle \mathcal {G} \rangle }_{i}(X_{t_{i}}^{\pi })|^2\Big ] dt\right) \\&~{} \qquad \le C\left[ h + \sum _{i=0}^{N-1} (N\mathcal {E}_i^v + \mathcal {E}_i^z + \mathcal {E}_i^\gamma ) +\right. \varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + \mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 \Bigg ], \end{aligned}$$

with \(\mathcal {E}_i^v\), \(\mathcal {E}_i^z\) and \(\mathcal {E}_i^\gamma \) given in (4.11), and \(\varepsilon ^{Z}(h)\) and \(\varepsilon ^{\Gamma }(h)\) defined in (4.5).

Proof

Step 1: Recall \(\widehat{\mathcal {V}}_{t_i}\) introduced in (4.6). The purpose of this part is to obtain a suitable bound of the term \(\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\) in terms of more tractable terms. We have\(\square \)

Lemma 5.1

There exists \(C>0\) fixed such that for any \(0<h<1\) sufficiently small, one has

$$\begin{aligned} \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2 \le&~{} Ch^2+C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2ds\right) + C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{Z}_{t_i}|^2ds\right) \nonumber \\&~{} +C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) +Ch \mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2dr\right) \nonumber \\&~{} + C(1+Ch)\mathbb {E} \left| Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1} (X^\pi _{t_{i+1}}) \right| ^2, \end{aligned}$$
(5.1)

with \(\Theta _r=(r,X_r,Y_r,Z_r,\Gamma _r)\).

The rest of this subsection is devoted to the proof of this result.

Proof

Subtracting the equation (1.9) between \(t_i\) and \(t_{i+1}\), we obtain

$$\begin{aligned} \Delta Y_i = Y_{t_{i+1}}-Y_{t_i}=-\int _{t_i}^{t_{i+1}} f(\Theta _s)ds +\int _{t_i}^{t_{i+1}}Z_s\cdot dW_s+\int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d} U_s(y)\overline{\mu }(ds,dy). \end{aligned}$$
(5.2)

Using the definition of \(\widehat{\mathcal {V}}_{t_i}\) in (4.6),

$$\begin{aligned} \begin{aligned} Y_{t_i}-\widehat{\mathcal {V}}_{t_i}=&~{} Y_{t_{i+1}} -\Delta Y_i -\widehat{\mathcal {V}}_{t_i}\\ =&~{} Y_{t_{i+1}}+\int _{t_i}^{t_{i+1}}[f(\Theta _s)-f(\widehat{\Theta }_{t_i})]ds-\int _{t_i}^{t_{i+1}}Z_s\cdot dW_s \\&~- \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d} U_s(y)\overline{\mu }(ds,dy){} -\mathbb {E}_i(\widehat{\mathcal {U}}_{i+1}(X^\pi _{t_{i+1}})). \end{aligned} \end{aligned}$$

Here \(\widehat{\Theta }_{t_i}=(t_i,X^\pi _{t_i},\widehat{\mathcal {V}}_{t_i}, \overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i})\). Then, by applying the conditional expectation for time \(t_i\) given by \(\mathbb {E}_i\) and using that, in this case, the stochastic integrals are martingales

$$\begin{aligned} Y_{t_i}-\widehat{\mathcal {V}}_{t_i}=\mathbb {E}_i(Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^\pi _{t_{i+1}})) + \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}[f(\Theta _s)-f(\widehat{\Theta }_{t_i})]ds\right) = a+b. \end{aligned}$$

Using the classical inequality \((a+b)^2\le (1+\gamma h)a^2+(1+\frac{1}{\gamma h})b^2\) for \(\gamma >0\) to be chosen, we get

$$\begin{aligned} \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2&\le (1+\gamma h) \mathbb {E} \left[ \mathbb {E}_i\left( Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^\pi _{t_{i+1}})\right) \right] ^2 \nonumber \\&\quad + \left( 1+\frac{1}{\gamma h}\right) \mathbb {E} \left[ \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}[f(\Theta _s)-f(\widehat{\Theta }_{t_i})]ds\right) \right] ^2.\qquad \end{aligned}$$
(5.3)

With no lose of generality, because we are looking for bounds, we can replace \([f(\Theta _s)-f(\widehat{\Theta }_{t_i})]\) by \(|f(\Theta _s)-f(\widehat{\Theta }_{t_i})|\). Also, we can drop the \(\mathbb {E}_i\) due to the law of total expectation. The Lipschitz condition on f in (1.4) allows us to give a bound in terms of the difference between \(\Theta _s\) and \(\widehat{\Theta }_{t_i}\). Indeed, for a fixed constant \(K>0\),

$$\begin{aligned} |f(\Theta _s)-f(\widehat{\Theta }_{t_i})|\le K\left( |s-t_i|^{1/2}+|X_s-X^{\pi }_{t_i}|+|Y_s-\widehat{\mathcal {V}}_{t_i}|+|Z_s-\overline{\widehat{Z}}_{t_i}|+|\Gamma _s-\overline{\widehat{\Gamma }}_{t_i}|\right) . \end{aligned}$$

Therefore, we have the bound

$$\begin{aligned}&\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|f(\Theta _s)-f(\widehat{\Theta }_{t_i})|ds\right) ^2 \\ {}&\quad \le ~{} Ch\left[ h^2+\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|X_s-X^{\pi }_{t_i}|^2ds\right) + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-\widehat{\mathcal {V}}_{t_i}|^2ds\right) \right. \\&\qquad +\left. \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{\widehat{Z}}_{t_i}|^2ds\right) + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\widehat{\Gamma }}_{t_i}|^2ds\right) \right] , \end{aligned}$$

where the Lipschitz constant K was absorbed by C. Using now the triangle inequality \(|Y_s-\widehat{\mathcal {V}}_{t_i}|^2 \le 2|Y_s-Y_{t_i}|^2 +2|Y_{t_i}-\widehat{\mathcal {V}}_{t_i}|^2\), and the approximation error of the X scheme (3.4), we find

$$\begin{aligned}&\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|f(\Theta _s)-f(\widehat{\Theta }_{t_i})|ds\right) ^2 \nonumber \\&\quad \le C h\left[ h^2+ 2\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^ 2 ds\right) +2h \mathbb {E}\left| Y_{t_i}- \widehat{\mathcal {V}}_{t_i} \right| ^2 \right. \nonumber \\&\left. \qquad \qquad + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{\widehat{Z}}_{t_i}|^2 ds\right) + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\widehat{\Gamma }}_{t_i}|^2ds\right) \right] , \end{aligned}$$
(5.4)

and therefore, replacing in (5.3),

$$\begin{aligned}&\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2 \nonumber \\&\le \left( 1+\gamma h \right) \mathbb {E}\left| \mathbb {E}_i\left[ Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1} (X^\pi _{t_{i+1}}) \right] \right| ^2 \nonumber \\&\quad + \left( 1+\gamma h\right) \frac{C}{\gamma } \left[ h^2 + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^ 2 ds\right) +h \mathbb {E}\left| Y_{t_i}- \widehat{\mathcal {V}}_{t_i} \right| ^2 \right. \nonumber \\&\left. \qquad \qquad \qquad \quad \quad + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{\widehat{Z}}_{t_i}|^2 ds\right) + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\widehat{\Gamma }}_{t_i}|^2ds\right) \right] . \end{aligned}$$
(5.5)

Recall \(\overline{Z}_{t_i}\) and \(\overline{\Gamma }_{t_i}\) introduced in (4.4). Now, we are going to prove the following

$$\begin{aligned} \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{\widehat{Z}}_{t_i}|^2 ds\right)&=\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{Z}_{t_i}|^2 ds\right) +h \mathbb {E}\left| \overline{Z}_{t_i}-\overline{\widehat{Z}}_{t_i} \right| ^2. \end{aligned}$$
(5.6)
$$\begin{aligned} \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\widehat{\Gamma }}_{t_i}|^2 ds\right)&=\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) +h \mathbb {E}\left| \overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i} \right| ^2. \end{aligned}$$
(5.7)

Let us prove the latter, the former is analogous. Recall that the \(\Gamma \) components represents the nonlocal part and therefore is one dimensional.

$$\begin{aligned} |\Gamma _t-\overline{\widehat{\Gamma }}_{t_i}|^2&= | (\Gamma _t-\overline{\Gamma }_{t_i}) + (\overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i}) |^2 \\ {}&= (\Gamma _t-\overline{\Gamma }_{t_i})^2 + (\overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i})^2 + 2(\Gamma _t-\overline{\Gamma }_{t_i})(\overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i}). \end{aligned}$$

It is sufficient to establish that the double product is 0 when integrating and taking expectation. Recall that \(\overline{\Gamma }_{t_i}\) from (4.4) is a \(\mathcal {F}_{t_i}\) measurable random variable. Then,

$$\begin{aligned} \int _{t_i}^{t_{i+1}}\left( \Gamma _t-\overline{\Gamma }_{t_i}\right) (\overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i})dt&= \left( \int _{t_i}^{t_{i+1}}(\Gamma _t-\overline{\Gamma }_{t_i})dt\right) (\overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i})\\&= \left[ \int _{t_i}^{t_{i+1}}\Gamma _t dt - \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\Gamma _t dt\right) \right] (\overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i}). \end{aligned}$$

Due to the \(\mathcal {F}_{t_i}\)-measurability of the right side of the last multiplication and the \(L^2(\mathbb {P})\) orthogonality, taking expectation annihilates the last term. Therefore, equations (5.6) and (5.7) are proven. By multiplying (5.2) by \(\Delta W_i\) and taking \(\mathbb {E}_i\),

$$\begin{aligned} \mathbb {E}_i\left( \Delta W_i Y_{t_{i+1}}\right)&+ \mathbb {E}_i\left( \Delta W_i\int _{t_i}^{t_{i+1}}f(\Theta _r)dr\right) \\ =&~{} \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}dW_r\int _{t_i}^{t_{i+1}}Z_r dW_r\right) \\&+\mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}U_r(y)\overline{\mu }(dy,dr)\int _{t_i}^{t_{i+1}}dW_r\right) \\ =&~{} \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}Z_r dr\right) = h \overline{Z}_{t_i}, \end{aligned}$$

where we have used Lemma 4.3. Then, subtracting \(h \overline{\widehat{Z}}_{t_i}=\mathbb {E}_i(\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})\Delta W_i)\),

$$\begin{aligned} h (\overline{Z}_{t_i}- \overline{\widehat{Z}}_{t_i})= \mathbb {E}_i\left[ \Delta W_i (Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}}))\right]&+ \mathbb {E}_i\left( \Delta W_i\int _{t_i}^{t_{i+1}}h(\Theta _r)dr\right) . \end{aligned}$$

By multiplying (5.2) by \(\Delta M_i\) and taking \(\mathbb {E}_i\),

$$\begin{aligned}&\mathbb {E}_i\left( \Delta M_i Y_{t_{i+1}}\right) + \mathbb {E}_i\left( \Delta M_i\int _{t_i}^{t_{i+1}}f(\Theta _r)dr\right) \\&\quad = \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}\overline{\mu }(ds,dy)\int _{t_i}^{t_{i+1}}Z_r\cdot dW_r\right) \\&\qquad + \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d} \overline{\mu }(dr,dy)\int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}U_r(y) \overline{\mu }(dr,dy)\right) \\&\quad = \mathbb {E}_i\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}U_r(y)\lambda (dy)ds\right) = h \overline{\Gamma }_{t_i}. \end{aligned}$$

Then, subtracting \(h\overline{\widehat{\Gamma }}_{t_i}=\mathbb {E}_i\left( \widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})\Delta M_i\right) \),

$$\begin{aligned} h(\overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i}) = \mathbb {E}_i\left[ \Delta M_i\left( Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})\right) \right] + \mathbb {E}_i\left( \Delta M_i\int _{t_i}^{t_{i+1}}f(\Theta _r)dr\right) . \end{aligned}$$

Summarizing, one has

$$\begin{aligned} h(\overline{Z}_{t_i}-\overline{\widehat{Z}}_{t_i}) =&~{} \mathbb {E}_i \left[ \Delta W_i \left( Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})-\mathbb {E}_i\left[ Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})\right] \right) \right] \\&~{} + \mathbb {E}_i\left[ \Delta W_i\int _{t_i}^{t_{i+1}}f(\Theta _r)dr \right] ;\\ h(\overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i}) =&~{} \mathbb {E}_i \left[ \Delta M_i \left( Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})-\mathbb {E}_i\left[ Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})\right] \right) \right] \\&~{} + \mathbb {E}_i\left[ \Delta M_i\int _{t_i}^{t_{i+1}}f(\Theta _r)dr \right] . \end{aligned}$$

For the sake of brevity, define now

$$\begin{aligned} H_{i}:=Y_{t_{i}}-\widehat{\mathcal {U}}_{i} (X^\pi _{t_{i}}); \end{aligned}$$
(5.8)

note that it depends on i. By the properties related with Itô isometry, from the previous identities we have

$$\begin{aligned} \mathbb {E}\left( h^2 \left| \overline{Z}_{t_i}-\overline{\widehat{Z}}_{t_i} \right| ^2\right) \le&~{} 2dh \left( \mathbb {E}(H_{i+1}^2)-\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2\right) + 2d h^2\mathbb {E}\left[ \int _{t_i}^{t_{i+1}}f(\Theta _r)^2dr\right] ; \end{aligned}$$
(5.9)
$$\begin{aligned} \mathbb {E}\left( h^2 \left| \overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i} \right| ^2\right) \le&~{} 2\lambda (\mathbb {R}^d) h \left( \mathbb {E}(H_{i+1}^2)-\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2\right) + 2\lambda (\mathbb {R}^d) h^2\mathbb {E}\left[ \int _{t_i}^{t_{i+1}}f(\Theta _r)^2dr\right] . \end{aligned}$$
(5.10)

\(\square \)

Remark 5.1

Note that in the previous bound is important the finiteness of the Levy measure \(\lambda \). The case of more general integro-differential operators, such as the fractional Laplacian mentioned in the introduction, it is an interesting open problem.

Let us work with equation (5.4). Using (5.6) and (5.7),

$$\begin{aligned} \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2 \le&~{} \left( 1+\gamma h\right) \mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 \\&~{} + \left( 1+\gamma h\right) \frac{C}{\gamma } \left[ h^2+\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2 ds\right) \right. \left. +h \mathbb {E}|Y_{t_i}- \widehat{\mathcal {V}}_{t_i}|^2\right. \nonumber \\&~{} \qquad \qquad \qquad \qquad + \left. \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{Z}_{t_i}|^2ds\right) +h \mathbb {E}\left| \overline{Z}_{t_i}-\overline{\widehat{Z}}_{t_i} \right| ^2 \right. \nonumber \\&~{} \qquad \qquad \qquad \qquad \left. + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) +h \mathbb {E}\left| \overline{\Gamma }_{t_i}-\overline{\widehat{\Gamma }}_{t_i} \right| ^2 \right] . \end{aligned}$$

Now use (5.9) and (5.10) to find that

$$\begin{aligned}&~{} \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\\&~{} \le (1+\gamma h)\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 \\&~{} \qquad + (1+\gamma h)\frac{C}{\gamma }\left[ h^2 + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2ds\right) \right. + h\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i}\right| ^2\\&~{} \qquad \qquad \qquad \qquad \qquad +\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{Z}_{t_i}|^2ds\right) +\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) \\&~{} \qquad \qquad \qquad \qquad \qquad +2d\left[ \mathbb {E}\left( H_{i+1}^2\right) -\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2\right] + 2dh \mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2 dr\right) \\&~{} \qquad \qquad \qquad \qquad \qquad + 2\lambda (\mathbb {R}^d)\left[ \mathbb {E}\left( H_{i+1}^2\right) -\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2\right] \left. +2\lambda (\mathbb {R}^d)h \mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2 dr\right) \right] . \end{aligned}$$

Let \(\gamma =C(\lambda (\mathbb {R}^d) + d)\) and define \(D:=(1+\gamma h)\frac{C}{\gamma }\), then the above term is bounded by

$$\begin{aligned}&(1+\gamma h)\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2+ Dh^2 + D\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2\right) \\&\quad + Dh \mathbb {E}|Y_{t_i}-\widehat{\mathcal {V}}_{t_i}|^2 + D \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z-\overline{Z}_{t_i}|^2ds\right) \\&\qquad + D\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) + (1+\gamma h)\frac{C}{\gamma } 2d\mathbb {E}\left( H_{i+1}^2\right) + 2dDh\mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2dr\right) \\&\qquad + (1+\gamma h)\frac{C}{\gamma } 2\lambda (\mathbb {R}^d)\mathbb {E}\left( H_{i+1}^2\right) + 2\lambda (\mathbb {R}^d) Dh\mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2dr\right) \\&\qquad - 2(1+\gamma h) \mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 \end{aligned}$$

Note that the first and last term in the last expression are similar, therefore can be subtracted which yields a negative number that can be bounded from above by 0. Also, we have the similar terms on \(\mathbb {E}\left( H_{i+1}^2\right) \) and the integral of f that we put together and bound respectively. Due to the definition of D, from now on the constant C has a linear dependence on the dimension d such that \(D\le C\). By replacing the last calculation and putting \(\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\) on the left side

$$\begin{aligned}&(1-Ch)\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\\ {}&\quad \le Ch^2+C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2ds\right) + C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{Z}_{t_i}|^2ds\right) \\ {}&\qquad +C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) + C(1+Ch)\mathbb {E}\left( H_{i+1}^2\right) + Ch \mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2dr\right) . \end{aligned}$$

Now we have to take h small such that, for example, \(Ch\le \frac{1}{2}\) and then

$$\begin{aligned} \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2 \le&~{} Ch^2+C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2ds\right) + C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{Z}_{t_i}|^2ds\right) \\&~{} +C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) +Ch \mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2dr\right) \\&+ C(1+Ch)\mathbb {E}\left( H_{i+1}^2\right) . \end{aligned}$$

Finally, by recalling that \(H_{i+1} =Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1} (X^\pi _{t_{i+1}})\), we have established (5.1).\(\square \)

Step 2: The last term in (5.1),

$$\begin{aligned} C(1+Ch)\mathbb {E} \left| Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1} (X^\pi _{t_{i+1}}) \right| ^2, \end{aligned}$$

was left without a control in previous step. Here in what follows we provide a control on this term. Recall the error terms \(\varepsilon ^{Z}(h)\) and \(\varepsilon ^{\Gamma }(h)\) introduced in (4.5). The purpose of this section is to show the following estimate:

Lemma 5.2

There exists a constant \(C>0\) such that,

$$\begin{aligned}&\max _{i\in \left\{ 0,...,N-1\right\} }\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_i}^\pi ) \right| ^2\nonumber \\&\quad \le ~{} C\Bigg [N \sum _{i=0}^{N-1} \mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X_{t_i}^\pi )-\widehat{\mathcal {V}}_{t_i} \right| ^2 + h + \varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + \mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 \Bigg ]. \end{aligned}$$
(5.11)

The rest of this section is devoted to the proof of this result.

Proof of Lemma 5.2

We have that \((a+b)^2\ge (1-h)a^2+(1-\frac{1}{h})b^2\) and

$$\begin{aligned} \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2&=\mathbb {E}\left| \left( Y_{t_i}-\widehat{\mathcal {U}}_{i}(X^\pi _{t_i})\right) + \left( \widehat{\mathcal {U}}_{i}(X^\pi _{t_i})-\widehat{\mathcal {V}}_{t_i}\right) \right| ^2\\&\ge (1-h)\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X^\pi _{t_i}) \right| ^2 + \left( 1-\frac{1}{h}\right) \mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X^\pi _{t_i})-\widehat{\mathcal {V}}_{t_i} \right| ^2. \nonumber \end{aligned}$$
(5.12)

Therefore, we have an upper (5.1) and lower bound for \(\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\). By connecting these bounds,

$$\begin{aligned}&(1-h)\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X^\pi _{t_i}) \right| ^2 + \left( 1-\frac{1}{h}\right) \mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X^\pi _{t_i})-\widehat{\mathcal {V}}_{t_i} \right| ^2 \\&\quad \le Ch^2+C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2ds\right) + C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{Z}_{t_i}|^2ds\right) \\&\quad \quad +C\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) +Ch \mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2dr\right) + C(1+Ch)\mathbb {E}\left( H_{i+1}^2\right) . \end{aligned}$$

Using that for sufficiently small h we have \((1-h)^{-1}\le 2\), we get,

$$\begin{aligned}&~{} \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i} (X_{t_{i}}^{\pi }) \right| ^2 \\&~{} \le CN\mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi })-\widehat{\mathcal {V}}_{t_i} \right| ^2 + Ch^2 \\&~{} \quad + C\left[ \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2ds\right) + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{Z}_{t_i}|^2ds\right) +\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\Gamma }_{t_i}|^2ds\right) \right] \\&~{} \quad + Ch\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|f(\Theta _s)|^2ds\right) +C\mathbb {E}\left| Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}}) \right| ^2. \end{aligned}$$

Notice that the expression on time \(t_i\) that we want to estimate, appears on the right side on time \(t_{i+1}\), we can iterate the bound and get that \(\forall \) \(i\in \left\{ 0,...,N-1\right\} \)

$$\begin{aligned}&\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i} (X_{t_i}^\pi ) \right| ^2 \\&~{} \le N C\sum _{k=i}^{N-1}\mathbb {E}\left| \widehat{\mathcal {U}}_{k} (X_{t_k}^\pi ) -\widehat{\mathcal {V}}_{t_k} \right| ^2 + C(N-i)h^2 \\&~{} \quad + C\sum _{k=i}^{N-1}\left[ \mathbb {E}\left( \int _{t_k}^{t_{k+1}} |Y_s-Y_{t_k}|^2ds\right) +\mathbb {E}\left( \int _{t_k}^{t_{k+1}} |Z_s-\overline{Z}_{t_k}|^2ds\right) \right. \\ {}&\quad \left. + \mathbb {E}\left( \int _{t_k}^{t_{k+1}} |\Gamma _s-\overline{\Gamma }_{t_k}|^2ds\right) \right] \\&\quad +Ch\sum _{k=i}^{N-1}\mathbb {E}\left( \int _{t_k}^{t_{k+1}} |f(\Theta _s)|^2ds\right) +C\mathbb {E}\left| Y_{t_{N}}-g(X^{\pi }_{t_N}) \right| ^2\\&~{} \le NC\sum _{k=0}^{N-1}\mathbb {E}\left| \widehat{\mathcal {U}}_{k} (X_{t_k}^\pi ) -\widehat{\mathcal {V}}_{t_k} \right| ^2 + CNh^2 \\&~{} \quad + C\sum _{k=0}^{N-1}\left[ \mathbb {E}\left( \int _{t_k}^{t_{k+1}} |Y_s-Y_{t_k}|^2ds\right) \right. +\mathbb {E}\left( \int _{t_k}^{t_{k+1}} |Z_s-\overline{Z}_{t_k}|^2ds\right) \\ {}&\quad + \left. \mathbb {E}\left( \int _{t_k}^{t_{k+1}} |\Gamma _s-\overline{\Gamma }_{t_k}|^2ds\right) \right] \\&~{} \quad + Ch\sum _{k=0}^{N-1}\mathbb {E}\left( \int _{t_k}^{t_{k+1}}| f(\Theta _s)|^2ds\right) +C\mathbb {E}\left| Y_{t_{N}}-g(X^{\pi }_{t_N}) \right| ^2. \end{aligned}$$

Applying maximum on \(i\in \left\{ 0,...,N-1\right\} \), recalling (4.5) and the bounds from Lemmas (4.7) and (4.1),

$$\begin{aligned}&\max _{i\in \left\{ 0,...,N-1\right\} }\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_i}^\pi ) \right| ^2\\&\qquad \le ~{} C\Bigg [N\sum _{i=0}^{N-1} \mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X_{t_i}^\pi )-\widehat{\mathcal {V}}_{t_i} \right| ^2 + O(h) +\varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + \mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 \Bigg ]. \end{aligned}$$

This is nothing that (5.11).

Remark 5.2

The classic bound used at the beginning of step 2 could have been stated using a fixed parameter \(\delta \in (0,1)\) in the form: \((a+b)^2 \ge (1-h^{\delta })a^2 + (1-\frac{1}{h^\delta })b^2\). This change makes N become \(N^{\delta }\), which is better. However, at some point of the proof the value \(\delta = 1\) is necessary.\(\square \)

Step 3: Estimate (5.11) contains some uncontrolled terms on its RHS. Here the purpose is to bound the term

$$\begin{aligned} \sum _{i=0}^{N-1} \mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X_{t_i}^\pi )-\widehat{\mathcal {V}}_{t_i} \right| ^2, \end{aligned}$$

in terms of more tractable terms. In this step we will prove

Lemma 5.3

There exists \(C>0\) such that,

$$\begin{aligned}&\max _{i\in \left\{ 0,...,N-1\right\} }\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_i}^\pi ) \right| ^2\nonumber \\&\quad \le ~{} C\Bigg [ h + \sum _{i=0}^{N-1} (N\mathcal {E}_i^v + \mathcal {E}_i^z + \mathcal {E}_i^\gamma ) +\varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + \mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 \Bigg ], \end{aligned}$$
(5.13)

with \(\mathcal {E}_i^v\), \(\mathcal {E}_i^z\) and \(\mathcal {E}_i^\gamma \) defined in (4.11).

In what follows, we will prove 5.13.

Proof

Fix \(i\in \left\{ 0,...,N-1\right\} \). Recall the martingale \((N_t)_{t\in [t_i,t_{i+1}]}\) and take \(t=t_{i+1}\),

$$\begin{aligned} \widehat{\mathcal {U}}_{i+1} (X^{\pi }_{t_{i+1}}) = \mathbb {E}_i \left( \widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})\right) + \int _{t_i}^{t_{i+1}} \widehat{Z}_s\cdot dW_s + \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d} \widehat{U}_s(y)\overline{\mu }(ds,dy). \end{aligned}$$

Now we replace the definition of \(\widehat{\mathcal {V}}_{t_i}\),

$$\begin{aligned} \widehat{\mathcal {U}}_{i+1} (X^{\pi }_{t_{i+1}}) = \widehat{\mathcal {V}}_{t_i} - f(t_i,X^{\pi }_{t_i},\widehat{\mathcal {V}}_{t_i}, \overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i})h +\int _{t_i}^{t_{i+1}} \widehat{Z}_s\cdot dW_s + \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d} \widehat{U}_s(y)\overline{\mu }(ds,dy). \end{aligned}$$
(5.14)

In what follows recall the value of F in the loss function \(L_i(\theta )\) (3.9) evaluated at the point

$$\begin{aligned} (t_i,X^\pi _{t_i},\mathcal {U}_i(X^\pi _{t_i};\theta ) ,\mathcal {Z}_i(X^\pi _{t_i};\theta ),\mathcal {G}_i(X^\pi _{t_i},\cdot ;\theta ),h,\Delta W_i), \end{aligned}$$

and that \(\langle \mathcal {G} \rangle _i(X_{t_i};\theta )\) is given in (3.8):

$$\begin{aligned}&F\left( t_i,X^\pi _{t_i},\mathcal {U}_i(X^\pi _{t_i};\theta ) ,\mathcal {Z}_i(X^\pi _{t_i};\theta ),\mathcal {G}_i(X^\pi _{t_i},\cdot ;\theta ),h,\Delta W_i \right) \\&~{} = \mathcal {U}_i(X^\pi _{t_i};\theta ) -hf(t_i,X^\pi _{t_i},\mathcal {U}_i(X^\pi _{t_i};\theta ),\mathcal {Z}_i(X^\pi _{t_i};\theta ) , \langle \mathcal {G} \rangle _i(X_{t_i};\theta )) + \mathcal {Z}_i(X^\pi _{t_i};\theta )\cdot \Delta W_i \\&\quad +\int _{\mathbb {R}^d} \mathcal {G}_i(X^\pi _{t_i},y;\theta ) \overline{\mu }\big ((t_i,t_{i+1}],dy\big ). \end{aligned}$$

Now fix a parameter \(\theta \) and replace (5.14) on \(L_i(\theta )\):

$$\begin{aligned}&\mathbb {E} \Big |\widehat{\mathcal {U}}_{i+1} (X^{\pi }_{t_{i+1}}) -F(t_i,X^\pi _{t_i},\mathcal {U}_i(X^\pi _{t_i};\theta ),\mathcal {Z}_i(X^\pi _{t_i};\theta ),\mathcal {G}_i(X^\pi _{t_i},\cdot ;\theta ),\Delta t_i,\Delta W_i)\Big |^2\\&\quad =\mathbb {E}\Big |\widehat{\mathcal {V}}_{t_i} - f(t_i,X^{\pi }_{t_i},\widehat{\mathcal {V}}_{t_i}, \overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i})h +\int _{t_i}^{t_{i+1}} \widehat{Z}_s\cdot dW_s \\&\qquad + \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d} \widehat{U}_s(y)\overline{\mu }(ds,dy) - \mathcal {U}_i(X^{\pi }_{t_i};\theta )\\&\qquad +h f(t_i,X^{\pi }_{t_i},\mathcal {U}_i(X_{t_i};\theta ),\mathcal {Z}_i(X_{t_i};\theta ), \langle \mathcal {G} \rangle _i(X_{t_i};\theta )) -\mathcal {Z}_i(X^{\pi }_{t_i};\theta )\cdot \Delta W_i\\&\qquad -\int _{\mathbb {R}^d}\mathcal {G}_i(X^{\pi }_{t_i},y;\theta )\overline{\mu }(\Delta t_i,dy) \Big |^2\\&\quad = \mathbb {E} \Bigg |\left[ \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta )\right. \\&\qquad \left. + h\left( f(t_i,X^{\pi }_{t_i},\mathcal {U}_i(X^{\pi }_{t_i};\theta ),\mathcal {Z}_{i}(X^{\pi }_{t_i};\theta ),\langle \mathcal {G} \rangle _{i}(X^{\pi }_{t_i};\theta )) - f(t_i,X^{\pi }_{t_i},\widehat{\mathcal {V}}_{t_i},\overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i}) \right) \right] \\&\qquad + \left[ \int _{t_i}^{t_{i+1}}\widehat{Z}_s\cdot dW_s - \int _{t_i}^{t_{i+1}}\mathcal {Z}_i(X^{\pi }_{t_i};\theta )\cdot dW_s\right. \\&\qquad \left. + \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}\widehat{U}(s,y)\overline{\mu }(ds,dy)-\int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}\mathcal {G}_i(X^{\pi }_{t_i},y;\theta )\overline{\mu }(ds,dy) \right] \Bigg |^2\\&\quad =\mathbb {E}\left| a+b \right| ^2. \end{aligned}$$

Note that b is a sum of martingale’s differences and therefore \(\mathbb {E}_i (b)=0\). By independence of \(\mu \) with W, we can deduce that

$$\begin{aligned} \mathbb {E}(b^2)&= \mathbb {E}\left( \int _{t_i}^{t_{i+1}}[\widehat{Z}_s-\mathcal {Z}_{i}(X^{\pi }_{t_i};\theta )]dW_s\right) ^2\\&\quad +\mathbb {E}\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}[\widehat{U}(s,y)-\mathcal {G}_i(X^{\pi }_{t_i},y;\theta )]\overline{\mu }(ds,dy)\right) ^2; \end{aligned}$$

and, since the random variables that appears on a are \(\mathcal {F}_{t_i}\)-measurable, \(\mathbb {E}(ab)=\mathbb {E}\left( \mathbb {E}_i(ab)\right) =\mathbb {E}\left( a\mathbb {E}_i(b)\right) =0\), we have that

$$\begin{aligned} L_i(\theta )&= \mathbb {E} \left( \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) + h\left[ f(t_i,X^{\pi }_{t_i},\mathcal {U}_i(X^{\pi }_{t_i};\theta ),\mathcal {Z}_{i}(X^{\pi }_{t_i};\theta ),\langle \mathcal {G} \rangle _{i}(X^{\pi }_{t_i};\theta ))\right. \right. \\&\quad \left. \left. - f(t_i,X^{\pi }_{t_i},\widehat{\mathcal {V}}_{t_i},\overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i}) \right] \right) ^2\\&\quad + \underbrace{\mathbb {E}\left( \int _{t_i}^{t_{i+1}}[\widehat{Z}_s-\mathcal {Z}_{i}(X^{\pi }_{t_i};\theta )]dW_s\right) ^2+\mathbb {E}\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}[\widehat{U}(s,y)-\mathcal {G}_i(X^{\pi }_{t_i},y;\theta )]\overline{\mu }(ds,dy)\right) ^2}_{c_0}. \end{aligned}$$

By the same arguments on equations (5.6) and (5.7),

$$\begin{aligned} c_0&= \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\widehat{Z}_s-\overline{\widehat{Z}}_{t_i}|^2ds\right) + h\mathbb {E}\left| \overline{\widehat{Z}}_{t_i}-\mathcal {Z}_{i}(X^{\pi }_{t_i};\theta ) \right| ^2\\&\quad \quad \quad + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}|\widehat{U}_s (y)-\overline{\widehat{U}}_{t_i}(y)|^2\lambda (dy)ds\right) \\&\quad \quad \quad + h\mathbb {E}\left( \int _{\mathbb {R}^d} \big (\overline{\widehat{U}}_{t_i}(y)-\mathcal {G}_{i}(X^{\pi }_{t_i},y;\theta )\big )^2\lambda (dy)\right) . \end{aligned}$$

With this decomposition of \(L_i(\theta )\), for optimization reasons, we can ignore the part that does not depend on the optimization parameter \(\theta \). Let

$$\begin{aligned}&\hat{L}_i(\theta ) \\&\quad = \mathbb {E} \left( \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) + h\left[ f(t_i,X^{\pi }_{t_i},\mathcal {U}_i(X^{\pi }_{t_i};\theta ), \mathcal {Z}_{i}(X^{\pi }_{t_i};\theta ),\langle \mathcal {G} \rangle _{i}(X^{\pi }_{t_i};\theta )) \right. \right. \\ {}&\quad \quad \left. \left. - f(t_i,X^{\pi }_{t_i},\widehat{\mathcal {V}}_{t_i},\overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i})\right] \right) ^2\\&\qquad + h\, \mathbb {E}\left| \overline{\widehat{Z}}_{t_i}-\mathcal {Z}_{i}(X^{\pi }_{t_i};\theta ) \right| ^2 + h\,\mathbb {E}\left( \int _{\mathbb {R}^d} \big (\overline{\widehat{U}}_{t_i}(y)-\mathcal {G}_{i}(X^{\pi }_{t_i},y;\theta )\big )^2\lambda (dy)\right) . \end{aligned}$$

Let \(\gamma >0\) and use Young inequality and the Lipschitz condition on f to find that

$$\begin{aligned}&\mathbb {E} \left( \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) + h\left[ f(t_i,X^{\pi }_{t_i},\mathcal {U}_i(X^{\pi }_{t_i};\theta ),\mathcal {Z}_{i}(X^{\pi }_{t_i};\theta ),\langle \mathcal {G} \rangle _{i}(X^{\pi }_{t_i};\theta )) \right. \right. \\&\quad \quad \left. \left. - f(t_i,X^{\pi }_{t_i},\widehat{\mathcal {V}}_{t_i},\overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i}) \right] \right) ^2\le \left( 1\!+\!\gamma h\right) \mathbb {E}\left| \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) \right| ^2\\&\quad \quad + \left( 1+\frac{1}{\gamma h}\right) h^2K^2\mathbb {E}\left( |\widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta )|^2+|\mathcal {Z}_i (X^{\pi }_{t_i};\theta )-\overline{\widehat{Z}}_{t_i}|^2+|\langle \mathcal {G} \rangle _i(X_{t_{i}}^{\pi };\theta )-\overline{\widehat{\Gamma }}_{t_i}|^2\right) \\&\quad \le (1+Ch)\mathbb {E}\left| \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) \right| ^2 \\&\quad \quad + Ch\Bigg [\mathbb {E}|\mathcal {Z}_i (X^{\pi }_{t_i};\theta )-\overline{\widehat{Z}}_{t_i}|^2+\mathbb {E}\left( \left\Vert \overline{\widehat{U}}_{t_i}(\cdot )-\mathcal {G}_i(X_{t_{i}}^{\pi },\cdot ;\theta )\right\Vert _{L^2(\lambda )}^2 \right) \Bigg ]. \end{aligned}$$

Therefore, we have an upper bound on \(L(\theta )\) for all \(\theta \)

$$\begin{aligned} \hat{L}(\theta )&\le C\mathbb {E}\left| \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) \right| ^2 + h\left( \mathbb {E}|\mathcal {Z}_i (X^{\pi }_{t_i};\theta ) - \overline{\widehat{Z}}_{t_i}|^2\right) \\&\quad +h\mathbb {E}\left( \left\Vert \overline{\widehat{U}}_{t_i}(\cdot )-\mathcal {G}_i(X_{t_{i}}^{\pi },\cdot ;\theta )\right\Vert _{L^2(\lambda )}^2\right) . \end{aligned}$$

To find a lower bound, we use \((a+b)^2\ge (1-\gamma h)a^2+\left( 1-\frac{1}{\gamma h}\right) b^2\) with \(\gamma >0\)

$$\begin{aligned}&\mathbb {E} \left( \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) + h\left[ f(t_i,X^{\pi }_{t_i},\widehat{\mathcal {V}}_{t_i},\overline{\widehat{Z}}_{t_i}, \overline{\widehat{\Gamma }}_{t_i})\right. \right. \\ {}&\quad \quad \left. \left. - f(t_i,X^{\pi }_{t_i},\mathcal {U}_i(X^{\pi }_{t_i};\theta ),\mathcal {Z}_{i}(X^{\pi }_{t_i};\theta ),\langle \mathcal {G} \rangle _{i}(X^{\pi }_{t_i};\theta )) \right] \right) ^2\\&\quad \ge \left( 1-Ch\right) \mathbb {E}\left| \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) \right| ^2 -\frac{h}{2} \left( \mathbb {E}|\mathcal {Z}_i (X^{\pi }_{t_i};\theta )-\overline{\widehat{Z}}_{t_i}|^2+\mathbb {E}|\langle \mathcal {G} \rangle _i (X^{\pi }_{t_i};\theta )-\overline{\widehat{\Gamma }}_{t_i} |^2\right) ; \end{aligned}$$

where we used \(\gamma = 6K^2\). Then,

$$\begin{aligned} \hat{L}(\theta )&\ge \left( 1- Ch\right) \mathbb {E}\left| \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) \right| ^2 \\&\quad -\frac{h}{2} \Bigg [\mathbb {E}|\mathcal {Z}_i (X^{\pi }_{t_i};\theta )-\overline{\widehat{Z}}_{t_i}|^2+\mathbb {E}\left( \int _{\mathbb {R}^d} \big (\overline{\widehat{U}}_{t_i}(y)-\mathcal {G}_{i}(X^{\pi }_{t_i},y;\theta )\big )^2\lambda (dy)\right) \Bigg ]. \end{aligned}$$

Connecting this bounds using that \(\hat{L}(\theta ^*)\le \hat{L}(\theta )\) yields that \(\forall \theta \),

$$\begin{aligned}&\left( 1- Ch\right) \mathbb {E}\left| \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ^*) \right| ^2 +\frac{h}{2} \mathbb {E}|\overline{\widehat{Z}}_{t_i}-\mathcal {Z}_i (X_{t_i},\theta ^*)|^2\\&\quad +\frac{h}{2}\mathbb {E}\left( \int _{\mathbb {R}^d} \big (\overline{\widehat{U}}_{t_i}(y)-\mathcal {G}_{i}(X^{\pi }_{t_i},y;\theta ^*)\big )^2\lambda (dy)\right) \\&\le C\mathbb {E}\left| \widehat{\mathcal {V}}_{t_i}-\mathcal {U}_i(X^{\pi }_{t_i};\theta ) \right| ^2 \\&\quad + h\left( \mathbb {E}|\overline{\widehat{Z}}_{t_i}-\mathcal {Z}_i (X_{t_i},\theta )|^2\right) +h\mathbb {E}\left( \left\Vert \overline{\widehat{U}}_{t_i}(\cdot )-\mathcal {G}_i(X_{t_{i}}^{\pi },\cdot ;\theta )\right\Vert _{L^2(\lambda )}^2\right) . \end{aligned}$$

By taking infimum on the right side and h small such that \((1-Ch)\ge \frac{1}{2}\)

$$\begin{aligned}&~{} \mathbb {E}\left| \widehat{\mathcal {V}}_{t_i}-\widehat{\mathcal {U}}_i(X^{\pi }_{t_i}) \right| ^2 +\frac{h}{2} \mathbb {E}|\overline{\widehat{Z}}_{t_i}-\widehat{\mathcal {Z}}_i (X_{t_i})|^2+\frac{h}{2}\mathbb {E}\left( \int _{\mathbb {R}^d} \big (\overline{\widehat{U}}_{t_i}(y)-\widehat{\mathcal {G}}_{i} (X^{\pi }_{t_i},y)\big )^2\lambda (dy)\right) \nonumber \\&~{} \qquad \le C\left( \mathcal {E}_i^v + h\mathcal {E}_i^z + h\mathcal {E}_i^{\gamma } \right) . \end{aligned}$$
(5.15)

Using this bound on what we found on steps 1 and 2, we find

$$\begin{aligned} \max _{i\in \left\{ 0,...,N-1\right\} }\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_i}^\pi ) \right| ^2\le&~{} C\left[ h + \sum _{i=0}^{N-1} (N\mathcal {E}_i^v + \mathcal {E}_i^z + \mathcal {E}_i^\gamma ) + \sum _{i=0}^{N-1}\mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s-Y_{t_i}|^2ds\right) \right. \\&\qquad + \varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + \mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 \Bigg ]. \end{aligned}$$

Finally, using Proposition 4.1, one ends the proof of (5.13).\(\square \)

Step 4: We are going to show some bounds for the terms involving the \(\Gamma \) and U components, the same bounds holds for Z component and are shown in [31]. By using (5.10) on (5.7),

$$\begin{aligned} \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _t - \overline{\widehat{\Gamma }}_{t_i}|^2 dt\right) \!\le \!&~{} \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _t \!-\! \overline{\Gamma }_{t_i}|^2 dt\right) \!+\! 2\lambda (\mathbb {R}^d)\left( \mathbb {E}\left( H_{i+1}^2\right) - \mathbb {E}\left| \mathbb {E}_i (H_{i+1}) \right| ^2\right) \\&+ 2h\lambda (\mathbb {R}^d)\mathbb {E}\left( \int _{t_i}^{t_{i+1}}f(\Theta _r)^2 dr\right) , \end{aligned}$$

this implies, after using (4.5) and (),

$$\begin{aligned} \mathbb {E}\left( \sum _{i=0}^{N-1}\int _{t_i}^{t_{i+1}}|\Gamma _t - \overline{\widehat{\Gamma }}_{t_i}|^2 dt\right)&\le \mathbb {E}\left( \sum _{i=0}^{N-1}\int _{t_i}^{t_{i+1}}|\Gamma _t - \overline{\Gamma }_{t_i}|^2 dt\right) \\&\quad + C\sum _{i=0}^{N-1}\left( \mathbb {E}\left( H_{i+1}^2\right) - \mathbb {E}\left| \mathbb {E}_i (H_{i+1}) \right| ^2\right) + Ch\\&= \varepsilon ^{\Gamma }(h) + Ch + C\sum _{i=0}^{N-1}\left( \mathbb {E}\left( H_{i+1}^2\right) - \mathbb {E}\left| \mathbb {E}_i (H_{i+1}) \right| ^2\right) . \end{aligned}$$

From [31] we get the analogous bound for the Z component, therefore, putting this two together yields

$$\begin{aligned}&\mathbb {E}\left( \sum _{i=0}^{N-1}\int _{t_i}^{t_{i+1}}\big (|Z_t - \overline{\widehat{Z}}_{t_i}|^2 + |\Gamma _t - \overline{\widehat{\Gamma }}_{t_i}|^2\big ) dt\right) \nonumber \\&\quad \le \varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + Ch + C\sum _{i=0}^{N-1}\left( \mathbb {E}\left( H_{i+1}^2\right) - \mathbb {E}\left| \mathbb {E}_i (H_{i+1}) \right| ^2\right) . \end{aligned}$$
(5.16)

This tells us that the next mission in this proof is to give a suitable bound for \(\mathbb {E}\left( H_{i+1}^2\right) - \mathbb {E}\left| \mathbb {E}_i (H_{i+1}) \right| ^2\). Recall from (5.8) that \(H_{i+1} = Y_{t_{i+1}} - \widehat{\mathcal {U}}_{i+1} (X_{t_{i+1}}^{\pi })\), then

$$\begin{aligned} \begin{aligned} \sum _{i=0}^{N-1}\left( \mathbb {E}\left( H_{i+1}^2\right) - \mathbb {E}\left| \mathbb {E}_i (H_{i+1}) \right| ^2\right)&= \sum _{i=0}^{N-1}\mathbb {E}(H_{i+1}^2)-\sum _{i=0}^{N-1}\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2\\&= \mathbb {E}\left| Y_{t_N}-\widehat{\mathcal {U}}_N(X_{t_{N}}^{\pi }) \right| + \sum _{i=0}^{N-2}\mathbb {E}(H_{i+1}^2) - \sum _{i=0}^{N-1}\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2\\&\le \mathbb {E}\left| Y_{t_N}-\widehat{\mathcal {U}}_N(X_{t_{N}}^{\pi }) \right| +\mathbb {E}(H_{0}^2) + \sum _{i=1}^{N-1}\mathbb {E}(H_{i}^2) \\&\quad - \sum _{i=0}^{N-1}\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2\\&= \mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 +\sum _{i=0}^{N-1} \left( \mathbb {E}(H_{i}^2) - \mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 \right) . \end{aligned} \end{aligned}$$
(5.17)

From (5.12) and (5.4) we have an upper and lower bound on \(\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\). Indeed, first one has

$$\begin{aligned} (1-h)\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X^\pi _{t_i}) \right| ^2 \le \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2 + \left( \frac{1}{h}-1\right) \mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X^\pi _{t_i})-\widehat{\mathcal {V}}_{t_i} \right| ^2. \end{aligned}$$
(5.18)

Second, we have that for all \(\gamma >0\)

$$\begin{aligned}&\left( 1-h\right) \, \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) \right| ^2 \le \left( \frac{1}{h} -1\right) \mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi })-\widehat{\mathcal {V}}_{t_i} \right| ^2 + (1+\gamma h)\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 +(1+\gamma h)\frac{C}{\gamma }\\&\bigg [\underbrace{h^2+ \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Y_s - Y_{t_i}|^2 ds\right) + h\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2+ \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|Z_s-\overline{\widehat{Z}}_{t_i}|^2 ds\right) + \mathbb {E}\left( \int _{t_i}^{t_{i+1}}|\Gamma _s-\overline{\widehat{\Gamma }}_{t_i}|^2 ds\right) }_{B_i}\bigg ]. \end{aligned}$$

Let us call the expression inside the squared brackets by \(B_i\). Subtracting \((1-h) \mathbb {E}\left| \mathbb {E}_i(H_{i+1})\right| ^2\) and dividing by \((1-h)\),

$$\begin{aligned} \begin{aligned} \mathbb {E}(H_i^2) - \mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 \le&~{} \frac{1}{h}\mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) - \widehat{\mathcal {V}}_{t_i} \right| ^2 + \left( \frac{h+ \gamma h}{1-h}\right) \mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 +\frac{C}{\gamma }\frac{(1+ \gamma h)}{(1-h)}B_i. \end{aligned} \end{aligned}$$

For \(\gamma = 3C\) and sufficiently small h, we can force,

$$\begin{aligned} \frac{C}{\gamma }\frac{(1+\gamma h)}{(1-h)} \le \frac{1}{2}\qquad \text {and}\qquad \frac{1}{1-h}\le \frac{1}{2}. \end{aligned}$$

Hence,

$$\begin{aligned} \mathbb {E}(H_i^2)&- \mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 \le \frac{1}{h}\mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) - \widehat{\mathcal {V}}_{t_i} \right| ^2 + Ch\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2 +\frac{1}{2}B_i. \end{aligned}$$

Finally, note that,

$$\begin{aligned} \sum _{i=0}^{N-1}\mathbb {E}\left| \mathbb {E}_i(H_{i+1}) \right| ^2\le \mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2+N\underset{i=0,...,N-1}{\max }\ \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) \right| ^2. \end{aligned}$$
(5.19)

Remark 5.3

Note that in equation (5.19) appears N multiplying the last term. With the bounds that we have, is impossible to get rid of the N, and this is why the \(\delta \) improvement mentioned on Remark 5.2 will not be of much help.

Coming back to (5.17),

$$\begin{aligned} \begin{aligned} \sum _{i=0}^{N-1}\left( \mathbb {E}\left( H_{i+1}^2\right) - \mathbb {E}\left| \mathbb {E}_i (H_{i+1}) \right| ^2\right) \le&~{} 2\mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 + N \sum _{i=0}^{N-1}\mathbb {E}\left| \widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) - \widehat{\mathcal {V}}_{t_i} \right| ^2 \\&~{} + Ch N \underset{i=0,...,N-1}{\max }\ \mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) \right| ^2 +\frac{1}{2} \sum _{i=0}^{N-1} B_i. \end{aligned} \end{aligned}$$

Therefore, by plugging this bound in (5.16), noting that \(|Y_{t_i}-\widehat{\mathcal {V}}_{t_i}|^2\le 2|Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) |^2 + 2|\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) - \widehat{\mathcal {V}}_{t_i}|^2\), \(hN = 1\), and using Lemma 4.1, we have for some \(C>0\),

$$\begin{aligned}&\mathbb {E}\bigg (\sum _{i=0}^{N-1}\int _{t_i}^{t_{i+1}}\big (|Z_t - \overline{\widehat{Z}}_{t_i}|^2 + |\Gamma _t - \overline{\widehat{\Gamma }}_{t_i}|^2\big ) dt\bigg )\\&\quad \le C\bigg [\mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 + \varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + h\\&~{}\qquad + N \sum _{i=0}^{N-1}\mathbb {E}\left| \widehat{\mathcal {U}}_{t_i}(X_{t_{i}}^{\pi }) - \widehat{\mathcal {V}}_{t_i} \right| ^2 + \underset{i=0,...,N-1}{\max }\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{t_i}(X_{t_{i}}^{\pi }) \right| ^2\bigg ] . \end{aligned}$$

Now, use (5.15) together with Lemma 5.3 to get

$$\begin{aligned}&\mathbb {E}\bigg (\sum _{i=0}^{N-1}\int _{t_i}^{t_{i+1}}\big (|Z_t - \overline{\widehat{Z}}_{t_i}|^2 + |\Gamma _t - \overline{\widehat{\Gamma }}_{t_i}|^2\big ) dt\bigg ) \\&\quad \le C\bigg [\mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 + \varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + h\\&\quad + \sum _{i=0}^{N-1} (N\mathcal {E}_i^v + \mathcal {E}_i^z + \mathcal {E}_i^\gamma )\bigg ] . \end{aligned}$$

Again, recalling (5.15) using the previous bound and,

$$\begin{aligned}&\sum _{i=0}^{N-1}\mathbb {E}\left( \int _{t_i}^{t_{i+1}}\Big [|Z_t-\widehat{\mathcal {Z}}_{i}(X_{t_{i}}^{\pi })|^2 + |\Gamma _t-\widehat{\langle \mathcal {G} \rangle }_{i}(X_{t_{i}}^{\pi })|^2\Big ]dt\right) \\&\quad \le \sum _{i=0}^{N-1}\mathbb {E}\left( \int _{t_i}^{t_{i+1}}\Big [|Z_t-\overline{\widehat{Z}}_{t_i}|^2 + |\Gamma _t-\overline{\widehat{\Gamma }}_{t_i}|^2\Big ]dt\right) \\&\qquad + \sum _{i=0}^{N-1}h\mathbb {E}\left( \Bigg [|\overline{\widehat{Z}}_{t_i}-\widehat{\mathcal {Z}}_{i}(X_{t_{i}}^{\pi })|^2 + \left\Vert \overline{\widehat{U}}_{t_i}(\cdot )-\widehat{\mathcal {G}}_{i}(X_{t_{i}}^{\pi },\cdot )\right\Vert _{L^2(\lambda )}^2\Bigg ]dt\right) , \end{aligned}$$

we conclude that there exist \(C>0\), independent of the partition, such that for h sufficiently small,

$$\begin{aligned}&\underset{i=0,...,N-1}{\max }\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) \right| ^2 + \sum _{i=0}^{N-1}\mathbb {E}\left( \int _{t_i}^{t_{i+1}}\Big [|Z_t-\widehat{\mathcal {Z}}_{i}(X_{t_{i}}^{\pi })|^2 + |\Gamma _t-\widehat{\langle \mathcal {G} \rangle }_{i}(X_{t_{i}}^{\pi })|^2\Big ]dt\right) \\&\quad \le C\left[ h + \sum _{i=0}^{N-1} (N\mathcal {E}_i^v + \mathcal {E}_i^z + \mathcal {E}_i^\gamma ) +\right. \varepsilon ^{Z}(h)+\varepsilon ^{\Gamma }(h) + \mathbb {E}\left| g(X_T)-g(X_T^\pi ) \right| ^2 \Bigg ]. \nonumber \end{aligned}$$

Thus it has been demonstrated.

We state some remarks from the proof.\(\square \)

Remark 5.4

Note that the terms \(\mathcal {E}_i^v\), \(\mathcal {E}_i^z\) and \(\mathcal {E}_i^\gamma \), can be made arbitrarily small, in view of Lemma 2.1. The challenge here, and in almost every DL algorithm, is that we do not know how many units per layer, i.e., how large \(\kappa \) we need to take in order to achieve a fixed tolerance, we can only ensure the existence of a NN architecture satisfying the approximation property.

Remark 5.5

The main difficulty of the adaptation of the proof given in [31], was to give a useful definition of the third NN with the mission of approximate the non local component. This was problematic because we have two options, the first is to define the NN to approximate the whole integral

$$\begin{aligned} \int _{\mathbb {R}^d}[u(t_i,X_{t_i}^{\pi }+\beta (X_{t_i}^{\pi },y))-u(t_i,X_{t_i}^{\pi })]\lambda (dy), \end{aligned}$$

this seems intuitive because this will lead our third NN to approximate the nonlocal part of the PIDE and, therefore, receive one parameter: \(X_{t_{i}}^{\pi }\). But, we also need to approximate or been able to calculate the stochastic integral

$$\begin{aligned} \int _{\mathbb {R}^d}[u(t_i,X_{t_i}^{\pi }+\beta (X_{t_i}^{\pi },y))-u(t_i,X_{t_i}^{\pi })]\bar{\mu }\left( (t_i,t_{i+1}],dy\right) , \end{aligned}$$

that cannot be done by only knowing the first integral. To overcome this issue, we proposed the idea to approximate what it is inside the integrals and solve the problem of actually integrate this function with another tools.

Remark 5.6

The non local part of the PIDE (1.2) makes us add a Lévy process, which is a canonical tool when dealing with non local operators such as the one that appears on equation (1.2). This addition results in the natural definition of analogous objects from [31] such as the \(\Gamma , \bar{\Gamma }\) components for the nonlocal case.

Remark 5.7

The result of the theorem states that the better we can approximate \(v_i, z_i, \gamma _i\) by NN architectures, the better we can approximate \((Y_{t_i}, Z_{t_i}, \Gamma _{t_i})\) by \((\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }),\widehat{\mathcal {Z}}_{i}(X_{t_{i}}^{\pi }),\langle \widehat{\mathcal {G}} \rangle _{i}(X_{t_{i}}^{\pi }))\).

Remark 5.8

Because of the finiteness of the measure \(\lambda \), the case of the Fractional Laplacian mentioned in the introduction is not contained in Theorem 5.1. We hope to extend our results to this case in a forthcoming result.

5.1 Optimization step of the algorithm

In this subsection we give a brief explanation on how to compute the loss function from Algorithm 1 in order to perform it. As usual, we extend the computation of the loss function shown on [31] to our non local case for which we need to introduce the following definitions. For a cadlag process \((C_s)_{s\in [0,T]}\), \(\Delta C_s:=C_s - C_{s^-}\) stands for the jump of C at time \(s\in [0,T]\) and for a process \(U\in L_{\mu }^1(\mathbb {R}^d)\) the definition of stochastic integral with respect to \(\mu \) ([4, Sections 2 and 4]) is as follows,

$$\begin{aligned} \int _{s}^t \int _{\mathbb {R}^d} U(r,y)\mu (ds,dy) := \sum _{r\in (s,t]} U(r,\Delta P_r)\mathbb {1}_{\mathbb {R}^d} (\Delta P_r), \end{aligned}$$

where

$$\begin{aligned} \left( P_s = \int _{\mathbb {R}^d} x\mu (s,dx)\right) _s, \end{aligned}$$

is a compound Poisson process (see [4, Thm 2.3.10]). And therefore,

$$\begin{aligned} \int _{s}^t \int _{\mathbb {R}^d} U(r,y)\bar{\mu }(ds,dy) = \sum _{r\in (s,t]} U(r,\Delta P_r)\mathbb {1}_{\mathbb {R}^d} (\Delta P_r) - \int _{s}^t \int _{\mathbb {R}^d} U(r,y)\lambda (dy) dr. \end{aligned}$$

For simplicity assume that \(\lambda \) is a probability measure absolutely continuous with respect to Lebesgue measure. As we will see, several simulation of Lévy process \((X_t)_{t\in [0,T]}\) are needed.

As shown on Algorithm 1, given \(\widehat{\mathcal {U}}_{i+1}\) for \(i\in \left\{ 0,...,N-1\right\} \), we need to minimize \(L_i(\cdot )\) and define the NNs for step i. Recall the definition of \(L_i\) in (3.9), the idea is to write the expected value from the loss function as an average of simulations. Let \(M\in \mathbb {N}\) and \(I = \left\{ 1,..., M\right\} \), generate simulations \(\left\{ x^i_k: k\in I\right\} \), \(\left\{ x^{i+1}_k: k\in I\right\} \), \(\left\{ w_k:k\in I\right\} \) of \(X_{t_{i}}^{\pi }\), \(X_{t_{i+1}}^{\pi }\) and \(\Delta W_i\) respectively. Then,

$$\begin{aligned} L_i(\theta )\approx \frac{1}{M}\sum _{k\in I} \big (\widehat{\mathcal {U}}_{i+1}(x^{i+1}_k)-F(t_i,x^i_k,\mathcal {U}_i(x^i_k;\theta ),\mathcal {Z}_i(x^i_k;\theta ),\mathcal {G}_i(x^i_k,\cdot ;\theta ),h,w_i)\big )^2. \end{aligned}$$

Note that we are using an Euler scheme on the simulations of \((X_t)_{t\in [0,T]}\), nevertheless, there exists other methods depending on the structure of the diffusion, see [14, 34]. Recall that F needs two different integrals of \(\mathcal {G}_i(x^i_k,\cdot ;\theta )\), to approximate these values let \(L\in \mathbb {N}\) and \(J=\left\{ 1,...,L\right\} \) and consider, for every \(k\in I\), simulations \(\left\{ y^k_{l}: l\in J\right\} \) of a random variable \(Y\sim \lambda \), here is important the finitness of the measure. Then, the quantities we need can be computed as follows,

$$\begin{aligned} \int _{\mathbb {R}^d}\mathcal {G}_i(x^i_k,y;\theta )\lambda (dy)&= \mathbb {E}(\mathcal {G}_i(x^i_k,Y;\theta ))\approx \frac{1}{L}\sum _{l\in J} \mathcal {G}_i(x^i_k,y^k_{l};\theta )\\ \int _{\mathbb {R}^d}\mathcal {G}_i(x^i_k,y;\theta )\bar{\mu }\left( (t_i,t_{i+1}],dy\right)&= \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d}\mathcal {G}_i(x^i_k,y;\theta )\mu \left( dt,dy\right) \\ {}&- \int _{t_i}^{t_{i+1}}\int _{\mathbb {R}^d} \mathcal {G}_i(x^i_k,y;\theta )dt\lambda (dy) \\&\approx \sum _{t_i\le s<t_{i+1}} \mathcal {G}_i(x^i_k,\Delta P_s;\theta )\mathbb {1}_{\mathbb {R}^d} (\Delta P_s) - \frac{h}{L}\sum _{l\in J} \mathcal {G}_i(x^i_k,y^k_{l};\theta ). \end{aligned}$$

Therefore, provided we can simulate: trajectories of \((X_t)_{t\in [0,T]}\) and \((W_t)_{t\in [0,T]}\), realizations of \(Y\sim \lambda \) and the compound Poisson process \((P_t)_{t\in [0,T]}\), we can minimize \(L_i\), find the optimal \(\theta ^*\) and define

$$\begin{aligned} (\widehat{\mathcal {U}}_{i},\widehat{\mathcal {Z}}_{i},\widehat{\mathcal {G}}_{i})=(\mathcal {U}_i(\cdot ;\theta ^*),\mathcal {Z}_i(\cdot ;\theta ^*),\mathcal {G}_i(\cdot ,\circ ;\theta ^*)). \end{aligned}$$

Remark 5.9

The nonlocal term in equation (1.2) adds complexity not only in the proof of the consistency of the algorithm but in the algorithm itself. As we saw, it is key that the measure \(\lambda \) is finite as well as the capability to simulate integrals with respect to Poisson random measures and trajectories of the Lévy process. The implementation of this method and an extension to PIDEs with more general integro-differential operators, such as fractional Laplacian, are left to future work.