Abstract
In this paper we consider the numerical approximation of nonlocal integro differential parabolic equations via neural networks. These equations appear in many recent applications, including finance, biology and others, and have been recently studied in great generality starting from the work of Caffarelli and Silvestre by Lius and Lius (Comm PDE 32(8):1245–1260, 2007). Based on the work by Hure, Pham and Warin by Hure et al. (Math Comp 89:1547–1579, 2020), we generalize their Euler scheme and consistency result for Backward Forward Stochastic Differential Equations to the nonlocal case. We rely on Lévy processes and a new neural network approximation of the nonlocal part to overcome the lack of a suitable good approximation of the nonlocal part of the solution.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The goal of this article is to propose and analyse a deep learning scheme for PDEs of integral type which we refer to as PIDE models. The integral part of the considered equation is defined by a finite Lévy measure \(\lambda \) on \(\mathbb {R}^d\) (see Sect. 1.2).
A difficult problem in Applied Mathematics is to approximate solutions of Partial Differential Equations (PDEs) in large dimensions. In low dimensions such as 1, 2 or 3, classical methods such as finite differences or finite elements are commonly applied with satisfactory convergence orders (see e.g. Allaire [1, Chapters 2 and 6]). An important problem appears when we deal with high dimensional problems such as portfolio management, where each dimension represents the size of some financial derivative in the portfolio. More complications appear when the PDE is nonlocal, as presented in many applications. For finite difference methods, one needs to construct a mesh that, computationally speaking, has exponential cost on the dimension \(d\in \mathbb {N}\) of the considered PDE. This problem is known in the literature as the curse of dimensionality, and the most common attempt to solve this issue is via stochastic methods. Deep Learning (DL) methods have proven to be an efficient tool to handle this problem and to approximate solutions of high dimensional second order fully nonlinear PDEs. This is achieved by finding that the solution of the PDE, evaluated at a certain diffusion process, solves a Stochastic Differential Equation (SDE); then an Euler scheme together with DL is applied to solve the SDE, see [9, 31] for key developments.
Without being exhaustive, we present some of the current developments in this direction. First of all, Monte Carlo algorithms are an important approach to the resolution of this dimensional problem. This can be done by means of the classical Feynman–Kac representation that allows us to write the solution of a linear PDE as an expected value, and then approximate the high dimensional integrals with an average over simulations of random variables. The key developments in this area can be found in Han-Jentzen-E [29] and Beck-E-Jentzen [9]. On the other hand, Multilevel Picard method (MLP) is another approach and consist on interpreting the stochastic representation of the solution to a semilinear parabolic (or elliptic) PDE as a fixed point equation. Then, by using Picard iterations together with Monte Carlo methods for the computation of integrals, one is able to approximate the solution to the PDE, see [8, 32] for fundamental advances in this direction. On the other hand, the so-called Deep Galerkin method (DGM) is another DL approach used to solve quasilinear parabolic PDEs of the form \(\mathcal {L}(u)=0\) with initial and boundary conditions. The cost function in this framework is defined in an intuitive way, it consists of the differences between the approximate solution \(\hat{u}\) evaluated at the initial time and spatial boundary, with the true initial and boundary conditions plus \(\mathcal {L}(\hat{u})\). These quantities are captured by an \(L^2\)-type norm, which in high dimensions is minimized using Stochastic Gradient Descent (SGD) method. See [44] for the development of the DGM and [38] for an application.
In [31], the principal source of inspiration of this article, Hure, Pham, and Warin consider the framework introduced previously in [9] and present new approximation schemes for the solution of a parabolic nonlinear PDE and its gradient via Neural Networks. Via an intricate use of intermediate numerical approximations for each term in their scheme, they prove the numerical consistency and high accuracy of the method, at least in the case of low dimensions.
In general, standard PDEs model situations where, in order to know the state of a system at a particular point, one needs information of the state in a arbitrarily small neighborhood of the point. On the contrary, PIDEs can model more general phenomena where long distance interactions and effects are not negligible, therefore, must be considered. An important example of PIDEs are those which involve fractional derivatives, such as the Fractional Laplacian. This operator has been extensively studied, from the PDE point of view, during the past ten years, starting from the fundamental work by Caffarelli and Silvestre [15]. See [21, 45] and references therein for nice introductions to this operator, one of the most relevant examples of integro-differential operators. More generally speaking, nonlocal equations are used in a wide range of scientific areas, see [10] for applications in advection dispersion equations, [25] for image processing, [23] for perodyinamic, [46] for hydrodynamics, and see [16, 17] for finances. For more theoretical results on nonlocal equations, see e.g. [6, 11, 18] and references therein. In [19], the authors give a complete introduction to nonlocal equations and then they develop nonlocal version of three numerical methods: finite difference, finite element and Spectral-Galerkin.
In [35], the authors present a discrete-time approximation of a BSDEJ (Backward SDE with Jumps) such that its solution converges weakly to a solution of the continuous time equation. They also use this method to approximate the solution to the correspondent PIDE. Very recently, we have learned about a rigorous and complete work by L. Gonon and C. Schwab [26, 27] where a proof that deep ReLu neural networks (NNs) are able to approximate expectation of certain type of functionals defined in a space of stochastic processes is given. In particular, viscosity solutions of a linear PIDE can be represented in such way by means of a variation of Feynman–Kac formula. Furthermore, relying on controlling the size of NNs that approximate the parameters for a certain fixed tolerance, they mathematically prove that neural networks, as considered in their setting, are capable of breaking the curse of dimensionality in such approximation problems. They follow a similar procedure and generalize the results shown on [24, 28]. Our result is in some sense different, we work with a nonlinear equation and do not ask for NNs approximation of the parameters but we do not show that our scheme can overcome the curse of dimensionality. As we mention before, we propose and provide error bounds for a deep learning scheme for nonlinear parabolic PIDEs based in the work of [31]. We emphasize that their methods differ from ours, and were made in an independent fashion.
We present here an extension and generalization of [31] to PIDEs, by adding nonlocal contribution to the PDE. Some important changes are needed in the algorithm, including the use of a third Neural Network to approximate nonlocal parts of the solution. Of particular utility will be the result shown in [12] to prove convergence of the proposed numerical scheme.
The basic idea of the Euler scheme presented in this article is based on that presented by Zhang in [49]. In that paper, the author gives a discrete time approximation of a BSDE (backward SDE) with no jump terms. That scheme involves the computation of conditional expectations and gives important bounds and results that were used in [31] to prove the convergence of a DL algorithm to solve a second order fully nonlinear PDE. In our case, nonlocal integral models require additional treatments. The work by Bouchard and Elie [12], very important for the work presented here, generalizes the properties given in [49] to the nonlocal setting by considering Lévy process. We will closely follow their approach to construct our numerical scheme.
1.1 Notation
For any \(m\in \mathbb {N}\), \(\mathbb {R}^m\) represents the finite dimensional Euclidean space with elements \(x=(x_1,...,x_m)\) and endowed with the usual norm \(|x|^2=\sum _{i=1}^m |x_i|^2\). Note that for scalars \(a\in \mathbb {R}\) we also denote its norm as \(|a| = \sqrt{a^2}\). For \(x,y\in \mathbb {R}^m\) their scalar product is denoted as \(x\cdot y =\sum _{i=1}^m x_i y_i\). For a general measure space \((E,\Sigma ,\nu )\), \(p\ge 1\) and \(m\in \mathbb {N}\), \(L^p(E,\Sigma ,\nu ;\mathbb {R}^m)\) represents the standard Lebesgue space of p-integrable functions from E to \(\mathbb {R}^m\) and norm
We write \(L^p(E,\Sigma ,\nu )\) when \(m=1\). Given a general probability space \((\Omega ,\mathcal {F},\mathbb {P})\) and a random vector (or variable if \(m=1\)) \(X:\Omega \rightarrow \mathbb {R}^m\), for sake of simplicity and to avoid an overload of parenthesis we denote \(\mathbb {E}|X|^2 = \mathbb {E}(|X|^2)\). We also write
whenever \(f:E\rightarrow \mathbb {R}^m\) with \(f=(f_1,...,f_m)\). Along the paper we use several times that for \(x_1,...,x_k\in \mathbb {R}\) the following bound holds,
1.2 Setting
Let \(d\ge 1\) and \(T>0\). Consider the following integro-differential PDE
Here, \(u=u(t,x)\) is the unknown of the problem. The operator \(\mathcal {L}\) above is of parabolic nonlocal type, and is defined, for \(u\in \mathcal {C}^{1,2}([0,T]\times \mathbb {R}^d)\), as follows:
where \(\lambda \) is a finite measure on \(\mathbb {R}^d\), equipped with its Borel \(\sigma \)-algebra, and a Lévy measure as well which means that
Also, \(f:[0,T]\times \mathbb {R}^d\times \mathbb {R}\times \mathbb {R}^d\times \mathbb {R}\rightarrow \mathbb {R}\). We also assume the standard Lipschitz conditions on the functions in order to have a unique solution to (1.2) in the class \(C^{1,2}\): there exists a universal constant \(K>0\) such that
The last condition is of technical type and it is needed to ensure the validity of certain approximation results (see Theorem 4.1). On the other hand, the nonlocal, integro-differential operator \(\mathcal {I}\) is defined as
The conditions stated in (1.4) are standard in the literature (see [5, 12, 35]) and are needed to ensure the existence and uniqueness (with satisfactory bounds mentioned below) of solutions to a FBSDEJ (forward BSDEJ) related to (1.2).
Remark 1.1
In the literature (see [5, 20]) a Lipschitz condition imposed on \(\beta \) is often written as
The reason to impose this requirement is to ensure that
for some constant \(K_2>0\), this is another way of saying that \(\beta \) is Lipschitz with respect to its first variable in an integral sense. Our uniformly Lipschitz requirement on \(\beta \) and \(\lambda \) being a finite measure is enough to satisfy the said restriction.
1.3 Forward backward formulation of (1.2)
In the previous context consider the following stochastic setting for (1.2). Let \((\Omega ,\mathcal {F},\mathbb {F},\mathbb {P})\), \(\mathbb {F}=(\mathcal {F}_t)_{0\le t\le T}\), be a stochastic basis satisfying the usual conditions: \(\mathbb {F}\) is right continuous, and \(\mathcal F_0\) is complete (contains all zero measure sets). The filtration \(\mathbb {F}\) is generated by a d-dimensional Brownian motion (BM) \(W=(W_t)_{0\le t\le T}\) and a Poisson random measure \(\mu \) on \(\mathbb {R}_+\times \mathbb {R}^d\) with intensity measure \(\lambda \), these two random objects are assumed to be mutually independent.
Recall that \(\lambda \) is a finite Lévy measure on \(\mathbb {R}^d\). The compensated measure of \(\mu \) is denoted as
and is such that for every measurable set A satisfying \(\lambda (A)<\infty \), \((\overline{\mu }(t,A):=\overline{\mu }([0,t],A))_t\) is a martingale. Given a time \(t_i\in [0,T]\), the operator \(\mathbb {E}_i\) will denote the conditional expectation with respect to \(\mathcal {F}_{t_i}\):
Recall the equation (1.2)–(1.3)–(1.5). As usual, \(X_{r^-}\) denotes the a.e. limit of \(X_s\) as \(s\uparrow r\). Let us consider the next forward and backward stochastic differential equations with jumps in terms of the unknown variables (X, Y, Z, U):
where \(\Theta _s=(s,X_s,Y_s,Z_s,\Gamma _s)\) for \(0\le s\le T\) and \(x\in \mathbb {R}^d\). Note that \((Z_t)_{0\le t\le T}\) is a vector valued process.
By applying Itô’s lemma (see [20, Thm 2.3.4]) to the solution \(X_t\) in (1.8) and a \(\mathcal {C}^{1,2}([0,T]\times \mathbb {R}^d)\) solution u of PIDE (1.2) as \(Y_t\) in (1.9), we obtain the compact stochastic formulation of (1.2):
valid for \(t\in [0,T]\). This tells us that whatever we use as approximations of
must satisfy (1.11) in some proper metric. An important statement here is that the conditions (1.4) ensure the existence of a viscosity solution \(u\in \mathcal {C}([0,T]\times \mathbb {R}^d)\) with at most polynomial growth such that \(u(t,X_t) = Y_t\) (see [5, Thm 3.4]), and this is the reason why our scheme seek to approximate the solution to the FBSDEJ (1.8–1.9). In order to present our Neural Networks algorithm, we introduce them in Sect. 2.
1.4 Organization of this paper
The rest of this work is organized as follows. In Sect. 2 we give a concrete definition of the NNs that we will be using together with the approximation results needed in this paper. In Sect. 3 we introduce the discretization of the stochastic system that allows us to train our NNs. In Sect. 4 we state all the preliminary results and definitions needed for the proof of our main result. Finally, in Sect. 5 we state and prove the main result of this paper.
2 Neural networks and approximation theorems
Neural Networks (NNs) are not recent. In [41, 43], published in 1943 and 1958 respectively, the authors introduce the concept of a NN but far from the actual definition. Through the years, the use of NNs as a way to approximate functions, started to gain importance for its well performance in applications. A rigorous justification of this property was proven in [30, 36], using the Stone-Weierstrass theorem. These papers state that under suitable conditions on the approximated functions, measured in some mathematical terms, NNs have a very good performance. See [2, 48] for a review on the origin and state of the art survey of DL, respectively.
The huge amount of available data, due to social media, astronomical observatories and even Wikipedia, together with the progress of computational power, have allowed us to train more and more efficient Machine Learning (ML) algorithms and consider data that years ago were not possible to analyze. Deep Learning is a part of supervised ML algorithms and it concerns with the problem of approximating an unknown nonlinear function \(f:X\rightarrow Y\), where X represents the set of possibles inputs and Y the outputs, for example, Y could be a finite set of classes and therefore f has a classification task. In order to perform a DL algorithm we need a set of observations \(D = \left\{ (x,f(x)): x\in A\right\} \) of the phenomenon under consideration; in the literature this set is also known as training set. Here, A is a finite subset of X. The next step is to define a family of candidates \(\left\{ f_\theta : \theta \in \Xi \right\} \) where we can search for a good approximation of f, with \(\Xi \subset \mathbb {R}^{\kappa }\) for some \(\kappa \in \mathbb {N}\). Finally, how good the approximation is, will be measured by a cost function \(L(\cdot ;D):\Xi \rightarrow \mathbb {R}\) and therefore, intuitively, we take \(f_{\theta ^*}\) as the chosen approximation where \(\theta ^*\) minimizes \(L(\cdot ;D)\) over \(\Xi \).
The complexity and generality of the main problem that DL is trying to solve, makes it useful to a large variety of disciplines in science. In astronomy, the large amount of data recollected by observatories makes it a suitable place to implement ML, see [7] for a review of ML in astronomy and [39] for a concrete use of Convolutional Neural Networks (CNN) to classify light curves. See [13] for a review of ML on experimental high energy physics and [47] for an application of NN on quantum state tomography. In [40], the authors use DL to find patterns in fashion and style trends by space and time using data from Instagram. In [3] the authors train a CNN to classify brain tumors into Glioma, Meningioma, and Pituitary Tumor reaching high levels of accuracy. See [37] for a survey on the use of DL in medical science where CNN are the most common type of DL structure.
To fix ideas, in this paper we focus on a simpler setting, where the input and output variables belong to multidimensional real spaces \(\mathbb {R}^d\) and \(\mathbb {R}^m\) respectively with \(d,m\in \mathbb {N}\). In order to define the family of candidates we need \(L+1\in \mathbb {N}\) layers with \(l_i\in \mathbb {N}\) neurons each for \(i\in \left\{ 0,...,L\right\} \) where \(l_0=d\) and \(l_L=m\), weight matrices \(\left\{ W_i\in \mathbb {R}^{l_i\times l_{i-1}}\right\} _{i=1}^{L}\), bias vectors \(\left\{ b_i\in \mathbb {R}^{l_i}\right\} _{i=1}^{L}\), and an activation function \(\phi :\mathbb {R}\rightarrow \mathbb {R}\) which is a way to break the linearity (see Definition 2.1). The resulting function will take values from the input space \(\mathbb {R}^{l_0}\) to the output space \(\mathbb {R}^{l_L}\).
Remark 2.1
The first and last layer are called input and output layer respectively, the others \(L-1\) are often called hidden layers.
Definition 2.1
Given \(L\in \mathbb {N}\) and \(l_0,(l_i,W_i,b_i)_{i=1}^{L}\) as above, consider the parameter \(\theta =\left( W_i,b_i\right) _{i=1}^{L}\) which can be seen as an element of \(\mathbb {R}^\kappa \) with \(\kappa =\sum _{i=1}^{L}(l_i l_{i-1} + l_i)\) and a function \(\phi :\mathbb {R}\rightarrow \mathbb {R}\). We define the neural network \(f(\cdot ;\theta ,\phi ):\mathbb {R}^{l_0}\rightarrow \mathbb {R}^{l_L}\) as the following composition,
where \(A_i:\mathbb {R}^{l_{i-1}}\rightarrow \mathbb {R}^{l_i}\) is an affine linear function such that \(A_i(x)=W_ix+b_i\) for \(i\in \left\{ 1,...,L\right\} \) and \(\phi \) is applied component-wise. We denote \(f(\cdot ;\theta ,\phi ) = f(\cdot ;\theta )\) when the activation function is fixed and no confusion can arise.
In the following, the activation function will be fixed as well as the input and output dimensions, this is because those parameters are given by the mapping that we are trying to approximate, and the amount of layers L will also be fixed. The range of neural networks that we can reach varying the remaining parameters, which are the size and values of the weight matrices and bias vectors, will be called the set of neural networks and denoted by \(\mathcal {N}_{\phi ,L,l_0,l_L}\). The following definition materializes the previous explanation.
Definition 2.2
The set of Neural Networks associated to \((L, l_0=d, l_L=m)\subset \mathbb {N}\) and the function \(\phi :\mathbb {R}\rightarrow \mathbb {R}\) is defined by,
where,
In this paper m will be typically d, the space dimension of the PIDE (1.2), or 1. Using functional analysis arguments, K. Hornik proves in [30] that NNs are able to approximate functions in \(L^2\) spaces for any given tolerance. The space of NNs used in his work, to which we refer as \(\mathcal {H}\), is slightly different from ours, indeed,
Note that in this space the free parameter \(\kappa \) depends on the size \(n\in \mathbb {N}\) of the first (and only) hidden layer in the following way, \(\kappa =\sum _{i=1}^2 (l_i l_{i-1} + l_i) = nd + n + n + 1\). It is straightforward that a function \(f\in \mathcal {H}\) takes the following form
for \(\left( W_1,b_1,W_2,0\right) \in \mathbb {R}^{nd+n+n+1}\) and \(n\in \mathbb {N}\). Hornik proves the following important result.
Theorem 2.1
([30], Theorem 1) If \(\phi :\mathbb {R}\rightarrow \mathbb {R}\) is bounded and non-constant, then \(\mathcal {H}\) is dense in \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R})\) for every finite measure \(\mu \) in \(\mathbb {R}^d\).
Let \(m\in \mathbb {N}\). For a measure \(\mu \) on \(\mathbb {R}^d\), consider the space \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)\) of square integrable vector valued functions endowed with the norm
for \(h = (h_1,...,h_m)\) and \(h_i\) a scalar function for \(i\in \left\{ 1,...,m\right\} \). We also need to approximate the derivative \(\nabla u\) of the solution u to PIDE (1.2), the following proposition proves density of NNs in the space of square integrable vector valued functions.
Lemma 2.1
Let \(m\in \mathbb {N}\) with \(m\ge 1\). If the activation function \(\phi \) is bounded and non-constant, then \(\mathcal {N}_{\phi ,2,d,m}\) is dense in \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)\) for every finite measure \(\mu \) on \(\mathbb {R}^d\).
Proof
Given \(\varepsilon >0\) and a function \(h=(h_1,...,h_m)\in L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)\) we need to find \(f(\cdot ;\theta ,\phi )=(f_1,...,f_m)\in \mathcal {N}_{\phi ,2,d,m}\) such that
First, observe that \(\mathcal {H}\subset \mathcal {N}_{\phi ,2,d,1}\) which implies, by using Theorem 2.1, that \(\mathcal {N}_{\phi ,2,d,1}\) is also dense in \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R})\) and therefore for every \(i\in \left\{ 1,...,m\right\} \) we can find \(f_i(\cdot ;\theta ^i,\phi )\) with \(\theta ^i=\left( W_1^{i},b_1^i,W_2^i,b_2^i\right) \) and \(\kappa ^i=n^i d + n^i + n^i + 1\), depending on \(\varepsilon \), such that
Consider \(f\in \mathcal {N}_{\phi ,2,d,m}\) defined by \(\widehat{\theta }=\left( \widehat{W}_1, \widehat{b}_1, \widehat{W}_2, \widehat{b}_2\right) \) with
and which satisfies that for \(x\in \mathbb {R}^d\)
Therefore,
This ends the proof.\(\square \)
Lemma 2.1 allows us to state that if we take some function \(h:\mathbb {R}^d\rightarrow \mathbb {R}^{m}\) in \(L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\mu ;\mathbb {R}^m)\), then the quantity
can be made arbitrarily small by possible making \(\kappa \) growing sufficiently large, whenever \(\mu \) is a finite measure on \(\mathbb {R}^d\) and the activation function that defines the NN is bounded and non-constant. The following lemma states the \(L^2(\Omega ,\mathcal {F},\mathbb {P})\) integrability of a random variable constructed from a NN, it assumes an activation function with linear growth which is more general that the ones presented here.
Lemma 2.2
Let \(m\in \mathbb {N}\) with \(m\ge 1\), \(X\in L^2 (\Omega ,\mathcal {F},\mathbb {P};\mathbb {R}^d)\) and \(f(\cdot ;\theta ,\phi )\in \mathcal {N}_{\phi ,2,d,m}\). Assume that \(|\phi (x)|\le C(1+|x|)\) for \(x\in \mathbb {R}\) and some positive constant C, then \(f(X;\theta ,\phi )\in L^2 (\Omega ,\mathcal {F},\mathbb {P};\mathbb {R}^m)\).
Proof
Let f be represented by \(\theta =\left( W_1,b_1,W_2,b_2\right) \in \mathbb {R}^{nd+n+mn+m}\) then,
Therefore, without loss of generality we can assume \(f\in \mathcal {N}_{\phi ,2,d,1}\) and take \(\theta = \left( W_1,b_1,W_2,0\right) \in \mathbb {R}^{nd+n+n+1}\). By using the growth condition on \(\phi \) and that for \(a,b,c\in \mathbb {R}\) \((a+b+c)^2\le 3(a^2 + b^2 + c^2)\)
Note that the first two terms in the last expression are deterministic and finite, then, by using Cauchy-Schwartz twice on the third term we get,
This finishes the proof.\(\square \)
3 Discretization of the dynamics and the deep learning algorithm
Fix a constant step partition of the interval [0, T], defined as \(\pi =\left\{ \frac{iT}{N}\right\} _{i\in \left\{ 0,...,N\right\} }\), \(t_i= \frac{iT}{N}\), and set \(\Delta W_i=W_{t_{i+1}}-W_{t_i}\). Also, define \(h:=\frac{T}{N}\) and (with a slight abuse of notation), \(\Delta t_i=(t_i,t_{i+1}]\). Recall the compensated measure \(\overline{\mu }\) from (1.6). Let
It is well-known that an Euler scheme for the first equation in (1.8) obeys the form
Note that this scheme neglects the left limits that appears on the original equation, although, it satisfies the next error bound (see [20, Thm. 5.1.1], [12] or [26]),
Under suitable conditions, mostly Lipschitz and linear growth assumptions, it can be proved that the constant behind O(h) in (3.4) does not depend exponentially on d, see Lemma 4.3 in [26]. Adapting the argument of [31] to the non-local case, and in view of (1.11), we propose the following modified Euler scheme: for \(i=0,1,\ldots ,N\),
where \(F_i:\Omega \times [0,T]\times \mathbb {R}^d\times \mathbb {R}\times \mathbb {R}^d\times L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda )\times \mathbb {R}_+\times \mathbb {R}^d\rightarrow \mathbb {R}\) is defined as
Note that \(\omega \) is passed to \(F_i\) through its dependence on the compensated measure \(\bar{\mu }\). The function \(F_i\) is, indeed, a random variable
Remark 3.1
Note that the nonlocal term in (1.2) forces us to define \(F_i\) in such a way that its fifth argument must be a function \(\psi \) in \(L^2 (\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda )\). In view of the integrals involved in \(F_i\), it appears that we are again facing the same high dimensional problem; however this problem may be instead treated with Monte Carlo approximations, see below.
Remark 3.2
In the nonlocal setting, the function \(F_i\) also depends on the time interval \((t_i,t_{i+1}]\) in terms of the integrated measure \(\bar{\mu }\left( (t_i,t_{i+1}],dy\right) \). This is an important change in the Euler scheme, since we do not approximate the nonlocal term at time \(t_i\) in this case, but instead take into account how the measure \(\bar{\mu }\) behaves on the time interval \((t_i,t_{i+1}]\).
Recall Theorem 2.1 and let \(\phi :\mathbb {R}\rightarrow \mathbb {R}\) be a bounded and non-constant activation function. From now on we will be using NNs with a single hidden layer parameterized by \(\theta \in \Xi \), where \(\Xi =\mathbb {R}^{\kappa }\) for some free parameter \(\kappa \in \mathbb {N}\) depending on the size of the hidden layer. For every time \(t_i\) on the grid consider,
with \(\mathcal {U}_i(\cdot ;\theta )\in \mathcal {N}_{\phi ,2,d,1,\kappa }\), \(\mathcal {Z}_i(\cdot ;\theta )\in \mathcal {N}_{\phi ,2,d,d,\kappa }\) and \(\mathcal {G}_i(\cdot ,\circ ;\theta )\in \mathcal {N}_{\phi ,2,d+d,1,\kappa }\) approximating
respectively, in some sense to be specified. Note that we dropped Let also
We propose an extension of the DBDP1 algorithm presented on [31]. The main idea of the algorithm is that the NNs, evaluated on \(X_{t_{i}}^{\pi }\), are good approximations of the processes solving the FBSDEJ. Let \(L_i\) be a cost function defined for \(\theta \in \Xi \) as
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs42985-022-00213-z/MediaObjects/42985_2022_213_Figa_HTML.png)
For the minimization step we need to calculate an expected value, but this is a complicated task due to the non linearity and the fact that the distribution of the random variables involved are not always known. To overcome this situation, as well as in [31], one has to use a Monte Carlo approximation together with Stochastic Gradient Descent (SGD). See also Remark 3.1.
4 Preliminaries
For a general measure space \((E,\Sigma ,\nu )\), \(p\ge 1\) and \(m\in \mathbb {N}\), recall the definition and norm for the Lebesgue space \(L^p(E,\Sigma ,\nu ;\mathbb {R}^m)\) introduced in Sect. 1.1. For \(s,t\in [0,T]\) such that \(s\le t\), we define some spaces of stochastic processes:
-
\(\mathcal {S}_{[s,t]}^2(\mathbb {R})\) denotes the space of adapted càdlàg processes \(Y: \Omega \times [s,t]\rightarrow \mathbb {R}\) such that
$$\begin{aligned} \left\Vert Y\right\Vert ^2_{\mathcal {S}^2_{[s,t]}}:=\mathbb {E}\left( \underset{r\in [s,t]}{\sup }|Y_r|^2\right) <\infty . \end{aligned}$$ -
\(L^2_{W,[s,t]}(\mathbb {R}^d)\) denotes the space of predictable processes \(Z: \Omega \times [s,t]\rightarrow \mathbb {R}^d\) such that
$$\begin{aligned} \left\Vert Z\right\Vert _{L^2_{W,[s,t]}}^2:=\mathbb {E}\left( \int _s^t\left\Vert Z_r\right\Vert ^2 dr\right) <\infty . \end{aligned}$$ -
\(L^2_{\mu ,[s,t]}(\mathbb {R})\) denotes the space of \(\sigma (\mathcal {P}_{[s,t]}\times \mathcal {B}(\mathbb {R}^d))\)-measurable processes \(U: \Omega \times [s,t]\times \mathbb {R}^d\rightarrow \mathbb {R}\) with \(\mathcal {P}_{[s,t]}\) denoting the predictable sigma algebra on \(\Omega \times [s,t]\). These processes are such that
$$\begin{aligned} \left\Vert U\right\Vert _{L^2_{\mu ,[s,t]}}^2:=\mathbb {E}\left( \int _s^t\int _{\mathbb {R}^d}|U_r(y)|^2\lambda (dy)dr\right) <\infty . \end{aligned}$$
Whenever \([s,t]=[0,T]\), we avoid mentioning the interval of time, and denote \(\mathcal {B}^2 = \mathcal {S}^2\times L_{W}^2(\mathbb {R}^d)\times L_{\mu }^2(\mathbb {R})\). In the following, \(C>0\) will denote a constant that may change from one line to another. Also, the notation \(a\lesssim b\) means that there exists \(C>0\) such that \(a\le Cb\).
4.1 Existence and uniqueness for the FBSDEJ
In order to estimate errors we need a solution to compare, the following lemmas present well-known results concerning the existence and uniqueness of a solution to the decoupled system (1.8–1.9). We only check that our hypotheses match those of [4] and [5], these results are the same as those given in [12] and [20, Section 4.1].
Lemma 4.1
There exists a unique solution \(X\in \mathcal {S}^2\) to (1.8) such that.
Proof
Recall Remark 1.1. Observe that conditions (C), particularly those imposed on \(\beta \), imply that
This, together with the rest of conditions (C) are enough to fulfill the Lipschitz and growth hypotheses needed on [4, Section 6.2] to ensure the existence and uniqueness of a solution \(X\in S^2\) to the FSDEJ (1.8). Estimate (4.1) follows by considering the process \((X_u-X_s)_{u\in [s,t]}\) and using Doob’s maximal inequality [42, Theorem 20, Section 1] and Gronwall inequality. \(\square \)
Lemma 4.2
There exists a solution \((Y,Z,U)\in \mathcal {B}^2\) to (1.9).
Proof
We apply Theorem 2.1 of [5] with \(k=1\), \(Q=g(X_T)\) and a nonlinearity \(\bar{f}:\Omega \times [0,T]\times \mathbb {R}\times \mathbb {R}^d\times L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda )\rightarrow \mathbb {R}\) defined as
By the Lipschitz property on g and the bound given in Lemma 4.1 we can see that \(Q\in L^2(\Omega ,\mathcal {F}_T,\mathbb {P})\). The Lipschitz condition on f implies that for all \(\omega \in \Omega ,t\in [0,T],y,y'\in \mathbb {R}^d,z,z'\in \mathbb {R}^d\) and \(w,w'\in L^2(\mathbb {R}^d,\mathcal {B}(\mathbb {R}^d),\lambda )\),
this proves the Lipschitz condition on \(\bar{f}\). Using the previous bound is clear that,
These computations allow us to directly apply Theorem 2.1 of [5] this finishes the proof.\(\square \)
Combining previous lemmas we get that there exist a unique solution (X, Y, Z, U) to the system (1.8–1.9) in the space \(\mathcal {S}^2\times \mathcal {B}^2\) this implies,
4.2 Useful results from stochastic calculus
The following lemma strongly depends on the filtration under consideration, recall that \((\mathcal {F}_t)_{t\in [0,T]}\) is generated by the two independent objects W and \(\mu \) which allows us to state the representation property. See the end of Section 2.4 in [20] where it is stated that when the filtration is generated by a Brownian Motion and an independent jump process the required representation holds.
Lemma 4.3
(Martingale Representation Theorem) For any square integrable martingale M there exists \((Z,U)\in L^2_{W}(\mathbb {R}^d)\times L^2_{\mu }(\mathbb {R})\) such that for \(t\in [0,T]\)
We will need the next property involving conditional expectation, Itô isommetry and that W is independent of \(\overline{\mu }\).
Lemma 4.4
(Conditional Ito isometry) For \(V^1,V^2\in L^2_{\mu }(\mathbb {R})\) and \(H,K\in L^2_W(\mathbb {R}^d)\),
Proof
Follows from the classical Ito isommetry and using the definition of conditional expectation.\(\square \)
Lemma 4.5
(Conditional Fubini) Let \(H\in L^2_{\mu }(\mathbb {R}^d)\) be a \(\mathbb {F}\)-adapted process and \(t>0\), then
Proof
The proof is standard, but we included it by the sake of completeness. Let \(A\in \mathcal {F}_{t_i}\), we have to prove that
Note that because of \(H\in L^2_{\mu }(\mathbb {R}^d)\),
this means that H can be seen as an element of \(\in L^2(\Omega \times [t_i,t_{i+1}]\times \mathbb {R}^d)\subset L^1(\Omega \times [t_i,t_{i+1}]\times \mathbb {R}^d)\), both spaces endowed with the correspondent finite product measure. Then we can use classical Fubini theorem:
This finishes the proof.\(\square \)
4.3 Measuring the error
We first introduce the conditional expectations of the averaged processes
these quantities allows us to define the \(L^2\)-regularity of the solutions \((Z,\Gamma )\) (see [12] and [31]) as follows
Both quantities can be made arbitrarily small as it is shown on [12] and presented in the following theorem.
Theorem 4.1
Under assumptions (C), there exists a constant \(C>0\) such that
Proof
See [12, Theorem 2.1 (i)] for the bound on \(\varepsilon ^{\Gamma }(h)\) and [12, Theorem 2.1 (ii)] for the bound on \(\varepsilon ^{Z}(h)\). Note that in the cited reference this result is presented, using our notation, as follows,
Where \(\overline{\Gamma }_t = \overline{\Gamma }_{t_i}\) for \(t\in [t_i,t_{i+1})\) and \(\overline{Z}_t = \overline{Z}_{t_i}\) for \(t\in [t_i,t_{i+1})\).\(\square \)
We introduce a somehow auxiliary scheme that at the same time depends on the main one. Let \(i\in \{0,\ldots , N-1\}\), as stated in Sect. 3. We follow the procedure taken in [31], with key modifications. Let us use the ideas of [12] to define \(\mathcal {F}\)-adapted discrete processes
where \(\widehat{\mathcal {V}}_{t_i}\) is well-defined for sufficiently small h by Lemma 4.6 and the variables \(\overline{\widehat{Z}}_{t_i}\), \(\overline{\widehat{\Gamma }}_{t_i}\) are defined below.
Lemma 4.6
The process \(\widehat{\mathcal {V}}_{t_i}\) is well-defined.
Proof
Let \(i\in \left\{ 0,...,N-1\right\} \) and \(\psi :L^2(\Omega ,\mathcal {F},\mathbb {P})\rightarrow L^2(\Omega ,\mathcal {F},\mathbb {P})\) be defined as
For all \(\xi \in L^2(\Omega ,\mathcal {F},\mathbb {P})\) and \(\omega \in \Omega \). This function is well-defined by the properties of f and Lemma 2.2. Let \(\xi ,\overline{\xi }\in L^2\), then \(\mathbb {P}\) a.s \(|\psi (\xi )-\psi (\overline{\xi })| \le h |\xi -\overline{\xi }|\), therefore
Taking sufficiently small h we can see that this function is a contraction on \(L^2(\Omega ,\mathcal {F},\mathbb {P})\), and therefore, by applying Banach’s fixed point theorem, we conclude the proof.\(\square \)
For fixed \(i\in \left\{ 0,...,N\right\} \), let \(N_t\) be a process defined as \(N_t := \mathbb {E}\left( \widehat{\mathcal {U}}_{i+1}(X_{t_{i+1}}^{\pi })\Big |\mathcal {F}_t\right) \) for \(t\in [t_i,t_{i+1}]\). Using Lemma 2.2, it is not difficult to see that \(N_t\) is a square integrable martingale and therefore, by Martingale Representation Theorem (see Lemma 4.3), there exist \((\widehat{Z},\widehat{U})\in L^2_{\mu }\times L^2_{W}\) such that
By taking \(t=t_{i+1}\) and using (1.7),
By multiplying by \(\Delta W_i\) and \(\Delta M_i\), then taking \(\mathbb {E}_i\) and using Itô isometry,
Let
By Lemma 4.5 one can see that
The last equality can be seen as an analogous to (1.10) and makes sense with the notation \(\overline{\widehat{\Gamma }}_{t_i}=\langle \overline{\widehat{U}}_{t_i} \rangle \). Also, we can establish the following useful bound:
Indeed, from (4.10) and (3.8), Hölder inequality and the fact that \(\lambda \) is a finite measure
Following [31], we can find deterministic functions \(v_i, z_i, \gamma _i\) such that \(v_i(X_{t_{i}}^{\pi }) = \widehat{\mathcal {V}}_{t_i}, z_i(X_{t_{i}}^{\pi }) = \overline{\widehat{Z}}_{t_i}\) and \(\gamma _i(y, X_{t_{i}}^{\pi }) = \overline{\widehat{U}}_{t_i}(y)\) for \(y\in \mathbb {R}^d\). The correspondent \(L^2\)-integrability of these functions is ensured by the properties of \(\widehat{\mathcal {V}}_{t_i}, \overline{\widehat{Z}}_{t_i}\) and \(\overline{\widehat{U}}_{t_i}\). With the previous setup, the natural extension of the terms to estimate the error of the scheme shown on [31] must be
The expected values can be written as a integral with respect a probability measure in \(\mathbb {R}^d\) and therefore, applying the Theorem 2.1, these quantities can be made arbitrarily small as \(\kappa \) increases.
The following results will be useful in the proof of the main result. In Section 2.5 of [12], it is explained that the results presented there still hold for a time-dependent non-linearity.
Proposition 4.1
([12], Proposition 2.1) There exists a constant \(C>0\) independent of the step h such that
We will also need the following result.
Lemma 4.7
Consider \((X,Y,Z,U)\in \mathcal {S}^2\times \mathcal {B}^2\) the solution to (1.8–1.9), \(\Gamma \) defined as in (1.10) and \(\Theta _s = (s,X_s,Y_s,Z_s,\Gamma _s)\). Then,
Proof
First, note that by using useful bound (1.1) we have that for every \(s\in [0,T]\)
Applying again (1.1) and the Lipschitz bound on f,
Then, integrating on \(\Omega \times [0,T]\) with respect to \(d\mathbb {P}\times ds\), using Hölder inequality and bound (4.2),
this finishes the proof.\(\square \)
5 Main result
As stated previously, the proof of our main result, Theorem 5.1, is deeply inspired in the case without jumps considered in [31]. We follow the lines of that proof with some important differences because of the nonlocal character of our problem. Also, along the proof we use several times the useful bound (1.1); for \(x_1,...,x_k\in \mathbb {R}\) the following holds,
Theorem 5.1
Under (C), there exists a constant \(C>0\) independent of the partition such that for sufficiently small h,
with \(\mathcal {E}_i^v\), \(\mathcal {E}_i^z\) and \(\mathcal {E}_i^\gamma \) given in (4.11), and \(\varepsilon ^{Z}(h)\) and \(\varepsilon ^{\Gamma }(h)\) defined in (4.5).
Proof
Step 1: Recall \(\widehat{\mathcal {V}}_{t_i}\) introduced in (4.6). The purpose of this part is to obtain a suitable bound of the term \(\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\) in terms of more tractable terms. We have\(\square \)
Lemma 5.1
There exists \(C>0\) fixed such that for any \(0<h<1\) sufficiently small, one has
with \(\Theta _r=(r,X_r,Y_r,Z_r,\Gamma _r)\).
The rest of this subsection is devoted to the proof of this result.
Proof
Subtracting the equation (1.9) between \(t_i\) and \(t_{i+1}\), we obtain
Using the definition of \(\widehat{\mathcal {V}}_{t_i}\) in (4.6),
Here \(\widehat{\Theta }_{t_i}=(t_i,X^\pi _{t_i},\widehat{\mathcal {V}}_{t_i}, \overline{\widehat{Z}}_{t_i},\overline{\widehat{\Gamma }}_{t_i})\). Then, by applying the conditional expectation for time \(t_i\) given by \(\mathbb {E}_i\) and using that, in this case, the stochastic integrals are martingales
Using the classical inequality \((a+b)^2\le (1+\gamma h)a^2+(1+\frac{1}{\gamma h})b^2\) for \(\gamma >0\) to be chosen, we get
With no lose of generality, because we are looking for bounds, we can replace \([f(\Theta _s)-f(\widehat{\Theta }_{t_i})]\) by \(|f(\Theta _s)-f(\widehat{\Theta }_{t_i})|\). Also, we can drop the \(\mathbb {E}_i\) due to the law of total expectation. The Lipschitz condition on f in (1.4) allows us to give a bound in terms of the difference between \(\Theta _s\) and \(\widehat{\Theta }_{t_i}\). Indeed, for a fixed constant \(K>0\),
Therefore, we have the bound
where the Lipschitz constant K was absorbed by C. Using now the triangle inequality \(|Y_s-\widehat{\mathcal {V}}_{t_i}|^2 \le 2|Y_s-Y_{t_i}|^2 +2|Y_{t_i}-\widehat{\mathcal {V}}_{t_i}|^2\), and the approximation error of the X scheme (3.4), we find
and therefore, replacing in (5.3),
Recall \(\overline{Z}_{t_i}\) and \(\overline{\Gamma }_{t_i}\) introduced in (4.4). Now, we are going to prove the following
Let us prove the latter, the former is analogous. Recall that the \(\Gamma \) components represents the nonlocal part and therefore is one dimensional.
It is sufficient to establish that the double product is 0 when integrating and taking expectation. Recall that \(\overline{\Gamma }_{t_i}\) from (4.4) is a \(\mathcal {F}_{t_i}\) measurable random variable. Then,
Due to the \(\mathcal {F}_{t_i}\)-measurability of the right side of the last multiplication and the \(L^2(\mathbb {P})\) orthogonality, taking expectation annihilates the last term. Therefore, equations (5.6) and (5.7) are proven. By multiplying (5.2) by \(\Delta W_i\) and taking \(\mathbb {E}_i\),
where we have used Lemma 4.3. Then, subtracting \(h \overline{\widehat{Z}}_{t_i}=\mathbb {E}_i(\widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})\Delta W_i)\),
By multiplying (5.2) by \(\Delta M_i\) and taking \(\mathbb {E}_i\),
Then, subtracting \(h\overline{\widehat{\Gamma }}_{t_i}=\mathbb {E}_i\left( \widehat{\mathcal {U}}_{i+1}(X^{\pi }_{t_{i+1}})\Delta M_i\right) \),
Summarizing, one has
For the sake of brevity, define now
note that it depends on i. By the properties related with Itô isometry, from the previous identities we have
\(\square \)
Remark 5.1
Note that in the previous bound is important the finiteness of the Levy measure \(\lambda \). The case of more general integro-differential operators, such as the fractional Laplacian mentioned in the introduction, it is an interesting open problem.
Let us work with equation (5.4). Using (5.6) and (5.7),
Now use (5.9) and (5.10) to find that
Let \(\gamma =C(\lambda (\mathbb {R}^d) + d)\) and define \(D:=(1+\gamma h)\frac{C}{\gamma }\), then the above term is bounded by
Note that the first and last term in the last expression are similar, therefore can be subtracted which yields a negative number that can be bounded from above by 0. Also, we have the similar terms on \(\mathbb {E}\left( H_{i+1}^2\right) \) and the integral of f that we put together and bound respectively. Due to the definition of D, from now on the constant C has a linear dependence on the dimension d such that \(D\le C\). By replacing the last calculation and putting \(\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\) on the left side
Now we have to take h small such that, for example, \(Ch\le \frac{1}{2}\) and then
Finally, by recalling that \(H_{i+1} =Y_{t_{i+1}}-\widehat{\mathcal {U}}_{i+1} (X^\pi _{t_{i+1}})\), we have established (5.1).\(\square \)
Step 2: The last term in (5.1),
was left without a control in previous step. Here in what follows we provide a control on this term. Recall the error terms \(\varepsilon ^{Z}(h)\) and \(\varepsilon ^{\Gamma }(h)\) introduced in (4.5). The purpose of this section is to show the following estimate:
Lemma 5.2
There exists a constant \(C>0\) such that,
The rest of this section is devoted to the proof of this result.
Proof of Lemma 5.2
We have that \((a+b)^2\ge (1-h)a^2+(1-\frac{1}{h})b^2\) and
Therefore, we have an upper (5.1) and lower bound for \(\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\). By connecting these bounds,
Using that for sufficiently small h we have \((1-h)^{-1}\le 2\), we get,
Notice that the expression on time \(t_i\) that we want to estimate, appears on the right side on time \(t_{i+1}\), we can iterate the bound and get that \(\forall \) \(i\in \left\{ 0,...,N-1\right\} \)
Applying maximum on \(i\in \left\{ 0,...,N-1\right\} \), recalling (4.5) and the bounds from Lemmas (4.7) and (4.1),
This is nothing that (5.11).
Remark 5.2
The classic bound used at the beginning of step 2 could have been stated using a fixed parameter \(\delta \in (0,1)\) in the form: \((a+b)^2 \ge (1-h^{\delta })a^2 + (1-\frac{1}{h^\delta })b^2\). This change makes N become \(N^{\delta }\), which is better. However, at some point of the proof the value \(\delta = 1\) is necessary.\(\square \)
Step 3: Estimate (5.11) contains some uncontrolled terms on its RHS. Here the purpose is to bound the term
in terms of more tractable terms. In this step we will prove
Lemma 5.3
There exists \(C>0\) such that,
with \(\mathcal {E}_i^v\), \(\mathcal {E}_i^z\) and \(\mathcal {E}_i^\gamma \) defined in (4.11).
In what follows, we will prove 5.13.
Proof
Fix \(i\in \left\{ 0,...,N-1\right\} \). Recall the martingale \((N_t)_{t\in [t_i,t_{i+1}]}\) and take \(t=t_{i+1}\),
Now we replace the definition of \(\widehat{\mathcal {V}}_{t_i}\),
In what follows recall the value of F in the loss function \(L_i(\theta )\) (3.9) evaluated at the point
and that \(\langle \mathcal {G} \rangle _i(X_{t_i};\theta )\) is given in (3.8):
Now fix a parameter \(\theta \) and replace (5.14) on \(L_i(\theta )\):
Note that b is a sum of martingale’s differences and therefore \(\mathbb {E}_i (b)=0\). By independence of \(\mu \) with W, we can deduce that
and, since the random variables that appears on a are \(\mathcal {F}_{t_i}\)-measurable, \(\mathbb {E}(ab)=\mathbb {E}\left( \mathbb {E}_i(ab)\right) =\mathbb {E}\left( a\mathbb {E}_i(b)\right) =0\), we have that
By the same arguments on equations (5.6) and (5.7),
With this decomposition of \(L_i(\theta )\), for optimization reasons, we can ignore the part that does not depend on the optimization parameter \(\theta \). Let
Let \(\gamma >0\) and use Young inequality and the Lipschitz condition on f to find that
Therefore, we have an upper bound on \(L(\theta )\) for all \(\theta \)
To find a lower bound, we use \((a+b)^2\ge (1-\gamma h)a^2+\left( 1-\frac{1}{\gamma h}\right) b^2\) with \(\gamma >0\)
where we used \(\gamma = 6K^2\). Then,
Connecting this bounds using that \(\hat{L}(\theta ^*)\le \hat{L}(\theta )\) yields that \(\forall \theta \),
By taking infimum on the right side and h small such that \((1-Ch)\ge \frac{1}{2}\)
Using this bound on what we found on steps 1 and 2, we find
Finally, using Proposition 4.1, one ends the proof of (5.13).\(\square \)
Step 4: We are going to show some bounds for the terms involving the \(\Gamma \) and U components, the same bounds holds for Z component and are shown in [31]. By using (5.10) on (5.7),
this implies, after using (4.5) and (),
From [31] we get the analogous bound for the Z component, therefore, putting this two together yields
This tells us that the next mission in this proof is to give a suitable bound for \(\mathbb {E}\left( H_{i+1}^2\right) - \mathbb {E}\left| \mathbb {E}_i (H_{i+1}) \right| ^2\). Recall from (5.8) that \(H_{i+1} = Y_{t_{i+1}} - \widehat{\mathcal {U}}_{i+1} (X_{t_{i+1}}^{\pi })\), then
From (5.12) and (5.4) we have an upper and lower bound on \(\mathbb {E}\left| Y_{t_i}-\widehat{\mathcal {V}}_{t_i} \right| ^2\). Indeed, first one has
Second, we have that for all \(\gamma >0\)
Let us call the expression inside the squared brackets by \(B_i\). Subtracting \((1-h) \mathbb {E}\left| \mathbb {E}_i(H_{i+1})\right| ^2\) and dividing by \((1-h)\),
For \(\gamma = 3C\) and sufficiently small h, we can force,
Hence,
Finally, note that,
Remark 5.3
Note that in equation (5.19) appears N multiplying the last term. With the bounds that we have, is impossible to get rid of the N, and this is why the \(\delta \) improvement mentioned on Remark 5.2 will not be of much help.
Coming back to (5.17),
Therefore, by plugging this bound in (5.16), noting that \(|Y_{t_i}-\widehat{\mathcal {V}}_{t_i}|^2\le 2|Y_{t_i}-\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) |^2 + 2|\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }) - \widehat{\mathcal {V}}_{t_i}|^2\), \(hN = 1\), and using Lemma 4.1, we have for some \(C>0\),
Now, use (5.15) together with Lemma 5.3 to get
Again, recalling (5.15) using the previous bound and,
we conclude that there exist \(C>0\), independent of the partition, such that for h sufficiently small,
Thus it has been demonstrated.
We state some remarks from the proof.\(\square \)
Remark 5.4
Note that the terms \(\mathcal {E}_i^v\), \(\mathcal {E}_i^z\) and \(\mathcal {E}_i^\gamma \), can be made arbitrarily small, in view of Lemma 2.1. The challenge here, and in almost every DL algorithm, is that we do not know how many units per layer, i.e., how large \(\kappa \) we need to take in order to achieve a fixed tolerance, we can only ensure the existence of a NN architecture satisfying the approximation property.
Remark 5.5
The main difficulty of the adaptation of the proof given in [31], was to give a useful definition of the third NN with the mission of approximate the non local component. This was problematic because we have two options, the first is to define the NN to approximate the whole integral
this seems intuitive because this will lead our third NN to approximate the nonlocal part of the PIDE and, therefore, receive one parameter: \(X_{t_{i}}^{\pi }\). But, we also need to approximate or been able to calculate the stochastic integral
that cannot be done by only knowing the first integral. To overcome this issue, we proposed the idea to approximate what it is inside the integrals and solve the problem of actually integrate this function with another tools.
Remark 5.6
The non local part of the PIDE (1.2) makes us add a Lévy process, which is a canonical tool when dealing with non local operators such as the one that appears on equation (1.2). This addition results in the natural definition of analogous objects from [31] such as the \(\Gamma , \bar{\Gamma }\) components for the nonlocal case.
Remark 5.7
The result of the theorem states that the better we can approximate \(v_i, z_i, \gamma _i\) by NN architectures, the better we can approximate \((Y_{t_i}, Z_{t_i}, \Gamma _{t_i})\) by \((\widehat{\mathcal {U}}_{i}(X_{t_{i}}^{\pi }),\widehat{\mathcal {Z}}_{i}(X_{t_{i}}^{\pi }),\langle \widehat{\mathcal {G}} \rangle _{i}(X_{t_{i}}^{\pi }))\).
Remark 5.8
Because of the finiteness of the measure \(\lambda \), the case of the Fractional Laplacian mentioned in the introduction is not contained in Theorem 5.1. We hope to extend our results to this case in a forthcoming result.
5.1 Optimization step of the algorithm
In this subsection we give a brief explanation on how to compute the loss function from Algorithm 1 in order to perform it. As usual, we extend the computation of the loss function shown on [31] to our non local case for which we need to introduce the following definitions. For a cadlag process \((C_s)_{s\in [0,T]}\), \(\Delta C_s:=C_s - C_{s^-}\) stands for the jump of C at time \(s\in [0,T]\) and for a process \(U\in L_{\mu }^1(\mathbb {R}^d)\) the definition of stochastic integral with respect to \(\mu \) ([4, Sections 2 and 4]) is as follows,
where
is a compound Poisson process (see [4, Thm 2.3.10]). And therefore,
For simplicity assume that \(\lambda \) is a probability measure absolutely continuous with respect to Lebesgue measure. As we will see, several simulation of Lévy process \((X_t)_{t\in [0,T]}\) are needed.
As shown on Algorithm 1, given \(\widehat{\mathcal {U}}_{i+1}\) for \(i\in \left\{ 0,...,N-1\right\} \), we need to minimize \(L_i(\cdot )\) and define the NNs for step i. Recall the definition of \(L_i\) in (3.9), the idea is to write the expected value from the loss function as an average of simulations. Let \(M\in \mathbb {N}\) and \(I = \left\{ 1,..., M\right\} \), generate simulations \(\left\{ x^i_k: k\in I\right\} \), \(\left\{ x^{i+1}_k: k\in I\right\} \), \(\left\{ w_k:k\in I\right\} \) of \(X_{t_{i}}^{\pi }\), \(X_{t_{i+1}}^{\pi }\) and \(\Delta W_i\) respectively. Then,
Note that we are using an Euler scheme on the simulations of \((X_t)_{t\in [0,T]}\), nevertheless, there exists other methods depending on the structure of the diffusion, see [14, 34]. Recall that F needs two different integrals of \(\mathcal {G}_i(x^i_k,\cdot ;\theta )\), to approximate these values let \(L\in \mathbb {N}\) and \(J=\left\{ 1,...,L\right\} \) and consider, for every \(k\in I\), simulations \(\left\{ y^k_{l}: l\in J\right\} \) of a random variable \(Y\sim \lambda \), here is important the finitness of the measure. Then, the quantities we need can be computed as follows,
Therefore, provided we can simulate: trajectories of \((X_t)_{t\in [0,T]}\) and \((W_t)_{t\in [0,T]}\), realizations of \(Y\sim \lambda \) and the compound Poisson process \((P_t)_{t\in [0,T]}\), we can minimize \(L_i\), find the optimal \(\theta ^*\) and define
Remark 5.9
The nonlocal term in equation (1.2) adds complexity not only in the proof of the consistency of the algorithm but in the algorithm itself. As we saw, it is key that the measure \(\lambda \) is finite as well as the capability to simulate integrals with respect to Poisson random measures and trajectories of the Lévy process. The implementation of this method and an extension to PIDEs with more general integro-differential operators, such as fractional Laplacian, are left to future work.
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
References
Grégoire, A.: Numerical Analysis and Optimization An introduction to mathematical modeling and numerical simulation. Oxford University Press; Illustrated edition (July 19: 472 pages. ISBN- 10, 9780805839852 (2007)
Md, Z.A., Tarek, M.T., Chris, Y., Stefan, W., Paheding, S., Mst, S.N., Mahmudul, H., Brian, C.V.E., Abdul, A.S.A., Vijayan, K.A.: A state-of-the-art survey on deep learning theory and architectures. Electronics 8(3), 292 (2019). https://doi.org/10.3390/electronics8030292
Ali, M.A., Hiam, A., Isam, A.Q., Amin, A., Wafaa, A.S.: Brain tumor classification using deep learning technique—a comparison between cropped, uncropped, and segmented lesion images with different sizes. Int. J. Adv. Trends Comput. Sci. Eng. 8(6), 2 (2019)
David, A.: Lévy processes and stohastic calculus, Cambridge Studies In Advanced Mathematics, 2nd Edition, (April 1, 2009). ISBN-10: 0521738652
Barles, G., Buckdahn, R., Pardoux, E.: Backward stochastic differential equations and integral-partial differential equations. Stochas. Stocha. Rep. 60, 57–83 (1996)
Guy, B., Olivier, L., Erwin, T.: Lipschitz regularity for integro-differential equations with coercive hamiltonians and applications to large time behavior. Nonlinearity, Volume 30, Number 2 (2017), arXiv:1602.07806 [math.AP]
Dalya, B.: Machine learning in astronomy: a practical overview, arXiv:1904.07248v1 [astro-ph.IM] 15 Apr 2019
Christian, B., Fabian, H., Martin, H., Arnulf, J., Thomas, K.: Overcoming the curse of dimensionality in the numerical approximation of Allen-Cahn partial differential equations via truncated full-history recursive multilevel Picard approximations, Accepted in J. Numer. Math. arXiv:1907.06729 [math.NA], 2019
Christian, B., Weinan, E., Arnulf, J.: Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations, J. Nonlinear Sci. 29 (2019), 1563–1619, arXiv:1709.05963v1 [math.NA], 2017
Benson, D.A., Wheatcraft, S.W., Meerschaert, M.M.: Application of a fractional advection-dispersion equation. Water Resour. Res. 36(6), 1403–1412 (2000)
Isabeau, B., Giulio, G., Erwin, T.: Fractional truncated laplacians: representation formula. Fundam. Solut. Appl. arXiv:2010.02707 [math.AP], 2020
Bruno, B., Romuald, E.: Discrete time approximation of decoupled Forward-Backward SDE with jumps. Stochastic Processes and their Applications, Elsevier, 2008, 118 (1), pp. 53–75. ffhal00015486
Dimitri, B.: Machine and deep learning applications in particle physics. Int. J. Modern Phys. A 34(35):1930019 https://doi.org/10.1142/S0217751X19300199. arXiv:1912.08245v1 [physics.data-an]
Buckwar, E., Riedler, M.G.: Runge-Kutta methods for jump-diffusion differential equation. J. Comput. Appl. Math. 236, 1155–1182 (2011)
Luis, C., Luis, S.: An extension problem related to the fractional Laplacian. Comm. PDE Vol. 32, 2007 Issue 8 pp. 1245–1260
Carr, P., Geman, H., Madan, D.B., Yor, M.: The fine structure of asset returns: An empirical investigation. J. Bus. 75, 305–332 (2002)
Rama, C., Peter, T.: Financial modelling with jump processes. Chapman and Hall/CRC; 1st edition (December 30, 2003). ISBN-10: 1584884134, 552 pp
Gonzalo, D., Erwin, T.: The nonlocal inverse problem of Donsker And Varadhan, arXiv:2011.13295 [math.AP], 2020
D’Elia, M., Qiang, D., Glusa, C., Gunzburger, M., Xiaochuan, T., Zhi, Z.: Numerical methods for nonlocal and fractional models. Acta Numer. 2, 1–124 (2020)
Łukasz, D.: Backward Stochastic differential equations with jumps and their actuarial and financial applications. EEA series, Springer-Verlag London, 2013. https://doi.org/10.1007/978-1-4471-5331-3, 288+X pp
Eleonora, D.N., Giampiero, P., Enrico, V.: Hitchhiker’s guide to the fractional Sobolev spaces. Bull. Sci. Math. 136(5), 521–573 (2012)
Di Nunno, G., Proske, B.Ø.F.: Malliavin Calculus for Levy Processes with Applications to Finance. Springer-Verlag, Berlin Heidelberg (2009). https://doi.org/10.1007/978-3-540-78572-9, XIV+418 pp
Qiang, D., Xiaochuan, T.: Stability of nonlocal dirichlet integrals and implications for peridynamic correspondence material modeling, SIAM J. Appl. Math. (2018) Vol. 78, No. 3, pp. 1536–1552, arXiv:1710.05119 [physics.comp-ph]
Dennis, E., Philipp, G., Arnulf, J., Christoph, S.: DNN Expression Rate Analysis of High-dimensional PDEs: Application to Option Pricing, Tech. Report 2018-33. Seminar for Applied Mathematics, ETH Zürich, Switzerland, 2018
Gilboa, G., Osher, S.: Nonlocal operators with applications to image processing. Multisc. Model. Simul. 7, 1005–1028 (2008)
Lukas, G., Christoph, S.: Deep ReLU neural network approximation for stochastic differential equations with jumps, arXiv:2102.11707 (2021)
Lukas, G., Christoph, S.: Deep ReLU network expression rates for option prices in high-dimensional, exponential Lévy models, arXiv:2101.11897 (2021)
Philipp, G., Fabian, H., Arnulf, J., Philippe, von W.: A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations, To appear in Mem. Amer. Math. Soc.; arXiv:1809.02362 (2018), 124 pages
Han, J., Jentzen, A.: Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci. 115, 8505–8510 (2018)
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4, 251–257 (1991)
Hure, C., Pham, H., Warin, X.: Deep backward schemes for high-dimensional nonlinear PDE’s. Math. Comp. 89, 1547–1579 (2020)
Martin, H., Arnulf, J., Thomas, K., Tuan, A.N., Philippe, von W.: Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations, arXiv:1807.01212 [math.PR], 2018. Accepted in Proc. Roy. Soc. A
Benjamin, J., Sylvie, M., Wojbor, A.W.: Nonlinear SDEs driven by Lévy processes and related PDEs. ALEA Lat. Am. J. Probab. Math. Stat. 4 (2008), 1–29. arXiv:0707.2723, 2007
Arturo, K.-H., Peter, T.: Jump-adapted discretization schemes for Lévy-driven SDEs, Stochastic Processes and their Applications, Volume 120, Issue 11, 2010, Pages 2258-2285, ISSN 0304-4149, https://doi.org/10.1016/j.spa.2010.07.001
Antoine, L., Ernesto, M., Soledad, T.: Numerical approximation of backward stochastic differential equations with jumps, 2007. ffinria-00357992v2
Moshe, L.I., Vladimir, Y.L., Allan, P., Shimon, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6, 861–867 (1993)
Geert, L., Thijs, K., Babak, E.B., Arnaud, A.A.S., Francesco, C., Mohsen, G., Jeroen, A.W.M., van der Laak, Bram, van G., Clara, I.S.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, Pages 60–88 (2017), arXiv:1702.05747v2 [cs.CV] 4 Jun 2017
Martin, M., Andrew, M.N., Hendrick, W.H.: Neural network solutions to differential equations in non-convex domains: solving the electric field in the slit-well microfluidic device. Phys. Rev. Research 2, 033110 – Published 21 July 2020. ArXiv:2004.12235v1 [physics.comp-ph], 2020
Mahabal, A., Sheth, K., Gieseke, F., Pai, A., Djorgovski, S. G., Drake, A. J., Graham, M. J., CSS/CRTS/PTF Teams: Deep-learnt classification of light curves, 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, 2017, pp. 1-8, https://doi.org/10.1109/SSCI.2017.8280984. arXiv:1709.06257v1 [astro-ph.IM]
Kevin, M., Kavita, B., Noah, S.: StreetStyle: Exploring world-wide clothing styles from millions of photos, arXiv:1706.01869v1 [cs.CV] 6 Jun 2017
Warreb, S.M., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943)
Philip, E.: Protter. Stochastic integration and differential equations. Springer, Berlin (2004)
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 6 (1958)
Justin, S., Konstantinos, S.: DGM: A deep learning algorithm for solving partial differential equations. J. Comput. Phys. 375, 15 December 2018, Pages 1339–1364, arXiv:1708.07469 [q-fin.MF], 2017
Pablo, R.S.: User’s guide to the fractional Laplacian and the method of semigroups, in: Fractional Differential Equations, Walter de Gruyter GmbH & Co KG, pp. 235–266, arXiv:1808.05159 [math.AP], 2018
Tadmor, E., Tan, C.: Critical thresholds in flocking hydrodynamics with non-local alignment. Philos. Trans. R. Soc. Lond. A Math. Phys. Eng. Sci. 372, 20130401 (2014)
Giacomo, T., Guglielmo, M., Juan, C., Matthias, T., Roger, M., Giuseppe, C.: Neural-network quantum state tomography for many-body systems. Nat. Phys. 14, pages 447–450 (2018), arXiv:1703.05334v2 [cond-mat.dis-nn]
Haohan, W., Bhiksha, R.: On the origin of deep learning, arXiv:1702.07800v4 [cs.LG] 3 Mar 2017
Zhang, J.: A numerical scheme For BSDES. Ann. Appl. Prob. 14(1), 459–488 (2004)
Zhang, X.: Stochastic functional differential equations driven by levy processes and quasi-linear partial, integro-differential equation. Ann. Appl. Probab. 22(6), 2505–2538 (2012). arXiv:1106.3601 [math.PR]
Funding
Open Access funding enabled and organized by Projekt DEAL. Partial financial support was received from Chilean research grants FONDECYT 1191412, and CMM Conicyt PIA AFB170001.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The author declare he has no financial interest.
Additional information
This article is part of the section “Computational Approaches” edited by Siddhartha Mishra.
J.C.’s work was funded in part by Chilean research grants FONDECYT 1191412, and CMM Conicyt PIA AFB170001. This article is part of the section “Computational Approaches” edited by Siddhartha Mishra.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Castro, J. Deep learning schemes for parabolic nonlocal integro-differential equations. Partial Differ. Equ. Appl. 3, 77 (2022). https://doi.org/10.1007/s42985-022-00213-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42985-022-00213-z
Keywords
- Deep learning
- Deep neural networks
- Approximation
- Nonlocal diffusion equations
- Lévy processes
- Stochastic differential equations