Deep Learning for the Benes Filter

The Benes filter is a well-known continuous-time stochastic filtering model in one dimension that has the advantage of being explicitly solvable. From an evolution equation point of view, the Benes filter is also the solution of the filtering equations given a particular set of coefficient functions. In general, the filtering stochastic partial differential equations (SPDE) arise as the evolution equations for the conditional distribution of an underlying signal given partial, and possibly noisy, observations. Their numerical approximation presents a central issue for theoreticians and practitioners alike, who are actively seeking accurate and fast methods, especially for such high-dimensional settings as numerical weather prediction, for example. In this paper we present a brief study of a new numerical method based on the mesh-free neural network representation of the density of the solution of the Benes model achieved by deep learning. Based on the classical SPDE splitting method, our algorithm includes a recursive normalisation procedure to recover the normalised conditional distribution of the signal process. Within the analytically tractable setting of the Benes filter, we discuss the role of nonlinearity in the filtering model equations for the choice of the domain of the neural network. Further we present the first study of the neural network method with an adaptive domain for the Benes model.


Introduction
In this paper we present a further study of the deep learning method developed in [4] on the example of the Benes filter. The algorithm is derived from the splitting method for SPDEs and replaces the PDE approximation step by a neural network representation and learning algorithm. Combined with the Monte-Carlo method for the approximation of the required normalisation constant, this method becomes completely mesh free. Furthermore, an important property of the methodology in the filtering context is the ability to iterate it over several time steps. This allows the algorithm to be run online and to successively process observations arriving sequentially. In [4] it was noted that a possible extension of the approximation method would be given by an adaptive domain as the support of the neural network. We present in this work the first results obtained using an adaptive domain in the nonlinear and analytically tractable case of the Benes filter.
The paper is structured as follows. In subsection 1.1 we briefly introduce the nonlinear, continuous-time stochastic filtering framework. The setting is identical to the one assumed in [4] and the reader may consult [1] for an in-depth treatment of stochastic filtering. Thereafter, in subsection 1.2, we formulate the Benes filtering model we will be using in the numerical studies and state the explicit solution for the filter in this case. Then, in subsection 1.3 we introduce the filtering equation and the classical SPDE splitting method upon which the new algorithm in [4] was built.
Next, in section 2 we present an outline of the derivation of the new methodology. For details, the reader is referred to the original article [4]. The first idea of the algorithm, presented in subsection 2.1 is to reformulate the solution of the PDE for the density of the unnormalised filter as an expected value by the Feynman-Kac formula, based on an auxiliary diffusion process derived from the model equations. Moreover, in subsection 2.2 we briefly specify the neural network parameters used in the method, as well as the employed loss-function. The theoretical part of the paper is concluded with subsection 2.3 where we show how to normalise the obtained neural network from the prediction step using Monte-Carlo approximation for linear sensor functions. Section 3 contains the detailed parameter values and results of the numerical studies that we performed. The first result in this work is presented in subsection 3.1 and is a simulation of the Benes filter using the deep learning method over a larger domain, as well as longer time interval than in the paper [4]. Here, we have not employed any domain adaptation. However, the size of the domain needed to accommodate the support of the filter over a longer time period was estimated using the solution of the Benes model. This is necessary, as the nonlinearity of the Benes model makes it difficult to know the evolution of the posterior a priori. Thus we would be requiring a much larger domain, if chosen in an adhoc way. Then, in subsection 3.2 we show our results for the Benes filter using domain adaptation. The adaptation was performed using precomputed estimates of the support of the filter by again employing the solution formula for the Benes filter.
Finally, we formulate the conclusions from our experiments in section 4. In short, the domain adapted method was more effective in resolving the bimodality in our study than the non-domain adapted one. However, this came at the cost of a linear trend in the error.

Nonlinear stochastic filtering problem
The stochastic filtering framework consists of a pair of stochastic processes (X, Y ) on a probability space (Ω, F, P) with a normal filtration (F t ) t≥0 modelled, P-a.s., as and Here, the time parameter is t ∈ [0, ∞), d, p ∈ N and f : R d → R d and σ : R d → R d×p are the drift and diffusion coefficient functions of the signal. The processes V and W are p-and m-dimensional independent, (F t ) t≥0 -adapted Brownian motions. We call X the signal process and Y the observation process.
The function h : R d → R m is often called the sensor function, or link function, because it models the possibly nonlinear connection of the signal and observation processes.
Further, consider the observation filtration (Y t ) t≥0 given as where N are the P-nullsets of F. The aim of nonlinear filtering is to compute the probability measure valued (Y t ) t≥0 -adapted stochastic process π that is defined by the requirement that for all bounded measurable test functions ϕ : R d → R and t ∈ [0, ∞) we have P-a.s. that We call π the filter. Furthermore, let the process Z be defined such that for all t ∈ [0, ∞), Then, assumimg that we have that Z is an (F t ) t≥0 -martingale and by the change of measure (for details, see [1]) given by dP t dP Ft = Z t , t ≥ 0, the processes X and Y are independent underP and Y is aP-Brownian motion. Here,P is the consistent measure defined on t∈[0,∞) F t . Finally, underP, we can define the measure valued stochastic process ρ by the requirement that for all bounded measurable functions ϕ : R d → R and t ∈ [0, ∞) we have P-a.s. that The Kallianpur-Striebel formula (see [1]) justifies the terminology to call ρ the unnormalised filter.

The Benes filtering model
The Benes filter is is a one-dimensional nonlinear model and is used as a benchmark in the numerical studies below. As we show below, it is one of the rare cases of explicitly solvable continuous-time stochastic filtering models. Here, we are considering a special case of the more general class of Benes filters, presented, for example, in [1, Chapter 6.1]. The signal is given by the coefficient functions where α, β ∈ R and the observation is given by the affine-linear sensor function The density p B of the filter solving the Benes model is then given by two weighted Gaussians as where

Filtering equation and general splitting method
Note that under the conditions given in [4], X admits the infinitesimal generator A : D(A) → B(R d ) given, for all ϕ ∈ D(A), by where D(A) denotes the domain of the differential operator A and a = 1 2 σσ . It is well-known (see, e.g., [1]), that the unnormalised filter ρ satisfies the filtering equation, i.e. for all t ≥ 0, we haveP-a.s. that The classical splitting method for the filtering equation is given in [3] and seeks to approximate the following SPDE for the density p t of the unnormalised filter given, for all t ≥ 0, x ∈ R d , and P-a.s. as and relies on the splitting-up algorithm described in [9] and [10]. Here A * is the formal adjoint of the infinitesimal generator A of the signal process X.
We summarise the splitting-up method below in Note 1.
Note 1. The splitting method for the filtering problem is defined by iterating the steps below with initial density p 0 (·) = p 0 (·): 1. (Prediction) Compute an approximationp n of the solution to at time t n and 2. (Normalisation) Compute the normalisation constant with z n = (Y tn − Y tn−1 )/(t n − t n−1 ) and the function so that we can set, The deep learning method studied below replaces the predictor step of the splitting method above by a deep neural network approximation algorithm to avoid an explicit space discretisation. This is achieved by representing eachp n (z) by a feed-forward neural network and approximating the initial value problem (7) based on its stochastic representation using a sampling procedure. The normalisation step may then be computed either using quadrature, or, to preserve the mesh-free characteristic, by Monte-Carlo approximation.

Derivation and outline of the deep learning algorithm
Here, we present a concise version of the derivation laid out in detail in [4].

Feynman-Kac representation
Assuming sufficient differentiability of the coefficient functions, the operator A * may be expanded such that for all Subtracting the zero-order term from (8), we obtain an operator that generates the auxiliary diffussion process, denotedX, which is instrumental in the deep learning method. Definition 1. Define the partial differential operatorÂ : Aϕ = Tr(a Hess ϕ) + 2 − → div(a) − f, grad ϕ and the function r : R d → R such that for all x ∈ R d ,

Lemma 1.
For all x ∈ R d the operatorÂ defined in Definition 1 is the infinitesimal generator of the Itô diffusionX : [0, ∞) × Ω → R d given, for all t ≥ 0 and P-a.s. byX From the well-known Feynman-Kac formula (see Karatzas and Shreve [6, Chapter 5, Theorem 7.6]) we can deduce the Corollary 1 below for the initial value problem.
, letÂ be the operator defined in Definition 1, and let ψ : R d → R be an at most polynomially growing function. Suppose that u ∈ C 1,2 b ((0, T ] × R d , R) satisfies the Cauchy problem ∂u ∂t Then, for all (t, x) ∈ (0, T ] × R d , we have that whereX is the diffusion generated byÂ. Recall that our aim is to approximate the Fokker-Planck equation (7). Written in the form as in Corollary 1, (7) reads as Thus, with k = −r, and assuming that −r is non-negative in (9), we obtain by Corollary 1 the representation, for all n ∈ {1, . . . , N }, t ∈ (t n−1 , t n ], z ∈ R d , Note that [4, Proposition 2.4] shows that we have a feasible minimisation problem to approximate by the learning algorithm (see also [2, Proposition 2.7]). Remark 1. For the Benes model, note that the auxiliary diffusion is given aŝ Therefore the representation of the solution to the Fokker-Planck equation (7) in the Benes case reads

Neural network model for the prediction step
To solve the Fokker-Planck equation over a rectangular domain Ω d = [α 1 , β 1 ] × · · ·×[α d , β d ], we employ the sampling based deep learning method from [2]. Using the representation (10), the solution of the Fokker-Planck equation is reformulated into an optimisation problem over function space given in [4,Proposition 2.4]. This in turn yields the loss functions for the learning algorithm. WritingX ξ for the auxiliary diffusion with Unif(Ω d )-random initial value ξ, the optimisation problem is approximated by the optimisation where the solution of the PDE is represented by a neural network N N θ and the infinite-dimensional function space has been parametrised by θ. Here, L denotes the depth of the neural net, and the parameters l i are the respective layer widths. Further details can be found in [4]. A comprehensive textbook on deep learning is [5]. We apply a modified gradient descent method, called ADAM [7], to determine the parameters in the model by minimising the loss function where N b is the batch size and is a training batch of independent identically distributed realisations ξ i of ξ ∼ U(Ω d ) and {X ξ,i τj } J j=0 the approximate i.i.d. realisations of sample paths of the auxiliary diffusion started at ξ i over the time-grid τ 0 = 0 < τ 1 < · · · < τ J−1 < τ J = T . For the approximation of the sample paths of the diffusion we use the Euler-Maruyama method [8].
Additionally, we augment the loss L by an additional term to encourage the positivity of the neural network. Thus, in practice, we use the loss with the hyperparameter λ to be chosen. Thus, in the notation of subsection 1.3 we replace the Fokker-Planck solution by a neural network model, i.e. we postulate a neural network model with support on Ω d . Therefore we require the a priori chosen domain to capture most of the mass of the probability distribution it is approximating.

Monte-Carlo normalisation step
We then realise the normalisation step via Monte-Carlo sampling over the bounded rectangular domain Ω d to approximate the integral where, as defined earlier, z n = 1 tn−tn−1 (Y tn − Y tn−1 ). Note that, since Ω d is the support of the neural network N N , the right-hand side above is indeed identical to the integral over the whole space.
The sensor function in the Benes model is given by h(x) = h 1 x + h 2 . Then, the likelihood function becomes where N pdf (µ, σ 2 ) denotes the probability density function of a normal distribution with mean µ and variance σ 2 . Therefore, we can write the integral (11) as √ 2π This is an implementable method to compute the normalisation constant C n . Thus, we can express the approximate posterior density as p n (z) = 1 C n ξ n (z)p n (z).
Therefore, the methodology is fully recursive and can be applied sequentially.

Remark 2.
In low-dimensions, the usage of the Monte-Carlo method to perform the normalisation is optional, since efficient quadrature methods are an alternative. We chose the sampling based method to preserve the grid-free nature of the algorithm.

Numerical results for the Benes filter
The neural network architecture for all our experiments below is a feed-forward fully connected neural network with a one-dimensional input layer, two hidden layers with a layer width of 51 neurons each and batch-normalisation, and an output layer of dimension one (a detailed illustration can be found in [4]).   Figure 2(b) shows that the neural net training consistently succeeds as measured by the L 2 distance between the Monte-Carlo reference prior and the neural net prior.  resolve the bimodality in the nonlinear case by increasing the spatial resolution while keeping the computational cost equal. Figure 3(c)+(d) again show snapshots of the progression of the filter. The absolute error in means with respect to the Benes reference solution is plotted in Figure 4(a) and shows a clear linear trend. This is an interesting phenomenon, likely due to the reduced domain size and subsequent error accumulation. The probability mass, Figure 4(c), and Monte-Carlo acceptance rate, Figure 4(d) are stably fluctuating. Figure 4(b) shows here again that the neural net training consistently succeeds.

Conclusion and outlook
We have studied the domain adaptation in our method from [4] on the example of the Benes filter. We observed that the domain adapted method was more effective in resolving the bimodality than the non-domain adapted one. However, this came at the cost of a linear trend in the error. A possible direction for future work would thus be to investigate the optimal domain size more closely, in order to mitigate the error trend, and make full use of the increased resolution from the domain adaptation. This is subject of future research in connection with more general domain adaptation methods than the one employed here, which is specific to the Benes filter. As already noted in the previous work [4], the possibility for transfer learning in our method should be explored.
A long-term goal in the development of neural network based numerical methods must of course be the rigorous error analysis, which remains a challenging task.