Deep Learning for the Benes Filter

Lobbe, Alexander

doi:10.1007/978-3-031-18988-3_12

Alexander Lobbe¹³

Part of the book series: Mathematics of Planet Earth ((MPE,volume 10))

Included in the following conference series:

Stochastic Transport in Upper Ocean Dynamics Annual Workshop

1867 Accesses
1 Citations
2 Altmetric

Abstract

The filtering problem is concerned with the optimal estimation of a hidden state given partial and noisy observations. Filtering is extensively studied in the theoretical and applied mathematical literature. One of the central challenges in filtering today is the numerical approximation of the optimal filter. Here, accurate and fast methods are actively sought after, especially for such high-dimensional settings as numerical weather prediction, for example. In this paper we present a brief study of a new numerical method based on the mesh-free neural network representation of the density of the solution of the filtering problem achieved by deep learning. Based on the classical SPDE splitting method, our algorithm includes a recursive normalisation procedure to recover the normalised conditional distribution of the signal process. The present work uses the Benes model as a benchmark. The Benes filter is a well-known continuous-time stochastic filtering model in one dimension that has the advantage of being explicitly solvable. Within the analytically tractable setting of the Benes filter, we discuss the role of nonlinearity in the filtering model equations for the choice of the domain of the neural network. Further, we present the first study of the neural network method with an adaptive domain for the Benes model.

You have full access to this open access chapter, Download conference paper PDF

An application of the splitting-up method for the computation of a neural network representation for the solution for the filtering equations

Article Open access 09 June 2022

An energy-based deep splitting method for the nonlinear filtering problem

Article Open access 20 March 2023

Improving the prediction of complex nonlinear turbulent dynamical systems using nonlinear filter, smoother and backward sampling techniques

Article 09 July 2020

Keywords

1 Introduction

Stochastic Filtering, i.e. the estimation of a signal process given only partial and noisy observations, is a well-studied problem, both in the theoretical and applied literature. It is relevant in many practical domains, for example in numerical weather prediction. Therefore, there is a high demand for efficient numerical methods to approximate the optimal filter. Many such methods are known in the literature, among them the SPDE splitting method can be used to solve the filtering problem in low dimensions. The reason for the inefficiency of the splitting method in higher dimensions stems from the fact that the underlying state space must be explicitly discretised. This is problematic as the required number of discretisation points, known as the mesh, grows exponentially with the dimension of the state space. For this reason, the authors of [4] present a modified splitting method for the filtering problem which does not rely on the explicit space discretisation. The method developed in [4] is therefore called mesh-free and relies on a neural network representation of the solution. This means that, instead of approximating the values of the solution on a discrete mesh, we can optimize the parameters of a neural network defined on the state-space itself.

In this paper we present a further study of the deep learning method developed in [4] on the example of the Benes filter. The algorithm is derived from the classical splitting method for SPDEs which consists of a deterministic PDE approximation step and a normalisation step to incorporate the randomness of the SPDE. Our algorithm replaces the PDE approximation step of the splitting method by a neural network representation and learning algorithm. Combined with the Monte-Carlo method for the normalisation step, this method becomes completely mesh-free. Furthermore, an important property of the methodology in the filtering context is the ability to iterate it over several time steps. This allows the algorithm to be run online and to successively process observations arriving sequentially. In order to be computationally feasible, the domain of the neural network needs to be restricted. This restricted domain needs to cover the support of the density as well as possible in order to yield a sensible solution. In [4] the neural network domain is fixed a priori and does not move with the solution. This presents two problems. First, it is unnecessarily large to cover the support over all timesteps. Second, the solution may eventually move outside the computational domain, rendering the approximation inadequate. It was therefore noted in [4] that a possible extension of the approximation method would be given by an adaptive domain as the support of the neural network. We present in this work the first results obtained using an adaptive domain in the nonlinear and analytically tractable case of the Benes filter.

The paper is structured as follows. In Sect. 1.1 we briefly introduce the nonlinear, continuous-time stochastic filtering framework. The setting is identical to the one assumed in [4] and the reader may consult [1] for an in-depth treatment of stochastic filtering. Thereafter, in Sect. 2.2, we formulate the Benes filtering model used as a benchmark. Then, in Sect. 1.2 we introduce the filtering equation and the classical SPDE splitting method. This is the method upon which the new algorithm in [4] was built.

Next, in Sect. 2 we present an outline of the derivation of the new methodology. For details, the reader is referred to the original article [4]. The first idea of the algorithm, presented in Sect. 2.1 is to reformulate the solution of the PDE for the density of the unnormalised filter as an expected value. This is done using the Feynman–Kac formula, based on an auxiliary diffusion process derived from the model equations. Moreover, in Sect. 2.3 we briefly specify the neural network parameters used in the method, as well as the employed loss-function. The theoretical part of the paper is concluded with Sect. 2.4 where we show how to normalise the obtained neural network from the prediction step using Monte-Carlo approximation for linear sensor functions.

Section 3 contains the detailed parameter values and results of the numerical studies that we performed. Specifically, we perform two experiments, the first one, Sect. 3.1, is carried without any domain adaptation and highlights the limitations of ad-hoc parameterization of the domain. It is a simulation of the Benes filter using the deep learning method over a larger domain, as well as longer time interval than in the paper [4]. In particular, the size of the domain was estimated using the exact solution of the Benes model. This is necessary, as the nonlinearity of the Benes model makes it difficult to know the evolution of the posterior a priori. Thus we would be requiring a much larger domain, if chosen in an ad-hoc way. The second experiment, in Sect. 3.2, reports the performance of the proposed framework with domain adaptation. The adaptation was performed using precomputed estimates of the support of the filter by employing the solution formula for the Benes filter.

Finally, we formulate the conclusions from our experiments in Sect. 4. In short, the domain adapted method was more effective in resolving the bimodality in our study than the non-domain adapted one. However, this came at the cost of a linear trend in the error.

1.1 Nonlinear Stochastic Filtering Problem

The stochastic filtering framework consists of a pair of stochastic processes (X, Y ) on a probability space $(\varOmega , \mathcal {F}, \mathrm {P})$ with a normal filtration $\,(\mathcal {F}_t)_{t\geq 0}$ modelled, P-a.s., as

$$\displaystyle \begin{aligned} X_t = X_0 + \int_0^t f(X_s) \,\mathrm{d} s + \int_0^t \sigma(X_s) \,\mathrm{d} V_s \;, \end{aligned} $$

(1)

and

$$\displaystyle \begin{aligned} Y_t = \int_0^t h(X_s) \,\mathrm{d} s + W_t \;. \end{aligned} $$

(2)

Here, the time parameter is t ∈ [0, ∞), $d,p\in \mathbb {N}$ and $f: \mathbb {R}^d \rightarrow \mathbb {R}^d$ and $\sigma : \mathbb {R}^d \rightarrow \mathbb {R}^{d \times p}$ are the drift and diffusion coefficient functions of the signal. The processes V and W are p– and m-dimensional independent, $(\mathcal {F}_t)_{t\geq 0}$-adapted Brownian motions. We call X the signal process and Y the observation process. The function $h: \mathbb {R}^{d} \rightarrow \mathbb {R}^{m}$ is often called the sensor function, or link function, because it models the possibly nonlinear connection of the signal and observation processes.

Further, consider the observation filtration $(\mathcal {Y}_t)_{t\geq 0}$ given as

where $\mathcal {N}$ are the P-nullsets of $\mathcal {F}$. The aim of nonlinear filtering is to compute the probability measure valued $(\mathcal {Y}_t)_{t\geq 0}$-adapted stochastic process π that is defined by the requirement that for all bounded measurable test functions $\varphi : \mathbb {R}^d \to \mathbb {R}$ and t ∈ [0, ∞) we have P-a.s. that

$$\displaystyle \begin{aligned} \pi_t\varphi = \mathbb{E}\left[ \varphi(X_t) \left| \mathcal{Y}_t \right. \right]. \end{aligned}$$

We call π the filter.

Furthermore, let the process Z be defined such that for all t ∈ [0, ∞),

$$\displaystyle \begin{aligned} Z_t = \exp\{-\int_0^t h(X_s)\,\mathrm{d} W_s - \frac{1}{2} \int_0^t h(X_s)^2 \,\mathrm{d} s\}. \end{aligned}$$

Then, assumimg that

$$\displaystyle \begin{aligned} \mathbb{E}\left[ \int_0^t h(X_s)^2 \,\mathrm{d} s \right] < \infty \; \text{ and } \; \mathbb{E}\left[ \int_0^t Z_s h(X_s)^2 \,\mathrm{d} s \right] < \infty, \end{aligned}$$

we have that Z is an $(\mathcal {F}_t)_{t\geq 0}$-martingale and by the change of measure (for details, see [1]) given by $\left . \frac {\,\mathrm {d} \tilde {\mathrm {P}}^t} {\,\mathrm {d} \mathrm {P}}\right |{ }_{\mathcal {F}_t} = Z_t$, t ≥ 0, the processes X and Y are independent under $\tilde {\mathrm {P}}$ and Y is a $\tilde {\mathrm {P}}$-Brownian motion. Here, $\tilde {\mathrm {P}}$ is the consistent measure defined on $\bigcup _{t\in [0,\infty )}\mathcal {F}_t$. Finally, under $\tilde {\mathrm {P}}$, we can define the measure valued stochastic process ρ by the requirement that for all bounded measurable functions $\varphi : \mathbb {R}^d\to \mathbb {R}$ and t ∈ [0, ∞) we have P-a.s. that

$$\displaystyle \begin{aligned} \rho_t\varphi = \mathbb{E}\left[ \left.\varphi(X_t) \exp\{\int_0^t h(X_s)\,\mathrm{d} Y_s - \frac{1}{2} \int_0^t h(X_s)^2 \,\mathrm{d} s\} \right| \mathcal{Y}_t \right]. \end{aligned} $$

(3)

The Kallianpur–Striebel formula (see [1]) justifies the terminology to call ρ the unnormalised filter.

1.2 Filtering Equation and General Splitting Method

Note that under the conditions given in [4], X admits the infinitesimal generator $A: \mathcal {D}(A) \rightarrow B(\mathbb {R}^{d})$ given, for all $\varphi \in \mathcal {D}(A)$, by

(4)

where $\mathcal {D}(A)$ denotes the domain of the differential operator A and $a = \frac {1}{2}\sigma \sigma ^\prime $. The symbol $B(\mathbb {R}^{d})$ denotes the set of real-valued, bounded, Borel-measurable functions defined on $\mathbb {R}^d$.

It is well-known (see, e.g., [1]), that the unnormalised filter ρ satisfies the filtering equation, i.e. for all t ≥ 0, we have $\tilde {\mathrm {P}}$-a.s. that

$$\displaystyle \begin{aligned} \rho_t(\varphi) = \pi_0(\varphi) + \int_0^t \rho_s(A\varphi)\,\mathrm{d} s + \int_0^t \rho_s(\varphi h^\prime) \,\mathrm{d} Y_s. \end{aligned} $$

(5)

The classical splitting method for the filtering equation is given in [3] and seeks to approximate the following SPDE for the density p _t of the unnormalised filter given, for all t ≥ 0, $x\in \mathbb {R}^d$, and P-a.s. as

$$\displaystyle \begin{aligned} p_t(x) = p_0 (x) + \int_0^t A^* p_s(x)\,\mathrm{d} s + \int_0^t h^\prime(x) p_s(x) \,\mathrm{d} Y_s \end{aligned}$$

and relies on the splitting-up algorithm described in [9] and [10]. Here A ^∗ is the formal adjoint of the infinitesimal generator A of the signal process X.

We summarise the splitting-up method below in Note 1.

Note 1

The splitting method for the filtering problem is defined by iterating the steps below with initial density p ⁰(⋅) = p ₀(⋅):

1.
(Prediction) Compute an approximation $\tilde {p}^n$ of the solution to
$$\displaystyle \begin{aligned} \frac{\partial q^n}{\partial t} (t,z) &= A^* q^n (t,z), &\; & (t,z)\in (t_{n-1},t_n]\times\mathbb{R}^d,\\ q^n(0,z) &= {p}^{n-1}(z), &\; & z\in \mathbb{R}^d, \end{aligned} $$
(6)

at time t _n and
2.
(Normalisation) Compute the normalisation constant with $z_n = (Y_{t_n}-Y_{t_{n-1}}) / (t_n-t_{n-1})$ and the function
$$\displaystyle \begin{aligned} \mathbb{R}^d\ni z\mapsto\xi_n(z) = \exp \left( -\frac{t_n-t_{n-1}}{2} ||z_n - h(z)||{}^2 \right), \end{aligned}$$

so that we can set,
$$\displaystyle \begin{aligned} p^n(z) = \frac{1}{C_n} \xi_n(z) \tilde{p}^n(z); \; z\in\mathbb{R}^d, \end{aligned}$$

where $C_n = \int _{\mathbb {R}^d} \xi _n(z)\tilde {p}^n(z) \,\mathrm {d} z $.

The deep learning method studied below replaces the predictor step of the splitting method above by a deep neural network approximation algorithm to avoid an explicit space discretisation. This is achieved by representing each $\tilde {p}^n(z)$ by a feed-forward neural network and approximating the initial value problem (6) based on its stochastic representation using a sampling procedure. The normalisation step may then be computed either using quadrature, or, to preserve the mesh-free characteristic, by Monte-Carlo approximation.

2 Derivation and Outline of the Deep Learning Algorithm

Here, we present a concise version of the derivation laid out in detail in [4].

2.1 Feynman–Kac Representation

Assuming sufficient differentiability of the coefficient functions, the operator A ^∗ may be expanded such that for all compactly supported smooth test functions $\varphi \in C_c^\infty (\mathbb {R}^d, \mathbb {R})$ we have

(7)

Subtracting the zero-order term from (7), we obtain an operator that generates the auxiliary diffusion process, denoted $\hat {X}$, which is instrumental in the deep learning method.

Definition 1

Define the partial differential operator $\hat {A}:C_c^\infty (\mathbb {R}^d,\mathbb {R})\rightarrow C_b(\mathbb {R}^d,\mathbb {R})$, with image in the set of bounded continuous function on $\mathbb {R}^d$, such that for all $\varphi \in C_c^\infty (\mathbb {R}^d,\mathbb {R})$,

and the function $r:\mathbb {R}^d\rightarrow \mathbb {R}$ such that for all $x\in \mathbb {R}^d$,

$$\displaystyle \begin{aligned} r(x) = \operatorname{\mathrm{div}}(\overrightarrow{\mathrm{div}}(a)-f)(x). \end{aligned}$$

Lemma 1

For all $x\in \mathbb {R}^d$ the operator $\hat {A}$ defined in Definition 1 is the infinitesimal generator of the Itô diffusion $\hat {X}:[0,\infty )\times \varOmega \rightarrow \mathbb {R}^d$ given, for all t ≥ 0 and P-a.s. by

$$\displaystyle \begin{aligned} \hat{X}_t =x + \int_0^t b(\hat{X}_s) \mathrm{d} s + \int_0^t \sigma(\hat{X}_s) \mathrm{d} \hat{W}_s, \end{aligned}$$

where $\hat {W}:[0,\infty )\times \varOmega \rightarrow \mathbb {R}^d$ is a d-dimensional Brownian motion and $b:\mathbb {R}^d \rightarrow \mathbb {R}^d$ is the function

$$\displaystyle \begin{aligned} b = 2\overrightarrow{\mathrm{div}}(a) - f. \end{aligned}$$

From the well-known Feynman–Kac formula (see Karatzas and Shreve [6, Chapter 5, Theorem 7.6]) we can deduce the Corollary 1 below for the initial value problem.

Corollary 1

Let $d\in \mathbb {N}$ , T > 0, let $k\colon \mathbb {R}^d \to [0,\infty )$ be a continuous function, let $\hat {A}$ be the operator defined in Definition 1 , and let $\psi :\mathbb {R}^d \rightarrow \mathbb {R}$ be an at most polynomially growing function. Suppose that $u \in C_b^{1,2}((0,T]\times \mathbb {R}^d,\mathbb {R})$ is continuously differentiable with bounded derivative in time and twice continuously differentiable with bounded derivatives in space, and satisfies the Cauchy problem

$$\displaystyle \begin{aligned} \frac{\partial u}{\partial t}(t,x) + k(x)u(t,x) &= \hat{A}u(t,x), &\; &(t,x)\in (0,T] \times \mathbb{R}^d,\\ u(0,x) &= \psi(x), &\; &x\in \mathbb{R}^d. \end{aligned} $$

(8)

Then, for all $(t,x) \in (0,T]\times \mathbb {R}^d$ , we have that

$$\displaystyle \begin{aligned} u(t,x) = \mathbb{E}\left[ \left. \psi(\hat{X}_t)\exp\left(-\int_0^t k(\hat{X}_\tau) \,\mathrm{d}\tau\right) \right| \hat{X}_0 = x \right], \end{aligned}$$

where $\hat {X}$ is the diffusion generated by $\hat {A}$.

Recall that our aim is to approximate the Fokker–Planck equation (6). Assume from now on the discrete times {t ₀ = 0, t ₁, t ₂… }, indexed by n. Written in the form as in Corollary 1, for any timestep n = 1, 2, …, (6) reads as

$$\displaystyle \begin{aligned} \frac{\partial q^n}{\partial t} (t,z) &= \hat{A}q^n(t,z) + r(z)q^n(t,z), &\; & (t,z)\in (t_{n-1},t_n]\times\mathbb{R}^d,\\ q^n(0,z) &= {p}^{n-1}(z), &\; & z\in \mathbb{R}^d. \end{aligned}$$

Thus, with k = −r, and assuming that − r is non-negative in (8), we obtain by Corollary 1 the representation, for all n ∈{1, …, N}, t ∈ (t _n−1, t _n], $z\in \mathbb {R}^d$,

$$\displaystyle \begin{aligned} q^n(t,z) = \mathbb{E}\left[ \left. p^{n-1}(\hat{X}_t) \exp\left(\int_{t_{n-1}}^t r(\hat{X}_\tau) \,\mathrm{d}\tau\right) \right| \hat{X}_{t_{n-1}} = z \right]. \end{aligned} $$

(9)

Note that [4, Proposition 2.4] shows that we have a feasible minimisation problem to approximate by the learning algorithm (see also [2, Proposition 2.7]).

2.2 The Benes Filtering Model

The Benes filter is a one-dimensional nonlinear model and is used as a benchmark in the numerical studies below. As we show below, it is one of the rare cases of explicitly solvable continuous-time stochastic filtering models. Here, we are considering a special case of the more general class of Benes filters, presented, for example, in [1, Chapter 6.1].

The signal is given by the coefficient functions

$$\displaystyle \begin{aligned} f(x) = \alpha\sigma \tanh (\beta+\alpha x/ \sigma) \;\text{ and } \; \sigma(x) \equiv \sigma \in \mathbb{R},\end{aligned} $$

where $\alpha , \beta \in \mathbb {R}$ and the observation is given by the affine-linear sensor function

$$\displaystyle \begin{aligned} h(x) = h_1 x + h_2,\end{aligned} $$

with $h_1, h_2 \in \mathbb {R}$. The density p _B of the filter solving the Benes model is then given by two weighted Gaussians (see [1, Chapter 6.1]) as

$$\displaystyle \begin{aligned} p_B(z) = w^{+}\varPhi(\mu_t^{+}, \nu_t)(z) + w^{-}\varPhi(\mu_t^{-}, \nu_t)(z),\end{aligned} $$

(10)

where $\mu _t^{\pm } = M_t^{\pm }/(2v_t)$, ν _t = 1∕(2v _t), and

$$\displaystyle \begin{aligned} w^{\pm} = \frac{\exp ( (M_t^{\pm})^2/(4v_t))}{\exp ( (M_t^{+})^2/(4v_t))\exp ( (M_t^{-})^2/(4v_t))}\end{aligned} $$

with

$$\displaystyle \begin{aligned} M_t^{\pm} = \pm\frac{\alpha}{\sigma} + h_1\int_0^t \frac{\sinh(s\zeta\sigma )}{\sinh(t\zeta\sigma)}\,\mathrm{d} Y_s +\frac{h_2+h_1x_0}{\sigma\sinh(t\zeta\sigma)} - \frac{h_2}{\sigma}\coth(t\zeta\sigma),\end{aligned} $$

$v_t = {h_1}\coth (t\zeta \sigma )/{2\sigma }$, and $\zeta = \sqrt {\alpha ^2/\sigma ^2 + h_1^2}$.

Further, for the Benes model, the auxiliary diffusion is given as

$$\displaystyle \begin{aligned} \hat{X}_t = \hat{X}_0 - \int_0^t \alpha\sigma\tanh(\beta + \alpha x / \sigma) \, \mathrm{d} s + \int_0^t \sigma \,\mathrm{d} \hat{W}_s, \end{aligned}$$

and the coefficient

$$\displaystyle \begin{aligned} r(x) = - \operatorname{\mathrm{div}} f(x) = -\alpha^2 \mathrm{sech}^2(\beta + \alpha x / \sigma). \end{aligned}$$

Therefore the representation of the solution to the Fokker–Planck equation (6) in the Benes case reads

$$\displaystyle \begin{aligned} q^n(t,z) = \mathbb{E}\left[ \left. p^{n-1}(\hat{X}_t) \exp\left(- \int_{t_{n-1}}^t \alpha^2\mathrm{sech}^2(\beta + \alpha \hat{X}_\tau /\sigma) \,\mathrm{d} \tau\right) \right| \hat{X}_{t_{n-1}} = z \right]. \end{aligned}$$

2.3 Neural Network Model for the Prediction Step

To solve the Fokker–Planck equation over a rectangular domain Ω _d = [α ₁, β ₁] ×⋯ × [α _d, β _d], we employ the sampling based deep learning method from [2]. Using the representation (9), the solution of the Fokker–Planck equation is reformulated into an optimisation problem over function space given in [4, Proposition 2.4]. This in turn yields the loss functions for the learning algorithm. Writing $\hat {\mathrm {X}}^\xi $ for the auxiliary diffusion with Unif(Ω _d)-random initial value ξ, the optimisation problem is approximated by the optimisation

$$\displaystyle \begin{aligned} \inf_{\theta \in \mathbb{R}^{\sum_{i=2}^L l_{i-1}l_i+l_i}} \mathbb{E}\left[ \left| \psi(\hat{\mathrm{X}}_T^\xi) \exp\left(-\int_0^T k(\hat{\mathrm{X}}_\tau^\xi) \,\mathrm{d}\tau\right) - \mathcal{N}\mathcal{N}_\theta(\xi)\right|{}^2 \right] \end{aligned}$$

where the solution of the PDE is represented by a neural network $\mathcal {N}\mathcal {N}_\theta $ and the infinite-dimensional function space has been parametrised by θ. Here, L denotes the depth of the neural net, and the parameters l _i are the respective layer widths. Further details can be found in [4]. A comprehensive textbook on deep learning is [5]. We apply a modified gradient descent method, called ADAM [7], to determine the parameters in the model by minimising the loss function

where N _b is the batch size and $ \{\xi ^i, \{\hat {\mathrm {X}}_{\tau _j}^{\xi ,i}\}_{j=0}^J \}_{i=1}^{N_b}$ is a training batch of independent identically distributed realisations ξ ⁱ of $\xi \sim \mathcal {U}(\varOmega _d)$ and $\{\hat {\mathrm {X}}_{\tau _j}^{\xi ,i}\}_{j=0}^J$ the approximate i.i.d. realisations of sample paths of the auxiliary diffusion started at ξ ⁱ over the time-grid τ ₀ = 0 < τ ₁ < ⋯ < τ _J−1 < τ _J = T. For the approximation of the sample paths of the diffusion we use the Euler–Maruyama method [8]. Additionally, we augment the loss $\mathcal {L}$ by an additional term to encourage the positivity of the neural network. Thus, in practice, we use the loss

$$\displaystyle \begin{aligned} \tilde{\mathcal{L}}(\theta ; \{\xi^i, \{\hat{\mathrm{X}}_{\tau_j}^i\}_{j=0}^J \}_{i=1}^{N_b}) = {\mathcal{L}}(\theta ; \{\xi^i, \{\hat{\mathrm{X}}_{\tau_j}^i\}_{j=0}^J \}_{i=1}^{N_b}) + \lambda \sum_{i=1}^{N_b} \max\{0,-\mathcal{N}\mathcal{N}_\theta(\xi^i)\} \end{aligned}$$

with the hyperparameter λ to be chosen.

Thus, in the notation of Sect. 1.2 we replace the Fokker–Planck solution by a neural network model, i.e. we postulate a neural network model

$$\displaystyle \begin{aligned} \tilde{p}_n(z) = \mathcal{N}\mathcal{N}(z), \end{aligned}$$

with support on Ω _d. Therefore we require the a priori chosen domain to capture most of the mass of the probability distribution it is approximating.

2.4 Monte-Carlo Normalisation Step

We then realise the normalisation step via Monte-Carlo sampling over the bounded rectangular domain Ω _d to approximate the integral

$$\displaystyle \begin{aligned} \int_{\mathbb{R}^d} \xi_n(z)\mathcal{N}\mathcal{N}(z) \,\mathrm{d} z = \int_{\varOmega_d} \exp \left( -\frac{t_n-t_{n-1}}{2} ||z_n - h(z)||{}^2 \right) \mathcal{N}\mathcal{N}(z) \,\mathrm{d} z, \end{aligned} $$

(11)

where, as defined earlier, $z_n = \frac {1}{t_n-t_{n-1}}(Y_{t_n}-Y_{t_{n-1}})$. Note that, since Ω _d is the support of the neural network $\mathcal {N}\mathcal {N}$, the right-hand side above is indeed identical to the integral over the whole space.

The sensor function in the Benes model is given by h(x) = h ₁ x + h ₂. Then, the likelihood function becomes

$$\displaystyle \begin{aligned} \xi_n(z) = \frac{\sqrt{2\pi}}{\sqrt{(t_n-t_{n-1})h_1^2}} \mathcal{N}_{\text{pdf}}\left(\frac{z_n - h_2}{h_1}, \frac{1}{(t_n-t_{n-1})h_1^2}\right)(z), \end{aligned}$$

where $\mathcal {N}_{\text{pdf}}(\mu ,\sigma ^2)$ denotes the probability density function of a normal distribution with mean μ and variance σ ². Therefore, we can write the integral (11) as

$$\displaystyle \begin{aligned} \frac{\sqrt{2\pi}}{\sqrt{(t_n-t_{n-1})h_1^2}} \mathbb{E}_{Z}[\mathcal{N}\mathcal{N}(Z)]; \qquad \qquad Z \sim \mathcal{N}\left(\frac{z_n - h_2}{h_1}, \frac{1}{(t_n-t_{n-1})h_1^2}\right). \end{aligned}$$

This is an implementable method to compute the normalisation constant C _n. Thus, we can express the approximate posterior density as

$$\displaystyle \begin{aligned} p^n(z) = \frac{1}{C_n} \xi_n(z) \tilde{p}^n(z). \end{aligned}$$

Therefore, the methodology is fully recursive and can be applied sequentially.

Remark 1

In low-dimensions, the usage of the Monte-Carlo method to perform the normalisation is optional, since efficient quadrature methods are an alternative. We chose the sampling based method to preserve the grid-free nature of the algorithm.

3 Numerical Results for the Benes Filter

The neural network architecture for all our experiments below is a feed-forward fully connected neural network with a one-dimensional input layer, two hidden layers with a layer width of 51 neurons each and batch-normalisation, and an output layer of dimension one (a detailed illustration can be found in [4]). For the optimisation algorithm we chose the ADAM optimiser and performed the training over 6002 epochs with a batch size of 600 samples. The initial signal and observation values are x ₀ = y ₀ = 0 and the coefficients of the Benes model were chosen as α = 3, β = 0, σ = 0.5, h ₁ = 3, h ₂ = 0, and timestep Δt = 0.1 over N = 40 steps. The initial condition is a Gaussian density with mean 0 and standard deviation 0.001. The posterior was calculated over the domain [−9, 2.5]. The domain boundaries were pre-estimated using a simulation of the exact Benes filter with fixed random seed. In the case of the domain adaptation we used the precomputed evolutions from the true solution to estimate the support of the posterior and set a fixed domain adaptation schedule. The spatial resolution is 1000 uniformly spaced values in the domain of definition of the neural network. At each time step, the training of the network consumes 6002 ⋅ 600 = 3, 601, 200 Monte-Carlo samples. Additionally we employ a piecewise constant learning rate schedule lr(epoch) = 10^{−(2+epoch mod 2001)} and the normalisation constant is computed using 10⁷ samples each timestep. The regularising parameter λ = 1.

3.1 No Domain Adaptation

Figure 1 shows the plots for the Benes filter without domain adaptation. In Fig. 1a we observe the drift of the posterior toward the left edge of the domain. The initial bimodality, reflecting the uncertainty due to few observed values, quickly resolves and the approximate posterior tracks the signal within the domain. In Fig. 1b the bimodality is mostly visible in the Monte-Carlo prior and smoothed out by the neural network. Figure 1c and d show snapshots of the progression of the filter. The absolute error in means with respect to the Benes reference solution is plotted in Fig. 2a and shows that as the posterior reaches the left domain boundary, the error increases. This is reflected as well in the drop of probability mass, Fig. 2c, and Monte-Carlo acceptance rate, Fig. 2d at later times. It is not clear from Fig. 2a if there is a trend in the error. Further experiments need to be performed to check this hypothesis. Figure 2b shows that the neural net training consistently succeeds as measured by the L ₂ distance between the Monte-Carlo reference prior and the neural net prior.

3.2 With Domain Adaptation

Figure 3 shows the plots for the Benes filter with domain adaptation. In Fig. 3a we observe again the drift of the posterior toward the left edge of the domain. and the initial bimodality resolves. The approximate posterior tracks the signal within the domain. In Fig. 3b the bimodality is visible both in the prior an the posterior network. This shows that the domain adaptation helps resolve the bimodality in the nonlinear case by increasing the spatial resolution while keeping the computational cost equal. Figure 3c and d again show snapshots of the progression of the filter. The absolute error in means with respect to the Benes reference solution is plotted in Fig. 4a and shows a clear linear trend. This is an interesting phenomenon, likely due to the reduced domain size and subsequent error accumulation. The probability mass, Fig. 4c, and Monte-Carlo acceptance rate, Fig. 4d are stably fluctuating. Figure 4b shows here again that the neural net training consistently succeeds.

4 Conclusion and Outlook

We have studied the domain adaptation in our method from [4] on the example of the Benes filter. We observed that the domain adapted method was more effective in resolving the bimodality than the non-domain adapted one. However, this came at the cost of a linear trend in the error. A possible direction for future work would thus be to investigate the optimal domain size more closely, in order to mitigate the error trend, and make full use of the increased resolution from the domain adaptation. This is subject of future research in connection with more general domain adaptation methods than the one employed here, which is specific to the Benes filter.

As already noted in the previous work [4], the possibility for transfer learning in our method should be explored.

A long-term goal in the development of neural network based numerical methods must of course be the rigorous error analysis, which remains a challenging task.

References

Alan Bain and Dan Crisan. Fundamentals of Stochastic Filtering. Springer, 2008.
MATH Google Scholar
Christian Beck, Sebastian Becker, Philipp Grohs, Nor Jaafari, and Arnulf Jentzen. Solving stochastic differential equations and Kolmogorov equations by means of deep learning. arXiv preprint arXiv:1806.00421, 2018.
Google Scholar
Zhiqiang Cai, Francois Le Gland, and Huilong Zhang. An adaptive local grid refinement method for nonlinear filtering. PhD thesis, INRIA, 1995.
Google Scholar
Dan Crisan, Alexander Lobbe, and Salvador Ortiz-Latorre. An application of the splitting-up method for the computation of a neural network representation for the solution for the filtering equations. Preprint arXiv 2201.03283, 2022.
Google Scholar
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
MATH Google Scholar
Ioannis Karatzas and Steven E Shreve. Brownian Motion and Stochastic Calculus. Springer, 1998.
Google Scholar
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Google Scholar
Peter E. Kloeden and Eckhard Platen. Numerical Solution of Stochastic Differential Equations. Springer, 1992.
Book MATH Google Scholar
François Le Gland. Time discretization of nonlinear filtering equations. In Proceedings of the 28th IEEE Conference on Decision and Control,, pages 2601–2606. IEEE, 1989.
Google Scholar
François LeGland. Splitting-up approximation for SPDE’s and SDE’s with application to nonlinear filtering. In Stochastic partial differential equations and their applications, pages 177–187. Springer, 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Imperial College London, London, UK
Alexander Lobbe

Authors

Alexander Lobbe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Lobbe .

Editor information

Editors and Affiliations

Ifremer – Institut Français de Recherche pour l'Exploitation de la Mer, Plouzané, France
Bertrand Chapron
Imperial College London, London, UK
Dan Crisan
Imperial College London, London, UK
Darryl Holm
Campus Universitaire de Beaulieu, Inria – Institut National de Recherche en Sciences et Technologies du Numérique, Rennes, France
Etienne Mémin
Imperial College London, London, UK
Anna Radomska

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lobbe, A. (2023). Deep Learning for the Benes Filter. In: Chapron, B., Crisan, D., Holm, D., Mémin, E., Radomska, A. (eds) Stochastic Transport in Upper Ocean Dynamics. STUOD 2021. Mathematics of Planet Earth, vol 10. Springer, Cham. https://doi.org/10.1007/978-3-031-18988-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-18988-3_12
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18987-6
Online ISBN: 978-3-031-18988-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics