Keywords

1 Introduction

Stochastic Filtering, i.e. the estimation of a signal process given only partial and noisy observations, is a well-studied problem, both in the theoretical and applied literature. It is relevant in many practical domains, for example in numerical weather prediction. Therefore, there is a high demand for efficient numerical methods to approximate the optimal filter. Many such methods are known in the literature, among them the SPDE splitting method can be used to solve the filtering problem in low dimensions. The reason for the inefficiency of the splitting method in higher dimensions stems from the fact that the underlying state space must be explicitly discretised. This is problematic as the required number of discretisation points, known as the mesh, grows exponentially with the dimension of the state space. For this reason, the authors of [4] present a modified splitting method for the filtering problem which does not rely on the explicit space discretisation. The method developed in [4] is therefore called mesh-free and relies on a neural network representation of the solution. This means that, instead of approximating the values of the solution on a discrete mesh, we can optimize the parameters of a neural network defined on the state-space itself.

In this paper we present a further study of the deep learning method developed in [4] on the example of the Benes filter. The algorithm is derived from the classical splitting method for SPDEs which consists of a deterministic PDE approximation step and a normalisation step to incorporate the randomness of the SPDE. Our algorithm replaces the PDE approximation step of the splitting method by a neural network representation and learning algorithm. Combined with the Monte-Carlo method for the normalisation step, this method becomes completely mesh-free. Furthermore, an important property of the methodology in the filtering context is the ability to iterate it over several time steps. This allows the algorithm to be run online and to successively process observations arriving sequentially. In order to be computationally feasible, the domain of the neural network needs to be restricted. This restricted domain needs to cover the support of the density as well as possible in order to yield a sensible solution. In [4] the neural network domain is fixed a priori and does not move with the solution. This presents two problems. First, it is unnecessarily large to cover the support over all timesteps. Second, the solution may eventually move outside the computational domain, rendering the approximation inadequate. It was therefore noted in [4] that a possible extension of the approximation method would be given by an adaptive domain as the support of the neural network. We present in this work the first results obtained using an adaptive domain in the nonlinear and analytically tractable case of the Benes filter.

The paper is structured as follows. In Sect. 1.1 we briefly introduce the nonlinear, continuous-time stochastic filtering framework. The setting is identical to the one assumed in [4] and the reader may consult [1] for an in-depth treatment of stochastic filtering. Thereafter, in Sect. 2.2, we formulate the Benes filtering model used as a benchmark. Then, in Sect. 1.2 we introduce the filtering equation and the classical SPDE splitting method. This is the method upon which the new algorithm in [4] was built.

Next, in Sect. 2 we present an outline of the derivation of the new methodology. For details, the reader is referred to the original article [4]. The first idea of the algorithm, presented in Sect. 2.1 is to reformulate the solution of the PDE for the density of the unnormalised filter as an expected value. This is done using the Feynman–Kac formula, based on an auxiliary diffusion process derived from the model equations. Moreover, in Sect. 2.3 we briefly specify the neural network parameters used in the method, as well as the employed loss-function. The theoretical part of the paper is concluded with Sect. 2.4 where we show how to normalise the obtained neural network from the prediction step using Monte-Carlo approximation for linear sensor functions.

Section 3 contains the detailed parameter values and results of the numerical studies that we performed. Specifically, we perform two experiments, the first one, Sect. 3.1, is carried without any domain adaptation and highlights the limitations of ad-hoc parameterization of the domain. It is a simulation of the Benes filter using the deep learning method over a larger domain, as well as longer time interval than in the paper [4]. In particular, the size of the domain was estimated using the exact solution of the Benes model. This is necessary, as the nonlinearity of the Benes model makes it difficult to know the evolution of the posterior a priori. Thus we would be requiring a much larger domain, if chosen in an ad-hoc way. The second experiment, in Sect. 3.2, reports the performance of the proposed framework with domain adaptation. The adaptation was performed using precomputed estimates of the support of the filter by employing the solution formula for the Benes filter.

Finally, we formulate the conclusions from our experiments in Sect. 4. In short, the domain adapted method was more effective in resolving the bimodality in our study than the non-domain adapted one. However, this came at the cost of a linear trend in the error.

1.1 Nonlinear Stochastic Filtering Problem

The stochastic filtering framework consists of a pair of stochastic processes (X, Y ) on a probability space \((\varOmega , \mathcal {F}, \mathrm {P})\) with a normal filtration \(\,(\mathcal {F}_t)_{t\geq 0}\) modelled, P-a.s., as

$$\displaystyle \begin{aligned} X_t = X_0 + \int_0^t f(X_s) \,\mathrm{d} s + \int_0^t \sigma(X_s) \,\mathrm{d} V_s \;, \end{aligned} $$
(1)

and

$$\displaystyle \begin{aligned} Y_t = \int_0^t h(X_s) \,\mathrm{d} s + W_t \;. \end{aligned} $$
(2)

Here, the time parameter is t ∈ [0, ), \(d,p\in \mathbb {N}\) and \(f: \mathbb {R}^d \rightarrow \mathbb {R}^d\) and \(\sigma : \mathbb {R}^d \rightarrow \mathbb {R}^{d \times p}\) are the drift and diffusion coefficient functions of the signal. The processes V  and W are p– and m-dimensional independent, \((\mathcal {F}_t)_{t\geq 0}\)-adapted Brownian motions. We call X the signal process and Y  the observation process. The function \(h: \mathbb {R}^{d} \rightarrow \mathbb {R}^{m}\) is often called the sensor function, or link function, because it models the possibly nonlinear connection of the signal and observation processes.

Further, consider the observation filtration \((\mathcal {Y}_t)_{t\geq 0}\) given as

where \(\mathcal {N}\) are the P-nullsets of \(\mathcal {F}\). The aim of nonlinear filtering is to compute the probability measure valued \((\mathcal {Y}_t)_{t\geq 0}\)-adapted stochastic process π that is defined by the requirement that for all bounded measurable test functions \(\varphi : \mathbb {R}^d \to \mathbb {R}\) and t ∈ [0, ) we have P-a.s. that

$$\displaystyle \begin{aligned} \pi_t\varphi = \mathbb{E}\left[ \varphi(X_t) \left| \mathcal{Y}_t \right. \right]. \end{aligned}$$

We call π the filter.

Furthermore, let the process Z be defined such that for all t ∈ [0, ),

$$\displaystyle \begin{aligned} Z_t = \exp\{-\int_0^t h(X_s)\,\mathrm{d} W_s - \frac{1}{2} \int_0^t h(X_s)^2 \,\mathrm{d} s\}. \end{aligned}$$

Then, assumimg that

$$\displaystyle \begin{aligned} \mathbb{E}\left[ \int_0^t h(X_s)^2 \,\mathrm{d} s \right] < \infty \; \text{ and } \; \mathbb{E}\left[ \int_0^t Z_s h(X_s)^2 \,\mathrm{d} s \right] < \infty, \end{aligned}$$

we have that Z is an \((\mathcal {F}_t)_{t\geq 0}\)-martingale and by the change of measure (for details, see [1]) given by \(\left . \frac {\,\mathrm {d} \tilde {\mathrm {P}}^t} {\,\mathrm {d} \mathrm {P}}\right |{ }_{\mathcal {F}_t} = Z_t\), t ≥ 0, the processes X and Y  are independent under \(\tilde {\mathrm {P}}\) and Y  is a \(\tilde {\mathrm {P}}\)-Brownian motion. Here, \(\tilde {\mathrm {P}}\) is the consistent measure defined on \(\bigcup _{t\in [0,\infty )}\mathcal {F}_t\). Finally, under \(\tilde {\mathrm {P}}\), we can define the measure valued stochastic process ρ by the requirement that for all bounded measurable functions \(\varphi : \mathbb {R}^d\to \mathbb {R}\) and t ∈ [0, ) we have P-a.s. that

$$\displaystyle \begin{aligned} \rho_t\varphi = \mathbb{E}\left[ \left.\varphi(X_t) \exp\{\int_0^t h(X_s)\,\mathrm{d} Y_s - \frac{1}{2} \int_0^t h(X_s)^2 \,\mathrm{d} s\} \right| \mathcal{Y}_t \right]. \end{aligned} $$
(3)

The Kallianpur–Striebel formula (see [1]) justifies the terminology to call ρ the unnormalised filter.

1.2 Filtering Equation and General Splitting Method

Note that under the conditions given in [4], X admits the infinitesimal generator \(A: \mathcal {D}(A) \rightarrow B(\mathbb {R}^{d})\) given, for all \(\varphi \in \mathcal {D}(A)\), by

(4)

where \(\mathcal {D}(A)\) denotes the domain of the differential operator A and \(a = \frac {1}{2}\sigma \sigma ^\prime \). The symbol \(B(\mathbb {R}^{d})\) denotes the set of real-valued, bounded, Borel-measurable functions defined on \(\mathbb {R}^d\).

It is well-known (see, e.g., [1]), that the unnormalised filter ρ satisfies the filtering equation, i.e. for all t ≥ 0, we have \(\tilde {\mathrm {P}}\)-a.s. that

$$\displaystyle \begin{aligned} \rho_t(\varphi) = \pi_0(\varphi) + \int_0^t \rho_s(A\varphi)\,\mathrm{d} s + \int_0^t \rho_s(\varphi h^\prime) \,\mathrm{d} Y_s. \end{aligned} $$
(5)

The classical splitting method for the filtering equation is given in [3] and seeks to approximate the following SPDE for the density p t of the unnormalised filter given, for all t ≥ 0, \(x\in \mathbb {R}^d\), and P-a.s. as

$$\displaystyle \begin{aligned} p_t(x) = p_0 (x) + \int_0^t A^* p_s(x)\,\mathrm{d} s + \int_0^t h^\prime(x) p_s(x) \,\mathrm{d} Y_s \end{aligned}$$

and relies on the splitting-up algorithm described in [9] and [10]. Here A is the formal adjoint of the infinitesimal generator A of the signal process X.

We summarise the splitting-up method below in Note 1.

Note 1

The splitting method for the filtering problem is defined by iterating the steps below with initial density p 0(⋅) = p 0(⋅):

  1. 1.

    (Prediction) Compute an approximation \(\tilde {p}^n\) of the solution to

    $$\displaystyle \begin{aligned} \frac{\partial q^n}{\partial t} (t,z) &= A^* q^n (t,z), &\; & (t,z)\in (t_{n-1},t_n]\times\mathbb{R}^d,\\ q^n(0,z) &= {p}^{n-1}(z), &\; & z\in \mathbb{R}^d, \end{aligned} $$
    (6)

    at time t n and

  2. 2.

    (Normalisation) Compute the normalisation constant with \(z_n = (Y_{t_n}-Y_{t_{n-1}}) / (t_n-t_{n-1})\) and the function

    $$\displaystyle \begin{aligned} \mathbb{R}^d\ni z\mapsto\xi_n(z) = \exp \left( -\frac{t_n-t_{n-1}}{2} ||z_n - h(z)||{}^2 \right), \end{aligned}$$

    so that we can set,

    $$\displaystyle \begin{aligned} p^n(z) = \frac{1}{C_n} \xi_n(z) \tilde{p}^n(z); \; z\in\mathbb{R}^d, \end{aligned}$$

    where \(C_n = \int _{\mathbb {R}^d} \xi _n(z)\tilde {p}^n(z) \,\mathrm {d} z \).

The deep learning method studied below replaces the predictor step of the splitting method above by a deep neural network approximation algorithm to avoid an explicit space discretisation. This is achieved by representing each \(\tilde {p}^n(z)\) by a feed-forward neural network and approximating the initial value problem (6) based on its stochastic representation using a sampling procedure. The normalisation step may then be computed either using quadrature, or, to preserve the mesh-free characteristic, by Monte-Carlo approximation.

2 Derivation and Outline of the Deep Learning Algorithm

Here, we present a concise version of the derivation laid out in detail in [4].

2.1 Feynman–Kac Representation

Assuming sufficient differentiability of the coefficient functions, the operator A may be expanded such that for all compactly supported smooth test functions \(\varphi \in C_c^\infty (\mathbb {R}^d, \mathbb {R})\) we have

(7)

Subtracting the zero-order term from (7), we obtain an operator that generates the auxiliary diffusion process, denoted \(\hat {X}\), which is instrumental in the deep learning method.

Definition 1

Define the partial differential operator \(\hat {A}:C_c^\infty (\mathbb {R}^d,\mathbb {R})\rightarrow C_b(\mathbb {R}^d,\mathbb {R})\), with image in the set of bounded continuous function on \(\mathbb {R}^d\), such that for all \(\varphi \in C_c^\infty (\mathbb {R}^d,\mathbb {R})\),

and the function \(r:\mathbb {R}^d\rightarrow \mathbb {R}\) such that for all \(x\in \mathbb {R}^d\),

$$\displaystyle \begin{aligned} r(x) = \operatorname{\mathrm{div}}(\overrightarrow{\mathrm{div}}(a)-f)(x). \end{aligned}$$

Lemma 1

For all \(x\in \mathbb {R}^d\) the operator \(\hat {A}\) defined in Definition 1 is the infinitesimal generator of the Itô diffusion \(\hat {X}:[0,\infty )\times \varOmega \rightarrow \mathbb {R}^d\) given, for all t ≥ 0 and P-a.s. by

$$\displaystyle \begin{aligned} \hat{X}_t =x + \int_0^t b(\hat{X}_s) \mathrm{d} s + \int_0^t \sigma(\hat{X}_s) \mathrm{d} \hat{W}_s, \end{aligned}$$

where \(\hat {W}:[0,\infty )\times \varOmega \rightarrow \mathbb {R}^d\) is a d-dimensional Brownian motion and \(b:\mathbb {R}^d \rightarrow \mathbb {R}^d\) is the function

$$\displaystyle \begin{aligned} b = 2\overrightarrow{\mathrm{div}}(a) - f. \end{aligned}$$

From the well-known Feynman–Kac formula (see Karatzas and Shreve [6, Chapter 5, Theorem 7.6]) we can deduce the Corollary 1 below for the initial value problem.

Corollary 1

Let \(d\in \mathbb {N}\) , T > 0, let \(k\colon \mathbb {R}^d \to [0,\infty )\) be a continuous function, let \(\hat {A}\) be the operator defined in Definition 1 , and let \(\psi :\mathbb {R}^d \rightarrow \mathbb {R}\) be an at most polynomially growing function. Suppose that \(u \in C_b^{1,2}((0,T]\times \mathbb {R}^d,\mathbb {R})\) is continuously differentiable with bounded derivative in time and twice continuously differentiable with bounded derivatives in space, and satisfies the Cauchy problem

$$\displaystyle \begin{aligned} \frac{\partial u}{\partial t}(t,x) + k(x)u(t,x) &= \hat{A}u(t,x), &\; &(t,x)\in (0,T] \times \mathbb{R}^d,\\ u(0,x) &= \psi(x), &\; &x\in \mathbb{R}^d. \end{aligned} $$
(8)

Then, for all \((t,x) \in (0,T]\times \mathbb {R}^d\) , we have that

$$\displaystyle \begin{aligned} u(t,x) = \mathbb{E}\left[ \left. \psi(\hat{X}_t)\exp\left(-\int_0^t k(\hat{X}_\tau) \,\mathrm{d}\tau\right) \right| \hat{X}_0 = x \right], \end{aligned}$$

where \(\hat {X}\) is the diffusion generated by \(\hat {A}\).

Recall that our aim is to approximate the Fokker–Planck equation (6). Assume from now on the discrete times {t 0 = 0, t 1, t 2… }, indexed by n. Written in the form as in Corollary 1, for any timestep n = 1, 2, …, (6) reads as

$$\displaystyle \begin{aligned} \frac{\partial q^n}{\partial t} (t,z) &= \hat{A}q^n(t,z) + r(z)q^n(t,z), &\; & (t,z)\in (t_{n-1},t_n]\times\mathbb{R}^d,\\ q^n(0,z) &= {p}^{n-1}(z), &\; & z\in \mathbb{R}^d. \end{aligned}$$

Thus, with k = −r, and assuming that − r is non-negative in (8), we obtain by Corollary 1 the representation, for all n ∈{1, …, N}, t ∈ (t n−1, t n], \(z\in \mathbb {R}^d\),

$$\displaystyle \begin{aligned} q^n(t,z) = \mathbb{E}\left[ \left. p^{n-1}(\hat{X}_t) \exp\left(\int_{t_{n-1}}^t r(\hat{X}_\tau) \,\mathrm{d}\tau\right) \right| \hat{X}_{t_{n-1}} = z \right]. \end{aligned} $$
(9)

Note that [4, Proposition 2.4] shows that we have a feasible minimisation problem to approximate by the learning algorithm (see also [2, Proposition 2.7]).

2.2 The Benes Filtering Model

The Benes filter is a one-dimensional nonlinear model and is used as a benchmark in the numerical studies below. As we show below, it is one of the rare cases of explicitly solvable continuous-time stochastic filtering models. Here, we are considering a special case of the more general class of Benes filters, presented, for example, in [1, Chapter 6.1].

The signal is given by the coefficient functions

$$\displaystyle \begin{aligned} f(x) = \alpha\sigma \tanh (\beta+\alpha x/ \sigma) \;\text{ and } \; \sigma(x) \equiv \sigma \in \mathbb{R},\end{aligned} $$

where \(\alpha , \beta \in \mathbb {R}\) and the observation is given by the affine-linear sensor function

$$\displaystyle \begin{aligned} h(x) = h_1 x + h_2,\end{aligned} $$

with \(h_1, h_2 \in \mathbb {R}\). The density p B of the filter solving the Benes model is then given by two weighted Gaussians (see [1, Chapter 6.1]) as

$$\displaystyle \begin{aligned} p_B(z) = w^{+}\varPhi(\mu_t^{+}, \nu_t)(z) + w^{-}\varPhi(\mu_t^{-}, \nu_t)(z),\end{aligned} $$
(10)

where \(\mu _t^{\pm } = M_t^{\pm }/(2v_t)\), ν t = 1∕(2v t), and

$$\displaystyle \begin{aligned} w^{\pm} = \frac{\exp ( (M_t^{\pm})^2/(4v_t))}{\exp ( (M_t^{+})^2/(4v_t))\exp ( (M_t^{-})^2/(4v_t))}\end{aligned} $$

with

$$\displaystyle \begin{aligned} M_t^{\pm} = \pm\frac{\alpha}{\sigma} + h_1\int_0^t \frac{\sinh(s\zeta\sigma )}{\sinh(t\zeta\sigma)}\,\mathrm{d} Y_s +\frac{h_2+h_1x_0}{\sigma\sinh(t\zeta\sigma)} - \frac{h_2}{\sigma}\coth(t\zeta\sigma),\end{aligned} $$

\(v_t = {h_1}\coth (t\zeta \sigma )/{2\sigma }\), and \(\zeta = \sqrt {\alpha ^2/\sigma ^2 + h_1^2}\).

Further, for the Benes model, the auxiliary diffusion is given as

$$\displaystyle \begin{aligned} \hat{X}_t = \hat{X}_0 - \int_0^t \alpha\sigma\tanh(\beta + \alpha x / \sigma) \, \mathrm{d} s + \int_0^t \sigma \,\mathrm{d} \hat{W}_s, \end{aligned}$$

and the coefficient

$$\displaystyle \begin{aligned} r(x) = - \operatorname{\mathrm{div}} f(x) = -\alpha^2 \mathrm{sech}^2(\beta + \alpha x / \sigma). \end{aligned}$$

Therefore the representation of the solution to the Fokker–Planck equation (6) in the Benes case reads

$$\displaystyle \begin{aligned} q^n(t,z) = \mathbb{E}\left[ \left. p^{n-1}(\hat{X}_t) \exp\left(- \int_{t_{n-1}}^t \alpha^2\mathrm{sech}^2(\beta + \alpha \hat{X}_\tau /\sigma) \,\mathrm{d} \tau\right) \right| \hat{X}_{t_{n-1}} = z \right]. \end{aligned}$$

2.3 Neural Network Model for the Prediction Step

To solve the Fokker–Planck equation over a rectangular domain Ω d = [α 1, β 1] ×⋯ × [α d, β d], we employ the sampling based deep learning method from [2]. Using the representation (9), the solution of the Fokker–Planck equation is reformulated into an optimisation problem over function space given in [4, Proposition 2.4]. This in turn yields the loss functions for the learning algorithm. Writing \(\hat {\mathrm {X}}^\xi \) for the auxiliary diffusion with Unif(Ω d)-random initial value ξ, the optimisation problem is approximated by the optimisation

$$\displaystyle \begin{aligned} \inf_{\theta \in \mathbb{R}^{\sum_{i=2}^L l_{i-1}l_i+l_i}} \mathbb{E}\left[ \left| \psi(\hat{\mathrm{X}}_T^\xi) \exp\left(-\int_0^T k(\hat{\mathrm{X}}_\tau^\xi) \,\mathrm{d}\tau\right) - \mathcal{N}\mathcal{N}_\theta(\xi)\right|{}^2 \right] \end{aligned}$$

where the solution of the PDE is represented by a neural network \(\mathcal {N}\mathcal {N}_\theta \) and the infinite-dimensional function space has been parametrised by θ. Here, L denotes the depth of the neural net, and the parameters l i are the respective layer widths. Further details can be found in [4]. A comprehensive textbook on deep learning is [5]. We apply a modified gradient descent method, called ADAM [7], to determine the parameters in the model by minimising the loss function

where N b is the batch size and \( \{\xi ^i, \{\hat {\mathrm {X}}_{\tau _j}^{\xi ,i}\}_{j=0}^J \}_{i=1}^{N_b}\) is a training batch of independent identically distributed realisations ξ i of \(\xi \sim \mathcal {U}(\varOmega _d)\) and \(\{\hat {\mathrm {X}}_{\tau _j}^{\xi ,i}\}_{j=0}^J\) the approximate i.i.d. realisations of sample paths of the auxiliary diffusion started at ξ i over the time-grid τ 0 = 0 < τ 1 < ⋯ < τ J−1 < τ J = T. For the approximation of the sample paths of the diffusion we use the Euler–Maruyama method [8]. Additionally, we augment the loss \(\mathcal {L}\) by an additional term to encourage the positivity of the neural network. Thus, in practice, we use the loss

$$\displaystyle \begin{aligned} \tilde{\mathcal{L}}(\theta ; \{\xi^i, \{\hat{\mathrm{X}}_{\tau_j}^i\}_{j=0}^J \}_{i=1}^{N_b}) = {\mathcal{L}}(\theta ; \{\xi^i, \{\hat{\mathrm{X}}_{\tau_j}^i\}_{j=0}^J \}_{i=1}^{N_b}) + \lambda \sum_{i=1}^{N_b} \max\{0,-\mathcal{N}\mathcal{N}_\theta(\xi^i)\} \end{aligned}$$

with the hyperparameter λ to be chosen.

Thus, in the notation of Sect. 1.2 we replace the Fokker–Planck solution by a neural network model, i.e. we postulate a neural network model

$$\displaystyle \begin{aligned} \tilde{p}_n(z) = \mathcal{N}\mathcal{N}(z), \end{aligned}$$

with support on Ω d. Therefore we require the a priori chosen domain to capture most of the mass of the probability distribution it is approximating.

2.4 Monte-Carlo Normalisation Step

We then realise the normalisation step via Monte-Carlo sampling over the bounded rectangular domain Ω d to approximate the integral

$$\displaystyle \begin{aligned} \int_{\mathbb{R}^d} \xi_n(z)\mathcal{N}\mathcal{N}(z) \,\mathrm{d} z = \int_{\varOmega_d} \exp \left( -\frac{t_n-t_{n-1}}{2} ||z_n - h(z)||{}^2 \right) \mathcal{N}\mathcal{N}(z) \,\mathrm{d} z, \end{aligned} $$
(11)

where, as defined earlier, \(z_n = \frac {1}{t_n-t_{n-1}}(Y_{t_n}-Y_{t_{n-1}})\). Note that, since Ω d is the support of the neural network \(\mathcal {N}\mathcal {N}\), the right-hand side above is indeed identical to the integral over the whole space.

The sensor function in the Benes model is given by h(x) = h 1 x + h 2. Then, the likelihood function becomes

$$\displaystyle \begin{aligned} \xi_n(z) = \frac{\sqrt{2\pi}}{\sqrt{(t_n-t_{n-1})h_1^2}} \mathcal{N}_{\text{pdf}}\left(\frac{z_n - h_2}{h_1}, \frac{1}{(t_n-t_{n-1})h_1^2}\right)(z), \end{aligned}$$

where \(\mathcal {N}_{\text{pdf}}(\mu ,\sigma ^2)\) denotes the probability density function of a normal distribution with mean μ and variance σ 2. Therefore, we can write the integral (11) as

$$\displaystyle \begin{aligned} \frac{\sqrt{2\pi}}{\sqrt{(t_n-t_{n-1})h_1^2}} \mathbb{E}_{Z}[\mathcal{N}\mathcal{N}(Z)]; \qquad \qquad Z \sim \mathcal{N}\left(\frac{z_n - h_2}{h_1}, \frac{1}{(t_n-t_{n-1})h_1^2}\right). \end{aligned}$$

This is an implementable method to compute the normalisation constant C n. Thus, we can express the approximate posterior density as

$$\displaystyle \begin{aligned} p^n(z) = \frac{1}{C_n} \xi_n(z) \tilde{p}^n(z). \end{aligned}$$

Therefore, the methodology is fully recursive and can be applied sequentially.

Remark 1

In low-dimensions, the usage of the Monte-Carlo method to perform the normalisation is optional, since efficient quadrature methods are an alternative. We chose the sampling based method to preserve the grid-free nature of the algorithm.

3 Numerical Results for the Benes Filter

The neural network architecture for all our experiments below is a feed-forward fully connected neural network with a one-dimensional input layer, two hidden layers with a layer width of 51 neurons each and batch-normalisation, and an output layer of dimension one (a detailed illustration can be found in [4]). For the optimisation algorithm we chose the ADAM optimiser and performed the training over 6002 epochs with a batch size of 600 samples. The initial signal and observation values are x 0 = y 0 = 0 and the coefficients of the Benes model were chosen as α = 3, β = 0, σ = 0.5, h 1 = 3, h 2 = 0, and timestep Δt = 0.1 over N = 40 steps. The initial condition is a Gaussian density with mean 0 and standard deviation 0.001. The posterior was calculated over the domain [−9, 2.5]. The domain boundaries were pre-estimated using a simulation of the exact Benes filter with fixed random seed. In the case of the domain adaptation we used the precomputed evolutions from the true solution to estimate the support of the posterior and set a fixed domain adaptation schedule. The spatial resolution is 1000 uniformly spaced values in the domain of definition of the neural network. At each time step, the training of the network consumes 6002 ⋅ 600 = 3, 601, 200 Monte-Carlo samples. Additionally we employ a piecewise constant learning rate schedule lr(epoch) = 10−(2+epoch mod 2001) and the normalisation constant is computed using 107 samples each timestep. The regularising parameter λ = 1.

3.1 No Domain Adaptation

Figure 1 shows the plots for the Benes filter without domain adaptation. In Fig. 1a we observe the drift of the posterior toward the left edge of the domain. The initial bimodality, reflecting the uncertainty due to few observed values, quickly resolves and the approximate posterior tracks the signal within the domain. In Fig. 1b the bimodality is mostly visible in the Monte-Carlo prior and smoothed out by the neural network. Figure 1c and d show snapshots of the progression of the filter. The absolute error in means with respect to the Benes reference solution is plotted in Fig. 2a and shows that as the posterior reaches the left domain boundary, the error increases. This is reflected as well in the drop of probability mass, Fig. 2c, and Monte-Carlo acceptance rate, Fig. 2d at later times. It is not clear from Fig. 2a if there is a trend in the error. Further experiments need to be performed to check this hypothesis. Figure 2b shows that the neural net training consistently succeeds as measured by the L 2 distance between the Monte-Carlo reference prior and the neural net prior.

Fig. 1
figure 1

Results of the combined splitting-up/machine-learning approximation applied iteratively to the Benes filtering problem (no domain adaptation). (a) The full evolution of the estimated posterior distribution produced by our method, plotted at all intermediate timesteps. (bd) Snapshots of the approximation at times, t = 0.6, t = 1.8, and t = 3.9. The black dotted line in each graph shows the estimated posterior, the yellow line the prior estimate represented by the neural network, and the light-blue shaded line shows the Monte-Carlo reference solution for the prior

Fig. 2
figure 2

Error and diagnostics for the Benes filter (no domain adaptation). (a) Absolute error in means between the approximated distribution and the exact solution. (b) L 2 error of the neural network during training with respect to the Monte-Carlo reference solution. (c) Probability mass of the neural network prior. (d) Monte-Carlo acceptance rate

3.2 With Domain Adaptation

Figure 3 shows the plots for the Benes filter with domain adaptation. In Fig. 3a we observe again the drift of the posterior toward the left edge of the domain. and the initial bimodality resolves. The approximate posterior tracks the signal within the domain. In Fig. 3b the bimodality is visible both in the prior an the posterior network. This shows that the domain adaptation helps resolve the bimodality in the nonlinear case by increasing the spatial resolution while keeping the computational cost equal. Figure 3c and d again show snapshots of the progression of the filter. The absolute error in means with respect to the Benes reference solution is plotted in Fig. 4a and shows a clear linear trend. This is an interesting phenomenon, likely due to the reduced domain size and subsequent error accumulation. The probability mass, Fig. 4c, and Monte-Carlo acceptance rate, Fig. 4d are stably fluctuating. Figure 4b shows here again that the neural net training consistently succeeds.

Fig. 3
figure 3

Results of the combined splitting-up/machine-learning approximation applied iteratively to the Benes filtering problem (with domain adaptation). (a) The full evolution of the estimated posterior distribution produced by our method, plotted at all intermediate timesteps. (bd) Snapshots of the approximation at times, t = 0.6, t = 1.8, and t = 3.9. The black dotted line in each graph shows the estimated posterior, the yellow line the prior estimate represented by the neural network, and the light-blue shaded line shows the Monte-Carlo reference solution for the prior

Fig. 4
figure 4

Error and diagnostics for the Benes filter (with domain adaptation). (a) Absolute error in means between the approximated distribution and the exact solution. (b) L 2 error of the neural network during training with respect to the Monte-Carlo reference solution. (c) Probability mass of the neural network prior. (d) Monte-Carlo acceptance rate

4 Conclusion and Outlook

We have studied the domain adaptation in our method from [4] on the example of the Benes filter. We observed that the domain adapted method was more effective in resolving the bimodality than the non-domain adapted one. However, this came at the cost of a linear trend in the error. A possible direction for future work would thus be to investigate the optimal domain size more closely, in order to mitigate the error trend, and make full use of the increased resolution from the domain adaptation. This is subject of future research in connection with more general domain adaptation methods than the one employed here, which is specific to the Benes filter.

As already noted in the previous work [4], the possibility for transfer learning in our method should be explored.

A long-term goal in the development of neural network based numerical methods must of course be the rigorous error analysis, which remains a challenging task.