Keywords

1 Introduction and Motivations

The current approach to forecasting modelling consists of simulating explicitly only the largest-scale phenomena, while taking into account the smaller-scale ones by means of “physical parameterisations”. All numerical models introduce uncertainty through the selection of scales and parameters. Additionally, any computational methodology contributes to uncertainty due to discretization, finite precision and accumulation of round-off errors. Finally the ever growing size of the computational domains leads to increasing sources of uncertainties. Taking into account these uncertainties is essential for the acceptance of any numerical simulation. Numerical forecasting models often use Data Assimilation methods for the uncertainty quantification in the medium to long-term analysis. Data Assimilation (DA) is the approximation of the true state of some physical system at a given time by combining time-distributed observations with a dynamic model in an optimal way. DA can be classically approached in two ways: as variational DA [16] and as filtering [5]. In both cases we seek an optimal solution. The most popular filtering approach for data assimilation is the Kalman Filter (KF) [15]. Statistically, KF seeks a solution with minimum variance. Variational methods seek a solution that minimizes a suitable cost function. In certain cases, the two approaches are identical and provide exactly the same solution [16]. However, the statistical approach, though often complex and time-consuming, can provide a richer information structure, i.e. an average and some characteristics of its variability (probability distribution). During the last 20 years hybrid approaches [11, 18] have become very popular as they combine the two approaches into a single taking advantage of the relative rapidity and robustness of variational approaches, and at the same time, obtaining an accurate solution [2] thanks to the statistical approach. In this paper, in order to achieve the accuracy of the KF solution and reduce the execution time, we use Recurrent Neural Networks (RNN). Today the computational power of RNN is exploited for several application in different fields. Any non-linear dynamical system can be approximated to any accuracy by a Recurrent Neural Network, with no restrictions on the compactness of the state space, provided that the network has enough sigmoidal hidden units. This is what the Universal Approximation Theorem [12, 20] claims. Only during the last few years, the DA community is starting to approach machine learning models to improve the efficiency of DA models. In [17], the authors combined Deep Learning and Data Assimilation to predict the production of gas from mature gas wells. They used a modified deep LSTM model as their prediction model in the EnKF framework for parameter estimation. Even if the prediction phase is speed up due to the introduction of Deep Learning, this only partially affects the whole prediction-correction cycle which is still time-consuming. In [9], the authors presented an approach for employing artificial neural networks (NNs) to emulate the local ensemble transform Kalman filter (LETKF) as a method of data assimilation. Even if the Feed Forward NN they implemented is able to emulate the DA process for the time window they fixed, when they need to assimilate observations in new time steps, it still needs the prediction-correction cycle and this affects the execution time which is just 90 times faster than the reference DA model. To further speed up the process, in [8] the authors combined the power of Neural Networks and High Performance Computing to assimilate meteorological data. These studies, alongside others discussed in conferences and still under publication, highlight the necessity to avoid the prediction-correction cycle by developing a Neural Network able to completely emulate the whole Data Assimilation process. In this context, we developed a Neural Assimilation (NA) as a Coupled Neural Network made of two RNNs. NA captures the features of a Data Assimilation process by interleaving the training of the two component RNNs on the forecasting data and the observed data. That is, the two component RNNs are trained on forecasting and observed data respectively with additional inputs provided by the interaction of these two. This NA network emulates the KF and runs much faster than the KF prediction-correction cycle for data assimilation. In this paper we develop the NA architecture and proved its equivalence to the KF. The equivalence between NA and KF is independent from the structure on the RNNs. In this paper we show results we obtained employing two Long short-term memory (LSTM) architectures for the two RNNs. Then we employ the NA model to a practical problem in predicting of oxygen (and drugs) diffusion across the Blood-Brain Barrier (BBB) [1] to justify its correctness and efficiency.

This paper is structured as follows. In Sect. 2 the Data Assimilation problem is described. The Neural Assimilation is introduced in Sect. 3, where we investigate the accuracy of the introduced method and we present a theorem demonstrating that the novel model is consistent with the KF result. Experimental results are provided in Sect. 4. Conclusions and future works are summarised in Sect. 5.

2 Data Assimilation

Data Assimilation (DA) is the approximation of the true state of some physical system at a given time by combining time-distributed observations o(t) with a dynamic model \(\dot{x}=\mathcal {M}(x,t)\) in an optimal way. DA can be classically approached in two ways: as variational DA [3] and as filtering. One of the best known tools for filtering approach is the Kalman filter (KF) [15]. We seek to estimate the state x(t) of a discrete-time dynamic process that is governed by the linear difference equation

$$\begin{aligned} x(t)=M\ x(t-1)+w_t \end{aligned}$$
(1)

with an observation o(t):

$$\begin{aligned} o(t)=H\ x(t)+v_t \end{aligned}$$
(2)

Note that M and H are discrete operators. The random vectors \(w_t\) and \(v_t\) represent the modeling and the observation errors respectively. They are assumed to be independent, white-noise processes with normal probability distributions

$$\begin{aligned} w_t \sim \mathcal {N}(0,B_t), \qquad v_t \sim \mathcal {N}(0,R_t) \end{aligned}$$
(3)

where \(B_t\) and \(R_t\) are covariance matrices of the modeling and observation errors respectively. All these assumptions about unbiased and uncorrelated errors (in time and between each other) are not limiting, since extensions of the standard KF can be developed should any of these not be valid [5]. The KF problem can be summarised as follows: given a background estimate x(t), of the system state at time t, what is the best analysis z(t) based on the current available observation o(t)?

The typical assimilation scheme is made up of two major steps: a prediction step and a correction step. At time t we have the result of the previous forecast, x(t) and the result of an ensemble of observations o(t). Based on these two vectors, we perform an analysis that produces z(t). We then use the evolution model to obtain a prediction of the state at time \(t+1\). The result of the forecast at the prediction step is denoted with \(x(t+1)\)

$$\begin{aligned} x(t+1)=M z(t), \end{aligned}$$
(4)
$$\begin{aligned} B_{t+1}=M \left( (1-K_{t} H )B_{t} \right) M^T, \end{aligned}$$
(5)

and becomes the background for the next correction time step:

$$\begin{aligned} K_{t+1}=B_{t+1} H^T (H B_{t+1} H^T+R_{t+1})^{-1}, \end{aligned}$$
(6)
$$\begin{aligned} z(t+1)=x(t+1)+K_{t+1} \left( o(t+1)-H x(t+1)\right) , \end{aligned}$$
(7)

We observe that, in case the observed data are defined in the same space of the state variable, the operator \(H_t\) in (2) is the identity matrix and the Eqs. (6)–(7) can be simplified becoming:

$$\begin{aligned} K_{t+1}=B_{t+1} ( B_{t+1} +R_{t+1})^{-1}, \end{aligned}$$
(8)
$$\begin{aligned} z(t+1)=x(t+1)+K_{t+1} \left( o(t+1)- x(t+1)\right) , \end{aligned}$$
(9)

Due to the high computational cost in updating the covariance matrices \(B_t\) by Eq. (5), it in operational DA, is often used to assume \(B_t=B_{t+1}\) \(\forall t\). This assumption leads to a model which is also called Optimal Interpolation [16].

Statistically, KF seeks a solution with minimum variance. This approach, though often complex and time-consuming, can provide a rich information structure (often richer than information provided by variational DA), such as an average and some characteristics of its variability (probability distribution). In order to maintain the accuracy of the KF solution and reduce the execution time, we introduce, in the next section, a Neural Assimilation (NA) which is a network representing KF but much faster than a KF prediction-correction cycle.

3 Neural Assimilation

For a fixed time window \([t_0, t_1]\) and a fixed discretization time step \(\varDelta t\), let x(t) still denote the forecasting result at each time step \(t \in [t_0, t_1]\). Let o(t) denotes an observation of the state value (Fig. 1). As it does not affect the generality of our study, we are assuming here the observed data defined in the same space of the state variable, i.e. the operator \(H_t\) in (2) is the identity matrix.

Fig. 1.
figure 1

Available data in the fixed time window.

Given the data sets \(\{x(t)\}_{t\in [t_0,t_1]}\) and \(\{o(t)\}_{t\in [t_0,t_1]}\), the Neural Assimilation (NA) is a Coupled Neural Network (for temporal processing) as shown in Fig. 2, where:

  • a first forecasting network NN\(_F\) is a Recurrent Neural Network trained on forecasting data x(t) with an additional input provided by a second forecasting network NN\(_O\) trained on observed data o(t);

  • a second forecasting network NN\(_O\) is a Recurrent Neural Network trained on observed data o(t) with an additional input provided by a first forecasting network NN\(_F\).

A fundamental feature of each network is that it contains a feedback connection, so the activations can flow round in a loop. That enables the networks to do temporal processing and learn sequences with temporal prediction. The form of NA is a RNN with the previous set of hidden unit activations feeding back into the network along with the inputs.

Fig. 2.
figure 2

Neural Assimilation

Note that the time t is discretized, with the activations updated at each time step. The time scale might correspond to any time step of size appropriate for the given problem. A delay unit given by the network NN\(_F\) needs to be introduced to hold activations in NN\(_O\) until they are processed at the next time step and vice versa. As for simple architectures and deterministic activation functions, learning will be achieved using similar gradient descent procedures to those leading to the back-propagation algorithm for feed forward networks.

The NA scheme is made up of two major steps: a pre-processing step and a training step. During the pre-processing step, the data set is normalized considering the information we have about the error estimations and the error covariance matrices introduced in (3). We consider, to normalise, the inverse of the error covariance matrices so that, data with big covariance/variance are assumed with a small weight [5, 16]. We pose

$$\begin{aligned} \bar{x}(t)=B_t^{-1} x(t)\quad \text {and} \quad \bar{o}(t)=R_t^{-1}o(t). \end{aligned}$$
(10)

The computed vectors \(\bar{x}(t)\) and \(\bar{o}(t)\) are the data used in the training step:

$$\begin{aligned} \bar{o}(t) = f_{O_O} \left( W_{HO_O} h(t-1) \right) \end{aligned}$$
(11)
$$\begin{aligned} h(t)=f_H\left( W_{IH} \bar{x}(t-1) +W_{HH} h(t -1)\right) \end{aligned}$$
(12)
$$\begin{aligned} \bar{x}(t) = f_{O_F} \left( W_{HO_F} h(t) \right) \end{aligned}$$
(13)

where the vectors \(\bar{x}(t-1)\) are the inputs, the matrices \(W_{IH}\), \(W_{HH}\), \(W_{HO_F}\) and \(W_{HO_O}\) are the four connection weight matrices, and \(f_H\), \(f_{O_F}\) and \(f_{O_O}\) are the hidden and outputs unit activation functions. The state of the dynamical system is a set of values that summarizes all the information about the past behaviour of the system that is necessary to provide a unique description of its future behaviour, apart from the effect of any external factors. In this case the state is defined by the set of hidden unit activations h(t). The Back propagation Through Time for this algorithm is a natural extension of standard back propagation that performs gradient descent on a complete unfolded network ([21], Chapter 5 of [6]). If the NA training sequence starts at time \(t_0\) and ends at time \(t_1\), the total cost function is simply the sum over time of the standard error function C(t) at each time-step:

$$\begin{aligned} C_{total}=\sum _{t=t_0}^{t_1} C(t) \end{aligned}$$
(14)

where

$$\begin{aligned} C(t)= \frac{1}{2} \sum _{k=1}^n \left( (\bar{o}_{k}(t-1)-h_{k}(t-1))^2 + (\bar{x}_{k}(t)- h_{k}(t))^2\right) \end{aligned}$$
(15)

and n is the total number of training samples. The gradient descent weight updates have contributions from each time-step [19]:

$$\begin{aligned} \varDelta w_{ij} = - \eta \frac{\partial C_{total}(t_0,t_1) }{\partial w_{ij}} = - \eta \sum _{t=t_0}^{t_1} \frac{\partial C(t) }{\partial w_{ij}} \end{aligned}$$
(16)

where \(\eta \) is the learning rate [14]. The constituent partial derivatives \(\frac{\partial C(t) }{\partial w_{ij}}\) have contributions from the multiple instances of each weight

$$w_{ij}\in \left\{ W_{IH} ,W_{HH}, W_{HO_O} , W_{HO_F} \right\} $$

and depend on the inputs and hidden unit activations at previous time steps. The errors now have to be back-propagated through time as well as through the network [23].

We prove that the output function h(t) of the NA model corresponds to the solution of Kalman filter with fixed covariance matrices, i.e. in its Optimal Interpolation version [16]. The following result held.

Theorem 1

Let h(t) be the solution of NA given by Eqs. (10)–(16) and let z(t) denote the solution of the KF algorithm as defined in (9). We have

$$\begin{aligned} h(t)=z(t), \quad \forall t\in [t_0,t_1] \end{aligned}$$
(17)

Proof: Due to the definition of the \(L^2\) norm, the loss function in (15) can be written as

$$\begin{aligned} C(t)=\Vert \bar{o}(t-1)-h(t-1) \Vert _2^2+ \Vert \bar{x}(t)- h(t) \Vert _2^2 \end{aligned}$$
(18)

then, from Eq. (1), and except for the numerical errors that will be introduced later as already included in the data sets, the (18) can be written as:

$$\begin{aligned} C(t)=\Vert \bar{o}(t-1)-h(t-1) \Vert _2^2+ \Vert M\ \bar{x}(t-1)- M\ h(t-1) \Vert _2^2 \end{aligned}$$
(19)

From the properties of the \(L^2\) norm, the (19) can be written as

$$\begin{aligned} C(t)=&(\bar{o}(t-1)-h (t-1))^T (\bar{o}(t-1)-h(t-1)) \nonumber \\&+\,(M \bar{x}(t-1)- M h(t-1) )^T(M \bar{x}(t-1)- M h(t-1) ). \end{aligned}$$
(20)

To minimise this loss function, we compute the gradient

$$\begin{aligned} \nabla _{h(t-1)} C(t)= 2 (\bar{o}(t-1)-h(t-1)) + 2 M^T (M \bar{x}(t-1)- M\ h(t-1) ) \end{aligned}$$
(21)

where \(M^T\) denotes the Adjoint operator of the linear operator M [7] and we pose \(\nabla _{h(t-1)} C(t)=0\), then we have:

$$\begin{aligned} 2 h(t-1) = \bar{o}(t-1)+ \bar{x}(t-1) \end{aligned}$$
(22)

From the definition of \(\bar{x}\) and \(\bar{o}\) in (10), the (22) gives:

$$\begin{aligned} h(t-1) \left( B_{t-1}+R_{t-1}\right) = R_{t-1} x(t-1) + B_{t-1} o(t-1) \end{aligned}$$
(23)

Then, adding and subtracting the quantity \(B_{t-1} x(t-1)\) and merging the common factors, the (23) become

$$\begin{aligned} h(t-1) \left( B_{t-1}+R_{t-1}\right) = x(t-1) \left( B_{t-1}+R_{t-1}\right) +B_{t-1}\left( o(t-1)-x(t-1)\right) \end{aligned}$$
(24)

Finally, posed \(Q_{t-1}=B_{t-1} \left( B_{t-1}+R_{t-1}\right) ^{-1}\), the (24) gives:

$$\begin{aligned} h(t-1)= x(t-1) +Q_{t-1}\left( o(t-1)-x(t-1)\right) \end{aligned}$$
(25)

which is the expression of the KF solution \(z(t-1)\) in (9) for the time step \(t-1\) and for the case of observed data defined in the same space of the state variable (i.e. \(H=I\) and I is the identity matrix). \(Q_{t-1}\) is the Kalman gain matrix in (8).

The Eq. (25) in Theorem 1 represents a condition to assume that NA is consistent with KF.

In Sect. 4, we validate the results provided in this section. We also show that the employment of NA alleviates the computational cost making the running less expensive.

4 Experimental Results

In this section we provide experimental results that demonstrate the applicability and efficiency of NA. In our experiment, the NA is implemented by adopting Long short-term memory (LSTM) architecture for the two RNNs. The reason we use LSTMs is that they are suitable to contain information outside the normal flow of the recurrent network so it is easier to plug two networks together. Also, LSTMs allow to preserve the error that can be backpropagated through time and layers which is a very important point for discrete forecasting models. A description of the NA we implemented is provided in Fig. 3.

Fig. 3.
figure 3

Implementation of Neural Assimilation

The test case we consider is a numerical model to predict the oxygen diffusion across the Blood-Brain Barrier (BBB). Nevertheless the model can be used for any drugs by replacing the diffusion constant and the initial and boundary conditions [1]. The Blood-Brain Barrier protects the central nervous system, controls the entry of compounds into the brain by restricting access for blood borne compounds and facilitates access for nutrients. This protection makes it difficult to provide therapeutic compounds to brain cells when they are affected by brain diseases as Alzheimer, Autism [13]. The BBB is composed of endothelial cells connected by tight junctions. The main mechanisms allowing the transport of drugs across the membrane are passive transport, carrier-mediated transport, receptor-mediated transcytosis, and adsorption-mediated transcytosis [22]. The passive transport mechanism is the easiest method of drug transport for lipophilic and low molecular size molecules. It means a simple diffusion across any membrane without application of energy and carrier proteins. Opioids and steroids are examples of drugs which can be passively diffused [4]. Assuming that the main transport mechanism is through passive diffusion, the initial three-dimensional space problem can be reduced to a one-dimensional space problem. In fact, passive diffusion involves many simplifications as no reaction term, uniform movement in all directions and an overall diffusion constant. Therefore, a 1D partial differential equation (PDE) as (26) with one initial condition and two boundary conditions is an accurate model for this problem [22] where 0 corresponds to the location at which the blood meets the Blood-Brain Barrier and \(L=400\) nm is the real average thickness of the Blood-Brain Barrier.

$$\begin{aligned} \left\{ \begin{array}{l} \frac{\partial x}{\partial t}=D \frac{\partial ^2x}{\partial y^2} \\ x(0,y)=x_{0,y} \\ x(t,0)= x_{t,0}\\ x(t,L)=x_{t,L} \end{array}\right. \end{aligned}$$
(26)

where \(t\in [0,10\,\mathrm{{ms}}]\) (ms denotes microsecond) and \(y\in [0,L]\). We consider that, at time 0 there is no oxygen, then \( x_{0,y}=0 \). Moreover, for our boundary conditions, we consider that we have a constant concentration of oxygen in the bloodstream and that at the interface of the barrier and the brain tissue all oxygen will be consumed \( x_{t,0}= 0.02945\) L/L blood and \( x_{t,L}=0 \). We assume the diffusivity of oxygen through the Blood-Brain Barrier to be \(3.24*10^{-5}\) c\(m^2\)/s [1].

Equation (26) is discretised by a second order central finite difference in space with \(\varDelta y=8\) nm and a backward Euler method in time with \(\varDelta t=0.1\) ms:

$$ -Fx_{i-1}^n+(1+2F)x_i^n-x_{i+1}^n=x_{i-1}^{n-1} $$

where \(F=D\frac{\varDelta t}{\varDelta y^2}\), \(i=1,\dots ,50\) and \(n=1,\dots , 100\). As we know that it does not affect the generality of our study, in this paper we show results of NA using observed data o(t) provided in [1] by the analytical solution of (26) for the oxygen diffusion. The model can be used for any drugs by replacing the diffusion constant and the initial and boundary conditions. Data sets for observed data can be found in http://cheminformatics.org/datasets/. The NA code and the pre-processed data can be downloaded using the link https://drive.google.com/drive/folders/1C_O-rk5wyqFsG5U-T7_vugBOddTPmOlY?usp=sharing.

The NA network has been trained using the \(85\%\) of the data and tested on the remaining \(15\%\). Figure 4 shows the value of the Loos function for training and testing the forecasting network.

Fig. 4.
figure 4

Values of the loss function.

NA has been compiled as a sequential neural network with just one LSTM layer of 48 units using as loss function the mean squared error one and as optimiser the Adam one. Weights are automatically initialised by Keras using:

  • Glorot uniform for the kernel weights matrix for the linear transformation of the inputs;

  • Orthogonal for the linear transformation of the recurrent state.

Fig. 5.
figure 5

Temporal evolution of the concentration at (a) \(y=12\) nm and (b) \(y=35\) nm

Figure 5 shows the temporal evolution of the concentration at \(y=12\) nm (Fig. 5a) and \(y=35\) nm (Fig. 5b). The accuracy of the NA results is evaluated by the absolute error

$$\begin{aligned} e_{NA}(t,y)=\vert z(t,y)-h(t,y) \vert \end{aligned}$$
(27)

and the mean squared error

$$\begin{aligned} MSE(h(t,y)) = \frac{\Vert z(t,y)-h(t,y) \Vert _{L^2}}{\Vert z(t,y) \Vert _{L^2}} \end{aligned}$$
(28)

where z(ty) is the solution of KF performed at each time step. Table 1 shows values of absolute error computed every 10 time steps. We can see that the order of magnitude of the error is between \(e{-}07\) and \(e{-}04\). The corresponding values of mean squared error are \(MSE(h(t,y))=1.31e{-}07\) for \(y=12\) nm and \(MSE(h(t,y))=8.16e{-}08\) for \(y=35\) nm where \(t\in [0,0.10\,\mathrm{{ms}}]\). Figure 6 shows the comparison of the KF result and the NA result for the temporal evolution of the concentration at each point of the BBB we are modelling. Values of execution time are provided in Table 2. The values are computed as mean of execution times from 100 runnings. We can observe that the time for prediction in NA is 1000 faster than the prediction with KF.

Table 1. Error computed every 10 time steps at (a) \(y=12\) nm and (b) \(y=35\) nm
Table 2. Execution time for 100 time steps and all the distances
Fig. 6.
figure 6

Comparison between Data Assimilation (KF) and the Neural Network version for \(t\in [0,10\,\mathrm{{ms}}]\) (ms denotes microsecond) and \(y\in [0,400\,\mathrm{{nm}}]\).

Finally, Table 3 shows the values of mean square forecasting error:

$$\begin{aligned} MSE^{F}(x(t,y)) = \frac{\Vert x(t,y)-o(t,y) \Vert _{L^2}}{\Vert o(t,y) \Vert _{L^2}} \end{aligned}$$
(29)

and mean square assimilation error:

$$\begin{aligned} MSE^{NA}(h(t,y)) = \frac{\Vert h(t,y)-o(t,y) \Vert _{L^2}}{\Vert o(t,y) \Vert _{L^2}} \end{aligned}$$
(30)

computed with respect observations o(ty). The values of the errors in the assimilation results present a reduction of approximately one order of magnitude with respect to the error in forecasting data.

Table 3. Mean square error forecasting error \(MSE^F\) and mean square assimilation error \(MSE^{NA}\) computed every 10 time steps at \(y=12\) nm

5 Conclusions and Future Works

We introduced a new neural network for Data Assimilation (DA) that we named Neural Assimilation (NA). We proved that the solution of NA is the same of KF. We tested the validity of the provided theoretical results showing values of misfit between the solution of NA and the solution of KF for the same test case. We provided experimental results on a realistic test case studying oxygen diffusion across the Blood-Brain Barrier. NA is trained on both forecasting and observed data and it is used for predictions without needing a correction given by the information provided by observations. This allows to avoid the prediction-correction cycle of a Kalman filter, and it makes the assimilation process very fast. We show that the time for prediction in NA is 1000 faster than the prediction with KF. An implementation of NA to emulate variational DA [10] will be developed as future work. In particular, we will focus on a 4D variational (4DVar) method [5]. 4DVar is a computational expensive method as it is developed to assimilate several observations (distributed in time) for each time step of the forecasting model. We will develop an extended version of NA able to assimilate set of distributed observations for each time step and, then, able to emulate 4DVar.