1 Introduction

In order to improve the reconstruction and prediction of dynamical systems with uncertainties, data assimilation (DA) techniques, originally developed in numerical weather prediction (NWP) [1] and geosciences [2], are widely applied to industrial problems, such as hydrology [3], wildfire forecasting [4], drought monitoring[5] and nuclear engineering [6]. DA algorithms aim to find the optimal approximation (also known as the analyzed state) of the state variables (usually representing a physical field of interest, such as velocity, temperature, etc.,), relying on prior estimations and real-time observations, both assumed to be noisy. Due to the large dimension (often ranging from \(10^6\) to \(10^{10}\) in NWP and geoscience problems), prior errors are supposed to be Gaussian distributed for the sake of simplicity [7]. As a consequence, the prior error distribution can be perfectly characterized by the first (mean) and the second (covariance) moment. The output of the DA algorithms is determined through some optimization function where the weight of prior simulations and observations is determined by the associated error covariance matrices, respectively named as background and observation covariances. These error covariance matrices thus provide crucial information in DA algorithms [8], for not only the estimation of the analyzed state but also specifying posterior error distributions [9]. The prior errors represented by these matrices, especially in the case of observation errors, consisting of an ensemble of different sources of noise/uncertainties, including model error, instrument noise, and representativity error [10, 11].

In statistics, the covariance matrix of a random vector is often obtained via empirical estimation where a sufficient number of simultaneous samplings is required to avoid estimation bias [12]. Moreover, when the sampling number is inferior to the problem dimension, the estimated covariance will be rank deficient. In DA problems, the high dimensionality and lack of simultaneous data (i.e., several backgrounds or observation trajectories in the same time window) represent significant obstacles of covariance computation in data assimilation [13]. To overcome these difficulties, we often rely on calibration (e.g., least-square) methods based on some generic correlation kernels, often with homogeneous and isotropic characteristics [14]. Balanced operators can be employed for multivariate systems [15]. In terms of correlation kernels, the family of Matérn functions, including the Exponential kernel (Matérn 1/2), the Balgovind kernel (Matérn 3/2, also known as second-order auto-regressive (SOAR) function), and the Gaussian kernel (Matérn 5/2), is often prioritized for covariance computing owing to its smoothness and capability to capture spatial correlations in physical processes [10, 16, 17]. Other stationary covariance models involve, for instance, convolution formulation [18] or diffusion-based operators [19], both contribute to an efficient storage of the covariance matrices. However, limited by homogeneous and isotropic assumptions, it remains cumbersome to represent complex spatial correlation (often multidimensional and multivariate) using these one-dimensional kernels.

In this study, we develop and test a novel data-driven approach based on recurrent neural networks (RNNs) to improve both the accuracy and the efficiency of observation covariance specification in dynamical data assimilation problems. The novel approach is tested and compared with two state-of-the-art covariance tuning algorithms in two different digital experiments with parametric and non-parametric covariance estimation, respectively.

The paper is organized as follows. In Sect. 2, we introduce the related work for error covariance specification. The problem statement and the contribution of this paper are described in Sect. 3. Data assimilation techniques and the ensemble methods are introduced briefly in Sect. 4. We then describe traditional posterior covariance tuning algorithms DI01 and D05 in Sect. 5. The novel LSTM-based method is introduced in Sect. 6, followed by the comparison in the Lorenz (Sect. 7) and the shallow water twin experiments (Sect. 8). We close the paper with a discussion in Sect. 9.

2 Related work

To gain a clearer insight into covariance evolution, some ensemble-based methods such as [1] (NMC) and [20] (EnKF), have been developed to provide a non-parametric covariance estimation. These methods depend on the propagation of an ensemble of simulated trajectories, initialized either at different forecasting time steps (NMC) or by adding some artificially set perturbations to the current state (EnKF). These methods are more appropriate for modeling the background matrix compared rather than the observation matrix. The latter, independent from the numerical simulations, can not be represented by the propagation of artificially added noises. The Particle-Aided Unscented Kalman Filter [21] can estimate systems with high nonlinearity with a real-time updating of the background matrix. However, the observation matrix can not be estimated directly via the Particle-Aided Unscented Kalman Filter. In practice, the observation matrix is often set to be diagonal or spatially isotropic for the sake of simplicity (e.g. [22]). However, it is shown in the work of [10] that well-specified correlated observation covariances can significantly improve the performance of DA algorithms.

Several methods of data-derived posterior diagnosis have also been developed based on the analysis of innovation quantities which consist of the difference between the observations and the projected background/analyzed state in the observation space. As a strong contributor to this topic, the meteorology community developed several well-known posterior diagnoses and their improved versions [23,24,25] to adjust the background/observation ratio, the correlation scale length, or the full covariance structure in the observation space (both the observation matrix and the projected background matrix). Some iterative processes [26, 27] based on the fixed-point theory have also been proposed for error covariance tuning. Recent works of [28, 29] have proved the convergence of so-called “Desroziers iterative methods”[24] (also known as D05) in the ideal case. In brief, they have mathematically proved that, by using a semi-positive definite matrix as an initial guess, D05 iterative method converges on the exactFootnote 1 time-invariant (at least over a sufficiently long time period) observation error covariance when the background matrix and the transformation operator (which maps the state variables to real-time observations) are perfectly known a priori. On the other hand, it is also mentioned by [29] that a regularization step is necessary for practice for applying D05 and the convergence of the regularized iterations remains an open question [3, 29]. To deal with time-varying systems, lag-innovation statistics are used for error covariance estimation [30]. The essential idea is to build a secondary Kalman-filtering process for adjusting error covariances using time-shifted innovation vectors. For more details of the innovation-based methods, we refer to the overview of [13] which also covers some other estimation methods, such as the family of likelihood-based approaches and expectation-maximum(EM) methods.

3 Problem statement and contribution

Our work lies in a similar condition of [24, 29] where both the state forward model and the transformation operator are presumed to be well known. As the main difficulty concerns the non-synchronous time-variant observations in dynamical systems (which prevents empirical estimation), in this work we propose the use of recurrent neural networks (RNNs) [31] for the specification of the observation matrix across the underlying dynamics of the observed quantities. RNN has been widely adapted for the prediction/reconstruction of dynamical systems, especially in natural language processing (NLP) [32] and image/video processing [33] due to its convincing capacity for dealing with time series. More recently, RNN has also made their way to other engineering fields such as biomedical applications and computational fluid mechanics [34]. In general, the combination of deep learning and data assimilation methods [35, 36] has been widely adapted and analyzed in a variety of industrial applications, including air pollution [37] and ocean-atmosphere modeling [38]. A convolutional neural network (CNN) for covariance estimation has also been suggested in the work of [39]. In this study, we propose a novel methodology for LSTM-based covariance estimation which can be easily integrated into any DA schema for dynamical systems. Here, we first construct a set of training covariance matrices, being either parametric or non-parametric, within a certain range defined a priori. For each matrix in the training set, we then simulate a dynamic trajectory of the observation vector relying on the knowledge of the forward model where the noises at each time step are generated following a centered Gaussian distribution characterized by the error covariance. These trajectories are later used as input variables to train the long-short-term-memory (LSTM) RNN regression model where the time-invariant observation matrices stand for the learning target. For the online evaluation, only the historical observation data is needed to predict the error covariances. Compared to traditional posterior tuning methods [24, 40] which require several implementations of DA algorithms, the proposed machine learning (ML) method can be much more computationally efficient for real-time covariance estimation. Moreover, no prior knowledge concerning either the background or the observation matrix is necessary for the proposed ML approach unlike most of the traditional methods. For example, DI01 [23] requires precise knowledge of correlation structures for both background and observation matrices while D05 [24] make use of the perfect knowledge of the background covariance.

In order to make a comprehensive comparison with traditional methods, two different twin experiment frameworks are implemented in this paper, using respectively the Lorenz96 and the 2D shallow-water models. The Lorenz system, characterized by only three state variables, is associated with a non-parametric covariance modeling while we use an isotropic correlation kernel to parameterize the observation matrix in the shallow water dynamics. In both cases, we compare the performance of the proposed LSTM-based method against the state-of-the-art tuning approaches D05 and DI01 in terms of both the covariance specification and the posterior DA accuracy. An ensemble DA schema is used for estimating the time-variant background matrix for each of these methods.

4 Data assimilation

4.1 Principle of data assimilation

The objective of data assimilation algorithms is to approach the estimation of system’s states \(\mathbf{x}\) to its true values \(\mathbf{x}_{\mathrm{true}}\), also known as the true state, by taking advantage of two sources of information: the prior estimation or forecast \(\mathbf{x}_b\), which is also called the background state, and the measurement or observation \(\mathbf{y}\). DA algorithms aim to find an optimally weighted compromise between \(\mathbf{x}_b\) and \(\mathbf{y}\) by minimizing the lost function J defined as:

$$\begin{aligned} J(\mathbf{x})&=\frac{1}{2}(\mathbf{x}-\mathbf{x}_b)^T\mathbf{B}^{-1} (\mathbf{x}-\mathbf{x}_b) + \frac{1}{2}(\mathbf{y}-{\mathcal {H}} (\mathbf{x}))^T \mathbf{R}^{-1} (\mathbf{y}-{\mathcal {H}}(\mathbf{x})) \end{aligned}$$
(1)
$$\begin{aligned}&=\frac{1}{2}||\mathbf{x}-\mathbf{x}_b||^2_{\mathbf{B}^{-1}} +\frac{1}{2}||\mathbf{y} -{\mathcal {H}}(\mathbf{x})||^2_{\mathbf{R}^{-1}}, \end{aligned}$$
(2)

where \({\mathcal {H}}\) denotes the transformation operator from the state space to observation space. \(\mathbf{B}\) and \(\mathbf{R}\) are, respectively, the background and the observation error covariance matrices, i.e.

$$\begin{aligned} \mathbf{B} = \text {Cov}(\epsilon _b, \epsilon _b), \quad \mathbf{R} = \text {Cov}(\epsilon _y, \epsilon _y), \end{aligned}$$
(3)

where

$$\begin{aligned} \epsilon _b = \mathbf{x}_b - \mathbf{x}_{\mathrm{true}}, \quad \epsilon _y = {\mathcal {H}}(\mathbf{x}_{\mathrm{true}})-\mathbf{y}. \end{aligned}$$
(4)

Errors \(\epsilon _b, \epsilon _y\) are supposed to be centered Gaussian following:

$$\begin{aligned} \epsilon _b \sim {\mathcal {N}} (0, \mathbf{B}), \quad \epsilon _y \sim {\mathcal {N}} (0, \mathbf{R}). \end{aligned}$$
(5)

In Eq. 1, the left side strives for incorporating the prior information \(\mathbf{x}_b\), and the right side penalizes the difference between the observation \(\mathbf{y}\) and the state variables after having been mapped to the observation space \({\mathcal {H}}(\mathbf{x})\). Both terms are weighted by the corresponding inverse of error covariance matrix (\(\mathbf{B}^{-1}\), \(\mathbf{R}^{-1}\)) to reflect confidences for each of them.

The optimization problem of Eq. 1, so called three-dimensional variational (3D-Var) formulation, is a general representation of variational assimilation which does not take into account model errors. The output of Eq. 1 is denoted as \(\mathbf{x}_a\), i.e.

$$\begin{aligned} \mathbf{x}_a = \underset{\mathbf{x}}{\mathrm{argmin}} (J(\mathbf{x})). \end{aligned}$$
(6)

If \({\mathcal {H}}\) is the linear observation operator represented by \(\mathbf{H}\), Eq. 6 can be solved via BLUE (Best Linearized Unbiased Estimator) formulation:

$$\begin{aligned} \mathbf{x}_a&= \mathbf{x}_b+\mathbf{K}(\mathbf{y} -\mathbf{H} \mathbf{x}_b), \end{aligned}$$
(7)
$$\begin{aligned} \mathbf{A}&= (\mathbf{I}-\mathbf{K} \mathbf{H})\mathbf{B}, \end{aligned}$$
(8)

where \(\mathbf{A} = \text {Cov}(\mathbf{x}_a-\mathbf{x}_{\mathrm{true}})\) is the analyzed error covariance, and \(\mathbf{K}\) is the Kalman gain matrix described by

$$\begin{aligned} \mathbf{K}=\mathbf{B} \mathbf{H}^T (\mathbf{H} \mathbf{B} \mathbf{H}^T+\mathbf{R})^{-1}. \end{aligned}$$
(9)

In the rest of this paper, we define \(\mathbf{H}\) as a linear transformation operator. Nevertheless, it is usually more challenging to find the minimum of Eq. 1 when \({\mathcal {H}}\) is non-linear, even more, challenging when states are high-dimensional. The resolution for the minimization often involves gradient descent algorithms (such as “L-BFGS-B” [41] or adjoint-based [42] numerical techniques.

DA algorithms could be applied to dynamical systems thanks to sequential applications expressed by a transition operator \({\mathcal {M}}_{t^k \rightarrow t^{k+1}}\) (from discrete time \(t^{k}\) to \(t^{k+1}\)), where

$$\begin{aligned} \mathbf{x}_{t^{k+1}} = {\mathcal {M}}_{t^k \rightarrow t^{k+1}} (\mathbf{x}_{t^k}). \end{aligned}$$
(10)

\(\mathbf{x}_{b,t^{k+1}}\) thus depends on the knowledge of \({\mathcal {M}}_{t^k \rightarrow t^{k+1}}\) and the DA correcting state \(\mathbf{x}_{a,t^k}\), i.e.,

$$\begin{aligned} \mathbf{x}_{b,t^{k+1}} = {\mathcal {M}}_{t^{k} \rightarrow t^{k+1}} (\mathbf{x}_{a,t^{k} }) \end{aligned}$$
(11)

Obviously, the more accurate \(\mathbf{x}_{a,t^{k}}\)is, the more reliable \(\mathbf{x}_{b,t^{k+1}}\) would be.

To leverage the information embedded in the background state and observations, covariance matrices modeling is a pivotal point in DA as they influence not only how prior errors spread but may also change the DA results [26].

4.2 Ensemble methods

Ensemble data assimilation (EnDA) [43, 44] methods have shown a strong performance in dealing with non-linear chaotic DA systems by creating an ensemble with size M of the system state depicted as \(\{\mathbf{x}_{t^{k}}^{(i)}|1\le i \le M\}\). The latter is used to represent both the prior and the posterior probability distribution of the state variables. The system states of the ensemble evolve under \({\mathcal {M}}_{t^{k} \rightarrow t^{k+1}}\) and DA is applied to each of these ensemble states at every assimilation windows. Furthermore, instead of evolving the system to obtain the \(\mathbf{B}\) matrix, which is a time and computationally expensive process when a large number of states is available, \(\mathbf{B}\) is estimated as a sample covariance:

$$\begin{aligned} \mathbf{B}_{b,t^{k}} \approx \frac{1}{M-1}\sum _{i=1}^{M} (\mathbf{x}_{b,t^{k}}^{(i)}-\overline{\mathbf{x}}_{b,t^{k}}) (\mathbf{x}_{b,t^{k}}^{(i)}-\overline{\mathbf{x}}_{b,t^{k}})^{T}, \end{aligned}$$
(12)

where \(\overline{\mathbf{x}}_{b,t^{k}}=\frac{1}{M-1} \sum _{i=1}^{M} \mathbf{x}_{b,t^{k}}^{(i)}\), and the estimation becomes more reliable with the increases of M. For applications in this study, EnDA, with a sufficiently large number of examples, is used to estimate \(\mathbf{x}_{b,t^{k}}\) and \(\mathbf{B}_{b,t^{k}}\) so that we can focus on the comparison of \(\mathbf{R}\) matrix modelings.

4.3 Observation error covariances specification

For the estimation of \(\mathbf{R}\), under the assumption that the system model is stationary, a wide variety of methods have been explored, for example, the DI01 [23] method which adjusts accordingly the ratio between \(Tr(\mathbf{B})\) and \(Tr(\mathbf{R})\) and the D05 [24] approach which estimates the full observation space iteratively. However, these methods, based on posterior innovation quantities (i.e., \(\mathbf{y}-{\mathcal {H}}(\mathbf{x}_a)\)) which requires several applications of DA algorithms, can be computationally expensive. Moreover, these tuning methods, especially the D05 which estimates the full matrix, are not suitable for different matrix parameterizations. In this paper, working with time-series observation data, we use LSTM to predict the corresponding \(\mathbf{R}\) matrix under similar assumptions of DI01 and D05. The two classical methods, introduced in Sect. 5, are implemented to compare the results with the proposed machine learning approach.

5 Posterior covariance tuning algorithms

5.1 Desroziers and Ivanov (DI01) tuning algorithm

Because \(\mathbf{B}\) and \(\mathbf{R}\) determine the weight of background and observation information in the loss function ( Eq. 1), the knowledge of \(\text {Tr}(\mathbf{B})\) and \(\text {Tr}(\mathbf{R})\) is crucial to DA accuracy. DI01 [23] tuning algorithm, relying on the diagnosis of innovation quantities, has been widely adopted in meteorology [28, 45] and geoscience [46]. Consecutive works have been carried out to improve its performance and feasibility in problems of large dimensions [47]. Without modifying error correlation structures, DI01 adjusts the prior error amplitudes by applying an iterative fixed-point procedure.

As demonstrated in [23, 48], when \(\mathbf{B}\) and \(\mathbf{R}\) are perfectly specified,

$$\begin{aligned} {\mathbb {E}}\left[ J_b(\mathbf{x}_a) \right]&= \frac{1}{2} {\mathbb {E}}\left[ (\mathbf{x}_a -\mathbf{x}_b)^T \mathbf{B}^{-1}(\mathbf{x}_a-\mathbf{x}_b) \right] \nonumber \\&=\frac{1}{2} \text {Tr}(\mathbf{K}\mathbf{H}), \end{aligned}$$
(13)
$$\begin{aligned} {\mathbb {E}} \left[ J_o(\mathbf{x}_a) \right]&= \frac{1}{2} {\mathbb {E}}\left[ (\mathbf{y} -\mathbf{H}\mathbf{x}_b)^T \mathbf{R}^{-1}(\mathbf{y}- \mathbf{H}\mathbf{x}_b)\right] \nonumber \\&=\frac{1}{2} \text {Tr}(\mathbf{I}-\mathbf{H}\mathbf{K}), \end{aligned}$$
(14)

where \(\mathbf{H}\) is a linearized observation operator. Based on Eqs. 13 and 14 it is possible to iteratively correct the magnitudes of \(\mathbf{B}\) and \(\mathbf{R}\), following

$$\begin{aligned} \mathbf{B}_{q+1}=s_{b,q} \mathbf{B}_q, \quad \mathbf{R}_{q+1}=s_{o,q} \mathbf{R}_q, \end{aligned}$$
(15)

using the two indicators

$$\begin{aligned} s_{b,q}&=\frac{2J_b(\mathbf{x}_a)}{\text {Tr}(\mathbf{K}_q \mathbf{H})}, \end{aligned}$$
(16)
$$\begin{aligned} s_{o,q}&=\frac{2J_o(\mathbf{x}_a)}{\text {Tr}(\mathbf{I}-\mathbf{H}\mathbf{K}_q)}, \end{aligned}$$
(17)

where q is the current iteration.

Acting as scaling coefficients, the sequences \(\{ s_{b,q}\}\) and \(\{s_{o,q}\}\) modify the error variance magnitude in the iterative process. It is worth reminding that both the analyzed state \(\mathbf{x}_a\) and the gain matrix \(\mathbf{K}_q\) are obtained using \(\mathbf{B}_q\), \(\mathbf{R}_q\) which depend on \(s_{b,q}\) and \(s_{o,q}\). When the correlation patterns of both \(\mathbf{B}\) and \(\mathbf{R}\) are well known, DI01 is equivalent to a maximum-likelihood parameter tuning, as pointed out in [28, 47].

Unlike other posterior covariance diagnosis/computations, such as [24, 26], the estimation of the full matrices is not needed in DI01. Instead, only the estimation of two scalar values (\(J_b,J_o\)) is required, which significantly reduce the computational cost. As a consequence, this method could be more appropriate for online covariance tuning.

5.2 Desroziers iterative method (D05) in the observation space

The Desroziers diagnosis (D05)[24], subject to prior and posterior state-observation residuals has been widely applied in engineering problems, including numerical weather prediction[28] and hydrology [3]. The work of [24] proved that when \(\mathbf{B}\) and \(\mathbf{R}\) are well known a priori, the expectation of the analysis state should satisfy:

$$\begin{aligned} {\mathbb {E}}\left( [\mathbf{y}-{\mathcal {H}}(\mathbf{x}_a)][\mathbf{y} -{\mathcal {H}}(\mathbf{x}_b)]^T\right) = \mathbf{R}. \end{aligned}$$
(18)

The difference between the left side and the right side of Eq.18,

$$\begin{aligned} || \mathbf{R} - {\mathbb {E}}\left( [\mathbf{y}-{\mathcal {H}} (\mathbf{x}_a)][\mathbf{y}-{\mathcal {H}}(\mathbf{x}_b)]^T\right) ||_F, \end{aligned}$$
(19)

can be used as a validation indicator of the \(\mathbf{R}\) matrix estimation where \(||.||_F\) denotes the Frobenius norm. Applying this method, time variant observation/background data can contribute to the estimation of the \(\mathbf{R}\) matrix because the expectation in Eq. 18 could be evaluated using residuals at different time steps. When the \(\mathbf{B}\) matrix is well known, an iterative process has been introduced to estimate the \(\mathbf{R}\) matrix:

$$\begin{aligned} \mathbf{R}_{n+1} = {\mathbb {E}}\left( [\mathbf{y}-{\mathcal {H}} (\mathbf{x}_{a,q})][\mathbf{y}-{\mathcal {H}}(\mathbf{x}_b)]^T\right) , \end{aligned}$$
(20)

based on the fixed-point theory [29]. The current analysis state \(\mathbf{x}_{a,q}\) is obtained using the specification of \(\mathbf{R}_q\) while \(\mathbf{x}_b, \mathbf{B}, \mathbf{y}\) remains invariant. As proved in [28, 29], under the assumption of sufficient observation data and well known \(\mathbf{B}\) matrix, the iterative process of Eq. 20 converges to the exact observation error covariance. However, as shown in [29], the intermediate matrices \(\mathbf{R}_q\) could be non-symmetric and possibly contain negative or complex eigenvalues, which is cumbersome for DA algorithms to deal with.

In practice, a posterior regularization at each iteration step is often required to ensure the positive definiteness of \(\mathbf{R}_q\) [29] where the first step of the regularization could be symmetrizing the estimated \(\mathbf{R}_q\) matrix, i.e.,

$$\begin{aligned} \mathbf{R}_q \longleftarrow \frac{1}{2} (\mathbf{R}_q + \mathbf{R}^T_q). \end{aligned}$$
(21)

The spectrum of \(\mathbf{R}_q\) now contains only real numbers but they are not necessarily positive. The hybrid method[2] is a standard approach in ensemble-based DA methods to ensure the positive definiteness, which consists of combining a prior defined covariance matrix \(\mathbf{C}\) with the one obtained from empirical estimation. We thus obtain the formulation of the regularized observation matrix \(\mathbf{R}_{\mathrm{r},\mathrm{n}}\):

$$\begin{aligned} \mathbf{R}_{\mathrm{n}} \longleftarrow (1-\mu )\mathbf{R}_q + \mu \mathbf{C}, \end{aligned}$$
(22)

following Eq. 21 with \(\mu \in (0,1)\). The matrix \(\mathbf{C}\) is often set as a diagonal matrix since it helps to enhance the matrix conditioning. In this work, we choose to set

$$\begin{aligned} \mu = 0.2 \quad \text {and} \quad \mathbf{C}= \frac{Tr(\mathbf{R}_q) \times \mathbf{I}}{\mathrm{dim}(\mathbf{y})}, \end{aligned}$$
(23)

so that \(Tr(\mathbf{R}_q)\) will not be modified due to the post regularization. As mentioned in the discussion of [3, 29], the convergence of regularized observation matrices remains an open question. Therefore, a small iteration number is often assigned for D05 tuning in industrial problems (e.g., \(q=2\) in [3, 49]). Since the right side of Eq.  20 can be estimated using residual quantities at different time steps, D05 is often used to deal with time series observation data (e.g., [3, 49]) when assuming the \(\mathbf{R}\) matrix is time-invariant.

6 LSTM for error covariance estimation

6.1 Introduction of RNN and LSTM

LSTM, first introduced in [50], is a kind of RNN [31], capable of solving long-term dependency problems [51] that traditional RNN could not handle. As with other recurrent neural networks, LSTM has a chain-like structure. This structure is created by repeating the same module shown on the left side in Fig. 1. In addition, the repeating module comprises four neural networks instead of only one. The specific structure of the repeating module is on the right side in Fig. 1.

Fig. 1
figure 1

LSTM diagrams

An essential part of LSTM is the cell state \(\mathbf{C}_{t^{k-1}}\) which is the long-term memory storing information about past behaviors. LSTM uses three gates with each composed out of a sigmoid layer neural network (single layer neural network with sigmoid activation function at the output layer) and a pointwise multiplication operation, to protect and control information of the cell state as shown in Fig. 1.

The first gate is the forget gate following:

$$\begin{aligned} \mathbf{f}_{t^k}=\sigma (\mathbf{W}_f\cdot [\mathbf{h}_{t^{k-1}}, \mathbf{x}_{t^k}]+b_f), \end{aligned}$$
(24)

where the recurrent variable \(\mathbf{h}_{t^{k-1}}\) summarizing all the information about past behaviors, \(\mathbf{x}_{t^k}\) resuming information about current behaviors, and \(\mathbf{W}_f\) and \(b_f\) are weights and bias, respectively, parameterizing the sigma layer neural network. The forget gate decides what kind of information is going to be ignored in \(\mathbf{C}_{t^{k-1}}\).

The second gate is the input gate, and it determines which new information is added into \(\mathbf{C}_{t^{k-1}}\). This new information \(\tilde{\mathbf{C}}_{t^{k}}\), as conforming to

$$\begin{aligned} \tilde{\mathbf{C}}_{t^{k}}=tanh(\mathbf{W}_c \cdot [\mathbf{h}_{t^{k-1}},\mathbf{x}_{t^k}]+b_c), \end{aligned}$$
(25)

is attained by passing \(\mathbf{h}_{t^{k-1}}\) and \(\mathbf{x}_{t^k}\) to a tanh layer neural network (single layer neural network with tanh activation function at the output layer) with parameters \(\mathbf{W}_c\) and \(b_c\).

\(\tilde{\mathbf{C}}_{t^{k}}\) is then multiplied by weight coefficients \(\mathbf{i}_{t^{k}}\) which is acquired by the input gate, i.e.,

$$\begin{aligned} \mathbf{i}_{t^{k}}=\sigma (\mathbf{W}_i\cdot [\mathbf{h}_{t^{k-1}}, \mathbf{x}_{t^{k}}]+b_i). \end{aligned}$$
(26)

\(\mathbf{i}_{t^{k}}\) is applied to decide which new information would be employed to update \(\mathbf{C}_{t^{k-1}}\).

Hence, the state cell \(\mathbf{C}_{t^{k}}\) at the current time step \(t^k\) can be attained using

$$\begin{aligned} \mathbf{C}_{t^{k}}=\mathbf{f}_{t^k}\odot \mathbf{C}_{t^{k-1}} +\mathbf{i}_{t^{k}}\odot \tilde{\mathbf{C}}_{t^{k}}. \end{aligned}$$
(27)

Finally, the acquisition of \(\mathbf{h}_{t^{k}}\) requires the participation of the output gate and a tanh activation function: first, the tanh activation function tanh is used to create a cell state candidate information \(tanh(\mathbf{C}_{t^{k}})\). \(tanh(\mathbf{C}_{t^{k}})\) is then multiplied by some weight coefficients following

$$\begin{aligned} \mathbf{h}_{t^{k}}=\mathbf{o}_{t^{k}}\odot tanh(\mathbf{C}_{t^{k}}), \end{aligned}$$
(28)

to decide which information of \(tanh(\mathbf{C}_{t^{k}})\) would contribute to the obtainment of \(\mathbf{h}_{t^{k}}\). Among them, \(\mathbf{o}_{t^{k}}\) is generated by the output gate with neural network parameters \(\mathbf{W}_o\) and \(b_o\), i.e.,

$$\begin{aligned} \mathbf{o}_{t^{k}}=\sigma (\mathbf{W}_o[\mathbf{h}_{t^{k-1}}, \mathbf{x}_{t^{k}}]+b_o). \end{aligned}$$
(29)

6.2 LSTM for \(\mathbf{R}\) matrix estimation using time series observation data

The tuning methods presented in Sect. 5 have been applied in various engineering applications with significant improvement of covariance specification and DA accuracy. However, these methods which require several applications of DA algorithms can be computationally expensive for high-dimensional problems. Another important drawback stands for the requirement of precise knowledge on either the correlation patterns of \(\mathbf{B}\) and \(\mathbf{R}\) (for DI01) or the full \(\mathbf{B}\) matrix (for D05). In this study, we aim to build a data-driven surrogate model for efficient online \(\mathbf{R}\) matrix specification using LSTM. Unlike DI01 or D05, no specific knowledge about the error covariances or the state/observation dynamical systems other than the transformation operator \({\mathcal {H}}\) and the forward model \({\mathcal {M}}_{t^k \rightarrow t^{k+1}}\) (which is also indispensable for standard DA algorithms including variational methods and Kalman-filters) is required.

Based on an initial state \(\mathbf{x}_{b,t^{0}}^{[\mathrm{iter}]} =\mathbf{x}_{g,t^{0}}^{[\mathrm{iter}]}\), where \([\mathrm{iter}]\) is the indication of some certain application and \(\mathbf{x}_{g,t^{0}}^{[\mathrm{iter}]}\) suggests the initial generated state set of this application, our main idea is to build a training set for the specific problem, including predefined time-invariant observation matrices \(\{\mathbf{R}^{[\mathrm{iter}]}\}\) within certain range and generated dynamical observation vector \(\{\mathbf{y}_{t^k}^{[\mathrm{iter}]} \}\). Setting the dynamical observations \(\{\mathbf{y}_{t^k}^{[\mathrm{iter}]} \}\) as the system input and the \(\mathbf{R}\) matrices as output, LSTM networks are used to learn the error distribution across the underlying observation dynamics. More precisely, a real function \(g^{\mathbf{R}}(.): \Phi _{\mathbf{R}} \longrightarrow {\mathbb {R}}^{m \times m}\) is predefined where \(\Phi _{\mathbf{R}}\) is an empirically estimated real space which defines the range of a set of parameters, such as marginal error variance, correlation scale length [18] for computing the \(\mathbf{R}\) matrices. The generated observation matrices \(\{\mathbf{R}^{[\mathrm{iter}]}\}\) are set to be symmetric positive definite (SPD) thanks to the function \(g^{\mathbf{R}}(.)\). Both \(g^{\mathbf{R}}(.)\) and \(\Phi _{\mathbf{R}}\) vary for different applications.

Generated states \(\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k}}\}\), \(t^{k}\in \{0,\cdots t^T\}\) with \(t^T\) depicting the index of the final time step, could be attained by evolving the system respecting \({\mathcal {M}}_{t^k \rightarrow t^{k+1}}\):

$$\begin{aligned} \left\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k+1}}\right\} = {\mathcal {M}}_{t^k \rightarrow t^{k+1}}\left( \left\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k}}\right\} \right) . \end{aligned}$$
(30)

The observation \(\{\mathbf{y}^{[\mathrm{iter}]}_{t^{k}} \}\), \(t^{k}\in \{0,\cdots t^T\}\), is gained by mapping \(\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k}}\}\) through \({\mathcal {H}}\) and at the same time combining with random Gaussian noises:

$$\begin{aligned}&\left\{ \mathbf{y}^{[\mathrm{iter}]}_{t^{k}}\right\} = {\mathcal {H}} \left( \left\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k}}\right\} \right) + \left\{ \epsilon ^{[\mathrm{iter}]}_{g,t^{k}}\right\} \quad \text {for} \quad t^{k}\in \{0,\cdots t^T\} \quad \text {and} \quad \mathrm{iter} = 1...N, \nonumber \\&\qquad \text {where } \quad \left\{ \epsilon ^{[\mathrm{iter}]}_{g,t^{k}}\right\} \sim {{\mathcal {N}}(0,\{\mathbf{R}^{[\mathrm{iter}]}\}}). \end{aligned}$$
(31)
figure a
Fig. 2
figure 2

Offline training schema of the proposed method, including covariance generation, observation generation, and LSTM training

After having generated \(\{\mathbf{R}^{[\mathrm{iter}]}\}\) and \(\{\mathbf{y}_{t^k}^{[\mathrm{iter}]} \}\) as illustrated in Algo. 1, a LSTM network is then trained to predict \(\{\mathbf{R}^{[\mathrm{iter}]}\}\) knowing \(\{ \mathbf{y}_{t^k}^{[\mathrm{iter}]} \}\). The general process can be demonstrated in Fig. 2 where for each application, \(\mathbf{y}^{[\mathrm{iter}]}\) is simulated by evolving the system knowing the system dynamics and the observation error covariance generator, and then \(\mathbf{y}^{[\mathrm{iter}]}\) and \(\mathbf{R}^{[\mathrm{iter}]}\) are applied to train LSTM so that LSTM can predict \(\mathbf{R}^{[\mathrm{iter}]}\) when receiving unseen \(\mathbf{y}^{[\mathrm{iter}]}\).

Fig. 3
figure 3

Many to one LSTM training and \(\mathbf{R}\) prediction process

Following the principle of many to one LSTM in Fig. 3, the input features of LSTM consist of observations \(\mathbf{y}^{[\mathrm{iter}]}=\{\mathbf{y}_{t^0}^{[\mathrm{iter}]},\mathbf{y}_{t^1}^{[\mathrm{iter}]},\cdots ,\mathbf{y}_{t^k}^{[\mathrm{iter}]},\cdots ,\mathbf{y}_{t^T}^{[\mathrm{iter}]}\}\), while the output is \(\mathbf{R}^{[\mathrm{iter}]}\) matrix. Different from classical covariance tuning algorithms, the LSTM network only makes use of historical observation data, requiring neither the background states nor the error covariance matrix. The advantage of using LSTM is more salient when the observation dimension is large, for example, millions or even billions, while such dimension is not uncommon in real-world applications [2, 7].

To estimate \(\mathbf{R}\), LSTM is first trained to learn: – either related variables which can be used to constitute the symmetry observation error covariance matrix (i.e., input variables of the \(g^{\mathbf{R}}(.)\) function) in a parametric modeling; – or elements of the \(\mathbf{R}\) matrix (e.g., variables in the upper triangle and those in the diagonal of the covariance matrix) in a non-parametric modeling. The whole process for \(\mathbf{R}\) estimation using LSTM is described in Algorithm 2.

figure b

In Algorithm 2, it is suggested that LSTM training process is consisted of training and validation processes: the training process is comprised of the forward prediction and the backward neural network weight parameters updating processes; and validation process is used to predict desirable outputs or objectives and then calculate validation loss between predicted output and prior true output values. \(N_{\mathrm{epoch}}\) indicates the number of times that the entire example data set is passed forward and backward through the LSTM during the training process. Early stopping, which terminates the training process when the validation loss reaches the minimum and is always the minimum value after \(N_{\mathrm{patience} \_\text {epoch}}\) epochs, is applied to reduce the LSTM training time.

It is important to note that the offline data generation and LSTM training processes need to be carried out individually for different DA applications (Fig. 4).

Fig. 4
figure 4

Online LSTM prediction of the observation matrix compared to traditional covariance tuning approaches, followed by online data assimilation

7 Lorenz twin experiment

7.1 Twin experiment principle

In order to overcome the drawback that, in a realistic experiment, \(\mathbf{x}_{\mathrm{true}}\) is usually unknown and \(\mathbf{y}\) is often mixed with noises, twin experiment, in which a prototypical test case is selected to simulate real situations, is applied so as to provide \(\mathbf{x}_{\mathrm{true}}\) for comparison. In this experiment, a mapping is applied to some sampling true trajectory \(\mathbf{x}_{\mathrm{true},t^k}\) at some points in space and time and arbitrary random noises are added to obtain simulated raw measurements \(\mathbf{y}_{t^k}\). DA is then implemented starting from the initial background state \(\mathbf{x}_{b,t^0}\) representing the prior information that could be obtained about corresponding state \(\mathbf{x}_{\mathrm{true},t^0}\), along with initial raw measurement \(\mathbf{y}_{t^0}\). The output state is then compared against \(\mathbf{x}_{\mathrm{true}}\), verifying the distance of these two states and minimizing it to evaluate and improve the performance of DA. In this section, we use a twin experiment to evaluate the performance of applying DA to a simple Lorenz system in which raw measurement error covariance is estimated/adjusted using, respectively, DI01, D05 and LSTM.

7.2 Experiment set up

The Lorenz system, first studied by Edward Lorenz, is a system of ordinary differential equations. For certain parameter values and initial conditions, the Lorenz system is notable for having chaotic solutions, in particular the Lorenz attractor, toward which a system tends to evolve. The Lorenz 96 system[52] has been widely used as a prototypical test case to compare the performance of DA algorithms[34, 35, 53]. Here we build a twin experiment framework with a simple three dimensional Lorenz system in which the state vector is denoted as \(\mathbf{x}= [x_{(0)}, x_{(1)}, x_{(2)}]\). The studied Lorenz system can be characterized as:

$$\begin{aligned} \frac{\partial x_{(0)}}{\partial t}&=\sigma (x_{(1)}-x_{(0)})\nonumber \\ \frac{\partial x_{(1)}}{\partial t}&=\alpha x_{(0)}-x_{(1)} -x_{(0)}x_{(2)} \nonumber \\ \frac{\partial x_{(2)}}{\partial t}&=x_{(0)}x_{(1)}-\beta x_{(2)}. \end{aligned}$$
(32)

where \(\partial t=0.001s\), \(\sigma =10\), \(\alpha =28\) and \(\beta =2.667\).

The initial values of the true state \(\mathbf{x}_{\mathrm{true},t^0}\) are set to be [0, 1, 1.05] while the initial background state \(\mathbf{x}_{b,t^0}\) is generated by combining \(\mathbf{x}_{\mathrm{true}, t^0}\) with a centered Gaussian noise \(\epsilon _{b,t^0}\):

$$\begin{aligned} \mathbf{x}_{b,t^0} = \mathbf{x}_{\mathrm{true},t^0} + \epsilon _{b,t^0} \quad \text {where} \quad \epsilon _{b,t^0} \sim {\mathcal {N}} \left( 0, 0.05 \times \mathbf{I}_3\right) . \end{aligned}$$
(33)

Then both of true states \(\mathbf{x}_{\mathrm{true}}=\{\mathbf{x}_{\mathrm{true},t^0},\cdots ,\mathbf{x}_{\mathrm{true},t^T}\}\) and background states \(\mathbf{x}_{b,t^k}=\{\mathbf{x}_{b,t^0},\cdots , \mathbf{x}_{b,t^T}\}\) of the Lorenz system evolve by conforming, respectively, to the Lorenz equation in Eq. 32 until t=1s with total \(T=1000\) time steps.

Subsequently, observations \(\mathbf{y} =\{\mathbf{y}_{t^0},\cdots ,\mathbf{y}_{t^T}\}\) can be acquired by mapping \(\mathbf{x}_{\mathrm{true}}\) through a linear observation operator

$$\begin{aligned} \mathbf{H}= \begin{bmatrix} 1 & 1 & 0 \\ 2 & 0 & 1\\ 0 & 0 & 3 \end{bmatrix}, \end{aligned}$$
(34)

and adding noises respecting multinormal distribution \({\mathcal {N}}(0,\mathbf{R})\) where \(\mathbf{R}\) is randomly generated following the process described in Sect. 7.3.

EnDA is then applied in this twin experiment to update the background ensemble using available observations. More precisely, every ten time steps, EnDA is applied on \(\mathbf{x}_{b,t^k}\) along with \(\mathbf{y}_{t^k}\) to obtain the analysis states \(\mathbf{x}_{a,t^k}\). To simulate future background states before the next assimilation step, we add some artificial noises \(\epsilon _{b,t^{k+1}}\) to \(\mathbf{x}_{b,t^{k+1}}\) respecting multivariate distributions, i.e.,

$$\begin{aligned} \mathbf{x}_{b,t^{k+1}}&= {\mathcal {M}}_{t^k \rightarrow t^{k+1}} (\mathbf{x}_{a,t^k} + \epsilon _{b,t^{k}}), \nonumber \\ \mathbf{x}_{b,t^{(k+\gamma )}}&= {\mathcal {M}}_{t^{(k+\gamma -1)} \rightarrow t^{(k+\gamma )} }(\mathbf{x}_{b,t^{(k+\gamma -1)}}) \quad \text {for} \quad \gamma \in \{ 2,...,10\}, \end{aligned}$$
(35)

where

$$\begin{aligned} \epsilon _{b,t^{k+1}} \sim {\mathcal {N}}(0, \mathbf{Q}) \quad \text {and} \quad \mathbf{Q}= \begin{bmatrix} 1 & 0.2 & 0 \\ 0.2 & 1 & 0.2\\ 0 & 0.2 & 1 \end{bmatrix}. \end{aligned}$$

The model error covariance matrix \(\mathbf{Q}\) is supposed to be time-invariant for all generated trajectories.

7.3 DA with LSTM-based covariance estimation

The Sect. 7.2 exhibits the process of generating artificial training data for the LSTM model. More details about the \(\mathbf{R}\) matrix generation, the outputs of the LSTM model within the training data are revealed in this section. \(\mathbf{R}\) are parameterized by four real coefficients \(r_0\), \(r_1\), \(r_2\) and \(v_{\mathbf{R}}\) which determine, respectively, the three correlation coefficients and the error amplitude, i.e.,

$$\begin{aligned} \mathbf{R}= v_{\mathbf{R}} \times \begin{bmatrix} 1 & r_0 & r_1 \\ r_0 & 1 & r_2\\ r_1 & r_2 & 1. \end{bmatrix} \end{aligned}$$
(36)

In this study, \(v_{\mathbf{R}}\) is generated uniformly between 0 and 100, i.e., \(v_{\mathbf{R}} \sim {\mathcal {U}}(0,100)\) with \({\mathcal {U}}\) denoting the uniform probability distribution. The correlation coefficients \(\{r_0,r_1,r_2\} \in (-1,1)^3\) are obtained via randomly generated SPD matricesFootnote 2.

The LSTM network is thus trained to learn the \(\mathbf{R}\) matrix by trying to build a function mapping time series observation data \(\mathbf{y}_{t^k}\) to \(r_0\), \(r_1\), \(r_2\) and \(v_{\mathbf{R}}\). The specific structure of the LSTM model which consists of a LSTM input layer, a hidden layers with 200 neurons, and an output layer comprising four neurons applied to obtain \(\mathbf{R}\), is shown in Table 1. In this Lorenz twin experiment, two LSTM networks, respectively named as LSTM1000 and LSTM200 are designed with different input sizes of times series data. LSTM1000 is trained on a total of 1000 time steps for predicting the \(\mathbf{R}\) matrix while LSTM200 makes use of only the first 200 time steps to simulate a realistic application where the time-invariant \(\mathbf{R}\) matrix is estimated using historical data for improving future DA performance. The evaluation of both LSTM1000 and LSTM200, in terms of DA accuracy, is made using the full test dataset with 1000 times steps. By leveraging the LSTM model along with available observations, we can still perform DA algorithms even though \(\mathbf{R}\) is not explicitly given. The results are then compared with the ones obtained using the exact \(\mathbf{R}\) matrix.

Table 1 Lstm specific structure for \(\mathbf{R}\) prediction in lorenz experiment

7.4 Results

To evaluate the LSTM performance of predicting \(\mathbf{R}\), we first compare the LSTM predicted \(\mathbf{R}\) and the predefined true \(\mathbf{R}\) matrix in the test set and analyze the impact of predicted \(\mathbf{R}\) on DA accuracy. Since the LSTM prediction of \(\mathbf{R}\) matrix is non-parametric in these twin experiments, we compare each element of the predicted matrix (i.e.,\(r_0\), \(r_1\), \(r_2\) and \(v_{\mathbf{R}}\)) against the ground truth. As for the DA accuracy, we calculate the difference between \(\mathbf{x}_{b,t}=\{x_{b,(0),t}, x_{b,(1),t},x_{b,(2),t}\}\) (obtained via Eq. 35) refined in DA using \(\mathbf{R}\) estimated via DI01 (\(q=2\)), D05 (\(q = 3\)) or LSTM and the error-free true states \(\mathbf{x}_{\mathrm{true},t}=\{x_{\mathrm{true},(0),t},x_{\mathrm{true},(1),t},x_{\mathrm{true},(2),t}\}\). Both D05 and DI01 initialize with a random \(\mathbf{R}\) matrix, generated through \(g^{\mathbf{R}}(.)\) with the same range of parameters \(\Phi _{\mathbf{R}}\) as for the LSTM training.

Figure 5 shows the elements of the estimated \(\mathbf{R}\) matrix (i.e.,\(r_0\), \(r_1\), \(r_2\) and \(v_{\mathbf{R}}\)) obtained by LSTM1000 and LSTM200, both trained on 103486 Lorenz system observation samples, and predicts on a test dataset of 10,000 samples. The result of the D05 approach, with a calibration of \(r_0\), \(r_1\), \(r_2\) and \(v_{\mathbf{R}}\) against the true value, has also been displayed in Fig. 5 (i-l) in the same test dataset for comparison purposes. The blue circles are the LSTM/D05 prediction results, while the red line is the true value of the corresponding parameter. We observe that both LSTM1000 and LSTM200 prediction results fit very well to the predefined true value of each parameter, compared to the D05 tuning approach, especially for the error amplitude \(v_{\mathbf{R}}\) which is of most importance in covariance specification. Based on these experimental results, we can conclude that the proposed LSTM approach is capable of predicting the observation matrix, in terms of both correlation coefficients and error amplitude, when time-series observation data \(\mathbf{y}_{t^k}\) are given.

Fig. 5
figure 5

Prediction results of the non-parametric error covariance and the true values in the test dataset of LSTM1000 (ad), LSTM200 (eh) and D05 (\(\hbox {q} = 2\)) (il) of \(r_0,r_1,r_2\)0 and \(v_{\mathbf{R}}\) of the Lorenz system

Figure 6 and Table 2 illustrate the averaged DA performance along with \(\mathbf{R}\) attained in different ways with 10000 observations in the test dataset. We remind that for each observation sample \(\{ \mathbf{y}_{t^k}\}\), \(t^{k}\in \{0,\cdots ,t^T\}\), 100 background trajectories are generated to perform EnDA. Among these algorithms, DI01 uses two iterations to correct the magnitudes of \(\mathbf{B}\) and \(\mathbf{R}\) by conforming to Eq. 15. Each D05 iteration calculates the innovation quantities \((\mathbf{y}-{\mathcal {H}}(\mathbf{x}_a))\) and \((\mathbf{y} -{\mathcal {H}}(\mathbf{x}_b))\) every 10 time steps through a DA procedure, and then applies Eq. 20 to attain the updated \(\mathbf{R}\) for the corresponding Lorenz system.

Figure 6 displays the evolution of the mean square error (MSE) \(\epsilon _{{\mathrm{std}}\_{\mathrm{mse}},(i),t}\) between DA refined \(\mathbf{x}_{b,t}\) (following Eq. 35) and the true states \(\mathbf{x}_{\mathrm{true},t}\), i.e.,

$$\begin{aligned} \epsilon _{{\mathrm{std}}\_{\mathrm{mse}},(i),t} =\sum _{j=1}^{N}\frac{\sqrt{\sum _{m=1}^{M}{\left( x_{b,(i),t}^{(m),[j]} -x_{(i),t}^{[j]}\right) }^2}}{{\left\| \mathbf{x}_{(i)}\right\| }_2}\Big /N \end{aligned}$$
(37)

where \(i \in \{ 0,1,2\}\), \(N=1988\) is the number of examples, and \(M=100\) is the size of the ensemble DA. Table 2 shows the averaged (against time) absolute error

$$\begin{aligned} \epsilon _{\mathrm{mse},(i)}=\sum _{j=1}^{N}\frac{\sum _{t=t^0}^{t^{T}} \sqrt{\sum _{m=1}^{M}{\left( x_{b,(i),t}^{(m),[j]} -x_{(i),t}^{[j]}\right) }^2}}{t^T} \Big /N \end{aligned}$$
(38)

to interpret the difference between assimilated and true states, where \(t^T=1000\) is the total time steps that the Lorenz system has evolved.

It should be noted that the DI01 approach, which exclusively adjusts the \(Tr(\mathbf{B})/Tr(\mathbf{R})\) ratio without modifying the correlation structure, is over-performed by more refined covariance tuning/specification methods such as D05 and LSTM as shown in Fig. 6. Furthermore, Table 2 shows that \(\mathrm{lstm1000:} \epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}\) is smaller than \(\mathrm{d05:} \epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}\), which is consistent with the results shown in Fig. 5. Thus, we can conclude that LSTM1000 is sound at predicting \(\mathbf{R}\), contributing to a better DA performance, compared to DI01 and D05. Such a conclusion is further proved in Table 2 in which \(\epsilon _{\mathrm{mse}}\) based on LSTM1000 has values, for all three parameters displayed, close to those based on true \(\mathbf{R}\), which is predefined and used throughout the Lorenz system sample generations.

Fig. 6
figure 6

DA performance evaluated in \(\epsilon _{{\mathrm{std}} \_{\mathrm{mse}}}\) based on \(\mathbf{R}\) attained in algorithm LSTM, D05 and DI01

Table 2 DA performance of the Lorenz system evaluated in \(\epsilon _{\mathrm{mse}}\) for the three state variables based on \(\mathbf{R}\) varied by the predefinition, the algorithm LSTM1000, LSTM200, DI01 and D05

What we have not exhibited in Fig. 6 are the curves of \(\mathrm{lstm200:}\epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}\) and \(\mathrm{true:}\epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}\), representing the MSE based on LSTM200 predicted \(\mathbf{R}\) and manually predefined true \(\mathbf{R}\) respectively, as they have been almost totally overlapped with \(\mathrm{lstm1000:}\epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}\). Such a fact permits both the same conclusion to be reached with LSTM1000, and that LSTM200 which makes use of the observation data of only the first 200 time steps, is sound at predicting \(\mathbf{R}\) to contribute to a good DA performance with future observations. This conclusion is supported in Table 2 where \(\epsilon _{\mathrm{mse}}\) based on LSTM200 predicted \(\mathbf{R}\) has almost the same values as that based upon manually predefined true \(\mathbf{R}\) (Fig. 7).

8 Application to shallow water equations

8.1 Experiment setup

For further evaluating the performance of error covariance estimation using LSTM when incorporated with predefined correlation kernels, we also set up a twin experiment framework with a simplified 2D shallow-water dynamical model, which is frequently used for testing data assimilation algorithms ( e.g., [26, 42]). A cylinder of water is positioned in the middle of the study field with size \(20mm \times 20mm\) and released at the initial time step \(t^k=t^0s\) (i.e., with no initial speed), leading to a non-linear wave-propagation. The dynamics of the water level h (in mm), as well as the horizontal and vertical velocity field (respectively denoted as u and v in 0.1m/s), is given by the non-conservative shallow water equations,

$$\begin{aligned} \frac{\partial u}{\partial t}&=-g\frac{\partial }{\partial x}(h)-bu \nonumber \\ \frac{\partial v}{\partial t}&=-g\frac{\partial }{\partial y}(h)-bv \nonumber \\ \frac{\partial h}{\partial t}&=-\frac{\partial }{ \partial x}(uh) -\frac{\partial }{\partial y}(v h) \nonumber \\ u_{t^0}&= 0 \nonumber \\ v_{t^0}&= 0 \end{aligned}$$
(39)
Fig. 7
figure 7

State (uv) - observation (\(\mathbf{y}\)) transformation

In Eq. 39, \(b=0.1\) is the viscous drag coefficient, while g is the constant earth gravity. These equations are discretized in a \(20 \times 20\) regular grid, solved by first-order finite difference method with a time discretization \(\delta _t = 10^{-4}s\), from \(t^0 = 0s\) to \(t^{1000} = 0.1s\). This resolution is considered as the reference (i.e., the true state \(\mathbf{x}_{\mathrm{true}}\)) when performing DA algorithms. The state variables in this DA modeling are the combination of the velocity fields \(\{u\}_{20 \times 20}\) and \(\{v\}_{20 \times 20}\), leading to the state dimension of 800. The evolution of the reference state (\(\mathbf{x}_{\mathrm{true},t^k}\)), together with the error-free model equivalent observations (i.e., \(\mathbf{H}(\mathbf{x}_{\mathrm{true}})\)), is illustrated in Fig. 8. Spatially correlated prior observation errors are generated artificially and combined with the transformation operator to simulate real-time observations. More precisely, the observations are generated from the model equivalent \(\mathbf{H}(\mathbf{x}_{\mathrm{true}})\) separately for the fields u and v. \(\mathbf{H}\) is defined as a sparse matrix to imply the fact that measurements in real-world applications are sparser than true states due to the interference existing in the former situations as well as the limited performances of sensors. As shown in Fig. 7, the spatial observations at time \(t^k\) is defined as the average of \(u_{t^k}\) and \(v_{t^k}\) in a \(2 \times 2\) cells area with an observation error \(\epsilon _{y_{t^k}}\),

$$\begin{aligned} \mathbf{y}_{u,i,j,t^k}&= \frac{1}{4} (u_{\mathrm{true},2i,2j,t^k} + u_{\mathrm{true},2i+1,2j,t^k} + u_{\mathrm{true},2i,2j+1,t^k}\nonumber \\&\qquad \qquad + u_{\mathrm{true},2i+1,2j+1,t^k}) + \epsilon _{y_{u,i,j,t^k}}, \end{aligned}$$
(40)

and identical for \(\mathbf{y}_{v,i,j,t^k}\). Therefore, the dimension of the observation vector \(\mathbf{y}= [\mathbf{y}_{u,t^k}, \mathbf{y}_{v,t^k} ]\) is 200. In this experiment, we suppose that the observation error \(\epsilon _{y_{u,i,j,t^k}}\) and \(\epsilon _{y_{v,i,j,t^k}}\), respectively of the velocity fields u and v, follow the same Gaussian distribution \({\mathcal {N}}(0,\mathbf{R})\). Thus, the observation error covariance in this shallow water system can be fully characterized by a \(100 \times 100\) \(\mathbf{R}\) matrix after the observations (originally in a 2D grid) being converted to a 1D vector. Here we adopt a different parameterization of the \(\mathbf{R}\) matrix thanks to an isotropic correlation function \(\psi _{\mathbf{R}}(.)\),

$$\begin{aligned} \mathbf{R}=10^{-6} \cdot \sqrt{diag(\mathbf{D})}\cdot \psi _{\mathbf{R}} (r) \cdot \sqrt{diag(\mathbf{D})} \end{aligned}$$
(41)

where \(\mathbf{D} = \left[ D_0, ..., D_{99}\right]\), representing the error variances in the 2D (\(10 \times 10\)) velocity field. Each element of \(\mathbf{D}\) is generated individually following an uniform distribution,

$$\begin{aligned} \mathbf{D}_{\iota } \sim {\mathcal {U}}(1, 1000) \quad \text {for} \quad \iota \in \{ 0,...,99\}, \end{aligned}$$
(42)

which produces only positive elements to guarantee the positive definiteness of \(\mathbf{R}\).

\(\psi _{\mathbf{R}} (.)\) is the second-order auto-aggressive (also known as Balgovind) function,

$$\begin{aligned} \psi _{\mathbf{R}} (r) = \left( 1+\frac{r}{L_{\mathbf{R}}}\right) \exp \left( -\frac{r}{L_{\mathbf{R}}}\right) , \end{aligned}$$
(43)

where \(L_{\mathbf{R}}\) is the correlation scale length, fixed as \(L_{\mathbf{R}} =10\) in this application. r denotes the correlation scale length in the 2D space and is also generated uniformly with \(r \sim {\mathcal {U}}(1,5)\). Being part of Matern kernels, the SOAR function is often used in DA for prior error covariance modeling [6, 26] thanks to its smoothness and good conditioning. The simulation of \(\mathbf{x}_{b,t^k} = [{u}_{b,t^k}, {v}_{b,t^k}]\) via the same discretization of Eq. 39 (except the initial conditions) is used as background states at time \(t^k\) in the DA modeling. Similar to the Lorenz experiment(i.e., Eq. 35), \(\{\mathbf{x}_{b,t^k}\}\) is acquired by combining \(\{\mathbf{x}_{a,t^k}\}\) with randomly generated Gaussian errors, while \(\{\mathbf{x}_{a,t^k}\}\) is obtained every 100 time steps (i.e., 0.01s) from ensemble DA with time series observation data \(\{\mathbf{y}_{t^k}\}\) and the estimated observation error covariance \(\mathbf{R}\).

Fig. 8
figure 8

Evolution of the shallow water model of huv (true states) at different time steps (a, b, d, e) and the error-free model equivalent observations \(\mathbf{H}(\mathbf{x}_{\mathrm{true}})\) (c, f)

8.2 DA with LSTM estimated \(\mathbf{R}\)

As with the Lorenz experiment, simulated observations \(\{\mathbf{y}_{t^k}\}\), generated in the same process with that in the Lorenz system, are used as input training data for the LSTM model, while the \(\mathbf{D}\) vector and the correlation scale r served as training output. The specific structure of this LSTM network is shown in Table 3. This model has the same structure as the one of the Lorenz systems shown in Table 1, except the input and output dimensions. Besides, two types of LSTM are also proposed as what have been realized in the Lorenz system: LSTM1000 employs the whole 1000 time steps observation data as LSTM training and prediction inputs while LSTM200 makes use of only the first 200 time steps of observation data as the LSTM inputs. After the LSTM is well trained on the training set of 173000 generated observation trajectories in this experiment, \(\mathbf{R}\) can be gained even only observation time-series data \(\{\mathbf{y}_{t^k}\}\),is acknowledged.

Table 3 Lstm specific structure for \(\mathbf{R}\) prediction in shallow water experiment

Similar to the Lorenz system, EnDA is performed here with an ensemble of 100 state trajectories initialized from the same initial state \(\mathbf{x}_{t^0}\) for each observation series. EnDA takes place every 200 time steps with the \(\mathbf{R}\) matrix estimated through different methods.

8.3 Results

Figures 9 and 10 illustrate, respectively, the predictions results of LSTM1000 and LSTM200 against the true value on 10000 test examples in the test dataset, which demonstrate that the trained LSTM exhibits a good performance at predicting related values applied to compose \(\mathbf{R}\), including both marginal error variances (i.e., the elements of the \(\mathbf{D}\) vector) and the correlation scale length r. The prediction results of LSTM200 (Fig. 10) are almost as accurate as the ones obtained via LSTM1000 (Fig. 9). Training and evaluating the LSTM network using the first 200 time steps is sufficient to obtain an accurate estimation of the \(\mathbf{R}\) matrix. The so composed \(\mathbf{R}\) matrices, based on the prediction results from LSTM1000, are shown in Fig. 11 in comparison with the true \(\mathbf{R}\) matrix and that obtained via D05( \(q=3\)) after regularization. In order to estimate the high dimensional \(\mathbf{R}\) matrix in this application, D05 makes use of the DA residuals every 10 time steps, which is different from the final DA algorithm, in each of the three iterations. Four different sample \(\mathbf{R}\) matrices in the test dataset are displayed in Fig. 11. We observe that both LSTM and D05 manage to acquire a similar covariance structure as the true \(\mathbf{R}\) matrix while the considerable advantage of the LSTM approach can still be noticed. These results confirm the finding of Figs. 9 and 10 that LSTM can perform well in the parametric prediction of the \(\mathbf{R}\) matrix even in high dimensional systems.

Fig. 9
figure 9

Prediction results of the parametric error covariance and the true values in the test dataset of LSTM1000 for some elements of \(\mathbf{D}\) (representing marginal error variance in the 2D space) (ag) and the correlation scale r (h) of the shallow water system

Fig. 10
figure 10

Prediction results of the parametric error covariance and the true values in the test dataset of LSTM200 for some elements of \(\mathbf{D}\) (representing marginal error variance in the 2D space) (ag) and the correlation scale r (h) of the shallow water system

Fig. 11
figure 11

Comparison between true and estimated (LSTM1000,D05) \(\mathbf{R}\) matrix which represents the error covariance of 2D observation data, for 4 samples with different \(\mathbf{D}\) and r in the test dataset

Table 4 DA performance of the shallow water system evaluated in \(\epsilon _{\mathrm{mse}}\) based on \(\mathbf{R}\) varied by the predefinition, the algorithm LSTM1000, LSTM200, DI01 and D05

To assess the performance of DA accuracy, the metric \(\epsilon _{\mathrm{mse},(i)}\) (cf., Eq. 38) estimated using a set of 53 observation time series, is displayed in Table 4. Since it would take too much space to display the total \({{\epsilon _{\mathrm{mse},(i)}}_{i=0,1, \cdots ,800}}\) of all state variables, we select only some of them demonstrated in Table 4, together with the averaged value \(\bar{\epsilon _{\mathrm{mse}}}\) of all 800 cell coordinates in the field of u and v. The displayed results further prove what we have concluded from Figs. 9, 10,  11, that is, the LSTM-based approaches own an advantage in terms of the DA accuracy compared to D05 (\(q=3\)) and DI01 (\(q=2\)) tuning algorithms. Furthermore, the performance of LSTM200 is very close to LSTM1000 with an even slightly smaller average MSE, probably due to the sampling randomness. This result further confirms our analysis of Fig. 10 that the LSTM model which employs only the first 200 time steps of observation data as input manages to provide an accurate estimation of the \(\mathbf{R}\) matrix. The averaged computational time (of a laptop CPU) of online covariance tuning/specification is also shown in Table 4 where only the evaluation time of the trained LSTM model is taken into account. The training of LSTM can be totally performed offline. As for D05 and D01, we exclude the final DA step (i.e., the computational time is estimated for \(q-1\) iterations). As shown in Table 4, the LSTM approach is also considerably faster than traditional tuning methods, which allows a near-real-time application in dynamical systems.

9 Discussion

The precision of DA reconstruction/prediction depends heavily on the specification of both the background and the observation error correlation. The latter is often challenging to estimate in real-world applications because of the dynamic nature of the observation data. Furthermore, the observation matrix \(\mathbf{R}\) can not be empirically estimated from an ensemble of simulated trajectories, unlike the background error covariance. In this paper, we review in detail some well-known observation covariance tuning algorithms [23, 54], based on time-variant posterior innovation quantities. These methods, being widely adopted in geoscience, rely on some specific prior assumptions such as knowledge of the correlation structure [23] or the background matrix [24]. This is difficult to fulfill in some domains where very little knowledge about the prior error is available.

In this study, we have proposed a novel machine learning approach based on LSTM neural networks to predict the \(\mathbf{R}\) matrix using time series observation data as model input. Similar to the work of [23, 24], \(\mathbf{R}\) is assumed to be time-invariant, at least over a sufficiently long time period. Both the Kalman- and variational-type assimilation methods can benefit from the method proposed in this paper for improving the assimilation accuracy. The proposed data-driven approach also contributes to tackling one of the major bottlenecks of DA: it is time-consuming and computationally expensive to update covariance matrix, by mapping raw sensor observations to observation error covariance matrix. In both the Lorenz96 and the shallow water models presented in this paper, the LSTM-based approach displays significant strength, compared to classical posterior tuning methods DI01 and D05, in terms of: (i)estimation accuracy of the observation covariance \(\mathbf{R}\); (ii) reconstruction and prediction accuracy of the DA schema using the estimated \(\mathbf{R}\) matrix; (iii) computational efficiency of the online covariance estimation; (iv) flexibility of different model parameterization. It is worth mentioning that an important limitation of the proposed LSTM-based method is the specification of \(\Phi _{\mathbf{R}}\) which defines the range of parameters for training.

Since we assume that the observation matrix is time-invariant, the proposed approach could only deal with fixed sensor placement for dynamical systems, which is also the case of DI01 and D05 tuning algorithms. The possibility of time-variant sensor placement warrants further investigation. As pointed out by [55], the DL model can be stolen or reverse engineered by model inversion or model extraction attack. Despite the fact that all data used in the current study is generated from toy models, it is important to ensure the data privacy when applying the model to real applications. Future research should also consider applying the new method to a broader range of real-world problems, including NWP, hydrology, and object tracking, where the offline data simulation could be more computationally expensive compared to the two test models presented in this paper. To this end, future studies could also investigate the combination of model reduction methods, such as domain localization [56], proper orthogonal decomposition, information-based data compression [57], auto-encoder neural networks [58], and the current covariance estimation method. More precisely, the data assimilation can be performed in the compressed low dimensional space (e.g., obtained from POD or auto-encoder). The LSTM-based covariance specification algorithm developed in this work can be used to estimate the observation error covariance matrices in the low dimensional space for improving the accuracy of reduced-order data assimilation approaches.