Observation error covariance specification in dynamical systems for data assimilation using recurrent neural networks

Cheng, Sibo; Qiu, Mingming

doi:10.1007/s00521-021-06739-4

Observation error covariance specification in dynamical systems for data assimilation using recurrent neural networks

S.I.: Deep Learning for Time Series Data
Open access
Published: 20 December 2021

Volume 34, pages 13149–13167, (2022)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Observation error covariance specification in dynamical systems for data assimilation using recurrent neural networks

Download PDF

3327 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Data assimilation techniques are widely used to predict complex dynamical systems with uncertainties, based on time-series observation data. Error covariance matrices modeling is an important element in data assimilation algorithms which can considerably impact the forecasting accuracy. The estimation of these covariances, which usually relies on empirical assumptions and physical constraints, is often imprecise and computationally expensive, especially for systems of large dimensions. In this work, we propose a data-driven approach based on long short term memory (LSTM) recurrent neural networks (RNN) to improve both the accuracy and the efficiency of observation covariance specification in data assimilation for dynamical systems. Learning the covariance matrix from observed/simulated time-series data, the proposed approach does not require any knowledge or assumption about prior error distribution, unlike classical posterior tuning methods. We have compared the novel approach with two state-of-the-art covariance tuning algorithms, namely DI01 and D05, first in a Lorenz dynamical system and then in a 2D shallow water twin experiments framework with different covariance parameterization using ensemble assimilation. This novel method shows significant advantages in observation covariance specification, assimilation accuracy, and computational efficiency.

A nonintrusive hybrid neural-physics modeling of incomplete dynamical systems: Lorenz equations

Article 04 September 2021

Short-Term Rainfall Forecasting with E-LSTM Recurrent Neural Networks Using Small Datasets

Data-space inversion using a recurrent autoencoder for time-series parameterization

Article 18 November 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In order to improve the reconstruction and prediction of dynamical systems with uncertainties, data assimilation (DA) techniques, originally developed in numerical weather prediction (NWP) [1] and geosciences [2], are widely applied to industrial problems, such as hydrology [3], wildfire forecasting [4], drought monitoring[5] and nuclear engineering [6]. DA algorithms aim to find the optimal approximation (also known as the analyzed state) of the state variables (usually representing a physical field of interest, such as velocity, temperature, etc.,), relying on prior estimations and real-time observations, both assumed to be noisy. Due to the large dimension (often ranging from $10^6$ to $10^{10}$ in NWP and geoscience problems), prior errors are supposed to be Gaussian distributed for the sake of simplicity [7]. As a consequence, the prior error distribution can be perfectly characterized by the first (mean) and the second (covariance) moment. The output of the DA algorithms is determined through some optimization function where the weight of prior simulations and observations is determined by the associated error covariance matrices, respectively named as background and observation covariances. These error covariance matrices thus provide crucial information in DA algorithms [8], for not only the estimation of the analyzed state but also specifying posterior error distributions [9]. The prior errors represented by these matrices, especially in the case of observation errors, consisting of an ensemble of different sources of noise/uncertainties, including model error, instrument noise, and representativity error [10, 11].

In statistics, the covariance matrix of a random vector is often obtained via empirical estimation where a sufficient number of simultaneous samplings is required to avoid estimation bias [12]. Moreover, when the sampling number is inferior to the problem dimension, the estimated covariance will be rank deficient. In DA problems, the high dimensionality and lack of simultaneous data (i.e., several backgrounds or observation trajectories in the same time window) represent significant obstacles of covariance computation in data assimilation [13]. To overcome these difficulties, we often rely on calibration (e.g., least-square) methods based on some generic correlation kernels, often with homogeneous and isotropic characteristics [14]. Balanced operators can be employed for multivariate systems [15]. In terms of correlation kernels, the family of Matérn functions, including the Exponential kernel (Matérn 1/2), the Balgovind kernel (Matérn 3/2, also known as second-order auto-regressive (SOAR) function), and the Gaussian kernel (Matérn 5/2), is often prioritized for covariance computing owing to its smoothness and capability to capture spatial correlations in physical processes [10, 16, 17]. Other stationary covariance models involve, for instance, convolution formulation [18] or diffusion-based operators [19], both contribute to an efficient storage of the covariance matrices. However, limited by homogeneous and isotropic assumptions, it remains cumbersome to represent complex spatial correlation (often multidimensional and multivariate) using these one-dimensional kernels.

In this study, we develop and test a novel data-driven approach based on recurrent neural networks (RNNs) to improve both the accuracy and the efficiency of observation covariance specification in dynamical data assimilation problems. The novel approach is tested and compared with two state-of-the-art covariance tuning algorithms in two different digital experiments with parametric and non-parametric covariance estimation, respectively.

The paper is organized as follows. In Sect. 2, we introduce the related work for error covariance specification. The problem statement and the contribution of this paper are described in Sect. 3. Data assimilation techniques and the ensemble methods are introduced briefly in Sect. 4. We then describe traditional posterior covariance tuning algorithms DI01 and D05 in Sect. 5. The novel LSTM-based method is introduced in Sect. 6, followed by the comparison in the Lorenz (Sect. 7) and the shallow water twin experiments (Sect. 8). We close the paper with a discussion in Sect. 9.

2 Related work

To gain a clearer insight into covariance evolution, some ensemble-based methods such as [1] (NMC) and [20] (EnKF), have been developed to provide a non-parametric covariance estimation. These methods depend on the propagation of an ensemble of simulated trajectories, initialized either at different forecasting time steps (NMC) or by adding some artificially set perturbations to the current state (EnKF). These methods are more appropriate for modeling the background matrix compared rather than the observation matrix. The latter, independent from the numerical simulations, can not be represented by the propagation of artificially added noises. The Particle-Aided Unscented Kalman Filter [21] can estimate systems with high nonlinearity with a real-time updating of the background matrix. However, the observation matrix can not be estimated directly via the Particle-Aided Unscented Kalman Filter. In practice, the observation matrix is often set to be diagonal or spatially isotropic for the sake of simplicity (e.g. [22]). However, it is shown in the work of [10] that well-specified correlated observation covariances can significantly improve the performance of DA algorithms.

Several methods of data-derived posterior diagnosis have also been developed based on the analysis of innovation quantities which consist of the difference between the observations and the projected background/analyzed state in the observation space. As a strong contributor to this topic, the meteorology community developed several well-known posterior diagnoses and their improved versions [23,24,25] to adjust the background/observation ratio, the correlation scale length, or the full covariance structure in the observation space (both the observation matrix and the projected background matrix). Some iterative processes [26, 27] based on the fixed-point theory have also been proposed for error covariance tuning. Recent works of [28, 29] have proved the convergence of so-called “Desroziers iterative methods”[24] (also known as D05) in the ideal case. In brief, they have mathematically proved that, by using a semi-positive definite matrix as an initial guess, D05 iterative method converges on the exact^{Footnote 1} time-invariant (at least over a sufficiently long time period) observation error covariance when the background matrix and the transformation operator (which maps the state variables to real-time observations) are perfectly known a priori. On the other hand, it is also mentioned by [29] that a regularization step is necessary for practice for applying D05 and the convergence of the regularized iterations remains an open question [3, 29]. To deal with time-varying systems, lag-innovation statistics are used for error covariance estimation [30]. The essential idea is to build a secondary Kalman-filtering process for adjusting error covariances using time-shifted innovation vectors. For more details of the innovation-based methods, we refer to the overview of [13] which also covers some other estimation methods, such as the family of likelihood-based approaches and expectation-maximum(EM) methods.

3 Problem statement and contribution

Our work lies in a similar condition of [24, 29] where both the state forward model and the transformation operator are presumed to be well known. As the main difficulty concerns the non-synchronous time-variant observations in dynamical systems (which prevents empirical estimation), in this work we propose the use of recurrent neural networks (RNNs) [31] for the specification of the observation matrix across the underlying dynamics of the observed quantities. RNN has been widely adapted for the prediction/reconstruction of dynamical systems, especially in natural language processing (NLP) [32] and image/video processing [33] due to its convincing capacity for dealing with time series. More recently, RNN has also made their way to other engineering fields such as biomedical applications and computational fluid mechanics [34]. In general, the combination of deep learning and data assimilation methods [35, 36] has been widely adapted and analyzed in a variety of industrial applications, including air pollution [37] and ocean-atmosphere modeling [38]. A convolutional neural network (CNN) for covariance estimation has also been suggested in the work of [39]. In this study, we propose a novel methodology for LSTM-based covariance estimation which can be easily integrated into any DA schema for dynamical systems. Here, we first construct a set of training covariance matrices, being either parametric or non-parametric, within a certain range defined a priori. For each matrix in the training set, we then simulate a dynamic trajectory of the observation vector relying on the knowledge of the forward model where the noises at each time step are generated following a centered Gaussian distribution characterized by the error covariance. These trajectories are later used as input variables to train the long-short-term-memory (LSTM) RNN regression model where the time-invariant observation matrices stand for the learning target. For the online evaluation, only the historical observation data is needed to predict the error covariances. Compared to traditional posterior tuning methods [24, 40] which require several implementations of DA algorithms, the proposed machine learning (ML) method can be much more computationally efficient for real-time covariance estimation. Moreover, no prior knowledge concerning either the background or the observation matrix is necessary for the proposed ML approach unlike most of the traditional methods. For example, DI01 [23] requires precise knowledge of correlation structures for both background and observation matrices while D05 [24] make use of the perfect knowledge of the background covariance.

In order to make a comprehensive comparison with traditional methods, two different twin experiment frameworks are implemented in this paper, using respectively the Lorenz96 and the 2D shallow-water models. The Lorenz system, characterized by only three state variables, is associated with a non-parametric covariance modeling while we use an isotropic correlation kernel to parameterize the observation matrix in the shallow water dynamics. In both cases, we compare the performance of the proposed LSTM-based method against the state-of-the-art tuning approaches D05 and DI01 in terms of both the covariance specification and the posterior DA accuracy. An ensemble DA schema is used for estimating the time-variant background matrix for each of these methods.

4 Data assimilation

4.1 Principle of data assimilation

The objective of data assimilation algorithms is to approach the estimation of system’s states $\mathbf{x}$ to its true values $\mathbf{x}_{\mathrm{true}}$, also known as the true state, by taking advantage of two sources of information: the prior estimation or forecast $\mathbf{x}_b$, which is also called the background state, and the measurement or observation $\mathbf{y}$. DA algorithms aim to find an optimally weighted compromise between $\mathbf{x}_b$ and $\mathbf{y}$ by minimizing the lost function J defined as:

$$\begin{aligned} J(\mathbf{x})&=\frac{1}{2}(\mathbf{x}-\mathbf{x}_b)^T\mathbf{B}^{-1} (\mathbf{x}-\mathbf{x}_b) + \frac{1}{2}(\mathbf{y}-{\mathcal {H}} (\mathbf{x}))^T \mathbf{R}^{-1} (\mathbf{y}-{\mathcal {H}}(\mathbf{x})) \end{aligned}$$

(1)

$$\begin{aligned}&=\frac{1}{2}||\mathbf{x}-\mathbf{x}_b||^2_{\mathbf{B}^{-1}} +\frac{1}{2}||\mathbf{y} -{\mathcal {H}}(\mathbf{x})||^2_{\mathbf{R}^{-1}}, \end{aligned}$$

(2)

where ${\mathcal {H}}$ denotes the transformation operator from the state space to observation space. $\mathbf{B}$ and $\mathbf{R}$ are, respectively, the background and the observation error covariance matrices, i.e.

$$\begin{aligned} \mathbf{B} = \text {Cov}(\epsilon _b, \epsilon _b), \quad \mathbf{R} = \text {Cov}(\epsilon _y, \epsilon _y), \end{aligned}$$

(3)

where

$$\begin{aligned} \epsilon _b = \mathbf{x}_b - \mathbf{x}_{\mathrm{true}}, \quad \epsilon _y = {\mathcal {H}}(\mathbf{x}_{\mathrm{true}})-\mathbf{y}. \end{aligned}$$

(4)

Errors $\epsilon _b, \epsilon _y$ are supposed to be centered Gaussian following:

$$\begin{aligned} \epsilon _b \sim {\mathcal {N}} (0, \mathbf{B}), \quad \epsilon _y \sim {\mathcal {N}} (0, \mathbf{R}). \end{aligned}$$

(5)

In Eq. 1, the left side strives for incorporating the prior information $\mathbf{x}_b$, and the right side penalizes the difference between the observation $\mathbf{y}$ and the state variables after having been mapped to the observation space ${\mathcal {H}}(\mathbf{x})$. Both terms are weighted by the corresponding inverse of error covariance matrix ($\mathbf{B}^{-1}$, $\mathbf{R}^{-1}$) to reflect confidences for each of them.

The optimization problem of Eq. 1, so called three-dimensional variational (3D-Var) formulation, is a general representation of variational assimilation which does not take into account model errors. The output of Eq. 1 is denoted as $\mathbf{x}_a$, i.e.

$$\begin{aligned} \mathbf{x}_a = \underset{\mathbf{x}}{\mathrm{argmin}} (J(\mathbf{x})). \end{aligned}$$

(6)

If ${\mathcal {H}}$ is the linear observation operator represented by $\mathbf{H}$, Eq. 6 can be solved via BLUE (Best Linearized Unbiased Estimator) formulation:

$$\begin{aligned} \mathbf{x}_a&= \mathbf{x}_b+\mathbf{K}(\mathbf{y} -\mathbf{H} \mathbf{x}_b), \end{aligned}$$

(7)

$$\begin{aligned} \mathbf{A}&= (\mathbf{I}-\mathbf{K} \mathbf{H})\mathbf{B}, \end{aligned}$$

(8)

where $\mathbf{A} = \text {Cov}(\mathbf{x}_a-\mathbf{x}_{\mathrm{true}})$ is the analyzed error covariance, and $\mathbf{K}$ is the Kalman gain matrix described by

$$\begin{aligned} \mathbf{K}=\mathbf{B} \mathbf{H}^T (\mathbf{H} \mathbf{B} \mathbf{H}^T+\mathbf{R})^{-1}. \end{aligned}$$

(9)

In the rest of this paper, we define $\mathbf{H}$ as a linear transformation operator. Nevertheless, it is usually more challenging to find the minimum of Eq. 1 when ${\mathcal {H}}$ is non-linear, even more, challenging when states are high-dimensional. The resolution for the minimization often involves gradient descent algorithms (such as “L-BFGS-B” [41] or adjoint-based [42] numerical techniques.

DA algorithms could be applied to dynamical systems thanks to sequential applications expressed by a transition operator ${\mathcal {M}}_{t^k \rightarrow t^{k+1}}$ (from discrete time $t^{k}$ to $t^{k+1}$), where

$$\begin{aligned} \mathbf{x}_{t^{k+1}} = {\mathcal {M}}_{t^k \rightarrow t^{k+1}} (\mathbf{x}_{t^k}). \end{aligned}$$

(10)

$\mathbf{x}_{b,t^{k+1}}$ thus depends on the knowledge of ${\mathcal {M}}_{t^k \rightarrow t^{k+1}}$ and the DA correcting state $\mathbf{x}_{a,t^k}$, i.e.,

$$\begin{aligned} \mathbf{x}_{b,t^{k+1}} = {\mathcal {M}}_{t^{k} \rightarrow t^{k+1}} (\mathbf{x}_{a,t^{k} }) \end{aligned}$$

(11)

Obviously, the more accurate $\mathbf{x}_{a,t^{k}}$is, the more reliable $\mathbf{x}_{b,t^{k+1}}$ would be.

To leverage the information embedded in the background state and observations, covariance matrices modeling is a pivotal point in DA as they influence not only how prior errors spread but may also change the DA results [26].

4.2 Ensemble methods

Ensemble data assimilation (EnDA) [43, 44] methods have shown a strong performance in dealing with non-linear chaotic DA systems by creating an ensemble with size M of the system state depicted as $\{\mathbf{x}_{t^{k}}^{(i)}|1\le i \le M\}$. The latter is used to represent both the prior and the posterior probability distribution of the state variables. The system states of the ensemble evolve under ${\mathcal {M}}_{t^{k} \rightarrow t^{k+1}}$ and DA is applied to each of these ensemble states at every assimilation windows. Furthermore, instead of evolving the system to obtain the $\mathbf{B}$ matrix, which is a time and computationally expensive process when a large number of states is available, $\mathbf{B}$ is estimated as a sample covariance:

$$\begin{aligned} \mathbf{B}_{b,t^{k}} \approx \frac{1}{M-1}\sum _{i=1}^{M} (\mathbf{x}_{b,t^{k}}^{(i)}-\overline{\mathbf{x}}_{b,t^{k}}) (\mathbf{x}_{b,t^{k}}^{(i)}-\overline{\mathbf{x}}_{b,t^{k}})^{T}, \end{aligned}$$

(12)

where $\overline{\mathbf{x}}_{b,t^{k}}=\frac{1}{M-1} \sum _{i=1}^{M} \mathbf{x}_{b,t^{k}}^{(i)}$, and the estimation becomes more reliable with the increases of M. For applications in this study, EnDA, with a sufficiently large number of examples, is used to estimate $\mathbf{x}_{b,t^{k}}$ and $\mathbf{B}_{b,t^{k}}$ so that we can focus on the comparison of $\mathbf{R}$ matrix modelings.

4.3 Observation error covariances specification

For the estimation of $\mathbf{R}$, under the assumption that the system model is stationary, a wide variety of methods have been explored, for example, the DI01 [23] method which adjusts accordingly the ratio between $Tr(\mathbf{B})$ and $Tr(\mathbf{R})$ and the D05 [24] approach which estimates the full observation space iteratively. However, these methods, based on posterior innovation quantities (i.e., $\mathbf{y}-{\mathcal {H}}(\mathbf{x}_a)$) which requires several applications of DA algorithms, can be computationally expensive. Moreover, these tuning methods, especially the D05 which estimates the full matrix, are not suitable for different matrix parameterizations. In this paper, working with time-series observation data, we use LSTM to predict the corresponding $\mathbf{R}$ matrix under similar assumptions of DI01 and D05. The two classical methods, introduced in Sect. 5, are implemented to compare the results with the proposed machine learning approach.

5 Posterior covariance tuning algorithms

5.1 Desroziers and Ivanov (DI01) tuning algorithm

Because $\mathbf{B}$ and $\mathbf{R}$ determine the weight of background and observation information in the loss function ( Eq. 1), the knowledge of $\text {Tr}(\mathbf{B})$ and $\text {Tr}(\mathbf{R})$ is crucial to DA accuracy. DI01 [23] tuning algorithm, relying on the diagnosis of innovation quantities, has been widely adopted in meteorology [28, 45] and geoscience [46]. Consecutive works have been carried out to improve its performance and feasibility in problems of large dimensions [47]. Without modifying error correlation structures, DI01 adjusts the prior error amplitudes by applying an iterative fixed-point procedure.

As demonstrated in [23, 48], when $\mathbf{B}$ and $\mathbf{R}$ are perfectly specified,

$$\begin{aligned} {\mathbb {E}}\left[ J_b(\mathbf{x}_a) \right]&= \frac{1}{2} {\mathbb {E}}\left[ (\mathbf{x}_a -\mathbf{x}_b)^T \mathbf{B}^{-1}(\mathbf{x}_a-\mathbf{x}_b) \right] \nonumber \\&=\frac{1}{2} \text {Tr}(\mathbf{K}\mathbf{H}), \end{aligned}$$

(13)

$$\begin{aligned} {\mathbb {E}} \left[ J_o(\mathbf{x}_a) \right]&= \frac{1}{2} {\mathbb {E}}\left[ (\mathbf{y} -\mathbf{H}\mathbf{x}_b)^T \mathbf{R}^{-1}(\mathbf{y}- \mathbf{H}\mathbf{x}_b)\right] \nonumber \\&=\frac{1}{2} \text {Tr}(\mathbf{I}-\mathbf{H}\mathbf{K}), \end{aligned}$$

(14)

where $\mathbf{H}$ is a linearized observation operator. Based on Eqs. 13 and 14 it is possible to iteratively correct the magnitudes of $\mathbf{B}$ and $\mathbf{R}$, following

$$\begin{aligned} \mathbf{B}_{q+1}=s_{b,q} \mathbf{B}_q, \quad \mathbf{R}_{q+1}=s_{o,q} \mathbf{R}_q, \end{aligned}$$

(15)

using the two indicators

$$\begin{aligned} s_{b,q}&=\frac{2J_b(\mathbf{x}_a)}{\text {Tr}(\mathbf{K}_q \mathbf{H})}, \end{aligned}$$

(16)

$$\begin{aligned} s_{o,q}&=\frac{2J_o(\mathbf{x}_a)}{\text {Tr}(\mathbf{I}-\mathbf{H}\mathbf{K}_q)}, \end{aligned}$$

(17)

where q is the current iteration.

Acting as scaling coefficients, the sequences $\{ s_{b,q}\}$ and $\{s_{o,q}\}$ modify the error variance magnitude in the iterative process. It is worth reminding that both the analyzed state $\mathbf{x}_a$ and the gain matrix $\mathbf{K}_q$ are obtained using $\mathbf{B}_q$, $\mathbf{R}_q$ which depend on $s_{b,q}$ and $s_{o,q}$. When the correlation patterns of both $\mathbf{B}$ and $\mathbf{R}$ are well known, DI01 is equivalent to a maximum-likelihood parameter tuning, as pointed out in [28, 47].

Unlike other posterior covariance diagnosis/computations, such as [24, 26], the estimation of the full matrices is not needed in DI01. Instead, only the estimation of two scalar values ($J_b,J_o$) is required, which significantly reduce the computational cost. As a consequence, this method could be more appropriate for online covariance tuning.

5.2 Desroziers iterative method (D05) in the observation space

The Desroziers diagnosis (D05)[24], subject to prior and posterior state-observation residuals has been widely applied in engineering problems, including numerical weather prediction[28] and hydrology [3]. The work of [24] proved that when $\mathbf{B}$ and $\mathbf{R}$ are well known a priori, the expectation of the analysis state should satisfy:

$$\begin{aligned} {\mathbb {E}}\left( [\mathbf{y}-{\mathcal {H}}(\mathbf{x}_a)][\mathbf{y} -{\mathcal {H}}(\mathbf{x}_b)]^T\right) = \mathbf{R}. \end{aligned}$$

(18)

The difference between the left side and the right side of Eq.18,

$$\begin{aligned} || \mathbf{R} - {\mathbb {E}}\left( [\mathbf{y}-{\mathcal {H}} (\mathbf{x}_a)][\mathbf{y}-{\mathcal {H}}(\mathbf{x}_b)]^T\right) ||_F, \end{aligned}$$

(19)

can be used as a validation indicator of the $\mathbf{R}$ matrix estimation where $||.||_F$ denotes the Frobenius norm. Applying this method, time variant observation/background data can contribute to the estimation of the $\mathbf{R}$ matrix because the expectation in Eq. 18 could be evaluated using residuals at different time steps. When the $\mathbf{B}$ matrix is well known, an iterative process has been introduced to estimate the $\mathbf{R}$ matrix:

$$\begin{aligned} \mathbf{R}_{n+1} = {\mathbb {E}}\left( [\mathbf{y}-{\mathcal {H}} (\mathbf{x}_{a,q})][\mathbf{y}-{\mathcal {H}}(\mathbf{x}_b)]^T\right) , \end{aligned}$$

(20)

based on the fixed-point theory [29]. The current analysis state $\mathbf{x}_{a,q}$ is obtained using the specification of $\mathbf{R}_q$ while $\mathbf{x}_b, \mathbf{B}, \mathbf{y}$ remains invariant. As proved in [28, 29], under the assumption of sufficient observation data and well known $\mathbf{B}$ matrix, the iterative process of Eq. 20 converges to the exact observation error covariance. However, as shown in [29], the intermediate matrices $\mathbf{R}_q$ could be non-symmetric and possibly contain negative or complex eigenvalues, which is cumbersome for DA algorithms to deal with.

In practice, a posterior regularization at each iteration step is often required to ensure the positive definiteness of $\mathbf{R}_q$ [29] where the first step of the regularization could be symmetrizing the estimated $\mathbf{R}_q$ matrix, i.e.,

$$\begin{aligned} \mathbf{R}_q \longleftarrow \frac{1}{2} (\mathbf{R}_q + \mathbf{R}^T_q). \end{aligned}$$

(21)

The spectrum of $\mathbf{R}_q$ now contains only real numbers but they are not necessarily positive. The hybrid method[2] is a standard approach in ensemble-based DA methods to ensure the positive definiteness, which consists of combining a prior defined covariance matrix $\mathbf{C}$ with the one obtained from empirical estimation. We thus obtain the formulation of the regularized observation matrix $\mathbf{R}_{\mathrm{r},\mathrm{n}}$:

$$\begin{aligned} \mathbf{R}_{\mathrm{n}} \longleftarrow (1-\mu )\mathbf{R}_q + \mu \mathbf{C}, \end{aligned}$$

(22)

following Eq. 21 with $\mu \in (0,1)$. The matrix $\mathbf{C}$ is often set as a diagonal matrix since it helps to enhance the matrix conditioning. In this work, we choose to set

$$\begin{aligned} \mu = 0.2 \quad \text {and} \quad \mathbf{C}= \frac{Tr(\mathbf{R}_q) \times \mathbf{I}}{\mathrm{dim}(\mathbf{y})}, \end{aligned}$$

(23)

so that $Tr(\mathbf{R}_q)$ will not be modified due to the post regularization. As mentioned in the discussion of [3, 29], the convergence of regularized observation matrices remains an open question. Therefore, a small iteration number is often assigned for D05 tuning in industrial problems (e.g., $q=2$ in [3, 49]). Since the right side of Eq. 20 can be estimated using residual quantities at different time steps, D05 is often used to deal with time series observation data (e.g., [3, 49]) when assuming the $\mathbf{R}$ matrix is time-invariant.

6 LSTM for error covariance estimation

6.1 Introduction of RNN and LSTM

LSTM, first introduced in [50], is a kind of RNN [31], capable of solving long-term dependency problems [51] that traditional RNN could not handle. As with other recurrent neural networks, LSTM has a chain-like structure. This structure is created by repeating the same module shown on the left side in Fig. 1. In addition, the repeating module comprises four neural networks instead of only one. The specific structure of the repeating module is on the right side in Fig. 1.

An essential part of LSTM is the cell state $\mathbf{C}_{t^{k-1}}$ which is the long-term memory storing information about past behaviors. LSTM uses three gates with each composed out of a sigmoid layer neural network (single layer neural network with sigmoid activation function at the output layer) and a pointwise multiplication operation, to protect and control information of the cell state as shown in Fig. 1.

The first gate is the forget gate following:

$$\begin{aligned} \mathbf{f}_{t^k}=\sigma (\mathbf{W}_f\cdot [\mathbf{h}_{t^{k-1}}, \mathbf{x}_{t^k}]+b_f), \end{aligned}$$

(24)

where the recurrent variable $\mathbf{h}_{t^{k-1}}$ summarizing all the information about past behaviors, $\mathbf{x}_{t^k}$ resuming information about current behaviors, and $\mathbf{W}_f$ and $b_f$ are weights and bias, respectively, parameterizing the sigma layer neural network. The forget gate decides what kind of information is going to be ignored in $\mathbf{C}_{t^{k-1}}$.

The second gate is the input gate, and it determines which new information is added into $\mathbf{C}_{t^{k-1}}$. This new information $\tilde{\mathbf{C}}_{t^{k}}$, as conforming to

$$\begin{aligned} \tilde{\mathbf{C}}_{t^{k}}=tanh(\mathbf{W}_c \cdot [\mathbf{h}_{t^{k-1}},\mathbf{x}_{t^k}]+b_c), \end{aligned}$$

(25)

is attained by passing $\mathbf{h}_{t^{k-1}}$ and $\mathbf{x}_{t^k}$ to a tanh layer neural network (single layer neural network with tanh activation function at the output layer) with parameters $\mathbf{W}_c$ and $b_c$.

$\tilde{\mathbf{C}}_{t^{k}}$ is then multiplied by weight coefficients $\mathbf{i}_{t^{k}}$ which is acquired by the input gate, i.e.,

$$\begin{aligned} \mathbf{i}_{t^{k}}=\sigma (\mathbf{W}_i\cdot [\mathbf{h}_{t^{k-1}}, \mathbf{x}_{t^{k}}]+b_i). \end{aligned}$$

(26)

$\mathbf{i}_{t^{k}}$ is applied to decide which new information would be employed to update $\mathbf{C}_{t^{k-1}}$.

Hence, the state cell $\mathbf{C}_{t^{k}}$ at the current time step $t^k$ can be attained using

$$\begin{aligned} \mathbf{C}_{t^{k}}=\mathbf{f}_{t^k}\odot \mathbf{C}_{t^{k-1}} +\mathbf{i}_{t^{k}}\odot \tilde{\mathbf{C}}_{t^{k}}. \end{aligned}$$

(27)

Finally, the acquisition of $\mathbf{h}_{t^{k}}$ requires the participation of the output gate and a tanh activation function: first, the tanh activation function tanh is used to create a cell state candidate information $tanh(\mathbf{C}_{t^{k}})$. $tanh(\mathbf{C}_{t^{k}})$ is then multiplied by some weight coefficients following

$$\begin{aligned} \mathbf{h}_{t^{k}}=\mathbf{o}_{t^{k}}\odot tanh(\mathbf{C}_{t^{k}}), \end{aligned}$$

(28)

to decide which information of $tanh(\mathbf{C}_{t^{k}})$ would contribute to the obtainment of $\mathbf{h}_{t^{k}}$. Among them, $\mathbf{o}_{t^{k}}$ is generated by the output gate with neural network parameters $\mathbf{W}_o$ and $b_o$, i.e.,

$$\begin{aligned} \mathbf{o}_{t^{k}}=\sigma (\mathbf{W}_o[\mathbf{h}_{t^{k-1}}, \mathbf{x}_{t^{k}}]+b_o). \end{aligned}$$

(29)

6.2 LSTM for $\mathbf{R}$ matrix estimation using time series observation data

The tuning methods presented in Sect. 5 have been applied in various engineering applications with significant improvement of covariance specification and DA accuracy. However, these methods which require several applications of DA algorithms can be computationally expensive for high-dimensional problems. Another important drawback stands for the requirement of precise knowledge on either the correlation patterns of $\mathbf{B}$ and $\mathbf{R}$ (for DI01) or the full $\mathbf{B}$ matrix (for D05). In this study, we aim to build a data-driven surrogate model for efficient online $\mathbf{R}$ matrix specification using LSTM. Unlike DI01 or D05, no specific knowledge about the error covariances or the state/observation dynamical systems other than the transformation operator ${\mathcal {H}}$ and the forward model ${\mathcal {M}}_{t^k \rightarrow t^{k+1}}$ (which is also indispensable for standard DA algorithms including variational methods and Kalman-filters) is required.

Based on an initial state $\mathbf{x}_{b,t^{0}}^{[\mathrm{iter}]} =\mathbf{x}_{g,t^{0}}^{[\mathrm{iter}]}$, where $[\mathrm{iter}]$ is the indication of some certain application and $\mathbf{x}_{g,t^{0}}^{[\mathrm{iter}]}$ suggests the initial generated state set of this application, our main idea is to build a training set for the specific problem, including predefined time-invariant observation matrices $\{\mathbf{R}^{[\mathrm{iter}]}\}$ within certain range and generated dynamical observation vector $\{\mathbf{y}_{t^k}^{[\mathrm{iter}]} \}$. Setting the dynamical observations $\{\mathbf{y}_{t^k}^{[\mathrm{iter}]} \}$ as the system input and the $\mathbf{R}$ matrices as output, LSTM networks are used to learn the error distribution across the underlying observation dynamics. More precisely, a real function $g^{\mathbf{R}}(.): \Phi _{\mathbf{R}} \longrightarrow {\mathbb {R}}^{m \times m}$ is predefined where $\Phi _{\mathbf{R}}$ is an empirically estimated real space which defines the range of a set of parameters, such as marginal error variance, correlation scale length [18] for computing the $\mathbf{R}$ matrices. The generated observation matrices $\{\mathbf{R}^{[\mathrm{iter}]}\}$ are set to be symmetric positive definite (SPD) thanks to the function $g^{\mathbf{R}}(.)$. Both $g^{\mathbf{R}}(.)$ and $\Phi _{\mathbf{R}}$ vary for different applications.

Generated states $\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k}}\}$, $t^{k}\in \{0,\cdots t^T\}$ with $t^T$ depicting the index of the final time step, could be attained by evolving the system respecting ${\mathcal {M}}_{t^k \rightarrow t^{k+1}}$:

$$\begin{aligned} \left\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k+1}}\right\} = {\mathcal {M}}_{t^k \rightarrow t^{k+1}}\left( \left\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k}}\right\} \right) . \end{aligned}$$

(30)

The observation $\{\mathbf{y}^{[\mathrm{iter}]}_{t^{k}} \}$, $t^{k}\in \{0,\cdots t^T\}$, is gained by mapping $\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k}}\}$ through ${\mathcal {H}}$ and at the same time combining with random Gaussian noises:

$$\begin{aligned}&\left\{ \mathbf{y}^{[\mathrm{iter}]}_{t^{k}}\right\} = {\mathcal {H}} \left( \left\{ \mathbf{x}^{[\mathrm{iter}]}_{g,t^{k}}\right\} \right) + \left\{ \epsilon ^{[\mathrm{iter}]}_{g,t^{k}}\right\} \quad \text {for} \quad t^{k}\in \{0,\cdots t^T\} \quad \text {and} \quad \mathrm{iter} = 1...N, \nonumber \\&\qquad \text {where } \quad \left\{ \epsilon ^{[\mathrm{iter}]}_{g,t^{k}}\right\} \sim {{\mathcal {N}}(0,\{\mathbf{R}^{[\mathrm{iter}]}\}}). \end{aligned}$$

(31)

After having generated $\{\mathbf{R}^{[\mathrm{iter}]}\}$ and $\{\mathbf{y}_{t^k}^{[\mathrm{iter}]} \}$ as illustrated in Algo. 1, a LSTM network is then trained to predict $\{\mathbf{R}^{[\mathrm{iter}]}\}$ knowing $\{ \mathbf{y}_{t^k}^{[\mathrm{iter}]} \}$. The general process can be demonstrated in Fig. 2 where for each application, $\mathbf{y}^{[\mathrm{iter}]}$ is simulated by evolving the system knowing the system dynamics and the observation error covariance generator, and then $\mathbf{y}^{[\mathrm{iter}]}$ and $\mathbf{R}^{[\mathrm{iter}]}$ are applied to train LSTM so that LSTM can predict $\mathbf{R}^{[\mathrm{iter}]}$ when receiving unseen $\mathbf{y}^{[\mathrm{iter}]}$.

Following the principle of many to one LSTM in Fig. 3, the input features of LSTM consist of observations $\mathbf{y}^{[\mathrm{iter}]}=\{\mathbf{y}_{t^0}^{[\mathrm{iter}]},\mathbf{y}_{t^1}^{[\mathrm{iter}]},\cdots ,\mathbf{y}_{t^k}^{[\mathrm{iter}]},\cdots ,\mathbf{y}_{t^T}^{[\mathrm{iter}]}\}$, while the output is $\mathbf{R}^{[\mathrm{iter}]}$ matrix. Different from classical covariance tuning algorithms, the LSTM network only makes use of historical observation data, requiring neither the background states nor the error covariance matrix. The advantage of using LSTM is more salient when the observation dimension is large, for example, millions or even billions, while such dimension is not uncommon in real-world applications [2, 7].

To estimate $\mathbf{R}$, LSTM is first trained to learn: – either related variables which can be used to constitute the symmetry observation error covariance matrix (i.e., input variables of the $g^{\mathbf{R}}(.)$ function) in a parametric modeling; – or elements of the $\mathbf{R}$ matrix (e.g., variables in the upper triangle and those in the diagonal of the covariance matrix) in a non-parametric modeling. The whole process for $\mathbf{R}$ estimation using LSTM is described in Algorithm 2.

In Algorithm 2, it is suggested that LSTM training process is consisted of training and validation processes: the training process is comprised of the forward prediction and the backward neural network weight parameters updating processes; and validation process is used to predict desirable outputs or objectives and then calculate validation loss between predicted output and prior true output values. $N_{\mathrm{epoch}}$ indicates the number of times that the entire example data set is passed forward and backward through the LSTM during the training process. Early stopping, which terminates the training process when the validation loss reaches the minimum and is always the minimum value after $N_{\mathrm{patience} \_\text {epoch}}$ epochs, is applied to reduce the LSTM training time.

It is important to note that the offline data generation and LSTM training processes need to be carried out individually for different DA applications (Fig. 4).

7 Lorenz twin experiment

7.1 Twin experiment principle

In order to overcome the drawback that, in a realistic experiment, $\mathbf{x}_{\mathrm{true}}$ is usually unknown and $\mathbf{y}$ is often mixed with noises, twin experiment, in which a prototypical test case is selected to simulate real situations, is applied so as to provide $\mathbf{x}_{\mathrm{true}}$ for comparison. In this experiment, a mapping is applied to some sampling true trajectory $\mathbf{x}_{\mathrm{true},t^k}$ at some points in space and time and arbitrary random noises are added to obtain simulated raw measurements $\mathbf{y}_{t^k}$. DA is then implemented starting from the initial background state $\mathbf{x}_{b,t^0}$ representing the prior information that could be obtained about corresponding state $\mathbf{x}_{\mathrm{true},t^0}$, along with initial raw measurement $\mathbf{y}_{t^0}$. The output state is then compared against $\mathbf{x}_{\mathrm{true}}$, verifying the distance of these two states and minimizing it to evaluate and improve the performance of DA. In this section, we use a twin experiment to evaluate the performance of applying DA to a simple Lorenz system in which raw measurement error covariance is estimated/adjusted using, respectively, DI01, D05 and LSTM.

7.2 Experiment set up

The Lorenz system, first studied by Edward Lorenz, is a system of ordinary differential equations. For certain parameter values and initial conditions, the Lorenz system is notable for having chaotic solutions, in particular the Lorenz attractor, toward which a system tends to evolve. The Lorenz 96 system[52] has been widely used as a prototypical test case to compare the performance of DA algorithms[34, 35, 53]. Here we build a twin experiment framework with a simple three dimensional Lorenz system in which the state vector is denoted as $\mathbf{x}= [x_{(0)}, x_{(1)}, x_{(2)}]$. The studied Lorenz system can be characterized as:

$$\begin{aligned} \frac{\partial x_{(0)}}{\partial t}&=\sigma (x_{(1)}-x_{(0)})\nonumber \\ \frac{\partial x_{(1)}}{\partial t}&=\alpha x_{(0)}-x_{(1)} -x_{(0)}x_{(2)} \nonumber \\ \frac{\partial x_{(2)}}{\partial t}&=x_{(0)}x_{(1)}-\beta x_{(2)}. \end{aligned}$$

(32)

where $\partial t=0.001s$, $\sigma =10$, $\alpha =28$ and $\beta =2.667$.

The initial values of the true state $\mathbf{x}_{\mathrm{true},t^0}$ are set to be [0, 1, 1.05] while the initial background state $\mathbf{x}_{b,t^0}$ is generated by combining $\mathbf{x}_{\mathrm{true}, t^0}$ with a centered Gaussian noise $\epsilon _{b,t^0}$:

$$\begin{aligned} \mathbf{x}_{b,t^0} = \mathbf{x}_{\mathrm{true},t^0} + \epsilon _{b,t^0} \quad \text {where} \quad \epsilon _{b,t^0} \sim {\mathcal {N}} \left( 0, 0.05 \times \mathbf{I}_3\right) . \end{aligned}$$

(33)

Then both of true states $\mathbf{x}_{\mathrm{true}}=\{\mathbf{x}_{\mathrm{true},t^0},\cdots ,\mathbf{x}_{\mathrm{true},t^T}\}$ and background states $\mathbf{x}_{b,t^k}=\{\mathbf{x}_{b,t^0},\cdots , \mathbf{x}_{b,t^T}\}$ of the Lorenz system evolve by conforming, respectively, to the Lorenz equation in Eq. 32 until t=1s with total $T=1000$ time steps.

Subsequently, observations $\mathbf{y} =\{\mathbf{y}_{t^0},\cdots ,\mathbf{y}_{t^T}\}$ can be acquired by mapping $\mathbf{x}_{\mathrm{true}}$ through a linear observation operator

$$\begin{aligned} \mathbf{H}= \begin{bmatrix} 1 & 1 & 0 \\ 2 & 0 & 1\\ 0 & 0 & 3 \end{bmatrix}, \end{aligned}$$

(34)

and adding noises respecting multinormal distribution ${\mathcal {N}}(0,\mathbf{R})$ where $\mathbf{R}$ is randomly generated following the process described in Sect. 7.3.

EnDA is then applied in this twin experiment to update the background ensemble using available observations. More precisely, every ten time steps, EnDA is applied on $\mathbf{x}_{b,t^k}$ along with $\mathbf{y}_{t^k}$ to obtain the analysis states $\mathbf{x}_{a,t^k}$. To simulate future background states before the next assimilation step, we add some artificial noises $\epsilon _{b,t^{k+1}}$ to $\mathbf{x}_{b,t^{k+1}}$ respecting multivariate distributions, i.e.,

$$\begin{aligned} \mathbf{x}_{b,t^{k+1}}&= {\mathcal {M}}_{t^k \rightarrow t^{k+1}} (\mathbf{x}_{a,t^k} + \epsilon _{b,t^{k}}), \nonumber \\ \mathbf{x}_{b,t^{(k+\gamma )}}&= {\mathcal {M}}_{t^{(k+\gamma -1)} \rightarrow t^{(k+\gamma )} }(\mathbf{x}_{b,t^{(k+\gamma -1)}}) \quad \text {for} \quad \gamma \in \{ 2,...,10\}, \end{aligned}$$

(35)

where

$$\begin{aligned} \epsilon _{b,t^{k+1}} \sim {\mathcal {N}}(0, \mathbf{Q}) \quad \text {and} \quad \mathbf{Q}= \begin{bmatrix} 1 & 0.2 & 0 \\ 0.2 & 1 & 0.2\\ 0 & 0.2 & 1 \end{bmatrix}. \end{aligned}$$

The model error covariance matrix $\mathbf{Q}$ is supposed to be time-invariant for all generated trajectories.

7.3 DA with LSTM-based covariance estimation

The Sect. 7.2 exhibits the process of generating artificial training data for the LSTM model. More details about the $\mathbf{R}$ matrix generation, the outputs of the LSTM model within the training data are revealed in this section. $\mathbf{R}$ are parameterized by four real coefficients $r_0$, $r_1$, $r_2$ and $v_{\mathbf{R}}$ which determine, respectively, the three correlation coefficients and the error amplitude, i.e.,

$$\begin{aligned} \mathbf{R}= v_{\mathbf{R}} \times \begin{bmatrix} 1 & r_0 & r_1 \\ r_0 & 1 & r_2\\ r_1 & r_2 & 1. \end{bmatrix} \end{aligned}$$

(36)

In this study, $v_{\mathbf{R}}$ is generated uniformly between 0 and 100, i.e., $v_{\mathbf{R}} \sim {\mathcal {U}}(0,100)$ with ${\mathcal {U}}$ denoting the uniform probability distribution. The correlation coefficients $\{r_0,r_1,r_2\} \in (-1,1)^3$ are obtained via randomly generated SPD matrices^{Footnote 2}.

The LSTM network is thus trained to learn the $\mathbf{R}$ matrix by trying to build a function mapping time series observation data $\mathbf{y}_{t^k}$ to $r_0$, $r_1$, $r_2$ and $v_{\mathbf{R}}$. The specific structure of the LSTM model which consists of a LSTM input layer, a hidden layers with 200 neurons, and an output layer comprising four neurons applied to obtain $\mathbf{R}$, is shown in Table 1. In this Lorenz twin experiment, two LSTM networks, respectively named as LSTM1000 and LSTM200 are designed with different input sizes of times series data. LSTM1000 is trained on a total of 1000 time steps for predicting the $\mathbf{R}$ matrix while LSTM200 makes use of only the first 200 time steps to simulate a realistic application where the time-invariant $\mathbf{R}$ matrix is estimated using historical data for improving future DA performance. The evaluation of both LSTM1000 and LSTM200, in terms of DA accuracy, is made using the full test dataset with 1000 times steps. By leveraging the LSTM model along with available observations, we can still perform DA algorithms even though $\mathbf{R}$ is not explicitly given. The results are then compared with the ones obtained using the exact $\mathbf{R}$ matrix.

Table 1 Lstm specific structure for $\mathbf{R}$ prediction in lorenz experiment

Full size table

7.4 Results

To evaluate the LSTM performance of predicting $\mathbf{R}$, we first compare the LSTM predicted $\mathbf{R}$ and the predefined true $\mathbf{R}$ matrix in the test set and analyze the impact of predicted $\mathbf{R}$ on DA accuracy. Since the LSTM prediction of $\mathbf{R}$ matrix is non-parametric in these twin experiments, we compare each element of the predicted matrix (i.e.,$r_0$, $r_1$, $r_2$ and $v_{\mathbf{R}}$) against the ground truth. As for the DA accuracy, we calculate the difference between $\mathbf{x}_{b,t}=\{x_{b,(0),t}, x_{b,(1),t},x_{b,(2),t}\}$ (obtained via Eq. 35) refined in DA using $\mathbf{R}$ estimated via DI01 ($q=2$), D05 ($q = 3$) or LSTM and the error-free true states $\mathbf{x}_{\mathrm{true},t}=\{x_{\mathrm{true},(0),t},x_{\mathrm{true},(1),t},x_{\mathrm{true},(2),t}\}$. Both D05 and DI01 initialize with a random $\mathbf{R}$ matrix, generated through $g^{\mathbf{R}}(.)$ with the same range of parameters $\Phi _{\mathbf{R}}$ as for the LSTM training.

Figure 5 shows the elements of the estimated $\mathbf{R}$ matrix (i.e.,$r_0$, $r_1$, $r_2$ and $v_{\mathbf{R}}$) obtained by LSTM1000 and LSTM200, both trained on 103486 Lorenz system observation samples, and predicts on a test dataset of 10,000 samples. The result of the D05 approach, with a calibration of $r_0$, $r_1$, $r_2$ and $v_{\mathbf{R}}$ against the true value, has also been displayed in Fig. 5 (i-l) in the same test dataset for comparison purposes. The blue circles are the LSTM/D05 prediction results, while the red line is the true value of the corresponding parameter. We observe that both LSTM1000 and LSTM200 prediction results fit very well to the predefined true value of each parameter, compared to the D05 tuning approach, especially for the error amplitude $v_{\mathbf{R}}$ which is of most importance in covariance specification. Based on these experimental results, we can conclude that the proposed LSTM approach is capable of predicting the observation matrix, in terms of both correlation coefficients and error amplitude, when time-series observation data $\mathbf{y}_{t^k}$ are given.

Figure 6 and Table 2 illustrate the averaged DA performance along with $\mathbf{R}$ attained in different ways with 10000 observations in the test dataset. We remind that for each observation sample $\{ \mathbf{y}_{t^k}\}$, $t^{k}\in \{0,\cdots ,t^T\}$, 100 background trajectories are generated to perform EnDA. Among these algorithms, DI01 uses two iterations to correct the magnitudes of $\mathbf{B}$ and $\mathbf{R}$ by conforming to Eq. 15. Each D05 iteration calculates the innovation quantities $(\mathbf{y}-{\mathcal {H}}(\mathbf{x}_a))$ and $(\mathbf{y} -{\mathcal {H}}(\mathbf{x}_b))$ every 10 time steps through a DA procedure, and then applies Eq. 20 to attain the updated $\mathbf{R}$ for the corresponding Lorenz system.

Figure 6 displays the evolution of the mean square error (MSE) $\epsilon _{{\mathrm{std}}\_{\mathrm{mse}},(i),t}$ between DA refined $\mathbf{x}_{b,t}$ (following Eq. 35) and the true states $\mathbf{x}_{\mathrm{true},t}$, i.e.,

$$\begin{aligned} \epsilon _{{\mathrm{std}}\_{\mathrm{mse}},(i),t} =\sum _{j=1}^{N}\frac{\sqrt{\sum _{m=1}^{M}{\left( x_{b,(i),t}^{(m),[j]} -x_{(i),t}^{[j]}\right) }^2}}{{\left\| \mathbf{x}_{(i)}\right\| }_2}\Big /N \end{aligned}$$

(37)

where $i \in \{ 0,1,2\}$, $N=1988$ is the number of examples, and $M=100$ is the size of the ensemble DA. Table 2 shows the averaged (against time) absolute error

$$\begin{aligned} \epsilon _{\mathrm{mse},(i)}=\sum _{j=1}^{N}\frac{\sum _{t=t^0}^{t^{T}} \sqrt{\sum _{m=1}^{M}{\left( x_{b,(i),t}^{(m),[j]} -x_{(i),t}^{[j]}\right) }^2}}{t^T} \Big /N \end{aligned}$$

(38)

to interpret the difference between assimilated and true states, where $t^T=1000$ is the total time steps that the Lorenz system has evolved.

It should be noted that the DI01 approach, which exclusively adjusts the $Tr(\mathbf{B})/Tr(\mathbf{R})$ ratio without modifying the correlation structure, is over-performed by more refined covariance tuning/specification methods such as D05 and LSTM as shown in Fig. 6. Furthermore, Table 2 shows that $\mathrm{lstm1000:} \epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}$ is smaller than $\mathrm{d05:} \epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}$, which is consistent with the results shown in Fig. 5. Thus, we can conclude that LSTM1000 is sound at predicting $\mathbf{R}$, contributing to a better DA performance, compared to DI01 and D05. Such a conclusion is further proved in Table 2 in which $\epsilon _{\mathrm{mse}}$ based on LSTM1000 has values, for all three parameters displayed, close to those based on true $\mathbf{R}$, which is predefined and used throughout the Lorenz system sample generations.

Table 2 DA performance of the Lorenz system evaluated in $\epsilon _{\mathrm{mse}}$ for the three state variables based on $\mathbf{R}$ varied by the predefinition, the algorithm LSTM1000, LSTM200, DI01 and D05

Full size table

What we have not exhibited in Fig. 6 are the curves of $\mathrm{lstm200:}\epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}$ and $\mathrm{true:}\epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}$, representing the MSE based on LSTM200 predicted $\mathbf{R}$ and manually predefined true $\mathbf{R}$ respectively, as they have been almost totally overlapped with $\mathrm{lstm1000:}\epsilon _{{\mathrm{std}}\_{\mathrm{mse}}}$. Such a fact permits both the same conclusion to be reached with LSTM1000, and that LSTM200 which makes use of the observation data of only the first 200 time steps, is sound at predicting $\mathbf{R}$ to contribute to a good DA performance with future observations. This conclusion is supported in Table 2 where $\epsilon _{\mathrm{mse}}$ based on LSTM200 predicted $\mathbf{R}$ has almost the same values as that based upon manually predefined true $\mathbf{R}$ (Fig. 7).

8 Application to shallow water equations

8.1 Experiment setup

For further evaluating the performance of error covariance estimation using LSTM when incorporated with predefined correlation kernels, we also set up a twin experiment framework with a simplified 2D shallow-water dynamical model, which is frequently used for testing data assimilation algorithms ( e.g., [26, 42]). A cylinder of water is positioned in the middle of the study field with size $20mm \times 20mm$ and released at the initial time step $t^k=t^0s$ (i.e., with no initial speed), leading to a non-linear wave-propagation. The dynamics of the water level h (in mm), as well as the horizontal and vertical velocity field (respectively denoted as u and v in 0.1m/s), is given by the non-conservative shallow water equations,

$$\begin{aligned} \frac{\partial u}{\partial t}&=-g\frac{\partial }{\partial x}(h)-bu \nonumber \\ \frac{\partial v}{\partial t}&=-g\frac{\partial }{\partial y}(h)-bv \nonumber \\ \frac{\partial h}{\partial t}&=-\frac{\partial }{ \partial x}(uh) -\frac{\partial }{\partial y}(v h) \nonumber \\ u_{t^0}&= 0 \nonumber \\ v_{t^0}&= 0 \end{aligned}$$

(39)

In Eq. 39, $b=0.1$ is the viscous drag coefficient, while g is the constant earth gravity. These equations are discretized in a $20 \times 20$ regular grid, solved by first-order finite difference method with a time discretization $\delta _t = 10^{-4}s$, from $t^0 = 0s$ to $t^{1000} = 0.1s$. This resolution is considered as the reference (i.e., the true state $\mathbf{x}_{\mathrm{true}}$) when performing DA algorithms. The state variables in this DA modeling are the combination of the velocity fields $\{u\}_{20 \times 20}$ and $\{v\}_{20 \times 20}$, leading to the state dimension of 800. The evolution of the reference state ($\mathbf{x}_{\mathrm{true},t^k}$), together with the error-free model equivalent observations (i.e., $\mathbf{H}(\mathbf{x}_{\mathrm{true}})$), is illustrated in Fig. 8. Spatially correlated prior observation errors are generated artificially and combined with the transformation operator to simulate real-time observations. More precisely, the observations are generated from the model equivalent $\mathbf{H}(\mathbf{x}_{\mathrm{true}})$ separately for the fields u and v. $\mathbf{H}$ is defined as a sparse matrix to imply the fact that measurements in real-world applications are sparser than true states due to the interference existing in the former situations as well as the limited performances of sensors. As shown in Fig. 7, the spatial observations at time $t^k$ is defined as the average of $u_{t^k}$ and $v_{t^k}$ in a $2 \times 2$ cells area with an observation error $\epsilon _{y_{t^k}}$,

$$\begin{aligned} \mathbf{y}_{u,i,j,t^k}&= \frac{1}{4} (u_{\mathrm{true},2i,2j,t^k} + u_{\mathrm{true},2i+1,2j,t^k} + u_{\mathrm{true},2i,2j+1,t^k}\nonumber \\&\qquad \qquad + u_{\mathrm{true},2i+1,2j+1,t^k}) + \epsilon _{y_{u,i,j,t^k}}, \end{aligned}$$

(40)

and identical for $\mathbf{y}_{v,i,j,t^k}$. Therefore, the dimension of the observation vector $\mathbf{y}= [\mathbf{y}_{u,t^k}, \mathbf{y}_{v,t^k} ]$ is 200. In this experiment, we suppose that the observation error $\epsilon _{y_{u,i,j,t^k}}$ and $\epsilon _{y_{v,i,j,t^k}}$, respectively of the velocity fields u and v, follow the same Gaussian distribution ${\mathcal {N}}(0,\mathbf{R})$. Thus, the observation error covariance in this shallow water system can be fully characterized by a $100 \times 100$ $\mathbf{R}$ matrix after the observations (originally in a 2D grid) being converted to a 1D vector. Here we adopt a different parameterization of the $\mathbf{R}$ matrix thanks to an isotropic correlation function $\psi _{\mathbf{R}}(.)$,

$$\begin{aligned} \mathbf{R}=10^{-6} \cdot \sqrt{diag(\mathbf{D})}\cdot \psi _{\mathbf{R}} (r) \cdot \sqrt{diag(\mathbf{D})} \end{aligned}$$

(41)

where $\mathbf{D} = \left[ D_0, ..., D_{99}\right]$, representing the error variances in the 2D ($10 \times 10$) velocity field. Each element of $\mathbf{D}$ is generated individually following an uniform distribution,

$$\begin{aligned} \mathbf{D}_{\iota } \sim {\mathcal {U}}(1, 1000) \quad \text {for} \quad \iota \in \{ 0,...,99\}, \end{aligned}$$

(42)

which produces only positive elements to guarantee the positive definiteness of $\mathbf{R}$.

$\psi _{\mathbf{R}} (.)$ is the second-order auto-aggressive (also known as Balgovind) function,

$$\begin{aligned} \psi _{\mathbf{R}} (r) = \left( 1+\frac{r}{L_{\mathbf{R}}}\right) \exp \left( -\frac{r}{L_{\mathbf{R}}}\right) , \end{aligned}$$

(43)

where $L_{\mathbf{R}}$ is the correlation scale length, fixed as $L_{\mathbf{R}} =10$ in this application. r denotes the correlation scale length in the 2D space and is also generated uniformly with $r \sim {\mathcal {U}}(1,5)$. Being part of Matern kernels, the SOAR function is often used in DA for prior error covariance modeling [6, 26] thanks to its smoothness and good conditioning. The simulation of $\mathbf{x}_{b,t^k} = [{u}_{b,t^k}, {v}_{b,t^k}]$ via the same discretization of Eq. 39 (except the initial conditions) is used as background states at time $t^k$ in the DA modeling. Similar to the Lorenz experiment(i.e., Eq. 35), $\{\mathbf{x}_{b,t^k}\}$ is acquired by combining $\{\mathbf{x}_{a,t^k}\}$ with randomly generated Gaussian errors, while $\{\mathbf{x}_{a,t^k}\}$ is obtained every 100 time steps (i.e., 0.01s) from ensemble DA with time series observation data $\{\mathbf{y}_{t^k}\}$ and the estimated observation error covariance $\mathbf{R}$.

8.2 DA with LSTM estimated $\mathbf{R}$

As with the Lorenz experiment, simulated observations $\{\mathbf{y}_{t^k}\}$, generated in the same process with that in the Lorenz system, are used as input training data for the LSTM model, while the $\mathbf{D}$ vector and the correlation scale r served as training output. The specific structure of this LSTM network is shown in Table 3. This model has the same structure as the one of the Lorenz systems shown in Table 1, except the input and output dimensions. Besides, two types of LSTM are also proposed as what have been realized in the Lorenz system: LSTM1000 employs the whole 1000 time steps observation data as LSTM training and prediction inputs while LSTM200 makes use of only the first 200 time steps of observation data as the LSTM inputs. After the LSTM is well trained on the training set of 173000 generated observation trajectories in this experiment, $\mathbf{R}$ can be gained even only observation time-series data $\{\mathbf{y}_{t^k}\}$,is acknowledged.

Table 3 Lstm specific structure for $\mathbf{R}$ prediction in shallow water experiment

Full size table

Similar to the Lorenz system, EnDA is performed here with an ensemble of 100 state trajectories initialized from the same initial state $\mathbf{x}_{t^0}$ for each observation series. EnDA takes place every 200 time steps with the $\mathbf{R}$ matrix estimated through different methods.

8.3 Results

Figures 9 and 10 illustrate, respectively, the predictions results of LSTM1000 and LSTM200 against the true value on 10000 test examples in the test dataset, which demonstrate that the trained LSTM exhibits a good performance at predicting related values applied to compose $\mathbf{R}$, including both marginal error variances (i.e., the elements of the $\mathbf{D}$ vector) and the correlation scale length r. The prediction results of LSTM200 (Fig. 10) are almost as accurate as the ones obtained via LSTM1000 (Fig. 9). Training and evaluating the LSTM network using the first 200 time steps is sufficient to obtain an accurate estimation of the $\mathbf{R}$ matrix. The so composed $\mathbf{R}$ matrices, based on the prediction results from LSTM1000, are shown in Fig. 11 in comparison with the true $\mathbf{R}$ matrix and that obtained via D05( $q=3$) after regularization. In order to estimate the high dimensional $\mathbf{R}$ matrix in this application, D05 makes use of the DA residuals every 10 time steps, which is different from the final DA algorithm, in each of the three iterations. Four different sample $\mathbf{R}$ matrices in the test dataset are displayed in Fig. 11. We observe that both LSTM and D05 manage to acquire a similar covariance structure as the true $\mathbf{R}$ matrix while the considerable advantage of the LSTM approach can still be noticed. These results confirm the finding of Figs. 9 and 10 that LSTM can perform well in the parametric prediction of the $\mathbf{R}$ matrix even in high dimensional systems.

Table 4 DA performance of the shallow water system evaluated in $\epsilon _{\mathrm{mse}}$ based on $\mathbf{R}$ varied by the predefinition, the algorithm LSTM1000, LSTM200, DI01 and D05

Full size table

To assess the performance of DA accuracy, the metric $\epsilon _{\mathrm{mse},(i)}$ (cf., Eq. 38) estimated using a set of 53 observation time series, is displayed in Table 4. Since it would take too much space to display the total ${{\epsilon _{\mathrm{mse},(i)}}_{i=0,1, \cdots ,800}}$ of all state variables, we select only some of them demonstrated in Table 4, together with the averaged value $\bar{\epsilon _{\mathrm{mse}}}$ of all 800 cell coordinates in the field of u and v. The displayed results further prove what we have concluded from Figs. 9, 10, 11, that is, the LSTM-based approaches own an advantage in terms of the DA accuracy compared to D05 ($q=3$) and DI01 ($q=2$) tuning algorithms. Furthermore, the performance of LSTM200 is very close to LSTM1000 with an even slightly smaller average MSE, probably due to the sampling randomness. This result further confirms our analysis of Fig. 10 that the LSTM model which employs only the first 200 time steps of observation data as input manages to provide an accurate estimation of the $\mathbf{R}$ matrix. The averaged computational time (of a laptop CPU) of online covariance tuning/specification is also shown in Table 4 where only the evaluation time of the trained LSTM model is taken into account. The training of LSTM can be totally performed offline. As for D05 and D01, we exclude the final DA step (i.e., the computational time is estimated for $q-1$ iterations). As shown in Table 4, the LSTM approach is also considerably faster than traditional tuning methods, which allows a near-real-time application in dynamical systems.

9 Discussion

The precision of DA reconstruction/prediction depends heavily on the specification of both the background and the observation error correlation. The latter is often challenging to estimate in real-world applications because of the dynamic nature of the observation data. Furthermore, the observation matrix $\mathbf{R}$ can not be empirically estimated from an ensemble of simulated trajectories, unlike the background error covariance. In this paper, we review in detail some well-known observation covariance tuning algorithms [23, 54], based on time-variant posterior innovation quantities. These methods, being widely adopted in geoscience, rely on some specific prior assumptions such as knowledge of the correlation structure [23] or the background matrix [24]. This is difficult to fulfill in some domains where very little knowledge about the prior error is available.

In this study, we have proposed a novel machine learning approach based on LSTM neural networks to predict the $\mathbf{R}$ matrix using time series observation data as model input. Similar to the work of [23, 24], $\mathbf{R}$ is assumed to be time-invariant, at least over a sufficiently long time period. Both the Kalman- and variational-type assimilation methods can benefit from the method proposed in this paper for improving the assimilation accuracy. The proposed data-driven approach also contributes to tackling one of the major bottlenecks of DA: it is time-consuming and computationally expensive to update covariance matrix, by mapping raw sensor observations to observation error covariance matrix. In both the Lorenz96 and the shallow water models presented in this paper, the LSTM-based approach displays significant strength, compared to classical posterior tuning methods DI01 and D05, in terms of: (i)estimation accuracy of the observation covariance $\mathbf{R}$; (ii) reconstruction and prediction accuracy of the DA schema using the estimated $\mathbf{R}$ matrix; (iii) computational efficiency of the online covariance estimation; (iv) flexibility of different model parameterization. It is worth mentioning that an important limitation of the proposed LSTM-based method is the specification of $\Phi _{\mathbf{R}}$ which defines the range of parameters for training.

Since we assume that the observation matrix is time-invariant, the proposed approach could only deal with fixed sensor placement for dynamical systems, which is also the case of DI01 and D05 tuning algorithms. The possibility of time-variant sensor placement warrants further investigation. As pointed out by [55], the DL model can be stolen or reverse engineered by model inversion or model extraction attack. Despite the fact that all data used in the current study is generated from toy models, it is important to ensure the data privacy when applying the model to real applications. Future research should also consider applying the new method to a broader range of real-world problems, including NWP, hydrology, and object tracking, where the offline data simulation could be more computationally expensive compared to the two test models presented in this paper. To this end, future studies could also investigate the combination of model reduction methods, such as domain localization [56], proper orthogonal decomposition, information-based data compression [57], auto-encoder neural networks [58], and the current covariance estimation method. More precisely, the data assimilation can be performed in the compressed low dimensional space (e.g., obtained from POD or auto-encoder). The LSTM-based covariance specification algorithm developed in this work can be used to estimate the observation error covariance matrices in the low dimensional space for improving the accuracy of reduced-order data assimilation approaches.

Notes

Here, by the term “exact”, we refer to the covariance truly corresponding to the remaining errors present in the observation space.
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_spd_matrix.html

References

Parrish DF, Derber JC (1992) The national meteorological centers spectral statistical-interpolation analysis system. Mon Weather Rev 120(8):1747–1763
Article Google Scholar
Carrassi A, Bocquet M, Bertino L, Evensen G (2018) Data assimilation in the geosciences: an overview of methods, issues, and perspectives. Wiley Interdiscip Rev Climate Change 9(5):e535
Article Google Scholar
Cheng S, Argaud J-P, Iooss B, Lucor D, Ponçot A (2021) Error covariance tuning in variational data assimilation: application to an operating hydrological model. Stoch Env Res Risk Assess 35(5):1019–1038
Article MATH Google Scholar
Rochoux MC, Ricci S, Lucor D, Cuenot B, Trouvé A (2014) Towards predictive data-driven simulations of wildfire spread-part I: Reduced-cost ensemble Kalman filter based on a polynomial chaos surrogate model for parameter estimation. Nat Hazard 14(11):2951–2973
Article Google Scholar
Harisuseno D (2020) Meteorological drought and its relationship with southern oscillation index (soi). Civil Eng J 6:1864–1875
Article Google Scholar
Gong H, Yu Y, Li Q, Quan C (2020) An inverse-distance-based fitting term for 3D-Var data assimilation in nuclear core simulation. Ann Nucl Energy 141:107346
Article Google Scholar
Asch M, Bocquet M, Nodet M (2016) Data assimilation: methods, algorithms, and applications. Fundament Algorithms SIAM
Mattern JP, Edwards CA, Moore AM (2018) Improving variational data assimilation through background and observation error adjustments. Mon Weather Rev 146(2):485–501
Article Google Scholar
Eyre JR, Hilton FI (2013) Sensitivity of analysis error covariance to the mis-specification of background error covariance. Q J R Meteorol Soc 139(671):524–533
Article Google Scholar
Stewart LM, Dance SL, Nichols NK (2013) Data assimilation with correlated observation errors: experiments with a 1-D shallow water model. Tellus Dyn Meteorol Oceanogr 65(1):19546
Article Google Scholar
Janjić T, Bormann N, Bocquet M, Carton JA, Cohn SE, Dance SL, Losa SN, Nichols NK, Potthast R, Waller JA, Weston P (2018) On the representation error in data assimilation. Q J R Meteorol Soc 144(713):1257–1278
Article Google Scholar
Wishart J (1928) The generalised product moment distribution in samples from a normal multivariate population. Biometrika 20A(1/2):32–52
Article MATH Google Scholar
Tandeo P, Ailliot P, Bocquet M, Carrassi A, Miyoshi T, Pulido M, Zhen Y (2020) A review of innovation-based methods to jointly estimate model and observation error covariance matrices in ensemble data assimilation. Monthly Weather Rev 1–68
Fisher M (2003) Background error covariance modelling. In: Seminar on Recent developments in data assimilation for atmosphere and ocean (Shinfield Park, Reading, 8-12 September). ECMWF
Derber J, Rosati A (1989) A global oceanic data assimilation system. J Phys Oceanogr 19(9):1333–1347
Article Google Scholar
Solís M (2014) Conditional covariance estimation for dimension reduction and sensitivity analysis. PhD thesis, Université de Toulouse
Weston PP, Bell W, Eyre JR (2014) Accounting for correlated error in the assimilation of high-resolution sounder data. Q J R Meteorol Soc 140(685):2420–2429
Article Google Scholar
Gaspari G, Cohn SE (1999) Construction of correlation functions in two and three dimensions. Q J R Meteorol Soc 125(554):723–757
Article Google Scholar
Mirouze I, Weaver A (2010) Representation of correlation functions in variational assimilation using an implicit diffusion operator. Q J R Meteorol Soc 136(651):1421–1443
Article Google Scholar
Evensen G (1994) Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. J Geophys Res Oceans 99(C5):10143–10162
Article Google Scholar
Lin M, Yoon J, Kim B (2020) Self-driving car location estimation based on a particle-aided unscented kalman filter. Sensors 20(9):2544
Article Google Scholar
Arcucci R, Mottet L, Pain C, Guo Y-K (2018) Optimal reduced space for variational data assimilation. J Comput Phys 379:51–69
Article MathSciNet Google Scholar
Desroziers G, Ivanov S (2001) Diagnosis and adaptive tuning of observation-error parameters in a variational assimilation. Q J R Meteorol Soc 127(574):1433–1452
Article Google Scholar
Desroziers G, Berre L, Chapnik B, Poli P (2005) Diagnosis of observation, background and analysis-error statistics in observation space. Q J R Meteorol Soc 131(613):3385–3396
Article Google Scholar
Liu Y-A, Li Z, Huang M (2019) Towards a data-derived observation error covariance matrix for satellite measurements. Remote Sens 11(15):1770
Article Google Scholar
Cheng S, Argaud J-P, Iooss B, Lucor D, Ponçot A (2019) Background error covariance iterative updating with invariant observation measures for data assimilation. Stoch Env Res Risk Assess 33(11):2033–2051
Article Google Scholar
Kalnay E, Yang S-C (2010) Accelerating the spin-up of ensemble Kalman filtering. Q J R Meteorol Soc 136(651):1644–1651
Article Google Scholar
Ménard R (2016) Error covariance estimation methods based on analysis residuals: theoretical foundation and convergence properties derived from simplified observation networks. Q J R Meteorol Soc 142(694):257–273
Article Google Scholar
Bathmann K (2018) Justification for estimating observation-error covariances with the Desroziers diagnostic. Q J R Meteorol Soc 144(715):1965–1974
Article Google Scholar
Daley R (1992) The lagged innovation covariance: a performance diagnostic for atmospheric data assimilation. Mon Weather Rev 120(1):178–196
Article Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science
Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of CNN and RNN for natural language processing. ArXiv, arXiv:1702.01923
Sarabu A, Santra A (2021) Human action recognition in videos using convolution long short-term memory network with spatio-temporal networks. Emerg Sci J 5:25–33
Article Google Scholar
Pawar S, Ahmed SE, San O, Rasheed A, Navon IM (2020) Long short-term memory embedded nudging schemes for nonlinear data assimilation of geophysical flows. Phys Fluids 32(7):076606
Article Google Scholar
Arcucci R, Zhu J, Hu S, Guo Y-K (2021) Deep data assimilation: integrating deep learning with data assimilation. Appl Sci 11(3)
Geer AJ (2021) Learning earth system models from observations: machine learning or data assimilation? Philos Trans R Soc A Math Phys Eng Sci 379(2194):20200089
Article MathSciNet Google Scholar
Casas CQ, Arcucci R, Wu P, Pain C, Guo Y-K (2020) A reduced order deep data assimilation model. Phys D 412:132615
Article MathSciNet Google Scholar
Brajard J, Carrassi A, Bocquet M, Bertino L (2021) Combining data assimilation and machine learning to infer unresolved scale parametrization. Philos Trans R Soc A Math Phys Eng Sci 379(2194):20200086
Article MathSciNet Google Scholar
Liu K, Ok K, Vega-Brown W, Roy N (2018) Deep inference for covariance estimation: Learning gaussian noise models for state estimation. In: Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), pp 1436–1443. IEEE
Dreano D, Tandeo P, Pulido M, Ait-El-Fquih B, Chonavel T, Hoteit I (2017) Estimating model error covariances in nonlinear state-space models using Kalman smoothing and the expectation-maximisation algorithm. Q J R Meteorol Soc 143(705):1877–1885
Article Google Scholar
Fulton W (2000) Eigenvalues, invariant factors, highest weights, and schubert calculus. Bull Am Math Soc 37:209–250
Article MathSciNet MATH Google Scholar
Cioaca A, Sandu A (2014) Low-rank approximations for computing observation impact in 4D-Var data assimilation. Comput Math Appl 67(12):2112–2126
Article MathSciNet MATH Google Scholar
Bannister RN (2017) A review of operational methods of variational and ensemble-variational data assimilation. Q J R Meteorol Soc 143(703):607–633
Article Google Scholar
Bazargan H, Adibifard M (2019) A stochastic well-test analysis on transient pressure data using iterative ensemble Kalman filter. Neural Comput Appl 31(8):3227–3243
Article Google Scholar
Michel Y (2014) Diagnostics on the cost-function in variational assimilations for meteorological models. Nonlinear Process Geophys 21(1):187–199
Article Google Scholar
Hoffman R, Ardizzone JV, Leidner S, Smith D, Atlas R (2013) Error estimates for ocean surface winds: applying desroziers diagnostics to the cross-calibrated, multiplatform analysis of wind speed. J Atmos Oceanic Tech 30(11):2596–2603
Article Google Scholar
Chapnik B, Desroziers G, Rabier F, Talagrand O (2004) Property and first application of an error-statistics tuning method in variational assimilation. Q J R Meteorol Soc 130(601):2253–2275
Article Google Scholar
Talagrand O (1998) A posteriori evaluation and verification of analysis and assimilation algorithms. In: Workshop on diagnosis of data assimilation systems, pp 17–28, Shinfield Park, Reading
Migliorini S (2013) Information-based data selection for ensemble data assimilation. Q J R Meteorol Soc 139(677):2033–2054
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Lorenz E (1995) Predictability: a problem partly solved. In: Seminar on predictability, 4–8 September 1995, vol 1, pp 1–18, Shinfield Park, Reading. ECMWF, ECMWF
Ouala S, Nguyen D, Drumetz L, Chapron B, Pascual A, Collard F, Gaultier L, Fablet R (2020) Learning latent dynamics for partially observed chaotic systems. Chaos Interdiscip J Nonlinear Sci 30(10):103121
Article MathSciNet MATH Google Scholar
Descombes G, Auligné T, Vandenberghe F, Barker D, Barré J (2015) Generalized background error covariance matrix model (GEN_BE v2. 0). Geosci Model Develop 8(3):669–696
Liu X, Xie L, Wang Y, Zou J, Xiong J, Ying Z, Vasilakos AV (2020) Privacy and security issues in deep learning: a survey. IEEE Access 9:4566–4593
Article Google Scholar
Cheng S, Argaud JP, Iooss B, Ponçot A, Lucor D.(2021) A graph clustering approach to localization for adaptive covariance tuning in data assimilation based on state-observation mapping. Math Geosci, 1–30
Cheng S, Lucor D, Argaud J-P (2021) Observation data compression for variational assimilation of dynamical systems. J Comput Sci 53:101405
Article Google Scholar
Amendola M, Arcucci R, Mottet L, Casas CQ, Fan S, Pain C, Linden P, Guo Y-K (2020) Data assimilation in the latent space of a neural network. ArXiv, arXiv:2012.12056

Download references

Acknowledgments

S.Cheng would like to thank Dr. D.Lucor, Dr. J-P. Argaud, Dr. B.Iooss, and Dr. A.Ponçot for fruitful discussions about the error covariance computation. This research was supported by EDF R&D. This research was partially funded by the Leverhulme Centre for Wildfires, Environment and Society through the Leverhulme Trust, grant number RC-2018-023.

Author information

Sibo Cheng and Mingming Qiu contributed equally to this work.

Authors and Affiliations

Data Science Instituite, Department of computing, Imperial College London, London, UK
Sibo Cheng
Institut Polytechnique de Paris, Palaiseau, France
Mingming Qiu
EDF R&D, EDF Lab Saclay, 7 Boulevard Gaspard Monge, 91120, Palaiseau, France
Mingming Qiu

Authors

Sibo Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Mingming Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sibo Cheng.

Ethics declarations

Conflict of interest statement

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Code availability

Code for the proposed LSTM-based covariance specification, together with DI01, D05 methods, for both Lorenz and shallow water models is available at https://github.com/scheng1992/LSTM_Covariance.

Contribution statement

S.Cheng: Conceptualization, Methodology, Software, Writing - Original draft preparation M.Qiu: Methodology, Software, Writing - Original draft preparation

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Cheng, S., Qiu, M. Observation error covariance specification in dynamical systems for data assimilation using recurrent neural networks. Neural Comput & Applic 34, 13149–13167 (2022). https://doi.org/10.1007/s00521-021-06739-4

Download citation

Received: 26 June 2021
Accepted: 10 November 2021
Published: 20 December 2021
Issue Date: August 2022
DOI: https://doi.org/10.1007/s00521-021-06739-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Observation error covariance specification in dynamical systems for data assimilation using recurrent neural networks

Abstract

Similar content being viewed by others

A nonintrusive hybrid neural-physics modeling of incomplete dynamical systems: Lorenz equations

Short-Term Rainfall Forecasting with E-LSTM Recurrent Neural Networks Using Small Datasets

Data-space inversion using a recurrent autoencoder for time-series parameterization

1 Introduction

2 Related work

3 Problem statement and contribution

4 Data assimilation

4.1 Principle of data assimilation

4.2 Ensemble methods

4.3 Observation error covariances specification

5 Posterior covariance tuning algorithms

5.1 Desroziers and Ivanov (DI01) tuning algorithm

5.2 Desroziers iterative method (D05) in the observation space

6 LSTM for error covariance estimation

6.1 Introduction of RNN and LSTM

6.2 LSTM for \(\mathbf{R}\) matrix estimation using time series observation data

7 Lorenz twin experiment

7.1 Twin experiment principle

7.2 Experiment set up

7.3 DA with LSTM-based covariance estimation

7.4 Results

8 Application to shallow water equations

8.1 Experiment setup

8.2 DA with LSTM estimated \(\mathbf{R}\)

8.3 Results

9 Discussion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest statement

Code availability

Contribution statement

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation