1 Introduction

Spatial field prediction and reconstruction are crucial in the control of high-dimensional physical systems for applications in CFD, geoscience or medical science. Running physics-informed simulations is often computationally expensive, especially for high resolution and multivariate systems. Over the past years, numerous studies have been devoted to speed up the simulation/prediction of dynamical systems by constructing surrogate models via reduced-order modelling (ROM) and machine learning (ML) techniques [1,2,3,4]. More precisely, the simulation/experimental data are first compressed to a low-dimensional latent space through an Autoencoder (AE). A recurrent neural network (RNN) is then used to train a reduced-order surrogate model for predicting the dynamics in the latent space using compressed data. Once the ML surrogate model is computed, monitoring the model prediction with limited sensor information constitutes another major challenge. Making use of a weighted combination of simulation (also known as ‘background’) and observation data [5], data assimilation (DA) methods are widely used in engineering applications for field prediction or parameter identification [5, 6].

To incorporate real-time observations for correcting the prediction of the surrogate model, the idea of Latent Assimilation (LA) was introduced [7,8,9] where DA is performed directly in the reduced-order latent space. It has been shown in [7] that LA has a significant advantage in terms of computational efficiency compared to classical full-space DA methods. However, current approaches of LA require the compression of the observation data into the same latent space of the state variables, which is cumbersome for some applications where the states and the observations are either compressed using different AEs or different physical quantities. The latter is common practice in geoscience and CFD applications. For example, the observation of wind speed/direction can be used to improve the quality of the initial conditions of weather forecasts [10] and precipitation data can be used to correct the river flow prediction in hydrology [11, 12].

The DA is performed through a transformation operator (usually denoted by \(\mathcal {H}\)) which links the state variables to real-time observations. In real applications, \(\mathcal {H}\) is often highly non-linear [13]. In the case of LA, since the assimilation is carried out in the latent space, the \(\mathcal {H}\) function also includes several encoder, decoder functions, leading to extra difficulties in solving the assimilation problem. Furthermore, if the state vector and the observation vector are not in the same physical space, the latent spaces where the data are reduced might be different too. In this case, the operator of the data assimilation inverse problem includes the two ML-based functions used to compress the data (state vector and observations) in two different latent spaces. Also, ML functions often involve many parameters and are difficult to train in real-time. This means that performing variational LA, when the background simulation and the observation vector are not in the same physical space, is cumbersome.

The idea of applying ML algorithms, namely recurrent neural networks in a low-dimensional latent space for learning complex dynamical systems has been recently adapted in a wide range of applications including CFD [2, 14], hydrology [12], nuclear science [15] and air pollution quantification [3]. Both proper orthogonal decomposition (POD)-type (e.g., [2, 3, 12, 16]) and neural networks (NNs)-based autoencoding methods [1, 14] have been used to construct the reduced-order latent spaces. The work of [3] is extended in [17] which relies on an Adversarial RNN when the training dataset is insufficient. In terms of compression accuracy, much effort has been devoted to compare the performance of different auto-encoding approaches. The study of [18] shows a significant advantage of NNs-based methods compared to classical POD-type approaches when dealing with highly non-linear CFD applications. A novel ROM method, combining POD and NNs AE has been introduced in the very recent work of [19]. The authors have demonstrated that one of the advantages of this approach, for projection-based ROMs, is that it does not matter whether the high-fidelity solution is on a structured or unstructured mesh. Other approaches applying convolutional autoencoders to data on unstructured meshes include space-filling curves [20], spatially varying kernels [21] or graph-based networks [22].

Performing DA in the latent space in order to monitor surrogate models with real-time observations has led to an increase in research interest recently. The approaches used in the work of [3, 23] consist of learning assimilated results directly via a RNN to reduce forecasting errors. With a similar idea, [24] proposes an iterative process of (DL) and DA, i.e., a NN is retrained after each DA step (based on NN predictions and real observations) until convergence has been achieved. Collectively, the methods in [3, 23, 24] aim to enhance the system prediction by including assimilated dynamics in the training data. However, the requirement to retrain the NN when new observation data become available leads to considerable computational cost for online application of these methods.

In order to incorporate unseen real-time observation data efficiently, the recent works of [7, 8, 25] introduce the concept of LA where an AE network is used to compress the state variables and pre-processed observation data. The DA updating is performed in the reduced-order latent space subsequently. Similarly, in [9], a Generative Adversarial Network (GAN) was trained to produce time series data of POD coefficients, and this algorithm was extended to assimilate data by modifying the loss function and using the back-propagation algorithm of the GAN. Again, this produces an efficient method as no additional simulations of the high-fidelity model are required during the data assimilation process. Also, [26] proposes the use of a recurrent Kalman network in the latent space to make locally linear predictions. However, as mentioned in the Introduction, an important bottleneck of the current LA techniques is that the state and observation variables often can not be encoded into the same latent space for complex physical systems. Performing online LA thus requires a smooth, explainable and efficient-to-train local surrogate transformation function, leading to our idea of implementing polynomial regression.

Local polynomial regression has been widely used for the prediction and calibration of chaotic systems by providing smooth and easily interpretable surrogate functions. The work of [27] uses multivariate local polynomial fitting (M-MLP) which takes previous time steps in a multivariate dynamical systems as input and forecasts the evolution of the state variables. It is demonstrated numerically that the M-MLP outperforms a standard NN in the Lorenz twin experiment. Recently this work has been developed by the same authors to a local polynomial autoregressive model [28] which shows a good performance in one-step prediction. A detailed numerical comparison between (PR) and NN has also been given in [29, 30]. Their results show that PR with a polynomial degree lower than five, can achieve similar results to NNs when fitting a variety of multivariate real functions. Using a similar idea, [31] applies the local polynomial regression to provide not only the single mean forecast but an ensemble of future time steps, which provides better forecasts with noisy data as proved in their paper with geological applications.

Polynomial regression, or more generally, interpretable surrogate models such as Lasso or a Decision Tree (DT), have been widely used to approximate sophisticated deep learning algorithms to improve interpretability [32]. For example, [33] developed the model of Local Interpretable Model-agnostic Explanations (LIME) for improving the interpretability of ML classifiers. More precisely, they make use of a linear regression model to approximate a NNs classifier where the loss function is defined as a fidelity-interpretability tradeoff. The training set of the linear surrogate model is generated via samplings for local exploration of each ML input. It is pointed out by both [32] and [33] that both the distribution and the range of local samplings are crucial to the robustness of the local surrogate model. A small range may lead to overfitting while the efficiency and the local fidelity can decrease when the sampling range is too large.

A graph-based sampling strategy is proposed in the recent work of [34] to improve the performance of LIME. The principle of LIME can be easily extended by using a polynomial regression since our prime concern is not the interpretability but the smoothness of the local surrogate model. On the other hand, some effort has been given to replace the computational expensive ML models by polynomial functions which are much more efficient to evaluate. The use of a data-driven polynomial chaos expansion (PCE) has been proposed recently by [35] to perform ML regression tasks with a similar performance compared to DL and Support vector machine. Furthermore, PCE is able to deliver a probability density function instead of a single mean prediction for the model output. A similar idea can be found in [36] where the authors compare PCE- and NNs-based surrogate models for sensitivity analysis in a real-world geophysical problem. The study of [37] aims to reduce the over-parametrization of neural networks by using polynomial functions to fit a trained NN of the same inputs. Their study includes sophisticated NNs structures such as twodimensional (2D) convolutional neural network (CNN), in the global space. Despite the fact that the classification accuracy of the surrogate polynomial regression is slightly lower than the state-of-the-art DL approaches, the former exhibits a significantly higher noise robustness on real datasets. In addition, the theoretical study in [37] provides an upper bound of the PR learning error with respect to the number of samplings. Another important advantage of PR compared to other ML models, namely deep learning approaches, is the good performance for small training sets thanks to the small number of tuning parameters required [35]. Moreover, unlike DL methods, polynomial regression requires much less fine tuning of hyper-parameters which makes it more appropriate for online training tasks.

In this study, we develop a novel LA algorithm scheme which generalises the current LA framework [7] to heterogeneous latent spaces and non-linear transformation operators while keeping the important advantage of LA in terms of low computational cost. We use local surrogate functions to approximate the transformation operator from the latent space of the state vector to the observation one. This approach can incorporate observation data from different sources in one assimilation window as shown in Fig. 1. The latent transformation operator, which combines different encoder/decoder networks, and the state-observation transformation mapping, \(\mathcal {H}\) in the full physical space, is then used to solve the LA inverse problem. A crucial requirement is ensuring both the approximation accuracy (for unseen data) and the smoothness and interpretability of the surrogate function. For these reasons, we used local PR which is sufficiently accurate and infinitely differentiable [38]. We provide both a theoretical and numerical analysis (based on a high-dimensional CFD application) of the proposed method. The surrogate models we build are based on AE and long short-term memory (LSTM) technologies which have been shown to provide stable and accurate solutions for ROMs [17].

Fig. 1
figure 1

Flowchart of the generalised latent assimilation with machine learning surrogate models

In summary, we make the following contributions in this study:

  • We propose a novel Generalised Latent Assimilation (GLA) algorithm. Making use of a local PR to open the blackbox of DL functions addresses one of the major bottlenecks of current LA approaches for combining information sources (namely state vector and observations) issued from different latent spaces. The main differences of the proposed novel Generalised LA compared to the existing LA approaches are underlined in red in Fig. 1.

  • We provide a theoretical error upper-bound for the expectation of the cost function in LA when using the local surrogate polynomial function instead of the original DL function. This upper-bound, depending on the polynomial degree and the input dimension, is obtained based on the general results of learning NNs functions via PR [37].

  • The new approach proposed in this work can be easily applied/extended to other dynamical systems. The repository of python code scripts, including ROM (POD, Convolutional autoencoder (CAE) and POD AE), latent LSTM and Generalised LA can be found at c

The rest of this paper is organised as follows. In Sect. 2.1, several dimension reduction methods, including POD, ML-based AE and POD AE are introduced. We then address the RNN latent surrogate model in Sect. 2.2. The novel Generalised LA approach with a theoretical analysis is described in Sect. 3 after the introduction of classical variational DA. The CFD application, as a test case in this paper, is briefly explained in Sect. 4.1. The numerical results of this study are split into two parts: Sect. 4.2 for latent surrogate modelling (including ROM reconstruction and LSTM prediction), and Sect. 5 for Generalised LA with heterogeneous latent spaces. Finally, concluding remarks are provided in Sect. 6.

2 Methodology: ROM and RNN

2.1 Reduced-Order-Modelling

Different ROM approaches are introduced in this section with the objective to build an efficient rank reduction model with a low dimensional latent space and high accuracy of reconstruction. Their performance is later compared in the oil-water flow application in Sect. 4.2.1.

2.1.1 Proper Orthogonal Decomposition

The principle of proper orthogonal decomposition was introduced in the work of [39]. In general, a set of \(n_\text {state}\) state snapshots, issued from one or several simulated or observed dynamics, is represented by a matrix \({\textbf {X}}\in \mathbb {R}^{[\text {dim}({\textbf {x}}) \times n_\text {state}]}\) where each column of \({\textbf {X}}\) represents an individual state vector at a given time instant (also known as snapshots), i.e.

$$\begin{aligned} {\textbf {X}}[:,i] = {\textbf {x}}_{t=t_i}, \quad \forall i \in {\{0,1, \ldots , n_\text {state}-1\}}. \end{aligned}$$
(1)

Thus the ensemble \({\textbf {X}}\) describes the evolution of the state vectors. Its empirical covariance \({\textbf {C}}_{{\textbf {x}}}\) can be written and decomposed as

$$\begin{aligned} {\textbf {C}}_{{\textbf {x}}} = \frac{1}{n_\text {state}-1} {\textbf {X}}{\textbf {X}}^T = {{\textbf {L}}}_{{\textbf {X}}} {{\textbf {D}}}_{{\textbf {X}}} {{{\textbf {L}}}_{{\textbf {X}}}}^T \end{aligned}$$
(2)

where the columns of \({{\textbf {L}}}_{{\textbf {X}}}\) are the principal components of \({\textbf {X}}\) and \({{\textbf {D}}}_{{\textbf {X}}}\) is a diagonal matrix collecting the associated eigenvalues \(\{ \lambda _{{\textbf {X}},i}, i=0,\ldots ,n_\text {state}-1\}\) in a decreasing order, i.e.,

$$\begin{aligned} {{\textbf {D}}}_{{\textbf {X}}} = \begin{bmatrix} \lambda _{{\textbf {X}},0} &{} &{} \\ &{} \ddots &{} \\ &{} &{} \lambda _{{\textbf {X}},n_\text {state}-1} \end{bmatrix}. \end{aligned}$$
(3)

For a truncation parameter \(q \le n_\text {state}\), one can construct a projection operator \({{\textbf {L}}}_{{\textbf {X}},q}\) with minimum loss of information by keeping the first q columns of \({{\textbf {L}}}_{{\textbf {X}}}\). This projection operator can also be obtained by a singular value decomposition (SVD) [40] which does not require computing the full covariance matrix \({\textbf {C}}_{{\textbf {x}}}\). More precisely,

$$\begin{aligned} {\textbf {X}}= {{\textbf {L}}}_{{\textbf {X}},q} \varvec{\Sigma } {{\textbf {V}}}_{{\textbf {X}},q} \end{aligned}$$
(4)

where \({{\textbf {L}}}_{{\textbf {X}},q}\) and \({{\textbf {V}}}_{{\textbf {X}},q}\) are by definition with orthonormal columns. , i.e.,

$$\begin{aligned} {{{\textbf {L}}}_{{\textbf {X}},q}}^T {{\textbf {L}}}_{{\textbf {X}},q} = {{{\textbf {V}}}_{{\textbf {X}},q}}^T {{{\textbf {V}}}_{{\textbf {X}},q}} = {\textbf {I}}\quad \text {and} \quad \varvec{\Sigma }\varvec{\Sigma }^T = {{\textbf {D}}}_{q,X}, \end{aligned}$$
(5)

where \({{\textbf {D}}}_{q,X}\) is a diagonal matrix containing the first q eigenvalues of \({{\textbf {D}}}_{X}\). For a single state vector \({\textbf {x}}\), the compressed latent vector \(\tilde{{\textbf {x}}}\) can be written as

$$\begin{aligned} \tilde{{\textbf {x}}} = {{{\textbf {L}}}_{{\textbf {X}},q}}^T {\textbf {x}}, \end{aligned}$$
(6)

which is a reduced rank approximation to the full state vector \({\textbf {x}}\). The POD reconstruction then reads,

$$\begin{aligned} {\textbf {x}}^r_\text {POD} = {{{\textbf {L}}}_{{\textbf {X}},q}} \tilde{{\textbf {x}}} = {{{\textbf {L}}}_{{\textbf {X}},q}} {{{\textbf {L}}}_{{\textbf {X}},q}}^T {\textbf {x}}. \end{aligned}$$
(7)

The compression rate \(\rho _{{\textbf {x}}}\) and the compression accuracy \(\gamma _{{\textbf {x}}}\) are defined respectively as:

$$\begin{aligned} \gamma _{{\textbf {x}}} = \sum _{i=0}^{q-1} \lambda ^2_{{\textbf {X}},i} \Big / \sum _{i=0}^{n_\text {state}-1} \lambda ^2_{{\textbf {X}},i} \quad \text {and} \quad \rho _{{\textbf {x}}} = q \big / n_\text {state}. \end{aligned}$$
(8)

2.1.2 Convolutional Auto-encoder

An auto-encoder is a special type of artificial NNs used to perform data compression via an unsupervised learning of the identity map. The network structure of an AE can be split into two parts: an encoder which maps the input vector to the latent space, and a decoder which connects the latent space and the output. More precisely, the encoder \(\mathcal {E}_x\) first encodes the inputs \({\textbf {x}}\) to latent vector \(\tilde{{\textbf {x}}} = \mathcal {E}_{{\textbf {x}}} ({\textbf {x}})\), which is often of a much lower dimension (i.e., \(\text {dim}(\tilde{{\textbf {x}}}) \ll \text {dim}({\textbf {x}}) \)). A decoder \(\mathcal {D}_{{\textbf {x}}} \) is then added to approximate the input vector \({\textbf {x}}\) by computing a reconstructed vector \({\textbf {x}}^r_\text {AE} = \mathcal {D}_{{\textbf {x}}} \big ( \mathcal {E}_{{\textbf {x}}} ({\textbf {x}}) \big )\). The encoder and the decoder are then trained jointly with, for instance, the mean square error (MSE) as the loss function

$$\begin{aligned} J\big ( \varvec{\theta }_{\mathcal {E}}, \varvec{\theta }_{\mathcal {D}}\big ) = \frac{1}{N_\text {train}^{\text {AE}}} \sum _{j=1}^{N_\text {train}^{\text {AE}}} \vert \vert {\textbf {x}}_j - {\textbf {x}}^\text {r}_{\text {AE},j}\vert \vert ^2 \end{aligned}$$
(9)

where \( \varvec{\theta }_{\mathcal {E}}, \varvec{\theta }_{\mathcal {D}}\) denote the parameters in the encoder and the decoder respectively, and \(N_\text {train}^{\text {AE}}\) represents the size of the AE training dataset.

Neural networks with additional layers or more sophisticated structures (e.g.,CNN or RNN) can better recognise underlying spatial or temporal patterns, resulting in a more effective representation of complex data. Since we aim to obtain a static encoding (i.e., a single latent vector will not contain temporal information) at this stage, we make use of a CNN to build our first AE. In general, a convolutional layer makes use of a local filter to compute the values in the next layer. By shifting the input tensor by a convolutional window of fixed size, we obtain the output of a convolutional layer [41]. Compared to standard AE with dense layers, the advantage of CAE is mainly two-folds: the reduction of the number of parameters in the AE and the capability of capturing local information. Standard 2D CNNs are widely applied in image processing problems while for unsqaured meshes, 1D CNN and Graph NNs [21] are often prioritised due to the irregular structure. For more details about CNN and CAE, interested readers are referred to [41].

2.1.3 POD AE

The combination of POD and AE (also known as POD AE or SVD AE) was first introduced in the recent work of [19] for applications in nuclear engineering. The accuracy and efficiency of this approach has also been assessed in urban pollution applications (e.g., [17]), especially for problems with unstructured meshes. This method consists of two steps of dimension reduction. We first apply the POD to obtain the full set of principle components of the associated dynamical system. Using a part of the principle components as input, a dense autoencoder with fully connected neural networks is then employed to further reduce the problem dimension [17]. As an important note, including all of the PCs can involve some redundancy and noise which affects the performance of the AE. To avoid such effect, a prior POD truncation can be performed. In other words, both the input and ouput of this AE (with Encoder \(\mathcal {E}'_{{\textbf {x}}}\) and Decoder \(\mathcal {D}'_{{\textbf {x}}}\)) are the compressed latent vectors \(\tilde{{\textbf {x}}}_\lambda \) associated with the POD coefficients, i.e.,

$$\begin{aligned} \tilde{{\textbf {x}}}_\lambda = {{{\textbf {L}}}_{q',{\textbf {X}}}}^T {\textbf {x}}, \quad \tilde{{\textbf {x}}} = \mathcal {E}'_{{\textbf {x}}} (\tilde{{\textbf {x}}}_\lambda ) \quad \text {while} \quad \tilde{{\textbf {x}}}^r_\lambda = \mathcal {D}'_{{\textbf {x}}} (\tilde{{\textbf {x}}}), \quad {\textbf {x}}^r_\text {POD AE} = {{\textbf {L}}}_{q',{\textbf {X}}} \tilde{{\textbf {x}}}^r_\lambda \end{aligned}$$
(10)

where \(\tilde{{\textbf {x}}}^r_\lambda \) and \({\textbf {x}}^r_\text {POD AE}\) denote the reconstruction of the POD coefficients and the reconstruction of the full physical field respectively. The prior POD truncation parameter is denoted as \(q'\). Since the POD considerably reduce the size of the input vectors in AE, applying fully connected NNs layers is computationally affordable without the concern of over-parameterization as pointed out by [19]. Furthermore, the training time will be reduced in comparison to a full CNN AE applied directly to the high-fidelity solutions. It is important to point out that convolutional layers can also be used in the POD AE approach.

2.2 Surrogate Model Construction and Monitoring

Now that the ROM is performed, we aim to construct a lower-dimensional surrogate model by understanding the evolution of the latent variables. For this purpose, we build a ML surrogate model in the latent space, which is trained by encoded simulation data. With the development of ML techniques, there is an increasing interest in using RNNs to learn the dynamics of CFD or geoscience applications. Addressing temporal sequences as directed graphs, RNNs manage to handle complex dynamical systems because of their ability of capturing historical dependencies through feedback loops [42]. However, training standard RNNs to solve problems with long-term temporal dependencies can be computationally difficult because the gradient of the loss function may decrease exponentially with time. This is also known as the vanishing gradient problem [43]. A specific type of RNN, the long-short-term-memory (LSTM) network is developed to deal with long-term temporal dependencies. In brief, different from standard RNN units, LSTM units \(C^\text {LSTM}_{t}\) (here t denotes the time) are capable of maintaining information in memory of long periods with the help of a memory cell. Three gates, each composed of a Sigmoid activation function \(\sigma (x) = (1/(1 + e^{-x}))\), are used to decide when information is memorised or forgotten. The different gates and their transition functions are listed herebelow:

  • Forget gate decides whether the information is going to be forgotten for the current cell unit. Here the recurrent variable \(\textbf{h}_{t-1}\) summarises all historical information and \(\textbf{x}_t\) is the current layer input,

    $$\begin{aligned} f^{LSTM}_t=\sigma (\textbf{W}_f\cdot [\textbf{h}_{t-1},\textbf{x}_t]+b_f) \end{aligned}$$
    (11)
  • Input gate determines the new information which is going to be added with

    $$\begin{aligned} \tilde{C}^{LSTM}_{t}&=\tanh (\textbf{W}_C\cdot [\textbf{h}_{t-1},\textbf{x}_t]+b_C), \end{aligned}$$
    (12)
    $$\begin{aligned} \textbf{i}_{t}&=\sigma (\textbf{W}_i\cdot [\textbf{h}_{t-1},\textbf{x}_t]+b_i), \end{aligned}$$
    (13)

    while \(\tilde{C}^{LSTM}_{t}\) is multiplied by weight coefficients, leading to an update of \(C^{LSTM}_{t}\),

    $$\begin{aligned} C^{LSTM}_{t}=f^{LSTM}_t\odot C^{LSTM}_{t-1}+\textbf{i}_t\odot \tilde{C}^{LSTM}_{t}, \end{aligned}$$
    (14)

    where \(\odot \) denotes the Hadamard product of vectors and matrices.

  • Output gate decides the recurrent state \(\textbf{h}_t\) as a function of previous recurrent output \(\textbf{h}_{t-1}\) and the current layer input \(\textbf{x}_t\) through a Sigmoid activation function, i.e.,

    $$\begin{aligned} \textbf{o}_{t}&=\sigma (\textbf{W}_o[\textbf{h}_{t-1},\textbf{x}_t]+b_o) \end{aligned}$$
    (15)
    $$\begin{aligned} \textbf{h}_{t}&=\textbf{o}_t\odot \tanh (C^{LSTM}_t) \end{aligned}$$
    (16)

Here \({\textbf {W}}\) and \({\textbf {b}}\) denote the weight and the bias coefficients for different gates respectively. Once the LSTM NN is trained in the latent space, a low dimensional surrogate model can then be established for predicting the evolution of the dynamical system with a low computational cost.

3 Methodology: Generalised Latent Assimilation

Latent Assimilation techniques [7, 8] have been developed for the real-time monitoring of latent surrogate models. Here we have developed a new generalised LA approach which can incorporate observation data encoded in a latent space different from the one of state variables. Since we aim to assimilate a dynamical system, the dependence on time t is introduced for all state/observation variables in the rest of this paper.

3.1 Variational Assimilation Principle

Data assimilation algorithms aim to improve the prediction of some physical fields (or a set of parameters) \({\textbf {x}}_t\) based on two sources of information: a prior forecast \({\textbf {x}}_{b,t}\) (also known as the background state) and an observation vector \({\textbf {y}}_t\). The true state which represents the theoretical value of the current state is denoted by \({\textbf {x}}_{\text {true},t}\). In brief, Variational DA searches for an optimal weight between \({\textbf {x}}_{b,t}\) and \({\textbf {y}}_t\) by minimising the cost function J defined as

$$\begin{aligned} J_t({\textbf {x}})&=\frac{1}{2}({\textbf {x}}-{\textbf {x}}_{b,t})^T{\textbf {B}}_t^{-1}({\textbf {x}}-{\textbf {x}}_{b,t}) + \frac{1}{2}({\textbf {y}}_t-\mathcal {H}_t({\textbf {x}}))^T {\textbf {R}}_t^{-1} ({\textbf {y}}_t-\mathcal {H}_t({\textbf {x}}_t)) \nonumber \\&=\frac{1}{2}\vert \vert {\textbf {x}}-{\textbf {x}}_{b,t}\vert \vert ^2_{{\textbf {B}}_t^{-1}}+\frac{1}{2}\vert \vert {\textbf {y}}_t-\mathcal {H}_t({\textbf {x}})\vert \vert ^2_{{\textbf {R}}_t^{-1}} \end{aligned}$$
(17)

where \(\mathcal {H}_t\) denotes the state-observation mapping function, and \({\textbf {B}}_t\) and \({\textbf {R}}_t\) are the error covariance matrices related to \({\textbf {x}}_{b,t}\) and \({\textbf {y}}_t\), i.e.,

$$\begin{aligned} {\textbf {B}}_t = \text {Cov}(\epsilon _{b,t}, \epsilon _{b,t}), \quad {\textbf {R}}_t = \text {Cov}(\epsilon _{y,t}, \epsilon _{y,t}), \end{aligned}$$
(18)

where

$$\begin{aligned} \epsilon _{b,t} = {\textbf {x}}_{b,t} - {\textbf {x}}_{\text {true},t}, \quad \epsilon _{y,t} = \mathcal {H}_t({\textbf {x}}_{\text {true},t})-{\textbf {y}}_t. \end{aligned}$$
(19)

Since DA algorithms often deal with problems of large dimension, for the sake of simplicity, prior errors \(\epsilon _b, \epsilon _y\) are often supposed to be centered Gaussian, i.e.,

$$\begin{aligned} \epsilon _{b,t} \sim \mathcal {N} (0, {\textbf {B}}_t), \quad \epsilon _{y,t} \sim \mathcal {N} (0, {\textbf {R}}_t). \end{aligned}$$
(20)

Equation (17), also known as the three-dimensional variational (3D-Var) formulation, represents the general objective function of variational assimilation. Time-dependent variational assimilation (so called 4D-Var) formulation can also be reformulated into Eq. (17) as long as the error of the forward model is not considered. The minimisation point of Eq. (17) is denoted as \({\textbf {x}}_{a,t}\),

$$\begin{aligned} {\textbf {x}}_{a,t} = \underset{{\textbf {x}}}{\textrm{argmin}}\ \Big (J_t({\textbf {x}})\Big ) , \end{aligned}$$
(21)

known as the analysis state. When \(\mathcal {H}_t\) is non-linear, approximate iterative methods [44] have been widely used to solve variational data assimilation. To do so, one has to compute the gradient \(\nabla J({\textbf {x}})\), which can be approximated by

$$\begin{aligned} \nabla J({\textbf {x}}) \approx 2 {\textbf {B}}_t^{-1}({\textbf {x}}-{\textbf {x}}_{b,t}) - 2 {\textbf {H}}^T {\textbf {R}}_t^{-1} ({\textbf {y}}_t-\mathcal {H}_t({\textbf {x}})). \end{aligned}$$
(22)

In Eq. (22), \({\textbf {H}}\) is obtained via a local linearization in the neighbourhood of the current vector \({\textbf {x}}\). The minimization of 3D-Var is often performed via quasi-Newton methods, including for instance BFGS approaches [45], where each iteration can be written as:

$$\begin{aligned} {\textbf {x}}_{k+1} = {\textbf {x}}_{k} - L_\text {3D-Var} \big [ \text {Hess}(J) ({\textbf {x}}_{k})\big ]^{-1} \nabla J({\textbf {x}}_k) \end{aligned}$$
(23)

Here k is the current iteration, and \(L_\text {3D-Var}>0\) is the learning rate of the descent algorithm, and

$$\begin{aligned} \text {Hess}\Big (J({\textbf {x}}= [x_0,\ldots ,x_{n-1}])\Big )_{i,j} = \frac{\partial ^2 J}{\partial x_i \partial x_j} \end{aligned}$$
(24)

is the Hessian matrix related to the cost function J. The process of the iterative minimization algorithm is summarised in Algorithm 1.

figure a

Variational assimilation algorithms could be applied to dynamical systems for improving future prediction by using a transition operator \(\mathcal {M}_{t^k \rightarrow t^{k+1}}\) (from time \(t^k\) to \(t^{k+1}\)), thus

$$\begin{aligned} {\textbf {x}}_{t^{k+1}} = \mathcal {M}_{t^k \rightarrow t^{k+1}} ({\textbf {x}}_{t^k }). \end{aligned}$$
(25)

In our study, the \(\mathcal {M}_{t^k \rightarrow t^{k+1}}\) operator is defined by a latent LSTM surrogate model. Typically in DA, the current background state is often provided by the forecasting from the previous time step, i.e.

$$\begin{aligned} {\textbf {x}}_{b,t^k} = \mathcal {M}_{t^{k-1} \rightarrow t^{k}} ({\textbf {x}}_{a,t^{k-1} }). \end{aligned}$$
(26)

A more accurate reanalysis \({\textbf {x}}_{a,t^{k-1}}\) leads to a more reliable forecasting \({\textbf {x}}_{b,t^k}\). However, in practice, the perfect knowledge of \(\mathcal {M} \) is often out of reach. Recent work of [24] makes use of deep learning algorithms to improve the estimation of \(\mathcal {M}_{t^{k-1} \rightarrow t^{k}}\). From Algorithm 1, one observes that the linearization of \(\mathcal {H}\) and the evaluation of \(\text {Hess}\big (J({\textbf {x}}_k)\big )\) is necessary for variational assimilation. Since in this application, the latent variables and observations are linked via NNs functions, the linearization and the partial derivative calculation are almost infeasible due to:

  • the huge number of parameters in the NNs combined with non-linear transformation functions;

  • the non-differentiability of NNs functions, for instance, when using activation functions such as ReLu or LeakyReLu [46].

Therefore, we propose the use of a smooth local surrogate function to overcome these difficulties.

3.2 Assimilation with Heterogeneous Latent Spaces

Latent Assimilation techniques are introduced in the very recent work of [7, 8] where the DA is performed after having compressed the state and the observation data into the same latent space. In other words, it is mandatory to have the transformation operator \(\tilde{\mathcal {H}}_t = {\textbf {I}}\) in the latent space. To fulfil this condition, [7] preprocesses the observation data via a linear interpolation to the full space of the state variables. However, as mentioned in their work, this preprocessing will introduce additional errors, which may impact the assimilation accuracy. More importantly, it is almost infeasible to compress \({\textbf {x}}\) and \({\textbf {y}}\) into a same latent space in a wide range of DA applications, due to, for instance:

  • partial observation: only a part of the state variables are observable, usually in certain regions of the full state space;

  • a complex \(\mathcal {H}\) function in the full space: \({\textbf {x}}\) and \({\textbf {y}}\) are different physical quantities (e.g., temperature vs. wind in weather prediction, river flow vs. precipitation in hydrology).

A general latent transformation operator \(\tilde{\mathcal {H}}_t\) which links the state and the observation latent spaces can be formulated as

$$\begin{aligned}&\tilde{\mathcal {H}}_t = \mathcal {E}_{\textbf {y}}\circ \mathcal {H}_t \circ \mathcal {D}_{\textbf {x}}, \quad \text {i.e.,} \quad \tilde{{\textbf {y}}} = \mathcal {E}_{\textbf {y}}\circ \mathcal {H}_t \circ \mathcal {D}_{\textbf {x}}\big (\tilde{{\textbf {x}}}\big ) = \tilde{\mathcal {H}}_t \big (\tilde{{\textbf {x}}} \big ),\nonumber \\&\text {with} \quad \tilde{{\textbf {y}}}_t = \mathcal {E}_{\textbf {y}}({\textbf {y}}_t), \quad {\textbf {x}}_t = \mathcal {D}_{\textbf {x}}(\tilde{{\textbf {x}}}_t), \end{aligned}$$
(27)

where \(\mathcal {E}_{\textbf {y}}, \mathcal {D}_{\textbf {x}}\) denote the encoder of the observation vectors and the decoder of the state variables respectively. A flowchart of the generalised LA is illustrated in Fig. 2. The cost function \(\tilde{J}_t\) of general LA problems reads

$$\begin{aligned} \tilde{J}_t(\tilde{{\textbf {x}}})&=\frac{1}{2}(\tilde{{\textbf {x}}}-\tilde{{\textbf {x}}}_{b,t})^T\tilde{{\textbf {B}}}_t^{-1}(\tilde{{\textbf {x}}}-\tilde{{\textbf {x}}}_{b,t}) + \frac{1}{2}(\tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}_t}({\tilde{{\textbf {x}}}})^T \tilde{{\textbf {R}}}_t^{-1} (\tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})). \end{aligned}$$
(28)
$$\begin{aligned} \tilde{{\textbf {x}}}_{a,t}&= \underset{\tilde{{\textbf {x}}}}{\textrm{argmin}} \Big (\tilde{J}_t(\tilde{{\textbf {x}}})\Big ). \end{aligned}$$
(29)
Fig. 2
figure 2

Flowchart of the LA with heterogeneous latent spaces

The latent covariance matrices \( \tilde{{\textbf {B}}}_t\) and \(\tilde{{\textbf {R}}}_t\) which represent the error covariances in the latent spaces, are defined as

$$\begin{aligned} \tilde{{\textbf {B}}}_t&= \text {Cov}(\tilde{{\textbf {x}}}_{b,t}-\tilde{{\textbf {x}}}_{\text {true},t}, \tilde{{\textbf {x}}}_{b,t}-\tilde{{\textbf {x}}}_{\text {true},t}), \end{aligned}$$
(30)
$$\begin{aligned} \tilde{{\textbf {R}}}_t&= \text {Cov}(\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})-\tilde{{\textbf {y}}}_t, \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})-\tilde{{\textbf {y}}}_t). \end{aligned}$$
(31)

In the rest of this paper, it is supposed that the latent error covariances \(\tilde{{\textbf {B}}}_t = \tilde{{\textbf {B}}}, \tilde{{\textbf {R}}}_t = \tilde{{\textbf {R}}}\) are time invariant.

3.3 Polynomial Regression for Surrogate Transformation Function

Despite the fact that traditional variational DA approaches can deal with complex \(\mathcal {H}\) functions, it is almost impossible to perform descent methods for Algorithm 1 because of the drawbacks described at the end of Sect. 3.1. Our idea consists of building a local smooth and differentiable surrogate function \(\tilde{\mathcal {H}}_t^p \) such that

$$\begin{aligned} \tilde{\mathcal {H}}_t^p (\tilde{{\textbf {x}}_t^s}) \approx \tilde{\mathcal {H}}_t (\tilde{{\textbf {x}}_t^s}) \quad \text {for} \quad \tilde{{\textbf {x}}_t^s}\text { in a neighbourhood of }\tilde{{\textbf {x}}}_{b,t}. \end{aligned}$$
(32)

It is important to note that the computation of \(\tilde{\mathcal {H}}^p\) will also depend on the value of the latent variable \(\tilde{{\textbf {x}}}\). The approximate cost function can then be written as

$$\begin{aligned} \tilde{J}_t^p({\tilde{{\textbf {x}}}})&=\frac{1}{2}(\tilde{{\textbf {x}}}-\tilde{{\textbf {x}}}_{b,t})^T\tilde{{\textbf {B}}}^{-1}(\tilde{{\textbf {x}}}-\tilde{{\textbf {x}}}_b) + \frac{1}{2}(\tilde{{\textbf {y}}}_t-{\tilde{\mathcal {H}_t^p}(\tilde{{\textbf {x}}}))^T} \tilde{{\textbf {R}}}^{-1} (\tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}_t^p}(\tilde{{\textbf {x}}})). \end{aligned}$$
(33)

The way of computing the surrogate function makes crucial impact on both the accuracy and the computational cost of DA since the \(\tilde{\mathcal {H}}\) function may vary a lot with time for chaotic dynamical systems. From now, we denote \(\tilde{\mathcal {H}}_t\) and \(\tilde{\mathcal {H}}_t^p\), the latent transformation function at time t and the associated surrogate function. For time variant \(\tilde{\mathcal {H}}_t\) and \({\textbf {x}}_t\), the computation of \(\tilde{\mathcal {H}}_t^p\) must be performed online. Thus the choice of local surrogate modelling approach should be a tradeoff of approximation accuracy and computational time. As mentioned in the Introduction of this paper, the idea of computing local surrogate model has been developed in the field of interpretable AI. Linear regression (including Lasso, Ridge) and simple ML models such as DT are prioritised for the sake of interpretability (e.g., [33]). In this study, the local surrogate function is built via polynomial regression since our main criteria are smoothness and differentiability. Compared to other approaches, employing PR in LA has several advantages in terms of smoothness and computing efficiency.

To perform the local PR, we rely on local training datasets \(\{ \tilde{{\textbf {x}}}_{b,t}^q \}_{q = 1.. n_s}\) generated randomly around the current background state \(\tilde{{\textbf {x}}}_{b,t}\) since the true state is out of reach. The sampling is performed using Latin Hypercube Sampling (LHS) to efficiently cover the local neighbourhood homogeneously [47]. Other sampling techniques, such as Gaussian perturbation, can also be considered regarding the prior knowledge of the dynamical system. We then fit the output of the transformation operator by a local polynomial function,

$$\begin{aligned} {\tilde{\mathcal {H}}_t^p = \underset{{p \in P(d_p)}}{\textrm{argmin}} \left( \sum _{q=1}^{n_s} ||p({{\textbf {x}}}_{b,t}^q) - \mathcal {H}_t ({{\textbf {x}}}_{b,t}^q)||^2_2\right) ^{1/2}}, \end{aligned}$$
(34)

where \(P(d_p)\) represents the set of polynomial functions of degree \(d_p\). We then evaluate the \(\tilde{\mathcal {H}}_t\) function to generate the learning targets of local PR as shown in Fig. 3. The pipeline of the LA algorithms for dynamical models is summerised in Algorithm 2, where \(\tilde{\mathcal {M}}\) denotes the forward operator in the latent space. In the context of this paper, \(\tilde{\mathcal {M}}\) is the latent LSTM surrogate model. When using a sequence-to-sequence prediction, the forecasting model can be accelerated in the sense that a sequence of future background states can be predicted by one evaluation of LSTM. The PR degree, the sampling range and the sampling size are denoted as \(d_p, r_s\) and \(n_s\) respectively. These parameters affect considerably the performance of Generalised LA. Their values should be chosen carefully as shown later in Sect. 5.2.

Fig. 3
figure 3

Flowchart of the polynomial-based local surrogate model in latent assimilation

figure b

3.4 Theoretical Analysis of the Loss Function

The accuracy of the surrogate model with LA depends on a variety of different uncertainties, including the ROM error, the RNN error, the observation error, the minimization error of DA, and the approximation error of GLA. In this section, we focus on the latter which is induced by the approximation using local polynomial functions in our new model.

Objective We aim to provide a theoretical upper bound for the expected absolute and relative approximation error evaluated on the true state, i.e.,

$$\begin{aligned} \mathbb {E} \big ( J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}) - J_t(\tilde{{\textbf {x}}}_{\text {true},t}) \big ) \quad \text {and} \quad \frac{\mathbb {E} \big ( J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}) - J_t(\tilde{{\textbf {x}}}_{\text {true},t}) \big )}{\mathbb {E} \big (J_t(\tilde{{\textbf {x}}}_{\text {true},t}))}. \end{aligned}$$
(35)

Assumptions Following assumptions are made in this section,

  1. 1.

    Both background and observation prior errors follow a centred Gaussian distribution [5] and the prior error covariances are perfectly specified, i.e.,

    $$\begin{aligned} \tilde{\epsilon }_{b,t} = \tilde{{\textbf {x}}}_{b,t} - \tilde{{\textbf {x}}}_{{\text {true}},t} \sim \mathcal {N}(0, \tilde{{\textbf {B}}}), \quad \tilde{\epsilon }_{y,t} =\tilde{\textbf{y}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{{\text {true}},t}) \sim ( 0, \tilde{{\textbf {R}}}). \end{aligned}$$
    (36)
  2. 2.

    For simplicity, all the activation functions in the NNs are supposed to be Rectified Linear Unit (ReLu).

Analysis In fact, the difference between \(J_t(\tilde{{\textbf {x}}})\) and \(J_t^p(\tilde{{\textbf {x}}})\) for any point \(\tilde{{\textbf {x}}}\) in the space can be bounded as

$$\begin{aligned} J_t^p(\tilde{{\textbf {x}}})&= \frac{1}{2} \Big ( \vert \vert \tilde{{\textbf {x}}}-\tilde{{\textbf {x}}}_{b,t}\vert \vert ^2_{\tilde{{\textbf {B}}}^{-1}}+\vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}) + \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}) - \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}} \Big ) \end{aligned}$$
(37)
$$\begin{aligned}&\le \frac{1}{2} \Big ( \vert \vert \tilde{{\textbf {x}}}-\tilde{{\textbf {x}}}_{b,t}\vert \vert ^2_{\tilde{{\textbf {B}}}^{-1}}+\vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}} +\vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}} \nonumber \\&\quad + 2\vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \cdot \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}}\Big ) \end{aligned}$$
(38)
$$\begin{aligned}&\le \frac{1}{2} \Big ( J_t(\tilde{{\textbf {x}}}) + \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}}\Big ) + \vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \cdot \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}}. \end{aligned}$$
(39)

We are interested in the expectation value of the loss function evaluated on the true state, i.e., \(\mathbb {E}(J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}))\). Following Eq. (39),

$$\begin{aligned} \mathbb {E} \big ( J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}) \big )&\le \mathbb {E} \big ( J_t(\tilde{{\textbf {x}}}_{\text {true},t}) \big )+ \frac{1}{2} \mathbb {E} \big ( \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}} \big )\nonumber \\&\quad + \mathbb {E} \big ( \vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \cdot \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \big ) . \end{aligned}$$
(40)

Following Eq. (36) ,

$$\begin{aligned} \sqrt{\tilde{{\textbf {B}}}^{-1}} (\tilde{{\textbf {x}}}_{b,t} - \tilde{{\textbf {x}}}_{\text {true},t}) \sim \mathcal {N}(0, {\textbf {I}}_{\text {dim}(\tilde{{\textbf {x}}})}), \quad \sqrt{\tilde{{\textbf {R}}}^{-1}} (\tilde{\textbf{y}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t}) )\sim \mathcal {N}( 0, \textbf{I}_{\text {dim}(\tilde{{\textbf {y}}})}). \end{aligned}$$
(41)

Here we remind that by definition, \(\tilde{{\textbf {B}}}\) and \(\tilde{{\textbf {R}}}\) are real constant symmetric positive definite matrices thus \(\sqrt{\tilde{{\textbf {B}}}^{-1}} \) and \(\sqrt{\tilde{{\textbf {R}}}^{-1}} \) are well-defined.

$$\begin{aligned} \mathbb {E}(\vert \vert \tilde{{\textbf {x}}}_{\text {true},t}-\tilde{{\textbf {x}}}_{b,t}\vert \vert ^2_{\tilde{{\textbf {B}}}^{-1}})&= \mathbb {E}\Bigg ((\tilde{{\textbf {x}}}_{\text {true},t}-\tilde{{\textbf {x}}}_{b,t})^T \tilde{{\textbf {B}}}^{-1} (\tilde{{\textbf {x}}}_{\text {true},t}-\tilde{{\textbf {x}}}_{b,t}) \Bigg ) \end{aligned}$$
(42)
$$\begin{aligned}&= \mathbb {E}\Bigg (\Big ( \sqrt{\tilde{{\textbf {B}}}^{-1}} (\tilde{{\textbf {x}}}_{b,t} - \tilde{{\textbf {x}}}_{\text {true},t}) \Big )^T \cdot \Big ( \sqrt{\tilde{{\textbf {B}}}^{-1}} (\tilde{{\textbf {x}}}_{b,t} - \tilde{{\textbf {x}}}_{\text {true},t}) \Big ) \Bigg ) \end{aligned}$$
(43)
$$\begin{aligned}&= \mathbb {E}\Bigg (\vert \vert \sqrt{\tilde{{\textbf {B}}}^{-1}} (\tilde{{\textbf {x}}}_{b,t} - \tilde{{\textbf {x}}}_{\text {true},t})\vert \vert ^2_2 \Bigg )\end{aligned}$$
(44)
$$\begin{aligned}&= \text {dim}(\tilde{{\textbf {x}}}) \end{aligned}$$
(45)

For the same reason, \( \mathbb {E}(\vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}}) = \text {dim}(\tilde{{\textbf {y}}}_t)\). One can then deduce

$$\begin{aligned} \mathbb {E} \big ( J_t (\tilde{{\textbf {x}}}_{\text {true},t}) \big ) = \text {dim}(\tilde{{\textbf {x}}}_t) + \text {dim}(\tilde{{\textbf {y}}}_t). \end{aligned}$$
(46)

A similar reasoning via Mahalanobis norm can be found in the work of [48].

Now we focus on the other terms of Eq. (40). In fact, the observation error \( \vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert _{\tilde{{\textbf {R}}}^{-1}}\) is only related to instrument noises or representation error if the encoder error can be neglected. On the other hand, the approximation error \(\vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}}\) is only related to polynomial regression where the real observation vector \({\textbf {y}}\) is not involved. Therefore, we can suppose that \( \vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert _{\tilde{{\textbf {R}}}^{-1}}\) is uncorrelated to \(\vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}}\) . This assumption will be proved numerically in experiments. One can further deduce that,

$$\begin{aligned}&\mathbb {E} \big ( \vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \cdot \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \big ) \nonumber \\&\quad = \mathbb {E} \big ( \vert \vert \tilde{{\textbf {y}}}_t-\tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \big ) \cdot \mathbb {E} \big ( \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \big ) = 0. \end{aligned}$$
(47)

Now we only need to bound the polynomial regression error. For this, we rely on the recent theoretical results in the work of [37], which proves that for learning a teacher NNs via polynomial regression,

$$\begin{aligned} N^{*} = d^{O (L/\epsilon ^{*})^L} \quad \text {for the ReLU activation function}, \end{aligned}$$
(48)

where \(N^{*} \) is the required number of samples in the training dataset, d is the input dimension, L is the number of NNs layers and \(\epsilon ^{*}\) is the relative target prediction error (i.e., in our case \(\epsilon = \Big ( \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{2} / \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert _{2} \Big ) \le \epsilon ^{*} \)). Since we are looking for a bound of the regression error \(\epsilon \),

$$\begin{aligned}&N^{*} = d^{\big ( c (L/\epsilon ^{*})^L \big )} \quad \text {where} \quad c\text { is a real constant}\end{aligned}$$
(49)
$$\begin{aligned} \Leftrightarrow \quad&\log _d N^{*} = c (L/\epsilon ^{*})^L\end{aligned}$$
(50)
$$\begin{aligned} \Leftrightarrow \quad&\Big ({\frac{\log _d N^{*}}{c}}\Big )^{1/L} = L/\epsilon ^{*} \end{aligned}$$
(51)
$$\begin{aligned} \Leftrightarrow \quad&\epsilon \le \epsilon ^{*} = L \Big ( \frac{c}{\log _d N^{*}} \Big )^{1/L} \end{aligned}$$
(52)
$$\begin{aligned} \Leftrightarrow \quad&\vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{2} \le L \Big ( \frac{c}{\log _d N^{*}} \Big )^{1/L} \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert _{2}. \end{aligned}$$
(53)

Now that we have a relative bound of the polynomial prediction error in the \(L^2\) norm, we want to extend this boundary to the matrix norm \({\vert \vert .\vert \vert }_{\tilde{{\textbf {R}}}^{-1}}\). For this we use a general algebraic result:

$$\begin{aligned}&\forall a \in \mathbb {R}^{\text {dim}(a)}, \quad {\textbf {C}}_{p,d} \in \mathbb {R}^{\text {dim}(a) \times \text {dim}(a)} \quad \text { is a symmetric positive definite matrix then} \nonumber \\&\sqrt{\lambda _{\text {min}}} \vert \vert a \vert \vert _2 \le \vert \vert a \vert \vert _{{\textbf {C}}_{p,d}} \le \sqrt{\lambda _{\text {max}}} \vert \vert a \vert \vert _2 \end{aligned}$$
(54)

where \(\lambda _{\text {min}}, \lambda _{\text {max}}\) represent the smallest and the largest eigenvalues of \({\textbf {C}}_{p,d}\) respectively. Since \({\textbf {C}}_{p,d}\) is positive definite, \(0 < \lambda _{\text {min}} \le \lambda _{\text {max}}\). We denote \( 0 < {\lambda ^{\tilde{{\textbf {R}}}}}_{\text {dim}(\tilde{{\textbf {y}}})} \le \cdots \le {\lambda ^{\tilde{{\textbf {R}}}}}_{1} \) the eigenvalues of \(\tilde{{\textbf {R}}}\). Thus the eigenvalues of \({\tilde{{\textbf {R}}}}^{-1}\) are \( 0 < 1/{\lambda ^{\tilde{{\textbf {R}}}}}_{1} \le \cdots \le 1/{\lambda ^{\tilde{{\textbf {R}}}}}_{\text {dim}(\tilde{{\textbf {y}}})}\). Following the result of Eq. (54),

$$\begin{aligned}&\vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert _{2} \le \sqrt{\lambda ^{\tilde{{\textbf {R}}}}_1} \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \quad \text {and} \nonumber \\&\vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{2} \ge \sqrt{\lambda ^{\tilde{{\textbf {R}}}}_{\text {dim}(\tilde{{\textbf {y}}})}} \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}}. \end{aligned}$$
(55)

Therefore, we can deduce from Eq. (53) that

$$\begin{aligned} \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}} \le \sqrt{\lambda ^{\tilde{{\textbf {R}}}}_1/\lambda ^{\tilde{{\textbf {R}}}}_{\text {dim}(\tilde{{\textbf {y}}})}} L \Big ( \frac{c}{\log _d N^{*}} \Big )^{1/L} \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}})\vert \vert _{\tilde{{\textbf {R}}}^{-1}}. \end{aligned}$$
(56)

Thus,

$$\begin{aligned}&\mathbb {E} \big ( \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})- \tilde{\mathcal {H}}_t^p(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}} \big ) \nonumber \\&\quad = \text {cond}({\textbf {R}}) L^2 \Big ( \frac{c}{\log _d N^{*}} \Big )^{2/L} \mathbb {E} \big ( \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}}), \end{aligned}$$
(57)

where \(\text {cond}({\textbf {R}}) = \lambda ^{\tilde{{\textbf {R}}}}_1/\lambda ^{\tilde{{\textbf {R}}}}_{\text {dim}(\tilde{{\textbf {y}}})}\) is the condition number of the \({\textbf {R}}\) matrix. Combining Eqs. (40), (47) and (57),

$$\begin{aligned}&\mathbb {E} \big ( J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}) \big ) \nonumber \\&\quad \le \mathbb {E} \big ( J_t(\tilde{{\textbf {x}}}_{\text {true},t}) \big )+ \frac{1}{2} \text {cond}({\textbf {R}}) L^2 \Big ( \frac{c}{\log _d N^{*}} \Big )^{2/L} \mathbb {E} \big ( \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}}) \nonumber \\&\quad = \text {dim}(\tilde{{\textbf {x}}}_t) + \text {dim}(\tilde{{\textbf {y}}}_t) + \frac{1}{2} \text {cond}({\textbf {R}}) L^2 \Big ( \frac{c}{\log _d N^{*}} \Big )^{2/L} \mathbb {E} \big ( \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}}). \end{aligned}$$
(58)

Therefore we have an upper bound of \(\mathbb {E} \big ( J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}) \big ) \) and \(\mathbb {E} \big ( J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}) \big ) - \mathbb {E} \big ( J_t(\tilde{{\textbf {x}}}_{\text {true},t}) \big )\) which doesn’t depend on the local polynomial surrogate model \(\tilde{\mathcal {H}}^p_t \). An upper bound for the relative error can also be found, i.e.,

$$\begin{aligned} \frac{\mathbb {E} \big ( J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}) - J_t(\tilde{{\textbf {x}}}_{\text {true},t}) \big )}{\mathbb {E} \big (J_t(\tilde{{\textbf {x}}}_{\text {true},t}))} \le \frac{\text {cond}({\textbf {R}}) L^2 \Big ( \frac{c}{\log _d N^{*}} \Big )^{2/L} \mathbb {E} \big ( \vert \vert \tilde{\mathcal {H}}_t(\tilde{{\textbf {x}}}_{\text {true},t})\vert \vert ^2_{\tilde{{\textbf {R}}}^{-1}})}{2 (\text {dim}(\tilde{{\textbf {x}}}) + \text {dim}(\tilde{{\textbf {y}}}))}. \end{aligned}$$
(59)

Furthermore, in the case where the target NNs is fixed and we have infinite local training data for the polynomial surrogate model,

$$\begin{aligned} \mathbb {E} \big ( J_t^p(\tilde{{\textbf {x}}}_{\text {true},t}) - J_t(\tilde{{\textbf {x}}}_{\text {true},t}) \big ) \xrightarrow {N^{*} \rightarrow +\infty } 0. \end{aligned}$$
(60)

This result obtained is consistent with the Stone-Weierstrass theorem which reveals the fact that every continuous function defined on a closed interval can be approximated as closely as desired by a polynomial function [49]. The proof in this section is made by assuming all the activation functions are ReLu in the NNs. This analysis can be extended to other activation functions, such as sigmoid, based on the recent results in [37].

4 Results: ROM and RNN Approaches

In this section, we describe the test case of an oil-water two-phase flow CFD simulation, used for numerical comparison of different ML surrogate models and LA approaches.

4.1 CFD Modelling

Liquid-liquid two-phase flows are widely encountered in many industrial sectors, including petroleum, chemical and biochemical engineering, food technology, pharmaceutics, and so on. In crude-oil pipelines or oil recovery equipment, both dispersed and separated oil-water flows can be observed, and the transition between these flow patterns can impact the operating cost and safety. Therefore, fundamental understanding of the oil-water flow behavior in pipelines has been tackled for a long term with various efforts from theoretical, experimental, and simulating perspectives. However, it is not fully solved yet due to the complexity of the multiphase flow characteristics. Even for a very simple case of the separating process of oil droplets in water in a horizontal pipeline, it is still challenging to predict the separation length and layer height distribution. Although there are a lot of experimental data, the prediction of such kind of flow regime transition is still poor due to the limited understanding of the underlying physics.

The experiment in this study is conducted in the flow rig developed by [50]. The flow pipe consists of a front and a back leg with equal length of 4 m and a uniform diameter of 26 mm as shown in Fig. 4. The two legs are connected by a U-bend. Measurements are conducted in the front leg only, and High-speed imaging, combined with Particle Image Velocimetry and Laser Induced Fluorescence experiments are carried out to study the drop evolution, velocity profiles and flow patterns. As shown in Table 2, the two test cases explored in this work have initial mixture velocity of 0.52 m/s and 1.04 m/s respectively. The average oil inlet volume fraction of both simulations is set to 30%. The first simulation (i.e., the one with \(U_m = 0.52\) m/s) is used to train the surrogate model while the second one is used latter to test the performance of ROMs. The simulations are validated against experimental data of the concentration profiles and layer heights. The simulations adopt the same physical properties and operating parameters as those in the experiment. The related parameters are shown in Tables 1 and 2.

Fig. 4
figure 4

Dimension and parameters of the pipe and the two-phase flow

Table 1 Physical properties of the experimental system
Table 2 Operating parameters of the experiment
Fig. 5
figure 5

CFD modelling of the two-phase flow

The CFD simulation (as illustrated in Fig. 5) aims to study the flow separation characteristics. The two-phase flow of silicone oil and water in a pipe with a length of 4m and a diameter of 26 mm is studied. Eulerian–Eulerian simulations are performed through the opensource CFD platform of OpenFOAM (version 8.0), and population balance models [51] are used to model the droplet size and coalescence behaviour. The governing equations of the Eulerian framework are given as below:

$$\begin{aligned}&\frac{\partial }{\partial t}\left( \alpha _\textrm{k} \rho _\textrm{k}\right) + \nabla \cdot \left( \alpha _\textrm{k} \rho _\textrm{k} {\textbf {U }}_\textrm{k}\right) =0, \end{aligned}$$
(61)
$$\begin{aligned}&\frac{\partial }{\partial t}\left( \alpha _\textrm{k} \rho _\textrm{k} \varvec{U}_\textrm{k}\right) +\nabla \cdot \left( \alpha _\textrm{k} \rho _\textrm{k} \varvec{U}_\textrm{k} \varvec{U}_\textrm{k}\right) =-\alpha _\textrm{k} \nabla p+\nabla \cdot \left( \alpha _\textrm{k} \varvec{\tau }_\textrm{k}\right) +\alpha _\textrm{k} \rho _\textrm{k} \varvec{g}+\varvec{M}_\textrm{k}, \end{aligned}$$
(62)

where the subscript of k represents the phases of water and oil respectively, and \(\varvec{\tau }\) is the stress tensor expressed as

$$\begin{aligned} \varvec{\tau } _\textrm{k}=\mu _\textrm{ eff }\left[ \nabla \varvec{U} _\textrm{k}+\left( \nabla \varvec{U} _\textrm{k}\right) ^\textrm{ T }-\frac{2}{3}\left( \nabla \cdot \varvec{U} _\textrm{k}\right) I \right] . \end{aligned}$$
(63)

A structured mesh with 180000 nodes is generated by the utility of blockMesh, and the volume concentration at the inlet boundary is prescribed by the patch manipulation (the utility of createPatch in OpenFOAM.). In all cases, the mixture \(k{-}\epsilon \) model and wall functions are used to model turbulence equations. In order to obtain a steady flow pattern, the flow time is set to 10 s. The time step is 0.005 s for all the cases, which ensures the convergence at the current mesh resolution. The running time is 40 h on a four-nodes parallel computing mode. The computing nodes harness an Intel\(^\circledR \) Xeon(R) CPU E5-2620 (2.00 GHz, RAM 64 GB). Finally, snapshots of oil concentration \(\varvec{\alpha }_t\) and velocities \({\textbf {V}}_{x,t}, {\textbf {V}}_{y,t}, {\textbf {V}}_{z,t}\) in the xyz axes respectively (i.e., \(\varvec{U}_{k,t} = [V_{x,t}, V_{y,t}, V_{z,t}] \)) can be generated from the CFD model to describe the two-phase flow dynamics. In this study, we are interested in building a machine learning surrogate model for predicting the evolution of \(\varvec{\alpha }_t\) along the test section. The training of autoencoders and LSTM is based on 1000 snapshots (i.e., every 0.01 s) as described in Sect. 4.2.

4.2 Numerical Results of Latent Surrogate Modelling

In this section, we compare different latent surrogate modelling techniques, including both ROM and RNN approaches in the CFD application described in Sect. 4.1.

4.2.1 ROM Reconstruction

We first compare the performance of the different autoencoding approaches introduced in Sect. 2.1. The single-trajectory simulation data of 1000 snapshots in total are split into a training (including validation) dataset with \(80\%\) of snapshots and a test dataset with the remaining \(20\%\) snapshots. Following the setup in [7], the data split is performed homogeneously where the four snapshots between two consecutive test snapshots are used for training. In other words, the test dataset contains the snapshots \(\{ \varvec{\alpha }_4, \varvec{\alpha }_9, \varvec{\alpha }_{14}, \ldots , \varvec{\alpha }_{999} \}\). Since we are dealing with cylindrical meshes and the length of the pipe (4 m) is much larger than its diameter (26 mm), we decide to first flatten the snapshots to 1D vectors before auto-encoding as shown in Fig. 6.

Fig. 6
figure 6

Encoder–decoder modelling for the two-phase flow in the pipe

POD The distribution of the eigenvalues respectively for \(\varvec{\alpha }\), normalised \( {\textbf {V}}_x,\) normalised \({\textbf {V}}_y\) and normalised \( {\textbf {V}}_z\) is shown in Fig. 7 while the compression accuracy \(\gamma \) and rate \(\rho \), as defined in Eq. (8), are displayed in Table 3 for the truncation paramater \(q=30\). In this application, POD exhibits a high compression accuracy with an extremely low compression rate on the training data set issued from one CFD simulation. The performance on the test dataset will be further examined in Sect. 4.2.1.

Fig. 7
figure 7

Eigenvalues for \(\varvec{\alpha }, {\textbf {V}}_x, {\textbf {V}}_y\) and \({\textbf {V}}_z\) on the training set, issued from one simulation

Table 3 Compression accuracy \(\gamma \) and rate \(\rho \) with truncation parameter \(q = 30\) for \(\varvec{\alpha }, {\textbf {V}}_x, {\textbf {V}}_y\) and \({\textbf {V}}_z\)

1D CAE Since the meshes have an unsquared structure and the pipe’s length is much larger than the diameter, we decide to proceed with 1D CAE. As pointed out by [52], the ordering of points is crucial in CNN algorithms especially for problems with non-square meshes. Denoting \( \mathcal {Z} = \{z_1, z_2, ..z_{n_z} \}\) the ensemble of nodes in the mesh structure, their links can be represented by the Adjacency matrix \({\textbf {A}}^z\) defined as

$$\begin{aligned} {{\textbf {A}}^z_{i,j} = \left\{ \begin{array}{c l} 1 &{} \quad \text {if }z_i \text { is connected to }z_j \\ 0 &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$
(64)

In this study, when we flatten the 3D meshes to a 1D vector, the corresponding adjacency matrix contains many non-zero values outside the diagonal band as shown in Fig. 8a. In other words, when applying 1D CNN, the edges \({\textbf {A}}^z_{i,j}\) represented by the non-zero values in the adjacency matrix can not be included in the same convolutional window thus the information of these links will be lost during the encoding. This is a common problem when dealing with unstructured or non-square meshes [17, 19]. Much effort has been devoted to finding the optimum ordering of sparse matrices for reducing the matrix band [53, 54]. In this work, we make use of the Cuthill-McKee algorithm [55] based on ideas from graph theory, which is proved to be efficient for dealing with symmetric sparse matrices. The adjacency matrix for the reordered nodes is shown in Fig. 8b where all non-zero elements are included in the diagonal band of width 10. We then perform the 1D CNN based on these reordered nodes. The exact NNs structure of this 1D CAE can be found in Table 4.

Fig. 8
figure 8

Adjacency matrices before (a) and after (b) mesh reordering

Table 4 NN structure of the CAE with ordered meshes

POD AE We first apply the POD operators to obtain the full set of PCs of \(\varvec{\alpha }, {\textbf {V}}_x, {\textbf {V}}_y\) and \({\textbf {V}}_z\) respectively as described in Sect. 2.1.1. Since \(20\%\) of the snapshots are used for testing, we obtain 799 PCs for each variable. Then the auto-encoding of \(\varvec{\alpha }, {\textbf {V}}_x, {\textbf {V}}_y\) and \({\textbf {V}}_z\) to compressed latent variables \(\tilde{\varvec{\alpha }}, \tilde{{\textbf {V}}}_x, \tilde{{\textbf {V}}}_y\) and \(\tilde{{\textbf {V}}}_z\) is performed individually with the same NNs structure as displayed in Table 5. The training is very efficient for POD AE so much that it can be easily performed on a laptop CPU in less than 15 min. On the other hand, 1D CAE training takes several hours if training with the full set of snapshots.

Table 5 NN structure of the POD AE

Numerical comparison The relative mean square error (RMSE) for the oil concentration \(\alpha \) of different ROM reconstructions is illustrated in Fig. 9 on the CFD simulations. The first simulation (Fig. 9a, b) includes both training (\(80\%\)) and test (\(20\%\)) data while the second simulation (Fig. 9c) consists of purely unseen test data. In order to further inspect the ROM accuracy against the dimension of the latent space (i.e., the truncation parameter), we show in Fig. 9 the performance for both \(q=5\) (a) and \(q=30\) (b,c). It can be clearly observed that the POD and 1D CAE (with reordered nodes) are out-performed by POD AE in terms of both average accuracy and temporal robustness for the first CFD simulation data. For all ROM approaches, a higher dimension of the latent space (\(5 \longrightarrow 30\)) can significantly enhance the reconstruction. In the case of POD AE, the RMSE has been reduced from around \(10\%\) to around \(3\%\). We thus choose to use the POD AE approach for computing the latent surrogate model in this work. As expected, the RMSE evaluated on the second simulation dataset is larger than the first one. In Fig. 9c, the POD and POD AE show a better generalizability compared to the 1D CAE, which confirms our choice of POD AE in this application.

Fig. 9
figure 9

Comparison of reconstruction errors of oil concentration \(\alpha \) using different auto-encoder approaches. Figures a, b are evaluated on the simulation data of \(U_m = 0.52\) (i.e., the first row of Table 2) while figure c is evaluated on the simulation data of \(U_m = 0.52\) (i.e., the second row of Table 2)

4.2.2 LSTM Surrogate Model

In this study, instead of classical many-to-one LSTM setting (e.g., [1, 7]), we make use of a sequence-to-sequence LSTM structure to speed up the evaluation of the surrogate model. More precisely, in lieu of a single time output, the LSTM predicts a time series of latent variables with an internal connection according to the time steps. For more details about sequence-to-sequence LSTM, interested readers are referred to the work of [56]. The recent work of [57] shows that incremental LSTM which forecasts the difference between output and input variables can significantly improve the accuracy and efficiency of the learning procedure, especially for multiscale and multivariate systems. Therefore, we have adapted the incremental LSTM in the sequence-to-sequence learning with

  • LSTM input: \({\textbf {u}}_\text {input} = [\tilde{{\textbf {x}}}_t, \tilde{{\textbf {x}}}_{t+1},\ldots , \tilde{{\textbf {x}}}_{t+l_\text {input}-1}]\),

  • LSTM output: \({\textbf {u}}_\text {output} = [\tilde{{\textbf {x}}}_{t+l_\text {input}}-\tilde{{\textbf {x}}}_{t+l_\text {input}-1}, \tilde{{\textbf {x}}}_{t+l_\text {input}+1}-\tilde{{\textbf {x}}}_{t+l_\text {input}},\ldots , \tilde{{\textbf {x}}}_{t+l_\text {input}+l_\text {output}-1}-\tilde{{\textbf {x}}}_{t+l_\text {input}+l_\text {output}-2}]\),

where \(l_\text {input}\) and \(l_\text {output}\) denote the length of the input and the output sequences respectively. \(\tilde{{\textbf {x}}}_{t}\) represents the latent vector encoded via the POD AE approach at time step t. The training data is generated from the simulation snapshots by shifting the beginning of the input sequence as shown in Fig. 10. Similar to the setup of AEs, \(80\%\) of input and output sequences are used as training data while the remaining \(20\%\) are divided into the test dataset. In this work, we implement two LSTM models where the first one includes only the encoded concentration (i.e., \(\tilde{\varvec{\alpha }}\)) and the second one uses both concentration and velocity variables (i.e., \( \tilde{\varvec{\alpha }}, \tilde{{\textbf {V}}}_x, \tilde{{\textbf {V}}}_y, \tilde{{\textbf {V}}}_z\)) as illustrated in Fig. 10. We set \(l_\text {intput} = l_\text {output} = 30\) for the joint LSTM model (i.e., the one including the velocity data), meaning that 33 iterative applications of LSTM are required to predict the whole CFD model. On the other hand, the single concentration model is trained using a LSTM 10to10 (i.e., \(l_\text {intput} = l_\text {output} = 10\)) since the instability of the single model doesn’t support long range predictions, which will be demonstrated later in this section. For clarity, in the rest of this paper, single and joint models refer to

  • Single model: LSTM 10to10 predictive model based on encoded concentration \(\tilde{\varvec{\alpha }}\)

  • Joint model: LSTM 30to30 predictive model based on encoded concentration and velocity variables \( \tilde{\varvec{\alpha }}, \tilde{{\textbf {V}}}_x, \tilde{{\textbf {V}}}_y, \tilde{{\textbf {V}}}_z\)

The exact NNs structure of the joint LSTM model is shown in table 7 where the sequence-to-sequence learning is performed. On the other hand, the single conceration model is implemented thanks to the RepeatVector layer. The reconstructed principle components via LSTM prediction (i.e., \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\) following the notation in Sect. 2.1.3) against compressed ground truth (i.e., \({\textbf {L}}_{{\textbf {x}}}^T ({\textbf {x}})\)) are shown in Figs. 11 and 12. As observed in Fig. 12, the latent prediction is accurate until around 200 time steps (2 s) for all eigenvalues. However, a significant divergence can be observed just after \(t=2\) s for most principal components due to the accumulation of prediction error. On the other hand, the joint LSTM model with similar NNs structures exhibits a much more robust prediction performance despite that some temporal gap can still be observed. The reconstructed prediction of oil concentration \(\varvec{\alpha }\) at \(t = 7\) s (i.e. \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_{t=700})\)), together with the CFD simulation of \(\varvec{\alpha }_{t=700}\) are illustrated in Fig. 13. The joint LSTM model predicts reasonably well the CFD simulation with a slight delay of the oil dynamic while the prediction of the single LSTM model diverges at \(t=7\) s. These results are coherent with our analysis of Figs. 11 and 12. In summary, although the objective here is to build a surrogate model for simulating the oil concentration, it is demonstrated numerically that more physics information can improve the prediction performance. The computational time of both LSTM surrogate models (on a Laptop CPU) and CFD (with parallel computing mode) approaches for the entire simulation is illustrated in table 6. For both LSTM models the online prediction takes place from t=1s (\(100\text {th}\) time step) until t = 10 s (\(1000\text {th}\) time step) where the first 100 time steps of exact encoded latent variables are provided to ’warm up’ the prediction system. From table 6, one observes that the online computational time of LSTM surrogate models is around 1000 times shorter compared to the CFD. Table 6 also reveals the fact that a longer prediction sequence in sequence-to-sequence LSTM can significantly reduce the online prediction complexity. As shown in Table 6, the offline computation of both approaches is also very fast thanks to the training efficiency of POD AE (Table 7).

Fig. 10
figure 10

LSTM training in the latent space for a joint model of concentration and velocity

Fig. 11
figure 11

The LSTM prediction of reconstructed POD coefficients (i.e., \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\)) with joint LSTM 30to30 surrogate model

Fig. 12
figure 12

The LSTM prediction of reconstructed POD coefficients (i.e., \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\)) with single LSTM 10to10 surrogate model

Table 6 Computational time of LSTM surrogate models and CFD modelling
Table 7 LSTM structure in POD AE latent space for the single model (only concentration) and the joint model (concentration and velocity)
Fig. 13
figure 13

The original CFD simulation against LSTM predictions at \(t = 7\) s

5 Results: GLA Approach

In this section, we test the performance of the novel generalised latent assimilation algorithm on the CFD test case of oil-water two-phase flow. The strength of the new approach proposed in this paper compared to existing LA methods, is that DA can be performed with heterogeneous latent spaces for state and observation data. In this section, we evaluate the algorithm performance using randomly generated observation function \(\mathcal {H}\) in the full space.

5.1 Non-linear Observation Operators

In order to evaluate the performance of the novel approach, we work with different synthetically generated non-linear observation vectors for LA. Since we would like to remain as general as possible, we prefer not to set a particular form of the observation operator, which could promote some space-filling properties. For this purpose, we decide to model the observation operator with a random matrix \({\textbf {H}}\) acting as a binomial selection operator. The full-space transformation operator \(\mathcal {H}\) consists of the selection operator \({\textbf {H}}\) and a marginal non-linear function \(f_\mathcal {H}\). Each observation will be constructed as the sum of a few true state variables randomly collected over the subdomain. In order to do so, we introduce the notation for a subset sample \(\left\{ \textsc {x}_t^*(i) \right\} _{i=1\ldots n_\text {sub}}\) randomly but homogeneously chosen (with replacement) with probability P among the available data set \(\left\{ \textsc {x}_t(k) \right\} _{k=1\ldots n=180000}\). The evaluations of the \(f_\mathcal {H}\) function on the subsets (i.e., \(f_\mathcal {H}(\textsc {x}_t^*)\)) are summed up and the process is re-iterated \(m \in \{10000, 30000\}\) times in order to construct the observations:

$$\begin{aligned} y_t(j) = \sum _{i=1}^{n_j} f_\mathcal {H} (\textsc {x}_t^*(i)), \quad \text {for} \; j=1,\ldots ,m, \end{aligned}$$
(65)

where the size \(n_j\) (invariant with time) of the collected sample used for each \(j{\text {th}}\) observation data point \(y_t(j)\) is random and by construction follows a binomial distribution \(\mathcal {B}(n,P)\). As for the entire observation vector,

$$\begin{aligned}&{\textbf {y}}_t = \begin{bmatrix} y_t(0) \\ y_t(1)\\ \vdots \\ y_t(m-1) \end{bmatrix} = \mathcal {H} ({\textbf {x}}_t) = {\textbf {H}}f_\mathcal {H} ({\textbf {x}}_t) = \begin{bmatrix} {\textbf {H}}_{0,0}, \ldots {\textbf {H}}_{0,n-1} \\ \vdots \\ {\textbf {H}}_{m-1,0}, \ldots {\textbf {H}}_{m-1,n-1} \\ \end{bmatrix} \begin{bmatrix} f_\mathcal {H} (x_t(0)) \\ f_\mathcal {H} (x_t(1))\\ \vdots \\ f_\mathcal {H} (x_t(n-1)) \end{bmatrix} \nonumber \\&\text {with} \quad {\textbf {H}}_{i,j} = \left\{ \begin{array}{c l} &{} 0 \quad \text {with probability} \quad 1-P\\ &{} 1 \quad \text {with probability} \quad P \end{array}\right. . \end{aligned}$$
(66)

Using randomly generated selection operator for generating observation values is commonly used for testing the performance of DA algorithms (e.g., [58, 59]). In this work we choose a sparse representation with \(P=0.1\%\). Once \({\textbf {H}}\) is randomly chosen, it is kept fixed for all the numerical experiments in this work. Two marginal non-linear functions \(f_\mathcal {H}\) are employed in this study:

  • quadratic function: \(f_\mathcal {H}(x) = x^2 \)

  • reciprocal function: \(f_\mathcal {H}(x) = 1/(x + 0.5) \).

After the observation data is generated based on Eq. (66), we apply the POD AE approach to build an observation latent space of dimension 30 with associated encoder \(\mathcal {E}_y\) and decoder \(\mathcal {D}_y\). In this application, the dimension of the observation latent space is chosen as 30 arbitrarily. In general, there is no need to keep the same dimension of the latent state space and the latent observation space. Following Eqs. (27) and (66), the state variables \(\tilde{{\textbf {x}}}_t\) and the observations \(\tilde{{\textbf {y}}}_t\) in LA can be linked as:

$$\begin{aligned} \tilde{{\textbf {y}}}_t&= \tilde{\mathcal {H}} (\tilde{{\textbf {x}}}_t) = \mathcal {E}_{{\textbf {y}}} \circ {\textbf {H}}\circ f_\mathcal {H} \circ \mathcal {D}_{{\textbf {x}}} (\tilde{{\textbf {x}}}_t). \end{aligned}$$
(67)

5.2 Numerical Validation and Parameter Tuning

Local polynomial surrogate functions are then used to approximate the transformation operator \(\tilde{\mathcal {H}} = \mathcal {E}_{{\textbf {y}}} \circ {\textbf {H}}\circ f_\mathcal {H} \circ \mathcal {D}_{{\textbf {x}}}\) in Latent Assimilation. In order to investigate the PR accuracy and perform the hyper-parameters tuning, we start by computing the local surrogate function at a fixed time step \(t=3\) s with \((\tilde{{\textbf {x}}}_{300}, \tilde{{\textbf {y}}}_{300})\). Two LHS ensembles \(\{ \tilde{{\textbf {x}}}_\text {train}^q \}_{q = 1.. 1000}\) and \(\{ \tilde{{\textbf {x}}}_\text {test}^q \}_{q = 1.. 1000}\), each of 1000 sample vectors, are generated for training and validating local PR respectively. As mentioned previously in Sect. 3.2, the polynomial degree \(d^{p}\) and the LHS range \(r_s\) are two important hyper-parameters which impacts the surrogate function accuracy. \(r_s\) also determines the expectation of the range of prediction errors in the GLA algorithm. For hyper-parameters tuning, we evaluate the root-mean-square-error (RMSE) (of \(\{ \tilde{{\textbf {x}}}_\text {test}^q \}_{q = 1.. 1000}\)) and the computational time of local PR with a range of different parameters, i.e.,

$$\begin{aligned}&\{ \tilde{{\textbf {x}}}_\text {train}^q \}_{q = 1.. 1000} / \{ \tilde{{\textbf {x}}}_\text {test}^q \}_{q = 1.. 1000} = \text {LHS Sampling}_{\{d^{p},r_s,1000\}} (\tilde{{\textbf {x}}}_{t=300}) \nonumber \\&\text {for} \quad d^{p} \in \{1,\ldots ,5 \} \quad \text {and} \quad r_s \in \{10\%, \ldots , 90\% \}. \end{aligned}$$
(68)
Fig. 14
figure 14

Logarithm of RMSE in the test dataset (evaluated on 1000 points) and the training time in seconds

The results are presented in Fig. 14 with a logarithmic scale for both RMSE and computational time (in seconds). Here the quadratic function is chosen as the transformation operator to perform the tests. Figure 14a reveals that there is a steady rise of RMSE against LHS range\(r_s\) . This fact shows the difficulties of PR predictions when the input vector is far from the LHS center (i.e., \(\tilde{{\textbf {x}}}_{300}\)) due to the high non-linearity of NNs functions. The PR performance for \(d^p = 2,3,4\) on the test dataset \(\{ \tilde{{\textbf {x}}}_\text {test}^q \}_{q = 1.. 1000}\) is more robust compared to linear predictions (i.e., \(d^p = 1\)), especially when the LHS range grows. However, a phenomenon of overfitting can be noticed when \(d^p \ge 5\) where an increase of prediction error is noticed. One has to make a tradeoff between prediction accuracy and application range when choosing the value of \(r_s\). In general, PR presents a good performance with a relative low RMSE (with an upper bound of \(e^{3} = 20.08\)) given that \(\vert \vert \tilde{{\textbf {x}}}_{t=300}\vert \vert _2 = 113.07\). As for the computational time of a local PR, it stays in the same order of magnitude for different set of parameters (from \(e^{5.2} \approx 181\) s to \(e^{5.5} \approx 244\) s) where the cases of \(d^p = 1,2,3,4\) are extremely close. Considering the numerical results shown in Fig. 14 and further experiments in Latent Assimilation, we fix the parameters as \(d^p = 4\) and \(r_s = 0.3\) in this application. The PR prediction results against the compressed truth in the latent space are shown in Fig. 15 for 4 different latent observations. What can be clearly seen is that the local PR can fit very well the \(\tilde{\mathcal {H}}\) function in the training dataset (Fig. 15a–d) while also provides a good prediction of unseen data (Fig. 15e–h), which is consistent with our conclusion in Fig. 14. When the sampling range increases in the test dataset (Fig. 15i–l), it is clear that the prediction start to perform less well. This represents the case where we have under-estimated the prediction error by 100% (i.e., \(r_s = 30\%\) for training and \(r_s = 60\%\) for testing). The required number of samples (i.e., \(n_s=1000\)) is obtained by offline experiments performed at \(({\textbf {x}}_{300},{\textbf {y}}_{300})\). For different polynomial degrees \(d_p \in \{ 1,2,3,4,5\}\), no significant improvement in terms of prediction accuracy on the test dataset can be observed when the number of samples \(n_s>1000\). We have also performed other experiments at different time steps (other than \(t=3\) s) and obtained similar results qualitatively.

Fig. 15
figure 15

Latent variable prediction results in the training (ad) and test (el) datasets against the true values with the polynomial degree \(d^p = 4\). The LHS sampling range is \(r_s = 30\%\) for a–h and \(r_s = 60\%\) for i–l

5.3 Generalised Latent Assimilation

In this section, we illustrate the experimental results of performing variational Generalised LA with the POD AE reduced-order-modelling and the LSTM surrogate model. The loss functions in the variational methods are computed thanks to the local polynomial surrogate functions. The obtained results are compared with CFD simulations both in the low dimensional basis and the full physical space.

5.3.1 GLA with a Quadratic Operator Function

Following the setup in Sect. 5.1, the full-space observation operator is computed with a binomial random selection matrix \({\textbf {H}}\) and quadratic marginal equation \(f_\mathcal {H}(x) = x^2 \) as shown in Eq. (66). Separate POD AEs (i.e., \(\mathcal {E}_{{\textbf {y}}}\) and \(\mathcal {D}_{{\textbf {y}}}\)) are trained for encoding the observation data. The prediction of the LSTM surrogate model start at \(t=3\) s, i.e., the \(300\text {th}\) time step. Since the prediction of the joint model is made using a 30 to 30 LSTM, the LA takes place every 1.5 s starting from 5.7 s for 30 consecutive time steps each time. In other words, the LA takes place at time steps 570 to 599, 720 to 749 and 870 to 899, resulting in 90 steps of assimilations among 700 prediction steps. As for the 10to10 single concentration LSTM model, since the prediction accuracy is relatively mediocre as shown in Fig. 12, more assimilation steps are required. In this case the LA takes place every 0.6 s starting from 5 s for 10 consecutive time steps each time, leading to 180 in total. For the minimization of the cost function in the variational LA (Eq. (33)), Algorithm 2 is performed with the maximum number of iterations \(k_{max} = 50\) and the tolerance \(\epsilon = 0.05\) in each assimilation window. Identifying the error covariances in the latent space is challenging. Current LA methods either make use of diagonal matrices [7, 60] or perform ensemble-type [8, 25, 61] data assimilation to estimate \(\tilde{{\textbf {B}}}\) and \(\tilde{{\textbf {R}}}\). The latter can be computationally difficult for complex and highly non-linear transformation functions. Thus in this test case, identity matrices are chosen as latent covariance matrices. To increase the importance of observation data, the error covariance matrices in Algorithm 1 are fixed as:

$$\begin{aligned} \tilde{{\textbf {B}}} = {\textbf {I}}_{30} \quad \text {and} \quad \tilde{{\textbf {R}}} = 0.1 \times {\textbf {I}}_{30}, \end{aligned}$$
(69)

where \({\textbf {I}}_{30}\) denotes the identity matrix of dimension 30. In this particular test, no artificial error has been added to synthetic observations. However, encoding observation data from the full space to the latent space will inevitably create compression errors. These noises can be included in the DA process through the modelling of the \({\textbf {R}}\) matrix [62, 63].

The Latent assimilation of reconstructed principle components (i.e., \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\)) against the compressed ground truth is illustrated in Figs. 16 and 17 for the joint and single LSTM surrogate model respectively. The red curves include both prediction and assimilation results starting at \(t = 3\) s (i.e., \(300\text {th}\) time step). What can be clearly observed is that, compared to pure LSTM predictions shown in Figs. 11 and 12, the mismatch between predicted curves and the ground truth (CFD simulation) can be considerably reduced by the novel generalised LA technique, especially for the single LSTM model. As for the joint LSTM surrogate model (Fig. 16), the improvement is significant for \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)_4, \mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)_5, \text {and } \mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)_6\). These results show that the novel approach can well incorporate real-time observation data with partial and non-linear transformation operators that the state-of-the-art LA can not handle. Prediction/assimilation mismatch in the full physical space will be discussed later in Sect. 5.3.3.

Fig. 16
figure 16

The LA of reconstructed POD coefficients (i.e., \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\)) with joint LSTM 30to30 surrogate model and quadratic observation function. Results of the same experiment without GLA is shown in Fig. 11

Fig. 17
figure 17

The LA of reconstructed POD coefficients (i.e., \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\)) with single LSTM 10to10 surrogate model and quadratic observation function. Results of the same experiment without GLA is shown in Fig. 12

5.3.2 GLA with a Reciprocal Operator Function

Here we keep the same assimilation setup as in Sect. 5.3.1 in terms of assimilation accuracy and error covariances specification. Instead of a quadratic function, the observation data are generated using the reciprocal function \(f_\mathcal {H}(x) = 1/(x + 0.5) \) in the full space as described in Sect. 5.1. Therefore, new autoencoders are trained to compress the observation data for \(\varvec{\alpha }_t, {\textbf {V}}_{x,t}, {\textbf {V}}_{y,t}, {\textbf {V}}_{z,t}\) to latent spaces of dimension 30. The results of predicted/assimilated POD coefficients \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\) are shown in Figs. 18 and 19. Similar conclusion can be drawn as in Sect. 5.3.1, that is, the generalised LA approach manages to correctly update the LSTM predictions (for both joint and single models) on a consistent basis. Some non-physical oscillatory behaviours can be observed in Figs. 16, 17, 18 and 19. This is due to the application of LA which modified the dynamics in the latent space. Comparing the assimilated curves using quadratic and reciprocal observation functions, the latter is slightly more chaotic due to the fact that reciprocal functions, when combined with DL encoder–decoders (as shown in Fig. 3) can be more difficult to learn for local polynomial surrogate functions.

Fig. 18
figure 18

The LA of reconstructed POD coefficients (i.e., \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\)) with joint LSTM 30to30 surrogate model and reciprocal observation function

Fig. 19
figure 19

The LA of reconstructed POD coefficients (i.e., \(\mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\)) with single LSTM 10to10 surrogate model and reciprocal observation function

Fig. 20
figure 20

Prediction relative error in the space of the principle component and the full space

Fig. 21
figure 21

Prediction in the full CFD space after LA with quadratic (a, c) and reciprocal (b, d) observation functions

5.3.3 Prediction Error in the Latent and the Full Space

In this section, we illustrate the evolution of the global prediction/assimilation errors and the forecasting of the global physical field based on the results obtained in Sects. 5.3.1 and 5.3.2. The relative \(L_2\) error in the space of the principle components and the full space of the concentration, i.e.,

$$\begin{aligned} \frac{\vert \vert {\textbf {L}}^T_{{\textbf {x}}} \varvec{\alpha }_t - \mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\vert \vert _2}{\vert \vert {\textbf {L}}^T_{{\textbf {x}}} \varvec{\alpha }_t \vert \vert _2} \quad \text {and} \quad \frac{\vert \vert \varvec{\alpha }_t - {\textbf {L}}_{{\textbf {x}}} \mathcal {D}'_{{\textbf {x}}}(\tilde{{\textbf {x}}}^{\text {predict}}_t)\vert \vert _2}{\vert \vert \varvec{\alpha }_t\vert \vert _2}, \end{aligned}$$
(70)

for both joint and single models is shown in Fig. 20. The evolution of the relative error in the global space is consistent with our analysis in Figs. 16, 17, 18 and 19 for decoded POD coefficients. The LA with quadratic (in red) and reciprocal (in green) observation operators can significantly reduce the relative error as compared to the original LSTM model (in blue). More importantly, the DA does not only impact the estimation of current time steps, it improves also future predictions after assimilation, thus demonstrating the stability of the proposed approach. The prediction error in the latent space and the full physical space share very similar shapes for both single and joint models, showing that the ROM reconstruction errors are dominated by the LSTM prediction error. The reconstructed model prediction/assimilation in the full space at \(t=7\) s is shown in Fig. 21. Compared to Fig. 13, the prediction of the single LSTM model (Fig. 21a, b) can be greatly improved with an output much more realistic and closer to the CFD simulation (Fig. 13a). As for the joint model, the initial delay of the oil dynamic can also be well corrected thanks to the variational LA approach despite some noises can still be observed. In summary, the novel LA technique with local polynomial surrogate function manages to improve the current assimilation reconstruction, and more importantly future predictions of latent LSTM. The averaged computational time of GLA on a laptop CPU for one time step is presented in Table 8. Little difference can be found between quadratic and reciprocal marginal functions. The optimization of Eq. (33) is implemented using the ADAO [64] package where the maximum number of iterations and the stopping tolerance of the BFGS algorithm are fixed as 50 and 0.01, respectively.

6 Conclusion and Future Work

Performing DA with simulation and observation data encoded in heterogeneous latent spaces is an important challenge since background and observation quantities are often different in real DA scenarios. On the other hand, it is extremely difficult, if not infeasible, to apply directly classical variational DA approaches due to the complexity and non-smoothness of the NNs function which links different latent spaces. In this paper, we introduce a novel algorithm, named generalised Latent Assimilation, which makes use of a polynomial surrogate function to approximate the NNs transformation operator in a neighbourhood of the background state. Variational DA can then be performed by computing an observation loss using this local polynomial function. This new method promotes a much more flexible use of LA with machine learning surrogate models. A theoretical analysis is also given in the present study, where an upper bound of the approximation error of the DA cost function (evaluated on the true state) is specified. Future work can further focus on the minimization error related to the surrogate loss function in GLA. The numerical tests in the high-dimensional CFD application show that the proposed approach can ensure both the efficiency of the ROMs and the accuracy of the assimilation/prediction. In this study, the training and the validation for both ROM and LSTM are performed using a single CFD simulation with well separated training and testing datasets. Future work will investigate to build robust models for both autoencding and machine learning prediction using multiple CFD simulations as training data. However, building such training dataset can be time-consuming due to the complexity of the CFD code. The local polynomial surroagate function is computed relying on LHS samplings in this work. Other sampling strategies, such as Gaussian perturbations, can also be considered. Representing model or observation error (originally in the full space) in the latent space is challenging due to the non-linearity of ROMs. Future work can also be considered to enhance the error covariance specification in the latent space by investigating, for instance, uncertainty propagation from the full physical space to the latent space, posterior error covariance tuning (e.g., [58, 65, 66]) or Ensemble-type [67] DA approaches.

Table 8 Averaged computational time of GLA for one time step