1 Introduction

The demand for HR data in the flow field has always been a major pursuit in computational fluid dynamics (CFD). Many CFD methods, such as the finite volume (FV) [1], the finite difference (FD) [2] and the lattice Boltzmann (LBM) [3, 4], can be performed for obtaining HR turbulence data. Unlike the CFD methods that directly solve the Navier-Stokes (NS) equations on a macroscopic scale, the LBM simulates the flow field based on the lattice Boltzmann equation on a mesoscopic scale. Compared to FD and FV solvers, LBM is simpler in modeling and implementation, has parallel processing capabilities, and is more successful in processing complex boundaries [5, 6]. However, for HR turbulent flow data, LBM usually requires a large number of lattice numbers with a very small time step, which makes the acquisition of HR data time-consuming and computationally expensive [7]. Due to the high data cost of LBM results for the flow field, the rapid development of data-driven models provides an effective way to obtain HR flow data in real-time [8,9,10]. In this study, we focus on obtaining HR turbulent flows from LR spatio-temporal data in real-time using deep learning (DL) models.

With the mushroom growth of deep learning technology, deep neural networks have been widely used in flow field SR and reconstruction tasks [11,12,13]. Various deep learning models have been applied in turbulent flow field super-resolution [14,15,16,17]. Fukami et al. [18] proposed a hybrid downsampled skip-connection/multi-scale (DSC/MS) model to reconstruct two-dimensional homogeneous turbulence. Considering the multi-scale information of the flow field, Liu et al. [19] designed a multi-resolution convolutional autoencoder (MrCAE) SR architecture that leverages the multigrid method and transfer learning. MrCAE can dynamically capture different scaled flow information at different network depths. Given the difficulty of obtaining HR label information, Gao et al. [20] proposed the CNN-SR model, which is trained without any high-resolution labels in a physics-driven way. Han et al. [21] proposed a novel deep learning solution that recovers temporal super-resolution (TSR) of three-dimensional vector field data (VFD) for unsteady flow. In addition to the CNN model, the generative adversarial network (GAN) [22] model is also widely used in flow field super-resolution. Güemes et al. [23] used the super-resolution generative adversarial network (SRGAN) to reconstruct the turbulent flow measured by the coarse wall. The results show that SRGAN can obtain perfect reconstruction even at very low resolution. In addition, Deng et al. [24] compared the results of two models, SRGAN and enhanced-SRGAN (ESRGAN), on super-resolution reconstructions of complicated wake flows behind cylinders. Their results found that ESRGAN has better reconstruction ability than SRGAN in both mean and fluctuation flow fields. To reconstruct HR turbulent flow by minimal flow field data, Yousif et al. [25] proposed the multi-scale ESRGAN with a physics-based loss function. Considering the great performance of reduced-order models in flow field prediction, the combination of reduced-order models with deep learning can also be well applied to flow field super-resolution tasks. Guastoni et al. [26] combined a fully convolutional neural network (FCN) with proper orthogonal decomposition (POD) [27]. They proposed the FCN-POD model, which has great performance in reconstructing the flow field from coarse wall measurements. The flow field SR technology based on deep learning can accurately obtain the HR flow field in real-time, which is of great significance for obtaining high-precision flow field information.

Most of the models mentioned above only use the spatial information of the low-resolution flow field in the SR reconstruction of the flow field, which will increase the prediction error of the model. In order to improve the quality of the SR reconstruction, Liu et al. [28] proposed multiple temporal paths convolutional neural network (MTPC). The LR information of the current moment and two adjacent moments is adopted as the input of the MTPC network, and then the output of the three moments is weighted to predict the HR image of the current moment. The results indicate that the MTPC has a better reconstruction capability than only using LR information at the current moment. In addition, Fukami et al. [29] further proposed a spatio-temporal SR model based on the temporal SR task. They constructed a temporal model and a spatial model, respectively, to reconstruct the HR information over a period of time, only using the LR information at the first and last moments. Their work has made important contributions to reconstructing the spatio-temporal flow field.

In this paper, to further improve the prediction accuracy of the HR flow field, we propose the MST-UNet model for flow field SR tasks with spatio-temporal combination. Since the flow field is usually spatially and temporally coupled, the MST-UNet model not only uses the LR spatial information of the flow field at the current frame but also incorporates the HR information predicted at the previous frame. The information input of this spatio-temporal combination can further improve prediction accuracy. In addition, we apply the MST-UNet model to the spatio-temporal SR task, which can reconstruct the HR information of a series of intermediate frames only from the beginning and end frames of LR information.

The remainder of this paper is structured as follows. Section 2 introduces the MST-UNet model and spatio-temporal SR task. The construction of datasets and the results of the MST-UNet model in spatial SR and spatio-temporal SR tasks are presented in Section 3. Section 4 concludes the paper.

2 Methods

The purpose of this study is to reconstruct the spatio-temporal HR flow field from LR data in real time using CNN. In order to predict the HR flow field at the current frame, we use the LR information at the current frame and the HR information at the previous frame. In this way, we can obtain HR flow field information by learning the end-to-end mapping function F,

$$\begin{aligned} P^{SR}_{t}=F\left( P_{t-1}^{HR}, P_{t}^{LR}, \theta \right) , \end{aligned}$$
(1)

where \(P^{SR}_{t}\) is the result of SR reconstruction, and \(\theta\) means the learnable network parameters in CNN, \(P_{t}^{LR}\) and \(P_{t-1}^{HR}\) represent LR information at the current frame and HR information at the previous frame, respectively. It should be noted that \(P_{t-1}^{HR}\) is not the label information but the model output result at the previous frame instead. We can take advantage of the LR information at the current moment only, which is a normal single-input-single-output situation. Compared with this case, using the previous time information can improve the data utilization efficiency and prediction accuracy. In this multiple-input-single-output case, the shared explicit redundancy from adjacent frames will provide more constraints.

In the current work, the MST-UNet model is first introduced for the SR tasks. Then, we apply the MST-UNet model in the spatio-temporal SR task.

2.1 MST-UNet model

In the MST-UNet model, the HR images of the previous frame (\(P_{t-1}^{HR}\)) and the LR images of the current frame (\(P_{t}^{LR}\)) are combined as the model input, which is multi-scale information. To fuse this multi-scale information as model input, the \(P_{t}^{LR}\) is first up-sampled to the target resolution \(P_{t}^{inter}\) using the bicubic interpolation. Then, the \(P_{t-1}^{HR}\) and \(P_{t}^{inter}\) are concatenated into a two-channel image as the input of the model (Fig. 1).

Fig. 1
figure 1

Data fusion for the input of the model. The HR images of the previous frame (\(P_{t-1}^{HR}\)) and the LR images of the current frame (\(P_{t}^{LR}\)) are combined as the input of the model, and the interpolation method is used to make two inputs the same size

In this paper, the UNet model [30] is adopted as the basic model of the SR task, which is one outstanding image-to-image CNN model. The UNet model is an encoder-decoder architecture consisting of a contracting path that captures global information (encoder) and an asymmetric expanding path to reconstruct HR information (decoder). In addition to combining LR information during downsampling and HR information during upsampling, the UNet model fills in the underlying information through a feature fusion operation of skip connections to improve prediction accuracy. Due to this particular framework, UNet has gained great success on computer vision tasks.

The super-resolution task we deal with is a typical image-to-image regression problem, and the classic UNet structure is modified for better application to this assignment, as shown in Fig. 2. The input of the model is a two-channel image consisting of \(P_{t-1}^{HR}\) and \(P_{t}^{inter}\), and the output of the model is the HR image at the current frame. The blue arrow is a basic block made up of three parts, followed by a \(3\times 3\) convolutional layer, GroupNorm (GN) [31] and ReLU activation functions (\(\phi (x)=\max (x, 0)\)). Since the batch normalization method is relatively dependent on the batch size, when the batch size is too small, the calculated mean and variance are insufficient to represent the entire data distribution. The GN method is employed in this problem, independent of the batch size. The UNet model consists of the contracting path (on the left) and the expansive path (on the right). As for the contracting path, four downsampling blocks (red arrow) using average pooling with \(2\times 2\) filters and stride 2 are adopted for feature extraction. Correspondingly, the expansive path consists of four identical \(2\times\) upsampling blocks, using bilinear interpolation instead of the transpose convolution. For our modified UNet model, the contracting path and the expansive path are completely symmetric, which makes the input and output of the model the same size.

Fig. 2
figure 2

The UNet architecture for SR tasks. The input of the model is a two-channel image consisting of \(P_{t-1}^{HR}\) and \(P_{t}^{inter}\), and the output of the model is the HR image at the current frame

Fig. 3
figure 3

Schematic diagram of the multi-scale temporal path UNet model

In the training process, in order to predict the HR image at the t frame, the LR information of the t (\(P_{t}^{LR}\)) and \(t-1\) (\(P_{t-1}^{LR}\)) frames is used. As shown in Fig. 3, in order to obtain the HR results of the previous frame, we use interpolation to obtain an initial value in the first epoch of the training process. In the remaining epochs, the idea of the memory bank [32] is adopted. The prediction results of the previous epoch will be stored and used as the HR images of the previous frame. The L1 loss is chosen as the loss function,

$$\begin{aligned} Loss_{mst} = \frac{1}{N} \sum \limits _{t=1}^{N}\left\| P_{t}^{SR}-P_{t}^{H R}\right\| . \end{aligned}$$
(2)

\(Loss_{mst}\) is the loss function of the MST-UNet. According to (2), the network parameters can be updated using the back-propagation algorithm.

2.2 Spatio-temporal super resolution

For reconstructing the spatio-temporal flow field, Fukami et al. [29] proposed two methods. Both methods include temporal and spatial models, but in different orders.

Fig. 4
figure 4

Spatio-temporal SR reconstruction of the flow field using deep learning. (a) shows the combine model using the temporal model as the inbetweening model while (b) demonstrates the combine model using the spatial model

As shown in Fig. 4(a), for LR spatial and temporal information (\(P(x_{LR},t_{LR})\)), we apply the inbetweening temporal model (\({F}_{t}^{*}: \mathbb {R}^{n_{L R} \times m_{L R}} \rightarrow \mathbb {R}^{n_{L R} \times m_{H R}}\)) to obtain (\(P(x_{LR},t_{HR})\)), which has HR temporal information. Then the spatial model (\({F}_{x}\)) is employed to gain HR spatial and temporal information (\(P(x_{LR},t_{HR})\)).

$$\begin{aligned}{} & {} P(x_{LR},t_{HR})={F}_{t}^{*}\left( P\left( x_{L R}, t_{L R}\right) \right) , \nonumber \\{} & {} P(x_{HR},t_{HR})={F}_{x}(P(x_{LR},t_{HR})),\\{} & {} P(x_{HR},t_{HR})={F}_{x}\left( {F}_{t}^{*}\left( q\left( x_{L R}, t_{L R}\right) \right) \right) . \nonumber \end{aligned}$$
(3)

Correspondingly, we can use the spatial model as the inbetweening model to obtain HR spatial information (\(P(x_{HR},t_{LR})\)), and then use the temporal model to obtain HR spatial and temporal information (Fig. 4(b)).

$$\begin{aligned}{} & {} P(x_{HR},t_{LR})={F}_{x}^{*}\left( P\left( x_{L R}, t_{L R}\right) \right) , \nonumber \\{} & {} P(x_{HR},t_{HR})={F}_{t}(P(x_{HR},t_{LR})),\\{} & {} P(x_{HR},t_{HR})={F}_{t}\left( {F}_{x}^{*}\left( q\left( x_{L R}, t_{L R}\right) \right) \right) , \nonumber \end{aligned}$$
(4)

where n is the dimension of spatial information and m is the dimension of temporal information. \(P(x_{LR},t_{LR})\) represents LR spatial and temporal information, which is the model input. \(P(x_{LR},t_{HR})\) represents HR temporal information while LR spatial information. Similarly, \(P(x_{HR},t_{LR})\) represents HR spatial while LR temporal information, and \(P(x_{HR},t_{HR})\) represents HR temporal and spatial information, which is the output of the model.

In this study, the first model is selected as the spatio-temporal SR model, which uses the temporal model as the inbetweening model. In the temporal model, we still use the UNet model, taking the first and last two frames of the LR image as input, and the output is a series of LR images in the middle frame. In this way, the MST-UNet model proposed in Section 2.1 can be used as a spatial model to improve the model prediction accuracy further. The loss function of the spatio-temporal SR model is as follows,

$$\begin{aligned} Loss_{ts}= & {} \Vert \hat{P}\left( x_{H R}, t_{HR}\right) -{F}_{x}\left( {F}_{t}^{*}\left( P\left( x_{L R}, t_{LR}\right) \right) \right) \Vert , \nonumber \\= & {} \Vert \hat{P}\left( x_{H R}, t_{HR}\right) -{P}\left( x_{H R}, t_{HR}\right) \Vert . \end{aligned}$$
(5)

Among them, \(\hat{P}\left( x_{H R}, t_{HR}\right)\) represents the label information of a series of frames.

3 Numerical results

To evaluate the ability of the MST-UNet model to reconstruct HR flow fields, the flow around cylinders is used as our test [33, 34]. The entire flow field domain is shown in Fig. 5.

Fig. 5
figure 5

The illustration of the flow domain around three cylinders

In Fig. 5, \(D=0.2\) m and \(u_0=0.2\) m/s represent the cylinder diameter and initial velocity, respectively. The LBM is applied in the construction of the data set.

3.1 The lattice Boltzmann method

The LBM is performed on a square lattice where at each node there is a discrete number of directions in which the fluid particles can move. In this study, a 2D model with nine lattice velocities - the D2Q9, will be discussed and applied.

A general form of the LBM equation can be written as follows [35],

$$\begin{aligned} f_i\left( \varvec{x}+\varvec{e}_i \Delta t, t+\Delta t\right) -f_i(\varvec{x}, t)=\Omega _i^{\textrm{coll}}(\varvec{x}, t), \end{aligned}$$
(6)

where \(f_{i}\) is a discrete distribution function at position \(\varvec{x}\), with average particle lattice velocity \(\varvec{e}_i\), pointing in direction (see Fig. 6), at time t. \(\Omega ^{\text {coll}}\) is the collision operator, and will be discussed later. \(\varvec{e}_i=\left( e_{x, i}, e_{y, i}\right)\) is the directional vector and it takes the following values:

$$\begin{aligned} \varvec{e}=\left( \begin{array}{lll} e_{x, 0}, &{} e_{x, 1}, \ldots &{} e_{x, 8} \\ e_{y, 0}, &{} e_{y, 1}, \ldots &{} e_{y, 8} \end{array}\right) =\left( \begin{array}{l} 0,1,0,-1,0,1,-1,-1,1 \\ 0,0,1,0,-1,1,1,-1,-1 \end{array}\right) . \end{aligned}$$
(7)

The local fluid density \(\rho\) is computed as the sum of all \(f_i\) on site. The expression for the D2Q9 lattice model is the following,

$$\begin{aligned} \rho (\varvec{x}, t)=\sum \limits _{i=0}^8 f_i(\varvec{x}, t). \end{aligned}$$
(8)

The local fluid velocity \(\varvec{u}=\left( u_x, u_y\right)\) can be computed from the expression for the local momentum.

Fig. 6
figure 6

D2Q9 basic lattice velocity directions

$$\begin{aligned} \rho (\varvec{x}, t) \varvec{u}(\varvec{x}, t)=\sum \limits _{i=1}^8 f_i(\varvec{x}, t) \varvec{e}_i. \end{aligned}$$
(9)

The Bhatnagar-Gross-Krook (BGK) model has the simplest form of the presented collision operators, which reads as follows [35, 36],

$$\begin{aligned} \Omega _i^{\textrm{coll}}=-\omega \left( f_i(\varvec{x}, t)-f_i^{\textrm{eq}}(\varvec{x}, t)\right) . \end{aligned}$$
(10)

\(f_i^{\textrm{eq}}\) is the local equilibrium distribution function, and is computed from the local macroscopic velocity \(\varvec{u}\), and the local density \(\rho\).

$$\begin{aligned} f_i^{\textrm{eq}}(\varvec{u}, \rho )= w_i \rho (\varvec{x}, t)\left[ 1+3 \varvec{e}_i \cdot \varvec{u}(\varvec{x}, t)+\frac{9}{2}\left( \varvec{e}_i \cdot \varvec{u}(\varvec{x}, t)\right) ^2-\frac{3}{2} \varvec{u}(\varvec{x}, t) \cdot \varvec{u}(\varvec{x}, t)\right] , \end{aligned}$$
(11)

where \(w_i\) is the equilibrium weight factor in the i-th direction. Its values add up to 1, and in D2Q9 they equal:

$$\begin{aligned} \varvec{w}=\left( w_0, w_1, \ldots w_8\right) =\left( \frac{4}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{36}, \frac{1}{36}, \frac{1}{36}, \frac{1}{36}\right) . \end{aligned}$$
(12)

w is the relaxation rate and is the inverse of the relaxation time \(\tau =\frac{1}{\omega }\). From the relaxation rate, the fluid’s kinematic viscosity v can be calculated,

$$\begin{aligned} v=\frac{1}{3}\left( \frac{1}{\omega }-\frac{1}{2}\right) . \end{aligned}$$
(13)

There are three cylinders in the domain, with center coordinates \(\left( x_0,y_0\right)\), \(\left( x_1,y_1\right)\), and \(\left( x_2,y_2\right)\). In our problem, the range of six coordinates is shown in Table 1. Two datasets of different sizes are generated to illustrate the effectiveness of our model. One is a small dataset with the size of 5000, denoted as \(Dataset_{s}\): 500 groups of different cylindrical positions are generated using the Latin hypercube sampling method. For each set of positions, we generate 10 frames of data at consecutive moments. The other one is a large dataset completed with the size of 19000, which has 1000 groups of different cylindrical positions and 19 frames of data at each position, denoted as \(Dataset_{l}\). In each dataset, \(80\%\) of the samples are used for training, and the remaining \(20\%\) for testing. The position of the cylinder changes, and the flow field at each position varies with time, which increases the difficulty of the SR task. In our problem, the resolution of HR and LR flow fields is \(128 \times 256\) and \(16 \times 32\) separately with an upscaling factor of 8.

Table 1 The value range of six coordinates

In the experiments, we first investigate the spatial SR task using the MST-UNet model. Then, on the basis of the spatial super-resolution task, the spatial-temporal SR task is further studied. The mean absolute error (MAE) and \(L_2\) error norms (\(\varepsilon\)) are adopted as the metrics to evaluate the performance of the MST-UNet model and spatial-temporal SR task, where \(\varepsilon\) is defined as

$$\begin{aligned} \varepsilon =\left\| P^{H R}-F\left( P^{L R}, \theta \right) \right\| _{2} /\left\| P^{H R}\right\| _{2}. \end{aligned}$$
(14)

3.2 Spatial super-resolution task

In this task, our MST-UNet model combines the HR information of the previous moment with the LR information of the current moment. By contrast, models with LR information as inputs only are introduced,

$$\begin{aligned} P^{S R}=F\left( P_{[t-d: t]}^{L R}, \theta \right) , \end{aligned}$$
(15)

where \(P_{[t-d: t]}^{L R}\) represents a sequence of LR data and \(\theta\) represents model parameters. In the case of \(d = 0\), the mapping between input and output becomes \(P^{S R}=F\left( P_{t}^{L R}, \theta \right)\), which is the case of general single-input-single-output, denoted as \(\text {SLR-UNet}\). Under the circumstances, the model input contains only instantaneous spatial information, not temporal information. When \(d>0\), the model input combines LR information from multiple moments, as shown in Fig. 7. Besides the case of \(d=0\), the two cases \(d=1\) and \(d=2\) are also known as our comparative experiments, denoted as \(\text {MTLR1-UNet}\) and \(\text {MTLR2-UNet}\), respectively. The loss function for these models is as follows,

$$\begin{aligned} { Loss_{MTLR} }=\left\| F\left( P_{[t-d: t]}^{L R}, \theta \right) , P^{H R}\right\| _{p}. \end{aligned}$$
(16)

It is worth noting that in order to ensure the fairness of the comparison experiments, the different models used for comparison only have different input information while with the same network structure.

Fig. 7
figure 7

Schematic diagram of the multi-scale time path network (MST-UNet)

We perform SR on the curl of the flow field. These models are evaluated on the test set, and the results of MAE and \(L_2\) error norms on two different dataset sizes are shown in Table 2. Where the MST-UNet model is the model proposed in this paper, \(\text {SLR-UNet}\), \(\text {MTLR1-UNet}\) and \(\text {MTLR2-UNet}\) correspond to the model in (15) where d is 0, 1 and 2 respectively. It can be seen from the results that the error decreases significantly as the dataset size increases. In the comparison of different models, on both datasets the spatial model containing time information (\(\text {MTLR1-UNet}\) and \(\text {MTLR2-UNet}\)) is better than the spatial model using the LR information at the current frame (\(\text {SLR-UNet}\)) only. For models with different frames of LR information as input, i.e. \(\text {MTLR1-UNet}\) and \(\text {MTLR2-UNet}\), we find that \(\text {MTLR2-UNet}\) does not perform well, compared to the \(\text {MTLR1-UNet}\) model with less input LR information. The main reason is that the LR information of multiple frames as input is redundant or invalid, which will affect the prediction accuracy to a certain extent. In the performance of MST-UNet model, it is obvious from Table 2 that the error of the MST-UNet model proposed in this paper is significantly lower than the rest of the models in each indicator. In addition, this advantage is more obvious as the size of the dataset increases. For the results of \(Dataset_{l}\), it can be seen that the model prediction errors of the three models \(\text {SLR-UNet}\), \(\text {MTLR1-UNet}\) and \(\text {MTLR2-UNet}\), have little difference. From the results, we can find that in the case of sufficient data, the increase of LR temporal information has little effect on the prediction accuracy of the model. However, the MST-UNet model that uses the HR information of the previous frame can greatly improve the prediction accuracy compared to the model which only uses the LR temporal information, since the MST-UNet model uses the HR information of the previous frame in addition to the LR information of the current frame. Generally speaking, the HR information of the previous frame can contain most of the information before the current frame, which allows us to use the information of the previous frame very efficiently compared to the \(\text {MTLR1-UNet}\) and \(\text {MTLR2-UNet}\) models.

Table 2 Results of the four models on the test set
Table 3 The MAE (\(\times 10^{-4}\text {s}^{-1}\)) of the four models in the test set on \(Dataset_{5000}\)
Fig. 8
figure 8

The MAE (\(\times 10^{-4}\text {s}^{-1}\)) of the four models in the test set on \(Dataset_{19000}\)

For the spatio-temporal flow field, since we usually pay more attention to the flow field of the last frame, the overall low average error does not reflect the prediction error of the model at a certain moment, and it is very likely that the error accumulates with time. We calculate the average error over 10 frames across all different cylinder positions in the test set, and the MAE of \(Dataset_{s}\) and \(Dataset_{l}\) are shown in Table 3 and Fig. 8, respectively.

Fig. 9
figure 9

Performance of different models in the test set on \(Dataset_{s}\)

Fig. 10
figure 10

Performance of different models in the test set on \(Dataset_{l}\)

From the results, it can be seen that the prediction error of the MST-UNet model is much lower than that of the \(\text {SLR-UNet}\), \(\text {MTLR1-UNet}\) and \(\text {MTLR2-UNet}\) models on both \(Dataset_{s}\) and \(Dataset_{l}\) except for the first frame. Furthermore, the prediction errors eventually stabilize over time and do not increase. The reason for the high prediction error of the first frame is that the HR input of the previous frame is obtained by interpolation. In order to improve the prediction accuracy of the first frame, we can also build a separate model for the first frame to improve the prediction accuracy. However, for the space-time flow field, we generally focus on the information at later moments. The errors of the last frame at one position on the test sets are shown in Figs. 9 and 10, respectively. The results show that the performance of MST-UNet is greatly improved compared to the other three models.

Fig. 11
figure 11

Performance of the TS model on the spatio-temporal super-resolution task

Fig. 12
figure 12

Performance of the ST model on the spatio-temporal super-resolution task

3.3 Spatio-temporal super-resolution task

As shown in Section 2.2, the purpose of the spatio-temporal SR task is to reconstruct HR information of a time series from the LR information at the first and last two frames. As shown in Section 2, two models can be used in the spatio-temporal SR task. One is to obtain HR temporal information through the temporal model first, and then gain HR spatial and temporal information through the spatial model, denoted as TS; the other model is in reverse order, taking the spatial model as the inbetweening model, which acquires the HR spatial information first, denoted as ST. In order to facilitate comparison, for different cylinder positions in \(Dataset_{l}\), we select the first ten frames. In our spatio-temporal SR task, we predict the HR information of eight frames through the LR information at the first and last two frames (frame 1 and frame 10).

Our model adopts the TS model. For the temporal model, we employ the UNet model in Fig. 2, which takes the LR images of the 1st and 10th frames as input and the output is a series of LR images between the two frames. Since the temporal model has output a series of LR images, on the spatial model, we adopt the MST-UNet model proposed in Section 2.1. A set of results for the TS model is shown in Fig. 11. The comparison model adopts the ST model. On the spatial model, since there are only two frames of LR images, we use the single model in Section 3.2. The temporal model in ST adopts the UNet model too, while the input and output are HR images. Figure 12 presents a set of results for the ST model.

We evaluate the results of both models on the test set, and the results of MAE and L2 error norm are shown in Table 4. The results show that the TS model we use is lower than the ST model in all indicators, which shows that the TS containing the MST-UNet model has good performance on the spatio-temporal SR task.

Table 4 Results of the two models on the test set

4 Conclusions

In this paper, we propose the MST-UNet model for SR reconstruction of spatio-temporal flow fields. In order to obtain the HR result of the current frame, the MST-UNet model combines the LR information of the current moment and the HR information of the previous moment. In addition, we apply the MST-UNet model to the task of flow field reconstruction around cylinders and verify the proposed model on two tasks: spatial SR and spatio-temporal SR. The experimental results show that the MST-UNet model is much better than the other models that only use LR information as input in the spatial SR task. In the spatio-temporal SR task, the results of the model with MST-UNet are significantly improved compared with the model that only uses low-resolution information.

Although we explore a novel model for the spatio-temporal SR reconstruction problem, many topics still exist for further investigation. Our model adopts a data-driven approach, which requires a lot of label data. We will adopt a physics-informed approach to reduce data costs in future work.