1 Introduction

Nowadays, with the development of ultra-high-definition (UHD) video display technology, the UHD television is able to display videos up to 8 K spatial resolution and 240 fps frame rate. However, the common video capturing devices can only shoot videos with maximum spatial resolution and frame rate as 2 K and 60 fps respectively [1]. One promising solution for converting existing videos for displaying on the UHD televisions with more satisfying visual experience is to use space-time video super-resolution (STVSR). STVSR is a field that aims to simultaneously increase the video spatial resolution and frame rate of the low-resolution (LR) low-frame-rate videos for reconstructing high-resolution (HR) high-frame-rate videos.

Previous STVSR methods usually adopt the optical and motion field techniques to capture the offsets between frames, and then leveraged these offsets for interpolating intermediate frames. They also tend to rely on prior assumptions heavily (e.g., linear motion assumption between frames [2]) [35]. But in real sceneries, these assumptions are usually not satisfied, leading to inaccurate spatial details and motion blurs within their reconstruction frames. Recently, some deep learning methods achieve STVSR by performing advanced video frame interpolation (VFI) [68], and video super resolution (VSR) [911] models consecutively. They first interpolate missing intermediate low-resolution frames via the VFI model and then reconstruct high-resolution frames using the VSR model. Since STVSR is achieved by two independent steps using two models, these methods are called two-stage approaches. However, these methods cannot fully exploit the inherent correlations of spatial and temporal information [12], resulting in severe frame inconsistency and fake artifacts problems. Moreover, these two-stage approaches require expensive computation cost thus show poor inference efficiency [611].

To resolve the above issues, numerous one-stage STVSR approaches have been proposed to directly achieve STVSR in an end-to-end manner. These models are normally designed with several phases. They usually interpolate intermediate frame features between given LR frames at the first phase. Then each frame feature is refined in local and global level according to their local neighbors and overall feature sequences. Finally, they increase the spatial resolutions of these features for video reconstruction [13, 14]. In comparison to the two-stage approaches, the one-stage approaches are able to effectively model the temporal correlations (object deformations and object movements) between frames to obtain realistic spatial details. However, as their most important phase, the feature interpolation phase only captures the spatial-temporal information from two adjacent frames for interpolating each intermediate frame feature. They implicitly assume that the motions occur in constant speed between consecutive frames. However, to the best of our knowledge, the object movement speeds are usually changeable, e.g., moving cars and running people, we argue that the long-term spatial-temporal dependencies between multiple consecutive adjacent frames are also helpful to recover the details of each intermediate frame. Hence, our motivation is to aggregate long-term temporal information from multiple frames for interpolating intermediate frame features.

In this paper, we propose a novel long-term temporal feature aggregation network (LTFA-Net) for STVSR. Specifically, LTFA-Net takes input of LR video frames and first performs feature interpolation. To handle complex motions and aggregate long-term temporal information for interpolating intermediate features, we design a long-term mixture of experts (LTMoE) module to encode and integrate complementary information from long-term frame feature sequences into the interpolation results. LTMoE can compensate spatial details and reduce blurred motions in the interpolated features since it exploits long-term correlations between multiple frames for recovering spatial-temporal contexts. Then we calculate the deformable feature alignment functions between each feature and its local neighbours for its local refinement, and we further perform global refinement among the whole feature sequence via a bidirectional deformable ConvLSTM [13] layer. In the last phase, the refined feature maps are reconstructed to the HR high-frame-rate video.

In summary, our contribution is two-fold:

(1) We propose a novel long-term temporal feature aggregation network (LTFA-Net) for STVSR. To the best of our knowledge, our LTFA-Net is the first unified model which achieves long-term temporal dependencies modelling and information aggregation for recovering realistic spatial details and motion continuity within interpolated features. This is achieved by our long-term mixture of experts (LTMoE) module.

(2) Our method shows superior performance to state-of-the-art approaches on two standard benchmarks: Adobe240fps and GoPro.

2 Related work

2.1 Video frame interpolation

The purpose of video frame interpolation (VFI) is to synthesize intermediate frames between given adjacent frames in the same spatial-resolution level. It is a challenging problem especially to restore fast non-linear object motions under illumination and occlusion changes. Early methods adopted optical flows to estimate motion cues and warped the adjacent frames to target intermediate frames. Due to fast motions and occlusions, the estimated optical flows may be inaccurate to describe object movements [15]. Also, these approaches suffer from huge computational costs [15]. Recent methods attempted to tackle these problems by applying kernel-based methods [15, 16]. The kernel-based methods extract information from local patches of given frames using various kinds of convolutional kernels, which preserves a large detail of local texture and shape patterns to synthesis the target frame. For instance, Lee et al. proposed spatially adaptive convolution kernels to select and map existing pixels from given frames to appropriate locations in the target intermediate frame [17]. Moreover, Bao et al. [18] and Xue et al. [7], respectively, estimated depth maps and occlusion masks from given frames to discard occluded areas, and thus help to improve the interpolation performance.

2.2 Video super resolution

Video super resolution (VSR) aims to increase the spatial resolution of given low-resolution frames to their high-resolution versions. Previous approaches dedicated to aggregating the spatial information from neighbouring frames for reconstructing each target frame. For example, Tian et al. [10] captured spatial information from adjacent frames via the deformable convolutional layer and then reconstructed high-resolution frames. Wang et al. [11] proposed a Pyramid, Cascading and Deformable (PCD) module to perform feature spatial alignment in three different scale features. The aligned features are fused for spatial super-resolution. Chan et al. [19] introduced the BasicVSR model, in which they designed the bidirectional propagation operation to exploit global information of overall sequence for the spatial alignment of each frame feature. Haris et al. [20] iteratively projected features of neighbouring frames to that of reference frame for its spatial super-resolution.

2.3 Space time video super resolution

The main difference between space-time video super-resolution (STVSR) with VFI and VSR is that STVSR aims to increase the video spatial and temporal resolutions simultaneously. Shechtman et al. [3] firstly tackled STVSR by aggregating information from the input video frames and then performing directional spatial-temporal regularization for frame reconstruction. Recent approaches achieve STVSR by performing end-to-end deep neural networks. For example, Haris et al. [12] applied an optical flow estimation network to encode the motion representations between adjacent frames, then they iteratively projected adjacent features to target intermediate frames by the help of these representations. Xiang et al. [13] used the deformable feature alignment functions to interpolate intermediate features and then designed the bidirectional deformable ConvLSTM (BDConvLSTM) layer for feature global refinement. Xu et al. [14] enabled the reconstruction of multiple intermediate frames at arbitrary moments. Shi et al. [21] introduced an adapted pixel shuffle layer to reconstruct video frames at arbitrary target spatial resolutions. You et al. [22] developed a memory graph aggregation module to capture long-range dependencies over the whole feature sequence for global refinement. Cao et al. [23] designed the Fourier data transform layer and the recurrent video enhancement layer to respectively handle the motion blurs and motion aliasing within the reconstructed frames. Hu et al. [24] iteratively projected features in the low-resolution and high-resolution scales to eliminate the differences and update the corresponding representations for feature refinement. Several latest methods achieved STVSR via vision transformers [25, 26], e.g., Geng et al. [25] used the SWIN Transformer [27] as their model backbone to perform information interaction between features of different frames.

However, these approaches interpolate each intermediate feature only guided by its two most adjacent frames, ignoring the spatial-temporal information of other frames for restoring fast and variable-speed object movements, leading to their poor long-term motion continuity within reconstructed frames. Therefore, we propose a novel long-term mixture of experts (LTMoE) module to model long-term spatial-temporal dependencies and capture discriminative spatial-temporal contexts from more neighbouring frames to boost STVSR performance.

2.4 Mixture of experts

Mixture of Experts (MoE) is an ensemble learning method for subtask optimization. MoE divides the problem space into several sub-tasks and respectively handle them using experts. Then another gating network plays the role to control the activation of experts for their combination [28]. MoE was introduced by Jacob et al. [29] and currently it is applied in many computer vision tasks, such as semantic segmentation [30], crowd counting [31] and image super-resolution [3234]. For instance, Liu et al. [33] tackled single image super resolution (SI-SR) via MoE. They partitioned the whole image space into several subspaces and distributed a super-resolution inference module (expert) to each subspace. Another gating net generates pixel-level weight maps and multiplies them with the partitioned image to reconstruct the high-resolution version. Rasti et al. [35] applied multi-scale MoE on retinal optical coherence tomography images for macular pathologies diagnosis. Each expert is performed on a scale feature to provide discriminative spatial information, then a gating net combines them for image classification. For our STVSR task, given that the long-term spatial-temporal features from multiple consecutive adjacent frames can help interpolate the intermediate frame, we for the first time employ multiple experts to extract discriminative information from these long-term frames and aggregate them using gating nets for intermediate frame features interpolation.

3 Method

3.1 Overview

The overview of our LTFA-Net is illustrated in Fig. 1. LTFA-Net aims to reconstruct HR high-frame-rate video frames \(\mathcal{H}= \{ H_{t} \}_{t=1}^{7}\) from LR low-frame-rate frames \(\mathcal{L}= \{ L_{2t-1} \}_{t=1}^{4}\) via four main phases: feature interpolation, feature local refinement, feature global refinement, and frame reconstruction. Specifically, we first pass \(\mathcal{L}\) into a novel long-term mixture of experts (LTMoE) module to aggregate long-term temporal information from given frames for feature interpolation. LTMoE contains four experts \(\{ E_{1}, E_{2}, E_{3}, E_{4} \}\) designed using ConvNext [36] for feature extraction of given frames, and three gating nets \(\{ \mathcal{G}^{1}, \mathcal{G}^{2}, \mathcal{G}^{3} \}\) to assign hard weights to experts for their combinations. From three different combinations of experts, we can obtain three intermediate features \(\{ F_{2}, F_{4}, F_{6} \}\) accordingly. Note that our LTMoE can also support aggregating information from more neighbouring frames, but here we introduce LTMoE using four supporting frames for the sake of simplicity. Then, having initialized features of given frames and interpolated feature sequence \(\mathcal{F}= \{ F_{t} \}_{t=1}^{7}\) ready, we follow [14] to feed them into the Local Feature Comparison (LFC) module to align each feature using its local neighbours. Then we adopt the widely used bidirectional deformable ConvLSTM layer to globally refine each feature over the whole sequence. Finally, we apply a reconstruction module, which is integrated with ten ConvNext blocks [36] and a Pixel shuffle layer [21] to increase the feature spatial resolutions and output HR high-frame-rate video frames \(\mathcal{H}\).

Figure 1
figure 1

LTFA-Net has four main phases. Given LR low-frate-rate frames \(\mathcal{L}\), we first interpolate intermediate frame features via the LTMoE module. Then we perform local feature refinement using the Locally-temporal Feature Comparison module. Next, we feed the whole feature sequence \(\mathcal{F}^{R}\) into a bidirectional deformable ConvLSTM (BDConvLSTM) for global refinement. Finally, we reconstruct HR high-frame-rate video frames \(\mathcal{H}\) by ten ConvNext blocks and a Pixel shuffle layer

3.2 Long-term mixture of experts for feature interpolation

All of the existing STVSR approaches explicitly or implicitly contain the feature interpolation phase aiming to recover features of intermediate frames, but they only capture information from two most adjacent frames for interpolating each intermediate frame feature, failing to maintain long-term motion continuity and handle variable-speed object movements within the reconstructed frame sequence. Thus, we attempt to aggregate long-term temporal information from multiple consecutive adjacent frames for interpolation. Considering that different adjacent frames have different contribution levels to the interpolation results, indiscriminately and directly concatenate them for feature extraction may lead to suboptimal reconstruction performance. We instead design the long-term mixture of experts (LTMoE) to aggregate long-term spatial-temporal information from given frames with different weights. Our LTMoE is adapted from the popular mixture of experts (MoE) [29]. The original MoE leverages several experts \(\mathcal{E}= \{ E_{i} \}_{i =1}^{n}\) to handle different subtasks of a complex task independently and utilizes a gating net \(\mathcal{G}\) to assign weights on them for their fusion. The output is \(E_{\mathrm{out}} = \sum_{i =1}^{n} E_{i} G_{i}\), where \(G_{i}\) indicates the weight \(\mathcal{G}\) assigned to i-th expert \(E_{i}\) and there are n experts in total.

The structure of LTMoE is shown in Fig. 1. LTMoE comprises of four shareable experts \(\{ E_{1}, E_{2}, E_{3}, E_{4} \}\) to extract discriminative features \(\{ F_{1}, F_{3}, F_{5}, F_{7} \}\) from given frames, respectively. Each expert contains two 2D convolutional layers and a ConvNext [23] model. These features have mutual and complementary information thus we design three gating nets \(\{ \mathcal{G}^{1}, \mathcal{G}^{2}, \mathcal{G}^{3} \}\) to combine them with different weights for obtaining intermediate features \(\{ F_{2}, F_{4}, F_{6} \}\). For instance, \(F_{2} = \sum_{t =1}^{4} G_{t}^{1} \times E_{t} ( L_{2 t -1} )\), where \(G_{t}^{1}\) represents the hard weight that gating net \(\mathcal{G}^{1}\) assigns for controlling the contribution of expert \(E_{t}\). These gating nets take the input of the concatenation of \(\mathcal{L}\) and have the same network structure and weights: a 2D convolutional layer for feature extraction, a 2D global average pooling layer for spatial dimension collapse, and a fully connected layer with a Softmax activation function for weights calculation and normalization. Only the concatenation orders of their input features are different. By this means the gating nets can automatically control the activation degrees of each expert for passing their spatial-temporal information into the interpolated features. Our LTMoE implicitly exploits inter-frame spatial-temporal correlations to recover long-term motion continuity across multiple frames.

3.3 Local and global feature refinement

Local Feature Refinement. In this phase, we aim to refine each feature map using its local neighbours to maintain the local motion consistency. To achieve this purpose, we inherit the advantageous module, Locally-temporal Feature Comparison (LFC) module from [14]. Figure 2 illustrates the process of LFC refining \(F_{t}\) with the help of \(F_{t-1}\) and \(F_{t+1}\). Specifically, LFC firstly concatenates \(F_{t}\) with \(F_{t-1}\) and \(F_{t+1}\), respectively, and feed them into two convolutional layers to learn the forward and backward motion offset \(o_{t-1 \rightarrow t}\) and \(o_{t+1 \rightarrow t}\). These learned offsets describe the motion cues including the movement directions and movement displacements of objects, and thus it can be used to align adjacent features \(F_{t -1}, F_{t +1}\) towards the current frame using the deformable convolutional layer [37]. The deformable convolutional layer adds learnable offsets to the regular grid sampling locations of the standard convolution operation. It enables the size of the convolutional kernels to be deformable thus encoding features of objects more comprehensively. We follow [13, 14] to use it for feature spatial alignment. After that, the aligned feature maps are concatenated with \(F_{t}\) and then fed into four convolutional layers and five LReLU activation layers for feature fusion.

Figure 2
figure 2

Locally Temporal Feature Comparison (LFC) module refines the intermediate feature \(F_{t}\) using its adjacent frame features \(F_{t-1}\) and \(F_{t+1}\)

The LFC module performs local refinement on \(\mathcal{F}= \{ F_{t} \}_{t =1}^{7}\) with a sliding window of size 3. For the first (last) feature in this sequence, its previous (next) neighbour is itself. Finally, we have the refined features \(\mathcal{F}^{R} = \{ F_{t}^{R} \}_{t=1}^{7}\).

Global Feature Refinement. This phase aims to refine each frame feature using information from the whole sequence, and it can model the motions over the sequence to solve the global motion inconsistency problem. To achieve global temporal information passing and aggregation, we take inspiration from [13] to input \(\mathcal{F}^{R}\) into a bidirectional deformable ConvLSTM (BDConvLSTM) layer for global refinement. The backbone of BDConvLSTM layer is the ConvLSTM [38] layer which can capture the motion information from small receptive fields. But ConvLSTM layer fails to handle large and fast motions from the overall feature sequence. Therefore, [13] improves the original ConvLSTM layer with additional deformable convolutional layers to better encode temporal correlations between frames. The design of memory cells and four effective gates enables the BDConvLSTM layer to extract long-term temporal contexts and perform information interaction in the forward and backward directions. Motion artifacts and blurs can be tackled by exploiting global temporal contexts in the BDConvLSTM layer. Moreover, it can model long-term dependency within sequences with arbitrary numbers of features. We denote the output of this layer by \(\mathcal{F}^{G} = \{ F_{t}^{G} \}_{t=1}^{7}\).

3.4 High-resolution frame reconstruction

In the last phase of our LTFA-Net, we aim to reconstruct HR video frames. Specifically, we first feed the features \(\mathcal{F}^{G}\) into ten ConvNext blocks [36] for spatial refinement. Then we follow [21] to project the outputs to HR frames \(\mathcal{H}= \{ H_{t} \}_{t =1}^{7}\) using a Pixel shuffle layers.

For model optimization, our LTFA-Net is optimized by the following reconstruction loss in an end-to-end manner:

$$\begin{aligned} l_{\mathrm{rec}} = \sqrt{ \bigl\Vert H_{t} - H_{t}^{gt} \bigr\Vert ^{2} + \epsilon ^{2}}, \end{aligned}$$

where \(H_{t}^{gt}\) represents the ground truth of t-th frame and ϵ is the penalty term. As suggested in [13, 14], we empirically set \(\epsilon = 10^{-3} \).

4 Experiments

4.1 Datasets

We select the widely used Vimeo-90 K septuplet dataset [7] as our training dataset. This dataset has 91,701 different videos. Each video contains 7 consecutive frames with a fixed spatial resolution of \(448\times 256\). We follow [13, 14] to transform the odd-index frames to their low-resolution (LR) versions using bicubic interpolation. These LR frames have a spatial resolution of \(112\times 64\) and are treated as the training inputs to reconstruct the original HR frames.

For model evaluation, we follow [39] to test the model performance on the Adobe240fps [40] and GoPro [41] datasets. The Adobe240fps dataset contains 71 videos with the resolution of \(1280\times 720\) and the frame rate of 30 FPS. The GoPro dataset is composed of 3214 videos at \(1280\times 720\) resolution and 240 FPS frame rate. The frame number of videos in these two datasets is not fixed. Similar to the setup of the training set, we down-sampled the odd-index frames with a factor of 4 to obtain the LR low-frame-rate videos, and then input them to the proposed LTFA-Net for reconstructing the original videos.

4.2 Implementation details and evaluation protocols

Implementation Details. The LTFA-Net is optimized by the SGD [42] optimizer with an initial learning rate of \(2\mathrm{e}^{-4}\). Following [14], we apply the cosine annealing scheduler to gradually decay the learning rate to \(1\mathrm{e}^{-7}\) for every 150,000 iterations. We set the number of training iterations as 600,000 and the batch size as 6. Moreover, the training set is augmented by image rotation (90, 180, and 270), image horizontal flip and image vertical flip operations [14]. We performed our experiments on an NVIDIA RTX 2070 Super GPU.

Evaluation Protocols. We follow [2, 13, 14] to set our evaluation protocols as the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) metrics [43]. We also compare the number of model parameters and the inference speed with state of art approaches. Specifically, PSNR is utilized to measure the reconstruction quality between the reference frame x and the reconstructed frame y. The equation of PSNR is defined as:

$$\begin{aligned} \mathrm{PSNR}(x,y)=20 \log _{10} \biggl( \frac{255}{\sqrt{\mathrm{MSE}(x, y)}} \biggr), \end{aligned}$$

where MSE denotes the mean squared error. SSIM [43] aims to measure the human visual perception similarity between images. It is defined as:

$$\begin{aligned} \mathrm{SSIM} ( x,y ) = \frac{2 x_{\mathrm{norm}} y_{\mathrm{norm}} + C_{1}}{x_{\mathrm{norm}}^{2} y_{\mathrm{norm}}^{2} + C_{1}} \times \frac{2\operatorname{cov}(x,y)+ C_{2}}{x_{\mathrm{std}}^{2} + y_{\mathrm{std}}^{2} + C_{2}}. \end{aligned}$$

The first term of this formula measures the mean illuminance value between the norm value of two images. The second term measures the contrast of two images. cov represents the covariance of two images. \(x_{\mathrm{std}}\) is the image squared standard deviation. \(C_{1}\) and \(C_{2}\) are constant and empirically set as 0.01 and 0.03, respectively [43]. Larger PSNR and SSIM values indicate better reconstruction performance.

4.3 Comparison to state of the art

The state-of-the-art STVSR approaches can be divided into two categories: two-stage and one-stage models. For two-stage approaches, we have the combination of advanced VFI method (SuperSloMo [44], QVI [8] and DAIN [18]) and VSR methods (Bicubic, EDVR [11] and BasicVSR [19]). For one stage model, we select Zooming SlowMo [13], STARnet [12], VideoINR [39], RSTT-L [25] and TMNet [14] as the main comparison approaches.

The quantitative comparison results are shown in Table 1. First, we can observe that one-stage models clearly outperform two-stage models, e.g., our LTFA-Net shows 1.69db and 2.16db PNSR advantages than the best two-stage approach DAIN+BasicVSR on two datasets, respectively. These results prove the superiority of one-stage approaches on exploiting intrinsic spatial-temporal correlations among different frames for reconstructing realistic details. Moreover, LTFA-Net also shows better performance than other one-stage approaches. It significantly outperforms the best one-stage approach Zooming SlowMo by 1.02db and 1.22db on two datasets in terms of PSNR metric. The parameter number of LTFA-Net is 12.41 million, and the inference speed of LTFA-Net to reconstruct a single HR frame is 14.53 FPS, which is also comparable to that of recent approaches. These results indicate the advantage of LTFA-Net for aggregating long-term temporal features to achieve STVSR.

Table 1 Performance comparison between our LTFA-Net and state of the art two-stage and one-stage approaches. We report the PSNR, SSIM, model size and inference speed of different approaches on the Adobe240fps and GoPro datasets. The best result is highlighted in bold

Qualitative results are shown in Fig. 3: in the first row, we display the reconstructed result of 7 consecutive frames by our LTFA-Net. We can observe the smooth movement of the car with the well-preserved wheel structure across the frames. We also display the reconstructed second frame from the video sequence “GOPRO384-11-00” in GoPro dataset in the second row. The one-stage approach shows better qualities of the reconstructed result than the two-stage approaches in terms of blurred pixel and artifact issues. It can be clearly observed that our LTFA-Net reconstructs cloth collar with clearer edges and less motion blurs than that reconstructed from TMNet and Zooming SlowMo. Overall, all the quantitative and qualitative results demonstrate the superiority of LTFA-Net on STVSR.

Figure 3
figure 3

Qualitative results of different STVSR approaches. Our LTFA-Net outperforms other methods with more visually appealing details

4.4 Ablation study

In this section, we provide detailed examinations of the key LTMoE module in our LTFA-Net. We explore the importance of LTMoE and evaluate the performance of different LTMoE variants. We conducted experiments on following variants: 1) the LTMoE variant using different expert aggregation mechanism; 2) the LTMoE variant whose experts are designed with other structures. LTFA-Net with different LTMoE variants is trained on the Vimeo90k dataset and tested on the Adobe240fps and GoPro datasets. We report the corresponding PSNR and SSIM metrics in Table 2.

Table 2 Ablation study on the proposed LTMoE

(1) LTMoE using different expert aggregation mechanism. Currently, our LTMoE module leverages gating nets to assign different weights on different experts for controlling their activations thus achieve feature interpolation. To test the effectiveness of these gating nets, we remove them from LTMoE and obtain interpolated features by averaging the experts directly, concatenating the experts and then leverage a \(1\times 1\times 1\) 3D convolutional layer for dimensionality reduction, and assigning weights to the experts using a temporal closeness function following [1]. This temporal closeness function measures the temporal-distance of the target feature and given features thus assigns nearer/farther experts with bigger/smaller weights. These three variants are termed as LTMoE (gating→avg), LTMoE (gating→con), LTMoE (gating→tcf). We can observe that LTMoE (gating→avg) shows a significant decrease of PSNR by 0.34db on the Adobe240fps dataset compared with original LTFA-Net. LTMoE (gating→con) and LTMoE (gating→tcf) only show 28.29db and 28.46db PSNR on the GoPro dataset, which are significantly worse than LTFA-Net. These results indicate that each expert has its own useful spatial-temporal information and irrelevant noises, and different experts should have different contribution levels to the interpolation results. Thus, the gating nets can learn to assign different weights to these experts to enhance the meaningful information and suppress the irrelevant noises. By this means we can improve the STVSR performance.

(2) LTMoE using different expert structures. We further investigate the effectiveness of the expert structure of our LTMoE module. Currently, each expert is designed with a ConvNext [36] model for feature extraction, and it is reported that the deformable convolutional layers (DCN) can help aggregate temporal information in the interpolation process [14]. Therefore, we added an additional DCN in each expert to model temporal correlations between adjacent frames for better interpolation. We term this variant as LTMoE w/ DCN. From Table 2 we can observe that LTMoE w/ DCN shows 0.51db and 0.07db PSNR decrease on two dataset. These results indicate that our LTMoE already has the ability to aggregate information from the overall sequence and the inclusion of additional DCN may lead to overfitting.

5 Conclusion

In this work, we proposed a novel long-term temporal feature aggregation network (LTFA-Net) for STVSR. Our main contribution is to aggregate long-term temporal information from multiple neighboring frames for feature interpolation. This process is achieved by our long-term mixture of experts (LTMoE) module, which is designed with several experts for feature extraction and some gating nets for combining expert outputs to the interpolation results. Then the interpolated features are fed into the Locally-temporal Feature Comparison module and a bidirectional deformable ConvLSTM layer for local and global refinement. The experiment results on two standard benchmarks prove the superiority of LTFA-Net over state-of-the-art approaches. The limitations of this work are two-fold: similar to [12, 13], LTFA-Net cannot interpolate frames at arbitrary intermediate time stamps between the given frames but only at the middle time stamp; Also, its loss function does not integrate the motion factor into the network optimization process. We will dedicate ourselves to solving the above problems in our future work.