Abstract
Space-time video super-resolution (STVSR) serves the purpose to reconstruct high-resolution high-frame-rate videos from their low-resolution low-frame-rate counterparts. Recent approaches utilize end-to-end deep learning models to achieve STVSR. They first interpolate intermediate frame features between given frames, then perform local and global refinement among the feature sequence, and finally increase the spatial resolutions of these features. However, in the most important feature interpolation phase, they only capture spatial-temporal information from the most adjacent frame features, ignoring modelling long-term spatial-temporal correlations between multiple neighbouring frames to restore variable-speed object movements and maintain long-term motion continuity. In this paper, we propose a novel long-term temporal feature aggregation network (LTFA-Net) for STVSR. Specifically, we design a long-term mixture of experts (LTMoE) module for feature interpolation. LTMoE contains multiple experts to extract mutual and complementary spatial-temporal information from multiple consecutive adjacent frame features, which are then combined with different weights to obtain interpolation results using several gating nets. Next, we perform local and global feature refinement using the Locally-temporal Feature Comparison (LFC) module and bidirectional deformable ConvLSTM layer, respectively. Experimental results on two standard benchmarks, Adobe240 and GoPro, indicate the effectiveness and superiority of our approach over state of the art.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Nowadays, with the development of ultra-high-definition (UHD) video display technology, the UHD television is able to display videos up to 8 K spatial resolution and 240 fps frame rate. However, the common video capturing devices can only shoot videos with maximum spatial resolution and frame rate as 2 K and 60 fps respectively [1]. One promising solution for converting existing videos for displaying on the UHD televisions with more satisfying visual experience is to use space-time video super-resolution (STVSR). STVSR is a field that aims to simultaneously increase the video spatial resolution and frame rate of the low-resolution (LR) low-frame-rate videos for reconstructing high-resolution (HR) high-frame-rate videos.
Previous STVSR methods usually adopt the optical and motion field techniques to capture the offsets between frames, and then leveraged these offsets for interpolating intermediate frames. They also tend to rely on prior assumptions heavily (e.g., linear motion assumption between frames [2]) [3–5]. But in real sceneries, these assumptions are usually not satisfied, leading to inaccurate spatial details and motion blurs within their reconstruction frames. Recently, some deep learning methods achieve STVSR by performing advanced video frame interpolation (VFI) [6–8], and video super resolution (VSR) [9–11] models consecutively. They first interpolate missing intermediate low-resolution frames via the VFI model and then reconstruct high-resolution frames using the VSR model. Since STVSR is achieved by two independent steps using two models, these methods are called two-stage approaches. However, these methods cannot fully exploit the inherent correlations of spatial and temporal information [12], resulting in severe frame inconsistency and fake artifacts problems. Moreover, these two-stage approaches require expensive computation cost thus show poor inference efficiency [6–11].
To resolve the above issues, numerous one-stage STVSR approaches have been proposed to directly achieve STVSR in an end-to-end manner. These models are normally designed with several phases. They usually interpolate intermediate frame features between given LR frames at the first phase. Then each frame feature is refined in local and global level according to their local neighbors and overall feature sequences. Finally, they increase the spatial resolutions of these features for video reconstruction [13, 14]. In comparison to the two-stage approaches, the one-stage approaches are able to effectively model the temporal correlations (object deformations and object movements) between frames to obtain realistic spatial details. However, as their most important phase, the feature interpolation phase only captures the spatial-temporal information from two adjacent frames for interpolating each intermediate frame feature. They implicitly assume that the motions occur in constant speed between consecutive frames. However, to the best of our knowledge, the object movement speeds are usually changeable, e.g., moving cars and running people, we argue that the long-term spatial-temporal dependencies between multiple consecutive adjacent frames are also helpful to recover the details of each intermediate frame. Hence, our motivation is to aggregate long-term temporal information from multiple frames for interpolating intermediate frame features.
In this paper, we propose a novel long-term temporal feature aggregation network (LTFA-Net) for STVSR. Specifically, LTFA-Net takes input of LR video frames and first performs feature interpolation. To handle complex motions and aggregate long-term temporal information for interpolating intermediate features, we design a long-term mixture of experts (LTMoE) module to encode and integrate complementary information from long-term frame feature sequences into the interpolation results. LTMoE can compensate spatial details and reduce blurred motions in the interpolated features since it exploits long-term correlations between multiple frames for recovering spatial-temporal contexts. Then we calculate the deformable feature alignment functions between each feature and its local neighbours for its local refinement, and we further perform global refinement among the whole feature sequence via a bidirectional deformable ConvLSTM [13] layer. In the last phase, the refined feature maps are reconstructed to the HR high-frame-rate video.
In summary, our contribution is two-fold:
(1) We propose a novel long-term temporal feature aggregation network (LTFA-Net) for STVSR. To the best of our knowledge, our LTFA-Net is the first unified model which achieves long-term temporal dependencies modelling and information aggregation for recovering realistic spatial details and motion continuity within interpolated features. This is achieved by our long-term mixture of experts (LTMoE) module.
(2) Our method shows superior performance to state-of-the-art approaches on two standard benchmarks: Adobe240fps and GoPro.
2 Related work
2.1 Video frame interpolation
The purpose of video frame interpolation (VFI) is to synthesize intermediate frames between given adjacent frames in the same spatial-resolution level. It is a challenging problem especially to restore fast non-linear object motions under illumination and occlusion changes. Early methods adopted optical flows to estimate motion cues and warped the adjacent frames to target intermediate frames. Due to fast motions and occlusions, the estimated optical flows may be inaccurate to describe object movements [15]. Also, these approaches suffer from huge computational costs [15]. Recent methods attempted to tackle these problems by applying kernel-based methods [15, 16]. The kernel-based methods extract information from local patches of given frames using various kinds of convolutional kernels, which preserves a large detail of local texture and shape patterns to synthesis the target frame. For instance, Lee et al. proposed spatially adaptive convolution kernels to select and map existing pixels from given frames to appropriate locations in the target intermediate frame [17]. Moreover, Bao et al. [18] and Xue et al. [7], respectively, estimated depth maps and occlusion masks from given frames to discard occluded areas, and thus help to improve the interpolation performance.
2.2 Video super resolution
Video super resolution (VSR) aims to increase the spatial resolution of given low-resolution frames to their high-resolution versions. Previous approaches dedicated to aggregating the spatial information from neighbouring frames for reconstructing each target frame. For example, Tian et al. [10] captured spatial information from adjacent frames via the deformable convolutional layer and then reconstructed high-resolution frames. Wang et al. [11] proposed a Pyramid, Cascading and Deformable (PCD) module to perform feature spatial alignment in three different scale features. The aligned features are fused for spatial super-resolution. Chan et al. [19] introduced the BasicVSR model, in which they designed the bidirectional propagation operation to exploit global information of overall sequence for the spatial alignment of each frame feature. Haris et al. [20] iteratively projected features of neighbouring frames to that of reference frame for its spatial super-resolution.
2.3 Space time video super resolution
The main difference between space-time video super-resolution (STVSR) with VFI and VSR is that STVSR aims to increase the video spatial and temporal resolutions simultaneously. Shechtman et al. [3] firstly tackled STVSR by aggregating information from the input video frames and then performing directional spatial-temporal regularization for frame reconstruction. Recent approaches achieve STVSR by performing end-to-end deep neural networks. For example, Haris et al. [12] applied an optical flow estimation network to encode the motion representations between adjacent frames, then they iteratively projected adjacent features to target intermediate frames by the help of these representations. Xiang et al. [13] used the deformable feature alignment functions to interpolate intermediate features and then designed the bidirectional deformable ConvLSTM (BDConvLSTM) layer for feature global refinement. Xu et al. [14] enabled the reconstruction of multiple intermediate frames at arbitrary moments. Shi et al. [21] introduced an adapted pixel shuffle layer to reconstruct video frames at arbitrary target spatial resolutions. You et al. [22] developed a memory graph aggregation module to capture long-range dependencies over the whole feature sequence for global refinement. Cao et al. [23] designed the Fourier data transform layer and the recurrent video enhancement layer to respectively handle the motion blurs and motion aliasing within the reconstructed frames. Hu et al. [24] iteratively projected features in the low-resolution and high-resolution scales to eliminate the differences and update the corresponding representations for feature refinement. Several latest methods achieved STVSR via vision transformers [25, 26], e.g., Geng et al. [25] used the SWIN Transformer [27] as their model backbone to perform information interaction between features of different frames.
However, these approaches interpolate each intermediate feature only guided by its two most adjacent frames, ignoring the spatial-temporal information of other frames for restoring fast and variable-speed object movements, leading to their poor long-term motion continuity within reconstructed frames. Therefore, we propose a novel long-term mixture of experts (LTMoE) module to model long-term spatial-temporal dependencies and capture discriminative spatial-temporal contexts from more neighbouring frames to boost STVSR performance.
2.4 Mixture of experts
Mixture of Experts (MoE) is an ensemble learning method for subtask optimization. MoE divides the problem space into several sub-tasks and respectively handle them using experts. Then another gating network plays the role to control the activation of experts for their combination [28]. MoE was introduced by Jacob et al. [29] and currently it is applied in many computer vision tasks, such as semantic segmentation [30], crowd counting [31] and image super-resolution [32–34]. For instance, Liu et al. [33] tackled single image super resolution (SI-SR) via MoE. They partitioned the whole image space into several subspaces and distributed a super-resolution inference module (expert) to each subspace. Another gating net generates pixel-level weight maps and multiplies them with the partitioned image to reconstruct the high-resolution version. Rasti et al. [35] applied multi-scale MoE on retinal optical coherence tomography images for macular pathologies diagnosis. Each expert is performed on a scale feature to provide discriminative spatial information, then a gating net combines them for image classification. For our STVSR task, given that the long-term spatial-temporal features from multiple consecutive adjacent frames can help interpolate the intermediate frame, we for the first time employ multiple experts to extract discriminative information from these long-term frames and aggregate them using gating nets for intermediate frame features interpolation.
3 Method
3.1 Overview
The overview of our LTFA-Net is illustrated in Fig. 1. LTFA-Net aims to reconstruct HR high-frame-rate video frames \(\mathcal{H}= \{ H_{t} \}_{t=1}^{7}\) from LR low-frame-rate frames \(\mathcal{L}= \{ L_{2t-1} \}_{t=1}^{4}\) via four main phases: feature interpolation, feature local refinement, feature global refinement, and frame reconstruction. Specifically, we first pass \(\mathcal{L}\) into a novel long-term mixture of experts (LTMoE) module to aggregate long-term temporal information from given frames for feature interpolation. LTMoE contains four experts \(\{ E_{1}, E_{2}, E_{3}, E_{4} \}\) designed using ConvNext [36] for feature extraction of given frames, and three gating nets \(\{ \mathcal{G}^{1}, \mathcal{G}^{2}, \mathcal{G}^{3} \}\) to assign hard weights to experts for their combinations. From three different combinations of experts, we can obtain three intermediate features \(\{ F_{2}, F_{4}, F_{6} \}\) accordingly. Note that our LTMoE can also support aggregating information from more neighbouring frames, but here we introduce LTMoE using four supporting frames for the sake of simplicity. Then, having initialized features of given frames and interpolated feature sequence \(\mathcal{F}= \{ F_{t} \}_{t=1}^{7}\) ready, we follow [14] to feed them into the Local Feature Comparison (LFC) module to align each feature using its local neighbours. Then we adopt the widely used bidirectional deformable ConvLSTM layer to globally refine each feature over the whole sequence. Finally, we apply a reconstruction module, which is integrated with ten ConvNext blocks [36] and a Pixel shuffle layer [21] to increase the feature spatial resolutions and output HR high-frame-rate video frames \(\mathcal{H}\).
LTFA-Net has four main phases. Given LR low-frate-rate frames \(\mathcal{L}\), we first interpolate intermediate frame features via the LTMoE module. Then we perform local feature refinement using the Locally-temporal Feature Comparison module. Next, we feed the whole feature sequence \(\mathcal{F}^{R}\) into a bidirectional deformable ConvLSTM (BDConvLSTM) for global refinement. Finally, we reconstruct HR high-frame-rate video frames \(\mathcal{H}\) by ten ConvNext blocks and a Pixel shuffle layer
3.2 Long-term mixture of experts for feature interpolation
All of the existing STVSR approaches explicitly or implicitly contain the feature interpolation phase aiming to recover features of intermediate frames, but they only capture information from two most adjacent frames for interpolating each intermediate frame feature, failing to maintain long-term motion continuity and handle variable-speed object movements within the reconstructed frame sequence. Thus, we attempt to aggregate long-term temporal information from multiple consecutive adjacent frames for interpolation. Considering that different adjacent frames have different contribution levels to the interpolation results, indiscriminately and directly concatenate them for feature extraction may lead to suboptimal reconstruction performance. We instead design the long-term mixture of experts (LTMoE) to aggregate long-term spatial-temporal information from given frames with different weights. Our LTMoE is adapted from the popular mixture of experts (MoE) [29]. The original MoE leverages several experts \(\mathcal{E}= \{ E_{i} \}_{i =1}^{n}\) to handle different subtasks of a complex task independently and utilizes a gating net \(\mathcal{G}\) to assign weights on them for their fusion. The output is \(E_{\mathrm{out}} = \sum_{i =1}^{n} E_{i} G_{i}\), where \(G_{i}\) indicates the weight \(\mathcal{G}\) assigned to i-th expert \(E_{i}\) and there are n experts in total.
The structure of LTMoE is shown in Fig. 1. LTMoE comprises of four shareable experts \(\{ E_{1}, E_{2}, E_{3}, E_{4} \}\) to extract discriminative features \(\{ F_{1}, F_{3}, F_{5}, F_{7} \}\) from given frames, respectively. Each expert contains two 2D convolutional layers and a ConvNext [23] model. These features have mutual and complementary information thus we design three gating nets \(\{ \mathcal{G}^{1}, \mathcal{G}^{2}, \mathcal{G}^{3} \}\) to combine them with different weights for obtaining intermediate features \(\{ F_{2}, F_{4}, F_{6} \}\). For instance, \(F_{2} = \sum_{t =1}^{4} G_{t}^{1} \times E_{t} ( L_{2 t -1} )\), where \(G_{t}^{1}\) represents the hard weight that gating net \(\mathcal{G}^{1}\) assigns for controlling the contribution of expert \(E_{t}\). These gating nets take the input of the concatenation of \(\mathcal{L}\) and have the same network structure and weights: a 2D convolutional layer for feature extraction, a 2D global average pooling layer for spatial dimension collapse, and a fully connected layer with a Softmax activation function for weights calculation and normalization. Only the concatenation orders of their input features are different. By this means the gating nets can automatically control the activation degrees of each expert for passing their spatial-temporal information into the interpolated features. Our LTMoE implicitly exploits inter-frame spatial-temporal correlations to recover long-term motion continuity across multiple frames.
3.3 Local and global feature refinement
Local Feature Refinement. In this phase, we aim to refine each feature map using its local neighbours to maintain the local motion consistency. To achieve this purpose, we inherit the advantageous module, Locally-temporal Feature Comparison (LFC) module from [14]. Figure 2 illustrates the process of LFC refining \(F_{t}\) with the help of \(F_{t-1}\) and \(F_{t+1}\). Specifically, LFC firstly concatenates \(F_{t}\) with \(F_{t-1}\) and \(F_{t+1}\), respectively, and feed them into two convolutional layers to learn the forward and backward motion offset \(o_{t-1 \rightarrow t}\) and \(o_{t+1 \rightarrow t}\). These learned offsets describe the motion cues including the movement directions and movement displacements of objects, and thus it can be used to align adjacent features \(F_{t -1}, F_{t +1}\) towards the current frame using the deformable convolutional layer [37]. The deformable convolutional layer adds learnable offsets to the regular grid sampling locations of the standard convolution operation. It enables the size of the convolutional kernels to be deformable thus encoding features of objects more comprehensively. We follow [13, 14] to use it for feature spatial alignment. After that, the aligned feature maps are concatenated with \(F_{t}\) and then fed into four convolutional layers and five LReLU activation layers for feature fusion.
The LFC module performs local refinement on \(\mathcal{F}= \{ F_{t} \}_{t =1}^{7}\) with a sliding window of size 3. For the first (last) feature in this sequence, its previous (next) neighbour is itself. Finally, we have the refined features \(\mathcal{F}^{R} = \{ F_{t}^{R} \}_{t=1}^{7}\).
Global Feature Refinement. This phase aims to refine each frame feature using information from the whole sequence, and it can model the motions over the sequence to solve the global motion inconsistency problem. To achieve global temporal information passing and aggregation, we take inspiration from [13] to input \(\mathcal{F}^{R}\) into a bidirectional deformable ConvLSTM (BDConvLSTM) layer for global refinement. The backbone of BDConvLSTM layer is the ConvLSTM [38] layer which can capture the motion information from small receptive fields. But ConvLSTM layer fails to handle large and fast motions from the overall feature sequence. Therefore, [13] improves the original ConvLSTM layer with additional deformable convolutional layers to better encode temporal correlations between frames. The design of memory cells and four effective gates enables the BDConvLSTM layer to extract long-term temporal contexts and perform information interaction in the forward and backward directions. Motion artifacts and blurs can be tackled by exploiting global temporal contexts in the BDConvLSTM layer. Moreover, it can model long-term dependency within sequences with arbitrary numbers of features. We denote the output of this layer by \(\mathcal{F}^{G} = \{ F_{t}^{G} \}_{t=1}^{7}\).
3.4 High-resolution frame reconstruction
In the last phase of our LTFA-Net, we aim to reconstruct HR video frames. Specifically, we first feed the features \(\mathcal{F}^{G}\) into ten ConvNext blocks [36] for spatial refinement. Then we follow [21] to project the outputs to HR frames \(\mathcal{H}= \{ H_{t} \}_{t =1}^{7}\) using a Pixel shuffle layers.
For model optimization, our LTFA-Net is optimized by the following reconstruction loss in an end-to-end manner:
where \(H_{t}^{gt}\) represents the ground truth of t-th frame and ϵ is the penalty term. As suggested in [13, 14], we empirically set \(\epsilon = 10^{-3} \).
4 Experiments
4.1 Datasets
We select the widely used Vimeo-90 K septuplet dataset [7] as our training dataset. This dataset has 91,701 different videos. Each video contains 7 consecutive frames with a fixed spatial resolution of \(448\times 256\). We follow [13, 14] to transform the odd-index frames to their low-resolution (LR) versions using bicubic interpolation. These LR frames have a spatial resolution of \(112\times 64\) and are treated as the training inputs to reconstruct the original HR frames.
For model evaluation, we follow [39] to test the model performance on the Adobe240fps [40] and GoPro [41] datasets. The Adobe240fps dataset contains 71 videos with the resolution of \(1280\times 720\) and the frame rate of 30 FPS. The GoPro dataset is composed of 3214 videos at \(1280\times 720\) resolution and 240 FPS frame rate. The frame number of videos in these two datasets is not fixed. Similar to the setup of the training set, we down-sampled the odd-index frames with a factor of 4 to obtain the LR low-frame-rate videos, and then input them to the proposed LTFA-Net for reconstructing the original videos.
4.2 Implementation details and evaluation protocols
Implementation Details. The LTFA-Net is optimized by the SGD [42] optimizer with an initial learning rate of \(2\mathrm{e}^{-4}\). Following [14], we apply the cosine annealing scheduler to gradually decay the learning rate to \(1\mathrm{e}^{-7}\) for every 150,000 iterations. We set the number of training iterations as 600,000 and the batch size as 6. Moreover, the training set is augmented by image rotation (90∘, 180∘, and 270∘), image horizontal flip and image vertical flip operations [14]. We performed our experiments on an NVIDIA RTX 2070 Super GPU.
Evaluation Protocols. We follow [2, 13, 14] to set our evaluation protocols as the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) metrics [43]. We also compare the number of model parameters and the inference speed with state of art approaches. Specifically, PSNR is utilized to measure the reconstruction quality between the reference frame x and the reconstructed frame y. The equation of PSNR is defined as:
where MSE denotes the mean squared error. SSIM [43] aims to measure the human visual perception similarity between images. It is defined as:
The first term of this formula measures the mean illuminance value between the norm value of two images. The second term measures the contrast of two images. cov represents the covariance of two images. \(x_{\mathrm{std}}\) is the image squared standard deviation. \(C_{1}\) and \(C_{2}\) are constant and empirically set as 0.01 and 0.03, respectively [43]. Larger PSNR and SSIM values indicate better reconstruction performance.
4.3 Comparison to state of the art
The state-of-the-art STVSR approaches can be divided into two categories: two-stage and one-stage models. For two-stage approaches, we have the combination of advanced VFI method (SuperSloMo [44], QVI [8] and DAIN [18]) and VSR methods (Bicubic, EDVR [11] and BasicVSR [19]). For one stage model, we select Zooming SlowMo [13], STARnet [12], VideoINR [39], RSTT-L [25] and TMNet [14] as the main comparison approaches.
The quantitative comparison results are shown in Table 1. First, we can observe that one-stage models clearly outperform two-stage models, e.g., our LTFA-Net shows 1.69db and 2.16db PNSR advantages than the best two-stage approach DAIN+BasicVSR on two datasets, respectively. These results prove the superiority of one-stage approaches on exploiting intrinsic spatial-temporal correlations among different frames for reconstructing realistic details. Moreover, LTFA-Net also shows better performance than other one-stage approaches. It significantly outperforms the best one-stage approach Zooming SlowMo by 1.02db and 1.22db on two datasets in terms of PSNR metric. The parameter number of LTFA-Net is 12.41 million, and the inference speed of LTFA-Net to reconstruct a single HR frame is 14.53 FPS, which is also comparable to that of recent approaches. These results indicate the advantage of LTFA-Net for aggregating long-term temporal features to achieve STVSR.
Qualitative results are shown in Fig. 3: in the first row, we display the reconstructed result of 7 consecutive frames by our LTFA-Net. We can observe the smooth movement of the car with the well-preserved wheel structure across the frames. We also display the reconstructed second frame from the video sequence “GOPRO384-11-00” in GoPro dataset in the second row. The one-stage approach shows better qualities of the reconstructed result than the two-stage approaches in terms of blurred pixel and artifact issues. It can be clearly observed that our LTFA-Net reconstructs cloth collar with clearer edges and less motion blurs than that reconstructed from TMNet and Zooming SlowMo. Overall, all the quantitative and qualitative results demonstrate the superiority of LTFA-Net on STVSR.
4.4 Ablation study
In this section, we provide detailed examinations of the key LTMoE module in our LTFA-Net. We explore the importance of LTMoE and evaluate the performance of different LTMoE variants. We conducted experiments on following variants: 1) the LTMoE variant using different expert aggregation mechanism; 2) the LTMoE variant whose experts are designed with other structures. LTFA-Net with different LTMoE variants is trained on the Vimeo90k dataset and tested on the Adobe240fps and GoPro datasets. We report the corresponding PSNR and SSIM metrics in Table 2.
(1) LTMoE using different expert aggregation mechanism. Currently, our LTMoE module leverages gating nets to assign different weights on different experts for controlling their activations thus achieve feature interpolation. To test the effectiveness of these gating nets, we remove them from LTMoE and obtain interpolated features by averaging the experts directly, concatenating the experts and then leverage a \(1\times 1\times 1\) 3D convolutional layer for dimensionality reduction, and assigning weights to the experts using a temporal closeness function following [1]. This temporal closeness function measures the temporal-distance of the target feature and given features thus assigns nearer/farther experts with bigger/smaller weights. These three variants are termed as LTMoE (gating→avg), LTMoE (gating→con), LTMoE (gating→tcf). We can observe that LTMoE (gating→avg) shows a significant decrease of PSNR by 0.34db on the Adobe240fps dataset compared with original LTFA-Net. LTMoE (gating→con) and LTMoE (gating→tcf) only show 28.29db and 28.46db PSNR on the GoPro dataset, which are significantly worse than LTFA-Net. These results indicate that each expert has its own useful spatial-temporal information and irrelevant noises, and different experts should have different contribution levels to the interpolation results. Thus, the gating nets can learn to assign different weights to these experts to enhance the meaningful information and suppress the irrelevant noises. By this means we can improve the STVSR performance.
(2) LTMoE using different expert structures. We further investigate the effectiveness of the expert structure of our LTMoE module. Currently, each expert is designed with a ConvNext [36] model for feature extraction, and it is reported that the deformable convolutional layers (DCN) can help aggregate temporal information in the interpolation process [14]. Therefore, we added an additional DCN in each expert to model temporal correlations between adjacent frames for better interpolation. We term this variant as LTMoE w/ DCN. From Table 2 we can observe that LTMoE w/ DCN shows 0.51db and 0.07db PSNR decrease on two dataset. These results indicate that our LTMoE already has the ability to aggregate information from the overall sequence and the inclusion of additional DCN may lead to overfitting.
5 Conclusion
In this work, we proposed a novel long-term temporal feature aggregation network (LTFA-Net) for STVSR. Our main contribution is to aggregate long-term temporal information from multiple neighboring frames for feature interpolation. This process is achieved by our long-term mixture of experts (LTMoE) module, which is designed with several experts for feature extraction and some gating nets for combining expert outputs to the interpolation results. Then the interpolated features are fed into the Locally-temporal Feature Comparison module and a bidirectional deformable ConvLSTM layer for local and global refinement. The experiment results on two standard benchmarks prove the superiority of LTFA-Net over state-of-the-art approaches. The limitations of this work are two-fold: similar to [12, 13], LTFA-Net cannot interpolate frames at arbitrary intermediate time stamps between the given frames but only at the middle time stamp; Also, its loss function does not integrate the motion factor into the network optimization process. We will dedicate ourselves to solving the above problems in our future work.
Availability of data and materials
Vimeo-90 K dataset is available at http://toflow.csail.mit.edu/. Adobe240 dataset is available at http://www.cs.ubc.ca/labs/imager/tr/2017/DeepVideoDeblurring/. GoPro dataset is available at https://seungjunnah.github.io/Datasets/gopro.
References
Z. Yue, M. Shi, S. Ding, S. Yang, Enhancing space-time video super-resolution via spatial-temporal feature interaction (2022). http://arxiv.org/abs/2207.08960
S.Y. Kim, J. Oh, M. Kim, FISR: deep joint frame interpolation and super-resolution with a multi-scale temporal loss, in AAAI (2022). http://arxiv.org/abs/1912.07213
E. Shechtman, Y. Caspi, M. Irani, Increasing space-time resolution in video, in ECCV (2002)
E. Shechtman, Y. Caspi, M. Irani, Space-time super-resolution. IEEE Trans. Pattern Anal. Mach. Intell. 27, 531–545 (2005). https://doi.org/10.1109/TPAMI.2005.85
O. Shahar, A. Faktor, M. Irani, Space-time super-resolution from a single video, in CVPR (2011), pp. 3353–3360. https://doi.org/10.1109/CVPR.2011.5995360
S. Niklaus, F. Liu, Softmax splatting for video frame interpolation, in CVPR (2020). http://arxiv.org/abs/2003.05534
T. Xue, B. Chen, J. Wu, D. Wei, W.T. Freeman, Video enhancement with task-oriented flow, in IJCV, vol. 127 (2019), pp. 1106–1125. https://doi.org/10.1007/s11263-018-01144-2
X. Xu, L. Siyao, W. Sun, Q. Yin, M.-H. Yang, Quadratic video interpolation, in IJCV (2019). http://arxiv.org/abs/1911.00627
W.-S. Lai, J.-B. Huang, N. Ahuja, M.-H. Yang, Fast and accurate image super-resolution with deep Laplacian pyramid networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2599–2613 (2018). http://arxiv.org/abs/1710.01992
Y. Tian, Y. Zhang, Y. Fu, C. Xu, TDAN: temporally deformable alignment network for video super-resolution, in CVPR (2018). http://arxiv.org/abs/1812.02898
X. Wang, K.C.K. Chan, K. Yu, C. Dong, C.C. Loy, EDVR video restoration with enhanced deformable convolutional networks, in CVPR Workshop (2019). http://arxiv.org/abs/1905.02716
M. Haris, G. Shakhnarovich, N. Ukita, Space-time-aware multi-resolution video enhancement, in CVPR (2020). http://arxiv.org/abs/2003.13170
X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J.P. Allebach, C. Xu, Zooming slow-mo: fast and accurate one-stage space-time video super-resolution, in CVPR (2020). http://arxiv.org/abs/2002.11616
G. Xu, J. Xu, Z. Li, L. Wang, X. Sun, M.-M. Cheng, Temporal modulation network for controllable space-time video super-resolution, in CVPR (2021). http://arxiv.org/abs/2104.10642
S. Niklaus, L. Mai, F. Liu, Video frame interpolation via adaptive separable convolution, in ICCV (2017). http://arxiv.org/abs/1708.01692
S. Niklaus, L. Mai, F. Liu, Video frame interpolation via adaptive convolution, in CVPR (2017). http://arxiv.org/abs/1703.07514
H. Lee, T. Kim, T. Chung, D. Pak, Y. Ban, S. Lee, AdaCoF: adaptive collaboration of flows for video frame interpolation, in CVPR (2020). http://arxiv.org/abs/1907.10244
W. Bao, W.-S. Lai, C. Ma, X. Zhang, Z. Gao, M.-H. Yang, Depth-aware video frame interpolation, in CVPR (2019). http://arxiv.org/abs/1904.00830
K.C.K. Chan, X. Wang, K. Yu, C. Dong, C.C. Loy, BasicVSR: the search for essential components in video super-resolution and beyond, in CVPR (2021). http://arxiv.org/abs/2012.02181
M. Haris, G. Shakhnarovich, N. Ukita, Deep back-projection networks for super-resolution, in CVPR (2018). http://arxiv.org/abs/1803.02735
W. Shi, J. Caballero, F. Huszár, J. Totz, A.P. Aitken, R. Bishop, D. Rueckert, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in CVPR (2016). http://arxiv.org/abs/1609.05158
C. You, L. Han, A. Feng, R. Zhao, H. Tang, W. Fan, Megan: memory enhanced graph attention network for space-time video super-resolution, in WACV (2022)
J. Cao, J. Liang, K. Zhang, W. Wang, Q. Wang, Y. Zhang, H. Tang, L.V. Gool, Towards interpretable video super-resolution via alternating optimization, in ECCV (2022)
M. Hu, K. Jiang, L. Liao, J. Xiao, J. Jiang, Z. Wang, Spatial-temporal space hand-in-hand: spatial-temporal video super-resolution via cycle-projected mutual learning, in CVPR (2022)
Z. Geng, L. Liang, T. Ding, I. Zharkov, RSTT: real-time spatial temporal transformer for space-time video super-resolution, in CVPR (2022). http://arxiv.org/abs/2203.14186
H. Wang, X. Xiang, Y. Tian, W. Yang, Q. Liao, STDAN: deformable attention network for space-time video super-resolution (2022). http://arxiv.org/abs/2203.06841
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: hierarchical vision transformer using shifted windows, in CVPR (2021). http://arxiv.org/abs/2103.14030
S. Masoudnia, R. Ebrahimpour, Mixture of experts: a literature survey. Artif. Intell. Rev. 42, 275–293 (2014). https://doi.org/10.1007/s10462-012-9338-y
R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts. Neural Comput. 3, 79–87 (1991). https://doi.org/10.1162/neco.1991.3.1.79
S. Pavlitskaya, C. Hubschneider, M. Weber, R. Moritz, F. Huger, P. Schlicht, J.M. Zollner, Using mixture of expert models to gain insights into semantic segmentation, in CVPR Workshop (2020), pp. 1399–1406. https://doi.org/10.1109/CVPRW50498.2020.00179
Z. Du, M. Shi, J. Deng, S. Zafeiriou, Redesigning multi-scale neural network for crowd counting (2022). http://arxiv.org/abs/2208.02894
Y. Wang, L. Wang, H. Wang, P. Li, H. Lu, Blind single image super-resolution with a mixture of deep networks. Pattern Recognit. 102, 107169 (2020). https://doi.org/10.1016/j.patcog.2019.107169
D. Liu, Z. Wang, N. Nasrabadi, T. Huang, Learning a mixture of deep networks for single image super-resolution, in ACCV (2017). http://arxiv.org/abs/1701.00823
M. Emad, M. Peemen, H. Corporaal, MoESR: blind super-resolution using kernel-aware mixture of experts, in WACV (2022), pp. 4009–4018. https://doi.org/10.1109/WACV51458.2022.00406
R. Rasti, H. Rabbani, A. Mehridehnavi, F. Hajizadeh, Macular OCT classification using a multi-scale convolutional neural network ensemble. IEEE Trans. Med. Imaging 37, 1024–1034 (2018). https://doi.org/10.1109/TMI.2017.2780115
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A ConvNet for the 2020s, in CVPR (2022). http://arxiv.org/abs/2201.03545
J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional networks, in CVPR (2017). http://arxiv.org/abs/1703.06211
X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W. Wong, W. Woo, Convolutional LSTM network: a machine learning approach for precipitation nowcasting, in NIPS (2015). http://arxiv.org/abs/1506.04214
Z. Chen, Y. Chen, J. Liu, X. Xu, V. Goel, Z. Wang, H. Shi, X. Wang, VideoINR: learning video implicit neural representation for continuous space-time super-resolution, in CVPR (2022). http://arxiv.org/abs/2206.04647
D. Sun, X. Yang, M.-Y. Liu, J. Kautz, PWC-net: CNNs for optical flow using pyramid, warping, and cost volume, in CVPR (2018). http://arxiv.org/abs/1709.02371
S. Nah, T.H. Kim, K.M. Lee, Deep multi-scale convolutional neural network for dynamic scene deblurring, in CVPR (2018). http://arxiv.org/abs/1612.02177
I. Loshchilov, F. Hutter, SGDR: stochastic gradient descent with warm restarts, in ICLR (2017). http://arxiv.org/abs/1608.03983
Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004). https://doi.org/10.1109/TIP.2003.819861
H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, J. Kautz, Super SloMo: high quality estimation of multiple intermediate frames for video interpolation, in CVPR (2018). http://arxiv.org/abs/1712.00080
Acknowledgements
The authors thank for the discussions with lab members from VISCOM, Department of Informatics, King’s College London, UK.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
Dataset processing, experiment result analysis and manuscript preparation was done by KC under the guidance and supervision by ZY and MS. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
Prof. Miaojing Shi is an editorial board member for Autonomous Intelligent Systems and was not involved in the editorial review, or the decision to publish, this article. All authors declare that there are no other competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chen, K., Yue, Z. & Shi, M. Space-time video super-resolution using long-term temporal feature aggregation. Auton. Intell. Syst. 3, 5 (2023). https://doi.org/10.1007/s43684-023-00051-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43684-023-00051-9