Introduction

Super-Resolution for generating high-resolution visuals from low-resolution inputs is a classic problem in computer vision. Its initial solution was provided by Image Super-Resolution (ISR), which only utilises spatial information of a single image or multiple discrete images to produce fundamental visual quality improvements [1, 2]. Extending the target super-resolving subject from single images to video signals, the adoption of super-resolution approaches used in conventional ISR to Video Super-Resolution (VSR) fails to capture the temporal information present in videos [3, 4]. VSR aims to adopt several temporally correlated low-resolution frames within a video sequence to super-resolve the frame series. The cross-consideration of spatial and temporal dimensions across multiple input frames induces VSR as a highly non-linear multi-dimensional problem that remains an active research field.

In recent years, Deep Neural Networks (DNN) have been widely adopted in the VSR to leverage highly non-linear multi-dimensional characteristics and features in the input video frames with promising results [5]. These learning-based VSR approaches  [6,7,8,9,10,11,12,13] utilize temporal information in a video as a learning feature followed by stages of frame alignment and fusion to reconstruct and up-sample the resulting pixels. However, their use of commonly adopted frame alignment techniques, conventional Motion Estimation and Motion Compensation (MEMC) using optical flow and warping [12], or modern machine learning technologies such as deformable convolution [14] may be sensitive to large changes in luminance [15, 16] and motion [5]. To counter, 2D/3D and recurrent convolutions have been used to learn inter-frame correlation without any implicit or explicit frame alignment [17].

To reveal inter-frame correlation along a video sequence without any implicit or explicit frame alignment, the input frames adopted to be learned are commonly based on a sliding window mechanism, including a fixed number of consecutive frames from either past and/or future timestamps to the target frame [18]. Most VSR models using such a sliding-window mechanism treat all neighbouring input frames as equally important without rank or preference. However, each neighbouring frame in a sliding window may express a different correlation because of the context and content changes across the time domain. As a result, a fixed selection of consecutive frame(s) from the target frame in a sliding window may not be optimal for learning spatiotemporal correlation [7].

A fixed number of consecutive neighbouring frames to/from a given target frame in video super-resolution (VSR) models can impede the ability of these models to capture the temporal context of the video sequences. This limitation can result in a lack of information about motion and changes in the scene, negatively impacting the performance of VSR models. To address this issue, recent VSR models have employed all-frame-in bidirectional neural networks, which benefit from the information available from a larger temporal window for each given timestamp. However, these models are complex and may not be applicable to real-world applications because of the need for all frames to be available simultaneously with considerable time and memory requirements [17].

A practical alternative is to apply an efficient frame selection mechanism to the conventional sliding window mechanism in VSR models. By comparing frames within the sliding window and selecting the most relevant frames to the target frame, VSR models can extract more discriminate features required for the super-resolution task. The relevance of frames can be defined in various measures of similarity based on properties such as features, visual appearance, luminance, and element structure.

In this study, we aim to investigate the potential impact of using image comparison measures for input frame selection in VSR. Despite the intuitive reasoning behind this approach, it has yet to be explored in the literature in a comprehensive way. To address this gap, we conduct an analysis of image similarity measures and develop two dynamic content-based input frame selection algorithms for VSR: the SpatioTemporal Input Frame Selection (STIFS) algorithm and the Feature-based Input Frame Selection (FIFS) algorithm. Through an empirical study, we evaluate the performance of these algorithms compared to conventional sliding window methods. Additionally, we extend the applicability of the best-performing selection algorithm to a state-of-the-art 360° video super-resolution model. Overall, the key contributions of this paper are:

  1. 1.

    an analysis of the effectiveness of widely used image similarity measures for input frame selection in VSR.

  2. 2.

    the development of two dynamic content-aware input frame selection algorithms for VSR, namely STIFS [19] and FIFS.

  3. 3.

    an empirical study evaluating the impact of input selection using the proposed algorithms for VSR compared to the conventional sliding window method.

  4. 4.

    an application of the best-performing frame selection algorithm to the state-of-the-art 360° video super-resolution model.

Background

Sliding Window-Based VSR and its Limitations

Using frame alignment in VSR, Motion Estimation and Motion Compensation (MEMC) [6, 9, 20, 21] remains challenging, particularly when inter-frame motion or luminance variance is evident across neighbouring frames [22]. Alternatively, deformable convolutions proposed by Dai et al. [14] have been used for frame alignment by enhancing DNN’s capacity to model the transformation of geometric variations of objects. Although deformable convolution is more tolerant to variance in luminance or motion, it involves higher computational overhead [7, 10, 23]. Recently, more VSR methods have been proposed that do not rely on frame alignment techniques to alleviate the above-mentioned limitations. These methods promote 2D convolution [24], 3D convolution [8, 25], or Recurrent Neural Networks (RNN) [17, 26, 27] to exploit spatial or spatiotemporal information in a video.

However, most VSR models simply use a fixed set of consecutive frames for super-resolving each target frame in the video. Some recent methods have introduced variations of the model architectures to extract different features from the given consecutive frames attempting to capture the unique temporal characteristics between video frames. Enhanced Deformable Convolution Networks (EDVR) [7] make use of a Temporal-Spatial Attention (TSA) mechanism where convolution-based similarity distance is used to generate temporal attention maps in element-wise multiplication with the original feature maps of the frame and compute a spatial attention mask by a fusion process. Even after being incorporated with complex components like TSA, the information feed via input frames to these models remains the same. This implies that the models’ learning relies only upon the same inputs to map low-resolution frames to a higher-resolution output, even when the operations applied to extract features from the input might vary.

The literature shows that the field lacks a mechanism to effectively select the input frames for either alignment or non-alignment-based VSR models. Non-frame alignment models suffer more from monotony in the input space resulting from the conventional sliding-window mechanism, with the exception of RNN-based models, which commonly use one consecutive frame in addition to the target frame and the hidden state propagated from super-resolving frames from past timestamps. Two of the non-frame alignment-based methods are VSRResFeatGAN [24], and Dynamic Upsampling Filters (DUF) [8], which use 2D and 3D convolution, respectively. Both methods use a sliding window mechanism to select a fixed number of frames from both past and future temporal dimensions and rely on either 2D convolution to extract the spatial correlation or 3D convolution to extract the spatiotemporal correlation. The capability of convolution layers in these models to learn the optimal spatial features in a frame or spatiotemporal features between frames could be limited since the temporal proximity insinuating the cross-correlation, relevance, and mutual information between video frames may not be fully utilised.

VSR Challenges

Although the evaluation and comparison of a new VSR model with the current state-of-the-art VSR models are beyond the scope of this paper, our intention with this discussion is to acknowledge the fierce competition in VSR research and the relatively minor gains achieved via modelling and addressing the complex problem of video super-resolution. As an example, IconVSR [12] harnessed the sequential modelling ability of bidirectional recurrent neural networks in combination with MEMC to obtain peak signal-to-noise ratio (PSNR) improvement of only 0.03 dB over the previously best-performing model, EDVR [7] on the Vimeo90k [21] test set. This exemplifies the challenges in improving the performance of existing VSR models. Interesting to mention is the extent of the changes made to the model to obtain this modest improvement. Similarly, despite the complexity of the model proposed, the recent BasicVSR model can only improve the PSNR on Vid4 by 0.04 dB compared to the previously best-performing model Recurrent Structure-Detail Network (RSDN) [13]. RSDN, in turn, was only able to improve the super-resolution outcome on Vid4 [28], in PSNR terms, by 0.07 dB compared to the EDVR model, the best-performing model preceding RSDN. These examples further emphasize the trend of increasing architectural complexities in the VSR literature to achieve only marginal improvements in super-resolution results. Therefore, any such improvements in super-resolution results without any added architectural overhead would be considered cost-effective, efficient and practical alternatives to increasing architectural complexities. The proposed input frame selection strategy offers such an alternative.

Scope of this Work

Our literature study concludes that, although limited attempts have been made to treat frames at different timestamps differently in some alignment-based VSR methods, no work has been proposed to effectively select the input frames in current VSR models, despite the hypothesis that such an approach will likely benefit the feature space, and may achieve improved super-resolution outcomes, especially for non-frame-alignment based VSR models. At the same time, it is hypothesised that selecting the most relevant input frames will improve VSR results at a lower computational cost compared to models with increased learnable parameters formulating a more computationally expensive approach.

By leveraging temporal information and considering pixel-level and feature-level comparisons of neighbouring frames with the target frame, our proposed input selection algorithms aim to determine the most relevant frames for super-resolution reconstruction. This approach allows prioritising frames with relevant content and visual patterns, thereby capturing and leveraging temporal information effectively. By integrating these algorithms into VSR models, significant improvements in super-resolution outcomes are anticipated, particularly for non-frame-alignment-based methods, while also reducing computational complexity. Therefore, in this study, we aim to explore the effectiveness of employing frame comparison matrices for input selection in VSR and investigate its impact on VSR performance, specifically through the four major contributions that have been highlighted in “Introduction”.

Input Selection Mechanisms

Based on the properties used to define relevance and facilitate selection, frame selection mechanisms can be broadly categorized into three types:

  1. 1.

    Pixel-based similarity measures compare the similarity between a given target frame and its neighbouring frames based on the pixel values. These methods are computationally efficient, but they may not be as effective in cases with significant motion or luminance changes between frames. Examples of pixel-based measures include Mean Pixel Value Difference (MPVD), Normalized Cross Correlation (NCC), Correlation Coefficient, and Mutual Information (MI) [29, 30]. Among these methods, MPVD is one of the simplest yet most effective measures for pixel-based comparisons.

  2. 2.

    Quality-based similarity measures compare the similarity between a given target frame and its neighbouring frames based on the quality or visual appearance of the frames. These methods consider factors such as luminance and contrast and are expected to be more effective than pixel-based measures in cases with significant noise or compression artefacts. Examples of quality-based measures include Peak Signal to Noise Ratio (PSNR) [31], Structural Similarity Index (SSIM) [31], and Learned Perceptual Image Patch Similarity (LPIPS) [32].

  3. 3.

    Feature-based similarity measures compare the similarity between a given target frame and its neighbouring frames based on the feature points or descriptors extracted from the frames. These methods can be effective in cases with distinct features in the video frames, such as text or objects. However, these methods are the most computationally intensive among the three approaches. Examples of feature-based measures include shallow features using SIFT (Scale-Invariant Feature Transform) [33], FAST (Features from Accelerated Segment Test) [34], BRIEF (Binary Robust Independent Elementary Features) [35], ORB (Oriented FAST and Rotated BRIEF) [36], BRISK (Binary Robust Invariant Scalable Keypoints) [37] or deep features using VGG16 [38] or ResNet [39]. Using deep features for selection can prove to be computationally expensive in a VSR model. Among the conventional shallow feature detection methods, ORB has been identified as one of the most efficient and robust methods [40].

Analysis of Selection Measures

Pixel-Based vs. Quality-Based

We perform a frame-to-frame comparison between example target frames and their neighbours using the pixel-based method—MPVD and quality-based methods—PSNR and SSIM for all four clips of the Vid4 dataset. For this analysis, we consider the target frame at timestamp \(t=12\) and its 11 neighbours in each temporal direction. From the graphs shown in Fig. 1a and b, it is evident that MPVD is highly correlated with both PSNR and SSIM, justifying the ability of MPVD to capture similarity/difference between frames at a similar level as quality-based metrics.

The computational cost of the MPVD method for selecting input frames in video super-resolution (VSR) is significantly lower compared to the PSNR and SSIM methods, as shown in Table 1. The ORB method, on the other hand, is the most computationally expensive among all the methods presented in Table 1 due to the need to extract features explicitly for each frame before making comparisons and selections. The time computation presented in Table 1 was performed on a machine with an Intel(R) Core(TM) i7-8665U CPU @ 1.90 GHz (2.11 GHz) processor, 16 GB of installed RAM (15.8 GB usable) and running a 64-bit Windows 10 Enterprise operating system.

This highlights the need to consider the trade-offs between computation cost and effectiveness of the selection measures for selecting neighbouring frames for a given target frame in VSR, as it is done repeatedly using a sliding window over the entire video. It is important to note that despite the higher computational need for PSNR-based and SSIM-based selections, the nature of selection is highly correlated to MPVD-based selection, as demonstrated in Fig. 1a and b. Therefore, for developing the selection algorithm in the following sections, we consider the MPVD method as an optimal selection measure.

Fig. 1
figure 1

PSNR, SSIM and MPVD Correlation between target frame \(F_t\), where \(t=12\) and its 11 neighbours in each temporal direction in 4 clips of the benchmark Vid4 Dataset

Table 1 Time taken in seconds to perform PSNR, SSIM, MPVD and ORB-based selection for 34 frames of the City clip of Vid4 Dataset
Fig. 2
figure 2

Comparison between the percentage of target frames for which immediate consecutive past and future neighbouring frames were not selected when using MPVD and ORB as selection measures on ten randomly selected clips of the Vimeo90k [21] dataset

Pixel-Based vs. Feature-Based

Despite the higher computational cost associated with feature-based selection, it is important to note that the nature of comparison and level of sophistication varies from other approaches. Feature-based methods are well suited for deep learning VSR models for several reasons:

  1. 1.

    Robustness: Feature-based methods can be more robust to changes in luminance and motion, as they extract distinct and consistent features across frames, regardless of the appearance of the frames.

  2. 2.

    Spatial and temporal information: Feature-based methods can extract both spatial and temporal information from the video frames. This can be useful for deep learning models that must capture both information types to generate high-quality HR frames.

  3. 3.

    Scale-invariance: Feature-based methods like ORB are scale-invariant, meaning they can detect feature points across different scales. This can be beneficial in cases where the scale of the objects or the details in the scene change between the frames.

As demonstrated in Fig. 2, the frequency in terms of percentage (%) of target frames for which non-consecutive neighbouring frames were selected varies significantly when using MPVD versus ORB. This illustrates the diversity of these two selection measures. Therefore, in order to study the impact of these measures individually on VSR results, we have also chosen to use ORB for our input selection algorithm development in the following sections.

Spatial vs. Spatiotemporal

MPVD is computationally efficient and provides a mechanism for a spatial comparison between frames. However, MPVD does not yet optimally consider the spatiotemporal inter-dependencies among video frames. We have analysed the sensitivity of using MPVD when used alone for selection and depicted the result in Fig. 2. The figure shows that for most of the clips selected from the Vimeo90k septuplet dataset, MPVD selected more non-consecutive neighbouring frames compared to the ORB method. The higher frequency of selecting non-consecutive neighbouring frames by MPVD may be due to its heightened sensitivity to noise and pixel value variations in the frames.

Additionally, we investigated the impact of MPVD (pixel-based) ranking and selection by considering factors beyond the spatial factors and revealed the spatiotemporal relationship for that. As shown in Fig. 3, if we considered the selection of five out of eleven frames with reference to target frame \(F_t\), where \(t=12\) for the City clip, based on the spatial metric MPVD only, the most distant five frames from the target frame are selected because they exhibit the largest spatial differences, as highlighted by the dotted bounding box in Fig.  3. However, when Temporal Distance (TD) is considered, the most distant frames rank lowest, despite having the largest MPVD with \(F_t\); thus, the nearest five frames are selected, as highlighted by the solid bounding box in Fig.  3. Considering spatial dimension alone inverts the VSR to multi-image super-resolution, which is undesirable. Both spatial and temporal dimensions must be considered to capture true spatiotemporal interdependence between the target frame and its neighbouring frames.

Fig. 3
figure 3

Comparison between spatial and spatiotemporal selection. The dashed bounding box represents frame selection based on a spatial metric (MPVD) alone. The solid bounding box represents frame selection based on the spatiotemporal metric (MPVD/TD)

The Proposed Input Frame Selection Algorithms

STIFS for Pixel-Based Selection

To mitigate the shortcomings of the sliding-window approach in current VSR models, our novel SpatioTemporal Input Frame Selection (STIFS) algorithm uses the frame-wise spatiotemporal correlation between neighbouring frames and the target frame to capture their relationship in the input space to a VSR network. The frame-wise spatiotemporal correlation comprises spatial differences and temporal differences between frames. To compute the spatial difference, we make use of the Mean Pixel Value Difference (MPVD) between the target frame \(F_t\) and the neighbouring frames \(F_{t \pm \delta }\), where \(\delta \in \{\pm 1,\dots ,\pm n-1\}\), where n is the total number of frames in the video. MPVD is defined as:

$$\begin{aligned} \textit{MPVD}(F_t, F_{t \pm \delta }) = \frac{1}{{h}\times {w}} \sum _{j=1}^{{h}\times {w}} \Vert p_j(F_t) - p_j(F_{t \pm \delta })\Vert \end{aligned}$$
(1)

where h and w are the height and width of the frames in terms of pixels, respectively; \(p_j(\cdot )\) is the value of \(j^{th}\) pixel of a given frame.

Algorithm 1
figure a

STIFS Algorithm

Fig. 4
figure 4

Visualisation of key points identified as part of the FIFS algorithm

Fig. 5
figure 5

Visual example of feature matching obtained from FIFS algorithm between frame 7 and 6 of Calendar clip of Vid4 dataset

The temporal component of the spatiotemporal correlation is the Temporal Distance (TD) between a target frame \(F_t\) and neighbour \(F_{t \pm \delta }\) calculated as,

$$\begin{aligned} \textit{TD}(F_t, F_{t \pm \delta }) = \Vert \delta \Vert . \end{aligned}$$
(2)

The rank score for each frame \(F_{t \pm \delta }\) in the neighbouring space of target frame \(F_t\) is then computed as,

$$\begin{aligned} {r}(F_{t \pm \delta }) = \frac{\text{ MPVD }(F_t, F_{t \pm \delta })}{\text{ TD }(F_t, F_{t \pm \delta })}. \end{aligned}$$
(3)

The STIFS algorithm then uses the rank scores of neighbouring frames to select two neighbouring frames from the given \(n-1\) frames that could belong to either past or future dimensions in reference to the target frame \(F_t\). The overall algorithm for the frame selection to an input space of a VSR model for a given video sequence with the total number of frames n, where each frame is of size \({h}\times {w}\), is presented in Algorithm 1. Based on the proposed STIFS Algorithm 1, the selection is repeated for each target frame \(F_t\) in a video sequence, finally giving an input space of size \(n \times 3\), with two neighbouring frames \(F_{t \pm \delta }\) and one target frame \(F_t\) for each LR frame in frames[1, ..., n]. The algorithm selects neighbouring frames by ranking them while capturing both the spatial and temporal correlation between \(F_t\) and each neighbouring frame \(F_{t \pm \delta }\).

By considering the rank scores of neighbouring frames and capturing both spatial and temporal correlations, the algorithm dynamically chooses two neighbouring frames, either from the past or future relative to the target frame, for each low-resolution frame in the video sequence. The resulting input space for the VSR model is a collection of the selected frames, which exhibit higher spatial and temporal correlation with respect to the target frame. This frame selection mechanism optimises the utilisation of relevant information for super-resolution, potentially leading to improved reconstruction quality and enhanced visual fidelity in the resulting high-resolution videos.

Algorithm 2
figure b

FIFS Algorithm

FIFS for Feature-Based Selection

Feature Descriptor in FIFS

Image features can be broadly categorized as deep features and shallow features. Deep features are extracted using deep learning techniques such as convolution neural networks like ResNet [39] and VGG16 [38]. They are able to capture high-level, semantic information about the content of an image. Shallow features, on the other hand, are extracted using conventional image processing techniques such as edge detection, colour histograms, and texture analysis. These features capture low-level information about the image, such as edges, colour distribution, and texture patterns. Despite the sophistication of information representation in deep features, these are not suitable for selection measures in VSR because of their heavy computational complexity and high memory requirements. They are also not rotation-invariant, which will be easily affected by changes in the orientation across the video frames.

In contrast, shallow features, such as those extracted using ORB [36], are more suited for use as a selection measure in VSR due to their computational efficiency and robustness to changes in image scale and orientation [40]. ORB extractor identifies shallow features by using the Harris corner detector to find key points in a video frame and then extracts binary descriptors based on intensity gradients in a neighbourhood around each point. Figure 4 illustrates the resultant key points identified by ORB feature extractor for conventional 2D video and equirectangular 360° video frames, respectively, by using the proposed FIFS algorithm. By adopting ORB as the ideal method for feature extraction, our proposed Feature-based Input Frame Selection (FIFS) is able to offer rotation-invariant and scale-invariant capabilities, implying that it is robust to changes in the frame’s orientation and size of the objects in video. ORB also has a relatively low computational cost compared to other shallow and deep feature extraction methods, making it better suited for real-time processing as part of the proposed FIFS algorithm.

Fig. 6
figure 6

Prototype VSR model architecture with three input frames

Matching the Descriptors in FIFS

Figure 5 illustrates a visual example of a feature mapping result from the proposed FIFS algorithm. The FIFS algorithm successfully identifies relevance between the two frames by adopting the Brute Force Matching technique  [42] to match the ORB key point descriptors between two frames. The Brute Force Matching technique enables the proposed FIFS algorithm to compare each descriptor in the target frame (\(F_t\)) to every descriptor in the neighbouring frame (\(F_{t \pm \delta }\)) and finds the closest match. The process is repeated for every descriptor in the target frame, resulting in a set of matches between each pair of frames, as shown in Fig. 5.

The FIFS algorithm adopts the Brute Force matching technique as it is a general-purpose method that does not incorporate any assumptions about data structures and distributions [42]. As a result, it works well on any feature descriptors and can compute the distances between them. The proposed FIFS algorithm enables feature descriptor mapping between a target frame(\(F_t\)) and a given neighbouring frame(\(F_{t \pm \delta }\)) by applying the Brute Force matching technique following the ORB feature extraction process. The match function will finally return a number of matches, where each match represents a corresponding key point between the two frames. The steps involved in the proposed FIFS algorithm using ORB Features and Brute Force matching technique are outlined in Algorithm 2.

By leveraging feature-based techniques such as keypoint extraction, descriptor computation, and matching, the FIFS algorithm dynamically selects neighbouring frames for video super-resolution. The algorithm identifies distinctive keypoint features and computes descriptors to capture local structure and appearance information. By comparing these descriptors, the algorithm determines the number of matches between the target frame and other frames, indicating their similarity. The frames with the highest number of matches are selected as neighbours, resulting in an input space that incorporates frames exhibiting high relevance to the target frame. This feature-based frame selection approach employed by the FIFS algorithm enhances the utilisation of relevant information for super-resolution, potentially resulting in improved frame reconstruction accuracy and enhanced observable details in the resulting high-resolution videos.

Empirical Study

A two-staged empirical study is conducted to investigate the effects of proposed selection algorithms on super-resolution performance. The impacts of the proposed algorithms are firstly evaluated for a prototype VSR model as discussed in “Selection in Prototype VSR Model”. A sophisticated state-of-the-art 360\(^circ\) video super-resolution model is then considered in “Selection in State-of-the-Art VSR Model” to explore the applicability of proposed selection algorithms to a more complex and challenging task, specifically in the context of 360\(^circ\) video super-resolution. The results of this study provide a deeper understanding of the potential benefits of using selection algorithms in VSR tasks.

Selection in Prototype VSR Model

A prototype VSR model is built to facilitate the empirical study of the impact of input frame selection algorithms. As shown in Fig. 6, this model is based on the residual convolution neural network architecture with a total of 5.4 million parameters. The model employs a feature extraction module composed of a series of convolution layers to extract the spatial and spatiotemporal information from the input video frames. It is a non-alignment model that uses co-joint feature extraction between a target frame (\(F_t\)) and its two neighbouring frames (\(F_{t \pm \delta }\)) to allow the extraction of unique temporal characteristics between them, even without any implicit or explicit frame alignment. Additionally, the module includes a self-attention mechanism, which uses spatial attention and channel attention to enhance the model’s ability to focus on the most relevant features in the input frames.

The extracted features are then refined through a series of ten residual blocks, each consisting of two convolution operations, a ReLU activation layer and a skip connection, as shown in Fig. 6. The feature refinement step using residual blocks is then followed by up-sampling using a pixel-shuffle operation. The model is a residual learning model which learns the residue feature that is added element-wise to the bicubically interpolated target frame input (\(F_t\)) to generate the corresponding high-resolution output frame(\(F_t \times 4\)) as shown in Fig. 6.

Model Training

This prototype VSR model was trained on the Vimeo90k train set until learning saturation occurred at thirty epochs. Bicubic downsampling is used to generate the LR input frames. The prototype model was trained under three separate settings; each only varied in the input frame selection mechanism used. The three versions of prototype model training represent training with (i) FIFS as the input selection algorithm, (ii) STIFS as the input selection algorithm and (iii) a conventional sliding window with no selection. The corresponding input selection mechanisms are likewise used in the test phase as well.

Adam optimiser with SmoothL1 loss was used to train the model as it combines the advantages of L1-loss (steady gradients for large values) and L2-loss (less oscillation during update when values are small). Thus, it is less sensitive to outliers and sometimes prevents exploding gradients. The initial learning rate is set to \(1 \times 10^{-4}\) and decayed by a factor of 10 after every 10 epoch. Model training and testing are performed using two NVIDIA Tesla V100 GPUs.

Table 2 Super-resolution results in terms of PSNR/SSIM on benchmark Vid4 [28] dataset from Prototype VSR model with varied input frame selection approaches
Table 3 Super-resolution results in terms of PSNR/SSIM on benchmark UDM10 [43] dataset from Prototype VSR model with varied input frame selection approaches
Fig. 7
figure 7

Subjective inspection of visual quality generated by prototype model when using selection (w FIFS) compared to the conventional sliding window with no selection on the "camera" clip (a) and "city" clip (b)

Table 4 Comparative evaluation showcasing the impact of using selection in input space versus no selection on state-of-the-art 360° VSR model - S3PO with 360 Video Dataset [41]

Comparison

The empirical evaluation presented in Tables 2 and 3 provide valuable insights into the performance of the prototype VSR model with different input frame selection approaches. The results indicate that incorporating dynamic content-aware frame selection algorithms, namely FIFS and STIFS, significantly enhances the super-resolution performance compared to the model without frame selection.

In terms of quantitative metrics, both FIFS and STIFS consistently outperform the model without frame selection, as demonstrated by higher PSNR and SSIM scores across various video clips. The average PSNR/SSIM values in Tables 2 and 3 reinforce the superiority of FIFS and STIFS over the no-select approach. The FIFS algorithm, in particular, consistently achieves the best performance in terms of PSNR/SSIM, demonstrating its effectiveness in selecting frames with high spatial and temporal correlations to the target frames.

Moreover, the individual clip analysis reveals notable improvements achieved by the FIFS and STIFS algorithms over the no-select scenario. For instance, in Table 3, the "camera" clip exhibits a significant PSNR improvement of 1.1934 dB using FIFS and 0.9642 dB using STIFS. This demonstrates the capability of the proposed dynamic content-aware selection to capture essential details and enhance the overall visual quality in challenging scenarios.

The qualitative visualisations in Fig. 7 further reinforce the quantitative findings, showcasing the merit of FIFS and STIFS in restoring fine details, textures, and sharper edges in the high-resolution frames compared to the no-select scenario. The restored frames exhibit enhanced visual fidelity, indicating the ability of the frame selection algorithms to effectively leverage spatial and temporal correlations between frames, thereby improving the reconstruction quality.

Selection in State-of-the-Art VSR Model

We study the applicability of an input frame selection mechanism, specifically the proposed Feature-based Input Frame Selection (FIFS) algorithm, on 360° video super-resolution. 360° Video Super-Resolution (360VSR) is a challenging task, as conventional video processing methods are not well-suited for the distorted nature of equirectangular 360° video frames. However, the recently proposed Spherical Signal Super-resolution with Proportioned Optimization (S3PO) [41] model addresses these 360° specific requirements by incorporating strategic optimization and feature extraction while utilizing 2D convolution layers. Despite its non-alignment architecture, S3PO has been shown to outperform state-of-the-art VSR models in 360° video super-resolution.

Therefore, in this study, we aim to enhance the performance of the S3PO model by incorporating the FIFS algorithm for the input frame selection. The FIFS algorithm has proven to significantly improve super-resolution outcomes for 2D conventional videos, as discussed in “Comparison”. Furthermore, the scale and rotational invariant capability of FIFS makes it well-suited for feature-based selection in 360° videos as the likelihood of scale and rotation variations across the frames are even higher in these videos because of the distortions present in the equirectangular frames.

To carry out this investigation, we fine-tuned the pre-trained S3PO model by replacing the default selection of immediate consecutive past and future neighbouring frames as input with the proposed FIFS algorithm. We then evaluated the performance of the fine-tuned model through PSNR, SSIM, Weighted-Spherically PSNR (WS-PSNR), and Weighted Spherically SSIM (WS-SSIM) [44]. The evaluation results on five sampled clips from the test set of 360 Video Dataset (360VDS) [41] are presented in Table 4, demonstrating consistent improvement from the FIFS algorithm across all evaluation metrics. This further signifies the applicability and effectiveness of the proposed input frame selection mechanism even on 360° video super-resolution.

Conclusion and Future Directions

In this study, we investigate the impact of using image comparison measures for input frame selection in video super-resolution. Despite the potential of this approach, it has yet to be explored in the VSR literature. Addressing this gap, we conduct an extensive analysis of image similarity measures and develop two dynamic content-aware input frame selection algorithms for VSR: the SpatioTemporal Input Frame Selection (STIFS) algorithm and the Feature-based Input Frame Selection (FIFS) algorithm. Our empirical study shows that these algorithms outperform conventional sliding window methods in terms of both PSNR and SSIM quality metrics on benchmark datasets. Furthermore, we extend the applicability of the best-performing selection algorithm to a state-of-the-art 360° video super-resolution model, resulting in even greater improvement. Our key contributions include the development of cost-effective, efficient, and practical alternatives compared to the increasingly complex architectures that drive the VSR literature. This study opens up a new avenue of research and has the potential to revolutionize the field of VSR.

Based on the result of the empirical study of our proposed dynamic content-aware input frame selection algorithms with feature-based selection capability, the shallow feature-based approach could be further enhanced for the VSR task by not only using the identified features for selection process but also adding the shallow features as additional input feeds to the VSR model. Shallow features extracted in this way would contain low-level information about the frames, such as edges, colour distributions, and texture patterns, and could complement the deep features being extracted by VSR models to learn the super-resolution task. The proposed STIFS and FIFS algorithms could also be extended to select a varied number of input frames from varied sizes of selection windows. This cements STIFS and FIFS as adjunct techniques that could be adopted in conjunction with any VSR model in order to enhance super-resolution performance.

The dynamic content-aware input frame selection mechanism proposed in this study opens up promising pathways for future work in the field of VSR. The integration of shallow features in the VSR model that are extracted during the selection process will allow low-level information, such as edges, colour distributions, and texture patterns, to be comprehensively represented and, thus, a potential direction to enhance the learning capabilities of the model and improve the overall super-resolution performance. Additionally, STIFS and FIFIS currently select a fixed number of input frames from fixed-size selection windows. However, the adaptability of the algorithms could be further enhanced by allowing the selection of a varied number of input frames and varying sizes of selection windows. This flexibility would enable fine-tuning of the selection process based on different video characteristics, dataset properties, and application requirements. Such an extension would reinforce STIFS and FIFS as versatile techniques that can be seamlessly integrated with various VSR models to achieve superior super-resolution results. Finally, the scalability and efficiency of the input frame selection algorithms can be improved to accommodate real-time or near-real-time applications. Investigating techniques such as parallelisation, hardware acceleration, or optimization algorithms could help reduce the computational complexity and enable a faster selection of input frames.