iSeeBetter: Spatio-temporal video super-resolution using recurrent generative back-projection networks

Recently, learning-based models have enhanced the performance of single-image super-resolution (SISR). However, applying SISR successively to each video frame leads to a lack of temporal coherency. Convolutional neural networks (CNNs) outperform traditional approaches in terms of image quality metrics such as peak signal to noise ratio (PSNR) and structural similarity (SSIM). On the other hand, generative adversarial networks (GANs) offer a competitive advantage by being able to mitigate the issue of a lack of finer texture details, usually seen with CNNs when super-resolving at large upscaling factors. We present iSeeBetter, a novel GAN-based spatio-temporal approach to video super-resolution (VSR) that renders temporally consistent super-resolution videos. iSeeBetter extracts spatial and temporal information from the current and neighboring frames using the concept of recurrent back-projection networks as its generator. Furthermore, to improve the “naturality” of the super-resolved output while eliminating artifacts seen with traditional algorithms, we utilize the discriminator from super-resolution generative adversarial network. Although mean squared error (MSE) as a primary loss-minimization objective improves PSNR/SSIM, these metrics may not capture fine details in the image resulting in misrepresentation of perceptual quality. To address this, we use a four-fold (MSE, perceptual, adversarial, and total-variation loss function. Our results demonstrate that iSeeBetter offers superior VSR fidelity and surpasses state-of-the-art performance.


Introduction
The goal of super-resolution (SR) is to enhance a low resolution (LR) image to a higher resolution (HR) image by filling in missing fine-grained details in the LR image.The domain of SR research can be divided into three main areas: single image SR (SISR) [6,15,17,27], multi image SR (MISR) [9,10] and video SR (VSR) [3,16,23,38,43].
Consider an LR video source which consists of a sequence of LR video frames LR t−n , ..., LR t , ..., LR t+n , where we super-resolve a target frame LR t .
The idea behind SISR is to super-resolve LR t by utilizing spatial information inherent in the frame, independently of other frames in the video sequence.However, this technique fails to exploit the temporal details inherent in a video sequence resulting in temporal incoherence.MISR seeks to address just that -it utilizes the missing details available from the neighboring frames LR t−n , ..., LR t , ..., LR t+n and fuses them for super-resolving LR t .After spatially aligning frames, missing details are extracted by separating differences between the aligned frames from missing details observed only in one or some of the frames.However, in MISR, the alignment of the frames is done without any concern for temporal smoothness, which is in stark contrast to VSR where the frames are typically aligned in temporal smooth order.
Traditional VSR methods upscale based on a single degradation model (usually bicubic interpolation) followed by reconstruction.This is sub-optimal and adds computational complexity [39].Recently, learning-based models that utilize convolutional neural networks (CNNs) have outperformed traditional approaches in terms of widely-accepted image reconstruction metrics such as peak signal to noise ratio (PSNR) and structural similarity (SSIM).
In some recent VSR methods that utilize CNNs, frames are concatenated [23] or fed into recurrent neural networks (RNNs) [20] in temporal order, without explicit alignment.In other methods, the frames are aligned explicitly, using motion cues between temporal frames with the alignment modules [3,33,38,43].The latter set of methods generally render temporally smoother results compared to the methods with no explicit spatial alignment [20,31].However, these VSR methods suffer from a number of problems.In the frame-concatenation approach [3,23,33], many frames are processed simultaneously in the network, resulting in significantly higher network training times.With methods that use RNNs [20,38,43], modeling both subtle and significant changes simultaneously (e.g., slow and quick motions of foreground objects) is a challenging task even if long short-term memory units (LSTMs) are deployed, which are designed for maintaining long-term temporal dependencies [12].A crucial aspect of an effective VSR system is the ability to handle motion sequences, which are often integral components of videos [3,34].
The proposed method, iSeeBetter, is inspired by recurrent back-projection networks (RBPNs) [16] which utilize "back-projection" as their underpinning approach, originally introduced in [21,22] for MISR.The basic concept behind back-projection is to iteratively calculate residual images as reconstruction error between a target image and a set of neighboring images.The residuals are then back-projected to the target image for improving super-resolution accuracy.The multiple residuals enable representation of subtle and significant differences between the target frame and its adjacent frames, thus exploiting temporal relationships between adjacent frames as shown in Fig. 1.Deep back-projection networks (DBPNs) [15] use back-projection to perform SISR using learning-based methods by estimating the output frame SR t using the corresponding LR t frame.To this end, DBPN produces a high-resolution feature map that is iteratively refined through multiple up-and down-sampling layers.RBPN offers superior results by combining the benefits of the original MISR back-projection approach with DBPN.Specifically, RBPN uses the idea of iteratively refining HR feature maps from DBPN, but extracts missing details using neighboring video frames like the original back-projection technique [21,22].This results in superior SR accuracy.
To mitigate the issue of a lack of finer texture details when super-resolving at large upscaling factors that is usually seen with CNNs [30], iSeeBetter utilizes GANs with a loss function that weighs adversarial loss, perceptual loss [30], mean square error (MSE)based loss and total-variation (TV) loss [37].Our approach combines the merits of RBPN and SRGAN [30] -it is based on RBPN as its generator and is complemented by SRGAN's discriminator architecture, which is trained to differentiate between super-resolved images and original photo-realistic images.Blending these techniques yields iSeeBetter, a state-of-the-art system that is able to recover precise photo-realistic textures and motion-based scenes from heavily downsampled videos.
Our contributions include the following key innovations.Combining the state-of-the-art in SR: We propose a model that leverages two superior SR techniques -(i) RBPN, which is based on the idea of integrating SISR and MISR in a unified VSR framework using back-projection and, (ii) SRGAN, which is a framework capable of inferring photorealistic natural images.RBPN enables iSeeBetter to extract details from neighboring frames, complemented by the generator-discriminator architecture in GANs which pushes iSeeBetter to generate more realistic and appealing frames while eliminating artifacts seen with traditional algorithms [47].iSeeBetter thus yields more than the sum of the benefits of RBPN and SRGAN.
"Optimizing" the loss function: Pixel-wise loss functions such as L1 loss, used in RBPN [16], struggle to handle the uncertainty inherent in recovering lost high-frequency details such as complex textures that commonly exist in many videos.Minimizing MSE encourages finding pixel-wise averages of plausible solutions that are typically overly-smooth and thus have poor perceptual quality [2,7,24,36].
To address this, we adopt a four-fold (MSE, perceptual, adversarial, and TV) loss function for superior results.Similar to SRGAN [30], we utilize a loss function that optimizes perceptual quality by minimizing adversarial loss and MSE loss.Adversarial loss helps improve the "naturality" associated with the output image using the discriminator.On the other hand, MSE loss focuses on optimizing perceptual similarity instead of similarity in pixel space.Furthermore, we use a de-noising loss function called TV loss [1].We carried out experiments comparing L1 loss with our four-fold loss and found significant improvements with the latter (cf.Section 4).
Extended evaluation protocol: To evaluate iSeeBetter, we used standard datasets: Vimeo90K [48], Vid4 [32] and SPMCS [43].Since Vid4 and SPMCS lack significant motion sequences, we included Vimeo90K, a dataset containing various types of motion.This enabled us to conduct a more holistic evaluation of the strengths and weaknesses of iSeeBetter.To make iSeeBetter more robust and enable it to handle realworld videos, we expanded the spectrum of data diversity and wrote scripts to collect additional data from YouTube.As a result, we augmented our dataset to about 170,000 clips.
User-friendly infrastructure: We built several useful tools to download and structure datasets, visualize temporal profiles of intermediate blocks and the output, and run predefined benchmark sequences on a trained model to be able to iterate on different models quickly.In addition, we built a video-to-frames tool to directly input videos to iSeeBetter, rather than frames.We also ensured our script infrastructure is flexible (such that it supports a myriad of options) and can be easily leveraged.The code and pre-trained models are available at https://iseebetter.amanchadha.com.

Related work
Since the seminal work by Tsai on image registration [44] two decades ago, many SR techniques based on various underlying principles have been proposed.Initial methods included spatial or frequency domain signal processing, statistical models and interpolation approaches [49].In this section, we focus our discussion on learning-based methods which have emerged as superior VSR techniques compared to traditional statistical methods.

Deep SISR
First introduced by SRCNN [6], deep SISR required a predefined up-sampling operator.Further improvements in this field include better up-sampling layers [39], residual learning [42], back-projection [15], recursive layers [28], and progressive up-sampling [29].A significant milestone in SR research was the introduction of a GAN-powered SR approach [30], which achieved state-of-the-art performance.

Deep VSR
Deep VSR can be divided into five types based on the approach to preserving temporal information.
(a) Temporal Concatenation.The most popular approach to retain temporal information in VSR is concatenating multiple frames [3,23,26,31].This approach can be seen as an extension of SISR to accept multiple input images.However, this approach fails to represent multiple motion regimes within a single input sequence since the input frames are simply concatenated together.
(b) Temporal Aggregation.To address the dynamic motion problem in VSR, [33] proposed multiple SR inferences which work on different motion regimes.
The final layer aggregates the outputs of all branches to construct SR frame.However, this approach still concatenates many input frames, resulting in lengthy convergence during global optimization.(c) Recurrent Networks.
RNNs deal with temporal inputs and/or outputs and have been deployed in a myriad of applications ranging from video captioning [25,35,50], video summarization [5,45], and VSR [20,38,43].Two types of RNNs have been used for VSR.A many-to-one architecture is used in [20,43] where a sequence of LR frames is mapped to a single target HR frame.A many-to-many RNN has recently been used by [38] where an optical flow network to accepts LR t−1 and LR t , which is fed to an SR network along with LR t .This approach was first proposed by [20] using bidirectional RNNs.However, the network has a small network capacity and has no frame alignment step.Further improvement is proposed by [43] using a motion compensation module and a ConvLSTM layer [40].
(d) Optical Flow-Based Methods.The above methods estimate a single HR frame by combining a batch of LR frames and are thus computationally expensive.They often result in unwanted flickering artifacts in the output frames [37].To address this, [38] proposed a method that utilizes a network trained on estimating the optical flow along with the SR network.Optical flow methods allow estimation of the trajectories of moving objects, thereby assisting in VSR.[26] warp video frames LR t−1 and LR t+1 onto LR t using the optical flow method of [8], concatenate the three frames, and pass them through a CNN that produces the output frame SR t+1 .[3] follow the same approach but replace the optical flow model with a trainable motion compensation network.
(e) Pre-Training then Fine-Tuning v/s Endto-End Training.While most of the above-mentioned methods are end-to-end trainable, certain approaches first pre-train each component before fine-tuning the system as a whole in a final step [3,33,43].
Our approach is a combination of (i) an RNN-based optical flow method that preserves spatio-temporal information in the current and adjacent frames as the generator and, (ii) a discriminator that is adept at ensuring the generated SR frame offers superior fidelity.

Datasets
To train iSeeBetter, we amalgamated diverse datasets with differing video lengths, resolutions, motion sequences, and number of clips.Tab. 1 presents a summary of the datasets used.When training our model, we generated the corresponding LR frame for each HR input frame by performing 4× down-sampling using bicubic interpolation.We thus perform selfsupervised learning by automatically generating the input-output pairs for training without any human intervention.To further extend our dataset, we wrote scripts to collect additional data from YouTube.The dataset was shuffled for training and testing.Our training/validation/test split was 80%/10%/10%.

Network architecture
Fig. 2 shows the iSeeBetter architecture that consists of RBPN [16] and SRGAN [30] as its generator and discriminator respectively.Tab. 2 shows our notational convention.RBPN has two approaches that extract missing details from different sources: SISR and MISR.Fig. 3 shows the horizontal flow (represented by blue arrows in Fig. 2) that enlarges LR t using SISR.Fig. 4 shows the vertical flow (represented by red arrows in Fig. 2) which is based on MISR that computes residual features from (i) a pair of LR t and its neighboring frames (LR t−1 , ..., LR t−n ) coupled with, (ii) the precomputed dense motion flow maps (F t−1 , ..., F t−n ).At each projection step, RBPN observes the missing details from LR t and extracts residual features from neighboring frames to recover details.
The convolutional layers that feed the projection modules in Fig. 2 thus serve as initial feature extractors.Within the projection modules, RBPN utilizes a recurrent encoder-decoder mechanism for fusing details extracted from adjacent frames in SISR and MISR and incorporates them into the estimated frame SR t through back-projection.
The convolutional layer that operates on the concatenated output from all the projection modules is responsible for generating SR t .Once SR t is synthesized, it is sent over to the discriminator (shown in Fig. 5) to validate its "authenticity".

Loss functions
The perceptual image quality of the resulting SR image is dependent on the choice of the loss function.
To evaluate the quality of an image, MSE is the most commonly used loss function in a wide variety of stateof-the-art SR approaches, which aims to improve the PSNR of an image [19].While optimizing MSE during training improves PSNR and SSIM, these metrics may not capture fine details in the image leading to misrepresentation of perceptual quality [30].The ability of MSE to capture intricate texture details based on pixel-wise frame differences is very limited, and can cause the resulting video frames to be overly-smooth [4].In a series of experiments, it was found that even manually distorted images had an MSE score comparable to the original image [46].To address this, iSeeBetter uses a four-fold (MSE, perceptual, adversarial, and TV) loss instead of solely relying on pixel-wise MSE loss.We weigh these losses together as a final evaluation standard for training iSeeBetter, thus taking into account both pixel-wise similarities and high-level features.Fig. 6 shows the individual components of the iSeeBetter loss function.

MSE loss
We use pixel-wise MSE loss (also called content loss [30]) for the estimated frame SR t against the ground truth HR t .
where, G θ G (LR t ) is the estimated frame SR t .W and H represent the width and height of the frames respectively.
3.3.2Perceptual loss [2,11] introduced a new loss function called perceptual loss, also used in [24,30], which focuses on perceptual similarity instead of similarity in pixel space.Perceptual loss relies on features extracted from the activation layers of the pre-trained VGG-19 network in [41], instead of low-level pixel-wise error measures.We define perceptual loss as the euclidean distance between the feature representations of the estimated SR image G θ G (LR t ) and the ground truth HR t .
where, V GG i,j denotes the feature map obtained by the j th convolution (after activation) before the i th maxpooling layer in the VGG-19 network.W i,j and H i,j are the dimensions of the respective feature maps in the VGG-19 network.
Fig. 3 DBPN [15] architecture for SISR, where we perform up-down-up sampling using 8 × 8 kernels with a stride of 4 and padding of 2. Similar to the ResNet architecture above, the DBPN network also uses Parametric ReLUs [18] as its activation functions.Fig. 4 ResNet architecture for MISR that is composed of three tiles of five blocks where each block consists of two convolutional layers with 3 × 3 kernels, a stride of 1 and padding of 1.The network uses Parametric ReLUs [18] for its activations.Fig. 5 Discriminator Architecture from SRGAN [30].The discriminator uses Leaky ReLUs for computing its activations.
Fig. 6 The MSE, perceptual, adversarial, and TV loss components of the iSeeBetter loss function.

Adversarial loss
We use the generative component of iSeeBetter as the adversarial loss to limit model "fantasy", thus improving the "naturality" associated with the superresolved image.Adversarial loss is defined as: where, for better gradient behavior [13].

Total-Variation loss
TV loss was introduced as a loss function in the domain of SR by [1].It is defined as the sum of the absolute differences between neighboring pixels in the horizontal and vertical directions [47].Since TV loss measures noise in the input, minimizing it as part of our overall loss objective helps de-noise the output SR image and thus encourages spatial smoothness.TV loss is defined as follows:

Loss formulation
We define our overall loss objective for each frame as the weighted sum of the MSE, adversarial, perceptual, and TV loss components: (5) where, α, β, γ, δ are weights set as 1, 6 × 10 −3 , 10 −3 and 2 × 10 −8 respectively [14].The discriminator loss for each frame is as follows: The total loss of an input sample is the average loss of all frames.

Experimental evaluation
To train the model, we used an Amazon EC2 P3.2xLarge instance with an NVIDIA Tesla V100 GPU with 16GB VRAM, 8 vCPUs and 64GB of host memory.We used the hyperparameters from RBPN and SRGAN.Tab. 3 compares iSeeBetter with six state-of-the-art VSR algorithms: DBPN [15], B 123 + T [33], DRDVSR [43], FRVSR [38], VSR-DUF [23] and RBPN/6-PF [16].Tab. 4 offers a visual analysis of VSR-DUF and iSeeBetter.Tab. 5 shows ablation studies to assess the impact of using a generatordiscriminator architecture and the four-fold loss as design decisions.We proposed iSeeBetter, a novel spatio-temporal approach to VSR that uses recurrent-generative backprojection networks.iSeeBetter couples the virtues of RBPN and SRGAN.RBPN enables iSeeBetter to generate superior SR images by combining spatial and temporal information from the input and neighboring frames.
In addition, SRGAN's discriminator architecture fosters generation of photorealistic frames.We used a four-fold loss function that emphasizes perceptual quality.Furthermore, we proposed a new evaluation protocol for video SR by collating diverse datasets.With extensive experiments, we assessed the role played by various design choices in the ultimate performance of iSeeBetter, and demonstrated that on a vast majority of test video sequences, iSeeBetter advances the state-of-the-art.
To improve iSeeBetter, a couple of ideas could be explored.In visual imagery the foreground recieves much more attention than the background since it typically includes subjects such as humans.To improve perceptual quality, we can segment the foreground and background, and make iSeeBetter perform "adaptive VSR" by utilizing different policies for the foreground and background.For instance, we could adopt a wider span of the number of frames to extract details from for the foreground compared to the background.Another idea is to decompose a video sequence into scenes on the basis of frame-similarity and make iSeeBetter assign weights to adjacent frames based on which scene they belong to.Adjacent frames from a different scene can be weighed lower compared to frames from the same scene, thereby making iSeeBetter focus on extracting details from frames within the same scene -à la the concept of attention applied to VSR.

Tab. 4
Visually inspecting examples from Vid4, SPMCS and Vimeo-90k comparing VSR-DUF and iSeeBetter.We chose VSR-DUF for comparison because it was the state-of-the-art at the time of publication.Top row: fine-grained textual features that help with readability; middle row: intricate high-frequency image details; bottom row: camera panning motion.
Tab. 1 Datasets used for training and evaluation