1 Introduction

Images or videos acquired from cameras are the outcome of algorithmic transformations of the sensor measurements. Such transformations are often modeled manually, while aiming to maximize the visual perceptual quality. Better sensor measurements obviously result in higher perceptual quality. However, improving the sensor’s output quality is not always an option due to several limitations, including but not limited to economical and/or physical constraints. Moreover, images/videos of higher perceptual quality could also be carefully selected from a wide variety of data, or even enhanced manually to serve the same.

The perceptual quality of mediocre images/videos, e.g. acquired from consumer-grade cameras, can be improved by algorithmic enhancement techniques. For this purpose, learning example-based enhancement is of very high interest in the literature, which is also found to be promising in Ignatov et al. (2017). Example-based enhancement methods assume that the source (mediocre) and target (high quality) data examples are provided in some forms. The assumed forms however, either violate the practicality or limit the learning capabilities. We are concerned with the impractical assumption of the availability of pixel and frame-wise pairing of source and target video examples.

In the literature, the problem of frame-wise alignment between example video pairs, i.e. the synchronization problem, is circumvented by merely addressing the enhancement frame-by-frame, which essentially boils down to the unpaired image enhancement problem (Chen et al., 2018; Ignatov et al., 2018). Here, the unpairdness refers to the unknown pixel-wise alignment (or the correspondences) between source and target images. The unpaired image enhancement methods are both tailored and suitable for image enhancement. When they are applied independently to video frames, the output videos are very likely to be temporally inconsistent. We refer temporal inconsistency to the inhomogeneous enhancement across video frames. Such inconsistency results in unnatural videos, thus of lower perceptual quality.

Besides the spatio-temporal inconsistency, the problem of video enhancement invites new challenges of higher computational cost and larger data volumes, when compared to that of stills. Furthermore, video enhancement algorithms are often required to work in real-time, subject to hard time constraints in applications such as live TV, streaming services, and video acquisition using smartphones. This work addresses all the aforementioned challenges. Namely, (i) Learning from unpaired videos; (ii) Spatio-temporally consistent enhancement; (iii) Speed- and memory-efficient training; and (iv) Real-time inference. To tackle these challenges, we propose an efficient recurrent adversarial framework with the following listed contributions.

Learning from Unpaired Videos Inspired by Chen et al. (2018), Ignatov et al. (2018), we propose a new cyclic Generative Adversarial Network (GAN) based framework to learn the distribution map from source domain (input videos) to target domain (output videos) with desirable perceptual qualities, such that the pairdness of video examples is not required. The framework consists of two generators and a single discriminator. One of the two generators is trained to learn the quality map from source domain to target domain, while the other learns the reverse map (also referred as degrader). The discriminator is tasked to process both source and target videos. In particular, the discriminator takes a pair of videos – one from source and the other from target – as input, and learns the joint distribution between them. This facilitates us to make use of a single discriminator for both domains, leading to a simpler and more effective design.

Spatio-Temporal Consistency We exploit a new recurrent design for both the generators and the discriminator, such that the desired consistency is directly implied. In this regard, we introduce new recurrent cells that comprise interleaved local and global modules for better aggregation of spatial and temporal information. This design enables a fully connected view from the past spatio-temporal information in a sequence to each pixel-location in the output.

Efficient Training Our generators and discriminator efficiently process videos using an improved recurrent latent space propagation (Fuoli et al., 2019). Such a design not only efficiently propagates the spatio-temporal information, but also supports us to build smaller networks. This greatly decreases the complexity of all the involved discriminator and generators, thereby making our network both memory- and computation-efficient during training. Note that efficient training is necessary for high resolution videos due to the large data volumes.

Real-Time Inference The real-time performance of our network is a direct consequence of our efficient design. This specifically applies to the generator that translates mediocre videos into high quality ones; in other words - the enhancer. Up to our knowledge, the proposed method is the first to achieve real-time performance for high definition deep video enhancement tasks. The attained inference speed is over 35 frames per second (fps), which we regard to be suitable for many practical applications.

In order to achieve state-of-the-art results in real time, along with the necessary training efficiency and stability towards a more elegant data-driven framework, the following considerations were made in this paper. The desired properties are achieved by avoiding: (i) data/domain specific handcrafted losses and division into auxiliary subtasksFootnote 1; (ii) normalization in recurrent adversarial setting, which inefficiently reduces the discriminator’s capacity; (iii) and two separate discriminators – each for source and target domain. To achieve better training stability, a temporal padding strategy is introduced. This enables the network to accumulate information in the latent state, without exposing the unavoidable initialization to the discriminator. Such exploitation was observed to be very effective in our experiments, with regard to training stability.

To evaluate the proposed recurrent framework, we rely on the Vid3oC dataset (Kim et al., 2019) as well as a new large-scale dataset that includes much more videos. Both datasets use target videos as those captured by better sensors, when compared to that of the source. This dataset consideration allows us to collect data in a very practical manner, e.g. by using a rig where both cameras of source and target are mounted side-by-side. Although this offers the possibility of collecting data with the same video content, their pixel and frame-wise alignment is not possible due to differences in viewpoint, camera lens and the frame acquisition time stamp. This essentially only offers the use of unpaired videos in practice. The evaluation on these two datasets show that our proposed method is capable of achieving highly competitive performance in the user study and by a set of standard benchmark metrics, compared to the state-of-the-art methods. Notably, the proposed method offers significantly better temporal consistency and higher computational efficiency (more than 4x higher than the compared methods) which enables us to perform real-time video enhancement.

In summary, the task of real-time video enhancement with efficient training is achieved by means of several considerations and contributions. In this regard, our main contributions can be summarized as follows:

  1. 1.

    We exploit a novel cyclic generative adversarial framework to learn the target distribution from unpaired videos for the task of real-world video enhancement.

  2. 2.

    We introduce a new recurrent design to both the generators and the discriminator for better spatio-temporal consistency of enhanced videos.

  3. 3.

    We apply an improved recurrent latent space propagation, for efficient spatio-temporal information diffusion, that highly speeds up the training.

  4. 4.

    We achieve real-time video enhancement with highly competitive performance both in terms of quantitative and qualitative measures.

The rest of the paper is organized as follows. After setting our work in context to the literature in Sect. 2, we explain the motivation for our method’s design and describe the technical details of the proposed modules in Sect. 3. In Sect. 4, we outline the experimental setup and present the results of our extensive evaluations, comprised of; quantitative results, a user study, visual examples, temporal consistency curves, and an ablation study. Finally, we conclude our work in Sect. 5.

2 Related Work

The goal of video quality enhancement is to enhance low-quality videos towards those with less artificial noise, more vivid colorization, sharper texture details or higher contrast. In the literature, there are two major families of video quality enhancement methods. The first family aims at focusing on one single subtask of video enhancement. For example, some methods (Wang et al., 2017; Dai et al., 2017; Yang et al., 2018) are suggested for video compression artifacts removal, some (Liu & Freeman, 2010; Varghese & Wang, 2010; Maggioni et al., 2012; Godard et al., 2018; Mildenhall et al., 2018) are designed for video denoising, some (Baker et al., 2011; Werlberger et al., 2011; Yu et al., 2013; Jiang et al., 2018; Mathieu et al., 2015; Niklaus et al., 2017; Jiang et al., 2018; Niklaus & Liu, 2018) are proposed for frame interpolation, and some (Su et al., 2017; Aittala & Durand, 2018; Gast & Roth, 2019) are developed for video deblurring. Besides, some (Liu et al., 2017; Jo et al., 2018; Liu et al., 2020; Wang et al., 2019; Tao et al., 2017; Sajjadi et al., 2018; Fuoli et al., 2019; Chu et al., 2018) aim to enhance the resolution of a given video by adding missing high-frequency information. For comprehensive reviews on these topics, we refer readers to Ghoniem et al. (2010), Nasrollahi and Moeslund (2014). Below we provide the details of four more related works that belong to the first family of video enhancement methods. The first method (Xue et al., 2019) uses a neural network with separate trainable motion estimation and a video processing components. Both components are then trained jointly to learn a task-oriented flow, which can be applied to various video enhancement subtasks separately. Differently, the second method (Chu et al., 2018) proposes an GAN-based approach for video super-resolution that results in temporally coherent solutions without sacrificing spatial detail. Similarly, the third method (Galteri et al., 2019) applies the MobileNetV2 architecture (Sandler et al., 2019) to the GAN (Goodfellow et al., 2014) model for fast artifact removal from compressed videos. Finally, the method proposed in Xiong et al. (2020) learns a two-stage GAN-based framework to enhance the real-world low-light images in an unsupervised manner, (Iizuka et al., 2016) provides a solution to colorize black-and-white images with full supervision. More general video domain translation methods are proposed to handle large domain gaps in a broad range of tasks (e.g. Chen et al. 2019; Park et al. 2019).

The most related works, to the method proposed in this work, are the second family of video enhancement methods. These methods aim to strengthen the compound visual qualities of videos, which includes increasing color vividness, boosting contrast, sharpening up textures. The major issue in regard to enhancing videos using the methods of the second family, is the practicality of collecting well-aligned training data. More precisely, these methods require aligned input and target videos, in both the spatial and the temporal domains, which is a challenging requirement to meet in practice. To address this problem, some reinforcement learning based techniques (Hu et al., 2018; Park et al., 2018; Kosugi & Yamasaki, 2019) create pseudo input-retouched pairs by applying retouching operations sequentially. Other method (Chen et al., 2018; Ignatov et al., 2018; Ni et al., 2020) exploit GAN models for the same task. In this regard, (Chen et al., 2018) suggests to train an image enhancement model by learning the target distribution with unpaired photographs. Particularly, the suggested method learns an enhancing map from a set of low-quality photos to a set of high-quality photographs using the GAN technique (Goodfellow et al., 2014), which has proven to be good at learning real data distributions. Similarly, (Ignatov et al., 2018) applies the GAN technique to learn the distribution of separate visual elements (i.e., color and texture) of images. Such separation allows (Ignatov et al., 2018) to map low-quality images to high-quality with relative ease, with more vivid colors and sharpened textures. Although these methods have demonstrated promising success for unsupervised image enhancement, their extension to videos is not trivial besides applying them to frame-by-frame video enhancement. In this work, we show that such straightforward extension produces temporally inconsistent enhancement results. In contrast, due to the well-designed recurrent framework, our proposed method is capable of achieving clearly better temporal consistency as well as much faster inference with highly competitive performance in terms of other benchmark metrics.

3 Proposed Method

Fig. 1
figure 1

General loss setting. The figure shows two opposite cycles and the single discriminator GAN-loss. A high-quality sequence \(\hat{y}=G(x)\) and a low-quality sequence \(\hat{x}=H(y)\) are generated from a sample \((x,y)\sim p(x,y)\) drawn from the training data. To favorize a consistent mapping between the domains, generated samples are mapped back and the cyclic loss is evaluated between \(x'=H(G(x))\), \(y'=G(H(y))\) and the real samples xy. The real (xy) and fake \((\hat{x},\hat{y})\) samples are fed to our single joint-distribution discriminator D to compute the respective logits \(s_\rho \), \(s_\phi \). The final relative real \(\rho \) and fake \(\phi \) scores are computed by applying the relativistic transformation. The GAN-loss is defined as the sigmoid cross-entropy using the sigmoid \(\sigma (\cdot )\)

One major aim in video enhancement is to strengthen the perceptual quality of videos \(X=\{x_1,x_2,\ldots , x_N\}\in \mathbb {R}^{N \times T\times H\times W\times C}\), that are captured on cameras with compromised capabilities, such that they resemble videos \(Y=\{y_1,y_2,\ldots ,y_M\}\in \mathbb {R}^{M\times T\times H\times W\times C}\) from the high-quality target distribution with the desired properties. In addition to the enhancement of stills, video enhancement suffers from new challenges, including; (i) difficulty of collecting spatio-temporally aligned video samples for supervision, (ii) complexity of achieving spatio-temporal consistency within individual videos, (iii) higher computational cost and larger data volumes.

We avoid the problem (i) of collecting aligned pairs by merely learning from unpaired samples in both source and target domain through a cyclic GAN-based solution. Our objective function is designed to match the distributions of generated and true samples from the training data alone, without supervision or handcrafted losses. To address the issue of inconsistencies in the temporal and spatial domain (ii), we introduce a fully recurrent framework, consisting of recurrent generator G, degrader H, and discriminator D; all featuring our novel local/global module (LGM). We achieve fast training speed and real-time state-of-the-art performance at inference (iii) by employing effective losses and streamlined recurrent architectures, both of which are designed to cope with the heavy computational demand associated with video processing. In contrast to most generative adversarial frameworks, which require extensive human effort for designing multiple handcrafted losses and the associated expensive tuning of hyperparameters, we address the aforementioned issues from a holistic view of the problem. We shift our design effort from explicit losses to implicit ones, thereby enabling effective learning in a data-driven fashion. We therefore design a more efficient, data-driven GAN-framework, where only two architectures need to be manually designed, i.e. generator/degrader and discriminator. The proposed framework features only one recurrent discriminator network (to enforce all desired properties) and leverages joint distribution learning for enhanced coupling between source and target distributions. This leads to memory- and computation-efficient discrimination in both source domain \(\mathcal {X}\) and target domain \(\mathcal {Y}\) without the need for separate discriminators. Our discriminator learns the joint distribution p(xy) directly from unpaired video samples (xy) in an unsupervised manner.

3.1 Objective Function

Our proposed GAN-framework (Fig. 1) features a novel recurrent generator/degrader and a novel recurrent discriminator to learn the joint distribution p(xy) using a single network. With the outputs of discriminator D we calculate real logits \(\rho (x,y)\) and fake logits \(\phi (\hat{x},\hat{y})=\phi (H(y),G(x))\) from sampled sequences \((x,y)\sim p(x,y)\), which are further evaluated through a standard sigmoid cross-entropy loss.

In addition to the GAN objective, we introduce a cyclic constraint to favorize a bijective mapping between the distributions of source domain \(\mathcal {X}\) and target domain \(\mathcal {Y}\). The constraint acts mainly as a regularizer to stabilize the training process, which has been successfully applied to unsupervised quality mapping before, e.g. in Chen et al. (2018). Our generator \(G: \mathcal {X} \rightarrow \mathcal {Y}\) maps videos from source to target domain, while the degrader \(H: \mathcal {Y} \rightarrow \mathcal {X}\) constitutes the inverse mapping. In contrast to previous methods, we are left with only one single hyperparameter \(\alpha \) to balance the GAN-loss with the cyclic constraint and thus significantly reduce the burden on tuning hyperparameters. For enhanced coupling two symmetric cycles are employed; one starting from the source sequence \(H(G(x))=x'\), and the other starting from the target sequence \(G(H(y))=y'\). Note, contrary to recently proposed GAN training strategies, our setting requires no normalization or advanced regularization techniques (Miyato et al., 2018; Gulrajani et al., 2017) for stable training, even with a batch size of only 1 sample. Generator G, degrader H and discriminator D are trained in alternating fashion:

$$\begin{aligned} \begin{aligned} \min _{G,H}\text { }&\mathcal {L}^G_{GAN}\left( G,H,D\right) + \alpha \mathcal {L}_{cyc}\left( G,H\right) , \\ \min _{D}\text { }&\mathcal {L}^D_{GAN}\left( G,H,D\right) , \end{aligned} \end{aligned}$$
(1)

where the above three loss terms are given by

$$\begin{aligned} \begin{aligned} \mathcal {L}^G_{GAN}\left( G,H,D\right) =&-\mathbb {E}_{(x,y)\sim p(x,y)}\left[ \log (\sigma \left( \phi (x,y)\right) \right] \\&- \mathbb {E}_{(x,y)\sim p(x,y)}\left[ \log (1 - \sigma \left( \rho (x,y)\right) \right] , \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} \begin{aligned} \mathcal {L}^D_{GAN}\left( G,H,D\right) =&-\mathbb {E}_{(x,y)\sim p(x,y)}\left[ \log (\sigma \left( \rho (x,y)\right) \right] \\&- \mathbb {E}_{(x,y)\sim p(x,y)}\left[ \log (1 - \sigma \left( \phi (x,y)\right) \right] , \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} \mathcal {L}_{cyc}\left( G,H\right) = \mathbb {E}_{(x,y)\sim p(x,y)}\left[ (x' - x)^2 + (y' - y)^2\right] . \end{aligned} \end{aligned}$$
(4)

We use a relativistic average GAN objective (RaGAN) (Jolicoeur-Martineau, 2018) to train our networks, since it has been shown to be effective for enhancement tasks (Wang et al., 2018). We however adapt RaGAN to train using a single sample per batch, by avoiding the original averaging performed over the batch dimension. In memory constrained GPUs, our single sample-based training enables larger crop sizes, which was observed to be important for video enhancement with global properties. The discriminator D accepts real (xy) and fake \((\hat{x},\hat{y})\) samples and outputs the respective logits \(s_\rho ,s_\phi \in \mathbb {R}^{T\times H\times W\times 1}\) for each pixel-location. Then, we average over the map of logits \(s_\rho = D(x,y), s_\phi = D(\hat{x},\hat{y}) \in \mathbb {R}^{T\times H\times W\times 1}\) in the spatial domain. Analogous to Jolicoeur-Martineau (2018), this adaption leverages the fact that all patches in a frame should be classified exclusively as real or fake. The averages \(\overline{s}_\rho , \overline{s}_\phi \), obtained as follows, can then be used for the relativistic transformation.

$$\begin{aligned} \overline{s}_\rho =\frac{1}{HW}\sum _{h}^{H}\sum _{w}^{W} s_\rho (t,h,w), \end{aligned}$$
(5)
$$\begin{aligned} \overline{s}_\phi =\frac{1}{HW}\sum _{h}^{H}\sum _{w}^{W} s_\phi (t,h,w). \end{aligned}$$
(6)

After the relativistic transformation, the modified logits for real \(\rho (x,y)\) and fake \(\phi (x,y)\) are given by,

$$\begin{aligned} \begin{aligned} \rho (x,y) =&D(x,y)-\overline{s}_\phi , \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned} \phi (x,y) =&D(\hat{x},\hat{y}) - \overline{s}_\rho . \end{aligned} \end{aligned}$$
(8)

These modified logits are then evaluated with the standard sigmoid cross-entropy loss, using the loss functions  (2) and  (3). Please, refer to Fig. 1 for our loss setting.

Fig. 2
figure 2

LGM module. A standard convolution branch (right) processes local information, while a second branch (left) extracts global information from the input feature map f. The features g, computed from f with kernel size \(k_g\) and stride \(k_g\), are average pooled to form the global feature vector \(v_g\). This vector is replicated spatially and concatenated with the local features l. Adding the LGM module to our model, increases the runtime only marginally by 2.5 ms, but improves the performance significantly

Fig. 3
figure 3

Spatio-temporal generator. The figure shows the recurrent cell at time step t. From a batch of input frames \(x_{t-1}, x_t, x_{t+1}\), hidden state \(h_{t-1}\) and feedback \(y_{t-1}\) a single enhanced frame \(y_t\) is produced. The employment of our novel LGM block enables processing of global spatial information, \(k_l,k_g\) denote the kernel sizes for local and global feature maps, F denotes the total number of filters in a LGM block, \(F'\) is the number of filters in the global feature maps. These parameters are set according to their respective spatial input sizes, see Sect. 3.2 and Fig. 2. The residual connection from input \(x_t\) to output \(y_t\) enables learning only the residuals with respect to the input center frame \(x_t\), \(\oplus \) denotes element-wise tensor addition

Fig. 4
figure 4

Spatio-temporal discriminator. The figure shows the recurrent cell at time step t. As the discriminator is designed to learn the joint distribution of source \(\mathcal {X}\) and target \(\mathcal {Y}\) domain, sequences xy are concatenated along the channel dimension and fed to the discriminator. A map of logits \(s_t\) is produced from a batch of concatenated input frames, hidden state \(h_{t-1}\) and feedback \(s_{t-1}\). The employment of our novel LGM block enables processing of global spatial information, \(k_l,k_g\) denote the kernel sizes for local and global feature maps, F denotes the total number of filters in a LGM block, \(F'\) is the number of filters in the global feature maps, see Sect. 3.2 and Fig. 2. Note, we compute residual logits in relation to the previous map of logits \(s_{t-1}\), \(\oplus \) denotes element-wise tensor addition

3.2 Architecture Design

In order to achieve fast runtimes, temporally consistent video, and strong training signals, we modify RLSP (Fuoli et al., 2019), a powerful recurrent architecture for supervised real-time video super-resolution (VSR). Our modifications allow us to perform video quality mapping (VQM) with fully convolutional recurrent cells. Since every component in our setting benefits from recurrence, all networks (G,H and D) are derived from RLSP. Recurrence allows for consistency in the temporal domain, as the processing at the current time step is constrained with the previous time step’s hidden state and it enables efficient temporal information accumulation in all architectures. In addition to a local view for improvement of properties like sharpness and contrast, VQM benefits from a global view of the input in order to perform spatially coherent enhancement, e.g. overall illumination and tone. For this purpose, we introduce a novel local and global module (LGM) within both of the generators and the discriminator. Combined application of recurrence and LGM establishes a fully connected view from all the past spatial and temporal information in a sequence to individual pixels in the output, improving spatio-temporal consistency. Futhermore, such a design suits online inference with our enhancer G, as it facilitates the maximum usage of available information at each time step (Fig. 3).

Local/Global Module (LGM) Our introduced LGM block for efficient interleaved local and global information processing consists of two convolutional blocks. In addition to a standard local convolution block with a \(k_l\times k_l\) kernel, a second block with a larger \(k_g\times k_g\) kernel with stride \(k_g\) extracts features from equidistant patches. Features \(g\in \mathbb {R}^{H'\times W'\times F'}\) from this second block are average pooled in the spatial domain to obtain a global feature vector \(v_g \in \mathbb {R}^{1\times 1\times F'}\).

$$\begin{aligned} v_g=\frac{1}{H'W'}\sum _{h,w}^{H',W'} g(h,w,f) \end{aligned}$$
(9)

Then, the pooled features are spatially replicated to match the previous input size and concatenated with the local feature maps \(l\in \mathbb {R}^{H\times W\times F-F'}\) (global concatenation), acting as global priors for the downstream convolutions, see Fig. 2.

Repeated LGM blocks enable interleaved processing from local to global information. Note, after only one processing step with the LGM module, a fully connected global view is established, which dynamically adapts to the spatial size of the input. Subsequent processing steps further refine the global and local relations in multiple stages and establish a rich set of features containing information from the full global view all the way down to pixel-level.

3.2.1 Generator

We introduce the LGM module as a replacement convolution block in RLSP to efficiently process local and global information with the help of recurrent connections. The proposed recurrent cell features a hidden state and feedback, which enables the architecture to produce temporally consistent video enhancement. Furthermore, a shuffle operation (/4) at the input increases the receptive field and enables efficient computation by reducing spatial size. At the output stage, the features are upscaled to the original resolution through combined bilinear resizing and convolution, yielding the enhanced frames \(\hat{y}=G(x)\). In each upsampling stage, the number of filters in the LGM modules (\(F,F'\)) is divided by 2. The kernel sizes \(k_g\) in the LGM modules are set according to the current spatial resolution in relation to a full-resolution kernel size of \(k_g=8\). In the first blocks after the shuffling operation (/4) we set \(k_g=8/4=2\), in the subsequent two upscaling blocks we set \(k_g=8/2=4\) and \(k_g=8\) respectively. Two extra LGM blocks are placed in parallel to the upsampling block for further refinement in the hidden state. The number of layers are set to 8. The resulting architecture allows joint processing of spatio-temporal as well as global and local information in a single pipeline. We initialize the values for feedback and hidden state \(f^0,h^0\) with zeroes. We use the same network for the degrader H to impose the cyclic constraint of Eq. 4.

3.2.2 Discriminator

Our recurrent Discriminator architecture also incorporates the LGM module, which is again based on RLSP. Including recurrence to the discriminator adheres to introducing a suitable bias for spatio-temporal data. With the similar considerations as that of the generator, recurrence encourages to discriminate temporal dynamics. This efficiently improves the capacity of the discriminator, leading to stronger training signals for the generator. Here again, the used LGM blocks exploit local and global information, which plays a vital role in VQM.

The architecture of the proposed network is a fully convolutional recurrent one, which operates in full resolution, see Fig. 4. The cell is implemented with 8 layers. We predict pixel-dense logits \(s \in \mathbb {R}^{T\times H\times W\times 1}\) per sequence which are fed to the GAN objective after applying the relativistic transformation, see Eqs. 7 and 8. Due to the repeated LGM blocks and recurrence, every local pixel classification in s also includes the full past spatio-temporal information. Since all operations are conducted in full resolution, we set the global kernel size to \(k_g=8\) in order to be consistent with the choice of \(k_g\) in the generator. The discriminator also features a hidden state and feedback. The logits s are fed back to the input by concatenation with the inputs and the hidden state, serving as priors at the next time step. The network predicts residual logits \(s'_t\in \mathbb {R}^{H\times W\times 1}\), which are added to the previous output \(s_{t-1}\) to form the logits \(s_t\) at the current time step t. The hidden state propagates information along time and thereby enables the discrimination of temporal dynamics. Along with discrimination in the spatial domain, our recurrent discriminator can also enforce realistic temporal dynamics over long sequences. Note that the latter is not the case for non-recurrent discriminator networks.

Fig. 5
figure 5

Temporal padding, shown for the generator. The first frame is replicated and prepended before processing any sequence. This enables initialization of hidden state \(h_{t-1}\) and feedback \(\hat{y}_{t-1}\) without the pressure of generating meaningful output right away. The padding is removed after processing, i.e. \(\hat{y}_{-3}, \hat{y}_{-2}\) and \(\hat{y}_{-1}\) are removed

4 Experiments and Discussion

4.1 Experimental Setup

To get manageable training samples, we crop the videos online in the spatio-temporal domain to short sequences \(x,y\in \mathbb {R}^{5\times 256\times 256\times 3}\). Since our proposed recurrent adversarial video enhancement (RAVE) frameworkFootnote 2 is recurrent, hidden state and feedback need to be initialized at the beginning of a sequence, both \(h^0\) and \(y^0\) are set to zero. In addition, the efficiency gain in our proposed recurrent networks stems from access to past information which is encoded in the current hidden state. This information is not available at the beginning of a sequence, as it needs to first accumulate for a couple of frames. Therefore, the enhancement quality gradually increases during the first few frames, which is a discriminating factor for the discriminator by itself and also affects the cyclic losses. In order to prevent initialization-related issues during training, all sequences are padded by repeating the first frame for 3 time steps, see Fig. 5. At every step during training, i.e. after a sequence is processed, the excess frames are removed. Therefore, the padded frames are never exposed to any losses directly, leaving room for the generator G and degrader H to warm-start the hidden state. The proposed temporal padding strategy alleviates the need for constant enhancement quality from the first frame on and enables the network to make use of past information without exposing the initialization process to the discriminator. We observed improved stability during training and better training results with this modification. We use Adam optimizer (Kingma & Ba, 2014) with standard settings from PyTorch and set the learning rate to \(10^{-6}\). Due to hardware constraints, all models are trained with a batch size of 1. The cyclic loss weight is set to \(\alpha =1\) for all experiments, generator/degrader and discriminator are trained in alternating steps.

4.2 Datasets

We use two FullHD (\(1920\times 1080\)) video datasets to evaluate and compare our proposed RAVE method against competitors. Vid3oC (Kim et al., 2019) is a small dataset consisting of 50 roughly-aligned videos for training and 16 videos for validation recorded with 3 different cameras (a ZED stereo camera, a Canon 5D Mark IV, and a Huawei P20) on a calibrated camera rig. This dataset can be publicly downloaded and has been used in challenges for video super-resolution (Fuoli et al., 2019, 2020) and video quality mapping (Fuoli et al., 2020). Following the setup of Fuoli et al. (2020), we use the videos taken by the ZED (low-quality i.e., source) and Canon (high-quality i.e., target) cameras for the video enhancement evaluation. The data consists of a training set with 50 source videos and 50 target videos, as well as a validation setFootnote 3 with 16 source videos and 16 target videos. Alignment is not required by our framework, nevertheless we use this dataset as it was used as benchmark in Fuoli et al. (2020) and it allows us to compute approximate LPIPS scores, which rely on aligned frames. We completely ignore the fact, that the videos are roughly aligned and sample xy independently during training. However, there are several issues attached to this dataset. Good generalization to the validation set from learning on only 50 training sequences is highly doubtful as the sample size is simply too small for video. On top of that, the target domain contains serious motion blur, which is not the desired property in video enhancement and causes the generator to produce blurred frames.

To address these shortcomings we employ a large dataset with videos from a smartphone camera (Huawei P30 Pro) and a high-quality DSLR (Panasonic GH5s), which we call Huawei. Videos are captured by the P30 Pro in 29.95fps, while those taken by the Panasonic GH5s are of 25fps. The resolution of the captured videos is also \(1920 \times 1080\). The videos for both devices have been shot at the same locations, but without proper alignment in any dimension. The viewpoints, lenses and recording times are very different. The data is split into 1270 source videos and 1270 target videos for training, and 16 pairs of validation source/target videos. Again, we sample xy independently.

4.3 Evaluation

4.3.1 Quantitative Metrics

The design of reliable quantitative metrics for unsupervised generation and enhancement, which correlate well with human perception, is difficult and is still ongoing research. Missing pixel-level aligned targets in these settings prohibit the use of standard distance metrics like PSNR and SSIM and even more recent feature-level based metrics like LPIPS (Zhang & Isola, 2018) are no option when source and target are unaligned. Nevertheless, we provide approximate LPIPS scores on the Vid3oC dataset, as the frames are roughly aligned. Due to the severe misalignment, no meaningful LPIPS scores can be provided on the Huawei dataset.

A popular metric to measure the distance between two distributions on image level is the Fréchet Inception Distance (FID) (Heusel et al., 2017; Obukhov et al., 2020). FID represents the Fréchet distance between generated samples and samples from the target distribution in feature space of Inception v3 (Szegedy et al., 2016). We calculate FID scores on all frames in the validation set to compare frame-level mapping quality of our method. Although this metric is suitable for our purpose, it is restricted to measure frame-level quality and neglects the temporal part of our video distributions entirely. We therefore also compute the video based FID (i.e., FVD) scores, proposed in Unterthiner et al. (2018). Contrary to FID, FVD computes the Fréchet distance from video based features, which ultimately enables to measure the distance between two video distributions.

4.3.2 Qualitative Metrics

Furthermore, a user study is conducted to support the metrics. For that matter we asked the users to mark their preference of the mapping quality towards the target distribution between two methods. In each comparison, the enhanced frames from the two competing methods were shown together with a frame from the target sequence. The users were asked to rate which method enhances the image features closer to the target. Together with the frames we also provided a zoomed-in crop for low-level inspection. One frame per validation video sequence is randomly extracted for comparison to be rated by 20 users, resulting in \(16\times 20 = 320\) ratings. To rank all methods among each other, we calculate the total number of wins against the competing methods, normalized by the total number of comparisons (\(2\times 320 = 640\)). This ratio is given as percentage in Table 1. We also provide visual examples for results on the validation sets of the Vid3oC and Huawei datasets in Figs. 6,7.

4.3.3 Speed Metric

To compare the methods in terms of inference speed, we compute runtimes on an NVIDIA V100. The runtimes are measured per frame with equal settings on a FullHD (\(1080\times 1920\)) sequence from the validation set. Since we use temporal padding during training and inference for RAVE, the processing time for padded frames are also considered and therefore slightly increase the runtimes for RAVE. The best out of 10 runs is reported for each method.

4.3.4 Temporal Consistency Metric

Current metrics for perceptual quality mostly neglect the temporal domain entirely. Consistent enhancement along time however is crucial for high-quality videos, as the human vision is sensitive to inconsistencies like flickering artifacts. Due to the lack of reliable temporal consistency metrics for unsupervised settings in the field, we propose a novel way of quantifying our model as well as the competing methods with respect to temporal consistency. Since there is no pixel-level ground truth in unsupervised video enhancement, it is difficult to obtain such a metric with reference to the target domain. However, source x and enhanced video \(\hat{y}=G(x)\) are aligned on pixel level. With the assumption that the source video is temporally consistent, since we do not expect temporal inconsistencies in recordings, we can use it as reference.

Pixel-flow for source \(F_{S,t}(h,w)=x_{t+1}(h,w) - x_t(h,w)\) and method \(F_{M,t}(h,w)=\hat{y}_{t+1}(h,w) - \hat{y}_t(h,w)\) are calculated per sequence. In order to compare the method’s flows against the reference x, we calculate the statistics from the flow difference \(d_t = F_{M,t}(h,w) - F_{S,t}(h,w)\) per time step, which we call relative pixel-flow (RPF), and plot it over time.

$$\begin{aligned} \mu _{d_t}&= \frac{1}{HW}\sum _{h}^{H} \sum _{w}^{W} F_{M,t}(h,w) - F_{S,t}(h,w) \end{aligned}$$
(10)
$$\begin{aligned} \sigma _{d_t}&= \sqrt{\frac{1}{HW}\sum _{h}^{H} \sum _{w}^{W} \left[ F_{M,t}(h,w) - F_{S,t}(h,w) - \mu _{d_t} \right] ^2} \end{aligned}$$
(11)

The mean flow \(\mu _{d_t}\) and standard deviation \(\sigma _{d_t}\) for the first 8 sequences from both validation sets are plotted in Figs. 8,9 for all methods. A temporally consistent generated video \(\hat{y}\) should have similar pixel-flow as the source video. The similarity is indicated by a small mean \(\mu _{d_t}\) and standard deviation \(\sigma _{d_t}\), together with smooth evolution of \(\mu _{d_t}\) and \(\sigma _{d_t}\) over time, e.g. no spikes and no valleys.

Table 1 Comparison with state-of-the-art. We conducted a user study and compute FID, FVD, LPIPS and runtime to evaluate the mapping quality and the performance on the Vid3oC (Kim et al., 2019) and Huawei datasets. Wins denotes the normalized relative preference to both other methods in %
Fig. 6
figure 6

Visual results for Vid3oC

Fig. 7
figure 7

Visual results for Huawei. We do not highlight details for the target, due to different view, scale and time

4.4 Comparison with State-of-the-Art Methods

Table 2 Ablation study. We show the effectiveness of our introduced modules on the Vid3oC dataset (Kim et al., 2019) in terms of FID, FVD and LPIPS scores

To the best of our knowledge, existing published works rarely apply unsupervised quality enhancement to videos. So far, only methods for image level quality mapping exist. We therefore compare our proposed model RAVE to the most prominent state-of-the-art single image enhancers in the literature for quality mapping, Weakly-Supervised Photo Enhancement (WESPE) (Ignatov et al., 2018) and Deep Photo Enhancement (DPE) (Chen et al., 2018). WESPE divides the enhancement problem into two subtasks and treats the texture and color enhancement separately and features a cyclic constraint which is enforced through a VGG loss (Ledig et al., 2017; Simonyan & Zisserman, 2015). DPE uses a cyclic GAN setting with identity loss, cyclic constraint and two GAN losses, one for each domain. The major difference to our proposed method is the counter productive identity loss, the lack of taking into account the temporal properties for smooth videos and neglecting the importance of fast runtimes to enable real-time applications. Besides, the network is not fully convolutional, which inherently limits its practical application due to the fixed input size. For both competing methods, we employ the implementations from the original authors. We evaluate quantitative metrics and compare the perceived visual quality by providing examples and a user study.

We additionally compare our model against a method for video translation (Bansal et al., 2018), that shares similarities to the VQM task (see supplementary material). The method is unable to produce comparative quantitative (FID of 56.68 on Vid3oC) and qualitative results, which we attribute to the differences between both tasks. The results highlight the need for a specialized solution for VQM, which we provide with our proposed framework.

4.4.1 Quantitative Results

The results on both datasets are listed in Table 1. We outperform the current state-of-the-art methods DPE and WESPE in all quantitative metrics on both image level (FID, LPIPS) and video level (FVD). We gain up to 4.25 (Vid3oC) and 2.30 (Huawei) in FID, achieve better FVD scores up to 100.84 (Vid3oC) and 34.68 (Huawei) and up to 0.053 lower LPIPS values over both competing methods WESPE and DPE. Despite the (implicit) additional temporal consistency constraint imposed by our recurrent discriminator, our method produces higher quantitative results on image level, which is quite remarkable as both image level enhancers are not bound to this constraint and can therefore solve a simpler objective. Our RAVE clearly outperforms the other methods in terms of FVD scores due to the recurrent discriminator and also achieve the fastest runtimes. Our proposed RAVE method is the only method that achieves real-time performance by producing over 35fps of high-quality FullHD (\(1080\times 1920\)) video. RAVE is over \(\times 10\) faster than WESPE and over \(\times 4\) faster than DPE. DPE only produces video in a reduced resolution (\(512\times 512\)) and therefore has to process about \(\times 7.9\) less pixels than WESPE and RAVE. Hence, the speed difference to RAVE would be larger, if RAVE would process the same resolution. Please note, high-quality real-time performance is generally hard to achieve in deep learning, mainly because designing the optimal network structure is crucial when its model complexity should be minimized.

4.4.2 Qualitative Results

In addition to the quantitative metrics we provide visual examples to inspect the perceived visual quality. We show a frame-level comparison between the unaltered source frame, WESPE, DPE, RAVE and a corresponding target frame from the validation set of Vid3oC in Fig. 6. Note, the temporal consistency is covered later in Sect. 4.4.3. As previously stated in Sect. 4.2, the dataset size is limited for the task. DPE can match the colors a bit closer to the target, which however comes at the expense of a significantly lower resolution due to the inevitable resizing operations to facilitate the fixed input size of the network. The results are therefore seriously blurred out, as can be seen by the close-ups. Effectively, DPE is trained on complexity-reduced data due to the resampling into the lower resolution. The resulting lower Nyquist rate, with its associated lack of higher frequencies, removes the task of matching the sharpness of generated and target frames and reduces the problem to color matching only. Additionally, the model is learned on the full spatial content, which improves generalization when compared to learning on smaller crops only. WESPE produces high frequencies which perceptually sharpens the output. However, a closer look reveals low-level artifacts, e.g. discontinuous lines and ringing artifacts on the edges. Additionally, there is a pattern of bright, green pixels distributed over the whole frame, which appears in all outputs. RAVE produces balanced frames without those artifacts and achieves a better trade-off between color and sharpness.

The results on the larger Huawei dataset in Fig. 7 and additional close-ups in Fig. 10 are more representative due to the aforementioned discussion in Sect. 4.2. RAVE achieves significantly better color matching than on Vid3oC and is on par or better than DPE in this regard, despite the more complex requirements for sharpness and temporal consistency. Again, WESPE produces higher frequencies, but the exact same low-level artifacts can be observed as on Vid3oC. Also, WESPE fails completely to learn the target color distribution and generates colors which are close to the source, but seem even further away from the target. Additionally, it does not pick up the contrast from the target distribution, which RAVE can handle impressively. RAVE attains good color matching, higher sharpness compared to DPE and can process the data in real-time.

The user study reflects the aforementioned issues associated with each method on the Vid3oC dataset, since no method can match the target distribution satisfactorily. Hence, the preference scores among the users are indecisive. RAVE outperforms DPE and is on par with WESPE, separated by only one vote, while WESPE achieves better scores against DPE. RAVE is the clear winner on the Huawei dataset. RAVE’s video enhancement is significantly preferred over both DPE and WESPE, as it wins 62% of all direct comparisons. The user study’s results align well with our visual quality discussion above, that yielded significant issues for WESPE. WESPE is favored only in 37% of all direct comparisons. RAVE is preferred over WESPE with a ratio of 220/100, and outperforms DPE with a ratio of 186/134.

4.4.3 Temporal Consistency Results

Fig. 8
figure 8

Relative pixel-flow (RPF) curves on Vid3oC dataset (Sequences 1-8). Mean \(\mu _{d_t}\) (solid lines) and standard deviation \(\sigma _{d_t}\) (borders of filled area) of pixel-flow differences per frame \(d_t = \sum _{h}^{H} \sum _{w}^{W} F_{M}(t, h, w) - F_{S}(t, h, w)\) from method M in reference to source S sequences. Ripples in \(\mu \) and \(\sigma \) curves indicate temporal inconsistencies, the amplitude shows the extent of the artifacts. See Sect. 4.3.4 for implementation details

Fig. 9
figure 9

Relative pixel-flow (RPF) curves on Huawei dataset (Sequences 1-8). Mean \(\mu _{d_t}\) (solid lines) and standard deviation \(\sigma _{d_t}\) (borders of filled area) of pixel-flow differences per frame \(d_t = \sum _{h}^{H} \sum _{w}^{W} F_{M}(t, h, w) - F_{S}(t, h, w)\) from method M in reference to source S sequences. Ripples in \(\mu \) and \(\sigma \) curves indicate temporal inconsistencies, the amplitude shows the extent of the artifacts. See Sect. 4.3.4 for implementation details

We compare all methods by assessing the mean flow \(\mu _{d_t}\) and standard deviation \(\sigma _{d_t}\) of our suggested RPF in the first 8 sequences from the Vid3oC and Huawei validation sets. The curves are plotted in Figs. 8,9, and show a clear advantage of our model against single frame enhancers DPE and WESPE. Both exhibit more serious artifacts in the temporal domain. RAVE generates the most consistent video, as indicated by a significantly lower standard deviation \(\sigma _{d_t}\), together with a smooth evolution of \(\mu _{d_t}\) and \(\sigma _{d_t}\) over time. Both DPE and WESPE expose more ripples than RAVE. Additionally, WESPE’s RPF-curves show clearly higher ripples in the mean, which is indication for flickering artifacts (illumination). These temporal artifacts are most likely caused by inconsistent enhancement due to isolated single frame enhancement. In that regard, the curves show equal behavior on both datasets.

4.5 Ablation Study

Due to the advantage of being able to provide LPIPS scores, we evaluate 6 different architectures on Vid3oC to show the advantages of our model. All experiments are conducted with the same hyperparameter settings. We show the advantage of our modified RaGAN* objective by comparing it with the standard sigmoid cross-entropy evaluation. In combination with recurrence, RaGAN* clearly improves the enhancement results. In order to show the effectiveness of our novel LGM module, we remove all LGM modules in the generator, degrader and discriminator. Further we show the effects of recurrence to accuracy and speed in various settings by removing all recurrent connections in our networks. The results for this study are summarized in Table 2. These results show that our proposed LGM module plays a crucial part in both the recurrent and non-recurrent setting, as the configurations with LGM achieve the best scores among all configurations. This emphasizes the importance of having a global view for VQM. Our final recurrent model with LGM and RaGAN* achieves the best enhancement quality scores among all configurations.

As can be seen by comparison of runtimes in Table 2, the LGM module only introduces a marginal increase of 2 ms (non-recurrent) and 2.5ms (recurrent) relative to their non-LGM configurations respectively. This shows the effectiveness of coupled recurrence and LGM, which eventually facilitates high-quality real-time video enhancement in FullHD.

Moreover, we computed the complexity of our single discriminator in comparison to a two discriminator setting in Table 3. The difference in FID is small in relation to the significantly larger complexity in the two discriminator setting (approximately factor \(\times 2\) in all metrics). Our single discriminator largely reduces the computational demand with a comparatively small decrease in performance only.

Fig. 10
figure 10

Close-up visual results of RAVE (left), DPE (center) and WESPE (right) for Huawei. Note the blurry content of DPE and the heavy low-level artifacts (green) of WESPE. Additionally, WESPE is missing the desired color properties from the target domain. RAVE produces clean frames with properties similar to the target domain

We further conducted experiments to determine the impact of the ratio between local and global channels in our LGM module. We get the following FID scores for a variety of global channel sizes; 4/52.20, 8/51.85, 16/51.88, 32/51.82, 48/54.85, 64/55.31 (global channel size/FID score). Large global channel sizes (48, 64) and the smallest global channel size (4) achieve the worst FID scores, sizes in-between achieve similar scores. Our models are trained with a global channel size of 8, which provides a good balance between global and local processing with good visual quality and speed.

5 Conclusion

In this paper, we introduced a novel GAN-based framework, named RAVE, for unsupervised video enhancement. RAVE is designed to effectively deal with the major practical challenges associated with the video enhancement task. More specifically, our framework allows efficient, unsupervised, and data-driven learning from unpaired video sequences, which does not make any impractical demand of aligned data collection for supervision. Moreover, RAVE produces spatio-temporally consistent output thanks to the enhanced guidance of the proposed recurrent architecture and LGM module. Efficiency in memory and computation (both during training and inference) was achieved by means of the joint-distribution learning strategy and our computationally efficient recurrent cells.

Table 3 Comparison between single discriminator and two discriminators setting. We compare number of parameters (#Par.), MACs (multiply-accumulate operations), FLOPs (floating-point operations), memory requirements (Mem. Req.) and FID score on Vid3oC

Consequently, our data-driven method performs the quality mapping without any strong inductive bias, which is also reflected by its training stability with no need to adjust the hyperparameter \(\alpha \) across datasets. Our experiments on two datasets demonstrate that RAVE produces consistently better output than the compared methods, both in terms of quantitative and qualitative measures. In fact, RAVE is the first method that achieves real-time performance while addressing the problem of temporal consistency. In addition to the mentioned benefits, the proposed framework is also generic by design. Therefore, RAVE might also be adapted to other video enhancement/manipulation tasks, merely by replacing the suitable cells.