1 Introduction

Recent advances of deep convolutional neural networks (CNNs) have led to the development of many powerful image processing techniques including, image filtering [30, 37], enhancement [10, 24, 38], style transfer [17, 23, 29], colorization [19, 41], and general image-to-image translation tasks [21, 27, 43]. However, extending these CNN-based methods to video is non-trivial due to memory and computational constraints, and the availability of training datasets. Applying image-based algorithms independently to each video frame typically leads to temporal flickering due to the instability of global optimization algorithms or highly non-linear deep networks. One approach for achieving temporally coherent results is to explicitly embed flow-based temporal consistency loss in the design and training of the networks. However, such an approach suffers from two drawbacks. First, it requires domain knowledge to re-design the algorithm [1, 16], re-train a deep model [12, 15], and video datasets for training. Second, due to the dependency of flow computation at test time, these approaches tend to be slow.

Fig. 1.
figure 1

Applications of the proposed method. Our algorithm takes per-frame processed videos with serious temporal flickering as inputs (lower-left) and generates temporally stable videos (upper-right) while maintaining perceptual similarity to the processed frames. Our method is blind to the specific image processing algorithm applied to input videos and runs at a high frame-rate. This figure contains animated videos (see supplementary material).

Bonneel et al. [6] propose a general approach to achieve temporal coherent results that is blind to specific image processing algorithms. The method takes the original video and the per-frame processed video as inputs and solves a gradient-domain optimization problem to minimize the temporal warping error between consecutive frames. Although the results of Bonneel et al. [6] are temporally stable, their algorithm highly depends on the quality of dense correspondence (e.g., optical flow or PatchMatch [2]) and may fail when a severe occlusion occurs. Yao et al. [39] further extend the method of Bonneel et al. [6] to account for occlusion by selecting a set of key-frames. However, the computational cost increases linearly with the number of key-frames, and thus their approach cannot be efficiently applied to long video sequences. Furthermore, both approaches assume that the gradients of the original video are similar to the gradients of the processed video, which restricts them from handling tasks that may generate new contents (e.g., stylization).

In this work, we formulate the problem of video temporal consistency as a learning task. We propose to learn a deep recurrent network that takes the input and processed videos and generates temporally stable output videos. We minimize the short-term and long-term temporal losses between output frames and impose a perceptual loss from the pre-trained VGG network [34] to maintain the perceptual similarity between the output and processed frames. In addition, we embed a convolutional LSTM (ConvLSTM) [36] layer to capture the spatial-temporal correlation of natural videos. Our network processes video frames sequentially and can be applied to videos with arbitrary lengths. Furthermore, our model does not require computing optical flow at test time and thus can process videos at real-time rates (\(400+\) FPS on \(1280 \times 720\) videos).

As existing video datasets typically contain low-quality frames, we collect a high-quality video dataset with 80 videos for training and 20 videos for evaluation. We train our model on a wide range of applications, including colorization, image enhancement, and artistic style transfer, and demonstrate that a single trained model generalizes well to unseen applications (e.g., intrinsic image decomposition, image-to-image translation, see Fig. 1). We evaluate the quality of the output videos using temporal warping error and a learned perceptual metric [42]. We show that the proposed method strikes a good balance between maintaining the temporal stability and perceptual similarity. Furthermore, we conduct a user study to evaluate the subjective preference between the proposed method and state-of-the-art approaches.

We make the following contributions in this work:

  1. 1.

    We present an efficient solution to remove temporal flickering in videos via learning a deep network with a ConvLSTM module. Our method does not require pre-computed optical flow or frame correspondences at test time and thus can process videos in real-time.

  2. 2.

    We propose to minimize the short-term and long-term temporal loss for improving the temporal stability and adopt a perceptual loss to maintain the perceptual similarity.

  3. 3.

    We provide a single model for handling multiple applications, including but not limited to colorization, enhancement, artistic style transfer, image-to-image translation and intrinsic image decomposition. Extensive subject and objective evaluations demonstrate that the proposed algorithm performs favorably against existing approaches on various types of videos.

2 Related Work

We address the temporal consistency problem on a wide range of applications, including automatic white balancing [14], harmonization [4], dehazing [13], image enhancement [10], style transfer [17, 23, 29], colorization [19, 41], image-to-image translation [21, 43], and intrinsic image decomposition [3]. A complete review of these applications is beyond the scope of this paper. In the following, we discuss task-specific and task-independent approaches that enforce temporal consistency on videos.

Task-Specific Approaches. A common solution to embed the temporal consistency constraint is to use optical flow to propagate information between frames, e.g., colorization [28] and intrinsic decomposition [40]. However, estimating optical flow is computationally expensive and thus is impractical to apply on high-resolution and long sequences. Temporal filtering is an efficient approach to extend image-based algorithms to videos, e.g., tone-mapping [1], color transfer [5], and visual saliency [25] to generate temporally consistent results. Nevertheless, these approaches assume a specific filter formulation and cannot be generalized to other applications.

Recently, several approaches have been proposed to improve the temporal stability of CNN-based image style transfer. Huang et al. [15] and Gupta et al. [12] train feed-forward networks by jointly minimizing content, style and temporal warping losses. These methods, however, are limited to the specific styles used during training. Chen et al. [7] learn flow and mask networks to adaptively blend the intermediate features of the pre-trained style network. While the architecture design is independent of the style network, it requires the access to intermediate features and cannot be applied to non-differentiable tasks. In contrast, the proposed model is entirely blind to specific algorithms applied to the input frames and thus is applicable to optimization-based techniques, CNN-based algorithms, and combinations of Photoshop filters.

Table 1. Comparison of blind temporal consistency methods. Both the methods of Bonneel et al. [6] and Yao et al. [39] require dense correspondences from optical flow or PatchMatch [2], while the proposed method does not explicitly rely on these correspondences at test time. The algorithm of Yao et al. [39] involves a key-frame selection from the entire video and thus cannot generate output in an online manner.

Task-Independent Approaches. Several methods have been proposed to improve temporal consistency for multiple tasks. Lang et al. [25] approximate global optimization of a class of energy formulation (e.g., colorization, optical flow estimation) via temporal edge-aware filtering. In [9], Dong et al. propose a segmentation-based algorithm and assume that the image transformation is spatially and temporally consistent. More general approaches assume gradient similarity [6] or local affine transformation [39] between the input and the processed frames. These methods, however, cannot handle more complicated tasks (e.g., artistic style transfer). In contrast, we use the VGG perceptual loss [23] to impose high-level perceptual similarity between the output and processed frames. We list the feature-by-feature comparisons between Bonneel et al. [6], Yao et al. [39] and the proposed method in Table 1.

Fig. 2.
figure 2

Overview of the proposed method. We train an image transformation network that takes \(I_{t-1}, I_t, O_{t-1}\) and processed frame \(P_t\) as inputs and generates the output frame \(O_t\) which is temporally consistent with the output frame at the previous time step \(O_{t-1}\). The output \(O_t\) at the current time step then becomes the input at the next time step. We train the image transformation network with the VGG perceptual loss and the short-term and long-term temporal losses.

3 Learning Temporal Consistency

In this section, we describe the proposed recurrent network and the design of the loss functions for enforcing temporal consistency on videos.

3.1 Recurrent Network

Figure 2 shows an overview of the proposed recurrent network. Our model takes as input the original (unprocessed) video \(\{I_t | t=1 \cdots T\}\) and per-frame processed videos \(\{P_t | t=1 \cdots T\}\), and produces temporally consistent output videos \(\{O_t | t=1 \cdots T\}\). In order to efficiently process videos with arbitrary length, we develop an image transformation network as a recurrent convolutional network to generate output frames in an online manner (i.e., sequentially from \(t = 1\) to T). Specifically, we set the first output frame \(O_1 = P_1\). In each time step, the network learns to generate an output frame \(O_t\) that is temporally consistent with respect to \(O_{t-1}\). The current output frame is then fed as the input at the next time step. To capture the spatial-temporal correlation of videos, we integrate a ConvLSTM layer [36] into our image transformation network. We discuss the detailed design of our image transformation network in Sect. 3.3.

Fig. 3.
figure 3

Temporal losses. We adopt the short-term temporal loss on neighbor frames and long-term temporal loss between the first and all the output frames.

3.2 Loss Functions

Our goal is to reduce the temporal inconsistency in the output video while maintaining the perceptual similarity with the processed frames. Therefore, we propose to train our model with (1) a perceptual content loss between the output frame and the processed frame and (2) short-term and long-term temporal losses between output frames.

Content Perceptual Loss. We compute the similarity between \(O_t\) and \(P_t\) using the perceptual loss from a pre-trained VGG classification network [34], which is commonly adopted in several applications (e.g., style transfer [23], super-resolution [26], and image inpainting [31]) and has been shown to correspond well to human perception [42]. The perceptual loss is defined as:

$$\begin{aligned} \mathcal {L}_p = \sum _{t=2}^T \sum _{i=1}^N \sum _{l} \left\| \phi _l(O_t^{(i)}) - \phi _l(P_t^{(i)}) \right\| _1, \end{aligned}$$
(1)

where \(O_t^{(i)}\) represents a vector \(\in R^3\) with RGB pixel values of the output O at time t, N is the total number of pixels in a frame, and \(\phi _l(\cdot )\) denotes the feature activation at the l-th layer of the VGG-19 network \(\phi \). We choose the 4-th layer (i.e., relu4-3) to compute the perceptual loss.

Short-Term Temporal Loss. We formulate the temporal loss as the warping error between the output frames:

$$\begin{aligned} \mathcal {L}_{st} = \sum _{t=2}^T \sum _{i=1}^N M_{t \Rightarrow t-1}^{(i)} \left\| O_t^{(i)} - \hat{O}_{t-1}^{(i)} \right\| _1, \end{aligned}$$
(2)

where \(\hat{O}_{t-1}\) is the frame \(O_{t-1}\) warped by the optical flow \(F_{t \Rightarrow t-1}\), and \(M_{t \Rightarrow t-1} = \exp (-\alpha \Vert I_t - \hat{I}_{t-1} \Vert _2^2)\) is the visibility mask calculated from the warping error between input frames \(I_t\) and warped input frame \(\hat{I}_{t-1}\). The optical flow \(F_{t \Rightarrow t-1}\) is the backward flow between \(I_{t-1}\) and \(I_t\). We use the FlowNet2 [20] to efficiently compute flow on-the-fly during training. We use the bilinear sampling layer [22] to warp frames and empirically set \(\alpha = 50\) (with pixel range between [0, 1]).

Long-Term Temporal Loss. While the short-term temporal loss Eq. 2 enforces the temporal consistency between consecutive frames, there is no guarantee for long-term (e.g., more than 5 frames) coherence. A straightforward method to enforce long-term temporal consistency is to apply the temporal loss on all pairs of output frames. However, such a strategy requires significant computational costs (e.g., optical flow estimation) during training. Furthermore, computing temporal loss between two intermediate outputs is not meaningful before the network converges.

Instead, we propose to impose long-term temporal losses between the first output frame and all of the output frames:

$$\begin{aligned} \mathcal {L}_{lt} = \sum _{t=2}^T \sum _{i=1}^N M_{t \Rightarrow 1}^{(i)} \left\| O_t^{(i)} - \hat{O}_{1}^{(i)} \right\| _1. \end{aligned}$$
(3)

We illustrate an unrolled version of our recurrent network as well as the short-term and long-term losses in Fig. 3. During the training, we enforce the long-term temporal coherence over a maximum of 10 frames (\(T = 10\)).

Overall Loss. The overall loss function for training our image transformation network is defined as:

$$\begin{aligned} \mathcal {L} = \lambda _p \mathcal {L}_p + \lambda _{st} \mathcal {L}_{st} + \lambda _{lt} \mathcal {L}_{lt}, \end{aligned}$$
(4)

where \(\lambda _p\), \(\lambda _{st}\) and \(\lambda _{lt}\) are the weights for the content perceptual loss, short-term and long-term losses, respectively.

3.3 Image Transformation Network

The input of our image transformation network is the concatenation of the currently processed frame \(P_t\), previous output frame \(O_{t-1}\) as well as the current and previous unprocessed frames \(I_t, I_{t-1}\). As the output frame typically looks similar to the currently processed frame, we train the network to predict the residuals instead of actual pixel values, i.e., \(O_t = P_t + \mathcal {F}(P_t)\), where \(\mathcal {F}\) denotes the image transformation network. Our image transformation network consists of two strided convolutional layers, B residual blocks, one ConvLSTM layer, and two transposed convolutional layers.

Fig. 4.
figure 4

Architecture of our image transformation network. We split the input into two streams to avoid transferring low-level information from the input frames to output.

We add skip connections from the encoder to the decoder to improve the reconstruction quality. However, for some applications, the processed frames may have a dramatically different appearance than the input frames (e.g., style transfer or intrinsic image decomposition). We observe that the skip connections may transfer low-level information (e.g., color) to the output frames and produce visual artifacts. Therefore, we divide the input into two streams: one for the processed frames \(P_t\) and \(O_{t-1}\), and the other stream for input frames \(I_t\) and \(I_{t-1}\). As illustrated in Fig. 4, the skip connections only add skip connections from the processed frames to avoid transferring the low-level information from the input frames. We provide all the implementation details in the supplementary material.

4 Experimental Results

In this section, we first describe the employed datasets for training and testing, followed by the applications of the proposed method and the metrics for evaluating the temporal stability and perceptual similarity. We then analyze the effect of each loss term in balancing the temporal coherence and perceptual similarity, conduct quantitative and subjective comparisons with existing approaches, and finally discuss the limitations of our method. The source code and datasets are publicly available at http://vllab.ucmerced.edu/wlai24/video_consistency.

4.1 Datasets

We use the DAVIS-2017 dataset [32], which is designed for video segmentation and contains a variety of moving objects and motion types. The DAVIS dataset has 60 videos for training and 30 videos for validation. However, the lengths of the videos in the DAVIS dataset are usually short (less than 3 s) with 4,209 training frames in total. Therefore, we collect additional 100 high-quality videos from Videvo.net [35], where 80 videos are used for training and 20 videos for testing. We scale the height of video frames to 480 and keep the aspect ratio. We use both the DAVIS and Videvo training sets, which contains a total of 25,735 frames, to train our network.

4.2 Applications

As we do not make any assumptions on the underlying image-based algorithms, our method is applicable for handling a wide variety of applications.

Artistic Style Transfer. The tasks of image style transfer have been shown to be sensitive to minor changes in content images due to the non-convexity of the Gram matrix matching objective [12]. We apply our method to the results from the state-of-the-art style transfer approaches [23, 29].

Colorization. Single image colorization aims to hallucinate plausible colors from a given grayscale input image. Recent algorithms [19, 41] learn deep CNNs from millions of natural images. When applying colorization methods to a video frame-by-frame, those approaches typically produce low-frequency flickering.

Image Enhancement. Gharbi et al. [10] train deep networks to learn the user-created action scripts of Adobe Photoshop for enhancing images. Their models produce high-frequency flickering on most of the videos.

Intrinsic Image Decomposition. Intrinsic image decomposition aims to decompose an image into a reflectance and a shading layer. The problem is highly ill-posed due to the scale ambiguity. We apply the approach of Bell et al. [3] to our test videos. As expected, the image-based algorithm produces serious temporal flickering artifacts when applied to each frame in the video independently.

Image-to-Image Translation. In recent years, the image-to-image translation tasks attract considerable attention due to the success of the Generative Adversarial Networks (GAN) [11]. The CycleGAN model [43] aims to learn mappings from one image domain to another domain without using paired training data. When the transformations generate a new texture on images (e.g., photo \(\rightarrow \) painting, horse \(\rightarrow \) zebra) or the mapping contains multiple plausible solutions (e.g., gray \(\rightarrow \) RGB), the resulting videos inevitably suffer from temporal flickering artifacts.

The above algorithms are general and can be applied to any type of videos. When applied, they produce temporal flickering artifacts on most videos in our test sets. We use the WCT [29] style transfer algorithm with three style images, one of the enhancement models of Gharbi et al. [10], the colorization method of Zhang et al. [41] and the shading layer of Bell et al. [3] as our training tasks, with the rest of the tasks being used for testing purposes. We demonstrate that the proposed method learns a single model for multiple applications and also generalizes to unseen tasks.

4.3 Evaluation Metrics

Our goal is to generate a temporally smooth video while maintaining the perceptual similarity with the per-frame processed video. We use the following metrics to measure the temporal stability and perceptual similarity on the output videos.

Temporal Stability. We measure the temporal stability of a video based on the flow warping error between two frames:

$$\begin{aligned} E_{\text {warp}}(V_t, V_{t+1}) = \frac{1}{\sum _{i=1}^N M_t^{(i)}} \sum _{i=1}^N M_t^{(i)} \Vert V_t^{(i)} - \hat{V}_{t+1}^{(i)} \Vert _2^2, \end{aligned}$$
(5)

where \(\hat{V}_{t+1}\) is the warped frame \(V_{t+1}\) and \(M_t \in \{0, 1\}\) is a non-occlusion mask indicating non-occluded regions. We use the occlusion detection method in [33] to estimate the mask \(M_t\). The warping error of a video is calculated as:

$$\begin{aligned} E_{\text {warp}}(V) = \frac{1}{T-1} \sum _{t=1}^{T-1} E_{\text {warp}}(V_t, V_{t+1}), \end{aligned}$$
(6)

which is the average warping error over the entire sequence.

Perceptual Similarity. Recently, the features of the pre-trained VGG network [34] have been shown effective as a training loss to generate realistic images in several vision tasks [8, 26, 31]. Zhang et al. [42] further propose a perceptual metric by calibrating the deep features of ImageNet classification networks. We adopt the calibrated model of the SqueezeNet [18] (denote as \(\mathcal {G}\)) to measure the perceptual distance of the processed video P and output video O:

$$\begin{aligned} D_{\text {perceptual}}(P, O) = \frac{1}{T-1} \sum _{t=2}^{T} \mathcal {G}(O_t, P_t). \end{aligned}$$
(7)

We note that the first frame is fixed as a reference in both Bonneel et al. [6] and our algorithm. Therefore, we exclude the first frame from computing the perceptual distance in Eq. 7.

Fig. 5.
figure 5

Analysis of parameters. (Left) When \(\lambda _t\) is large enough, choosing \(r = 10\) (shown in red) achieves a good balance between reducing temporal warping error as well as perceptual distance. (Right) The trade off between perceptual similarity and temporal warping with different ratios r, as compared to Bonneel et al. [6], and the original processed video, \(V_p\).

4.4 Analysis and Discussions

An extremely blurred video may have high temporal stability but with low perceptual similarity; in contrast, the processed video itself has perfect perceptual similarity but is temporally unstable. Due to the trade-off between the temporal stability and perceptual similarity, it is important to balance these two properties and produce visually pleasing results.

To understand the relationship between the temporal and content losses, we train models with the several combinations of \(\lambda _p\) and \(\lambda _t\) (\(= \lambda _{st} = \lambda _{lt}\)). We use one of the styles (i.e., udnie) from the fast neural style transfer method [23] for evaluation. We show the quantitative evaluation on the DAVIS test set in Fig. 5. We observe that the ratio \(r = \lambda _t / \lambda _p\) plays an important role in balancing the temporal stability and perceptual similarity. When the ratio \(r < 10\), the perceptual loss dominates the optimization of the network, and the temporal flickering remains in the output videos. When the ratio \(r > 10\), the output videos become overly blurred and therefore have a large perceptual distance to the processed videos. When \(\lambda _{t}\) is sufficiently large (i.e., \(\lambda _t \ge 100\)), the setting \(r = 10\) strikes a good balance to reduce temporal flickering while maintaining small perceptual distance. Our results find similar observation on other applications as well.

Fig. 6.
figure 6

Visual comparisons on style transfer. We compare the proposed method with Bonneel et al. [6] on smoothing the results of WCT [29]. Our approach maintains the stylized effect of processed video and reduce the temporal flickering.

Table 2. Quantitative evaluation on temporal warping error. The “Trained” column indicates the applications used for training our model. Our method achieves a similarly reduced temporal warping error as Bonneel et al. [6], which is significantly less than the original processed video (\(V_p\)).
Table 3. Quantitative evaluation on perceptual distance. Our method has lower perceptual distance than Bonneel et al. [6].

4.5 Comparison with State-of-the-Art Methods

We evaluate the temporal warping error Eq. 6 and perceptual distance Eq. 7 on the two video test sets. We compare the proposed method with Bonneel et al. [6] on 16 applications: 2 styles of Johnson et al. [23], 6 styles of WCT [29], 2 enhancement models of Gharbi et al. [10], reflectance and shading layers of Bell et al. [3], 2 photo-to-painting models of CycleGAN [43] and 2 colorization algorithms [19, 41]. We provide the average temporal warping error and perceptual distance in Tables 2 and  3, respectively. In general, our results achieves lower perceptual distance while maintains comparable temporal warping error with the results of Bonneel et al. [6].

We show visual comparisons with Bonneel et al. [6] in Figs. 6 and 7. Although the method of Bonneel et al. [6] produces temporally stable results, the assumption of identical gradients in the processed and original video leads to overly smoothed contents, for example from stylization effects. Furthermore, when the occlusion occurs in a large region, their method fails due to the lack of a long-term temporal constraint. In contrast, the proposed method dramatically reduces the temporal flickering while maintaining the perceptual similarity with the processed videos. We note that our approach is not limited to the above applications but can also be applied to tasks such as automatic white balancing [14], image harmonization [4] and image dehazing [13]. Due to the space limit, we provide more results and videos on our project website.

4.6 Subjective Evaluation

We conduct a user study to measure user preference on the quality of videos. We adopt the pairwise comparison, i.e., we ask participants to choose from a pair of videos. In each test, we provide the original and processed videos as references and show two results (Bonneel et al. [6] and ours) for comparisons. We randomize the presenting order of the result videos in each test. In addition, we ask participants to provide the reasons that they prefer the selected video from the following options: (1) The video is less flickering. (2) The video preserves the effect of the processed video well.

We evaluate all 50 test videos with the 10 test applications that were held out during training. We ask each user to compare 20 video pairs and obtain results from a total of 60 subjects. Fig. 8(a) shows the percentage of obtained votes, where our approach is preferred on all 5 applications. In Fig. 8(b), we show the reasons when a method is selected. The results of Bonneel et al. [6] are selected due to temporal stability, while users prefer our results as we preserve the effect of the processed video well. The observation in the user study basically follows the quantitative evaluation in Sect. 4.5.

Fig. 7.
figure 7

Visual comparisons on colorization. We compare the proposed method with Bonneel et al. [6] on smoothing the results of image colorization [19]. The method of Bonneel et al. [6] cannot preserve the colorized effect when occlusion occurs.

Fig. 8.
figure 8

Subjective evaluation. On average, our method is preferred by \(62\%\) users. The error bars show the \(95\%\) confidence interval.

4.7 Execution Time

We evaluate the execution time of the proposed method and Bonneel et al. [6] on a machine with a 3.4 GHz Intel i7 CPU (64G RAM) and an Nvidia Titan X GPU. As the proposed method does not require computing optical flow at test time, the execution speed achieves 418 FPS on GPU for videos with a resolution of \(1280 \times 720\). In contrast, the speed of Bonneel et al. [6] is 0.25 FPS on CPU.

4.8 Limitations and Discussion

Our approach is not able to handle applications that generate entirely different image content on each frame, e.g., image completion [31] or synthesis [8]. Extending those methods to videos would require incorporating strong video priors or temporal constraints, most likely into the design of the specific algorithms themselves.

In addition, in the way the task is formulated there is always a trade-off between being temporally coherent or perceptually similar to the processed video. Depending on the specific effect applied, there will be cases where flicker (temporal instability) is preferable to blur, and vice versa. In our current method, the user can choose a model based on their preference for flicker or blur, but an interesting area for future work would be to investigate perceptual models for what is considered acceptable flicker and acceptable blur. Nonetheless, we use the same trained model (same parameters) for all our results and showed clear viewer preference over prior methods for blind temporal stability.

5 Conclusions

In this work, we propose a deep recurrent neural network to reduce the temporal flickering problem in per-frame processed videos. We optimize both short-term and long-term temporal loss as well as a perceptual loss to reduce temporal instability while preserving the perceptual similarity to the processed videos. Our approach is agnostic to the underlying image-based algorithms applied to the video and generalize to a wide range of unseen applications. We demonstrate that the proposed algorithm performs favorably against existing blind temporal consistency method on a diverse set of applications and various types of videos.