Temporally Consistent Video Colorization with Deep Feature Propagation and Self-regularization Learning

Video colorization is a challenging and highly ill-posed problem. Although recent years have witnessed remarkable progress in single image colorization, there is relatively less research effort on video colorization and existing methods always suffer from severe flickering artifacts (temporal inconsistency) or unsatisfying colorization performance. We address this problem from a new perspective, by jointly considering colorization and temporal consistency in a unified framework. Specifically, we propose a novel temporally consistent video colorization framework (TCVC). TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization. Furthermore, TCVC introduces a self-regularization learning (SRL) scheme to minimize the prediction difference obtained with different time steps. SRL does not require any ground-truth color videos for training and can further improve temporal consistency. Experiments demonstrate that our method can not only obtain visually pleasing colorized video, but also achieve clearly better temporal consistency than state-of-the-art methods.


INTRODUCTION
V IDEO colorization aims to generate a fully colored video from its monochrome version.This topic is attractive with wide applications, since there are numerous legacy black-and-white movies produced in the past ages.Colorization can also assist other computer vision tasks such as detection [1], [2], tracking [3], [4] and video action recognition [5].
Colorization is a challenging problem due to its highly ill-posed and ambiguous nature.In recent years, plenty of single image colorization methods are proposed and have achieved remarkable progress [6], [7], [10], [11], [12].Compared with image colorization, video colorization [8], [13], [14] is more complex, and receives relatively less attention.It requires not only satisfactory colorization performance but also good temporal consistency, as evaluated in Figure 1.A simple way to realize this task is to treat a video sequence as a series of frames and to process each frame independently using an image-based colorization model.In practice, however, when colorizing consecutive sequences, this naive solution tends to produce results suffering from flickering artifacts (temporal inconsistency).As shown in Figure 2, the results of InsColor [12], a recent state-ofthe-art image-based method, are not temporally consistent.Although the colorization effect of each frame is good, the Fig. 1.Compared with existing algorithms (CIC [6], IDC [7], FAVC [8] and BTC [9]), our method achieves both satisfactory colorization performance and good temporal consistency.b denotes the image-based method backbone.
overall results contain unstable flickering, e.g., the colors of the sky and the clothes are inconsistent.This highlights the temporal consistency problem of video colorization.
In general, there are currently two ways to realize temporally consistent video colorization.The first one is to redesign a specialized video colorization model with explicitly considering temporal coherence.It demands tedious domain knowledge to devise the algorithm involving delicate exploration of network structures and loss functions [6], [7].A recent work FAVC [8] first employs deep learning to achieve automatic video colorization by utilizing selfregularization and diversity loss.However, with their focus mainly on consistency, their colorization performance for Input Fig. 2. Image-based colorization method, e.g.InsColor [12], tends to bring about severe flickering artifacts with inconsistent colors (highlighted in green rectangles).The colorization effect of video-based method FAVC [8] is not satisfactory.The sky is hazy, the grass is not fully colorized and the overall results are grayish.Instead, our method can achieve good temporal consistency while maintaining excellent colorization performance.More comparison results are shown in Section 4.
individual frame is far from satisfactory.As shown in the thrid row of Figure 2, the sky is grayish and hazy, and the overall results are not vivid.More results of FAVC can be found in Figure 5.Its results are usually unsaturated with grayish or yellowish hue.Without good colorization, the temporal consistency will be of less significance.
Another way is to apply post-processing on the output frames and generate a more temporally consistent video [9], [15], [16], [17].For instance, Lai et al. [9] present a deep network with ConvLSTM module for blind video temporal consistency (BTC), which minimizes the short-term and long-term temporal loss to constrain the temporal stability.Although such post-processing methods can enhance the temporal consistency, they directly regenerate all the frames of the original video, which largely alter the overall frame contents and increase the potential risk of incorrect modification when outliers occur.Moreover, these methods cannot achieve task-specific processing.Because different videos in various tasks are manipulated by the same operators, which could lead to dramatic quantitative performance drop comparing with the original output (see Figure 1).Further, methods like BTC [9] only consider the information of previous frames in forward propagation direction.It is necessary to integrate bidirectional information in handling consecutive video sequence.
Unlike the aforementioned approaches, we tackle video colorization from a new perspective.Rather than designing a complicated and specialized model, we jointly account for both frame-level colorization and temporally consistent constraints in a unified deep architecture.Specifically, we propose a novel Temporally Consistent Video Colorization framework (TCVC) that leverages deep features extracted from image-based model G to generate contiguous adjacent features by bidirectional feature propagation.We only uti-lize G to extract several anchor frame features while the remaining internal frame feautres are all generated from anchor frames.Eventually, the colorization performance of our method surpasses that of image-based model G, and the temporal consistency is largely improved as well.Throughout the process, we formulate the spatial-temporal alignment and propagation in high-dimensional feature space rather than image space.Differing from conventional supervised learning, we do not employ any explicit loss with the ground-truth color video, but only adopt the temporal warping loss for self-regularization.As a result, our method is label-free and data-independent.Such self-regularization mechanism also makes the training procedure very efficient.Experiments demonstrate that the proposed framework can favorably preserve the colorization performance of imagebased method while simultaneously achieving state-of-theart temporal consistency for video colorization (as compared in Figure 1).

RELATED WORK
Image and video colorization.Conventional colorization methods resort to additional information provided by user scribbles [18], [19], [20], [21], [22] or example images [23], [24], [24], [25].These methods treat colorization as a constrained optimization problem, e.g., Levin et al. [18] proposed an interactive colorization technique that propagated colors from scribbles to neighboring similar pixels.Recently, deep learning techniques have been applied to colorization [6], [7], [10], [11], [12], [26], [27].Iizuka et al. [10] devised a two-branch network for jointly learning colorization and classification.Zhang et al. [6] modeled the colorization as a classification problem to predict the distribution of possible colors for each pixel.Su et al. [12] proposed an instanceaware image colorization model which integrated object detection and colorization together.Aforementioned works have achieved impressive performance on single images but heavily suffer from flickering artifacts when tested on video.Video colorization [8], [13], [14], [28], [29] needs to consider both colorization performance and temporal consistency.Recently, a pioneer deep-learning-based work FAVC [8] was proposed for automatic video colorization, which is the most relevant work to ours.FAVC regularized its model with KNN graph built on the ground-truth color video and simultaneously posed a temporal loss term for constraining temporal consistency.However, the colorization performance of FAVC is not satisfactory.
Video temporal consistency.The temporal consistency problem is addressed on a diverse type of applications, such as artistic style transfer [30], [31], [32], [33], [34], image enhancement [35], [36], [37], [38] and colorization [8], [9].Bonneel et al. [15] proposed a gradient-domain technique to infer the temporal regularity from the original unprocessed video.Yao et al. [16] developed an online keyframe strategy to keep track of the dynamic objects and to handle occlusions.Lai et al. [9] presented a ConvLSTM-based method which took advantage of deep recurrent network and perceptual similarity [39].These video temporal consistency algorithms are usually post-processing methods, which modify each frame of the input video and produce a new output video.Instead of applying post-processing, our work addresses the temporal consistency and video colorization in a unified framework.

Temporally Consistent Video Colorization
Given an input grayscale video, the objective of video colorization is to obtain its corresponding colorized version.Following previous works [6], [7], [10], [12], we perform this task in CIE Lab color space and predict two associated chrominance channels of a grayscale image, i.e., a channel and b channel.

Overview
For a long input grayscale video sequence, we can decompose it into several intervals.Assume each interval sequence contains N consecutive grayscale frames X = {x 1 , x 2 , • • • , x N }, we denote the start frame x 1 and the last frame x N as anchor frames, and the remaining N − 2 frames as internal frames.Thereupon, the input sequence is divided with several anchor frames and internal frames in between.Our method works on each interval sequence and we consider the input sequence as a continuum with continual camera and object motions.
The proposed TCVC framework leverages the temporal and motional information of consecutive frames in highdimensional feature space.For any image-based colorization model G, it can be naturally separated into two parts: feature extraction module G E and color mapping module G C .Generally, the color mapping module corresponds to the last output layer of G, while the feature extraction module includes all the layers before the output layer.As illustrated in Figure 3, firstly, the deep features of the two anchor frames are extracted through the feature extraction module G E .Then, the features are sequentially propagated in forward and backward directions frame by frame.They contain essential information for colorization.Finally, at each frame step, the associated deep features are fed into the shared color mapping module G C to obtain the predicted color chrominance channels y i .

Anchor Frame Processing
Given the anchor frames at both ends of the interval sequence, our goal is to generate the color channels of each internal frame by propagating the features of the anchor frames.The anchor frames are directly processed by colorization backbone G.As shown in Figure 3, in each interval sequence, the anchor frame branch colorizes the two anchor frames and extracts the deep features for propagation: The superscripts f and b represent forward and backward directions, respectively.As described in Equation ( 1), the extracted features here are the output of the penultimate layer of model G, which properly matches with the color mapping module G C .The color mapping module G C accepts these features and outputs the predicted color channels.
Remarkably, G can be any CNN-based algorithm, such as CIC [6], IDC [7], etc., making it a plug-and-play framework.Note that, in TCVC framework, the colorization backbone G is fixed without training.We only adopt its G E to extract anchor frame features and G C to predict y i .Since the anchor frame branch applies model G on the anchor frames straightforward, it will not change any of the colorization style and performance of G.With the initial deep features extracted from anchor frames, the features of each internal frame are progressively generated by bidirectional propagation.The forward feature propagation is initiated at the start anchor frame x 1 and the backward feature propagation is initiated at the last anchor frame x N .

Bidirectional Deep Feature Propagation
For internal frames, we make use of the temporal and motional characteristics of video sequence to generate the associated features from anchor frame features.This procedure is carried out by backward propagation and forward propagation sequentially.We adopt bidirectional feature propagation since all the information needed to generate the internal frame features is encoded and contained in the forward and backward directions.The feature propagation initiates from the backward direction.Backward propagation.As depicted in Figure 3, the feature propagation begins at backward direction.We first estimate the optical flow between two adjacent frames.Based on the estimated motion fields, we can obtain coarsely warped internal frame features in backward direction: where F b i is the backward warped feature at the i-th frame, •) denotes the warping function, which can be implemented using bilinear interpolation [40] and f i+1→i represents the optical flow from frame x i+1 to x i .We adopt FlowNet2 [41] to compute the flow, due to its effectiveness in relative tasks.The backward propagation starts from F b N and generates the features of each internal frame F b i .However, only backward propagation is insufficient.Without complementary information from the opposite direction, the warping operation will cause the features to continuously shift in one direction, resulting in information loss.Therefore, after the backward propagation, the forward propagation will start from the other direction.
Forward propagation.The forward propagation starts from the first frame x 1 .Similar to the backward propagation, the forward propagation first obtains a coarsely warped internal frame feature based on the estimated optical flow.Furthermore, forward propagation is obligated for more functions, including integrating backward and forward features, and generating color channels with the fused features.To integrate the features propagated in backward and forward directions, we devise an effective framespecific feature fusion module (FFM).The details of FFM will be described later.We denote the output feature of FFM in the forward propagation as F f i , which combines fine bidirectional information for subsequent colorization.Note that, except the first forward feature F f 1 is directly delivered to the next frame, the other forward features to be propagated are the features after fusion, i.e., F f i : where i = 1, 2, • • • , N − 2. After feeding F f i into the shared color mapping module G C , the predicted color channels of internal grayscale frame x i is obtained: Feature fusion module.The structure of the proposed feature fusion module is detailed in Figure 4.It contains a weighting network (WN) and a feature refine network (FRN), which are both three-layer plain CNNs.In FFM, three consecutive images x i−1 , x i , x i1 are first fed into G E to obtain the corresponding features, which are then concatenated together with other inputs to feed in the WN and FRN.

Feature Refine Network
Intuitively, the warped backward feature F b i and forward feature F f i are both coarsely aligned with current frame x i .However, due to different propagation directions, there are complementary and redundant parts between them in different pixel locations.The weighting network predicts a weighting map W ∈ R H×W ×1 ranged in [0, 1].Then, the forward and backward features are fused by a simple linear interpolation: , where denotes the element-wise multiplication operation.
F f b i contains the information of both forward and backward features.Due to the inaccurate flow estimation and the information loss caused by warping operation, errors will accumulate in the propagation process.Therefore, we further refine the feature according to the adjacent spatiotemporal information.
As shown in Figure 4, the feature refine network accepts the roughly fused feature F f b i and generates a refining residual F res i .Specially, the feature refine network additionally takes into account the backward feature F b i+1 of the latter frame and the forward feature F f i−1 of the previous frame.The reason for such design is that F b i+1 and F f i−1 implicitly encode and contain all the information needed to obtain the aligned feature at current i-th frame. 1 × 1 convolutions are used to reduce and unify the dimensionality.The final refined feature at i-th frame is obtained as: F f i will be propagated to the next frame.By utilizing the information of current frame and adjacent frames, FFM can achieve frame-specific feature fusion in a coarse-to-fine manner.With bidirectional feature propagation, the internal frame features are all generated from anchor frame features.

Self-regularization Learning
One unique characteristic of the proposed framework is that it utilizes a self-regularization learning scheme without relying on ground-truth color videos.Here the selfregularization learning means that we do not employ any explicit loss with the ground-truth color video, which is different from the conventional supervised learning.In TCVC, we do not need to train the colorization backbone G.To let the network learn temporal consistency, we adopt the temporal warping loss as follows: where ŷwarp i+d = warp(ŷ i+d , f i+d→i ), and d represents the time interval for temporal warping.
2 ) is the visibility mask.Following [9], we set α = 50.This loss function explicitly poses a penalty to the temporal consistency between adjacent frames.It is noteworthy that there is no ground-truth color video used during training.A consecutive grayscale input video is all we need.More discussions can be found in the supplementary file.
Further, the self-regularization learning also makes our framework free from the influence of training and testing data, i.e., the proposed method is data-independent and label-free.As long as the input video contains consecutive motional frames, it can be adopted as our training set.Another advantage of the proposed self-regularization is that it does not require magnanimous training data and it has few trainable parameters.Thus, the training procedure can be very efficient (about two days).

Multiple Anchor Frame Sampling
For a long input sequence (over dozens of frames), we first divide it into several intervals by uniformly sampling anchor frames or specifying the interval length N .Once the interval length N is determined, the anchor frames are also determined during inference phase.Empirically, this scheme has already worked fine.However, we also provide an optional post-processing scheme to further enhance the performance.Specifically, we sample the anchor frames multiple times (choose different N ), and then average the results of each output.This procedure can be regarded as an ensemble method during testing.It could eliminate the uncertainty and inconsistency of anchor frames to some extent and achieve better temporal consistency.In the experiments, we adopt N = 15 and N = 17 for ensembling.

Differences with other methods
The proposed TCVC framework is conceptually different from previous works on video colorization and video temporal consistency in motivation and methodology.We address the video colorization problem from a new perspective.In summary, the proposed method is different from previous solutions in three aspects: i) The proposed framework takes advantage of an ingenious image-based model and focuses on the temporally consistent constraints.It can favorably achieve both good colorization performance and satisfactory temporal consistency.ii) We formulate the spatial-temporal alignment and propagation in highdimensional feature space.iii) Different from conventional supervised learning, it adopts a self-regularization learning scheme without relying on ground-truth color videos.

Uniqueness of TCVC
Our goal is to improve the temporal consistency based on image-based model G without retraining it.The unique feature of TCVC is that it requires no GT color videos during training.This unique trait is a concomitant byproduct, introduced by the method itself with feature propagation and self-regularization learning.Specifically, TCVC first leverages a pretrained image-based model G to extract anchor frame features containing color information.Then, TCVC propagates the color information from anchor frames to the remaining internal frames.The color information is inherited from G, and TCVC focuses on the propagation of such sparse information with explicitly considering the temporal consistency.The temporal warping loss poses a penalty to implicitly guide the network to better propagate the color information in a self-regularization and selfsupervised manner, with explicitly enhancing the temporal consistency between adjacent frames.Further, we have also conducted experiments to demonstrate the effectiveness of such learning scheme.
With such a unique characteristic, TCVC obtains state-ofthe-art results on temporally consistant video colorization.As shown in Figure 2, existing image-based method like InsColor [12] tends to produce severe flickering artifacts with inconsistent colorization.The colorization performance of video-based method FAVC [8] is not satisfactory, which will produce grayish and unsaturated hue.Instead, the colorization effect of TCVC framework is temporally consistent and visually pleasing, due to bidirectional feature propagation and self-regularization learning.Image-based methods [7], [12] are prone to produce severe flickering artifacts.Postprocessing method BTC [9] cannot achieve long-term temporal consistency well and cannot handle outliers.The results of FAVC [8] are usually unsaturated and sometimes contain strange greenish hue, e.g., there are strange greenish regions on the gun in the upper sequence.Please zoom in for best view.

EXPERIMENTS
Datasets.Following previous works [8], [9], we adopt DAVIS dataset [42] and Videvo dataset [9] for training and testing Metrics.We evaluate the results in two facets: colorization performance and video temporal consistency.The colorization performance is paramount for colorization task.Without good colorization, the temporal consistency will be less of significance.For example, unsaturated images with few colors could result in better consistency.However, such neutral results cannot meet the requirements of good colorization.To measure the colorization performance, we adopt PSNR and L 2 error in Lab color space.Moreover, we also utilize the colorfulness measurement proposed by Hasler and Suesstrunk [43], to roughly evaluate the color diversity of the resulting images produced by different methods.For temporal consistency, we adopt warp error proposed in [9].However, warp error is uncorrelated with the video color and is easily affected by the performance of flow estimation module used in the measurement.Therefore, we propose a more suitable Color Distribution Consistency index (CDC) to further measure the temporal consistency, which is specially devised for video colorization task.Specifically, it computes the Jensen-Shannon (JS) divergence of the color distribution between consecutive frames: JS(P c (I i ), P c (I i+t )), (7) where N is the video sequence length and P c (I i ) is the normalized probability distribution of color image I i across c channel, which can be calculated from the image histogram.t denotes the time step.A smaller t indicates shortterm temporal consistency, while larger t indicates longterm temporal consistency.The JS divergence can measure the similarity between two color probability distributions.Considering the long-term and short-term temporal consistency together, we propose the following index: It takes t = 1, t = 2 and t = 4 into account, which can appropriately reflect the temporal consistency for color distribution.Too large t will lead to much difference in content between the two frames, causing the color distribution to change rapidly.Moreover, we also conducted a user study for subjective evaluation.

Implementation Details
In our implementation, we adopt CIC [6] and IDC [7] as the image-based colorization backbone G.Note that we do not need to train G.When training, the input interval length N = 10, while N is set to 17 when testing.The batch size is 4 and the patch size of input frames is 256 × 256.
The learning rate is initialized to 5e −5 and is decayed by half every 10, 000 iterations.The Adam optimizer [44] is adopted.We use PyTorch framework and train all models using four GTX 2080Ti GPUs.

Comparison with State-of-the-art Methods
Since this paper focuses on temporally consistent video colorization, FAVC [8] is the main competitor.FAVC is the newest and the first learning-based fully automatic video colorization method.Unfortunately, there is only FAVC published in the top conference or journal.We follow FAVC and conduct sufficient comparisons with image-based, videobased and post-processing methods.Specifically, we compare our method with representative single image colorization methods [6], [7], [10], [12] and video colorization method FAVC [8].In addition, we apply the blind temporal consistency methods BTC [9] and DVP [17] on [6] and [7] to form another two groups of comparison methods.
Quantitative comparison.The quantitative results are summarized in Table 1 and Figure 1.Image-based methods [7], [12] can achieve relatively higher PSNR, while their temporal consistency is poor.Video-based method FAVC [8] slightly improves the temporal consistency but its colorization performance is not satisfactory, as shown in Figure 5. Quantitatively, FAVC yields the lowest colorfulness value among all the methods.BTC [9] and DVP [17] can largely enhance the temporal consistency, but the cost is that the PSNR values decrease dramatically compared to original outputs of [6], [7].Further, BTC is vulnerable to be affected by outliers and DVP leans to produce colorless results (see Figure 8).Moreover, DVP [17] is an image-specific one-shot algorithm which requires independent training during testing.Thus, it is time-consuming to conduct inference, making it impractical for real-time or high-speed applications.
For comparison, we adopt [6], [7] as our backbones.After integrated in TCVC, the temporal consistency gets improved, validating the effectiveness of TCVC.Moreover, TCVC can achieve impressive colorization performance with high PSNR values.TCVC can even slightly boost the PSNR values and reduce the L 2 error in Lab space.TCVC can also perfectly remain the colorfulness, while BTC and DVP could lower the resulting colorfulness values.Note that, for fairness, we do not use scene cut techniques on the test datasets, but we still achieve the best results.For very long videos, some simple techniques can be used, like histogram/block matching, which are easy to be incorporated with TCVC.With scene cut techniques, the performance of TCVC is supposed to be further improved.Qualitative comparison.Visual comparisons are shown in Figure 2, 5, 6, 7 and 8. Image-based methods [7], [12] are prone to produce severe flickering artifacts.Their predicted color of one object differs in consecutive frames.For example, in Figure 6, the car is colorized in red by InsColor [12] in the first four frames , while it is painted bluish in the last frame; The dancer's clothes are colorized in brighter red by IDC [7] in the first and fourth frames, while in the other frames, the color of the clothes becomes lighter and less saturated.After applying post-processing method BTC [9], the results can become more temporally consistent.However, BTC modifies all the frames of the original output video, which could immensely decrease the PSNR values as discussed before.Further, BTC is susceptible to outliers and cannot deal with the extreme outliers properly and thoroughly.As shown in the lower part of Figure 6, BTC fails to achieve temporal consistency in this consecutive sequence: the outlier red region stays unchanged after applying BTC.As shown in Figure 8, DVP [17] could remove the color of the original images or produce results with weird green tone.Further, the results of DVP are likely to contain color contaminations.Compared with state-ofthe-art image-based methods, the results of FAVC [8] are Fig. 8.Comparison with post-processing methods BTC [9] and DVP [17].DVP sometimes could remove the color of the original images (first row) or produce results with weird green tone (second and third rows).
usually not vivid with unsaturated and grayish hue.FAVC [8] sometimes could even produce strange greenish color in objects (see the lower part of Figure 6).Compared with previous works, our method can achieve both good colorization performance and temporal consistency.Particularly, TCVC can produce colorized results with long-term temporal consistency, since all the internal frames are generated by continual feature propagation.Thus, different from BTC, TCVC can handle outliers and achieve impressive quantitative performance.
Results on legacy black-and-white movies.Additionally, we display several visual results on legacy blackand-white movies to demonstrates the good generalization ability of our method.It is an attractive application of video colorization.As shown in Figure 7, one can see that our model is able to produce good colorization results on legacy grayscale films.
User study.We also conducted a user study with 20 participates for subjective evaluation.15 videos are randomly selected from the test datasets.We compare our method with video colorization methods FAVC [8], CIC [6]+BTC and IDC [7]+BTC in pairwise manner.The participates are asked to choose the better one by colorization performance and the temporal consistency.As the results shown in Figure 9, the proposed framework TCVC surpasses all other methods by a large margin.More than 75.0%(225) of users' choices favor our results.

Advantages of Adopting Anchor Frames
In TCVC framework, the anchor frames are directly processed by the well-performed image-based model G, and the internal frames are generated by bidirectional propagation from anchor frames.We demonstrate the advantages of adopting anchor frames by statistical analysis.Specifically, we aim to answer the following questions: 1) Since all the anchor frames are the same with that of G, what is the influence of sampling anchor frames with different interval length N ?2) What is the effect of adopting deep feature propagation to generate internal frames?3) What are the advantages of TCVC compared with post-processing method BTC [9]?To answer these questions, we have calculated the PSNR values of the anchor frames and the internal frames produced by TCVC under different interval length N on DAVIS [42] dataset.Then, we compare the corresponding PSNR values with the backbone model IDC [7] and postprocessing method BTC [9].[7].We calculate the corresponding PSNR values of anchor frames and internal frames for various N .Different N will lead to different separated sets for anchor and internal frames.
Larger interval length N connotes that fewer anchor frames will be sampled.One may concern that TCVC could sample anomalous anchor frames (outliers), resulting in the accumulation of errors throughout the feature propagation.From Table 2 the probability of sampling anomalous anchor frames is marginal.Further, with the increase of N , the number of sampled anchor frames will be reduced, and more outlier frames will fall in internal frames to be regenerated.In such cases, compared with post-processing method like BTC [9], TCVC can better get rid of the influence of outliers.As shown in Table 3, although BTC could enhance the video temporal consistency, the PSNR is significantly reduced.For TCVC, since the anchor frames are directly processed by image-based method, the PSNR of anchor frames is the same as IDC [7], while the PSNR of internal frames is further improved.This is because TCVC can avoid the influence of anomalous internal frames with low PSNR values, since all the internal frames are regenerated by feature propagation from anchor frames.Hence, TCVC can successfully achieve satisfactory temporal consistency while maintaining good colorization performance.

ABLATION STUDY
We further conduct ablation studies to demonstrate the effectiveness of the proposed FFM, bidirectional propagation and self-regularization learning.We test the models with interval length N = 11 on DAVIS dataset.

Effectiveness of Feature Fusion Module
The purpose of the feature fusion module (FFM) is to integrate the backward and forward features in a dedicated corse-to-fine manner.It leverages the information of current frame and adjacent frames to achieve frame-specific feature fusion.To demonstrate its effectiveness, we replace the FFM with plain convolutional networks to fuse the bidirectional features.The experimental results are shown in the second and third rows of Table 4.By adopting FFM, the temporal consistency is further improved from 0.004003 to 0.003874, which validates the effectiveness of FFM.

Effectiveness of Bidirectional Propagation
In this paper, we propose bidirectional propagation to generate the consecutive internal features from anchor features.If we only conduct unidirectional propagation without complementary information from the opposite direction, the warping operation will cause the features continuously to shift in one direction, leading to information loss.We conduct an ablation study to validate the effectiveness of bidirectional propagation.For unidirectional propagation, we do not need to fuse the forward and backward features with FFM, so we replace FFM with a plain network.As shown in the first and second rows of Table 4, TCVC model with one direction is much inferior to that with two directions.By the utilization of bidirectional propagation, both the PSNR and temporal consistency are largely improved.

Effectiveness of Self-regularization Learning
We compare conventional supervised-learning and the proposed self-regularization in the same of TCVC framework.In particular, we train the proposed framework using different regularization terms.1) Only adopting L 2 loss with ground-truth color videos.2) Only adopting temporal warping loss for self-regularization. 3) Adopting both L 2 and temporal warping losses simultaneously.As shown in Table 5, interestingly, the performance of adopting only L 2 loss is much inferior to that of adopting temporal warping loss.It fails to achieve satisfactory colorization performance nor temporal consistency.As shown in Figure 11, TCVC model with only adopting L 2 loss produces results with severe visual artifacts.This is because L 2 loss cannot regularize the procedure of feature warping and fusion in TCVC framework.In addition, L 2 loss is not robust to the intrinsic ill-posed nature of colorization problem, which is also addressed in [6].When adopting both L 2 loss and temporal warping loss, the results are better than only adopting L 2 loss but still inferior to only adopting temporal warping loss.Adopting L 2 loss on GT will degrade the colorization performance, since we do not retrain G. Avoiding using GT color makes the framework concentrate on reorganizing the consecutive features.This experiment validates the effectiveness of the self-regularization learning for TCVC.It also stresses the difficulties for achieving both good colorization effect and temporal consistency.Therefore, different from conventional supervised learning, we design such an elaborate mechanism, making it efficient and unique.

Exploration on Interval Length
To further explore the influence of interval length N , we conduct experiments with different N during training and testing phases.In particular, we adopt N = 5, N = 7 and N = 10 for training while N = 3, N = 5, N = 9, N = 11, N = 17 and N = 19 for testing.The experimental results are listed in Table 6.It can be seen that the interval length N mainly affects the temporal consistency while it has marginal impact on PSNR values.Adopting more internal frames for training and testing can achieve better temporal consistency.Specifically, when the testing interval length is fixed, the more frames adopted in training, the better the consistency performance.This is because our framework propagates the features in intervals, longer intervals could achieve longer temporal consistency.In addition, adopting more consecutive frames for training can make the model have larger temporal receptive field and learn more motion patterns.Nevertheless, too large interval length N for training will cost more GPU memory and computational resources.Too large N for testing will result in PSNR drop, because the difficulty of optical flow estimation and feature fusion has increased as well.Thus, we adopt moderate N = 10 for training and N = 17 for testing in the main experiment.

FAILURE CASES
We here show several failure cases of TCVC.As shown in Figure 10, in some cases, TCVC could produce results with ghost artifacts or color contamination.This is mainly due to the inaccurate optical flow estimation, especially when large motions or sever occlusions occur.The estimation of optical flow (OF) and occlusion (OCC) is crucial for most videorelevant tasks, e.g., video super-resolution [45], video frame interpolation [46] and video compression [47].However, we have tested TCVC on a large number of videos.It outperforms all other works qualitatively and yields the best quantitative evaluations on average.In the Appendix, we provide the detailed evaluation of each test video.The performance is stable and robust.Certainly, there is room for improving our method.With better OF/OCC estimation, TCVC can continuously be promoted.Research effort on better optical flow and occlusion estimation will largely contribute to lots of computer vision tasks.

CONCLUSION
We propose a temporally consistent video colorization framework (TCVC) with deep feature propagation and self-regularization learning.TCVC generates contiguous adjacent features for colorizing video.It adopts a selfregularization learning scheme and does not require any ground-truth color video for training.TCVC can achieve both good colorization effect and temporal consistency.

Fig. 3 .
Fig. 3.The proposed TCVC framework (take N = 4 for example).The anchor frame branch colorizes the two anchor frames and extracts the deep features for propagation.With bidirectional deep feature propagation, the internal frame features are all generated from anchor frames, which ensures the temporal consistency in high-dimensional feature space.

Fig. 4 .
Fig. 4. The structure of feature fusion module (FFM), which contains a weighting network and a feature refine network.

Fig. 5 .
Fig.5.Visual comparison with state-of-the-art methods.Image-based methods[7],[12] are prone to produce severe flickering artifacts.Postprocessing method BTC[9] cannot achieve long-term temporal consistency well and cannot handle outliers.The results of FAVC[8] are usually unsaturated and sometimes contain strange greenish hue, e.g., there are strange greenish regions on the gun in the upper sequence.Please zoom in for best view.
. The DAVIS dataset is designed for video segmentation, which includes a variety of moving objects and motion types.It has 60 videos for training and 30 videos for testing.The Videvo dataset contains 80 videos for training and 20 videos for testing.The training videos are all resized to 300 × 300.We mix the DAVIS and Videvo training sets to conduct self-regularization learning as in Section 3.2.

Fig. 10 .
Fig. 10.Failure cases of TCVC due to erroneous estimation of optical flow and occulusions.

TABLE 1
[9]ntitative performance on DAVIS30 and Videvo20 datasets.Applying BTC[9]improves the temporal consistency but decreases the PSNR values dramatically.TCVC framework can favorably achieve both satisfactory colorization effect and temporal consistency.b indicates the backbone we choose for TCVC and + denotes adopting multiple anchor frame sampling ensemble.

TABLE 3
, we observe that the PSNR values of anchor frames for different samplings are relatively stable.Statistically, there are few outliers in a sequence.Thus, Effectiveness of TCVC for achieving both good colorization performance and satisfactory temporal consistency.N = 17.

TABLE 6
Exploration on interval length N .More internal frames could benefit temporal consistency.