Pixelated Semantic Colorization

While many image colorization algorithms have recently shown the capability of producing plausible color versions from gray-scale photographs, they still suffer from limited semantic understanding. To address this shortcoming, we propose to exploit pixelated object semantics to guide image colorization. The rationale is that human beings perceive and distinguish colors based on the semantic categories of objects. Starting from an autoregressive model, we generate image color distributions, from which diverse colored results are sampled. We propose two ways to incorporate object semantics into the colorization model: through a pixelated semantic embedding and a pixelated semantic generator. Specifically, the proposed network includes two branches. One branch learns what the object is, while the other branch learns the object colors. The network jointly optimizes a color embedding loss, a semantic segmentation loss and a color generation loss, in an end-to-end fashion. Experiments on Pascal VOC2012 and COCO-stuff reveal that our network, when trained with semantic segmentation labels, produces more realistic and finer results compared to the colorization state-of-the-art.

Human beings excel in assigning colors to grayscale images since they can easily recognize the objects and have gained knowledge about their colors.No one doubts the sea is typically blue and a dog is never naturally green.Although many objects have diverse colors, which makes their prediction quite subjective, humans can get around this by simply applying a bit of creativity.However, it remains a significant challenge for machines to acquire both the world knowledge and "imagination" that humans possess.
Previous works in image colorization require reference images (Gupta et al (2012); Liu et al (2008); Charpiat et al (2008)) or color scribbles (Levin et al (2004)) to guide the colorization.Recently, several automatic approaches (Iizuka et al (2016); Larsson et al (2016); Zhang et al (2016) without semantics with semantics Fig. 1: Colorization without and with semantics generated using the network from this paper.(a) The method without semantics assigns unreasonable colors to objects, such as the colorful sky and the blue cow.The method with semantics generates realistic colors for the dog and the old man.(b) The method without semantics fails to capture long-range pixel interactions (Royer et al (2017)).With semantics, the model performs better.colorization, there are still common pitfalls that make the colorized images appear less realistic.We show some examples in Figure 1.The cases in (a) without semantics suffer from incorrect semantic understanding.For instance, the cow is assigned a blue color.The cases in (b) without semantics suffer from color pollution.Our objective is to effectively address both problems to generate better colorized images with high quality.
Both traditional (Chia et al (2011); Ironi et al (2005)) and recent colorization solutions (Larsson et al (2016); Iizuka et al (2016); He et al (2016); Zhang et al (2016Zhang et al ( , 2017))) have highlighted the importance of semantics.However, they only explore image-level class semantics for colorization.As stated by Dai et al (2016), image-level classification favors translation invariance.Obviously, colorization requires representations that are, to a certain extent, translation-variant.From this perspective, semantic segmentation (Long et al (2015); Chen et al (2018); Noh et al (2015)), which also requires translation-variant representations, provides more reasonable semantic guidance for colorization.It predicts a class label for each pixel.Similarly, according to (Zhang et al (2016); Larsson et al (2016)), colorization assigns each pixel a color distribution.Both challenges can be viewed as an image-to-image prediction problem and formulated as a pixel-wise prediction task.We show several colorized examples after using pixelated semantic-guidance in Figure 1 (a) and (b).Clearly, pixelated semantics helps to reduce the color inconsistency by a better semantic understanding.
In this paper, we study the relationship between colorization and semantic segmentation.Our proposed network is able to be harmoniously trained for semantic segmentation and colorization.By using such multi-task learning, we explore how pixelated semantics affects colorization.Differing from the preliminary conference version of this work (Zhao et al ( 2018)), we view colorization here as a sequential pixel-wise color distribution generation task, rather than a pixel-wise classification task.We design two ways to exploit pixelated semantics for colorization, one by guiding a color embedding function and the other by guiding a color generator.Using these strategies, our methods produce diverse vibrant images on two datasets, Pascal VOC2012 (Everingham et al (2015)) and COCO-stuff (Caesar et al (2018)).We further study how colorization can help semantic segmentation and demonstrate that the two tasks benefit each other.We also propose a new quantitative evaluation method using semantic segmentation accuracy.
The rest of the paper is organized as follows.In Section 2, we introduce related work.Following, in Section 3, we describe the details of our colorization network using pixelated semantic guidance.Experiments and results are presented in Section 4. We conclude our work in section 5.

Colorization by Reference
Colorization using references was first proposed by Welsh et al (2002), who transferred the colors by matching the statistic within the pixel's neighborhood.Rather than relying on independent pixels, Ironi et al (2005) transferred colors from a segmented example image based on their observation that pixels with the same luminance value and similar neighborhood statics may appear in different regions of the reference image, which may have different semantics and colors.Tai et al (2005) and Chia et al (2011) also performed local color transfer by segmentation.Bugeau et al (2014) and Gupta et al (2012) proposed to transfer colors at pixel level and super-pixel level.Generally, finding a good reference with similar semantics is key for this type of methods.Previously, Liu et al (2008) and Chia et al (2011) relied on image retrieval methods to choose good references.Recently, deep learning has supplied more automatic methods in (Cheng et al (2015); He et al ( 2018)).In our approach, we use a deep network to learn the semantics from data, rather than relying on a reference with similar semantics.

Colorization by Scribble
Another interactive way to colorize a gray-scale image is by placing scribbles.This was first proposed by Levin et al (2004).The authors assumed that pixels nearby in space-time, which have similar gray levels, should have similar colors as well.Hence, they solved an optimization problem to propagate sparse scribble colors.To reduce color bleeding over object boundaries, Huang et al (2005) adopted an adaptive edge detection to extract reliable edge information.Qu et al (2006) colorized manga images by propagating scribble colors within the pattern-continuous regions.Yatziv and Sapiro (2006) developed a fast method to propagate scribble colors based on color blending.Luan et al (2007), further extended Levin et al (2004) by grouping not only neighboring pixels with similar intensity but also remote pixels with similar texture.Several more current works (Zhang et al (2017); Sangkloy et al (2017)) used deep neural networks with scribbles trained on a large dataset and achieved impressive colorization results.In all these methods, which use hints like strokes or points, provide an important means for segmenting an image into different color regions.We prefer to learn the segmentation rather than manually labelling it.

Colorization by Deep Learning
The earliest work applying a deep neural network was proposed by Cheng et al (2015).They first grouped images from a reference database into different clusters and then learned deep neural networks for each cluster.Later, Iizuka et al (2016) pre-trained a network on Im-ageNet for a classification task, which provided global semantic supervision.The authors leveraged a largescale scene classification database to train a model, exploiting the class-labels of the dataset to learn the global priors.Both of these works treated colorization as a regression problem.In order to generate more saturated results, Larsson et al (2016) and Zhang et al (2016) modeled colorization as a classification problem.Zhang et al (2016) applied cross-channel encoding as self-supervised feature learning with semantic interpretability.Larsson et al (2016) claimed that interpreting the semantic composition of the scene and localizing objects were key to colorizing arbitrary images.Nevertheless, these works only explored image-level classification semantics.Our method takes the semantics one step further and utilizes finer pixelated semantics from segmentation.
Further, generative models have more recently been applied to produce diverse colorization results.Currently, several works (Cao et al (2017); Isola et al (2017);Frans (2017)) have applied a generative adversarial network (GAN) (Radford et al (2016)).They were able to produce sharp results but were not as good as the approach proposed by Zhang et al (2016).Variational autoencoders (VAE) (Kingma and Welling (2014)) have also been used to learn a color embedding (Deshpande et al (2017)).This method produced results with large-scale spatial co-ordination but tonelessness.Royer et al (2017) and Guadarrama et al (2017) applied PixelCNN (van den Oord et al (2016); Salimans et al (2017)) to generate better results.We use Pixel-CNN as the backbone in this paper.

Methodology
In this section, we will detail how pixelated semantics improves colorization.We will first introduce our basic colorization backbone.Then, we will present two ways to exploit object semantics for colorization.Our network structure is summarized in Figure 2.

Pixelated Colorization
To arrive at image colorization with pixelated semantics, we start from an autoregressive model.It colorizes each pixel conditioned on the input gray image and previously colored pixels.Specifically, a conditional Pixel-CNN (van den Oord et al ( 2016)) is utilized to generate per-pixel color distributions, from which we sample diverse colorization results.
We rely on the CIE Lab color space to perform the colorization, since it was designed to be perceptually uniform with respect to human color vision and only two channels a and b need to be learned.An image with a height H and a width W is defined as X ∈ R H×W ×3 .Here, f θ is a color embedding function, h ϕ is a semantic segmentation head and g ω is the autoregressive generation model.There are three loss functions L seg , L emb and L gen (Section 3.3).
X contains n (= H × W ) pixels.In raster scan order: row by row and pixel by pixel within every row, the value of the i-th pixel is denoted as X i .The input grayscale image, represented by light channel L, is defined as To reduce computation and memory requirements, we prefer to produce color images with low resolution.This is reasonable since the human visual system resolves color less precisely than intensity (Van der Horst and Bouman (1969)).As stated in (Royer et al ( 2017)), image compression schemes, such as JPEG, or previously proposed techniques for automatic colorization also apply chromatic subsampling.
By adopting PixelCNN for image colorization, a joint distribution with condition is modelled as van den Oord et al (2016): (1) All the elementary per-pixel conditional distributions are modelled using a shared convolutional neural network.As all variables in the factors are observed, training can be executed in parallel.Furthermore, X L can be replaced by a good embedding learned from a neural network.Taking g ω as the generator function and f θ as the embedding function, each distribution in Equation ( 1) can be rewritten as: As the purple flow in Figure 2 shows, there are two components included in our model.A deep convolutional neural network (f θ ) produces a good embedding of the input gray-scale image.Then an autoregressive model uses the embedding to generate a color distribution for each pixel.The final colorized results are sampled from the distributions using a pixel-level sequential procedure.We first sample Ŷ1 from p( Ŷ1 |X L ), then sample Ŷi from p( Ŷi | Ŷ1 , ..., Ŷi−1 , X L ) for all i in {2, ...n}.

Pixelated Semantic Colorization
Intuitively, semantics is the key to colorizing objects and scenes.We will discuss how to embed pixelated semantics in our colorization model for generating diverse colored images.

Pixelated Semantic Embedding
Considering the conditional pixelCNN model introduced above, a good embedding of gray-scale image f θ (X L ) greatly helps to generate the precise color distribution of each pixel.We first incorporate semantic segmentation to improve the color embedding.We use X S to denote the corresponding segmentation map.Then, we learn an embedding of the gray-scale image conditioned on X S .We replace f θ (X L ) with f θ (X L |X S ).Thus, the new model learns the distribution in Equation (2) as: (3) Here, the semantics only directly affects the color embedding generated from the gray-scale image, but not the autoregressive model.
Incorporating semantic segmentation can be straightforward, i.e., using segmentation masks to guide the colorization learning procedure.Such a way enables the training phase to directly obtain guidance from the segmentation masks, which clearly and correctly contain semantic information.However, it is not suitable for the test phase as segmentation masks are needed.Naturally, we can rely on an off-the-shelf segmentation model to gain segmentation masks for all the test images, but it is not elegant.Instead, we believe it is best to simultaneously learn the semantic segmentation and the colorization, making the two tasks benefit each other, as we originally proposed in (Zhao et al ( 2018)).
Modern semantic segmentation can easily share lowlevel features with the color embedding function.We simply need to plant an additional segmentation branch h ϕ following a few bottom layers, like the blue flow shown in Figure 2. Specifically, we adopt the semantic segmentation strategies from Chen et al (2018).At the top layer, we apply atrous spatial pyramid pooling, which expoits multiple scale features by employing multiple parallel filters with different rates.The final prediction (h ϕ (X L )) is the fusion of the features from the different scales, which helps to improve segmentation.The two tasks have different top layers for learning the high-level features.In this way, semantics is injected into the color embedding function.By doing so, a better color embedding with more semantic awareness is learned as input to the generator.This is illustrated in Figure 2, by combining the purple flow and the blue flow.

Pixelated Semantic Generator
A good color embedding with semantics aids the generator to produce more correct color distributions.Furthermore, the generator is likely to be improved with semantic labels further.Here, we propose to learn a distribution conditioned on previously colorized pixels, a color embedding of gray-scale images with semantics (f θ (X L |X S )), and pixel-level semantic labels.We rewrite Equation (3) as: Intuitively, this method is capable of using semantics to produce more correct colors of objects and more continuous colors within one object.It is designed to address the two issues mentioned in Figure 1.The whole idea is illustrated in Figure 2 by combining the purple flow with the blue and green flows.
We consider two different ways to use pixelated semantic information to guide the generator.The first way is to simply concatenate the color embedding f θ (X L ) and the segmentation prediction h ϕ (X L ) along the channels and then input the fusion to the generator.The second way is to apply a feature transformation introduced by Perez et al ( 2018) and Wang et al (2018).Specifically, we use convolutional layers to learn a pair of transformation parameters from the segmentation predictions.Then, a transformation is applied to the color embedding using these learned parameters.We find the first way works better.Results will be shown in the Experiment section.

Networks
In this section, we provide the details of the network structure and the optimization procedure.
Network structure.Following the scheme in  2017)), each of which has two convolutions with 3 × 3 kernels, a skip connection and a gating mechanism.We apply atrous (dilated) convolutions to several layers to increase the network's field-of-view without reducing its spatial resolution.Table 1 and 2 list the details of the color embedding branch and the semantic segmentation branch, respectively.The gray rows are shared by the two branches.Loss functions.During the training phase, we train the colorization and segmentation simultaneously.We try to minimize the negative log-likelihood of the probabilities: (5) Specifically, we have three loss functions L emb , L seg and L gen to train the color embedding, the semantic segmentation and the generator, respectively.The final loss function L sum is the weighted sum of these loss functions: Following Salimans et al (2017), we use discretized mixture logistic distributions to approximate the distribution in Equation (3) and Equation (4).A mixture of 10 logistic distributions is applied.Thus, both L emb and L gen are discretized mixture logistic losses.
As for semantic segmentation, generally it should be performed in the RGB image domain as colors are important for semantic understanding.However, the input of our network is a gray-scale image which is more difficult to segment.Fortunately, the network incorporating colorization learning supplies color information which in turn strengthens the semantic segmentation for gray-scale images.The mutual benefit among the three learning parts is the core of our network.It is also important to realize that semantic segmentation, as a supplementary means for colorization, is not required to be very precise.We use the cross entropy loss with the standard softmax function for semantic segmentation (Chen et al ( 2018)).We use the gray-scale converted images as input and rescale each image to 128×128. Figure 3 shows some examples with natural scenes, objects and artificial objects from the datasets.Implementation.Commonly available pixel-level annotations intended for semantic segmentation are sufficient for our colorization method.We do not need new pixel-level annotations for colorization.We train our network with joint color embedding loss, semantic segmentation loss and generating loss with the weights λ 1 : λ 2 : λ 3 = 1 : 100 : 1, so that the three losses are similar in magnitude.Our multi-task learning for simultaneously optimizing colorization and semantic segmentation effectively avoids overfitting.The Adam optimizer (Kingma and Ba ( 2015)) is adopted.We set an initial learning rate equal to 0.001, momentum to 0.95 and second momentum to 0.9995.We apply Polyak parameter averaging (Polyak and Juditsky (1992)).

Effect of segmentation on the embedding function f θ
We first study how semantic segmentation helps to improve the color embedding function f θ .Following the method introduced in Section 3.2.1,we jointly train the purple and blue flows shown in Figure 2. In this case, the semantic segmentation branch only influences the color embedding function.To illustrate the effect of pixelated semantics, we compare the color embeddings generated from the embedding function f θ in Figure 4. Obviously, as can be seen, semantic-guidance enables better color embeddings.For example, the sky in the first picture looks more consistent, and the sheep are assigned reasonable colors.However, the results without semantic-guidance appear less consistent.For instance, there is color pollution on the dogs and the sheet in the second picture.
Further, in order to more clearly show the predicted color channels of the color embeddings, we remove the light channel L and only visualize the chrominances a and b in Figure 4 (b).Interestingly, without semanticguidance, the predicted colors are more noisy, as shown in the top row.However, with semantic-guidance, the colors are more consistent and echo the objects well.From these results, one clearly sees that colorization profits from semantic information.These comparisons support our idea and illustrate that pixelated semantics is able to enhance semantic understanding, leading to more consistent colorization.
In theory, we should obtain better colorization when a better color embedding is input into the generator.In Figure 5, we show some final colorizations produced by the generator g ω .Our method using pixelated semantics works well on the two datasets.The results look more realistic.For instance, the fifth example in the Pascal VOC dataset is a very challenging case.The proposed method generates consistent and reasonable color for the earth even with an occluded object.For the last example in Pascal VOC, it is surprising that the horse bit is assigned a red color although it is very tiny.The proposed method processes details well.We also show various examples from COCO-stuff, including animals, humans, fruits, and natural scenes.The model trained with semantics performs better.Humans are given normal skin color in the third and fifth examples.The fruits have uniform colors and look fresh.

Effect of segmentation on the generator g ω
In the next experiment, we add semantics to the generator as described in Section 3.2.2(combining the purple flow with the blue and green flows).This means the generator produces a current pixel color distribution conditioned not only on the previous colorized pixels and the color embeddings from the gray image, but also on the semantic labels.As we train the three loss functions L emb , L seg and L gen simultaneously, we want to know whether the color embeddings produced by the embedding function are further improved.In Figure 6, we compare the color embeddings generated by the embedding functions of the purple flow (shown in the top row), the purple-blue flow (shown in the second row) and the purple-blue-green flow (shown in the bottom row).Visualizations of color embeddings followed by the corresponding predicted chrominances are given.As can be seen, the addition of the green flow further improves the embedding function.From the predicted a and b visualizations, we observe better cohesion of colors for the objects.Clearly, the colorization benefits from the multi-task learning by jointly training the three different losses.
Indeed, using semantic labels as condition to train the generator results in better color embeddings.Moreover, the final generated colorized results will also be better.In Figure 7, we compare the results from the three methods: pixelated colorization without semantic guidance (the purple flow), pixelated semantic color embedding (the purple-blue flow), and pixelated semantic color embedding and generator (the purple-blue-green flow).The purple flow does not always understand the object semantic well, sometimes assigning unreasonable colors to objects, such as the cow in the third example of Pascal VOC, the hands in the second example and the apples in the last example of COCO-stuff.In addition, it also suffers from inconsistency and noise on objects.Using pixelated semantics to guide the color   embedding function reduces the color noise and somewhat improves the results.Adding semantic labels to guide the generator improves the results further.As shown in Figure 7, the purple-blue-green flow produces the most realistic and plausible results.Note that it is particularly apt at processing the details and tiny ob-jects.For instance, the tongue of the dog is red and the lip and skin of the baby have very natural colors.
To conclude, these experiments demonstrate our strategies using pixelated semantics for colorization are effective.

Effect of colorization on the segmentation
From the previous discussion, it is concluded that semantic segmentation aids in training the color embedding function and the generator.The color embedding function and the generator also help each other.As stated in Section 3, the three learnings could benefit each other.Thus, we study whether colorization is able to improve semantic segmentation.
Color is important for semantic segmentation.As we observed in (Zhao et al ( 2018)), color is quite critical for semantic segmentation since it captures some semantics.A simple experiment is performed to stress this point.We apply the Deeplab-ResNet101 model (Chen et al ( 2018)) without conditional random field as post-processing, trained on the Pascal VOC2012 training set for semantic segmentation.We test three versions of the validation images, including gray-scale images, original color images and our colorized images.The mean intersection over union (mean-IoU) is adopted to evaluate the segmentation results.As seen in Figure 8, with the original color information, the accuracy of 72.1% is much better than the 66.9% accuracy of the gray images.The accuracy obtained using our proposed colorized images is only 1.8% lower than using the original RGB images.This again demonstrates that our colorized images are realistic.More importantly, the proposed colorized images outperform the gray-scale images by 3.4%, which further supports the importance of color for semantic understanding.
Colorization helps semantic segmentation.In order to illustrate how colorization influences semantic segmentation, we train three semantic segmentation models on gray-scale images using our network structure: ( 1 As seen in Figure 9, the model trained on a pretrained colorization model converges first.The loss is stable from the 18-th epoch and the stable loss value is about 0.043.The model trained from scratch has the lowest starting loss but converges very slowly.Starting from the 55-th epoch, the loss plateaus at 0.060.As expected, the pre-trained colorization model makes semantic segmentation achieve better accuracy.We believe the colorization model has already learned some semantic information from the colors, as also observed by Zhang et al (2016).Further, our multi-task model jointly trained with semantic segmentation and colorization obtains the lowest validating loss of 0.030, around the 25-th epoch.This supports our statement that the two tasks with the three loss functions are able to be learned harmoniously and benefit each other.

Sample Diversity
As our model is capable of producing diverse colorization results for one gray-scale input, it is of interest to know whether or not pixelated semantics reduces the sample diversity.Following Guadarrama et al (2017), we compare two outputs from the same gray-scale image with multiscale structural similarity (SSIM) (Wang   rather than recover the ground-truth.As discussed previously, colorization is a subjective challenge.Thus, both qualitative and quantitative evaluations are difficult.As for quantitative evaluation, some papers (Zhang et al (2016); Iizuka et al ( 2016 In this paper, we propose a new evaluation method.We use semantic segmentation accuracy to assess the performance of each method, since we know semantics is key to colorization.This is more strict than classification accuracies.Specifically, we calculate the mean-IoU for semantic segmentation results from the colorized images.We use this procedure to compare our method with single colorization methods.For qualitative evaluation, we use the method from our previous work (Zhao et al ( 2018)).We ask 20 human observers, including research students and people without any image processing knowledge, to do a test on a combined dataset including the Pascal VOC2012 validation and the COCO-stuff subset.Given a colorized image or the real ground-truth image, the observers should decide whether it looks natural or not.

Single Colorization State-of-the-art
We compare the proposed method with the single colorization state-of-the-art (Zhang et al (2016); Iizuka et al (2016); Larsson et al (2016)).In addition to the proposed semantic segmentation accuracy evaluation,

Grayscale
Iizuka et al.  we also report PSNR.We use the Deeplab-ResNet101 model again for semantic segmentation.In this case, we only sample one result for each input, using our method.
Result comparisons are shown in Table 3.Our method has a lower PSNR than Iizuka et al ( 2016) and Larsson et al (2016), as PSNR depends on the groundtruth.PSNR over-penalizes a plausible but different colorization (He et al ( 2018)).However, our method outperforms all the others in semantic segmentation accuracy.This demonstrates that our colorizations are more realistic and contain more perceptual semantics.
For qualitative comparison, we report the naturalness of each method according to 20 human observations in Table 4. Three of the single colorization methods perform comparatively.Our results are more natural.We select examples are shown in Figure 11.The method by Iizuka et al (2016) produces good results, but sometimes assigns unsuitable colors to objects, like the earth in the fourth example.The results from Larsson et al ( 2016) look somewhat grayish.Zhang et al (2016)'s method can generate saturated results but suffers from color pollution.Compared to these, our colorizations are spatially coherent and visually appealing.For instance, the color of the bird in the third example and the skin of the human in the last example, both look very natural.

Diverse Colorization State-of-the-art
We also compare our method with the diverse colorization state-of-the art (Royer et al ( 2017

Failure Cases
Our method is able to output realistic colorized images but it is not perfect.There are still some failure cases encountered by the proposed approach as well as other automatic systems.We provide a few failure cases in Figure 13.Usually, it is highly challenging to colorize different kinds of food.They are artificial and variable.It is also difficult to learn the semantics of images containing several tiny and occluded objects.Moreover, our method cannot handle the objects with unclear semantics.Although we exploit semantics for improving colorization, we do not have very many categories.We believe a finer semantic segmentation with more class labels will further enhance the results.

Conclusion
We propose pixelated semantic colorization to address a limitation of automatic colorization: object color inconsistency due to limited semantic understanding.We study how to effectively use pixelated semantics to achieve good colorization.Specifically, we design a pixelated semantic color embedding and a pixelated semantic generator.Both of these strengthen semantic understanding so that content confusion can be reduced.We train our network to jointly optimize colorization and semantic segmentation.The final colorized results on two datasets demonstrate the proposed strategies generate plausible, realistic and diverse colored images.Although we have achieved good results, our system is not perfect yet and has some challenges remaining.For instance, it cannot well process images with artificial objects, like food, or tiny objects.More learning examples and finer semantic segmentation may further improve the colorization results in the future.
; Royer et al (2017); Guadarrama et al (2017)) have been proposed based on deep convolutional neural networks.Despite the improved

Fig. 2 :
Fig. 2: Pixelated semantic colorization.The three colored flows (arrows) represent three variations of our proposal.The purple flow illustrates the basic pixelated colorization backbone (Section 3.1).The purple flow combined with the blue flow obtains a better color embedding with more semantics (Section 3.2.1).The purple flow, blue flow and green flow together define our final model, a pixelated colorization model conditioned on gray-scale image and semantic labels (Section 3.2.2).Here, f θ is a color embedding function, h ϕ is a semantic segmentation head and g ω is the autoregressive generation model.There are three loss functions L seg , L emb and L gen (Section 3.3).
Figure 2, three components are included: the color embedding function f θ , the semantic segmentation head h ϕ and the autoregressive model g ω .Correspondingly, three loss functions are jointly learned, which will be introduced later.The three flows represent the three different methods introduced above.The purple flow illustrates the basic pixelated colorization.The purple flow combined with the blue flow results in the pixelated semantic embedding.The purple flow combined with the blue and green flows, results in the pixelated semantic generator.Inspired by the success of the residual block (He et al (2016); Chen et al (2018)) and following Royer et al (2017), we apply gated residual blocks (van den Oord et al (2016); Salimans et al (

Fig. 3 :
Fig. 3: Color images, gray-scale images and segmentation maps from (a) Pascal VOC and (b) COCO-stuff.COCO-stuff has more semantic categories than Pascal VOC.

Fig. 4 :
Fig. 4: Colorizations from the embedding functions f θ using the purple flow and the purple-blue flow.(a) Colorization without semantic-guidance (first row) and with semantic-guidance (second row).With semantics, better colorizations are produced.(b) Visualization of the predicted a and b color channels of the colorizations.The top row shows the results without semantic-guidance and the bottom row with semantic-guidance.With semantics, the predicted colors have less noise and look more consistent.

Fig. 5 :
Fig. 5: Colorization from the generators g ω , when relying on the purple flow and the purple-blue flow.Examples from (a) Pascal VOC and (b) COCO-stuff are shown.For both datasets, the top row shows results from the model without semantic-guidance and the bottom row shows the ones with semantic-guidance.The results with semantic-guidance have more reasonable colors and better object consistency.

Fig. 6 :
Fig. 6: Colorizations generated by the embedding functions f θ , using three variants of our network.The top row shows the results of the purple flow.The second row shows the results of the purple-blue flow.The bottom row shows the results of the purple-blue-green flow.Each colorization is followed by the corresponding predicted chrominances.The purple-blue-green flow produces the best colorization.

Fig. 7 :
Fig. 7: Colorizations produced by the generators g ω , using three variants of our network on (a) Pascal VOC and (b) COCO-stuff: the purple flow (first row), the purple-blue flow (second row) and the purple-blue-green flow (third row).Using pixel-level semantics to guide the generator in addition to the color embedding function achieves the most realistic results.

Fig. 8 :
Fig. 8: Segmentation results in terms of mean-IoU on gray-scale images, proposed colorized images and original color images, on the Pascal VOC2012 validation dataset.Color aids semantic segmentation.

Fig. 9 :
Fig. 9: Semantic segmentation validating loss comparisons.Three models are trained for 50 epochs.Training from a pre-trained colorization model is better than training from Jointly training obtains the lowest validating loss, which demonstrates colorization helps to improve semantic segmentation.

Fig. 10 :
Fig. 10: Samples diversity.Histogram of SSIM scores on the Pascal VOC validation dataset shows the diversity of the multiple colorized results.Some examples with their specific SSIM scores are also shown.Our model is able to produce appealing and diverse colorizations.
)) apply Top-5 and/or Top-1 classification accuracies after colorization to assess the performance of the methods.Other papers (He et al (2018); Larsson et al (2016)) use the peak signal-to-noise ratio (PSNR), although it is not a suitable criteria for colorization, especially not for a method like ours, which produces multiple results.For qualitative evaluation, human observation is mostly used(Zhang et al (2016); Iizuka et al (2016); He et al (2018); Royer et al (2017); Cao et al (2017)).

Fig. 11 :
Fig. 11: Comparisons with single colorization state-of-the-art.Our results look more saturated and realistic.
);Cao et al (2017); Deshpande et al (2017)).All of these are based on a generative model.We only qualitatively compare these by human observation.We use each model to pro-

Fig. 12 :
Fig. 12: Comparisons with diverse colorization state-of-the-art.The diverse results generated by our method look fairly good.

Fig. 13 :
Fig. 13: Failure cases.Food, tiny objects and artificial objects are still very challenging.

Table 1 :
Color embedding f θ (X L ) Color embedding branch structure.Feature spatial resolution, number of channels and dilation rate are listed for each module.The gray rows indicate the bottom layers are shared with the semantic segmentation branch (detailed in Table2).

Table 2 :
Semantic segmentation h ϕ (X L ) Semantic segmentation branch structure.Feature spatial resolution, number of channels and dilation rate are listed for each module.#classmeans the number of semantic categories.The gray rows indicate the bottom layers are shared with the color embedding branch (detailed in Table1).
4.6 Comparisons with State-of-the-artGenerally, we want to produce visually compelling colorization results, which can fool a human observer,

Table 4 :
Qualitative evaluation.Comparisons of the naturalness.Our colorizations are more natural than others.