Reference-guided structure-aware deep sketch colorization for cartoons

Digital cartoon production requires extensive manual labor to colorize sketches with visually pleasant color composition and color shading. During colorization, the artist usually takes an existing cartoon image as color guidance, particularly when colorizing related characters or an animation sequence. Reference-guided colorization is more intuitive than colorization with other hints, such as color points or scribbles, or text-based hints. Unfortunately, reference-guided colorization is challenging since the style of the colorized image should match the style of the reference image in terms of both global color composition and local color shading. In this paper, we propose a novel learning-based framework which colorizes a sketch based on a color style feature extracted from a reference color image. Our framework contains a color style extractor to extract the color feature from a color image, a colorization network to generate multi-scale output images by combining a sketch and a color feature, and a multi-scale discriminator to improve the reality of the output image. Extensive qualitative and quantitative evaluations show that our method outperforms existing methods, providing both superior visual quality and style reference consistency in the task of reference-based colorization.


Introduction
With the increasing popularity of digital cartoons, various computer-assisted technologies for producing 1  them have rapidly been developed in recent years. Colorizing cartoon sketches is such a technology that has attracted extensive research focus in the field of computer graphics. Since the sketch itself contains no hint about the colors to apply, existing methods either colorize the sketch by pure guesswork [1][2][3], which may lead to unnatural colors (see Fig. 1(b)), or by using user-provided hints, such as color points or scribbles [4,5], or text hints [6]. However, manually crafting color hints is timeconsuming, especially for line drawings with complex content. For example, in Fig. 1(c), 62 color points were created by the user to obtain a relatively satisfactory colorized result. Besides, crafting proper color hints is also challenging, especially for amateur users, as it requires the user to apply a certain level of aesthetic judgement to generate a visually pleasant color cartoon. Moreover, when coloring a cartoon animation, it would be extremely difficult for the user to achieve color consistency across frames when color hints are provided for each frame individually.
To resolve the above issues, we propose an automatic system that can colorize a cartoon line drawing based on a reference image. There are two key advantages in reference-guided colorization over color hint-based methods. Firstly, it saves the effort of user trial-and-error in creating a proper set of userhints for colorizing the sketches. Instead, the user only needs to provide a reference image in a similar color style to that desired in the colorized image. The system can automatically learn the color style from this reference image and apply it to colorize the sketch. Secondly, when the user needs to colorize a set of sketches with similar content, such as a sequence of sketched frames for producing a cartoon animation or a set of sketched character designs for the same cartoon character in different poses, the user only Comparison between our method and existing sketch colorization and style transfer methods.
needs to colorize one of the sketches in the set and can then transfer the color of this image to the other images with ease.
In the current literature, very few works focus on reference-based colorization and most suffer from low colorization quality: it is challenging to properly propagate the visual style of the reference to the sketch input. However, various existing techniques tailored for other applications may be borrowed for this application. Style transfer methods [7,8] take an input image and a style image as input and transfer the style of the style image to the input. While these methods may be adapted to apply the style of the reference image to the input sketch in our application, the structural lines in the sketch generally cannot be well preserved, and the result may exhibit obvious artifacts (see Fig. 1(e)). Contentstyle disentanglement methods separate style from content [9,10], but they generally cannot ensure the the content component to correspond exactly to the structural lines in the sketch. As a result, the encoded style space is usually not perpendicular to the content space, which further affects the quality of the colorized result.
In this paper, we propose a novel learningbased solution for colorizing cartoon sketches based on reference images. A key requirement of our application is that structural lines in the sketch should be preserved during colorization. To do so, we formulate the colorization network as an imageto-image translation problem where the input is a cartoon sketch and the output is a colorized version of it based on the color style of a reference image. To colorize the sketch based on a reference image, we extract the color style of the reference image and then fuse the extracted color style feature into a deep hierarchical representation of the sketch via adaptive instance normalization (AdaIN) [7]. In order to improve the visual quality of the generated color output, we adopt a multi-scale discriminator which can improve the realism of the output image in terms of both global color composition and local color shading. We have applied our method to various images, with convincing results obtained in all cases.
The contributions of our work can be summarized as follows: • a novel reference-based cartoon sketch colorization method using a deep learning approach which does not need manual color hinting, • a formulation of sketch colorization as an imageto-image translation problem where the structural lines in the sketch should be faithfully preserved, and • a multi-scale discriminator to improve the visual realism of the generated cartoon in terms of both global color composition and local color shading.

Background
With the rapid development of digitization, many ways have been proposed to solve the difficulties in cartoon creation. For example Ref. [11] proposed using geodesic distance to recolor cartoon images, Ref. [12] tried to use 2D sketches to reconstruct the symmetric in 3D free-form shapes, and Ref. [13] used the shading of a cartoon image to estimate the reflectance and shape of objects. Similarly, there are many methods used for sketch colorization. Among them, plenty perform reference-guided colorization. We can roughly divide these methods into three major categories: image style transfer, conditional image-toimage generation, and style-content disentanglement.

Image style transfer
Image style transfer aims to alter the style of an image, including its color and texture, based on a style exemplar. Ref. [14] proposed a style editing method which learns the texture mapping from pyramidal features based on the exemplar. However, its need for paired supervision greatly restricts its usage in our colorization task. Recently, thanks to the revamped construction of image features with deep neural network models, style transfer tasks can achieve higher output quality and style editing precision. Among these methods, Ref. [15] used a content loss and a style loss to handle different image components in the feature domain to achieve style transfer. Ref. [16] further used perceptual loss as a universal metric for all style-related editing tasks.
Ref. [7] extended the style transfer method to perform on-the-fly style alteration with a single neural network using a feature rescaling technique, adaptive instance normalization. Later, further methods were proposed to achieve better visual quality [8,[17][18][19][20], or efficiency [21,22]. However, style transfer methods are usually designed with global style objectives and cannot be directly applied to propagate local characteristics of the reference style. In other words, these methods cannot apply customized local color and textures according to underlying structural constraints (see Fig. 1(e)). In contrast, our method not only extracts the global color style to guide colorization, but also cares about detailed local color shading.

Deep conditional image-to-image synthesis
Deep generative models have been widely used to accomplish cross-domain image-to-image translation tasks. These methods apply adversarial training [23] to synthesize images with natural looks. Training can be either conducted under paired supervision [1] or cycle-consistency [2,24]. While these methods achieved decent image translation quality and can be used for sketch colorization [3], these methods are generally deterministic, i.e., they cannot create diversified outputs based on different user requirements. Conditional image generation methods extended the generative model to allow conditional user inputs. Refs. [4,25] proposed use of scribbled or point color hints as the conditional input to guide colorization. However, they cannot use full-page cartoon pictures as a reference. Additionally, placing color hints is time-consuming and requires users' experiences. Ref. [6] proposed reading a list of text-level visual tags to guide decoration of the sketch input with the given visual properties. However, these visual tags may not well capture the color style the user targets, unlike a reference images. Ref. [5] attempted to apply the style of the reference cartoon image to an arbitrary sketch input with a feature-based encoder and decoder design. However, the output quality is poor, with blurring artifacts. Furthermore, the color style of the result may be inconsistent with that of the reference image. Unlike the above conditional colorization methods that need explicit visual cues, Ref. [26] proposed encoding visual styles into a latent space with a style encoder. During the image translation process, the style of the reference image is extracted as the conditional input. However, it lacks a strong style editing mechanism for complex datasets such as comics and cartoons, which contain almost unlimited color and texture combinations. More recently, Ref. [27] proposed a reference-based sketch colorization method based on a transformationaware attention module. However, this method still shares the problem of color style inconsistency between the reference and the output. We will further demonstrate this weakness in Section 3.2.4.

Reference-based photo colorization
Many works attempt to colorize photographs with reference-based priors. The pioneering work [28] proposed to transfer chromatic information to the corresponding regions by matching the luminance and texture. Various correspondence techniques [29][30][31][32][33] have been proposed to improve the result of local color transfer by hand-crafted low-level features. Still, these correspondence methods are not robust to complex appearance variations of the same object because low-level features do not capture semantic information. With the development of deep learning, recent studies [34,35] composed semantically close source-reference pairs based on features extracted from pretrained networks [34,35] and exploited their semantic correlation for colorization. Although they may have good performance for photo colorization, their performance in sketch colorization is relatively weak. This is due to the abstract nature of sketches, which cannot offer enough visual semantic cues for dense color propagation, as widely used in photo colorization.

Image style-content disentanglement
We may also achieve reference-guided colorization by style-content disentanglement, in which images are dismantled into a content space and a style space. The content space encodes the structural information, while the style space encodes colors, textures, and other style-related information. With the disentanglement techniques, multi-modal [9,36,37] or multi-domain [10] image-to-image translation can be achieved with structure preservation. Moreover, these methods allow aligning the output image style to the reference by converging their style space representations. While the disentanglement methods manage to create smooth translations across different image categories or styles, they always encode the content (or the structural information) of the images into latent feature vectors, which generally differ from the structural information in line art. However, in our task, the structural lines in line art should be exactly preserved in the colorized image. Therefore, stylecontent disentanglement methods cannot be directly generally applied to this task.

Overview
The key insight of our proposed method is to achieve reference-guided sketch colorization by specifying the sketch as the content component in the colorized cartoon image. During colorization, we transfer the color style features of the reference image into the sketch input to create our final color cartoon output. We propose a deep learning framework to tackle this challenging problem. In this section, we first present the detailed network design of our proposed sketch colorization framework. Then we discuss its training, including training dataset preparation, loss function design, and training configuration.

Components
Traditional colorization networks do not allow imagebased color style reference or have limited abilities to read from users' color guidance. Our goal is to automatically recognize the color style from the guidance image and apply it to the input sketch with consistency of color composition and shading. To do so, we first use a style extraction network that takes the style guidance image as input and outputs a representative style code. The style code contains essential color style information from the guidance image which is used during the colorization process to regularize the style of the output. We subsequently use the colorization network to fuse the color style of the guidance image and the deep semantics of the input sketch to create the final color cartoon output, with style consistent with the guidance image and content consistent with the input sketch. A multiscale discriminator ensures realistic colorization in terms of both global color composition and local color shading. The overall network structure is shown in Fig. 2.   Fig. 2 Overview. Our framework takes an input image (a) and a reference image (b) as input, and outputs a color cartoon (c) that is consistent with the input in content and consistent with the reference in style.

Color style extractor
First, we use a color style extractor to extract a diversity of style variations in our training data into a unified color style space representation. The design of the color style space aims to collect as wide a range of different color styles as possible while excluding structural information from the style representation.
We use a 4-block downscaling residual network [38] as the style extractor. We input the style guidance image into the style extractor, and use global average pooling to output a style code with a dimension of 256. Our fully convolutional style extraction can extract the style code from a reference image of any resolution, but for convenience of training, we resize the image to 256 × 256 before input to the style extractor. We can combine the extracted color style features with any sketch to generate a new color cartoon image.

Colorization network with multi-scale outputs
The colorization network takes a cartoon sketch as input, colors it according to the color style code extracted from the reference image, and outputs a color cartoon with content consistent with the input and style consistent with the reference. Our colorization network is based on the U-Net structure, with a downscaling sketch encoder and an upscaling style-content fusion decoder. The encoder transforms a sketch image into deep feature maps with rich semantic information. The decoder fuses the color style code with the high-level feature maps of the input sketch to align the color style of the output to the reference. Unlike existing colorization networks that usually produce only one output image, we use a multi-scale output mechanism to help to train the discriminator to better distinguish realistic cartoon images in terms of both global and local color characteristics. In particular, low-resolution output images help more to improve global color composition, while high-resolution images help more to improve local color shading.
In the encoder, we use 5 levels of downsampling blocks to obtain a hierarchical feature map with rich semantic information about the input sketch. We add instance normalization [39] to the encoder to ensure sketch style information erasure [7]. The feature map is then fed to the decoder. The decoder contains 5 upsampling blocks. We use concatenation operations to propagate information from the encoder to the decoder for better synthesis and reconstruction. The final output has the same resolution as the input sketch. In the decoder, we use AdaIN layers [7] to control the statistics of the feature map to achieve style editing and alignment to the reference image. After each upsampling block of the decoder, we produce an extra output image of a lower resolution, which is fed to the multi-scale discriminator and is also useful in constraining the loss function of our system; see Sections 3.2.4 and 3.3.2.

Multi-scale discriminator
We further employ a multi-scale discriminator to regularize the colorization network, for more realistic cartoon image generation. Unlike the commonly used discriminator network that judges generation quality based on a single input, we allow the discriminator to be compatible with different resolutions of generator output, vastly improving the receptive field of discrimination. We find that such a design leads to better colorization quality, especially in terms of global color composition. Figure 3 shows an example using single-scale and multi-scale discriminators. In  We construct our discriminator with 3 downscaling residual blocks. The discriminator is patch-based [1] and we compute the mean of the output as the discriminator output. The adversarial objective will be introduced in Section 3.3.2.

Training dataset
To train our networks, we use a publicly available dataset [40] which contains 17,769 pairs of color cartoons and corresponding sketches. We use 14,224 pairs for training and 3545 for evaluation.
During training, for each color cartoon and the corresponding sketch, we take the sketch as the input image and feed it to the colorization network. Then we take the color cartoon as the reference image and feed it to the color style extractor. Ideally, the output image should take the content from the sketch and take the color style from the color cartoon, i.e., the output image should be the same image as the color cartoon.

Loss function
Our loss function contains two loss terms, a multiscale reconstruction loss and a multi-scale adversarial loss.
The reconstruction loss ensures the functionality of style extraction and style propagation by determining the reconstruction ability of the colorization network. This is done by computing the difference between a ground-truth color cartoon and the network output by the color cartoon as style guidance and its sketch counterpart as input. By minimizing the reconstruction error, the network can better learn the style encoding in a more precise way and provide a color output similar to the ground truth. As mentioned before, our colorization network provides multi-scale versions of the colored output, so we employ our reconstruction loss at each level of the output so that the network can learn colorization and style propagation in a coarse-to-fine manner and balance the learning load for all upscaling convolutions. Specifically, for each output resolution, the reconstruction loss depends on the perceptual loss [16] and the pixel-wise mean square error (MSE) via a weighted sum: Here,ŷ i is the image predicted for a certain level generated by the colorization network. y i is the ground-truth image of the same resolution, obtained by rescaling the original resolution ground-truth image using bilinear interpolation. λ i is the weight for each level of output, set to [1, 2, 3, 10] respectively; output images with higher resolutions have higher weights. ϕ(·) is the output of the VGG16 network [41]. ω 1 is a weight set to 5 in all our experiments. We further adopt a multi-scale adversarial loss [42] for our multi-scale discriminator. To further improve stability, we apply gradient penalty regularization [43] to the discriminator. Our multi-scale adversarial loss is defined as Here, D is the discriminator. E[·] is the expectation operator. ω 2 weights the gradient penalty and is set to 10 in all our experiments. The multi-scale adversarial loss increases the realism of the colorization results in terms of both global color composition and local color shading, as shown in Fig. 3. It also widens the diversity of different types of style references. The overall loss function L of our framework is the sum of the reconstruction loss L rec and the multi-scale adversarial loss L adv :

Training details
We use the Adam optimizer [44] to train our networks. All networks are jointly trained. The learning rate is initially set to 10 −4 and gradually decreased to 2 × 10 −6 . We employ a learning rate adjustment policy where the initial learning rate is multiplied by 1 − (iter/max iter) 0.9 . The optimization converges after about 150 epochs: see Fig. 4.

Results
In this section, we present an in-depth evaluation of our reference-guided sketch colorization framework. First, we present visual comparisons between our method and several competitors in different categories to qualitatively evaluate the performance of colorization and style alignment of our framework. We further perform a quantitative comparison to mainstream state-of-the-art colorization methods in terms of our sketch colorization task. Moreover, we present an ablation study to investigate the design and contribution of each component in our framework. We categorize our competitors into three major categories: image style transfer, style-content disentanglement, and conditional sketch colorization. We choose several state-of-the-art works in each category as the benchmark. For image style transfer, we choose three recent CNN-based methods, the Gatys method [15], WCT [8], and AdaIN [7] as our competitors. For style-content disentanglement, we compare with CCD [9] and DMIT [10]. Both methods are trained with our prepared cartoon dataset. For conditional sketch colorization, we choose two stateof-the-art reference-based colorization methods [5,27] and one state-of-the-art hint-based colorization method [25] as our competitors. Figure 5 visually compares our method and stateof-the-art image style transfer, style-content disentanglement, and reference-based sketch colorization methods. Gatys' method [15] fails to colorize the sketch and only splashes random color patterns. AdaIN [7] and WCT [8] achieve better results, but structural lines are not well preserved with obvious distortions and artifacts. The style-content disentanglement methods preserve the structural lines of the sketch better, but the color styles of the generated images are usually dissimilar to those of the reference cartoon. This is mainly because they encode the content component via an implicit representation, which may introduce bias to the shape and introduce extra errors when encoding the color style. Similarly, the reference-based colorization method [27] fails to propagate the exact color from the reference during colorization. In sharp contrast, our method preserves the content component by faithfully reproducing the structural lines in the sketch, so that both the style and the content are represented in our colorization framework with less bias. Figure 6 visually compares our method and stateof-the-art hint-based sketch colorization methods. As shown in Fig. 6(c), the results of reference-based colorization [5] contain unexpected color mixing and obvious color discontinuity. Moreover, the results do not fully reflect the color characteristics of the reference image. The hint-based colorization method in Ref. [25] produces a style similar to the reference image given a certain number of user hints (at least 25 color points), as shown in Figs. 6(d) and 6(e). However, the quality of the output highly depends on the user's expertise in color composition and requires extensive color hints. The colorization results often deviate from the reference image and the overall coloring procedure may be inconvenient for amateur users. In comparison, our method faithfully propagates the reference color styles to the sketches with much less user effort.

Qualitative evaluations
Besides, we also investigated using reference-based colorization methods from other domains to our sketch colorization task. Icon colorization [45] and photo colorization [35] were tested. A visual comparison is presented in Fig. 7. We can observe that Ref. [45] learns to propagate the colors from the reference, but the output colors are too saturated for real-life sketch colorization purposes. Ref. [35] failed to obtain semantic correspondence in our sketch colorization task and thus outputs very dull colorization results. The domain gap is too large to directly adopt photo and icon colorization techniques for cartoon sketches.

Quantitative evaluation
We next present a quantitative evaluation based on the quality of reconstruction. We randomly sampled ground-truth cartoon and sketch pairs from the evaluation dataset and used the color cartoon as the style guidance to colorize the sketch. Ideally, the colorized output image should be exactly the same as the color cartoon. To estimate the reconstruction quality, we measured the similarity between the reconstructed color image and the ground-truth color image using two commonly used similarity metrics, peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [46]: see Table 1. Our method outperforms all competitors in both metrics in the perspective of reconstruction.
Furthermore, to estimate the realism of our colorized outputs, we also present a quantitative study by calculating the Fréchet inception distance (FID) [47] between our colorization results and ground-truth color cartoons. In this case, the style reference and the input sketch do not need to be the same. Again see Table 1: our method produces the closest colorization results to the ground truth, again demonstrating the superiority of our method in terms of output realism.

Ablation study
To validate the impact of each component in our network design, we performed an ablation study, determining the reconstruction metrics and FID metrics with different network designs. See Table 2; note that the PSNR/SSIM metrics are evaluated with paired sketch/reference input and FID is evaluated with random references.
Firstly, without the multi-scale output design, both PSNR and SSIM values dropped significantly, which shows the effectiveness of our multi-scale output design. We also studied the use of the multi-scale discriminator by only feeding a single full-resolution output to the discriminator. The results showed that the multi-scale discriminator design is essential to the output quality. The multi-scale discriminator design brings together the adversarial learning at different scales and can be seen as a general case of Ref. [48], which fuses global and local information together to improve generated results. With the multi-scale network design, the colorization can take care of both global color composition and the local texture synthesis. In addition, we find that the multiscale reconstruction loss is also very important for generating visually pleasant results. We also tested replacing adaptive instance normalization with an alternative style injection approach in which style codes of the same spatial size are reshaped as feature maps and concatenated. However, as shown in the second row, both measurements dropped significantly with this concatenation design. We also investigated the appropriate dimension of code used to encode the color style. Higher-dimensional encoding has a larger capacity but may encode some extra information that may not be directly related to style. Also, using higher dimensions may reduce the generalizability of the style encoding. On the opposite side, lowerdimensional encoding has better generalization but may lead to a shift in color style in the reconstructed image. In our experiments, we found 256 dimensions to be optimal for the color style code, best balancing reconstruction and generalization. We also explored the effectiveness of the discriminator by visual assessment of an ablation, as shown in Fig. 8. Without the discriminator, the diversity of color in the output image can be restricted, and the color styles can be overfitted. For example, in the first row, without adversarial learning, the colorization of the girl's eyes loses symmetry, so appears unnatural. The same issue is also observed in the last row, where the network without adversarial learning incorrectly propagates white colors to the top-right of the canvas due to similar modes in the style reference. Adversarial learning minimizes the impact of these style-unrelated modes and ensures natural colorization in these cases. On the other hand, adversarial learning also effectively alleviates the problem of color-bleeding, e.g., on the girl's chest  in the second row and the girl's arms and hair in the third row.

Limitation
Although our method can be applied to most sketch images, our output results may still be affected by the color and texture composition of the style reference. If the style reference contains very little color information, the output quality may not be satisfying. Additionally, if the style reference and the input sketch are too different in structural composition, our method may find difficulties in propagating color styles from the colored regions of the reference style to the input sketch, as shown in Fig. 9. Fig. 9 Our method may not work well when the style reference contains very few color information or when the style reference and the input sketch are too different in structure composition.

Conclusions
In this paper, we proposed a novel deep learning approach for reference-guided cartoon sketch colorization. Our system consists of a color style extractor that extracts a color style code from a color cartoon image, a colorization network that fuses a sketch with a color style code to generate a set of multi-scale color cartoon outputs, and a multi-scale discriminator that improves the realism of the generated color cartoon in terms of both global color composition and local color shading. Experiments show that our method significantly outperforms existing methods in content preservation of the input and style consistency with the reference images. As future work, we intend to explore the potential of colorizing an animation sequence by incorporating temporal information. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.