1 Introduction

Old legacy movies and historical videos are in black and white format. When the video was captured, there was no suitable technology to preserve color information. Black and white or grayscale images can be restored by new real-life colorization, which gives life to old pictures and videos. The main aim of colorization is to add color to a black and white image or grayscale image such that the newly generated image is visually appealing and meaningful. In recent years, based on generative adversarial networks (GANs) [13], a variety of colorization techniques have been proposed, and the state-of-the art performance has been reported on current databases [8, 11, 31]. These colorization techniques differ in many aspects, such as network architecture, different types of loss functions, learning strategies, etc. However, the existing colorization [15, 21, 27, 33, 34] processes mostly follow unconditional generation where the colors are predicted only from the grayscale input image. This might lead to ambiguous results as the prediction of color from a grayscale information is inherently ill-posed. To increase the fidelity in the colorization pipeline, a text-guided colorization pipeline is proposed where some color descriptions about the objects present in the grayscale image can be provided as auxiliary conditions to achieve more robust colorized results (Fig. 1).

Fig. 1
figure 1

Images generated by the proposed algorithm: the first column indicates the input grayscale images; the second column shows the ground truth color images and the third column illustrates the respective colorized outputs of the proposed model. [Best viewed with 300% zoom in the digital version]

The major contributions of our work are as follows.

  • A novel GAN pipeline is proposed that exploits textual descriptions as an auxiliary condition.

  • We extensively evaluate our framework using qualitative and quantitative measures. In comparison with the state-of-the-art (SOTA) algorithms, it is found that the proposed method generates results with better perceptual quality.

  • To the best of our knowledge, this is the first attempt to integrate textual information into an end-to-end colorization pipeline to improve the quality of generation. The textual color description acts as additional conditioning to increase the fidelity in the final colorized output.

It is important to note that the SOTA text-based colorization method [6] is not an end-to-end model. It first tries to estimate a color palette from the textual description and then attempts to colorize the input grayscale image. The proposed method is an end-to-end model that completely circumvents the necessity of any intermediate color palette estimation.

The rest of the paper is organized as follows. Section 2 introduces the SOTA colorization techniques. In Section 3, the proposed colorization framework is discussed in detail. Section 4 presents the experimental settings that are used to train and evaluate the pipeline. We present our results and compare our proposed framework with the SOTA algorithms using qualitative and quantitative metrics in Section 5. Finally, in Section 6, the paper is concluded by pointing out the overall findings of the proposed work, its limitations and future prospects.

2 Related work

Image colorization methods have been the primary focus of significant research over the last two decades. Most of these methods were influenced by conventional machine learning approaches [7, 14, 19]. In the last few years, the trend has shifted to deep learning (DL)-based approaches due to the success of DL-based approaches in different fields [5, 24, 26, 28]. Recently, DL-enabled automatic image colorization systems have shown an impressive performance in the colorization task [6, 9, 10, 19, 28, 35]. For the attention based mechanism [1,2,3,4] the authors are used in literature for crowd management work.

Deep colorization [10] was the first network to incorporate DL for image colorization. In training, five Fully Connected layers are used followed by ReLU activation with the least-squares error as a loss function. In this model, the first layer neurons depend on the features extracted from the gray scale patch. The output layers have only two neurons, i.e., the U and V channel. During testing, grayscale image features are extracted from three levels, i.e., low-level, mid-level, and high-level. The sequential gray values, DAISY feature [26] and semantic labeling are extracted at the low level, mid-level and high level, respectively to complete the task.

Deep depth colorization [9] used a pre-trained ImageNet [11] network for colorizing the image using their depth information of RGB. First, the network is designed for object recognition by learning a mapping from the depths to the RGB channel. Pre-trained weights are kept frozen in the network, and this pre-trained network is merely used as a feature extractor.

Wang et al. [28] proposed SAR-GAN to colorize Synthetic Aperture Radar images. The cascaded generative adversarial network is used as the underlying architecture. SAR-GAN was developed with two subnets; one is the speckling subnet, and another is the colorization subnet. The speckling subnet generates the noise-free SAR image, and the colorization subnet further processes it to colorize the images. The speckling sub-network consists of 8 convolution layers with BatchNorms and element-wise division in the residual network. The colorization subnet utilizes an encoder-decoder architecture with 8 convolution layers and skips connection. The Adam [17] optimizer is used for training the entire network. The SAR-GAN utilizes hybrid loss with l1 loss and adversarial loss.

The text2color [6] model consists of two conditional adversarial networks: the Text to Palette Generation network and the Palette-based colorization network. The Text to Palette Generation network is trained using the palette and text dataset. The text to Palette Generation networks generator learns the color palette from the text and identifies the fake and real color palette. Huber loss is used as a loss function in this network. The palette-based colorization network is designed using the U-NET architecture where the color palette is used as a conditional input to the generator architecture. The authors had designed the discriminator using a series of convo2d and LeakyRelu [22] modules, followed by a fully connected layer to classify the colorized image as real or fake. In this method, the number of color palettes is 4 to 6. The generated color image entirely depends on the number of the color palette.

In Colorful image colorization [35], authors proposed the 1st CNN architecture for the colorization. In this method, the authors used a cross Chanel encoder for colorization. The colorization method cannot color each object with the appropriate color.

In colorization with optimization [19], authors used some color scribbles to colorized the image. The author proposed a quadratic cost function and obtained an optimization problem that can be solved efficiently using standard techniques. If the number of colors in the image is huge, then the colorization technique can not maintain the color properly.

Towards Vivid and Diverse Image Colorization with Generative Color Prior [32], authors first used a pre-trained GAN for the feature matching. After that authors generated the various color by changing the latent space for the next GAN network. If the pre-train GAN produced miss leading feature, then the colorization network generated the unnatural colorization of the image.

In Colorization transfer [18], author used conditional auto aggressive transformer for generation the image in low resolution. After that, the network consists of two parallel networks, one for course colorization and one for fine colorization.

In the proposed work, a novel end-to-end deep model takes an input grayscale image, and a textual embedding [23] and the model tries to colorize the input image using the textual embedding as side information. The textual embedding is fused with a low dimensional representation of the input image using residual in residual dense block (RRDB) [30] to impose the conditioning.

3 Methodology

Image colorization aims to generate a color image from a grayscale image. Typically, deep learning tools use RGB images as ground truth for image generation. In the proposed method, RGB images are converted into the CIE LAB color space, where we need to find only the ‘A’ and ‘B’ channels instead of three channels of RGB. The input text is converted, containing the color information of the image, to a word vector using the word2vec. The size of the word vector is 256, and the input size of the image is 256 × 256, which is the ‘L’ channel of the LAB color space. We add the ‘L’ channel with the AB channel of the image, which the Generator predicts, to reconstruct a fully colorized image. The discriminator signifies the visual authenticity of the image in a patch-based manner.

Motivation

Though image colorization has a wide range of applications, the majority of the existing methods do not provide any direct control over the colorization process. The scribble-based techniques, that can regulate the final colorized output demand extensive human intervention. Though it has been observed that text-based generative pipelines are extremely user-friendly and give direct control in the generation process, there are only a few attempts to design text-based colorization algorithm due to the inherent complexity of the overall methodology. Most of the existing text-based colorization algorithms try to predict the color palette, and perform the colorization process. To the best of our knowledge, there is no existing end-to-end model available that can exploit the flexibility and richness of text-based generation pipeline which is the prime motivation to propose the model. Regarding the proposed architecture, though largely unexplored, we observe that the RRDB modules perform well in ill-posed inverse problems [12]. Thus, we designed an encoder-decoder-based generative architecture keeping RRDB in the generator.

3.1 Generator

The idea of the proposed Generator (Fig. 2) is that the text color information is fused with the grayscale image (Li) at the last downsample step of the network. The input L image is first resized to a fixed size of 256 × 256. The overall generator has two pathways- an image pathway, through which the image information flows in the network, and the text pathway, through which the text color information flows as a conditional input. Both pathways finally meets in the Residual in Residual Dense Block (RRDB). For the image path, each resolution level has two convolution layers. The down-sampling follows the last convolution by 2 to move to a new resolution. A 3 × 3 kernel size with 64 filters are used in each convolution block. After each convolution, batch-normalization is performed , and each convolution block has ReLU activation. We also process the text vector(Si) by two fully connected layers of sizes 256 and 4096. The text features is resized and computed by the last fully connected layer to 1x64x64 and perform an element-wise dot product between the image features and the text features to impose a text-guided conditioning. The text conditioned features are then fed to a Residual in Residual Dense Block (RRBD) before forwarding to the expanding part of the generator. The RRBD block consists of several dense layers with skip connections. The output of each dense block is scaled by β before feeding it to the next dense block. Each Dense block consists of a convolutional layer, followed by BN and leaky ReLU activation with the residual connection. As shown in Fig. 3, skip connections are introduced to tackle the problem of a vanishing gradient. The output of the RRBD block is used as input to the convTranspose2d layer with a 64 filter. In the expanding pathway,three up-sampling oparations are used that work in four different resolutions. To increase the feature information in the expanding path, after each up-sampling layer, the available features are concatenated with the same resolution in the contracting path. The convolution blocks in the expanding path are similar to the convolution blocks at the contracting paths, and The number of filters are decreased by two as we move to the higher resolution. At the highest resolution, two filters are applied with kernel size 1x1 to generate the estimated AB channel of the color image. At the end of the proposed network, The color image is computed by adding the generated AB and the input grayscale image(Li). The proposed Generator is illustrated in Fig. 2.

Fig. 2
figure 2

The block diagram of the proposed architecture. The network predicts the color components of the image, which is combined with the intensity image to produce the final colorized image. [Best viewed with 300% zoom in the digital version.]

Fig. 3
figure 3

The block diagram of the Residual in Residual Dense Block(RRBD) architecture. [Best viewed with 300% zoom in the digital version.]

3.2 Discriminator

For the colorization task, it is required that the discriminator can detect the local quality of a generated colorized image. Thus, The PatchGAN Discriminator D use to judge the quality of the generated image. The discriminator penalizes the generated structure at the patch level resulting in a high-quality single level generation. The grayscale image are stack (Li) with either a target image (Ti) or with a estimated image (Ei) where Ti and Ei are the AB channel of the color image. The (Li,Ti) stack is labeled as real and the (Li,Ei) stack is labeled as fake. In this way, we enforce discrimination on image transition rather than the image itself. In our model, the Patch discriminator takes a three-channel input dimension 256 × 256. The discriminator has three convolution blocks with 64, 128 and 256 filters, respectively, in each block with filter dimension 4 × 4. In the first two convolution blocks, the filter has stride 2, whereas, for the last two blocks, 1x1 stride is used . Each convolution layer is followed by batch-normalization and leaky-ReLU activation. After the convolution blocks, one filter of kernel size 4x4is applied with stride 1 to compute the final response. The average of the final response is the output of the discriminator.

3.3 Training

As mentioned in, the PatchGAN discriminator focuses more on the high frequency information. Thus to keep the fidelity of low frequency information in the colorized image, L1 loss use in the generator G which is calculated as

$$ \mathcal{L}^{G}_{1}=\|E^{i}-T^{i}\|_{1}=\|G(L^{i},S^{i})-T^{i}\|_{1} $$
(1)
$$ \mathcal{L}^{G}_{1}= \sum\limits_{i=1}^{d}\|x_{i}-y_{i}\| $$
(2)

where xi and yi are the i-th elements of d-dimensional vectors x and y, respectively. As the generator trained in an adversarial manner,the adversarial or the GAN loss of the generator and the discriminator define as:

$$ \mathcal{L}^{G}_{GAN}=\mathcal{L}_{BCE}(D(L^{i},G(L^{i},S^{i})),1) $$
(3)
$$ \mathcal{L}^{D}_{GAN}=\mathcal{L}_{BCE}(D(L^{i},T^{i}),1)+\mathcal{L}_{BCE}(D(L^{i},G(L^{i},S^{i})),0) $$
$$ \mathcal{L}_{BCE} ={(y\log(p) + (1 - y)\log(1 - p))} $$
(4)

where \({L}^{G}_{GAN}\) and \({L}^{D}_{GAN}\) denote adversarial Generator loss adversarial discriminator loss, respectively. To increase the visual quality of the image, perceptual loss is used to train the generator. In \({\mathscr{L}}_{BCE}\), y is the label and p is predicted probability of the point.

$$ \mathcal{L}_{p_{\rho}}^{G}=\frac{1}{h_{\rho} w_{\rho} c_{\rho}}\sum\limits_{x=1}^{h_{\rho}}\sum\limits_{y=1}^{w_{\rho}}\sum\limits_{z=1}^{c_{\rho}}\|\phi_{\rho}(E^{i})-\phi_{\rho}(T^{i})\|_{1} $$
(5)

where \({\mathscr{L}}_{p_{\rho }}^{G}\) is the perceptual loss computed at the ρth layer, ϕρ is the output from the ρth layer of a pretrained VGG19 model, and hρ, wρ and cρ are the height, width and the number of channels at that layer, respectively.

The total generator loss \({\mathscr{L}}^{G}\) can be defined as

$$ \mathcal{L}^{G}= arg \min_{G} \max_{D}\text{ }\lambda_{1}\mathcal{L}^{G}_{GAN}+\lambda_{2}\mathcal{L}^{G}_{P_{4}}+\lambda_{3}\mathcal{L}^{G}_{1} $$
(6)
$$ tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} = \frac{1 - e^{-2x}}{1 + e^{-2x}} $$
(7)

4 Experimental details

The PyTorch framework is used to build the model, and perform our experiments. The Adam [17] optimizer is used to train both the generator and discriminator up to 350K iterations with β1 = 0.5 and β2 = 0.999. The learning rate is 1 × 10− 4 with a decay of 0. All the leaky-ReLU activations have negative slope coefficients of 0.2. We select λ1 = 1, λ2 = 1 and λ3 = 1.

While training the discriminator D, by concatenating Li with either Ti or Ei and give that as a input. Both D and G are trained iteratively, i.e., we keep D fixed while training G and vice versa. As the training process of GAN is highly stochastic, the network weights is stored at the end of each iteration. At the time of inference, The discriminator network is droped and generate the A,B channels only using the generator network.

4.1 Datasets

To evaluate the performance of our model, three popular datasets, Caltech-UCSD Birds 200 [31], MS COCO [20] and Natural color Dataset(NCD) [5] is used.

The Birds dataset (Fig. 4) contains 6032 bird images with their color information. The dataset is split into two parts (train, test). The total number of images for training is 5032, and the remaining images are used in the test set.

Fig. 4
figure 4

A typical example in our dataset: each sample contains a color image and corresponding color descriptions of the bird. To use the image while training the network, the color image is converted to LAB color space, and use the ‘L’ image as the input

The total number of images in the NCD set are 730 fruit images. 600 images are used for training and the remaining 130 images for testing. The class label is converted to one single color, like the tomato’s class is converted into red and used as color information for training and testing.

From the MS COCO [20] dataset, 39k images are used for training and 6225 images for testing. In COCO stuff [8], the text description of the images are available. In each text description, The sentence(s) related to color information of an object is an auxiliary information to the network. By collecting all such sentences and use it as the final auxiliary information for the respective image.

5 Experimental results

To understand the overall performance of the proposed framework by performing an extensive set of experiments to evaluate the quality of the final colorized images. In Fig. 5, we compare our algorithm with [35, 36] and [6]. As shown in the figure, the proposed algorithm colorized the grayscale images with higher fidelity. Although the existing methods have colorized the grayscale images successfully, however, the colors are often significantly different to the actual ground truth. The colorized images produced by the SOTA algorithms are also less colorful. As the proposed method utilizes the textual description as auxiliary information, our algorithm generates more realistic and colorful images from the respective grayscale input images. Evaluate the network’s performance by generating the color images from three public databases. Figure 6 shows image samples from the UCSD Bird dataset [31]and MS COCO [8] dataset and the Natural Color Dataset [5], respectively.

Fig. 5
figure 5

Qualitative comparison results: The first column contains ground truth images, the second column, third and fourth columns contain the results generated by the SOTA algorithms, and the last column shows the results generated by the proposed algorithm. [Best viewed with 300% zoom in the digital version.]

Fig. 6
figure 6

Images generated by the proposed algorithm from theCaltech-UCSD Birds 200 [31], NCD [5] and MS COCO stuff [8] Dataset: the first column contains the grayscale images, the second column contains the ground truth images and the third column shows the colorized outputs of the proposed model. [Best viewed with 300% zoom in the digital version.]

To further validate the effectiveness of the proposed model, To evaluate the quality of the generated images using quantitative metrics as well. Average PSNR, SSIM [29], LPIPS(vgg) [25, 37] and LPIPS(sqz) [16] measures is used to compare the similarity of the generated images with the ground truth. As shown in Table 1, the proposed algorithm outperforms the SOTA algorithms in SSIM, PSNR, LPIPS(vgg) measures.

Table 1 Quantitative comparison among different colorization methods – Zhang et al. [35], Zhang et al. [36], Bhang et al. [6] and our method. The bold emphasis indicates the best result in that metric

5.1 Ablation study

To further validate the textual description’s importance by training a new model without using the textual information. As shown in Fig. 7, without the textual conditioning, the proposed pipeline fails to colorize the grayscale images properly. In Fig. 8, represents that the textual description can be used for the recolorization task. In Fig. 8 (a), grayscale image was colorized with the actual textual description of the ground truth. In Fig. 8 (b), by keeping the grayscale image unchanged and have used the textual description of a different image. It is observed that the proposed framework is able to follow textual conditioning and can produce significantly different colorized outputs from the same grayscale image based on the textual encoding. In the ablation study, RRBD networks ablation is by changing the nuber of RRBD. In Fig. 9 shows the generator images with 1,32 and 64 RRBDs. The result of the forth column (generated by 64 RRBDs)is the best result of this ablation study. To validate the understating some quantitative result is generated in the Table 2.

Fig. 7
figure 7

Validation of the importance of the textual encoding: first column contains the grayscale images, second column contains the ground truth images, the third and fourth columns show the results generated without and with the textual encoding, respectively. [Best viewed with 300% zoom in the digital version.]

Fig. 8
figure 8

Recolorization: The first column shows the grayscale images, the second column shows the images whose textual descriptions are used as conditioning. The third column shows the final colorized images. [Best viewed with 300% zoom in the digital version.]

Fig. 9
figure 9

The first column shows the Real images, the second column shows the generator images with 1 RRBD. The third column shows the generator images with 32 RRBDs. The fourth column shows the generator images with 64 RRBDs [Best viewed with 300% zoom in the digital version.]

Table 2 Quantitative comparison among different number of RRBDs in ablation study. The bold emphasis indicates the best result in that metric

6 Conclusions

In this paper, we proposed a novel image colorization algorithm that utilizes textual encoding as auxiliary conditioning in the color generation process.It is found that the proposed framework exhibits higher color fidelity compared to the state-of-the-art algorithms. We have also demonstrated that the proposed framework can also be used for recolorization purposes by modulating textual conditioning. It is important to note that we have considered only textual conditioning of foreground objects in this work. In the given setting, though the proposed algorithm outperforms the SOTA methods, as the textual descriptions mostly depict the foreground objects ignoring the backgrounds; our method exhibits less fidelity for the background colors. This problem can obviously be resolved by adding additional color descriptions for the background. We also observed that as the textual descriptions define the colors of the objects coarsely, to fill the gaps, the proposed method generates certain colors which are not there in the respective ground truths. Thus, in certain cases, our method produced less colorful backgrounds (Fig. 10), which establishes the necessity of a more exhaustive textual description for the grayscale images in the future. Efforts should also be made to design a more robust colorization process for the background for which textual descriptions are not rich.

Fig. 10
figure 10

Some of the failure cases. [Best viewed with 300% zoom in the digital version.]