Introduction

With decreasing price and increasing resolution, thermal infrared cameras have gradually moved from their initial military applications to civilian applications, such as autonomous driving [1,2,3], and security [4, 5]. However, thermal infrared images have limitations of low contrast and lack of color information because of the imaging principle and manufacturing process. These problems can reduce the performance of computer vision tasks [6,7,8,9] based on thermal infrared images. Therefore, thermal infrared image processing has become a hot research topic in academia and industry. In thermal infrared image processing, colorization is one of the important algorithms. On the one hand, people prefer color information to gray-scale information; on the other hand, compared with visible color images, thermal infrared images lose some information. It not only loses color information, but also loses semantic information.

As a result, the colorization of thermal infrared images involves the recovery of color information and semantic information. The colorization of single-channel gray thermal images into color visible images with sufficient details is valuable to improve the scene representation of thermal infrared images. However, thermal infrared image colorization is a highly ill-posed problem, which makes the design of network architecture for colorization extremely difficult. The current problem is how to accurately predict color information and recover semantic information from the contour details of thermal infrared images objects.

After several decades of development, a lot of colorization methods have been proposed, which can be broadly classified into three categories: (1) scribble-based methods [10,11,12], (2) reference image-based methods [13,14,15,16], and (3) fully automated methods [17,18,19,20,21,22,23,24,25,26]. The two former types of methods require manual interaction to distribute reasonable colors to various objects in the input image, which makes them labor-intensive and less robust to errors. [10,11,12] require pre-setting reference colors to be drawn on gray-scale images, and [13,14,15,16] depend heavily on manually selected reference color images. With the rapid development of deep learning, image colorization methods have gradually started to utilize generative adversarial networks (GANs) to automatically colorize images. TICC-GAN [18] used conditional generative adversarial networks to colorize infrared images, but the generated colorized images were blurred in detail and lacked texture information. Therefore, SCGAN [22] boosts the colorization quality by increasing the feature extraction capability, but the attention prediction network only works on a single resolution level, which leads to insufficient feature information. I2VGAN [23] enhances the colorization quality by segmenting infrared and visible video into images with the relationship between consecutive frames. MobileAR-GAN [24] improves the network and loss function, which achieves state-of-the-art colorization quality. However, due to insufficient consideration of gray-scale detail recovery, the colorized images look similar to the visible images, but the semantic information of the details is always not satisfactory.

To address the above problem, this paper proposes an Efficient and Effective Generative Adversarial Network (E2GAN). The generator proposes a hybrid structure of multi-level dense module, feature fusion module, and color-aware attention module, which enhances the feature extraction capability, captures more semantic details, and generates high-quality colorized images. And the discriminator is a PatchGAN with a color-aware attention module, which enhances the discriminator’s ability to recognize true and fake colorized images. Meanwhile, this paper proposes a novel composite loss function to achieve realistic image quality.

To be more specific, the main contributions of the work are as follows:

  • The proposed multi-level dense module extracts multi-level context information with fewer parameters and computational cost, which enhances the feature extraction capability and improves the detail recovery capability. As shown in Fig. 1, the proposed E2GAN can generate higher quality color thermal infrared images while maintaining high robustness compared to previous methods.

  • This paper adopts the proposed color-aware attention module to the decoder part, which effectively improves the colorization quality of the network by enhancing important features and suppressing unnecessary ones. It also enhances perceptual information, captures more semantic details, and focuses on more critical objects.

  • This paper proposes the feature fusion module, which processes the features down-sampled by the encoder and sends the fused features to the decoder. The feature fusion module can reduce the information loss caused by encoder down-sampling and improve the prediction of the image’s detailed color.

  • This paper proposes a composite loss function consisting of synthetic loss, adversarial loss, feature loss, total variation loss, and gradient similarity loss, which can improve the quality of colorized images and generate fine local details, meanwhile recovering semantic and texture information.

  • This paper evaluated the proposed network on the KAIST dataset and the FILR dataset. Extensive experimental results show that compared with existing methods, the proposed network generates colorized images with the state-of-the-art results in SSIM (Structural Similarity), PSNR (Peak Signal to Noise Ratio), LPIPS (Learned Perceptual Image Patch Similarity) and NIQE (Natural Image Quality Evaluator).

In the remainder of this paper, the next section summarizes the related work about IR image colorization and attention mechanisms. The subsequent section describes the architecture of the proposed E2GAN followed by which the experiments on the KAIST dataset and the FILR dataset are presented. The final section draws conclusions from the work.

Fig. 1
figure 1

Performance of different thermal infrared image coloration networks in the FLIR dataset. As can be seen in the box plot, the proposed E2GAN produces no outliers and very high robustness during the test, while obtaining competitive results

Related work

Knowledge reserve

After decades of technological development, the field of artificial intelligence (AI) has gradually matured. Machine learning (ML) [27] is a branch of AI. It includes NAS [28], WANN [29], and hybrid approaches between meta-heuristics and machine learning [30,31,32], the latter being a novel research area that successfully combines machine learning and swarms intelligence approaches and has been shown to yield excellent results in different domains. Deep learning (DL) [33, 34] is a subset of ML, or deep learning is a technique that implements machine learning. In recent years, deep learning-based infrared techniques have emerged [35, 36], and these deep learning-based methods are relatively new and have achieved good performance in many scenarios.

Image colorization

Traditional colorization methods have required manual interaction to assign reasonable colors to different objects in the input image, including scribbles [10,11,12] and reference images [13,14,15,16], among others. After the fast development of deep learning, Convolutional Neural Networks (CNNs) have been successfully applied to colorization tasks in gray-scale [37,38,39,40,41], near-infrared [17, 24, 42], and thermal infrared [20, 25, 43,44,45].

Larsson et al. [37] predicted the histogram of each pixel to solve the problem of different colors. Deshpande et al. [38] used a variational auto-encoder for colorization. Limmer et al. [42] proposed a deep multi-scale CNN for near-infrared colorization. Berg et al. [43] proposed two methods for thermal infrared to colorize visible images from fully automatic colorization. Almasri et al. [44] proposed a deep learning method that maps thermal infrared features in the thermal image spectrum to their visible representations in the low-frequency domain. Then the predicted low-frequency representation is combined with the high-frequency representation, which is extracted from the thermal image using a panoramic sharpening method. Cheng et al. [20] proposed a coarse-to-fine colorization strategy to generate more realistic colorized images. Liu et al. [45] proposed a two-stage colorization method, which converted infrared images into gray-scale images and then converted gray-scale images into color images. This method could learn the intrinsic features. Wang et al. [25] improved the quality of color thermal infrared images using hierarchical architecture, attention mechanism, and proposing a composite loss function. All in all, previous methods have achieved impressive results, however, due to the insufficient constraint of the loss function of CNNs on the image colorization, the semantic encoding entanglement and geometric distortion in colorization tasks are still not well addressed.

Generating adversarial networks

Generative adversarial networks (GANs) [46] have shown excellent performance in tasks such as image-to-image translation [39,40,41, 47, 48], image style transfer [49,50,51], and image super-resolution [52,53,54]. For image-to-image translation, Isola et al. [39] used pairs of images to embed conditional vectors for the generated colorized images. To reduce the difficulty of collecting paired data, Zhu et al. [40] proposed an unpaired image-to-image translation method with cyclic consistency loss. Kniaz et al. [47] used thermal histograms and feature descriptors as thermal features. Vitoria et al. [41] proposed a colorization method that combines semantic information. The model learns colorization by combining the perception of color and category distributions with semantic understanding to learn colorization ability. Zhao et al. [22] improved the quality of colorization by enhancing the feature extraction capability. Li et al. [23] segmented infrared and visible video into images and enhanced the quality of colorization by the relationship between consecutive frames. Saxena et al. [48] proposed a new multi-constraint adversarial model, where several adversarial constraints are applied to the multi-scale output of the generator through a single discriminator. The gradients are simultaneously passed to all scales and help the generator to train to capture the difference in appearance between two domains. Jung et al. [55] proposed a novel semantic relation consistency (SRC) regularization along with decoupled contrastive learning, which utilized the diverse semantics by focusing on the heterogeneous semantics between the image patches of a single image. However, SRC generated images still have distortion. The generative adversarial network produces images that are diverse but suffer from incorrect mapping and lack of detail, as well as the risk of training instability and pattern collapse. Chen et al. [56] improved texture distortion and detail blurring through an improved generator and a new contrast loss function that ensures the consistency of image content and structure. Compared with CNN-based, the GAN-based colorization methods reduce artifacts and improve the quality of colorization. However, the feature information of the generated image is incomplete, resulting in the mismatch between the structure and content of objects in the image, which in turn leads to texture distortion and blurred details.

Attentional mechanisms

In recent years, attention mechanisms [57, 58] have been extensively used in computer vision fields, such as semantic segmentation [59, 60], image fusion [61, 62], and object detection [63, 64], which help models to focus their attention on important feature parts. To address the lack of texture information and semantic information, some methods employed attention mechanisms to aid image-to-image transformation. Kim et al. [65] introduced channel attention mechanisms in generators and discriminators to better achieve overall shape transformation. To produce more realistic output images, Emami et al. [66] proposed a new spatial attention that is used to help the generator to pay more attention to the most discriminative regions between the source and target domains. Lai et al. [67] introduced a residual attention mechanism that further facilitates the propagation of features through the encoder. Tang et al. [68] created attention masks based on CycleGAN with an embedded attention mechanism, which combines the generated results with an attention mask to obtain high-quality colorized images. Yadav et al. [24] utilized attention gates in the decoder part of the generator network, which helps to focus on more important features to learn. Luo et al. [26] designed a novel attention module, which not only progressively utilizes extensive contextual information to assist in the semantic perception of complex regions but also reduces the spatial entanglement of features. The results show that the proposed E2GAN generates visually more realistic RGB images compared to the above methods. The comparison in Figs. 7 and 9 clearly demonstrates the advantage.

Proposed E2GAN

The proposed E2GAN aims to learn the optimal mapping of thermal infrared image \(I_\textrm{ir}\) to Ground Truth \(I_\textrm{vis}\), and generate optimal colorized images \(I_\textrm{col}\). However, thermal infrared image colorization requires estimating texture information and semantic information simultaneously, which is very difficult to learn the optimal mapping. Inspired by GANs, as shown in Fig. 2, the proposed E2GAN consists of a designed generator and discriminator. The generator is used to generate the colorized images, while the discriminator tries to identify the input images as true or false.

Fig. 2
figure 2

The framework of the proposed E2GAN. \(L_\textrm{adv}\): adversarial loss. \(L_\textrm{syn}\): synthetic loss. \(L_\textrm{feat}\): feature loss. \(L_\textrm{tv}\): total variation loss. \(L_\textrm{grad}\): gradient similarity loss

Overall framework

Figure 3 shows the technical details of the proposed E2GAN, which uses an encoder–decoder structure. The generator consists of five parts: multi-level dense module, down-sampling module, up-sampling module, feature fusion module, and color-aware attention module.

Fig. 3
figure 3

Detailed structure of the proposed E2GAN generator and discriminator (C denotes the concatenate fusion operation, IN denotes Instance Normalization and S2 denotes the convolutional stride size of 2)

The multi-level dense module is the basic module used to fully extract information from thermal infrared images. It extracts multi-level contextual information with fewer parameters and computational cost, which enhances feature extraction capability and improves detail recovery. The down-sampling module utilizes a convolutional block with a convolutional kernel of 3 and a stride size of 2, which can retain more information from thermal infrared images. The up-sampling module adopts a transpose convolution with a convolution kernel of 4 and a stride size of 2. A convolution block with a convolution kernel of 3 and a stride size of 1 is added to the transpose convolution to reduce the resulting checkerboard artifacts. The feature fusion module is applied between the encoder and the decoder. It reduces the information loss caused by encoder down-sampling and improves the prediction of the image’s fine colors. The color-aware attention module is used for up-sampling, and it effectively improves the colorization quality of the network by enhancing important features and suppressing unnecessary features. It also enhances the perceptual information to capture more semantic details and focus on more critical objects. The discriminator is a modified PatchGAN with a color-aware attention module in the middle convolutional layer. It can boost perceptual information and facilitate quick convergence. Moreover, the proposed E2GAN satisfies 1-Lipschitz continuity by attaching spectral normalization [69] to each discriminator layer.

Fig. 4
figure 4

Detailed structure of the proposed multi-level dense module. Given the channel numbers M for the input of the multi-level dense module and the channel numbers N for the output

Network modules

Multi-level dense module

Currently, the commonly used feature extraction module in colorization uses residual blocks [18, 24], although it effectively mitigates the problem of gradient conduction and is able to retain most of the low-frequency information. However, multi-level information is underutilized in this structure, and solving this problem has been shown to produce positive effects in many fields [70,71,72]. To sufficiently reuse the abundant multi-level information in the encoder–decoder, this paper proposes a multi-level dense module to extract multi-level context information with fewer parameters and computational cost, which enhances the feature extraction capability and improves the detail recovery.

Figure 4 shows the technical details of the proposed multi-level dense module. Specifically, in the main route, this paper employs a convolutional block with a convolutional kernel of 3 and a step size of 1. Given the channel numbers M for the input of the multi-level dense module and the channel numbers N for the output. the channel numbers of the convolutional layer in the ith convolutional block is \(N/2^i\), except for the channel numbers of the last convolutional block, which are the same as the channel numbers of the previous convolutional block. In each slave route in a multi-level dense block, this paper uses a convolutional block with a convolutional kernel of 3. The input channel numbers of the convolutional block are the same as the output channel numbers. This paper uses \(f_i\) and \(m_i\) to denote the ith convolutional block of the main route and the slave route, respectively. Therefore, the output of the ith convolutional block is calculated as follows:

$$\begin{aligned} x_{i} = m_{i}\left( f_{i}\left( x_{i-1},k_{i}\right) , k_{i}\right) , \end{aligned}$$
(1)

where \(x_{i-1}\) and \(x_{i}\) are the input and output of the ith convolutional block, respectively. \(f_i\) and \(m_i\) denote a special convolution block, which includes a convolutional layer, an instance normalization layer, and the LeakyReLU activation layer. \(k_i\) is the kernel size of the convolutional layer.

This paper focuses on scalable receptive fields and multi-level information. The lower level needs enough channels to encode more fine-grained small receptive field information, while the higher level with large receptive fields focuses more on the generalization of higher-level information. However, setting the same channels as the lower levels may lead to information redundancy, so this paper provides different proportions of channel outputs in the multi-level dense module. To enrich the feature information, this paper uses the \(x_1\) to \(x_n\) feature mapping as the output of the multi-level dense module by concatenating fusion. In the settings, the final output of the multi-level dense module is:

$$\begin{aligned} x=C\left\{ x_{1}, x_{2}, \ldots , x_{n}\right\} , \end{aligned}$$
(2)

where x denotes the multi-level dense module output. C denotes the concatenate fusion operation in the method, which maintains enriched multi-level information within the channel. And \(x_{1}, x_{2}, \ldots , x_{n}\) are the feature maps from all n convolutional blocks. x captures the multi-scale information of all blocks. Based on experimental experience, n is chosen to be 4.

Feature fusion module

Nowadays, the vast majority of encoder–decoder architectures perform up-sampling recovery directly after the end of down-sampling, where no additional operations are performed. However, in this structure, the down-sampling information is not fully utilized. For thermal infrared image colorization, the prediction of fine color information in the local image plays an important role in generating the color information of the three channels. Inspired by [73, 74], this paper designed feature fusion modules to be applied between the encoder and decoder, and the structure of the feature fusion module is shown in Fig. 5. The first convolution block consists of convolution kernels of 3 and stride size of 1. The channel numbers are reduced to half, but the feature map scale remains the same. The second convolution block has a convolution kernel of 3 and a stride size of 2. The channel numbers remain the same, but the feature map scale is reduced to half. The third convolution

Fig. 5
figure 5

Detailed structure of the proposed feature fusion module. \(f_1\), \(f_2\) and \(f_3\) separately extract multi-scale information and increase the network recognition capability after the information fusion operation

Fig. 6
figure 6

Detailed structure of the proposed color-aware attention module (D3 denotes the convolutional dilation rate of 3)

block has a convolution kernel of 3, a stride size of 1, and the channel numbers are reduced to half. The three convolution blocks separately extract multi-scale information. After bilinear up-sampling interpolation to the initial image size. Then the global features and local features are concatenated by a concatenate and fusion operation, increasing the network recognition capability. Finally, the final feature map is obtained by the fourth convolution block and skip connection. The feature fusion module can reduce the information loss of encoder down-sampling and improve the prediction of the image’s fine color. The entire process of the feature fusion module can be described as follows:

$$\begin{aligned} x_\textrm{output}=f_{4}\left( C\left\{ f_{1}\left( x_\textrm{input}\right) , \textrm{Up}\left( f_{3}\left( f_{2}\left( x_\textrm{input}\right) \right) \right) \right\} \right) +x_\textrm{input}, \end{aligned}$$
(3)

where \(x_\textrm{output}\) and \(x_\textrm{input}\) are the input and output of the feature fusion module, \(f_{i}\) is the ith convolution block, Up is the bilinear up-sampling interpolation, and C is the concatenate fusion operation.

Color-aware attention module

The process of convolution causes the perceptual field to be limited to a certain range. This operation leads to a certain difference between pixel-to-pixel of the same category, which can result in lower-quality of colorization results. To solve the above problem, this paper proposes the color-aware attention module. As shown in Fig. 6, it mainly consists of position-aware attention and semantic-aware attention. In particular, the position-aware attention can capture the context information between arbitrary two positional information in the position, while the semantic-aware attention can capture the context information in the channel dimension.

Position-aware attention: The technical details of the location-aware attention are shown in the top region of Fig. 6. The input feature map of the position-aware attention is \(X \in R^{H \times W \times C}\), where HWC represent the height, width, and channel dimensions, respectively. The input X is convolved 1\(\times \)1 to obtain \(x^1\). And the output of this stage is processed with two parallel dilation convolution blocks and a standard convolution block to obtain the feature maps of \(x^2\), \(x^3\), and \(x^4\). This paper multiplies the two mappings \(x^2\) and \(x^3\), which is used softmax to obtain the attention matrix, as shown in Eqs. (4) and (5):

$$\begin{aligned} x^{1}&{=}&f_{1}(X) \quad x^{2}{=}f_{2}\left( x^{1}\right) \quad x^{3}{=}f_{3}\left( x^{1}\right) \quad x^{4}{=}f_{4}\left( x^{1}\right) ,\nonumber \\ \end{aligned}$$
(4)
$$\begin{aligned} s_{i, j}= & {} \frac{\exp \left( x_{i}^{1} \times x_{j}^{2}\right) }{\sum _{i=1}^{H \times W} \exp \left( x_{i}^{1} \times x_{j}^{2}\right) }, \end{aligned}$$
(5)

where f denotes the convolution and \(s_{i, j}\) evaluates the ith influence of the jth position on the position.

In the feature map \(x^3\) provides feature mappings at different scales, the fusion layer decides to fuse the various feature mappings carried by \(x^3\) with the \(s_{i, j}\) feature matrix to help and guide the model to better recover the color and structural features of the image, prompting the model to generate a more realistic colorized image. And summed by the convolutional block with \(x^1\). Finally, the final feature map is obtained by a standard convolutional block. Position-aware attention can be expressed as:

$$\begin{aligned} \textrm{PA}=f_{6}\left( f_{5}\left( \sum _{i=1}^{H \times W} s_{i, j} \times x_{i}^{3}\right) +x^{1}\right) , \end{aligned}$$
(6)

Semantic-aware attention: In high-level semantic features, each channel can be considered as a special response to a certain class. Enhancing the feature channels of this response can effectively improve the colorized image quality. Each channel is weighted by calculating a weighting factor. This highlights the important channels and enhances the feature representation. Semantic-aware attention can be expressed as

$$\begin{aligned} \textrm{SA}=f_{9}\left( f_{8}\left( {\text {AvgPool}}\left( f_{7}(X)\right) \right) \cdot f_{7}(X)\right) . \end{aligned}$$
(7)

This paper applies the proposed color-aware attention module to the decoder part, which effectively improves the colorization quality of the network by enhancing the important features and suppressing the unnecessary ones. Meanwhile, it enhances perceptual information, captures more semantic details, and focuses on more critical objects.

Loss function

To penalize the distance between real color images and colorized images, this paper proposes a composite loss function consisting of synthetic loss, adversarial loss, feature loss, total variation loss, and gradient similarity loss. It can improve the quality of the colorized image and generate fine local details while recovering semantic and texture information. This paper now describes the technical description of each loss function in detail.

Synthetic loss. In fact, synthetic loss is L1 loss. By increasing the synthetic loss, the brightness and contrast differences between color images and Ground Truth can be effectively minimized. If the GAN focuses overly on the synthesis loss, the brightness and contrast in the thermal infrared image will be lost. To prevent the generator over representation of pixel-to-pixel relationships, this paper will add appropriate weights for synthetic loss. The synthetic loss can be expressed as

$$\begin{aligned} L_\textrm{syn}=\frac{1}{W H} \sum _{i=1}^{W} \sum _{j=1}^{H}\left\| \left( I_\textrm{vis}\right) _{i, j}-G\left( I_\textrm{ir}\right) _{i, j}\right\| _{1}, \end{aligned}$$
(8)

where W and H denote the height and width of the thermal infrared image, respectively. \(I_\textrm{ir}\) denotes the input thermal infrared image, \(I_\textrm{vis}\) denotes Ground Truth. \(G(\cdot )\) denotes the generator, and \(\Vert \cdot \Vert _{1}\) denotes the L1 norm.

Adversarial loss. Although the application of synthetic loss facilitates high PSNR, colorized images will lose some of their detailed content. To encourage the network colorization results with more realistic details, this paper utilizes adversarial loss. Adversarial loss is used to learn the implicit relationship between thermal infrared images and Ground Truth. Adversarial loss is defined as

$$\begin{aligned} L_\textrm{adv}=E_{I_\textrm{ir}}\left[ \log \left( 1-D\left( G\left( I_\textrm{ir}\right) , I_\textrm{ir}\right) \right) \right] , \end{aligned}$$
(9)

where the input thermal infrared image \(I_\textrm{ir}\) is not only the input to the generator, but also to the discriminator as a conditional item.

Feature loss. \(L_\textrm{syn}\) and \(L_\textrm{adv}\) sometimes fail to ensure consistency between perceptual quality and objective metrics. To alleviate this problem, this paper uses feature loss [75], which compares the differences between feature maps extracted by dedicated pre-trained models. In this paper, this paper uses the VGG-16 CNN model [76] to extract the feature maps of the color results and Ground Truth, and then compute the Manhattan distance between the pair of feature maps. The feature loss is expressed as

$$\begin{aligned} L_\textrm{feat}{=}\!\sum _{n} \frac{1}{C_{n} H_{n} W_{n}} \sum _{i=1}^{H_{n}} \sum _{j=1}^{W_{n}}\left\| \varphi _{n}\left( I_\textrm{v i s}\right) _{i, j}{-}\varphi _{n}\left( G\left( I_{i r}\right) \right) _{i, j}\right\| _{1}\!, \end{aligned}$$
(10)

where \(\varphi _{n}(\cdot )\) denotes the nth layer feature map in the VGG-16 network. \(C_n\), \(H_n\) and \(W_n\) denote the number of channels, height and width of the layer, respectively.

Total variation loss. In addition to using feature loss to recover high-level content, this paper also employs total variation loss [77] to enhance the spatial smoothness of color thermal images. The total variation loss is defined as:

$$\begin{aligned} L_\textrm{tv}{=}\frac{1}{W H}\sum \left( |\nabla _{x} G\left( I_\textrm{ir}\right) |+|\nabla _{y} G\left( I_\textrm{ir}\right) |\right) , \end{aligned}$$
(11)

where \(\nabla _{x}\) and \(\nabla _{y}\) denote the gradient of the image in the x and y directions. \(|\cdot |\) denotes the element-by-element absolute value of the given input.

Gradient similarity loss. This paper is the first to use gradient similarity loss to enhance the colorized image quality. Gradient similarity loss of visible images is used in the network to constrain the network, which is used to obtain texture information and luminance information of the images. The gradient similarity loss facilitates faster convergence of the network. The gradient similarity loss is defined as

$$\begin{aligned} L_\textrm{grad}=\left\| \nabla G\left( I_\textrm{ir}\right) -\nabla I_\textrm{vis}\right\| _{1}, \end{aligned}$$
(12)

where \(\nabla \) denotes the gradient.

Final loss function. The final loss function of the proposed E2GAN is presented as follows:

$$\begin{aligned} L_\textrm{total}= & {} \lambda _\textrm{adv} L_\textrm{adv}+\lambda _\textrm{feat} L_\textrm{feat}\nonumber \\{} & {} \quad +\lambda _\textrm{syn } L_\textrm{syn}+\lambda _\textrm{tv} L_\textrm{tv}+\lambda _\textrm{grad} L_\textrm{grad}, \end{aligned}$$
(13)

where \(\lambda _\textrm{adv}\), \(\lambda _\textrm{feat}\), \(\lambda _\textrm{syn }\), \(\lambda _\textrm{tv}\) and \(\lambda _\textrm{grad}\) denote the weights controlling the different loss shares in the complete loss function, respectively. The weights are set based on preliminary experiments on the training dataset.

Experiments

To verify the capability and quality of the proposed E2GAN, this paper performed a series of experiments on the processed KAIST (denoted as KAIST) dataset and the processed FLIR (denoted as FLIR) dataset. this paper compares the representative visible colorization networks and infrared colorization networks in recent years. This paper introduces the information on the datasets used in Sect. 4.1. Then this paper describes the experimental setting of the proposed E2GAN. Next, a comparison of the performance with other advanced networks is shown on the KAIST dataset and FLIR dataset. This is followed by ablation studies to verify the reasonableness and effectiveness of the proposed module and loss function. Subsequently, generalization experiments were conducted to verify that the proposed E2GAN has stronger robustness and generalization ability. Finally, the limitations of the proposed E2GAN are analyzed.

Dataset information

Although many visible-infrared datasets are available, most of them are unaligned. And when applying colorization tasks, colorization networks generate images with poor colorization quality and are prone to color confusion. Therefore, this paper selected two processed datasets for extensive experiments.

KAIST dataset: The KAIST multispectral pedestrian dataset [78] consists of a total of 95,328 images, each containing both RGB color and infrared versions of the image. The dataset captures a variety of regular traffic scenes including schoolyards, streets, and the countryside during daytime and nighttime, respectively. The image size is 640 \(\times \) 512. Although the KAIST dataset itself states that it is aligned, after careful observation, this paper found that some of the infrared images are still partially deviated from the visible images. And because the KAIST dataset is taken from the video continuous frame images, the neighboring images will not be very different from each other. After the washing operation, this paper selected 4755 daytime images and 2846 nighttime images in the training dataset. In the test dataset, 1455 daytime images and 797 nighttime images were selected. This paper uses only the daytime training dataset for training.

FLIR dataset: The FLIR datasetFootnote 1 has 8862 unaligned visible and infrared image pairs, which contain abundant scenes, such as roads, vehicles, pedestrians, and so on. These images are highly representative scenes in FLIR videos. In this dataset, this paper selects feature points by the control point selection algorithm to form a feature point matrix and use this matrix to manually align the FLIR dataset. After filtering the alignment, this paper left 4346 daytime images and 363 nighttime images in the training dataset. In the test dataset, 1455 daytime images and 104 nighttime images were selected. This paper uses only the daytime training dataset for training. Table 1 provides specific details about these datasets.

Table 1 Specifications of the KAIST dataset and the FLIR dataset

Implementation details

Each image is resized from an arbitrary size of the dataset to a fixed size of 256 \(\times \) 256 and then input to the training network. The proposed network is trained for 200 epochs, and the batchsize is set to 1. During training, this paper uses a “Lambda” learning rate decay strategy [40], where the learning rate is set to 0.0002. The number of channels in the first convolutional layer of the generator and discriminator is set to 64. This paper uses the Adam optimizer [79] with \(\beta _1\) = 0.5 and \(\beta _2\) = 0.999. The discriminator and generator are trained alternately until the proposed E2GAN converges. In all experiments, according to the experimental experience, this paper sets the weights of different losses in the final loss function to \(\lambda _\textrm{adv}\) = 1, \(\lambda _\textrm{feat}\) = 1, \(\lambda _\textrm{syn}\) = 15, \(\lambda _\textrm{tv}\) = 1 and \(\lambda _\textrm{grad}\) = 10, respectively. This paper implemented the network with the PyTorch framework [80] and performed it on an NVIDIA Titan Xp GPU for training.

$$\begin{aligned} \textrm{lr}_\textrm{lambda}= & {} 1-\max \left( 0, \textrm{epoch}\right. \nonumber \\{} & {} \left. +1-n_{-}\textrm{epoch}\right) /\left( n_{-}\textrm{epoch}_{-}\textrm{decay}+1\right) , \end{aligned}$$
(14)
$$\begin{aligned} \textrm{lr}_\textrm{new}= & {} \textrm{lr}_\textrm{lambda}\left( \textrm{epoch}_\textrm{last}\right) \times \textrm{lr}_\textrm{base}, \end{aligned}$$
(15)

where \(\textrm{lr}_\textrm{lambda}\) denotes the decaying learning rate factor, epoch denotes the current training round. \(n_{-}\textrm{epoch}\) denotes the number of training rounds with constant learning rate, and \(n_{-}\textrm{epoch}_{-}\textrm{decay}\) denotes the number of training rounds with decaying learning rate. \(\textrm{epoch}_\textrm{last}\) denotes the last training round. \(\textrm{lr}_\textrm{base}\) denotes the base learning rate, and \(\textrm{lr}_\textrm{new}\) denotes the updated learning rate.

In this paper, this paper uses the four representative image quality evaluation metrics for quantitative evaluation: SSIM [81], PSNR, LPIPS [82], and NIQE [83]. PSNR is used to measure the intensity difference between colorization results and Ground Truth, and higher PSNR indicates the intensity of two images is more similar. SSIM is used to measure the similarity between colorization results and Ground Truth, and a higher SSIM indicates a smaller difference. LPIPS is more consistent with human perception than PSNR and SSIM, and lower LPIPS indicates that the colorization results are more realistic and diverse. NIQE is a more effective indicator of non-reference colorization quality, and lower NIQE indicates better colorization quality. In addition, all quantitative experimental results are the average of all images in each test dataset. This paper also uses three evaluation metrics regarding computational complexity: generator parameters, Graphics Processing Unit (GPU) memory, and Multiply-Accumulate Operations (MACs). The lower the three parameters, the lower the computational complexity.

Comparison experiments

This paper selected the eight representative colorization methods and the proposed E2GAN on the KAIST dataset and FLIR dataset for comprehensive comparison. The comparison mainly includes recent approaches based on non-attention mechanisms for comparison, such as Pix2Pix [39], CycleGAN [40], ThermalGAN [47], TICC-GAN [18] and I2VGAN [23], and approaches based on attention mechanisms, such as U-GAT-IT [65] and CWT-GAN [67], and the latest contrast learning based method SRC [55]. Among them, TICC-GAN is the most similar to the approach, and this paper sets it as the baseline. Compared to the eight representative colorization methods, the proposed E2GAN accurately generates locally realistic textures and fine details. Regarding the comparison methods, this paper trained and tested them on the KAIST dataset and the FLIR dataset using the source code published by the authors and the hyperparameters provided.

Fig. 7
figure 7

Comparison by observational analysis using the KAIST dataset between proposed E2GAN and other colorization methods. From these results, this paper can see that the proposed E2GAN can generate high-quality colorized images

Fig. 8
figure 8

Comparison more details by observational analysis using the KAIST dataset between proposed E2GAN and other colorization methods. From these results, it can be seen that the proposed E2GAN has a clear advantage in color realism and detail accuracy, especially for objects such as “vehicles” and “crosswalks”. It indicates that the method performs well in terms of both global quality and local perception

Comparison on the KAIST dataset

First of all, this paper performed comprehensive comparisons with the proposed E2GAN on the KAIST dataset using the comparison methods. As can be seen in Fig. 7, it can be seen that CycleGAN, ThermalGAN, and U-GAT-IT can only reconstruct most of the global structure, but the colorization results still look unrealistic while lacking in color information and local details. For example, critical objects such as “vehicles”, “crosswalks”, and “traffic signs” in visual perception are not accurately restored. Among all comparison methods, Pix2pix has the most model parameters. However, the network roughly recovered the overall chromaticity but failed to accurately estimate the luminance, and the colored images were severely blurred and lacked fine textures. While colorization is accomplished, there are still deficiencies in image structure and detail recovery. As the current state-of-the-art methods, TICC-GAN, I2VGAN, CWT-GAN, and SRC return quite reasonable and natural color images compared to the first four methods. However, it fails in terms of image details, especially in terms of tiny structures, and the contours of “crosswalks”, “buildings” and “roadblocks” are confusing. There are artifacts in the colored sky, and there is a semantic confusion effect. The proposed E2GAN has a clear advantage in color realism and detail accuracy, especially for objects such as “vehicles” and “crosswalks”. It indicates that the method performs well in terms of both global quality and local perception. More detailed information on the colorization details is shown in Fig. 8. The quantitative performance of each comparison method in the KAIST dataset is shown in Table 2, and the optimal values of each metric are marked in red to highlight. SRC achieves optimal results on NIQE, but its qualitative analysis shows serious distortion in its colorization results, which indicates that the unreferenced metrics are susceptible to noise, artifacts, and other conditions. The proposed E2GAN achieves competitive metrics and has low computational complexity. The good performance of the algorithm in PSNR and SSIM proves that the proposed E2GAN retains more structural information and texture features of the infrared images, and LPIPS proves that the algorithm is more relevant to the real image from the information theory point of view.

Comparison on the FLIR dataset

To fully reflect the advancedness of the proposed E2GAN, this paper is validated on the FLIR dataset with the same experimental settings and evaluation metrics as the KAIST dataset comparison experiments. As shown in Fig. 9, the colorization effect results of each comparison method for multiple scenes in the FLIR dataset are demonstrated. As can be seen from Fig. 10, the colorized images of the other comparison algorithms do not have obvious representation of the sky details in the infrared images and produce serious artifacts. In contrast, the colorized image of this paper not only retains the complete sky texture but also has no artifacts. Meanwhile, the colorized image of each comparison method does not have good colorization of the distant targets, but the colorized image of this paper has more complete and obvious distant targets, and the details in the image are also fully preserved, and the image as a whole is more consistent with human eye vision. This indicates that the method in this paper can still generate color images with higher perceptual quality and can learn the data distribution of different datasets with high robustness.

The quantitative performance of each comparison method on the FLIR dataset is shown in Table 3, and the optimal values of each metric are marked in red to highlight. The algorithms in this paper still have more obvious advantages in PSNR, SSIM, and LPIPS compared to other comparison methods, while the performance on NIQE also possesses a better performance. The analysis results of each method on the two data sets are the same, and the quantitative analysis results are consistent with the qualitative analysis results, which proves the effectiveness and advancement of the method in this paper. Meanwhile, it indicates that the proposed E2GAN is able to learn the data distribution of different datasets with extremely high robustness. The performance of each comparison method for a lot of random samples is shown in Fig. 11. It can be seen that compared with other comparison methods, the proposed E2GAN produces less performance degradation in complex scenarios and has stronger robustness.

Table 2 Comparison analysis of test results of various state-of-the-art methods, including Pix2Pix, CycleGAN ThermalGAN, U-GAT-IT TICC-GAN, I2VGAN, CWT-GAN, and SRC, as well as the proposed E2GAN on the KAIST dataset
Fig. 9
figure 9

Comparison by observational analysis using the FLIR dataset between proposed E2GAN and other colorization methods. From these results, this paper can see that the proposed E2GAN can generate high-quality colorized images

Fig. 10
figure 10

Comparison of more details by observational analysis using the FLIR dataset between proposed E2GAN and other colorization methods. From these results, it can be seen that the proposed E2GAN has a clear advantage in color realism and detail accuracy, especially for objects such as “vehicles” and “trees”. It indicates that the method performs well in terms of both global quality and local perception

Table 3 Comparison analysis of test results of various state-of-the-art methods, including Pix2Pix, CycleGAN, ThermalGAN, U-GAT-IT, TICC-GAN, I2VGAN, CWT-GAN, and SRC, as well as the proposed E2GAN on the FLIR dataset

Ablation study

In addition to performing comparative experiments, this paper also performed ablation studies, which were used to demonstrate the reasonableness and validity of the proposed modules and composite loss functions in the proposed E2GAN. Specifically, this paper performed quantitative and qualitative analyses on the KAIST dataset and the FLIR dataset.

Importance of different modules

To further validate the reasonableness and effectiveness of several modules proposed in the proposed E2GAN, this paper performed the experimental analysis of different module structures on the KAIST dataset and FLIR dataset. Specifically, this paper designed four module structures. Table 4 shows the quantitative comparison of different module structures on the KAIST dataset and the FLIR dataset. Among them, Structure1 is a w/o multi-level dense module (replaced with one standard dense block [84]), Structure2 is a w/o feature fusion module (replaced with one standard residual block [85]), Structure3 is a w/o color-aware attention module (replaced with CBAM [86]), and Structure4 is the complete E2GAN structure. A comparison of the perceptual quality of the different module structures on the KAIST dataset and the FLIR dataset is shown in Fig. 12. It can be seen that by removing the multi-level dense module and the feature fusion module, the colorized images generated by the proposed E2GAN have reduced quality, and the edges are blurred, which fails to recover many important objects. The lack of the feature fusion module fails to prediction of the image’s detailed color, which leads Images color confusion. The absence of the color-aware attention module leads to semantic confusion and fails to emphasize the important parts of the image. In conclusion, each module of the proposed E2GAN is indispensable.

Importance of different loss functions

This paper further explored the effect of different loss functions on the composite loss function. This paper performed the experimental analysis of different loss function combinations on the KAIST dataset and the FLIR dataset. Specifically, this paper designed six combinations of loss functions. Table 5 shows the quantitative comparison of different loss function combinations on the KAIST dataset and the FLIR dataset. In particular, Loss1 is \(L_\textrm{adv}\), Loss2 is \(L_\textrm{adv}\)+\(L_\textrm{syn}\), Loss3 is \(L_\textrm{adv}\)+\(L_\textrm{syn}\)+\(L_\textrm{feat}\), Loss4 is \(L_\textrm{adv}\)+\(L_\textrm{syn}\)+\(L_\textrm{feat}\)+\(L_\textrm{tv}\), and Loss5 is \(L_\textrm{adv}\)+\(L_\textrm{syn}\)+\(L_\textrm{feat}\)+\(L_\textrm{tv}\)+\(L_\textrm{grad}\). Different loss function combinations are compared in terms of perceptual quality on the KAIST dataset and the FLIR dataset as shown in Fig. 13. It can be seen that \(L_\textrm{adv}\) can learn the implicit relationship between thermal infrared images and Ground Truth to generate accurate details, \(L_\textrm{syn}\) can improve the quality of colorized images, \(L_\textrm{feat}\) can increase the detail information, \(L_\textrm{tv}\) can reduce the impact of noise in the infrared dataset and generate smoother colorized images, and \(L_\textrm{grad}\) can effectively learn the gradient characteristics of real visible images from the training data, thus helping the generator to reconstruct more realistic and clear color images. Therefore, each of the loss functions is essential for generating high-quality colorized images.

Night to day colorization generalization

This paper also performed generalization experiments on thermal infrared images captured during the nighttime, with Pix2Pix, CycleGAN, ThermalGAN, U-GAT-IT, TICC-GAN, I2VGAN, CWT-GAN, and SRC. Figure 14 shows the colorization results of the proposed E2GAN and the comparison methods. All methods were trained on the daytime dataset only. Due to the lower contrast of nighttime infrared images, colorization is more difficult. Therefore, the contrast methods’ results of nighttime are more blurred compared to those of the daytime, and there is a problem of color confusion. There are serious distortions in the areas of people, cars, and trees in the picture, resulting in the inability to identify significant targets. In contrast, the proposed E2GAN still recovers most of the detailed textures, which are clearly visible in the colorization results, and significantly outperforms other comparison methods.

Fig. 11
figure 11

Qualitative comparison of a lot of random samples between proposed E2GAN and other colorization methods. It can be seen that compared with other comparison methods, the proposed E2GAN produces less performance degradation in complex scenarios and has stronger robustness

Table 4 Quantitative analysis of different module structures for the KAIST dataset and the FLIR dataset
Fig. 12
figure 12

Comparison by observational analysis using the KAIST dataset and the FLIR dataset between different module structures. From these results, this paper can see that the proposed E2GAN with complete structure function can generate high-quality colorized images

Limitations

Table 5 Quantitative analysis of different loss combinations for the KAIST dataset and the FLIR dataset
Fig. 13
figure 13

Comparison by observational analysis using the KAIST dataset and the FLIR dataset between different loss combinations. From these results, this paper can see that the proposed E2GAN with complete loss function can generate high-quality colorized images

The proposed E2GAN achieves excellent results on the KAIST dataset and the FLIR dataset, but there are still some problems. Figure 15 shows some failure cases. It can be seen that the proposed E2GAN can recover the global structure, but the local details are easily colored incorrectly or ignored directly. Examples include “shadows”, “buildings”, and “cars”. These failures contain repeating textures and less structural information, which makes it difficult for the network to colorize them. This may be due to problems in the dataset itself, the existing datasets are not diverse enough for the model to learn the color and detail information of the objects that do not appear in the training set. There are also such as low resolution, low contrast, and blurred visuals. Considering the difficulty of dataset acquisition, it may be necessary to pre-train the proposed network, which learns various semantic information of different scenes in other existing datasets. This paper will investigate it in the future work.

Conclusion

In this paper, this paper proposes an Efficient and Effective Generative Adversarial Network (E2GAN), which exhibits excellent visual quality in colorizing thermal infrared images. This paper demonstrates the advantages of the multi-level dense module, the color-aware attention module, and the feature fusion module over existing structures in feature extraction and detail recovery. Meanwhile, the proposed composite loss function is explored to facilitate the recovery of texture information and semantic information. Extensive experiments on thermal infrared image colorization are performed on the KAIST dataset and the FLIR dataset. The experimental results were compared with Pix2Pix, CycleGAN, ThermalGAN, U-GAT-IT, TICC-GAN, I2VGAN, CWT-GAN, and SRC. Quantitative results show that the performance of the method improves on all datasets in terms of different evaluation metrics such as SSIM, PSNR, LPIPS, and NIQE. Qualitative results also point out that the colorized images of the method improve in terms of detail recovery capability, semantic information, and artifact reduction. The unification of quantitative and qualitative results proves the superiority of the proposed E2GAN.

Fig. 14
figure 14

Qualitative analysis of night to day colorization generalization for the KAIST dataset and the FLIR dataset. Due to the lower contrast of nighttime infrared images, colorization is more difficult. In contrast, the proposed E2GAN still recovers most of the detailed textures, which are clearly visible in the colorization results, and significantly outperforms other comparison methods

Fig. 15
figure 15

Failure colorized cases by proposed E2GAN. It can be seen that the proposed E2GAN can recover the global structure, but the local details are easily colored incorrectly or ignored directly. Examples include “shadows”, “buildings”, and “cars”. These failures contain repeating textures and less structural information, which makes it difficult for the network to colorize them

There are three aspects in future work. First, to build a complete dataset, high-quality and diverse aligned thermal infrared-visible image pairs need to be acquired, which allows the network to learn more diverse color information. Secondly, more different generator architectures need to be explored to further improve the image quality, such as using a transformer encoder for the generator. Finally, the pre-training approach is attempted to improve the colorization quality by allowing the network to learn various semantic information of different scenes in other existing datasets.