Introduction

Images collected by a single mode sensor fail to effectively and comprehensively describe imaging scenes due to theoretical and technical limitations [1]. Infrared sensors capture thermal radiation emitted by objects and can generate infrared images with significant targets, even in adverse conditions such as low brightness, occlusions, or harsh weather. However, infrared images are susceptible to noise and lack textural details. In contrast, visible images offer abundant texture and structural information but are susceptible to imaging conditions. As such, infrared and visible image fusion tasks involve reconstructing a single image with comprehensive information from multimodal data, providing both significant targets and valuable texture information. Motivated by variations in imaging scenes, several excellent fusion algorithms have been proposed for broad applications in various advanced vision tasks, including object detection [2], semantic segmentation [3], pedestrian re-recognition [4], and visual tracking [5].

In recent years, the fusion of infrared and visible images has attracted the attention of many scholars and has developed rapidly as a result. Existing technologies can be categorized into two groups: traditional methods [6, 7] and deep learning-based methods [8,9,10,11,12]. Traditional image fusion algorithms are typically implemented using multi-scale transform (MST)-based methods [13], sparse representation (SR)-based methods [14], low-rank representation (LRR)  [15], saliency-based methods [16], subspace-based methods [17], and other methods [18]. Although traditional methods have shown superior fusion performance in some aspects, they are also known to encounter specific challenges. (1) They generally require manually selected feature representations and accurate fusion rules when generating high-quality fused images, which require manual intervention and can degrade fusion performance. (2) In the case of SR and LRR techniques, it can be difficult to construct a suitable overcomplete dictionary. The runtime for corresponding fusion algorithms is, thus, not conducive to real-time image fusion. (3) Complex feature extraction and fusion strategies often introduce halos and blurred edges, due to the overlapping of asymmetric feature information.

To address these issues, deep learning-based methods have been introduced for infrared and visible image fusion. These frameworks can typically be divided into three categories: auto-encoder (AE) [8, 9], convolutional neural network (CNN) [19], and generative adversarial network (GAN) based architectures [20]. Deep learning offers several advantages for improved representation capabilities, but with certain limitations. First, to reduce the complexity of the network, introducing a down-sampling operation to reduce the image resolution inevitably results in the loss of important information in the fusion image. On the other hand, modern convolutional networks are not shift-invariant[21], as small shifts or translations in the input cause substantial changes in the output. Second, some methods use only simple feature fusion rules, such as addition and connection, which can cause artifacts or blurred edges in fused images. Third, the existing infrared and visible images used for training and testing are mainly derived from the TNO [22] and RoadScene [23] datasets, which restricts the comprehensive evaluation of the model’s generalization performance. FusionGAN [24] crops the source images into image patches by setting the stride to 14, while lacking global information to learn over long distances and failing to handle complex scenes.

Fig. 1
figure 1

A schematic illustration of the proposed method. The first row displays the source images, the second row presents fused images with IFCNN [19] and FusionGAN [24], and the last row provides results produced by NestFuse [8] and our proposed method

A novel deep learning architecture for the fusion of infrared and visible images is proposed in this paper to address the issues discussed above.

Inspired by previous traditional multi-scale frameworks, we design an encoder network, consisting of hybrid dilated convolutional blocks used to obtain multi-scale depth salient features by implementing different dilation rates. It is worth noting that no down-sampling is used during feature extraction, so the resulting feature map is the same size as the source images. In addition, to make full use of multi-scale layer characteristics, we introduce different attention modules for each scale, to ensure the network pays attention to specific features and compensates for information loss. To demonstrate the effectiveness of our approach, a representative fusion sample is shown in Fig. 1 and compared with three other excellent deep learning-based algorithms. Our method not only produces higher image contrast (e.g., the person in the infrared image is brighter using our technique), but also improves visual effects (e.g., smoke is preserved in the visible images, and trees in the background exhibit clearer edges). The primary contributions of our work can be summarized as follows:

  • We propose a lightweight network architecture for infrared and visible image fusion, which can capture fine-grained detailed features with a high semantic level and does not require a down-sampling operation.

  • Both spatial attention and channel attention mechanisms are introduced in the encoder-decoder framework at different scales. The proposed method not only forces the network to focus on foreground targets of the infrared image and the background information in the visible image, it also enhances local and global contextual information and attenuates noise.

  • A total loss function is designed to jointly focus on pixel distribution information and texture details in both infrared and visible images, to preserve essential complementary information in each modality.

  • Extensive experiments demonstrate our method’s superiority over state-of-the-art methods. The experimental results on public datasets reveal that our method achieves significant enhancements in entropy (EN) by 4.80%, standard deviation (SD) by 3.97%, correlation coefficient (CC) by 1.86%, correlations of differences (SCD) by 9.98%, and multi-scale structural similarity (MS_SSIM) by 5.64%.

Related work

Traditional image fusion methods

Traditional image fusion algorithms can be divided into three steps: feature extraction, fusion, and reconstruction. The feature extraction and reconstruction steps are typically opposite operations. Several multi-scale techniques such as Gaussian pyramid [7], shearlet [25], and nonsubsampled contourlet [26] transforms have been proposed in the past few decades, some of which are utilized in deep learning-based fusion frameworks. In addition, feature extraction methods used for sparse representations include joint sparse representation [14] and latent low-rank representations [27]. Inspired by human visual perception, this process requires an over-complete dictionary. As such, the computational complexity of sparse representations has always been an issue. In addition, by reducing the dimensionality of the original features into low-dimensionality of features that are independent of each other, representative techniques can be developed using subspace feature extraction, including independent component analysis [28], principal component analysis [29], and non-negative matrix factorization [30].

Deep learning-based fusion methods

Convolutional neural networks can learn prior knowledge from large image quantities and have been widely used for image fusion and other related tasks. Image fusion methods based on deep learning include AE-based algorithms, convolutional neural networks, and GAN-based image fusion models. Liu et al. [31] first proposed a CNN-based fusion framework. Since the purpose of the network is to generate a decision map, this approach is only suitable for multi-focus images. Li et al. [8] proposed a fusion method of nest connection-based architecture comprised of three parts: encoder network, fusion strategy, and decoder network, which extract deep features at different scales. This feature fusion is manually supervised by rules that affect fusion performance to a certain extent. Later, residual end-to-end auto-encoder fusion networks have been proposed to overcome the issue [9].

Fig. 2
figure 2

The overall framework for the infrared and visible image fusion algorithm based on HDC blocks and different attention mechanisms

In addition, by forcing the network to focus on intensity distribution and texture structures in images, infrared and visible image fusion algorithms based on the end-to-end convolutional neural network provide a solution to this problem. For example, Ma et al. [32] used salient mask to force the network on texture details in visible images and salient information in infrared images. However, it can be difficult to provide ground truth data to the network for image fusion tasks. Considering extreme illumination conditions for source images, Tang et al. [33] introduced an illumination-aware sub-network that maintains intensity distributions in salient targets and preserves texture information in the background. Furthermore, to facilitate advanced visual tasks, this group introduced semantic segmentation into the image fusion module to improve the semantic information in the fused images. They also proposed a joint low-level and high-level adaptive training strategy to simultaneously achieve superior performance and close the gap in both image fusion and high-level vision tasks [34].

In 2019, Ma et al. [24] first introduced the generative adversarial networks into the field of infrared and visible image fusion. Specifically, content loss and adversarial loss are employed to preserve details of thermal radiation in the fused images generated from connected source images. However, a single discriminator cannot focus on both infrared and visible regions. As such, Li et al. [35] not only introduced a dual-discriminator conditional generative adversarial network, but also used a multi-scale attention mechanism to constrain the discriminator and focus more on regions of interest, to balance the data distribution and improve fused image fidelity.

Dilated convolutional and attention mechanism applications

Dilated convolution, inspired by wavelet decomposition, enhances the receptive field of a convolutional kernel by inserting zeros between its pixels. This expansion aids the network in capturing detailed information within the scene. Dilated convolution has been widely applied in image classification, object detection, and semantic segmentation. Yu et al. [36] addressed the issue of gridding artifacts introduced by dilation by designing dilated residual networks, which can be effectively employed in downstream tasks such as object localization and semantic segmentation.

The attention mechanism, motivated by the human visual system, has been successfully incorporated into computer vision systems such as image recognition, object detection, semantic segmentation, and action recognition [37]. Channel attention focuses on important objects by assigning new weights to the channels of the feature map. Hu et al. [38] first proposed the concept of channel attention, known as SENet. The core squeeze-and-excitation (SE) block of SENet effectively captures the channel-wise relationship, thereby enhancing the representation capability of the network model. Qin et al. [39] demonstrated that global average pooling can be viewed as a special case of the discrete cosine transform and designed a multi-spectral channel attention mechanism to further enhance the model’s representation capabilities. Spatial attention, on the other hand, can be seen as the adaptive selection of important spatial regions. Hu et al. [40] designed GENet to capture long-distance spatial contextual information in feature maps, enabling the highlighting of important features while suppressing noise. Building upon the success of self-attention in natural language processing, Wang et al. [41] proposed Non-Local networks that expand the receptive fields of the network, enabling the capture of global information. In the context of image fusion, Ma et al. [42] introduced Swin Transformer and proposed intra-domain and inter-domain fusion units based on self-attention and cross-attention, respectively. This approach achieves the integration of complementary information and captures global long-range dependencies, facilitating the effective fusion of multi-domain images.

Methodology

This section describes the proposed lightweight infrared and visible image fusion network architecture in detail. First, we present the overall network pipeline. Hybrid dilated convolutional (HDC) blocks and multi-scale spatial/channel attention are then introduced. Finally, the proposed loss function is discussed.

Problem formulation

Given a pair of registered infrared \(I_{ir}\in R^{H\times W\times 1}\) and visible images \(I_{vis}\in R^{H\times W\times 3}\), under the guidance of a total loss function, the fused image \(I_{f}\in R^{H\times W\times 3} \) can be generated by feature extraction, feature fusion, and reconstruction. The previous deep learning methods emphasized the importance of feature extraction on the quality of fusion results, which led to designing complex feature extractors. However, the real-time image fusion requirement was ignored. In order to improve the ability of feature representation, while ensuring real-time infrared and visible image fusion, key design components for the lightweight HDC blocks and multi-scale attention mechanisms are designed to produce high-quality fused images and prevent artifacts. (we will discuss its network architecture in Section “Network architecture”). The overall framework for our proposed infrared and visible image fusion algorithm is shown in Fig. 2.

First, a fusion network based on HDC blocks is devised to fully extract the high-level semantic information in source images. More specifically, we apply a feature extraction module \(F_E\) to extract fine-grained feature information from infrared and visible images. This process can be represented as:

$$\begin{aligned}&\left\{ F_{i r}, F_{v i s}\right\} =\left\{ F_E\left( I_{i r}\right) , F_E\left( I_{v i s}\right) \right\} \text{, } \end{aligned}$$
(1)

where \(F_{ir}\) and \(F_{vis}\) represent feature maps for infrared and visible images, respectively. Moreover, HDC blocks are deployed in the feature extraction module to expand the receptive field while ensuring that important coarse-grained and fine-grained feature information is extracted, as shown in Fig. 3. Given the HDC input \(F_{i}\), the corresponding output \(F_{i+1}\) can be represented as:

$$\begin{aligned} F_{i+1}=HDC\left( F_i\right) =\phi \left( DConv^{n} (F_{i} )\right) , \end{aligned}$$
(2)

where \(DConv^{n}\) is an n-cascaded \(3\times 3\) dilated convolutional layer and \(\phi \) represents the LReLU activation function. Information flow is processed in HDC blocks using respective hierarchical levels in the pipeline. In this paper, HDC blocks capture local and global information of the source image to effectively facilitate feature representation capabilities.

The feature fusion and reconstruction module is responsible for converting the feature maps into the fused image. However, simply reconstructing the fused image using convolution operations may result in information loss. Therefore, we introduce different attention modules at different layers of the extractor to fully exploit contextual information from the source images and alleviate the information loss of the feature maps in reconstruction.

Fig. 3
figure 3

The specific arrangement of the hybrid dilated convolutional blocks. Lift to right: convolutional layers with a kernel size of \(3\times 3\) and dilation rates of 1, 3, and 5, respectively. The HDC clocks naturally enlarge the network receptive field without adding extra modules

To integrate the abundant fine-grained detailed features in infrared and visible images and reconstruct the fused image, the element-wise addition strategy in [43] is used. The formula for this fusion process is as follows:

$$\begin{aligned}&F_f={\text {Add}}\left( \alpha _i\left( F_{i r}\right) , \alpha _i\left( F_{v i s}\right) \right) \text{, } \end{aligned}$$
(3)

where \(F_{f}\) is fused feature maps, \(Add(\cdot , \cdot )\) represents an element-wise addition strategy, and \(\alpha _i\) denotes an attention mechanism corresponding to multiple scales. Specifically, \(\alpha _1\) is employed to focus on coarse-grained information from infrared and visible images using a spatial attention mechanism. Both \(\alpha _{2}\) and \(\alpha _{3}\) are devoted to strengthening a large amount of fine-grained feature information using a channel attention mechanism. Finally, the fused image \(I_f\) is reconstructed from \(F_f\) via an image reconstructor \(R_i\) as follows:

$$\begin{aligned}&I_f=R_i\left( F_f\right) . \end{aligned}$$
(4)

Network architecture

The framework for the proposed lightweight fusion network based on hybrid dilated convolutional blocks (HDCBs), shown in Fig. 2, consists of encoder and decoder networks for feature extraction and image reconstruction, respectively.

Fig. 4
figure 4

A diagram of the spatial and channel attention-based modules is shown. \(C\times H\times W\) denotes feature maps with channel number C, height H and width W. \(\otimes \) denotes matrix multiplication, \(\oplus \) represents element-wise addition, and \(\odot \) indicates element-wise multiplication

The feature extractor utilizes three HDCBs to increase the size of the receptive field in the network and capture more contextual information, while ensuring fine-grained features are extracted from infrared and visible images. In addition, a multi-scale spatial/channel attention module is also proposed to retain valuable information and reduce artifacts in multi-modality images. In the feature extractor, a multi-scale shallow layer in the encoder focuses on the elemental features using a spatial attention module, while a channel attention module is used to pay attention to fine-grained features in source images on multi-scale deep layers of the encoder. These multi-scale attention features are added as inputs to corresponding layer features of the decoder network to reconstruct the fused image. As shown in Fig. 2, two parallel encoder modules are used to extract features from infrared and visible images containing three HDCBs with dilation rates of 1, 3, and 5, respectively. The special design of the HDCB is shown in Fig. 3. The block mainly changes the dilation rates of ordinary convolutions, which is set to prevent the occurrence of gridding problems. The mainstream applies to three convolutional layers with a kernel size of \(3\times 3\) and stride of 1, the batch normalization (BN) layers, and the LReLU layers. To preserve more diverse and important contextual information, the different attention modules are introduced to each scaling layer of the encoder, as shown in Fig. 4. The \(FM_{i}\) serves as the input to the attention module, acquired from the feature maps of each HDCB output in the encoder, while the \(FM_{o}\) provides the output of the attention module. The spatial attention mechanism is used by shallow features of the first HDCB, while the channel attention mechanism is exploited in the deep scaling layers.

Attention maps for infrared and visible images at different scales are then integrated via an element-wise addition strategy, and the results are fed into the decoder network to achieve image reconstruction. The decoder network in the image reconstructor generates fused images using three \(3\times 3\) convolutional layers and three BN layers, all of which are followed by an LReLU activation function. The stride is set to 1 in the fused network with no down-sampling operation, to reduce information loss. As such, fused images are the same size as the source images.

Loss function

A total loss function is proposed in this study to facilitate more comprehensive detail in the resulting images, obtained from salient target information in infrared images and fine-grained features in visible images. This total loss function consists of intensity loss \(L_{intensity}\) and detail loss \(L_{detail}\) terms, which is defined as follows:

$$\begin{aligned}&L_{total}=L_{intensity}+\gamma L_{detail}, \end{aligned}$$
(5)

where \(\gamma \) is a weight factor used to balance the intensity loss \(L_{intensity}\) and detail loss \(L_{detail}\).

The intensity loss is designed to constrain intensity similarity between the fused and input images at the pixel level. Therefore, the intensity loss is expressed as:

$$\begin{aligned}&L_{intensity}=\frac{1}{H W}\left\| I_f-\left( p I_{i r}+(1-p) I_{v i s}\right) \right\| _1, \end{aligned}$$
(6)

where W and H represent the width and height of the image, respectively, \(\left\| \cdot \right\| _{1}\) is the \(l_{1}\)-norm, and p denotes the weight of constraints used to integrate the distribution of pixel intensities in infrared and visible images.

However, fused images not only include the pixel intensity distribution of the source images, but also exhibit a fine-grained detail distribution. Hence, a detail loss is introduced to force the fused image to preserve more structure and fine-grained texture information. Detail loss can be expressed as:

$$\begin{aligned}&L_{detail}=\frac{1}{H W}\left\| \left| \nabla I_f\right| -\left( q\left| \nabla I_{i r}\right| +(1-q)\left| \nabla I_{vis}\right| \right) \right\| _1, \end{aligned}$$
(7)

where \(\nabla \) indicates the Sobel gradient operation used to measure the fine-grained information in the source images, q is a weight parameter that constrains the fine-grained features in infrared and visible images, and \(\left| \, \cdot \, \right| \) indicates the absolute value operation.

Finally, guided by the total loss function, our proposed fused network based on HDCBs and multi-scale attention provides fused images with a better pixel intensity distribution and larger quantities of detail information, to efficiently generate high-quality images.

Experiments

In this section, we first describe the experimental settings and training details. Then, we conduct both quantitative and qualitative comparative experiments and generalization experiments to fully evaluate the performance of our proposed fusion algorithm. Finally, we introduce ablation experiments to demonstrate the effectiveness of the model design, including detail loss and multi-scale spatial/channel attention.

Experimental settings

We perform extensive quantitative and qualitative experiments using the TNO [22], RoadScene [23], and VIFB [44] datasets to comprehensively evaluate the proposed fusion method. In addition, seven state-of-the-art image fusion algorithms are selected for comparison with our approach, including three typical traditional methods, i.e., IFEVIP [45], GTF [18] and CBF [46], two AE-based models, i.e., MFEIF [47] and NestFuse [8], one CNN-based method IFCNN [19], and one GAN-based method FusionGAN [24]. Implementations of these algorithms are publicly available and corresponding parameters are set in agreement with those in their respective papers.

Nine statistical evaluation indicators are used to quantitatively evaluate our method and the seven other excellent fusion methods. They are entropy (EN) [48], modified fusion artifacts measure (Nabf) [49], correlations of differences (SCD) [50], spatial frequency (SF) [51], standard deviation (SD) [52], peak signal to noise ratio(PSNR) [53], multi-scale structural similarity (MS_SSIM) [54], feature mutual information (FMI) and correlation coefficient (CC). These values increase as the fusion performance improved (excluding Nabf).

The EN measures the amount of information contained in a fused image as follows:

$$\begin{aligned}&E N=-\sum _{l=0}^L p_l \log _2 p_l, \end{aligned}$$
(8)

where L and \(p_l\) represent the total number of gray levels and the normalized histogram of the corresponding gray level in the fused image, respectively. A large EN indicates that a large amount of information is available, representing better fusion performance. Larger EN values may also be caused by noises.

The Nabf, which quantifies the number of noises or artifacts added in the fused image due to the fusion process, can be expressed as:

$$\begin{aligned}&N_{m}^{\frac{A B}{F}}=\frac{\sum _{\mathbf {\forall }i}\sum _{\mathbf {\forall }j}A M_{i,j}\left[ \left( 1-Q_{i,j}^{A F}\right) w_{i,j}^{A}+\left( 1-Q_{i,j}^{B F}\right) w_{i,j}^{B}\right] }{\sum _{\mathbf {\forall }i}\sum _{\mathbf {\forall }\,j}\left( w_{i,i}^{A}+w_{i,i}^{B}\right) }, \end{aligned}$$
(9)
$$\begin{aligned}&A M_{i,j}\;=\;\left\{ \begin{array}{ll} {{1,}}&{}{{g_{i,j}^{F}>g_{i,j}^{A}~\textrm{and}~\,g_{i,j}^{F}>g_{i,j}^{B}}}\\ {{0,}}&{}{{\textrm{otherwise}}}\end{array}\right. , \end{aligned}$$
(10)

where \(A M_{i,j}\) indicates locations of fusion artifacts when fused gradients are stronger than input, \(Q_{i,j}^{A F}\) and \(Q_{i,j}^{B F}\) denote the gradient information preservation estimates of source images A and B, respectively, \(w_{i,i}^{A}\) and \(w_{i,i}^{B}\) are the perceptual weights of source images, respectively, \(g_{i,j}^{A}\), \(g_{i,j}^{B}\) and \(g_{i,j}^{F}\) are the edge strength of A, B, and fused image F, respectively. A low Nabf value is indicative of superior visual performance in the fused image.

The SCD, which measures the amount of information transmitted from source images to the fused image, can be represented as:

$$\begin{aligned}&SCD=r(D_{1},S_{1})+r(D_{2},S_{2}), \end{aligned}$$
(11)

where \(r(\cdot )\) denotes the correlation function.

The SF metric effectively measures the gradient distribution of images, which reveals the details and texture of images. It can be defined as follows:

$$\begin{aligned}&S F={\sqrt{R F^{2}+C F^{2}}}, \end{aligned}$$
(12)
$$\begin{aligned}&R F={\sqrt{\sum _{i=1}^{M}\sum _{j=1}^{N}\left( F(i,j)-F(i,j-1)\right) ^{2}}}, \end{aligned}$$
(13)
$$\begin{aligned}&CF={\sqrt{\sum _{i=1}^{M}\sum _{j=1}^{N}(F(i,j)-F(i-1,j))^{2}}}, \end{aligned}$$
(14)

where RF and CF are the spatial row frequency (RF) and column frequency (CF) based on horizontal and vertical gradients, respectively.

The CC metric measures the degree of linear correlation between the fused image and the source images, as defined below:

$$\begin{aligned}&CC={\frac{r_{a f}+r_{b f}}{2}}, \end{aligned}$$
(15)
$$\begin{aligned}&r_{x f} = \frac{\sum _{i=1}^{M}\sum _{j=1}^{N}(x_{i,j}-\mu _{x})(f_{i,j}-\mu _{f})}{\sqrt{\sum _{i=1}^{M}\sum _{j=1}^{N}(x_{i,j}-\mu _{x})^{2}\sum _{i=1}^{M}(f_{i,j}-\mu _{f})^{2}}}, \end{aligned}$$
(16)

where \(\mu _{x}\) and \(\mu _{f}\) indicate the mean values of the input image x and the fused image f, respectively. A higher value of CC indicates a better correlation and higher image quality for the fused image.

The SD reflects the distribution and contrast of the fused image from a statistical perspective and can be defined mathematically as:

$$\begin{aligned}&S D=\sqrt{\sum _{i=1}^{M}\sum _{j=1}^{N}\left( f(i,j)-\mu \right) ^{2}},&\end{aligned}$$
(17)

where \(\mu \) denotes the average of the fused image. A positive SD value indicates that the fused image exhibits favorable visual effects.

The MS_SSIM represents a calibration definition for the difference between two images across scales. The corresponding multi-scale SSIM index is given by:

$$\begin{aligned}&SSIM({x},{y})=[l_{M}({x},{y})]^{\alpha _{M}}\cdot \prod _{j=1}^{M}[c_{j}({x},{y})]^{\beta _{j}}[s_{j}({x},{y})]^{\gamma _{j}}, \end{aligned}$$
(18)

where M is the highest scale, \(\alpha _{M}\), \(\beta _{j}\) and \(\gamma _{j}\) are used to adjust the relative importance of different components, and \(c_{j}({x},{y})\) and \(s_{j}({x},{y})\) provide a comparison of contrast and structure at the j-th scale image, respectively, while \(l_{M}({x}, {y})\) is only the luminance comparison at scale M.

The PSNR is used to evaluate the ratio of peak signal power to noise power and therefore reflects the amount of distortion during the fusion process. This metric is defined as follows:

$$\begin{aligned}&PSNR=10\log _{10}{\frac{r^{2}}{MSE}}, \end{aligned}$$
(19)

where r indicates the peak value of the fused image. The higher PSNR value indicates that the fused image is closer to the source images and has less distortion in terms of image quality.

The FMI is used to measure the amount of feature information transmitted from the source images to the fused image. It is defined as follows:

$$\begin{aligned}&FMI_{F}^{A B}={\frac{1}{n}}\sum _{i=1}^{n}\left( {\frac{I_{i}(A;F)}{H_{i}(A)+H_{i}(F)}}+{\frac{I_{i}(B;F)}{H_{i}(B)+H_{i}(F)}}\right) , \end{aligned}$$
(20)

where \(H_{i}(A)\) and \(H_{i}(B)\) are the entropy of the corresponding windows from the input images, \(I_{i}(A;F)\) and \(I_{i}(B;F)\) indicate the regional mutual information between corresponding windows in the fused image and source images. A larger FMI value commonly implies that a considerable amount of feature information is transferred from the source images to the fused image.

Training details

We train the proposed fusion network on the Multi-Spectral Road Scenarios (MSRS) [33] dataset. This training set includes 1078 pairs of infrared and visible images, while the test set contains 361 image pairs. This dataset is constructed based on MFNet [55] and consists of a large number of nighttime and daytime scenes. Before feeding the training set to the fusion network, all images are normalized to [0, 1] and parameters are set as follows. The total loss hyper-parameter is set to \(\gamma \) = 100, p = 0.68, and q = 0.08. The batch size and epoch are set to 8 and 80, respectively. The model parameters are updated by the Adam optimizer with a learning rate of 0.001 and weight decay of 0.0001. All experiments are performed on an NVIDIA RTX A5000 GPU and a 2.40 GHz Intel(R) Xeon (R) Silver 4214R CPU. Since color visible images are included in MSRS, a specific fusion strategy[43] is used to process color image fusion. We first transfer the input visible images from the RGB color space to the YCbCr color space. The Y channel in the visible images is then employed to fuse the infrared images and obtain a new fused channel Y. Finally, the fused image is combined with the Cb and Cr channels of visible images and converted to the RGB color space.

Fig. 5
figure 5

Four pairs of source images. The top row contains visible images, and the second row displays infrared images

Fig. 6
figure 6

Visual result comparisons for different methods apply to the ‘man-in-doorway’ image from the TNO dataset. Our method excels at preserving abundant texture details, particularly in the zoomed-in region (i.e., the red box), and effectively highlights a salient region (i.e., the green box)

Fig. 7
figure 7

Visual result comparisons for different methods apply to the ‘Marne-04’ image from the TNO dataset. Our method excels at preserving abundant texture details, particularly in the zoomed-in region (i.e., the red box), and effectively highlights a salient region (i.e., the green box)

Results analysis on TNO dataset

We compare the fusion performance for our method with the seven state-of-the-art algorithms applied to 24 image pairs acquired from the TNO dataset. All infrared and visible images display different scenes and are registered before being fed to the network. Samples of these images are shown in Fig. 5.

Qualitative results

Table 1 Average evaluation metric values for all methods apply to 24 image pairs from the TNO dataset

For quantitative experiments, fused images produced by existing fusion methods and our proposed method are shown in Figs. 6 and 7. Some representative regions from the fused images are selected and enlarged near the bottom, to more intuitively display and analyze visual effects in the fused results. A significant target is evident in the green box and abundant textural details can be seen in the red box.

Fig. 8
figure 8

Qualitative comparisons of the proposed method with seven state-of-the-art methods apply to ‘\(FLIR\_07210\)’ from the RoadScene dataset. Our method excels at preserving abundant texture details, particularly in the zoomed-in region (i.e., the red box), and effectively highlights a salient region (i.e., the green box)

Fig. 9
figure 9

Qualitative comparisons of the proposed method with seven state-of-the-art methods on ‘\(FLIR\_08954\)’ from the RoadScene dataset. Our method excels at preserving abundant texture details, particularly in the zoomed-in region (i.e., the red box), and effectively highlights a salient region (i.e., the green box)

Table 2 Average evaluation metric values for all methods apply to 24 image pairs from the RoadScene dataset

As shown, nearly all methods generate some meaningless information due to thermal radiation contamination in the background. However, our method not only highlights the target but also preserves detail information. The region in the green box indicates that although the CBF results include a bright target, the pixel distribution in this area suffers heavily from noise compared to the proposed method. Also, the IFEVIP, GTF, and FusionGAN models severely weaken significant targets in the fused images. In the case of NestFuse, IFCNN, and MFEIF, the fused images indicate that while some of the target edges are highlighted, other salient features and textural details in the fused images are blurred. In contrast, our fusion method produces more realistic contrast and successfully preserves the intensity of significant areas and the texture detail of visible images, compared with other methods. For example, the proposed scheme preserves internal contours and details for cars and clouds intact in Fig. 7. This improvement demonstrates one of the primary advantages of our method.

Quantitative results

Quantitative evaluation experiments are conducted using the TNO dataset, employing nine metrics to comprehensively compare our method with seven state-of-the-art methods. Average values for the compared fusion methods and the proposed algorithm are shown in Table 1 across nine metrics, where the two best values for each metric are bold and underlined, respectively. As demonstrated by the statistical results, the proposed fusion method achieves the largest average values in four of the metrics, including CC, SCD, MS_SSIM, and FMI. It also achieves reasonable performance in EN and SD, producing the second largest average values. Our method also achieves the best performance for SCD, indicating that the correlation between our fused images and the source images is the highest. In addition, the largest average values for CC and MS_SSIM indicate that our fused images transfer more considerable information while preserving structural information in the input images. The values for FMI also prove that our method well preserves feature information from the source images to the fused images. These results indicate that our method can transfer more meaningful information from the source images, especially the richest fine-grained details and significant structural information.

Results analysis on RoadScene dataset

Qualitative results

An additional 24 image pairs showing different day and night scenes are selected from the RoadScene dataset, including cars, streetlights, roads, pedestrians, bicycles, trees, and houses. The fused results produced by different fusion methods are shown in Figs. 8 and 9. It is evident that undesirable artifacts appear in the CBF results, while the GTF and IFEVIP fused images do not retain details from the infrared image. This results in significant information loss, particularly in the red box region. In addition, FusionGAN produces under-exposed results and could not retain the sharp target edges. On the contrary, the NestFuse, IFCNN, MFEIF, and the proposed method obtain better fusion performance in subjective evaluations compared with the other three fusion methods. However, the fused images obtained by the proposed method exhibit more reasonable luminance information.

Table 3 Quantitative comparisons of ablation studies using the TNO dataset

Quantitative results

The results of quantitative comparisons between our method and other state-of-the-art algorithms are provided in Table 2. It shows that our method achieves the largest average across four metrics, including SD, CC, SCD, and MS_SSIM. Our proposed method presents the best SD value, indicating the fused images exhibit the highest contrast. In addition, our algorithm produces the highest CC and MS_SSIM values, suggesting the fused results share strong correlation and structural information with the source images. The highest SCD value further implies that our fused images have less pseudo-information and the strongest correlation with source images.

In summary, both qualitative and quantitative results demonstrate that our proposed method achieves excellent performance in transferring more considerable information and highlighting significant contrast, which has remarkable advantages over other methods.

Ablation studies

Multi-scale attention analysis

The multi-scale attention module plays a critical role in our fusion network as it enhances the contextual representation of the network on both local and global features. Therefore, we implement an ablation study using the multi-scale attention module, the results of which are shown in Fig. 10. The multi-scale attention module is excluded from the ablation experiment. It is evident that the fused images preserve texture details in the source images, but with low contrast. In addition, some of the visualized results exhibit a few artifacts.

Detail loss analysis

Ablation experiments are included to determine the role of detail loss in the results. More specifically, we train a network without additional detail loss, the results of which are shown in Fig. 10. Notice that when the detail loss is removed, the fusion network fails to preserve useful information of source images, specifically texture detail in background regions and pixel intensity and contours for salient targets. In addition, the results of quantitative comparisons are provided in Table 3, where all metrics are seen to decrease, excluding the SD metric. These experimental results demonstrate the importance of detail loss, which can preserve the texture details in the fused images.

Fig. 10
figure 10

Qualitative comparisons of ablation analysis results for four image pairs acquired from the TNO dataset. The source images are shown in the first two rows, followed by the fused images produced without a multi-scale attention network (Without Attention), fused images without detail loss (Without Detail Loss), and fused images produced by our method

Fig. 11
figure 11

Qualitative comparisons of eight methods apply to ‘elecbike’ image pairs from the extended VIFB dataset. Our method excels at preserving abundant texture details, particularly in the zoomed-in region (i.e., the red box), and effectively highlights a salient region (i.e., the green box)

Fig. 12
figure 12

Qualitative comparisons of eight methods apply to ‘manCar’ image pairs from the extended VIFB dataset. Our method excels at preserving abundant texture details, particularly in the zoomed-in region (i.e., the red box), and effectively highlights a salient region (i.e., the green box)

Table 4 Quantitative comparisons of 21 image pairs from the extended VIFB dataset

Efficiency comparisons

To verify the computational efficiency of the fusion algorithm, the traditional methods are tested on the CPU, while the others are implemented on the GPU. As can be seen in Table 5, the average running time of the image fusion algorithms varies widely, and the running times of traditional methods are longer than that of deep learning-based methods that benefit from the GPU acceleration. Specifically, IFCNN with a simple network architecture is the fastest algorithm on all datasets. Our proposed fusion algorithm focuses on features at different scales and makes up for the missing comprehensive information via attention modules. As such, the running time for our method trails only IFCNN. Fortunately, the experiments show that our fusion algorithm has an efficiency advantage compared with other methods and will be thus feasible for real-time applications.

Table 5 Average running time for all methods across three datasets (unit: second)

Extension to the VIFB dataset

To further verify our generalization of the proposed method, the experiment is also conducted using the VIFB dataset, which includes 21 pairs of registered visible and infrared images. These samples not only cover a wide range of environments and working conditions (e.g., indoor, outdoor, low illumination, and over-exposure), they also include various image resolutions, such as \(320\times 240\), \(630\times 460\), \(512\times 184\), and \(452\times 332\).

Fused results for the VIFB dataset are shown in Figs. 11 and 12, where it is evident that GTF, FusionGAN, and NestFuse lose vital information. CBF is also seen to suffer from noise interference and other undesirable artifacts. In addition, IEVIP fails to display significant targets due to overexposure to visible images. In contrast, MFEIf, IFCNN, and the proposed method preserve detail information and highlighted targets from the source images. Quantitative results for the VIFB dataset are provided in Table 4, where it is evident that our method achieves the largest average values across three metrics, including CC, SCD, and MS_SSIM. These metrics indicate the fused results exhibit a meaningful structure and texture information transferred from the source images. In contrast, the proposed method follows CBF in the EN metric because the fused images generated by CBF contain additional noise.

Conclusion

In this paper, a novel lightweight deep learning fusion network based on multi-scale attention and hybrid dilated convolutional blocks is proposed to effectively improve the fusion of infrared and visible images. By designing hybrid dilated convolution blocks, the feature extraction module with a larger receptive field efficiently extracts more contextual information and fine-grained details without changing the size of the feature maps. The use of a unique total loss allows our proposed fusion network to simultaneously preserve texture features and salient target intensity from both infrared and visible images. In addition, the spatial/channel attention modules at different scales are designed to focus on shallow local and deep global detail features, which compensate for missing detail in the fusion process and improve the contrast of fused images. Experiments performed on two public infrared and visible image datasets demonstrate that our fused images not only include large amounts of detailed textural features but also reduce noise and artifacts. In addition, these experiments are extended to the VIFB dataset and further verify the generalizability of our proposed model.