GANFuse: a novel multi-exposure image fusion method based on generative adversarial networks

In this paper, a novel multi-exposure image fusion method based on generative adversarial networks (termed as GANFuse) is presented. Conventional multi-exposure image fusion methods improve their fusion performance by designing sophisticated activity-level measurement and fusion rules. However, these methods have a limited success in complex fusion tasks. Inspired by the recent FusionGAN which firstly utilizes generative adversarial networks (GAN) to fuse infrared and visible images and achieves promising performance, we improve its architecture and customize it in the task of extreme exposure image fusion. To be specific, in order to keep content of extreme exposure image pairs in the fused image, we increase the number of discriminators differentiating between fused image and extreme exposure image pairs. While, a generator network is trained to generate fused images. Through the adversarial relationship between generator and discriminators, the fused image will contain more information from extreme exposure image pairs. Thus, this relationship can realize better performance of fusion. In addition, the method we proposed is an end-to-end and unsupervised learning model, which can avoid designing hand-crafted features and does not require a number of ground truth images for training. We conduct qualitative and quantitative experiments on a public dataset, and the experimental result shows that the proposed model demonstrates better fusion ability than existing multi-exposure image fusion methods in both visual effect and evaluation metrics.


Introduction
Powered by advanced digital image technology, the effect of image vision is more demanding than ever before. High dynamic range (HDR) technology, the one of the ways to improve image quality, has aroused extensive attention. It is widely applied in the fields of digital electronic products, remote sensing, security monitoring and so on. The dynamic range of image is the ratio of maximum brightness to minimum brightness. The dynamic range of real-world scenes is very wide [1]. However, ordinary image sensors have fixed exposure settings and can only get images with low dynamic range (LDR). Thus, due to the limitation of ordinary image sensors, it is difficult for ordinary image sensors to fully present the visual information in the real scene. The HDR technology can improve the dynamic range of the image. Through this technology, the visual information of the extreme exposed area of real-world scenes can be preserved [2]. Multi-exposure image fusion (MEF) is the most common technique in HDR technology, which merges LDR images with different exposures into a well-expose image of HDR.
In 1984, MEF was firstly proposed in [3]. After that, MEF has become a hot field, and many related methods have been proposed. Existing MEF methods could be generally divided into three categories: pixel- [4][5][6][7], region- [8][9][10][11][12], and deep learning-based methods [13][14][15][16][17][18]. The first two categories have developed for many years and widely used in all kinds of scenarios. Consequently, these methods are known as traditional fusion methods. Generally, traditional methods contain three major steps, including image transformation, activity-level measurement and fusion rule designing [19]. However, these steps are limited by implementation difficulty and high computational costs. Deep learning-based methods can avoid these problems. Because the trained network can generate the complex relationship between source images and fused image, it can automatically extract feature information from the images and fuse these features without manual participation in transformation and activity-level measurement. The fusion process is simpler and more applicable. Through constraint of loss function, the fused image has obvious targets, rich details and good visual effects. Existing deep learningbased methods have made some progress. But there is definitely room for improvement. More effective loss function and structure of the network will lead to better fusion results. Specifically, SSIM is one of the quality evaluations for image fusion, which measures the correlation loss, brightness loss and contrast loss between source images and fused image. DeepFuse sets SSIM as loss function in their model [13]. But it will lose other key information, such as contrast and texture information and so on. IFCNN regards pre-trained CNN as a tool to extract features from the source image [15]. However, the fusion rule is still designed manually. FusionGAN formulates the image fusion as an adversarial game between keeping the infrared thermal radiation information and preserving the visible appearance texture information [16]. Whereas, FusionGAN pays too much attention on information from the visible image and neglects information of the infrared image, which may cause the loss of information from the infrared image.
To overcome the above-mentioned problems, we propose a novel unsupervised MEF method based on GAN, named as GANFuse. GANFuse consists of three components: a generator and two discriminators. The generator attempts to obtain a fused image which contains valid information of the source images. Whereas, the discriminators are conducted to distinguish between fused image and source images. This adversarial process will force the generator to have better performance. As for loss functions, the pixel intensities loss and gradient loss are applied in our network that can help fused image to preserve luminance information and texture information from the source images. As shown in Fig. 1, the result of our GANFuse shows the better visual effect, including luminance and texture. Furthermore, in order to improve the robustness of the algorithm, we establish the training dataset from extreme exposure image pairs in different environments (indoor/ outdoor, day/night, side-lighting/back-lighting and so on).
The contributions of this work are as follows: • A GAN-based unsupervised image fusion algorithm for fusing extreme-exposure images is proposed. The adversarial relationship enables fused image to have more details from source images. • Different from FusionGAN, a novel structure of GAN is developed, which is more suitable for the task of MEF. • We design a new loss function for MEF which can help fused image to preserve more information from source images. • We construct a new training dataset which contains all kinds of conditions. This dataset could enhance the robustness of our method.
The rest parts of this paper are listed as follows. In Sect. 2, we briefly review related works from the literatures. In Sect. 3, we introduce our proposed GANFuse, including the architecture, loss function, training and testing processes. The results of comparison experiments are presented in Sect. 4. And the conclusion is described in Sect. 5.

Related works
This section provides a brief summary of existing image fusion methods based on deep learning. Furthermore, in consideration of our fusion method is based on GAN, we will discuss the basic theory of GAN, representative variants of GAN and their applications.

Fusion methods based on deep learning
In recent years, since the deep learning has aroused extensive concern, deep learning has been applied in image fusion, due to its outstanding ability of feature extraction and universality. The main research theories are divided into the following three categories. (1) Methods combine traditional methods with deep learning. In these methods, deep learning framework functions as a tool to extract image features. Representatively, Liu et al. [20] decomposes the source images into detail layer and base layer, and then utilizes convolutional sparse representation (CSR) to merge these layers. Finally, the fused image is reconstructed by the fused base layer and detail layer. In IFCNN [15], Liu et al. proposed a universal network for image fusion and designed different fusion rule according to different type of source images. (2) These methods regard the convolutional network function as a way to generate weight map which shows the importance of each pixel from source images. For instance, Li et al. [21] uses the VGGNet to extract image features and construct a robust weight map for fusion. (3) Other methods present an end-to-end learning framework for image fusion. Prabhakar et al. [13] proposes an unsupervised deep learning framework for multi-exposure fusion. They utilize a novel CNN architecture and designed a no-reference quality metric as the loss function. In FusionGAN and its variants [16,22,23], a generative adversarial network is applied to fuse infrared and visible images. The fused image generated by the generator is forced to have more details existing in the visible image by applying the discriminator to distinguish differences between them.
Although these works have achieved promising progress, there are still some drawbacks. (a) Many existing methods use neural network to extract features and reconstruct these features. However, fusion rules are designed manually. Thus, these methods still have limitations of traditional fusion methods. (b) Unsupervised deep learning methods are implemented by designing a suitable loss function. However, finding an effective loss function is still a challenge. (c) Existing GAN-based fusion methods applied discriminator to force fused image to contain more details in one of the source image, leading to the loss of information from the other source image.
To address these drawbacks, we research an approach to MEF that can preserve more effective information from source images under the framework of GAN. Motivated by the success of the FusionGAN on infrared and visible image fusion, we aim to develop the structure of Fusion-GAN and make our structure suitable for MEF. In general, there are three improvements in our method.

FusionGAN feeds visible image and fused image into
discriminator. Actually, fused image contains not only visible information, but also infrared information. Therefore, it is easy for discriminator to distinguish between the visible image and fused image. According to the principle of GAN, the stronger distinguishing ability the discriminator has, the better the fused image generated by the generator performs. To improve discriminator's ability, we want to find a way to get the contribution of source images in the fused image and set these contributions and fused image as input of discriminators. By coincidence, this idea is included in the SCD loss function [24]. The main idea of SCD is that the difference between the fused image (F) and the source image ðS 1 Þ represents the contribution of the source image ðS 2 Þ and vice versa. Therefore, we think that the difference image between the fused image (F) and the input image ðS 2 Þ almost contains the information transmitted from another input image ðS 1 Þ. Consequently, in our networks, by feeding jF À S 2 j and S 1 into discriminator 1 and jF À S 1 j and S 2 into discriminator 2, we make it difficult for the discriminators to distinguish the input data and makes the adversarial relationship between the two discriminators and the generator more fierce. The proposed network overall architecture is shown in Fig. 2. 2. In the generator, instead of using concat operation to fuse feature maps which is applied in FusionGAN, we choose tensor addition as the fusion rule. It is due to the fact that purpose of MEF is to get the well-exposure image whose exposure value is the average of the source image. According to this theory, IFCNN uses the elementwise-mean fusion rule [15] to fuse multiexposure images. But simply using elementwise-mean fusion rule may cause the loss of information from source images. Therefore, we choose tensor addition to merge feature maps. Average operation is done by the follow-up networks. 3. As mentioned in Sect. 1, DeepFuse sets SSIM as loss function. Due to the fact that DeepFuse is the first work that uses deep CNN architecture for MEF, following with DeepFuse, many existing deep learning-based image fusion methods employ the metric SSIM as the loss function [15,20,25]. However, simply depending on SSIM to constrain whole network leads to loss of other information. As we know, the most important information in an image is texture information and luminance information. Consequently, to preserve these key information in fused image, we include the gradient loss and pixel intensities loss as the loss function. Moreover, in the experimental result section, we use SSIM as one of metrics to evaluate the fused image, and our result shows the highest value among the comparison methods.

The basic theory of GAN
Generative adversarial net was initially proposed by IanJ Goodfellow et al. in 2014 [26]. Different from conventional neural networks, network training requires a generator (G) and a discriminator (D) to work simultaneously. This framework corresponds to a minimax two-player game. Game players are generator and discriminator network. During training step, the ability of G and D are gradually improved until the two sides to achieve equilibrium. Given the input variable (x), the generator G is used to generate output y ¼ GðxÞ. Through training process, G can learn a training distribution P G ðxÞ which approximate to real data distribution P Data ðxÞ. Then, the discriminator D is trained to determine whether the input is from P Data ðxÞ or P G ðxÞ. The purpose of G is to generate a fake data which can fool the D. However, D aims at differentiating between real data and fake data. Through this adversarial relationship, the distribution generated by G will gradually approximate the real data. The optimization formulation of G is formulated as: where Div denotes the divergence between P Data ðxÞ and P G ðxÞ. The function of D can be expressed as: where V(G, D) is defined as follows: Thus, the optimization formulation of generative adversarial network can be expressed as: G and D are alternately trained. With the advance of the adversarial process, the data generated by G will be gradually similar to the real data.

Variants of GAN and their applications
GAN is a novel network which can generate more real-like data. However, GAN suffers from unstable training. Since the year of 2014, several works have attempted to solve this problem. For example, deep convolutional GAN [27] defines a set of constraints on the architecture of GAN that makes their model stable to train. For optimizing the unreasonable divergence measurement in original GAN, WGAN [28] introduces the Wasserstein distance to improve the stability of training. To overcome the vanishing gradients problem caused by loss function, least squares GAN (LSGAN) [29] adopts the least squares loss function for the discriminator. StyleGAN [30] embeds the input latent code into an intermediate latent space and proposes two new distribution quality metrics for generator architecture that makes their model more linear, less entangled representation of variation. Conditional GAN (cGAN) extends GAN to a conditional model by feeding auxiliary information such as class labels or data from other modalities into the discriminator and generator [31]. For translating clothing images between two specific clothing categories, Liu et al. [32] proposes category-attribute GAN (CA-GAN) framework, including three discriminators. Overall, in the future, GANs have the potential to apply in many fields.

Proposed method
The color conversion of the proposed fusion model is presented in Fig. 3. We decomposed source images into three channels, Y, Cb and Cr. The model we proposed is used for fusing Y channel of source images since the texture details of image are mainly presented by luminance channel (Y) of image. The fusion rule for chrominance channels (Cb and Cr) will be introduced in Sect. 3.5. The architecture of networks, loss function, training and testing processes will be described in the remainder of this section. Fig. 3 The whole procedure of the proposed fusion model

GANFuse
Learning ability of GAN is depending on the structure of network and the loss function. There are three differences between the GANFuse and FusionGAN. Firstly, we train two discriminators to optimize the generator that makes the fused image to contain more details in extreme exposure image pairs. Secondly, we design a new input mode to improve the discretion ability of discriminators. Thirdly, according to the purpose of MEF, we set pixel intensity loss and gradient loss of source images as generator's loss function. The proposed network overall architecture is shown in Fig. 2. The Y channel of under-exposure image and the Y channel of over-exposure image are fed to generator (G), and the output of the generator is the Y channel of the fused image.
As mentioned in Sect. 2.1, we input jF À S 2 j and S 1 into discriminator 1 ðD 1 Þ to distinguish between contribution of S 1 in fused image and S 1 . In the meantime, we input jF À S 1 j and S 2 into discriminator 2 ðD 2 Þ to distinguish between contribution of S 2 in fused image and S 2 . This input model enhances the differentiating capacity of discriminators, which force generator to generate real-like fused image. In the training phase, the two discriminators are trained simultaneously by testing them against the G. After training procedure, D 1 cannot differentiate between the contribution of S 1 in fused image and S 1 , and D 2 cannot differentiate between the contribution of S 2 in fused image and S 2 .

Loss function
The loss function contains two parts, losses of generator and discriminators. The details are presented as follows. The loss function of generator (G) consists of over-exposure image's loss L I o and under-exposure image's loss L I u .
where c is used to control the trade-off between over-exposure image and under-exposure image loss. L I o is defined as follows: where a is a weight controlling the trade-off between two terms. L con I o denotes the content loss of over-exposure image, which aims to save the over-exposure image information in the fused image. As mentioned in Sect. 2, we aim to reserve the gradient information and pixel intensities information in fused image. Therefore, L con I o is defined as follows: where the weight r is used to control the trade-off. h and w presents the height and width of the source image. sum represents element summation of the input. Á k k F is the matrix Frobenius norm, C denotes the gradient operation. L adv I o conveys the adversarial loss between G and D 1 , which is defined as follows: In order to establish an adversarial relationship between discriminators and generator, we set a negative sign in front of D 1 .
The second term of L G presents the loss of under-exposure image, which is defined as follows: where the weight b is used to control the trade-off. Similarly, L con I u is the content loss of I u and I f , which is defined as follows: L adv I u conveys the adversarial loss between G and D 2 , which is defined as follows: Discriminator shortens the difference between fused image and source images. The adversarial loss of D 1 and D 2 judge the similarity of source images and fused image. The loss function of D 1 and D 2 are shown as follows: We regard D 1 ðjI f À I u jÞ and D 2 ðjI f À I o jÞ as fake data which is decreased by discriminator, and regard D 1 ðI o Þ and D 2 ðI u Þ as real data which is increased by discriminator. Thus, there was a negative sign in front of real data.

Network structure
From Fig. 2, we can see that the whole network structure consists of two discriminators ðD 1 and D 2 Þ and one generator (G). In this section, the structure of ðD 1 , D 2 Þ and G will be introduced.

Generator
The structure of G consists of three parts, namely, feature extraction layers, fusion operation and reconstruction layers, as illustrated in Fig. 4. The function of feature extraction layer is to get features from source images. We use the same feature extraction layer to get features of under-exposure image and over-exposure image. Therefore, we can add these extracted information, and then fed them into the reconstruction layer. The output of the reconstruction model is the fused image.
Owing to the random initialized kernels, training the end-to-end model is unstable and difficult. An effective way to handle this issue is using a well-trained feature extraction model [33,34]. Thus, we choose pre-trained Resnet V1 [35] as the feature extraction layers. It learns residual representations between inputs and outputs by using multiple parametric layers, which can avoid vanishing gradient. As is shown in Fig. 4, our feature extraction layers has five bottlenecks. And n48 on bottleneck 1 denotes that the depth of bottleneck 1 is 48. The architecture of each bottleneck is illustrated in Fig. 5. For avoiding loss information in extreme exposure image pairs, we set the stride of all kernels to 1. Reconstruction layers comprise five CNN layers. Batch normalization and ReLU are applied to alleviate gradient exploding and accelerate the training.

Discriminator
By designing discriminators, the details of fused image are more similar to under-exposure image and over-exposure image. The networks of these two discriminators have the same network, which is shown in Fig. 6. And the stride of all layers is set to 2 without padding.

Training
As for the training data set, we collect 30 pairs of exposure stacks which are available publicly from the Internet [36]. It contains all kinds of conditions. Due to the huge amount of source data, we down-sample the source images and crop them into 7552 patch pairs with the size of 84 Â 84. We set the learning rate to 10 À4 and train the network for 5 epochs with all the training patches being processed in each epoch.

Testing
After training phase, we can get the fused image in the Y channel. The chromaticity channels of fused image are got by weighting sum of input chromaticity channel values. The main information is presented in the Y channel. Thus, different fusion strategies are applied in literature for Y, Cb and Cr fusion [13,37]. We can choose different methods to merge RGB channels. However, there is usually a substantial correlation between the RGB channels. Therefore, fusing source image in RGB channels will ignore this correlation and cause obvious color difference. We merge the chromaticity channels of the source image by following the strategy of Prabhakar [37], which is shown as follows: where the x 1 and x 2 denote the pixel intensities of image pairs. The fused chrominance value is obtained by weighing two chrominance values with s subtracted value from itself. In our work, the value of s is set to 128. The

Experiments
We have conducted extensive evaluation and comparison study against state-of-the-art algorithms. For verifying the effect of the experiment, we select the images pairs as the test set with different conditions, including indoor and outdoor, day and night, natural and artificial lighting. To evaluate the performance of algorithms objectively, we adopt five types of metrics. All the experiments are conducted on a desktop with 2.4 GHz Intel Xeon CPU E5-2673 v3, GeForce GTX 2080Ti, and 64 GB memory.

Comparison methods
The method we proposed is compared against with five representative methods, including GFF [38], DSIFT [39], FLER [40], the gradient-based method (GBM) [36], DeepFuse [13]. GFF is a novel guided filtering-based fusion method for creating a highly informative fused image [38]. In DSIFT, the dense SIFT descriptor is applied as the activity level measurement to extract information from source images [39]. FLER proposed a strategy which brightens the high-light regions in the dark image and darkens the darkest regions in the bright image and finally generates virtual image via intensity mapping functions [40]. In the GBM, two different fusion strategies are applied for chrominance and luminance channels separately [36]. DeepFuse is the landmark multi-exposure fusion method based on deep learning [13].

Qualitative comparisons
We firstly perform qualitative comparison experiments on three typical image sequences. Fused results of our method and five comparison methods are shown in Figs. 7, 8 and 9.
In this paper, we evaluate the effect of image fusion from two aspects, the overall image visual effect and the detail effect of the image. From the aspect of the overall visual effect, the method we proposed is well proportioned in light distribution and closer to the actual scene. There are local dark areas in the image of compared methods, which will lead to the loss of detail features. As for the detail effect, our method can provide additional texture information in some regions. From Figs. 7, 8 and 9, we can see the results of the three methods of GFF, DSIFT and FLER, these methods have obvious black regions in the fused image. The fusion results of GBM and DeepFuse are more consistent with human visual perception. However, as we have shown in the red box in the ground truth, they also have some loss of detail textures. To be specific, as can be seen in Fig. 7, the details of the tree in our method are of abundant texture information. The same phenomenon can be found in Fig. 8. In the red box, our method shows that the texture on the wall is more clear and the outline of texture is closer to ground truth. As for the window we marked in Fig. 9, there is a bird pattern in the center part of the mark land. The bird pattern of ours shows the most colorful and clear result.

Quantitative comparisons
In the multi-exposure image fusion community, MEF-SSIM [36] is a commonly used metric for quantitative evaluation. In addition, we select SD, PSNR, CC and SCD as metrics. These methods are commonly used in MEF evaluation. We apply these metric to evaluate the source images and fused results of five comparisons methods. These five metrics are introduced as follows.

Standard deviation (SD)
SD is a metric reflecting contrast and distribution of images. Due to the fact that human pays more attention to the region with high contrast. Thus, the larger value of SD means the higher contrast of the fused image. SD is defined as follows: where h and w denotes the height and width of image. l f denotes the mean value of image f.

Peak signal-to-noise ratio (PSNR)
PSNR is a metric reflecting the distortion by the ratio of peak value power and noise power. where r is the max value of the fused image. r is set as 256 in this paper. MSE is the mean square error that measures the dissimilarity between the source images and the fused image that is defined as follows: A larger PSNR indicates the less distortion between source images and fused image.

Correlation coefficient (CC)
CC measures the correlation between the source images and the fused image. It is mathematically defines as follows: A large CC indicates that there is a strong correlation between the fused image and the source images.

Mean structural similarity (MSSIM)
MSSIM measures the average of the individual SSIM values for each sliding window. SSIM is a metric used to model image loss and distortion. It is defined as follows: where x and f are the image patches of the source image X and the fused image F, respectively, r denotes the covariance or the standard deviation. l denotes the mean values. C1, C2 and C3 are the parameters for stability.

The sum of the correlations of differences (SCD)
In the SCD loss function, the difference image between one of the input images ðS 2 Þ and the fused image (F) almost discloses the information transferred from the other input image ðS 1 Þ. These differences ðD 1 and D 2 Þ can then be formulated as: The D 1 and D 2 indicates the amount of transferred information from each of the input images into the fused image. SCD loss function is formulated as the following: rðD k ; S k Þ is to calculate the similarity between D k and S k , which is defined as the following: where D k and S k are the average of the pixel values of D k and S k .
In addition, these metrics can only handle single-channel images. Thus, we perform these metrics on Y channel. We test these five metrics on 30 multi-exposure image pairs, and the results are presented in Fig. 10. And the results of mean value are shown in Table 1. The red and blue index we marked in Table 1 is the largest value and second largest value, respectively. Table 1 presents that our method can perform a good result. Our method gets the largest mean value in CC, PSNR, SSIM and SCD. SD of GFF and DSIFT own the first and second place, respectively. However, these methods have the phenomenon of inhomogeneous illumination which will result in a high value of SD. Our method achieved the largest mean values among the rest methods.

Comparative experiment
To prove the effect of creations in our framework, we perform an ablation study on the components of GANFuse. The comparative experiment 1 shows the result of GAN-Fuse without discriminators. In the comparative experiment 2, removing the theory of SCD, we directly feed F and S 1 into discriminator 1 and F and S 2 into discriminator 2. The qualitative result is displayed in Fig. 11. It is obvious that the result of Fig. 11b is most similar to ground truth, including color and textural detail. Particularly, as presented in red box, tableware of Fig. 11b has better visual effect. And the quantity result is given in Table 2 which also testify to the effect of creations of our method.
Moreover, we perform a parametric study on those important parameters in our GANFuse. In our GANFuse, we mainly set the weight r to trade off the gradient loss and pixel intensities loss. In the following part, we will show results of different r. In our model, we set r as 0.1. And the weight r is 0.3 and 0.5 in the comparative experiment 3 and the comparative experiment 4, respectively. The quantity result is presented in Table 3. In Table 3, we can find that GANFuse owns the best quantity result when the value of r is 0.1. Due to the fact that the values of the result is approximate, the quality results are almost same. Therefore, the quality results will not show anymore.

Conclusion and future work
In this paper, we propose a novel GAN-based multi-exposure image fusion method, termed as GANFuse. On the basis of FusionGAN, we increase the number of discriminator and propose a novel way to change the input of discriminators. By doing so, we can preserve more information of source images in the fused image. Furthermore, we train and test our networks with all kinds of dataset. Thus, our method can achieve better robust with different conditions. Compared with other five state-of-the-art fusion methods, our method can achieve advanced performance both qualitatively and quantitatively. In our current work, GANFuse is trained to fuse static multi-exposure images. However, for moving objects in image, our method does not possess a good visual effect. Moving objects may lead the ghost phenomenon in fused image. For future research, we aim to handle the ghost phenomenon and generalize GANs or their variants to fuse multi-modal images.