Keywords

1 Introduction

Most single low dynamic range (LDR) images cannot capture light information for infinite levels owing to physical sensor limitations of a camera. For the too bright or dark area in the image, the boundary with surrounding objects does not appear. However, a high dynamic range (HDR) image containing various brightness information by acquiring and combining LDR images having different exposure levels does not encounter this problem. Owing to this property, interests on HDR imaging have been increasing in various fields. Unfortunately, creating an HDR image from multiple LDR images requires multiple shots, and HDR cameras are still unaffordable. As a result, alternative methods are needed to infer an HDR image from a single LDR image.

Generating an HDR image with only a single LDR image is referred to as an inverse tone mapping problem. This is an ill-posed problem, because a missing signal not appearing in a given image should be restored. Recently, studies have been conducted on an HDR image application using deep learning technique [1,2,3]. Endo et al. [1], Lee et al. [2], and Eilertsen et al. [3] successfully restored the lost dynamic range using deep learning. However, a disadvantage is that it requires additional training to generate additional LDR images or fails to restore some patterns.

Deep learning is a method of processing information by deriving a function that connects two domains that are difficult to find relation as a function approximator. Deep neural networks demonstrate noteworthy performance for real-world problems (image classification, image restoration, and image generation) that are difficult to be solved by the hand-crafted method. Deep learning, which has emerged in the field of supervised learning that requires labeled data during the learning process, has recently undergone a new turning with the stabilization of the generative adversarial network (GAN) structure [4,5,6,7,8].

We propose a novel method for inverse tone mapping using the GAN structure. This paper has the following three main contributions:

  1. 1.

    The GAN structure creates more realistic images than a network trained with a simple pixel-wise loss function because a discriminator represents a changeable loss that includes the global and local information in the input image during the training process. Thus, we use the structural advantages of the GAN to infer natural HDR images that extend the dynamic range of a given image.

  2. 2.

    We propose a novel network architecture that reconfigures the deep chain HDRI network structure [2], which is a state-of-art method for restoring the lost dynamic range. The reconfigured network can be significantly simplified in scale compared with the existing network, while the performance is maintained.

  3. 3.

    Unlike the conventional deep learning-based inverse tone mapping methods [1, 2] that produce a fixed number of images with different exposure values, we represent the relationship between images with relative exposure values, which has the advantage of generating images with the wider dynamic range without the additional cost.

Fig. 1.
figure 1

Three-dimensional distribution for the image dataset with different exposure values in the image manifold space: for images labeled with the corresponding exposure value, we visualized the image space by three-dimensional reduction using t-distributed stochastic neighbor embedding [9]. Images having the same scene gradually change in the space. In addition, when the difference in the exposure value between the images is large, they are far from each other on the manifold. (Color figure online)

2 Related Works

Deep Learning-Based Inverse Tone Mapping. As with other image restoration problems, inverse tone mapping involves the issue of restoring the lost signal information. To solve this problem, the conventional hand-craft algorithms in this field deduce a function to infer the pixel luminance based on the lightness and relations between spatially adjacent pixels of a given image [10, 11], create a pseudo multi-exposure image stack [12], or merge optimally exposed regions of LDR red/green/blue color components for generating an HDR image [13]. By contrast, methods using deep learning [1,2,3] are included in the example-based learning and successfully applied to restore the lost dynamic range of LDR images. In other words, these types of deep neural networks estimate a function mapping from the pixel brightness to the luminance from a given train set and generate HDR images of given LDR images. Endo et al.’s method [1] creates a multi-exposure stack for a given LDR image using a convolutional neural network (CNN) architecture which consists of three-dimensional convolutional layers. Similarly, Lee et al.’s method [2] constructs a multi-exposure image stack using a CNN-based network that is designed to generate images through a deeper network structure as the difference in exposure values between the input and the image to be generated increases. By contrast, Eilerstsen et al.’s method [3] determines a saturated region using a CNN-based network for an underexposed LDR image and produces the final HDR image by combining the given LDR image and estimated saturated region. These methods require further networks (or parameters) that generate additional images for creating the final HDR image with a wider dynamic range.

Deep Learning and Adversarial Network Architecture. Because AlexNet [14] has garnered considerable attention in image classification, deep learning is used in various fields, such as computer vision and signal processing, to demonstrate significant performance than conventional methods have not reached. For training deep neural networks, techniques such as residual block [15] and skip connection [16] have been introduced. These techniques smooth the weight space and make these networks easy to train [17]. Based on these methods, various structures of neural networks have been proposed. Thus, generating a high-quality image using neural networks in the image restoration is possible.

The GAN structure proposed by Goodfellow et al. [4] is a new type of neural network framework that enables highly efficient unsupervised learning than conventional generative models. However, there is a problem that GAN training is unstable. Hence, various types of min-max problems have been proposed for stable training recently: WGAN [18], LSGAN [19], and f-GAN [20]. In addition, by extending the basic GAN structure, recent studies have shown the remarkable success in the image-to-image translation for two different domains [6,7,8]. Ledig et al. [21] proposed a network, SRGAN, capable of recovering the high-frequency detail using the GAN structure and successfully restored the photo-realistic image through this network. Isola et al. [6] demonstrated that it can be successful in image-to-image translation using a simple combination of the modified conditional GAN loss [22] and L1 loss.

Fig. 2.
figure 2

The structural relationship between a deep chain HDRI [2] and proposed network: the proposed network has a structure of folding sub-networks, which can be interpreted as a structure in which each network shares weight parameters.

3 Proposed Method

We first analyze the latest algorithms based on deep learning that focuses on the stack restoration and attempted to determine problems of these algorithms. As a solution, we propose novel neural networks by reconstructing a deep chain HDRI structure [2]. Figure 2 shows the overall structure of the proposed method.

3.1 Problems of Previous Stack-Based Inverse Tone Mapping Methods Using Deep Learning

The purpose of the inverse tone mapping algorithm to reconstruct the HDR image from the estimated multi-exposure stack is to generate images with different exposure values. When producing images with different exposure values, previous methods [1, 2] generate LDR images with a uniform exposure differences T for a given input image (i.e., \(T=1\) or 0.7). In this case, generating 2M images with different exposure values from a given image requires 2M sub-networks, because each sub-network represents the relationship between input images and images with the difference of exposure value \(i \times T\), for \(i=\pm 1, \pm 2, \cdots , \pm M\). Hence, these methods have the disadvantage that the number of additional networks increases linearly to widen the dynamic range. In addition, different datasets and optimization process are needed to train additional networks. Moreover, these fail to restore some patterns by creating artifacts that do not exist. To solve this problem, we define two neural networks \(G^{plus}\) and \(G^{minus}\) considering the direction of change in the exposure value (plus or minus). In addition, these networks are constrained to generate images considering adjacent pixels using conditional GAN [22]. Then, using these networks, we infer images with relative exposure \(+T\) and \(-T\) for a given image.

3.2 Training Process Using an Adversarial Network Architecture

The conditional GAN based architecture that is constrained by input images produces higher-quality images than the basic GAN structure [6]. Therefore, we design the architecture conditioned on the exposure value of the given input using a conditional GAN structure. In other words, to convert to images with a relative exposure value \(+T\) (or \(-T\)), we define a discriminator network \(D^{plus}\) (or \(D^{minus}\)) that outputs the probability to determine whether a given pair of images is real or fake.

The proposed architecture determines the optimal solution in the min-max problem of Eqs. (1) and (2):

$$\begin{aligned} G^{plus},D^{plus} = \min _G \max _D \{\mathbb {E}_{I^{EV i+1},I^{EV i}} [logD(I^{EV i+1},I^{EV i})] \\ + \mathbb {E}_{I^{EV i},z} [1-logD(G(I^{EV i},z),I^{EV i})]\}, \nonumber \end{aligned}$$
(1)
$$\begin{aligned} G^{minus},D^{minus} = \min _G \max _D \{\mathbb {E}_{I^{EV i-1},I^{EV i}} [logD(I^{EV i-1},I^{EV i})] \\ + \mathbb {E}_{I^{EV i},z} [1-logD(G(I^{EV i},z),I^{EV i})]\}, \end{aligned}$$
(2)

where \(I^{EV i}\) is an image with EV i, z is a random noise vector, and \(\mathbb {E}\) is the expectation function. For \(D^{plus}\), we set the pair \((I^{EV i+1}, I^{EV i})\) as a real and the pair \((G(I^{EV i},z), I^{EV i})\) as a fake.

3.3 Structure of the Proposed Neural Network Architecture

We verified the specific network settings of the generator and discriminator through the supplementary document (Fig. 3).

Generator: U-Net [23] Structure. We adopt an encoder-decoder model as the generator structure. When the data goes to the next layer, the size of the feature map is reduced by one-half, vertically and horizontally, and conversely doubled. Then, the abstracted feature map is reassembled with the previous feature maps for creating the desired output through a structure that increases the width and height of the feature map. In this structure, we add skip-connections between encoder layers and decoder layers, so that the characteristics of low-level features are reflected in the output. The downsampling block consists of a convolutional layer, one batch normalization layer, and one parametric ReLU (PReLU) [24]. And, the upsampling block contains an upsampling layer, one convolutional layer, one batch normalization layer, and one PReLU. The upsampling layer doubles the feature map size using the nearest-neighbor interpolation. As with the deep chain HDRI, we used PReLU for the network inferring relative \(EV +1\) and MPReLU [2] for the opposite direction.

Fig. 3.
figure 3

Structure of proposed generators \(G^{plus}\), \(G^{minus}\).

Discriminator: Feature Matching. The neural network of the GAN structure is difficult to train [4, 5, 18,19,20]. In particular, the problem that the discriminator does not distinguish clearly between the real and fake leads to the difficulty in determining the desired solution in the min-max problem. To solve this problem, we use the method training the generator to match the similarity of features on an intermediate layer of the discriminator in the basic GAN [5]. Therefore, the proposed discriminator is similar to the Markovian discriminator structure [6, 25]. This discriminator generates feature maps that consider the neighboring pixels in an input through convolutional layers. Hence, this network outputs the probability whether each patch in an input image is real or not. Unlike pixel-wise loss, the loss function expressed by the discriminator network represents the structured loss such as the structural similarity, feature matching, and conditional random field [26]. In other words, the loss produced by this discriminator allowed the generator to create natural images that reflect in the relationship between adjacent pixels. The proposed discriminator is composed of convolution blocks, including one convolution layer, one batch normalization layer, and one leaky ReLU layer [27]. The activation function of the last convolution block is a sigmoid function. In addition, there is no batch normalization layer for the first and last layers (Fig. 4).

Fig. 4.
figure 4

Structure of proposed discriminators \(D^{plus}\), \(D^{minus}\).

3.4 Loss Functions

For \(G^{plus}\) and \(G^{minus}\), we set an objective function that combined the following two losses for the training. We set the relative weights of each loss to \(\lambda =100\) through the experimental procedure. the final objective is:

$$\begin{aligned} G^{plus}&= \mathop {\mathrm {arg\,min}}\limits _G L_{LSGAN} (G) + \lambda L_{L1} (G) \text { for training pairs } (I^{EV 1},I) \text { and} \end{aligned}$$
(3)
$$\begin{aligned} G^{minus}&= \mathop {\mathrm {arg\,min}}\limits _G L_{LSGAN} (G) + \lambda L_{L1} (G) \text { for training pairs } (I^{EV -1},I), \end{aligned}$$
(4)

where I is an input image, \(I^{EV 1}\) (or \(I^{EV -1}\)) is an image with the relative exposure difference 1 (or \(-1\)) for a given I.

GAN Loss. As the basic GAN structure [4] is unstable in the training process, we use LSGAN [19] to determine the optimal solution of the min-max problem. For an input image x, a reference image y, and random noise z,

$$\begin{aligned} L_{LSGAN}(D)&= \frac{1}{2} \mathbb {E}_{x,y} [(D(y,x)-1)^2 ]+ \frac{1}{2} \mathbb {E}_{x,z} [(D(G(x,z),x))^2 ], \end{aligned}$$
(5)
$$\begin{aligned} L_{LSGAN}(G)&= \mathbb {E}_{x,z} [(D(G(x,z),x)-1)^2], \end{aligned}$$
(6)

where G and D are training networks. We divide the loss of the discriminator by half compared with the generator process to make the overall learning stable by delaying the training of the discriminator.

Content Loss. The pixel-wise mean absolute error (MAE) loss \(L_{L1}\) is defined as:

$$\begin{aligned} L_{L1} (G) = \mathbb {E}_{x,y,z} [||y - G(x,z)||_1]. \end{aligned}$$
(7)

A method to calculate the pixel-wise difference between two images through L2 norm generates a blurred image relative to L1 norm for image restoration [28]. Therefore, we use L1 loss as a term of the objective function to recover low-frequency components.

Fig. 5.
figure 5

The training process of proposed network architecture: we trained the generators to minimize L1 loss and defeat discriminator networks. The discriminator distinguishes the pair (reference, input) from the pair (estimated image, input) as the training progresses.

3.5 Optimization Process

The proposed architecture is trained through two steps, as shown in Fig. 5. In the first training phase, we used only L1 loss, and in the second training phase, we additionally used GAN loss. We set the two training phases epoch with the same ratio (1:1). In the second training phase, the discriminator and generator alternated one by one to minimize each objective function. We used the Adam optimizer [29] with 0.00005 of the learning rate, and momentum parameters were \(\beta _1 = 0.5\) and \(\beta _2 = 0.999\). We set the batch size to one. The dropout noise is added during training.

3.6 Inference

First, we generated images \(\hat{I}^{EV 1}\) and \(\hat{I}^{EV -1}\) from the given LDR image, as shown in Fig. 6, using \(G^{plus}\), \(G^{minus}\). In the next phase, we obtained \(\hat{I}^{EV 2}\), \(\hat{I}^{EV -2}\) by using \(\hat{I}^{EV 1}\) and \(\hat{I}^{EV -1}\) as the input of \(G^{plus}\) and \(G^{minus}\), respectively. We recursively repeated this process for creating a multi-exposure stack. Figure 6 shows an example of outputting the multi-exposure stack up to \(EV \pm 3\).

Fig. 6.
figure 6

The multi-exposure stack generation process of the proposed structure.

4 Experimental Results

For a dataset, we used 48 stacks of VDS dataset [2] for training, and other 48 stacks of VDS dataset and 41 stacks of HDREye dataset [30] for testing. VDS database is composed of images taken with Nikon 7000, and HDREye consists of images taken with Sony DSC-RX100 II, Sony NEX-5N, and Sony \(\alpha 6000\). Both the VDS and HDREye datasets consists of seven images, each of which has uniformly different exposure levels. We set the unit exposure value T to exposure value one at \(ISO-100\) like the deep chain HDRI [2]. By using Debevec et al.’s algorithm [31], we synthesized the generated stack with a target HDR image, and we generated the tone-mapped images by using Reinhard et al.’s [32] and Kim and Kautz’s methods [33] through HDR Toolbox [34]. For the image pair with the exposure value difference, we set the image with low exposure value as an input image and set the other image as a reference when training \(G^{plus}\). (\(G^{minus}\) was done in the opposite way.) We randomly cropped the sub-images with the \(256 \times 256\) pixel resolution from the training set, which contained adequate information about the entire image rather than patches, thereby providing 20, 700 training pairs. We set epochs of the first and second phases to 10 for training.

First, to verify that the images were generated successfully, we compared them with the ground truths through the peak signal-to-noise ratios (PSNR), structural similarity (SSIM), and multi-scale SSIM (MS-SSIM) on test images with \(512 \times 512\) pixel resolution. Second, we compared our method with the state-of-the-art algorithms using deep learning [1,2,3]. Finally, we confirmed the performance of the proposed method by testing the different loss functions with two cases: L1 loss and L1 + GAN Loss.

4.1 Comparison Between the Ground Truth LDR and Inferred LDR Image Stacks

Table 1 and Fig. 7 show the several results and comparisons between estimated and ground truth stacks. In addition, we compared it to the deep chain HDRI method [2] that estimated a stack with the same unit exposure value \(T=1\). In the proposed method, the similarity between the inferred LDR and reference images was reduced as the difference of exposure value increased. This is because the artifacts were amplified as the input image passed recursively through the network to generate an image with the high exposure value. However, the proposed method used the GAN structure, where the discriminator evaluated the image quality by considering adjacent pixels, and generated inferred images, thereby increasing the similarity with the ground truth compared with the deep chain HDRI method.

Table 1. Comparison of the ground truth LDR and inferred LDR image stacks.
Fig. 7.
figure 7

Comparison of the ground truth LDR and inferred LDR image stacks.

4.2 Comparisons with State-of-the-art Methods

For quantitative comparisons with the state-of-the-art methods, we compared PSNR, SSIM, and MS-SSIM with the ground truth for tone-mapped HDR images. Also, we used HDR-VDP-2 [35] based on the human visual system for evaluating the estimated HDR images. We set the input parameters of HDR-VDP-2 evaluation as follows: a 24-inch display, a viewing distance of 0.5 m, peak contrast of 0.0025, and gamma of 2.2. To establish a baseline, we reported the comparison with HDR images inferred by Masia et al.’s method [36] using the exponential expansion. Table 2 and Fig. 8 show the evaluation results. In addition, to verify the physics-based reconstruction, we performed to convert an LDR image of a color-checker into an HDR image. LDR and HDR image pairs including a color checker board [30] were used in the experiment. The results of the verification are shown in Fig. 9.

The proposed method exhibited similar performance to the deep chain HDRI [2]. Moreover, the average PSNR of the tone-mapped images was 3 dB higher than that of Endo et al. [1], and the average of 10 dB was higher than Eilertsen et al. [3]. For HDREye dataset, which consists of images with different characteristics from the training set, the proposed method was almost better than other methods [1,2,3] in the HDR VDP Q-score. The reconstructed images of the proposed method were more similar to the ground truth than others in the overall tone and average brightness, as shown in Fig. 8. In addition, the dark and saturated regions of the input image were restored.

Table 2. Comparison of the ground truth HDR images with HDR images inferred by [1,2,3, 36] and ours. color indicates the best performance and color indicates the second best performance.
Fig. 8.
figure 8

Comparison of the ground truth HDR images with HDR images inferred by [1,2,3], and the proposed method (ours).

Fig. 9.
figure 9

Comparison of ground truth HDR with HDR images inferred by [1,2,3], and the proposed method (ours) about physical luminance.

4.3 Comparison of the Different Loss Functions

To evaluate the effect of the GAN loss term, we compared images generated by the proposed method with training results using only L1 loss. When using only the L1 loss, we trained the network for 20 epochs. Table 3 presents the results of the quantitative comparison. For tone-mapped images by Reinhard’s TMO [32], the average PSNR of the proposed method with L1 + GAN was 2.27 dB higher than the other. For images generated by Kim and Kautz’s TMO [33], the proposed method had an average PSNR of 1.29 dB higher. Figure 10 shows the tone-mapped HDR images generated by the proposed method using the Reinhard’s TMO. The network trained by setting L1 loss as an objective function generated images that prominently contained artifacts. By contrast, the network architecture with GAN loss did not generate it.

Table 3. Average values of image quality metrics PSNR and VDP quality score on the testing dataset for different cost functions.
Fig. 10.
figure 10

Comparison of the ground truth HDR images with HDR images inferred by L1 and L1 + GAN. The proposed method generates fewer artifacts in the image than the network with L1.

5 Conclusion

We proposed the deep neural network architecture based on the GAN architecture to solve the inverse tone mapping problem, reconstructing missing signals from a single LDR image. Moreover, we trained this CNN-based neural network to infer the relation between relative exposure values using a conditional GAN structure. Therefore, the proposed method generated an HDR image recovered in a saturated (or dark) region of a given LDR image. This network differed from existing networks [1, 2], in that it converted an LDR image into a non-linear LDR image corresponding to \(+1\) or \(-1\) exposure stops. This property led the architecture to generate images with varying exposure levels without additional networks and training process. In addition, we constructed a relatively simple network structure by changing the deep structure effect of deep chain HDRI into a recursive structure.