1 Introduction

High-dynamic range (HDR) images capture the luminance of real-world scenes, which ranges from extreme dark to direct sunlight. Details of both shadow and highlight areas, present in a high-dynamic scene, can be recovered in a single HDR image. In contrast, standard digital imaging produces low-dynamic-range (LDR) images, in which the luminance dynamics is replaced by the discrete luma range. The latter limits the capture of details in scene shadow and highlight regions, resulting in an under-/over-exposure.

The camera response function (CRF) gives the relationship between the luminance and the luma up to a scale factor. To this end, to recover an HDR image using standard digital imaging, an estimation of the CRF and a detail recovery for black/white image pixels are required. The most common way of creating an HDR image by respecting these two requirements is to merge multiple LDR images, taken at various exposure times and referred to, in this paper, as multi-exposure images.

Despite the efficiency of multi-exposure methods, the process of creating an HDR image is usually time-consuming. First, users are required to use a tripod and to adjust the camera exposure time each time they take an image (if the camera does not include exposure bracketing function). Moreover, during the shooting process, misalignment could become an issue, especially when there are moving objects in the scene. To this end, more time is likely to be spent aligning the images in the hope of correcting ghosting artifacts.

To represent the atmosphere and the details of a real-world environment, users often need to take more than three exposure images [25]. However, this may increase the risk of misalignment and noise. In the particular case of dark environments with a high luminance range, decreasing the exposure time allows to capture fine details in the highlights, but may significantly increase the levels of noise in the computed HDR image.

Instead of taking several multi-exposure images of the same scene, we propose a method, in which we use only two images to recover an HDR image—a non-flash image, taken at a certain exposure value, and its corresponding flash image. Our method can also be used for low-dynamic scenes to enhance the quality of a non-flash image with the help of a flash image. Non-flash images represent the genuine atmosphere of the original scene lighting. However, especially for images shot in dark environments, the non-flash images are often noisy and lack important details (in under-/over-exposed pixels). In contrast, flash images contain more details, but they do not preserve the original scene lighting.

The first key idea behind our HDR image creation lies in mimicking the CRF by a brightness function. The brightness function, used in our method, aims to represent the human perception of a scene at various brightness levels. This corresponds to the main purpose of digital cameras. Therefore, we strongly believe that the CRF can be well-approximated by our brightness function. We alter the brightness of a non-flash image by a one-parameter-dependent gamma correction and we yield a sequence of brightness exposure images. To create an HDR image, we then need to recover the missing details of the multiple brightness images (for which we recover no information by the brightness correction).

The second key idea of our method consists in recovering these missing details by using reliable information from the flash image. To retain the original ambience of the scene while preserving the details of the flash image, we propose a novel bi-local chromatic adaptation transform (CAT). The bi-local CAT is directly applied to the flash image in order to adapt its brightness to that of the non-flash image. As the non-flash brightness is lower than the flash brightness, the bi-local CAT remaps the flash pixel values into values, corresponding to lower brightness. Therefore, to allow for an increase of the brightness dynamics and the contrast of the flash image, we carry out the bi-local CAT on each of the multiple brightness images. That way, we obtain final multi-exposure images, which we merge into an HDR image. We apply our method to dark environment scenes with high-dynamic range, for which the reach of the flash is significant.

The main contributions of this paper are fivefold:

  • Automatic non-flash image brightness correction;

  • Bi-local CAT for automatic creation of multi-exposure images from only two images—flash and non-flash;

  • Automatic recovery of HDR images from the computed multi-exposure images;

  • Enhancement of a non-flash image using a flash image;

  • Automatic removal of the soft shadows of the flash image as well as diminution of flash reflections.

The advantages of our method over the classical multi-exposure methods are the following: (1) the number of images for obtaining an HDR image is reduced to two; (2) the usage of a tripod is not required in the case when the flash and non-flash images can be taken one after the other in a short period of time (therefore, our method is suitable for handheld device applications); (3) ghosting artifacts and misalignment are brought to a minimum. If a small misalignment between the two images occurs, our method is able to overcome it.

2 Related works

The entire dynamic range of a real-world scene cannot be captured by today’s camera sensors. That is why, digital images of scenes with high-dynamic luminance range are either under-or over-exposed. The classical technique for obtaining a high-dynamic range image without under-/over-exposed regions uses a set of images, taken at various exposure settings [3, 17]. Debevec et al. [3] first exploit the reciprocity property of imaging systems to construct the response curve of multi-exposure images and to recover their HDR radiance map. Reversely, Mann et al. [17] compute a floating-point image as a representation of an “undigital” image with an extended dynamic range, without any prior knowledge about the response curve of the imaging device. The floating-point HDR image is yet again computed from a set of multi-exposure images. The general concept of using multi-exposure images for creating HDR images is highly exploited in today’s photography. However, this approach has several main drawbacks, including possible image misalignment and ghosting for scenes with moving objects. To this end, there exist a number of techniques, designed to handle misalignment and ghosting artifacts [9, 10, 26, 28].

Fig. 1
figure 1

Main flowchart. In the noise removal step (blue boxes), we denoise the two input images. The brightness correction computes a sequence of multiple brightness images. Then, for each of the images in this sequence, we apply the bi-local CAT N times (by using the flash image) and we obtain the final multi-exposure images. They are merged into an HDR image in the last step (in red)

To overcome the main limitations of multi-exposure methods, Tocci et al. [29] propose an optical architecture which automatically captures three optically aligned images at different exposures by splitting the light from a single lens and focusing it onto high-, medium- and low-exposure imaging sensors. However, the proposed optical advancement is not available for massive use and its construction is costly. In contrast, other methods use a single-coded image to recover per-pixel exposures [22, 23]. They rely on a spatially varying optical mask on the sensor, giving different exposures to adjacent pixels. The coded exposures are mapped to an HDR image using reconstruction techniques, such as interpolation [23], piece-wise linear estimators, based on Gaussian mixture models [1], and the recently proposed sparse reconstruction, based on convolution sparse coding [27]. However, such reconstructions are computationally costly, they require hardware modification and they can introduce artifacts if the mask is regular and a simple interpolation is used.

Furthermore, a new method for image brightening from a single image, using standard digital cameras, has been recently introduced [16]. Li et al. create three virtually exposed images from a single image by increasing the brightness of the under-exposed regions. The brightness increase is carried out by a non-decreasing function in a newly designed “simplified” CIE Lab color space. Unlike our method, Li et al.’s method does not explicitly compute and modify the brightness of the original image. It is used to brighten dark objects in outdoor scenes as well as to create a tone-mapped version of an HDR image by fusing the three virtual exposures. A brightening approach from a single image would not give plausible results if the input image contains a significant number of under-/over-exposed pixels, for which no information can be recovered from a single image.

Furthermore, Mertens et al. [21] propose an exposure fusion, which merges a sequence of multi-exposure images into an image with extended luminance range, which can be directly displayed on an LDR screen (a tone-mapped image). The fusion is guided by series of metrics which ensure that only the well-exposed values of each exposure image are kept in the result. Unlike the exposure fusion, which combines several multi-exposure images into one enhanced image, two other methods introduce image enhancement techniques for flash photography, relying on only two images. The methods in [5, 24] exploit the properties of flash and non-flash image pairs for dark environments. These methods combine the non-flash ambient light with the details from the flash image using a bilateral-filter-based image decomposition. That way, they enhance the quality of the non-flash image. Unlike the HDR imagery, which provides a number of LDR outcomes (tone mappers), the methods in [5, 24] generate a single LDR image, which cannot be extended to an HDR image. Other methods also take two differently exposed images as an input. The method in [31] is applied between blurred and noisy image pairs for the purpose of image deblurring, whereas the methods in [11, 14] take differently exposed subsequent frames from a video sequence to reconstruct an HDR video.

Matsuoka et al. [20] also exploit the properties of the flash image, this time in the context of HDR imagery. To construct an HDR image, the authors integrate a sequence of multi-exposure images in the wavelet domain. Before merging the multi-exposure images, two steps are performed. First, the flash image is used to find an alpha mask of shadow regions of the long exposure image. Second, a noise removal technique, guided by the flash image, is applied to denoise these shadow regions. Unlike our method, Matsuoka et al.’s method does not explicitly involve the flash image into the creation of the HDR image (no flash image information is transferred into the final HDR image). Furthermore, similarly to multi-exposure methods, Matsuoka et al.’s method requires a tripod to shoot the multi-exposure images and it is suitable only for static scenes.

3 Our method

In the present section, we introduce our method for computing an HDR image from two images—a flash image F and a non-flash image \(E_{0}\). Figure 1 illustrates the main flowchart of our method. The proposed method starts with a noise removal step, yielding free of noise flash and non-flash images. The brightness of the noise-free non-flash image is modified during a brightness correction step, at the end of which we obtain a sequence of multiple brightness images. The images in this sequence contain black and/or white pixels. In the next step, an iterative bi-local CAT, the missing details are recovered using information from the noise-free flash image. That way, we generate a final sequence of multi-exposure images, which we then merge together into an HDR image.

3.1 Noise removal

Our method starts by denoising the flash and non-flash images. Even though the flash image is considered a reliable image, containing no/very little noise, the flash may introduce grainy noise. Therefore, we apply a bilateral filter with a small kernel size on the flash image to handle any possible noise. The bilateral filter behaves well for a well-lit images, such as flash images. In contrast, for images shot in dark environments without a flash, experiments show that the guided filter [12] performs better than both bilateral and cross-bilateral filters [4, 5]. Therefore, we apply the guided filter to denoise the non-flash image.

3.2 Brightness gamma correction

Creating an HDR image from multi-exposure images requires a knowledge of the CRF. Several methods for recovering the CRF exist [3, 17]. They recover the CRF up to a scale factor from at least two multi-exposure images. Once recovered, the CRF could be used to compute multi-exposure images from a single image.

To simplify the process of creating multi-exposure images, we no longer use prior knowledge about the CRF. Instead, we mimic the CRF by a brightness function. The brightness, which is one of the absolute color appearance attributes, is fundamental for our approach. It describes the intensity of the light source, and its sensation depends on the adaptation to the scene light source. Furthermore, it varies with the environment (dark, dim, bright, etc.). A key advantage of the brightness over other color appearance attributes, such as lightness, is its unbounded range.

We compute the brightness \(Q_0\) of the non-flash image \(E_0\) using CIECAM02 [6]. The brightness \(Q_0\) is modified using a gamma correction function, where gamma is derived from a brightness-dependent parameter p. By varying this parameter, we obtain multiple brightness images \(E_p\). Their brightness \(Q_p\) is computed using the gamma correction, proposed by Bist et al. [2]:

$$\begin{aligned} Q_{p} = Q_0^{\gamma (p)} ,\,{\text {where}}\,\gamma (p) = 1 + \log \left( \frac{Q_{\max }}{p}\right) . \end{aligned}$$
(1)

The gamma value \(\gamma (p)\) is obtained as a function of the correction parameter p and the maximum brightness \(Q_{\max }\) of the non-flash image \(E_0\). The parameter p is expressed in terms of \(Q_{\max }\). Therefore, we either increase (for \(p \le Q_{\max }\)) or decrease (for \(p > Q_{\max }\)) the brightness of the non-flash image \(E_0\) to obtain each brightness exposure image \(E_p\). The optimal choice of parameter p is discussed in Sect. 3.5.

Fig. 2
figure 2

Images a, b and c are obtained with a bi-local CAT, a local iCAM CAT and a local CAT, discussed in [13], respectively. The highly contrasting areas of the non-flash image cannot be well-represented by a global illuminant, and this is the main reason for the local CATs to fail when used in the context of flash/non-flash photography

3.3 Iterative bi-local CAT

The brightness gamma correction, presented in the previous subsection, does not introduce new information, and therefore, details in the under-/over-exposed areas of the non-flash image \(E_0\) cannot be recovered. To tackle the limitations of using a single image for the recovery of an HDR image, we consider an extra image—the flash image F. This image can easily be taken alongside the non-flash image in less than one second and contains reliable information about the shadows of the non-flash image as well as more scene details.

We propose a novel CAT, which carries out a transformation of the flash image F with respect to each image \(E_p\). This transformation, that we call bi-local CAT, aims to adapt the colors of the image F to those of the image \(E_p\) as well as to remove the impact of the flash on the original scene lighting, while preserving the details of the image F (except for the flash shadows and reflections). Compared to previous works [5, 24], our method allows for an advanced combination of flash/non-flash light, color and detail, and at the same time is robust to small misalignment, flash shadows and reflections.

The bi-local CAT extends the local CAT, presented in the iCAM [7, 15]. The local iCAM CAT would compute a global illuminant for the image \(E_p\) and would locally adapt the colors of the flash image F to this illuminant. However, the wide luminance range of the image \(E_p\), varying from pure black to pure white, cannot be correctly described by a single illuminant. To transfer the high contrast areas of the image \(E_p\) onto the flash image F, the bi-local CAT computes a local representation of the illuminant of the image \(E_p\), instead of a global one, as well as a local representation of the illuminant of the flash image F. Like standard CATs [6, 15], the bi-local CAT starts by converting the RGB stimuli of both images F and \(E_p\) into spectrally sharpened RGB signals [6]. Then, we apply the von Kries normalization pixel-wise to convert the spectrally sharpened RGB stimuli (\(R^{F}\), \(G^F\), \(B^F\)) of the flash image into the adapted tristimulus responses (\(R_c\), \(B_c\), \(G_c\)) as follows:

$$\begin{aligned} R_c= & {} \left( \frac{R^{E_p}_{w}}{R^{F}_w}D + (1 - D)\right) R^{F} \end{aligned}$$
(2)
$$\begin{aligned} G_c= & {} \left( \frac{G^{E_p}_{w}}{G^{F}_w}D + (1 - D)\right) G^{F} \end{aligned}$$
(3)
$$\begin{aligned} B_c= & {} \left( \frac{B^{E_p}_{w}}{B^{F}_w}D + (1 - D)\right) B^{F}, \end{aligned}$$
(4)

where the triples (\(R^{E_p}_{w}\), \(G^{E_p}_{w}\), \(B^{E_p}_{w}\)) and (\(R^{F}_{w}\), \(G^{F}_{w}\), \(B^{F}_{w}\)) are pixels from low-pass versions of the images \(E_p\) and F, respectively (more details in the following paragraph). The adaptation factor D is given as follows [6, 15]:

$$\begin{aligned} D = KS\left( 1 - \frac{1}{3.6}\mathrm{e}^{\left( \frac{-L_A-42}{92}\right) }\right) , \end{aligned}$$
(5)

where the scalar \(L_A\) is the adapting luminance, taken as 20\(\%\) of the white object in the scene. The surrounding factor, denoted by S, equals 1 for average surround, 0.9 for dim surround and 0.8 for dark environments. In our method, we carry out an adaptation of the colors of the flash image, and therefore, the surround is considered average (\(S = 1\)). A coefficient \(K = 0.3\) is used by Kuang et al. [15] to avoid full adaptation and de-saturation of the colors. In contrast, we use a coefficient \(K = 1\) to perform a full adaptation. The adaptation factor D ranges from 0 (no adaptation) to 1 (full adaptation).

The von Kries normalization in Eqs. (2), (3) and (4) computes the per-pixel ratio of two low-pass images (\(R^{F}_{w}\), \(G^{F}_{w}\), \(B^{F}_{w}\)) and (\(R^{E_p}_{w}\), \(G^{E_p}_{w}\), \(B^{E_p}_{w}\)), called white images (following the notation in [6]). So far, the von Kries normalization has been carried out either globally [8], in which case the white images boil down to white points, or locally between a single-point illuminant and a white image [13, 15]. To the best of our knowledge, a CAT has never been applied in a bi-local context. Figure 2 shows the advantage of the bi-local CAT over two local CATs for the purposes of this paper.

The white images are computed directly from the flash image F and the image \(E_p\) as follows.

  • The flash white image is computed by applying the guided filter. We observed that in our context the guided filter outperforms Gaussian and bilateral filters. Experiments show that Gaussian filter fails to transfer properly the shadows of the image \(E_p\), introducing brand new shadow regions. Moreover, the bilateral filter introduces a lot of visible halo artifacts around the edges. In contrast, the guided filter suppresses the presence of such halo artifacts, preserves the shadow boundaries of the image \(E_p\) and robustly sharpens the details of the flash image.

  • The white image of the image \(E_p\) is the image \(E_p\) itself. The image \(E_p\) is obtained from the image \(E_0\), to which we have applied the guided filter.

When applied iteratively, the bi-local CAT robustly adapts the colors of the image F to the colors of the image \(E_p\) and progressively removes flash shadows and reflections. The iterations are performed as follows:

$$\begin{aligned} F^{p}_t = {\left\{ \begin{array}{ll} \mathrm{biCAT}(F, E_p), &{} \text {if }\, t=1;\\ \mathrm{biCAT}\left( F^{p}_{t-1}, E_p\right) , &{} \text {if }\, t \in [2, N]. \end{array}\right. } \end{aligned}$$
(6)

During the first iteration (\(t = 1\)), we carry out the bi-local CAT between the images F and \(E_{p}\). For the following iterations, we perform the bi-local CAT between the result from the previous iteration \(F^{p}_{t-1}\) and the image \(E_{p}\). After N iterations, we obtain the final exposure image \(F^{p}_{N}\). During each iteration t, the flash white image is recomputed from the result \(F_{t-1}^p\), whereas the white image of the image \(E_p\) remains unchanged. The two main properties of the iterative bi-local CAT are discussed hereafter.

Property 1: Darkening When the ratio of the white images is less than 1, i.e., \(I^{E_p}_{w} / I^{F_p^t}_w < 1\), where I stands for R, G and B channels, the bi-local CAT darkens the flash image F (left-hand plot in Fig. 3). As the white image of the image F is recomputed iteratively, the pixels of the flash image will keep decreasing until reaching an iteration k, for which the white image ratio becomes close to 1 (because the values \((R_c, G_c, B_c)\) remain unchanged after the iteration k, see Eqs. (2), (3) and (4)). To recover information in the under-exposed regions of the brightness multi-exposures, the maximum number of iterations does not have to exceed k. However, it still needs to be big enough for the bi-local CAT to transfer the scene ambience and remove flash shadows and reflections. More information on the optimal number of iterations is presented in Sect. 3.5.

Property 2: Brightening When the ratio of white images is greater than 1, i.e., \(I^{E_p}_{w} / I^{F_t^p}_w > 1\), the bi-local CAT brightens the flash image (left-hand plot in Fig. 3). The pixel values of the flash image will keep increasing until reaching an iteration l, for which the white image ratio becomes close to 1. After the iteration l, the values of the flash image remain unchanged.

Fig. 3
figure 3

The left-hand plot shows luminance histograms of a flash image (in green) and results \(F_t^p\). The bi-local CAT progressively darkens and brightens \(F_t^p\), extending its range and contrast. The right-hand plot shows the influence of the parameter p on the transformation of \(F_t^p\) for \(t = 8\). The smaller the value of p, the brighter the result \(F_t^p\) and the lesser the under-/over-exposed pixels. When \(p = Q_{\max }\), i.e., the result \(F_t^p\) is identical to the non-flash image, no brightening is performed (the flash pixels are progressively darkened)

These two properties reveal the ability of the bi-local CAT to increase the dynamic range of the flash image F (by both darkening and brightening). They also reveal the importance of the brightness correction step in our algorithm. If we applied the iterative bi-local CAT only between the flash and the non-flash images (without computing multiple brightness images), we would progressively darken the values of the result \(F_t^p\) by shifting its histogram to the left (right-hand plot in Fig. 3). In this case, the final result would represent the brightness of the non-flash image \(E_0\) rather than the brightness of the scene. In contrast, once we obtain the sequence of multiple brightness images and perform the bi-local CAT, the histogram of the result \(F_t^p\) is shifted both to the left and to the right (we darken the pixels in the shadows and brighten the ones in the highlights).

3.4 Image fusion

The bi-local CAT yields a sequence of multi-exposure images, which are then merged together to recover an HDR image. We use Debevec et al. fusion method [3], which relies on a CRF estimation. In our method, we estimate the CRF from the final multi-exposure images.

Additionally, we compute the real CRF from a sequence of real multi-exposure images to verify whether or not the CRF, used in our method, is similar to the real one. Figure 4 presents plots of the CRF, computed from final multi-exposure images, and the real CRF. We observe that the CRF, used in our method, approximates well to the real CRF. This conclusion is based on several experiments, involving various real image sets. Figure 4 shows also the CRF, computed from the multiple brightness images. The CRF, estimated after the iterative bi-local CAT, is more accurate than the CRF, estimated after the brightness correction. This reveals a key advantage of our method over methods, based only on a brightness correction.

3.5 Choice of optimal values of p and N

The efficiency of our method greatly depends on the parameter p, used in the brightness correction step. We analyze which values of p allow to compute a plausible approximation of the real CRF.

Fig. 4
figure 4

The CRF, computed after the bi-local CAT, is highly accurate for exposure values less than 0 but tends to overestimate the luma for positive exposure values. This overestimation is due to the use of a flash image, which successfully captures details in shadows but may not manage to represent all the finest details, belonging to a light source (see Fig. 7)

Fig. 5
figure 5

Plots a, b represent the SSIM scores (on the Y axis) between a real multi-exposure image and each of 400 multi-exposure images, obtained with our method. The 400 images are computed for each iteration \(t \in [1, 10]\) and each value of the parameter p, indexed by \(j \in [0, 39]\). The X axis of each plot represents the number \(m \in [1, 400]\) of the final multi-exposure images, where \(m = t \cdot j\). Each number m corresponds to an iteration t and a value of p. The curves in each plot correspond to the different iterations t. We find a clearly defined peak per iteration (circled in blue), optimizing the SSIM score. The highest value of p per curve (iteration) is circled in red, whereas the lowest value is circled in purple (analogically for the plot (b), where we give an example only for the forth iteration)

First, for every iteration \(t \in [1, 10]\) of the bi-local CAT, we compute multi-exposure images by using each value of p from the set \(\{(0.6 + 0.1i)Q_{\max }\}_{i = 0}^{39}\) (for a total of 400 final multi-exposure images). Experiments showed that values of p lower than \(0.6\cdot Q_{\max }\) result in over-exposure of the majority of pixels in the result, and therefore, we exclude them. Second, we compute the structural similarity metrics (SSIM) [30] between each of the 400 multi-exposure images and each of several real multi-exposure images of the same scene (taken manually by a professional photographer). We observe a clearly defined peak, optimizing the SSIM value for each iteration t (Fig. 5). The peaks for all iterations t (per real multi-exposure image) correspond to the same p, which remains unchanged for all the different sets of real multi-exposure images, for which we performed this analysis. These sets of images were taken with two different types of cameras. Therefore, the value of p is also independent of the choice of camera. The value of p depends only on the exposure of the real multi-exposure image, but at the same time, it is insensitive to the choice of an exposure for the non-flash image.

We have experimentally derived the value \(p_i\) of the \(i\mathrm{th}\) final multi-exposure image as a function of \(Q_{\max }\) and the image index i, \(i \in \{1, \dots , M\}\):

$$\begin{aligned} p_i= & {} \left( 1 - \frac{S \cdot C\cdot i}{10}\right) Q_{\max }, \end{aligned}$$
(7)

where C is a constant, which has experimentally been set to 0.7. The sign S is either equal to 1 for an increase of the non-flash brightness \(Q_0\), or equal to \(-1\) for a brightness decrease. We have experimentally found out that the use of \(M = 6\) final multi-exposure images (out of which one is the non-flash image) helps generate HDR images, close to the ground truth. Our experiments have indicated that the exposure value \(X_i\) of the \(i\mathrm{th}\) final multi-exposure image can be expressed as \(X_i = X_0 + S\cdot i\), where \(X_0\) is the exposure of the image \(E_0\). The final multi-exposure images together with the computed exposure values allow to recover plausible HDR images.

In our experiments, the maximum SSIM score was reached during the eighth iteration. We therefore chose to perform \(N=8\) iterations of the bi-local CAT.

Fig. 6
figure 6

The real HDR image is computed by merging five real multi-exposure images. Our method produces an HDR image, the dynamic range of which is similar to that of the real HDR image (with a number of f-stops equal to 13). Moreover, the log2 luminance histograms show a similarity between the luminance distributions of our result and the real HDR image

Fig. 7
figure 7

Our HDR result is compared to four real HDR images, computed from two consequent, two non-consequent, three and five real multi-exposure images (images (a), (b), (c) and (d), respectively). The log2 luminance distribution of our result is similar to that of the HDR image, obtained by merging five real exposure images. The similarity is also reflected in the HDR-VDP-2 color maps. As the merge of five multi-exposure images represents better the ground truth than the merge of either two or three multi-exposure images, our method performs better than the classical multi-exposure methods, merging two and three multi-exposure images

4 Results and evaluation

In the this section, we present our HDR results and we evaluate their similarity to the ground truth.

4.1 Experimental setup

We have built a data set of images of real-world scenes, consisting of flash and non-flash images and real multi-exposure images. The flash and non-flash images were taken in a short period of time (less than one second) and were used to compute the results, shown in this paper. Additionally, we took real multi-exposure images to recover a real HDR image per scene. A professional photographer has chosen the best exposure values in order to capture the finest details in the shadows and the highlights. To make the real HDR images representative of the ground truth, we used a tripod to avoid misalignment during the shooting process. We compare our results to the ground truth in the evaluation part of this section.

4.2 Recommendations for the choice of non-flash images

In our method, we choose the non-flash image \(E_{0}\) to be the lowest exposed image with less than 5% black pixels. Despite the fact that the iterative bi-local CAT is able to recover the missing details in black pixel regions, in the case when the percentage of black pixels exceeds 5%, the non-flash image becomes too low-exposed and noisy. This results in a trade-off between the fidelity of the result and the successful noise removal when applying our method. The noisier the image, the bigger the kernel size of the guided filter and the greater the loss of details. The non-flash image \(E_0\) may not be the lowest exposed image for a given scene; however, its exposure time is still significantly short to allow for the flash and non-flash images to be taken subsequently without the use of a tripod.

4.3 Evaluation

Figure 6 presents an HDR result, obtained with our method, as well as a real HDR image of the scene. To evaluate the similarity between our HDR result and the real HDR image, we compute their luminance histograms (Fig. 6). Our method recovers the dynamic range of the real HDR images in our data set (resulting in the same number of f-stops as the real HDR images). The luminance distribution of our results is strongly correlated with the ground-truth luminance. Moreover, we adopt the perceptual metrics HDR-VDP-2 [18] to visualize the perceptual difference between our HDR results and the ground truth. Red regions in the HDR-VDP-2 color-coded map indicate deviations from the ground-truth luminance. The color-coded maps in Fig. 6 reveal an overall high perceptual similarity between our result and the ground truth.

Fig. 8
figure 8

Result of applying our method to dark environment scenes with high-dynamic range. Our method is able to recover most of the scene dynamics, as shown in the false color images. Snippets a, b visualize the most significant perceptual difference (indicated also by HDR-VDP-2) between the two HDR images. Our method recovers fine details from the flash image (the DVD labels), whereas the multi-exposure approach causes noise. Moreover, our method avoids the presence of ghosting artifacts like those in the real HDR image (the tree branches)

Fig. 9
figure 9

Image a presents our result. All under-exposed pixels (below a threshold) of the non-flash image are shown in white in image (b). The under-exposed pixels are recovered in image (a) with the use of our iterative bi-local CAT. The flash and the non-flash images were taken with a handheld camera, resulting in a small misalignment (the green circles). The misalignment is correctly handled by our bi-local CAT

The real HDR images aim to represent the ground truth by merging a number of multi-exposure images. The more multi-exposure images we merge, the closer the HDR image is to the real-world scene. To show how close our results are to the ground truth, in Fig. 7 we compare them to several HDR images, obtained by combining two, three and five real multi-exposure images. The HDR-VDP-2 metrics indicates that our HDR result is visually similar to the HDR image, computed from five real multi-exposure images. Moreover, the log2 luminance distribution of our HDR image is highly correlated with that of the real HDR image, obtained from five real multi-exposure images. In this sense, our result is closer to the ground truth than the HDR images, recovered by merging two and three real multi-exposure images.

Despite the similarity with HDR images, computed from five real multi-exposure images, our results may differ from real HDR images at shadow areas. While adapting to the colors of the image \(E_p\), our bi-local CAT preserves the details of the flash image in the shadows of the result. Reversely, taking low-exposures images in dark environments may cause noise in the shadows and compromise the integrity of the real HDR image. Figure 8 illustrates a key property of our method, i.e., the detail recovery. Our HDR image preserves the DVD labels in the shadows of the scene, unlike the HDR image, obtained from five real multi-exposures.

The main advantage of our method over the multi-exposure approach is illustrated in Fig. 9. The flash and non-flash images, shown in the figure, were taken with a handheld camera, imitating a typical user case. Our method successfully recovers HDR images of non-still (slow moving) objects (such as people, posing for portraits) and avoids ghosting artifacts.

Fig. 10
figure 10

The flash shadows and reflections from the flash image have been automatically removed by our method

Fig. 11
figure 11

Non-flash quality enhancement. The gamma correction of the non-flash image reveals the missing details on the top-right and lower-right corners. These details are recovered with our method using the flash image (best viewed on screen)

Fig. 12
figure 12

Images (a) and (b) are our results, obtained respectively by fusing the final multi-exposure images (with the method in [21]), and by using the tone-mapper in [19] on the reconstructed HDR image. Image c is Eisemann et al.’s result [5], whereas image d is Mertens et al.’s result [21]. The flash and non-flash images are courtesy of [5]

Finally, another advantage of our method consists of an automatic removal of soft shadows from the flash image, carried out by the bi-local CAT. If the flash image contains shadows, created by the flash, there is a risk that they will appear in the final result \(F^{p}_{N}\) (and if they do, the result would look unnatural). It turns out, though, that eight iterations of our bi-local CAT are enough to completely remove soft shadows from our HDR result, as illustrated in Fig. 10. Our method also reduces reflections, caused by the flash.

4.4 Non-flash image enhancement

Our method can be used in the context of non-flash image enhancement. We increase the quality of a non-flash image in terms of detail recovery and scene illumination enhancement, as shown in Fig. 11. Given flash and non-flash images, we automatically recover an HDR image and then we use various tone-mapping operators to visualize it on an LDR screen. Figure 12 shows a comparison between our method and two state-of-the-art methods, all used in the context of non-flash image enhancement. Mertens et al. [21] fail to properly combine the flash and non-flash images, because the flash image is already well-exposed. Eisemann at al. [5] produce a single image as an outcome of their method. In contrast, our method provides a number of enhanced images, each resulting from a different tone-mapping operator.

5 Conclusion

In this paper, we proposed a novel method for creating HDR images, relying on only two images as an input—flash and non-flash images. Our method automatically creates multiple exposure images by brightening the non-flash image and bi-locally adapting the colors of the flash image to the brightened image. Our method is used to compute HDR images, which do not significantly differ from HDR images, obtained by merging five manually taken multi-exposure images. We proposed a method for handling challenging dark environment scenes, in which the non-flash image is often unreliable as it contains noise and lacks information. Moreover, our method can be used in the context of non-flash image enhancement and in comparison with existing methods, it provides various enhancement options. Due to the limited reach of the flash, our approach is limited for outdoor scenes, which are left for future work.