1 Introduction and related work

White balance is a basic procedure that is nearly applied to all images by the camera image signal processor (ISP). The goal of image white balancing is to remove undesirable color casts caused by scene lights. That is, this process aims to normalize the camera-rendered image’s colors such that any achromatic object appears grayish (i.e., \(R = G = B\)) [1]. While it cannot fix all other colors, white-balance correction is often assumed to approximate the color constancy (i.e., object color appearance remains constant under different lighting conditions of the scene) [2]. Image white balancing does not target only the aesthetic aspect of camera-rendered images but also improves the accuracy of other computer vision tasks, such as image classification and image semantic segmentation [3,4,5].

Cameras apply white balance by first estimating the scene illumination color. This estimation is performed by a color constancy method, which typically falls into one of the following categories: (1) statistical methods (e.g., [6,7,8]) that use heuristic statistical-based hypotheses to estimate the illuminant color of the captured image, and (2) learning methods (e.g., [9,10,11,12,13,14,15]) that rely on machine learning techniques to learn to predict the illuminant color given the input image [10, 13, 14] or its color histogram [9, 11, 15]. Then, a global color channel scaling operation is applied to remove such undesirable color casts. This white-balance procedure is applied to camera sensor raw image (i.e., a linear representation of the incoming light), and afterwards camera ISPs apply a set of nonlinear procedures to render the final output image in the sRGB space [16, 17]. Due to these nonlinear procedures, adopting the simple color scaling process—that is intended to fix linear raw images—does not work properly to fix camera-rendered images with white-balance errors [16].

An intuitive way to deal with such errors is to reconstruct a linear version of the input image followed by fixing its white balance, and then re-processing the image to its final form in the sRGB space. Though it seems trivial, this solution is impractical as camera ISP’s procedures are unknown and a careful camera calibration process is required to accomplish this linearization task [16, 17].

Recently, Afifi et al. [16, 18,19,20] proposed solutions to deal with such white-balance errors in the camera-rendered sRGB images without a need for raw reconstruction. The authors in [16, 18,19,20] proposed to process the input improperly white-balanced images in the sRGB space to generate the correctly white-balanced images. This sRGB-level processing was performed by a polynomial color mapping—e.g., the KNN white balance method (KNN-WB) [16] and the interactive white-balance method (I-WB) [19]—or via deep neural networks—e.g., the deep white balance methods (D-WB) [18, 20].

Inspired by this research direction, our solution also proposes to avoid any raw reconstruction and processes the input image in its original space (i.e., the sRGB space). Due to the remarkable results achieved by deep learning methods in several research fields, such as computer networking and communications [21,22,23], we propose a deep learning-based method to solve our problem. In contrast with prior work (e.g., [16, 18,19,20]), our proposed solution consists of two different stages to re-white balance the input image. The first stage estimates global mapping parameters to correct the input image colors without considering its spatial information. Then, the second stage locally processes our initial corrected image based on its spatial information to generate the final corrected image. Figure 1 shows our result compared to the state-of-the-art methods [16, 20].

Fig. 1
figure 1

This work focuses on fixing potential white-balance errors in camera rendered sRGB images. Our method consists of a two-stage learning framework that first applies a global correction to input image colors, and then a local process is applied to produce the final result. Our method produces competitive results compared to state-of-the-art methods (i.e., KNN-WB [16] and D-WB [20]) (color figure online)

The major contributions of this work can be summarized as follows:

  1. 1.

    We propose a novel framework to fix white-balance errors in sRGB images through a two-stage correction procedure.

  2. 2.

    Unlike recent work that treats improperly white-balanced images through global color mapping operation, our two-stage framework first processes input image colors globally based on its color distribution to generate an initial corrected image. This initial solution is then improved by learning a residual layer to locally adjust our initial result to generate the final image. This two-stage strategy improves the perceptual results by considering local enhancements of the input image.

Extensive experiments are conducted on challenging test sets, and we show that our method produces competitive results when comparing with the state-of-the-art methods for correcting improperly white-balanced sRGB images. The rest of this paper is organized as follows. In Sect. 2, the methodology is presented. Section 3 presents the evaluation of our method through a set of experiments, ablation studies, and comparisons between the proposed method and the state-of-the-art methods. Finally, the paper is concluded in Sect. 4.

2 Method

We present a learning framework to correct the sRGB colors of input images that were rendered with camera white-balance errors. Figure 2 shows an overview of our framework. As shown, our framework consists of two steps to re-white balance the input sRGB image. In the first step, we apply a global mapping to the input sRGB image, \(I_{\textit{in}}\), in order to correct its colors in the sRGB space. This mapping process can be described as follows:

$$\begin{aligned} \hat{I}_{Gcrr}= & {} C \quad \varphi \left( I_{\textit{in}}\right) , \end{aligned}$$
(1)
$$\begin{aligned} C= & {} N_1\left( H\left( I_{\textit{in}}\right) , \Theta _1\right) , \end{aligned}$$
(2)

where \(\varphi \) is a kernel operator that projects the three sRGB color channels into the high-dimensional space, as follows \(\varphi \left( \langle R, G, B\rangle ^{T} \right) = \langle R, G, B, RG, RB, GB, R^2, G^2, B^2\rangle ^{T}\), and C is a \(3\times 9\) matrix generated by our network \(N_1\) that accepts the color histogram \(H\left( \cdot \right) \) of \(I_{\textit{in}}\) and processes it with its trainable weights \(\Theta _1\).

Fig. 2
figure 2

Our re-white-balancing method processes the input image through two networks (\(N_1\) and \(N_2\)). Our networks (i.e., \(N_1\) and \(N_2\)) are trained in an end-to-end manner to produce the final sRGB image, \(\hat{I}_{\textit{crr}}\) with correctly white-balanced colors. See Sect. 2 for more details (color figure online)

In Eq. (2), we mentioned that our first network, \(N_1\), accepts a color histogram feature of the input image, \(I_{\textit{in}}\). This histogram feature represents the color distribution of the input sRGB image, \(I_{\textit{in}}\). To create a dense histogram feature, we rely on the RGB-uv histogram proposed in prior work [1, 13, 16, 24]. In particular, we create a \(64\times 64\times 3\) histogram feature of the input image \(I_{\textit{in}}\) in the the log-chrominance space [9, 11, 25] as follows:

$$\begin{aligned} H(I_{\textit{in}}, d) = \sum _{i}I_{in_{y(i)}} abs(I_{in_{uj(i)}} - u)&< \frac{\varepsilon }{2} \wedge \nonumber \\ abs(I_{in_{vj(i)}} - v)&< \frac{\varepsilon }{2}, \end{aligned}$$
(3)

where d refers to \(\langle u,v,c \rangle \), j refers to each color channel in the generated histogram feature, \(\varepsilon \) is the histogram’s bin size, and \(i = \{1,\ldots ,n\}\) is the pixel index (here, n is the total number of pixels in the input image). The values of \(I_{{\textit{in}}_{y(i)}}\), \(I_{{\textit{in}}_{uj(i)}}\), \(I_{{\textit{in}}_{vj(i)}}\) are computed as follows:

$$\begin{aligned} I_{{\textit{in}}_{y(i)}}= & {} \sqrt{I_{{\textit{in}}_{R(i)}}^{2} + I_{{\textit{in}}_{G(i)}}^{2} + I_{{\textit{in}}_{B(i)}}^{2}}, \nonumber \\ I_{{\textit{in}}_{u1(i)}}= & {} \log {\left( I_{{\textit{in}}_{R(i)}}\right) } - \log {\left( I_{{\textit{in}}_{G(i)}}\right) }, \nonumber \\ I_{{\textit{in}}_{v1(i)}}= & {} \log {\left( I_{{\textit{in}}_{R(i)}}\right) } - \log {\left( I_{{\textit{in}}_{B(i)}}\right) }. \end{aligned}$$
(4)

Likewise, \(I_{{\textit{in}}_{u2}}\), \(I_{{\textit{in}}_{v2}}\), \(I_{{\textit{in}}_{u3}}\), and \(I_{{\textit{in}}_{v3}}\) are generated as follows:

$$\begin{aligned} I_{{\textit{in}}_{u2}}= & {} -I_{{\textit{in}}_{u1}} , \, I_{{\textit{in}}_{v2}} = -I_{{\textit{in}}_{u1}} + I_{{\textit{in}}_{v1}}, \nonumber \\ I_{{\textit{in}}_{u3}}= & {} -I_{{\textit{in}}_{v1}}, \, I_{{\textit{in}}_{v3}} = -I_{{\textit{in}}_{v1}} + I_{{\textit{in}}_{u1}}. \end{aligned}$$
(5)
Table 1 Results on the Rendered WB dataset [16]

Finally, the histogram feature is normalized by the summation of the histogram’s bin values.

This global correction is similar to what was proposed by prior work [16, 19] in the sense that the work in [16, 19] also uses global correction to map from the sRGB input colors to the corresponding sRGB ground truth colors. This global correction was computed based on a K nearest-neighbor search (KNN) in prior work [16]. In contrast, our proposed mapping is learned by a neural network and it is followed by our second step that locally processes the image. This is motivated by the fact that camera ISPs apply local tone mapping that results in spatially varying color changes in the final sRGB image [17]. Thus, we first generate our initial correction through a global mapping followed by a local processing that is applied in the second step.

Table 2 Ablation study on the effect of the networks (i.e., \(N_1\) and \(N_2\)) on our final results
Fig. 3
figure 3

Qualitative results for post-capture white-balance correction on the Extrinsic Test Set of the Rendered WB dataset [16]. This figure shows the results of: (1) KNN-WB [16], (2) I-WB [19], and (3) our method. For each image, we show the corresponding ground truth correctly white-balanced sRGB image

The second step of our framework locally adjusts the fine-details of the globally corrected image, \(\hat{I}_{Gcrr}\), via another neural network \(N_2\). This fine-details adjustment is applied to improve the quality of our initial correction and to deal with over-saturated pixels that are hard to correct by solely global polynomial mapping. Our second network, \(N_2\), accepts the output of the first step and produces a residual layer to generate our final output image \(\hat{I}_{\textit{crr}}\) as described in the following equation:

$$\begin{aligned} \hat{I}_{\textit{crr}} = N_2\left( \hat{I}_{Gcrr}, \Theta _2\right) + \hat{I}_{Gcrr}, \end{aligned}$$
(6)

where \(\Theta _2\) represents the trainable weights of our second network \(N_2\).

2.1 Network architecture

As explained earlier, our framework consists of two neural networks (i.e., \(N_1\) and \(N_2\)). The first network, \(N_1\), accepts a \(64\times 64\times 3\) histogram feature and produces the mapping parameters in C. This network, \(N_1\), consists of four fully connected layers, as shown in Fig. 2. We use the leaky ReLU (LReLU) operator as our activate function applied to the output of second and third fully connected layers. A dropout rate of 0.5 is applied to the output of the third fully connected layer. The output fully connected layer has 27 output neurons to construct the mapping matrix C.

After processing the histogram feature of the input sRGB image, \(I_{\textit{in}}\), we produce the globally corrected image, \(\hat{I}_{Gcrr}\) (as described in Eq. 1). This image is then fed to the second network, \(N_2\), which is a U-Net-based network [26] with four encoder blocks, four decoder blocks, two bottleneck blocks and skip connections.

Each encoder block in \(N_2\) consists of conv–LReLU–conv–LReLU-maxpool layers, each bottleneck block has conv–LReLU layers, and each decoder block consists of applying a 2D transposed conv (Tconv) operator followed by conv–LReLU–conv–LReLU layers. The first encoder block maps the input images into a latent space with 24 output channels. The output of each proceeding encoder block is doubled by a factor of 2 to reach 192 output channels by the last encoder block. After the second and fourth encoder blocks, we apply an instance normalization operation [27]. Then, the latent representation, \(\mathcal {X}\), produced by the last encoder block is first processed by the latent feature generated by the first fully connected layer in \(N_1\). This dual-use of the latent feature in \(N_1\) helps the local network (i.e., \(N_2\)) to get some cues of the global color distribution in the image. We use the leaked feature from \(N_1\) to apply an affine transformation to the latent representation \(\mathcal {X}\) as follows:

$$\begin{aligned} \mathcal {X}'_{(i, j, f)} = t_f \mathcal {X}_{(i, j, f)} + s_f, \end{aligned}$$
(7)

where \(\{t_1, t_2, \ldots , t_{192}\}\) and \(\{s_1, s_2, \ldots , s_{192}\}\) represent the latent feature vector produced by the first fully connected layer in \(N_1\) (see Fig. 2).

After applying this affine transformation (Eq. 7), the output feature is processed by two bottleneck blocks with 384 and 192 output channels, respectively. Our decoders then processes this latent representation along with the skipped latent representation produced by each corresponding encoder blocks. Note that the skip connections are taken before applying the maxpool operator in each encoder block and concatenated with the result of the 2D transposed conv operator in each corresponding decoder block.

Finally, we apply a single conv layer with \(1\times 1\) kernel followed by a tanh activation function to generate a residual layer that is then used to generated the final output image \(\hat{I}_{\textit{crr}}\) (as described in Eq. 6).

Fig. 4
figure 4

Qualitative results for post-capture white-balance correction on the Cube Test Set of the Rendered WB dataset [16]. This figure shows the results of: (1) KNN-WB [16], (2) I-WB [19], and (3) our method. For each image, we show the corresponding ground truth correctly white-balanced sRGB image

2.2 Training details

We trained our networks (i.e., \(N_1\) and \(N_2\)) for 100 epochs in an end-to-end manner to minimize the following loss function:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{N_1} + \mathcal {L}_{N_2} + \lambda \mathcal {L}_{\textit{per}}, \end{aligned}$$
(8)

where \(\mathcal {L}_{N_2}\) is the L2 between our corrected image, \(\hat{I}_{\textit{crr}}\), and the ground-truth white-balanced image, \(I_{\textit{crr}}\), \(\mathcal {L}_{N_1}\) is the L2 between the globally corrected image, \(\hat{I}_{Gcrr}\), and the ground-truth white-balanced image, \(I_{\textit{crr}}\), \(\mathcal {L}_{\textit{per}}\) is the perceptual loss [28] between \(\hat{I}_{\textit{crr}}\) and \(I_{\textit{crr}}\), and \(\lambda \) and is a hyperparameter used to control the effect of \(\mathcal {L}_{\textit{per}}\) on the final loss. In our experiments, we set \(\lambda \) 0.1.

Note that as our framework is trained in an end-to-end scheme, the output \(3\times 9\) correction matrix from \(N_1\) is first applied to the input image, and then the first loss term in Eq. (8), \(\mathcal {L}_{N_1}\), is computed. This \(\mathcal {L}_{N_1}\) loss term encourages the network to produce proper parameters in the output matrix to correct the colors of \(I_{\textit{in}}\). That is, the network learns the parameters of this matrix unsupervisedly. Afterward, the second network, \(N_2\), receives the output of the globally corrected image, \(\hat{I}_{Gcrr}\), and the second loss term, \(\mathcal {L}_{N_2}\), encourages the second network, \(N_2\), to correct local residual errors in \(\hat{I}_{Gcrr}\) to get the final corrected image, \(\hat{I}_{\textit{crr}}\).

To optimize Eq. (8), we used Adam optimizer [29] with beta values 0.9 and 0.999 and learning rate of \(10^{-4}\) dropped by a factor of 0.5 each 25 epochs. We used mini-batch size of 16 and regularized the weights \(\Theta _1\) and \(\Theta _2\) of our networks using L2 regularization with a multiplier of \(10^{-5}\).

In order to improve the training process, we interchangeably optimize \(\Theta _1\) to separately minimize \(\mathcal {L}_{N_1}\) (by processing the input data by only \(N_1\) and disabling \(N_2\)), and then we process the input data through the entire framework (i.e., \(N_1\) and \(N_2\)) to minimize Eq. (8) at each iteration.

We train our framework using patch-wise training, where we randomly select \(128\times 128\) training patches from the full-size training images, while generate the histogram feature H from the entire training images. This allows our first network, \(N_1\), to have global cues of the color distribution in the training image in order to predict suitable mapping parameters for each image. We followed the same procedure in the inference phase by using the histogram of the entire test image, while feeding the full-size test image to our second fully connected network, \(N_2\).

3 Experimental results

We trained our network on the Intrinsic Set of the Rendered WB dataset [16]. This set includes 62,535 sRGB images rendered by several DSLR camera devices. For evaluation, we tested our method on the Extrinsic Test Set of the Rendered WB dataset [16], which includes 2881 images captured by a DSLR camera and several mobile phone cameras. Furthermore, we tested our method on the Cube Test Set that has 10,242 camera-rendered images with several white-balance settings [16, 30].

Fig. 5
figure 5

Qualitative comparison of our results with and without the perceptual loss \(\mathcal {L}_{\textit{per}}\). For each image, we show the corresponding ground truth correctly white-balanced sRGB image

Table 3 Ablation study on the effect the mini-batch size and the input patch size on our results

3.1 Comparisons

We compared our results with the recently published work for correcting improperly white-balanced images. In particular, we reported the results of the following methods: deep white-balance (D-WB) [20], KNN white balance (KNN-WB) [16], the interactive white balance (I-WB) [19], and mixed white-balance (M-WB) [31]. We further compare our method with other deep-learning-based and statistical-based color constancy methods: FC4 [32], quasi-unsupervised color constancy (Quasi-U CC) [33], the shades-of-gray (SoG) method [6], and the gray-world (GW) method [34] (Table 1).

We followed prior work [16, 20] in adopting the mean angular error (MAE) and 4E 2000 [35] as our evaluation metrics. Table 3 shows the quantitative results. Finally, we show qualitative comparisons in Figs. 13, and 4.

3.2 Ablation studies

We studied the behavior of our method with different options in design and loss function. The first part of Table 2 shows the results of our ablation study. We reported our results on the Extrinsic Set of the Rendered WB dataset [16] after training our network without the global and local processing networks; i.e., \(N_1\), \(N_2\), respectively. It is worth mentioning that when we disabled \(N_1\), we substituted \(\hat{I}_{Gcrr}\) with \(I_{\textit{in}}\) in Eq. (6). In addition to this ablation study, we showed the results of our framework after training without the perceptual loss term, \(\mathcal {L}_{\textit{per}}\) (i.e., \(\lambda = 0\)), and with different values of \(\lambda \), which controls the contribution of the perceptual loss in Eq. (8). The qualitative evaluation of our result with and without the perceptual loss is shown in Fig. 5. Finally, we studied the effect of different training mini-batch and image patch sizes on our results. The results of this study are shown in Table 3.

4 Conclusion

We have presented a deep learning framework to correct sRGB images that were saved by camera ISPs with wrong white-balance settings. Our framework processes the input image through two main stages to first correct its colors by learning a global mapping parameters that map the input image colors to the corresponding corrected white-balanced ones. Then, we produce a learnable residual layer that locally adjusts our initial result. We have evaluated our two-stage correction method on large test sets of improperly white-balanced sRGB images and showed promising results that are on par or better compared to recently published methods for color constancy and image white balancing. One interesting direction in future work may be the use of XAI techniques [36, 37] for understanding (and possibly improving) the behavior of the proposed network.