Two-stage deep learning framework for sRGB image white balance

This work aims to correct white-balance errors in sRGB images. These white-balance errors are hard to fix due to the nonlinear color-processing procedures applied by camera image signal processors (ISP) to produce the final sRGB colors. Camera ISPs apply these nonlinear procedures after the essential white-balance step to render sensor raw images to the sRGB space through a camera-specific set of tone curves and look-up tables. To correct improperly white-balanced images, projecting non-linear sRGB colors back to their original raw space is required. Recent work formulates the problem as an image translation problem, where input sRGB colors are mapped using nonlinear polynomial correction functions to fix such white-balance errors. In this work, we show that correcting white-balance errors in sRGB images through a global color mapping followed by spatially local adjustments, learned in an end-to-end training, introduces perceptual improvements in the final results. Qualitative and quantitative comparisons with recently published methods for camera-rendered image white balancing validate our method’s efficacy and show that our method achieves competitive results with state-of-the-art methods.


Introduction and related work
White balance is a basic procedure that is nearly applied to all images by the camera image signal processor (ISP). The goal of image white balancing is to remove undesirable color casts caused by scene lights. That is, this process aims to normalize the camera-rendered image's colors such that any achromatic object appears grayish (i.e., R = G = B) [1]. While it cannot fix all other colors, white-balance correction is often assumed to approximate the color constancy (i.e., object color appearance remains constant under different Romany F. Mansour and Adel A. Sewisy authors contributed equally to this study. lighting conditions of the scene) [2]. Image white balancing does not target only the aesthetic aspect of camera-rendered images but also improves the accuracy of other computer vision tasks, such as image classification and image semantic segmentation [3][4][5].
Cameras apply white balance by first estimating the scene illumination color. This estimation is performed by a color constancy method, which typically falls into one of the following categories: (1) statistical methods (e.g., [6][7][8]) that use heuristic statistical-based hypotheses to estimate the illuminant color of the captured image, and (2) learning methods (e.g., [9][10][11][12][13][14][15]) that rely on machine learning techniques to learn to predict the illuminant color given the input image [10,13,14] or its color histogram [9,11,15]. Then, a global color channel scaling operation is applied to remove such undesirable color casts. This white-balance procedure is applied to camera sensor raw image (i.e., a linear representation of the incoming light), and afterwards camera ISPs apply a set of nonlinear procedures to render the final output image in the sRGB space [16,17]. Due to these nonlinear procedures, adopting the simple color scaling process-that is intended to fix linear raw images-does not work properly to fix camera-rendered images with white-balance errors [16].
An intuitive way to deal with such errors is to reconstruct a linear version of the input image followed by fixing its white  Fig. 1 This work focuses on fixing potential white-balance errors in camera rendered sRGB images. Our method consists of a two-stage learning framework that first applies a global correction to input image colors, and then a local process is applied to produce the final result. Our method produces competitive results compared to state-of-the-art methods (i.e., KNN-WB [16] and D-WB [20]) (color figure online) balance, and then re-processing the image to its final form in the sRGB space. Though it seems trivial, this solution is impractical as camera ISP's procedures are unknown and a careful camera calibration process is required to accomplish this linearization task [16,17].
Recently, Afifi et al. [16,[18][19][20] proposed solutions to deal with such white-balance errors in the camera-rendered sRGB images without a need for raw reconstruction. The authors in [16,[18][19][20] proposed to process the input improperly white-balanced images in the sRGB space to generate the correctly white-balanced images. This sRGB-level processing was performed by a polynomial color mapping-e.g., the KNN white balance method (KNN-WB) [16] and the interactive white-balance method (I-WB) [19]-or via deep neural networks-e.g., the deep white balance methods (D-WB) [18,20].
Inspired by this research direction, our solution also proposes to avoid any raw reconstruction and processes the input image in its original space (i.e., the sRGB space). Due to the remarkable results achieved by deep learning methods in several research fields, such as computer networking and communications [21][22][23], we propose a deep learningbased method to solve our problem. In contrast with prior work (e.g., [16,[18][19][20]), our proposed solution consists of two different stages to re-white balance the input image. The first stage estimates global mapping parameters to correct the input image colors without considering its spatial information. Then, the second stage locally processes our initial corrected image based on its spatial information to generate the final corrected image. Figure 1 shows our result compared to the state-of-the-art methods [16,20].
The major contributions of this work can be summarized as follows: 1. We propose a novel framework to fix white-balance errors in sRGB images through a two-stage correction procedure. 2. Unlike recent work that treats improperly white-balanced images through global color mapping operation, our two-stage framework first processes input image colors globally based on its color distribution to generate an initial corrected image. This initial solution is then improved by learning a residual layer to locally adjust our initial result to generate the final image. This two-stage strategy improves the perceptual results by considering local enhancements of the input image.
Extensive experiments are conducted on challenging test sets, and we show that our method produces competitive results when comparing with the state-of-the-art methods for correcting improperly white-balanced sRGB images. The rest of this paper is organized as follows. In Sect. 2, the methodology is presented. Section 3 presents the evaluation of our method through a set of experiments, ablation studies, and comparisons between the proposed method and the state-of-the-art methods. Finally, the paper is concluded in Sect. 4.

Method
We present a learning framework to correct the sRGB colors of input images that were rendered with camera whitebalance errors. Figure 2 shows an overview of our framework. As shown, our framework consists of two steps to re-white balance the input sRGB image. In the first step, we apply a global mapping to the input sRGB image, I in , in order to correct its colors in the sRGB space. This mapping process can be described as follows: where ϕ is a kernel operator that projects the three sRGB color channels into the high-dimensional space, as follows and C is a 3 × 9 matrix generated by our network N 1 that accepts the color histogram H (·) of I in and processes it with its trainable weights 1 .
In Eq. (2), we mentioned that our first network, N 1 , accepts a color histogram feature of the input image, I in . This histogram feature represents the color distribution of the input sRGB image, I in . To create a dense histogram feature,  we rely on the RGB-uv histogram proposed in prior work [1,13,16,24]. In particular, we create a 64 × 64 × 3 histogram feature of the input image I in in the the log-chrominance space [9,11,25] as follows: where d refers to u, v, c , j refers to each color channel in the generated histogram feature, ε is the histogram's bin size, and i = {1, . . . , n} is the pixel index (here, n is the total number of pixels in the input image). The values of I in y(i) , I in u j(i) , I in v j(i) are computed as follows: Likewise, I in u2 , I in v2 , I in u3 , and I in v3 are generated as follows: Finally, the histogram feature is normalized by the summation of the histogram's bin values.
This global correction is similar to what was proposed by prior work [16,19] in the sense that the work in [16,19] also uses global correction to map from the sRGB input colors to the corresponding sRGB ground truth colors. This global correction was computed based on a K nearest-neighbor search (KNN) in prior work [16]. In contrast, our proposed mapping is learned by a neural network and it is followed by our second step that locally processes the image. This is motivated by the fact that camera ISPs apply local tone mapping that results in spatially varying color changes in the final sRGB image [17]. Thus, we first generate our initial correction through a global mapping followed by a local processing that is applied in the second step.
The second step of our framework locally adjusts the finedetails of the globally corrected image,Î Gcrr , via another neural network N 2 . This fine-details adjustment is applied to improve the quality of our initial correction and to deal with over-saturated pixels that are hard to correct by solely global polynomial mapping. Our second network, N 2 , accepts the output of the first step and produces a residual layer to generate our final output imageÎ crr as described in the following equation: where 2 represents the trainable weights of our second network N 2 .

Network architecture
As explained earlier, our framework consists of two neural networks (i.e., N 1 and N 2 ). The first network, N 1 , accepts a 64 × 64 × 3 histogram feature and produces the mapping parameters in C. This network, N 1 , consists of four fully connected layers, as shown in Fig. 2. We use the leaky ReLU (LReLU) operator as our activate function applied to the output of second and third fully connected layers. A dropout rate of 0.5 is applied to the output of the third fully connected layer. The output fully connected layer has 27 output neurons to construct the mapping matrix C.
After processing the histogram feature of the input sRGB image, I in , we produce the globally corrected image,Î Gcrr (as described in Eq. 1). This image is then fed to the second network, N 2 , which is a U-Net-based network [26] with four encoder blocks, four decoder blocks, two bottleneck blocks and skip connections. In addition to our results, this table shows the results of the following methods: gray-world (GW) [34], shades-of-gray (SoG) [6], FC4 [32], quasi-unsupervised color constancy (Quasi-U CC) [33], interactive white balance (I-WB) [19], KNN white balance (KNN-WB) [16], and deep white-balance (D-WB) [20]. We used the following evaluation metrics: the mean angular error (MAE) and 4E 2000 [35]. We also report model sizes in mega-bytes of learning-based methods (including ours) and average testing GPU time to process a single image in seconds. Our results are indicated with boldface We further report the results of our method trained with and without the perceptual loss L per . In these experiments, we used the Extrinsic Test Set in the Rendered WB dataset [16]. The best results are indicated with boldface Each encoder block in N 2 consists of conv-LReLU-conv-LReLU-maxpool layers, each bottleneck block has conv-LReLU layers, and each decoder block consists of applying a 2D transposed conv (Tconv) operator followed by conv-LReLU-conv-LReLU layers. The first encoder block maps the input images into a latent space with 24 output channels. The output of each proceeding encoder block is doubled by a factor of 2 to reach 192 output channels by the last encoder block. After the second and fourth encoder blocks, we apply an instance normalization operation [27]. Then, the latent representation, X , produced by the last encoder block is first processed by the latent feature generated by the first fully connected layer in N 1 . This dual-use of the latent feature in N 1 helps the local network (i.e., N 2 ) to get some cues of the global color distribution in the image. We use the leaked feature from N 1 to apply an affine transformation to the latent representation X as follows: where {t 1 , t 2 , . . . , t 192 } and {s 1 , s 2 , . . . , s 192 } represent the latent feature vector produced by the first fully connected layer in N 1 (see Fig. 2). After applying this affine transformation (Eq. 7), the output feature is processed by two bottleneck blocks with 384 and 192 output channels, respectively. Our decoders then processes this latent representation along with the skipped latent representation produced by each corresponding encoder blocks. Note that the skip connections are taken before applying the maxpool operator in each encoder block and concatenated with the result of the 2D transposed conv operator in each corresponding decoder block.
Finally, we apply a single conv layer with 1 × 1 kernel followed by a tanh activation function to generate a residual layer that is then used to generated the final output imageÎ crr (as described in Eq. 6).

Training details
We trained our networks (i.e., N 1 and N 2 ) for 100 epochs in an end-to-end manner to minimize the following loss function: where L N 2 is the L2 between our corrected image,Î crr , and the ground-truth white-balanced image, I crr , L N 1 is the L2 between the globally corrected image,Î Gcrr , and the groundtruth white-balanced image, I crr , L per is the perceptual loss [28] betweenÎ crr and I crr , and λ and is a hyperparameter used to control the effect of L per on the final loss. In our experiments, we set λ 0.1. Note that as our framework is trained in an end-to-end scheme, the output 3 × 9 correction matrix from N 1 is first applied to the input image, and then the first loss term in Eq. (8), L N 1 , is computed. This L N 1 loss term encourages the network to produce proper parameters in the output matrix to correct the colors of I in . That is, the network learns the parameters of this matrix unsupervisedly. Afterward, the second network, N 2 , receives the output of the globally corrected image,Î Gcrr , and the second loss term, L N 2 , encourages the second network, N 2 , to correct local residual errors inÎ Gcrr to get the final corrected image,Î crr .
To optimize Eq. (8), we used Adam optimizer [29] with beta values 0.9 and 0.999 and learning rate of 10 −4 dropped by a factor of 0.5 each 25 epochs. We used mini-batch size of 16 and regularized the weights 1 and 2 of our networks using L2 regularization with a multiplier of 10 −5 .
In order to improve the training process, we interchangeably optimize 1 to separately minimize L N 1 (by processing the input data by only N 1 and disabling N 2 ), and then we process the input data through the entire framework (i.e., N 1 and N 2 ) to minimize Eq. (8) at each iteration.
We train our framework using patch-wise training, where we randomly select 128 × 128 training patches from the fullsize training images, while generate the histogram feature H from the entire training images. This allows our first network, N 1 , to have global cues of the color distribution in the training image in order to predict suitable mapping parameters for each image. We followed the same procedure in the inference phase by using the histogram of the entire test image, while feeding the full-size test image to our second fully connected network, N 2 .

Experimental results
We trained our network on the Intrinsic Set of the Rendered WB dataset [16]. This set includes 62,535 sRGB images rendered by several DSLR camera devices. For evaluation, we tested our method on the Extrinsic Test Set of the Rendered WB dataset [16], which includes 2881 images captured by a DSLR camera and several mobile phone cameras. Furthermore, we tested our method on the Cube Test Set that has 10,242 camera-rendered images with several white-balance settings [16,30]. Qualitative results for post-capture white-balance correction on the Cube Test Set of the Rendered WB dataset [16]. This figure shows the results of: (1) KNN-WB [16], (2) I-WB [19], and (3)

Ablation studies
We studied the behavior of our method with different options in design and loss function. The first part of Table 2 shows the results of our ablation study. We reported our results on the Extrinsic Set of the Rendered WB dataset [16] after training our network without the global and local processing networks; i.e., N 1 , N 2 , respectively. It is worth mentioning that when we disabled N 1 , we substitutedÎ Gcrr with I in in Eq. (6). In addition to this ablation study, we showed the results of our framework after training without the perceptual loss term, L per (i.e., λ = 0), and with different values of λ, which controls the contribution of the perceptual loss in Eq. (8). The qualitative evaluation of our result with and without the perceptual loss is shown in Fig. 5. Finally, we studied the effect of different training mini-batch and image patch sizes on our results. The results of this study are shown in Table 3. In these experiments, we used the Extrinsic Test Set in the Rendered WB dataset [16]. The best results are indicated with boldface

Conclusion
We have presented a deep learning framework to correct sRGB images that were saved by camera ISPs with wrong white-balance settings. Our framework processes the input image through two main stages to first correct its colors by learning a global mapping parameters that map the input image colors to the corresponding corrected white-balanced ones. Then, we produce a learnable residual layer that locally adjusts our initial result. We have evaluated our two-stage correction method on large test sets of improperly whitebalanced sRGB images and showed promising results that are on par or better compared to recently published methods for color constancy and image white balancing. One interesting direction in future work may be the use of XAI techniques [36,37] for understanding (and possibly improving) the behavior of the proposed network. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.