1 Introduction

If the human visual system had not evolved to discount the effects of illumination, color which is one of the most significant features we use to give meaning to our surroundings, would lose its importance. Semir Zeki, a well-known neuroscientist, emphasized the importance of identifying colors irrespective of the type of illuminant present in the scene by stating that without this ability, objects could no longer be reliably identified by their color [1]. The ability to perceive colors as constant regardless of the type of illuminant is called color constancy, and it is performed unconsciously by the human visual system [2]. Even though many studies have been conducted to understand this phenomenon, it is still unknown how the brain arrives at the color constant descriptors of the scene [3]. Investigating how we perform color constancy might be the key to unraveling how visual color processing works in the visual cortex, and the outcomes can help us to design more robust artificial systems [3, 4]. Therefore, color constancy is an important field not only for neuroscientists to understand how the human visual system works but also for computer scientists to mimic our capabilities to create algorithms that perform better in scene understanding.

We can explain computational color constancy by making use of the following image formation model. An image captured by a device is an integrated signal of three key components which can be formulated as follows [3];

$$\begin{aligned} I_i(x,y) = \int R(x,y,\lambda ) \cdot L(x,y,\lambda ) \cdot S_i(\lambda ) {\text {d}}\lambda \end{aligned}$$
(1)

where \(I_i\) presents a pixel measured at spatial location (xy), R denotes the reflectance of objects, L is the wavelength distribution of the illumination, \(S_i\) is the capturing device’s sensor function with \(i \in \) {red, green, blue}, and \(\lambda \) is the wavelength of the visible spectrum.

Without the knowledge of the measuring device and the type of light source illuminating the scene, it is very challenging for machine vision systems to identify the true colors of the objects. Therefore, in the field of computational color constancy, researchers aim at developing algorithms that discount the illumination conditions to assist various computer vision tasks. Since color constancy is an ill-posed problem, frequently, the assumption is made that the sensors’ responses are narrow-band and the illumination is uniform throughout the entire scene. It is worth mentioning that while the relaxation of the problem helps us to estimate the illuminant of the scene, most scenes are not illuminated by uniform light sources [5, 6].

The formation of the measured data I can then be simplified as follows [3];

$$\begin{aligned} I(x,y) = R(x,y) \cdot {\textbf{L}}. \end{aligned}$$
(2)

With this simplification, researchers widely accept that after estimating \({\textbf{L}}\), a white-balanced image \(I_{\text {wb}}\) can be obtained from color cast image I by using a \(3\times 3\) diagonal matrix as follows [7, 8];

$$\begin{aligned} I_{\text {wb}} = \begin{bmatrix} L_{\text {est}_{\text {g}}}/L_{{\text {est}}_{\text {r}}} &{} 0 &{} 0\\ 0 &{} 1 &{}0 \\ 0 &{} 0 &{} L_{\text {est}_{\text {g}}}/L_{\text {est}_{\text {b}}} \end{bmatrix} \cdot I \end{aligned}$$
(3)

where \({\textbf{L}}_{{\text {est}}} = [L_{{\text {est}}_{\text {r}}}, L_{{\text {est}}_{\text {g}}}, L_{{\text {est}}_{\text {b}}}]\) is the estimated color vector of the illuminant, and rg, and b represent the red, green, and blue color channels, respectively.

In the field of computational color constancy, numerous methods have been developed. Two algorithms, i.e., the max-RGB and gray world algorithm, inspired by the human visual system have been proposed some time ago, yet they are still used as building blocks of different methods [3, 9]. In one of our previous studies, we also used the assumptions of the max-RGB and gray world algorithm to design a color constancy algorithm. In our method, we further assumed that if the scene is achromatic on average, the shift of the highest luminance pixels away from an achromatic value should be the result of external factors, and this shift should be in the direction of the global illumination source [10]. We observed that our learning-free method provides satisfying results in estimating the illuminant even when lights beyond the standard illuminants are present in the scene, while learning-based methods face a challenge in discounting these lights since they are seldomly included in existing benchmarks. On the other hand, we realized that there are some pixels that reduce the performance of our algorithm, which also coincides with the observations reported in several studies [11,12,13,14,15]. In order to minimize the negative impact of these pixels, we adjusted our method so that only the blocks containing salient image elements, i.e., whitest pixels, are used [16]. In the same study, we showed that block-based computations and salient pixels increase the performance of the max-RGB and gray world algorithms. By taking inspiration from our previous investigations, recently, we further modified our method with a scale-space approach and demonstrated that like block-based operations, multi-scale computations can also improve the effectiveness of several color constancy algorithms [17].

In this paper, we extend our previous works by adding new discussions and experiments on new datasets, while also modifying our algorithm for mixed illumination conditions with a simple yet effective approach. First of all, we investigate the contribution of the stages of our method to the illumination estimation accuracy and analyze how the best-performing stages of our method contribute to the efficiencies of the learning-free color constancy algorithms. With this investigation, we not only introduce a color constancy algorithm but also provide a comprehensive analysis on how scale-space computations as well as block-based operations by only considering the salient pixels, and their combination affects the performance of existing methods. As a result, we demonstrate a simple yet effective approach to modify the algorithms so that their accuracy in estimating the illuminant improves significantly. Moreover, we propose an approach to transform our global color constancy method into an algorithm that can achieve color constancy for mixed illumination conditions without utilizing any prior information about the scene.

Overall, we can summarize our contributions as follows:

  • We propose a multi-scale block-based color constancy algorithm that takes advantage of scale-space and the varying local spatial statistics, while only considering the informative image elements to estimate the illuminant.

  • We demonstrate that the efficiency of several learning-free color constancy algorithms can be improved by using different parts of our algorithm.

  • We show that with a simple modification, the proposed technique designed for global color constancy can be converted into an algorithm for mixed illumination conditions which does not require any information about the scene, i.e., the number of the light sources.

This paper is organized as follows. We provide a brief literature review in Sect. 2. We detail the proposed method in Sect. 3. We present our experimental setup in Sect. 4, and we discuss our results in Sect. 5. Lastly, we give a brief summary of the study in Sect. 6.

2 Related work

Numerous algorithms have been proposed to overcome the ill-posed nature of color constancy. In this section, we provide a brief review of global and multi-illuminant color constancy algorithms, which we utilize for comparison in our experiments. We simply group these methods into two categories as traditional algorithms and learning-based algorithms. While the former estimates the illuminant purely from image statistics, the latter extracts features from large-scale datasets to discount the effects of the illuminant. We would like to note that providing a full literature survey is outside of our scope, but research dedicated to this aim can be found in the following studies [3, 18, 19].

2.1 Traditional algorithms

There are two well-known traditional color constancy methods, i.e., the gray world and the max-RGB. The gray world assumption was formalized by Buchsbaum in 1980. The gray world method depends on the assumption that on average the world is gray [20]. To estimate the color vector of the light source, the gray world method takes the mean of each color channel individually and outputs a vector formed by these average values as its illuminant estimate. The max-RGB method is based on the Retinex algorithm proposed by Land in 1971 [21]. To find an illuminant estimate, the max-RGB algorithm finds the maximum response of each color channel separately to form the color vector of the illuminant. These two methods establish the foundations of many color constancy approaches due to their simplicity and effectiveness. For instance, the shades of gray algorithm supposes that the mean of pixels raised to a certain power is gray [22]. The gray edge method and weighted gray edge algorithm assume that the mean of the high-frequency components of the image is achromatic [23,24,25]. The bright pixels method stresses the importance of the bright image elements [11]. The mean-shifted gray pixels method detects the gray pixels and uses them to find the illuminant [13]. The gray pixels color constancy algorithm detects the gray pixels in a scene by using a grayness measure and utilizes them to estimate the illuminant [14].

There are also other traditional methods that depend on different approaches. For instance, the local surface reflectance statistics approach considers the feedback modulation mechanism in the eye and the linear image formation model [26]. The principal component analysis-based color constancy method uses only certain pixels, which have the largest gradient in the data matrix [12]. The double-opponency-based color constancy method is based on the physiological findings on color processing [27]. The biologically inspired color constancy method is based on the hierarchical structure of the human visual system and mimics our sensation on color illusions [9].

Alongside the algorithms assuming the illuminant is uniform throughout the scene, there are also several methods trying to solve the problem of color constancy for spatially varying illumination conditions. One early algorithm called local space average color is introduced by Ebner [28], who computationally modeled the biological finding that the human visual system might be discounting the illuminant based on local space average color. Gijsenij et al. introduced a block-based method that can be applied to global color constancy algorithms in a local manner [6]. The conditional random fields-based method modifies existing global color constancy methods to apply them for mixed illuminant conditions [29]. The retinal-inspired color constancy model investigates the color processing mechanisms in a certain level of the retina [30]. The color constancy weighting factors method divides the image into regions and checks whether a region contains sufficient information for color constancy by utilizing the normalized average absolute difference of each area as a measure of confidence [31]. The color constancy for image of non-uniformly lit scenes divides the image into regions and estimates the illuminant by using the regions containing sufficient color variation [32]. The color constancy adjustment based on texture of image method takes advantage of textures to detect image elements that have sufficient color variation and utilizes these to find the illuminant estimate [33]. The visual mechanism-based color constancy with the bottom-up method mimics the bottom-up mechanisms of the human visual system [34]. The N-white balancing algorithm finds the number of white points in the image to discount the illuminant of the scene [35].

Fig. 1
figure 1

The flowchart of our proposed illumination estimation algorithm. We use the luma image to obtain the salient pixels and their weights to form our informative image. Subsequently, we estimate the illuminant by carrying out computations in scale-space. We compute the scaling vectors \(\textbf{C}\) to find the deviation of the brightest pixels from the gray value. Then, we take the mean of the scaling vectors at each level to find an illuminant estimate \({\textbf{L}}_s\) for the corresponding scale s. Lastly, we average all the illuminant estimates to find the color vector of the global light source \({\textbf{L}}_{\text {est}}\) illuminating the scene

2.2 Learning-based algorithms

The convolutional color constancy study is one of the earliest color constancy methods based on convolutional neural networks (CNNs) [36]. In this study, Barron transformed the operation of illumination estimation to an object detection application by formulating color constancy as a 2D spatial localization task. The deep specialized network for illuminant estimation is based on a convolutional network architecture, and it is sensitive to diverse local regions of the scene [37]. The fast Fourier color constancy method carries out computations in the frequency domain and transforms the light source estimation task into a spatial localization operation on a torus [38]. The quasi-unsupervised color constancy algorithm relies on the detection of gray pixels without using a huge amount of labeled data during its training phase [39]. The color constancy convolutional autoencoder method utilizes convolutional autoencoders and unsupervised pre-training to estimate the illuminant of the scenes [40]. The sensor-independent color constancy model maps the input to a sensor-independent space by using an image-specific matrix [41]. The bag of color features color constancy method utilizes convolutional neural networks, and it is based on bag-of-features pooling [42]. The cross-camera convolutional color constancy algorithm is trained on images captured with several different cameras, and during inference, additional unlabeled images are given as input so that the model can calibrate itself to the spectral properties of the testing set [43]. The combining color constancy algorithms model efficiently merges color constancy algorithms from the motivation that not all algorithms perform well on all scenes since the performance of algorithms is sensitive to the content of the input scene [44]. The one-net: convolutional color constancy simplified model does not utilize pre-trained layers and large kernels to show that complex models are not necessary to achieve high performance [45]. Alongside single-illuminant cases, learning-based models are also being used in mixed-illumination conditions. Bianco et al. introduced a CNN-based color constancy method which has a detector that finds the number of light sources in the scene [46]. The physics-driven and generative adversarial networks (GAN)-based color constancy method transforms the illumination estimation task into an image-to-image domain translation problem [47].

Apart from algorithms aiming at estimating the illuminant to perform color constancy, there also exist models that correct improperly white-balanced images without explicitly estimating the color vector of the light source. For instance, the KNN white-balance method computes a nonlinear color mapping function for correcting the colors of the image [48]. The extension of the KNN white-balance method is the interactive white-balance which relates the nonlinear color mapping functions directly to the colors chosen by the users to allow interactive white-balance adjustment [49]. The deep white-balance algorithm realistically edits the white-balance of an sRGB image by mapping the input to two white-balance settings by using a deep neural network [50]. The auto-white balance for mixed scenes renders the input scene with a small number of predefined white-balance settings through which it forms weight maps that are used to fuse the rendered scenes for generating a white-balanced output [51]. The style white-balance improves the contemporary auto-white balance models for single- and mixed-illuminant inputs by modeling the illumination as style factor [52].

While the learning-based algorithms usually outperform the traditional methods on well-known benchmarks, their performance may decrease when they face images captured with unknown hardware specifications and/or the test samples contain lighting conditions different from their training set. These observations have been reported in several studies, where researchers shared their concerns about the problems in learning-based methods [14, 53,54,55]. Also, in our recent study, we have explicitly shown the performance decrease of learning-based algorithms, when they process images containing unique illumination conditions that are seldomly considered in available benchmarks [10]. We can briefly summarize the reasons behind the performance drop as follows. Most public benchmarks are formed with similar hardware, out-of-ordinary lights are usually not considered during dataset creation, and learning-based algorithms suppose that their train and test sets are similar in one way or another [10, 54].

3 Proposed method

In this section, we detail our method (Fig. 1). We build our algorithm upon two assumptions, which have a correspondence in the human visual system. Since the human visual system might be discounting the illuminant of the scene based on space-average color and highest luminance patches [5, 20, 21, 56,57,58,59,60], we assume that on average, the world is gray, and there are several bright pixels somewhere in the scene. We form our main idea around these assumptions and assume that the deviation of the brightest pixels from the achromatic value should be caused by the light source (Fig. 2). We note that both the bright pixels and the achromatic value might change throughout the scene due to the varying local surface orientations. Therefore, in our method, we rely on a block-based approach to ensure that our algorithm is sensitive to local spatial information, which is usually neglected in other studies operating at the image level. Moreover, as stated in many color constancy studies, not all image elements are informative for estimating the illuminant. Therefore, we utilize only the image elements that we extract from the brightest pixels in the scene since pixels having the highest luminance might be useful for human color constancy [57,58,59,60]. Moreover, we adaptively weigh these salient pixels since their contribution to the task of color constancy could vary throughout the scene, i.e., they may not have an equal contribution to the illumination estimation task due to changing local statistics of the scene. Furthermore, we carry out our computations in multiple scales since the importance of utilizing the scale-space, especially for the tasks taking advantage of the color feature is demonstrated in many studies due to its sensitivity to the low-level features of images [61,62,63,64,65].

Fig. 2
figure 2

The proposed method is based on two assumptions: (i) the world is gray on average and (ii) there are several bright pixels somewhere in the scene. Our aim is to find the color vector of the light source by finding the deviation of the brightest pixels \(\textbf{P}_{\text {max}}\) from the gray value \(P_\mu \) by using the scaling vectors \(\textbf{C}\). Since the local surface orientations vary throughout the image, we calculate the scaling vectors for each non-overlapping block. Finally, we estimate the color of the global light source by taking the mean of all scaling vectors

In our method, first of all, we apply a gamma correction in case an sRGB image is provided as input. We carry out this operation to obtain the linear relationship between the pixels. Moreover, we do not consider the under- and over-saturated pixels, i.e., approximately the top \(5\%\) and the bottom \(2\%\) of image elements, in our calculations to reduce possible noise. Then, we focus on the extraction of the most informative pixels, i.e., salient regions, in the scene. In several studies, it is pointed out that not all pixels in an image are useful for the task of color constancy [9, 11, 12, 14, 15]. For instance, dominant sky regions have a tendency to bias the estimates of the light source; hence, they should be handled separately. In order to find the regions containing only the informative pixels, we make use of the brightest pixels in the scene since it is known that the human visual system might be discounting the illuminant based on the highest luminance in the scene [57, 58]. To determine the salient pixels, we benefit from the black-white opponent channel \(O_{\text {BW}}\) of the opponent color space. We compute \(O_{\text {BW}}\) as follows [3];

$$\begin{aligned} O_{\text {BW}} = \frac{r+g+b}{\sqrt{3}} \end{aligned}$$
(4)

Subsequently, we form a binary saliency map \({\mathcal {S}}\), by selecting the informative pixels that correspond to the top \(3.8\%\) brightest pixels in the black-white opponent channel \(O_{\text {BW}}\). The selection of the brightest pixels is described in detail in Sect. 5.1.

For a scene, all the pixels highlighted by the saliency map may not have equal brightness. Thus, using the binary saliency map directly may not be effective while estimating the illuminant. Thereupon, we adaptively weight the pixels in the saliency map by forming a map \({\mathcal {W}}\) from the black-white opponent channel \(O_{\text {BW}}\) by fitting the pixels into a Gaussian function (Eq. 5). This weight map allows us to weaken the contribution of the darker image elements while giving more attention to the pixels having the highest luminance.

$$\begin{aligned} \small {\mathcal {W}}(x,y) = 1 - \frac{1}{2\pi \sigma ^2} \cdot {\text {exp}}\left( -\frac{ (O_{\text {BW}}(x,y) - \mu )^2}{2\sigma ^2}\right) \end{aligned}$$
(5)

where \(\sigma \) and \(\mu \) represent the standard deviation and the mean of \(O_{\text {BW}}\), respectively.

Subsequently, by using the saliency and weight maps, we create an informative image \({\mathcal {I}}\), where the salient regions are adaptively weighted as follows;

$$\begin{aligned} {\mathcal {I}} = I(x,y) \cdot {\mathcal {W}}(x,y) \cdot {\mathcal {S}}(x,y). \end{aligned}$$
(6)

Afterward, we carry out our computations in scale-space, where we can highlight the low-level features of images. We obtain representations of the images at different levels, while we determine the number of scales \({\mathcal {L}}\) adaptively based on the image resolution, and it can be calculated as \({\mathcal {L}} = \lfloor {\text {log}}({\text {min}}(h,w)) / {\text {log}}(2) \rfloor \), where h and w are the width and height of the image. Then, to respect the local surface orientations, we divide the image into non-overlapping blocks which consist of m number of pixels (Fig. 1). The parameter m depends on the image resolution at the scale of interest, and it is calculated as \(m = \sqrt{(h \cdot w) / \eta }\), where \(\eta \) is the controlling parameter of m which is taken as 120 based on practical experiments (the process of determining this parameter is provided in Sect. 5.1).

It is worth mentioning that we divide the image into non-overlapping blocks only at levels higher than half of the number of possible scales that can be obtained in the pyramid. The reason behind this is that at coarser scales the blocks would contain a small number of pixels which would violate one of our algorithm’s assumptions, i.e., the world is achromatic on average, since the gray world assumption requires a sufficient number of distinct colors to be present in the scene [3]. This requirement also coincides with the mechanisms of the human visual system which are explicitly demonstrated in Land’s study [56]. In the experiments, Land showed that two color patches with different colors, i.e., yellow and green, taken from a Mondrian image are perceived as grayish-white only when their centers are viewed through viewing tubes in void mode, i.e., patches are viewed so that they are isolated from their local neighbors (Fig. 3) [3].

Fig. 3
figure 3

The observers viewed only the center of the colored patches of the Mondrian through an adjustable aperture. Even though the colors of the patches are yellow and green, the observers perceived them as grayish-white. When the Mondrian is viewed as a whole, the observers identify the actual reflectance of the patches

After we divide the images in scale-space into blocks, we determine the descriptors to estimate the color vector of the light source in every block. Note that for simplicity, we refer to each image in coarser scales that is not divided into blocks as a block. As we mentioned before, our algorithm builds upon the assumptions of the gray world method and the max-RGB. For each block P we assume that there is a unique achromatic value, which is our first descriptor that is used to find the illuminant estimate. We compute this gray value \(P_\mu \) by taking the average over all channels within the block of interest instead of taking the mean of all pixels in the image directly. We calculate a particular gray value for each block to take the varying local spatial statistics into account. For the same reason, we determine our second descriptor by taking the maximum response of each block channel individually and represent them by \(\textbf{P}_{\text {max}} = [ P_{{\text {r}},{\text {max}}}, P_{{\text {g}},{\text {max}}}, P_{{\text {b}},{\text {max}}}]\).

We calculate the color vector of the light source of each block in every scale by using both of our assumptions. Based on our main idea we compute how much the brightest values \(\textbf{P}_{\text {max}}\) deviate from the gray value \(P_\mu \). We assume that if the world is achromatic on average, then the summation of the intensity values of \(\textbf{P}_{\text {max}}\) should equal \(P_\mu \). However, if there is a shift away from the achromatic value this deviation should be in the direction of the color vector of the light source. We can find this deviation by using a vector \(\textbf{C}_P = [c_{\text {r}}, c_{\text {g}}, c_{\text {b}}]\), where each element of \(\textbf{C}_P\) scales the intensities of \(\textbf{P}_{\text {max}}\) such that they sum to \(P_\mu \) as follows;

$$\begin{aligned} P_{{\text {r}},{\text {max}}} \cdot c_{\text {r}} + P_{{\text {g}},{\text {max}}} \cdot c_{\text {g}} + P_{{\text {b}},{\text {max}}} \cdot c_{\text {b}} = P_\mu . \end{aligned}$$
(7)

We convert Eq. 7 into an optimization problem to find \(\textbf{C}_P\) as follows;

$$\begin{aligned} \textbf{C}_P= & {} \underset{\textbf{C}_P}{\arg \min } \left\| P_{\text {max}} ~ \textbf{C}_P - P_\mu \right\| _2 \nonumber \\{} & {} {\text {with}}~\forall c \in \textbf{C}_P: c > 0. \end{aligned}$$
(8)

Note that we do not only minimize the norm of this optimization problem but also the Euclidean norm of \(\textbf{C}_P\). In other words, if we have multiple solutions for this problem, we take the solution that minimizes the norm of \(\textbf{C}_P\).

After we obtain the \(\textbf{C}_P\) values for each block, we calculate the illuminant estimate for each scale by averaging the \(\textbf{C}_P\) as follows;

$$\begin{aligned} \textbf{L}_{s} = \frac{1}{n} \sum _{k=1}^{n}{\textbf{C}_{P_k}} \end{aligned}$$
(9)

where \(\textbf{L}_{s}\) is the color vector of the light source at a certain scale s, and n is the number of blocks. In case we obtain \(\textbf{C}_P\) in coarser scales where we do not divide the image into blocks, we directly take the deviation obtained from Eq. 8 as our illuminant estimate for that scale.

Since we assume that the scenes are uniformly illuminated, we linearly combine the estimations from each level to obtain a single-illuminant estimate \(\textbf{L}_{\text {est}}\) for the given image as follows;

$$\begin{aligned} \textbf{L}_{\text {est}} = \frac{1}{{\mathcal {L}}} \sum _{k=1}^{{\mathcal {L}}}{\textbf{L}_{s_k}} \end{aligned}$$
(10)

where \(\textbf{L}_{\text {est}}\) is then converted into a unit vector.

Lastly, we can discount the illuminant to obtain a white-balanced image \(I_{\text {wb}}\) by using Eq. 3.

3.1 Application to multi-illuminant color constancy

A common drawback in several multi-illuminant color constancy methods is the requirement of prior information related to the number of clusters or segments, which depends on the number of illuminants present in the scene [66]. We argue that as we cannot know the type of light source and/or type of the capturing device, we can also not be sure about the number of lights illuminating the scene. Therefore, algorithms that utilize the number of illuminants as prior information will face a significant challenge in discounting the illuminant when this parameter is not provided correctly. Consequently, we cannot obtain a robust algorithm that achieves color constancy for mixed-illumination conditions effectively if we cannot design an algorithm free of prior information.

As aforementioned, we develop our method for the cases where we have a uniformly illuminated scene. However, we can modify our method with a simple yet effective approach so that it can provide pixel-wise estimates of the illuminant for the scenes illuminated by varying illumination conditions. In this subsection, we explain our modification that can transform our global color constancy method into a multi-illuminant color constancy approach, which does not require any prior information about the scene.

Our first adjustment is for our informative image formation stage (Fig. 1). For scenes illuminated by multiple light sources, we use all pixels that are close to white, i.e., pixels having the highest luminance, instead of using the top brightest image elements. We need this modification since in multi-illuminant color constancy, we need to take more spatial information into account. If we utilize only a small number of pixels, we cannot provide accurate pixel-wise illuminant estimates since we would lose the local relationship between the neighboring pixels, which is an important cue for mixed illumination conditions [9, 61]. As a result, we are using all the pixels, which are closest to white. We can explain the idea of using the whitest pixels from two different perspectives, (i) digital photography, and (ii) human color constancy. In digital photography, we know that we can easily determine the color vector of the illumination by using the white pixels in the image instead of utilizing the colored ones [9]. For instance, let us assume that we capture the picture of a room illuminated with yellow light. The room has white walls and contains different objects having distinct colors. We can determine the color vector of the light source easier from the white walls rather than the objects since the capturing device will measure the light reflected from the white walls as yellow. Furthermore, from the findings on human color constancy, we know that the areas having the highest luminance might be used by the human visual system to discount the effects of the light source [3, 9, 57, 59, 60].

We determine the image elements closest to white by using a simple yet effective approach [9]. In order to find such image elements, we form a temporary color vector of the light source by taking the mean of each color channel individually; in other words, we apply the gray-world algorithm since we are assuming that world is gray on average. Then, we obtain a temporary white-balanced image \(I_{\text {temp}}\) by scaling the input image according to this temporary color vector of the light source by Eq. 3. Afterward, we create a pixel-wise whiteness map \({\mathcal {W}}\) by calculating the pixel-wise distance between the white vector \({\textbf{w}} = [1~1~1]\) and the \(I_{\text {temp}}\) as follows;

$$\begin{aligned} {\mathcal {W}}(x,y) = cos^{-1} \begin{pmatrix} \frac{{\textbf{w}} \cdot I_{\text {temp}}}{ \left\| {\textbf{w}} \right\| \cdot \left\| I_{\text {temp}} \right\| } \end{pmatrix}. \end{aligned}$$
(11)

Since the contribution of all spatial locations differs, we obtain a certainty map \({\mathcal {C}}\), by fitting \({\mathcal {W}}\) into a Gaussian function as follows;

$$\begin{aligned} {\mathcal {C}}(x,y) = \frac{1}{2\pi \sigma _{{\mathcal {W}}}^2} \cdot {\text {exp}}\left( -\frac{ ({\mathcal {W}}(x,y) - \mu _{{\mathcal {W}}})^2}{2\sigma _{{\mathcal {W}}}^2}\right) \end{aligned}$$
(12)

where \(\mu _{{\mathcal {W}}}\) and \(\sigma _{{\mathcal {W}}}\) are the mean and the standard deviation of \({\mathcal {W}}\), respectively.

By using \({\mathcal {C}}\), we form our informative image (Eq. 13) which we use for our multi-scale block-based computations.

$$\begin{aligned} {\mathcal {I}} = I(x,y) \cdot {\mathcal {C}}(x,y) \end{aligned}$$
(13)

After obtaining our informative image, we follow similar operations to our global color constancy approach, i.e., we carry out block-based operations in scale-space. However, contrary to the single-illuminant case, for the multi-illuminant scenario, we do not consider the coarser scales of the pyramid where the locality starts to degrade since locality is an important cue for multi-illuminant color constancy [67]. We use half of the number of possible scales that can be reached in a pyramid since we observed in our experiments that the performance generally starts to degrade afterward. Moreover, we do not take the mean of the computed deviation \(\textbf{C}_P\), but we place \(\textbf{C}_P\) into the center of the corresponding block to obtain a sparsely populated image \({\mathcal {I}}_{C_P}\), i.e., an image only containing the estimated deviations in the center of each block. Then, in order to fill the missing pixels in \({\mathcal {I}}_{C_P}\), i.e., to obtain a dense image for a specific scale, where for every spatial location a pixel-wise estimate \({L}_{s}(x,y)\) is present, we carry out an interpolation between the neighboring center pixels by convolving \({\mathcal {I}}_{C_P}\) with a Gaussian kernel (Eq. 14). We follow this approach to obtain smooth transitions between adjacent blocks instead of assuming that all the pixels in a block have the same deviation, which would result in sharp changes between adjacent blocks.

$$\begin{aligned} {L}_{s}(x,y) = {\mathcal {I}}_{C_P} * \frac{1}{2\pi \sigma ^2} \cdot {\text {exp}}\left( -\frac{x^2 + y^2}{2\sigma ^2}\right) \end{aligned}$$
(14)

where \(*\) denotes the convolution operation. It is important to note that the scaling factor \(\sigma \), i.e., controlling parameter of the Gaussian kernel, should be large enough to ensure that at least two neighboring deviations are inside of the kernel. This parameter is practically determined, and it is calculated as \(\sigma = 0.5 \beta \), where \(\beta = ({\text {min}}(h,w)/2)\).

Fig. 4
figure 4

(Left-to-right) The RECommended ColorChecker, the INTEL-TAU, the NUS-8, the MIMO, and the Mixed-Illuminant Test Set datasets, respectively

After obtaining \({L}_{s}\) for each scale, in order to find our pixel-wise estimates \(L_{\text {est}}(x,y)\), we process every \({L}_{s}\) as follows. We first upsample the coarsest scale so that it matches the size of the consecutive finer level. Then, we linearly combine the upsampled image with the one on the finer scale. Afterward, we upsample the resulting image to linearly combine it with its consecutive finer level. We carry out this operation until the finest scale is reached. The resulting image represents our pixel-wise illumination estimate \(L_{\text {est}}\).

4 Experimental setup

In this section, we demonstrate our experimental setup. In following Sect. 4.1, we introduce the datasets that we utilize to compare our algorithm’s performance with the studies briefly explained in Sect. 2, and the initial and previous versions of our method, i.e., block-based color constancy, and block-based color constancy with salient pixels, respectively [10, 16]. In Sect. 4.2, we explain the error metrics that we used to report the statistical results. We would like to note that we performed the experiments on an Intel i7 CPU @ 2.7 GHz Quad-Core 16GB RAM machine using MATLAB R2021a.

4.1 Datasets

In order to investigate the contribution of the stages of the proposed algorithm to the performance, and benchmark our method and the modified algorithms, we carry out comprehensive evaluations on 3 well-known global color constancy datasets, namely, the RECommended ColorChecker [68], INTEL-TAU [69], and NUS-8 [12] datasets. While in our previous works’, we used the RECommended ColorChecker and INTEL-TAU datasets, in this study we extend our discussion by also utilizing the well-known NUS-8 dataset. Moreover, to provide an analysis on multi-illuminant cases, we utilize 2 benchmarks, the Multiple Illuminant and Multiple Object (MIMO) Dataset [29], and the Mixed-Illuminant Test Set which is recently created by Afifi et al. [51] (Fig. 4). While using these datasets, if necessary, we mask out the calibration objects, i.e., color checker, and subtract the black level from the original images. Also, we clip the under- and over-saturated pixels to prevent the contribution of the noisy pixels since it is known that these image elements negatively affect the performance of color constancy methods. In the following, we briefly introduce these benchmarks.

4.1.1 The RECommended ColorChecker dataset

The RECommended ColorChecker dataset is the updated version of the Gehler-Shi dataset [70]. In this modified version, the researchers provide accurate ground truths for each scene to solve the problems in the Gehler-Shi dataset. The dataset contains 568 scenes, captured under single-dominant illumination. The scenes contain 254 indoor, 85 outdoor, and 229 close-up images, which are taken by two different capturing devices, namely, Canon 1D and Canon 5D [54].

4.1.2 The INTEL-TAU dataset

The INTEL-TAU dataset is one of the largest color constancy benchmarks containing 7022 images. The 1466 indoor, 2327 outdoor, and 3229 close-up scenes are captured with different devices, namely, Nikon D810, Canon 5DSR, and Sony IMX135 [54]. The images are captured under one light source. All the images in INTEL-TAU are already processed, i.e., images have a linear response and their black level is calibrated. Moreover, contrary to other datasets, all the sensitive data in the INTEL-TAU dataset are handled, i.e., the faces and license plates are masked out. To utilize this dataset, we used the images belonging to the sets of “field1” and “field3” since the calibration object is unmasked in other sets, i.e., “lab printouts,” and “lab realscene,” and their masks are not provided.

4.1.3 The NUS-8 dataset

The NUS-8 dataset is another publicly available global color constancy benchmark, which contains a total of 1736 raw images. The images containing 415 indoor, 279 outdoor, and 1159 close-up scenes are captured with 8 different cameras, namely, Canon EOS-1Ds Mark III, Canon EOS 600D, Fujifilm X-M1, Nikon D5200, Olympus E-PL6, Panasonic Lumix DMC-GX1, Samsung NX2000, and Sony SLT-A57 [54].

4.1.4 The MIMO dataset

The MIMO dataset is one of the well-known benchmarks in the field of color constancy for mixed illumination conditions. We evaluate our illumination estimation strategy on this dataset since many algorithms have already been tested on this benchmark [54]. The MIMO dataset contains a total of 78 linear images for two different sets; (i) the real-world scenes containing 20 complex scenes, and (ii) 58 laboratory scenes consisting of simple scenes.

4.1.5 The Mixed-Illuminant Test Set

The Mixed-Illuminant Test Set is a publicly available benchmark. This recently created dataset is rendered by computer graphics; hence, the ground truths are not biased by camera sensor specifications. The synthetic dataset contains a total of 150 images with 30 varying scenes. Each scene is rendered with 5 different mixed illumination conditions at different color temperatures, and for each scene, the ground truth white-balanced image is provided.

4.2 Error metric

To present statistical results, we adopt the well-known error metric, the angular error, between the color vector of the estimated illuminant \({\textbf{L}}_{\text {est}}\), and the ground truth \(\textbf{L}_{\text {gt}}\). The angular error between two vectors can be calculated as follows;

$$\begin{aligned} \varepsilon ({\textbf{L}}_{\text {est}}, \textbf{L}_{\text {gt}}) = {\text {cos}}^{-1} \begin{pmatrix} \frac{{\textbf{L}}_{\text {est}} \cdot \textbf{L}_{\text {gt}}}{\left\| {\textbf{L}}_{\text {est}} \right\| \cdot \left\| \textbf{L}_{\text {gt}} \right\| } \end{pmatrix}. \end{aligned}$$
(15)

While we report the mean, the median, the mean of the best 25%, and the mean of the worst 25% of the angular error for uniform illuminant cases, we analyze the mean and the median pixel-wise angular error for the mixed-illuminant cases.

5 Experimental discussion

In this section, we provide a detailed experimental discussion while presenting both statistical and visual analyses on all datasets. Firstly, in Sect. 5.1, we demonstrate the parameter selection process that we followed during our algorithm design. Then, in Sect. 5.2, we investigate the contributions of the stages of our algorithm to the performance in detail. After analyzing the stages of our method, in Sect. 5.3, we discuss our outcomes on single-illuminant color constancy by adopting 3 datasets, while we also investigate the effects of modifying existing color constancy algorithms by using our observations in the ablation study. Lastly, in Sect. 5.4, we provide our results for the application to multi-illuminant color constancy.

As a final note, the detailed ablation study, the modification of existing algorithms by using the best-performing stages of our method, and the discussions on the NUS-8, MIMO, and Mixed-Illuminant Test Set datasets are all extensions to our previous works’ experimental discussions.

5.1 Discussion on parameter selection

As we mentioned in Sect. 3, our method depends on two parameters, i.e., the parameter that controls the size of the non-overlapping blocks, and the percentage of the brightest pixels which we utilize to form our informative image. In order to analyze their effects on performance, we investigate how the efficiency of our method changes for global color constancy with the consideration of their different combinations. For this analysis, we create a subset containing random samples from the global color constancy datasets. In order to determine the best-performing combination, we check the parameter combination having the lowest mean angular error (Table 1).

As aforementioned, our algorithm is built upon the gray world assumption. Thus a sufficient number of pixels have to fall into each non-overlapping block since the gray world assumption is only valid when there are an adequate number of distinct colors in the image, i.e., block. In our algorithm, both of the parameters (number of brightest pixels and number of pixels per block) affect the number of pixels that we use to estimate the illuminant in each block. While the top brightest pixels are related to the image statistics, the number of image elements in each non-overlapping block is dependent on the image size. As shown in Table 1, the performance of our algorithm increases when a sufficient number of blocks having an adequate number of image elements are taken into account. We can explain this observation with the fact that the possibility of changing surface orientations increases as we select a sufficient number of blocks with an adequate number of pixels. Moreover, since the chance of obtaining uniform colored areas decreases as we choose blocks with an appropriate size, we can satisfy the assumptions of the gray world by choosing blocks with a sufficient number of pixels. Due to similar reasons, the number of the brightest pixels falling into a block has to satisfy our assumptions.

As seen in Table 1, the best combination is obtained by choosing the top \(3.8\%\) brightest pixels and the controlling parameter of the block size as 120.

Table 1 Selecting the best parameter combination on a subset containing random samples

5.2 Ablation study

Table 2 Ablation study on the steps of the proposed method

We conduct an ablation study on a single dataset, which we obtain by combining all benchmarks, to analyze the contributions of each step of our method to the performance of color constancy. In Table 2, we provide the results of our investigation, where baseline refers to solving only Eq. 8 without considering the informative image formation step and without carrying out block-based computations in scale-space. The outcomes we provide alongside the baseline correspond to solving Eq. 8 by considering only particular stages of our proposed approach, such as solving Eq. 8 by using only the informative image or only carrying out the computations in scale-space. In order to analyze how the steps of the algorithm affect the performance of our method, we divide our investigation into three parts. First, we analyze their contribution alone, then we investigate their dual combinations, and lastly, we present the results of our proposed technique. For each part, we choose the best-performing strategy alongside the proposed method which we will use later to modify several learning-free color constancy algorithms. We observe that each component of our algorithm contributes to the performance noticeably. We see that solving Eq. 8 by utilizing the informative image slightly increases the performance of our method. However, as presented in Table 2, using either the blocks or the scale-space increases the performance substantially. When we check the angular errors, carrying the computations into scale-space rather than utilizing blocks results in a slightly better performance. The reason behind this can be explained as follows. While we reduce the image from the finer scale to the consecutive coarser scales, we actually apply a local averaging between the pixels while reducing the image size, which acts like a block-based operation, especially for the coarser scales. Therefore, due to its sensitivity to low-frequency components of the image, i.e., colors, and the partial block-based operations in particular at the coarser scales, the scale-space approach performs slightly better than estimating the illuminant only on the finest scale by using block-based operations. After investigating the steps utilized in our algorithm individually, we analyze their dual combinations. We notice that the efficiency of our method increases significantly when we consider using blocks and the informative image rather than the scale-space and the informative image. The reason behind this performance difference can be explained by the contribution of the varying local statistics. When we use the block-based approach together with the informative image, we give more importance to local statistics than using the informative image with scale-space since when we divide the image into non-overlapping blocks, we take more varying local estimations into account than in the scale-space, i.e., the number of blocks is higher than the number of scales for an image; thus, more information is taken into account in block-based operations combined with the informative image rather than considering the scale-space with the informative image. Yet, we obtain the best outcomes when we combine all three steps in our algorithm since the locality is respected the most when all stages are considered and only the informative regions are taken into account.

5.3 Results of single-illuminant color constancy

In order to present the performance of our method and to analyze the effectiveness of carrying out the computations of several learning-free color constancy algorithms by using the highlighted steps in the ablation study, we make a comprehensive comparison with numerous color constancy algorithms. For each benchmark, we obtain the results of the methods by either running their codes without any optimization or by making use of the reported outcomes of their works and recent publications, which are considered to be up-to-date and comprehensive [14, 41, 43]. While discussing the experimental outcomes, we first focus on the results of our proposed method. Then, we analyze the effects of modifying existing color constancy methods with our approach.

Table 3 Statistical results on the RECommended ColorChecker dataset
Table 4 Statistical results on the INTEL-TAU dataset
Table 5 Statistical results on the NUS-8 dataset

We provide statistical analyses for global color constancy in Tables 3, 4, and 5. On the RECommended ColorChecker dataset (Table 3), the first noticeable outcome is that we obtain the lowest mean and the mean of the worst 25% of the angular error among the traditional algorithms, while we achieve competitive results compared to the learning-based models. Furthermore, we observe that the extensions we made to the former version of our algorithm improved the performance of our method in all metrics. On the INTEL-TAU dataset (Table 4), again we obtain the lowest mean angular error among the learning-free algorithms, while we outperform 4 of the learning-based methods. On the NUS-8 dataset (Table 5), we achieve the lowest mean and the mean of the worst 25% of the angular error among the learning-free methods. Compared to the proposed algorithm’s previous version, the improvement in our best and worst cases leads to a significant decrease in the mean angular error.

Fig. 5
figure 5

Visual comparison on random samples. (Left-to-right) Input image, ground truth, proposed algorithm, and color constancy method (top-to-bottom) max-RGB [21], wGE [25], GI [14], GW [20], DOCC [27], and PCA-CC [12]. The angular error is provided on the bottom-right side of the image

Fig. 6
figure 6

The visual comparison of the proposed method for random samples taken among the worst cases in its previous version block-based color constancy with salient pixels. (Left-to-right) Input image, ground truth, proposed method, and former version [16]. The angular error is provided on the bottom-right side of the image

We provide visual comparisons in Figs. 5 and 6 by using random samples from the benchmarks. It is known that in color constancy, scenes containing a limited number of distinct colors are challenging for the algorithms. In our visual results, we observe that even for these scenes our angular error is less than 5 (second and third rows of Fig. 5). Also, in Fig. 6 where we provide an analysis by taking random samples among the worst cases of our previous version, block-based color constancy with salient pixels, we can see that our worst cases significantly improved. Yet, scenes containing uniformly distributed colors and a limited number of bright pixels are still challenging (last row of Fig. 6).

After we investigate the outcomes of our algorithm, we analyze how our modifications affect the existing color constancy methods’ performance. As aforementioned, we modify the learning-free methods by exchanging Eq. 8 with their computations on estimating the illuminant. In particular, we investigate the effects of modifying several color constancy algorithms through our observations in the ablation study (Table 2). As aforementioned, we select the best-performing steps of our algorithm. It is worth mentioning that we do not modify traditional algorithms that require information from parts, which are discarded in our approach. For instance, principal component analysis-based color constancy [12] needs information from both the brightest and darkest regions of the image; thus, our strategy, which does not consider the darkest pixels, is not suitable to modify this algorithm. Furthermore, we do not apply our approach to learning-based methods, since they have a fixed input size requirement, while we use blocks with varying sizes in our approach, and resizing these blocks so that they meet the input requirements would distort the image.

As shown in Tables 3, 4, and 5, the modified algorithms achieve lower mean angular error compared to their original versions, while they also outperform several other color constancy algorithms. We observe that all three steps of our method, which are used to modify the algorithms, increase the performance significantly, while the highest performance increase is usually obtained by utilizing the blocks and the informative image. We can explain this outcome by two facts; (i) all pixels are not informative for color constancy, and (ii) taking varying local spatial information into account allows us to highlight local features that might not be noticed while operating on global scale. In short, the noteworthy result is that by making slight modifications, existing simple yet effective methods can be improved so that they can compete with the state-of-the-art algorithms or even outperform them. For instance, the original version of the max-RGB has a mean angular error of 7.78 on the RECommended ColorChecker dataset; however, when we modify this algorithm by using the blocks and informative image, its mean angular error reduces to 3.46. Also, when the weighted gray edge method is applied in scale-space, it outperforms most of the state-of-the-art algorithms on the NUS-8 dataset as shown in Table 5. Moreover, when we modify the algorithms, the worst cases reduce significantly on all benchmarks. Since in color constancy it is known that it is important to improve the algorithms’ performance for the worst cases, this is a valuable outcome.

Table 6 Statistical results on the MIMO dataset

5.4 Results of multi-illuminant color constancy

For mixed illumination conditions, we provide statistical results on the MIMO dataset and on the Mixed-Illuminant Test Set. We report the results of existing methods by either running their codes or making use of already published works considered up-to-date and comprehensive. As aforementioned, the MIMO dataset contains two sets, i.e., the “Real-World” set and the “Laboratory” set, and the Real-World set includes images that are closer to the scenes we observe in our daily lives. Thus, compared to the Laboratory set it contains more complex scenes, which makes it more challenging than the Laboratory set [14]. The Mixed-Illuminant Test Set is a recent dataset, and it contains synthetic images rendered with computer graphics. This benchmark includes images with different room layouts that are illuminated under varying mixed illumination conditions. Thus, in our experiments, we evaluate our approach not only for the real-world scenes but also for synthetically created “real-world-like” challenging images. In this section, first, we provide statistical results for the MIMO dataset and we discuss the outcomes. Then, we report our results for the Mixed-Illuminant Test Set.

Fig. 7
figure 7

Results on MIMO dataset. (Left-to-right) Input scenes, pixel-wise ground truths, and pixel-wise estimation of the proposed method

Table 7 Statistical results on the mixed-illuminant test set

According to the statistical analysis on the MIMO dataset (Table 6), our approach can provide pixel-wise estimates for mixed illumination conditions and it surpasses several methods, which are specifically designed to transform global color constancy algorithms into multi-illuminant cases. Also, while the statistical results seem competitive on both sets, it is worth pointing out that our algorithm neither requires any prior information about the scene, i.e., the number of illuminants or the number of image segments/clusters nor it is trained using the illuminants from the MIMO dataset such as the GAN-based Color Constancy. We provide our pixel-wise estimations for both sets in Fig. 7.

Among all traditional algorithms, our method obtains the best angular error on average on the Real-World set. Compared to the learning-based techniques, we provide the best median angular error together with GAN-based CC, while the best mean angular error is obtained by CNNs-based CC. In the Laboratory set, we provide competitive results. Our performance is higher than in the Real-World set, which arises due to the complexity difference between these sets.

For the Mixed-Illuminant Test Set (Table 7), the proposed illumination estimation strategy obtains the second-best mean and third-best median angular error. It is important to stress that our approach is a learning-free method; hence, it is independent of data. The high-cost training phases are not incurred, which we see as our main advantage among the state-of-the-art works, i.e., Auto White-Balance for Mixed-Scenes [51], Style White-Balance [52].

As a final note, we would like to highlight the advantages and limitations of our algorithm which we addressed throughout our paper. Our algorithm is learning-free; thus, we have lower computational costs since we do not have a training phase. Also, our method is easy to implement, and we utilize only two parameters which is considerably lower compared to learning-based methods. Moreover, as stated in other color constancy studies not all pixels are informative, therefore we only consider the salient regions by utilizing the bright pixels which enable us to reduce the impact of the non-informative image elements. Furthermore, we carry out our computations using non-overlapping blocks that allow us to take the varying local statistics of the scenes into account which might not be possible while operating on a global scale. Therefore, modifying the algorithms by using the salient regions and blocks improves their performance on average significantly. Also, we use scale-space computations which are sensitive to the low-level features of images, to highlight the color features that can be missed while operating only on a single scale. On the other hand, in the field of color constancy, it is well-known that the methods utilizing statistical properties, in particular, traditional algorithms relying on the gray world assumption, have difficulty in estimating the color vector of the light source in uniformly colored images, i.e., scenes containing dominant grass and sky regions, since when there are a limited number of distinct colors, the gray world algorithm is not valid. To tackle this problem, most studies, and also our method, try to guide their approaches by only considering specific regions or pixels. While highlighting the color features and taking local statistics only in the salient regions into account allows us to improve the efficiency of our traditional algorithm, scenes containing large uniformly colored areas are more challenging than other images.

6 Conclusion

Color is an important feature not only for humans but also for various computer vision pipelines to perform accurate high-level vision tasks, i.e., object recognition and image dehazing. Due to the importance it holds, computational color constancy has been an attractive field of study, and researchers in this domain have developed many successful color constancy algorithms. Yet, the aim of researchers is not only to develop new techniques to find the color vector of the light source but also to improve existing methods by combining various strategies since this might help us to design simple yet efficient methods. From this motivation, we develop a computational color constancy algorithm based on the observation that space-average color and highest luminance patches carry significant cues for human color constancy. We estimate the color vector of the light source by assuming that on average, the world is achromatic and there are several bright image elements somewhere in the scene. We further assume that if the scene is gray on average, the shift of the brightest pixels from the achromatic value should be caused by the light source. We carry out our computations in scale-space where we find the estimations for each non-overlapping block individually by only considering the salient regions of the scene. Thereby, we take into account that surface orientations might vary throughout the scene, and not every image element is informative for performing color constancy. According to the experiments, the proposed algorithm achieves better performance than existing learning-free algorithms, while providing competitive results with learning-based methods. Furthermore, we demonstrate that the performance of several learning-free algorithms can be significantly improved by using particular steps of our algorithm. Lastly, we propose an approach that can convert our global color constancy algorithm into a method that is free from prior information about the scene for mixed illumination conditions which obtains competitive results compared to the state-of-the-art.