1 Introduction

A well working color constancy (CC) algorithm is a key component in the camera color processing pipelines. Color constancy is obtained by algorithms that estimate the illuminant white point from captured images. There are static methods that are based on physical or statistical properties of scenes (Yang et al. , 2015; Qian et al. , 2019) and learning-based methods that learn white point mapping from training data (Barron, 2015; Barron & Tsai, 2017; Hu et al. , 2017). While the color constancy has been studied for a long time, the problem is not fully solved. Even the best algorithms may fail, for example, when the scene is dominated by a single color.

In this work, we propose a novel approach for computational color constancy. In our approach we replace the raw RGB images used by the existing methods with average color spectra of captured scenes. Such spectral sensors are already available in the high-end mobile phones. For example, Huawei P40 Pro is equipped with an 8-channel average spectral sensor. It is noteworthy that average spectral measurements completely lack the spatial dimension, but the spectral domain information captures spectral fingerprints of illuminants and thus the illuminant white point can be estimated by a simple regression.

Fig. 1
figure 1

Real spectral dataset examples. Solid black line denotes the light source power spectrum (ground truth) and the dashed is the measured average reflected spectrum. Gray dotted lines are the 14 spectral channels used in our experiments. For each image the three most important channels found by leave-one-out are colored with their corresponding wavelength and the most important denoted by a asterisk (percentage numbers denote the increase in angular error if this is removed compared to the second most important channel)

The core idea of spectral fingerprints is illustrated in Fig. 1. The typical light sources such as a daylight, fluorescent, LED and tungsten are recognizable by the shapes of their power spectra.Footnote 1 The claim can be validated by taking a spectral white point regressor trained with all channels and testing it on unseen images and switching off each channel one by one, i.e. a spectral channel was set to zero for the separate runs without re-training the model. After running all the combinations, we compared the results to the reference where all channels were used normally. The increase of the error when a channel was set to zero will indicate the importance of the channel for the given test image. The most important channel(s) should be characteristic for each light source. The results for the MLP regressor in Sect. 3 are shown in Fig. 1 for various scenes and light sources. The reflected spectra are not that different from the ground truth illuminant spectrums even though the scenes are very chromatic in several of the illustrations. For example, for the both daylight cases the most important channel is the same around 415nm even though the color content of the scenes are very different. That wavelength contains a characteristic bump of the daylight spectrum. For tungsten halogen the important wavelengths are in the near-infrared region. That is characteristic to tungsten sources which have substantial amount of IR energy as compared to their visible light region. The LED case illustrates the practicality of the illuminant fingerprints to identify specific spectral peaks. The blue die peak in the cool white LED case is captured with the most important channel and the other important channels record more information about the blue peak and the phosphor bump. Other channels are clearly less meaningful in the LED case. The warm white LED has so much more power in the yellow area that the important channels are focused there. It is a similar and equally intuitive case with the fluorescent spectrum too.

Our main contribution is

  1. (1)

    the novel approach for computational color constancy using average color spectrum. In addition, we propose

  2. (2)

    a method to generate spectral data from the existing tristimulus (RGB) color constancy datasets for training purposes and

  3. (3)

    simulation based analysis of optimal spectral sensor design.

In all experiments our method obtains lower average angular error than the existing RGB based methods and it is noteworthy that the results are better in cross-dataset experiments where our method is trained with generated data but tested with real data.

This work is an extended version of our recent paper (Koskinen et al., 2021). This work addresses the important additional research and design questions that were not addressed in the preliminary work. As the first extension, (4) we study the performance upper bound if the practical 14 channel sensor is replaced with a dedicated 65 channel sensor that corresponds to commercial spectrometers. As the second extension, (5) we study the difficult but important multi-illuminant case which is assumed to be difficult for our single pixel sensor without spatial information. Finally, this work includes (6) a more detailed description of image data augmentation for training spectral color constancy with limited samples, and additional visualizations and illustrations of the approach and its experimental results.

2 Related Work

Color constancy algorithms estimate the illuminant L in order to recover the scene R under the white light. In the conventional setting L is estimated from the raw RGB image I. The existing algorithms can be divided into learning-free (static) and learning-based methods. Classical learning-free methods are based on image statistics in the RGB color space in order to find the illuminant white point. The most common such algorithm is a gray world algorithm (Buchsbaum, 1980) which assumes that the image chromaticity is gray on average. That assumption works in scenarios where there are a lot of color variation in the scenery. Extended versions of the grey world algorithm are a max RGB (Barnard et al. , 2002) and a gray edge  (van de Weijer & Gevers, 2005) algorithms. They assume that it is more likely to find achromatic content in an image if you only consider certain areas of the image, like regions near edges (gray edge) and around the maximum value (max RGB) of the image. The updated versions of the algorithms can also apply weights for each pixel based on its spatial statistics like the pixel’s gradient or relative brightness. The classical methods can work well in fairly many cases, but they are very inefficient in the challenging conditions, such as when the scenery is dominated by a single chromatic color.

In the recent evaluations on multiple datasets  (Qian et al. , 2019; Keshav & GVSL, 2019) the best performing learning-free algorithms are Gray Index (GI) (Qian et al. , 2019), Local Surface Reflectance Statistics (LSRS) (Gao et al. , 2014), and Cheng et al. (2014)  and the best performing learning-based are Decoupled Semantic Context and Color Correlation (DSCCC) (Keshav & GVSL, 2019), Fast Fourier Color Constancy (FFCC) (Barron & Tsai, 2017) and Fully Convolutional Color Constancy with Confidence (FC4) (Keshav & GVSL, 2019). The best method varies between the datasets and depending on whether the evaluation is single or cross-dataset, but in overall the differences are small.

There are a few works that study color constancy for (multi)spectral images. For example, Gevers et al. (2000) use spectral sensing for color constancy assuming that a white reference is available in the scene. Chakrabarti et al. (2011) model color constancy via spatio-spectral statistics similar to conventional RGB white balance algorithms. Khan et al. (2017) also extend traditional color constancy algorithms to multispectral images with varying spectral resolutions. These works assume that a full spatial spectral image is available, but compact high resolution spectral cameras are difficult to manufacture. Work done by Chen (2017) studies how the Corrected-Moments algorithm (Finlayson, 2013) can be extended and improved when applied for multispectral images. Spectral sharpening by Finlayson et al. (1994) aims to improve color constancy with the help of spectral sensing. Hui et al. have studied an illuminant source separation task for which they utilize spectral data (Hui et al. , 2018, 2019). Their training data generation in the former paper is physics based and use pre-defined databases for illuminant and reflectance spectra. They also weight their spectral estimation according to a camera spectral response.

Research on spectral measurements is timely as new technological advances make it possible to manufacture miniaturized multispectral sensors. The recent works of Jensen (2020) and Wang et al. (2019) investigate practical implementations of portable spectral sensors.

3 Methods

Spectral sensors can be expressed mathematically in a similar way as the RGB sensors of digital cameras. Formation of a raw RGB image I of a scene R with the camera C of known spectral sensitivities \(S_{i=R,G,B}\) and under a global illumination L can be expressed as (von Kries, 1970)

$$\begin{aligned} \begin{aligned} I_{i}(x,y) = \int L(\lambda )S_{i}(x,y,\lambda )R(x,y,\lambda ) d\lambda , \\ i\in \{\text{ R,G,B }\} \end{aligned} \hspace{5.0pt}, \end{aligned}$$
(1)

where \(S_{i}(x,y,\lambda )\) denote the spectral sensitivity of the Red, Green and Blue elements: \(i=\left\{ R,G,B\right\} \). \(\lambda \) is the spectral wavelength that for human perceivable colors is 380–700 nanometers (nm). Below 380nm is the ultra-violet band and above 700nm is the infra-red band.

The RGB sensors are designed to capture photographs that match the color sensitive cells of the human visual system (HVS) (Palmer, 1999). However, for accurate color measurements the HVS-inspired wide-band RGB sensors \(C=C^{RGB}\) produce various problems such as the metamerism. The problems can be largely avoided by spectral imaging with a spectral camera \(C^{spec}\) that has multiple narrowband sensor elements \(S_{i=1,\ldots ,N}\). Manufacturing of a spectral camera with a high spatial resolution is difficult as it requires a mechanical filter wheel or a large number of photo receptors for each band (Nathan & Michael, 2013; Gao & Wang, 2016).

3.1 Average Spectral Measurement

In this work, we omit the spatial dimension for color constancy. In that case, a spectral camera is not needed. Average spectrum can be measured by a point sensor that needs

  1. 1.

    A wide angle lens or a diffuser that covers the scene on the image plane (xy) of Eq. 1 and

  2. 2.

    N narrowband spectral sensor elements \(S_i\) behind the lens.

The sensor \(S_i\) response is

$$\begin{aligned} \bar{I}_i = \int _x\int _y I_{i}(x,y) = \int L(\lambda )S_{i}(\lambda )R(\lambda ) d\lambda \hspace{5.0pt}. \end{aligned}$$
(2)

The average spectral measurement of a scene R and under the illumination L is stored as a vector \(\textbf{s}=\left( \bar{I}_1, \bar{I}_2,\ldots , \bar{I}_N\right) \). The color constancy problem is to obtain the illuminant L using the spectral response vector \(\textbf{s}\). In our simulations, \(\textbf{s}\) of only N=14 elements provides good accuracy. This means that sufficient information is available in five orders of magnitude (10\(^5\times \)) less data than in a 10MPix camera image.

The field of view (FOV) of the sensor should be as wide as possible in order to integrate and average the changes in the surrounding scenery. This helps to reduce small chromatic objects strongly affecting the shape of the reflected spectrum in a same way as a classic gray world (Buchsbaum, 1980) color constancy algorithm works. The field of view should be at least on a same level as the camera’s FOV.

3.2 Sensor Design

The physical design has restrictions due to the optics, electronics and material properties (Hamamatsu, 2019), but for simulation purposes the sensor responses \(S_i\) can be approximated by a Gaussian function, \(Gauss(\mu ,\sigma )\), with the maximum at 1.0 i.e. perfect quantum efficiency at the peak wavelength. The Gaussian filter response \(S_i\) is defined by the central wavelength \(\mu _i\) and bandwidth \(\sigma _i\)

$$\begin{aligned} S_i (\lambda ) = \frac{1}{\sigma _i\sqrt{2\pi }}\exp \left( {-\frac{1}{2} \left( \frac{\lambda -\mu _i}{\sigma _i}\right) ^2}\right) \hspace{5.0pt}. \end{aligned}$$
(3)

The Gaussian spectral shape is a fair assumption also for a practical implementation (Jensen, 2020; Wang et al. , 2019).

Our objective is to find the optimal spectral sensor for color constancy so that it can be implemented in a miniaturized hardware. The number of channels were experimentally tested for N= 4, 6, ..., 16. The central bandwidths, Gaussian peaks, were adjusted to uniformly cover the visible spectrum ranging from 380nm to 700nm. This range covers the core of the CIE photopic luminosity function (Guild & Petavel, 1931). The channel bandwidth was defined by the full width at half maximum (FWHM) and the FWHM bandwidths of 10nm, 20nm and 30nm were tested. These bandwidths were selected to match the capabilities provided by the current technologies. These settings provide 21 different configurations evaluated in Sect. 5.1.

3.3 65 & 3 channel reference sensors

In addition to finding the best practical spectral sensor design for mobile use, we included to our experiments a “high quality reference sensor” that mimics the best available scientific spectrometers. For that purpose we defined a sensor with 5nm wide (FWHM) channels with 5nm intervals resulting to 65 channels in the same 380–700nm range. This setting is similar to a Konica-Minolta CL-70F spectrometer for the given spectral range. The 65 channel version is considered as an upper bound performance target for the more practical designs in both theoretical and real world use cases.

Some experiments were also done with a 3 channel sensor that used a spectral response of a Huawei Mate 20 Pro as the channels. The 3 channel “RGB” sensor would act similarly as a normal mobile camera that is downscaled to a single pixel. While the shape of the channels are very different to the other Gaussian shaped designs, this simulated sensor gives us a lower bound of the performance opposite to the 65 channel design.

3.4 White Point Regression

The spectral sensor produces a measurement vector \(\textbf{s}=\left( \bar{I}_1, \bar{I}_2,\ldots , \bar{I}_N\right) \) from (2) using the Gaussian responses \(S_i\) (Sect. 3.2). Color constancy corresponds to an estimation of the global ambient scene illumination \(L = {\varvec{{\hat{\ell }}}} \approx \varvec{\ell }\) (Finlayson et al. , 2001). The estimated white point is used to normalize the image colors so that achromatic regions appear gray. The white point estimation is defined as a regression problem \(\varvec{\ell }= \left( \ell _R, \ell _G, \ell _B\right) ^T = f(\textbf{s}_{N\times 1})\), where \(\varvec{\ell }\) is the illuminant white point in RGB and \(f(\cdot )\) is a regression function that maps the spectral measurement \(\textbf{s}\) to a white point estimate of L.

For f we tested a number of popular regression methods: Kernel Ridge regression (KR) (Murphy, 2012), Random Forest regression (RF) (Breiman, 2001), and Multilayer Perceptron (MLP) (Geoffrey, 1989). The Scikit-Learn Python library was used for KR and RF. The methods’ hyperparameters were optimized by grid search and cross-validation on the training data and for each sensor configuration separately. MLP was implemented using TensorFlow. MLP has three fully connected hidden layers of sizes 512-1024-512 and the standard Adam optimizer was used. In our experiments the differences between KR, RF and MLP regressors were small and thus any of them is a feasible choice.

4 Data

4.1 Generated Spectral Data

In order to train the white point regressors in Sect. 3.3 we need spectral color constancy training data. It would be straightforward to convert existing spectral image datasets (Parkkinen et al. , 1988; Westland et al. , 2000; Kerekes et al. , 2008) for our purposes, but they are too small and do not contain natural scenes. Alternatively, spectral training data can be generated from the existing color constancy datasets using one of the RGB-to-Spectral conversion methods (Kawakami et al. , 2011; Arad & Ben-Shahar, 2016; Jia et al. , 2017). The recent Cube+ dataset (Banić and Lončarić, 2017) fits to our purposes. For spectral approximation we adopt parts of our recent Sensor-to-Sensor Transfer (SST) model (Koskinen et al. , 2020). The original model is designed for RGB-to-RGB conversion between two different RGB sensors and therefore we need to adapt it for RGB-to-Spectral conversion using the following spectral processing steps (Fig. 2):

  1. 1.

    Illuminant spectrum estimation: \(\varvec{\ell }\) to \(\hat{L}'_{spec}\),

  2. 2.

    Raw to spectral image transform: \(I_{raw}\) to \(\hat{R}_{spec}\),

  3. 3.

    Spectral image refinement: \(\hat{R}_{spec}\) to \(\hat{R}'_{spec}\),

  4. 4.

    Sensor sampling of the average reflected illuminant: \(\bar{R}'_{spec} \cdot \hat{L}'_{spec}\) to \(\textbf{s}\).

Fig. 2
figure 2

The RGB-to-Spectral conversion model used to generate spectral training data (Sect. 4.1)

4.2 Illuminant spectrum estimation

\(\hat{L}'_{spec}\) estimation is made by finding the closest matching spectrum from an existing database and then refining it to perfectly match the ground truth RGB tristimulus white points in Cube+. For this purpose, we gathered an illuminant database of 100 spectra. Most illuminants were picked from the CIE standard illuminants (International Organization for Standardization, 2006). The standard does not contain modern LEDs and therefore 13 different LED spectra were measured and added. It does provide an equation to calculate different daylight spectra as the function of a correlated color temperature: \(L(\lambda )=L_{0}(\lambda )+M_{1}L_{1}(\lambda )+M_{2}L_{2}(\lambda )\). \(L_i\) are predefined illuminant characteristics vectors and \(M_i\) are coefficients depending on the selected white point. We selected 70 different daylight illuminants ranging from 2500K to 9400K to cover various conditions from sunsets to cloudy days. The standard also provides typical fluorescent spectra and we selected 8 of those. Finally, we also added 9 tungsten halogen spectra ranging from 2200K to 3250K by using the Planck’s law.

As in Eq. 1 the image I is formed according to (von Kries, 1970):

$$\begin{aligned} \begin{aligned} I_{i}(x,y) = \int L(\lambda )S_{i}(\lambda )R(x,y,\lambda ) d\lambda ,\\ i\in \{\text{ R,G,B }\} . \end{aligned} \hspace{5.0pt}\end{aligned}$$
(4)

Now that we are only comparing illuminant spectra and the Cube+ ground truth white points, we can set the reflectance spectrum R to a perfect white and thus effectively omit it from the equation. For the same reason, the spatial information (xy) can be removed. We obtained the camera model used in the Cube+ and measured the sensor response spectra \(S_i\) using Labsphere QES-1000. For spectral matching the image term \(I_{i}\) is replaced with the ground truth illuminant white point \(\varvec{\ell }\). Therefore, we need to find the illuminant \(L_{d}\) from our database that minimizes the equation

$$\begin{aligned} \begin{aligned} \hat{L}_{spec} = \underset{L_{d}}{\hbox {arg min}} \Vert {\int L_{d}(\lambda )S_{i}(\lambda ) d\lambda } - \varvec{\ell }\Vert ^2, \\ i\in \{\text{ R,G,B }\} . \end{aligned} \hspace{5.0pt}\end{aligned}$$
(5)

\(\hat{L}_{spec}\) is the best match within the 100 illuminants. Since our database contains real illuminant spectra, the best matching illuminant has the natural shape of the corresponding white point. The found spectrum has also similar tristimulus response, but needs fine-tuning. To keep the spectral shape and naturalness intact, refining is done by linearly adjusting the red and blue parts of the spectrum from the pivot point of 530nm. The pivot point is selected to be in the middle of a typical green channel response. The refining is done iteratively until a perfect tristimulus match is achieved for \(\hat{L}_{spec}'\) by utilizing the equation (\(\hat{L}_{spec}'^{(0)} = \hat{L}_{spec}\))

$$\begin{aligned} \hat{L}_{spec}'^{(t+1)}(\lambda ) = \hat{L}_{spec}'^{(t)}(\lambda ) w(\lambda ) \hspace{5.0pt}, \end{aligned}$$
(6)

where w is the weight vector having a value of 1 at 530nm.

4.3 Raw to spectral image transform

After estimating the illuminant spectrum \(\hat{L}_{spec}' \approx L\), the only unknown is the scene reflectance spectrum R in Eq. 4. The same approach from Section 4.1 can be used for reflectance spectrum estimation. The only difference is that the illuminant database is replaced with a database of natural reflectance spectra. The Munsell Glossy dataset (Orava, 1995) is suitable for our purposes. The spectra are well spread over the gamut and the shapes are smooth in nature. Another change we did for the reflectance spectrum estimation is that the matching is made in the CIE L*a*b* color space (International Organization for Standardization, 2008) where the luminance component L* can be omitted. The matching is done in a 2D space using the Euclidean distance. We use k nearest neighbors and the weighted sum of their Munsell spectra to replace the RGB values of each location (xy) with a spectral vector. The results were not very sensitive to selection of k and thus k was set to 2 in

$$\begin{aligned} \begin{aligned}&\hat{R}_{spec}(x,y) = \sum _{k} w_{k} R_{Munsell}^{k}\\&\left\{ w_{k}\right\} = \underset{\left\{ w_{k}\right\} }{\arg \min } \Vert {I_{raw,i}}(x,y)\\&\qquad -\sum _k w_{k} \int \hat{L}_{spec}'(\lambda ) S_{i}(\lambda ) {R}_{Munsell}^{k}(\lambda ) d\lambda \Vert _{a,b}^{2}. \end{aligned} \hspace{5.0pt} \end{aligned}$$
(7)

4.4 Spectral image refinement

The spectral image refinement is required to perfectly match the Cube+ image RGB values. We normalized the camera spectral responses \(S_{i}\) so that the sum of the color channels (\(i\in \{\text{ R,G,B }\}\)) for each wavelength is one. The normalized curves \(\bar{S}_{i}\) are utilized as weighting functions for the iteration process (\(\hat{R}_{spec}'^{(0)} = \hat{R}_{spec}\))

$$\begin{aligned} \begin{aligned} \hat{R}_{spec}'^{(t+1)}(x,y,\lambda ) = \hat{R}_{spec}'^{(t)}(x,y,\lambda ) + \\ \left( \frac{e_{i}+\epsilon }{\hat{e_{i}}}-1\right) \cdot \left( \hat{R}_{spec}'^{(t)}(x,y,\lambda ) \cdot \bar{S}_{i}(\lambda )\right) , \end{aligned} \hspace{5.0pt} \end{aligned}$$
(8)

where the color channel specific (RGB) variables are \(\hat{e_{i}}\) for the estimate and \(e_{i}\) for the target. Iteration is finished when the spectrum matches the raw tristimulus values, i.e. \(\hat{e}_{i} = e_{i}\). We use \(\epsilon =10^{-6}\) to make sure the spectra are always positive. The raw input image \(I_{raw}\) contains the target values and the estimates are calculated using Eq. 4 by placing \(L = \hat{L}_{spec}'\), \(S = S_{i}\) (measured Cube+ camera spectral characterization curves) and \(R = \hat{R}_{spec}'\).

Fig. 3
figure 3

Visualized accuracy of the RGB-to-Spectral conversion. The spectral accuracy is shown for the ColorChecker patches indicated with cyan squares. The ground truths (solid lines) and the estimates (dashed lines) are plotted on the right with colors corresponding to the patches (note that there is no visible difference between the colors of the solid and dashed lines). The spectra were normalized to the peak wavelength

4.5 Sensor sampling

In the final step the estimated scene reflectance spectra and the estimated light source spectra are used to construct the spectral sensor response. First, the image spectra are averaged \(\hat{R}_{spec}'\rightarrow \bar{R}_{spec}'\). The spectral response S now corresponds to the wide angle multi-channel sensor in Sect. 3.2 and in the following the index i refers to the channel number. The final sensor response \(\textbf{s}\) is computed from

$$\begin{aligned} \textbf{s} = \int \hat{L}'_{spec}(\lambda )S_{i}(\lambda )\bar{R}'_{spec}(\lambda ) d\lambda \hspace{5.0pt}. \end{aligned}$$
(9)

4.6 Data augmentation

During the preliminary experiments it was noticed that more training samples were needed for the MLP method than the 1657 vectors from averaged images provided by the Cube+ dataset. To generate more data, the spectral images were split to 12 equal sized sub-images for which Eq. 9 was computed separately. Since the illuminant spectra is the same for all pixels, the augmentation expanded the number of different natural surfaces. This way the Cube+ dataset produced 19,884 spectral sensor vectors and white point ground truths. It is noteworthy, that the amount of data is still vastly less than typically used for conventional RGB image color constancy algorithms.

4.7 Noise model

For more realistic results we added noise to the generated training samples. The noise gives benefit to wider channels with better signal-to-noise levels. The computational spectral sensor channels were defined to have a 100% peak quantum efficiency. We empirically set a very low light condition where the amount of photons to the most sensitive sensor channel is 20 times the FWHM width W of the channel (in nm). So in effect we assume the same exposure time for each sensor design. We only calculated the photon noise and disregarded the less significant noise sources, such as a read-out noise and ADC noise as those depend heavily on the hardware design which is not known. The photon noise is signal dependent Poisson distributed noise. The strength of the noise can be modeled as a noise which standard deviation grows with a square root of the signal level (Foi et al. , 2008; Hasinoff, 2014). Therefore, an equation \(\textbf{s} = \textbf{s} + \sqrt{20W\textbf{s}}\,X\) was used to add noise to the sensor response \(\textbf{s}\) for which most sensitive channel is normalized to one. X is a random sample from the normal distribution \(\mathcal {N}(\mu ,\rho ^{2})=\mathcal {N}(0,20W)\).

4.8 Transform accuracy verification

In order to verify the accuracy of the used RGB-to-Spectral conversion, we measured the spectral reflectances of the color patches of an X-Rite ColorChecker with a Photo Research PR-670 spectrometer. The spectra were then converted to RGB values using Eq. 1, where the camera spectral sensitivities were from a Huawei Mate 20 Pro and illuminant was set to an illuminant E. The RGB values were then transformed back to spectral values using the proposed RGB-to-Spectral conversion and compared to the original measured ground truth spectra. Any visible errors in the spectral domain are metameric as the differences in the RGB values are negligible. The results are shown in Fig. 3 for the challenging saturated content. The average spectra of a scene is typically much less saturated and thus easier for the estimation as indicated by the plotted white patch accuracy.

4.9 Multi-Illuminant Data

Multi-illuminant color constancy is a complex and largely unsolved problem. However, we wanted to study whether the spectral sensor can be helpful for the multi-illuminant case despite that it completely lacks the spatial information. To succeed in the multi-illuminant case the spectral method should detect multiple illuminant spectral fingerprints simultaneously.

The multi-illuminant data was generated by adapting the processing pipeline in Sect. 4.1 so that each image was re-illuminated by a mixture of two random illuminants. In specific, we replaced the estimated ground truth illuminant spectrum \(\hat{L}_{spec}'\) by a mixture of two randomly selected illuminant spectra. Dominant illuminant was selected and its intensity was randomly selected from \((50\%, 90\%]\). Then another secondary illuminant was randomly selected and added with intensity of at least 10%. The illuminants were picked from the set of 100 light source spectra used for the illuminant spectrum estimation in Sect. 4.1. Similar data augmentation to the single illuminant case was applied and thus resulting to the total of 83,000 spectral samples.

4.10 Real Spectral Color Constancy Data

To validate the results with real data, we collected a spectral color constancy dataset. Each sample contains a raw image captured with a Huawei Mate 20 Pro mobile phone and two spectral measurements by a Konica Minolta CL-70F spectrometer. The first spectral measurement represents the average spectrum of the scene reflected illuminant and the second the ground truth illuminant. The first measurement was made by placing Konica Minolta next to the phone and pointing it towards the scene. The second measurement was made by placing the spectrometer to the scene to measure the ground truth illumination falling on the area. The data gathering setup is illustrated in Fig. 4. The ground truth white points were calculated using the illuminant spectrum, the camera spectral response and a perfect white reflectance spectrum in Eq. 4.

Fig. 4
figure 4

The setup used to capture the real spectral color constancy dataset

The dataset consists of 235 raw images with their corresponding spectral measurements. The dataset was purposely made difficult for color constancy by including scenes that are dominated by a few chromatic colors and often without any clear gray areas. These cases are challenging also to spectral color constancy as the illuminant spectrum and the reflected spectrum are clearly different (the solid and dashed lines in Fig. 1). Examples from the dataset are shown in Figs. 1 and 6.

5 Experiments

5.1 Sensor Design

We tested the 21+2 sensor configurations in Sect. 3.2: 7 different filter configurations from \(N=4\) to \(N=16\) and 3 different filter bandwidths from 10nm to 30nm. In addition, we had a 65 channel reference design that was a target for the other configurations and a 3 channel design that gave understanding about the lower bound performance. The evaluations were made with the generated Cube+ spectral images (Sect. 4.1) and with the real spectral data (Sect. 4.3). All results are average numbers from 3-fold cross-validation and the experiments were carried out with noise-free and noise added measurements. The noisy measurements reflect better the performance in realistic low light conditions and demonstrate the difference between the narrow (10nm) and wide (30nm) band sensors. The performance measure in all our experiments is the mean angular error between the ground truth white point \(\varvec{\ell }\) and the estimated white point \(\varvec{\ell }'\) (Finlayson et al. , 2017)

$$\begin{aligned} err=\cos ^{-1}\left( \frac{\varvec{\ell }\cdot \varvec{\ell }'}{\Vert \varvec{\ell }\Vert \cdot \Vert \varvec{\ell }'\Vert }\right) \hspace{5.0pt}. \end{aligned}$$
(10)

Results are shown in Fig. 5 and provide two expected findings:

  1. 1.

    Adding more channels systematically improves the results until they saturate at \(N \ge 10\).

  2. 2.

    Wider filters are more robust to low light and noisy scenes (Cube+).

The average error with the real data (\(\approx 2.4^\circ \)) is clearly worse than with the generated Cube+ (\(\approx 0.5^\circ \) for clean and \(\approx 1.0^\circ \) for noisy) which can be explained by the fact that the real dataset is more challenging. However, both results are well below \(3.0^\circ \) that is the generally used just noticeable difference of human color perception.

The results with our spectral dataset are clearly worse than with Cube+ and there is no significant difference between the clean and noisy results. The main reason for this is that our scenes are more difficult (often only a few dominating colors) for color constancy, there are much fewer scenes and spectra were measured using a real spectrometer. Based on the noise-free and noisy results and trying to keep the design feasible, we selected the option with 14 channels having 20nm width as the best miniaturized sensor design for the remaining experiments in addition to the 65 channel reference design representing a high-end spectrometer.

Supplementary tests in addition to the Gaussian shaped sensors were carried out with the 3 channel design that represented a typical pixel of a mobile camera. While the "RGB" sensor does not see any spatial information, and thus cannot perform as well as the real mobile cameras, it gives a relatable lower bound for the multi-channel sensors. We conducted the evaluation using the Cube+ dataset as its image count was high enough to give very stable results. The accuracy of the "RGB" sensor dropped 54% on average and 64% on the 95\(^{th}\) percentile compared to the Gaussian (10nm) 4 channel sensor. The result are in line with the expectations when looking at the results for the Gaussian shaped sensors in Fig. 5.

Fig. 5
figure 5

Results for the Gaussian shaped sensors using the MLP white point regressor. Y-axis is the mean angular error from a 3-fold cross-validation and x-axis is the number of channels except with the black lines that show the reference 65 channel sensor as the target performance

Table 1 Comparison of the proposed spectral (both 14 channel and 65 channel options) and SotA color constancy methods in the 3-fold cross-validation. The numbers are angular errors (lower is better)
Table 2 Angular errors for the cross-dataset experiment
Fig. 6
figure 6

Visualized errors for the tested algorithms from the Real Spectral Dataset. The images include also a static color space transform and an sRGB gamma for displaying purposes. The average results of the algorithms are similar as shown statistically for the given dataset in Table 2

5.2 Method Comparison

We compared the spectral color constancy with the settings \(N=14\) and sensor bandwidth 20nm against three SotA methods: Grayness Index (GI) (Qian et al. , 2019), Fast Fourier Color Constancy (FFCC) (Barron & Tsai, 2017) and Fully Convolutional with Confidence (\(\hbox {FC}^{4}\)) (Hu et al. , 2017). GI is a static method that does not need training data, but it is competitive against the learning-based methods and particularly effective in cross-dataset evaluations. FFCC and FC4 are SotA learning-based methods, but with an important difference: FFCC omits the spatial dimension and uses image RGB distributions while FC4 directly uses the RGB images.

We repeated the 3-fold cross-validation of the previous experiment with the generated Cube+ and the Real Spectral Dataset. The results in Table 1 provide two important findings:

  1. 1.

    All variants of spectral color constancy outperform the SotA RGB methods on both datasets.

  2. 2.

    The spectral method is particularly effective on the most difficult scenes (95\(^{\textrm{th}}\) percentile) for which it obtains remarkable improvements of 39% to 74% even with the 14 channel configuration

5.3 Cross-dataset Evaluation

The cross-dataset evaluations are important as the methods are not allowed to use training data from the tested datasets and therefore the results reflect better the practical performance. For the cross-dataset evaluations all methods were trained with the Cube+ images. From the popular color constancy benchmarks we selected those where we were able to find the same camera model and measure its spectral response. The selected test datasets were Intel-TUT (Aytekin et al. , 2017), NUS (Cheng et al. , 2014) and Shi-Gehler (Hemrit et al. , 2018), with 142, 197 and 482 images, in addition to our own collected Real Spectral Dataset with 235 images.

The results are shown in Table 2 and visualized in Fig. 6. The spectral color constancy method achieved superior or on par accuracy on all four datasets. Similar to the previous experiment, the performance was particularly good for the most difficult images (95\(^{\textrm{th}}\) percentile) where the spectral method achieved notable improvements of 38–54% with the 14 channel design.

5.4 Multi-Illuminant Case

The MLP network created for the single illuminant color constancy (3 outputs) was modified to produce two white points and their relative intensities (6+2 outputs). The output of the “dual-MLP” setting can be expressed as a weighted sum of the two white points \(w_1'\varvec{\ell }_1'+w_2'\varvec{\ell }_2'\), where \(\varvec{\ell }_i'\) are the estimated white points and \(w_i'\) their weights. For simplicity, the second weight could be defined as \(w_2' = 1.0-w_1'\), but we did not find much difference in the results of the two and therefore used two outputs. Dual-MLP is able to estimate the illuminant(s) for the both single and dual illuminant case. In the correctly estimated single illuminant instance, the other weight is evaluated to be 0. It should be noted that in practice two illuminants are often also spatially separated, for example, consider an image captured indoors in an office that includes window viewing outdoors. However, since the single pixel sensor has no spatial information the weights represent the spatial extent of the two lights.

During the experiments, it was noted that the regressor detected well the two illuminants but was less successful detecting the mixing weights. Therefore we used the following compound error that uses the ground truth mixing weights to measure how well the correct illuminants were detected (for the single illuminant MLP \(\varvec{\ell }_1'=\varvec{\ell }_2'\))

$$\begin{aligned}&err_{dual}= \nonumber \\&w_1 \cdot err(\varvec{\ell }_1,\varvec{\ell }_1') + w_2 \cdot err(\varvec{\ell }_2,\varvec{\ell }_2') \hspace{5.0pt}, \end{aligned}$$

where err() is the angular error in Eq. 10. Note that since the order of the two white points is arbitrary, the white points were swapped and the minimum recorded as the error.

Table 3 Weighted angular errors for the generated images containing random mixtures of two illuminants (3-fold cross-validation)

The results for the dual-MLP are shown in Table 3. The numbers are clearly worse than in the single illuminant experiments and that demonstrates the difficulty of multi-illuminant color constancy. However, the dual-MLP architecture obtains systematically 15–25% better accuracies than the single illuminant MLP indicating that the single pixel spectral measurement can detect the spectral fingerprints of multiple illuminants. It should be noted that these results are promising but only preliminary as for the practical multi-illuminant color constancy also the spatial segments of the different illuminants should be estimated.

6 Conclusions

We introduced a new approach for computational color constancy. Instead of the conventional procedure of using RGB images, our approach uses average color spectra sampled from the visible part of the electromagnetic spectrum. The spectral color constancy achieved the highest accuracy with clear margins to SotA RGB methods. In particular, remarkable improvement of over 50% in the challenging cross-dataset evaluations was achieved in the most difficult cases using a design that is practical for mobile devices. It also proved that the data generation method was effective as the results with the generated training data and tested on real measured data still achieved superior results. In addition, we showed that a single pixel spectral sensor is able to detect multiple illuminants from a single global measurement. We conclude that the spectral dimension is more important than the spatial dimension for estimating the illuminant white points.