1 Introduction

A camera response function (CRF) describes the mapping between the radiant energy received by an image sensor and the intensity output of a camera in the final images [1]. Most cameras are manufactured with nonlinear CRFs [2]. Such nonlinearity is introduced during the stages of image formation in the camera. For instances, analog-digital conversion in the image sensor, white balance adjustment that minimises image colour drift due to differing illumination, gamma correction that expands the luminance range to be interpreted, and tone mapping for optimising the image visual quality [3]. Popular CRF models include Empirical Model of Response (EMoR), generalised gamma curve model (GGCM), polynomial, and gamma [2, 4].

Calibration of a camera response is crucial in many computer vision tasks. Examples of such tasks include image mosaicing where multiple images need to be flawlessly coupled together [5], high dynamic range imaging where images of multiple exposures are used to produce images with greater dynamic ranges [6], and denoising that removes motion blur [7]. CRF calibration also has application in digital forensics [1].

Elaborate CRF representation modelling is the foundation for accurate and rapid CRF calibration. The calibration can be seen as an optimisation process where often the optimal parameters of a selected CRF representation model are calculated to best describe the camera response. The existing CRF models are mostly parametric with multiple parameters. The solution spaces for optimising these parameters are complex with arbitrary distributions. Thus, it takes a long time to calibrate the optimal parameters using existing models.

In this paper, a novel and high-performance non-deterministic CRF representation model, the Single Latent Representation model (SLR), is proposed based on the autoencoder, neural architecture search (NAS), and latent distribution learning (LDL) techniques. This work presents the following contributions. 1) Pattern of real-world CRFs were extracted by unsupervised learning and represented by a single latent variable using autoencoder. 2) Two approaches (i.e. a LDL and a supervised learning approach using handcrafted feature) are proposed and applied during model representation learning to constrain the latent distribution which further improves the accuracy of camera calibration. 3) A naïve NAS algorithm is used to seek for the optimal autoencoder architecture considering both model accuracy and complexity. 4) The proposed model achieves state-of-the-art performance in terms of accuracy and efficiency of CRF modelling but executes in less than half the time than current best algorithms during CRF calibration.

2 Related work

The latest successful CRF representation model is perhaps the EMoR by Grossberg and Nayar proposed in 2004 [2]. The EMoR describes a CRF by linearly composing a collection of principal components or eigenvectors generated by applying Principal Component Analysis (PCA) on 201 real-world CRFs known as the Database of Response Function (DoRF). Each CRF curve is composed by 1024 uniformly sampled irradiance-intensity converting ratios and is normalised between and passes through \(\left( 0,0\right) \) and \(\left( 1,1\right) \). By EMoR, an approximation \({\widetilde{f}}\) to the CRF f can be constructed from k coefficients and the corresponding eigenvectors:

$$\begin{aligned} {\widetilde{f}}=f_0+\varvec{c}_k^T \varvec{H}_k \end{aligned}$$
(1)

where \(f_0\) is the base function calculated by averaging all the CRFs in DoRF, \(\varvec{c}_k=\varvec{H}_k^T\left( f-f_0\right) \) is the model coefficients, and \(\varvec{H}_k:=\left[ \varvec{h}_1\cdots \varvec{h}_k\right] \) is the first k eigenvectors with the largest eigenvalues.

EMoR is an efficient model to represent CRFs by a very small number of parameters or coefficients. As reported in the paper [2], three eigenvalues encode 99.5 percent of the cumulative energies associated with the eigenvalues in DoRF. So far, it is the most widely adopted model for representing a CRF due to its high representing accuracy and simplicity [3, 5, 8,9,10,11,12].

Polynomial and gamma curves are the two other popular models used for CRF representation whose performances are slightly worse than the EMoR according to a benchmark [4]. A high-order polynomial has the general form:

$$\begin{aligned} f_{\varvec{\omega }} \left( \varvec{x} \right) =\sum _{i=1}^{M}{\varvec{\omega }}_i \varvec{x}^i \end{aligned}$$
(2)

where M and \(\varvec{\omega }\) are the order number and model coefficients, respectively, and they are the parameters to be determined through camera calibration. \(\varvec{x}\in \left[ 0,1\right] \) is the model input and represents image pixel intensity.

In general, gamma curves follow the basic form:

$$\begin{aligned} f\left( \varvec{x} \right) =\varvec{x}^\gamma \end{aligned}$$
(3)

where \(\gamma \) is the gamma value typically determined through calibration. This model has been applied in numeral works [13, 14].

An extended version of gamma curve named GGCM has been proposed [4] and applied [15]. It is denoted in (4).

$$\begin{aligned} f_{\varvec{\omega }} \left( \varvec{x} \right) =\varvec{x}^{P_{\varvec{\omega }} \left( \varvec{x}\right) } \end{aligned}$$
(4)

where the gamma value in the basic form is replaced by a polynomial term \(P_{\varvec{\omega }} \left( \varvec{x}\right) =\sum _{i=0}^{N}{\varvec{\omega }}_i\varvec{x}^i\).

A limitation of the current CRF representation models is certainly the high dimensional and complex solution space for solving the optimal model parameters during calibration. A CRF representation with a minimum number of model parameters, e.g. the single gamma value for gamma curves, is demanded for simplifying the calibration. Autoencoder has the ability of generalising and has been used for representation modelling [16]. It compresses data into a much lower dimensional latent space represented by a few latent variables and provides a potential solution to CRF representation. However, such work has not been reported yet.

In general, autoencoder is a neural network that consists of an encoder and a decoder. The encoder maps the input data \(\varvec{x}\) to a latent representation \(\varvec{z}\). And the decoder reconstructs \(\varvec{z}\) back to the input data \(\tilde{\varvec{x}}\). The latent representation and the model weights are trained by minimising the difference between the input and reconstructed data in an unsupervised process [17].

In the work by Makhzani et al. [18], Adversarial Autoencoder (AAE) has been introduced combining autoencoder with generative adversarial training to deliver unsupervised learning on multiple objectives. It can impose a constraint on the latent distribution by the adversarial training process. The value function of adversarial training can be represented as:

$$\begin{aligned} \begin{aligned} \min _{G} \max _{D} V\left( D, G \right)&= {\mathbb {E}}_{\varvec{x} \sim p_d}\left[ \log D\left( \varvec{x}\right) \right] \\&+ {\mathbb {E}}_{\varvec{z} \sim p\left( \varvec{z}\right) }\left[ \log \left( 1 - D \left( G \left( \varvec{z} \right) \right) \right) \right] \end{aligned} \end{aligned}$$
(5)

where the encoder in the autoencoder also acts as the generator G in adversarial network to produce the latent representations \(\varvec{z}\) from the data distribution \(P_d\). And at the same time, a discriminator D calculates the probability that a representation is generated from the data or prior distribution. AAE has been successfully applied in applications such as image anomaly detection [19] and classification [18].

A Variational Autoencoder (VAE) [20] is another popular autoencoder model capable of constraining the distribution of latent representation. Such constraint is achieved by a recognition network that predicts the posterior distribution of the latent space.

Recent advancement of neural networks for end-to-end feature representation and data processing increases the demand for automating the architecture engineering which is time-consuming and usually done manually. NAS automates the neural network engineering process. It can be summarised as three topics: search space, search strategy, and performance evaluation strategy [21]. Search space defines the architecture searching scope and usually involves human prior knowledge. Search strategy determines how the search space is explored. And performance evaluation strategy quantifies candidate model performance.

3 Proposed method

3.1 Autoencoder-based CRF representation model

Fig. 1
figure 1

Architecture of the proposed model for CRF representation. The top row represents a multi-layer fully connected autoencoder with a single latent variable. The bottom row demonstrates the latent distribution and an objective prior distribution

As shown in Fig. 1, the proposed Single Latent variable camera response Representation (SLR) model inputs a CRF represented by 1,024 uniformly sampled points on the function, reduces the dimensionality to the latent space by the encoder, and outputs the reconstructed CRF by the decoder. As a result, a CRF can be represented by the latent variables in the latent space of the proposed model. A multi-layer fully connected autoencoder with the same number of input and output neurons is selected as the representation model of CRFs.

In our model, the number of hidden layers in either the encoder or decoder is denoted by L. Both the encoder and decoder contain either one, two, or three hidden layers, i.e. \(L \in \left\{ {1,2,3} \right\} \). Each hidden layer contains varied number of neurons, denoted by \({C_l}\) where \(l \in \left\{ {1, \ldots ,L} \right\} \) is the layer index. \({C_z}\) is the number of latent variables in the model. Dropout operation is added to prevent the model from overfitting [22]. Nonlinearity is introduced by an activation function on each unit. The feed-forward operation of the proposed model has the form:

$$\begin{aligned} \begin{aligned} {r_j}&\sim \mathrm{{Bernoulli}}\left( p \right) \\ {{{{\tilde{u}}}}^{\left( l \right) }}&= {r^{\left( l \right) }} * {u^{\left( l \right) }} \\ v_i^{\left( {l + 1} \right) }&= w_i^{\left( {l + 1} \right) }{{{{\tilde{u}}}}^l} + b_i^{\left( {l + 1} \right) } \\ u_i^{\left( {l + 1} \right) }&= g\left( {v_i^{\left( {l + 1} \right) }} \right) \end{aligned} \end{aligned}$$
(6)

where \({r^{\left( l \right) }}\) is a vector of independent Bernoulli random variables with each element a probability p of being 1, * denotes an element-wise product, \({u^{\left( l \right) }}\) denotes the output vector calculated from the input vector \({v^{\left( l \right) }}\) into layer l, \({w^{\left( l \right) }}\) and \({b^{\left( l \right) }}\) are the model weights and bias at layer l, g is the activation function.

The output vector from layer l is firstly sampled by the dropout operation and then processed by the weights and bias. The processed outputs are nonlinearly activated and used as inputs to the next layer. This process is repeated layer by layer. At test time, the model weights are scaled by p to infer without the effect of the dropout. For CRF construction, the latent variable is used as the input, and the reconstructed CRF \({{\tilde{x}}}\) can be obtained at the final output layer.

The model weights in the autoencoder are learnt by independently back-propagating the gradients calculated from the derivatives of the losses. The reconstruction loss is the mean-square-error (MSE) between the input x and reconstructed CRFs \({{\tilde{x}}}\):

$$\begin{aligned} MSE\left( {x,{{\tilde{x}}}} \right) = \frac{1}{N}\sum \limits _{i = 1}^N {{{\left( {{x_i} - {{{{\tilde{x}}}}_i}} \right) }^2}} \end{aligned}$$
(7)

where N is the number of training data. Meanwhile, a smoothness loss is imposed on the reconstructed CRF \({{\tilde{x}}}\) as a CRF is usually a smooth and continuous function based on the observation from the CRFs in the DoRF:

$$\begin{aligned} {{{\mathcal {L}}}}\left( {{{\tilde{x}}}} \right) = {\left\| {{{{{\tilde{x}}}}^\prime }} \right\| _2} \end{aligned}$$
(8)

where \({{\tilde{x}}}\) is the first derivative of the reconstructed CRF, \({\parallel _2}\) denotes the l2-norm.

The optimal number of hidden layers and number of neurons in each hidden layer are determined by NAS.

3.2 Naïve neural architecture search

The optimal architecture of the proposed SLR model in terms of both model accuracy and complexity is determined by NAS. NAS not only helps find the desired model architecture but also brings flexibility to the model design (e.g. when an extension of the number of latent variables is needed). Since devices with relatively limited computing resources such as mobile phones are being considered for running the proposed model, the model performance estimation needs to be taken cognisance of both the model complexity and its accuracy.

The search space defined is three hidden layers with optional neuron numbers \({h_1} = \left[ {10,20,50,100,200,500} \right] \), \({h_2} = \left[ {0,10,20,50,100,200} \right] \), and \({h_3} = \left[ {0,10,20,50,100} \right] \) for both encoder and decoder. Note that when hidden layer two has no neuron, hidden layer three must also have no neuron.

We aim to minimise the model complexity while maximising its accuracy. However, balancing the trade-offs between model complexity and accuracy is a persistent challenge in NAS.

In this paper, the optimal model architecture is determined by a newly proposed NAS method named naïve NAS. Initially, the naïve NAS searches for possible neural architecture in the search space. It then selects M candidate architectures with the highest accuracies from the search. Eventually, the optimal architecture is chosen from those M candidates with the lowest model complexity. The naïve NAS is illustrated in Algorithm 1.

figure a

Existing variable search strategies can be coupled with the proposed naïve NAS. The Grid Search [23] is the selected strategy since it is thorough and the proposed model is light-weight (i.e. the performance estimation of each candidate architecture can be completed in less than a minute) and the search space is discrete and small (i.e. with only a total of 156 valid candidate architectures in the search space).

The model complexity is calculated by the total number of weights in either the encoder or decoder of the SLR with considering the latent variable:

$$\begin{aligned} Complexity\sim \left[ {\left( {\sum \limits _{l = 1}^L {{C_{l - 1}}{C_l} + {C_l}} } \right) + {C_L}{C_z} + {C_z}} \right] \end{aligned}$$
(9)

where L is the total number of layers in the encoder or decoder, C is the number of neurons in a specific layer, and \({C_z}\) is the number of latent variables in the model. The model accuracy is measured by a three-fold cross-validation and (7):

$$\begin{aligned} Accuracy \sim MSE\left( {x,{{\tilde{x}}}} \right) \end{aligned}$$
(10)

where x and \({{\tilde{x}}}\) are the reconstructed and validated CRF curves.

3.3 Constraint on the latent distribution

The latent variable z in the autoencoder follows an arbitrary distribution by default. Two approaches (i.e. a distribution learning and a supervised learning approach using heuristics) have been proposed to constrain the latent variable to follow a prior distribution to help the optimisation process to more accurately find the best z during calibration.

In the first approach, the latent distribution is constrained by “learning” from the objective distribution. It is named latent distribution learning (LDL) and achieved by minimising the Kullback–Leibler Divergence (KL-divergence) between the latent and objective distributions.

The latent distribution is approximated by a normal distribution \(y\sim {{{\mathcal {N}}}}\left( {\mu ,\sigma } \right) \). The normal distribution maximum likelihood of the latent variable is estimated by:

$$\begin{aligned} \begin{aligned} \mu&= {{\bar{y}}},\\ {\sigma ^2}&= \sum \limits _{i = 1}^N {{{\left( {{y_i} - \mu } \right) }^2}} \end{aligned} \end{aligned}$$
(11)

where y is the samplings on the latent distribution and M is the number of samplings.

KL-divergence between two normal distributions has the form:

$$\begin{aligned} \begin{aligned} KL&\left( {{{{{\mathcal {N}}}}_1}\left( {{\mu _1},{\sigma _1}} \right) ,{{{{\mathcal {N}}}}_2}\left( {{\mu _2},{\sigma _2}} \right) } \right) \\&= - \int {{{{{\mathcal {N}}}}_1}\log \left( {{{{{\mathcal {N}}}}_2}} \right) } dy + \int {{{{{\mathcal {N}}}}_1}\log \left( {{{{{\mathcal {N}}}}_1}} \right) } dy\\&= \log \left( {\frac{{{\sigma _2}}}{{{\sigma _1}}}} \right) + \frac{{\sigma _1^2 + {{\left( {{\mu _1} - {\mu _2}} \right) }^2}}}{{2\sigma _2^2}} - \frac{1}{2} \end{aligned} \end{aligned}$$
(12)

The KL-divergence between the estimated latent distribution \(\mathcal{N}\left( {{\mu _1},{\sigma _1}} \right) \) and the objective standard normal distribution \({{{\mathcal {N}}}}\left( {0,1} \right) \) can be simplified to:

$$\begin{aligned} \begin{aligned} KL&\left( {{{{\mathcal {N}}}}\left( {{\mu _1},{\sigma _1}} \right) ,{{{\mathcal {N}}}}\left( {0,1} \right) } \right) \\&= \frac{1}{2}\left( {\mu _1^2 + \sigma _1^2 - 2\log {\sigma _1} - 1} \right) \end{aligned} \end{aligned}$$
(13)

This is used as the cost for the latent distribution learning in the proposed SLR model. The second approach, named AUC, generates a label for each CRF as the true latent value for distribution constraining. The label is generated by a so-called area under curve approach which calculates the area between the CRF and diagonal curve:

$$\begin{aligned} \iota = \sum \limits _{i = 1}^N {\left( {{x_i} - \frac{i}{N}} \right) } \end{aligned}$$
(14)

where N is the number of samplings in each CRF curve, which is 1,024 for those in the DoRF. The latent distribution is trained using supervised learning by minimising the MSE between the latent and true values.

4 Experiments and results

This section details the experimental setup used to examine and test the proposed model. All the processing and evaluations were performed on a laptop computer with a 2.6-GHz Intel Core i7 processor and a 16-GB memory. To accelerate the optimisation process, a NVIDIA GeForce RTX 2060 GPU was employed.

4.1 Datasets

Fig. 2
figure 2

Distribution of the two datasets used for model validation. (a) The CRFs of 201 real-world cameras in the DoRF. Each green curve represents a CRF in the database. (b) Irradiance-intensity scatter plot of the CCPs extracted from the 14 cameras selected from the Middlebury dataset. CCPs of different cameras were rendered in varied colours

Two datasets, i.e. the DoRF and a modified Middlebury [24], were prepared for performing the validations and benchmarks. Data distribution of these two datasets is demonstrated in Fig. 2.

The DoRF contains 201 CRFs and is currently the most comprehensive dataset of CRFs produced from real-world camera models. This dataset was used in our experiments without modification.

The modified Middlebury dataset contains a total of 112 images. Images of 14 cameras were selected from the original dataset. These cameras were chosen because of their higher cross-channel response uniformity. Each of these cameras took eight images of a Macbeth colour chart under two uniform illuminations and four fixed exposures. This dataset provides an abundance of variation for evaluating CRF calibration accuracy.

The colour patch (CP) locations in the images in the second dataset (24 CP for each image) were carefully labelled by utilising a custom-developed Python script so that the CPs can be extracted and aligned with each other across different images. The true colour values of the CPs are extracted from the RAW images.

4.2 Evaluation metrics

The root-mean-square error (RMSE) [2, 8, 25,26,27] has been widely used to quantify colour difference. It measures the Euclidean distance between two compared vectors:

$$\begin{aligned} d\left( {{u_i},{v_i}} \right) = \sqrt{\frac{1}{N}\sum \limits _{i = 1}^N {{{\left( {{u_i} - {v_i}} \right) }^2}} } \end{aligned}$$
(15)

where u and v are the compared vectors and N is the number of items in each of the vectors. A smaller RMSE indicates a better result. A 0 RMSE illustrates identical results.

In the experiments, the RMSEs calculated from comparing the reconstructed CRF with CRFs in the DoRF in the first experiment or those from comparing colour values of JPG and corresponding RAW images in the second experiment were collected into a result vector h for statistical analysis:

$$\begin{aligned} h = \left. {\left[ {\begin{array}{*{20}{c}} {{e_0}}\\ \vdots \\ {{e_{ - 1}}} \end{array}} \right] } \right\} C \end{aligned}$$
(16)

where C is the number of camera models to be compared.

The Mean of the result vector h was used as the overall performance evaluation indicator in the first experiment. In the second experiment, five metrics are used to evaluate the result vector produced by each method. The first four are statistical metrics (i.e. Mean, Standard Derivation, Maximum, and 95 Percentile) that reflect model accuracy. Among these four metrics, the Mean of h can be seen as the overall performance metric for accuracy. The time metric was evaluated as the total time needed in seconds (s) for calibrating all the 14 camera models in the second dataset. We considered \(\Delta RMSE > 0.005\) to be the thresholds for significant performance difference in the second experiment.

4.3 Latent distribution constraint benchmarks

Fig. 3
figure 3

Visualisation of the (a) objective distribution, compared to the latent distribution developed by the SLR that (b-e) applies four different constraining approaches and (f) imposes no constraint as the training epoch grows

Constraining the latent distribution using four different approaches (i.e. the two proposed approaches, AAE and VAE approaches), the baseline (i.e. without imposing any constraint method), and the objective distribution were compared in this benchmark as shown in Fig. 3.

Other than the two proposed approaches, AAE constrains the latent distribution by utilising an adversarial training network. The network employs the encoder in the SLR model as the generator. The discriminator consists of two hidden layers with 100 neurons for each layer and a single neuron for both the input and output layers. The adversarial training process is represented by (5).

Instead of imposing additional constraints on the latent distribution as used by the previous three approaches, VAE incorporates the posterior distribution of the latent space into the autoencoder network architecture. Since a normal latent distribution is demanded, the encoder outputs two neurons (i.e. a Mean and a Standard Derivation) representing a normal distribution and then generates the single latent variable by sampling the posterior normal distribution.

The last approach imposes no constraint on the latent distribution and is seen as the baseline for the comparisons.

The objective latent distribution is the standard normal distribution \({{{\mathcal {N}}}}\left( {0,1} \right) \) as visualised in Fig. 3(a), except for the AUC approach.

The results demonstrated that applying the proposed latent learning approach converged rapidly and led to an accurate latent distribution compared to the objective distribution. The proposed supervised learning approach produced a constrictive yet sharp latent distribution. The latent distribution developed by AAE was unstable compared to the rests. Both the distributions developed by VAE and baseline converged slowly during model training with the baseline distribution also being constrictive.

Overall, the proposed latent distribution learning (LDL) approach performed the best. Thus, this approach was selected to constrain the latent distribution in the rest experiments.

4.4 DoRF curve-fitting benchmark

Table 1 DoRF curves fitting performance evaluation of various approximation models with different number of parameters in terms of averaged RMSE

We firstly compared the performance of the proposed SLR with four other popular models, i.e. gamma, polynomial, GGCM, and EMoR in a DoRF curve-fitting benchmark. In this experiment, every CRF curve in the DoRF was represented by each model with a specific number of parameters using the optimal parameters calculated. Four number of parameters (i.e. 1, 2, 3, and 4) were tested for each of the polynomial, GGCM, and EMoR. While gamma and our method were tested with only one parameter since our method works with a single latent variable. The benchmark results are demonstrated in Table 1.

The results indicate that our model with only a single parameter achieved greater than ten-folds better performance (Mean RMSE: 5.61E-4) over most of the other tested methods in the DoRF curve-fitting benchmark. This is not surprising as our model learned the nonlinear CRF features from the real-world CRFs.

4.5 Camera radiometric calibration

The performance and applicability of the proposed SLR model is further validated by a camera radiometric calibration application [28]. This computer vision task estimates inverse-CRF from real camera images.

Fig. 4
figure 4

Radiometric calibration results of a specific camera model (CanonPowerShotG9). The grey dots are the ground truth values. The red, orange, yellow, and red curves are the inverse CRFs calibrated using 3, 6, 12, and 24 NoCCPs on a Macbeth colour chart, respectively. And the blue diagonal line is for reference to. Our model (a) produced more accurate and stable inverse CRFs than the other three tested methods (b, c, d)

Table 2 Stability evaluation and comparison of the four commonly used CRF models, i.e. our SLR, polynomial, GGCM, and EMoR, in terms of the total variance between CRFs estimated using 3, 6, 12, and 24 NoCCPs on a Macbeth colour chart

The true irradiance-intensity mapping values of a specific camera model, i.e. Canon PowerShot G9, and inverse-CRFs produced by four different methods with applying 3, 6, 12, and 24 number of corresponding colour patches (NoCCPs) during calibration are visualised in Fig. 4. Our model fitted more accurately to the true values (see Table 2 for detail). Ours also performed more stable when using varied NoCCPs for the calibration (the Total Variance of the four curves in each plot of Fig. 4 are: our SLR 0.66; polynomial 1.63; GGCM 8.11; EMoR 3.87; the smaller the better).

Table 3 Camera radiometric calibration results produced by five different methods (our SLR, gamma, polynomial, GGCM, and EMoR) using eight calibration images and three colour patches in each image are listed. Our SLR was evaluated with four latent distribution constraining approaches and the baseline. Six metrics are used to evaluate the performance of each method. The first five are statistical metrics (mean, median, standard derivation, maximum, and 95 percentile) of RMSE that reflect model accuracy. Among these five metrics, the mean can be seen as the overall performance metric for accuracy. The time metric was evaluated as the total time needed in seconds for calibrating all the 14 camera models

The radiometric calibration performances of the five methods (i.e. our SLR, gamma, third degree polynomial, third degree GGCM, and EMoR with three parameters) were further evaluated by 14 camera models. Their performances in terms of six metrics are demonstrated in Table 3. The first five metrics are statistical metrics of the RMSEs calculated from the inverse-CRFs of the 14 camera models. The RMSE of each camera model was calculated by comparing the true values and the calibrated inverse-CRF. These five metrics evaluate accuracy of the inverse-CRFs calibrated by each method. Our SLR with a single latent variable and applying the LDL (Mean RMSE 0.062) clearly outperformed the others compared to some of the other methods with using even three parameters. The CRF calibration accuracy improvement contributed by the LDL on our SLR can be quantified by comparing with the baseline. The sixth metric evaluates the total time needed for calibrating all the 14 camera models (i.e. finding the optimal model parameters). It is a metric that reflects model efficiency and is important to be considered for deploying on mobile platforms. Our SLR with LDL (57.4s) completed all the calibrations over twice faster than the gamma (112.6s) that also works with a single parameter and the others that work with more parameters. This is partially contributed by the simple yet efficient autoencoder architecture found by NAS. Our SLR with AUC (43.1s) achieved faster calibration yet sacrificed the calibration accuracy (Mean RMSE 0.105).

5 Conclusion

In this paper, a CRF model that represents camera responses with only a single latent variable has been described. The model used unsupervised learning on real-world CRFs by autoencoder. A simple yet efficient autoencoder architecture was found by applying a naïve NAS algorithm. A latent distribution learning approach was introduced to effectively constrain the latent variable to a normal distribution for improving the accuracy of the CRF calibration process. We demonstrated a superior performance of the proposed model in terms of both the CRF modelling accuracy (i.e. ten-folds better CRF modelling accuracy in the curve-fitting cross-validation benchmark) and calibration efficiency (i.e. around twice as fast as the best current models for CRF calibration in a double-cross-validation benchmark).