5.1 Databases
Many different databases have been used for evaluating the image difference metrics, but in order to extensively test WLF-DEE, we have chosen three databases: the public Tampere Image Database 2008 (TID2008) and two databases developed at ‘Norwegian Colour and Visual Computing Laboratory’.
The first database, the TID2008 database [24], contains a total of 1,700 images, with 25 reference images (Figure 5) and 17 types of distortions over four distortion levels. The mean opinion scores are the results of 654 observers attending the experiments. No viewing distance is stated in the TID database; therefore, we have used a standard viewing distance of 50 cm for the metrics requiring this setting. The authors have decided to include the images where distortions provide directly or indirectly a change in contrast, narrowing the TID2008 to a total of 400 images equally divided in the following four categories: masked noise, quantization noise, denoising and contrast change.
The second database, proposed by Pedersen et al. [41], contains four original images (Figure 6), three portraits and one illustration. The originals were altered in lightness, where each image had four versions with global lightness differences and four versions with local lightness changes. The lightness changes were 3 and 5. Four versions were brighter than the original, and four darker, for a total of 32 modified images. The psychophysical experiment was done on a calibrated CRT monitor, LaCie electron 22 blue II (LaCie, Basel, Switzerland), in a grey room. The observers were seated approximately 80 cm from the screen. The light was measured to approximately 17 lux in front of the monitor. A total of 25 observers were recruited for the experiment, and they were asked in a pairwise comparison experiment to choose the image most similar to the original. This database is particularly of our interest because contrast is directly related to change in luminance [40], which is related to lightness [7, 8].
The third database from Ajagamelle [42] contains a total of 10 original images covering a wide range of characteristics and scenes Figure 7. The images were modified using Adobe Photoshop software on a global scale with separate and simultaneous variations of contrast, lightness and saturation, resulting in a total number of 80 test images. The experiment was carried out as a category-judgment experiment with 14 observers. Each pair of images was displayed on an Eizo ColorEdge CG241W digital LCD display (Eizo Corporation, Ishikawa, Japan). The monitor was calibrated and profiled using GretagMacbeth Eye-One Match 3. The settings on the monitor were sRGB with a resolution of 1,600 × 1,200 pixels. The experiment took place in a windowless room with neutral grey walls, ceiling and floor. The ceiling lights in the room was set to provide a level of ambient illumination around 40 lux, which is below the upper threshold of 64 recommended by the CIE [43]. The white point was set to the D65, the gamma to 2.2 and the luminance level to 80 cd/m2. The display was placed at a viewing distance of 70 cm. The images presented were 750 × 499 pixels or 499 × 750 pixels, which subtended roughly 20° of the visual angle when viewed at this distance.
5.2 Performance measures
Two types of correlation coefficients are computed [45] in order to evaluate the performance of WLF-DEE:
-
1.
The Pearson product-moment. It assumes that the variables are ordinal, and it evaluates the linear relationship between two variables. This is a performance measure relating to the prediction accuracy of the metric [46].
-
2.
Spearman rank. It is a non-parametric measure of correlation, and it is used as a measure of linear relationship between two sets of ranked data, instead of the actual values. This describes the relationship between variables with no assumptions on the frequency distribution of the variables and on how tightly the ranked data clusters are around a straight line. This is a performance measure relating to the prediction monotonicity of the metric [46].
The relationships between the metrics and the observers are not necessarily linear. In order to remove any non-linearities due to the subjective experimental process and to facilitate comparison of the metrics in a common analysis space, we investigate the relationship between the metrics and observers by using non-linear regression [46]. In this work, we apply the same mapping function as that of Sheikh et al. [47]:
(18)
where θ
i
, i = 1,2,3,4,5, are parameters to be be fitted. The 95% confidence intervals for the correlation values are calculated using Fisher’s Z transformation as described by the Video Quality Expert Group [48].
In order to have a complete analysis the following coefficients are also presented:
-
Root-mean-square error (RMSE)[48]. It is a measure of the differences between the values predicted by the metric and the scores actually given by the observers.
-
Significance of the difference between the Pearson correlation coefficients (t-value)[48]. This measure assumes that a good fit for observers’ quality score is given by the normal distribution. It uses the H
0hypothesis that assumes that there is no significant difference between correlation coefficients and the H
1hypothesis, which considers that the difference is significant, although not specifying better or worse.
5.3 Results
As mentioned in Section 4.1.1, WLF-DEE has been tested for a total of 48 configurations, but in order to give a more readable and understandable presentation of the results, we will present only a selection of them. As WLF-DEE using the two DOG schemes in Equations 15a and 15b have lower performance in correlation with respect to WLF-DEE using the DOG scheme in Equation 15(c), these results will be excluded. This confirms also the statement of Tadmor and Tolhurst that the DOG model in analogy with the Michelson formula has better performance for contrast assessment [4]. On the same way, all configurations with ρ = 0.85 in Equation 13 will be excluded, as they show lower performance in correlation with respect to those configurations with ρ = 1.00. This will end in a presentation of a total of only eight results of WLF-DEE shown in Table 2 for Pearson correlation, Table 3 for Spearman correlation and Table 4 for RMSE. Significance of the difference between the Pearson correlation coefficients are presented for each database in Tables 5, 6, 7, 8, 9 and 10.
Table 2
Pearson correlation for WLF-DEE and the selected state-of-the-art metrics on all databases
Table 3
Spearman correlation for WLF-DEE and the selected state-of-the-art metrics on all databases
Table 4
RMSE for WLF-DEE and the selected state-of-the-art metrics on all databases
Table 5
Significance of the difference between Pearson correlation coefficients on TID-masked noise database
Table 6
Significance of the difference between Pearson correlation coefficients on TID-quantization noise database
Table 7
Significance of the difference between Pearson correlation coefficients on TID-image denoising database
Table 8
Significance of the difference between Pearson correlation coefficients on TID-contrast change database
Table 9
Significance of the difference between Pearson correlation coefficients on Ajagamelle database
Table 10
Significance of the difference between Pearson correlation coefficients on Pedersen database
Considering the TID database, SSIM has the higher Pearson correlation in all four categories. In the masked noise category, SSIM is followed by WLF-DEE K, while in the quantization noise, it is followed by S-CIELAB and then WLF-DEE K. In the category denoising and contrast change, SSIM instead is followed by WLF-DEE I and then WLF-DEE K. For all the four categories, it is possible to notice that all the metrics give higher correlation with perceived observer difference using the logistic fitting. Furthermore, as the confidence intervals (Figures 8, 9, 10 and 11) of WLF-DEE K overlap with the confidence intervals of SSIM, we can claim to have the same performance. Overall, for the four categories of the TID database, WLF-DEE K shows to be significantly better than PSNR and VSNR and to have the same performance with SSIM and S-CIELAB.
For the Ajagamelle database, PSNR shows the higher Pearson correlation, followed by PSNR-HVS-M, SCIELAB, SSIM and then four configurations (I, J, K, L) of WLF-DEE, which have very close results. Since in this case, the confidence intervals (Figure 12) of WLF-DEE K overlaps with the confidence interval of those metrics with slightly higher correlation, we can claim to have the same performance. For the Pedersen database, S-CIELAB shows the higher Pearson correlation, followed by PSNR-HVS-M, PSNR, four configurations (I, J, K, L) of WLF-DEE with very close results and then SSIM. In this database instead, confidence intervals (Figure 13) show that WLF-DEE-K has a slightly lower performance than S-CIELAB, but not with respect to SSIM. Also for these two databases, it holds true that all the metrics give higher correlation using logistic fitting.
Considering all the six database sets examined, WLF-DEE gives higher correlation using configurations I, J, K, L with respect to configurations M, N, O, P, and in particular, WLF-DEE K most agrees with observer perceived difference, indicating that large radiuses of the Gaussians and uniform weighting of the levels should be used for the estimation of perceived difference. Furthermore, it is possible to notice that WLF-DEE K with logistic fitting shows a stable trend among the six datasets having an average performance in correlation of 0.65. This holds true also for other tested metrics such as S-CIELAB and PSNR-HVS-M, but not for SSIM and VSNR which show very high correlation in one dataset and very low in an another one.
Analysis with the Spearman correlation follows the same discussion with the Pearson correlation except for the Ajagamelle database, where the highest correlation is shown by S-CIELAB, but not outperforming most of the other metrics. The results are presented only with linear fitting as no improvements can be found in any metric using the logistic fitting. Also with the Spearman correlation, WLF-DEE K shows its stability with an average performance of 0.59.
Analysis with root-mean-square error shows that for all the four categories of the TID database, SSIM has the lowest RMSE. As the confidence intervals of SSIM overlap with two configurations of WLF-DEE-C (J, K) (Figures 14, 15, 16 and 17), it cannot be claimed that the two metrics are significantly different in performance.
For Ajagamelle database instead, WLF-DEE-C N shows the lowest RMSE, followed by S-CIELAB and then by other several configurations of WLF-DEE-C (M, I, J, K, L) and VSNR. Confidence intervals (Figure 18) shows that these three metrics are not significantly different among each other but they outperform other tested metrics such as SSIM, PSNR and PSNR-HVS-M. For Pedersen database, S-CIELAB shows the lowest RMSE followed by PSNR, PSNR-HVS-M and then all the configurations of WLF-DEE-C. SSIM and VSNR have the highest RMSE. Confidence intervals (Figure 19) show that WLF- DEE-C (I, J, K, L, P) has no difference in performance with the other tested metrics but the overlap with S-CIELAB confidence interval is minimal.
Analysis with significance of the difference are presented with 5% significance level for Pearson correlation with logistic fitting only. Based on the definition in [48], two metrics can be significantly different if -1.96 < t-value < 1.96. This analysis confirms that WLF-DEE-C K is not significantly different in performance with respect to the other tested metrics for TID-masked noise and TID-image denoising databases. For TID-quantization noise instead, WLF-DEE-C K is not significantly difference in performance with respect to SSIM and S-CIELAB. For TID-contrast change WLF-DEE-C K is significantly difference in performance only with respect to VSNR. For Ajagamelle database, WLF-DEE-C K is not significantly different in performance only from SSIM while for Pedersen database is not significantly different in performance from the other tested metrics except S-CIELAB.
Overall, WLF-DEE-C K shows its particular strength on those databases where a change in contrast between the original and its reproduction is triggered by a change of color attributes and not particular distortions. In conclusion, WLF-DEE-C K promotes itself as a new metric for predicting the perceived magnitude of contrast between an original and a reproduction, fulfilling the purpose for which it was developed.