Introduction

High dynamic range (HDR) imaging enables to capture, represent and reproduce a wide range of colors and luminous intensities present in everyday life, ranging from bright sunshine to dark shadows (Dufaux 2016). These extended capabilities are expected to significantly improve the quality of experience (QoE) of emerging multimedia services with respect to conventional low dynamic range (LDR) technology. Commercial HDR video cameras and displays are becoming available, and parts of the HDR end-to-end delivery chain such as image and video compression are currently matter of standardization activities in MPEG (Luthra et al. 2015; Hanhart et al. 2016) and JPEG (Richter 2013). In this context, evaluating the visual quality of compressed HDR pictures is of critical importance in order to design and optimize video codecs and processing algorithms.

Evaluating HDR visual quality presents new challenges with respect to conventional LDR quality assessment (Narwaria et al. 2016b). The higher peak brightness and contrast offered by HDR increases the visibility of artifacts, and at the same time changes the way viewers focus their attention compared to LDR (Narwaria et al. 2014b). Moreover, color distortion assumes a major role in the overall quality judgment, as a result of the increased luminance level (Fairchild 2013). Since these and other factors intervene in a complex way to determine HDR visual quality, the most accurate approach to assess it is, in general, through subjective test experiments. However, these are expensive to design and implement, require specialized expertize and are time-consuming. Furthermore, in the case of HDR, subjective testing requires specialized devices such as HDR displays, which still have a high cost and a limited diffusion. Therefore, designing and tuning full-reference (fidelity) quality metrics for HDR content is very timely, and has motivated research in both the multimedia and computer graphics community in the past few years (Mantiuk et al. 2011; Narwaria et al. 2015a, b; Aydın et al. 2008; Narwaria et al. 2016a).

Two main approaches have been proposed to measure HDR fidelity. On one hand, some metrics require modeling of the human visual system (HVS), such as the HDR-VDP (Mantiuk et al. 2011) or HDR-VQM (Narwaria et al. 2015a) metrics for images and videos, respectively. For example, the HDR-VDP metric accurately models the early stages of HVS, including intra-ocular scattering, luminance masking, and achromatic response of the photoreceptors, in order to precisely predict the visibility and strength of per pixel distortion. On the other hand, one can resort to metrics developed in the context of LDR imagery, such as simple arithmetic (PSNR, MSE), structural [SSIM (Wang et al. 2004) and its multiscale version (Wang et al. 2003)] and information-theoretic [e.g., VIF (Sheikh and Bovik 2006)] metrics. All these LDR metrics are based on the assumption that pixel values are perceptually linear, i.e., equal increments of pixel values correspond to equivalent changes in the perceived luminance. This is not true in the case of HDR content, where pixel values store linear light, i.e., pixels are proportional to the physical luminance of the scene. Instead, human perception has a more complex behavior: it can be approximated by a square-root in low luminance values and is approximately proportional to luminance ratios in higher luminance values, as expressed by the DeVries–Rose and Weber–Fechner laws, respectively (Kundu and Pal 1986). Thus, in order to employ these metrics, the HDR content needs to be perceptually linearized, e.g., using a logarithmic or perceptually uniform (PU) encoding (Aydın et al. 2008).

The capability of both kinds of fidelity metrics to predict viewers’ mean opinion scores (MOS) has been assessed in a number of recent subjective studies using compressed HDR pictures (Valenzise et al. 2014; Hanhart et al. 2015a; Narwaria et al. 2013, 2012a). Nevertheless, the results of these studies show sometimes discrepancies in their conclusions about the ability of these metrics to yield consistent and accurate predictions of MOSs. For instance, the correlation values of PU-SSIM, i.e., SSIM metric applied after the PU encoding of Aydın et al. (2008), differ substantially between the study of Narwaria et al. (2015b) and that of Valenzise et al. (2014). The difference is basically related to the size and characteristic of the subjective material. In Valenzise et al. (2014), the performance of objective metrics was assessed on a small image database (50 subjectively annotated images), using different coding schemes including JPEG, JPEG 2000 and JPEG-XT. In Narwaria et al. (2015b), the authors evaluate metric correlations using a number of subjectively annotated databases, with variegate distortion and, especially, with scores gathered in separated tests (each with their own experimental conditions). Both studies have their advantages and limitations, which renders difficult to extract a simple and clear conclusion about the performance of fidelity metrics. In other cases, such as Hanhart et al. (2015a), metrics have been tested on a single type of distortion only (specifically JPEG-XT compression), thus it is desirable to extend those conclusions to more realistic and variegate conditions.

The aim of this paper is to bring more clarity in this field, by providing an extensive, reliable, and consistent benchmark of the most popular HDR image fidelity metrics. To this end, we collected as many as possible publicly available databases of HDR compressed images with subjective scores, in addition to proposing a new one which mixes different codecs and pixel encoding functions. This gives a total of 690 HDR images, which is up to our knowledge the largest set on which HDR metrics have been tested so far. We then align the MOSs of these databases using the iterated nested least square algorithm (INLSA) proposed in Pinson and Wolf (2003), in order to obtain a common subjective scale. Based on this data, we analyze the prediction accuracy and the discriminability (i.e., the ability of detecting when two images have different perceived quality) of 25 fidelity metrics, including those currently tested in MPEG standardization.

The main contributions of this paper include:

  • the most extensive evaluation (using 690 subjectively annotated HDR images) of HDR full-reference image quality metrics available so far;

  • the proposal of a new subjective database with 50 distorted HDR images, combining 3 image codecs and 2 pixel encoding algorithm (SMPTE-2084 perceptual quantization (SMPTE 2014) and a global tone-mapping operator);

  • an evaluation of metric discriminability, that complements the conventional statistical accuracy analysis, based on a novel classification approach.

Assessment of image quality is different from the assessment of video quality, as HVS has different temporal mechanisms. Nevertheless, image quality metrics are often applied to video on a frame-by-frame basis, e.g., PSNR or SSIM. Therefore, the result of this work could be indicative of frame-by-frame objective metrics performance in video as well.

The rest of this paper is organized as follows. “Considered subjective databases” describes the subjective databases considered within this paper. The alignment procedure is explained in “Alignment of Database MOSs”. In “Analysis of objective quality metrics”, existing objective image quality metrics have been compared using both statistical evaluation and a classification approach. Finally, “Conclusion” concludes the paper.

Considered subjective databases

Although there are several publicly available repositories of high-quality HDR pictures (Debevec and Malik 2008; EMPA 2013; Fairchild 2007; Drago and Mantiuk 2004; pfstools 2015), there is only a small number of subjectively annotated image quality databases. For this study, we selected four publicly available HDR image quality assessment databases, in addition to proposing a new one described in “Database #5–new subjective database”. Each database contains compressed HDR pictures with related subjective scores. The databases differ in size, kind of distortion (codec) and subjective methodology. A brief description of these databases is given in the following, while a summary of their characteristics is reported in Table 1. The interested reader can refer to original publications for further details.

Database #1—Narwaria et al. (2013)

In the work of Narwaria et al. (2013), a tone mapping based HDR image compression scheme has been proposed and assessed via a subjective test. Subjective scores were collected from 27 observers, using a SIM2 HDR47E S 4K display in a 130 \(cd/m^2\) illuminated room. The participants were asked to rate overall image quality using the absolute category rating with hidden reference (ACR-HR) methodology, employing a five-level discrete scale where 1 is bad and 5 is excellent quality. The test material was obtained from 10 pristine HDR pictures, including both indoor and outdoor, natural or computer-generated scenes. The distorted images are generated through a backward compatible scheme (Ward et al. 2006): the HDR image is first converted to LDR by using a tone mapping operator (TMO); then, the LDR picture is coded using a legacy image codec; finally, the compressed image is expanded by inverse tone mapping to the original HDR range. The coding scheme in Narwaria et al. (2013) employs iCAM06 (Kuang et al. 2007) as TMO, and JPEG compression at different qualities. In addition, the authors proposed two criteria to optimize the quality of the reconstructed HDR. As a result, a total of 10 contents \(\times \) 7 bitrates \(\times \) 2 optimization criteria \(=140\) test images were evaluated. This database is publicly available at http://ivc.univ-nantes.fr/en/databases/JPEG_HDR_Images/.

The analysis in Narwaria et al. (2013) shows that mean squared error (MSE) and structural similarity index measure (SSIM) perform well in estimating human predictions and ordering distorted images when each content is assessed separately. However, these results do not apply when different contents are considered at the same time. HDR-VDP-2 was found to be the best performing (in terms of linear correlation with MOSs) metric, but not statistically different from the metric proposed in Narwaria et al. (2012b).

Table 1 Number of observers, subjective methodology, number of stimuli, compression type and tone mappings employed in the HDR image quality databases used in this paper

Database #2—Narwaria et al. (2014a)

Narwaria et al. (2014a) evaluate subjectively the impact of using different TMOs in HDR image compression. The test material includes six original scenes, both indoor and outdoor, from which a total of 210 test images were created using JPEG 2000 image compression algorithm after the application of several TMOs, including Ashikhmin (2002), both local and global versions of Reinhard (2002), Durand and Dorsey (2002), and logarithmic TMO. The experiment setup was the same as in Narwaria et al. (2013) Database #1 described above. The subjective test is conducted with 29 observers using ACR-HR methodology.

Results show that the choice of TMO greatly affects the quality scores. It is also found that local TMOs, with the exception of Durand’s, generally yield better results than global TMOs as they tend to preserve more details. No evaluation of objective quality metrics is reported in the original paper (Narwaria et al. 2014a).

Database #3—Korsunov et al. (2015)

In the study of Korsunov et al. (2015), an HDR image quality database, publicly available at http://mmspg.epfl.ch/jpegxt-hdr, has been created using backward-compatible JPEG-XT standard (Richter 2013) with different profiles and quality levels. For this database, 240 test images have been produced, using either Reinhard (2002) or Mantiuk et al. (2006) TMO for the base layer, 4 bit rates for each original image and 3 profiles of JPEG-XT. The test room was illuminated with a 20 lux lamp, and a SIM2 HDR display was used. At any time, 3 observers took the test simultaneously. The subjective scores were collected from 24 participants, using double stimulus impairment scale (DSIS) Variant I methodology, i.e., images were displayed side-by-side, one of the images was the reference and the other the distorted one.

This subjective database has been used in the work of Artusi et al. (2015). In this work, an objective evaluation of JPEG-XT compressed HDR images has been carried out. The results show that LDR metrics such as PSNR, SSIM, and multi-scale SSIM (MSSIM) give high correlation scores when they are used with the PU encoding of Aydın et al. (2008), while the overall best correlated quality metric is HDR-VDP-2.

Database #4—Valenzise et al. (2014)

Valenzise et al. (2014) were the first to collect subjective data with the specific goal to analyze the performance of HDR image fidelity metrics. Their database is composed of 50 compressed HDR images, obtained from 5 original scenes in the Fairchild HDR image survey (Fairchild 2007). Three different coding schemes have been used to produce the test material, i.e., JPEG, JPEG 2000 and JPEG-XT. In the first two cases, the HDR image is first tone mapped to LDR using the minimum-MSE TMO proposed by Mai et al. (2011). The images were displayed on a SIM2 HDR47E S 4 K display, with an ambient luminance of 20 \(cd/m^2\). Subjective scores were collected using DSIS methodology, i.e., pairs of images (original and distorted) were presented to the viewers, who had to evaluate the level of annoyance of distortion in the second image on a continuous quality scale ranging from 0 to 100, where 0 corresponds to very annoying artifacts and 100 to imperceptible artifacts. Fifteen observers rated the images. The database is available at http://webpages.l2s.centralesupelec.fr/perso/giuseppe.valenzise/download.htm.

The results of this study showed that LDR fidelity metrics could accurately predict image quality, provided that the display response is somehow taken into account (in particular, its peak brightness), and that a perceptually uniform (PU) encoding (Aydın et al. 2008) is applied to HDR pixel values to make them linear with respect to perception.

Fig. 1
figure 1

Original contents for the new proposed image database described in “Database #5—New subjective database”, rendered using the TMO in Mantiuk et al. (2008)

Database #5—New subjective database

In addition to the databases described above, we construct a new subjective HDR image database of 50 images, as an extension to our previous work (Valenzise et al. 2014). The new database features five original contents, selected in such a way to be representative of different image features, including the dynamic range, image key and spatial information. The five contents are shown in Fig. 1. The images “Balloon”, “FireEater2”, and “Market3” are chosen among the frames of the MPEG HDR sequences proposed by Technicolor (Lasserre et al. 2013). “Showgirl” is taken from Stuttgart HDR Video Database (Froehlich et al. 2014). “Typewriter” is from HDR photographic survey dataset (Fairchild 2007). All images have either \(1920 \times 1080\) pixels spatial resolution, or are zero-padded to have the same resolution.

Similarly to Valenzise et al. (2014), the test images are obtained by using a backward compatible HDR coding scheme (Ward et al. 2006), using JPEG and JPEG 2000 (with different bitrates) as LDR codecs. We did not include JPEG-XT in this experiment, since some of the contents we selected (e.g., “Showgirl” and “Typewriter”) were already part of the Database #3. In order to convert HDR to LDR, we use two options: (i) the TMO of Mai et al. (2011); and (ii) the electro-optical transfer function SMPTE ST 2084 (Miller et al. 2012; SMPTE 2014), commonly known as perceptual quantization (PQ). The latter is a fixed, content-independent transfer function which has been designed in such a way that the increments between codewords have minimum visibility, according to Barten’s contrast sensitivity function (Barten 1999). We choose this transfer function as an alternative to tone mapping, as it has been proposed as the anchor scheme in current MPEG HDR standardization activities (Luthra et al. 2015). Both PQ and Mai et al.’s TMO are applied per color channel.

The test environment and methodology are carefully controlled to be the same as in Database #4 (Valenzise et al. 2014). The DSIS methodology is employed, where the reference image is shown for 6 s, followed by 2 s of mid-gray screen and 8 s of degraded image. The asymmetry in timing between distorted and reference image is determined in a pilot test, taking into account the fact that the reference image is shown several times, while the degraded image is different at each round and requires a longer evaluation interval. After both the original and distorted image are displayed, the observer takes all the time she/he needed to rate the level of annoyance on the same continuous scale as in Valenzise et al. (2014). The sequence of tested images is randomized to avoid context effects (De Simone 2012). Moreover, too bright (“Market3”) and too dark (“FireEater2”) stimuli are not placed one after another in order to avoid any masking caused by sudden brightness change. In addition to randomization, stabilizing images (one from each content and featuring each quality level) are shown in the beginning of the experiment to stabilize viewers’ votes (which are discarded for those images).

In addition to the contents reported in Fig. 1, a small subset of the stimuli of Database #4 was included in the test. This enabled to align the two databases, #4 and #5, in order for the corresponding MOS values to be on the same scale (Pitrey et al. 2011). Thus, in the following we will refer to the union of these two databases as Databases #4 & 5.

A panel of 15 people (3 women, 12 men; average age of 26.8 years), mainly Ph.D. students naive to HDR technology and image compression, participated to the test. Subjects reported normal or corrected-to-normal vision. The outlier detection and removal procedure described in BT.500-13 (ITU 2012) resulted in no detected outlier. Then, mean opinion scores and their confidence interval (CI) were computed assuming data follows a t-Student distribution.Footnote 1

Alignment of database MOSs

During the training phase, the subjects are generally instructed to use the whole range of grades (or distortions) in the scale while evaluating. However, the quality of the test material for different experiments may not be the same when they are compared to each other. The viewers may not share the same understanding and expectations of image or video quality. Hence, the MOS values generally do not show the absolute quality of the stimuli. In Fig. 2a, we observe the MOS distribution for non-aligned databases as a function of the HDR-VQM metric. Due to the characteristics of the experiments and test material, a similar level of impairment in the subjective scale may correspond to very different values of the objective metrics. Therefore, in order to use in a consistent way the MOS values of different subjective databases, these need to be mapped onto a common quality scale.

In order to align the MOS values of all five HDR image databases, we use the iterated nested least square algorithm (INLSA) proposed in Pinson and Wolf (2003)Footnote 2. This algorithm requires objective parameters for the alignment, under the assumption that those are sufficiently well correlated and linear with respect to MOS. Therefore, we selected the five most linear and most correlated objective quality metrics: HDR-VDP-2.2, HDR-VQM, PU-IFC, PU-UQI, and PU-VIF (the calculation of PU-metrics will be explained in detail in “Objective Quality Metrics under Consideration”). The INLSA algorithm first normalizes MOS scores from each source in the [0,1] interval, and then aligns them by solving two least square problems: first, the MOS values are corrected by an affine transformation in order to span the same subjective scale; second, the MOS values are aligned to the corresponding objective values by finding the optimal (in least-square sense) combination of weights such that the corrected MOSs can be predicted as a linear combination of objective parameters. These two steps, prediction and correction, are repeated iteratively till some convergence criterion is met. Details about the algorithm can be found in Pinson and Wolf (2003).

Fig. 2
figure 2

Plots of MOS vs HDR-VQM scores before and after INLSA alignment. The INLSA algorithm scales MOS values so that images which have similar objective scores also have similar MOS values. In order to compare the scatter plot quantitatively, the root mean squared error (RMSE) of the data is reported for each case

The scatter plots of MOS values and HDR-VQM metric values after alignment can be seen in Fig. 2b. It can be observed that data points having similar HDR-VQM values have similar MOS values after INLSA alignment. After the alignment, all the MOS values have been mapped onto a common subjective scale, and they can be used in the evaluation of the objective quality metrics.

From Fig. 2b and initial observations of the test images, we notice that images in Database #2 (Narwaria et al. 2014a) have very different characteristics compared to others, and MOS values are much more scattered than other databases after the alignment. This is mainly due to the characteristics of this database, i.e., the stimuli were mainly obtained by changing the tone mapping algorithm used in the compression, including many TMOs which are definitely not adapted to be used in coding as they produce strong color artifacts in the reconstructed HDR image, and that are therefore not used in any practical coding scheme. Also, different kinds of distortion are present simultaneously, such as color banding, saturation etc. In some cases, it is noticed that false contours have been generated, and some color channels were saturated. Initial inspection of both test images and objective metric results indicate that the considered metrics do not capture the effect of color on quality as humans do.

As viewers were rating very different distortions with respect to the other databases, which instead contain similar kinds of visual impairments, Database #2 is very challenging for all the quality metrics we considered in this work. Therefore, in order to provide a complete overview of the performance of HDR fidelity metrics, in the following we report results both with and without including Database #2 in the evaluations.

Analysis of objective quality metrics

After the alignment of MOS values of the databases, we obtain an image data set consisting of 690 (or 480 images if Database #2 is excluded) images compressed using JPEG, JPEG-XT, and JPEG 2000. In this section, we provide a thorough analysis of the performance of several HDR image fidelity metrics, both from the point of view of prediction accuracy and of their ability to tell whether two images are actually perceived as being of different quality.

Objective quality metrics under consideration

We include in our evaluation a number of commonly used full-reference image quality metrics, including the mean square error (MSE), peak signal to noise ratio (PSNR), structural similarity index (SSIM) (Wang et al. 2004), multi-scale SSIM (MSSIM) (Wang et al. 2003), information fidelity criterion (IFC) (Sheikh et al. 2005), universal quality index (UQI) (Wang and Bovik 2002), VIF (Sheikh and Bovik 2006), and pixel based VIF. In addition to those metrics, we consider HDR-VDP-2.2 (Narwaria et al. 2015b), HDR-VQM (Narwaria et al. 2015a), additional full-reference metrics recently proposed for HDR video such as mPSNR, tPSNR, CIE \(\Delta {}E\) 2000 (Tourapis and Singer 2015), and spatial extension of CIE \(\Delta {}E\) 2000 (Zhang and Wandell 1997) which is computed with S-CIELAB model.

In order to calculate quality metrics, we first scale pixel values to the range of luminance emitted by the HDR displays used in each subjective experiments. This is especially important for those metrics such as HDR-VDP 2.2 which rely on physical luminance. In order to compute these values, we convert HDR pixels into luminance emitted by a hypothetical HDR display, assuming it has a linear response between the minimum and maximum luminance of the display. As the same display (i.e. SIM2 HDR47E S 4K) has been used in all the experiments, we have selected the same parameters for all experiments, i.e., \(0.03\,cd/m^2\) and \(4250\,cd/m^2\) for minimum an maximum luminance, respectively. Although the emitted luminance on HDR displays depends on many factors and is not exactly a linear function of input pixel values, we found in our previous work that, it is adequately close to linear (Zerman et al. 2016) and from a practical point of view, this simple linear assumption is equivalent to more sophisticated luminance estimation techniques which require a detailed knowledge of the reproduction device (Valenzise et al. 2014).

The objective quality metrics under consideration can be grouped as following:

  • HDR-specific metrics HDR-VDP-2.2 and HDR-VQM are recent fidelity metrics developed for HDR image and video, respectively. They model several phenomena that characterize the perception of HDR content, and thus requires some knowledge of viewing conditions (such as distance from the display, ambient luminance, etc.). The mPSNR is PSNR applied on an exposure bracket extracted from the HDR image, and then averaged across exposures.

  • Color difference metrics we use CIE \(\Delta {}E\) 2000 (denoted as CIE \(\Delta {}E_{00}\)), which entails a color space conversion in order to get perceptually uniform color differences (Luo et al. 2001), and its spatial extension (Zhang and Wandell 1997) (denoted as CIE \(\Delta {}E_{00}^S\)). More sophisticated color appearance models have not been considered in this study, as their use in quality assessment has been marginal so far. However they are an interesting aspect to investigate in future work.

  • LDR metrics applied after a transfer function LDR metrics such as MSE, PSNR, VIF, SSIM, MSSIM, IFC, and UQI. To compute these LDR metrics we use:

    • Physical luminance of the scene directly, denoted as Photometric-,

    • Perceptually uniform (Aydın et al. 2008) encoded pixel values, denoted as PU-,

    • Logarithmic coded pixel values, denoted as Log-, or

    • Perceptually quantized (Miller et al. 2012; SMPTE 2014) pixel values. For this case, only tPSNR-YUV has been considered as in Tourapis and Singer (2015).

When possible, we use the publicly available implementation of these metrics, i.e., HDR-VDP-2.2.1 available at http://sourceforge.net/projects/hdrvdp/files/hdrvdp/, HDR-VQM available at http://www.sourceforge.net/projects/hdrvdp/files/hdrvdp/, HDRtools version 0.4 (Tourapis and Singer 2015) developed within MPEG, the MeTriX MuX library for Matlab, available at http://foulard.ece.cornell.edu/gaubatz/metrix_mux/.

Statistical analysis

The performance of the aforementioned fidelity metrics has been evaluated in terms of prediction accuracy, prediction monotonicity, and prediction consistency (De Simone 2012). For prediction accuracy, Pearson correlation coefficient (PCC), and root mean square error (RMSE) are computed. Spearman rank-order correlation coefficient (SROCC) is used to find the prediction monotonicity, and outlier ratio (OR) is calculated to determine the prediction consistency. These performance metrics have been computed after a non-linear regression performed on objective quality metric results using a logistic function, as described in the final report of VQEG FR Phase I (Rohaly et al. 2000). This logistic function is given in Eq. 1:

$$\begin{aligned} Y_i = \beta _2 + \frac{\beta _1 - \beta _2}{1 + e^{-(\frac{X_i - \beta _3}{|\beta _4|})}}, \end{aligned}$$
(1)

where \(X_i\) is the objective score for the i th distorted image, and \(Y_i\) is the mapped objective score. It tries to minimize the least-square error between the MOS values and the objective results. This fitting has been done using the nlinfit function of Matlab to find optimal \(\beta \) parameters for each objective quality metric. After fitting, the performance scores have been computed using the mapped objective results, \(Y_i\), and MOS values.

Table 2 Pearson correlation coefficient (PCC) results for each database and for aligned data
Table 3 Spearman rank-ordered correlation coefficient (SROCC) results for each database and for aligned data
Table 4 Root mean squared error (RMSE) results for each database and for aligned data (please note that, in order to have comparable results, RMSE values were calculated after all MOS values are scaled to the range of [0,100].)
Table 5 Outlier ratio (OR) results for each database and for aligned data

The results of these performance indexes (SROCC, PCC, RMSE, and OR) have been computed for each database separately, as well as considering all the data together. The results are reported in Tables 2, 3, 4, 5. The aligned data scores have been denoted as “Combined”, and “Except Database #2” for the data aligned excluding Database #2 as explained in  “Alignment of Database MOSs”.

These results show that the performance of many fidelity metrics may significantly vary from one database to another, due to the different characteristics of the test material and of the subjective evaluation procedure. In particular, Database #2 is the most challenging for all the considered metrics, due to its more complex distortion features, as discussed in  “Alignment of Database MOSs”. Despite the variations across databases, we can observe a consistent behavior for some metrics. Photometric-MSE is the worst correlated one, for all databases. This is expected as mean square error is computed on photometric values, without any consideration of visual perception phenomena. On the other hand, HDR-VQM, HDR-VDP-2.2 Q, and PU-MSSIM are the best performing metrics, with the exception of Database #2.

When we analyze objective metrics for each transfer function, we observe that Photometric-IFC is the best correlated and Photometric-MSE is the worst in the linear domain; Log-SSIM is the best correlated and Log-VIF is the worst in the logarithmic domain. Among the objective metric results in PU domain, PU-MSSIM and PU-SSIM display high correlation coefficients, while PU-MSE is the again the worst performer. Comparing the three transfer functions, PU is the most effective, as PU-MSSIM and PU-SSIM achieve performance very close to HDR-VDP-2.2 Q and HDR-VQM. In general, metrics which are based on MSE and PSNR (PU-MSE, Log-MSE, PU-PSNR, mPSNR, etc.) yield worse results compared to other metrics. Instead, more advanced LDR metrics such as IFC, UQI, SSIM, and MSSIM yield much better results. We also notice that mPSNR, tPSNR-YUV, and CIE \(\Delta {}E\) 2000, which have been recently used in MPEG standardization activities, perform rather poorly in comparison to the others.

We also evaluate the significance of the difference between the considered performance indexes, as proposed in ITU-T Recommendation P.1401 (ITU 2012). The results are provided in Figs. 3 and 4 for “Combined” and “Except Database #2” cases respectively. The bars indicate statistical equivalence between the quality metrics. We observe that the performance of HDR-VQM in the combined database is significantly different from all others while PU-MSSIM, PU-VIF, and some other metrics have essentially equivalent performance across the combined databases. Although HDR-VDP-2.2 has a lower performance on combined dataset compared to its performance on individual databases, it is among the three most correlated metrics with HDR-VQM and PU-MSSIM on the case excluding Database #2. Interestingly, the HDR-VQM metric, which has been designed to predict video fidelity, gives excellent results also in the case of static images, and is indeed more accurate on Database #2 than HDR-VDP-2.2. Furthermore, we notice that all metrics except CIE \(\Delta {}E_{00}\) and CIE \(\Delta {}E_{00}^S\) consider only luminance values. Although CIE \(\Delta {}E_{00}\) and CIE \(\Delta {}E_{00}^S\) have been found to be among the most relevant color difference metrics among others in a recent study (Ortiz-Jaramillo et al. 2016), they have lower correlation scores when compared to luminance-only metrics. In fact, this result is not in disagreement with Ortiz-Jaramillo et al. (2016), which did not consider compression artifacts in the experiments, as the impact of those on image quality was deemed to be much stronger than color differences. Thus, our analysis confirms that luminance artifacts such as blocking, etc., play a dominant role in the formation of quality judgments, also in the case of HDR.

Fig. 3
figure 3

Statistical analysis results for correlation indices for combined data according to ITU-T Recommendation P.1401 (ITU 2012). The bars signify statistical equivalence between the quality metrics if they have the same bar aligned with two quality metrics; e.g., there is a statistically significant difference between HDR-VQM and all the other metrics considered in terms of PCC, SROCC, and RMSE

Fig. 4
figure 4

Statistical analysis results for correlation indices for combined data excluding Database #2 according to ITU-T Recommendation P.1401 (ITU 2012). The bars signify statistical equivalence between the quality metrics if they have the same bar aligned with two quality metrics; e.g., HDR-VDP-2.2 Q, HDR-VQM, PU-SSIM and PU-MSSIM are statistically equivalent to each other in terms of OR

Discriminability analysis

Table 6 Results of discriminability analysis: area under the ROC curve (AUC), threshold \(\tau \) at 5% false positive rate, maximum classification accuracy. We report for comparison the fraction of correct decisions (CD) at 95% confidence level as proposed in Brill et al. (2004)

MOS values are estimated from a sample of human observers, i.e., they represent expected values of random variables (the perceived annoyance or quality). Therefore, MOS are as well random variables which are known with some uncertainty, which is typically represented by their confidence intervals (ITU 2012). As a result, different MOS values could correspond to the same underlying distribution of subjective scores and two images with different MOS might indeed have the same visual quality in practice (with confidence level). The performance scores considered in “Statistical analysis” assume instead that MOS values are deterministically known, and that the goal of fidelity metrics is to predict them as precisely as possible, without taking into account whether two different subjective scores do actually correspond to different quality. Therefore, in the following we consider another evaluation approach, which aims at assessing if an objective fidelity metric is able to discriminate whether two images have significantly different subjective quality.

The intrinsic variability of MOS scores is not a completely new problem, and several approaches have been proposed in the literature to take this into account while evaluating objective metrics. Brill et al. (2004) introduced the concept of resolving power of an objective metric, which indicates the minimum difference in the output of a quality prediction algorithm such that at least \(p\%\) of viewers (where generally \(p=95\%\)) would observe a difference of quality between two images. This approach has also been standardized in ITU Recommendation J.149 (ITU 2004), and used in subsequent work (Pinson and Wolf 2008; Barkowsky 2009; Hanhart et al. 2015b; Nuutinen et al. 2016). Nevertheless, this technique has a number of disadvantages. Resolving power is computed after transforming MOS to a common scale, which requires applying a fitting function; however, the fitting problem could be ill-posed in some circumstances, yielding incorrect results. Also, the resolving power in the common scale corresponds to a variable metric resolution in the original scale, which makes it difficult to interpret. Moreover, it is not always possible to fix the level of significance p to be the same for different metrics, as there could be cases when the percentage of observers seeing a difference between image qualities is lower than p for any metric difference values. Finally, the results of this approach are generally evaluated in a qualitative manner, e.g., by considering how the number of correct decisions, false rankings, false differentiations, etc., vary as a function of objective metric differences (Brill et al. 2004; Hanhart et al. 2015b); conversely, a compact, quantitative measure is desirable in order to fairly compare different metrics. Another approach to this problem has been recently proposed by Krasula et al. (2016). In their paper, Krasula et al. find the accuracy of an objective image or video quality metric by transforming the problem into a classification problem. For this purpose, they find z-score of subjective scores and the difference of objective scores for each pair of stimuli, and then find the accuracy of the metric by calculating classification rates.

Due to the factors above limiting the effectiveness of resolving power, in this work we propose an alternative approach in the original scale of the metric similar to what has been presented in Krasula et al. (2016), which enables to evaluate its discrimination power while avoiding the shortcomings discussed above. Despite the similarities, the implementation and the data processing steps of their work and the proposed algorithm are not the same. Therefore, we give the details of the proposed algorithm below in order to clarify differences.

The basic idea of the proposed method is to convert the classical regression problem of accurately predicting MOS values, into a binary classification (detection) problem (Kay 1998). We denote by S(I) and O(I) the subjective (MOS) and objective quality of stimulus I, respectively, for a certain objective quality metric. Given two stimuli \(I_i, I_j\), we model the detection problem as one of choosing between the two hypotheses \(\mathcal {H}_0\), i.e., there is no significant difference between the visual quality of \(I_i\) and \(I_j\), and \(\mathcal {H}_1\), i.e., \(I_i\) and \(I_j\) have significantly different visual quality. Formally:

$$\begin{aligned} \mathcal {H}_0&: S(I_i)\cong S(I_j); \nonumber \\ \mathcal {H}_1&: S(I_j)\,\ncong\,S(I_j), \end{aligned}$$
(2)

where we use \(\cong \) (resp. \(\ncong \)) to indicate that the means of two populations of subjective scores (i.e., two MOS values) are the same (resp. different). Given a dataset of subjective scores, it is possible to apply a pairwise statistical test (e.g., a two-way t-test or z-test) to determine whether two MOSs are the same, at a given significance level. In our work, we employ a one-way analysis of variance (ANOVA), with Tukey’s honestly significant difference criterion to account for the multiple comparison bias (Hogg and Ledolter 1987), as it is also stated as the ideal way to find statistical significance in Krasula et al. (2016). Figure 5a shows the results of ANOVA on our combined database, thresholded at a confidence level of 95% (i.e., 5% significance). For convenience of visualization, MOS values have been sorted in ascending order before applying ANOVA. White entries represent MOS pairs which are statistically indistinguishable.

In order to decide between \(\mathcal {H}_0\) and \(\mathcal {H}_1\), similar to Krasula et al. (2016), we consider the simple test statistic \(\Delta ^O_{ij} = |O(I_i) - O(I_j)|\), i.e., we look at the difference between the objective scores for the two stimuli and compare it with a threshold \(\tau \), that is:

$$\begin{aligned} \text {Decide:} {\left\{ \begin{array}{ll} \mathcal {H}_0 &{} \quad \text {if } \; \Delta ^O_{ij} \le \tau \\ \mathcal {H}_1 &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$
(3)

For a given value of \(\tau \), we can then label the set of stimuli as being equivalent or not, as shown in Fig. 5b. The performance of the detector in (3) depends on the choice of \(\tau \). We call true positive rate (TPR) the ratio of images with different MOSs correctly classified as being of different quality, and false positive rate (FPR) the ratio of images with equal MOSs incorrectly classified as being of the different quality. By varying the value of \(\tau \), we can trace a receiver operating characteristic (ROC) curve, which represents the TPR at a given value of FPR (Kay 1998). The area under the ROC curve (AUC) is higher when the overlap between the marginal distributions of \(\Delta ^O_{ij}\) under each hypothesis, that is, \(p(\Delta ^O_{ij};\mathcal {H}_0)\) and \(p(\Delta ^O_{ij};\mathcal {H}_1)\), is smaller. Therefore, the AUC is a measure of the discrimination power of an objective quality metric.

Fig. 5
figure 5

Equivalence maps for the (sorted) combined database. White entries correspond to \(S(I_i) \cong S(I_j)\), black to \(S(I_i)\,\ncong\,S(I_j)\)

Table 6 reports the AUC values for the combined case and the combination without Database-#2. In addition to the area under the ROC curve, we also compute the balanced classification accuracy, which is an extension of the conventional accuracy measure to unbalanced datasets, i.e., where the number of positive and negative samples is different (Brodersen et al. 2010):

$$\begin{aligned} Acc = \frac{2 \times TP}{TP + FN} + \frac{2 \times TN}{TN + FP}. \end{aligned}$$
(4)

In Table 6 we report the maximum classification accuracy, \(Acc^*=\max _{\tau } Acc\), which characterizes the global detection performance, as well as the value of the detector threshold at \(\text {FPR}=5\%\), that is,

$$\begin{aligned} \tau _{.05}=\min \{ \tau : p(\Delta ^O_{ij} > \tau ;\mathcal {H}_0) \le 0.05\}, \end{aligned}$$
(5)

which indicates the minimum value of \(\tau \) in order to keep below 5% the probability of incorrectly classifying two stimuli as being of different quality. This latter measure provides somehow the resolution of an objective metric (with a 5% tolerance) in the original metric scale.

Fig. 6
figure 6

Statistical analysis results for the discriminability analysis, according to the procedure described in Krasula et al. (2016). The bars signify statistical equivalence between the quality metrics if they have the same bar aligned with two quality metrics. It can be said that among PU-UQI, Log-UQI, and Photometric-UQI, there is not any statistically significant difference. Whereas, there is a statistically significant difference between HDR-VQM and all the other metrics considered

These results in Table 6 are complemented with the percentage of correct decisions (CD) in Brill et al. (2004), which is to be compared with \(Acc^*\). Furthermore, we present the results of statistical significance evaluation of the reported AUC values according to the guidelines presented in Krasula et al. (2016). The results of this statistical significance evaluation are presented in Fig. 6. The results show that HDR-VQM is the best performing metric, and PU-VIF and PU-MSSIM perform better than most of the considered metrics. Although its performance is reduced in the combined case, HDR-VDP-2.2 Q also is statistically better than other metrics in the case excluding Database #2.

We notice that, in general, the values of CD are much lower than \(Acc^*\). This is due to the fact that the method in Brill et al. (2004) not only aims at distinguishing whether two images have the same quality, but also to determine which is the one with better quality. Thus the classification task is more difficult, as there are three classes—equivalent, better or worse—to label. Indeed, we observe a certain coherence between our approach and Brill et al. (2004), and with the statistical analysis in “Statistical analysis”: the best performing metrics are HDR-VQM and those based on PU transfer function such as PU-MSSIM, PU-VIF, and PU-SSIM. Nevertheless, our analysis provides a better insight on the discrimination power of fidelity metrics compared to Brill et al. (2004), and gives practical guidelines on which should be the minimal differences between the objective scores of two images in order to claim that those have different visual quality. Finally, the fact that, even for the best performing metrics in terms of correlation with MOSs, maximum accuracy saturates at 0.8 suggests that there is still space for improving existing HDR objective quality measures, as far as discriminability (and not only prediction accuracy) is included in the evaluation of performance.

Conclusion

In this paper, we conduct an extensive evaluation of full-reference HDR image quality metrics. For this purpose, we collect four different publicly available HDR image databases for compression distortion and a newly created one. In order to have consistent MOS values across all databases, we align subjective scores using the INLSA algorithm. After the alignment, a total of 690 compressed HDR images have been evaluated using several full-reference HDR image quality assessment metrics. The performance of these fidelity metrics has been assessed from two different perspectives: on one hand, by looking at the quality estimation as a regression problem, using conventional statistical accuracy and monotonicity measures (De Simone 2012); on the other hand, by focusing on the ability of objective metrics to discriminate whether two stimuli have the same perceived quality.

Our analysis shows that recent metrics designed for HDR content, such as HDR-VQM and to some extent HDR-VDP-2.2, provide accurate predictions of MOSs, at least for compression-like distortion. We also confirm the findings in previous work (Valenzise et al. 2014; Hanhart et al. 2015a) that legacy LDR image quality metrics have good prediction and discrimination performance, provided that a proper transformation such as PU encoding is done beforehand. This somehow suggests that the quality assessment problem for HDR image compression is similar to the case of LDR, if HDR pixels are properly preprocessed. Yet, the absolute performance figures of these metrics show that, when databases with heterogeneous characteristics are merged (database #2 in our experiments), none of the tested metrics provides highly reliable predictions. All but two of the considered metrics are computed on the luminance channel only. Interestingly, the non color-blind metrics, CIE \(\Delta {}E_{00}\) and CIE \(\Delta {}E_{00}^S\), displays poor performance in our evaluation, similar to other MSE-based metrics. While other studies report different results in terms of correlation with MOSs (Hanhart et al. 2016), we believe that a partial explanation for these results is that in the case of coding artifacts, the structural distortion (blocking, blur) in the luminance channel dominates the color differences, captured by CIE \(\Delta {}E_{00}\) and CIE \(\Delta {}E_{00}^S\). The important aspect of color fidelity metrics for HDR content, however, is still little understood and is part of our current research.

Finally, the alternative evaluation methodology proposed in this work, based on the discriminability of a metric, provides a complementary perspective on the performance of objective quality metrics. It recognizes the stochastic nature of MOSs, which are samples from a population and hence are known with some uncertainty. Therefore, we consider the quality estimation task as one of detecting when images have significantly different quality. The relevance of this alternative point of view is demonstrated by the amount of efforts to go beyond classical statistical measures such as correlation in the last decade, from the seminal work of Brill et al. (2004) to the very recent work of Krasula et al. (2016), developed in parallel to our study. These analyses show that, even for metrics which can accurately predict MOS values, the rate of incorrect classifications is still quite high (20% or more). This suggests that novel and more performing object quality metrics could be designed, provided that new criteria such as discriminability are taken into account alongside the correlation indices used to find statistical accuracy.