1 Introduction

With the rapid development of different image processing methods, a considerable attention is given to techniques which are trying to mimic human visual perception [9, 46]. Image quality assessment (IQA) techniques measure the quality of presented images, and often support, or compare, image enhancement, restoration, or denoising techniques [3, 5, 29, 32]. According to the availability of a reference image, IQA measures are divided into full-reference, reduced-reference, and no-reference techniques [4, 19, 31, 47].

This paper focuses on the full-reference IQA measures. Over the last decade, many different full-reference IQA approaches have been introduced [4], starting from simple peak signal-to-noise ratio (PSNR), or noise quality measure (NQM) [8] in which a linear frequency distortion and an additive noise injection are used for the distorted image modelling. The popular structural similarity (SSIM) measure [41], being the extension of universal image quality index (UQI) [39], uses loss of correlation, luminance distortion, and contrast distortion. SSIM was further extended using a multi-scale approach (MSSIM) [41] or statistical models of natural scenes, as it can be seen in information content weighted SSIM (IWSSIM) [40]. Such statistical models are also utilised in information fidelity criterion (IFC) [33] and visual information fidelity (VIF) [34]. Feature similarity index (FSIM) [51] and its version for colour images (FSIMc), employ phase congruency and image gradient magnitude. In Riesz-transform based feature similarity measure (RFSIM) [50] Riesz-transform features at key locations between the distorted image and its reference image are used. SURF-SIM [38], in turn, uses Speed Up Robust Features (SURF) in order to detect multiscale differences between features. Spectral residual based similarity measure (SRSIM) [49] and visual saliency-induced index (VSI) [52] use visual saliency maps. Contrast changes and luminance distortions are used in gradient similarity (GSM) measure [16], and inter-patch and intra-patch similarities were modelled in [54] using modified normalised correlation coefficient and image curvature. The edge based image quality assessments (EBIQA) measure is based on different edge features extracted from distorted image and its pristine equivalent [1]. In [11], a novel pooling strategy based on the harmonic mean was proposed.

In the literature, there are also approaches in which several IQA techniques are combined into a hybrid, or a fusion, measure. For example, in the most apparent distortion algorithm (MAD) [14] local luminance and contrast masking evaluate high-quality images, while changes in the local statistics of spatial-frequency components are used for images with a low quality. Information obtained using saliency maps, gradient and contrast information was fused in [30]. In [21, 22], scores of MSSIM, VIF and R-SVD were non-linearly combined. A preliminary work with non-linear combination of several IQA measures selected by a genetic algorithm was shown in [23]. In [17], SNR, SSIM, VIF, and VSNR were combined using canonical correlation analysis, and a regularised regression was used to combine up to seven IQA models in [13]. In [25], a support vector machines classifier used for predicting of the distortion followed by a fusion of SSIM, VSNR, and VIF using k-nearest-neighbour regression, was proposed. An adaptive combination of IQA approaches with an edge-quality based on preservation of edge direction was introduced in [26]. In [18], a fusion measure using a support vector regression approach was proposed. Lukin et al. [20] introduced a fusion measure which combines six IQA measures using a neural network. In [48], kernel ridge regression was used to combine found perceptually meaningful structures and local distortion measurements. In other approaches, adaptive weighting [2] or internal generative mechanism [43] were considered in order to obtain hybrid measures.

For evaluation of IQA approaches, specific IQA benchmark databases have been introduced [14, 27, 28, 42]. They contain pristine images, their distorted equivalents and subjective human evaluations in the form of mean opinion scores (MOS) or differential MOS (DMOS). Some images from these benchmarks with subjective scores are often used for tuning parameters of many developed methods, e.g., [13, 20, 44, 52, 54]. Here, the number of used images should be small in order to obtain the benchmark-independent solution. In this paper, a novel full-reference hybrid IQA measure is proposed which employs regularized least-squares regression using the least absolute shrinkage and selection operator (lasso) [36, 37]. This technique combines objective scores produced by up to 16 full-reference IQA measures. The lasso regression was applied since it performs selection of the most important predictors, what makes the usage of such combined measure more practical. Finally, only several IQA measures take part in the fusion. In the proposed approach, the regression coefficients are determined using part of images and their scores from benchmark databases. It is shown that the proposed hybrid measure is significantly better if pairwise score differences (PSD) are used instead of raw scores. These differences can be also used for performance evaluation of IQA measures. The application of PSD can be motivated by the organisation of some IQA tests with human subjects [28], where the observer compares distorted images with each other, taking into account the pristine image. It can be assumed that PSD can also be used in the development of IQA measures that require supervised learning. The hybrid measures developed using raw scores or PSD in the lasso regression are compared with the state-of-the-art techniques on four largest IQA benchmark image datasets using well-established evaluation protocol, as well as statistical significance tests.

The rest of this paper is organised as follows. Section 2 presents the proposed hybrid IQA measure. In Section 3, the approach is compared with state-of-the-art measures using four IQA benchmarks, and, finally concluding remarks are presented in Section 4.

2 Proposed approach

Let Q 1, …,Q M be the objective scores of M IQA measures seen as predictor variables in multiple linear regression model [36]. In the model, S o is the estimated response, or objective score, of the resulted hybrid IQA measure. It can be written as follows:

$$ \boldsymbol{S^{o}}= B_{0} + \sum\limits_{m=1}^{M} Q_{m}B_{m} + \epsilon, $$
(1)

where B contains fitted coefficients estimated by minimising the mean squared difference between the outcome, i.e., the vector of subjective scores S s, and predicted outcome, S o; 𝜖 represents a relationship between Q and S s which is rejected from the equation.

For large number of predictors, it would be desirable to select those which are the most informative. This also leads to a more practical IQA hybrid approach, consisting of only several IQA measures. One of possible approaches to the problem of predictor selection is to use a penalised regression in the lasso form [36]. In the regression, for a given λ, the lasso determines B solving the following optimisation problem:

$$ \begin{array}{rrclcl} \displaystyle \min_{\boldsymbol{B}} & \multicolumn{3}{l}{(\frac{1}{2}\sum\limits_{n=1}^{N} ({S^{o}_{n}}-B_{0} - \sum\limits_{m=1}^{M} Q_{nm}B_{m})^{2}+} & \lambda \sum\limits_{m=1}^{M} | B_{m} |), \end{array} $$
(2)

where N is the number of objective scores, and λ is a regularization parameter. In other words, the lasso minimises the residual sum of squares with the constant α:

$$ \sum\limits_{m=1}^{M} | B_{m} | \leq \alpha. $$
(3)

In the approach, λ value which minimises the mean squared error was used to determine coefficients.

In experiments, the following M=16 publicly available full-reference IQA measures were used: VSI [52], FSIM [51], FSIMc [51], GSM [16], IFC [33], IW-SSIM [40], MAD [14], MSSIM [41], NQM [8], PSNR [35], RFSIM [50], SR-SIM [49], SSIM [42], VIF [34], IFS [7], and SFF [6]. They were used for assessment of processed images and then PSD were obtained. Most of these approaches present state-of-the-art performance, and their inclusion was mainly influenced by the need of achieving a broad sample of various approaches mimicking human visual system. It is assumed that the lasso regression would be able to select several IQA measures and develop the well-performing hybrid measure.

The proposed approach uses first 20 % images and their subjective scores from a given benchmark dataset, in order to obtain the regression coefficients. There are four largest IQA benchmark image datasets; therefore, four hybrid measures are introduced. In the literature, different numbers of images with scores were used for this purpose, ranging from 20 % [38], through 30 % [52], 50 % [48], and 100 % [13, 26, 44], to several datasets jointly [54].

In experiments, the following four largest IQA benchmarks were used: TID2013 [28], TID2008 [27], CSIQ [14], and LIVE [42]. The number of images in each benchmark, as well as the number of distortions and their levels are shown in Table 1. Since the number of learning images in the subset is small, the number of scores used in the regression can be considerably increased by employing PSD. To the best knowledge of the author, PSD have not been used for training of IQA measures. In this paper, the lasso regression produces the hybrid IQA measure trained with the small subset of images and scores obtained for M=16 IQA measures, as well as trained with pairwise differences of these scores. The obtained fitted coefficients, B, indicate the number and contribution of the most informative IQA measures. Only these measures were used in the quality assessment of the test images. For a selected reference image, all score differences between its distorted equivalents are calculated. For example, for 5 images with 24 distortions and 5 distortion levels, 600 images and scores are available in the typical learning scenario, or, as it is introduced in this paper, \({\sum }_{k=1}^{5} \displaystyle {120 \choose 2} = 7140\) pairwise score differences for these images can be used. It is assumed that only scores of distorted images that have the same reference image are compared. The usage of PSD can be also motivated by the tristimulus methodology for performing tests with human observers [28], in which two distorted images are presented with their pristine equivalent at the same time. Then, the observer selects which distorted image has the better quality, what requires evaluation of each distorted image separately, looking at the pristine image, and jointly, while making the decision on their relative quality. Such pairwise image comparison is used to determine subjective opinions for assessed images [28].

Table 1 IQA benchmark image datasets

In the experiments, images from a given benchmark dataset were divided into five disjoint subsets. There are 20 % of all images in each subset, and each image was evaluated by 16 IQA measures. Finally, after the application of the proposed approach, 40 hybrid IQA measures were obtained; half of them was trained on PSD. For convenience of presentation, only measures, namely lasso regression SImilarity Measures (lrSIMs), obtained for the first 20 % of benchmark images are written as follows:

$$\begin{array}{@{}rcl@{}} lrSIM_{1}^{1a}&=& 10.214 VSI -1.5221 MAD -0.5705 PSNR \\ &&+0.7827 RFSIM +0.5723 VIF +1.9253 IFS \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} lrSIM_{1}^{2a}&=& 8.2432 VSI -2.9136 MAD -1.0000 PSNR \\ && + 1.0432 VIF+1.8354 IFS \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} lrSIM_{2}^{1a}&=& 0.5107 VSI -1.5079 MAD + 0.5439 PSNR \\ &&+1.1451RFSIM +0.3124 SRSIM +1.0850 VIF \\ && +0.6202 IFS+ 5.7429 SFF \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} lrSIM_{2}^{2a}&=& -2.5348 MAD + 0.6056 RFSIM +1.6761 SRSIM \\ &&+ 1.3234 VIF +0.8086 IFS +3.8507 SFF \end{array} $$
(7)
$$\begin{array}{@{}rcl@{}} lrSIM_{3}^{1a}&=& 0.3887 MAD -0.1408 RFSIM -0.1969 VIF \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} lrSIM_{3}^{2a}&=& 0.5193 MAD -0.2754 VIF -0.0543 IFS \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} lrSIM_{4}^{1a}&=& 14.913 IFC + 72.26 MAD + 1.5549 NQM\\ &&+2.5175 PSNR + 20.989 SRSIM \\ && -36.315 SSIM -43.421 VIF \end{array} $$
(10)
$$\begin{array}{@{}rcl@{}} lrSIM_{4}^{2a}&=& 11.906 GSM + 6.8190 IWSSIM+ 71.034 MAD \\ &&+6.0730 MSSIM -38.154 VIF -15.709 IFS \end{array} $$
(11)

In the (4)–(11), the number in the subscript denotes the benchmark whose part of images were used for the development of the measure, 1 for TID2013, 2, 3, 4 for TID2008, CSIQ, and LIVE, respectively; the number in superscript, in turn, indicates that the measure was developed using scores ”1” or PSD ”2”; the letter in the subscript denotes which subset of training images was used (five letters: a-e). In the evaluation (see Section 3), results for the a subset or for all subsets together in the form of the mean value are reported. Taking into account all obtained hybrid measures, it can be said that one hybrid measure uses 5.725 single IQA measures, on average. Among the mostly used IQA measures, VIF was selected 40 times, MAD 39, IFS 29, FSIM 25, RFSIM 22, PSNR 20, SFF 19, and VSI 17. The remaining IQA measures were used less than 15 times each. Interestingly, FSIMc was not used at all, and NQM, SSIM, or MSSIM were used less than five times, what can indicate that their features were replaced by the remaining IQA measures. Some measures contributed more than others, what is reflected by the weights. For example, in hybrid measures (4)–(5) VSI was the most contributing technique, in (6)–(7) MAD with SFF contributed more than other techniques, and in (8)–(11) MAD with VIF. The sign the weight mostly depends on the sign of correlation between the objective scores produced by the measure and the subjective scores in the benchmark. Experiments were performed using Matlab 7.14 with Statistics Toolbox.

3 Experimental evaluation

According to the widely-used protocol [10, 35], IQA measures are compared with each other using the following performance indices: Spearman Rank order Correlation Coefficient (SRCC), Kendall Rank order Correlation Coefficient (KRCC), Pearson linear Correlation Coefficient (PCC) and Root Mean Square Error (RMSE). The prediction monotonicity is evaluated by SRCC and KRCC, and the prediction accuracy by PCC and RMSE. These performance indices are calculated after a non-linear mapping between a vector of objective scores, S o, and subjective scores, MOS or DMOS, denoted here by S s, using the following mapping function for the non-linear regression [35]:

$$ {{S_{m}^{o}}}= \beta_{1}(\frac{1}{2}-\frac{1}{exp(\beta_{2}(S^{o}-\beta_{3}))})+\beta_{4} S^{o}+\beta_{5}, $$
(12)

where β=[β 1, β 2, …,β 5] are parameters of the non-linear regression model [35], and \(\boldsymbol {{{S_{m}^{o}}}}\) is the non-linearly mapped S o. PCC and RMSE use S o:

$$ {PCC}(\boldsymbol{{S_{m}^{o}}},\boldsymbol{S^{s}}) = \frac{\bar{\boldsymbol{{S_{m}^{o}}}}^{T}\bar{\boldsymbol{S^{s}}}}{\sqrt{\bar{\boldsymbol{{S_{m}^{o}}}}^{T}\bar{\boldsymbol{{S_{m}^{o}}}}\bar{\boldsymbol{S^{s}}}^{T}\bar{\boldsymbol{S^{s}}} }}, $$
(13)

where \(\bar {\boldsymbol {{S_{m}^{o}}}}\) and \(\bar {\boldsymbol {S^{s}}}\) denote the mean-removed vectors.

$$ {RMSE}(\boldsymbol{{S_{m}^{o}}},\boldsymbol{S^{s}}) = \sqrt{ \frac{(\boldsymbol{{S_{m}^{o}}} - \boldsymbol{S^{s}})^{T} (\boldsymbol{{S_{m}^{o}}} - \boldsymbol{S^{s}}) }{m}}. $$
(14)

SRCC is calculated as:

$$ {SRCC}(\boldsymbol{S^{o}},\boldsymbol{S^{s}}) =1- \frac{6{\sum}_{i=1}^{m} {d_{i}^{2}} }{m(m^{2}-1)}, $$
(15)

where d i is the difference between i th image in \(\boldsymbol {\hat {S}}\) and S, m is the total number of images. In KRCC, the number of concordant pairs in the dataset, m c , is used, as well as the number of discordant pairs, m d :

$$ {KRCC}(\boldsymbol{S^{o}},\boldsymbol{S^{s}}) = \frac{m_{c}-m_{d} }{0.5m(m-1)}. $$
(16)

The values of RMSE closer to 0 are considered better, in contrary to SRCC, KRCC, and PCC whose values should be close to 1.

Table 2 contains mean values of evaluation indices for all developed hybrid measures on four benchmarks. There are 178500 values of PSD for TID2013, 56950, 12068, and 10081, for TID2008, CSIQ, and LIVE, respectively. It can be seen that within the benchmark, mean values for hybrid measures trained with PSD, indicated with ”2” in the superscript (e.g., lrSIM\(_{1\textit {-}4}^{2}\)), were in almost all cases better than hybrid measures learned using images from the benchmark and their raw scores, lrSIM\(_{1\textit {-}4}^{1}\). This indicates that PSD carry more information than raw scores.

Table 2 Comparison of mean values of RMSE for developed hybrid measures on four IQA benchmark datasets using raw scores or PSD

Extension of the typical testing protocol with results obtained using PSD may lead to more quantitative conclusions about the performance of evaluated IQA measures. Therefore, the proposed hybrid measures, trained using PSD and represented by lrSIM\(_{1-4}^{2a}\), are compared with state-of-the-art IQA measures using four performance indices calculated with raw scores, and with PSD. The results of comparison are shown in Tables 3 and 4. The overall results for RMSE do not take into account LIVE dataset due to the range difference, and weighted results were obtained using the number of images in the benchmark as its weight. The tables contain the six best IQA measures that were considered in the regression, out of 16, and lrSIM\(_{1-4}^{2a}\), the four best performing measures for each evaluation index are written in bold. The obtained results reveal that all presented lrSIMs clearly outperformed compared IQA measures. For TID2013, where VSI was the best performing non-hybrid measure, hybrid measures trained on images from benchmark with considerably less number of distortions that are present in TID2013, i.e., on CSIQ and LIVE, performed worse than measures trained on TID benchmarks. IQA measures that can be seen in models trained on CSIQ and LIVE, which share most of distortion types, are performing worse on newly introduced TID benchmarks, and that can explain worse results for lrSIM\(_{3}^{2a}\) and lrSIM\(_{4}^{2a}\). However, these results are still better, taking into account overall performance, than for other non-hybrid IQA measures. Weighted results seem to favour IQA approaches performing better on TID2013 due to large number of images used as its weight. However, weighted results for tests with PSD show superior performance of all introduced hybrid measures. Interestingly, evaluation results on benchmarks with PSD, seen in Table 4, allow further assessment of the performance of compared non-hybrid IQA measures. There are cases in which some measures were better in this test than it was shown in the typical evaluation with raw scores. For example, for TID2008 with PSD, SFF was better than IFS and VSI, while in the previous evaluation, their precedence was reversed, with VSI as the leading technique. Interestingly, SFF was introduced before IFS by the same authors. Furthermore, for the results on CSIQ with PSD, MAD clearly outperformed newly introduced SFF and IFS, what was not evident in the known evaluation. MAD’s performance was also confirmed in tests on LIVE with PSD, where it was the fourth best IQA measure.

Table 3 Comparison of hybrid measures with the six best state-of-the-art IQA measures on four benchmark datasets
Table 4 Comparison of hybrid measures with the six best state-of-the-art IQA measures on four benchmark datasets

The evaluation results on benchmark datasets showed superior performance of introduced family of hybrid IQA measures, lrSIMs. However, it would be desirable to prove that the approach is statistically better. In statistical significance tests, hypothesis tests based on the prediction residuals of each measure after non-linear mapping were conducted using F-test [14], where the smaller residual variance denoted the better prediction. The results of statistical significance tests on LIVE benchmark are presented in Fig. 1. The tests cover all 16 IQA measures that were considered in the regression. In the figure, the number ”1”, ”0” or ”-1” in the cell denotes that the measure in the row is statistically better with the confidence greater than 95 %, indistinguishable, or worse than the measure in the column, respectively. The test revealed that lrSIMs are statistically better than other IQA measures, and in many cases, hybrid measures trained with PSD are better than hybrid measures trained with raw scores. Figure 2 presents summary of significance tests covering all benchmarks, including tests with PSD. For each benchmark, the numbers in cells were added. Since there are eight benchmarks, the number in the cell denotes the number of benchmarks in which the measure in the row is significantly better, or worse in case of the negative value, than the measure in the column. Taking into account the summary of significance tests, it can be seen that lrSIM\(_{2}^{2a}\) is the best performing measure, with non-negative values in cells. It is worth noticing that lrSIM\(_{2}^{2a}\) is worse that lrSIM\(_{1}^{2a}\) if only significance tests with non-hybrid measures are taken into account. All lrSIMs have non-negative values in cells shared with non-hybrid measures, where they are in rows, and hybrid IQA measures developed with PSD, lrSIM\(_{1-4}^{2a}\), are at the top of the ranking. Among non-hybrid approaches, SFF is the leading IQA measure, followed by FSIMc, MAD and VSI.

Fig. 1
figure 1

Significance tests on LIVE benchmark using raw scores (a) and PSD (b). The number ”1”, ”0” or ”-1” in the cell denotes that the measure in the row is statistically better with the confidence greater than 95 %, indistinguishable, or worse than the measure in the column, respectively. It can be seen that lrSIMs are statistically better than state-of-the-art IQA measures on this dataset

Fig. 2
figure 2

Summary of significance tests on four benchmarks, including tests with PSD (eight benchmarks in total). For each test, the number ”1”, ”0” or ”-1” in the cell denotes that the measure in the row is statistically better with the confidence greater than 95 %, indistinguishable, or worse than the measure in the column, respectively. The values in cells for eight tests were added in order to form this summary. The higher value in the cell indicates the better IQA measure in the row

The experimental evaluation showed that it is worth using PSD in training of the proposed hybrid IQA measure family. This can also be seen on scatter plots with subjective opinion scores against objective scores of the two best IQA measures and lrSIM\(_{1}^{2a}\) on four benchmarks (see Fig. 3). Here, lrSIM\(_{1}^{2a}\) is better correlated with subjective scores than compared measures.

Fig. 3
figure 3

Scatter plots of subjective opinion scores against subjective scores of the two best IQA measures and lrSIM\(_{1}^{2a}\) on four benchmarks. Plots also contain curves fitted with logistic functions, names of benchmark datasets (vertical axis) and IQA measures (horizontal axis). Colours represent different distortions; each dataset has its own set of colours

Since in this paper the hybrid approach is presented, it should be compared with other similar approaches that are present in the literature. Therefore, Table 5 contains comparison with such approaches on the basis of published SRCC values. This also gives the opportunity to compare the results with non-hybrid IQA measures, which also can be found in the literature. The table contains results for TID2008, CSIQ and LIVE benchmarks, since most of the compared measures were not evaluated on TID2013. Here, the best three results for a given benchmark are written in boldface, results not reported are denoted by ”-”. Furthermore, ”-” denotes overall results for IQA measures that were not evaluated on these three benchmarks or are not independent, i.e., authors developed a separate measure for each benchmark without providing cross-benchmark results ([13, 18, 20, 22, 25, 48]).

Table 5 Comparison of the approach with other IQA approaches and hybrid measures based on SRCC values reported in the literature

The results of comparison based on SRCC are presented in Table 5. They reveal that lrSIMs outperformed other measures, being in most cases among the three best IQA techniques. For TID2008, SM-HM-FSIM [11] was the second best technique. However, SM-HM-FSIM is non-hybrid approach, what makes all lrSIMs better than other compared hybrid approaches. The presented lrSIM family, together with the approach introduced by Barri et al. [2] outperformed other techniques on CSIQ. Overall results, as well as tests on LIVE, showed superior performance of lrSIMs over the other measures. More specifically, lrSIM\(_{1}^{2a}\) and lrSIM\(_{2}^{2a}\) were clearly better than other measures, what was indicated in the previous experiments.

4 Conclusions

In this paper, a hybrid full-reference IQA was introduced. The measure was obtained using the lasso regression and pairwise score differences of up to 16 IQA measures seen as predictors. The lasso was able to select the most important several IQA measures. This resulted in the family of hybrid measures, lrSIMs, which was extensively evaluated on four largest IQA image benchmarks employing SRCC, KRCC, PCC, and RMSE. The evaluation was also based on PSD. The introduced approach outperformed widely used full reference IQA measures, as well as other hybrid techniques. It can be assumed that the usage of PSD will support the development of other IQA measures based on supervised learning.

The Matlab code of the approach that allows adding other IQA measures, scripts performing pairwise score differences on used benchmarks, as well as the evaluation of compared approaches, are available to download at: http://marosz.kia.prz.edu.pl/lrSIMpsd.html.