Keywords

1 Introduction

The development of image and video processing technologies and the exponential increase of new multimedia services raise the critical issue of assessing the visual quality. From several years, a number of investigations have been conducted to design robust Image Quality Assessment (IQA) metrics. Such metrics aim at predicting image quality that well correlates with Mean Opinion Scores (MOS). No-Reference IQA (NR-IQA) are interesting as they assume no knowledge of the reference image and can be embedded in practical and real-time applications. Three approaches may be used in the design of IQA algorithms. The first one looks to mimic the behavior of the Human Visual System (HVS). The HVS models used in this context, include relevant properties such as the contrast sensitivity function, masking effects and detection mechanisms. A number of investigations [1] have shown that these models when included in IQA algorithms, improve their performance. The second approach is well suited for assessing the quality of images distorted by known distortions. The algorithms of this approach quantify one or more distortions such as blockiness [20, 27], blur [2, 22] or ringing [9, 10] and score the image accordingly. The third and last approach is a general-purpose method. It considers that the HVS is very sensitive to structural information in the image and any loss of this structural information results in a perceptual loss of quality. To quantify loss of information, this approach uses Natural Scene Statistics (NSS). Generally, NSS-based algorithms apply a combination of learning-based approach with NSS-based extracted features. When a large ground truth is available, statistical modeling algorithms can achieve good performance. However, there is still an effort to provide to reach the subjective consistency of the HVS.

The work proposed in this paper is motivated by the interesting results of IQA when visual attention models are used. Computational visual saliency models extract regions that can attract human gaze. These regions are of a great interest in IQA. This paper presents a new NR-IQA based on the use of saliency maps to better weight the extracted distortions and combines these weighted distortions using a MultiVariate Gaussian distribution (MVGD).

Fig. 1.
figure 1

Overall synopsis of the multi-scale proposed approach

2 The Proposed Approach

Figure 1 presents the overall synopsis of the multi-scale proposed approach, namely SABIQ (SAliency-based Blind Image Quality) index. First, a multi-scale decomposition is performed on the input image and a saliency map is computed at each level. The base level corresponds to the first image while the remaining ones are obtained by a low-pass filtering followed by a sub-sampling. Secondly, different distortion maps are generated at same scale levels. At each level, the Renyi entropy of the subsampled image is also computed. Thirdly, for each level, a weighting of each computed distortion map with the corresponding saliency map is performed in order to increase the strength of degradation in visually attracting areas. Finally, the combination at each level of the weighted distortion map with the computed Renyi entropy is performed to design a multiresolution distortion map. The final stage of the pipeline is a simple Bayesian model that predicts the quality score of the input image. The Bayesian approach maximizes the probability that the image has a certain quality score given the features extracted from the image. The associate posterior probability is modeled as a MultiVariate Gaussian Distribution (MVGD).

2.1 Visual Saliency Map

Visual attention is the ability of the HVS to rapidly direct our gaze towards regions of interest in our visual environment. Two attentional mechanisms are involved in such selection; bottom-up and top-down. Main features known to influence bottom-up attention include color, orientation, motion. Top-down attention is rather driven by the observer’s experience, task and expectation. Many conducted investigations have helped in understanding visual attention and many computational saliency models have been proposed in the literature [12, 13]. A recent state-of-the-art in visual attention is given in [5]. Most of these models use bottom-up approach and are based on the Feature Integration Theory of Treisman and Gelade [28]. They compute a 2D map that highlights locations where fixations are likely to occur. These image-based (stimulus-driven) models use the same architecture but vary in the selection of characteristics used to calculate the global saliency map.

The saliency models have addressed various applications, including computer vision [21], robotics [6] and visual signal processing [7, 29]. In the context of IQA algorithms, the saliency models are intended to extract the most relevant visual features that when combined, produce a quality score highly correlated with human judgment [3].

Many research have been investigated to model the phenomenon that any human viewer can focus on attractive points at a first glance, and many saliency models have been proposed in the literature.

Saliency models can be categorized into (1) pixel-based models and (2) object-based models. The pixel-based models aim to highlight pixel locations where fixations are likely to occur. The object-based models focus on detecting salient objects in a visual scene. The majority of saliency models in the literature are pixel-based saliency models, such as ITTI [11], STB [30], PQFT [8], etc.

In this paper, the ITTI model [12] has been employed. This model combines multiscale image features into a single topographical saliency map. Three channels (Intensity, Color and Orientation) are used as low level features. First, feature maps are calculated for each channel via center-surround differences operation. Three kinds of conspicuity maps are then obtained by across-scale combination. The final saliency map is built through combining all of the conspicuity maps.

2.2 Distortion Maps

Many studies have shown that image quality degradations are well measured by features of local structure [31], contrast [31, 32], multi-scale and multi-orientation decomposition [34].

Contrast Distortion Map. The image gradient is an interesting descriptor to capture both local image structure and local contrast [33]. Also according to this study, the partial derivatives and gradient magnitudes change with the strength of applied distortions.

Following this strategy and in order to generate the contrast distortion map, we compute both horizontal and vertical gradient component images \(\partial {I}/\partial {x}\) and \(\partial {I}/\partial {y}\) from the image I. From those two gradient images, the gradient magnitude image is computed as \(\sqrt{(\partial {I}/\partial {x})^2 + (\partial {I}/\partial {y})^2}\) and then modelled by a Weibull distribution. This distribution fits well the gradient magnitude of natural images [25] and its 2 parameters (the scale parameter and the shape parameter) roughly approximate the local contrast and the texture activity in the gradient magnitude map, respectively. Larger values of the scale parameter imply greater local contrast.

Yet, instead of computing the contrast on the entire image, the image is first partitioned into equally sized \(n\times n\) blocks (referred to as local image patches), then the local contrast is computed for each block yielding in final to a local contrast map \(\mathcal {M}_C\).

Structural Distortion Map. The structural distortion map considered here uses structural distortion features that are extracted from both spatial and frequency information. To extract image structure information from frequency domain, the image is partitioned into equally sized \(n\times n\) local image patches and then a 2D-DCT (Discrete Cosine Transform) is applied on each patch. The feature extraction is thus locally performed in the spatio-frequency domain according to local spatial visual processing property of the HVS [4]. To capture degradation depending on directional information in the image, block DCT coefficients are modeled along three orientations (0, 45 and 90\(^\circ \)). For each orientation, a Generalized Gaussian is fitted to the associated coefficients, and the coefficient \(\zeta \) is computed from the histogram model as \(\zeta =\sigma (X)/\mu (X)\) where \(\sigma (X)\) and \(\mu (X)\) are the standard deviation and the mean of the DCT coefficient magnitudes, respectively. In order to select the most significant map from the three generated distortion maps, the variance of \(\zeta \) is then computed for each orientation. The distortion map associated to the highest value of the variance of \(\zeta \) is finally chosen and serve as structural distortion map, namely \(\mathcal {M}_S\).

Since the DC (Direct Coefficient) does not convey any structural information, it is removed from all computations.

Multi-orientation Image Property Map. It is widely admitted that the HVS is sensitive to spatial frequency and orientation. In order to capture this sensitivity, the steerable pyramid transform [26] is used.

Let \(a(i,j,f,\theta )\) be an original coefficient issued from the decomposition process located at the position (ij) in the frequency band f and orientation band \(\theta \). The associated squared and normalized coefficient \(r(i,j,f,\theta )\) is defined as:

$$\begin{aligned} r(i,j,f,\theta )=k \frac{a(i,j,f,\theta )^2}{\sum _{\phi \in \left[ 0, 45, 90, 135\right] } a(i,j,f,\phi )^2+\sigma ^2} \end{aligned}$$
(1)

In this paper, four orientation bands with bandwidths of 45\(^\circ \) 0, 45, 90, 135 plus one isotropic lowpass filter are used yielding in five response maps \(\{R_{\theta }, R_\text {iso}\}, \theta \in [ 0, 45, 90, 135]\). The distortion map associated to the highest value of the variance is finally selected and will serve as frequency variation distortion map, namely \(\mathcal {M}_F\).

From the four orientation bands, we compute the energy ratio in order to take account the modification of local spectral signatures of an image. This approach is inspired from the quality BLIINDS2 index [24]. Each map associated to \(\theta \) \(\{R_{\theta }\}, \theta \in [ 0, 45, 90, 135]\) is decomposed into equally sized \(n\times n\) blocks. For each obtained patch, the average energy in frequency band \(\theta \) models the variance corresponding to band \(\theta \) as \(e_\theta =\sigma _\theta ^2\).

For each \(\theta \in [45, 90, 135]\), the relative distribution of energies in lower and higher bands is then computed as:

$$\begin{aligned} E_\theta = \frac{|e_\theta - 1/n \sum _{t<\theta }e_t|}{|e_\theta + 1/n \sum _{t<\theta }e_t|} \end{aligned}$$
(2)

where \(1/n \sum _{t<\theta }e_t\) represents the average energy up to frequency band \(\theta \). Three distortion maps are then generated.

The distortion map associated to the highest value of the variance of \(E_\theta \) is finally selected and serves as energy ratio distortion map, namely \(\mathcal {M}_E\).

2.3 Multiscale Features Computation

In this block, each distortion map is combined with the saliency map in order to obtain a saliency-based distortion map. From each saliency-based distortion map, a pooling strategy is applied by averaging over the highest 10th percentile coefficients across the distortion map. This pooling strategy is motivated by the fact that the “worst” distortions in an image heavily influence subjective impressions and that they are concentrated in few coefficients having higher values [18]. All the obtained values are referred to as \(df^{10}(\cdot )\), where \((\cdot )\) represents one of the computed distortion maps \(\{\mathcal {M}_C,\mathcal {M}_S,\mathcal {M}_F,\mathcal {M}_E\}\). In order to get information about the distribution of the distortions (over space or isolated distortions), the 100th percentile average of the local scores is also computed. The obtained values are referred to as \(df^{100}(\cdot )\), where \((\cdot )\) represents one of the computed distortion maps \(\{\mathcal {M}_C,\mathcal {M}_S,\mathcal {M}_F,\mathcal {M}_E\}\). The whole computation leads, in total, to 8 distortion features \(\{df^{10}(k),df^{100}(k)\},\) \(\forall k\in \{\mathcal {M}_C,\mathcal {M}_S,\mathcal {M}_F,\mathcal {M}_E\}\).

The final feature is computed at each scale level l as

$$\begin{aligned} \text {final-feature}^p_l(k)= df^p_l(k)* \text {entropy}_l \end{aligned}$$
(3)

where \(\ p \in \{10,100\}\), \(\ k \in \{\mathcal {M}_C,\mathcal {M}_S,\mathcal {M}_F,\mathcal {M}_E\}\), \(df^p_l(k)\) represents the value of the distortion value \(df^{p}(k)\) at level l, and \(\text {entropy}_l\) is the Renyi entropy of the associated saliency-based distortion map. This strategy yields us to include information about the anisotropy property of distortion maps. In this paper, the number of scales l is set to 3 as this value achieves the best performance.

2.4 Probabilistic Model and Quality Score Prediction

The computed features and the DMOS (Difference of Mean Opinion Scores) values of training images are then used by the learning block to fit a MVGD. The resulting model SABIQ is given by:

$$\begin{aligned}&\text {SABIQ}\left( x\right) = \nonumber \\&\,\,\, \frac{1}{\left( 2\pi \right) ^{k/2}\left| \varSigma \right| ^{1/2}}\exp \left( -\frac{1}{2}\left( x-\beta \right) ^{T}\varSigma ^{-1}\left( x-\beta \right) \right) \end{aligned}$$
(4)

where \(x = \left( \{\text {final-feature}^p_l(k)\}, DMOS\right) \) corresponds to the extracted features (Eq. 3) to which is added DMOS. \(\beta \) and \(\varSigma \) denote the mean and covariance matrix of the MVGD model and are estimated using the maximum likelihood method. The features extracted from testing images with DMOS values lying between 0 and 100 with a step of 0.5, are fed into the learned SABIQ to assess quality of image under test.

Table 1. SROCC values of NR-IQA models on each distortion types for the TID2013 database.
Table 2. SROCC values of NR-IQA models on each distortion types for the CSIQ Images database.

3 Performance Evaluation

3.1 Apparatus

To provide comparison of NR-IQA algorithms, two publicly available databases are used: (1) TID2013 database [23] and (2) CSIQ database [15]. Since LIVE database [14] has been used to train both the proposed metric and most of the trail NR-IQA schemes, it has not been used to evaluate performances. To train our model, we used LIVE database running multiple train-test sequences. For each sequence, the image database is divided into distinct training and test sets. In each train-test sequence, 80% of the LIVE IQA Database content was chosen to design the training set, and the remaining 20% were dedicated to the test set. This means each training set contains 23 reference images and their associated distorted images. The quality scores are computed using a bootstrap process with 999 replicates.

To assess the performance of SABIQ, the Spearman Rank Order Correlation Coefficient (SROCC) is computed between DMOS values and predicted scores from six state-of-the-art opinion-aware NR-IQA methods, including BRISQUE [17], BLIINDS2 [24], DIIVINE [19], CORNIA [17], ILNIQE [33] and SSEQ [16] which are all so far widely accepted in the research community.

3.2 Performance Evaluation

The SROCC between predicted DMOS and subjective DMOS is reported in Table 1 for the TID2013 database. From Table 1, one observes that SABIQ performs much better than the six other NR-IQA methods when the SROCC values for the whole database is considered. This significant gain in performance is likely induced by the visual attention that is used in the weighting of distortions. When single distortions are considered, SABIQ achieves performance comparable with CORNIA and performs better than the five remaining trail quality schemes. For multiple distortions, SABIQ performs better than BRISQUE, BLIINDS2, DIIVINE and SSEQ and competes very well with CORNIA and ILNIQE.

Similar results are shown in Table 2 for CSIQ Images database. SABIQ achieves better results for 4 out of 6 distortions and outperforms all trail NR-IQA algorithms when the entire database is considered. In this case, the gain in performance is about 7% compared to ILNIQE and is at least 32% compared to other metrics.

We also trained the methods on TID2013 excluding multi-distorted subsets (MD), then tested them on the two other datasets and the remaining MD subsets of TID2013. The results are shown in Table 3. The NR-IQA methods IL-NIQE and SABIQ clearly outperform the other trial method when trained on single distortion. When considering the LIVE database, IL-NIQE and SABIQ achieve almost the same results, which is not surprising since many existing recent NR-IQA schemes reach high correlations on that database. Furthermore, SABIQ presents the highest SROCC value with CSIQ database. All these results tend to highlight a high generalization capability of the proposed approach.

Table 3. SROCC values when trained on TID2013, excluding multi-distortion subsets (MD)

4 Conclusion

In this paper, we investigated how the visual attention property of the HVS can be embedded in the NR-IQA algorithm design and in which way it can improve the prediction of image quality. The proposed approach, namely SABIQ, is based on the use of computational modeling of visual attention to compute the saliency map. At each of the three levels of the multiresolution scheme, distortions of the input image are generated and weighted by the saliency maps in order to highlight degradations of visually attracting regions. The extracted features are used by a probabilistic model to predict the final quality score. The obtained results demonstrate the effectiveness of the approach.