Structured Computational Modeling of Human Visual System for No-reference Image Quality Assessment

Objective image quality assessment (IQA) plays an important role in various visual communication systems, which can automatically and efficiently predict the perceived quality of images. The human eye is the ultimate evaluator for visual experience, thus the modeling of human visual system (HVS) is a core issue for objective IQA and visual experience optimization. The traditional model based on black box fitting has low interpretability and it is difficult to guide the experience optimization effectively, while the model based on physiological simulation is hard to integrate into practical visual communication services due to its high computational complexity. For bridging the gap between signal distortion and visual experience, in this paper, we propose a novel perceptual no-reference (NR) IQA algorithm based on structural computational modeling of HVS. According to the mechanism of the human brain, we divide the visual signal processing into a low-level visual layer, a middle-level visual layer and a high-level visual layer, which conduct pixel information processing, primitive information processing and global image information processing, respectively. The natural scene statistics (NSS) based features, deep features and free-energy based features are extracted from these three layers. The support vector regression (SVR) is employed to aggregate features to the final quality prediction. Extensive experimental comparisons on three widely used benchmark IQA databases (LIVE, CSIQ and TID2013) demonstrate that our proposed metric is highly competitive with or outperforms the state-of-the-art NR IQA measures.


Introduction
In the 21st century, an informative network era, the Internet has become an important way for people to acquire the latest information and entertainment.Visual information, including images and videos, has accounted for more than 80% of the total Internet traffic.High quality visual experience is the common basis of major applications such as the digital media industry and network information service.Image quality assessment (IQA), dedicated to evaluating human visual perception and predict image quality, has been a fundamental issue in image processing fields [1] .Although subjective IQA is the most accurate approach, the slowness, time-consuming, laboriousness and difficult repetition of subjective IQA immensely limit its progress.By contrast, objective IQA that resorts to mathematical metrics for predicting the per-ceived quality of images automatically and efficiently has been widely researched.In common IQA databases, the distorted images are usually degraded from a pristine image called the reference image.According to the available information of the reference image, objective IQA algorithms can be classified into full-reference (FR), reduced-reference (RR) and no-reference (NR) algorithms, respectively.
FR IQA models calculate the target image quality with fully accessible reference images, and they usually measure the "distance" between the perfect original image and its degraded image [2] .The mean-squared error (MSE) and peak signal-to-noise ratio (PSNR) are two classic and widely used metrics proposed long ago.However, they have a poor correlation with subjective perceptions in some conditions [3] .For this purpose, Wang et al. [4] propose the structural similarity index (SSIM) combining luminance information, contrast information and structure information based on the assumption that the human visual system (HVS) is highly sensitive to the structures in the image.In addition, plenty of successful FR IQA algorithms are subsequently designed, such as the visual information fidelity (VIF) [5] , the visual signal-to-noise ratio (VSNR) [6] and the perceptual similarity (PSIM) [7] .When only partial information about the original image is available, RR IQA models take full advantage of this information to evaluate the image quality.Wang et al. [8] propose an effective method using the natural scene statistics (NSS) features in the transform domain.In the spatial domain, Liu et al. [9] report a RR model compositing the bottom-up and top-down features as reference data.Soundararajan and Bovik [10] develop the reduced reference entropic differencing (RRED) metric via the wavelet coefficients of original and distorted images to assess their qualities.Min et al. [11] measure image quality based on the saliency similarity.
However, the pristine image is nonexistent or unavailable in most of the actual scenarios, and thus NR IQA models are highly desirable which require no original references.Since there is no prior knowledge of the reference image, NR IQA is more difficult and challenging than FR and RR IQA.In fact, the development of NR IQA has been rapid and brilliant in recent years.Based on the design philosophies of the measures, the NR IQA algorithms can be divided into three types, which are NSS-based, learning-based and HVS-based measures.NSS-based NR IQA models are the earliest NR metrics and their motivation is that high quality natural images possess some kind of statistical properties, while degraded images no longer possess such properties.Typical NSS-based NR models follow three major steps: feature extraction, NSS modeling, and feature regression [12] .In the literature, Moorthy and Bovik [13] propose a distortion identification based image verity and integrity evaluation (DIIVINE) model based on the statistical properties of wavelet coefficients.Mittal et al. [14] design a natural image quality evaluator (NIQE) using the NSS of image patches with high image contrast in the spatial domain.Min et al. [15] develop a blind IQA model called blind pseudo-reference image based metric (BPRI) based on the NSS of pseudo-reference images.Liu et al. [16] propose an unsupervised method with the structure, naturalness, and the perception quality variations based on a pristine multivariate Gaussian model.An increasing number of learning-based NR measures have been proposed in recent years, which try to learn and integrate the quality features.For example, Ye et al. [17] report an unsupervised feature learning framework method CORNIA (codebook representation for no-reference image assessment) by constructing an unlabeled codebook from raw image patches via K-means clustering.Xu et al. [18] introduce a NR IQA metric based on high order statistics aggregation (HOSA).A blind image evaluator using an optimized end-to-end convolutional neural network is proposed by Kim and Lee [19] .
The human eye is the final receiver of visual signals and the ultimate criterion of visual experience for human beings.Computational modeling of HVS is a key scientific problem in visual experience optimization.Thus, in addition to the above two categories of NR models, the HVS-based NR metric is also an important component of NR algorithms, which is motivated by some properties of the HVS, and extracts some perceptual-based features to predict the image quality.Zhai et al. [20] propose a psychovisual image no-reference free-energy-based quality metric (NFEQM) inspired by the free-energy principle interpreting the perception of an image as an active inference process.Gu et al. [21] put forward a NR method incorporating free-energy based features, structural information and gradient magnitude.Li et al. [22] design an NR IQA metric employing no-reference quality assessment using structural and luminance (NRSL) features inspired by human visual perception of images.Li et al. [23] report an NR IQA algorithm extracting perceptual features from firstorder and second-order structural patterns of images.Saha and Wu [24] extract features from scale-space, Fourier domain and wavelet domain to compute the quality score of the target image.Although there are a lot of HVS-based NR algorithms and the effectiveness of these models has been proved, these metrics still have defects.Traditional black-box regression models have low interpretability, which can hardly guide visual experience optimization effectively, while the models based on physiological simulation are difficult to integrate into practical visual communication services due to theirs high computational complexity.How to construct a NR IQA metric with high interpretability to bridge the gap between signal distortion and visual experience still deserves to be researched.
In the literatures of neuroscience and brain theory, visual experience can be induced by external stimuli, such as the appearance of an image [25] .Localization of the neural structure is an important step in the process of comprehending the fundamental mechanisms of the visual system [26] .The human brain is limited in its ability to interpret all perceptual stimuli appearing in the visual field at any position in time and relies on a gradual cognitive process of the stimuli based on the contingencies of the moment [27] .During perception, activation of visual imagery ultimately results from bottom-up impacts from the retina [28,29] .Therefore, we attempt to propose a bottom-up structured HVS-based approach to illustrate the information transfer and feedback mechanism of visual perception in the human brain.Combined with knowledge of image processing, we divide visual stimuli into three bottom-up layers, which are a low-level visual layer, a middle-level visual layer and a high-level visual layer.Specifically, for an image, the global image can be regarded as the high-level visual excitation, and the local primitives obtained from the decomposition of the global image can be treated as the middle-level visual stimulus, while the low-level visual layer is composed of all individual pixels in the image.Conversely, the complete global image can be acquired by synthesizing its local primitives, which are constituted by individual pixels of the image.The diagram of our proposed structural computational modeling in the human visual system is shown in Fig. 1.
In this paper, inspired by the above-mentioned framework, a new NR IQA algorithm based on structural computational modeling of the human visual system is proposed, called NSCHM (no-reference structural computational of human visual system metric).We deeply investigate and analyze the perception mechanism in HVS based on multi-layer representations of the image.A set of quality-aware NSS-based features are extracted as lowlevel visual features.Deep features in the convolution network are considered as middle-level features and free-energy based features are treated as high-level features in our proposed method.Finally, support vector regression (SVR) is used to aggregate these three layers′ features into a perception quality index that can predict the quality scores of target images.In order to demonstrate the effectiveness of our method, extensive experiments are performed on three common image quality databases (LIVE [30] , CSIQ [31] and TID2013 [32] ) and the method is compared with eight mainstream general-purpose NR algorithms.Experimental results show that the proposed NSCHM method is effective and superior or comparable to the state-of-the-art NR models.
The remainder of this paper is organized as follows.In Section 2, we describe details of the NSCHM metric.Validation is given in Section 3, which demonstrates that NSCHM is comparable to or outperforms the state-of-theart NR IQA models.Finally, we draw some general conclusions in Section 4.

The proposed algorithm
For characterizing the quality of images using structural computational modeling, we investigate three layers of perception mechanism in HVS.In this section, three levels of features including low-level visual features, middle-level visual features and high-level visual features are analyzed and devised to characterize the quality of distorted images effectively.After feature extraction, we adopt SVR to regress those features into the final index to represent the quality of target images.The overall diagram of the NSCHM method is illustrated in Fig. 2.

Feature extraction in the low-level visual layer
The features extracted from NSS have been widely ac-cepted in the NR IQA field because of their stability and efficiency.The NSS-based features in the spatial domain can judge the degree of image degradation by the characteristics at pixel level since high-quality original scene images satisfy some certain statistical characteristics, while quality degradation may alter these characteristics.This is consistent with the low-level visual features we expected.Therefore, in this section we will introduce the selection of low-level visual features based on NSS in the spatial domain.

I I H H
Specifically, inspired by some previous studies [33,34] , the locally mean subtracted and contrast normalized (MSCN) coefficients of the intensity image of a target image can denote the luminance effectively.Given an image , we first transform to the intensity image , and then the MSCN coefficients of can be calculated as where and represent the pristine and normalized values of the intensity image at position , and are spatial indices, and mean the width and height of the image respectively.and denote the mean and standard deviation of the local patch with the center , which can be computed as where stands for the 2D circularly-symmetric Gaussian weighting function and in this implementation.Ruderman [34] observes that these normalized luminance values of natural images have a great correlation with the unit normal Gaussian characteristic.These properties of MSCN coefficients can be used to describe the distortion level of the target image.To demonstrate this fact, the MSCN coefficients′ distributions of a reference image selected from TID2013 IQA database [32] and its corresponding degraded versions with different distortion types are shown in Fig. 3.It is obvious that the distributions of the reference image and its various distorted versions are different, which indicates that the statistical properties of MSCN coefficients can be changed by various distortions.
In addition, as reported by [33], the distribution of the original image presents a Gaussian like appearance and each distortion deviates from such kind of properties in its own way.For describing the MSCN coefficients′ distribution specifically, a generalized Gaussian distribution (GGD) is employed which can effectively depict the broader spectrum of the distorted image statistics.The zero-mean GGD is expressed as where and gamma function is defined as where and are the parameters, which control the magnitude and the variance of the distribution, respectively.Then, we employ this GGD model to fit the above-mentioned MSCN distributions from the target images and extract and as the quality-aware features for our low-level visual feature group.
In addition to the statistical distribution of each pixel, we also consider the statistical law of adjacent pixels, which exhibits a regular structure and is sensitive to the presence of distortion [33] .Thus, we compute the pairwise products of adjacent MSCN coefficients in four orientations including horizontal, vertical, main-diagonal and secondary-diagonal.The distributions of the pairwise products of the adjacent MSCN coefficients of the reference and its various degraded versions along the horizontal direction are illustrated in Fig. 4. The difference between the distribution of the original image and that of its distorted version can be clearly distinguished.Similarly, we adopt a zero mode asymmetric generalized Gaussian distribution (AGGD) model to fit these distributions of the adjacent coefficients: where and control the expansion of each side respectively, while is the parameter controlling the magnitude of the mode.Then the mean of the above distribution can be calculated as The means in the informative model parameters of this AGGD model are extracted as our low-level visual features considering its high sensitivity to various degradation of images proved by extensive experiments.Since multi-scale processing contributes to improve the correlation between predicting scores of QA models and the human perception, we extract all features at two scales including the original scale and a reduced resolution downsampled with a factor of two.Finally, a total of twelve features, six at each scale, are employed as the low-level visual features to measure the quality of the target image.

Feature extraction in the middle-level visual layer
Following the low-level visual feature extraction, in this section, we will discuss the middle-level visual feature extraction.As mentioned above, we consider that the middle-level visual feature is more advanced than the low-level visual features, which is no longer the information at the pixel level, but the characteristic of some local primitives in the images.It is known that the convolutional neural network (CNN) can extract local features of images by calculating the cross-correlation between convolution kernels and feature maps.With the development of deep learning in recent years, deep CNNs show great performance in solving various visual signal problems, such as image recognition [35,36] , detection [37,38] , tracking [39,40] , etc. Also, many studies indicate that local features extracted by CNNs response to edge, texture, etc., which is consistent with the reaction of neurons in the human visual system.The core of deep learning is passing the kernel through continuous convolution iteration between layers to realize the final goal, which accords with the properties of the middle visual layer conceived by us.How to extract suitable deep features as the middle-level visual features is the target of this section.
As a novel concept, the pseudo-reference image using the worst image to act as a reverse reference image is proved to be effective in NR IQA models [15] .Inspired by this concept [41] , we combine a deep convolutional neural network with this framework to extract middle-level visual features.The framework of the proposed middle-level feature extraction is illustrated in Fig. 5. First, we need to confirm the distortion types for the distortion aggravation to produce the pseudo-reference images.Since different categories of distortion bring in different artifacts, the pseudo-reference image associated with a specific distortion needs to be defined to comply with the properties of the given distortion.Generally, in most of widely used subjective IQA databases, AWGN, GBlur, JPEG and JP2K are the four most mainstream encountered distortion types.Thus, these four distortion types are used to further measure the noising, blurring, blocking and ringing artifacts via degrading the distorted image.For different categories of distortion aggravation, VGG-based representation maps are generated and the middle-level visual features are extracted based on the features in these maps.Since the visual geometry group (VGG) network has great power in representing image local features, we calculate the VGG-based representation maps for different categories of distortion aggravation images and extract the middle-level visual features from these maps.The details are introduced as follows.

D Pni
Firstly, we introduce the methods of distortion aggravation for each distortion type.To achieve noising effects, we add white noise to the distorted image to obtain the multiple pseudo reference images (MPRIs) : represents the i-th degree of distortion aggravation, indicates a random normal distribution with zero mean and variance.For the blurring effect, we blur the distorted image to MPRIs by employing Gaussian kernels: where is a Gaussian kernel with determinate standard deviation and denotes a convolution operator.To  where JPEG indicates the JPEG encoder and adjusts the compression quality.For producing the ringing effect, we compress the distorted image to the MPRIs by adopting the JP2K encoder: where JP2K means the JP2K encoder and is used to change the compression ratio.In total, the subscripts , , , denote noising, blurring, blocking and ringing effects, respectively.In addition, the degrees of distortion aggravation are divided into five levels for each distortion type, which means in this work.After distortion aggravation, we carry out the process of extracting middle-level features based on the target distorted image and its corresponding MPRIs.Ding et al. [42] find that only calculating the spatial means and variances of feature maps in the convolutional neural network receive an efficient parametric model towards visual quality.Thus, in this work, we employ a VGG network in the target distorted images and their corresponding MPRIs and calculate the mean and variance in each feature map of the VGG network as well as the input image.Specifically, the MPRI connected to the convolution responses of five corresponding VGG layers is composed of the representation: where in this work, which means the number of convolution layers of and denote the number of feature maps in the -th convolution layer.
is the representation for the -th MPRI.Similarly, we can also derive the representation for the target distorted image: After that, the quality features extracted from and are required to be specified.Inspired by the features in SSIM [4] , we calculate the quality features of the texture and structure of each pair of the feature maps of the target image and its corresponding MPRI based on the global means and variances: where and denote the similarities of the global means (texture features) and global correlation (structure features), respectively., , , and indicate the global means and variances of and , as well as the global covariance between and , respectively.and are two small constants to prevent instabilities when the denominators are close to zero.

M
Finally, based on the structure features of the target distorted image and its corresponding MPRI at different convolution layers, the middle-level visual features are extracted: where represents the positive learnable weights, which satisfy .

Feature extraction in the high-level visual layer
After discussing the extraction of low-level visual features and middle-level visual features, in this section we will explore and analyze the high-level visual feature extraction.Since the high-level visual features take the global image as a whole, we need to seek a model aiming at the whole image to extract the features.We thoroughly investigate the visual perception models of the human brain and attempt to characterize the quality of image from the high-level visual perception in HVS.
Specifically, we employ the free-energy principle method, which unifies several findings in brain theory and neuroscience, to simulate the process of human action, perception and learning [43] .A fundamental theory of the free-energy principle is that the process of cognition or comprehending is an active inference behavior managed by an internal generative model (IGM) in the human brain [44] .When a "surprise", such as an image signal, transmits to the human brain via the retina, the brain will spontaneously produce the useful part of the information and ignore the redundant uncertain components for explaining sensations using this IGM [20] .The perceptual quality of the input thus has high correlation with the discrepancy between the input image and its corresponding representation generated by IGM [21] .Since IGM yields the perception of the visual signals based on the integrated input image, free-energy based features are regarded as high-level visual features in this work.For mathematical formulation, we adopt to represent the internal generative model.Also, we assume that the process of visual perception is parametric, which adjusts the parameter vector to explain visual scenes.Given the input image , the joint distribution with the model parameters vector can measure the value of free-energy:

K
To simplify this mathematical expression, an auxiliary term is introduced to both the numerator and denominator of the above equation.Using Jensen′s inequality and dropping the generative model in order to make the formula clear, we can alter this equation to: Afterwards, we can regard the right side of the above equation as the free energy according to the knowledge of statistical physics and thermodynamics [45] : dθ.

P (I, θ) = P (θ|I)P (I)
Notice that , we can further infer the above equation as KL(•) where denotes the Kullback-Leibler divergence between the approximate posterior and the true posterior distributions.A more detailed derivation of free energy can be found in [20].
Since the human brain is extremely complicated and far beyond our current knowledge, the explicit expression model of the free-energy has not yet been developed.To solve this problem, some research attempts to approximate the free-energy calculating model using existing models for simulating image perception of the human brain.In some earlier works [20,21] , the linear auto-regressive (AR) model is employed to acquire the approximation of the free energy .However, the calculation process of the AR model is too complex, which leads to a relatively long time for feature extraction.Based on the neurobiological findings, sparse representation is suitable for denoting natural images that agree with some properties, such as spatial localization, orientation and bandpass in the mammalian primary visual cortex of the brain [46] .Thus, Liu et al. [47] and Zhu et al. [48] use a sparse representation method to approximately express the free energy.The performance of the sparse representation method is demonstrated to be more efficient and effective than the linear AR model for predicting the quality of images.Therefore, in this paper, we employ the sparse representation model to approximate the IGM.
Specifically, given the input image , we first select a patch from it with an extraction operator , where denotes the size of the patch.Then, the sparse representation of over an over-complete dictionary is equal to compute a vector to represent , which can be indicated as where is the dictionary that can be expressed as , is the representation coefficient vector of the extracted patch and represents the number of atoms in the sparse representation model.is a positive constant used to balance the weight of the reconstruction fidelity constraint term and the sparse punishment term.Moreover, represents the norm.From the above formula, the sparse vector for representing can be obtained.After that, the sparse representation of the whole input image can be expressed as where is the sparse representation of the entire image , which is regarded as the representation of in human brain."./" means the element-wise division of two matrices and refers to the number of patches.represents the transpose operation of and denotes the vector whose values are all 1 with the size of .

I
According to the above-mentioned analysis, the free energy indicates a discrepancy between the input image and its best prediction image by the IGM.Thus, free energy can be considered as a natural proxy for the quality of perceptions.Based on the expression of free energy, the prediction residual of input image is defined as where refers to the prediction residual of input image and is the magnitude operation.After that, the uncertainty of can be obtained by measuring its entropy: where and shows the entropy of , which is also regarded as the value of free energy.
refers to the probability density of the -th gray scale in .For illustrating the effectiveness of the free energy feature on describing image quality intuitively, the distorted images generated from two reference images are selected from the TID2013 database [32] .As shown in Figs.6(a and distortion levels with different types are illustrated in Fig. 6.As exhibited, it can be observed that the values of reduce gradually with the deepening of degradation.Based on the great capacity of free energy features to measure degradations of image quality effectively and its high-level visual properties, we select the free-energy feature to be the highlevel visual feature in this work.

Quality evaluation
After extracting the quality-aware features from lowlevel, middle-level and high-level visual layers, we need to seek an appropriate mapping to learn the subjective MOS values from the feature space using the regression module, and then employ it to produce objective quality scores.A total of 33 features are extracted from the three visual layers, as shown in Table 1.Based on the number of features and the effectiveness of regressors, we adopt SVR [49] to aggregate the quality-aware features, which has been widely used in the NR IQA field [21,33] .
Φ Fi Specifically, given the training set , the qualityaware features and the corresponding subjective qual-qi ity labels (MOS) of the images are employed to train the model: where is composed of the low-level visual features , the middle-level visual features , and the high-level visual features of the training image in the training set .Then, we can utilize this regressor to predict the quality score of any target image with its corresponding feature : stands for the predicted objective quality score of the target image.In this work, the LIBSVM package [50] is utilized with a radial basis function (RBF) kernel to teach our proposed model.

Experimental results and analysis
In this section, we first compare the performance of our proposed method with the performance of the popular NR IQA models on three common large-scale image databases: LIVE [30] , CSIQ [31] and TID2013 [32] for validating the proposed NSCHM quality metric.The four most mainstream distortion types that we mentioned above: AWGN, GBlur, JPEG and JP2K are employed in the experiment and distortion type of the Rayleigh fast-fading channel simulation (FF) in the LIVE database [30] is also included.The performance on single distortion types is also discussed.In addition, we analyze the robustness of our proposed method through cross-validation under mismatched conditions.Finally, the ablation experiment is employed to demonstrate the effect of features in different visual layers.

Parameter settings and training procedure ξ δ ξ δ
In the process of exacerbating the distortion in the middle-level visual feature extraction, the distorted image is degraded by AWGN, GBblur, JPEG and JP2K distortions with five degradation levels for each type.We employ the Matlab to apply these four distortions and the specific parameters are as follows.The five Gaussian kernels of AWGN with standard deviations are from 0.5 to 2.5 with a step of 0.5, the five variances of GBlur are from 0.3 to 0.7 with a step of 0.1 and the five quality parameters of the JPEG encoder are from 0 to 8 with a step of 2 as well as the five compression ratios of JPEG2000 encoder are from 150 to 250 with a step of 25.In addition, since the perceptual weights and are undetermined, we train the VGG-based representation model on the KADID dataset [51] to learn and .
In the part of the sparse representation in the highlevel visual feature extraction, the predefined dictionary  adopts an overcomplete discrete cosine transform (DCT) dictionary with the size of which includes 144 atoms for sparse representation.The size of each patch vector is set to 64.The orthogonal matching pursuit (OMP) algorithm [52] is used to work out the optimization problem of sparse representation.
Since the model we proposed requires training, we randomly divided the distorted images in each testing databases into two parts: a training set and a testing set, which respectively include 80% and 20% of the images.We train our proposed algorithm using the training set and measure its performance with the testing set.This 80% train − 20% test process is repeated one thousand times to guarantee the robustness of our metric [53] .The median results over these one thousand iterations are selected as the final performance to avoid the performance bias as much as possible.

Experimental protocol
1) Databases: For examining the performance of the proposed model, three widely used IQA databases are employed as testbeds, including LIVE [30] , CSIQ [31] and TID2013 [32] .A brief introduction of these three databases is presented below: The LIVE database [30] is released by the University of Texas at Austin, and is the most famous IQA database.It contains 770 lossy images generated from 29 pristine images by degrading them with five different types of distortions: AWGN, GBlur, JPEG, JP2K and FF.
The CSIQ database [31] is provided at Oklahoma State University including 886 images created from 30 original images.Six types of distortions are considered in the CSIQ database, which are GBlur, AWGN, JPEG, JP2K, global contrast decrements (CC) and additive pink Gaussian noise (APGN) at four or five distortion levels respectively.
The TID2013 database [32] is the updated version of the TID2008 database, which is developed with a joint international cooperation among Finland, Italy and Ukraine.This database consists of 3 000 distorted images generated by corrupting 25 reference ones with 24 distortion types at five distinct distortion levels.
2) Comparing algorithms: Eight popular IQA algorithms are compared with our proposed NSCHM metric, which are DIIVINE [13] , BLINDS2 [54] , BRISQUE [33] , NIQE [14] , QAC [55] , IL-NIQE [56] , LPSI [57] and BPRI [15] .In these NR models, DIIVINE, BLINDS2 and BRISQUE are opinion-aware models which need to be trained to integrate the NSS features extracted from the wavelet domain, DCT domain and spatial domain, respectively.The rest of them are opinion-unaware models, where NIQE and IL-NIQE are based on spatial domain NSS, QAC learns a codebook to achieve quality-aware clustering, LPSI uses local image structure statistics and BPRI utilizes a local binary pattern.
3) Evaluation criteria: Four commonly used evaluation criteria are applied to measure the performance of the compared IQA metrics, including spearman rank-order correlation coefficient (SRCC), Kendall′s rank-order correlation coefficient (KRCC), Pearson linear correlation coefficient (PLCC) and root mean squared error (RMSE) [58,59] .The mathematical expressions of these four measurements are as follows: where represents the difference between the ranks of the -th images in subjective and objective assessments, and denotes the number of images in testing data set.and mean the numbers of concordant and discordant pairs in the testing database.and indicate the converted objective score and subjective score of the -th image after the nonlinear regression.and are the means of all and .Specifically, SRCC represents the prediction monotonicity by only considering the relative orders between the inputs, and KRCC is another monotonicity index employed to evaluate the association between the data.PLCC describes the prediction linearity of an IQA metric and RMSE indicates the prediction accuracy.A good IQA measure is expected to acquire high values, which close to 1, in SRCC, KRCC and PLCC, yet the low values, which near 0, in RMSE.

{ζ1, ζ2, ζ3, ζ4, ζ5}
Furthermore, following the suggestions of the video quality experts group (VQEG) [60] , PLCC and RMSE cannot calculate performance by using the subjective scores and the corresponding objective ratings directly.According to the guidance of VQEG, we adopt a regression analysis to conduct a nonlinear mapping between the subjective MOSs and the corresponding objective ratings predicted by target IQA metrics.For the nonlinear regression, a monotonic logistic function of five parameters is employed: where and represent the raw input ratings and mapped scores, and stand for five parameters to be ascertained during the process of the nonlinear fitting.

Overall performance comparison
First, we compare the overall performance of our proposed algorithm with the above-mentioned eight state-ofthe-art NR IQA models on three widely used databases: LIVE, CSIQ and TID2013.For a fair comparison, we retrain the opinion-aware algorithms: DIIVINE, BLIINDS2 and BRISQUE, as well as our proposed method on the same training set and measure them on the testing set of each database.For the remaining models, we employ the same testing set to test their performance.The overall performance in terms of SRCC, KRCC, PLCC and RMSE are tabulated in Table 2, where the three top-performing models are highlighted.
It is observed that our proposed algorithm shows great comprehensive performance and achieves the top three positions on all databases in terms of various criteria.By comparison, DIIVINE, NIQE and LPSI show relatively moderate performance on three databases.BRIS-QUE demonstrates good performance in LIVE and BLI-INDS2 has high correlation with the subjective scores on LIVE and CSIQ.Another observation is that IL-NIQE and BPRI achieve great prediction performance in TID2013 and CSIQ, respectively.These experiments clearly demonstrate that our proposed method has high stability and superiority in assessing the perceived quality of images.

Performance on different distortion types
In addition to testing the overall performance of algorithms on individual databases, we also examine the prediction performance of all NR IQA metrics on individual distortions.The same training-testing process described in Section 3.3 is implemented.The 80% degraded images in the training set are all employed to train the models, while only images with the target distortion type in the testing set selected from the rest 20% distorted images are applied to test.The mean results of our proposed method and the compared blind IQA models on single distortion types are summarized in Table 3.The three best performances of each distortion type on different databases is highlighted with boldface.For simplicity, we only list SRCC values in Table 3, but we can acquire similar evaluation results with other evaluation criteria.
From Table 3, it can be clearly observed that the competition between each NR IQA algorithm is more intense, and each metric has its own advantages.Specifically, our proposed NSCHM is also comparable to these popular metrics when performed on individual distortions, which is consistent with the results of the overall performance evaluation introduced in Section 3.3.In addition, we can find that BRISQUE obtains the best results on the LIVE and has relatively mediocre performance on the TID2013, while BPRI and LPSI perform much better on CSIQ and TID2013.Furthermore, our proposed model shows more stable performance than other NR measures, and NSCHM has no SRCC value lower than 0.88 for a single distortion type.Our NSCHM metric has no obvious weakness in these four common distortion types on three popular databases.

Cross-validation under mismatched conditions
In Sections 3.3 and 3.4, the performance of the NR algorithms is based on the training-testing procedure on the same database.Thus, in this section, we attempt to carry out cross-validation experiments to test the robustness of our proposed method under mismatched conditions.We use LIVE, CSIQ and TID2013 databases as the training set respectively, and then employ the corresponding remaining two databases as the testing set.The results are Table 2 Overall performance comparison of the ten popular IQA methods and our proposed metric on LIVE, CSIQ and TID2013 databases.We highlight the three top-performing models in each row.
To demonstrate that our algorithm also has acceptable performance under the mismatched conditions, we compare NSCHM with other competitive algorithms in this section.For the fairness of this experiment, we select the opinion-aware algorithms, which are DIIVINE, BLIINDS2 and BRISQUE as well as a state-of-the-art training algorithm NFERM [21] to compare with our proposed method.We employ the TID2013 database as the training set and measure the performance of these metrics on LIVE and CSIQ databases.The performance results are demonstrated in Table 5.It is obvious that our proposed algorithm has advantages compared with other opinion-aware algorithms.The results of PLCC, KRCC and RMSE in LIVE as well as the results of PLCC and RMSE achieve the best performance among these models.In addition, there are no relatively poor results for each sub-item indicating that the robustness of our algorithm is good.
From Table 6, it is easy to find that NSCHM is superior to all competitive NR IQA models on the LIVE database and has great advantages compared with other competitors on the TID2013 database, where only IL-NIQE and BPRI are comparable to our method.In addition, although the performance on the CSIQ database is not as outstanding as that on the other two databases, no competitor algorithm is superior to NSCHM.Thus, this experiment demonstrates the advantage of NSCHM in evaluating the image quality statistically.

Ablation experiment
As described in Section 2, our proposed NSCHM consists of three groups of features, namely low-level visual features, middle-level visual features and high-level visual features.Therefore, it is interesting to analyze the contribution of each part to the overall algorithm.We conduct the ablation study on the LIVE, CSIQ and TID2013 databases.For quantitative analysis, we compute the median values of SRCC, PLCC, KRCC and RMSE via the same 80% train − 20% test process described above for each group of features.In addition, in order to make a more detailed division, we divide the low-level visual characteristics into the MSCN coefficient features and adjacent MSCN coefficients features.The performance of each feature group on different databases is demonstrated in Table 7.In Table 7, LOW1 and LOW2 stand for the MSCN coefficient features and adjacent MSCN coefficients features in low-level visual features, respectively.MIDDLE and HIGH denote the features extracted from the middle-level and high-level visual layers.It is observed that each set of features has favourable perform-ance, with LOW1 and MIDDLE performing better and LOW2 and HIGH performing relatively worse.Another observation is that the performance of each set of features is inferior to the final proposed algorithm, which means that each set of groups has its own impact on improving the predicted accuracy of our proposed metric in evaluating the perceived quality of images.

Conclusions
In this paper, a novel perceptual NR IQA metric named NSCHM is proposed based on structural computational modeling of HVS.The proposed metric is inspired by the fact that the human brain processes visual stimuli in a hierarchical manner.We first analyze the process of the human brain to handle the images and introduce the framework of structured computing model.After that, three groups of features are extracted, which are the lowlevel visual features at the pixel level, the middle-level visual features at the primitive level and the high-level visual features at the global image level, respectively.Then, we employ SVR to integrate these three feature groups and predict the image quality ratings.Validation experiments are conducted on three widely used IQA databases, i.e., LIVE, CSIQ and TID2013, demonstrating that NSCHM has outstanding performance with state-ofthe-art NR methods in overall performance comparison.For individual distortion types, our metric still maintains favourable performance.The cross validation experiments testify the stable performance of NSCHM under mismatched conditions.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article′s Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Fig. 1 Fig. 2
Fig. 1 Diagram of our proposed structural computational modeling of human visual system

Fig. 4
Fig. 4 Distributions of the products of the adjacent MSCN coefficients along the horizontal orientation of an original image and its corresponding degraded versions distorted by AWGN, GBlur, JPEG and JP2K.

Fig. 5
Fig. 5 Framework of the proposed middle-level visual features extraction.to are multiple pseudo reference images, and to denote the representation maps in VGG.The VGG-based perceptual representation includes six stages, in which the zeroth stage is the raw pixels.to indicate the distorted image′ s middle-level visual features combining the texture and structure features of target distorted image and its corresponding multiple pseudo reference images at different convolution layers.
) and 6(b), these two images have different image complexity in that Fig.6(a) possesses simple image content and Fig. 6(b) has complicated texture information.Two com-H H H mon distortion types, GBlur and JPEG compression and five distortion levels are employed.The relationship between high-level visual feature

Fig. 6
Fig. 6 Relationship between high-level visual feature and distortion levels with different types.(a) and (b) are two reference images selected from the TID2013 database.(c) shows the visual feature of (a) over different distortion levels distorted by GBlur.(d) shows the visual feature of (b) over different distortion levels degraded by JPEG.

Table 1
Summary of the quality-aware features extracted from three visual layers W. H. Zhu et al. / Structured Computational Modeling of Human Visual System for No-reference •••

Table 3
SRCC values of our NSCHM and other IQA metrics in various individual distortion types on LIVE, CSIQ and TID2013 databases.We highlight the three top-performing models with boldface.

Table 4
Cross-validation experiments under mismatched conditions using LIVE, CSIQ and TID2013 databases

Table 5
Performance results of our NSCHM and other NR metrics in cross-validation experiments under mismatched conditions.TID2013 database is employed as the training set and LIVE and CSIQ databases are applied to test the models.

Table 7
Performance of the ablation study measured by SRCC, PLCC, KRCC and RMSE on LIVE, CSIQ and TID2013 databases.LOW1 and LOW2 mean the MSCN coefficient features and adjacent MSCN coefficients features in low-level visual features, respectively.MIDDLE and HIGH denote the features extracted from middle-level and high-level visual layers.W. H. Zhu et al. / Structured Computational Modeling of Human Visual System for No-reference ••• Foundation of China (Nos.61831015 and 61901260), Key Research and Development Program of China (No. 2019YFB1405902).