A genetic programming-based convolutional neural network for image quality evaluations

Monitoring the perceptual quality of digital images is fundamentally important since digital image transmissions through the Internet continue to increase exponentially. Many automatic image quality evaluation (IQE) metrics have been developed based on image features correlated to image distortions; however, those metrics are only effective on particular image distortion types. In recent years, convolutional neural network (CNNs) have been developed for IQEs. These CNNs first capture image features from distorted images; image qualities are predicted based on the captured image features. Since the CNN weights are randomly initialized and are updated with respect to a loss function, image features which are strongly correlated to image quality are not guaranteed to be captured. In this paper, a hybrid deep neural network (DNN) is proposed by integrating image quality metrics to capture image features which are correlated to image quality; the approach guarantees that significant image features are included to predict image quality. Also, a tree-based classifier namely geometric semantic genetic programming is proposed to perform the overall predictions by incorporating CNN predictions and image features; the approach is simpler than the fully connected network but is able to model the nonlinear image qualities. The performance of the proposed hybrid DNN is evaluated by an image quality database with 3000 distorted images. The mean correlation achieved by the proposed hybrid DNN is 0.57 which is higher than the other tested methods. Experimental results with the t- test, F-test and Tueky’s range tests show that the proposed hybrid DNN achieves more accurate image predictions with a 99.9% confidence level, compared to the state-of-the-art IQE metics and the most recently developed CNN for IQEs.


Introduction
Digital images uploaded to social media such as facebook, Instagram are essential in our daily life for communications and entertainments. Those images are captured by a camera at the front stage and are watched by human at the end stage in a pipeline. Between the front and end stages, images digitization and compression are performed; processed images are transmitted through a communication channel such as Internet, wireless network; images decompression and reconstruction are performed before watching by human users. Within the pipeline, certain distortions are contaminated to the original images. Coding, decoding, capturing, storing and displaying images generate another distortion. Those processes further degrade the visual qualities. At the end of the pipeline, human users watch the distorted versions. Quantifying visual quality can be conducted by subjective quality assessments with human judgements. However, subjective quality assessments are time-consuming and impractical for online applications. To automatically evaluate image qualities, image quality evaluation (IQE) models can be used [1,2]. Those IQE models [3][4][5][6] have been developed to engage with many image processing tasks. The predictions of the IQE models can be used to optimize the multiparameters in the pipeline with many image processing tasks, in order to improve the image qualities.
IQE models can be classified into three domains, namely the full-reference, reduced-reference and no-reference models. Full-reference models use the original images without distortion as a reference in order to determine qualities of distorted images; full-reference models predict image qualities when information of original images is available; reduced-reference models use image features captured from original images and compare those features with those of the distorted images, in order to perform image quality predictions [7,8]. However, the original images in many applications are not available to compare with distorted images. As an example, an image is transmitted from Smartphone A to Smartphone B; the original image from the Smartphone A is an unknown to Smartphone B, and the original image is not available. Therefore, no-reference models are implemented in many applications since original images are generally not available. No-reference models are commonly used, although IQE is more challenging since no information of the original images is available. They automatically predict images qualities with respect to human perception, when distorted images are only given.
No-reference models can be catalyzed into two types namely the image feature models and machine learning models which are the explicit and implicit typed models, respectively. The image feature model captures an image feature from distorted images which are correlated to subject image qualities evaluated by people. The image feature indicates the quantity of a distortion type, such as blur, blocking, ringing artifacts. Based on the distortion quantity, the image quality can be estimated [9]. However, image feature models cannot be applied for general or multi-purposes; they can be used only when the distortion type is already known or the distortion type of the images is covered by the model. Another type of no-reference models, machine learning models, map a set of image features to the subjective image qualities [3]. They were developed based on a dataset which consists of images features and subjective image qualities scored by human judgement. The image features correlated to perceived image qualities can be extracted using statistical analysis [3]. The machine learning models have been developed based on the approaches of probabilistic prediction models [10,11], support vector machines [12], fuzzy methods [13,14], simple neural networks with shallow network architectures [15,16]. The machine learning models do not rely on explicit information between image features and subjective image qualities. Motivated by the effectiveness of deep neural networks (DNNs) such as convolutional neural networks (CNNs) for object classifications and detections, CNNs have been developed to perform IQEs [5]. More recently Bosses et al. [17] have developed a CNN for predicting image qualities. Better results can be achieved compared with many recently developed no-reference IQE models [4,5,11,18,19] and other DNNs for IQEs [20,21].
So far, standard CNNs with cascade convolutional and pooling operations have only been implemented for IQEs [17,20,21]. The CNN consists of two main stages namely feature extraction stage and classification stage. In the feature extraction stage, the CNN captures the image features using a set of cascade convolutional and pooling operations from the original distorted images. In the classification stage, those image features are fed into a fully connected neural network in order to quantify image qualities. Generally CNN consists of millions of network weights which are initialized randomly. The initialized weights are updated iteratively by minimizing a loss function which quantifies the differences between real image quality scores and predictions. Therefore, some features are significant to classifications, but some are insignificant; also some image features are strongly correlated to image qualities but are not captured by the cascade convolutional and pooling operations. When those excluded image features are included as an input to the classification stages, the performance of image quality evaluations can be improved.
In this paper, we propose a hybrid DNN in order to improve the feature extraction stage and the classification stage of the currently used CNNs for IQEs. In the feature extraction stage, the proposed hybrid DNN integrates significant image features which are generated by image feature models, and those significant features are fed to the classification stage. The proposed hybrid DNN ensures that significant image features are included to evaluate image qualities. Since significant image features are generally correlated to image qualities, they are likely to improve the IQEs when they are integrated in the CNN for IQE. In the classification stage, we propose another machine learning approach namely geometric semantic genetic programming which uses the DNN predictions and the image features to perform the IQEs. Similar to the fully connected neural networks which are commonly used in the DNNs to the classification stage, geometric semantic genetic programming represents as a tree which consists of branches and nodes. Small trees are initialized and are grown iteratively to achieve a better classification. The size of the geometric semantic genetic programming is more flexible, unlike the fully connected neural network which has a fixed architecture and consists of a lot of links of which some of the links are not significant for classifications. Unnecessary branches of tree are not included in the geometric semantic genetic programming which is simpler than the cumbersome fully connected neural network.
The performance of the proposed hybrid DNN is evaluated using an image quality database which contains 3000 distorted images [22]. The backbone of the proposed hybrid DNN is implemented with the recently developed CNN [17] and is integrated with several commonly used simple image feature models namely blocking [23], blur [24], and two ringing artifacts [25]. Here we particularly select these four image features models which are computationally simpler than the modern IQE models. If these four computationally simple models are able to assist the hybrid DNN to perform better IQEs, better performance can be achieved when more accurate and computationally complex models are integrated. Experimental results show that the proposed hybrid DNN is capable to perform more accurate IQEs compared to the four image feature models and the recently developed CNN [17] which outperformed many IQA models [4,5,11,18,19] and neural network models [20,21]. The rest of the paper is organized as follows: Sect. 2 discusses the commonly used IQE models and the metric for evaluating the models. Section 3 presents the mechanisms of the proposed DNN, and the motivations of why the hybrid DNN are proposed. Section 4 presents how the proposed DNN is implemented, and the performance is evaluated. Experimental results and analysis are also shown. A conclusion is drawn in Sect. 5. A novel hybrid DNN attempts to improve the currently used CNNs of which some significant image features cannot be included for image quality evaluations. This limitation is overcome by incorporating typical image features generated by classical models with the CNN. The hybrid DNN further improves the prediction capabilities of the CNN for IQEs.

Image quality evaluation
When an image, I 2 R nÂm , with the dimension n Â m, is given, an IQE model, F IQE , can be used to estimate the image quality, q for I: To evaluate the performance of F IQE , we can use a set of N I images, I i with i ¼ 1; 2; :::; N I , which are contaminated with different types of noise and different levels of distortion. Image qualities are scored by many people, and MOS i is the corresponding mean opinion score (MOS) for I i . Based on MOS i and I i , Pearson linear correlation, r in (2), is commonly used to evaluate the performance of F IQE , which is widely used to evaluate IQE models [2], where F IQE ð I i Þ is the predicted image quality for the image I i ; q and MOS are the means of all F IQE ð I i Þ and all MOS i , respectively; r is the covariance between the predictions of F IQE , and the actual mean opinion scores divided by the product of the two standard deviations. The performance of F IQE is good when the correlation (2) between all I i and their corresponding MOS i is high. The correlation is strong when 0:5\r 1 or À1 r\ À 0:5; the correlation is moderate when 0:3\r 0:5 or À0:5 r\ À 0:3; the correlation is weak when 0:1 r\0:3 or À0:3 r\ À 0:1; the correlation is very weak when À0:1 r 0:1.

Image feature models for IQE
Some distortion types such as blurring, blocking, ringing, are correlated to image qualities. If the distortion type contaminated on the image is available, one can quantify the distortion level using an image feature model which is particularly developed to quantify this distortion type. Based on the quantified distortion, one can estimate the image quality. Given a distorted image, F IQE can be developed as an image feature model to quantify its image quality when its distortion type is available. Three image features, namely blur, blocking and ringing are commonly used to quantify image qualities [2].
1. Blur artifact is caused by camera movements, long exposure times, object movement or improper camera focus. Blur in an image can be observed by loss of semantic information such as object shapes. The amount of blur is correlated to the loss of spatial detail in an image. Hence, blur in an image can be quantified by the spatial-frequency domain [26]. Blur artifact also affects object edges and fine details on an image. Blur can also be measured by edge spreads [27], edge gradients [28] and edge widths [29] which are correlated to the magnitude of blur. 2. Blocking artifact is caused by the block-based image coding in low bit-rate rates, packet loss in image transmissions or block-based image compressions. The blocking artifact can be observed as artificial horizontal and vertical contours which are in block edges. The blocking artifact can also be caused by quantization of pixels at image blocks, and it causes image discontinuity at block boundaries. To quantify the magnitude of blocking artifact, ones can quantify the edge strength at block boundaries. Those Blocking magnitudes can be quantified by measuring the energy level of the blocky signal [23], detecting step edges with low amplitudes [30] and discrete Fourier transform (DFT)based measure of blocking artifact [31]. 3. Ringing artifact is caused by the coarse quantization of discrete wavelet transform or improper truncation of high-frequency components. It can also be generated in high-frequency irregularities during the reconstruction. Ringing distortion can be observed in high contrast edges from smooth texture regions. Ringing distortion can be quantified by edge-detection techniques which measure the overall magnitudes of edge-spread on an image. Those techniques include principal components analysis [32], quantifying pixel and edge distortion [33], and changes in statistic regularities of DWT/DCT coefficients [34].
However, this is ineffective to implement F IQE based on the quantity of a unique distortion type, since information about the distortion type is not known in most practical applications. When a single distortion type is only used to develop F IQE , other distortion types cannot be quantified. The F IQE is only sensitive to a particular distortion type, although an image can be contaminated by many distortion types. Therefore, developing F IQE based on a unique distortion type is impractical.

Machine learning for IQE
To overcome the limitation of image feature models, F IQE can be developed by machine learning which is trained automatically using distorted images and their MOSs. Machine learning is involved with two main steps [35]. First, image features corresponding to different distortion types are captured from the distorted image; second, F IQE is developed by learning the relationship between captured image features and MOSs. F IQE is capable to predict visual image qualities across different distortion types and image contents, when images contaminated with more than one distortion types are used to develop F IQE [4,5,11,36]. Recent approaches based on the machine learning have been developed to evaluate image qualities when training data with images as input and MOSs as labels is available. Artusi et al. [37] proposed a DNN to evaluate image qualities. However, this approach is not a no-reference image quality metric, where the original undistorted image is used as another input to the DNN. The image quality of the distorted image is compared to that of the original undistorted image. The approach is not practice for many applications such as image transmissions or compressions since the original undistorted images are generally not available. Despite this full-reference image quality metric, no-reference image quality metrics based on DNNs have been developed to perform image quality evaluations for particular object types such as magnetic resonance images [38], sonar images [39], images for liquid crystal displays [40]. Also an approach based on the genetic programming was developed to evaluate qualities of fish images [41]. Those approaches attempt to evaluate image qualities before performing further image processing or object recognitions. However, those approaches were only focused on a particular object type, where image features of those objects are only captured. Those approaches are not developed to evaluate general images which are captured by cameras, where the images are contaminated by common distortion types such as image digitization and compression, internet image transmission, wireless network, image decompression and reconstruction. Although Bi et al. [42] has developed a genetic programming-based model to evaluate qualities for general images, the model is only effective to evaluate image qualities of which the images are distorted with either blurring, lowering contrast or gaussian noise. The model is not developed for multiimage distortion types.
A more robust metric based on convolutional neural network CNN has been used to develop F IQE where the CNN is trained by a set of distorted images and MOSs of the corresponding distorted images [17]. The performance of CNN has been evaluated by the TID database with 24 image distortion types and better performance can be achieved by the CNN compared with many no-reference IQE models [4,5,11,18,19] and other deep neural networks [20,21]. The CNN was developed since the CNNs are capable and effective to perform object classification and detection applications involved with images. A distorted image is the input of the CNN, and the corresponding MOS is the label. The fundamental building blocks of a CNN are illustrated in Fig. 1. The topology of CNN consists of many convolution and pooling layers. The image patch is the inputted to the first layer. The CNN uses multi-layers with pooling and convolution kernels to generate a set of features The final layer uses f CNN to predict the image quality, q CNN ; a fully connected neural network is generally used as the final layer.
CNN weights of the pooling and convolution kernels and those of the fully connected neural network are generally determined by the back-propagation algorithm. First, the CNN weights are randomly initialized. The backpropagation algorithm uses the loss function to determine the prediction error between the actual MOSs and the estimated image qualities. In each iteration, the CNN weights are updated based on the prediction error. The generalization capability of CNN can be improved through the iterations. When the CNN weights are properly finetuned iteratively, a lower prediction error can be obtained. After running the back-propagation algorithm with a certain number of iterations, satisfactory predictions can be achieved by the CNN.
CNN weights for the pooling and convolution kernels are optimized with respect to the loss function and the training data set. The pooling and convolution kernels in CNNs only generates a limited set of features, f CNN . The features in f CNN only cover distortion types in the training dataset. When the image quality is evaluated based on an image feature model, the corresponding distortion type is guaranteed to be quantified on the image. For an example, when a model is particularly developed to quantify image blur, blur distortion can be fully quantify on the image. The approach of the image feature models do not have the limitation of the CNN since the CNN only relies on the training dataset, and some image features cannot be fully quantified.

Proposed hybrid deep neural network
In this paper, we propose a hybrid deep neural network namely hybrid DNN which integrates CNN evaluations and image features captured from distortion metrics, in order to predict image qualities. The hybrid DNN attempts to overcome the limitations of the CNN approach and the unique distortion metrics. The hybrid DNN is illustrated in Fig. 2 which uses the geometric semantic genetic programming (GSGP) [43,44] to predict image qualities based on CNN predictions and image features captured from image features models. Compared to the neural networks and the regression models, the GSGP is proposed since (1) the GSGP is a heuristic algorithm of which better solutions can be explored when keep running the algorithm. The coefficients of regression model and the weights of neural network are determined based on the least square method and the backpropagation method, respectively, which only reaches local optima. (2) When using the regression model, the model structures including the interaction and orders have to be predefined. Also, in the neural network, the network configuration has to be predefined based on users' experience. The models generated by the GSGP have more variants, and the model structures can be optimized automatically when keep running the algorithm. Also, the models generated by the GSGP are more transparent, compared to the neural networks. The hybrid DNN model, F DNNGP in (3), integrates the image features and CNN prediction by using the GSGP, in order to determine the image quality: where f Dis ¼ ff Dis 1 ; f Dis 2 ; :::; f Dis Nd g is a set of N d image features, and q CNN is the CNN prediction. F DNNGP attempts to include the image features which are generated by the image feature models and are not included by the CNN.

Algorithmic flow
The DNNGP-Algorithm in Algorithm 1 is proposed to generate F DNNGP . The flow of DNNGP-Algorithm is illustrated in Fig. 3. To develop F DNNGP , a set of N D IQA samples, ff Dis j ; q j DNN jMOS j g with j ¼ 1; 2; :::; N D , is collected, where q j DNN and f Dis j are the j th data argument, and MOS j is the jth data label; q j DNN and f Dis j are the DNN prediction and image features to the j th image sample, respectively; MOS j is the mean opinion score to the jth image sample. F DNNGP attempts to correlate the data argument and the data label. In the DNNGP-Algorithm, the predefined number of generations is denoted as C max . A population of N POP models namely is initialized randomly, where i is the generation number of the genetic process. When i ¼ 0, the genetic process is at the first generation. The representation of F i DNNGP;k with k ¼ 1; 2; :::; N POP is formatted as a geometric semantic tree, where the arithmetic operations, { ?, -, 9, / } are used as the tree nodes. The image features, f Dis 1 ; f Dis 2 ; :::; f Dis Nd , and DNN prediction, q DNN , are used as the tree terminals. The nonlinear functions, such as exponential function, sinusoid function, can be used in the nodes. In the proposed DNNGP-Algorithm, the arithmetic operations are used since the execution time of arithmetic operations is shorter than that of the nonlinear functions (Fig. 3).
The proposed DNNGP-Algorithm reproduces new models based on the two geometric semantic operators, namely geometric semantic crossover and geometric semantic mutation [45], where the two geometric semantic operators are discussed in Sect. 3.4. An empty set, namely S, is generated to store the generalization capabilities of all models in the current F DNNGP i generation. The generalization capabilities of models are evaluated by the proposed fitness function, namely FIT, which is discussed in Sect. 3.3. FIT evaluates the correlation between the predictions of F DNNGP and the actual MOSs of the images. The commonly used tournament selection method is used to select the good models from the current i th generation into the ði þ 1Þth generation.

Model representation for image quality predictions
The model, which is generated by the DNNGP-Algorithm, predicts the image quality, y DNNGP , when the image features, f Dis 1 ; f Dis 2 ; :::; f Dis Nd , and the CNN prediction, q CNN , are given. In the algorithm, the model, F i DNNGP;k , is represented as the following regular expression:  [17].
Here an example of regular expression is shown in (5) and Fig. 4. In this example, the final prediction is correlated to the CNN prediction, q CNN , and also it is correlated to the image features, f Dis 2 , f Dis 8 and f Dis 10 . The final predictions are more robust to estimate image qualities where the images are contaminated with different distortion types.

Fitness evaluations
The correlation function in (2) is reformulated as the fitness function of the proposed DNNGP-Algorithm, FIT, in order to evaluate the performance of the image quality prediction model, F i DNNGP;k . (6) is defined as FIT which is particularly incorporated with the real MOS j and the image quality predictions obtained by F i DNNGP;k with respect to the j th IQE image sample, ff j Dis ; q j CNN jMOS j g with j ¼ 1; 2; :::; N D . where is the mean of the image quality predictions obtained by is the average of the mean opinion scores to the IQE image samples. The denominator is the product of the standard derivations of the image quality predictions and the real MOSs. The correlation indicates whether the real MOSs can be explained by the image quality predictions. The performance of F i DNNGP;k is good when the correlation between the image quality predictions and the real MOS is high. In machine learning applications, the mean square errors and mean absolute errors are commonly used to indicate whether the real observations and machine predictions are close. Since image qualities are mostly evaluated by human judgement of which the evaluations are subjective and perceptive, the evaluations are not highly precise compared to the machine judgements. Therefore, the mean square error and mean absolute error are not the most suitable metrics to be used as the fitness function. Correlation between the subjective MOS and the image quality predictions of F i DNNGP;k in (6) is used as the fitness function. The correlation indicates the consistency between the visual machine predictions and the human visual judgements.

Reproductions of image prediction models
After evaluating the fitness of each model F i DNNGP;k using (6), some models in the current generation are selected to perform reproductions for the new generation. The new models are generated as candidates which have potential to have better capabilities to predict image qualities. The new model is reproduced by incorporating an old model and a random model in the form of (4). The components of the new model are created by exchanging some components of the random model. In the DNNGP-Algorithm, the following two operators, namely geometric semantic crossover and geometric semantic mutation [45], are used to reproduce new models.
a. Geometric semantic crossover namely CRO incorporates the components of two current models,  , the new model has potential to generate better predictions of image qualities. CRO first generates a random model, R. CRO then performs a mapping from the 2Ddimension n Â n to the 1D-dimension n. The new model, newmodel, is reproduced as: where R 0 is given by R 0 ¼ 1 1þe R . b. Geometric semantic mutation namely MUT creates a new model based on a single model, F i DNNGP;k . The new model involves new components which have potential to generate better image quality predictions. First, MUT creates two random models, R 1 and R 2 . MUT then performs a map in the 1D-dimension n. A new model, namely newmodel, is created as: where c m is a constant; R 0 1 ¼ 1 1þe R 1 and R 0 2 ¼ 1 1þe R 2 are the functional values of R 1 and R 2 , respectively.

Experimental results and analysis
In this section, we discuss how we implement the proposed hybrid DNN, and also we evaluate the performance of the proposed hybrid DNN. Section 4.1 discusses the database of which we use to evaluate the algorithmic performance. Section 4.2 discusses how the proposed hybrid DNN is implemented. Section 4.3 presents the experimental and comparison results with other methods including the CNN, and the commonly used image feature models for IQEs.

Image quality assessment database
The prediction capability of the proposed hybrid DNN is evaluated by predicting image qualities of the commonly used image quality assessment database namely TID2013 [22] which is developed for academic research 1 . TID2013 database is an extended version of the old version TID2008 database [46]. TID2013 contains 3000 distorted images which were developed based on 25 reference images. The distorted images are contaminated with 25 distortion types in 5 levels. Each distorted image is created by contaminating a reference image with a distortion type in a single distortion level. Those distortions are caused by camera operations, image transmissions, image compressions, and conventional image processing. TID2013 database also contains images with exotic distortions, which do not exist in general applications of image processing but are challenging for image quality prediction algorithms. The image qualities of the distorted images in the TID2013 were evaluated by 985 subjective experiments, which were involved with human observers from 5 countries, Finland, France, Italy, Ukraine and USA. MOSs from 0 to 100 were scored and were rescaled to the range from 1 to 5. The images contaminated with the 25 distortion types are shown in Fig. 5. These 25 distortion types cover many image processing applications. TID2013 is a commonly used image quality assessment database to assess the performance of image quality prediction models [22]. We attempt to evaluate the prediction capability of the proposed hybrid DNN based on (2) which is the correlation between algorithmic predictions and true visual qualities to images.
Despite the TID2013, another image quality assessment database namely LIVE database 2 is used to develop the hybrid DNN which estimates image quality when a distorted image is given [47,48]. The distorted images in LIVE database are created based on 29 reference images without distortion. These 29 images were contaminated with 5 distortion types namely additive white Gaussian noise, Gaussian blur, a simulated fast fading Rayleigh channel, JP2K compression and JPEG compression. The images were contaminated with many distortion levels. The image qualities of the distorted images were evaluated by more than 25,000 human image quality judgements. Quality difference scores of distorted images were evaluated by comparing the reference images and the distorted images. The quality difference scores are in the range between 0 to 100. When the quality difference score is lower, the image quality is better. Hence, the quality difference scores are different to the MOS. MOS is the higher the better, while the quality difference score is the lower the better.
To evaluate the performance of the proposed DNNGP-Algorithm, the proposed DNNGP-Algorithm uses the LIVE database to train the hybrid DNN, and TID2013 database is used to validate the generalization capabilities of the trained hybrid DNN. Since the number of distortion types in TID2013 database is 24 and the number of distortion types in LIVE database is 6, the number of distortion types of TID2013 is larger than that of LIVE. When the LIVE database is used to train the hybrid DNN, some image distortions including in the TID2013 are not covered. Performing these experiments attempt to evaluate whether the performance of the proposed hybrid DNN is better than the CNN [17] and the other compared algorithms [23][24][25]. The proposed hybrid DNN uses classical image feature models to capture distortion types which are correlated to the image qualities. The approach overcomes the limitations of the CNN of which the training is only relied on the training database. If the image distortion types are not included in the training database, the CNN is unlikely to estimate image qualities of those distorted images since the CNN is fully relied on training samples. Also the four image feature models are only developed for a particularly distortion type. We attempt to verify whether the performance difference between the proposed hybrid DNN and the other tested methods is significant.

Algorithmic implementation of proposed hybrid DNN
The implementation of the proposed hybrid DNN in Fig. 2 consists of three main components, GSGP-Algorithm, CNN and image feature models which are illustrated in Fig. 6. The GSGP-Algorithm [44] is implemented since the algorithm is computationally simpler than the classical genetic programming models and is more capable to generate nonlinear models compared to the statistical regression. The CNN is implemented with a convolutional neural network namely Boses-CNN [17] which is recently developed for predicting image qualities; experimental results showed that the Boses-CNN is significantly better than the commonly used image quality metrics. In the proposed hybrid DNN, the image feature models are selected as Blocking artifact, Blur artifact, Ringing artifact (edge magnitude) and Ringing artifact (edge gradient) [49] which are correlated to the image qualities and have small computational costs. Here we do intentionally not implement the most modern and computationally complex models. We particularly select these four state-of-the-art models which are simple and are not most modern. If these four computationally simple models are able to perform better image predictions and achieve improvement, much better performance can be achieved by the proposed hybrid DNN when the more modern and computationally complex models are integrated. The implementation details of the GSGP-Algorithm are given in Sect. 4

GSGP-Algorithm
The GSGP-Algorithm integrates the four image feature models and the Boses-CNN, in order to develop the hybrid DNN for IQEs. The following algorithmic parameters are implemented in the GSGP-Algorithm: The image quality samples in the TID database were used to generate the proposed hybrid DNN. Since the TID database is created by 25 reference images, 25-fold cross was used to evaluate the prediction capability of the proposed hybrid DNN. The image quality samples were divided into 25 folds, where each fold contains the contaminated images which were distorted on one of the reference images. In each validation, 24 out of the 25 folds were used to develop the model, and the remaining data fold were used to validate the prediction capability of the model. The performance of the models was evaluated based on the Pearson linear correlation in (2) which indicates the correlation between the actual MOSs and the model predictions. The proposed GSGP-Algorithm was coded based on the C?? framework for the public source of GSGP 3 . The GSGP-Algorithm was implemented by a P510 Xeon E5-1630 v4 machine with 32 GB memory and two xGTX1080 GPU cards.

CNN
Boses-CNN [17] was implemented on the proposed hybrid DNN. The backbone of Boses-CNN is based on the architecture of VGGnet which is embedded with cascaded convolution kernels with the size of 3Â3 [50]. The VGGnet consists of eight convolution layers and four maxpool layers. Since the VGGnet was only developed for images with the size of 224Â224 pixels, extra image resizing was performed between the images and the input of the VGGnet by two convolution layers and a single maxpool layer. The output of VGGnet is a set of image features which are fed into two cascaded fully connected neural networks and a single pooling layer. In order to avoid Fig. 6 Implementation of the proposed hybrid DNN overtraining the Boses-CNNs, the approach of dropout regularization is used to determine the network parameters. Based on [17], the total number of parameters in Boses-CNN is about 5.2 million. The detailed description of the Boses-CNN can be referred to [17]. Also the implementable Boses-CNNs are available for public use; they have been trained by two image quality databases namely TID and LIVE, and they can be downloaded 4 .
The proposed hybrid model is integrated with the Boses-CNN which is trained by the LIVE database. In [17], experiments have been conducted to evaluate the performance of the Boses-CNNs for predicting the image qualities.

Image feature models
The following image features Fig. 7 5 , namely blocking, blur and ringing artifacts which commonly exist in images, are integrated to the proposed hybrid DNN. CNN-Bosses in the hybrid DNN is trained by the LIVE database which contains the distortion types of blur and image compressions. Although those distortion types are similar to those of the following image features, CNN-Bosses is only trained by a limited number of images contaminated by those distortion types. The CNN-Bosses may not cover all the levels of those distortions. The proposed hybrid DNN attempts to integrate CNN predictions and the four image feature models, in order to compensate those uncovered levels. It attempts to improve the performance of image predictions.
• Blocking artifacts: The metric of block artifacts was developed by Wang et al. [23]. The blocking artifacts are usually caused by image compressions such as JPEG, JPEG2000. The artifacts appear continuously at block boundaries and are caused individually by the quantization of those blocks. Three quantity parameters indicate the blocking measures. The first parameter quantifies the blocks by measuring overall differences between block boundaries. The other two parameters quantify image blurs in horizontal and vertical directions based on differences between blocks and a zerocrossing rate, respectively. A high order polynomial function is used to combine the three parameters in order to quantify the blocking artifacts. • Blur artifacts: The metric of blur artifacts was developed by Marziliano et al. [24]. Blur reduces spatial detail and object shapes in an image. Blur is caused by the overall increasing of edge smoothness or reducing of edge sharpness. The metric quantifies the blur by measuring the widths of vertical edges in the image. Detections of vertical edges are applied since the required computations are less, comparing to the inclusion of both vertical and horizontal edges. The metric first uses the Sobel filter to detect vertical edges, where the Sobel filter is commonly used for edge detections [52]. Blur measures are quantified in each detected edge. The overall blur is quantified based on the average of all those blur measures. • Ringing (edge magnitude and gradient) -The metric of ringing artifacts was developed by Saha et al. [25]. The ringing artifacts appear on high contrast edges which are in smooth textures. The ringing artifacts are caused by image processing or transmissions, where high-frequency components exist in the images. In the metric of ringing artifacts, two parameters are quantified. First Sober edge detection [52] is used to generate the edge image from the original one. Based on the edge image, the first parameter is measured as the overall edge magnitude. The second parameter indicates the overall edge gradients in both vertical and horizontal directions. These two parameters are used in order to quantify the image activity which is correlated to ringing artifacts (Fig. 7).

Experimental results and statistical tests
Since the GSGP-Algorithm is a heuristic algorithm, the random operations such as the population initialization, mutation and crossover are involved. Different hybrid DNNs are generated in different runs, although the same parameters are used in the GSGP-Algorithm. Therefore, the GSGP-Algorithm was run for 30 times. 25-fold cross validations were conducted in which each trial was corresponded to a reference image. The averages for 30 runs for each reference image were recorded and are shown in All the four commonly used image quality metrics achieved poorer results compared to the proposed hybrid DNNs and the Boses-CNN, the four metrics only address one of the four artifacts namely blocking artifacts, blur artifacts and two ringing artifacts (edge magnitude and gradient), respectively. If the image quality metric is used to measure the images with other distortion types, those distortions cannot be quantified. For example, the metric for quantifying blur artifacts is not effective to quantify the images with blocking artifacts. If this metric is used to measure images contaminated with blocking artifacts, predictions of images qualities are likely to be poor. Therefore, solely quantifying one distortion type is not enough to quantify the overall image distortion since the image database is involved with distorted images which are contaminated by other distortion types. Since both Boses-CNN and the proposed hybrid DNNs are developed by images contaminated with many distortion types, the overall performance of image quality predictions is generally better than that of the four image quality metrics.
When the performance of the Boses-CNN and the proposed hybrid DNNs is compared, the correlations in Table 1 show that the proposed hybrid DNN outperforms the Boses-CNN. The image quality predictions of the proposed hybrid DNN are integrated with both the predictions from the Boses-CNN and the image quality metrics. The exact quantity of a distortion type can be evaluated by the corresponding distortion metric. The image quality metric does not have the limitation of the Boses-CNN which is trained by a limited number of features from a database. When images contaminated with some distortion types are not included for training, quality predictions for those image types can be poorer. Hence, the overall predictions of Boses-CNN are also likely to be poorer. The proposed hybrid DNN is generally better than the Boses-CNN.

Statistical tests
To further compare the prediction performance of the four commonly image quality metrics, Boses-CNN and the proposed hybrid DNN, the t-test [53] was used. The t-test evaluated the significance of the hypothesis that the Pearson linear correlations achieved by the proposed hybrid DNN are larger than those of the four commonly image  Table 2 shows the P-values which are all less than 0.0000 for the hybrid DNN compared to the four image quality metrics. Also the P-value is 0.0002 for the hybrid DNN compared to the Boses-CNN. Therefore, we have significant confidence to claim that the hybrid DNN generated by the DNNGA-Algorithm is able to generate better results than the other four tested methods for predicting image qualities.
Since the p-values are all zeros or close to zero, the posthoc analysis namely Tueky's range test is used to further validate whether the performance of the proposed hybrid DNN generated by the proposed DNNGA-Algorithm and the other five tested methods is significant difference. The results in Fig. 8a-e show that the Pearson linear correlation mean achieved by the proposed hybrid DNNs is better than those achieved by the four image quality metrics and the Boses-CNN, respectively. These results further validate the performance of the proposed hybrid DNN.
These validations further demonstrate that the proposed hybrid DNN which is a novel version of the CNN and is better than the recently developed Boses-CNN. The prediction of hybrid DNN is integrated with the Boses-CNN prediction and the four image quality metrics. The integration is performed by the proposed DNNGA-Algorithm which uses the five predictions in order to achieve better predictions with better Pearson linear correlations. Therefore, better results can be achieved by the proposed hybrid DNN.
These experimental results demonstrate that the proposed hybrid-DNN achieves higher correlations to the real MOSs in the TID database, compared to the five tested methods including block artifact metric, blur artifact metric, ringing artifact metric, ringing artifact metric, Boses-CNN. Also the validation results are shown by the T-test, F-test and Tueky's range test. They showed that the proposed approach is able to achieve significantly more accurate quality predictions, compared to the five tested methods. These results show that the proposed hybrid-DNN outperforms the CNN-Bosses of which those distortion types are not included for training. Better results are achieved by the hybrid-DNN since the hybrid-DNN integrates with image features generated by the four image feature models, block artifact metric, blur artifact metric, ringing artifact metric, and ringing artifact metric. Some distortion types, which have not been covered by the CNN-Bosses, are included in the hybrid-DNN. Also the hybrid-DNN is better than the individual image feature model which is only robust on a single distortion type. Therefore, better results can be achieved by the proposed hybrid-DNN.

Conclusion
In this paper, a novel hybrid DNN was developed to perform automatic IQEs. The proposed hybrid DNN consists of two stages, namely feature extraction stage and classification stage. In the feature extraction stage, the proposed approach integrates image features captured from IQE models in order to predict image qualities. The proposed approach ensures that significant features correlated to image qualities are integrated by image feature models. It overcomes the limitation of the recently developed CNN that image features are only captured randomly, and significant one cannot be guaranteed to be included. In the classification stage, the tree-based model namely geometric semantic genetic programming integrates image features to perform the final image quality predictions. The approach is simpler than the cumbersome fully connected neural networks.
The performance of the proposed approach was evaluated by the TID image quality database which is commonly used to evaluate the performance of image quality metrics. The database consists of 3000 distorted images which were contaminated with 25 distortion types in 5 levels. We have compared the proposed approach with the four state-of-art IQE metrics and a powerful CNN which outperformed many IQE models. The mean correlation achieved by the proposed hybrid DNN is 0.57 which is higher than the tested methods including the state-of-the-art IQE metics and the powerful CNN for IQEs. The correlation results in terms of t-test, F-test and Tueky's range tests showed that the proposed approach is able to achieve significantly more accurate quality predictions with a 99.9% confidence level, compared to the tested methods.
In the future, we will incorporate the proposed hybrid DNN with more modern models for IQEs. This is expected that further improvement can be achieved but long execution time is required when the more computationally complex models are integrated. We will find the tradeoff between the prediction accuracy and computational time.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.

Declarations
Conflict of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. We fully understand and fully comply the ethical responsibilities and rules for Authors for this journal.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.