No-reference synthetic image quality assessment with convolutional neural network and local image saliency

Depth-image-based rendering (DIBR) is widely used in 3DTV, free-viewpoint video, and interactive 3D graphics applications. Typically, synthetic images generated by DIBR-based systems incorporate various distortions, particularly geometric distortions induced by object dis-occlusion. Ensuring the quality of synthetic images is critical to maintaining adequate system service. However, traditional 2D image quality metrics are ineffective for evaluating synthetic images as they are not sensitive to geometric distortion. In this paper, we propose a novel no-reference image quality assessment method for synthetic images based on convolutional neural networks, introducing local image saliency as prediction weights. Due to the lack of existing training data, we construct a new DIBR synthetic image dataset as part of our contribution. Experiments were conducted on both the public benchmark IRCCyN/IVC DIBR image dataset and our own dataset. Results demonstrate that our proposed metric outperforms traditional 2D image quality metrics and state-of-the-art DIBR-related metrics.


Introduction
With the development of mobile devices and wireless network technology, depth-image-based rendering (DIBR) has become a mainstream technology for supporting remote interactive 3D graphics.Example uses include 3DTV [1], free-viewpoint video [2], stereoview video [3], and 3D interactive graphics systems [4].In these DIBR-based systems, a virtual view is synthesized based on various known reference views as the input, which comprise texture and depth information.3D warping [5] and hole filling [1] are typically applied to generate the required virtual views.However, the process of virtual view synthesis is prone to distortions, degrading the visual quality of the synthetic images.Having a proper quality metric for synthetic images is fundamental to ensuring quality of service (QoS) of DIBR-based systems.Specifically, the feedback from synthetic image assessment can be used to govern optimization of reference view compression and transmission.
As illustrated in Fig. 1, geometric distortions, such as holes, cracks, ghost artifacts, and stretching, are the dominant distortions in a DIBR synthetic image.They mainly result from object dis-occlusion, and rounding errors from 3D warping and hole filling processes.Compared to traditional DCTbased image distortions such as noise, blurring, blocking, and ringing artifacts which are distributed rather uniformly over an image, geometric distortions appear in a non-uniform way and are distributed locally around occlusion regions [6].Existing 2D image quality assessment (IQA) algorithms focus on structural distortions, and are incapable of properly assessing the visual quality of DIBR synthetic images.So far, only a few works have aimed to evaluate DIBR synthetic images.Most are extensions of existing 2D IQA methods, assuming that DIBR synthetic images follow the same natural scene statistics (NSS) as traditional 2D images [6][7][8][9].Their improvements mainly rely on carefully designed handcrafted features.
In contrast to existing DIBR-related metrics, which heavily rely on handcrafted features, we propose a no-reference (NR) DIBR synthetic image quality assessment method using convolutional neural networks (CNNs) and local image saliency based weighting.Specifically, we exploit the power of CNNs for synthetic image feature extraction, while utilizing the sensitivity of local image saliency to geometric distortions to refine the predicted scores.To overcome the lack of existing training data, we constructed a large DIBR synthetic image dataset with subjective score annotations.
Our main contributions are as follows: • To our knowledge, we are the first to propose a CNN-based NR-IQA for DIBR synthetic images.
In particular, the integration of local image saliency boosts prediction performance.• We have constructed a new DIBR synthetic image dataset with subjective scores.The capacity and diversity of our proposed dataset is superior to any existing public DIBR image dataset, boosting the training quality and avoiding training bias.• We have validated the proposed metric on both the public benchmark IRCCyN/IVC DIBR image dataset [10] and our own dataset.Experimental results demonstrate that our method outperforms conventional 2D image metrics and state-of-the-art DIBR-related metrics.The rest of the paper is organized as follows.Related work is described in Section 2. Section 3 presents our NR-IQA approach, and Section 4 evaluates our proposed algorithm.Application of the proposed metric is demonstrated in Section 5. Finally, Section 6 concludes the paper.

Image quality assessment
Depending on their need for a priori knowledge of the undistorted image, IQA methods may be broadly categorized as full-reference (FR), reduced reference (RR), and no-reference (NR).In FR-IQA, algorithms typically have full knowledge of the ground truth image, and evaluate image distortion according to pixel error measurements, e.g., SSIM [11].In contrast, RR-IQA only uses partial information of a reference image for quality evaluation [12].NR-IQA is the most challenging task, in which algorithms estimate the quality of a distorted image without any information about the ground truth.However, NR-IQA is most suitable for DIBR system usage, since the undistorted image corresponding to a virtual view is typically unavailable.We hence only discuss NR-IQA algorithms in the following.
Most NR-IQA methods are based on NSS priors.Mittal et al. [13] proposed a Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE), which extracts point-wise statistics from local normalized luminance signals, measuring image naturalness by the deviations from a natural image model.They also proposed another no-reference metric, Natural Image Quality Evaluator (NIQE) [14], without the need for knowing the human subjective score for a distorted image.
Recently, deep learning methods, especially CNNs, have attracted great attention for their powerful image feature extraction capability.Kang et al. [15] firstly introduced CNNs into image quality assessment.In their work, training images are divided into small patches assigned with subjective scores as labels.The small patches are then trained to fit human subjective scores using CNNs.Bosse et al. [16] and Bare et al. [17] improved the prediction performance by weighting the predicted patch scores with image saliency.Bare et al. [17] adopted a more complex network architecture which clusters each minibatch of training patches.In Ref. [18], a pretrained CNN model is utilized to provide multiple level features for image quality assessment.GANs are also introduced into NR-IQA [19], where a plausible reference image is generated to assist training.As well as for image quality assessment, deep learning has also been applied in aesthetic evaluation [20].CNN-based NR-IQA methods have achieved state-ofthe-art performance on public 2D image databases, such as LIVE [21], TID2008 [22], and TID2013 [23].However, no work has been reported for assessing DIBR synthetic images.This is mainly due to the training bias of traditional 2D image datasets, as the features of traditional 2D images and synthetic images are different due to the different natures of their distortions.

DIBR-related image quality assessment
Previous IQA methods for 2D images are inappropriate for assessing DIBR synthetic images, since the dominant distortions in synthetic images are geometric distortions, as mentioned before.Specifically, holes are mainly induced by object disocclusions in a virtual view.Cracks are induced by rounding errors from 3D warping.Ghost artifacts are mainly induced by inaccurate depths, and stretching is due to improper hole filling algorithms.These distortions are quite different from traditional image distortions, such as noise, blurring, blocking, and ringing artifacts induced by DCT-transform based coding and lossy transmission.
Conze et al. [24] aggregated texture, gradient orientation, and contrast information as weighting maps for assessing DIBR synthetic image distortions.Battisti [7] presented an FR synthetic image quality metric.It evaluated a synthetic image by comparing the Kolmogorov-Smirnov distance between the blocks of the synthetic image and the undistorted image.Sandić-Stanković et al. proposed a Morphological Wavelet Peak Signal-to-Noise Ratio (MW-PSNR) metric [25] and a Morphological Pyramids Peak Signal-to-Noise Ratio (MP-PSNR) metric [26].Both MW-PSNR and MP-PSNR transform a synthetic image into wavelet domain, and measure the spectral difference between the synthetic image and the undistorted one.Zhou et al. [6] proposed an FR metric for DIBR synthetic images with dis-occluded region discovery.It first detected the dis-occluded regions by comparing the absolute difference between the synthetic image and the undistorted image, and then weighted the predicted quality using the detected dis-occluded regions.Gu et al. [8] proposed an NR method for DIBR synthetic images using local image description.It measured geometric distortions with an auto-regression based NSS model.Tian et al. [9] proposed another NR-IQA method for measuring synthetic image distortions.Four kinds of features, including morphological differences, edges, gradients, and holes ratio, are separately measured and finally aggregated.These DIBR-related metrics achieve significant improvement over conventional IQA metrics, yet heavily rely on handcrafted features.

Our approach
We now present the details of our method.As mentioned above, current DIBR-related IQA methods rely heavily on handcraft features, while CNN-based methods suffer from training bias.We hence propose a novel NR-IQA method for synthetic images based on CNNs and local image saliency based weighting.We also address the lack of training data by constructing a new DIBR synthetic image database with sufficient samples.

Overview
Motivated by previous work, we apply CNNs to train a regression model between predicted image quality scores and human subjective scores.Specifically, the CNN model is assumed to represent the feature subspace of DIBR synthetic images in terms of natural images.
The main bottleneck of CNN-based synthetic image quality prediction is the lack of sufficient training data.Notably, existing CNN-based IQA methods achieve successful results as they are typically trained on very large image databases, e.g., LIVE, CSIQ, TID2008, and TID2013, which contain thousands of images.In contrast, existing public DIBR synthetic image datasets, in particular the IRCCyN/IVC DIBR image dataset, contain only 96 images (including the undistorted images).Our new synthetic image dataset was developed to address the lack of training data.
A CNN model is proposed and trained on our dataset.Particularly, we utilize local image saliency to weight the predicted score, appropriately emphasising the contribution of geometric distortions.The architecture of proposed method is illustrated in Fig. 2. With our trained model, we can predict the quality score for test images without knowledge of undistorted versions of them.

Local image saliency based weighting
Previous work assigns the subjective score of an image to small image patches uniformly [15][16][17].It implicitly assumes that the small image patches equally contribute to image quality.In fact, the visual quality of each small image patch is quite different from the whole image quality [27], especially for synthetic images.Suppose a small image patch is exactly covered by a dis-occluded region, and holes dominate an entire patch.As illustrated in Fig. 3, such a patch may be perceived as having better visual quality than that of the whole image.Without knowledge of geometric distortions, a user may simply think that the patch contains a smooth region.Therefore, the strategy of assigning a uniform predicted score to all image patches cannot properly represent the contributions of geometric distortions.
As performing subjective tests on small image patches is expensive and time-consuming (e.g., a total of 768 subjective tests are required to consider small image patches for each image), a light-weight method of assignment of predicted patch scores is highly desirable.In Ref. [16], the predicted patch score is weighted by image saliency, i.e., salient regions are assigned larger weights.This fits the assumption that observers are generally more sensitive to salient regions, such as the person and chair in Fig. 4(a).The distortions in such salient regions have more influence on the quality of the whole image.However, this only holds for traditional distortions, such as blurring, white noise, and blocking artifacts that are distributed uniformly across the whole image.It is inapplicable to DIBR synthetic images, as geometric distortions in such images are non-uniform and locally-distributed.
Consider Fig. 4. Figure 4(b) shows the saliency map for Fig. 4(a) generated by Ref. [28].Note that the most salient regions (depicted brighter) are not those regions containing geometric distortions in the synthetic image.For instance, the most salient region in Fig. 4 is the blurred red book, but it is not humanly perceived as distorted.Directly applying image saliency based weighting as proposed in Ref. [27] to the synthetic image thereby overstates the contribution of such regions, while weakening the contribution of local patches containing geometric distortions.
We observe that it makes sense to exploit the difference between the saliency map of a local patch and its corresponding region of the saliency map for the whole image to help to improve the representation of geometric distortions.As seen in Fig. 4(c), the cracks on the wall are dark (indicating weak saliency) in the whole image but are bright (indicating strong saliency) in the small patch.In reality, human perception is most sensitive to such cracks.We should hence assign a large weight to the corresponding patches.In contrast, the holes appearing at the right side of the lion statue are dark (indicating weak saliency) in both the image saliency map and the patch saliency map.This fits the observation that holes around the lion statue are not perceived to be consistent with the cracks in the white wall.This is partly supported by theories that in the human visual system, texture contrast masking and luminance adaptation conceal distortions to some extent [29].We can thus give the corresponding patch a small weight.On the other hand, patches containing no geometric distortion share similar appearance of local patch saliency and corresponding regional saliency in the whole image.For instance, the aforementioned red book with motion blurring appears to be salient in both the patch and the corresponding region of the whole image.However, human perception does not consider motion blurring to be a distortion.In this situation, the contribution of the predicted patch score should be low.The background floor is neither salient at the patch level nor the whole image level, and that should also be considered as unimportant, as shown in Fig. 4(c).
Based on the above observations, we exploit the ratio between the local patch saliency and the corresponding regional saliency in the whole image to represent the contribution of patch scores toward geometric distortions.We define this as local image saliency, formulated as follows: where Ω x indicates the region of a small patch.S(•) and S (•) denote the per-pixel value of patch saliency and the corresponding saliency in the whole image, respectively.The proposed local image saliency is then used to weight the predicted patch scores.For example, a patch with high local image saliency implies that the patch contains clearly visible geometric distortions, and that the predicted score should be increased, and vice versa.

Network architecture
Our network is mostly inspired by Ref. [15], but is designed to process DIBR synthetic images during preprocessing, and to use local image saliency based weighting.

Preprocessing
Before training, we divide each synthetic image into small patches of size 32 × 32 pixels.As depicted in Fig. 5, geometric distortions are visible in RGB channels.However, such distortions are concealed after gray-scale transformation and local contrast normalization.Consequently, we abandon grayscale transform and local contrast normalization, even though they have been widely used in existing CNN-based NR-IQA methods [15,17].As a result, important distortion information can be better preserved.

Layers
We use 9 convolutional layers to extract local patch features.Each convolutional layer is followed by a ReLU activation function, which means the local information is extracted into a deeper layer.The convolutional layer can be formulated as where C j is the feature map of the jth layer, and W j and B j are weight and bias respectively.Details of layer configurations as well as kernels are depicted in Fig. 2. Note that we use a zero-padding strategy, so as to preserve the information at image borders.After three convolutional layers, we apply a max-pooling layer with a 2 × 2 kernel to enlarge the respective field.We also apply the dropout strategy after the first fully connected layer.The network depth is chosen with the assumption that shallow network architectures capture low-level features while deep network architectures capture semantic features.The effect of network depth is discussed in Section 4.

Optimization
By aggregating the local image saliency based weighting, the loss function is formulated as follows: min where c x is the local image saliency defined in Eq. ( 1).
x and q x denote the input small image patch and its assigned subjective quality score, respectively.f(•) outputs the predicted quality score.W , B indicate the trainable weights and biases, respectively.The effectiveness of our proposed local image saliency based weighting is discussed in Section 4. We use the ADAM optimizer to solve this problem.

Our DIBR synthetic image database
Until recently, available synthetic image databases with subjective scores were insufficient for training.
For instance, the IRCCyN/IVC DIBR image dataset [35] contains only 12 undistorted images and 84 synthetic images.Moreover, these images cover only three scenes: Book Arrival, Newspaper, and Lovebird.
All have humans in the center of the scene, which may lead to training bias.The MCL 3D database [36] contains 693 stereoscopic image pairs, which is sufficient for training.However, it lacks subjective scores for each synthetic image.In order to improve training performance, we constructed a new DIBR synthetic image dataset.A total of 18 reference images were chosen.These reference images ranged from 960 × 640 to 1920 × 1080 pixels in size.Twelve reference images were randomly sampled from 3D-HEVC testing video sequences or other typical RGBD databases.Note that the sampled reference images are quite different from those in the IRCCyN/IVC DIBR image dataset.The remaining six reference images were picked from the Middlebury Stereo dataset [34], which only contains indoor objects without people.We specifically chose these reference images to avoid training bias.The reference images are shown in Fig. 6.
Figure 7 shows a scatter plot of spatial information (SI) vs. colorfulness information (CI) for our chosen reference images and IRCCyN/IVC DIBR image dataset, as suggested by Ref. [37].They show that the SI and CI of our chosen reference images span a larger range than the IRCCyN/IVC DIBR image dataset, indicating that the contents of our dataset are more diverse and more likely to avoid training bias.
For each reference image, we set four camera baselines between the reference view and the virtual Fig. 6 Reference images from Nayoga Free-viewpoint video dataset [30], Microsoft 3D Video database [31], Poznan Multiview video test sequences [32], Freiburg stereo dataset [33], and Middlebury Stereo dataset [34].view.For instance, the camera position of the Balloons reference image is denoted by 0, then we select four virtual cameras along the horizon line of the reference camera, while the baselines between the virtual cameras and the reference camera are set to −2d, −d, +d, +2d, respectively.d is the preset unit distance.After 3D warping, we conduct 7 holefilling algorithms on the synthetic images.Finally, we obtain 504 synthetic images.Note that the holefilling algorithms are the same as those used for the IRCCyN/IVC DIBR image dataset.Details of the hole-filling algorithm are given in Ref. [7].Compared to the IRCCyN/IVC DIBR image dataset, our new database has over 5 times as many images.Further comparisons are listed in Table 1.

Subjective testing
Since the number of synthetic images was prohibitively large for a double stimulus setup, we instead chose a single stimulus absolute category rating procedure with hidden reference (ACR-HR), as suggested by ITU-T Recommendation P.910 [38].Each synthetic image was evaluated by 15 observers.Subjective testing was divided into three sub-sessions of 25 min each with a break of five minutes in between to reduce visual fatigue and eye strain.Each testing image was displayed for 15 s, following by a gray image for 5 s.To ensure Before testing started, the study procedure was explained to each subject.We also obtained verbal confirmation that the subjects had normal or corrected-normal vision.For each sub-session, five images were shown as a warm-up; these had different contents but the same type of distortions as the testing images.
A 24 inch Lenovo X23 LG 0.2 monitor was used as display.It had 16:9 aspect ratio, 0.30 m height, 200 cd•m −2 peak luminance, and 1920 × 1080 display resolution.The testing room was dark with weak ambient lighting.Subjects viewed images from 2.1 m, as suggested in ITU-T Recommendation P.910 [38].At the end of the image display duration, the test number of the image was displayed on the screen, informing subjects to write down one of the five rankings: 5-Excellent, 4-Good, 3-Fair, 2-Poor, 1-Bad on their subjective scoring sheets.

Processing of raw subjective scores
The subject rejection procedure outlined in ITU-R BT.500 [39] was used to discard scores from unreliable subjects.The kurtosis of the scores (MOS scores) was firstly used to determine whether the scores assigned by a subject followed a normal distribution.For the normally distributed scores, a subject was rejected whenever more than 5% of the scores assigned by the subject fell outside the range of two deviations from the mean scores; otherwise, the subject was rejected whenever more than 5% of the scores fell outside the range of 4.47 standard deviations from the mean scores.All of the 15 subjects passed the outlier rejection.We further analyzed the scores for the 12 redundant images, finding that most subjects assigned the same scores to these repeated images.This further validated the effectiveness of our subjective testing.Finally, the scores of 15 subjects were averaged.

Experimental results
We now provide the details of our experimental settings and give a performance comparison for our proposed DIBR synthetic image quality metric on the benchmark IRCCyN/IVC DIBR image dataset and our own dataset.We also briefly discuss the dependence on proposed strategies, including preprocessing, local image saliency based weighting, and network depth.In experiments, we set the ADAM optimizer learning rate λ = 0.0001, performing stochastic gradient descent (SGD) for 20 epochs in training, and saving the models with the top five Pearson linear correlation coefficient (PLCC) performance on the validation set.For each epoch, the training and validation set were shuffled.We calculated local image saliency weights for the whole image and patches using the saliency model in Ref. [28].During the testing stage, the predicted scores from the five restored models were averaged.

Evaluation methodology
Three indicators were used to evaluate the performance of our proposed metric, including Pearson linear correlation coefficient (PLCC), root mean square error (RMSE), and Spearman rank order correlation coefficient (SROCC).These indicators measure the consistency, accuracy, and monotonicity between the predicted quality scores and subjective scores.PLCC and SROCC range from 0 to 1, higher values indicating better performance.RMSE ranges from 0 to ∞ + , smaller values indicating better performance.
For the sake of fairness of performance comparison, the predicted scores of compared metrics were scaled to the subjective scores, i.e., MOS values via thirdorder polynomial fitting.The polynomial fitting is conducted as follows, which is suggested by ITU-R BT.500 [39]: MOS p = as 3 + bs 2 + cs + d (4) where s is the score and a, b, c, d are coefficients of the polynomial fitting function, determined by the predicted results and associated subjective scores.Note that our predicted scores are directly trained to fit the subjective scores, so do not require scaling.
The parameters (if any) in the compared FR-IQA methods were trained on the training dataset, while the predicted scores were fitted using non-linear logistic regression to minimize the errors between the predicted scores and the corresponding subjective scores, as suggested by Ref. [8].After parameter training, we evaluated each method's performance on the testing dataset.The compared NR-IQA methods were directly evaluated on the testing dataset.

Performance on the IRCCyN/IVC DIBR image dataset
We now compare the performance of the proposed algorithm on the IRCCyN/IVC DIBR image dataset with state-of-the-art methods.As mentioned before, we trained the CNN model on the training data of our DIBR image database, where the models with top five PLCC results on the validation dataset were saved.The RMSE, PLCC, and SROCC for our metric using the IRCCyN/IVC DIBR image dataset are listed in Table 2. Our proposed algorithm achieves values of 0.3820, 0.8112, and 0.7520, respectively, which are better than those for competing methods.
From Table 2, we are able to derive two important conclusions.
Firstly, existing IQA algorithms that were designed for traditional 2D images do not perform effectively.The FR-IQA metrics are better than the NR-IQA metrics.FSIM [41] achieves 0.5887, 0.4671, and 0.3286 for RMSE, PLCC, and SROCC, respectively.Note that NR-IQA metrics are not able to predict DIBR synthetic image scores at all well, e.g., NIQE [14] achieves 0.1152 and 0.1181 for PLCC and SROCC, respectively.This is mainly due to dependency on natural image distortion priors.In particular, NIQE predicts image quality by evaluating the effect of distortions in terms of the NSS distribution.As mentioned before, geometric distortions are different from traditional image distortions.The learned model is thus inadequate for assessing DIBR synthetic images.
Secondly, despite the fact that the DIBR-related IQA algorithms perform better than those designed for traditional 2D images, prior methods are still insufficient.The best DIBR-related IQA metric is SDRD [6] that achieves 0.3901, 0.8104, and 0.7610 for RMSE, PLCC, and SROCC, respectively.Stateof-the-art NR-IQA metrics, such as APT [8] and NIQSV+ [9] achieve similar performance.Our metric outperforms those two relatively new NR-IQA metrics for DIBR synthetic images, and indeed achieves performance competitive to that of the state-of-theart FR-IQA metric, SDRD.Note however that SDRD is a full-reference method while ours is independent of reference images.

Cross validation
To avoid training bias of our CNN model, we conducted cross validation on our own database.Particularly, we evaluated the RMSE, PLCC, and SROCC of our metric and DIBR-related metrics on the testing set of our database.The results are listed in Table 3.
Our metric achieves the best performance on our DIBR synthetic image database in comparison with other DIBR-related metrics.Note that SDRD [6] is inferior to our method on the new database.
The performance of most existing DIBR-related metrics decreases when tested on our database.This implies that lack of diversity in the IRCCyN/IVC DIBR image dataset has caused training bias.The variation in RMSE on these two databases is shown in Table 4, which shows that RMSE is lower when testing on our database.Note that the RMSE variation of 3DSwIM is the most significant.This is perhaps due to the weighting of face features in 3DSwIM, leading to training bias.

Ablation study
Several strategies are involved in our method.The most important issues concerning prediction performance are preprocessing, local image saliency based weighting, and network depth.We therefore conduct an ablation study to demonstrate the effect of these strategies.

Preprocessing
We first evaluated preprocessing.While our preprocessing strategy uses raw images directly, we also implemented gray-scale transformation and local contrast normalization of the training images for comparison; the network architecture remained the same.The RMSE, PLCC, and SROCC values are listed in Table 5.
We can see from Table 5 that our preprocessing strategy achieves better performance on the testing set of our DIBR synthetic image database.It implies that gray-scale transformation and local contrast normalization may discard useful information.

Local image saliency based weighting
To demonstrate the effectiveness of local image saliency based weighting, we separately trained the CNN model with different modalities, i.e., the CNN network without weighting, the same model with image saliency based weighting as deployed in Ref. [17], and our proposed model based on local image saliency weighting.In the first case, the predicted patch scores are averaged to fit the subjective score.In the second case, the predicted patch scores are weighted by image saliency.The utilized image saliency is formulated as follows: Note the difference between image saliency based weighting in Eq. ( 6) and local image saliency based weighting in Eq. ( 3).Image saliency considers saliency, while local image saliency considers saliency variation between the local region and the whole image.The RMSE, PLCC, and SROCC for the testing dataset of our DIBR synthetic image database A visualization of local image saliency based weighting is given in Fig. 8. Figure 8(a) represents the saliency map of the entire image, while Fig. 8(b) represents saliency maps of small patches, merged into an entire image-sized map. Figure 8(c

Network depth
A deeper network architecture is suggested [16] to achieve better prediction performance on traditional 2D image databases.We validated this assumption on our augmented DIBR synthetic image dataset.Figure 9 shows how RMSE varies with different network depths, i.e., number of convolutional layers.We observe that RMSE decreases on both the training dataset and validation dataset with increasing network depth, agreeing with the assumption that greater network depth benefits prediction performance.However, the performance gain, significantly decreases when the network depth   exceeds nine.Also, deeper convolutional layers may lead to overfitting on the validation dataset unless care is taken.In practice, we use a network architecture with nine convolutional layers.

Application
The quality of synthetic images is key to the success of DIBR-based systems.For instance, a quality measure can be used to guide the coding of reference texture images and depth map.It can also be used to evaluate hole-filling algorithms.Here we use the proposed synthetic image quality metric to optimize the prediction of reference viewpoints.We first describe the baseline model of reference viewpoint prediction, and then introduce a novel model using our proposed metric.

Baseline model of reference viewpoint prediction
Suppose a user navigates within a virtual environment.Reference viewpoints are predicted according to user movement, and for each, an associated reference texture image and depth are transmitted to the user-end for virtual view synthesis.Ideally, reference viewpoint prediction is frequent, to reduce errors.However, the bottleneck of reference viewpoint transmission is bandwidth: the reference data which can be transmitted are limited.Previous work [43,44] adopts a strategy that predicts reference viewpoints with a constant frequency.Shi et al. [45] adopts another mechanism that predicts the reference viewpoint when the MSE between the synthetic image and the undistorted image exceed preset thresholds.We choose these two models as baselines to demonstrate the effectiveness of our proposed metric.Following Ref. [45], we predict reference viewpoints by assessing the quality of the synthetic images.However, our metric requires no reference, and can be directly used to assess the synthetic images without need for the undistorted images.

Performance
Suppose the user navigates the virtual environment along a horizontal path.The path is equally sampled, and each sample indicates a virtual viewpoint.The positions of these virtual viewpoints can then be denoted as ( , where v 0 denotes the initial viewpoint.Figure 10 shows the undistorted image and the synthetic images for v 0 .Note that the two synthetic images utilize different reference viewpoint predicted by MSE and our proposed metric.We can see from Fig. 10 that the two synthetic images can hardly be distinguished.However, the predicted reference viewpoints are v 4 using MSE and v 7 using the proposed metric, respectively.We choose the predicted reference viewpoint as the new initial viewpoint, repeating the reference viewpoint prediction until the virtual viewpoint reaches v 100 .A total of 25 reference viewpoints are suggested by MSE, while only 17 reference viewpoints are suggested by our proposed metric.By doing so, the transmitted reference data is reduced while the visual quality maintained.
We also simulated virtual environment navigation on a Nexus 5 device.The reference data was transmitted to the client when the quality of the synthetic image fell below a preset threshold.We tested bandwidth required by MSE-based reference viewpoints and ours.See Table 7: our metric saves 29% bandwidth on average in comparison to the metric in Ref. [45].The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095.To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Fig. 2
Fig. 2 Architecture of our no-reference synthetic image quality metric.The inputs are small (32 × 32) patches.The predicted patch scores are weighted by local image saliency.

Fig. 3
Fig.3Visual appearance of image patches containing geometric distortions.Patch A has partial holes, while patch B is dominated by holes.Compared to patch A, patch B is generally perceived as a higher quality image patch, if knowledge of geometric distortions in the whole image is not known.

Fig. 4
Fig. 4 Saliency maps for a synthetic image and its local patches.(a) Synthetic image.(b) Associated saliency map, with brighter intensity indicating stronger saliency.(c) Six chosen small patches extracted from the synthetic image, the corresponding patch saliency maps using the same saliency model, and the corresponding region extracted from the image saliency map.Note that geometric distortions appear differently in the patch saliency map and the image saliency map.

Fig. 5
Fig. 5 Visual perception of synthetic images.(a) Two synthetic images.(b) Corresponding gray-scale maps.(c) Visualization of the local normalized maps[15,17].Note that holes in regions with high intensity contrast and complex textures are lost after gray-scale transformation and local contrast normalization.

Fig. 7
Fig. 7 Spatial information versus colorfulness scatter plots for (a) the IRCCyN/IVC DIBR image dataset and (b) our proposed augmented synthesized image dataset.Red lines indicate the convex hull of the points in each scatter plot, indicating the range of scene diversity.
Two datasets were used in our experiments, including the IRCCyN/IVC DIBR image dataset and our DIBR synthetic image database.We trained the CNN model on our DIBR synthetic image database; the synthetic images were divided into training set, validation set, and testing set according to reference image.The dataset division obeyed the 60%/20%/20% principle.Thus, 10 reference images with their associated distorted images were chosen as training set.The validation set and testing set contained 4 reference images and their distorted images separately.Only the training set and validation set were used during training, while the testing set was kept secret until performance evaluation.
) visualizes the actually used local image saliency based weights, as calculated by Eq. (3).Clearly, the weights from the saliency map and local image saliency are quite different.The red box in Figs.8(a) and 8(c) shows cracks in the wall assigned a low weight by the saliency map but a high weight by our proposed local image saliency: local image saliency based weighting provides a better representation of the contributions of patch scores.

Fig. 8
Fig. 8 Visualization of local image saliency based weighting.(a) Saliency map of the entire distorted image.(b) Merged saliency maps of the associated small image patches.All saliency maps were produced by Ref. [28].(c) Local image saliency based weights, brighter blocks indicating higher weights.

Fig. 9
Fig. 9 Performance of CNN models with different network depths (numbers of convolutional layers).

Fig. 10
Fig. 10 Visual quality of synthetic images with different predicted reference viewpoints.(a) Undistorted image of v0.(b) Synthetic image of v0 using the reference view of v4, as suggested by MSE.(c) Synthetic image of v0 using the reference view of v7, as suggested by our metric.

Table 1
Details of our proposed DIBR synthetic image dataset the robustness of subjective opinion, twelve testing images were randomly displayed repeatedly.The 15 subjects who participated in the test were graduate or undergraduate students with ages ranging from 21 to 31.Two of them had knowledge of IQA, the remainder having no experience of IQA.

Table 3
RMSE, PLCC, and SROCC on testing dataset of our DIBR synthetic image database

Table 4
RMSE on IRCCyN/IVC DIBR image dataset and testing dataset of our DIBR synthetic image database

Table 5
RMSE, PLCC, and SROCC for the testing set of our DIBR synthetic image database with different preprocessing strategies

Table 6
RMSE, PLCC, and SROCC for the testing dataset of our DIBR synthetic image database with different network modalities

Table 7
Transmission frequency and average bandwidth cost of different reference viewpoint selection models Compared to existing DIBR-related IQA methods, there are some highlights of our work.Firstly, it is the first CNN-based NR-IQA method for DIBR synthetic images, achieving significant performance improvements over state-of-the-art 2D and DIBRrelated IQA methods.Our proposal to use local image saliency based weighting further benefits prediction performance.Secondly, we have designed a diverse DIBR synthetic image dataset, which helps to reduce training bias in our CNN model.Although we have achieved competitive performance on DIBR synthetic images, there is still room to improve.For instance, the assignment of patch scores needs further consideration to better fit human perception.In future, we hope to improve the proposed metric by integrating local image saliency in an end-to-end framework.interestsinclude distributed virtual environments, computer graphics, and e-learning systems.Dr.Li has served as a guest editor of special issues of the International Journal of Distance Education Technologies and the Journal of Multimedia.He has served on conference committees of a number of conferences, including as Program Co-Chair of ICWL 2007-08, 2013, 2015, and IDET 2008-09, and Workshop Co-Chair of ICWL 2009 and U-Media 2009.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.