In this section, we estimate the function f in (3) through NNs, then we investigate the validity of the model proposed in (2) through various numerical experiments and finally we illustrate the capability of the whole system to highlight potential anomalies in the data collected during a subjective experiment.
SOS model validation
To validate the SOS model in (2) as well as the ability of VQMs to capture diversity among observers’ ratings, an approximation of the function f is needed. This can be done by fitting the VQMs to the SOS observed during a subjective experiment, using any ML algorithm tailored for regression. An impressive number of ML algorithms has been proposed in literature, however NN based models and support vector regression (SVR) have empirically demonstrated greater accuracy in the field of media quality assessment. To estimate the function f we therefore naturally evaluated both NN as well as SVR based models. However, we have experimentally observed that NNs, for the task of interest, lead to a prediction of the gtSOS which correlates better with the SOS when cross validating the obtained models. We rely therefore on a NN to approximate the function f. The NN is trained using the five aforementioned VQMs, as an input, and the target is \(\textit {SOS}_{exp}^{pvs}\). However, on the basis of the model in (2) and the assumption in (3), the stochastic component Dexp is not predictable, and from the disagreement of the values of the objective metrics that the NN receives as input, interesting information can be gained only for the prediction of the deterministic component of the SOS. Therefore, we can assimilate the result of the NN prediction to the gtSOS.
Since subjective experiments are expensive and time consuming, it is very difficult or probably even impossible to find datasets that contain reliable subjective evaluations for a very high number of PVSs. This precludes the possibility of using on these datasets, deep NNs, i.e., NNs with more than one hidden layer, or even single hidden layer NNs with a large number of neurons on the hidden layer. In fact the high number of parameters and consequently the number of degrees of freedom of these NNs would lead to overfitting the dataset. In the context of this study, overfitting would yield an estimate of the gtSOS affected by the peculiarities of the specific subjective experiment which is reflected in the data used for training. Such estimate of the gtSOS would therefore no longer be an intrinsic characteristic of the PVS since it suffers from two sources of error due to subjective experiment settings, i.e., scale quantization and limited number of observers, as previously discussed. To overcome this problem, in Section 5 we will adopt a data augmentation approach. More precisely, we will generate more data artificially from the ones actually collected during a subjective experiment in order to be able to use a deep NN. Given that the focus of this Section is to validate the model in (2) for each subjective experiment involved in our study, we simply investigated several single hidden layer NNs with few neurons on the hidden layer, to determine the structure that would work best to estimate f without already generating other data that could biased the accuracy of the proposed model in representing the SOS values actually observed during a subjective experiment.
We experimentally found that f can be effectively approximated by a NN with 5 neurons on the input layer, i.e. one for each VQM, a single hidden layer with 4 neurons and finally an output layer with one neuron delivering the gtSOS estimation.
In order to validate the model in (2), we estimate the function f on five different annotated datasets, i.e. the VQEG-HD1, VQEG-HD3, VQEG-HD5, Netflix public and ITS4S dataset. Once the function f is known, it is possible to i) estimate the value of gtSOSpvs for each PVS, thus identifying contents whose quality is intrinsically difficult to assess consistently (i.e., high gtSOSpvs); ii) deduce from (2) the value of the stochastic component Dexp for each PVS. From the set of Dexp values, we estimate the empirical cumulative distribution of Dexp that we then compare with the cumulative distribution of a Gaussian random variable with zero mean and standard deviation equal to the one derived from the set of Dexp values. The results are shown in Fig. 4. In all the cases, the empirical cumulative distribution of Dexp seems to be very well approximated by a Normal cumulative distribution. This is coherent with the proposed SOS model. Figures 5, 6, 7, 8 and 9 report the comparison between the predicted gtSOS and the SOS for all the aforementioned datasets. On the various training sets, i.e., when training the NN using all the data in the dataset, the obtained PLCC values range from 0.30, in the worst case, up to 0.82, whereas in cross validation the observed PLCC values range from 0.29 to 0.77. However, the SROCC values are somewhat lower. In fact, on the various training sets they range from 0.24 to 0.69, and in cross validation from 0.23 to 0.62. This difference with respect to the PLCC values is an artifact of the quantization of the scale on which the subjective tests are conducted. In fact, the computation of the SOS value on ordinal data increases the probability of getting ties, the presence of which typically leads to an underestimation of the SROCC.
We performed statistical tests aiming at verifying whether the PLCC and SROCC values in the aforementioned ranges can be considered statistically different than zero with 95% of confidence while taking into account the size of each dataset i.e., the number of PVSs evaluated in the dataset. In all cases, the test result revealed that the obtained PLCC and SROCC values can be considered greater than zero with statistical significance. Therefore, the hypothesis that it is possible to obtain information about the diversity observed in the opinions expressed by different observers about the visual quality of a PVS using only some VQMs calculated on that PVS cannot be rejected.
We notice that lower PLCC and SROCC values have been observed in the case of the ITS4S dataset in comparison to those obtained on the other datasets. We attribute this behavior to the fact that, unlike the other subjective experiments considered in this work, the one of the ITS4S was designed for the development of no-reference metrics. Therefore, during the experiment, the source (SRC), i.e. the original content, was never shown to the observers. Hence, the full reference VQMs considered in this study did not allow to obtain as much information on the diversity between the opinions of the observer as in the other cases. Nevertheless, the obtained PLCC and SROCC can be considered significant with 95% of confidence.
Anomalies detection
In literature, some studies [13] addressed the issue of identifying potential anomalies in a subjective experiment due to the presence of peculiar contents or subject behavior. For instance, an observer may just assign random votes or the grading of a specific sequence may be remarkably inconsistent. The presence of such anomalies may negatively affect the accuracy of objective measures developed, relying on raw data collected during subjective experiments. The typical approach adopted for anomalies detection is to model the observer opinion on each sequence using the normal distribution [11, 13, 15] and then estimate the related parameters to identify unexpected situations. While using the normal distribution is very convenient from the theoretical point of view, in practice the use of such a distribution may not always be the best option. For instance, the normal distribution can not effectively model the opinions’ distribution for PVSs with very high or very low perceived visual quality as illustrated in Fig. 10a, which shows the score distribution for a specific PVS in the Netflix dataset.
In this work, we approach the problem differently. Our analysis is based on the proposed SOS model described by (2). The term Dexp in the model intends to represent the part of the inconsistency in the votes introduced by the experimental settings. As such, it also models the average inconsistency of the sample of people chosen for the experiment. Therefore, such an estimate allows to determine the sequences for which a high inconsistency of the votes has been observed and also those for which, due the quantization of the scale, the observed SOS is less than that, which could have been observed considering a greater number of subjects in the subjective experiment voting on a continuous scale.
Our procedure to find potential anomalies can be summarized as follows. Starting from the data of the subjective experiment under examination, we estimate the function f as discussed before, then from (2) and (3) we obtain, for each PVS, the following estimate:
$$ D_{exp} \approx SOS_{exp} - f(PSNR,SSIM,VIF,MSSSIM,VMAF) $$
(4)
We thus obtain a set of values having a normal distribution with zero mean as indicated by the model in (2). The PVSs, whose evaluation we believe may be affected by anomalies, are those for which the estimated Dexp value is an outlier of this distribution. In practice, denoting with \(D_{exp}^{pvs}\) the value of Dexp for a given PVS and by \(std_{D_{exp}}\) the standard deviation of Dexp, we suggest to give a closer look to the ratings of each PVS for which:
$$ \left|D_{exp}^{pvs}\right|>3 \cdot std_{D_{exp}} $$
(5)
and carefully consider an examination of such anomalies before using the data.
In order to investigate the effectiveness of the method in practice, we tested it on the Netflix public dataset and the ITS4S dataset. In Fig. 10, we report again the comparison between the predicted gtSOS and the SOS after determining the function f on the Netflix public dataset. We labeled the PVSs to facilitate the interpretation of the results. For any PVS, \(D_{exp}^{pvs}\) is estimated by subtracting the predicted gtSOSpvs from the \(SOS_{exp}^{pvs}\). Consider, for instance, PVS #63 for which the condition in (5) holds. The ratings collected in the subjective experiments are shown in Fig. 10a. For such a PVS, even if the mode of the distribution of the subjects’ opinion is equal to 5 (“Excellent”) and 22 observers out of 26 rank the quality of the PVS at least 4, i.e. “Good”, there is surprisingly an observer ranking it as 1, i.e. “Bad”. It is therefore reasonable to be skeptical about the latter rating. This is indeed more curious when we notice that there are even sequences, such as PVS #19, where the previous anomalous observer is in a full agreement with all the observers. In the case of the ITS4S dataset shown in Fig. 11, we analyzed the scores collected for PVS #257 and #278 that exhibit a high value of |Dexp|. We notice that the individual subjects’ ratings for PVS #257 (shown in Fig. 11a) are almost uniformly distributed between “Poor” and “Excellent” leading to an observed SOS value, which is significantly larger than the predicted gtSOS that would suggest that the intrinsic difficulty of evaluating the PVS should be lower. Therefore, the PVS content characteristics should be investigated in more details. On the contrary, for PVS #278 (shown in Fig. 11b), a low value of the SOS is observed since 21 observers rated its perceived visual quality as 1 (“Bad”) and 5 observers rated it as 2 (“Poor”). However the analysis indicates that the observed SOS underestimates the gtSOS and thus the intrinsic capacity of such PVS to confuse the observer in terms of quality perception. This suggests that higher diversity among the opinions should be expected in case more ratings are gathered. This is therefore another interesting case for further investigation. For instance, such a PVS could be reevaluated asking many observers to vote on a continue scale in order to make sure that the low SOS value previously observed is not just due to the scale quantization effect and the use of a limited number of observers.