This section presents a practical solution to the problem of automatic detection of low quality. It is based on a previously developed system for quality assessment (properly trained), which evaluates Blockiness, Blur, Contrast and Noise impairments in the NR model. The choice of artefacts was made by a cooperating industrial partner.
The study of the possibility of training the quality evaluation system was conducted by a crowd-sourcing test, which is the process of acquiring knowledge from a large number of (mainly on-line) subjects. The development of information technology and the high popularity of social networking led to a conclusion that the Internet has become one of the main methods for collecting and distributing information. To perform the test a dedicated website was developed. It contained a data base of images with different degrees of degradation. For the sake of test simplicity, the authors decided to use images rather than video sequences. Test participants were asked to answer questions concerning the quality of sequentially displayed images. The site has been made available on social networks and sent via e-mail to various audiences, including subjects dealing with issues of image analysis. With these results, gathered from a diverse background, the threshold of perception of artefacts have been designated for four types of image distortions, namely: Blockiness, Blur, Contrast and Noise.
Usage of images, rather than videos, may be justified due to the nature of artefact indicators developed. All of them operate on a single-frame basis and may later be used as an input to the selected temporal pooling algorithm, yielding quality indication for the video sequence.
The test was conducted according to best-practices taken from VQEG activities and the white paper published based on QUALINET task force experience .
Examined artefacts and image assessment methodology
We studied the effects of four types of artefacts: Blockiness, Blur, Contrast and Noise.
Three questions were asked in the conducted test. The first related to whether the subject saw any artefact in the displayed image. The second required the subject to score images on a Mean Opinion Score (MOS) scale. And the third related to the type of distortion present. The subject could determine if the image contained any of the following impairments: Blockiness, Blur, Contrast or Noise. In case the subject did not see any artefact, they could choose the answer “none”. The “other” option has been put as well. The final question was asked only to groups active in the image processing field.
The Mean Opinion Score (MOS), referred to in a previous paragraph, is a scale related to subjective, numerical indication of quality of the medium obtained after compression, decompression or transmission. MOS consists of levels from 1 to 5, where each denotes: 1 – bad quality, 2 – poor quality, 3 – average quality, 4 – good quality, and 5 – excellent image quality . The crowd-sourcing experiment subject could select only one of these levels.
The first step in implementing the crowd-sourcing test was to prepare images with various degrees of artefacts. Materials designed in this way were later uploaded to the website and scored by the test subjects. Eight (8) properly transformed images have been selected for the test.
Before uploading, the image size was modified in order not to exceed nine hundred pixels in the horizontal direction. Thanks to this, photos on the website could be viewed in their entirety on a fifteen-inch screen, being the most popular screen size amongst laptop users. The images were then distorted with artefacts. For each of the eight images, the following artefacts were applied: thirteen (13) levels of Blockiness artefact type, ten (10) levels of Blur artefact type, seventeen (17) levels of Contrast artefact type, and nine (9) levels of Noise artefact type. In total, a database of four hundred (400) images was compiled. Additionally, eight common set (warm-up) images were chosen to be displayed during the first run of a test. This treatment was due to the fact that during the first visit, the subject had to learn the web interface provided. As a result, the first eight scores of the test images were not taken into account when analysing the results.
The next stage of the test was to put the images on the website and allow users to start the evaluation process. Each subject could complete the test once; it was impossible to log in again using the same user name. After log in the user was presented with a warm-up sequence, followed by four hundred (400) relevant photos.
Each image to be assessed was presented in its original resolution and accompanied on the right by a panel displaying all three questions along with the username, progress bar, and the interface for moving between test images. When a subject passed to the next image, results were saved to the database.
For all the questions displayed on the page, one could select only a single answer. If the subject failed to answer all three and tried to move to the next image, a message asking to address the remaining questions was displayed.
The user could end the test at any time, either by logging off from the front end of the interface or simply leaving the web page.
A total of one hundred seventy-three (173) subjects took part in the crowd-sourcing test in a single month. Forty (40) subjects simply logged in and did not participate in the evaluation process. Forty-two (42) people gave evaluation scores for less than nine images, assessing just a collection of common set images not included in the analysis of the test results. Ninety-one (91) subjects issued scores for more than eight images. On average, ten (10) scores were obtained for each image. The number of scores made it possible to separate the results of various user groups participating in the test. This kind of division allowed to carry out separate analysis for a few distinct user profiles.
Operation under the time constraint made it impossible to gather more results. Nonetheless, number of answers acquired has proven to be sufficient for further analysis.
Analysis of results
Based on the test results, artefact perception percentages were determined for all levels of each single distortion. This quantity denotes the percentage of test subjects, who properly noticed an artefact’s presence. The number of scores received does not allow for a separate analysis of each image. Figure 1 presents perception percentages plotted versus quality metrics outcomes yielded by the measurement software package.
On the basis of those results, artefact perception thresholds were calculated for each type of impairment respectively. A threshold value was chosen to represent a situation when half of the test subjects saw the distortion, and the other half did not. For the Blockiness artefact type, the artefact was visible for less than half of the respondents, above the level equal to 50. For the Blur artefact type, impurities were detected for a level greater than 1. In the Contrast artefact type, degradation of an image was not detected above the level of -10 and below 20. For the Noise artefact, distortions were visible above the level equal to 4.
Designated threshold error has been estimated. The data set was divided into the training and test subsets, which was necessary to perform cross-validation of a model. The following percentages of accuracy of various types of artefacts were achieved: Blockiness – 77.09 %, Blur - 87.5 %, Contrast – 75 % and Noise – 78.57 %.
The calculated threshold for the Noise artefact type did not take into account the results for one of the tested images. This was due to the inconsistency of the data obtained for different levels of impurity. Results for this single image had a significant impact on the final threshold value. Separating them from the rest of the data yielded much better model performance, which was further proven by cross-validation.
In the case of simultaneous imposition of different types of artefacts on images, of which no one is dominant, the calculation of metrics cannot be made properly because of mutual masking of the distortions. To enable an accurate assessment of a single artefact, one must first apply the appropriate compensation algorithms .
The main problem encountered during the research was the number of test subjects per tested image. Out of one hundred seventy-three (173) subjects, nearly half did not participate in the test, logging in or assessing common set images only. The vast majority of those who evaluated more than eight common set images failed to complete half of the test. Hence, proper examination of results was not a trivial task. It was impossible to clearly determine visible artefact thresholds or analyse the results for each group of artefacts separately. The latter difficulty arose from an insufficient number of scores for a single image in the artefact group.
Each image was almost identically rated for the Blur artefact type. The most varied ratings were obtained for the Noise distortion type. As was previously mentioned, to achieve a high threshold accuracy here, results for one of the test images had to be ruled out, significantly increasing the reliability of the final threshold level.
Despite the problems encountered during the test, it was completed successfully. High accuracy thresholds of artefact perception for specific types of impurities were received.