Software package for measurement of quality indicators working in no-reference model

The key objective of No-Reference (NR) visual metrics (indicators) is to predict the end-user experience concerning remotely delivered video content. Rapidly increasing demand for easily accessible, high quality video material makes it crucial for service providers to test the user experience without the need for comparison with reference material. In this paper, we present a versatile measurement system and describe various optimisation strategies utilised to reach real-time operation. Furthermore, several calculation automation scripts are described, along with a dedicated graphical user interface, which gives a more comprehensive insight into the presented system. On top of that, we show the results of crowd-sourcing experiments used to estimate subjective threshold values for quality indicators. Additionally, integration with the IMCOP system is introduced.


Introduction
Providing not only a high level of traditional Quality of Service (QoS), but also Quality of Experience (QoE) is a real challenge for ISPs (Internet Service Providers), audiovisual service providers, broadcasters and new Over-The-Top (OTT) service providers. Therefore, objective audiovisual data metrics are often carried out in order to monitor, troubleshoot, analyse and establish patterns of content applications working in real-time or offline scenarios. Since 2000, the work bound with the concept of QoE, in the context of different applications, has gained momentum and achieved business recognition.
A number of researchers focus on different ways to assess the quality of vision applications, taking into account additional information used in the evaluation process. Usually, two main approaches (metrics classes) are distinguished. The first approach is called Full-Reference (FR), and assumes unlimited access to the original (reference) video sequences. FR metrics are usually the most accurate at the expense of higher computational effort. The second class is commonly referred to as a No-Reference (NR) approach and is based on the quality assessment without knowledge of the original material. Due to the missing original signal, NR metrics may be less accurate than their FR counterparts, but tend to provide much better computational efficiency.
In this paper, we present a software package to measure quality indicators, operating in a difficult NR model. This software package is the realisation of a previously developed concept of monitoring the quality of vision, by Key Performance Indicators (KPI) [19]. The idea proposed here goes by the name: Monitoring Of Audio Visual quality by Key Performance Indicators (MOAVI). MOAVI artefacts (or KPIs) are divided into four categories, depending on their origin: a category of capturing, processing, transmission and display. The MOAVI based application is able to isolate and improve incident investigation, aid algorithm configuration, extend the periods to monitor and ensure better prediction of QoE.
Most models of quality are based on the measurement of typical artefacts/KPIs, such as blur, blockiness or jerkiness, and produce MOS (Mean Opinion Score) forecasts. Therefore, many of the algorithms generating an expected value of MOS use a blend of blur, blockiness and jerkiness metrics. Weighting between each KPI can be a simple mathematical function. However, if one KPI is not correct, the global result of prediction is completely wrong. Other KPIs -such as exposure, noise, block-loss, freezing, slicing, etc. -are usually not taken into account in prognosis of the MOS [18].
ITU-T has been working on a similar noise measurement model for many years [7], but only for the FR and with the Reduced Reference (RR) approach. The history of ITU-T recommendations for image quality metrics is presented in Table 1. Table 2 shows the synthesis of a set of standard indicators that are based on video signals [18]. As can be seen from both tables, there are no achievements for the NR approach.
Although not standardised, NR video quality assessment methods do exist. Zhu et al. presented in [29] model based on discrete cosine transform (DCT) and non-linear sequencelevel features to subjective scores mapping by the usage of trained multilayer neural network. Authors of [29] used experimental results to show that NR metrics can compete with their FR and RR counterparts. However, due to its nature, the NR approach is both distortion specific and data driven, as compared to the more universal FR algorithms. This conclusion is not surprising, considering the fact that authors focused solely on the H.264/AVC compression as a fundamental source of distortions. On the other hand, findings shown in [20] suggest the possibility to introduce a data independent NR solution. Li, Guo and Lu use spatiotemporal 3D-DCT to extract features both in space and time. This information is further used to calculate a small set of parameters, which after temporal pooling for the entire sequence, get mapped to subjective scores. Thanks to thorough training and testing on various databases, authors of [20] verified data independence of their solution. Nonetheless, the best results were obtained for sequences distorted with only a single artefact source, making this solution not globally applicable. It is worth mentioning that both [29] and [20] use the luminance channel solely. This concept is also applied in presented work due to a higher human visual system (HVS) sensitivity for luminance (rather than colour) changes.
Another thing to consider about the solution described in this article is the lack of temporal pooling and subjective scores mapping, what makes it difficult to directly compare our work with others. Those missing concepts remain to be implemented and tested in the near future. Nevertheless, as described in VQEG's (Video Quality Experts Group) MOAVI project [28], KPIs approach is defined to be complementary and more universal as compared to classical QoE measurement based on overall quality prediction.
The remainder of this paper is structured as follows. A general overview of software structure and quality metrics listing is given in Section 2. Section 3 presents experimental threshold values for metrics, along with a methodology used to obtain them. A detailed description of the operation of the presented software is given in Section 4, which is further divided into Subsections 1 to 5, all of which provide a comprehensive guide to the development process. Integration of quality evaluation software package with the IMCOP system is provided in Section 5. Section 6 concludes the paper.  [12] n / a n / a SDTV J.144 [8] n / a n / a VGA J.247 [10] J.246 [9] n / a CIF J.247 [10] J.246 [9] n / a QCIF J.247 [10] J.246 [9] n / a Aiming to allow easier evaluation and debugging of the software, the authors decided to design it in a modular manner. This basically means that each of the metrics may be easily detached or attached to the whole topology. Utilising such a strategy makes it possible to comfortably and efficiently modify the functionality of the package. In this way, the final shape of the application may be precisely carved to fit the desired use-case scenario. The software consists of 15 visual metrics, which together form KPIs that could be used to model predicted quality of experience, as seen from the perspective of the end-user. The following set of metrics was developed:

Investigating room for crowd-sourcing quality evaluation
This section presents a practical solution to the problem of automatic detection of low quality. It is based on a previously developed system for quality assessment (properly trained), which evaluates Blockiness, Blur, Contrast and Noise impairments in the NR model. The choice of artefacts was made by a cooperating industrial partner.
The study of the possibility of training the quality evaluation system was conducted by a crowd-sourcing test, which is the process of acquiring knowledge from a large number of (mainly on-line) subjects. The development of information technology and the high popularity of social networking led to a conclusion that the Internet has become one of the main methods for collecting and distributing information. To perform the test a dedicated website was developed. It contained a data base of images with different degrees of degradation. For the sake of test simplicity, the authors decided to use images rather than video sequences. Test participants were asked to answer questions concerning the quality of sequentially displayed images. The site has been made available on social networks and sent via e-mail to various audiences, including subjects dealing with issues of image analysis. With these results, gathered from a diverse background, the threshold of perception of artefacts have been designated for four types of image distortions, namely: Blockiness, Blur, Contrast and Noise.
Usage of images, rather than videos, may be justified due to the nature of artefact indicators developed. All of them operate on a single-frame basis and may later be used as an input to the selected temporal pooling algorithm, yielding quality indication for the video sequence.
The test was conducted according to best-practices taken from VQEG activities and the white paper published based on QUALINET task force experience [5].

Examined artefacts and image assessment methodology
We studied the effects of four types of artefacts: Blockiness, Blur, Contrast and Noise.
Three questions were asked in the conducted test. The first related to whether the subject saw any artefact in the displayed image. The second required the subject to score images on a Mean Opinion Score (MOS) scale. And the third related to the type of distortion present. The subject could determine if the image contained any of the following impairments: Blockiness, Blur, Contrast or Noise. In case the subject did not see any artefact, they could choose the answer "none". The "other" option has been put as well. The final question was asked only to groups active in the image processing field.
The Mean Opinion Score (MOS), referred to in a previous paragraph, is a scale related to subjective, numerical indication of quality of the medium obtained after compression, decompression or transmission. MOS consists of levels from 1 to 5, where each denotes: 1 -bad quality, 2 -poor quality, 3 -average quality, 4 -good quality, and 5 -excellent image quality [6]. The crowd-sourcing experiment subject could select only one of these levels.

Crowd-sourcing process
The first step in implementing the crowd-sourcing test was to prepare images with various degrees of artefacts. Materials designed in this way were later uploaded to the website and scored by the test subjects. Eight (8) properly transformed images have been selected for the test.
Before uploading, the image size was modified in order not to exceed nine hundred pixels in the horizontal direction. Thanks to this, photos on the website could be viewed in their entirety on a fifteen-inch screen, being the most popular screen size amongst laptop users. The images were then distorted with artefacts. For each of the eight images, the following artefacts were applied: thirteen (13) levels of Blockiness artefact type, ten (10) levels of Blur artefact type, seventeen (17) levels of Contrast artefact type, and nine (9) levels of Noise artefact type. In total, a database of four hundred (400) images was compiled. Additionally, eight common set (warm-up) images were chosen to be displayed during the first run of a test. This treatment was due to the fact that during the first visit, the subject had to learn the web interface provided. As a result, the first eight scores of the test images were not taken into account when analysing the results.
The next stage of the test was to put the images on the website and allow users to start the evaluation process. Each subject could complete the test once; it was impossible to log in again using the same user name. After log in the user was presented with a warm-up sequence, followed by four hundred (400) relevant photos.
Each image to be assessed was presented in its original resolution and accompanied on the right by a panel displaying all three questions along with the username, progress bar, and the interface for moving between test images. When a subject passed to the next image, results were saved to the database.
For all the questions displayed on the page, one could select only a single answer. If the subject failed to answer all three and tried to move to the next image, a message asking to address the remaining questions was displayed.
The user could end the test at any time, either by logging off from the front end of the interface or simply leaving the web page.

Results
A total of one hundred seventy-three (173) subjects took part in the crowd-sourcing test in a single month. Forty (40) subjects simply logged in and did not participate in the evaluation process. Forty-two (42) people gave evaluation scores for less than nine images, assessing just a collection of common set images not included in the analysis of the test results. Ninety-one (91) subjects issued scores for more than eight images. On average, ten (10) scores were obtained for each image. The number of scores made it possible to separate the results of various user groups participating in the test. This kind of division allowed to carry out separate analysis for a few distinct user profiles.
Operation under the time constraint made it impossible to gather more results. Nonetheless, number of answers acquired has proven to be sufficient for further analysis.

Analysis of results
Based on the test results, artefact perception percentages were determined for all levels of each single distortion. This quantity denotes the percentage of test subjects, who properly noticed an artefact's presence. The number of scores received does not allow for a separate analysis of each image. Figure 1 presents perception percentages plotted versus quality metrics outcomes yielded by the measurement software package.
On the basis of those results, artefact perception thresholds were calculated for each type of impairment respectively. A threshold value was chosen to represent a situation when half of the test subjects saw the distortion, and the other half did not. For the Blockiness artefact type, the artefact was visible for less than half of the respondents, above the level equal to 50. For the Blur artefact type, impurities were detected for a level greater than 1. In the Contrast artefact type, degradation of an image was not detected above the level of −10 and below 20. For the Noise artefact, distortions were visible above the level equal to 4.
Designated threshold error has been estimated. The data set was divided into the training and test subsets, which was necessary to perform cross-validation of a model. The following percentages of accuracy of various types of artefacts were achieved: Blockiness -77.09 %, Blur −87.5 %, Contrast -75 % and Noise -78.57 %.
The calculated threshold for the Noise artefact type did not take into account the results for one of the tested images. This was due to the inconsistency of the data obtained for different levels of impurity. Results for this single image had a significant impact on the final threshold value. Separating them from the rest of the data yielded much better model performance, which was further proven by cross-validation.
In the case of simultaneous imposition of different types of artefacts on images, of which no one is dominant, the calculation of metrics cannot be made properly because of mutual

Summary
The main problem encountered during the research was the number of test subjects per tested image. Out of one hundred seventy-three (173) subjects, nearly half did not participate in the test, logging in or assessing common set images only. The vast majority of those who evaluated more than eight common set images failed to complete half of the test. Hence, proper examination of results was not a trivial task. It was impossible to clearly determine visible artefact thresholds or analyse the results for each group of artefacts separately. The latter difficulty arose from an insufficient number of scores for a single image in the artefact group.
Each image was almost identically rated for the Blur artefact type. The most varied ratings were obtained for the Noise distortion type. As was previously mentioned, to achieve a high threshold accuracy here, results for one of the test images had to be ruled out, significantly increasing the reliability of the final threshold level.
Despite the problems encountered during the test, it was completed successfully. High accuracy thresholds of artefact perception for specific types of impurities were received.
As was already mentioned, the presented software package performs a remote NR quality assessment. The main goal accompanying its design and implementation was the idea to create an application that is platform-independent and does not include proprietary software. Consequently, the source code of the program was written entirely in C programming language and none of the metrics utilised any external libraries. This approach resulted in a longer development timeframe but at the same time allowed us to create a versatile, portable and stable measurement system.

Input and output interfaces
The presented software package operates within the NR model, meaning that the measurement is performed without any knowledge of the original sequence. As a consequence, input material must be analysed in pixel-by-pixel fashion. This in turn imposes the necessity of decompression of the video file or stream, before any computation may be performed. Due to the fact that the algorithms used operate solely on the luminance channel (Y), YUV420p format is utilised to store the input files for the application. It makes it possible to save memory by omitting part of the information related to colours, further referred to as chrominance channels. Data stored in this manner incorporates complete information about the grayscale representation, but allocates only one value of chrominance channels (U and V) for each 4 pixels of the original material. An additional advantage of using the previously mentioned format is contiguous alignment of image data, which constitutes a very basic optimisation strategy. Most hardware platforms perform best when operated on linearly stored information. Reading out sequentially ordered memory blocks yields the lowest possible access times and thus leaves more headroom for the actual computation.
In addition to the uncompressed video sequence, the application also expects the parameters describing width, height and number of frames per second of the tested material. Supplementary input arguments result from the specification of YUV420p format. It does not contain any header for storing detailed information about the included material. In most cases, however, this is not a problem, since data used for processing exists in compressed form, which along with the video material, contains all the essential information.
The application generates a detailed report concerning each frame of the input material. Alongside frame number, one can also see the result of each single metric. Presentation of the output information is twofold:

Planned and applied optimisation schemes
The careful reader may notice that operations performed on uncompressed video sequences require large memory bandwidth, as well as high computational power. This kind of restriction becomes especially important when operating in real-time or nearly real-time scenarios. Figure 4 shows the relative execution times for each metric, when processing video with a resolution of 1920×1080 pixels. Average computation time for such a material oscillates around 119 ms. At this point it is worth mentioning that this test was conducted using a single thread version of the application on the machine featuring an Intel Core i7 CPU 950@.3.07 GHz x 8 processor. The average processing time indicates the necessity of further optimisation if one requires real-time execution of the software. Assuming the video sequence gets refreshed Fig. 4 Relative metrics execution time for a single 1920×1080 image frame 30 times per second, fetching image data and preforming computations must not exceed 33 ms. Should dropping any of the provided indicators prove impossible, another optimisation technique would be to utilize a multiprocessor and thus, multithread architecture of contemporary platforms. Performing the test once again -this time employing a multithread version of the application -allowed to reduce the time needed for calculations to 59 ms. Even though it does not guarantee real-time operation, there is still more optimisation strategies to be implemented.
If, on the other hand, eliminating some of the indicators is possible, ruling out Blur and Block-loss metrics yields an execution time below 33 ms (provided that multithread version of the software is used).
It is worth mentioning that many image processing algorithms use precisely defined, and more importantly, finite set of operations, which may be performed on the image. As a consequence, once processed, an image or parameter may be stored and used again in other metrics. This strategy works best if the amount of data to be stored does not exceed some threshold value, which defines the balance point for a trade-off between memory usage and computational complexity.
Yet another possible optimisation scenario is to move as much computations as possible into the domain of integer numbers. This is justified only if one plans to use the central processing unit (CPU) exclusively. Due to its internal topology, it performs best when used with this kind of data.
All optimisation methods described operate in the software layer of the system design. Apart from those, one can always try to port the code to another hardware platform like the GPU (Graphics Processing Unit) or FGPA (Field-Programmable Gate Array). Both solutions allow to massively parallelise the execution and thus reduce the time needed for processing. However, advantageous features of both these solutions come at a price of thorough source code rebuilding that is necessary to gain maximum performance boost.

Additional scripts
As an addition, several automated calculation scripts are provided. In order to achieve a high level of portability, all of the scripts were written both for Unix-like and Microsoft Windows systems. Obtaining this extent of versatility required the creation of two separate implementations. One written in Bash (Linux, Mac OS) and one in Batch (Windows). Utilisation of FFmpeg tools allowed to reduce the input interface to a single parameter, namely the path to video sequence or folder containing video materials to process. Automation scripts are based on the assumption that all input data is stored in the form including detailed information about its content. This mechanisation allows one to seamlessly apply the presented measurement techniques to a large set of input data, be it images or videos.

Versions
One of the most important aspects accompanying the development process was the assumption that if possible, the application should be platform independent. As a result, the software package was released for all of the most popular operating systems: Linux, Mac OS and Windows. Though multi-sided, the software's implementation remains consistent, meaning that a single source code may be used to compile into all supported binaries. Minute changes in the configuration file is enough to quickly switch between the desired OS (operating system) and architecture type (32 or 64-bit).
The described software is provided free of charge (for non-commercial usage) and may be downloaded from the WWW web page [26].

Graphical user interface
Keeping in mind that presentation of the software is of key importance, the authors decided to additionally implement a graphical user interface. Its main advantage is the possibility of simultaneous observation of results and the currently processed video sequence. Figure 5 shows an example of the described software. The graphical version of the measurement system is capable of processing any video stream, provided its content is made available in a shared memory. Thus, it is necessary to introduce a thin integration layer decompressing video stream and uploading raw frames into memory shared with measurement application. This kind of solution was developed and tested inside the MITSU project [21]. Merging transcoding software with the measurement system allowed to create dynamically changing video delivery architecture that aimed to maximise user experience in terms of QoE.

Integration with IMCOP architecture
The IMCOP project -an "Intelligent Multimedia System for Web and IPTV Archiving. Digital Analysis and Documentation of Multimedia Content", is a joint Polish-Israeli R&D project realised by a consortium consisting of four partners. In general, IMCOP's objectives are twofold and are referred to as: (i) multimedia data analysis and content discovery on one side and (ii) data aggregation, content related binding (finding and assigning content related connections between data) and delivery on the other [3]. An overall IMCOP platform architecture is illustrated in Fig. 6.
Data analysis is performed in order to enrich the data (mainly images and video sequences) by extracting their features and classify them according to a given criteria. Components of the IMCOP system dedicated to carry out the above analysis are known as the Metadata Enhancement Services (MES), which, in fact, are REST-compliant Web services in the cloud [2]. Each MES service is intended to perform a single classification task. Selected specialized tasks of IMCOP's MES services are as follows: -detection as well as facial recognition, -head counting, -bokeh effect detection, -text detection and recognition, -nudity detection, -sky/landscape detection, -detection of architectural scenes (scenes containing buildings, monuments or other kinds of artificial structures), etc.
In addition to the ones given above, IMCOP also incorporates less specialized types of MES services, which are dedicated, for example, to extracted selected low level features of analysed data, such as e.g. SURF features [1], Shape Context histogram, MPEG-7 visual descriptors, coefficients of Piecewise-linear transforms [24], etc. Services designated to perform visual quality evaluation are also of great importance to the IMCOP system. In general, they are used to filter out the low quality multimedia data and exclude them from further processing. The way they are used to model predicted quality of experience in the IMCOP system is as follows: -quality of multimedia data is classified into three categories, known as quality levels 0, 1 and 2, where level 0 means data of very low, unacceptable quality and level 2, on the contrary, data of very high quality, -data is classified into quality level 2 when at most two metrics fail (fall outside the min/max range given in [26]), -data is classified into quality level 1 (the category of low and medium quality) when three or four metrics fail, -when more than four metrics fail, data is classified at quality level 0.
Metadata Enhancement Services can be, inter alia, used just to label (tag) multimedia data. Examples of labels given by selected IMCOP's MES services to a chosen image from the VIME Flickr dataset [27] are depicted in Fig. 7.

Conclusions
Quality indicators have been successfully developed as a result of the work. All together constitute a single, universal and multi-platform measurement system, which runs entirely on the receiving side. This ability makes it especially suitable for content providers operating on a massive scale. The opportunity to remotely sense quality of experience at each user-node guarantees better system control and gives solid input for various resource utili-sation algorithms. Moreover, measurement performed on two ends of the system allows one to quantitatively measure its impact on the content being transmitted.
A related point to consider is the fact that the software provides information regarding all indicators separately. Establishing trustworthy mapping between those KPIs and final subjective quality is a challenging task requiring more experimental data. As such, it remains to be implemented and defines the scope of the current research. A direct consequence of this shortage is the difficulty in objective assessment of algorithm performance that would allow to compare it with other state-of-the-art achievements.
On the other hand, the lack of consistent KPI to MOS mapping may also be regarded as advantageous. Due to clear and comprehensive presentation of results, the user alone may choose the meaning and importance of certain metrics, making it possible to introduce a customised quality evaluation process. Both presented use-cases [3,25] of the software package utilised this property to aid their operation and, at the same time, prove its usability both for end-products and experimental set-ups.