This contribution is a follow-up of our past research, which was aimed at investigating how packet loss-related issues affect user perception of video quality. In that study, we tested user perception using 72 different test sequences, which were prepared in advance in an emulated network environment. The same type of content was used in all test sequences (1-h documentary film about the solar system); however, in each test sequence, packet loss rate (PLR), number of packet loss occurrences (PLOs), and duration of PLOs varied.
Details of the subjective test are discussed in the remainder of this chapter, but here, it is first necessary to recognize that we adopted the test methodology from [5]. The authors designed the experiment, which allows a researcher to prepare the test sequences in advance, thus retaining control over audiovisual quality. The sequences are then stored on, for instance, an optical drive or removable storage, and distributed to the subjects for rating in their home environments. Hence, subjective evaluation of the sequences is conducted in uncontrolled conditions, which is important for QoE evaluation, as emphasized in the Introduction.
Note that we have analyzed other approaches that include uncontrolled environments. For instance, in [7,8,9], the authors employed QoE crowdtesting, while in [10, 11], user network performances were remotely monitored during streaming sessions. These two approaches were not suitable for this study, because it would have been difficult to pursue the test subjects to download or stream 1-h video (i.e., several gigabytes of data) to their devices at home, which was necessary in [7,8,9,10,9] and [10, 11], respectively.
In [12], Staelens et al. prepared the test sequences in advance and stored them on tablet computers. The tablets were then distributed to the subjects, who watched the sequences in everyday conditions. The subjects rated the quality of the sequences directly on the tablets. The authors collected the rating data after the subjects returned the tablets. With this approach, video downloading or streaming is avoided, yet we decided not to follow it due for two reasons: (a) we did not have a sufficient number of tablet devices to conduct a large-scale study such as ours (602 test subjects) and (b) the player used for watching the test sequences contained a video quality rating scale; thus, the purpose of the test was revealed to the subjects. When the purpose of the test is known to the subjects, they are more focused on quality degradation during the test and less focused on content. This is unlike everyday service usage scenarios and also affects the user QoE rating. Finally, the group of authors in [13] implemented a QoE rating scale in the user interface of the VLC Media Player. The player was installed on the subjects’ devices, who used it for streaming multimedia content. This approach was also rejected since the visibility of the rating scale during playback reveals the purpose of the test to the subjects.
The experiment setup and creation of test sequences
In this experiment, test subjects evaluated the quality of 1-h video about the solar system in a home environment. We have used entertainment-oriented content selection [14], since we assumed that, while at home, the subjects usually watch video content that interests them. An additional reason for choosing the content that can entertain the subjects was the duration of the test, i.e., the video, and the need to capture the subjects’ concentration for a full hour. We believe that if different content would have been used, that would have been perceived by the subjects as, for instance, boring, then their willingness to participate in our study would decline. Note that our test conditions were unlike controlled environment experimenting in laboratory, where the tests usually last 20–30 min, during which observers rate the quality of several short video clips. In such shorter tests, it is easier to retain the subjects’ focus on the task at hand.
The experiment was performed using longer test sequences; thereby, we have acknowledged the findings presented in [15,16,17]; the authors in that study demonstrate how QoE evaluation requires an increase in the duration of the test sequences, because user perception cannot be entirely shaped when using shorter video clips. The original video used in the experiment was encoded with Advanced Video Coding (H.264/AVC) and Advanced Audio Coding (AAC). The video bitrate, audio bitrate and frame rate were 9.8 Mbps, 256 kbps and 50 fps, respectively. The video resolution was 1920 × 1080 pixels. Note that the video contained video subtitles.
To create test sequences, first, the original video was streamed six times between two computers, which were connected in a peer-to-peer connection (Fig. 1). The Network Emulator Client was installed on Computer 1. The client dropped the packets on the outgoing stream toward Computer 2. PLR varied between 0.05, 0.1, 0.5, 1, 1.5 and 2%, while the burst packet loss length was set to 1. VLC Media Player was used to stream the video between the computers. Each incoming (degraded) video signal was stored on Computer 2 in the same format as the original video. During streaming of the video between the computers, UDP was used on the transport layer. This differs from Hyper Text Transfer Protocol (HTTP)-based streaming, which uses TCP (Transport Control Protocol). Nowadays, HTTP-based adaptive streaming is the relevant scenario; however, UDP-based streaming is still used for delivering Internet Protocol Television (IPTV), in particular for those services that use set top boxes.
Second, we have imported the stored video signals into CyberLink Power Director and extracted 1, 4, 7 or 10 short video clips from a degraded video signal and inserted them into the original video signal. The duration of a single inserted clip, i.e., a single packet loss occurrence (PLO), varied between 1, 4 and 7 s. By varying the number of inserted PLOs and the duration of a single PLO, we were able to generate different total durations for all PLOs in a test sequence that equaled 1, 4, 7, 10, 16, 28, 40, 49 or 70 s. The total duration is derived by multiplying the number of inserted PLOs in a sequence (1, 4, 7 or 10) with the duration of a single PLO (1, 4 or 7 s). When selecting these particular values of the parameters, the objective was to generate a wide range of total duration of the distortions in the sequences. Additionally, we wanted to create sequences with equal total duration of distortion while containing a different number of inserted PLOs and different durations of a single PLO. For instance, two sequences can contain 28 s of total quality distortions, but one can have 7 PLOs each lasting 4 s, while the other has 4 PLOs each lasting 7 s. This enabled us to evaluate the impact of each parameter individually.
In [18], the authors revealed that when the distortions are grouped into the first few minutes of the video, the quality scores of the subjects are observed to increase. Conversely, if the distortions are grouped into the last few minutes, the scores are observed to decrease. This kind of subject reasoning is influenced by human short-term memory [19] and the psychological effect of recency [18, 20]. Thus, the PLOs were evenly distributed over the entire duration of all test sequences. However, in each test sequence, the first and last 7 min and 17 s were unaffected by the degradations, which allowed the test subjects to get involved with the content in the beginning of the session as well as critically think about the audiovisual quality toward the end.
In this study, the methodology used for the subjective evaluation of the sequences was adopted from [5]. Thus, the sequences were distributed to the subjects on a DVD. This format was chosen due to the following reasons: (a) the test sequences were easily distributed to the subjects; (b) DVD players are more available compared to Blu-ray players (this was important since the survey was conducted among a student population); (c) DVD disks are cheaper compared to, e.g., memory sticks; and (d) according to [21], while evaluating other services, test subjects often use the quality of DVDs as a reference.
During the encoding of the test sequences to the DVD format, the PAL system, MPEG-2 video encoding format and variable bitrate encoding method were used. All video encoding settings were set to maintain the best possible video quality. Furthermore, all video enhancement features of the CyberLink Power Director software were turned off and the software did not use any error concealment methods. The resulting video bitrate, audio bitrate and frame rate were equal to 9.51 Mbps, 256 kbps and 25 fps, respectively.
Although we have used the settings which allowed the best possible video quality for the conversion to the DVD format, this process downgraded the quality of the sequences, since the original video (encoded with H.264/AVC) was re-encoded into the MPEG-2 format. In general, this decline in the video quality is hard to notice in low-motion scenes but can be noticed in high-motion scenes when moving objects, appearing in a scene, can become blocky. The difference in the quality becomes even more visible to the subjects if they are provided with the original sequence for the comparison (that was not the case in our study). Since we conducted the experiment using the documentary film, most of the scenes in the film were low-motion scenes (for instance, the presenter’s monologue or dialogue with other persons appearing in the film). Thus, the quality degradation caused by the re-encoding process was not apparent. Furthermore, we need to highlight that in our test conditions, which were mimicking a lifelike viewing experience, the subjects were not focused on keeping track of the video quality, they were focused on the content instead. Thus, we believe that the decline in the video quality due to the re-encoding of the video did not impact the QoE of our test subjects.
Data collection
A fuzzy-based no reference objective video quality assessment model for assessing user QoE, developed in this research, correlates the values of the three objective parameters (discussed in Sect. 2.1) with the subjects’ perception of video quality. Hence, the focus of this section is to report how the subjective data set was collected, needed for the development of the inference system of the model.
Design of the questionnaire
The questionnaire used in the survey had four pages. Page 1 contained a detailed description of the purpose of the test and instructions on how to fulfil the questionnaire. Questions related to the perceived video quality were printed on page 2. Page 3 was used to investigate the subjects’ opinions about the video content as well as the subjects’ environment and the equipment used to reproduce the video, the social context in which they watched the video, level of fatigue and other factors. Page 4 contained general questions used to collect subject demographic information and a blank space, where the subjects were able to leave comments. Pages 2 and 3 of the questionnaire can be found in Appendix 1.
The questionnaire contained multiple choice questions as well as 11-point numerical scales (designed by ITU-T in [22]) for questions related to the subjective perception of the video quality and subjects’ level of annoyance caused by the degradations. The decision to use an 11-point scale, over the more commonly used discrete five-level scale, was taken, because the aim was to collect the continuous data and provide the subjects with a larger span of possible answers. The scales enabled capturing natural ambiguity and fuzziness of the subjects’ opinions. The use of discrete rating scales or, for instance, questions with two-alternative options (such as was the audiovisual quality of the video acceptable with possible answers yes or no) would cause the loss of valuable information about the impact of the objective parameters on the subjects’ perception in different viewing conditions. This is further discussed in Sect. 4.
Furthermore, several questions were used to detect the subjects’ abnormal rating. For instance, if the subject indicated noticing only one quality degradation in the entire 1-h video but rated that frequency as Annoyingly high frequency, this rating was considered abnormal, and the questionnaire was rejected. Furthermore, we have rejected all questionnaires in which the subjects’ indicated noticing video artifacts, which were unrelated to our experiment, i.e., a specific test sequence. We considered that in those instances, the subjects’ equipment may have been malfunctioning, which may have interfered with the audiovisual presentation and their rating. The questionnaires were also rejected if the subjects responded positively to the statement When I watch DVDs as I usually do, their quality is often degraded or There is a possibility that my DVD player that I used to watch the video may be broken or malfunctioning. Additionally, we have asked the subjects to evaluate the level of noise in their surroundings, while they were watching the video. This information was used to exclude those questionnaires in which the subjects indicated that they were unable to concentrate on the video due to noise. The questionnaires were also rejected if the subjects did not complete them immediately after the screening; thus, might have forgotten the quality distortions they experienced, potentially leading to false ratings.
Further details regarding the reasons for questionnaire rejection can be found in [4], as well as the number of rejected questionnaires per specific rejection criteria. As can be seen from this discussion, we have rejected the questionnaires by employing the methods discussed in [23]; i.e., the questionnaire contained consistency questions, and we have investigated the hardware environment and hidden influence factors.
The questionnaires were distributed to the subjects in sealed envelopes. Two questionnaires have been inserted into each envelope (if the subjects watched the video with a company, they were asked to pass the second questionnaire to one person in their company). Furthermore, we have printed an illustration on the envelopes explaining how to proceed with the test, indicating four main steps: (1) Take the envelope and the attached video to your home; (2) watch the video in everyday conditions; (3) open the envelope immediately after watching the video, read the instructions and complete the questionnaire; and (4) return the questionnaire. The test sequences were attached to the outer side of the envelopes, so they were accessible to the subjects without the need to open the envelope.
The test subjects
Since both authors of this paper are the employees of the University of Zagreb, it was decided that the survey would be conducted among the student population of the university; i.e., the convenience sampling method was used [24]. Another reason for targeting this particular population can be found in [25], where Datta et al. reveal that persons between the ages of 18 and 24 are common users of video streaming services. This corresponds with the age group of a typical student population.
The subjects were approached and asked to participate in the survey, while they were in classes at the university. At each occasion, only a few key points of the research were presented to them; namely, we made it clear that:
-
the survey is anonymous;
-
the participation in the survey is not mandatory;
-
those who wish to participate will be asked to:
-
take one envelope and the attached video;
-
keep the envelope sealed and open it only after watching the video;
-
watch the video only once in the conditions they would normally watch television;
-
open the envelope immediately after watching the video and read the instructions;
-
complete the questionnaire;
-
pass the second questionnaire to one person in their company (if applicable and if that person also watched the video with them);
-
return the completed questionnaire(s);
-
the video content is 1-h documentary film about the solar system;
-
the questionnaire takes approximately 10 min to complete;
-
the illustration printed on the envelopes reminds them about the steps of participation in the survey;
-
the survey lasts 2 weeks.
During this brief presentation of what is expected from the test subjects, the purpose of the test and the content of the questionnaire were not revealed in any way. After a period of 2 weeks, the collected questionnaires were processed, and the QoE analysis was continued on a sample of 602 test subjects.
Discussion about the obtained results of the subjective evaluation
The results obtained from the survey were grouped into different categories. Specifically, we have conducted analysis of user QoE for each test sequence, tested and confirmed the IQX hypothesis [26], and examined the relationships between the level of user annoyance and PLR, number of PLOs and total duration of all PLOs in a sequence. These relationships will be later used in the fuzzification process (in Sect. 5). Furthermore, we have investigated the impact of human short-term memory and the recency effect, correlated user QoE with their level of entertainment and fatigue, and analyzed the impact of social context and video subtitles on user QoE.
The user QoE analysis indicated that the subjects’ MOS remained reasonably high (always above 4, on a scale from 0, being bad quality, to 10, being excellent quality), even for those sequences that contained the most PLOs. This confirmed the findings presented in [5, 6, 27], where different authors underlined how the results of the uncontrolled experiments suggest that the subjects are not so negatively affected by the perceived quality degradations. This has direct implications on the inference system of the model and its output. Specifically, Sect. 7 will demonstrate how the model assesses a user QoE of 4.48, even for the most degraded test sequence. Detailed analysis of user QoE showed that when there is only one PLO in a 1-h video, the PLR and the duration of a single PLO do not affect user QoE. For PLRs ≥ 1%, a quality degradation that lasts ≥ 16 s can be negatively perceived by users. Furthermore, if the video contains 7 or more PLOs and PLR increases (≥ 1.5%), the duration of a single PLO comes to the fore. The analysis also revealed that, for PLRs of ≥ 0.5%, an increase in the number of PLOs significantly influences user QoE.
After confirming the IQX hypothesis, we ranked the objective parameters by their order of importance in relation to their impact on user QoE as follows: (1) total duration of quality distortions in a video, i.e., total duration of PLOs; (2) number of PLOs; (3) PLR; and (4) duration of a single PLO.
The impact of human short-term memory [19] was tested by comparing the number of PLOs in a specific sequence with the number of perceived PLOs reported by the subjects. The analysis revealed that a considerable number of test subjects (408) failed to notice and/or memorize some or even all quality distortions inserted in the sequences. This can be related with three casualties. First, longer test sequences were used in the test. Presumably, after 1 h, some subjects forgot the degradations, which they may have noticed while watching the DVD. Second, experimenting in the home environment of the subjects encouraged them to watch the video in everyday conditions (at a known location, with or without company, at any time of day). In these lifelike viewing conditions, the subjects were not focused on noticing and memorizing the PLOs; they were focused on the content instead. Third, the subjects were uninformed about the purpose of the test. Thus, before watching and during the video, they were unaware of the degradations that would appear in a sequence. This inability to notice and/or memorize PLOs impacted the subjects, which was manifested as high MOS even for the most degraded test sequences (as previously discussed). We have also discovered how the PLOs that appeared in the middle of the video were more often unreported by the subjects compared to those PLOs that appeared toward the end of the video. This confirmed the impact of the recency effect [20] on our test subjects. Notwithstanding, since the test sequences used in this experiment lasted 1 h, we cannot neglect that these results may differ if shorter test sequences had been used (or other types of content).
It is worth mentioning that the results also disclosed how the overall user experience can be redeemed despite the perceived quality distortions if the content is entertaining to the viewer. Finally, a separate analysis was conducted to see if the video subtitles could draw viewers’ attention to the bottom of the screen, thus making the PLOs harder to notice. It was observed that the subjects who watched the video with subtitles noticed fewer PLOs and achieved higher QoE compared with the subjects that watched the video without subtitles. However, we emphasized that further investigation of the impact of video subtitles on user QoE is needed.
Critical overview of the methodology
In terms of network-related parameters, which may impact user audiovisual perception, it can be observed that we have limited our research quest to the effect of packet loss-related issues on user QoE. The inclusion of more parameters, such as network delay, jitter, and throughput, increases the number of test sequences. We have created 72 test sequences just by combining different values of the three parameters (PLR, number and duration of PLOs in a test sequence). A larger number of test sequences would mean that we would have to reach more test subjects, which was considered unfeasible. Note that each of the 602 test subjects evaluated one test sequence by watching it once (as discussed in Sect. 2.2.2). If the subjects watched the same sequence more than once before completing the questionnaire or if they watched another sequence for the first time, they would have known the purpose of the test while watching the video. We underlined earlier that the goal was to avoid that since knowing the purpose of the test would make our test subjects more perceptive to the PLOs.
Apart from the sheer number of test subjects, it would be difficult to pinpoint the effect of packet loss on user QoE if other network-related parameters had been tested as well. Hence, it can be argued that not all the parameters were tested against the subjects’ perception, yet the impact of packet loss on a wide number of subjective parameters was tested meticulously.
When critically thinking about the test conducted in this study, it has to be taken into account that the methodology had to meet the following demands:
-
the primary objective was to collect the rating data and use it to develop the QoE assessment model that would be able to produce more lifelike QoE assessments; thus, the data had to be collected from uncontrolled experiments (in a home environment);
-
longer test sequences had to be used in the test, because short video clips are not adequate for user QoE evaluation (as reported in [15,16,17]);
-
the test sequences had to be distributed to the subjects for rating in a manner that bypasses downloading or streaming of the video (due to its size);
-
the accuracy of the model depends on the fuzzification and defuzzification processes, i.e., obliquely on the size of the data set used for the model development; that is, a sufficient number of test sequences of different properties had to be generated and evaluated by a sufficient number of test subjects;
-
it was unfeasible to conduct the interviews with such a large number of test subjects (602); thus, the subjects’ opinions were collected using hard copy questionnaires.
The methodology discussed in this chapter met all of the above demands and highly impacted the obtained results. While watching the video at their homes, in a familiar environment, possibly surrounded by known people, the test subjects were not focused on keeping track of the video quality fluctuations. We can also assume that the subjects were able to relax and were entertained by the video content (on a scale from 0, being least entertaining, boring, to 10, being very entertaining, the average level of entertainment was 7.62 with a margin of error of 0.15 and a confidence level of 95%). This home test environment made the subjects more forgiving to the perceived quality distortions, which was mirrored in the results. Hence, the test environment directly affects the inference system of the model.
We are aware that the methodology has certain disadvantages. Namely, the success of such uncontrolled experiment largely depends on the honesty of the test subjects. Moreover, the test was conducted in environments, where a number of QoE influential factors may impact the subjects’ rating (as discussed in [3]). We were not able to investigate all the factors on such a large target group. However, we invested effort in (a) clearly presenting what is expected from the test subjects in the study; (b) designing a questionnaire that returns enough information for the modelling; (c) discovering and rejecting outliers from the sample; and (d) removing those questionnaires, where the subjects’ answers indicated equipment malfunction or noisy environments that may have interfered with the viewing experience.
The statistical analysis of the collected data (conducted in [4]) and the obtained results that confirmed the findings of other authors allows us to argue that the objective of the test was achieved and its outcomes can be further used for the development of the model.