Overview on selected applications and subjective studies
The presented QoE measures are applied to real data sets available in the literatureFootnote 5, subjective studies, but rather used the opinion scores from the existing studies to apply the QoE measures and interpret the results in a novel way, obtaining a deeper understanding of them., comparing MOS values to other quantities. To cover a variety of relevant applications, we consider speech, video, and web QoE. The example studies highlight which conclusions can be drawn from other measures beyond the MOS, such as SOS, quantiles, or \(\theta \)-acceptability. The limitations of MOS become clear from the results. These additional insights are valuable e.g., to service providers to properly plan or manage their systems.
Section “-θ-Acceptability derived from user ratings” focuses on the link between acceptance and opinion ratings. The study considers web QoE, however, users have to complete a certain task when browsing. Test subjects are asked to rate the overall quality as well as answering an acceptance question. This allows to investigate the relation between MOS, acceptance, \(\theta \)-acceptability, \(\%\mathrm {GoB}\), and \(\%\mathrm {PoW}\) based on the subjects’ opinions. The relation between acceptance as a behavioral measure and overall quality as opinion measure is particularly interesting. To wit, it would be very useful to be able to derive the “accept” behavioral measure from QoE studies and subjects’ opinions. This would provide a powerful tool to re-interpret existing QoE studies from a different, more business-oriented perspective.
Section “%GoB and %PoW: ratio of (dis-)satisfied users” investigates the ratio of (dis-)satisfied users. The study on speech quality demonstrates the impact of rating scales and compares \(\%\mathrm {PoW}\) and \(\%\mathrm {GoB}\) related to MOS when subjects are rating on a discrete and a continuous scale. The results are also checked against the E-model to analyze its validity when linking overall quality (MOS) to those quantities. Additional results for web QoE can be found in "Experimental Setup for Task-Related Web QoE", "Speech Quality on Discrete and Continuous Scale", "Web QoE and Discrete Rating Scale" in Appendix 1 (Fig. 13). In this subjective study on web QoE, page load times are varied while subjects are viewing a simple web page. The web QoE results confirm the gap between the \(\%\mathrm {GoB}\) and \(\%\mathrm {PoW}\) estimates (as defined e.g. for speech QoE by the E-model), and the measured \(\%\mathrm {GoB}\) and \(\%\mathrm {PoW}\).
Section “SOS hypothesis and modeling of complete distributions” relates the diversity in user ratings in terms of SOS to MOS. Results from subjective studies on web, speech, and video QoE are analyzed. As a result of the web QoE study, we find that the opinion scores for this study can be very well approximated with a binomial distribution—which allows us to fully specify the voting distribution using only the SOS parameter a. For the video QoE study, a continuous rating scale was used and we find that the opinion scores follow a truncated normal distribution. Again, the SOS parameter a derived for this video QoE study fully describes then the distribution of opinion scores for any given MOS value. Thus, the SOS parameter allows to model the entire distribution and then to derive measures such as quantiles. We highlight the discrepancy between quantiles and MOS, which is of major interest for service providers.
Section “Comparison of results” provides a brief comparison of the studies presented in the article. It serves mainly as an overview on interesting QoE measures beyond MOS and a guideline how to properly describe subjective studies and their results.
For the sake of completeness, the reader finds a detailed summary of the experimental description in the Appendix 1.
\(\theta \)-Acceptability derived from user ratings
The experiments in Schatz and Egger (2014) investigated task-related web QoE in conformance with ITU-T Rec. P.1501 (ITU-T 2013). In the campaign conducted, subjects were asked to carry out a certain task, e.g. ‘Browse to search for three recipes you would like to cook in the given section.’ on a certain cooking web page (cf. Table 3). The network conditions were changed and the impact of page load times during the web session was investigated. Besides assessing the overall quality of the web browsing session, subjects additionally answered an acceptance question. In particular, after each condition, subjects were asked to rate their overall experienced quality on a 9-point ACR scale, see Fig. 11, as well as a binary acceptance question. The experiment was carried out in a laboratory environment, with 32 subjects.
Figure 5 quantifies the acceptance and QoE results from the subjective study in Schatz and Egger (2014). This study also considered web QoE; however, users must complete a certain task when browsing. The test subjects were asked to rate the overall quality as well as answering an acceptance question. This allowed to investigate the relation between MOS, acceptance, \(\theta \)-acceptability, \(\%\mathrm {GoB}\), and \(\%\mathrm {PoW}\) based on the subjects’ opinions.
Figure 5a shows the MOS and the acceptance ratio for each test condition. The blue bars in the foreground depict the MOS values on the left y-axis. The grey bars in the background depict the acceptance values on the right y-axis. While the acceptance values reach the upper bound of 100 %, the maximum MOS observed is 4.3929. The minimum MOS over all test conditions is 1.0909, while the minimum acceptance ratio is 27.27 %. These results indicate that users may tolerate significant quality degradation for web services, provided they are able to successfully execute their task. This result contrasts with e.g., speech services, where very low speech quality makes it almost impossible to have a phone call, and hence results in non-acceptance of the service. Accordingly, the \(\%\mathrm {PoW}\) estimator defined in the E-model is almost 100 % for low MOS values.
Figure 5b makes this even more clear. The plot shows how many users accept a condition and rate QoE with x for \(x=1,\,\ldots\,,9\). All users who rate an arbitrary test condition with x are considered and the acceptance ratio y is computed over those users. For each rating category \(1,\ldots ,9\), there are at least 20 ratings. Even when the quality is perceived as bad (‘1’), 20 % of the users accept the service. For category ‘2’ between ‘poor’ and ‘bad’ (see Fig. 11), up to 75 % accept the service at an overall quality which is at most ‘poor’.
Figure 5c takes a closer look at the relation between MOS and acceptance, \(\theta \)-acceptability, as well as the \(\%\mathrm {GoB}\) estimation as defined in “%GoB and %PoW”. The markers depict \(\theta \)-acceptability \(P(U\ge \theta )\) depending on the MOS for \(\theta =3\) ‘\(\lozenge \)’ and \(\theta =4\) ‘\(\vartriangle \)’ i.e. \(\%\mathrm {GoB}\). The \(\%\;\mathrm {GoB}\) estimator (solid line) overestimates the true ratio of users rating good or better (\(\theta =4\)). This can be adjusted by considering users rating fair or better P(U ≥ 3) which is close to the \(\%\mathrm {GoB}\) estimator. In addition, the acceptance ratio ‘\(\square \)’ is plotted depending on the MOS. However, the \(\theta \)-acceptability curves as well as the \(\%\mathrm {GoB}\) do not match the acceptance curve. In particular, for the minimum MOS of 1.0909, the \(\theta \)-acceptability is 0 %, while the acceptance ratio is 27.27 %.
The discrepancy between acceptance and the \(\%\mathrm {GoB}\) estimator is also rather large, see Fig. 5c. The estimator in the E-model maps a MOS value of 1 to a \(\%\mathrm {GoB}\) of 0 %, as a speech service is not possible any more if the QoE is too bad. In contrast, in the context of web QoE, a very bad QoE can still result in a usable service which is accepted by the end user. Thus, the user can still complete for example the task to find a wikipedia article, although the page load time is rather high. This may explain why 20 % of the users accept the service even though they rate the QoE with bad quality (1).
We conclude that it is not generally possible to map opinion ratings on the overall quality to acceptance.Footnote 6 The conceptual difference between acceptance and the concept of \(\theta \)-acceptability is the following. In a subjective experiment, each user defines his own threshold determining when the overall quality is good enough to accept the service. Additional contextual factors like task or prices influence strongly acceptance Reichl et al. (2015). In contrast, \(\theta \)-acceptability considers a globally defined threshold (e.g. defined by the ISP) which is the same for all users. Results that are only based on user ratings do not reflect user acceptance, although the correlation is quite high (Pearson’s correlation coefficient of 0.9266).
Figure 5d compares acceptance and \(\%\mathrm {PoW}\). The markers depict the ratio of users not accepting a test condition ‘\(\square \)’ depending on the MOS for all 23 test conditions. The \(\%\mathrm {PoW}\) is a conservative estimator of the no acceptance’ characteristics. Especially, 27.27 % of users are still accepting the service, although the MOS value is 1.0909. The \(\%\mathrm {PoW}\) is close to 0 %. This indicates that overall quality can only be roughly mapped to other dimensions like ‘no acceptance’.
\(\%\mathrm {GoB}\) and \(\%\mathrm {PoW}\): Ratio of (dis-)satisfied users
The opinion ratings of the subjects on speech quality are taken from Köster et al. (2015). The listening-only experiments were conducted by 20 subjects in an environment fulfilling the requirements in ITU-T Rec. P.800 (ITU-T 2003) using the source speech material in Gibbon (1992). The subjects assessed the same test stimuli on two different scales: the ACR scale (Fig. 6) and the extended continuous scale (Fig. 7). To be more precise, each subject was using both scales during the experiment. The labels were internally assigned to numbers of the interval [0,6] in such a manner that the attributes corresponding to ITU-T Rec. P.800 were exactly assigned to the numbers \(1,\ldots ,5\).
Figure 8a investigates the impact of the rating scale on the ratio of dissatisfied users. For 86 test conditions, the MOS, \(\%\mathrm {PoW}\), and \(\%\mathrm {GoB}\) values were computed over the opinions from the 20 subjects on the discrete rating scale and the continuous rating scale. The results for the discrete scale are marked with ‘\(\square \)’, while the QoE measures for the continuous scale are marked with ‘\(\lozenge \)’.
Although the MOS is larger than 3, about 30 and 20 % of the users are not satisfied rating poor or worse on the discrete and the continuous scale, respectively. The results are also checked against the E-model to analyze its validity when linking overall quality (MOS) to \(\%\mathrm {PoW}\). We consider the ratio \(P(U \le 2)\) of users rating a test condition poor or worse. For that test condition, the MOS value is computed and each marker in Fig. 8a represents the measurement tuple (MOS, \(P(U \le 2)\) for a certain test condition. In addition, a logistic fitting is applied to the measurement values depicted as dashed line. It can be seen that the ratio \(\%\mathrm {PoW}\) of the subjects on the discrete rating scale is always above the E-model (solid curve). The maximum difference between the logistic fitting function and the E-model is 13.78 % at MOS 2.2867. Thus, the E-model underestimates the measured \(\%\mathrm {PoW}\) for the discrete scale.
For the continuous rating scale, the ratio \(P(U\le 2)\) is below the E-model. However, we can determine the parameter \(\theta \) in such a way that the mean squarred error (MSE) between the \(\%\mathrm {PoW}\) of the E-model and the subjective data \(P(U \le \theta )\) is minimized. In the appendix, Figure 12b shows the MSE for different realizations of \(\theta \). The value \(\theta =2.32 > 2\) leads to a minimum MSE regarding \(\%\mathrm {PoW}\). The E-model overestimates the measure \(\%\mathrm {PoW}\), i.e. \(P(U \le 2)\), for the continuous scale. However, \(P(U \le \theta )\) leads to a very good match with the E-model.
In a similar way, Fig. 8b investigates the \(\theta \)-acceptability and compares the results with \(\%\mathrm {GoB}\) of the E-model. Even when the MOS is around 4, the subjective results show that the ratio of users rating good or better is only 80 and 70 % on the discrete and the continuous scale, respectively. The E-model overestimates the ratio \(P(U \ge 4\)) of satisfied users rating good or better on the discrete scale. The maximum difference between the logistic fitting function and the \(\%\mathrm {GoB}\) of the E-model is 17.49 % at MOS 3.3379. For the continuous rating scale, the E-model further overestimates the ratio of satisfied users, with the maximum difference being 46.20 % at MOS 3.4862. The value \(\theta ={3.0140}\) leads to a minimum MSE between the E-model and \(P(U \ge \theta )\) on the continuous scale, as numerically derived from Fig. 12b. Thus, for the speech QoE study, the \(\%\mathrm {GoB}\) of the E-model corresponds to the ratio of users rating fair or better.
In summary, the E-model does not match the results from the speech QoE study for \(\mathrm {PoW}\), i.e. \(P(U\le 2)\), and \(\mathrm {GoB}\), i.e. \(P(U\ge 4)\), on both rating scales. The results on the discrete rating scale lead to a higher ratio of dissatisfied users rating poor or worse than a) the \(\%\mathrm {PoW}\) of the E-model and b) the \(\%\mathrm {PoW}\) for the continuous scale. The \(\%\mathrm {GoB}\) of the E-model overestimates the \(\%\mathrm {GoB}\) on the discrete and the continuous scale.Footnote 7 Thus, in order to understand the ratio of satisfied and dissatisfied users it is necessary to compute those QoE metrics for each subjective experiments since the E-model does not match for all subjective experiments. Due to the non-linear relationship between MOS and \(\theta \)-acceptability, the additional insights get evident. For service providers, the \(\theta \)-acceptability allows to go beyond the ’average’ user in terms of MOS and to derive the ratio of satisfied users with ratings larger than \(\theta \).
SOS hypothesis and modeling of complete distributions
We relate the SOS values to MOS values and show that the entire distribution of user ratings for a certain test condition can be modeled by means of the SOS hypothesis. A discrete and continuous rating scale will lead to a discrete and continuous distribution respectively.
Results for web QoE on a discrete rating scale
Figure 9 shows the results of the web QoE study (Hoßfeld et al. 2011). In the study, the page load time was influenced for each test condition and 72 subjects rated the overall quality on a discrete 5-point ACR scale. Each user viewed 40 web pages with different images on the page and page load times (PLTs) from 0.24 to 1.2 s resulting into 40 test conditions per user.Footnote 8 For each test condition, MOS and SOS are computed over the opinions of the 72 subjects. As users conducted the test remotely, excessively high page load time might have caused them to cancel or restart the test. In order to avoid this, only a maximum PLT of [1.2] s was chosen. As a result, the minimum MOS value observed is 2.1111 for the maximum PLT.
Figure 9a shows the relationship between SOS and MOS and reveals the diversity in user ratings. The markers ‘\({\square }\)’ depict the tuple (MOS,SOS) for each of the 40 test conditions. For a given MOS the individual user rating is relatively unpredictable due to the user rating diversity (in terms of standard deviation).
The results in Fig. 9a confirm the SOS hypothesis and the SOS parameter is obtained by minimizing the least squared error between the subjective data and Eq. 5. As a result, a SOS parameter of \(\tilde{a}=0.27\) is obtained. The mean squarred error between the subjective data and the SOS hypothesis (solid curve) is close to zero (MSE 0.0094), indicating a very good match. In addition, the MOS-SOS relationship for the binomial distribution \((a_B=0.25)\) is plotted as dashed line. To be more precise, if user ratings U follow a binomial distribution for each test condition, the SOS parameter is \(a_B=0.25\) on a 5-point scale. The parameters of the binomial distribution per test condition are given by the fixed number \(N=4\) of rating scale items and the MOS value \(\mu \) which determines \(p=(\mu -1) N\). Since the binomial distribution is defined for values \(x=0,\ldots ,N\), the distribution is shifted by one to have user ratings on a discrete 5-point scale from 1 to 5. Thus, for a test condition, the user ratings U follow the shifted binomial distribution with \(N=4\) and \(p=(\mu -1) N\) for a MOS value \(\mu \), i.e. \(U \sim B(N,(\mu -1)N) + 1\) and \(P(U=i)=\left( {\begin{array}{c}N\\ i-1\end{array}}\right) p^{i-1}(1-p)^{n-i+1}\) for \(i=1,\ldots ,N+1\) and \(\mu \in [1;5]\).
We observe that the measurements can be well approximated by a binomial distribution with \(a_B=0.25\) (MSE = 0.0126) plotted as dashed curve. The SOS parameter of the measurement data is only \(\sqrt{\frac{a}{a_B}}=1.04\) higher than the SOS for the binomial distribution. The SOS parameter a is a powerful approach to select appropriate distributions of the user opinions. In the study here, we observe roughly \(a=0.25\) on a discrete 5-point scale which means that the distribution follows the aforementioned shifted binomial distribution. Thus, for any MOS value, the entire distribution (and deducible QoE metrics like quantiles) can be derived.
Figure 9b shows the measured \(\alpha \)-quantiles ‘\({\square }\)’ as well as the quantiles from the binomial distribution ‘\({\bullet }\)’ compared to the MOS values ‘\({\blacklozenge }\)’. The quantiles for the shifted binomial distribution ‘\({\bullet }\)’ match the empirically derived quantiles very well. The 10 and 90 %-quantiles quantify the opinion score of the 10 % of the most critical and the most satisfied users, respectively. There are strong differences between the MOS and the quantiles. The maximum difference between the 90 %-quantile and MOS is \(4-2.14=1.861\). For the 10 %-quantile, we observe a similarly strong discrepancy, \(2.903-1=1.903\).
This information, while very significant to service providers, is masked out by averaging used to calculate MOS values. As a conclusion from the study, we recommend to report different quantities beyond the MOS to fully understand the meaning of the subjective results. While the SOS values reflect the user diversity, the quantiles help to understand the fraction of users with very bad (e.g. 10 % quantile) or very good quality perception (e.g. 90 % quantile).
Results for video QoE on a continuous rating scale
Figure 10 shows the results of the video QoE study (De Simone et al. 2009). A continous rating scale from 0 to 5 (cf. Fig. 14) was used. The two labs where the study was carried out are denoted as “EPFL” and “PoLiMi” in the result figures. The packet loss in the video transmission was varied in \(p_L \in \{0;0.1;0.4;1;3;5;10\}\) (in %) for four different videos. In total, 40 subjects assessed 28 test conditions. The MOS, SOS, as well as the 10 and 90 %-quantile were computed for each test condition over all 40 subjects from both labs. More details on the setup can be found in "Experimental setup for task-related Web QoE" , "Speech quality on discrete and continuous scale" , "Web QoE and discrete rating scale", "Video QoE and continuous rating scale" in Appendix 1.
Fiugre 10a provides a SOS-MOS plot. The markers depict the tuple (MOS, SOS) for each of the 28 test conditions (PoliMi ‘\({\square }\)’ and EPFL ’\({\lozenge }\)’). The dashed lines shows the SOS fitting function with the corresponding SOS parameters for the two labs which are almost identical. When merging the results from both labs, we arrive at the SOS parameter \(a=0.10\). Due to the user diversity, we observe of course positive SOS values for any test condition (the theoretical minimum SOS is zero for the continuous scale), but the diversity is lower than for web QoE. Subjects are presumably more confident on (or familiar with) how to rate an impaired video, while the impact of temporal stimuli i.e. PLT for web QoE is more difficult to evaluate.
For each test condition, we observe a MOS value and the corresponding SOS value according to the SOS parameter. We fit the user ratings per packet loss ratio with a truncated normal distribution in [0; 5] with the measured mean \(\mu \) (MOS) and standard deviation \(\sigma \) (SOS). Thus, the user ratings U follow the truncated normal distribution, i.e. \(U \sim N(\mu ;\sigma ;0;5)\) with \(U \in [0;5]\). We observe a very good match between the empirical CDF and the truncated normal distribution, see Fig. 15b in the appendix. This is not obivous and no trivial result, although the first two moments of both distributions are identical, the underlying distributions could be very different, see “Motivation”. Thus, together with the SOS parameter a, the user voting distribution is completely specified for any MOS value \(\mu \) on the rating scale i.e. \(\mu \in [0;5]\).
Figure 10b shows the quantiles as a function of MOS. The filled ‘\(\bullet \)’ and non-filled markers ‘\(\circ \)’ depict the empirically derived 90 and 10 %-quantiles for the 28 test conditions, respectively. Furthermore, we plot the quantiles depending on MOS for user ratings U following a truncated normal distribution and SOS parameter \(a=0.1, 0.5, 1\). Note that we measure \({a=0.096}\) in the experiments on video QoE. The SOS parameter 0.5 leads to \(\sqrt{\frac{0.5}{0.1}}={2.2361}\) higher SOS values for an observed MOS. The SOS parameter 1 leads to the maximum possible SOS which is 3.1623 times higher than in the subjective data. Due to the SOS hypothesis and a given SOS parameter a, we obtain for each MOS value \(\mu \) the related SOS value \(\sigma (\mu ;a)\), see (5). Thereby, a MOS value represents the outcome of a concrete test condition. The parameters \(\mu \) and \(\sigma \) are input parameters of the truncated normal distribution which allows us to compute the \(\alpha \)-quantile of the truncated normal distribution, i.e. \(U \sim N(\mu ;\sigma ;0;5)\). The solid and dashed lines depict the 90 and 10 %-quantiles, respectively. We observe that the truncated normal distribution corresponding to the SOS parameter \(a=0.1\) fit very well the empirical quantiles. With the information of the SOS parameter, the quantiles, etc., can be completely derived for any MOS value. Similarly to the discrete rating scale results from the web QoE study, we observe strong differences between the MOS and the quantiles when using a continuous rating scale. The maximum difference between the 90 %-quantile and MOS is \({3.623250}-{2.420470}={1.202780}\). Also on the continuous scale, the MOS masks out such meaningful information for providers.
Results for speech QoE—comparison between continuous and discrete rating scale
When comparing the SOS values from the web and video study, we observe that the discrete rating scale leads to higher SOS values than the continous scale. However, the higher user diversity may be caused by the application (Hoßfeld et al. 2011). Therefore, we briefly discuss the speech QoE study (as already discussed in Sect. “%GoB and %PoW: ratio of (dis-)satisfied users” and described in the "Experimental setup for task-related Web QoE", "Speech quality on discrete and continuous scale" in Appendix 1). Subjects rate the QoE for certain test conditions on a discrete and a continuous scale which allows a comparison.
As a result (cf. Fig. 12a), the SOS parameter \(a_d=0.23\) and \(a_c=0.12\) are obtained for the discrete and the continuous scales, respectively. For the discrete scale, we observe larger SOS values than for the continous scale, which can also be seen by the larger SOS parameter \(a_d>a_c\). In particular, on the discrete scale, the SOS values are larger by a factor of \(\sqrt{frac{a_d}{a_c}} \approx {1.3844}\). This observation seems to be reasonable, as the continuous scale has more discriminatory power than the discrete scale. Subjects can assess the quality more fine granular on the continuous scale by choosing a value \(x \in [i;i+1]\), while the subject has to decide between i and \(i+1\) on a discrete scale. The minimum SOS for a given MOS value is zero for a continuous scale, while the minimum SOS is larger than zero and depends on the actual MOS value, cf. (3).
Although the results seem to be valid from a statistical point of view, the literature shows conflicting results. In Siahaan et al. (2014), subjective studies on the image aesthetic appeal were conducted using a discrete 5-point ACR scale as well as a continuous scale. However, similar SOS parameters were obtained for both rating scales. Péchard et al. (2008) compared two different subjective quality assessment methodologies for video QoE: absolute category rating (ACR) using a 5-point discrete rating scale and subjective assessment methodology for video quality (SAMVIQ) using a continuous rating scale. As a key finding, SAMVIQ is more precise (in terms of confidence interval width of a MOS value) than ACR for the same number of subjects. However, SAMVIQ uses multiple stimuli assessment, i.e. multiple viewing of a sequence. There are further works (Tominaga et al. 2010; Pinson and Wolf 2003; Brotherton et al. 2006; Huynh-Thu and Ghanbari 2005) comparing different (discrete and continuous) rating scales as well as assessment methodologies like SAMVIQ in terms of reliability and consistency of the user ratings. We note, however, that they do not address the issues of using averages to characterize the results of those assessments. A detailed analysis of the comparison of continuous and discrete rating scales and their impact on QoE metrics is left for future work.
Table 2 Description of the subjective studies conducted for analyzing QoE for different applications
Comparison of results
All experiments and some key quantities are summarized in Table 2, which may serve as a guideline to properly describe subjective studies and their results in order to extract as much insight from them as possible. For comparing the key measures across the experiments with different rating scales, the user ratings in all experiments are mapped on a scale from 1 (bad quality) to 5 (excellent quality).
The user rating diversity seems to be lower when using a continuous rating scale than a discrete one. This can be observed from the SOS parameter a, but also the maximum SOS at a certain MOS. It should be noted, however, that in more interactive services such as web QoE, there might be an inherently higher variation of user ratings, due e.g., to uncertainty on how to rate the overall quality.
The MSE-optimal parameter \(\theta \) is determined by minimizing the MSE between the \(\theta \)-acceptability of the measurement data and the \(\%\mathrm {GoB}\)-MOS. The discrete rating scale can only find a discrete value \(\theta \) and therefore stronger deviations between the \(\%\mathrm {GoB}\) estimator and the \(\theta \)-acceptability arise. We see that for the task-related web QoE, the MSE-optimal parameter is \(\theta =3\). This means that the ratio of users rating fair or better match the \(\%\mathrm {GoB}\) curve. For the continuous rating scales, optimal continuous thresholds can be derived. For the speech QoE and the video QoE on continuous scales, a value of \(\theta \) around 3 matches the \(\%\mathrm {GoB}\) curve.
The limitations of MOS are made evident by the minimum \(\%\mathrm {GoB}\) ratio \(P(U \ge 4)\) for all test conditions which lead to a MOS value equal or larger than 4. The ratio shows how many users accept (or do not accept) the condition, although the MOS exceeds the threshold.
Another limitation of the MOS is highlighted by the quantiles. In particular, the maximum difference between the 90%-quantile and the MOS values is shown to reach up to 2 points on the 5-point scale. This highlights the importance of considering QoE beyond the MOS.