Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

A class of discrete probability distributions contains distributions with limited support. A typical example is some variant of a Likert scale, with response mapped to either the $\{1, 2, \ldots, 5\}$ or $\{-3, -2, \ldots, 2, 3\}$ set. An interesting subclass of discrete distributions with finite support are distributions limited to two parameters and having no more than one change in probability monotonicity. The main contribution of this paper is to propose a family of distributions fitting the above description, which we call the Generalised Score Distribution (GSD) class. The proposed GSD class covers the whole set of possible mean and variances, for any fixed and finite support. Furthermore, the GSD class can be treated as an underdispersed continuation of a reparametrized beta-binomial distribution. The GSD class parameters are intuitive and can be easily estimated by the method of moments. We also offer a Maximum Likelihood Estimation (MLE) algorithm for the GSD class and evidence that the class properly describes response distributions coming from 24 Multimedia Quality Assessment experiments. At last, we show that the GSD class can be represented as a sum of dichotomous zero-one random variables, which points to an interesting interpretation of the class.


I. INTRODUCTION
A Likert scale is used in numerous research fields, like psychology, medicine, or quality of experience [1], [2], to name a few.Responses given to questions using a Likert scale are ordinal, but in practice they are often analysed as data coming from an interval scale.There is even a fairly common approach of ignoring the fact that data are discrete and treating them as if they come from a continuous, usually normal, distribution [3].Such analyses can result in wrong conclusions [1].
There are many reasons why an ordinal scale is converted to an interval one for the purpose of statistical analysis.One of them is the fact that in many cases, like assessing a student's performance or rating a movie, the mean assessment (or the mean response) is used to operationalise a latent subjectively judged trait one wants to measure.For example, it is a common practice to use the average of student's grades as an assessment of their underlying capabilities.This average is treated as a measure of student's "quality."Although this approach is not mathematically correct in the strict sense (assuming the data-i.e., student's grades-truly are ordinal), important decisions are made based on it (e.g., whether to grant a scholarship to a "high quality" student).Wanting to stay in line with how people conventionally use ordinal assessment scores (or, more generally, ordinal responses), in this work we try to model the strategy of treating quality scores as if they were expressed on an interval scale.This strategy seems justified to us especially if a trait one wants to assess has some objective characteristics that are dominant in the opinion formulation.This is the case in the field of Video Quality Assessment (VQA), where the technical reproduction quality of a visual stimulus is assessed.Although there is a subjective component to such assessments, objectively measurable factors (like resolution or the number of frames per second) are dominant in the opinion formulation.Importantly, the subjective component of the assessment can be treated as noise, distorting the measurement of the underlying stimulus quality.Our modelling approach takes that view into account.
Another reason why ordinal responses are converted to an interval scale is that proper latent variable models, like ordered logit or ordered probit (see [4]), are too complicated for studies with relatively few responses per hidden variable.In hidden variable models, we have the distribution parameters and the discretization/mapping parameters.All these parameters are difficult to interpret and in the case of small sample sizes, it is impossible to estimate them properly.For instance, if we have a sample with responses regarding only one or two objects (e.g., video clips), then the hidden variable models are overparameterised.They necessarily need big sample sizes with multiple raters evaluating multiple objects.
The main idea of this paper is to propose a family of discrete probability distributions that is able to model responses gathered in Multimedia Quality Assessment (MQA) subjective experiments. 1There, the distribution of responses is often (but not exclusively) uderdispersed (i.e., having variance lower than that of the binomial distribution) and practitioners analysing these data need a model intuitively conveying response distribution characteristics.We put forward a solution avoiding the difficulties inherent to treating ordinal responses as though they are expressed on an interval scale and, at the same time, offer a model with only two parameters (to sidestep overparameterisation challenges).We find our solution attractive to practitioners, who want to use relatively simple but mathematically correct tools.We call our proposed family of distributions a Generalised Score Distribution class (GSD).It is a two-parameter family of discrete distributions on the set {1, ..., M }, M ∈ N \ {1, 2}.The practical interpretation we attach to the elements of this set is that they correspond to subjective ratings of quality (e.g., of an image or a video clip).Because of its convenient parameterisation, the class does a similar job to what the normal distribution class does for the continuous case.The first parameter ψ ∈ [1, M ] is (as in the case of the normal distribution class) the expected value.In MQA, it is a common practice to assume subjective responses are expressed on an interval scale.The expected value of responses gathered in the course of a subjective experiment is then treated as an estimate of stimulus latent quality (sometimes referred to as true quality).The ψ parameter of the GSD class reflects this approach.We would like the second parameter of the GSD class to play the same role as the second parameter of the normal distribution class.Unfortunately, for discrete distributions defined on the set {1, ..., M }, the range of all possible variances is changing with ψ.Therefore, the second parameter ρ ∈ [0, 1] (also referred to as dispersion parameter or confidence parameter) of the GSD class is a linear function of variance, with the value range equal to the interval [0, 1] for every ψ.To the best of our knowledge, the convenient parameterisation of the GSD class is unique, compared to other discrete distributions.It allows us to look at the parameters describing the expected value and variance independently, as one can do in the case of the normal distribution class.In other words, shifting the expected value parameter (ψ), does not change the dispersion parameter (ρ).Likewise, changing the dispersion parameter (ρ), does not influence the expected value parameter (ψ).The GSD class covers all possible first-and second-order moments for discrete distributions defined on {1, ..., M }.It also offers explicit formulae for probabilities without the use of special functions and thus explicit formulae for the derivatives of its log-likelihood function.Therefore, we believe that the GSD can be used outside of the field of MQA.
The problem of generalising the binomial distribution to overdispersed and underdispersed data is known and widely considered in literature.The natural generalisation of the binomial distribution for overdispersed data is the beta-binomial distribution.It provides simple formulae for probabilities and derivatives of its log-likelihood function.Furthermore, there are available ready-to-use algorithms for the estimation of its parameters (see [5]).One can also find applications of the betabinomial distribution when dealing with overdispersed data (see e.g.[6]).In [7] the method for extending the beta-binomial distribution for underdispersed data is proposed.Unfortunately, this method does not provide a distribution that covers all possible variances.For any fixed mean value, it stops at some variance and cannot go lower (see Fig. 2).Thus, the data that are strongly underdispersed cannot be modelled in such a way.In this paper, we propose a different way of extending the beta-binomial distribution.Our solution allows to obtain all possible variances for a discrete distribution defined on {1, ..., M }.
Our method also provides a reparameterisation to obtain easy to interpret parameters ψ (mean value) and ρ (confidence level linearly dependent on variance).The GSD class can be also represented as a sum of dichotomous zero-one random variables (see Proposition 4), which gives an interesting interpretation of that class.
This paper provides estimation and test of goodness-of-fit algorithms for the GSD class.We analyse the algorithms performance through an extensive simulation study.We also use six Multimedia Quality Assessment (MQA) databases (amounting to more than 100 000 individual responses) to provide evidence that the proposed class is a useful analytical tool that can be used in practice.It is worth mentioning that we already proposed in the past a tool based on the GSD class.The tool extends possible ways to validate data consistency of responses obtained during a MQA subjective experiment [8].Importantly, our work was noticed by practitioners in the MQA field and referred to in [9], [10], and [11].
The GSD class we propose can be used for modelling survey responses expressed on a Likert scale as well.For example, in [12] the authors consider the impact of the number of response categories on the reliability of the measurements.In the theorised response generation model (provided in equation ( 1) of [12]), one can use the GSD class to model the random error.This approach would then allow to estimate the latent unobserved true response.The same method can also be used, for example, in [13], where the problem of measuring patient experience in hospitals is considered.
We claim that the GSD class properly describes responses from MQA subjective experiments.We also state that the GSD class estimates well response distributions even for samples of small size (i.e., sample sizes conventionally used in MQA experiments).At last, we argue that the GSD class describes response distributions using easy to interpret and easy to estimate parameters.To substantiate our claims, in this work we present the following contributions: • We reveal that based on samples of small size, the GSD class better forecasts a response distribution for a sample of larger size, in comparison to the empirical distribution.The paper is structured as follows.In Section II, we describe the proposed GSD family of distributions and compare it with the Ordered Probit model.Section III introduces the maximum likelihood estimation for the GSD class.It also considers the numerical accuracy of the estimation method we use.(Numerical accuracy was also examined in multidimensional case in Appendix E.) Section IV presents the analyses performed on real data sets of responses from six MQA studies.The last section concludes the paper.All proves and additional formulae can be found in Appendices.

II. SUBJECTIVE RESPONSE AS A RANDOM VARIABLE
Assuming that we use an M -point discrete scale, a random variable U (describing a subjective response) has a distribution given by: Such a description of a response distribution is general but has M − 1 different parameters.There are M − 1 of them, since there are M probabilities this distribution describes.In general, we can describe a subjective response as a function: where ψ is the expected value (referred to as true quality in the context of MQA research) and is an error term with the mean value equal to zero.An algorithm predicting stimulus quality (or any other subjectively judged trait) should aim at estimating ψ.Still, the error distribution is important and should be modelled.The error term represents the precision of ψ estimation.
It is desirable that the error term (represented by ) should not be too complicated.Therefore, we would like to use a model in which the error is described by a single parameter.(Please note that in Appendix E we also consider a multidimensional version of subjective responses with m quality parameters and n error parameters.)

A. Ordered Probit With Fixed Thresholds
The models proposed by [15] and [16] describe subjective responses as following a continuous normal distribution with certain mean µ, which is assumed to represent the latent stimulus quality (also referred to as true quality), and standard deviation σ, describing the error.Therefore, subjective response O ∼ N (µ, σ 2 ).Since in MQA experiments subjective responses are often expressed on a discrete scale, we cannot directly observe O.To convert the continuous form of a response to a discrete one, discretisation and censoring (clipping) are necessary.This process converts a continuous random variable O to a discrete variable U .We can calculate each response category probability (i.e., U distribution), as a function of µ and σ using the following equations: for s = {2, 3, M − 1} and Note that the definition of true quality µ is model dependent here.This is a flaw of this approach.It is also worth mentioning that µ = E(O) can be completely different from ψ = E(U ).The latter is a natural and model-independent parameter defining the true quality (for possible differences between E(O) and E(U ) see Fig. 1).
It is obvious that have to be different.However, even for E(O) ∈ [1, M ] the differences are considerable.The other problem, especially from the numerical estimation point of view, is the unbounded parameter set (µ, σ) ∈ (−∞, ∞) × (0, ∞).In Fig. 6 one can see how parameters (µ, σ) map to (E(U ), V(U )) after discretisation, for the case M = 5.This figure and the lack of an inverse formula for calculating (µ, σ) having (E(U ), V(U )), make it clear that this approach (referred to in the literature as ordered probit [17]) may prove problematic if the moments-based estimation would be used.Likewise, it may be challenging to find a starting point for numerical estimation methods.

B. GSD
An example of a discrete distribution that is described by two parameters is the beta-binomial distribution [18].The smallest variance the beta-binomial distribution can express is binomial distribution's variance.It is a strong limitation.For example, in the case of MQA subjective experiments for video, in [19] it is suggested that the binomial distribution has the highest possible variance for a correctly conducted subjective experiment.Differently put, most MQA subjective experiments yield data with response distributions having the variance lower than that of the binomial distribution.Therefore, we need a different distribution, covering the whole spectrum of possible variances.This is especially true for underdispersed probability distributions (i.e., distributions with the variance lower than that of the binomial distribution; see Fig. 2).
where is an error with mean value equal to 0. Since U belongs to the set {1, 2, ..., M }, then the distribution of has to be supported on the set 1 − ψ, 2 − ψ, ..., M − ψ.Let us consider the shifted binomial distribution for : where k ∈ {1, ..., M } represents the response categories, from which subjective experiment participants (also referred to as subjects or raters) can choose from.Since the support of this distribution and the mean value are fixed, we obtain a fixed shifted binomial distribution without any freedom.However, we would like to have a class of distributions for with all possible variances V( ) = V(U ).Let us think about how the set of all possible variances, for all distributions supported on the set 1 − ψ, 2 − ψ, ..., M − ψ, looks like.Remember that the mean values for such distributions are fixed at 0 (since they describe the error term).This is why the set of all possible variances depends on ψ.If we denote by V min (ψ), V max (ψ) the minimal and maximal possible variance, respectively, then and the interval [V min (ψ), V max (ψ)] is the set of all possible variances.Notice that the interval [V min (ψ), V max (ψ)] is the biggest for ψ = (M + 1)/2 and if ψ is not an integer, then V min (ψ) > 0. Let us return to the shifted binomial distribution.It is easy to calculate that its variance is equal to: The question is how to obtain from this shifted binomial distribution a class of distributions that covers the whole interval of variances [V min (ψ), V max (ψ)] for any ψ (see Fig. 2).We would like to obtain this class by: • adding only a single normalised parameter ρ ∈ [0, 1], • making variance of the error to be linearly dependent on ρ and • requiring that variance is a decreasing function of ρ (this way we could interpret ρ as a confidence parameter; see Fig. 4).Let us denote by H ρ the distribution of the error fulfilling the above conditions.Since the variance of the error is linearly dependent on ρ and decreasing, then it has to be equal to Using formulae ( 8) and ( 9) we can calculate which gives us the value of ρ corresponding to a shifted binomial distribution.We have which corresponds to the red coloured dots and blue vertical hatching in Fig. 2, and which corresponds to the green coloured area in Fig. 2.
For variances bigger than V Bin (ψ) (the green coloured area in Fig. 2) we use the reparameterised beta binomial distribution.Since the mean value is fixed, we only have one free parameter ρ ∈ [0, C(ψ)].The effect of such reparameterization gives us the distribution denoted by G ρ : where ρ ∈ [0, C(ψ)] and k ∈ {1, ..., M }.The above formula can be rewritten as for k = M Therefore, for ρ = 0 we obtain where U = ψ + is supported on {1, ..., M }.
The proof of Proposition 1 can be found in Appendix A. Remark 1: Notice that for ρ → 0 the G ρ distribution approaches a two-point distribution supported on {1 − ψ, M − ψ}, with the biggest possible variance equal to V max (ψ).For ρ → C(ψ) the G ρ distribution approaches the shifted binomial distribution, with variance equal to V Bin (ψ).
For variances smaller than V Bin (ψ) (cf. the blue vertical hatching and red coloured dots in Fig. 2) we use a mixture technique.Specifically, we take a mixture of the shifted binomial distribution and the distribution with the smallest possible variance (i.e., a two-point or one-point distribution, depending on ψ).Of course, the mixture parameter has to be reparameterised to fit the [C(ψ), 1] interval.The effect of such reparameterisation gives us the distribution denoted by F ρ : where U = ψ + is supported on {1, ..., M }.
The proof of Proposition 2 can be found in Appendix A.
Remark 2: Notice that for ρ → C(ψ) the distribution F ρ approaches the shifted binomial distribution with variance equal to V Bin (ψ).For ρ → 1 the distribution F ρ approaches a two-point or one-point distribution (depending on ψ), with the smallest possible variance equal to V min (ψ).Finally, we obtain the distribution: , then we say that U = ψ + has the GSD(ψ, ρ) distribution, where ψ is the expected value and ρ ∈ [0, 1] is a confidence parameter, linearly dependent on the variance, i.e., The proof of Proposition 3 is an obvious consequence of Propositions 1 and 2.
In the following proposition, we show that our GSD class can be looked at from an interesting angle.The GSD class can be represented as a distribution of the number of successes in a sequence of M − 1 experiments.
Proposition 4: GSD distribution can be represented as a sum of dichotomous zero-one random variables.Specifically, a) in the case of ρ ≥ C(ψ), Z 1 , ..., Z M −1 are zero-one independent random variables and where (see Fig. 3) b) in the case of ρ < C(ψ), Z 1 |B, ..., Z M −1 |B are conditionally independent zero-one random variables, where P (Z i = 1|B) = B and B has the beta distribution of the following form: . The proof of Proposition 4 can be found in Appendix A.
Remark 3: If we consider a class of functions φ ψ,ρ satisfying the following conditions: • then we obtain a completely general response distribution suitable for representing underdispersed data (cf.Fig. 2) with mean value ψ and confidence parameter ρ.
The motivation behind constructing the GSD class is to have a class that would properly describe response distributions observed when analysing responses from MQA subjective experiments.There, the responses are often expressed on a 5-level scale, with the following mapping between discrete consecutive numbers and textual labels: 1-Bad, 2-Poor, 3-Fair, 4-Good and 5-Excellent.Although initially designed for the MQA research, we expect the GSD class to be useful for describing other, more general processes (at least for measurement processes exhibiting a characteristic similar to what is the case for the MQA subjective experiments).Examples of specific incarnations of the GSD family of distributions (for M = 5) are shown in Fig. 5.Note that for ψ close to 1 or M , regardless of ρ, the obtained distributions are similar.This is because the maximum spread we can obtain is limited by a small range of possible variances (see Fig. 2).

C. Comparing Ordered Probit's and GSD's Parametrisation
In this section, we present the interaction between ordered probit parameters, GSD parameters, and the (E(U ), V(U )) space.Fig. 6 presents the interaction between ordered probit parameters (µ and σ) and summary statistics (E(U ) and V(U )).The latter are calculated taking discrete responses generated by the ordered probit model with a given µ and σ pair.The lines present in the left-hand-side of Fig. 6, correspond to the same coloured lines in the right-hand-side of the same figure.Specifically, the ordering of lines (when going from left to right in Fig. 6a and top to bottom in Fig. 6c) in the left-hand-side of the figure is the same as the ordering of lines in the right-hand-side of the figure.Fig. 7 presents the corresponding plots for the GSD class.Note, however, that the ordering of lines in Fig. 7c is reversed to the ordering of lines in Fig. 7d.Differently put, the top-most line in Fig. 7c corresponds to the bottom-most line in Fig. 7d.Importantly, both Fig. 6 and Fig. 7 use M = 5.Fig. 6 is a graphical presentation of the problems inherent to the estimation and interpretation of ordered probit parameters, when these are based on the observations of the random variable U .First of all, the mapping from any bounded set of (µ, σ) pairs, results in a set of (E(U ), V(U )) pairs that does not cover the whole ghost-like area of all possible (E(U ), V(U )) pairs.Notice that the lines in Fig. 6b do not reach neither the sides nor the top of the ghost-like area.Second of all, there is no analytical formula mapping (E(U ), V(U )) pairs to (µ, σ) pairs.It is difficult to even approximately guess which (µ, σ) pair corresponds to which (E(U ), V(U )) pair.This is a big limitation of ordered probit's parameterisation.The estimation of a (E(U ), V(U )) pair, based on observations U 1 , ..., U n , is an easy task.Unfortunately, the results of this estimation are not useful for estimating ordered probit parameters.
The problems described above do not apply to GSD's parameterisation.The ψ parameter is the expected value of the observations (U 1 , ..., U n ).Thus, for any (E(U ), V(U )) pair, we immediately know the corresponding ψ.This is because ψ is equal to E(U ).Notice that the vertical lines in Fig. 7a and Fig. 7b are in identical positions along the horizontal axis.The second GSD class' parameter, ρ, is the confidence parameter.Its value of 0 corresponds to the biggest possible variance (the upper bound of the violet coloured ghost-like area in Fig. 7b) and 1 corresponds to the smallest possible variance (the lower bound of the violet coloured ghost-like area in Fig. 7b) of the observations.Moreover, the range of ρ values is linear.For example, ρ = 0.7 can be interpreted as 70%, in terms of the available variance.That is, V(U ) = 70%V min (ψ) + 30%V max (ψ).Therefore, to obtain ρ from a (E(U ), V(U )) pair, it is enough to calculate the distance between V(U ) and the upper bound of the violet coloured ghost-like area in Fig. 7b.This distance should be then divided by the distance between the upper and lower bounds of the violet coloured ghost-like area in Fig. 7b.Differently put, ρ = Vmax(ψ)−V(U ) Vmax(ψ)−Vmin(ψ) .Value of ρ is thus simply a ratio between the distance between the observed variance (V(U )) and the maximum possible variance (V max ) and the available range of variance (V max (ψ) − V min (ψ)).The easy mapping of (E(U ), V(U )) pairs to (ψ, ρ) pairs makes the estimation of (ψ, ρ) pairs (when based on U 1 , ..., U n observations) much easier than the estimation of (µ, σ) pairs for the ordered probit model.
We would also like to draw the reader's attention to the violet horizontal bars present in Fig. 6a, 6c, 7a, and 7c.In all cases, the violet bars span the range from 1 to 5.This range corresponds to the width of the violet ghost-like area in the right-hand side of Fig. 6 and 7. Please note that GSD's parameterisation is bounded to this 1 to 5 range, whereas ordered probit's one is not.This is yet another advantageous feature of GSD's parameterisation.Being bounded to the same range as the range of observable responses, GSD class parameters are easier to interpret and understand.

III. GSD PARAMETERS ESTIMATION
Ordered probit and GSD models cannot be used without an accurate and efficient parameter estimation procedure.The simplest approach is to use the method of moments.As we mentioned in the previous section, the method of moments for the ordered probit model is rather problematic.However, for the GSD class, moments based estimation is quite simple, i.e., , where E(U ), V(U ) are expectation value and variance of the empirical distribution.To compare the GSD with ordered probit and to apply the likelihood ratio test of goodness-of-fit, we actually use Maximum Likelihood Estimator (MLE) for both models.To make the comparison between the two models fair, when fitting them to real data, we use the same numerical estimation method for both, i.e., we use the estimation using a dense grid of points.Specifically, we first compute probabilities of all response categories for a set of (ψ, ρ) (or (µ, σ)) pairs and then search through the resultant grid to find the pair best matching the sample of interest.In the multidimensional case (cf.Appendix E), for generated data, we use the gradient based estimation method for the GSD class.The moments based estimator serves as a starting point.The exact formulae for the log-likelihood function and gradient that were used in the estimation algorithm are in Appendix B.

A. Numerical Experiments for the GSD Class
To validate our MLE procedure, we perform a simulation study.We draw data from the proposed distribution and then estimate the distribution parameters using the samples generated.Importantly, we do so for the case of M = 5.
In Fig. 8 we present the risk measured as Root Mean Square Distance (RMSD) between true value ψ and estimated ψ, for sample sizes n = 12, 24, 50, 200.As one can see, the hardest case is when ρ is small and ψ is in the middle of the [1, M ] scale.This behaviour is expected, since in the middle of the scale, the possible variance is the largest (cf.Fig. 6b).One can also see that the parameter ψ is rather easy to estimate.That is, the risk is rather small for ψ ∈ [1,5], even for small sample sizes.
In Fig. 9 we present the risk measured as the RMSD between true value ρ and estimated ρ, for sample sizes n = 12, 24, 50, 200.In this case, the situation is slightly more complicated (than that presented above for ψ).The hardest case is when ψ is on the edge of the [1, M ] interval.As one can see in Fig. 4, when ψ is close to the edge of the [1, M ] interval, variance expressed as a function of ρ is almost horizontal.This means that even small changes in sample variance correspond to large changes of ρ.Simply put, the smaller the [V min (ψ), V max (ψ)] interval, the harder the estimation of ρ.It is worth pointing out, however, that even for hard cases the risk is getting smaller, the larger is the sample size.
We also validate the MLE procedure for the multidimensional case.Specifically, we generate responses of n raters (also referred to as subjects), each having assigned a confidence parameter ρ 1 , ..., ρ n .The subjects rate m objects (also referred to as stimuli), each having assigned an expectation value ψ 1 , ..., ψ m (which can be, for example, interpreted as latent true qualities of a set of m videos presented to n subjects).Based on the responses generated, we used the multidimensional MLE to recover the GSD parameters.More details and estimation results are in Appendix E.

IV. REAL DATA EXAMPLES
In the previous sections, we presented a new GSD family of distributions and showed that the estimation method properly extracts GSD parameters when applied to the simulated data.In this section, we validate if the proposed GSD class can be used to model real subjective data (i.e., subjective responses coming from MQA experiments).

A. Comparing Goodness-of-Fit of Ordered Probit and GSD for Multimedia Quality Assessment Data
We want to compare the GSD with the ordered probit model and one state-of-the-art solution.We come up with the latter by adapting the model presented in [20].We call the adapted version of the model Simplified Li2020 or SLI for short. 2o check whether a distribution fits specific data, we have to perform a two-step procedure.The first step is to estimate distribution parameters for a sample of interest.The second step is to test a null hypothesis, saying that the sample truly comes   from the assumed distribution (GSD, ordered probit or SLI), given the parameters estimated in the first step.We choose a standard likelihood ratio approach to test the goodness-of-fit (GoF) of the models.Specifically, we use the G-test of GoF (cf.Sec.14.3.4 of [22]).Since sample sizes we consider are mostly small (less than 30 observations per sample), we do not use the asymptotic distribution for calculating the p-value.Conversely, we estimate the p-value using a bootstrapped version of the G-test (see Appendix C). (For comprehensive theoretical considerations on the topic please take a look at [23].) The data we use to perform the analysis come from six MQA studies: (i) ITU [24], (ii) HDTV [25], (iii) MM2 [26], (iv) 14-505 [2], (v) ITS4S [27] and (vi) NFLX.We do not provide here extensive details regarding each study.Instead, we refer the reader to publications cited next to each acronym.Since the NFLX study does not have a dedicated publication, we refer the reader to Sec.II.C of [21].Two important features of the six studies is that they all focus on Multimedia Quality Assessment (MQA) and follow best practices and recommendations in the field.In other words, we can safely call them typical MQA studies.Some studies represent data from more than one subjective experiment.Differently put, one study may consist of multiple experiments.In total, we use data from 24 subjective experiments.This amounts to more than 100 000 individual scores (exactly 111 198) for more than 3 500 stimuli (exactly 3 643).
In Fig. 10 we present the cumulative distribution function (CDF) of GoF test p-values for the GSD, ordered probit, and SLI models.The black line is the upper bound of 95% right-sided confidence interval for the CDF of p-values under the null hypothesis.Specifically, under the null hypothesis, the CDF of p-values is not greater than the uniform distribution function (for more details see [8]).As one can see, there is no evidence that the GSD is not the correct way of modelling subjective responses from MQA experiments.On the other hand, there is evidence that the distributions modelled by the ordered probit and SLI models are not suitable here.

B. Bootstrapping
If we would like to use the bootstrap technique in some testing problems with data expressed on a Likert scale, there are two approaches to resampling that one can consider.One can either use the empirical distribution or fit a distribution coming from some assumed parametric class.Here, we show that for at least one type of real data, specifically responses coming from MQA experiments, it is better to use the estimated GSD than it is to use the empirical distribution.This holds at least in the case of relatively small sample sizes.(In the field of MQA, usually only up to 30 responses per stimulus are available.)To compare the behaviour of empirical probability mass function (EPMF) and the GSD, we use the following algorithm.
Let us denote by N the number of observations in the large sample (e.g., N = 200) and by n the number of observations in the subsample of this large sample (e.g., n = 24).Now, we denote by (N 1 , N 2 , N 3 , N 4 , N 5 ) the frequencies of each response category in the large sample.We denote by (p 1 , p 2 , p 3 , p 4 , p 5 ) the EPMF of the large sample.The test procedure is as follows (assuming there are five response categories): 1) Generate M C bootstrap samples (e.g., M C = 10 000) of size n from the EPMF of the large sample (p 1 , p 2 , p 3 , p 4 , p 5 ).
2) For the r-th bootstrap sample (r = 1, 2, . . ., M C) do the following.a) Estimate response category probabilities using maximum likelihood estimation for the model of interest (e.g., the GSD model).Denote the estimated probabilities by (q 1 , q2 , q3 , q4 , q5 ).b) Denote by (v 1 , v2 , v3 , v4 , v5 ) the EPMF of the bootstrap sample.c) Find the likelihood L m of the estimated model for the large sample.In other words, calculate d) Find the likelihood L e of the bootstrap sample's EPMF for the large sample.In other words, calculate e) Find the natural logarithm of the ratio of the two likelihoods and denote it by W r Note that the above simplifies to 3) Calculate the estimator of p GSD − p e = P (W r > 0) − P (W r < 0), which is the difference between the probability that the GSD has greater likelihood than the EPMF and the probability that the EPMF has greater likelihood than the GSD.This can be formally described by the following.Fig. 11: Histograms depicting the distribution of probability differences pGSD − pe for three different small sample sizes (i.e., 12, 24 and 50) and for the case of the (unmodified) GSD being compared with the (unmodified) empirical distribution.Blue-coloured parts of the bars represent statistically insignificant probability differences.
For L > 0 the GSD performs better.For R < 0 the EPMF performs better.If [L, R] contains zero there is no significant difference between the GSD and EPMF.We use data from four MQA studies: (i) MM2 [26], (ii) HDTV [25], (iii) NFLX (cf.Sec.II.C of [21]) and (iv) ITERO [28].We describe the first three studies in Sec.IV-A.The last study, contrary to the first three, is not a typical MQA study.We decide to use it anyway for two reasons.First, being atypical, it should not not give unfair advantage to the GSD.Second, it provides real data with many responses per stimulus.This last property also stands behind our choice to use the other three studies (i.e., MM2, HDTV and NFLX).Specifically, we only select from these stimuli with at least 144 responses.This results in 234 stimuli, each assigned between 144 and 228 responses.
We use three small sample sizes, i.e., n = {12, 24, 50}.This way we can observe how the GSD performs (when compared to the empirical distribution) for different fractions of the large sample information available.Intuitively, we expect the empirical distribution's performance to improve as the small sample size increases.If the GSD proofs to perform differently than the empirical distribution we would observe how the increasing small sample size influences the difference between the two approaches.We emphasise here that the increasing small sample size always favours the empirical distribution.On the other hand, the performance of the GSD depends on how well it fits to the distribution of responses observed in the large sample.If the fit is good, increasing small sample size also favours the GSD.If the fit is poor, increasing small sample size does not necessarily improve GSD's performance.Fig. 11 presents results of the analysis.It contains three histograms of probability differences pGSD − pe .Each histogram shows results for one of the investigated small sample sizes (i.e., 12, 24, and 50).Larger probability mass to the right of zero means the GSD outperforms the empirical distribution.Larger probability mass to the left of zero means the empirical distribution performs better than the GSD.As can be seen, the GSD outperforms the empirical distribution for all three small sample sizes considered.
We theorise that the reason the GSD outperforms the empirical distribution is because the latter assigns a larger than the GSD probability to empty cells (i.e., response categories with no responses assigned to them in the sample of interest).To verify this claim, we run our analysis once again, this time modifying the estimation procedure both for the GSD and empirical distribution.This modification does not allow any empty cells.In other words, the estimated probability of any response category has to be necessarily in the interval (0, 1).The details are in Appendix D. Fig. 12 presents results for the case of the corrected GSD being compared with the corrected empirical distribution.Again, the GSD outperforms the empirical distribution, although this time by a smaller margin.
The results clearly show that the GSD is a better choice than the empirical distribution when it comes to resampling of subjective responses from MQA studies.

V. CONCLUSION
In this paper, we propose a Generalised Score Distribution (GSD) class.It is a family of discrete distributions with: finite support, two parameters, and no more than one change in probability monotonicity.The distribution parameters are: ψ determining the mean, and ρ determining the spread of the responses.
We show the usefulness of the GSD class for modelling, with a special focus on the Multimedia Quality Assessment (MQA) field.The class is a convenient regularisation of the multinomial distribution.The GSD class has only two parameters and covers all possible first-and second-order moments for a distribution defined on a discrete finite support.We also evidence that the GSD class can be useful in testing problems using the parametric bootstrap technique.
The advantage of the GSD class is that its ρ parameter can be used to determine the type of the underlining process.With ρ close to 1, we know that the process is similar to the Bernoulli distribution.Likewise, for ρ < C(ψ) we know the process rather resembles the beta-binomial distribution.This information can be used as a diagnostic tool, answering the following question: "What is the spread of the responses?"Note that the GSD class can be easily used outside of the MQA field, wherever information about responses spread is relevant.
We strongly believe that the GSD class can be of use for modelling results similar in nature to those reported in the field of MQA.More specifically, the GSD can be potentially useful for modelling subjective responses, where the population of observers generally agree about a given trait of a stimulus presented to them (e.g., about the visual quality of a distorted image).To put this differently, the GSD will not likely work in cases where there are evident subgroups of observers.For example, when there are two groups that have opposing views on a stimulus trait, they are asked to assess.The only exception to GSD's inability of modelling opposing views is the so-called "love or hate" case.In this case, a significant proportion of the population of observers either scores a stimulus trait extremely high or extremely low (cf.Fig. 5, the GSD distribution with ψ = 2.85 and ρ = 0.38).
In the future research, we would like to add more parameters to the GSD class.One idea is to add a parameter related to a potential personal bias of each observer (similarly to what is done in [15] and [16]).This subject bias parameter may, for example, show whether a person is generally more optimistic than other raters are.Another interesting direction of research would be using the GSD for data other than that coming from MQA subjective experiments.We would like to collaborate with scientists working in different fields, from audio and image quality, through student performance assessment, and up to psychology and sociology.In all those fields, a proper modelling of the response generation process would help to gain new insights.Our results obtained for MQA subjective data might be treated as a proof of concept, showing that the GSD class may be of use for those other fields as well.Now observe that + ψ − 1 has the beta-binomial distribution BB(M − 1, α, β) with parameters .
Using formula for beta-binomial expectation value we obtain Using formula for beta-binomial variance we obtain Vmax(ψ)−Vmin(ψ) (see formulas ( 6), ( 7) and ( 10)) we have Proof of Proposition 2: In F ρ distribution case (see formula (12)) notice that the random variable is a mixture of random variables 1 and 2 , i.e., where and 1 , 2 , D are independent.If ψ is an integer P ( 1 = 0) = 1 so E( 1 ) = 0.In case ψ is not an integer we have In both cases E( 1 ) = 0. Now, observe that 2 + ψ − 1 has the binomial distribution B(M − 1, p) with p = ψ−1 M −1 .Therefore We have then Now, notice that V( 1 ) = V min (ψ) and V ( 2 ) = V Bin (ψ) (see formulas ( 6), ( 7) and ( 8)).Therefore We want to show that for every fixed ψ ∈ [1, M ], the variance V(U ) is the linear function of ρ equal to ρV min (ψ) + (1 − ρ)V max (ψ).Notice that the f ψ function is a linear function of variable ρ.Therefore, it is enough to show that Proof of Proposition 4: First notice that in the case of ρ ≥ C(ψ) Random variables Z i can be written as Z i = DX i + (1 − D)Y i where X i , Y i , D are independent zero-one random variables and To define ψ• (n) and ρ• (n) we introduce a limit for the maximum probability any two response categories can add up to (and call it p max ).Importantly, when assessing p max , we only take into account two most probable response categories.This can be formally written as follows: p max = max (i,j)∈{1,...,M } 2 :i =j The final algorithm for fitting the GSD to a sample is as follows.Find such ( ψ, ρ) that satisfies the following two criteria: 1) p max ≤ 1 − 1 n , where p max is given by equation (13) and n is the sample size, and 2) the likelihood function has the maximum value.An example of ψ and ρ ranges for different sample sizes and M = 5 is shown in Fig. 13.

APPENDIX E MULITIDIMENSIONAL CASE
Let us consider a multidimensional model of responses with n raters (also referred to as subjects) and m objects (also referred to as stimuli), i.e., U ij = ψ j + ij , i ∈ 1, ..., n, j ∈ 1, ..., m where ij + ψ j has the GSD(ψ j , ρ i ) distribution.In this model, every object (e.g., a video) has its own quality ψ j and every rater has their own confidence parameter ρ i .
For numerical experiments, we generated 100 000 response matrices U i,j according to the following generative process: ψ j ∼ Uniform(1, 5), ρ i ∼ Uniform(0, 1).For each sample a probabilistic matrix factorisation U ij ∼ GSD( ψj , ρi ) done by MLE yields recovered estimates ψj and ρi .In Fig. 14 we present the estimation accuracy for a fixed ψ of one object in a multidimensional numerical experiment for n = m = 12, 24, 50, 200.All other parameters where random (uniformly distributed).For estimation, we used the gradient based method, using the formulae for the gradient of GSD log-likelihood function from Appendix B. In Fig. 15 we present estimation accuracy for fixed ρ of one rater in the same multidimensional numerical experiment.As one can see, our multidimensional estimator of GSD parameters is very accurate even for relatively small sample sizes.Notice that in case of n = m = 200, we estimate 400 parameters using the MLE, i.e., we estimate ψ 1 , ..., ψ 200 , ρ 1 , ..., ρ 200 .The accuracy of the estimator is higher if both n and m are larger.

Fig. 2 :
Fig.2: Area of all possible (E(U ), V(U )) for discrete distributions on {1, ..., M } for M = 5.The area is divided into different parts covered by two-parameter discrete models: the beta-binomial distribution (green crosses), extended beta-binomial distribution (previous + blue vertical hatching), and our solution (covers all possible values including the area marked with red dots).

Fig. 5 :
Fig. 5: GSD distributions of U for M = 5 and for various values of ψ and ρ.

Fig. 8 :
Fig. 8: Root Mean Square Distance (RMSD) between the input ψ and the estimated ψ for different (ψ, ρ), different sample sizes, and M = 5.For every sample size we generated 500 000 samples to obtain the figure.

Fig. 9 :
Fig. 9: Root Mean Square Distance (RMSD) between the input ρ and the estimated ρ for different (ψ, ρ), different sample sizes, and M = 5.For every sample size we generated 500 000 samples to obtain the figure.

Fig. 10 :
Fig. 10: p-Value P-P plot for typical multimedia quality assessment (MQA) experiments.p-Values come from the G-test of goodness-of-fit applied to the GSD, ordered probit and Simplified Li2020 (SLI) models fitted to responses from 24 real-life subjective experiments.Ecdf stands for empirical cumulative distribution function.

Fig. 12 :
Fig.12: Histograms depicting the distribution of probability differences pGSD − pe for three different small sample sizes (i.e., 12, 24 and 50) and for the case of the corrected GSD being compared with the corrected empirical distribution.Blue-coloured parts of the bars represent statistically insignificant probability differences.

Fig. 14 :
Fig. 14: Estimation accuracy for ψ estimation for the multidimensional model with n subjects scoring m = n objects.

Fig. 15 :
Fig. 15: Estimation accuracy for ρ estimation for the multidimensional model with n subjects scoring m = n objects.