Mixture polarization in inter-rater agreement analysis: a Bayesian nonparametric index

In several observational contexts where different raters evaluate a set of items, it is common to assume that all raters draw their scores from the same underlying distribution. However, a plenty of scientific works have evidenced the relevance of individual variability in different type of rating tasks. To address this issue the intra-class correlation coefficient (ICC) has been used as a measure of variability among raters within the Hierarchical Linear Models approach. A common distributional assumption in this setting is to specify hierarchical effects as independent and identically distributed from a normal with the mean parameter fixed to zero and unknown variance. The present work aims to overcome this strong assumption in the inter-rater agreement estimation by placing a Dirichlet Process Mixture over the hierarchical effects' prior distribution. A new nonparametric index $\lambda$ is proposed to quantify raters polarization in presence of group heterogeneity. The model is applied on a set of simulated experiments and real world data. Possible future directions are discussed.


Introduction
In several contexts, decision-making relies heavily (or exclusively) on expert ratings, especially in situations where a direct quantification of quality of an object or a subject is either impossible or unavailable.Examples include applicant selection procedures, grading of student assignments in education, or risk evaluation in emergencies, all of which rely on observational ratings made by experts.For ease of exposition, throughout this paper we will refer to evaluation of students' work in an educational context as the primary example.To ensure consistency across different teachers, harmonization of marking criteria is often used to improve inter-rater agreement and homogeneity (Gisev et al., 2013;Gwet, 2008); however, discrepancies between grades assigned by different teachers may still persist (Bygren, 2020;Barneron et al., 2019;Makransky et al., 2019;Zupanc and Štrumbelj, 2018), reflecting each teacher's approach to evaluation.Therefore, statistical models that can capture inter-rater agreement (or disagreement) can shed light on heterogeneity between teachers and aid the mark moderation process (Bygren, 2020;Barneron et al., 2019;Crimmins et al., 2016).The specific context that we are considering in this work is the observational setting where a set of raters are evaluating different sets of items 1 out of a total population of items; these sets may be completely disjoint (i.e., each item is evaluated by exactly one rater).Each item is represented by a set of covariates, assumed to follow some distribution.Within a hierarchical statistical model, a common assumption is that raters (who may or may not include covariates) may each be characterized through a latent variable capturing e.g.whether an evaluator is generous or how they assess different aspects of the work.In the simplest setting, in an evaluation context where there is no space for subjectivity, these latent variables will be identical for all raters, in the sense that their view of the item is identical and as a result their evaluation style is assumed to be the same.However, it is well-known in many scientific fields, e.g., cognitive neuroscience (Barneron et al., 2019;Makransky et al., 2019;Briesch et al., 2014), statistics (Agresti, 2015;Gelman et al., 2013) and psychometrics (Bartoš et al., 2020;Nelson and Edwards, 2015;Hsiao et al., 2011), that individual variability in rating tasks (Wirtz, 2020) needs to be accounted for when aggregating or interpreting individual raters' recommendations.Existing works account for heterogeneity between raters through a latent variable within a mixed-effects model (Martinková et al., 2023;Bartoš et al., 2020;Nelson andEdwards, 2015, 2008).In other words, a regression model is used where the rating is modelled conditionally on covariates with a random effect that varies across raters.However, the distribution of the latent variable is typically assumed to be unimodal, and cannot capture eg.polarisation or clustering of rater types.The present work aims to extend these models to account for clustered variability between raters.Through a Bayesian approach, a Dirichlet process mixture prior is placed over the hierarchical rater effects in a linear model.This flexible prior naturally accommodates different clusters among raters (i.e., different distributions for the rater effects). 1Commonly referred to as subjects in the rating context.
A multiple-level model is specified in which observations (i.e., ratings) are nested within raters, and in turn these are nested within clusters.These clusters reflect distinct groups of raters in terms of their decision-making, and can be used to characterise the level of (dis)agreement.The level of multimodality (i.e., how separated the latent group densities are) quantifies the polarization of the latent groups.For instance, a large variance between teacher scores might be due to both the presence of two main divergent latent trends among them or to a high level of noise in their assessing (Koudenburg and Kashima, 2022).It is important to differentiate the two cases and quantify the group polarization both for theoretical and practical purposes.Differentiate systematic differences of opinion against high level of noise might be needed (Koudenburg et al., 2021) .They are two very different cases and much attention must be paid in distinguishing one another.The former is a case of high group polarization (Esteban and Ray, 1994): two different teachers clusters emerge with a small within-cluster variance and a large variance between different clusters.In the second case only one cluster emerge with a large variance.It might be argue that in the first case, even that the overall agreement might be quite low since there are two main different trends among raters, there might be a high agreement within the same trend (Tang et al., 2022).Assuming the latent agreement among raters as the degree of latent similarity in rating, an index regarding the polarization of the different possible groups of raters might be informative (Koudenburg and Kashima, 2022;Tang et al., 2022;Koudenburg et al., 2021).In this work we introduce a novel index to quantify the latent polarization among raters through the posterior distribution of the hierarchical effects (DiMaggio et al., 1996).It naturally derives from the nonparametric model and overcomes some strong assumptions (e.g., the number of latent groups, the ratings distribution) of the previous indices (Koudenburg et al., 2021;Esteban and Ray, 1994).This nonparametric index, referred to as λ index, is based on the shape of the posterior distribution of the hierarchical effects.It connects two different research lines: it relates the works on distribution polarization of opinions (Koudenburg and Kashima, 2022;Tang et al., 2022;Koudenburg et al., 2021) with those about the inter-rater agreement analysis (Martinková et al., 2023;Bartoš et al., 2020;Nelson and Edwards, 2015;Gisev et al., 2013;Gwet, 2008;Nelson and Edwards, 2008).The paper proceeds as follows: Section 2 is devoted to the general psychometric framework, the key concepts of inter-rater agreement, inter-rater reliability are introduced; the statistical model is specified in Section 3 and the adopted Gibbs sampler in Section 4; the novel rater similarity index is described in Section 5; simulation studies are reported in Section 6, as an illustrative example, a real data analysis is described in Section 8; it is followed by conclusion and future directions in Section 9.

Existing work in inter-rater agreement and hierarchical effects models
Several methods and statistical models that aim to account for inter-rater variability have appeared in the literature (Nelson and Edwards, 2015;Gwet, 2008;Cicchetti, 1976).Models such as the Cultural Consensus Theory (Oravecz et al., 2014), which explores individuals' shared cultural knowledge, have been proposed to capture unobserved agreement and similar trends in groups of raters (Dressler et al., 2015).Two related but different concepts have been introduced: inter-rater agreement and inter-rater reliability.The former refers to the extent to which different raters' evaluations are concordant (i.e, they assign the same value to the same item), whereas the latter refers to the extent to which their evaluations consistently distinguish different items (Gisev et al., 2013).In other words, while the inter-rater agreement indices quantify the observed concordance, the inter-rater reliability indices aim to quantify the consistency of their evaluations (e.g., despite assigning different values, the distinction among the items is the same).The present work focuses on latent agreement intended as homogeneity in the evaluators' point of view (Tang et al., 2022;Esteban and Ray, 1994).
A number of methods are available to quantify both inter-rater agreement and interrater reliability.Indices for pairs (Nelson and Edwards, 2008;McHugh, 2012) or multiple raters (Jang et al., 2018), for binary (Gwet, 2008), polytomous (Nelson and Edwards, 2015) or continuous (Liljequist et al., 2019) ratings are commonly used in different contexts.Recent developments using the framework of Hierarchical Linear Models (i.e., HLMs) provide a more accurate estimation of inter-rater reliability accounting for different sources of variability (Martinková et al., 2023).
Despite the popularity of work on this issue, less attention has been paid to possible latent similarities of the raters (Wirtz, 2020).From a psychometric point of view, it can be appealing to assess the extent to which different raters might be heterogeneous in their ratings (Martinková et al., 2023;Bartoš et al., 2020;Koudenburg et al., 2021;Casabianca et al., 2015;Nelson and Edwards, 2015;Gisev et al., 2013;DeCarlo, 2008;Gwet, 2008;Nelson and Edwards, 2008).There are certain situations in which the subjective opinion of the raters is very informative; as a simple example, the type of teachers' training or experience can be thought of as latent states which affect a range of evaluations differently (Childs and Wooten, 2023;Barneron et al., 2019;Bonefeld and Dickhäuser, 2018;Dee, 2005).Sometimes the major interest is not on the mere consistency between raters, but on their actual evaluation.For instance, in a selection process the actual students' scores are very relevant for their admission (Zupanc and Štrumbelj, 2018).Even if a strict standardization of teachers evaluation is not feasible, some statistical methods can tackle these issues.In all these contexts the assessment of uniformity among raters could be useful and would provide further information about the rating process.To this aim, existing work, e.g.Martinková et al. (2023); Nelson and Edwards (2015); Casabianca et al. (2015); Hsiao et al. (2011);Cao et al. (2010);DeCarlo (2008), adopts an hierarchical approach where correlations between ratings are naturally captured through an hierarchical Bayesian model.Each rater i = 1, .., I is assumed to be rating a different set of items J i ∈ J , J i ∩ J i+1 = / 02 .The rating y i j of the item j ∈ J i carried out by rater i = 1, .., I, is modelled as follows: Here x i j and z i j are, respectively, 1 × p and 1 × q vectors of distinct explanatory variables of rating y i j ; β is a p × 1 vector of non varying effects and u i is a q × 1 vector the hierarchical effects of rater i.
In the standard HLM formulation, the following distribution is specified for the rater effects: Where N q (•) stands for a q-variate normal distribution; Here 0 is a q × 1 zero vector and Σ Σ Σ is a q×q positive semi-definite covariance matrix.For the hierarchical normal linear model ε i j ∼ N(0, σ ε ), with u i and ε i j typically assumed independent.The distribution of each vector-valued hierarchical effects u i is then assumed to follow some distribution and captures variability across different raters.In the above mentioned example, y i j is the score given to student j ∈ J i 's essay by teacher i.Since an observational approach is adopted (i.e., each raters rates a different set of items), the effect of the student is not identifiable (each student is rated only by one rater).Assuming that students effects are i.i.d., their variance is added to that of the residuals.In the univariate case (i.e., when z i = 1, varying intercept model) the relevance of the raters effect u i ∼ N(0, σ 2 u ), where σ 2 u > 0 is the variance, might be quantified through the intraclass correlation coefficient (i.e., ICC): It is the ratio between the variance of the raters effect and the total variability of the model, i.e., the proportion of variance of the score due to the teacher, which reflects the correlation of two ratings given by the same rater.Smaller values of ICC indicate a small effect of the rater on the student's score.

Dirichlet Process Mixture and hierarchical effects
The HLM assumption regarding the distribution of the hierarchical effects is crucial in characterising different possible clusters or latent patterns of heterogeneity among raters (Dorazio, 2009).The common Gaussian assumption for the distribution of the these effects may obscure skewness and multimodality present in the data.A more flexible specification of the hierarchical effects distribution can help capture more complex patterns of variability.Models that account for skew-normal (Lin and Lee, 2008), skew-normal-cauchy (Kahrari et al., 2019), multivariate t (Wang and Lin, 2014), extreme values (McCulloch and Neuhaus, 2021) effects distributions have been proposed (Schielzeth et al., 2020).Nevertheless, they poorly account for the possible presence of multimodality in those distributions.In this regard, a mixture distribution has been proposed as a potential solution (Heinzl and Tutz, 2013;Kyung et al., 2011;Kim et al., 2006).Each mode can then correspond to a cluster with a similar pattern (e.g., the same deviation from the population mean).Several works have explored this issue in the past two decades (Villarroel et al., 2009;Tutz and Oelker, 2017).For instance, Verbeke and Lesaffre (Verbeke and Lesaffre, 1996) proposed a standard normal mixture distributions for the hierarchical effects.James and Sugar (James and Sugar, 2003) (Komárek et al., 2010) and generalized hierarchical linear (Komárek and Komárková, 2013) models.Despite the breadth of specifications for the mixture model, in all the aforementioned models, the number of mixture components needs to be specified.Although this may not be a critical assumption in certain contexts, it may be questionable or detrimental in settings with a lack of a priori information on the level of multimodality, especially in cases where the characterisation of the multimodality is of direct interest.
When the number of components of the mixture is unknown, a Dirichlet Process Mixture (hereafter DPM) for the hierarchical effects is a natural extension (Gill and Casella, 2009;Navarro et al., 2006;Verbeke and Lesaffre, 1996).This nonparametric extension allows the model to capture an unknown marginal distribution of the hierarchical effects through the Dirichlet Process (Antoniak, 1974;Ferguson, 1973).Modeling the hierarchical effect u i as an infinite mixture of some distribution family (e.g., Normal) enables the model to account for possible multimodality without specifying the number of mixture components.Some existing works adopted this nonparametric approach and pose a DPM prior over the hierarchical effects (e.g., Heinzl and Tutz (2013); Heinzl et al. (2012); Kyung et al. (2011)).
The HLM of Equation ( 1) is then specified in the same way as before through: The following hierarchical prior distribution is placed over the raters effects: where µ i and Q i are, respectively, the q × 1 a location parameter vector and the q × q positive semi-definite covariance matrix for the hierarchical effects u i of rater i = 1, . . ., I.Here ε i j ∼ N(0, σ ε ), i = 1, .., I, j ∈ J i ; u i and ε i j are assumed independent as before.
3.1 DPM as a generative process for the hierarchical effects Here, DP(α, G 0 ) is a DPM with α > 0 precision parameter and base measure G 0 .These specify the mixing distribution G (Heinzl and Tutz, 2013), so that each realization of G is almost surely a discrete probability measure on the space (Ω , F ) (Blackwell, 1973).Thus, since the DPM is a discrete generative process with nonzero probability of ties, some of the realizations might be identical to each other with probability determined by the precision parameter α.Therefore, specifying this hierarchical model on the components location parameters µ induces a clustering in the hierarchical effects (i.e., the raters) (Kyung et al., 2011); hierarchical effects belonging to the same c-th cluster with location parameter µ c are then independent and identically distributed.In other words, in the context of the HLM, the DPM specifies the component-specific location parameter µ c , so that each rater has each has their own unique hierarchical effects value u i (Heinzl and Tutz, 2013).
The DPM is a generative process commonly used in conjunction with a parametric family of distributions (e.g., Normal, Poisson), and the base measure parameter G 0 denotes this specified distribution.Thus, for any element where Dir(•) stands for the Dirichlet distribution, and G 0 defines the expectation of G 3 , therefore they have the same support.The parameter α, a multiplicative constant of the vector-valued Dirichlet parameter, determines the probability of a new realization of the process to be different of the previous ones (Blackwell and Mac-Queen, 1973).In other words, it governs the probability that the DPM generates a new cluster.Formally, the generative property of the DPM is that, for i = 1, ..., I, with I being for instance the total number of raters: the probability that the new I-th realization µ I of G assumes a different values than the previous ones is described by the well known Pólya Urn Model: with C ∈ N being the number of already observed distinct clusters among the realizations of G (i.e., the number of the different values of µ already observed, in other words the number of clusters) and r c counts the elements in the c-th cluster.
Basically, since G is a discrete probability measure, the C clusters represent different point masses (or different sets of point masses in the multivariate case) and r c is the frequency of each of them.Considering the conditional distribution of µ I as a mixture distribution, the probability that µ I is a new point mass sampled from G 0 is proportional to α, the probability that it is equal to the already observed c-th point mass is proportional to r c .In this notation, the role of α in sampling a new (not already observed) value of µ I (i.e., a new point mass, a new cluster) is interpretable.
To this regard, Sethuraman (1994) described a stick-breaking construction of the DP4 .In this formulation G is equivalent to: where δ is the Dirac measure on µ c and µ c iid of the infinite mixture result from the stick-breaking procedure as follows: with Be(•) indicating the Beta distribution and {ν c } ∞ c=1 being reparameterized weights.It is even more explicit in this construction that the random measure G is a mixture of point masses.The distribution of the random weights π (i.e., the probability of different allocation to the clusters) is governed through the stick-breaking process by the precision parameter α.Further details are given in the Appendix.In practice, one of the established approximations to the stick-breaking process is to truncate the infinite number of components to a large, finite value: for large enough value of R (Tutz and Oelker, 2017;Gelman et al., 2013).In summary, the hierarchical effects distribution considering a stick breaking construction of the DPM might be then specified as follow: With this nonparametric model specification, latent common tendencies among raters might emerge through the components of the model (Heinzl and Tutz, 2013;Heinzl et al., 2012;Kyung et al., 2011).The Bayesian approach allows us to characterize the shape of the distribution of the rater effects, as well as explore the effect of uncertainty on these (Gelman et al., 2013).For example, in an applied context, strict vs. accommodating are very common latent states that drive students' essays grading process (Zupanc and Štrumbelj, 2018;Briesch et al., 2014;Dee, 2005).

Prior specification
Several of the parameters in the model have conjugate prior distributions which allow easier computation.
• For the effects β the following hierarchical prior is assigned: for m = 1, ..., p, where p is the number of covariates associated to the effects β .
Here, IG(•) stands for inverse-gamma with shape parameters a β 0 > 0 and rate parameters b β 0 > 0.Where b 0 and S 0 are, respectively, the p × 1 vector of lo-cation parameters and the p × p positive semi-definite covariance matrix of b β (i.e, the location parameter vector of the non varying effect β ); b β and S β are, respectively, the p × 1 location parameter and the p × p positive semi-definite covariance matrix for β (i.e., the non varying effect).The set {b 0 , S 0 , a β 0 , b β 0 } of the hyperparameters are specified by the user.A diagonal matrix is suggested for S 0 as showed by Heinzl et al. (2012).
• A diagonal structure for the q × q prior covariance matrix Q r for the hierarchical effects is specified as follows for each mixture component r = 1, ..., R and each each related covariate d = 1, ..., q: For the base measure G 0 5 and the precision parameter α of the DP mixture model the following priors are specified: for d = 1, ..., q, where q is the number of covariates associated to the hierarchical effects u i , and for Here Ga(•) stands for Gamma distribution with c α 0 > 0 and C α 0 > 0 respectively the shape and the rate parameters.Where m 0 and W 0 are, respectively, the q × 1 location parameter vector and the q × q positive semi-definite covariance matrix of µ 0 (i.e. the location parameter of the base measure G 0 ); µ 0 and D 0 are, respectively, the q × 1 location parameter vector and the q × q positive semi-definite covariance matrix of the base measure G 0 .The set of the hyperparameters need to be fixed.A diagonal structure is suggested for W 0 as above.
• The following prior is assigned to the noise variance: with a ε > 0 and b ε > 0 hyperparameters fixed by the user as well. 5Assuming independence between the location and the scale parameters of each mixture component, and between all the scale parameters for each covariate d = 1, . . ., q, G 0 is then the product of the q-variate normal and the q inverse gamma distributions.

Posterior sampling
Since most of the parameters in the model have conjugate prior distributions, a blocked Gibbs sampling algorithm was used for the posterior sampling (Ishwaran and James, 2001) .The parameter vector for the model is θ = {β , b β , B β , µ 0 , D 0 , Q, σ ε , α, π, c} which is updated at each state of the Markov chain of the Gibbs sampling.Here c = (c 1 , . . ., c I ) is the allocation parameter of the raters to the clusters and.Further details on the following sampling are given in the Appendix.The closed-form marginal posteriors are as follows.
1. Update parameters referring to effects β : For each covariate m = 1, ..., p associated with a non varying effect 2. Update parameters referring to hierarchical effects: • For each rater i = 1, ..., I: where µ c i is the location parameter vector of the cluster where the i-th rater is allocated.
• For each component r = 1, ..., R of the truncated mixture : -If ∄i : c i = r (if no rater are currently allocated into cluster r), for each covariate d = 1, ..., q associated to an hierarchical effect (independently): -If ∃i : c i = r (if at least one rater assigned to component r), for each covariate d = 1, ..., q associated to an hierarchical effect (independently): Essentially, at each iteration t, if the r-th cluster is empty the component location parameters µ r are sampled from the prior as suggested by (Gelman et al., 2013), otherwise they are drawn from the above mentioned closed-form posterior.• Each rater i = 1, ..., I is re-allocated into a cluster: where Cat(•) stands for Categorical distribution, and ω * i is reported in the Appendix.A truncated approximation for the DPM mixture model was used (Gelman et al., 2013;Heinzl et al., 2012) for a large value of R. The stickbreaking construction was used to generate the mixture weights π 1:R .
• For each component r = 1, ..., R − 1: and v R = 1 for the last cluster.Here c r is the number of raters assigned to the cluster r, and r l is the number of raters assigned to the cluster l.
• The precision parameter is updated as follows: • For each covariate d = 1, ..., q associated with an hierarchical effect the base measure parameters are updated: where µ d is the mean of the location parameters related to the d-th covariate over all the clusters.
3. Update the error variance: Here |J | is the cardinality of the set of all the rated items J , it equals the number of observations.

The nonparametric λ index
The marginal posterior distribution of the hierarchical effects in the model outlined above captures information about the polarization or disagreement among raters (on the assumption that the model captures the data adequately).The ICC (i.e., intraclass correlation coefficient, (Martinková et al., 2023;Bartoš et al., 2020;Agresti, 2015;Gelman et al., 2013)) might adequately quantify inter-rater variability if the normal distributional assumption of the rater hierarchical effect holds.Two assumptions are made computing the standard ICC considering a normal distributed hierarchical effect.Firstly, that the raters are sampled from the same population.Secondly, that possible different latent trends among raters are not interesting or eventually regarded as disagreement ratings.This might be a good first approximation of the rating process.Nevertheless, when more detailed considerations are needed, or subtle heterogeneity among raters is expected, the standard ICC might be less informative and inaccurate.Besides the latter issue, further information about the shape of the posterior might be quantified.For instance, in presence of a bimodal hierarchical effects distribution with two very distant modes (for example, when opinions are polarised), considering the posterior distribution of σ u as an index of variability among raters might be misleading.
The strong assumptions behind their use limit them to be valid options only in the parametric context or when the number of clusters is known.A model based nonparametric index is here proposed to overcome these limitations.
To this end the full estimated distribution of u resulting from the model might be useful.At each iteration t, the density of u is given by the corresponding mixture model given the parameters at iteration t.Following the formulation of (Gelman et al., 2013) , the set of modes and antimodes (i.e., the lowest frequent value between two modes) is identified.When the distribution of u is multimodal, the latent polarization (disagreement) λ is then defined as the log ratio between the mean density of the modes and the that of the anti-modes, it is zero when it is unimodal: Where M is the number of modes γ m , m = 1, . . ., M and the number of antimodes

Simulation studies
The following simulations aim to evidence how the values of λ varying across different polarization settings.The first simulation investigates the role of the precision parameter α and the variance of the mixture components in determining the values of λ .The second one shows the complementary role of λ in the inter-rater agreement analysis and how this index varies across different settings.

Simulation setting
The first simulation study explores the role that the precision parameter of the Dirichlet Process and the variance of the components have in determining the values of the log-density index λ .For simplicity purpose the mixture components are assumed to have the same variance Q in this simulation, so the component subscription will be omitted.The objective is to study the effect of α and Q, on λ conditional on all the other variables.Since the former has a crucial role in the determination on the point masses of G, and thus the concentration of its realizations, an inverse relation between α and λ is expected if Q is fixed.Likewise, an inverse relation between Q and λ is expected if α is fixed.It is interpretable as an index of the sharpness of the modes.For this reason both the precision parameter of the DPM and the variance of its components are expected to have an effect on λ .Controlling for Q (i.e., keeping it fixed), the expected relation is: the smaller α, i.e. the precision of the DPM mixture, the larger λ , i.e. the relative density around the modes; controlling for α (i.e., keeping it fixed), the expected relation is: the smaller Q, i.e. the variance of the components of the DPM, the larger λ .The parameters of the base measures G 0 have a non-negligible role in determining λ , so in this section focus is devoted to the relation between the precision parameter α, the mixture components variance Q and the index λ .Indeed, in all the study simulations the values of the other parameters involved in the DPM have been kept fixed across the scenarios.

Data generating process
The experimental design is as follows.For 4 different values of α = (0.1, 1, 5, 20) and 2 different values of Q = (0.1, 1.5) a set of independent observations u = 1, . . ., n are drawn from the following DPM: Where µ c and π c are the location parameter and the mixing proportion of the component c, respectively; G 0 is the base measure; and ν c is the parameter of the stick-breaking.Following the above mentioned truncated stick-breaking construction, here R is the maximum number of observable cluster.Across the eighth scenarios the following quantities are assigned: the number of observations n = 500, the maximum number of clusters R = 50, the base measure G 0 : U(−6, 6).Here, U(•) stands for uniform distribution.The use of these distributions in the present experimental context aims to highlight the effect of different values of α and Q on λ in a more evident and interpretable manner.

Results
As shown in Tables 1 and 2 as α increases, and so the number of point masses of G increases as well, λ decreases.The density of the observations u = 1, . . ., n is concentrated around few point masses (few modes) for lower value of α and is spread out the larger.Note also the change of density of the antimodes.As expected, it is proportional to the precision parameter in a positive fashion.As the observations are more spread as α increases, there are fewer intervals in the support with relative small density: λ index decreases at larger values of α (column-wise Table 1  and Table2).A similar proportional relation is observed between the variance of the mixture components Q and λ when α is kept fixed (row-wise Table 1 and Table2).Smaller values of both α and Q result in a high polarized distribution of u = 1, . . ., n and correspond to larger values of λ .Whereas larger values of both α and Q result in a low polarized distribution and correspond to smaller values of λ .It is is an index of how spread the density is over the support of the hierarchical effects.
From an interpretative point of view, λ indicates the degree of overlap between the infinitely many clusters.It might be informative of the separation between them.
Since this quantification is based on a non-parametric density, λ is not directly related to the number of the group, or to the cluster location.It indicates the degree to which the independent observations drawn from a DPM overlap; the variance of the cluster Q also plays a crucial role.The index thus quantifies the combined effect of the parameters to assess the extent to which possible different opinions (i.e., the modes) might be strongly shared among the raters (i.e., the modes are sharp pick of density).To this regard, λ is a polarization index in presence of heterogeneity.The higher the polarization levels, the larger the values of the index.The practical interpretation and the operational decisions must be guided by the field of application.6.2 Simulation 2: Inter-rater agreement and λ

Simulation setting
The following simulation study aims to highlight the complementary role of λ as an additional summary metric in inter-rater agreement analysis.The varying intercept parametrization is hereafter adopted as univariate case for the raters effects.To this aim the standard modelling approach (i.e., the normal distributed varying intercept and the resulting ICC) is compared with the nonparametric proposed above (i.e., the DPM prior over the varying intercept and λ ).The experiment evaluates the performance of both standard ICC and λ in the presence of heterogeneity between raters' evaluations due to a multimodal distribution of the hierarchical effects.
Table 1 The

Data generating process
Three experimental scenarios were planned, in which a different clustering on the raters' intercept parameter was specified in the generative model.In each scenario the rater's intercept u i was generated from a bimodal Gaussian mixture.The location parameters of the mixture components were fixed across the scenarios, µ 1 = −3 and µ 2 = 3;whereas decreasing values (1, 0.5, 0.1) were assigned to the components scale parameters Q 1 and Q 2 (see Table 3).This resulted in different polarization scenarios.The mixture components were kept equiprobable (π c = 0.5), c ∈ {1, 2} throughout.The number of raters I = 100 and the number of items J = 250 were fixed across the scenarios.One continuous covariate x i j with an effect β = 2 was used and it was the same across the scenarios.
Table 3 True raters' hierarchical effects distribution across different scenarios.A Gaussian mixture is specified as distribution of the hierarchical effects u i = 1, . . ., I. The location parameters of two components of the mixture are kept fixed across the scenarios and decreasing values were assigned to the respective scale parameters.

Standard model approach
The following priors were specified for the standard hierarchical effect model (i.e., the varying intercepts are assumed to be i.i.d.normal distributed): for i = 1, ..., I; Exp(•) stands fro the exponential distribution and β is the nonhierarchical effect, σ u and σ ε are the hierarchical effect and the noise variances parameters, respectively.A logic of complexity penalization was used in the choice of the above mentioned priors distributions (Simpson et al., 2017).The posterior of each standard hierarchical effect model were sampled using NUTS-Hamiltonian MCMC in Stan language (Stan Development Team, 2022).

Nonparametric model approach
The set of priors introduce in section 4 were elicited for the DPM models with the following hyperparameters as suggested by (Heinzl et al., 2012): As result of some preliminary analysis, a dense grid of 481 equally-spaced values from -12 to 12 (i.e., with a fixed interval of 0.05) was used to monitoring the mixture density of the nonparametric varying intercept u i at each iteration.The posterior distribution of the nonparametric hierarchical effect is obtained as the set of the mean density of each point of the grid over the iterations (Gelman et al., 2013).In all the computations for both the models 55,000 iteration with 5,000 burn-in were used, the Markov chains were thinned the by a factor of 50, resulting in samples of size 1000 (Heinzl et al., 2012).

Results
As shown in Table 4 the standard model (i.e., that in which hierarchical raters intercepts are assumed to be i.i.d.normally distributed) due to the rigid distributional assumption of the hierarchical parameters is not able to capture the possible multimodal distribution and it resulted in a large value of the hierarchical effect variance σ u (see Table 4 and Figure 4 ).As a result, the ICC didn't capture almost any difference among the three different scenarios (see Table 4 and Figure 5).On the contrary, the DPM model, due to the flexible nonparametric specification of the intercepts prior, showed a good performance.As evident from Figure 4 and Table 4 the DPM model was far more able to reproduce the data generating process.The different mixture used to generate the data emerged clearly from the posterior of the grid adopted to monitoring u.Since the DPM model properly learn the multi-modalities of the raters intercepts density, the index λ , being based on the ratio between the mean density of the modes and that of the antimodes present in the grid at each iteration, showed to be able to differentiate the three different polarization scenarios.It gives some interesting information regarding the shape of the non-parametric mixture distribution.The 95% credible interval of λ (see Table 5 and Figure3) as estimated in the three different scenarios highlighted different degrees of amplitude and separation (i.e., different degrees of polarization) along them.The index λ is computed as a logarithm of the ratio of the average mode density against the density of the antimodes at each iteration t of the posterior sampler.So, in the first scenario, the values of the 95% credible interval are smaller, indicating that in most of the iterations the difference between the mean density at the modes and that of the antimodes was very small.In terms of the third scenario, λ assumed rather larger values along the iterations as evidence that the mode density is far larger than that of the antimodes.In other words, the rater clusters were separated and clearly distinct.The parameters of the DPM are the most influential with regard to λ .Specifically, the location parameters µ c , c = 1, . . ., R, and the scale parameters Q c , c = 1, . . ., R, showed to have a combined effect of the proposed index.The 95% HDP intervals of the parameters of both the DPM prior model and that with normal distributional assumption are reported in tables 6 and 7, respectively.6 95% crebible intervals of β , DPM and residuals related parameters.Here β is the non varying effect, b β and σ β are, respectively, the related location and scale hyperparameters; µ 0 and σ D 0 are the location and scale parameters of the base measure G 0 , respectively.The precision parameter α and the residuals standard deviation σ ε are also reported.
Scenario 1 Scenario 2 Scenario 3 β (1.99, 2.00) (1.98, 2.01) (1.99, 2.01) σ ε (0.60, 0.64) (0.62, 0.63) (0.62, 0.63)  7 Large scale performance assessment The evaluation heterogeneity of teachers is a long-standing issue in psychometrics (Uto, 2022;Shirazi, 2019;Bonefeld and Dickhäuser, 2018;Casabianca et al., 2015;DeCarlo, 2008).Highly biased scored might have a detrimental effect on students proficiency and education (Chin et al., 2020;Paredes, 2014;Cooper, 2003).The proposed nonparametric model and the index λ might be valuable tools to address this issue.They might help to shed light on very biased assessment contexts and to provide fairer scores.The estimated hierarchical effect of each teacher (which may be interpreted as the teacher's bias) might be used to adjust the observed score.The index λ might quantify teachers polarization in their grading.

The Matura data set
As an illustrative real data application, a large scale performance assessment data set was analysed (Zupanc and Štrumbelj, 2018).The DPM-model was applied to a large-scale essay assessment data obtained during the nation-wide external exami- nation conducted by the National Examination Centre in upper secondary schools in Slovenia also known as Matura and analyzed in Zupanc and Štrumbelj (2018).These data were related to the spring term argumentative essays for years between 2010 and 2014.Particular attention is devoted to the distinction between two main aspects of essay writing: the language correctness (i.e., the presence of grammatical or syntactic errors) and the the good argumentation of the content (i.e., a good and clear presentation of all the arguments).Regarding the data structure, students are nested within the teachers.So that each student's essay is evaluated by one trained teacher, who is asked to grade it concerning two different rubrics.An essay can receive a score between 0 and 20 for the language-related rubric and between 0 and 30 for the content-related one.Prior analysis of these data (Zupanc and Štrumbelj, 2018) revealed that heterogeneity among teachers was broadly down to two types: strict and lenient.The two different trends might be captured by the model and their polarization quantified by the λ index.For this reason N=2616 students' essays, each scored by one of I=18 different teachers, were considered for the analysis 6 .The objective of this application is to analyze teachers' individual differences in scoring the essay content, controlling for its language correctness.How lenient or strict they are in scoring the quality of an essay content, without the effect of the language correctness.The content score is commonly ways more susceptible to idiosyncrasies or biases of the teacher than the language-related score, which is generally more objective (Childs and Wooten, 2023;Zhu et al., 2021;Shirazi, 2019).Accordingly, the content-related score was specified as outcome variable and the language-related score as covariate with a non varying effect.A DPM hierarchical prior was specified over the teachers' intercepts.All the scores were re-scaled for this analysis to get a easier parameters value interpretation 7 (Gelman et al., 2013).

Results
The language-related score showed a posterior mean effect of 0.27 on the contentrelated score, with a (0.16, 0.38) 95% credible interval.The language correctness of the essay writing had moderate role in predicting the evaluation of the its content.As shown by Figure 6(a) the DPM-model learned the presence of two main trends from the data.The bimodal non-parametric distribution over the grid suggested that the teachers were rather heterogeneous in the essay scoring process.More precisely, they seemed to be slightly polarized around two main tendencies.Some teacher showed a slightly more lenient or stricter than the others (i.e., who had a larger or smaller hierarchical effect posterior mean, respectively), see Figure 6.The λ index showed a posterior mean of 1.87 which suggested a low polarization.The λ 95% HPD interval was (0.0, 6.83) which indicated a non negligible occurrence of quite high values of λ .All the other parameters credible intervals are reported in Table 8 Assuming this latent group polarization as a low latent agreement among raters, the λ index might be used in a diagnostic manner.Considering the present application, some solutions might be suggested for a fairer assessment process.Firstly, assuming a very negligible noise term, the teacher's estimated bias might be removed from the actual score.Another practical solution might be the implementation ad hoc training aimed to a much more shared point of view in essay scoring.Here β is the non varying effect, b β and σ β are, respectively, the related location and scale hyperparameters; µ 0 and σ D 0 are the location and scale parameters of the base measure G 0 , respectively.The precision parameter α and the residuals standard deviation σ ε are also reported. 7The following transformation was applied to standardize the score: f (x) = x−x σx , where x =

Conclusions
Most of the statistical models commonly used to analyze data from such observational contexts haven't shown to be very flexible to certain types of heterogeneity among raters.The common HLMs with a normal (or unimodal) distributional assumption for the hierarchical effects cannot capture any possible latent clusters, i.e. any multimodality.Indeed, the residual covariance modelled through the hierarchical effects might be informative about different latent similarities among raters.In this regard, incorporating a DPM in the prior of the hierarchical effects distribution is a flexible choice to address this issue.Consequently, the estimation of the agreement among the raters should take into account the possible multimodal distribution of the hierarchical effects.Interest might not be exclusively on the proportion of variance attributable to the hierarchical effects over the total variance (i.e., the main interpretation of the ICC); instead, it might be more appealing to explore the entire multimodal density.Since the DPM naturally accommodates clusters among hierarchical effects (i.e., among raters), it is natural to consider the extent to which the mixture components are separated.Since λ is based on the density approximated through the grid approach f (u), it reflects both the clustering induced by the Dirichlet process and the variance of the mix-ture components.Due to the particular information carried by λ it might be more informative about the latent agreement among raters than the solely ICC.The latter is very useful when the normal distributional assumption of the hierarchical effects holds.However, in the presence of multimodality the estimate of the variance of the hierarchical effect σ u is not accurate (it might be over-estimated) and the related ICC might be non-informative.
In contexts in which strong beliefs about the exact number of cluster are present or it is supported by some sort of evidence, an hierarchical model with a prior finite mixture distribution over the hierarchical effects is expected to have comparably good performance as well.The parametric variance of a mixture might be take into account in the ICC formula in these cases.For the above mentioned reasons, added flexibility and the shrinkage property the DPM was here preferred.
Many other studies are needed to fully understand the performance of λ across different combinations of the Dirichlet process parameters.Future works might highlight the role of λ when the rating is either expressed on a dichotomous or on a polytomous scale.Further studies might highlight the computation of λ when multivariate hierarchical effects are specified.Comparisons between this index and the others widely used in these cases Tang et al. (2022);Forchheimer et al. (2015); Zhang et al. (2003) might be a focus of future studies.Further application of λ in a non-parametric context might be studied (Canale and Prünster, 2017).Zupanc, K. and Štrumbelj, E. (2018).A bayesian hierarchical latent trait model for estimating rater bias and reliability in large-scale performance assessment.PLOS ONE, 13(4):1-16.

Remarks for multiple ratings
When raters rate the same set of items J i = J , i = 1, . . ., I a varying intercept can be identified for each item (Martinková et al., 2023;Agresti, 2015;Nelson and Edwards, 2015).These term might be added to equation 1 (which is the same in both the standard and nonparametric formulation): In both the standard HLM (i.e., assuming a multivariate normal distributed hierarchical rater effect) and the nonparametric HLM (i.e., specifying a DPM over the rater effect) the following distribution might be specified: where σ δ > 0 is the scale parameter of δ and J = |J |.See Section 2 and 3 for the other quantities and their distribution assumption.Specifying a conjugate prior for σ δ additional steps might be added to the Gibbs sampling for the nonparametric HLM.
The main results of the present work and the interpretation of λ (see Section 5) still hold for this model specification.

Details on Dirichlet Process Mixture
As noticed above, α is proportional to the concentration of the realizations of G in point masses.Indeed, considering the partition (A, A c ) of Ω , the variance of G(A) is defined as Thus, larger values of α, conditioning on the number of raters I, reduce the variability of the DP, i.e. the process samples most of the time from G 0 , G tends to be an infinite number of point masses: the empirical distribution of G tends to become a discrete approximation of the parametric G 0 .In this case there is no a strong clustering since the probability of ties is very low.On the contrary, smaller values of α induce a strong clustering, the random weights distribution concentrate the probability mass to few points of the support of G and the probability of ties is higher.
Which in the present model means that several u i will be independent and identically distributed from a normal distribution indexed by the same parameters.Moreover, Antoniak (Antoniak, 1974) demonstrated that where C is the number of clusters.Thus, the expected number of point masses of G is proportional to both the α and the number of raters I. Every consideration regarding the role of the precision parameter on the distribution of G should be conditioned to I.

Details on the Gibbs sampling
Further details regarding some parameters of the posterior sampling are showed as follow.
1. Referring to the non varying effects: 2. Referring to hierarchical effects: • For each rater i = 1, ..., I: Here µ c i is the location parameter vector of the cluster where the rater i is allocated.
• For each component r = 1, ..., R and each variable d = 1, ..., q, associated with an hierarchical effect: Here u dr is the mean of the d-th hierarchical effect in the cluster r.

Fig. 1
Fig. 1 Different values of λ indicate different polarization levels.Three different values of λ were computed for three different mixture distributions, respectively.The realizations of these distribution are here referred to as u.Black dotted lines indicate the mean mode density, red dotted lines indicate the mean antimode density.(a) High polarization: the mixture components are highly and clearly separate, the mean density values of the modes is far larger then the mean density value of the antimodes; the log-density ratio between these two quantities is λ = 2.55 (b) Medium polarization: the mixture components are clearly separated, but the mean density values of the modes is closer to the mean density value of the antimodes; the log-density ratio between these two quantities is λ = 0.81 (c) Low polarization: the mixture components are not clearly separated, the mean density values of the modes is very close to the mean density value of the antimodes; the log-density ratio between these two quantities is λ = 0.19.(d) No polarization: the mixture distribution has only one mode (i.e., γ 1 ) and λ = 0 since the number of mode is not greater then one.
eight scenarios correspond to a DPM with different values of the precision parameter α and the components variance Q.Each scenario correspond to a specific combination of these two parameters.All the other quantities are fixed across the scenarios.The realizations of the DPM are here indicated as u = 1, . . ., n. Different combinations of α and Q result in different values of λ .For fixed values of Q (column-wise), a proportional relation is shown between α and λ : when the first increases, the second decreased.Similarly, for fixed values of α (row-wise), a proportional relation is shown between Q and λ : when the first increases, the second decreased.Smaller values of both α and Q result in a high polarized distribution of u = 1, . . ., n and correspond to larger values of λ .Whereas larger values of both α and Q result in a low polarized distribution and correspond to smaller values of λ .

Fig. 6
Fig. 6 (a) 95% HPD of the monitoring grid for the teachers' hierarchical effects u = 1, . . ., I (b) Posterior mean of the hierarchical effect of each teacher.Two different clusters emerged from the analysis, as expected: the more lenient (to the right-hand side) and the stricter (to the left-hand side).(c) λ ' 95% credible intervals.It indicate a moderate polarized posterior distribution of the posterior hierarchical effects.

Fig. 7
Fig. 7 Trace Plots of some parameters from the second scenario.As it is shown they all converge properly.
M − 1 of the density of u; f u (•) denotes the density at a specific point.Larger values of λ indicate strongly multimodal distribution of the hierarchical effects, whereas smaller values are evidence of weak multimodality, thus the estimated hierarchical effects are less concentrated.As it is shown in Figure 1 larger values of λ indicate distribution polarization, whereas smaller values indicate a less concentrated and more spread density distribution.The λ index is strongly affected by both location and scale parameters of the mixture components.For this reason it might be very informative in presence of multimodal distributions.Assuming such a raters' group polarization as a result of low latent agreement among raters, the λ index might be a useful diagnostic tool.

Table 2
Parameters values at each scenarios.Each of them correspond to a DPM with different values of the precision parameter α and the components variance Q.Both α and Q have an effect on the distribution polarization of the realizations of the DPM.As a result different values of λ are observed.

Table 4
95% HPD intervals of the hierarchical effects variance σ u from the standard models (i.i.d.normal distributed varying intercepts) and of the Grid density of the hierarchical effect in DPM models.

Table 5
95% HPD intervals of the ICC from the standard models (i.i.d.normal distributed varying intercepts) and of λ from the DPM models.

Table 7
95% HPD intervals of the other parameters of the standard models (i.i.d.normal distributed varying intercepts)