Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Most measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen’s kappa or Fleiss’s kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss’s kappa, Conger’s kappa, and Hubert’s kappa, the variant of Fleiss’s kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform. Supplementary Information The online version contains supplementary material available at 10.1007/s11336-023-09945-2.


Introduction
The most popular measures of inter-rater agreement involve correction for chance agreement.These can be written on the form where p a ( p d ) is the percentage agreement (disagreement) between the raters and p ca ( p cd ) is the chance agreement (disagreement) between the raters.Such measures are frequently called chance-corrected measures of agreement.Well-known examples of coefficients in this class are Cohen's (1960) kappa and its weighted variant (1968), its multi-rater variant Conger's kappa (Conger, 1980;Light, 1971), Krippendorff's (1970) alpha, Scott's (1955) pi, andFleiss' (1971) kappa.Some of these coefficients are defined only for two raters.The rest are defined in a pairwise manner, in the sense that they measure agreement between two raters at a time.However, not every proposed measure of agreement is defined on pairs of raters.The most famous is Hubert's kappa (1977), which was recently studied in detail by Martín Andrés and Álvarez Hernández (2020).

PSYCHOMETRIKA
There is no consensus on how multi-rater agreement coefficients should be defined.Broadly speaking, two options are considered: pairwise coefficients and consensus coefficients.The pairwise coefficients measure the agreement between pairs of raters (Conger, 1980) , while the consensus coefficients measure the simultaneous agreement between all raters.In particular, consensus coefficients support the notion that "agreement occurs if and only if all raters agree on the categorization of an object" (Hubert, 1977).Both pairwise and consensus-based definitions of agreement are variants of g-wise measures of agreement (Conger, 1980), where agreement is measured among g-tuples of raters.The case where 2 < g < R has received little attention in the literature (Warrens, 2012), and non-trivial ways to measure agreement are hard to invent in this case.However, we introduce a promising and general framework for handling g-wise measures of agreement based on the concept of Fréchet variances (Dubey & Müller, 2019).The Fréchet variances generalize the variance and the measures of agreement based on them generalize the nominal, linearly weighted, and quadratically weighted pairwise measures of agreement in a natural way.They are easily interpretable, as you measure how much the raters disagree with the generalized mean rater and then adjust for chance.For nominal data in particular, they measure how many raters disagree with the modal rater, with a resulting agreement measure less extreme than Hubert's kappa.
We need inferential theory for the g-wise agreement coefficients to make them useful.Much work has been done on inference for agreement coefficients, but, to our knowledge, inference for g-wise agreement coefficients has yet to be studied.Assuming multivariate normality of the ratings, Lin (1989, Section 3) derived the asymptotic distribution of Cohen's kappa with quadratic weights.Fleiss (1971) introduced a formula for the standard error of Fleiss's kappa, but later showed that it was incorrect.Using the properties of the multinomial distribution and the delta method, Schouten (1980) found the asymptotic variance of the weighted Fleiss's kappa in the case when the number of categories is finite.Almost forty years later, Gwet (2021) found a consistent estimator of the variance for the unweighted Fleiss's kappa.We extend these results to the weighted g-wise Fleiss's kappa for any number of categories below.In addition, we mention that bootstrap inference for Fleiss's kappa and Krippendorff's alpha was studied by Zapf et al. (2016).
We begin the paper by providing the definitions of two kinds of chance-corrected agreement coefficients.Then, in Sect.2, we establish connections between the multi-rater Cohen's kappa, Fleiss's kappa, Conger's kappa, Krippendorff's alpha, and Hubert's kappa.We restrict ourselves to the context where every rater rates every item.In Sect.3, we discuss the Fréchet variances mentioned above.Then we spell out the basic limit theory for this class agreement coefficients in Sect.4, extending the results of Schouten (1980), Schouten (1982), andO'Connell andDobson (1984) to vector-valued items and g-wise coefficients.We do this using the theory of U -statistics (Lee, 2019), but there are other ways to arrive at the same results.Then, in Sect.5, we provide practical recommendations regarding the choice of confidence interval, obtained by comparing three confidence interval constructions: basic, arcsine transformed, and Fisher transformed.Using a simulation study, we find that the arcsine and Fisher intervals outperform the basic interval when n is small.

Measures of Agreement
Let d(x 1 , . . ., x g ) be a disagreement function, a positive and symmetric function of g arguments that equals 0 when all x i s are equal, i.e., d(x, . . ., x) = 0.The disagreement function quantifies the disagreement between the ratings x 1 , . . ., x g , where 0 is understood as complete agreement.
Most disagreement functions take two arguments.While there are infinitely many disagreement functions, the best-known belong to the class of l p quasi-norms, p = 0, 1, 2, potentially raised to the pth power.The l p quasi-norms, p ∈ [0, ∞] in R k are defined as (2.1) as can be verified by taking the limit of ||x|| p as p → 0 and p → ∞, respectively.It is well known that ||x|| p are proper norms if and only if p ≥ 1, as the triangle inequality is violated when 1 > p ≥ 0. Now define the disagreement functions d p as the l p quasi-norm evaluated in x 1 − x 2 , i.e., In the case of scalar values, Vector-valued variants of d p and d p p are much less common, but have been used by, e.g., Berry et al. (2008).
When the dimension of the disagreement function d is not equal to 2, we are mostly interested in the case where its dimension equals the number of raters R. In this case, the disagreement functions often measure the degree of consensus among the raters, with 0 reflecting complete consensus.The most obvious choice is the Hubert disagreement function, which equals 0 if and only if every rater agrees on a rating.The disagreement function is employed in Hubert's kappa (Hubert, 1977).We present our results in terms of disagreement functions instead of the more popular agreement functions (i.e., positive symmetric functions bounded by 1 where 1 signifies maximal agreement, sometimes with the additional assumption that a ≥ 0).We do this mainly for mathematical convenience.Agreement functions and disagreement functions are closely related, for if a is an agreement function, then d = 1 − a is a disagreement function.Our results could have been framed in terms of agreement functions instead, though with some loss of generality.See Appendix (Sect.6) for a short discussion.
Our results and definitions are framed in the following setup.Let R be the number of raters and n be the number of items rated.Moreover, let F be a fixed multivariate distribution function F so that all rating vectors X i are sampled independently from F. In symbols, (2.4) There are no restrictions on the rating vector components X ir .They can be, e.g., categorical, real numbers, or vectors.Equation (2.4) implies that every item is rated by exactly the same number of raters, which we refer to as the rectangular design assumption.The assumption is common in the literature, 1 but far from universal.It can be relaxed, but it is strictly required for the limit results.We sketch how to loosen it in Appendix (Sect.6), but we have made no attempts at an inferential theory for non-rectangular designs.
There are two important special cases covered by equation (2.4).First, in the case of fixed raters, the same set of ordered raters rate every item.Having fixed raters is common in applications of Cohen's kappa, Conger's kappa, and the concordance correlation coefficient. 2Having fixed raters ensures that F does not vary across different rating vectors, but F could potentially vary with the ratings when the raters are not fixed, provided we do not make further assumptions.And that leads us to the second case, that of exchangeable ratings given the item.Here, the rater identities do not affect the ratings given.The raters may be different for each item, but the distribution F will still be fixed.Exchangeable ratings occur when the ratings are identically distributed conditional on the item rated.Exchangeable ratings is an implicit assumption underlying most applications of Fleiss' kappa, e.g., that of Fleiss (1971).In this case, the marginal distributions for all raters will be equal, which implies that the population value of the generalized Fleiss kappa equals the population value of the generalized Cohen's kappa, both defined below.However, the sample Fleiss's kappa is the preferred sample estimator, as it is invariant under changes of the raters' identities.
We intend to collect the kappas of Cohen, Fleiss, Conger, Hubert, and so on, into a coherent framework of g-wise agreement coefficients.To do this, we will have to define some quantities.Let x i = (x i1 , x i2 , . . ., x i R ) be an R-dimensional vector of observed ratings, and recall that g is the dimension of our disagreement function d.The following definitions are natural population counterparts of sample definitions prevalent in the agreement literature.
(i) The disagreement at x 1 , as measured by d.The purpose of this quantity is to translate an arbitrary g-dimensional disagreement function d into a disagreement function taking an R-dimensional vector x 1 as input.It is defined as where the sum runs over all g-dimensional subsets of {1, . . ., R} with order ignored, i.e., the g-combinations of R. The expression is simplified when g = R, as D d (x 1 ) = d(x 11 , . . ., x 1R ) in this case.To gain some intuition about this quantity, suppose that g = 2, that x 1 , x 2 are scalars, and consider the nominal disagreement function is the percentage of times two distinct raters disagree on their rating.(ii) The Cohen-type chance disagreement at x 1 , . . ., x g , so called to differentiate it from the Fleiss-type chance disagreement.It is similar to the disagreement at x 1 , but this time the raters do not necessarily rate the same item, as one rater rates the first item (from x 1 ) another rater rates the second item (from x 2 ), and so on.We do not allow a rater to rate the same item more than once in a pass: Hence, we need to choose g raters from a set of R raters, and the chance disagreement is The Fleiss-type chance disagreement at x 1 , . . ., x g is similar, but allows the same rater to rate an item multiple times.Its definition is where the sum runs over the product set R g .The expression for ).We will call the expected values of these quantities the mean disagreement, the mean Cohentype chance disagreement, and the mean Fleiss-type chance disagreement.Slightly abusing notation, we denote them as (2.8) where X 1 , . . ., X g are independently sampled from the same distribution and why to prefer one over the other, are abundant in the literature, often in the context of the so-called paradox of kappa (Cicchetti & Feinstein, 1990).
Definition 1.Let X ∼ F be a vector of R ratings and d be an agreement function with dimension g.Define the population values of the generalized Cohen's kappa (κ d ) and Fleiss's kappa (π d ) as (2.9) The generalized Fleiss's kappa, denoted as π d since it generalizes of Scott's pi (Scott, 1955), is a straightforward generalization of the Fleiss kappa (1971) to hold for 2 < g ≤ R. When g = R and d is the nominal disagreement, it equals Hubert's kappa.Likewise, the generalized Cohen's kappa is an extension of weighted Conger's kappa to hold for 2 ≤ g ≤ R. When g = R, it equals the Schuster-Smith coefficient (Schuster & Smith, 2005, eq. 1). 3 It generalizes several other agreement coefficients as well.For instance, Berry and Mielke (1988) discussed what we call κ d for Euclidean weights between vector-valued ratings, while Janson and Olsson (2001) extended it to squared Euclidean and nominal weights.The relationship between most of the mentioned agreement coefficients is summarized in Table 1.

Sample Estimates
Let X 1 , . . ., X n ∼ F be n iid vectors of ratings.Then there is a single natural sample estimator of D d , namely (2.10)There are, however, two natural estimators of the Cohen-type chance disagreement: one them a V -statistic (Lee, 2019, Chapter 4.2) and the other a U -statistic (Lee, 2019, Chapter 1), (2.11) where the first estimator runs over all combinations with repetitions of i 1 , i 2 , . . ., i g and the second estimator runs over the unordered combinations i 1 < i 2 < . . .< i g .Using the basic results of U -statistics (Lee, 2019, Chapter 1), we see that C u d is the unique minimum-variance unbiased estimator of C d , which makes it attractive from a theoretical point of view.However, from a well-known correspondence between U -statistics and V -statistics, the asymptotic distributions of Ĉd coincide with the asymptotic distribution of Ĉu d (Lee, 2019, Chapter 4, Theorem 1), so the choice between Ĉd and Ĉu d barely matters when n is sufficiently large.Likewise, there are two natural estimators of the Fleiss-type weighted chance agreement, where the index sets are described above.Now, we can define two sample variants of Cohen's kappa (Fleiss's kappa), depending on which one of Ĉd ( Fd  (Cohen, 1968) agrees with κd , not with κu d .Likewise, the sample Fleiss's kappa has a definition agreeing with πd (Fleiss, 1971).Moreover, due to the possibility of binning data, πd and κd are faster to compute when the data is not continuous.Since the estimators are asymptotically equivalent in any case, we will stick to the V -statistics κd and πd for estimation, but use the U -statistic form when deriving limit distributions.We note that, since we need to compute strictly fewer combinations, κu d and π u d are faster to compute when the data is continuous, which may be useful in some settings.

Fréchet Variances for g-Wise Agreement Coefficients
The most popular measures of agreement are defined only for g = 2.It is easy to find reasonable disagreement measures in this case, as one can draw on the extensive literature on norms and distances.The l p distances are the obvious choices, but there are many unexplored options, such as the Huber loss (Huber, 1964) and the LINEX loss (Varian, 1975).
In the setting of Hubert's kappa and the Schuster-Smith coefficient, we have g = R, and it is not that easy to find reasonable disagreement functions anymore.The disagreement function used in Hubert's kappa, d(x will penalize any number of discordant ratings equally, yielding the often undesirable outcome that most sets of ratings will be in complete disagreement.But there are less sensitive ways to count nominal disagreements.Consider the case of 10 raters with three ratings on an ordinal scale from 1-3, with 7 raters giving rating 1, 2 giving rating 2, and 1 giving rating 3. Then Hubert's disagreement rating is 1, as the rating vector is not constant, and the pairwise disagreement is 46/100.But it sounds reasonable to pick the modal rating (in this case 1) and then report the number of raters that disagree with it, divided by the number of raters.In this case, the number of raters disagreeing with the modal rating is 3, and the "modal" disagreement equals 3/10.
Sometimes we wish to aggregate numerical ratings instead of categorical ratings.Consider the above case again but with the median (which is 1) instead of the mode.It is well known that the median of a vector x is equal to argmin mean absolute deviation from the median) appears to be a reasonable measure of the mean disagreement when we use the median as the aggregation method.The resulting mean disagreement of the previous example is min The "modal" and "median" disagreement measures are instances of an intuitive generalization of the variance called the Fréchet variance (Dubey & Müller, 2019).Let l be a distance function satisfying l(x, y) ≥ 0 and l(x, x) = 0, and let A = {x 1 , x 2 , . . ., x R } be a set of points.The sample Fréchet mean of A is defined as the (not necessarily unique) point μ l that minimizes the sum of distances to all points in A, that is,4 Similarly, the sample Fréchet variance on A with distance function l is The Fréchet mean (Fréchet, 1948) is a generalization of centroids to arbitrary distance functions l; likewise, the Fréchet variance is a generalization of dispersion to any such distance function.They are best understood through a decision-theoretic lens: The Fréchet mean of A represents your best guess of the true classification or value of an item according to the distance l; the Fréchet variance V (l) is the decision-theoretic risk associated with the choice.See Cooil and Rust (1994) for an investigation of a closely related idea in the context of agreement measures.
Define the g-dimensional disagreement based on l as The most important distance functions are: Generalizes the nominal distance.If the data are categorical, the Fréchet mean μ d equals the mode, and the Fréchet variance equals the percentage of observations different from the mode.If we are dealing with vector-valued data with I elements each, it might be preferable to use I −1 I i=1 1[x i = y i ] instead, as it counts each dimension of the nominal data separately.(ii) d 1 (x, y) = ||x − y|| 1 .For scalar ratings, the Fréchet mean is equal to the sample median.
The Fréchet variance equals the sample mean absolute deviation from the median, i.e., For scalar ratings, the Fréchet mean is equal to the sample mean x r , and the Fréchet variance is equal to the biased sample variance of For vector-valued data, the Fréchet mean has no simple formula, but is known as the geometric median.If the data is scalar, d 2 = d 1 , which implies that the Fréchet mean equals the median, hence the name.There is an extensive literature on the geometric median; see, e.g., Drezner et al. (2002) for an overview and Cohen et al. (2016) for how to compute it.When the ratings are vector-valued, the geometric median is far more computationally expensive than the Fréchet mean based on ||x − y|| 2 2 .For any p ∈ [0, ∞] and pair of vectors x 1 , x 2 , we have the following (proved in Appendix, Sect.6): when we are dealing with pairwise agreement.Thus, the Fréchet variances generalize the pairwise agreement for these distances to g-wise coefficients.But be aware that the particular case of V (d 2 2 ) constitutes a trivial generalization, as it can be shown that the kappas do not vary with g when using the quadratic Fréchet variance 2 ) equals the concordance coefficient for every g.
We believe the most useful distance measures will typically be d 0 for categorical data and d 1 for ordinal data, both using g = R.The quadratic distance d 2 2 could be used for ordinal data as well, but is harder to justify, as it is usually not obvious why we would be interested in the squared distance between two observations rather than just the distance itself.The distances d p , p ∈ (1, ∞], with d 2 included, are even harder to recommend, as they do not work in a coordinatewise manner for vector data.In any case, it seems most reasonable to go with the R-wise variants of these distance measures, as they make use of all the available information, but the g-wise agreement coefficients (g < R) do not.
Example 2. In the paper introducing what is now called Fleiss's kappa, Fleiss (1971) discussed an example involving 5 different types of diagnoses, n = 30 patients, and R = 6 psychiatrists.The data were originally from Sandifer et al. (1968), but Fleiss removed some ratings to make the design rectangular.We use this data to illustrate the difference between Hubert's kappa and the Fréchet variances when applied to nominal data with g = R.
Hubert's kappa is π = 0.166 while Fleiss' kappa using V (d 0 ) is π = 0.486.The substantial difference suggests that a sizeable number of rating vectors contain at least one rating that disagrees with the others.Table 2 summarizes the relevant aspects of the data.The maximal agreement row *The largest number of raters that agree on the classification of an item.Both V (d 0 ) and Hubert's distance depend only on this when g = R.
could potentially go from 1 to 6, but the smallest number of raters agreeing on the classification of an item in this data set is 3.The count row counts the number of rows with the corresponding maximal agreements and distances.According to the Hubert distance, the raters disagree a lot, since only 5 items have a disagreement of 0 and the rest a disagreement of 1.On the other hand, V (d 0 ) results in a much smaller overall disagreement, with all disagreements smaller than the maximum of 1.

Limit Theory Using U -Statistics
Let X 1 , . . ., X n be independently and identically distributed and ψ(x 1 , . . ., x k ) be a symmetric function.A U -statistic of order k with kernel ψ is where the sum extends over all k-dimensional tuples satisfying 1 The theory of U -statistics was established by Hoeffding (1992); for an introduction, see, e.g., Chapter 6.1 of Lehmann (2004), Chapter 5 of Serfling (1980), or the textbook of Lee (2019).These references handle U -statistics where the X i s are real-valued, but their results, including the simple results below, hold for vector-valued X i s as well (Korolyuk & Borovskich, 2013).
The weighted chance agreement of Fleiss-type (Cohen-type) is a U -statistic with kernel F d (C d ), of order g.The disagreement is a U -statistic with kernel D d , which has order 1.To find the asymptotic variance of the kappas, we will use formulas for the asymptotic covariance of Ustatistics.Let U 1n and U 2n be two U -statistics of n observations with symmetric kernel functions ψ 1 , ψ 2 of dimensions k 1 and k 2 .Define Then we have n Cov(U 1n , U 2n ) → k 1 k 2 σ 12 and n Var(U 1n ) → k 2 1 σ 2 1 (Lee, 2019, Theorem 2, p. 76)).It is also possible to calculate the exact covariances, which could potentially make the asymptotic variances for the kappas perform better.See Appendix, Sect.6 for the formula for the exact covariance (Lee, 2019, Theorem 2, p. 17)).
Here the variable μ dC (X 1 ), and μ d F (X 1 ) are defined as The form of the covariance matrix follows from the remarks preceding the lemma.Asymptotic normality follows from a general theorem about asymptotic normality of U -statistics, see, e.g., Theorem 2 of Lee (2019, p. 76).
We want to use Lemma 1 to find the limit distribution of the generalized Cohen's kappa and Fleiss's kappa.To this end, recall the multivariate delta method (see, e.g., Lehmann, 2004, Theorem 5.2.3).Let f : R k → R be continuously differentiable at θ and suppose that where ∇ f (θ ) denotes the gradient of f at θ .In the case of Cohen's kappa and Fleiss's kappa, we find that Using some algebra, the expressions for the asymptotic variances follow from Lemma 1 and the above gradients.
Proposition 1. Then Cohen's kappa κd and Fleiss's kappa πd are asymptotically normal, and their asymptotic variances are These results are also valid for κu d and π u d .Since the sample Krippendorff's alpha (see note below) is equal to αd = πd + 1 2Rn (1 − πd ), it is also asymptotically normal with asymptotic variance σ 2 π .
With g = 2 and a finite number of categories, Schouten (1980) derived σ 2 π , while Schouten (1982) andO'Connell andDobson (1984) derived σ 2 κ .The result for Krippendorff's alpha is, to our knowledge, new.A brief aside on Krippendorff's alpha Krippendorff's alpha (Krippendorff, 1970) is an agreement coefficient especially popular in content analysis (Krippendorff, 2018).It has no population definition, but its sample definition equals αd = πd + 1 N (1 − πd ) (the total sample size N equals 2Rn in the case of a rectangular design); see Proposition 3 in Appendix for a justification.For this reason, all of the results about the limit of πu d apply to Krippendorff's alpha as well, as it is an asymptotically equivalent estimator of π d .Note, however, that Krippendorff (2018) emphasizes the use of non-rectangular designs, and the limit results in the preceding section do not hold for such study designs.

Estimating the Variances
The unknown quantities Dd , Ĉd , and Fd can be estimated using their sample counterparts.The variances and covariances can be estimated using the empirical (co)variances of the estimated μs.These have formulas where the index sets run over all combinations with repetitions of (i 1 , i 2 , . . ., i g−1 ).
Observe that estimating μdC and μdF directly is computationally very expensive, especially when done without binning, which cannot be done with continuous data.The obvious computation of all μdC requires a number of operations on the order of n g−1 , which is prohibitively expensive for large n and g.However, there are few applications of agreement measures with very large n and g, so this should not be a serious problem in practice.We note that less computationally demanding procedures are possible for the quadratic Fréchet variance V (d 2 2 ), as it can be shown that its associated kappas are invariant under g.Thus, we may use the computationally very effective methods for the concordance coefficient outlined by, e.g., Carrasco and Jover (2003).
From the definitions of Dd , Ĉd , and Fd , (4), we quickly deduce that μd = Dd , μdC = Ĉd and μdF = Fd .Using this fact, we can define the estimators . Moreover, we can estimate σ 2 F and σ 2 F D in the same way, substituting μdF for μdC .Using the formulas for the theoretical variances (4.4), we find the estimators The variance estimator σ 2 π coincides with that of Gwet (2021, equation 4) in the case of nominal weights; see Appendix (Sect.6) for a proof sketch.

Improving Approximate Normality with the Arcsine and Fisher Transforms
It is well known that the Fisher transform (Fisher, 1915) improves the inference for the correlation coefficient.If r is the sample correlation, artanh(r ) = 1 2 log[(1 + r )/(1 − r )] has approximately the same variance for most r , and its distribution is closer to normal than that of the untransformed r , especially when the population correlation is close to ±1.This transform makes sense outside the world of correlations; for instance, Lin (1989) used the Fisher transform to improve the normality of the quadratically weighted Cohen's kappa.
The arcsine is another reasonable transformation of κd and πd .The arcsine is the inverse of the sine function and is defined as arcsin x = 1/ √ 1 − x 2 dx.In ecology (Warton & Hui, 2011), the arcsine transformation denotes arcsin √ p, where p is a probability.We do not take square root, however, as κd and πd can be negative.
Calculating the limiting variance of arcsin κd and arcsin πd requires an additional application of the delta method (4.2).Using that d dx arcsin( (4.9) Expressions for πd can be found by swapping κ d for π d and σ 2 κ for σ 2 π .
Example 3.This example illustrates that the arcsine and Fisher transforms may make the sampling distribution closer to the normal distribution.Let the number of raters be R = 3, the disagreement function be quadratic (with g = 2), and the number of items be n = 20.There are five categories and the true classification of an item is one of {1, 2, 3, 4, 5} with probability 1/5 each.Every rater knows the true classification of an item with probability 0.9.If they do not know the correct classification, they will guess a classification from {1, 2, 3, 4, 5} uniformly at random.One can show that the population value of the quadratically weighted Cohen's kappa is 0.816 under these circumstances, following the arguments of Perreault and Leigh (1989).We simulate the value of κd a total of N = 50, 000 times and transform them using the identity transform, the arcsine transform, and the Fisher transform.The results are shown in Fig. 1.The arcsine transform appears to bring the sampling distribution of κd closer to the normal distribution, with the Fisher transform also improving normality quite a bit.

Confidence Intervals
Using the methodology we have developed, we can easily construct confidence intervals for the agreement coefficients.
We describe our three confidence interval constructions only for Cohen's kappa, as the intervals using Fleiss' kappa can be found by replacing every instance κd with πd and σ 2 κ with σ 2 π .We use the two-sided t-distribution-based confidence intervals with nominal level 1 − α = 0.95.Let c be the (1 − α/2)-quantile of the t distribution with n − 1 degrees of freedom.The basic interval is where σκ is the estimated variance described in equation (4.6).
Example 4. (Ex. 2 cont.)Using the data of Fleiss (1971), we calculate arcsine confidence intervals for the g-wise Fleiss's kappa.The raters are not the same for all items, but it seems plausible to assume that the ratings are exchangeable given the item.The diagnoses are essentially categorical in nature; hence, we will only consider V (d 0 ) and Hubert's disagreement function.The results are shown in Table 3.We see that the agreement coefficients agree when g = 2, as both V (d 0 ) and Hubert's disagreement function equals the nominal agreement in this case.But the coefficients differ substantially as g increases.This is to be expected, as Hubert's disagreement function measures consensus while V (d 0 ) measures the number of observations different from the mode.Observe that V (d 0 ) is not invariant with respect to g, hence it is a proper alternative to the classical Fleiss's kappa.Moreover, all confidence intervals are of comparable length.
The preceding example fits best into the context of Fleiss' kappa, as the identity of the raters are unknown.Moreover, there is no ordinal structure in the data, making the V (d 1 ) and V (d 22 ) 530 PSYCHOMETRIKA Table 3.
Confidence intervals for the data of Fleiss (1971) using the arcsine method.distances unnatural to employ.Our next example concerns the Fréchet variances applied to a case of ordinal data when the identity of the raters are known.
Example 5. Zapf et al. (2016) studied bootstrap intervals for Fleiss's kappa and Krippendorff's alpha using simulations and a case study.Their case study concerned the histopathological assessment of breast cancer and involved ratings performed by R = 4 senior pathologists and n = 50 breast cancer biopsies.We apply the arcsine method to calculate confidence intervals and point estimates, displayed in Table 4.We focus on Cohen's kappa since the same four pathologists rate each cancer biopsy, but we include a column for Fleiss's kappa when g = 4 for comparison's sake.When g = 4, Cohen's kappa and Fleiss's kappa are as good as indistinguishable.As can be verified by using the code in the supplementary material, this happens for the other gs as well.It is not generally the case that Fleiss's kappa and Cohen's kappa nearly coincide, but it is likely to happen if the marginal ratings are approximately the same for all raters, as is the case in this data set.There is a sizable difference between the disagreement functions, but there is not typically a big difference when changing gs, provided we keep the disagreement functions constant.It remains to be seen whether this is common or not.The exception is Hubert's disagreement function, which decreases quite a bit.(As in the Fleiss (1971) example, this is expected, as the Hubert's disagreement function is a consensus measure.)Observe that the kappas under the quadratic Fréchet variance V (d 2 2 ) do not change with g, which is always the case.

Simulation of Confidence Sets When g = 2
We include a small simulation study on the performance of confidence sets using two models: A Perreault-Leigh model for discrete rating data and a normal model for continuous rating data.For both models, we investigate the following parameters: (i) Number of raters R. We use 2, 5, 20, which corresponds to a small, medium, and large selection of raters.(ii) Sample sizes n.We use n = 10, 40, 100, corresponding to small, medium, and large agreement studies.(iii) Disagreement functions.Nominal disagreement 1[x = y], quadratic disagreement (x − y) 2 , and absolute value disagreement |x − y|.(iv) Methods.A basic interval without transformations, an arcsine-transformed interval, and a Fisher transformed interval.

A Perreault-Leigh Model
Perreault and Leigh (1989) discussed a particular model for ratings in which each rated user either knows the correct answer or guesses uniformly at random.Similar models have been used by Gwet (2008); Maxwell (1977), among others; see Moss (2023) for a thorough discussion of such models.We assume there are five categories encoded as C = {−2, −1, 0, 1, 2}, and the distribution of the true classification distribution is uniform.For each item rated, the r th rater knows the correct classification with probability √ 0.8.If not, he guesses, picking a number from C uniformly at random.Then κ d = π d = 0.8 for all weights and the number of raters, as can be verified by following the arguments of Perreault and Leigh (1989).We run each simulation N = 10, 000 times.
The simulated lengths and coverages for Cohen's kappa are given in Table 5.Two features stand out in Table 5.First, the confidence intervals have almost indistinguishable lengths and coverages when either R or n is large.Second, the basic interval has worse coverage than the arcsine and Fisher intervals when n is small, with the Fisher interval having coverage slightly closer to nominal than the arcsine interval.However, the better nominal coverage comes at the expense of greater lengths.In particular, for the absolute value weight, the coverage of the arcsine interval is greater than the coverage of the Fisher interval, but its length is shorter!The table for Fleiss's kappa is similar and can be found in Appendix, Table 8.

Normal Model
In this study, the rating data is distributed according to the multivariate normal N (0, ), where is the R × R correlation matrix with off-diagonal elements r i r j = ρ.Since the data is continuous, we study the absolute value disagreement d 1 and the quadratic disagreement d 2 2 only.The true values are See Appendix (Sect.6) for details on the computation of these true values.We use ρ = 0.7, and hence, κ d 2 2 = 0.7 and κ d 1 = 0.45.We run each simulation N = 1, 000 times. 5 We note that agreement coefficients are often called concordance coefficients when dealing with continuous data, especially when the quadratic distance is used.Lin's concordance coefficient (Lin, 1989(Lin, , 1992) ) is a prominent example.
The simulated lengths and coverages for Cohen's kappa are given in Table 6.There is barely any difference between the three confidence interval constructions.Taken together with the results for the Perreault-Leigh model, where the basic interval performs worse than the other two, we would recommend the usage of either the arcsine or Fisher interval.Again, the table for Fleiss's kappa is very similar and can be found in Appendix (Table 9).

Simulation of Confidence Sets when g = 2
Table 7 contains simulations from the Perreault-Leigh model (Sect.5.1.1)with N = 1000 repetitions and R = 5 raters using the Fréchet variances V (d 0 ), V (d 1 ), and Hubert's disagreement function.We drop V (d 2 2 ) since it does not vary with g.To save space, we drop the basic confidence PSYCHOMETRIKA interval in the simulation.As before, we show the results only for the Cohen-type disagreement, with the Fleiss-type disagreement relegated to Appendix (Table 10).All coverages are decent, and the coverages and lengths are similar across the board.

Concluding Remarks
When choosing an agreement coefficient one has to carefully think through exactly what one wishes to measure.The Fréchet variances are attractive because of their interpretation.You measure how much the raters disagree with the generalized mean rater, and then adjust for chance.In the case of nominal data, we measure the disagreement with the modal rater.When dealing with numerical data, we may measure disagreement with the median rater (using the absolute value distance), or the mean rater (using the quadratic distance), or use any other Fréchet variance defined on numeric data.
When dealing with nominal data, we believe that using the Fréchet variance, which measures the distance from the mode, is a reasonable choice.But other options are certainly possible, even when dealing with g-wise agreement measures.For example, one could use the entropy instead, with distance measure d(x 1 , x 2 , . . ., g , where #i counts the number of elements in (x 1 , x 2 , . . ., x g ) classified as i, which could be useful when the number of raters is finite but large.The topic of how to choose reasonable distance measures for g-wise agreement studies has not been thoroughly studied, and there might be options preferable to the Fréchet variances that have not yet been found.
We have only covered rectangular design, where every item is rated by the same number of raters.It is quite easy to generalize the definitions of κ d and π d to non-rectangular designs, as we have done in Appendix, Sect.6.But inference appears to be quite difficult, probably requiring additional assumptions for the case of non-exchangeable ratings.
In Sect.4, we introduced the U -statistic-based estimators of C d and F d , but only used them for theoretical purposes.The U -statistic-based estimators may plausibly outperform the classi- cal V -statistic-based estimators since they are minimum variance unbiased estimators.It would be interesting to see whether the U -statistic-based estimators could outperform the traditional V -statistic-based estimators when n is small, for example in terms of mean squared error or confidence interval coverage.
The confidence intervals based on the arcsine and Fisher transforms perform better than the basic, untransformed interval.It is unclear which one of these intervals to prefer, but it barely matters when the sample size is sufficiently large.It might be possible to improve all of these intervals.Small-sample corrections to the variance appear feasible, with potential openings in the application of the delta rule and in the calculation of of Lemma 1.We have used the arcsine and Fisher transforms to improve approximate normality of κd and πd , but this choice is semiarbitrary.Better variance-stabilizing transformations might be found by inspecting the formula for the variances of κd and πd in Proposition 1.The confidence intervals used in the simulation are only known to be first-order accurate.To make second-order accurate confidence intervals, it would be possible to use the explicit formula for the variances to construct studentized confidence intervals, i.e., bootstrap-t intervals (Efron, 1987), which are second-order accurate.
None of these approaches is guaranteed to help when n is small, especially when dealing with categorical data, as the sampling distributions of κd and πd are discrete and highly irregular.For example, consider the sample distribution of the Perreault-Leigh model (Sect.5.1) when n = 20 and R = 20, displayed in Fig. 2. (We omit a dominating spike at 1.) As there are C = 5 < ∞ categories, there is a finite number of possible values for κd to take, which is strongly reflected in the plots, especially for the nominal weight.
The superior performance of methods such as the bootstrap-t depends on the quantity θ −θ se being approximately pivotal, that is, approximately the same for all parameters, possibly after applying a transformation.Judging from the plots in Fig. 2, there is no such transformation.
Funding Open access funding provided by Norwegian Business School.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Agreement Versus Disagreement
Agreement weighting functions are frequently standardized to guarantee that w(x 1 , x 2 ) ≥ 0, e.g., w( for the absolute value weights.Standardization is not necessary, as they do not change the values of κ d and π d when it is possible (i.e., when max(|x 1 − x 2 |) < ∞), and is not defined otherwise.We choose not to use this operation, as it does not change the value of the agreement coefficients in this paper and is impossible to do when the range of x 1 , x 2 is unbounded.

Proof of Equivalence Between
First, consider the case when p ≥ 1.Using translation invariance and homogeneity of the norm, This is an equality if ||a − ν|| = ||a + ν|| = ||a||, i.e., when ν = 0, as the left side equals (||a − μ|| + ||a + μ||) p = 2 p ||a|| p .Now it is easy to verify that V (d p ) and V (d , and [x 2 , ∞); hence, its minimum is either x 1 , x 2 , or both.It is clear that both x 1 and x 2 maps to ||x 1 − x 2 || p ; hence, both are Fréchet means.The case p = 0 is obvious and omitted.

True Values in the Normal Simulation
We give a brief explanation why the true values of κ d and π d are 0.8 for the quadratic weights and 1 − √ 0.2 for the absolute value weights.First notice that, since the marginals of X r 1 and X r 2 are equal for all r 1 , r 2 , we have that κ d = π d .Moreover, we can ignore the number of raters, since the pairwise distribution do not depend on them.Then, from standard theory about the multivariate and folded normal, we find that Let X r 1 be a copy of X r 1 that is independent of X r 2 .Then E(|X r 1 − X r 2 |) = 2/ √ π and E(|X r 1 − X r 2 | 2 ) = 2. Now rewrite the kappas using disagreement instead of agreement.Use the fact that ( p wa − p f a )/(1 − p f a ) = 1 − d wa /d f a , where d wa = 1 − E(w(X r 1 , X r 2 )) and d f a = 1 − E(w(X r 1 , X r 2 )), where X r 1 is a copy of X r 1 that is independent of X r 2 .Thus, − ρ for the absolute value weights and 1 − E(|X r 1 − X r 2 | 2 )/E(|X r 1 − X r 2 | 2 ) = ρ for the quadratic weights.

Expanding the Definitions
Here is sketch of how we could expand the definitions in Sect. 2 to encompass more complicated scenarios.We restrict ourselves to g = 2, but the analysis can be expanded to arbitrary g.Suppose that any finite number of raters R is possible, the raters are not exchangeable, and that not every item is rated by every rater.Let X denote a rating, R be the raters, and I be the items rated.Suppose we sample pairs (X 1 , R 1 , I 1 ), (X 2 , R 2 , I 2 ) independently from the same distribution F. Then we may define (6.2) These quantities have natural sample analogues; e.g., where N is the total number of paired observations and the rater indices run over the raters who observed at the ith observation x.Population and sample definitions of Cohen's kappa and Fleiss' kappa follow as laid out in the main text, e.g., κ d = 1 − D d /C d .where μ(x i ) and μF (x i ) were defined in Sect. 4. Following a small reorganization of the terms, we find that Using the definitions of σ 2 D , σFD and σ 2 F (c.f.Section 4.2), one can verify using simple algebraic manipulations that hence, the estimator of Gwet ( 2021) is a special case of the proposed estimator in Sect.4.2.

Simulation of Fleiss's Kappa
Here are the results of the simulation study in 5.1 for Fleiss's kappa (Tables 8, 9, 10).
) and Ĉu d ( Fu d ) we choose to use.These are κd = 1 − Dd / Ĉd and κu d = 1 − Dd / Ĉu d for Cohen's kappa and πd = 1 − Dd / Fd and π u d = 1 − Dd / Fu d for Fleiss's kappa.The definition of the sample Cohen's kappa

Lemma 1 .
Define the parameter vectors p = (D d , C d , F d ) and p = ( Dd , Ĉd , Fd ).Then √ n( p− p) d → N (0, ), where is the covariance matrix with elements

Figure 1 .
Figure 1.Simulated sampling distribution of κd for quadratic weights using three transformations, n = 20, R = 3.The simulation setup is described in Example 3. The arcsine transform makes the sampling distribution closest to the normal distribution.
Figure 2. Sample distribution of κd for nominal (left) and absolute value (right) weights.Both plots omit a dominating spike at 1.Here n = 20 and j = 5, and we use the Perreault-Leigh model (same parameters as in Sect.5.1) to simulate the data.There were 2573 unique values for the nominal weight and 8790 unique values for the absolute value weight after N = 200,000 simulations.

Table 6 .
Coverage (first entry) and lengths (second entry) of confidence intervals: normal model, Cohen's kappa.

Table 7 .
Coverage (first entry) and lengths (second entry) of confidence intervals for g-wise coefficients: Perreault-Leigh model, Cohen's kappa.

Table 9 .
Coverage (first entry) and lengths (second entry) of confidence intervals: Normal model, Fleiss's kappa.