Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Moss, Jonas

doi:10.1007/s11336-023-09945-2

Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Original Research
Open access
Published: 08 January 2024

Volume 89, pages 517–541, (2024)
Cite this article

Download PDF

You have full access to this open access article

Psychometrika Aims and scope Submit manuscript

Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Download PDF

Jonas Moss ORCID: orcid.org/0000-0002-6876-6964¹

1179 Accesses
Explore all metrics

Abstract

Most measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen’s kappa or Fleiss’s kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss’s kappa, Conger’s kappa, and Hubert’s kappa, the variant of Fleiss’s kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform.

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

Mixed methods research: what it is and what it could be

Article Open access 29 March 2019

Qualitative Content Analysis: Theoretical Background and Procedures

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The most popular measures of inter-rater agreement involve correction for chance agreement. These can be written on the form

$$\begin{aligned} \frac{p_{a}-p_{ca}}{1-p_{ca}}=1-\frac{p_{d}}{p_{cd}}, \end{aligned}$$

(1.1)

where $p_{a}$ ($p_{d}$) is the percentage agreement (disagreement) between the raters and $p_{ca}$ ($p_{cd}$) is the chance agreement (disagreement) between the raters. Such measures are frequently called chance-corrected measures of agreement. Well-known examples of coefficients in this class are Cohen’s (1960) kappa and its weighted variant (1968), its multi-rater variant Conger’s kappa (Conger, 1980; Light, 1971), Krippendorff’s (1970) alpha, Scott’s (1955) pi, and Fleiss’ (1971) kappa. Some of these coefficients are defined only for two raters. The rest are defined in a pairwise manner, in the sense that they measure agreement between two raters at a time. However, not every proposed measure of agreement is defined on pairs of raters. The most famous is Hubert’s kappa (1977), which was recently studied in detail by Martín Andrés and Álvarez Hernández (2020). Other agreement coefficients include the $AC_{1}$ coefficient (Gwet, 2008), the recent coefficient of van Oest (2019), and a multitude of intraclass correlation coefficients (Gwet, 2014).

There is no consensus on how multi-rater agreement coefficients should be defined. Broadly speaking, two options are considered: pairwise coefficients and consensus coefficients. The pairwise coefficients measure the agreement between pairs of raters (Conger, 1980), while the consensus coefficients measure the simultaneous agreement between all raters. In particular, consensus coefficients support the notion that “agreement occurs if and only if all raters agree on the categorization of an object” (Hubert, 1977). Both pairwise and consensus-based definitions of agreement are variants of g-wise measures of agreement (Conger, 1980), where agreement is measured among g-tuples of raters. The case where $2<g<R$ has received little attention in the literature (Warrens, 2012), and non-trivial ways to measure agreement are hard to invent in this case. However, we introduce a promising and general framework for handling g-wise measures of agreement based on the concept of Fréchet variances (Dubey and Müller, 2019). The Fréchet variances generalize the variance and the measures of agreement based on them generalize the nominal, linearly weighted, and quadratically weighted pairwise measures of agreement in a natural way. They are easily interpretable, as you measure how much the raters disagree with the generalized mean rater and then adjust for chance. For nominal data in particular, they measure how many raters disagree with the modal rater, with a resulting agreement measure less extreme than Hubert’s kappa.

We need inferential theory for the g-wise agreement coefficients to make them useful. Much work has been done on inference for agreement coefficients, but, to our knowledge, inference for g-wise agreement coefficients has yet to be studied. Assuming multivariate normality of the ratings, Lin (1989, Section 3) derived the asymptotic distribution of Cohen’s kappa with quadratic weights. Fleiss (1971) introduced a formula for the standard error of Fleiss’s kappa, but later showed that it was incorrect. Using the properties of the multinomial distribution and the delta method, Schouten (1980) found the asymptotic variance of the weighted Fleiss’s kappa in the case when the number of categories is finite. Almost forty years later, Gwet (2021) found a consistent estimator of the variance for the unweighted Fleiss’s kappa. We extend these results to the weighted g-wise Fleiss’s kappa for any number of categories below. In addition, we mention that bootstrap inference for Fleiss’s kappa and Krippendorff’s alpha was studied by Zapf et al. (2016).

We begin the paper by providing the definitions of two kinds of chance-corrected agreement coefficients. Then, in Sect. 2, we establish connections between the multi-rater Cohen’s kappa, Fleiss’s kappa, Conger’s kappa, Krippendorff’s alpha, and Hubert’s kappa. We restrict ourselves to the context where every rater rates every item. In Sect. 3, we discuss the Fréchet variances mentioned above. Then we spell out the basic limit theory for this class agreement coefficients in Sect. 4, extending the results of Schouten (1980), Schouten (1982), and O’Connell and Dobson (1984) to vector-valued items and g-wise coefficients. We do this using the theory of U-statistics (Lee, 2019), but there are other ways to arrive at the same results. Then, in Sect. 5, we provide practical recommendations regarding the choice of confidence interval, obtained by comparing three confidence interval constructions: basic, arcsine transformed, and Fisher transformed. Using a simulation study, we find that the arcsine and Fisher intervals outperform the basic interval when n is small.

2 Measures of Agreement

Let $d(x_{1},\ldots ,x_{g})$ be a disagreement function, a positive and symmetric function of g arguments that equals 0 when all $x_{i}$s are equal, i.e., $d(x,\ldots ,x)=0$. The disagreement function quantifies the disagreement between the ratings $x_{1},\ldots ,x_{g}$, where 0 is understood as complete agreement.

Most disagreement functions take two arguments. While there are infinitely many disagreement functions, the best-known belong to the class of $l_{p}$ quasi-norms, $p=0,1,2$, potentially raised to the pth power. The $l_{p}$ quasi-norms, $p\in [0,\infty ]$ in $\mathbb {R}^{k}$ are defined as

$$\begin{aligned} \Vert x\Vert _{p}=\left( \sum _{i=1}^{k}|x_{i}|^{p}\right) ^{1/p}. \end{aligned}$$

(2.1)

Here $||x||_{0}=\sum _{i=1}^{k}1[x_{i}\ne 0]$ and $||x||_{\infty }=\sup _i |x_{i}|$, as can be verified by taking the limit of $||x||_{p}$ as $p\rightarrow 0$ and $p\rightarrow \infty $, respectively. It is well known that $||x||_{p}$ are proper norms if and only if $p\ge 1$, as the triangle inequality is violated when $1>p\ge 0$.

Now define the disagreement functions $d_{p}$ as the $l_{p}$ quasi-norm evaluated in $x_{1}-x_{2}$, i.e.,

$$\begin{aligned} d_{p}(x_{1},x_{2})=||x_{1}-x_{2}||_{p}. \end{aligned}$$

(2.2)

In the case of scalar values, $d_{0}(x_{1},x_{2})=1[x_{1}\ne x_{2}]$ is known as the nominal disagreement function. For $p=1$, the $l_{p}$ norm equals $d_{1}(x_{1},x_{2})=|x_{1}-x_{2}|$, which is known as the absolute value disagreement function (and sometimes the linear disagreement function). The quadratic disagreement function is $d_{2}^{2}(x_{1},x_{2})=(x_{1}-x_{2})^{2}$. Vector-valued variants of $d_{p}$ and $d_{p}^{p}$ are much less common, but have been used by, e.g., Berry et al. (2008).

When the dimension of the disagreement function d is not equal to 2, we are mostly interested in the case where its dimension equals the number of raters R. In this case, the disagreement functions often measure the degree of consensus among the raters, with 0 reflecting complete consensus. The most obvious choice is the Hubert disagreement function,

$$\begin{aligned} d(x_{1},\ldots ,x_{g})=1-1[x_{1}=\cdots =x_{g}] \end{aligned}$$

(2.3)

which equals 0 if and only if every rater agrees on a rating. The disagreement function is employed in Hubert’s kappa (Hubert, 1977).

We present our results in terms of disagreement functions instead of the more popular agreement functions (i.e., positive symmetric functions bounded by 1 where 1 signifies maximal agreement, sometimes with the additional assumption that $a\ge 0$). We do this mainly for mathematical convenience. Agreement functions and disagreement functions are closely related, for if a is an agreement function, then $d=1-a$ is a disagreement function. Our results could have been framed in terms of agreement functions instead, though with some loss of generality. See Appendix (Sect. 6) for a short discussion.

Our results and definitions are framed in the following setup. Let R be the number of raters and n be the number of items rated. Moreover, let F be a fixed multivariate distribution function F so that all rating vectors $\varvec{X}_{i}$ are sampled independently from F. In symbols,

$$\begin{aligned} \varvec{X}_{1},\varvec{X}_{2},\ldots ,\varvec{X}_{n} {\mathop {\sim }\limits ^{iid}} F. \end{aligned}$$

(2.4)

There are no restrictions on the rating vector components $\varvec{X}_{ir}$. They can be, e.g., categorical, real numbers, or vectors.

Equation (2.4) implies that every item is rated by exactly the same number of raters, which we refer to as the rectangular design assumption. The assumption is common in the literature,^{Footnote 1} but far from universal. It can be relaxed, but it is strictly required for the limit results. We sketch how to loosen it in Appendix (Sect. 6), but we have made no attempts at an inferential theory for non-rectangular designs.

There are two important special cases covered by equation (2.4). First, in the case of fixed raters, the same set of ordered raters rate every item. Having fixed raters is common in applications of Cohen’s kappa, Conger’s kappa, and the concordance correlation coefficient.^{Footnote 2} Having fixed raters ensures that F does not vary across different rating vectors, but F could potentially vary with the ratings when the raters are not fixed, provided we do not make further assumptions. And that leads us to the second case, that of exchangeable ratings given the item. Here, the rater identities do not affect the ratings given. The raters may be different for each item, but the distribution F will still be fixed. Exchangeable ratings occur when the ratings are identically distributed conditional on the item rated. Exchangeable ratings is an implicit assumption underlying most applications of Fleiss’ kappa, e.g., that of Fleiss (1971). In this case, the marginal distributions for all raters will be equal, which implies that the population value of the generalized Fleiss kappa equals the population value of the generalized Cohen’s kappa, both defined below. However, the sample Fleiss’s kappa is the preferred sample estimator, as it is invariant under changes of the raters’ identities.

We intend to collect the kappas of Cohen, Fleiss, Conger, Hubert, and so on, into a coherent framework of g-wise agreement coefficients. To do this, we will have to define some quantities. Let $\varvec{x}_{i}=(x_{i1},x_{i2},\ldots ,x_{iR})$ be an R-dimensional vector of observed ratings, and recall that g is the dimension of our disagreement function d. The following definitions are natural population counterparts of sample definitions prevalent in the agreement literature.

(i)
The disagreement at $\varvec{x}_{1}$, as measured by d. The purpose of this quantity is to translate an arbitrary g-dimensional disagreement function d into a disagreement function taking an R-dimensional vector $\varvec{x}_{1}$ as input. It is defined as
$$\begin{aligned} D_{d}(\varvec{x}_{1})=\left( {\begin{array}{c}R\\ g\end{array}}\right) ^{-1} \sum _{r_{1},\ldots ,r_{g}}d(\varvec{x}_{1r_{1}},\ldots , \varvec{x}_{1r_{g}}), \end{aligned}$$
(2.5)
where the sum runs over all g-dimensional subsets of $\{1,\ldots ,R\}$ with order ignored, i.e., the g-combinations of R. The expression is simplified when $g=R$, as $D_{d}(\varvec{x}_{1})=d(\varvec{x}_{11},\ldots ,\varvec{x}_{1R})$ in this case. To gain some intuition about this quantity, suppose that $g=2$, that $x_{1},x_{2}$ are scalars, and consider the nominal disagreement function $d_{0}(x_{1},x_{2})=1[x_{1}\ne x_{2}]$. Then $D_{d}(\varvec{x}_{1})=2R^{-1}(R-1)^{-1}\sum _{r_{1}>r_{2}}1[x_{1r_{1}}\ne x_{1r_{2}}]$ is the percentage of times two distinct raters disagree on their rating.
(ii)
The Cohen-type chance disagreement at $\varvec{x}_{1},\ldots ,\varvec{x}_{g}$, so called to differentiate it from the Fleiss-type chance disagreement. It is similar to the disagreement at $\varvec{x}_{1}$, but this time the raters do not necessarily rate the same item, as one rater rates the first item (from $\varvec{x}_{1}$) another rater rates the second item (from $\varvec{x}_{2}$), and so on. We do not allow a rater to rate the same item more than once in a pass: Hence, we need to choose g raters from a set of R raters, and the chance disagreement is
$$\begin{aligned} C_{d}(\varvec{x}_{1},\ldots ,\varvec{x}_{g}) =\left( {\begin{array}{c}R\\ g\end{array}}\right) ^{-1}\sum _{r_{1},\ldots ,r_{g}}d(\varvec{x}_{1r_{1}}, \ldots ,\varvec{x}_{gr_{g}}), \end{aligned}$$
(2.6)
where the sum runs over all g-dimensional subsets of $\{1,\ldots ,R\}$, i.e., the g-combinations of R. Observe that $D_{d}(\varvec{x})=C_{d}(\varvec{x},,\ldots ,\varvec{x})$. Since d is assumed to be symmetric, the expression is simplified to $d(\varvec{x}_{1r_{1}},\ldots ,\varvec{x}_{Rr_{R}})$ when $g=R$. When $g=2$, $C_{d}(\varvec{x}_{1},\varvec{x}_{2})=R^{-1}(R-1)^{-1} \sum _{r_{1}\ne r_{2}}d(\varvec{x}_{1r_{1}},\varvec{x}_{2r_{2}})$.
(iii)
The Fleiss-type chance disagreement at $\varvec{x}_{1},\ldots ,\varvec{x}_{g}$ is similar, but allows the same rater to rate an item multiple times. Its definition is
$$\begin{aligned} F_{d}(\varvec{x}_{1},\ldots ,\varvec{x}_{g}) =R^{-g}\sum _{r_{1},\ldots ,r_{g}}d(\varvec{x}_{1r_{1}}, \ldots ,\varvec{x}_{gr_{g}}), \end{aligned}$$
(2.7)
where the sum runs over the product set $R^{g}$. The expression for $F_{d}(\varvec{x}_{1},\ldots ,\varvec{x}_{g})$ is not dramatically simplified when $g=R$. When $g=2$, $F_{d}(\varvec{x}_{1},\varvec{x}_{2})=R^{-2} \sum _{r_{1},r_{2}}d(\varvec{x}_{1r_{1}},\varvec{x}_{2r_{2}})$.

We will call the expected values of these quantities the mean disagreement, the mean Cohen-type chance disagreement, and the mean Fleiss-type chance disagreement. Slightly abusing notation, we denote them as

$$\begin{aligned} D_{d}=E[D_{d}(\varvec{X}_{1})],\quad C_{d}=E[C_{d}(\varvec{X}_{1},\ldots ,\varvec{X}_{g})],\quad F_{d}=E[F_{d}(\varvec{X}_{1},\ldots ,\varvec{X}_{g})], \end{aligned}$$

(2.8)

where $\varvec{X}_{1},\ldots ,\varvec{X}_{g}$ are independently sampled from the same distribution F. Discussions about the difference between $E[C_{d}(\varvec{X}_{1},\ldots ,\varvec{X}_{g})]$ and $E[F_{d}(\varvec{X}_{1},\ldots ,\varvec{X}_{g})]$, and why to prefer one over the other, are abundant in the literature, often in the context of the so-called paradox of kappa (Cicchetti and Feinstein, 1990).

Definition 1

Let $X\sim F$ be a vector of R ratings and d be an agreement function with dimension g. Define the population values of the generalized Cohen’s kappa $(\kappa _{d})$ and Fleiss’s kappa $(\pi _{d})$ as

$$\begin{aligned} \kappa _{d}=1-\frac{D_{d}}{C_{d}},\quad \pi _{d}=1 -\frac{D_{d}}{F_{d}}. \end{aligned}$$

(2.9)

The generalized Fleiss’s kappa, denoted as $\pi _{d}$ since it generalizes of Scott’s pi (Scott, 1955), is a straightforward generalization of the Fleiss kappa (1971) to hold for $2< g\le R$. When $g=R$ and d is the nominal disagreement, it equals Hubert’s kappa. Likewise, the generalized Cohen’s kappa is an extension of weighted Conger’s kappa to hold for $2\le g\le R$. When $g=R$, it equals the Schuster–Smith coefficient (Schuster & Smith, 2005, eq. 1).^{Footnote 3} It generalizes several other agreement coefficients as well. For instance, Berry and Mielke (1988) discussed what we call $\kappa _{d}$ for Euclidean weights between vector-valued ratings, while Janson and Olsson (2001) extended it to squared Euclidean and nominal weights. The relationship between most of the mentioned agreement coefficients is summarized in Table 1.

Table 1 Weighted agreement coefficients.

Full size table

2.1 Sample Estimates

Let $\varvec{X}_{1},\ldots ,\varvec{X}_{n}\sim F$ be n iid vectors of ratings. Then there is a single natural sample estimator of $D_{d}$, namely

$$\begin{aligned} \hat{D}_{d}=n^{-1}\sum _{i=1}^nD_{d}(\varvec{x}_{i}). \end{aligned}$$

(2.10)

There are, however, two natural estimators of the Cohen-type chance disagreement: one them a V-statistic (Lee, 2019, Chapter 4.2) and the other a U-statistic (Lee, 2019, Chapter 1),

$$\begin{aligned} \hat{C}_{d} = n^{-g}\sum _{i_{1},\ldots ,i_{g}}C_{d} (\varvec{x}_{i_{1}},\ldots ,\varvec{x}_{i_g})\quad \text {and}\quad \hat{C}_{d}^{u}=\left( {\begin{array}{c}n\\ g\end{array}}\right) ^{-1}\sum _{i_{1}, \ldots ,i_{g}}C_{d}(\varvec{x}_{i_{1}},\ldots , \varvec{x}_{i_{g}}), \end{aligned}$$

(2.11)

where the first estimator runs over all combinations with repetitions of $i_{1},i_{2},\ldots ,i_{g}$ and the second estimator runs over the unordered combinations $i_{1}<i_{2}<\ldots <i_{g}$. Using the basic results of U-statistics (Lee, 2019, Chapter 1), we see that $C_{d}^{u}$ is the unique minimum-variance unbiased estimator of $C_{d}$, which makes it attractive from a theoretical point of view. However, from a well-known correspondence between U-statistics and V-statistics, the asymptotic distributions of $\hat{C}_{d}$ coincide with the asymptotic distribution of $\hat{C}_{d}^{u}$ (Lee, 2019, Chapter 4, Theorem 1), so the choice between $\hat{C}_{d}$ and $\hat{C}_{d}^{u}$ barely matters when n is sufficiently large.

Likewise, there are two natural estimators of the Fleiss-type weighted chance agreement,

$$\begin{aligned} \hat{F}_{d} = n^{-g}\sum _{i_{1},\ldots ,i_{g}}F_{d}(\varvec{x}_{i_{1}},\ldots , \varvec{x}_{i_{g}})\quad \text {and}\quad \hat{F}_{d}^{u} =\left( {\begin{array}{c}n\\ g\end{array}}\right) ^{-1}\sum _{i_{1},\ldots ,i_{g}}F_{d}(\varvec{x}_{i_{1}},\ldots , \varvec{x}_{i_{g}}), \end{aligned}$$

(2.12)

where the index sets are described above.

Now, we can define two sample variants of Cohen’s kappa (Fleiss’s kappa), depending on which one of $\hat{C}_{d}$ ($\hat{F}_{d}$) and $\hat{C}_{d}^{u}$ ($\hat{F}_{d}^{u}$) we choose to use. These are $\hat{\kappa }_{d}=1-\hat{D}_{d}/\hat{C}_{d}$ and $\hat{\kappa }_{d}^{u}=1-\hat{D}_{d}/\hat{C}_{d}^{u}$ for Cohen’s kappa and $\hat{\pi }_{d}=1-\hat{D}_{d}/\hat{F}_{d}$ and $\hat{\pi }_{d}^{u}=1-\hat{D}_{d}/\hat{F}_{d}^{u}$ for Fleiss’s kappa. The definition of the sample Cohen’s kappa (Cohen, 1968) agrees with $\hat{\kappa }_{d}$, not with $\hat{\kappa }_{d}^{u}$. Likewise, the sample Fleiss’s kappa has a definition agreeing with $\hat{\pi }_{d}$ (Fleiss, 1971). Moreover, due to the possibility of binning data, $\hat{\pi }_{d}$ and $\hat{\kappa }_{d}$ are faster to compute when the data is not continuous. Since the estimators are asymptotically equivalent in any case, we will stick to the V-statistics $\hat{\kappa }_{d}$ and $\hat{\pi }_{d}$for estimation, but use the U-statistic form when deriving limit distributions. We note that, since we need to compute strictly fewer combinations, $\hat{\kappa }_{d}^{u}$ and $\hat{\pi }_{d}^{u}$ are faster to compute when the data is continuous, which may be useful in some settings.

3 Fréchet Variances for g-Wise Agreement Coefficients

The most popular measures of agreement are defined only for $g=2$. It is easy to find reasonable disagreement measures in this case, as one can draw on the extensive literature on norms and distances. The $l_{p}$ distances are the obvious choices, but there are many unexplored options, such as the Huber loss (Huber, 1964) and the LINEX loss (Varian, 1975).

In the setting of Hubert’s kappa and the Schuster–Smith coefficient, we have $g=R$, and it is not that easy to find reasonable disagreement functions anymore. The disagreement function used in Hubert’s kappa, $d(x_{1},\ldots ,x_{R})=1-1[x_{1}=\cdots =x_{R}]$, will penalize any number of discordant ratings equally, yielding the often undesirable outcome that most sets of ratings will be in complete disagreement. But there are less sensitive ways to count nominal disagreements. Consider the case of 10 raters with three ratings on an ordinal scale from 1–3, with 7 raters giving rating 1, 2 giving rating 2, and 1 giving rating 3. Then Hubert’s disagreement rating is 1, as the rating vector is not constant, and the pairwise disagreement is 46/100. But it sounds reasonable to pick the modal rating (in this case 1) and then report the number of raters that disagree with it, divided by the number of raters. In this case, the number of raters disagreeing with the modal rating is 3, and the “modal” disagreement equals 3/10.

Sometimes we wish to aggregate numerical ratings instead of categorical ratings. Consider the above case again but with the median (which is 1) instead of the mode. It is well known that the median of a vector x is equal to ${{\,\textrm{argmin}\,}}_{\mu }\frac{1}{R}\sum _{r=1}^R|x_{r}-\mu |$, so $\min _{\mu }\frac{1}{R}\sum _{r=1}^R|x_{r}-\mu |$ (mean absolute deviation from the median) appears to be a reasonable measure of the mean disagreement when we use the median as the aggregation method. The resulting mean disagreement of the previous example is $\min _{\mu }\frac{1}{R}\sum _{r=1}^R|x_{r}-\mu |=\frac{1}{10} \sum _{r=1}^{10}|x_{r}-1|=4/10$.

The “modal” and “median” disagreement measures are instances of an intuitive generalization of the variance called the Fréchet variance (Dubey and Müller, 2019). Let l be a distance function satisfying $l(x,y)\ge 0$ and $l(x,x)=0$, and let $A=\{x_{1},x_{2},\ldots ,x_{R}\}$ be a set of points. The sample Fréchet mean of A is defined as the (not necessarily unique) point $\mu _{l}$ that minimizes the sum of distances to all points in A, that is,^{Footnote 4}

$$\begin{aligned} \mu _{l}[A]={{\,\textrm{argmin}\,}}_{\mu }\sum _{r=1}^{R}l(\mu ,x_{r}). \end{aligned}$$

(3.1)

Similarly, the sample Fréchet variance on A with distance function l is

$$\begin{aligned} V(l)[A]=\min _{\mu }\sum _{r=1}^{R}\frac{1}{R}l(\mu ,x_{r}) =\sum _{r=1}^{R}\frac{1}{R}l(\mu _{l}[A],x_{r}). \end{aligned}$$

(3.2)

The Fréchet mean (Fréchet, 1948) is a generalization of centroids to arbitrary distance functions l; likewise, the Fréchet variance is a generalization of dispersion to any such distance function. They are best understood through a decision-theoretic lens: The Fréchet mean of A represents your best guess of the true classification or value of an item according to the distance l; the Fréchet variance V(l) is the decision-theoretic risk associated with the choice. See Cooil and Rust (1994) for an investigation of a closely related idea in the context of agreement measures.

Define the g-dimensional disagreement based on l as

$$\begin{aligned} d(\varvec{x}_{1},\ldots ,\varvec{x}_{g})=V(l) [\{\varvec{x}_{1},\ldots ,\varvec{x}_{g}\}]. \end{aligned}$$

(3.3)

The most important distance functions are:

(i)
$d_{0}(x,y)=1[x\ne y]$. Generalizes the nominal distance. If the data are categorical, the Fréchet mean $\mu _{d}$ equals the mode, and the Fréchet variance equals the percentage of observations different from the mode. If we are dealing with vector-valued data with I elements each, it might be preferable to use $I^{-1}\sum _{i=1}^{I}1[x_{i}\ne y_{i}]$ instead, as it counts each dimension of the nominal data separately.
(ii)
$d_{1}(x,y)=||x-y||_{1}$. For scalar ratings, the Fréchet mean is equal to the sample median. The Fréchet variance equals the sample mean absolute deviation from the median, i.e., $\frac{1}{R}\sum _{r=1}^{R}|x_{r}-\mu _d|$, where $\mu _d$ is the sample median.
(iii)
$d_{2}^{2}(x,y)=||x-y||_{2}^{2}$. For scalar ratings, the Fréchet mean is equal to the sample mean $\mu _d=\frac{1}{R}\sum _{r=1}^{R}x_{r}$, and the Fréchet variance is equal to the biased sample variance of $\{x_{1},x_{2},\ldots ,x_{R}\}$, that is, $\frac{1}{R}\sum _{r=1}^{R}(x_{r}-\mu _d)^{2}$.
(iv)
$d_{2}(x,y)=||x-y||_{2}$. For vector-valued data, the Fréchet mean has no simple formula, but is known as the geometric median. If the data is scalar, $d_{2}=d_{1}$, which implies that the Fréchet mean equals the median, hence the name. There is an extensive literature on the geometric median; see, e.g., Drezner et al. (2002) for an overview and Cohen et al. (2016) for how to compute it. When the ratings are vector-valued, the geometric median is far more computationally expensive than the Fréchet mean based on $||x-y||_{2}^{2}$.

For any $p\in [0,\infty ]$ and pair of vectors $x_{1},x_{2}$, we have the following (proved in Appendix, Sect. 6):

$$\begin{aligned} V(d_{p})[x_{1},x_{2}]=\frac{1}{2}d_{p}(x_{1},x_{2}),\quad V(d_{p}^{p})[x_{1},x_{2}]=\frac{1}{2^{p}}d_{p}^{p}(x_{1},x_{2}). \end{aligned}$$

(3.4)

It follows that $\kappa _{d_{p}}=\kappa _{V(d_{p})}$ and $\kappa _{d_{p}^{p}}=\kappa _{V(d_{p}^{p})}$ when we are dealing with pairwise agreement. Thus, the Fréchet variances generalize the pairwise agreement for these distances to g-wise coefficients. But be aware that the particular case of $V(d_{2}^{2})$ constitutes a trivial generalization, as it can be shown that the kappas do not vary with g when using the quadratic Fréchet variance $V(d_{2}^{2})$. It follows that $\kappa _V(d_2^2)$ equals the concordance coefficient for every g.

Example 1

Suppose you have $R=5$ raters and 4 items, with ratings (1, 1, 2, 1, 1), (1, 2, 3, 2, 2), (2, 1, 1, 1, 1), (2, 3, 4, 4, 5). The Fréchet means using the distance $|x-y|$ equals the sample medians 1, 2, 1, 4. The Fréchet variances are $V(d_{1})=(0.2,0.4,0.2,0.8)$. To calculate the sample Cohen’s kappa with $d=V(d_{1})$, we first find the mean disagreement $\overline{V(d_{1})}=0.4$ (2.10), then the mean Cohen disagreement, which is $\approx 0.73$ (2.11). Thus, Cohen’s kappa is $1-0.4/0.73=0.45$.

We believe the most useful distance measures will typically be $d_{0}$ for categorical data and $d_{1}$ for ordinal data, both using $g=R$. The quadratic distance $d_{2}^{2}$ could be used for ordinal data as well, but is harder to justify, as it is usually not obvious why we would be interested in the squared distance between two observations rather than just the distance itself. The distances $d_{p},p\in (1,\infty ]$, with $d_{2}$ included, are even harder to recommend, as they do not work in a coordinatewise manner for vector data. In any case, it seems most reasonable to go with the R-wise variants of these distance measures, as they make use of all the available information, but the g-wise agreement coefficients ($g<R)$ do not.

Example 2

In the paper introducing what is now called Fleiss’s kappa, Fleiss (1971) discussed an example involving 5 different types of diagnoses, $n=30$ patients, and $R=6$ psychiatrists. The data were originally from Sandifer et al. (1968), but Fleiss removed some ratings to make the design rectangular. We use this data to illustrate the difference between Hubert’s kappa and the Fréchet variances when applied to nominal data with $g=R$.

Hubert’s kappa is $\pi =0.166$ while Fleiss’ kappa using $V(d_0)$ is $\pi =0.486$. The substantial difference suggests that a sizeable number of rating vectors contain at least one rating that disagrees with the others. Table 2 summarizes the relevant aspects of the data. The maximal agreement row could potentially go from 1 to 6, but the smallest number of raters agreeing on the classification of an item in this data set is 3. The count row counts the number of rows with the corresponding maximal agreements and distances. According to the Hubert distance, the raters disagree a lot, since only 5 items have a disagreement of 0 and the rest a disagreement of 1. On the other hand, $V(d_{0})$ results in a much smaller overall disagreement, with all disagreements smaller than the maximum of 1.

Table 2 Maximal agreement for the data of Fleiss (1971).

Full size table

4 Inference

4.1 Limit Theory Using U-Statistics

Let $\varvec{X}_{1},\ldots ,\varvec{X}_{n}$ be independently and identically distributed and $\psi (x_{1},\ldots ,x_{k})$ be a symmetric function. A U-statistic of order k with kernel $\psi $ is

$$\begin{aligned} U_{n}=\left( {\begin{array}{c}n\\ k\end{array}}\right) ^{-1}\sum _{i_{1},\ldots ,i_{k}} \psi (\varvec{X}_{i_{1}},\ldots ,\varvec{X}_{i_{k}}), \end{aligned}$$

(4.1)

where the sum extends over all k-dimensional tuples satisfying $1\le i_{1}<i_{2}<\cdots \le n$.

The theory of U-statistics was established by Hoeffding (1992); for an introduction, see, e.g., Chapter 6.1 of Lehmann (2004), Chapter 5 of Serfling (1980), or the textbook of Lee (2019). These references handle U-statistics where the $X_{i}$s are real-valued, but their results, including the simple results below, hold for vector-valued $X_{i}$s as well (Korolyuk and Borovskich, 2013).

The weighted chance agreement of Fleiss-type (Cohen-type) is a U-statistic with kernel $F_{d}$ ($C_{d})$, of order g. The disagreement is a U-statistic with kernel $D_{d}$, which has order 1. To find the asymptotic variance of the kappas, we will use formulas for the asymptotic covariance of U-statistics. Let $U_{1n}$ and $U_{2n}$ be two U-statistics of n observations with symmetric kernel functions $\psi _{1}$, $\psi _{2}$ of dimensions $k_{1}$ and $k_{2}$. Define

$$\begin{aligned} \sigma _{1}^{2}= & {} {{\,\textrm{Var}\,}}(E[\psi _{1}(\varvec{X}_{1},\ldots , \varvec{X}_{k_{1}})\mid \varvec{X}_{1})]),\\ \sigma _{12}= & {} {{\,\textrm{Cov}\,}}(E[\psi _{1}(\varvec{X}_{1},\ldots ,\varvec{X}_{k_{1}}) \mid \varvec{X}_{1})],E[\psi _{2}(\varvec{X}_{1}, \ldots ,\varvec{X}_{k_{2}})\mid \varvec{X}_{1})]). \end{aligned}$$

Then we have $n{{\,\textrm{Cov}\,}}(U_{1n},U_{2n})\rightarrow k_{1}k_{2}\sigma _{12}$ and $n{{\,\textrm{Var}\,}}(U_{1n})\rightarrow k_{1}^{2}\sigma _{1}^{2}$ (Lee, 2019, Theorem 2, p. 76)). It is also possible to calculate the exact covariances, which could potentially make the asymptotic variances for the kappas perform better. See Appendix, Sect. 6 for the formula for the exact covariance (Lee, 2019, Theorem 2, p. 17)).

Lemma 1

Define the parameter vectors $\varvec{p}=(D_{d},C_{d},F_{d})$ and $\hat{\varvec{p}}=(\hat{D}_{d},\hat{C}_{d},\hat{F}_{d})$. Then $\sqrt{n}(\hat{\varvec{p}}-\varvec{p}){\mathop {\rightarrow }\limits ^{d}}N(0,\Sigma )$, where $\Sigma $ is the covariance matrix with elements

$$\begin{aligned} \sigma _{11}= & {} \sigma _{D}^{2}= & {} {{\,\textrm{Var}\,}}D_{d}(\varvec{X}_{1})&,\quad&\sigma _{12}= & {} \sigma _{CD}= & {} g{{\,\textrm{Cov}\,}}(\mu _{dC}(\varvec{X}_{1}),D_{d}(\varvec{X}_{1})),\\ \sigma _{22}= & {} \sigma _{C}^{2}= & {} g^{2}{{\,\textrm{Var}\,}}\mu _{dC}(\varvec{X}_{1})&,\quad&\sigma _{13}= & {} \sigma _{FD}= & {} g{{\,\textrm{Cov}\,}}(\mu _{dF}(\varvec{X}_{1}),D_{d}(\varvec{X}_{1})),\\ \sigma _{33}= & {} \sigma _{F}^{2}= & {} g^{2}{{\,\textrm{Var}\,}}\mu _{dF}(\varvec{X}_{1})&,\quad&\sigma _{23}= & {} \sigma _{CF}= & {} g{{\,\textrm{Cov}\,}}(\mu _{dC}(\varvec{X}_{1}),\mu _{dF}(\varvec{X}_{1})). \end{aligned}$$

Here the variable $\mu _{dC}(\varvec{X}_{1})$, and $\mu _{dF}(\varvec{X}_{1})$ are defined as

$$\begin{aligned} \mu _{dC}(\varvec{X}_{1})=E[C_{d}(\varvec{X}_{1}, \ldots ,\varvec{X}_{g})\mid \varvec{X}_{1}] \quad \mu _{dF}(\varvec{X}_{1}) =E[F_{d}(\varvec{X}_{1},\ldots ,\varvec{X}_{g}) \mid \varvec{X}_{1}]. \end{aligned}$$

The form of the covariance matrix follows from the remarks preceding the lemma. Asymptotic normality follows from a general theorem about asymptotic normality of U-statistics, see, e.g., Theorem 2 of Lee (2019, p. 76).

We want to use Lemma 1 to find the limit distribution of the generalized Cohen’s kappa and Fleiss’s kappa. To this end, recall the multivariate delta method (see, e.g., Lehmann, 2004, Theorem 5.2.3). Let $f:\mathbb {R}^{k}\rightarrow \mathbb {R}$ be continuously differentiable at $\theta $ and suppose that $\sqrt{n}(\hat{\theta }-\theta ){\mathop {\rightarrow }\limits ^{d}}N(0,\Sigma )$. Then

$$\begin{aligned} \sqrt{n}[f(\hat{\theta })-f(\theta )]{\mathop {\rightarrow }\limits ^{d}}N(0,\nabla f(\theta )^{T}\Sigma \nabla f(\theta )), \end{aligned}$$

(4.2)

where $\nabla f(\theta )$ denotes the gradient of f at $\theta $.

In the case of Cohen’s kappa and Fleiss’s kappa, we find that

$$\begin{aligned} \nabla \kappa _{d}= & {} \frac{1}{C_{d}}\left( -1,\frac{D_{d}}{C_{d}}\right) , \quad \nabla \pi _{d}=\frac{1}{F_{d}}\left( -1,\frac{D_{d}}{F_{d}}\right) . \end{aligned}$$

(4.3)

Using some algebra, the expressions for the asymptotic variances follow from Lemma 1 and the above gradients.

Proposition 1

Then Cohen’s kappa $\hat{\kappa }_{d}$ and Fleiss’s kappa $\hat{\pi }_{d}$ are asymptotically normal, and their asymptotic variances are

$$\begin{aligned} \sigma _{\kappa }^{2}= & {} \sigma _{D}^{2}\frac{1}{C_{d}^{2}}-2\sigma _{CD} \frac{D_{d}}{C_{d}^{3}}+\sigma _{C}^{2}\frac{D_{d}^{2}}{C_{d}^{4}}, \nonumber \\ \sigma _{\pi }^{2}= & {} \sigma _{D}^{2}\frac{1}{F_{d}^{2}}-2\sigma _{FD}\frac{D_{d}}{F_{d}^{3}} +\sigma _{F}^{2}\frac{D_{d}^{2}}{F_{d}^{4}}. \end{aligned}$$

(4.4)

These results are also valid for $\hat{\kappa }_{d}^{u}$ and $\hat{\pi }_{d}^{u}$. Since the sample Krippendorff’s alpha (see note below) is equal to $\hat{\alpha }_{d}=\hat{\pi }_{d}+\frac{1}{2Rn}(1-\hat{\pi }_{d})$, it is also asymptotically normal with asymptotic variance $\sigma _{\pi }^{2}$.

With $g=2$ and a finite number of categories, Schouten (1980) derived $\sigma _{\pi }^{2}$, while Schouten (1982) and O’Connell and Dobson (1984) derived $\sigma _{\kappa }^{2}$. The result for Krippendorff’s alpha is, to our knowledge, new.

A brief aside on Krippendorff’s alpha Krippendorff’s alpha (Krippendorff, 1970) is an agreement coefficient especially popular in content analysis (Krippendorff, 2018). It has no population definition, but its sample definition equals $\hat{\alpha }_{d}=\hat{\pi }_{d}+\frac{1}{N}(1-\hat{\pi }_{d})$ (the total sample size N equals 2Rn in the case of a rectangular design); see Proposition 3 in Appendix for a justification. For this reason, all of the results about the limit of $\hat{\pi }_{d}^{u}$ apply to Krippendorff’s alpha as well, as it is an asymptotically equivalent estimator of $\pi _{d}$. Note, however, that Krippendorff (2018) emphasizes the use of non-rectangular designs, and the limit results in the preceding section do not hold for such study designs.

4.2 Estimating the Variances

The unknown quantities $\hat{D}_{d}$, $\hat{C}_{d}$, and $\hat{F}_{d}$ can be estimated using their sample counterparts. The variances and covariances can be estimated using the empirical (co)variances of the estimated $\hat{\mu }$s. These have formulas

$$\begin{aligned} \hat{\mu }_{d}(\varvec{x}_{i})= & {} D_{d}(\varvec{x}_{i}),\nonumber \\ \hat{\mu }_{dC}(\varvec{x}_{i})= & {} n^{-(g-1)}\sum _{i_{1},\ldots ,i_{g-1}}C_{d}(\varvec{x}_{i}, \varvec{x}_{i_{1}},\ldots ,\varvec{x}_{i_{g-1}}),\nonumber \\ \hat{\mu }_{dF}(\varvec{x}_{i})= & {} n^{-(g-1)}\sum _{i_{1},\ldots ,i_{g-1}}F_{d}(\varvec{x}_{i}, \varvec{x}_{i_{1}},\ldots ,\varvec{x}_{i_{g-1}}), \end{aligned}$$

(4.5)

where the index sets run over all combinations with repetitions of $(i_{1},i_{2},\ldots ,i_{g-1})$.

Observe that estimating $\hat{\mu }_{dC}$ and $\hat{\mu }_{dF}$ directly is computationally very expensive, especially when done without binning, which cannot be done with continuous data. The obvious computation of all $\hat{\mu }_\text {dC}$ requires a number of operations on the order of $n^{g-1}$, which is prohibitively expensive for large n and g. However, there are few applications of agreement measures with very large n and g, so this should not be a serious problem in practice. We note that less computationally demanding procedures are possible for the quadratic Fréchet variance $V(d_2^2)$, as it can be shown that its associated kappas are invariant under g. Thus, we may use the computationally very effective methods for the concordance coefficient outlined by, e.g., Carrasco and Jover (2003).

From the definitions of $\hat{D}_{d},\hat{C}_{d}$, and $\hat{F}_{d},$ (4), we quickly deduce that $\overline{\hat{\mu }_{d}}=\hat{D}_{d}$, $\overline{\hat{\mu }_{dC}}=\hat{C}_{d}$ and $\overline{\hat{\mu }_{dF}}=\hat{F}_{d}$. Using this fact, we can define the estimators

$$\begin{aligned} \hat{\sigma }_{C}^{2}=\frac{g^2}{n-1}\sum _{i=1}^{n}(\hat{\mu }_{dC} (\varvec{x}_{i})-\hat{C}_{d})^{2},\quad \hat{\sigma }_{CD}^{2} =\frac{g}{n-1}\sum _{i=1}^{n}(\hat{\mu }_{dC}(\varvec{x}_{i}) -\hat{C}_{d})(\hat{\mu }_{d}(\varvec{x}_{i})-\hat{D}_{d}), \end{aligned}$$

and $\hat{\sigma }_{D}^{2}=\frac{1}{n-1}\sum _{i=1}^{n}(\hat{\mu }_{d} (\varvec{x}_{i})-\hat{D}_{d})^{2}$. Moreover, we can estimate $\hat{\sigma }_{F}^{2}$ and $\hat{\sigma }_{FD}^{2}$ in the same way, substituting $\hat{\mu }_{dF}$ for $\hat{\mu }_{dC}$. Using the formulas for the theoretical variances (4.4), we find the estimators

$$\begin{aligned}{} & {} \hat{\sigma }_{\kappa }^{2}=\hat{\sigma }_{D}^{2}\frac{1}{\hat{C}_{d}^{2}} -2\hat{\sigma }_{CD}\frac{\hat{D}_{d}}{\hat{C}_{d}^{3}} +\hat{\sigma }_{C}^{2}\frac{\hat{D}_{d}^{2}}{\hat{C}_{d}^{4}}, \end{aligned}$$

(4.6)

$$\begin{aligned}{} & {} \hat{\sigma }_{\pi }^{2}=\hat{\sigma }_{D}^{2}\frac{1}{\hat{F}_{d}^{2}} -2\hat{\sigma }_{FD}\frac{\hat{D}_{d}}{\hat{F}_{d}^{3}}+\hat{\sigma }_{F}^{2} \frac{\hat{D}_{d}^{2}}{\hat{F}_{d}^{4}}. \end{aligned}$$

(4.7)

The variance estimator $\hat{\sigma }_{\pi }^{2}$ coincides with that of Gwet (2021, equation 4) in the case of nominal weights; see Appendix (Sect. 6) for a proof sketch.

4.3 Improving Approximate Normality with the Arcsine and Fisher Transforms

It is well known that the Fisher transform (Fisher, 1915) improves the inference for the correlation coefficient. If r is the sample correlation, ${{\,\textrm{artanh}\,}}(r)=\frac{1}{2}\log [(1+r)/(1-r)]$ has approximately the same variance for most r, and its distribution is closer to normal than that of the untransformed r, especially when the population correlation is close to $\pm 1$. This transform makes sense outside the world of correlations; for instance, Lin (1989) used the Fisher transform to improve the normality of the quadratically weighted Cohen’s kappa.

The arcsine is another reasonable transformation of $\hat{\kappa }_{d}$ and $\hat{\pi }_{d}$. The arcsine is the inverse of the sine function and is defined as $\arcsin x=\int 1/\sqrt{1-x^{2}}\textrm{d}x$. In ecology (Warton and Hui, 2011), the arcsine transformation denotes $\arcsin \sqrt{p}$, where p is a probability. We do not take square root, however, as $\hat{\kappa }_{d}$ and $\hat{\pi }_{d}$ can be negative.

Calculating the limiting variance of $\arcsin \hat{\kappa }_{d}$ and $\arcsin \hat{\pi }_{d}$ requires an additional application of the delta method (4.2). Using that $\frac{\textrm{d}}{\textrm{d}x}\arcsin (x)=1/\sqrt{1-x^{2}}$ and $\frac{\textrm{d}}{\textrm{d}x}{{\,\textrm{artanh}\,}}(x)=1/(1-x^{2})$, we find

$$\begin{aligned} \sqrt{n}(\arcsin \hat{\kappa }_{d}-\arcsin \kappa _{d})\rightarrow & {} N(0,(1-\kappa _{d}^{2})^{-1}\sigma _{\kappa }^{2}), \end{aligned}$$

(4.8)

$$\begin{aligned} \sqrt{n}({{\,\textrm{artanh}\,}}\hat{\kappa }_{d}-{{\,\textrm{artanh}\,}}\kappa _{d})\rightarrow & {} N(0,(1-\kappa _{d}^{2})^{-2}\sigma _{\kappa }^{2}). \end{aligned}$$

(4.9)

Expressions for $\hat{\pi }_{d}$ can be found by swapping $\kappa _{d}$ for $\pi _{d}$ and $\sigma _{\kappa }^{2}$ for $\sigma _{\pi }^{2}$.

Example 3

This example illustrates that the arcsine and Fisher transforms may make the sampling distribution closer to the normal distribution. Let the number of raters be $R=3$, the disagreement function be quadratic (with $g=2$), and the number of items be $n=20$. There are five categories and the true classification of an item is one of $\{1,2,3,4,5\}$ with probability 1/5 each. Every rater knows the true classification of an item with probability 0.9. If they do not know the correct classification, they will guess a classification from $\{1,2,3,4,5\}$ uniformly at random. One can show that the population value of the quadratically weighted Cohen’s kappa is 0.816 under these circumstances, following the arguments of Perreault and Leigh (1989). We simulate the value of $\hat{\kappa }_{d}$ a total of $N=50,000$ times and transform them using the identity transform, the arcsine transform, and the Fisher transform. The results are shown in Fig. 1. The arcsine transform appears to bring the sampling distribution of $\hat{\kappa }_{d}$ closer to the normal distribution, with the Fisher transform also improving normality quite a bit.

5 Confidence Intervals

Using the methodology we have developed, we can easily construct confidence intervals for the agreement coefficients.

We describe our three confidence interval constructions only for Cohen’s kappa, as the intervals using Fleiss’ kappa can be found by replacing every instance $\hat{\kappa }_{d}$ with $\hat{\pi }_{d}$ and $\hat{\sigma }_{\kappa }^{2}$ with $\hat{\sigma }_{\pi }^{2}$. We use the two-sided t-distribution-based confidence intervals with nominal level $1-\alpha =0.95$. Let c be the $(1-\alpha /2)$-quantile of the t distribution with $n-1$ degrees of freedom. The basic interval is

$$\begin{aligned} {[}\hat{\kappa }_{d}-c\hat{\sigma }_{\kappa }/\sqrt{n-1}, \hat{\kappa }_{d}+c\hat{\sigma }_{\kappa }/\sqrt{n-1}], \end{aligned}$$

(5.1)

where $\hat{\sigma }_\kappa $ is the estimated variance described in equation (4.6).

The arcsine interval replaces the basic limits with

$$\begin{aligned} \sin \left( \arcsin \hat{\kappa }_{d}\pm c(1-\hat{\kappa }_{d}^{2})^{-1/2}\hat{\sigma }_{\kappa }/ \sqrt{n-1}\right) , \end{aligned}$$

(5.2)

where $(1-\hat{\kappa }_{d}^{2})^{-1}\hat{\sigma }_{\kappa }^{2}$ is the asymptotic variance of $\arcsin \hat{\kappa }_{d}$ (4.8). The Fisher interval uses the area hyperbolic tangent,

$$\begin{aligned} \tanh \left( {{\,\textrm{artanh}\,}}\hat{\kappa }_{d}\pm c(1-\hat{\kappa }_{d}^{2})^{-1}\hat{\sigma }_{\kappa }/ \sqrt{n-1}\right) , \end{aligned}$$

(5.3)

where $(1-\hat{\kappa }_{d}^{2})^{-2}\hat{\sigma }_{\kappa }^{2}$ is the asymptotic variance of ${{\,\textrm{artanh}\,}}\hat{\kappa }_{d}$ (4.9).

Using the methodology just described, we can calculate confidence intervals for the Fleiss (1971) data of Example 2.

Example 4

(Ex. 2 cont.) Using the data of Fleiss (1971), we calculate arcsine confidence intervals for the g-wise Fleiss’s kappa. The raters are not the same for all items, but it seems plausible to assume that the ratings are exchangeable given the item. The diagnoses are essentially categorical in nature; hence, we will only consider $V(d_{0})$ and Hubert’s disagreement function. The results are shown in Table 3. We see that the agreement coefficients agree when $g=2$, as both $V(d_{0})$ and Hubert’s disagreement function equals the nominal agreement in this case. But the coefficients differ substantially as g increases. This is to be expected, as Hubert’s disagreement function measures consensus while $V(d_{0})$ measures the number of observations different from the mode. Observe that $V(d_0)$ is not invariant with respect to g, hence it is a proper alternative to the classical Fleiss’s kappa. Moreover, all confidence intervals are of comparable length.

Table 3 Confidence intervals for the data of Fleiss (1971) using the arcsine method.

Full size table

The preceding example fits best into the context of Fleiss’ kappa, as the identity of the raters are unknown. Moreover, there is no ordinal structure in the data, making the $V(d_1)$ and $V(d_2^2)$ distances unnatural to employ. Our next example concerns the Fréchet variances applied to a case of ordinal data when the identity of the raters are known.

Example 5

Zapf et al. (2016) studied bootstrap intervals for Fleiss’s kappa and Krippendorff’s alpha using simulations and a case study. Their case study concerned the histopathological assessment of breast cancer and involved ratings performed by $R=4$ senior pathologists and $n=50$ breast cancer biopsies. We apply the arcsine method to calculate confidence intervals and point estimates, displayed in Table 4. We focus on Cohen’s kappa since the same four pathologists rate each cancer biopsy, but we include a column for Fleiss’s kappa when $g=4$ for comparison’s sake. When $g=4$, Cohen’s kappa and Fleiss’s kappa are as good as indistinguishable. As can be verified by using the code in the supplementary material, this happens for the other gs as well. It is not generally the case that Fleiss’s kappa and Cohen’s kappa nearly coincide, but it is likely to happen if the marginal ratings are approximately the same for all raters, as is the case in this data set. There is a sizable difference between the disagreement functions, but there is not typically a big difference when changing gs, provided we keep the disagreement functions constant. It remains to be seen whether this is common or not. The exception is Hubert’s disagreement function, which decreases quite a bit. (As in the Fleiss (1971) example, this is expected, as the Hubert’s disagreement function is a consensus measure.) Observe that the kappas under the quadratic Fréchet variance $V(d^2_2)$ do not change with g, which is always the case.

Table 4 Confidence intervals for Zapf et al. (2016) using the arcsine method.

Full size table

5.1 Simulation of Confidence Sets When $g=2$

We include a small simulation study on the performance of confidence sets using two models: A Perreault–Leigh model for discrete rating data and a normal model for continuous rating data. For both models, we investigate the following parameters:

(i)
Number of raters R. We use 2, 5, 20, which corresponds to a small, medium, and large selection of raters.
(ii)
Sample sizes n. We use $n=10,40,100$, corresponding to small, medium, and large agreement studies.
(iii)
Disagreement functions. Nominal disagreement $1[x\ne y]$, quadratic disagreement $(x-y)^{2}$, and absolute value disagreement $|x-y|$.
(iv)
Methods. A basic interval without transformations, an arcsine-transformed interval, and a Fisher transformed interval.

5.1.1 A Perreault–Leigh Model

Perreault and Leigh (1989) discussed a particular model for ratings in which each rated user either knows the correct answer or guesses uniformly at random. Similar models have been used by Gwet (2008); Maxwell (1977), among others; see Moss (2023) for a thorough discussion of such models. We assume there are five categories encoded as $C=\{-2,-1,0,1,2\}$, and the distribution of the true classification distribution is uniform. For each item rated, the rth rater knows the correct classification with probability $\sqrt{0.8}$. If not, he guesses, picking a number from C uniformly at random. Then $\kappa _{d}=\pi _{d}=0.8$ for all weights and the number of raters, as can be verified by following the arguments of Perreault and Leigh (1989). We run each simulation $N=10,000$ times.

The simulated lengths and coverages for Cohen’s kappa are given in Table 5. Two features stand out in Table 5. First, the confidence intervals have almost indistinguishable lengths and coverages when either R or n is large. Second, the basic interval has worse coverage than the arcsine and Fisher intervals when n is small, with the Fisher interval having coverage slightly closer to nominal than the arcsine interval. However, the better nominal coverage comes at the expense of greater lengths. In particular, for the absolute value weight, the coverage of the arcsine interval is greater than the coverage of the Fisher interval, but its length is shorter! The table for Fleiss’s kappa is similar and can be found in Appendix, Table 8.

Table 5 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Cohen’s kappa.

Full size table

5.1.2 Normal Model

In this study, the rating data is distributed according to the multivariate normal $N(0,\Sigma )$, where $\Sigma $ is the $R\times R$ correlation matrix with off-diagonal elements $\Sigma _{r_{i}r_{j}}=\rho $. Since the data is continuous, we study the absolute value disagreement $d_{1}$ and the quadratic disagreement $d_{2}^{2}$ only. The true values are $\kappa _{d_{2}}=\pi _{d_{2}^{2}}=\rho $ and $\kappa _{d_{1}}=\pi _{d_{1}}=1-\sqrt{1-\rho }$. See Appendix (Sect. 6) for details on the computation of these true values. We use $\rho =0.7$, and hence, $\kappa _{d_{2}^{2}}=0.7$ and $\kappa _{d_{1}}=0.45$. We run each simulation $N=1,000$ times.^{Footnote 5} We note that agreement coefficients are often called concordance coefficients when dealing with continuous data, especially when the quadratic distance is used. Lin’s concordance coefficient (Lin, 1989, 1992) is a prominent example.

The simulated lengths and coverages for Cohen’s kappa are given in Table 6. There is barely any difference between the three confidence interval constructions. Taken together with the results for the Perreault–Leigh model, where the basic interval performs worse than the other two, we would recommend the usage of either the arcsine or Fisher interval. Again, the table for Fleiss’s kappa is very similar and can be found in Appendix (Table 9).

Table 6 Coverage (first entry) and lengths (second entry) of confidence intervals: normal model, Cohen’s kappa.

Full size table

5.2 Simulation of Confidence Sets when $g\ne 2$

Table 7 contains simulations from the Perreault–Leigh model (Sect. 5.1.1) with $N=1000$ repetitions and $R=5$ raters using the Fréchet variances $V(d_{0})$, $V(d_{1})$, and Hubert’s disagreement function. We drop $V(d_{2}^{2})$ since it does not vary with g. To save space, we drop the basic confidence interval in the simulation. As before, we show the results only for the Cohen-type disagreement, with the Fleiss-type disagreement relegated to Appendix (Table 10). All coverages are decent, and the coverages and lengths are similar across the board.

Table 7 Coverage (first entry) and lengths (second entry) of confidence intervals for g-wise coefficients: Perreault–Leigh model, Cohen’s kappa.

Full size table

6 Concluding Remarks

When choosing an agreement coefficient one has to carefully think through exactly what one wishes to measure. The Fréchet variances are attractive because of their interpretation. You measure how much the raters disagree with the generalized mean rater, and then adjust for chance. In the case of nominal data, we measure the disagreement with the modal rater. When dealing with numerical data, we may measure disagreement with the median rater (using the absolute value distance), or the mean rater (using the quadratic distance), or use any other Fréchet variance defined on numeric data.

When dealing with nominal data, we believe that using the Fréchet variance, which measures the distance from the mode, is a reasonable choice. But other options are certainly possible, even when dealing with g-wise agreement measures. For example, one could use the entropy instead, with distance measure $d(x_{1},x_{2},\ldots ,x_{g})=-\sum _{i=1}^{g}\frac{\#i}{g}\log \frac{\#i}{g}$, where $\#i$ counts the number of elements in $(x_{1},x_{2},\ldots ,x_{g})$ classified as i, which could be useful when the number of raters is finite but large. The topic of how to choose reasonable distance measures for g-wise agreement studies has not been thoroughly studied, and there might be options preferable to the Fréchet variances that have not yet been found.

We have only covered rectangular design, where every item is rated by the same number of raters. It is quite easy to generalize the definitions of $\kappa _{d}$ and $\pi _{d}$ to non-rectangular designs, as we have done in Appendix, Sect. 6. But inference appears to be quite difficult, probably requiring additional assumptions for the case of non-exchangeable ratings.

In Sect. 4, we introduced the U-statistic-based estimators of $C_d$ and $F_d$, but only used them for theoretical purposes. The U-statistic-based estimators may plausibly outperform the classical V-statistic-based estimators since they are minimum variance unbiased estimators. It would be interesting to see whether the U-statistic-based estimators could outperform the traditional V-statistic-based estimators when n is small, for example in terms of mean squared error or confidence interval coverage.

The confidence intervals based on the arcsine and Fisher transforms perform better than the basic, untransformed interval. It is unclear which one of these intervals to prefer, but it barely matters when the sample size is sufficiently large. It might be possible to improve all of these intervals. Small-sample corrections to the variance appear feasible, with potential openings in the application of the delta rule and in the calculation of $\Sigma $ of Lemma 1. We have used the arcsine and Fisher transforms to improve approximate normality of $\hat{\kappa }_{d}$ and $\hat{\pi }_{d}$, but this choice is semi-arbitrary. Better variance-stabilizing transformations might be found by inspecting the formula for the variances of $\hat{\kappa }_{d}$ and $\hat{\pi }_{d}$ in Proposition 1. The confidence intervals used in the simulation are only known to be first-order accurate. To make second-order accurate confidence intervals, it would be possible to use the explicit formula for the variances to construct studentized confidence intervals, i.e., bootstrap-t intervals (Efron, 1987), which are second-order accurate.

None of these approaches is guaranteed to help when n is small, especially when dealing with categorical data, as the sampling distributions of $\hat{\kappa }_{d}$ and $\hat{\pi }_{d}$ are discrete and highly irregular. For example, consider the sample distribution of the Perreault–Leigh model (Sect. 5.1) when $n=20$ and $R=20$, displayed in Fig. 2. (We omit a dominating spike at 1.) As there are $C=5<\infty $ categories, there is a finite number of possible values for $\hat{\kappa }_{d}$ to take, which is strongly reflected in the plots, especially for the nominal weight.

The superior performance of methods such as the bootstrap-t depends on the quantity $\frac{\hat{\theta }-\theta }{{{\,\textrm{se}\,}}}$ being approximately pivotal, that is, approximately the same for all parameters, possibly after applying a transformation. Judging from the plots in Fig. 2, there is no such transformation.

Notes

For instance, Fleiss (1971), in his paper introducing Fleiss’ kappa, removed several ratings from this data to make sure the total number of ratings was 6 for each item.
Note that the concordance correlation coefficient is an intraclass correlation coefficient, see (Carrasco & Jover, 2003, p. 850).
The Schuster–Smith coefficient also encompasses the case of $2<g<R$ provided their weight function v(s) is appropriately defined, see the discussion on dispersion weights in (Schuster and Smith, 2005).
The Fréchet mean and variances are usually defined slightly differently, using $l^{2}(x,x_{k})$ instead of $l(x,x_{k})$, with l being a metric. Our definition of the Fréchet mean is sometimes called the generalized Fréchet mean or the $\alpha $-Fréchet mean.
We use fewer simulations (1, 000) than in the previous simulation (10, 000) since estimation is far more computationally expensive when dealing with continuous data, as it does not allow for binning.

References

Berry, K. J., Johnston, J. E., & Mielke, P. W., Jr. (2008). Weighted kappa for multiple raters. Perceptual and Motor Skills, 107(3), 837–848. https://doi.org/10.2466/pms.107.3.837-848
Article PubMed Google Scholar
Berry, K. J., & Mielke, P. W. (1988). A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48(4), 921–933. https://doi.org/10.1177/0013164488484007
Article Google Scholar
Carrasco, J. L., & Jover, L. (2003). Estimating the generalized concordance correlation coefficient through variance components. Biometrics, 59(4), 849–858. https://doi.org/10.1111/j.0006-341x.2003.00099.x
Article PubMed Google Scholar
Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558. https://doi.org/10.1016/0895-4356(90)90159-m
Article PubMed Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104
Article Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/10.1037/h0026256
Article PubMed Google Scholar
Cohen, M. B. , Lee, Y. T. , Miller, G. , Pachocki, J., & Sidford, A. (2016). Geometric median in nearly linear time. In Proceedings of the forty-eighth annual ACM symposium on theory of computing (pp. 9–21). Association for Computing Machinery. https://doi.org/10.1145/2897518.2897647
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88(2), 322–328. https://doi.org/10.1037/0033-2909.88.2.322
Article Google Scholar
Cooil, B., & Rust, R. T. (1994). Reliability and expected loss: A unifying principle. Psychometrika, 59(2), 203–216. https://doi.org/10.1007/BF02295184
Article Google Scholar
Drezner, Z., Klamroth, K., Schöbel, A., & Wesolowsky, G. O. (2002). The weber broblem. In Z. Drezner & H. Horst (Eds.), Facility location: Applications and theory (pp. 1–36). Springer.
Dubey, P., & Müller, H. G. (2019). Fréchet analysis of variance for random objects. Biometrika, 106(4), 803–821. https://doi.org/10.1093/biomet/asz052
Article Google Scholar
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185. https://doi.org/10.2307/2289144
Article Google Scholar
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521. https://doi.org/10.2307/2331838
Article Google Scholar
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619
Article Google Scholar
Fréchet. (1948). Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’institut Henri Poincaré, 10(4), 215–230.
Google Scholar
Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. The British Journal of Mathematical and Statistical Psychology, 61, 29–48. https://doi.org/10.1348/000711006X126600
Article PubMed Google Scholar
Gwet, K. L. (2014). Handbook of inter-rater reliability. Advanced Analytics, LLC.
Gwet, K. L. (2021). Large-sample variance of fleiss generalized kappa. Educational and Psychological Measurement. https://doi.org/10.1177/0013164420973080
Hoeffding, W. (1992). A class of statistics with asymptotically normal distribution. In: S. Kotz & N. L. Johnson (eds), Breakthroughs in statistics: Foundations and basic theory (pp. 308–334). Springer. https://doi.org/10.1007/978-1-4612-0919-5_20
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1), 73–101. https://doi.org/10.1214/aoms/1177703732
Article Google Scholar
Hubert, L. (1977). Kappa revisited. Psychological Bulletin, 84(2), 289–297. https://doi.org/10.1037/0033-2909.84.2.289
Article Google Scholar
Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61(2), 277–289. https://doi.org/10.1177/00131640121971239
Article Google Scholar
Korolyuk, V. S., & Borovskich, Y. V. (2013). Theory of U-statistics (1994th ed.). Springer.
Krippendorff, K. (1970). Bivariate agreement coefficients for reliability of data. Sociological Methodology, 2, 139–150. https://doi.org/10.2307/270787
Article Google Scholar
Krippendorff, K. (2018). Content analysis: An introduction to its methodology.
Lee, A. J. (2019). U-statistics: Theory and practice. Routledge.
Lehmann, E. L. (2004). Elements of large-sample theory. Springer.
Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76(5), 365–377. https://doi.org/10.1037/h0031643
Article Google Scholar
Lin, L. I. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45(1), 255–268. https://doi.org/10.2307/2532051
Article PubMed Google Scholar
Lin, L. I. (1992). Assay validation using the concordance correlation coefficient. Biometrics, 48(2), 599–604. https://doi.org/10.2307/2532314
Martín Andrés, A., & Álvarez Hernández, M. (2020). Hubert’s multi-rater kappa revisited. The British Journal of Mathematical and Statistical Psychology, 73(1), 1–22. https://doi.org/10.1111/bmsp.12167
Article PubMed Google Scholar
Maxwell, A. E. (1977). Coefficients of agreement between observers and their interpretation. The British Journal of Psychiatry, 130, 79–83. https://doi.org/10.1192/bjp.130.1.79
Article PubMed Google Scholar
Moss, J. (2023). Measuring agreement using guessing models and knowledge coefficients. Psychometrika. https://doi.org/10.1007/s11336-023-09919-4
Article PubMed PubMed Central Google Scholar
O’Connell, D. L., & Dobson, A. J. (1984). General Observer-Agreement measures on individual subjects and groups of subjects. Biometrics, 40(4), 973–983. https://doi.org/10.2307/2531148
Article Google Scholar
Perreault, W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26(2), 135–148. https://doi.org/10.1177/002224378902600201
Article Google Scholar
Sandifer, M. G., Hordern, A., Timbury, G. C., & Green, L. M. (1968). Psychiatric diagnosis: A comparative study in north Carolina, London and Glasgow. The British Journal of Psychiatry, 114(506), 1–9. https://doi.org/10.1192/bjp.114.506.1
Article PubMed Google Scholar
Schouten, H. J. A. (1980). Measuring pairwise agreement among many observers. Biometrical Journal, 22(6), 497–504. https://doi.org/10.1002/bimj.4710220605
Article Google Scholar
Schouten, H. J. A. (1982). Measuring pairwise agreement among many observers. II. Some improvements and additions. Biometrical Journal, 24(5), 431–435. https://doi.org/10.1002/bimj.4710240502
Article Google Scholar
Schuster, C., & Smith, D. A. (2005). Dispersion-weighted kappa: An integrative framework for metric and nominal scale agreement coefficients. Psychometrika. https://doi.org/10.1007/s11336-003-1110-4
Article Google Scholar
Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19(3), 321–325. https://doi.org/10.1086/266577
Article Google Scholar
Serfling, R. J. (1980). Approximation theorems of mathematical statistics. Wiley.
van Oest, R. (2019). A new coefficient of interrater agreement: The challenge of highly unequal category proportions. Psychological Methods, 24(4), 439–451. https://doi.org/10.1037/met0000183
Article PubMed Google Scholar
Varian, H. R. (1975). A Bayesian approach to real estate assessment. In: A. Z. Stephen & E. Fienberg (Eds.), Studies in Bayesian econometric and statistics in honor of Leonard J. Savage (pp. 195–208). North Holland.
Warrens, M. J. (2012). Equivalences of weighted kappas for multiple raters. Statistical Methodology, 9(3), 407–422. https://doi.org/10.1016/j.stamet.2011.11.001
Article Google Scholar
Warton, D. I., & Hui, F. K. C. (2011). The arcsine is asinine: The analysis of proportions in ecology. Ecology, 92(1), 3–10. https://doi.org/10.1890/10-0340.1
Article PubMed Google Scholar
Zapf, A., Castell, S., Morawietz, L., & Karch, A. (2016). Measuring inter-rater reliability for nominal data—Which coefficients and confidence intervals are appropriate? BMC Medical Research Methodology, 16, 93. https://doi.org/10.1186/s12874-016-0200-9
Article PubMed PubMed Central Google Scholar

Download references

Funding

Open access funding provided by Norwegian Business School.

Author information

Authors and Affiliations

Department of Data Science and Analytics, BI Norwegian Business School, Oslo, Norway
Jonas Moss

Authors

Jonas Moss
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonas Moss.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 5327 KB)

Appendix

1.1 Agreement Versus Disagreement

Agreement weighting functions are frequently standardized to guarantee that $w(x_{1},x_{2})\ge 0$, e.g., $w(x_{1},x_{2})=1-|x_{1}-x_{2}|/\max (|x_{1}-x_{2}|)$ for the absolute value weights. Standardization is not necessary, as they do not change the values of $\kappa _{d}$ and $\pi _{d}$ when it is possible (i.e., when $\max (|x_{1}-x_{2}|)<\infty $), and is not defined otherwise. We choose not to use this operation, as it does not change the value of the agreement coefficients in this paper and is impossible to do when the range of $x_{1},x_{2}$ is unbounded.

1.2 Proof of Equivalence Between $V(d_{p})(\varvec{x}_{1},\varvec{x}_{2})$ and $||\varvec{x}_{1}-\varvec{x}_{2}||$

Proof

We will show that

$$\begin{aligned} V(d_{p})[\varvec{x}_{1},\varvec{x}_{2}]= \frac{1}{2}||\varvec{x}_{1}-\varvec{x}_{2}||_{p},\quad V(d_{p}^{p})[\varvec{x}_{1},\varvec{x}_{2}]= \frac{1}{2^{p}}||\varvec{x}_{1}-\varvec{x}_{2}||_{p}^{p}. \end{aligned}$$

First, consider the case when $p\ge 1$. Using translation invariance and homogeneity of the norm,

$$\begin{aligned}{} & {} ||x_{1}-\mu ||_{p}+||x_{2}-\mu ||_{p},\\{} & {} \quad = ||x_{1}-\frac{x_{1}+x_{2}}{2}-\mu +\frac{x_{1}+x_{2}}{2}||_{p}+||x_{2}-\frac{x_{1}+x_{2}}{2}-\mu +\frac{x_{1}+x_{2}}{2}||_{p},\\{} & {} \quad = ||\frac{x_{1}-x_{2}}{2}-\nu ||_{p}+||-\frac{x_{1}-x_{2}}{2}-\nu ||_{p},\\{} & {} \quad = ||a-\nu ||_{p}+||a+\nu ||_{p}, \end{aligned}$$

where $a=\frac{x_{1}-x_{2}}{2}$ and $\nu =\mu -\frac{x_{1}+x_{2}}{2}$.

Observe that

$$\begin{aligned} {{\,\textrm{argmin}\,}}_{\nu }||a+\nu ||_{p}+||a-\nu ||_{p}=0,\quad \text {for all }a \end{aligned}$$

implies $\mu =\frac{x_{1}+x_{2}}{2}$.

By the Minkowski inequality,

$$\begin{aligned} 2^{p}||a||^{p}=||a+\nu +a-\nu ||^{p}\le (||a-\nu ||+||a+\nu ||)^{p}. \end{aligned}$$

This is an equality if $||a-\nu ||=||a+\nu ||=||a||$, i.e., when $\nu =0,$ as the left side equals $(||a-\mu ||+||a+\mu ||)^{p}=2^{p}||a||^{p}$. Now it is easy to verify that $V(d_{p})$ and $V(d_{p}^{p})$ have the claimed form; just substitute the value $\mu =\frac{x_{1}+x_{2}}{2}$ into the formula for the Fréchet variance, $\frac{1}{2}(||x_{1}-\mu ||_{p}+||x_{2}-\mu ||_{p})$.

When $0<p<1$, the function $\mu \mapsto ||x_{1}-\mu ||_{p}+||x_{2}-\mu ||_{p}$ is stepwise concave on $[-\infty ,x_{1}]$,$[x_{1},x_{2}]$, and $[x_{2},\infty )$; hence, its minimum is either $x_{1},x_{2}$, or both. It is clear that both $x_{1}$ and $x_{2}$ maps to $||x_{1}-x_{2}||_{p}$; hence, both are Fréchet means. The case $p=0$ is obvious and omitted. $\square $

1.3 True Values in the Normal Simulation

We give a brief explanation why the true values of $\kappa _{d}$ and $\pi _{d}$ are 0.8 for the quadratic weights and $1-\sqrt{0.2}$ for the absolute value weights.

First notice that, since the marginals of $X_{r_{1}}$ and $X_{r_{2}}$ are equal for all $r_{1},r_{2}$, we have that $\kappa _{d}=\pi _{d}$. Moreover, we can ignore the number of raters, since the pairwise distribution do not depend on them. Then, from standard theory about the multivariate and folded normal, we find that

$$\begin{aligned} E(|X_{r_{1}}-X_{r_{2}}|)=2\sqrt{\frac{1-\rho }{\pi }},\quad E(|X_{r_{1}}-X_{r_{2}}|^{2})=2(1-\rho ). \end{aligned}$$

Let $X'_{r_{1}}$be a copy of $X_{r_{1}}$ that is independent of $X_{r_{2}}$. Then $E(|X'_{r_{1}}-X_{r_{2}}|)=2/\sqrt{\pi }$ and $E(|X'_{r_{1}}-X_{r_{2}}|^{2})=2$. Now rewrite the kappas using disagreement instead of agreement. Use the fact that $(p_{wa}-p_{fa})/(1-p_{fa})=1-d_{wa}/d_{fa}$, where $d_{wa}=1-E(w(X_{r_{1}},X_{r_{2}}))$ and $d_{fa}=1-E(w(X'_{r_{1}},X_{r_{2}}))$, where $X'_{r_{1}}$is a copy of $X_{r_{1}}$ that is independent of $X_{r_{2}}$.

Thus, $\kappa _{d}=\pi _{d}=1-E(|X_{r_{1}}-X_{r_{2}}|)/E(|X'_{r_{1}}-X_{r_{2}}|^{2})=1-\sqrt{1-\rho }$ for the absolute value weights and $1-E(|X_{r_{1}}-X_{r_{2}}|^{2})/E(|X'_{r_{1}}-X_{r_{2}}|^{2})=\rho $ for the quadratic weights.

1.4 Variance of U-Statistics

Let $U_{n}^{1}$ and $U_{n}^{2}$ be two U-statistics of n observations with symmetric kernels $\psi _{1}$, $\psi _{2}$ of dimension $k_{1}$ and $k_{2}$. Define

$$\begin{aligned} \sigma _{cc}^{2} = {{\,\textrm{Cov}\,}}(E[\psi _{1}(X_{1},\ldots ,X_{k_{1}})\mid X_{1},\ldots ,X_{c})],E[\psi _{2}(X_{1},\ldots ,X_{k_{2}})\mid X_{1},\ldots ,X_{c})]). \end{aligned}$$

(6.1)

Proposition 2

The exact covariance of $U_{1}^{n}$ and $U_{2}^{n}$ is

$$\begin{aligned} {{\,\textrm{Cov}\,}}(U_{1}^{n},U_{2}^{n})=\left( {\begin{array}{c}n\\ k_{1}\end{array}}\right) ^{-1}\sum _{c=1}^{k_{1}} \left( {\begin{array}{c}k_{2}\\ c\end{array}}\right) \left( {\begin{array}{c}n-k_{2}\\ k_{1}-c\end{array}}\right) \sigma _{cc}^{2}. \end{aligned}$$

If $k_{1}$ and $k_{2}$ are fixed, its asymptotic variance is $n{{\,\textrm{Cov}\,}}(U_{1}^{n},U_{2}^{n})\rightarrow k_{1}k_{2}\sigma _{12}$.

Proof

See (Lee, 2019, Theorem 2, p. 17) and (Lee, 2019, Theorem 2, p. 76). $\square $

1.5 Expanding the Definitions

Here is sketch of how we could expand the definitions in Sect. 2 to encompass more complicated scenarios. We restrict ourselves to $g=2$, but the analysis can be expanded to arbitrary g. Suppose that any finite number of raters R is possible, the raters are not exchangeable, and that not every item is rated by every rater.

Let X denote a rating, R be the raters, and I be the items rated. Suppose we sample pairs $(X_{1},R_{1},I_{1}),(X_{2},R_{2},I_{2})$ independently from the same distribution F. Then we may define

$$\begin{aligned} D_{d}= & {} E[d(X_{1},X_{2})\mid I_{1}=I_{2},R_{1}\ne R_{2}],\nonumber \\ C_{d}= & {} E[d(X_{1},X_{2})\mid R_{1}\ne R_{2}],\nonumber \\ F_{d}= & {} E[d(X_{1},X_{2})]. \end{aligned}$$

(6.2)

These quantities have natural sample analogues; e.g.,

$$\begin{aligned} \hat{D}_{d}=N^{-1}\sum _{i=1}^{n}\sum _{r_{1}\ne r_{2}}d(x_{ir_{1}},x_{ir_{2}}), \end{aligned}$$

where N is the total number of paired observations and the rater indices run over the raters who observed at the ith observation x. Population and sample definitions of Cohen’s kappa and Fleiss’ kappa follow as laid out in the main text, e.g., $\kappa _{d}=1-D_{d}/C_{d}$.

Table 8 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Fleiss’s kappa.

Full size table

Table 9 Coverage (first entry) and lengths (second entry) of confidence intervals: Normal model, Fleiss’s kappa.

Full size table

Table 10 Coverage (first entry) and lengths (second entry) of confidence intervals: Perreault–Leigh model, Fleiss’ kappa ($R=5$).

Full size table

1.5.1 Krippendorff’s Alpha

Now suppose that the ratings can take on only a finite number C distinct values. Define $o_{ck}$ as the number of times a pair of raters has classified an item into c and k, i.e.,

$$\begin{aligned} o_{ck}=\sum _{i=1}^{n}\sum _{r_{1}\ne r_{2}}1[x_{ir_{1}}=c,x_{ir_{2}}=k]. \end{aligned}$$

Then $N=\sum _{c,k}o_{ck}$ and $\hat{D}_{d}=N^{-1}\sum _{c,k}o_{ck}d(c,k).$ Moreover, define $n_{c}$ as the number of items classified as c. Then $n_{c}=\sum _{k}o_{ck}$, $\sum _{c}n_{c}=N$, and $\sum _{c,k}n_{c}n_{k}d(c,k)=N^{2}\hat{F}_{d}.$

Proposition 3

Using the above definitions, $\hat{\alpha }_{d}=\hat{\pi }_{d}+\frac{1}{N}(1-\hat{\pi }_{d})$. Since there are $N=2Rn$ rating pairs in the rectangular setup used in Sect. 2, $\hat{\alpha }_{d}=\hat{\pi }_{d}+\frac{1}{2Rn}(1-\hat{\pi }_{d})$ in that case.

Proof

The definition of $\hat{\alpha }_{d}$ can be found on Krippendorff (2018, p.235),

$$\begin{aligned} \hat{\alpha }_{d}=1-(N-1)\frac{\sum _{c\ne k}o_{ck}d(c,k)}{\sum _{c\ne k}n_{c}n_{k}d(c,k)}. \end{aligned}$$

From the above definitions, and the fact that $d(c,k)=0$ when $c=k$, we find that

$$\begin{aligned} \sum _{c\ne k}o_{ck}d(c,k)=\sum _{c,k}o_{ck}d(c,k)=N\hat{D}_{d}. \end{aligned}$$

In the same way,

$$\begin{aligned} \sum _{c\ne k}n_{c}n_{k}d(c,k)=\sum _{c,k}n_{c}n_{k}d(c,k)=N^{2}\hat{F}_{d}. \end{aligned}$$

Thus,

$$\begin{aligned} \hat{\alpha }_{d}=1-\frac{(N-1)}{N}\frac{\hat{D}_{d}}{\hat{F}_{d}} =1-\frac{\hat{D}_{d}}{\hat{F}_{d}}+\frac{1}{N}\frac{\hat{D}_{d}}{\hat{F}_{d}}, \end{aligned}$$

and using that $\hat{\pi }_{d}=1-\frac{\hat{D}_{d}}{\hat{F}_{d}}$, we are done. $\square $

1.6 Proof of Correspondence with Gwet (2021)

Using the nominal disagreement function, Gwet (2021) uses the following estimator for the asymptotic variance of the pairwise Fleiss’ kappa:

$$\begin{aligned} \hat{\sigma }^{2}=\frac{1}{n-1}\sum _{i=1}^{n}(\kappa _{i}^{\star }-\hat{\kappa })^{2}. \end{aligned}$$

Translating into our notation (dropping the dependence on the disagreement d), we have that $\hat{\kappa }=1-\hat{D}/\hat{F}$. Moreover, one can verify that $\kappa _{i}^{\star }$ equals

$$\begin{aligned} \kappa _{i}^{\star }=1-\frac{\hat{\mu }(x_{i})}{\hat{F}} -2\frac{\hat{D}}{\hat{F}}\left( 1-\frac{\hat{\mu }_{F}(x_{i})}{\hat{F}}\right) , \end{aligned}$$

where $\hat{\mu }(x_i)$ and $\hat{\mu }_F(x_i)$ were defined in Sect. 4.

Following a small reorganization of the terms, we find that

$$\begin{aligned} \frac{1}{n-1}\sum _{i=1}^{n}(\kappa _{i}^{\star }-\hat{\kappa })^{2} =\frac{1}{\hat{F}^{2}}\frac{1}{n-1}\sum _{i=1}^{n}\left( 2\frac{\hat{D}}{\hat{F}}\left[ \hat{\mu }_{F}(x_{i})-\hat{F}\right] -[\hat{\mu }_{d}(x_{i}) -\hat{D}]\right) ^{2}. \end{aligned}$$

Using the definitions of $\hat{\sigma }_{D}^{2},\hat{\sigma }_{FD}$ and $\hat{\sigma }_{F}^{2}$ (c.f. Section 4.2), one can verify using simple algebraic manipulations that

$$\begin{aligned} \frac{1}{n-1}\sum _{i=1}^{n}\left( \kappa _{i}^{\star }-\hat{\kappa }\right) ^{2} = \frac{1}{\hat{F}^{2}}\left( \hat{\sigma }_{D}^{2}-2\hat{\sigma }_{FD} \frac{\hat{D}_{d}}{\hat{F}_{d}}+\hat{\sigma }_{F}^{2}\frac{\hat{D}_{d}^{2}}{\hat{F}_{d}^{2}}\right) ; \end{aligned}$$

hence, the estimator of Gwet (2021) is a special case of the proposed estimator in Sect. 4.2.

1.7 Simulation of Fleiss’s Kappa

Here are the results of the simulation study in 5.1 for Fleiss’s kappa (Tables 8, 9, 10).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Moss, J. Measures of Agreement with Multiple Raters: Fréchet Variances and Inference. Psychometrika 89, 517–541 (2024). https://doi.org/10.1007/s11336-023-09945-2

Download citation

Received: 30 August 2022
Accepted: 06 December 2023
Published: 08 January 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s11336-023-09945-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Abstract

Similar content being viewed by others

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Mixed methods research: what it is and what it could be

Qualitative Content Analysis: Theoretical Background and Procedures

1 Introduction

2 Measures of Agreement

Definition 1

2.1 Sample Estimates

3 Fréchet Variances for g-Wise Agreement Coefficients

Example 1

Example 2

4 Inference

4.1 Limit Theory Using U-Statistics

Lemma 1

Proposition 1

4.2 Estimating the Variances

4.3 Improving Approximate Normality with the Arcsine and Fisher Transforms

Example 3

5 Confidence Intervals

Example 4

Example 5

5.1 Simulation of Confidence Sets When \(g=2\)

5.1.1 A Perreault–Leigh Model

5.1.2 Normal Model

5.2 Simulation of Confidence Sets when \(g\ne 2\)

6 Concluding Remarks

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (zip 5327 KB)

Appendix

Appendix

1.1 Agreement Versus Disagreement

1.2 Proof of Equivalence Between \(V(d_{p})(\varvec{x}_{1},\varvec{x}_{2})\) and \(||\varvec{x}_{1}-\varvec{x}_{2}||\)

Proof

1.3 True Values in the Normal Simulation

1.4 Variance of U-Statistics

Proposition 2

Proof

1.5 Expanding the Definitions

1.5.1 Krippendorff’s Alpha

Proposition 3

Proof

1.6 Proof of Correspondence with Gwet (2021)

1.7 Simulation of Fleiss’s Kappa

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation