Abstract
To measure the degree of agreement between R observers who independently classify n subjects within K categories, various kappa-type coefficients are often used. When R = 2, it is common to use the Cohen' kappa, Scott's pi, Gwet’s AC1/2, and Krippendorf's alpha coefficients (weighted or not). When R > 2, some pairwise version based on the aforementioned coefficients is normally used; with the same order as above: Hubert's kappa, Fleiss's kappa, Gwet's AC1/2, and Krippendorf's alpha. However, all these statistics are based on biased estimators of the expected index of agreements, since they estimate the product of two population proportions through the product of their sample estimators. The aims of this article are three. First, to provide statistics based on unbiased estimators of the expected index of agreements and determine their variance based on the variance of the original statistic. Second, to make pairwise extensions of some measures. And third, to show that the old and new estimators of the Cohen’s kappa and Hubert’s kappa coefficients match the well-known estimators of concordance and intraclass correlation coefficients, if the former are defined by assuming quadratic weights. The article shows that the new estimators are always greater than or equal the classic ones, except for the case of Gwet where it is the other way around, although these differences are only relevant with small sample sizes (e.g. n ≤ 30).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
It is often necessary to assess the degree of concordance or agreement between R raters which independently classify n subjects within K ≥ 2 categories (Fleiss 1971; Landis and Koch 1975a, b; Warrens 2010; Schuster and Smith 2005).
Let this be the case for only two raters (R = 2) and nominal categories. As some of the observed agreements may be due to chance, it is most common to eliminate the effect of chance by defining a kappa-type coefficient of the form κ = (Io − Ie)/(1 − Ie). In that expression Io is the observed index of agreements (the sum of the observed proportions of agreements), Ie is the expected index of agreements (the sum of the proportions of agreements that would happen if the two raters acted independently) and κ is the population value of the proposed agreement measure. Note that the previous indexes only consider the agreements obtained. When the categories are ordinal, the indexes defined are similar to the previous ones, but also considering the disagreements obtained, to which certain weights are assigned (see Sect. 2.1); this leads to a weighted kappa coefficient. From now on, κ will allude to one or the other indistinctly. According to the definition adopted for Ie, the different kappa coefficients are obtained: κS (Scott 1955), κC (Cohen 1960, 1968), and κG (Gwet 2008). The estimation of these coefficients has the general form of \(\hat{\kappa } = {{\left( {\hat{I}_{o} - \hat{I}_{e} } \right)} \mathord{\left/ {\vphantom {{\left( {\hat{I}_{o} - \hat{I}_{e} } \right)} {\left( {1 - \hat{I}_{e} } \right)}}} \right. \kern-0pt} {\left( {1 - \hat{I}_{e} } \right)}},\) where the values \(\hat{\kappa }\), \(\hat{I}_{o}\) and \(\hat{I}_{e}\) are the sample estimators of the previous population parameters. It can be seen that κ and \(\hat{\kappa }\) are decreasing functions of Ie and \(\hat{I}_{e}\) respectively. Additionally, Krippendorf (1970, 2004) provides an estimator \(\hat{\kappa }_{K}\) of κS that differs slightly from the more classical \(\hat{\kappa }_{S}\) because of its new definition of \(\hat{I}_{o}\).
Let this be the case for multi-raters (R ≥ 2). The different coefficients κ of the case R = 2 can be generalized for the case of multi-raters in several ways, depending on how the phrase “an agreement has occurred” is interpreted. The most common interpretation is that of Fleiss (1971) and Hubert (1977) "an agreement occurs if and only if two raters categorize an object consistently" or pairwise definition of agreement. This is the definition in this article. Hubert (1977) also makes the following interpretation "an agreement occurs if and only if all raters agree on the categorization of an object", or R-wise definition (Conger 1980). The extension R-wise κHR of κC can be seen in Conger (1980), Shuster and Smith (2005) and Martín Andrés and Álvarez Hernández (2020). The best-known pairwise extensions of the coefficients κS, κC and κG are the coefficients κF (Fleiss 1971), κH (Hubert 1977; Conger 1980) and κG (Gwet 2008) respectively. All of them are defined under the same format as in the case of R = 2. Additionally, Krippendorf (1970, 2004) provides an estimator \(\hat{\kappa }_{K}\) of κF that differs slightly from the more classical \(\hat{\kappa }_{F}\), again because of the definition of \(\hat{I}_{o}\). An overview of all of the above can be seen in Gwet's book (2021).
However, all \(\hat{\kappa }_{X}\) expressions are based on biased estimators (X refers to any of the letters used above), since they estimate the product of two population proportions -a term that is present in Ie- through the product of their sample estimators. The first objective of this article is to correct this bias by proposing unbiased estimators \(\hat{I}_{eU}\) of Ie − so the new estimator of κX will be \(\hat{\kappa }_{XU} = {{\left( {\hat{I}_{o} - \hat{I}_{eU} } \right)} \mathord{\left/ {\vphantom {{\left( {\hat{I}_{o} - \hat{I}_{eU} } \right)} {\left( {1 - \hat{I}_{eU} } \right)}}} \right. \kern-0pt} {\left( {1 - \hat{I}_{eU} } \right)}}\) − , as well as to determine the variance of \(\hat{\kappa }_{XU}\). This methodology is easy to apply to any other kappa coefficient studied. A second objective is to make pairwise extensions of some measures, but in a different way to traditional pairwise extensions.
The previous description is very general since it is necessary to specify who are the “subject population” and the “rater population”. Regarding the population of subjects, the n subjects may be: (a) a random sample of an infinite population of subjects, which is what is assumed in the rest of the sections; (b) a random sample of a finite population of subjects, in which case a finite population correction (Gwet 2021a, b) must be made to the formulas of the variance; and (c) the only subjects of interest, in which case only \(\hat{\kappa }_{X}\) makes sense, there is no κX parameter to estimate and it makes no sense to define \(\hat{\kappa }_{XU}\).
Regarding the population of raters, the R raters may be (Shrout and Fleiss 1979): (1) different for the same subject -even with a different number- and extracted from an infinite population of raters; (2) the same for all of the subjects and extracted from an infinite population of raters; and (3) the same for all of the subjects and they are the only raters of interest, which is what is assumed in the rest of the sections. When the replies of the raters are quantitative, a traditional way of measuring the degree of agreement between them is through the intraclass correlation coefficients (ICC) ρI1, ρI2, and ρI3 which are obtained from the corresponding one-way random model, two-way random model, or two-way mixed model, respectively. In the last two cases it is assumed that there is no interaction. Nevertheless, in this context of measures of agreement, Shrout and Fleiss (1979) and Carrasco and Jover (2003) point out that in case (3) it is also necessary to include the variability between raters in the total variability, so that in cases (2) and (3) we should use ρI2. Additionally, and for case (3), Lin (1989, 2000) and Barnhart et al. (2002) propose using as a measure of agreement the concordance correlation coefficient (CCC) ρL.
As is logical, different researchers have shown interest in searching for relations between the coefficients κX, ρIi, and ρL, as well as between their estimators \(\hat{\kappa }_{X}\), \(\hat{\rho }_{Ii}\), and \(\hat{\rho }_{L}\). Landis and Koch (1977) demonstrated that \(\hat{\kappa }_{F}\) is asymptotically equivalent to ρI1 when the replies are binary. Furthermore, Barnhart et al. (2002) and Carrasco and Jover (2003) demonstrated that ρL = ρI2. Since in the case of R = 2 Martín Andrés and Álvarez Hernández (2020) demonstrated that ρL = κC—assuming, as from now on, that the weights of the disagreements are quadratic—, then the satisfactory property κC = ρL = ρI2 is obtained when R = 2. The equivalences between the estimators of these parameters are more complex, since their values depend on the method of estimating their components. For example, Fleiss and Cohen (1973) demonstrated that \(\hat{\kappa }_{C}\) is asymptotically equivalent to ρI2, King and Chinchilli (2001) and Martín Andrés and Álvarez Hernández (2020) demonstrated that \(\hat{\kappa }_{C}\) = \(\hat{\rho }_{L}\) when direct (biased) estimators are used, and Davis and Fleiss (1982) verified that \(\hat{\kappa }_{H}\) is asymptotically equivalent to ρI2 when the replies are binary. The third objective of this article is to relate κH to ρL, as well as estimators \(\hat{\kappa }_{CU}\) and \(\hat{\kappa }_{HU}\) with estimators \(\hat{\rho }_{I2}\) and \(\hat{\rho }_{LU}\), which is based on the unbiased estimators of the components of ρI2 and ρL, respectively.
From the aforementioned reasons, we can see that in this article it is assumed that n subjects, extracted randomly from an infinite population, are given a score a single time by R fixed raters (who are the only ones of interest). It is also assumed that there are no missing data, i.e. that all of the raters give a reply in all of the subjects.
2 Case of two raters
Let be two raters (R = 2) that independently classify n subjects within K categories. Let Oij be the number of subjects whom observer 1 classifies as type i (i = 1, 2, …, K) and observer 2 as type j (j = 1, 2, …, K). This gives rise to a table of absolute frequencies Oij like those in Tables 1 and 2, with observed proportions \(\hat{p}_{ij}\) = Oij/n, where ΣiΣjOij = n and ΣiΣj\(\hat{p}_{ij}\) = 1. The notation for row totals (Oi· and \(\hat{p}_{i \cdot }\)), of column (O·j and \(\hat{p}_{ \cdot j}\)) or general (O·· = n and \(\hat{p}_{ \cdot \cdot }\) = 1) is the usual; for example \(\hat{p}_{i \cdot }\) = Σj\(\hat{p}_{ij}\). If the subjects have been chosen randomly and both raters classify all of the subjects, then the observed dataset {Oij} comes from a multinomial distribution of parameters n and {pij}, where pij is the probability that a subject will be classified in cell (i, j). Additionally {pi·} and {p·j} will be the marginal distributions of the row and column observers respectively. Obviously, \(\hat{p}_{ij}\), \(\hat{p}_{i \cdot }\) and \(\hat{p}_{ \cdot j}\) are the maximum likelihood estimators of pij, pi· and p·j respectively. At the end of “Appendix 2”, another type of sampling is mentioned in detail.
2.1 Weighted and unweighted kappa and observed index of agreements
It has already been indicated that κ depends on the indexes of agreement Io (observed) and Ie (expected). To evaluate any of them it is necessary to previously define the weight or degree of agreement wij that is assigned to the answer (i, j), with 0 ≤ wij ≤ 1, wii = 1, and generally wij = wji < 1 (i ≠ j). When categories are ordinal, there are many ways to assign values to wij (Schuster and Smith 2005). If we assume that categories 1, 2, …, K are ordered from the lowest to highest, it is usual that wij is related to the value of (i − j). A classic definition, to which we will refer later, is the quadratic weighting wij = 1 − [(i − j)/(K − 1)]2 of Fleiss and Cohen (1973). When categories are nominal, it is traditional to assign the weights wii = 1 and wij = 0 (i ≠ j), that is, it only considers the actual agreements. Historically, the different coefficients κ are defined first in the unweighted case, later extending it to the weighted case. However this article will be developed for the general weighted case, since the unweighted is a particular case of that: wij = δij with δij are the Kronecker deltas.
All coefficients κ are defined based on the same value of the index of agreements observed. Therefore, it is appropriate to indicate their definition (Io) and their estimate (\(\hat{I}_{o}\)) as general reference for all this Sect. 2:
where \(\hat{I}_{o}\) is an unbiased estimator of Io.
2.2 Cohen's kappa and the intraclass and concordance correlation coefficients
Cohen (1960, 1968) defines the classical measure of agreement
and proposes to estimate it by,
As indicated in “Appendix 1”, \(\hat{p}_{i \cdot } \hat{p}_{ \cdot j}\) is not an unbiased estimator of pi·p·j since
although it is asymptotically unbiased, as happens in the other cases that follow. Therefore \(E\left( {\hat{I}_{e} } \right) = \sum\nolimits_{i} {\sum\nolimits_{j} {E\left( {\hat{p}_{i \cdot } \hat{p}_{ \cdot j} } \right) = } }\){(n − 1)Ie + Io}/n and \(\hat{I}_{e}\) is also not an unbiased estimator of Ie. From expression (2) it follows that the unbiased estimators of pi·p·j and Ie are
respectively. Thus, the new estimator \(\hat{\kappa }_{CU}\) of κC will be
and its variance, which is deduced in “Appendix 2”, is
where \(V\left( {\hat{\kappa }_{C} } \right)\) refers to the formula of Fleiss et al. (1969), which can be seen in the book by Gwet (2021b); this book also contains all of the variances that are needed in what follows. This type of correction is similar to the one used by Miettinen and Nurminen (1985) for the score statistics in 2 × 2 tables. Because of expression (3), \(\hat{I}_{eU} - \hat{I}_{e}\) is proportional to \(- \left( {\hat{I}_{o} - \hat{I}_{e} } \right)\) ≤ 0 if and only if \(\hat{\kappa }_{C}\) ≥ 0. As \(\hat{\kappa }_{C}\) decreases with \(\hat{I}_{e}\), then \(\hat{\kappa }_{CU}\) ≥ \(\hat{\kappa }_{C}\) in the case of positive agreement (\(\hat{\kappa }_{C}\) ≥ 0, which is the case of greatest interest). It is easy to see that \(V\left( {\hat{\kappa }_{CU} } \right) \le V\left( {\hat{\kappa }_{C} } \right)\) if and only if κC ≥ n0.5/{n0.5 + (n − 1)0.5}. Something similar happens with the other variances obtained below.
Let there now be two raters with quantitative answers x1 and x2 with means μ1 and μ2, variances \(\sigma_{1}^{2}\) and \(\sigma_{2}^{2}\), and covariance σ12. Lin (1989, 2000) established the following measure of quantitative agreement ρL (known as CCC) and its estimation \(\hat{\rho }_{L}\)
where \(S_{i}^{2}\) and S12 are the biased estimators of the variances and covariances respectively (both with denominator n) and \(\overline{x}_{i}\) are the sample means. As mentioned in the Introduction, the quadratic weighting has the advantage of achieving that κC = ρL = ρI2 and that \(\hat{\rho }_{L} = \hat{\kappa }_{C}\). On the other hand, Carrasco and Jover (2003) replaced the values of \(\sigma_{i}^{2}\), σ12 and (μ1 − μ2)2 for their unbiased estimators \(s_{i}^{2}\), s12 (their sample variances and covariances with denominator n − 1) and \(\widehat{{\left( {\mu_{1} - \mu_{2} } \right)^{2} }}\) = (\(\overline{x}_{1}\) − \(\overline{x}_{2}\))2 − (\(s_{1}^{2}\) + \(s_{2}^{2}\) − 2s12)/n in the first expression (6), which led to the following estimator \(\hat{\rho }_{LU}\) of ρL,
Note that \(\hat{\rho }_{LU}\) = n \(\hat{\rho }_{L}\)/{(n − 1) + \(\hat{\rho }_{L}\)}, which is the same function of expression (4) that relates \(\hat{\kappa }_{CU}\) with \(\hat{\kappa }_{C}\). Therefore, as \(\hat{\kappa }_{C}\) = \(\hat{\rho }_{L}\), then \(\hat{\kappa }_{CU}\) = \(\hat{\rho }_{LU}\) and the two new estimators of ρL and κC (quadratic weights) are the same. Additionally, \(\hat{\rho }_{LU}\) ≥ \(\hat{\rho }_{L}\) if \(\hat{\rho }_{L}\) ≥ 0. In the “Appendix 3” it is proved that \(\hat{\rho }_{LU}\) = \(\hat{\rho }_{I2}\), thus \(\hat{\kappa }_{CU}\) = \(\hat{\rho }_{LU}\) = \(\hat{\rho }_{I2}\).
2.3 Scott's pi
Scott (1955) defines the following measure of agreement
and proposes to estimate it by
As indicated in “Appendix 1”, \(\hat{\pi }_{i} \hat{\pi }_{j}\) is not an unbiased estimator of πiπj since
Therefore, E(\(\hat{I}_{e}\)) = \(\sum\nolimits_{i} {\sum\nolimits_{j} {E\left( {\hat{\pi }_{i} \hat{\pi }_{j} } \right)} }\) = {(n − 1)Ie + (1 + Io)/2}/n, assuming that wij = wji, and \(\hat{I}_{e}\) is not an unbiased estimator of Ie. From expression (10) it is deduced that the unbiased estimators of πiπj and Ie are
respectively. Therefore, the new estimator \(\hat{\kappa }_{SU}\) of κS will be
and its variance, as followed in “Appendix 2”, is
Because of expression (11), \(\hat{I}_{eU} - \hat{I}_{e}\) is proportional to \(- \left\{ {\left( {1 - \hat{I}_{e} } \right) + \left( {\hat{I}_{o} - \hat{I}_{e} } \right)} \right\}\) which is also proportional to \(- \left\{ {1 + \hat{\kappa }_{S} } \right\}\) ≤ 0 if and only if \(\hat{\kappa }_{S}\) ≥ − 1. As \(\hat{\kappa }_{S}\) decreases with \(\hat{I}_{e}\), then \(\hat{\kappa }_{SU}\) ≥ \(\hat{\kappa }_{S}\) in the case of a positive agreement.
2.4 Krippendorf's alpha
Krippendorf (1970, 2004) proposed to estimate κS as in expression (9), but with a small-sample correction for \(\hat{I}_{o}\), though Gwet (2021b, p. 65) considers that “The need for such an adjustment and its potential benefits have not been documented”. The new estimator is,
where \(\hat{I}_{oC} = \hat{I}_{o} + {{\left( {1 - \hat{I}_{o} } \right)} \mathord{\left/ {\vphantom {{\left( {1 - \hat{I}_{o} } \right)} {2n}}} \right. \kern-0pt} {2n}}\); therefore,
The first expression follows from expressions (9) and (14); the second is obtained by replacing \(\hat{I}_{e}\) for the value of \(\hat{I}_{eU}\) in expression (11). From expressions (15) it is deduces that \(\hat{\kappa }_{K}\) ≥ \(\hat{\kappa }_{S}\) and \(\hat{\kappa }_{KU}\) ≥ \(\hat{\kappa }_{SU}\). Also, as for positive degrees of agreement it occurs that \(\hat{\kappa }_{SU}\) ≥ \(\hat{\kappa }_{S}\) then, due to expressions (15), \(\hat{\kappa }_{KU}\) ≥ \(\hat{\kappa }_{K}\). Finally, if in the first expression of Eq. (15) \(\hat{\kappa }_{S}\) is replaced by {(2n − 1)\(\hat{\kappa }_{SU}\) − 1}/{(2n − 1) − \(\hat{\kappa }_{SU}\)} − which is deduced from expression (12) − then \(\hat{\kappa }_{K}\) = 2(n − 1) \(\hat{\kappa }_{SU}\)/{(2n − 1) − \(\hat{\kappa }_{SU}\)} and \(\hat{\kappa }_{SU}\) ≥ \(\hat{\kappa }_{K}\) if \(\hat{\kappa }_{SU}\) ≥ 0. The overall conclusion is that \(\hat{\kappa }_{S}\) ≤ \(\hat{\kappa }_{K}\) ≤ \(\hat{\kappa }_{SU}\) ≤ \(\hat{\kappa }_{KU}\) for positive degrees of agreement.
Regarding the variance, it is sufficient to use the first part of the second expression (15) and then replacing V(\(\hat{\kappa }_{SU}\)) with the value in expression (13); thus
2.5 Gwet's AC1/2
Gwet (2008) defines the next measure regarding AC2 (AC1 refers to the non-weighted case),
and proposes to estimate it by
where πi and \(\hat{\pi }_{i}\) are obtained as in expressions (8) and (9). Once again it happens that \(\hat{I}_{e}\) is not an unbiased estimator of Ie, because \(\hat{\pi }_{i}^{2}\) is not an estimator of \(\pi_{i}^{2}\) either. Using the first expression (11) to estimate \(\pi_{i}^{2}\) in an unbiased way, we obtain that the unbiased estimators of \(\pi_{i}^{2}\) and Ie are, respectively
Therefore, the new estimator \(\hat{\kappa }_{GU}\) of κG will be
In “Appendix 1” it is proved that \(\hat{I}_{eU} - \hat{I}_{e}\) ≥ 0, so it always happens that \(\hat{\kappa }_{GU}\) ≤ \(\hat{\kappa }_{G}\). It can be observed that it is not feasible to determine V(\(\hat{\kappa }_{GU}\)) directly from the value of V(\(\hat{\kappa }_{G}\)).
3 Case of multi-raters
Let there be n subjects (s = 1, 2, …, n) classified by R raters (r = 1, 2, …, R) in K types (i = 1, 2, …, K). Let xsr = 1, 2, …, K be the answer of rater r in subject s, values that are usually presented in a two-dimensional table in which the subjects are in rows and the raters in columns. For each row (subject), let Ris be the number of raters that answer i in subject s; obviously Ri+ = ΣsRis is the total number of i answers (for every rater), R+s = ΣiRis = R and R++ = ΣiΣsRis = nR. For each column (rater), let nir be the number of subjects classified as i by rater r; obviously n+r = Σinir = n, ni+ = Σrnir = Ri+ is the total number of i answers and n++ = ΣiΣrnir = nR = R++. The results of Ris and nir are usually presented as in Table 3(a) and (b) respectively.
3.1 Pairwise methods and the observed index of agreement
To define and estimate the measures regarding the R > 2 case, the pairwise methods will be used. These methods in some way offer an average for what happens in the R(R − 1) possible pairs of raters (r, r'), with r, r' = 1, 2, …, R and r ≠ r'. This obliges us to change the notation used in Sect. 2, since it is necessary to indicate for each parameter from which pair (r, r') does its value come from. Parameters pij, pi· and p·j of Sect. 2 will now be notated as pir,jr', pir and pjr' respectively. Additionally, we define the new parameter pi+ = Σrpir = Σr'pir', which is the proportion of i answers of every raters. A similar thing occurs with the estimated values \(\hat{p}_{ij}\) and \(\hat{p}_{ir,jr^{\prime}}\) etc. Note that the estimators \(\hat{p}_{ir}\) of pir and \(\hat{p}_{i + }\) of pi+ are
respectively, where Σi \(\hat{p}_{ir}\) = 1 and ΣrΣi \(\hat{p}_{ir}\) = R. Parameters κ, Io and Ie of Sect. 2 will be denoted as κ(r, r'), Io(r, r') and Ie (r, r') respectively; therefore
and the same for the estimated values \(\hat{\kappa }\left( {r,r^{\prime}} \right)\) etc.
With pairwise methods there are several ways to average the results of every pair of raters (r, r'), but all procedures of interest define the global value of Io as
thus Io = ΣrΣr'≠rΣiΣjwijpir,jr'/{R(R − 1)}. As is traditional, the measure of global agreement will be κ = (Io − Ie)/(1 − Ie), where Ie is yet to be defined. If Ie is defined in a similar way to Io
we say that the procedure that defines global κ is a “two-pairwise” procedure and the population coefficient thereby obtained will be,
It can be noticed that κ2 is also obtained by dividing the sum of all the possible numerators (ΣrΣr'≠r) from expression (21), by the sum of all possible denominators, which indicates that κ2 if the weighted average of R(R − 1) values of κ(r, r') − the weights are the denominators − . This procedure is the one recommended by Janson and Olsson (2001), Conger (1980) and Gwet (2021b). Notice that ΣrΣr'≠rIo(r, r') = 2ΣrΣr'>rIo(r, r') and similarly with Ie. We have preferred to use the first expression because it facilitates some proofs, but regarding calculations the second expressions seems preferable. All of the above also applies to the case of estimated values.
As the base values of Io and \(\hat{I}_{o}\) are the same in every κ measures, it should be specified since its values are (see “Appendix 1”),
3.2 Hubert's kappa pairwise and the intraclass and concordance correlation coefficients
The κH coefficient of Hubert (Hubert 1977; Conger 1980) is a two-pairwise coefficient, and that is why the expression (23) can be applied for value Ie(r, r') of Cohen. Adjusting expression (1) to the current format, Ie(r, r') = ΣiΣjwijpirpjr' and, due to “Appendix 1”
Using expressions (20) the following estimation is obtained
It can be observed that for R = 2 it occurs that κC = κH and \(\hat{\kappa }_{C} = \hat{\kappa }_{H}\). In order to obtain an unbiased estimator of Ie, the second expression of (3), applied with the current notation, indicates that \(\hat{I}_{eU} \left( {r,r^{\prime}} \right) =\)\({{\left\{ {n\hat{I}_{e} \left( {r,r^{\prime}} \right) - \hat{I}_{o} \left( {r,r^{\prime}} \right)} \right\}} \mathord{\left/ {\vphantom {{\left\{ {n\hat{I}_{e} \left( {r,r^{\prime}} \right) - \hat{I}_{o} \left( {r,r^{\prime}} \right)} \right\}} {\left( {n - 1} \right)}}} \right. \kern-0pt} {\left( {n - 1} \right)}}\); therefore R(R − 1)\(\hat{I}_{eU}\) = \(\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {\hat{I}_{eU} \left( {r,r^{\prime}} \right)} } =\)\({{\left\{ {n\sum\nolimits_{r} {\sum\nolimits_{r^{\prime}} {\hat{I}_{e} \left( {r,r^{\prime}} \right)} } - \sum\nolimits_{r} {\sum\nolimits_{r^{\prime}} {\hat{I}_{o} \left( {r,r^{\prime}} \right)} } } \right\}} \mathord{\left/ {\vphantom {{\left\{ {n\sum\nolimits_{r} {\sum\nolimits_{r^{\prime}} {\hat{I}_{e} \left( {r,r^{\prime}} \right)} } - \sum\nolimits_{r} {\sum\nolimits_{r^{\prime}} {\hat{I}_{o} \left( {r,r^{\prime}} \right)} } } \right\}} {\left( {n - 1} \right)}}} \right. \kern-0pt} {\left( {n - 1} \right)}}\) and so \(\hat{I}_{eU}\) = (n \(\hat{I}_{e}\) − \(\hat{I}_{o}\))/ (n − 1). As this expression is the same as the second expression of (3), then the conclusions in Sect. 2.2 are still valid, changing the letter C with the letter H. Thus,
and \(\hat{\kappa }_{HU}\) ≥ \(\hat{\kappa }_{H}\) in the case of positive agreement.
Generalizing the first expression of (6) in the case of two raters r and r' of answers xr and xr’, means μr and μr', variances \(\sigma_{r}^{2}\) and \(\sigma_{r^{\prime}}^{2}\), and covariances σrr', we obtain \(\rho_{L} \left( {r,r^{\prime}} \right)\) = \({{2\sigma_{rr^{\prime}} } \mathord{\left/ {\vphantom {{2\sigma_{rr^{\prime}} } {\left\{ {\sigma_{r}^{2} + \sigma_{r^{\prime}}^{2} + \left( {\mu_{r} - \mu_{r^{\prime}} } \right)^{2} } \right\}}}} \right. \kern-0pt} {\left\{ {\sigma_{r}^{2} + \sigma_{r^{\prime}}^{2} + \left( {\mu_{r} - \mu_{r^{\prime}} } \right)^{2} } \right\}}}.\) If we apply to this expression the two-pairwise criterion which consists of adding ΣrΣr≠r' in the numerator and in the denominator, the CCC ρL of Lin (1989, 2000) and Barnhart et al. (2002) is obtained for the case of multi-raters; its estimated \(\hat{\rho }_{L}\) value is obtained in the same way as the second expression of (6). In this way,
Carrasco and Jover (2003) justified that \(\hat{\rho }_{L}\) is based on biased estimators and they proposed the following estimator, which is based on unbiased estimators (srr´ and \(s_{r}^{2}\))
It is easy to see that the same thing can be obtained applying the two-pairwise method to the first expression (7). As for R = 2 it occurred that κC = ρL and \(\hat{\kappa }_{C} = \hat{\rho }_{L}\) when the weights were quadratic, and in both cases the value for R > 2 is obtained in the same way − the sum of the numerators divided by the sum of the denominators − , then also κH = ρL and \(\hat{\kappa }_{H} = \hat{\rho }_{L}\) in the case of R > 2. Additionally, κHR = κH = ρL = ρI2 since ρL = ρI2 (Carrasco and Jover 2003) and κHR = ρL (Martín Andrés and Álvarez Hernández 2020). Furthermore, as \(\hat{\rho }_{LU} = {{n\hat{\rho }_{L} } \mathord{\left/ {\vphantom {{n\hat{\rho }_{L} } {\left\{ {\left( {n - 1} \right) + \hat{\rho }_{L} } \right\}}}} \right. \kern-0pt} {\left\{ {\left( {n - 1} \right) + \hat{\rho }_{L} } \right\}}}\) -an expression which has the same form as (26)- then also
where the last two equalities are demonstrated in the “Appendix 3”. In the last expression, which is simpler for the calculation, it is understood that xs· = \(\sum\nolimits_{r} {x_{sr} }\), x·r = \(\sum\nolimits_{s} {x_{sr} }\), and x·· = \(\sum\nolimits_{s} {\sum\nolimits_{r} {x_{sr} } }\). Something similar happens with the estimators based on the biased estimation of their components (see “Appendix 3”),
3.3 Fleiss' kappa pairwise
Fleiss (1971) extended κS to the case of R > 2 defining in the following value of Ie, which is not a two-pairwise type,
and proposes the following estimators
since pi+ is estimated as the second expression of Eq. (20). As indicated in “Appendix 1”, \(\hat{I}_{e}\) is not an unbiased estimator of Ie since nE(\(\hat{I}_{e}\)) = (n − 1)Ie + R−1{1 + (R − 1)Io}. This is why the unbiased estimator \(\hat{I}_{eU}\) of Ie and the new estimator \(\hat{\kappa }_{FU}\) of κF will be
Its variance, as deduced in “Appendix 2”, is
Through the first expression of Eq. (33), \(\hat{I}_{eU} - \hat{I}_{e}\) is proportional to \(\hat{I}_{e}\) − R−1{1 + (R − 1) \(\hat{I}_{o}\)}, which is also proportional to − {1 + (R − 1)\(\hat{\kappa }_{F}\)} ≤ 0 if and only if \(\hat{\kappa }_{F}\) ≥ − (R − 1)−1. Therefore, \(\hat{\kappa }_{FU}\) ≥ \(\hat{\kappa }_{F}\) in the case of positive agreement.
Another way of extending κS is to use the two-pairwise method. In this case, in “Appendix 1” it is demonstrated that
and therefore its estimated values in a traditional way would be
In order to obtain the unbiased estimator of Ie, the second expression of Eq. (11) is, in the current terms, \(\hat{I}_{eU} \left( {r,r^{\prime}} \right)\) = [\(n\hat{I}_{e} \left( {r,r^{\prime}} \right)\) − {1 + \(\hat{I}_{o} \left( {r,r^{\prime}} \right)\)/2}]/(n − 1). Applying expressions (22) and (23) it is obtained that the second expression of Eq. (11) is also applied to the current case, in such a way that the conclusions obtained in the case of Scott’s Pi are valid, changing the letter S with F2. In this way
and \(\hat{\kappa }_{F2U}\) ≥ \(\hat{\kappa }_{F2}\) when \(\hat{\kappa }_{F2}\) ≥ 0. Nevertheless, to the best of our knowledge, now the value of V(\(\hat{\kappa }_{F2}\)) is not known.
3.4 Krippendorf's multi-rater alpha
Now the objective is similar to that of Sect. 2.4: to estimate κF as in expression (32), but changing the value of \(\hat{I}_{o}\) for a value \(\hat{I}_{oC}\) defined as expression (14). In this way
Given the formal equality of the expressions, all of the previous conclusions can be accepted, with the necessary changes. In particular,
In a similar way for the two-pairwise method, where now
Therefore, expressions (36) to (38) are also valid putting number “2” after the letters K or F in the sub-indexes of these expressions.
3.5 Gwet's multi-rater AC1/2
For the case of multi-raters, Gwet (2008) defined the same measures of agreement AC1/2 κG and \(\hat{\kappa }_{G}\) of expressions (16) and (17) respectively, but with πi and \(\hat{\pi }_{i}\) alluding to the Fleiss values of expressions (31) and (32) respectively. Therefore, Ie = W \(\left( {1 - \sum\nolimits_{i} {\pi_{i}^{2} } } \right)\)/ {K(K − 1)} = W \(\left( {1 - {{\sum\nolimits_{i} {p_{i + }^{2} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {p_{i + }^{2} } } {R^{2} }}} \right. \kern-0pt} {R^{2} }}} \right)\)/{K(K − 1)} and
"Appendix 1" demonstrates that \(\hat{\pi }_{i}^{2}\) is not an unbiased estimator of \(\pi_{i}^{2}\) − see expression (48) − , so that \(\hat{I}_{e}\) is also not an unbiased estimator of Ie, which is justified in this same Appendix as the unbiased estimator \(\hat{I}_{eU}\) of Ie is
Therefore, the new estimator \(\hat{\kappa }_{GU}\) of κG will be,
It can be observed that now it is not viable to determine V(\(\hat{\kappa }_{GU}\)) directly from the value of V(\(\hat{\kappa }_{G}\)). “Appendix 1” demonstrates that \(\hat{I}_{eU} - \hat{I}_{e}\) ≥ 0, so that now we also find that \(\hat{\kappa }_{GU}\) ≤ \(\hat{\kappa }_{G}\).
An alternative is to use the two-pairwise method. In this case, “Appendix 1” demonstrates that
and therefore its estimated (biased) values are, because of expression (20)
To obtain unbiased estimator of Ie, expression (18) is, in current terms, \(\hat{I}_{eU} \left( {r,r^{\prime } } \right)\) = [\(n\hat{I}_{e} \left( {r,r^{\prime } } \right)\) − W{1 − \(\sum\nolimits_{i} {\hat{p}_{{ir,ir^{\prime } }} }\)}/{2K(K − 1)}]/(n − 1). Applying expression (23) we obtain the value for the current \(\hat{I}_{eU}\), which provides the value of \(\hat{\kappa }_{G2U}\); i.e.
and \(\hat{I}_{oN}\) as in expression (40). Note that in this expression \(\hat{I}_{eU}\) has the same form as in expression (18), so that \(\hat{\kappa }_{G2U}\) can be put as a function of \(\hat{\kappa }_{G2}\) in a similar way to in expression (19):
As in case R = 2 it occurred that \(\hat{I}_{eU} \left( {r,r^{\prime } } \right)\) ≥ \(\hat{I}_{e} \left( {r,r^{\prime } } \right)\), through expression (23) it is deduced that in the actual case \(\hat{I}_{eU}\) ≥ \(\hat{I}_{e}\); therefore \(\hat{\kappa }_{G2U}\) ≤ \(\hat{\kappa }_{G2}\). “Appendix 1” provides a more direct demonstration of the previous statement. To the best of our knowledge, the value of V(\(\hat{\kappa }_{G2}\)) is not known.
4 Examples
Table 1(a) contains the data from a classic example by Fleiss et al. (2003) in which R = 2 raters diagnose n = 100 individuals in K = 3 categories (Psychotic, Neurotic, and Organic). Its part (b) specifies the values of the eight kappa coefficients mentioned in Sect. 2, all of which are calculated for the non-weighted case (wij = δij). It can be observed that the eight coefficients verify the properties mentioned in Sect. 2; for example, all of the new estimators have a value greater than or equal to that of the classic ones, except in the case of the coefficient of Gwet in which case the opposite happens. Nevertheless, the first are only slightly different from the latter. This is due to the fact that the current sample size (n = 100) is too large to show the differences between the estimators. When the sample size is small (n = 8), as occurs in the example of Gwet 2021b (p 109) in Table 2(a) (R = 2, K = 3), the differences are more evident, as shown by the results in Table 2(b).
For the case of more than two raters, Table 3(a) and (b) show the values of Ris and nir, respectively, values which are obtained from the data xsr in an example by Gwet 2021b (p 341) related to the change in the coloring of Stickleback fish (R = 4, K = 5, n = 50). Table 3(c) shows the values of the fourteen kappa coefficients mentioned in Sect. 3, all of which are also calculated for the non-weighted case (wij = δij). It can be observed that the fourteen coefficients verify the properties mentioned in Sect. 3. It is also observed that although the values of n and \(\hat{\kappa }\) are moderate, all of the new coefficients are greater than the classic ones in at least one unit of the second decimal. The exception is the case of the two coefficients of Gwet, in which the differences obtained are very small.
5 Simulation
This section has two objectives. Firstly, to assess the bias of the two estimators of κX (\(\hat{\kappa }_{X}\) and \(\hat{\kappa }_{XU}\)) in the case of R = 2, where X refers to C, S, K or G. Secondly, to assess the behaviour of the estimator of the variance \(\hat{V}\left( {\hat{\kappa }_{CU} } \right)\), in order to exemplify that the new variances act coherently in relation to the classic ones.
To assess the two estimators, the procedure is as follows. Let us consider that the observed frequencies in Table 1(a), divided by n = 100, are the true probabilities pij of the problem mentioned, in which R = 2 and K = 3; for example, p11 = 75/100 = 0.75. In that case the value \(\hat{\kappa }_{C}\) = 0.676 of the Table 1(b) becomes the population value κC = 0.676 of the Cohen kappa coefficient, since the values \(\hat{I}_{o}\) and \(\hat{I}_{e}\) of \(\hat{\kappa }_{C}\) become the values Io and Ie of κC. If we now extract N = 10,000 random samples of the multinomial distribution of parameters {pij, n = 100}, each sample will provide two estimators \(\hat{\kappa }_{Ch}\) and \(\hat{\kappa }_{CUh}\) of κC. The means \(\overline{\hat{\kappa }}_{C} = {{\Sigma_{h} \hat{\kappa }_{Ch} } \mathord{\left/ {\vphantom {{\Sigma_{h} \hat{\kappa }_{Ch} } N}} \right. \kern-0pt} N}\) and \(\overline{\hat{\kappa }}_{CU} = {{\Sigma_{h} \hat{\kappa }_{CUh} } \mathord{\left/ {\vphantom {{\Sigma_{h} \hat{\kappa }_{CUh} } N}} \right. \kern-0pt} N}\) of the values \(\hat{\kappa }_{Ch}\) and \(\hat{\kappa }_{CUh}\) should be approximately equal to κC = 0.676 if the estimators were unbiased. The results of this simulation are provided on the sixteenth line of results in Table 4. The rest of the lines, where other values of K, n, and κC are used, were obtained in a similar way. It can be seen that in general κC = \(\overline{\hat{\kappa }}_{CU}\) ≥ \(\overline{\hat{\kappa }}_{C}\), except in two case in which κC > \(\overline{\hat{\kappa }}_{C}\) ≥ \(\overline{\hat{\kappa }}_{CU}\). Therefore, \(\hat{\kappa }_{CU}\) is less biased than \(\hat{\kappa }_{C}\) and, for the accuracy used, is generally unbiased. Nevertheless, \(\hat{\kappa }_{C}\) is only unbiased for values n ≥ 50 or 100, depending on the value of K.
The same tables and previous simulations allow us to obtain the corresponding results of the other two pairs of estimators (see the rest of Table 4). In the case of Scott's pi coefficient, it is also observed that κS = \(\overline{\hat{\kappa }}_{SU}\) ≥ \(\overline{\hat{\kappa }}_{S}\), except in four cases in which κS > \(\overline{\hat{\kappa }}_{SU}\) ≥ \(\overline{\hat{\kappa }}_{S}\), so that \(\hat{\kappa }_{SU}\) is also generally unbiased; additionally \(\overline{\hat{\kappa }}_{SU}\) = \(\overline{\hat{\kappa }}_{S}\) only for n = 100. The conclusions are a little different in the case of Krippendorf's alpha coefficient; in general it still occurs that κK = \(\overline{\hat{\kappa }}_{KU}\) ≥ \(\overline{\hat{\kappa }}_{K}\), except in five cases in which κK < \(\overline{\hat{\kappa }}_{KU}\) or κK > \(\overline{\hat{\kappa }}_{KU}\), in such a way that \(\hat{\kappa }_{KU}\) may also underestimate κK; now \(\overline{\hat{\kappa }}_{KU}\) = \(\overline{\hat{\kappa }}_{K}\) on some occasions when n ≥ 50. As can be seen, the three pairs of previous coefficients are either unbiased or they underestimate the value of the populational parameter. In the case of Gwet's AC1 coefficient, the opposite happens. In general κG = \(\overline{\hat{\kappa }}_{GU}\) ≤ \(\overline{\hat{\kappa }}_{G}\), except in four cases in which κG < \(\overline{\hat{\kappa }}_{GU}\) or κG > \(\overline{\hat{\kappa }}_{GU}\), so that both estimators are either unbiased or they overestimate the value of the populational parameter. Now the equality \(\overline{\hat{\kappa }}_{GU}\) = \(\overline{\hat{\kappa }}_{G}\) generally happens when K > 2 and n ≥ 50.
The general conclusion is that the estimators \(\hat{\kappa }_{XU}\) are generally unbiased and, when they are biased, their bias is lower than that of the estimators \(\hat{\kappa }_{X}\). When there is bias, it is positive in the case of the Gwet coefficient, and is negative in the other three cases.
Let us now consider the case of variance. The classic estimator \(\hat{\kappa }_{C}\) has an unknown variance \(V_{E} \left( {\hat{\kappa }_{C} } \right)\) which can be estimated in a quite precise way through the sample variance \(\hat{V}_{E} \left( {\hat{\kappa }_{C} } \right)\) of the values \(\hat{\kappa }_{Ch}\) of the 10,000 simulations. Moreover, each simulation provides an estimator \(\hat{V}_{h} \left( {\hat{\kappa }_{C} } \right)\) of \(V_{E} \left( {\hat{\kappa }_{C} } \right)\) obtained through the formula of Fleiss et al. (1969); the average value \(\overline{\hat{V}}\left( {\hat{\kappa }_{C} } \right)\) of these 10,000 estimators, compared to \(\hat{V}_{E} \left( {\hat{\kappa }_{C} } \right)\), allows us to check the bias of this estimator of the variance. The same reasoning is used in the case of the estimator \(\hat{\kappa }_{CU}\), although now \(\hat{V}_{h} \left( {\hat{\kappa }_{CU} } \right)\) is obtained through expression (5). The results are in Table 5. It can be seen that \(\hat{V}_{E} \left( {\hat{\kappa }_{CU} } \right)\) ≈ \({\hat{\text{V}}}_{E} \left( {\hat{\kappa }_{C} } \right)\) for n ≥ 20, being in general \(V_{E} \left( {\hat{\kappa }_{CU} } \right)\) > ( <) \(V_{E} \left( {\hat{\kappa }_{C} } \right)\) when κC = 0.4 (0.8). It is also observed to that the classic variance \(\overline{\hat{V}}\left( {\hat{\kappa }_{C} } \right)\) usually underestimates (overestimates) \(\hat{V}_{E} \left( {\hat{\kappa }_{C} } \right)\) when κC = 0.4 (0.8), the differences being small when n ≥ 50. However, the new variance \(\overline{\hat{V}}\left( {\hat{\kappa }_{CU} } \right)\) almost always underestimates \(\hat{V}_{E} \left( {\hat{\kappa }_{CU} } \right)\), the differences being small when n ≥ 50, but somewhat higher than in the previous case. In general, \(\overline{\hat{V}}\left( {\hat{\kappa }_{C} } \right)\) is closer to \(\hat{V}_{E} \left( {\hat{\kappa }_{C} } \right)\) than \(\overline{\hat{V}}\left( {\hat{\kappa }_{CU} } \right)\) is to \(\hat{V}_{E} \left( {\hat{\kappa }_{CU} } \right)\).
6 Assessment of the difference between each pair of estimators
The objective of this section is to assess the difference ΔXU = ∣\(\hat{\kappa }_{XU}\) − \(\hat{\kappa }_{X}\)∣, when \(\hat{\kappa }_{X}\) is any of the traditional estimators. In general, these differences are only appreciable with small samples, so that it is of interest to determine from what value of n onwards is it practically indifferent to calculate \(\hat{\kappa }_{XU}\) or \(\hat{\kappa }_{X}\).
For \(\hat{\kappa }_{CU}\), in which \(\hat{\kappa }_{CU}\) ≥ \(\hat{\kappa }_{C}\), through expression (4), ΔCU = \(\hat{\kappa }_{C}\)(1 − \(\hat{\kappa }_{C}\))/{(n − 1) + \(\hat{\kappa }_{C}\)}. Its maximum value in \(\hat{\kappa }_{C}\) ≥ 0 is reached in \(\hat{\kappa }_{C}\) = (n − 1)0.5/{n0.5 + (n − 1)0.5} and is {n0.5 + (n − 1)0.5}−2. Therefore, ΔCU < 0.01 (or 0.02) when n > 50 (or 17). The conclusion is also valid for ΔHU and ΔLU = ∣\(\hat{\rho }_{LU} - \hat{\rho }_{L}\)∣, since \(\hat{\kappa }_{HU}\) and \(\hat{\rho }_{LU}\) have the same form as \(\hat{\kappa }_{CU}\).
For \(\hat{\kappa }_{SU}\), in which \(\hat{\kappa }_{SU}\) ≥ \(\hat{\kappa }_{S}\), ΔSU = (1 − \(\hat{\kappa }_{S}^{2}\))/{(2n − 1) + \(\hat{\kappa }_{S}\)} through expression (12). Its maximum value in \(\hat{\kappa }_{S}\) ≥ 0 is reached in \(\hat{\kappa }_{S}\) = 0 and is 1/(2n − 1). Therefore, ΔSU < 0.01 (or 0.02) when n > 100 (or 33). The conclusion is also valid for ΔF2U, since \(\hat{\kappa }_{F2U}\) has the same form as \(\hat{\kappa }_{FU}\). The case of \(\hat{\kappa }_{KU}\) for R = 2 − last expression of Eq. (15) − provides a maximum for ΔKU of 1/2n and leads to the same conclusion as above. The conclusion is also maintained for \(\hat{\kappa }_{KU}\) in R > 2 and \(\hat{\kappa }_{K2U}\), since they have the same form as \(\hat{\kappa }_{KU}\) for R = 2.
The case of \(\hat{\kappa }_{FU}\), in which \(\hat{\kappa }_{FU}\) ≥ \(\hat{\kappa }_{F}\), is somewhat more complex. Through expression (33), ΔFU = (1 − \(\hat{\kappa }_{F}\)){R − (R − 1)(1 − \(\hat{\kappa }_{F}\))}/{Rn − (R − 1)(1 − \(\hat{\kappa }_{F}\))}. Its maximum value in \(\hat{\kappa }_{F}\) ≥ 0 is reached in \(\hat{\kappa }_{F}\) = {(R − 1)(n − 1)0.5 − n0.5}/[(R − 1){n0.5 + (n − 1)0.5}] and is {R/(R − 1)} × {n0.5 + (n − 1)0.5}−2. Note that for R = 2 this value is double that which is obtained for \(\hat{\kappa }_{CU}\). Therefore, if we require that ΔFU < 0.01 (or 0.02), the value of n depends on the value of R. For example: n > 100 (or 33) for R = 2, n > 75 (or 25) for R = 3, n > 63 (or 21) for R = 5, and n > 56 (or 19) for R = 10. Moreover, ΔFU is a decreasing function in R, taking its extreme values \(\hat{\kappa }_{F}\)(1 − \(\hat{\kappa }_{F}\))/{(n − 1) + \(\hat{\kappa }_{F}\)} in R = ∞, and (1 − \(\hat{\kappa }_{F}^{2}\))/{(2n − 1) + \(\hat{\kappa }_{F}\)} in R = 2. As those expressions have the same form as ΔCU and ΔSU respectively, then the precise minimum values of n for this case are an intermediate value from among the pairs of values indicated for those two cases. This is compatible with the numerical results above.
The case of \(\hat{\kappa }_{GU}\), in which \(\hat{\kappa }_{GU}\) ≤ \(\hat{\kappa }_{G}\), is much more complex since its values ΔGU also depend on \(\hat{I}_{e}\) because of expression (41). In the most simple situation -the unweighted case-, it can be demonstrated that ΔGU ≤ {R/(R − 1)}/{m0.5 + (m − 1)0.5}−2, with m = (n − 1)(K − 1), an expression that depends on n, R and K; the level is also valid for the weighted case, although it is conservative. Therefore, if R = 2 and we require that ΔGU < 0.01 (or 0.02), the value of n depends on the value of K. For example: n > 101 (or 34) for K = 2, n > 51 (or 17) for K = 3, and n > 26 (or 9) for K = 5. The conclusion is also valid for ΔG2U, since \(\hat{\kappa }_{G2U}\) has the same form as \(\hat{\kappa }_{GU}\).
The previous formulas provide values which are compatible with the results of Tables 1, 2, 3 and 4. Excluding the Gwet estimators and adopting the criterion that we want to guarantee that ΔXU < 0.02 (0.01), the overall conclusion is that we should use the current estimators at least when n ≤ 17 (50) in the case of \(\hat{\kappa }_{CU}\) and \(\hat{\kappa }_{HU}\), or when n ≤ 33 (100) in the rest of the cases.
7 Conclusions
There are different types of kappa coefficients which measure the experimental degree of agreement between R raters. In this article, we have focused on Cohen's kappa (Cohen 1960, 1968), Scott's pi (Scott 1955), Gwet's AC1/2 (Gwet 2008) and Krippendorf's alpha coefficients (Krippendorf 1970, 2004), whether weighted or not, for R = 2, and in its pairwise type extensions, Hubert's kappa (Hubert 1977; Conger 1980), Fleiss's kappa (Fleiss 1971), Gwet's AC1/2 and Krippendorf's alpha coefficients, for R > 2. In this last case (R > 2), the four measures of agreement use the pairwise method to determine the observed index of agreements Io, but only the measure of Hubert's kappa also uses the pairwise method to determine the expected index of agreements Ie. We have called the measures obtained in this last way as two-pairwise measures. We have also defined the other three coefficients (Fleiss's kappa, Gwet's AC1/2 and Krippendorf's alpha) from the two-pairwise point of view, thus obtaining the three Fleiss's kappas two-pairwise, etc. That is why the number of agreement coefficients that have been defined is eleven.
The article demonstrates that all of the traditional estimators of the eleven coefficients are based on biased estimators of Ie. The alternative is to use the eleven new proposed coefficients, which are based on unbiased estimators of Ie. In all cases, the traditional estimators are smaller than or equal to the new ones, except for the case of Gwet, where it is the other way around. The simulations carried out for the case of R = 2 show that the classic estimators \(\hat{\kappa }_{X}\) usually underestimate κX (or overestimate, in the case of X = G), while the new estimators \(\hat{\kappa }_{XU}\) are usually approximately unbiased. Additionally, it is verified that the new estimators \(\hat{\kappa }_{XU}\) may be unnecessary when the sample size n is sufficiently large (e.g. n > 30). The article also provides the variances of the new estimators as a function of the variances of the classic estimators, except in the case of the Gwet estimators.
One question of interest is the relation between the coefficients and estimators of Hubert's kappa (Hubert 1977; Conger 1980), the CCC (Lin 1989, 2000), and the ICC (Shrout and Fleiss 1979; Carrasco and Jover 2003), when in the first case quadratic weights are used. In the article it has been justified that: (1) κH = ρL = ρI2, with respect to the coefficients; (2) \(\hat{\kappa }_{H}\) = \(\hat{\rho }_{L}\), with respect to classical estimators based on biased estimators of the components of the coefficients; and (3) \(\hat{\kappa }_{HU}\) = \(\hat{\rho }_{LU}\) = \(\hat{\rho }_{I2}\), with respect to classical (\(\hat{\rho }_{LU}\) and \(\hat{\rho }_{I2}\)) or new (\(\hat{\kappa }_{HU}\)) estimators based on unbiased estimators of all components of the coefficients. These statements are true for R ≥ 2, so that for R = 2 it is obtained that: κC = ρL = ρI2, \(\hat{\kappa }_{C}\) = \(\hat{\rho }_{L}\), and \(\hat{\kappa }_{CU}\) = \(\hat{\rho }_{LU}\) = \(\hat{\rho }_{I2}\).
Finally, the entire article has been developed for the general case in which the measures are defined based on any wij weights, thus avoiding a repetition of expressions and demonstrations. Nevertheless the non-weighted case (wij = δij) is very common. To make the text more reader friendly “Appendix 4” includes the eleven non-weighted coefficients mentioned in this article.
References
Barnhart HX, Haber M, Song J (2002) Overall concordance correlation coefficient for evaluating agreement among multiple observers. Biometrics 58:1020–1027. https://doi.org/10.1111/j.0006-341X.2002.01020.x
Carrasco JL, Jover LL (2003) Estimating the generalized concordance correlation coefficient through variance components. Biometrics 59:849–858
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Measur 20:37–46
Cohen J (1968) Weighted kappa: nominal scale agreement with provision for scaled disagreemet or parcial credit. Psychol Bull 70:213–220
Conger AJ (1980) Integration and generalization of kappas for multiple raters. Psychol Bull 88:322–328. https://doi.org/10.1037/0033-2909.88.2.322
Davies M, Fleiss JL (1982) Measuring agreement for multinomial data. Biometrics 38:1047–1051
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76:378–382
Fleiss JL, Cohen J (1973) The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Measur 33(3):613–619
Fleiss JL, Cohen J, Everitt BS (1969) Large sample standard errors of kappa and weighted kappa. Psychol Bull 72:323–327. https://doi.org/10.1037/h0028106
Fleiss JL, Levin B, Paik MC (2003) Statistical methods for rates and proportions, 3rd edn. Wiley, New York
Gwet KL (2008) Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol 61(1):29–68
Gwet KL (2021a) Large-sample variance of Fleiss generalized kappa. Educ Psychol Measur 81(4):781–790. https://doi.org/10.1177/0013164420973080
Gwet KL (2021b) Handbook of inter-rater reliability. Volume 1: analysis of categorical ratings, 5th edn. Gaithersburg, MD, USA
Hubert L (1977) Kappa revisited. Psychol Bull 48(2):289–297
Janson S, Olsson U (2001) A measure of agreement for interval or nominal multivariate observations. Educ Psychol Measur 61(2):277–289
King TS, Chinchilli VM (2001) A generalized concordance correlation coefficient for continuous and categorical data. Statist Med 20(14):2131–2147. https://doi.org/10.1002/sim.845
Krippendorff K (1970) Estimating the reliability, systematic error, and random error of interval data. Educ Psychol Measur 30:61–70
Krippendorff K (2004) Measuring the reliability of qualitative text analysis data. Qual Quant 38:787–800
Landis JR, Koch GG (1975a) A review of statistical methods in the analysis of data arising from observer reliability studies (Part I). Stat Neerl 29:101–123
Landis JR, Koch GG (1975b) A review of statistical methods in the analysis of data arising from observer reliability studies (Part II). Stat Neerl 29:151–161
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174
Lin LI-K (1989) A concordance correlation coefficient to evaluate reproducibility. Biometrics 45:255–268
Lin LI-K (2000) A note of the concordance correlation coefficient. Letter to the Editor (Corrections). Biometrics 56:324–325
Martín Andrés A, Álvarez Hernández M (2020) Hubert’s multi-rater kappa revisited. Br J Math Stat Psychol 73(1):1–22
Miettinen O, Nurminen M (1985) Comparative analysis of two rates. Stat Med 4:213–226. https://doi.org/10.1002/sim.4780040211
Schuster C, Smith DA (2005) Dispersion-weighted kappa: an integrative framework for metric and nominal scale agreement coefficients. Psychometrika 70(1):135–146
Scott WA (1955) Reliability of content analysis: the case of nominal scale coding. Public Opin Q 19:321–325
Shrout PE, Fleiss JL (1979) Intraclass correlations: Uses in assessing rater reliability. Psychol Bull 86:420–428
Warrens MJ (2010) Inequalities between multi-rater kappas. Adv Data Anal Classif 4:271–286
Acknowledgements
This research was supported by the Ministry of Science and Innovation (Spain), Grant PID2021-126095NB-I00 funded by MCIN/AEI/https://doi.org/10.13039/501100011033 and by “ERDF A way of making Europe”.
Funding
Funding for open access publishing: Universidad de Granada/CBUA. Funding for open access charge: Universidad de Granada / CBUA. The authors are grateful to the reviewers for their invaluable comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
1.1 Appendix 1: Average values of some functions of parameters of the multinomial distribution and simplification of some expressions
In a multinomial distribution M{n; pi}, it occurs that E(\(\hat{p}_{i}\)) = pi, V(\(\hat{p}_{i}\)) = E(\(\hat{p}_{i}^{2}\)) − E2(\(\hat{p}_{i}\)) = pi(1 − pi)/n and Cov(\(\hat{p}_{i}\),\(\hat{p}_{j}\)) = E(\(\hat{p}_{i} \hat{p}_{j}\)) − E(\(\hat{p}_{i}\)) × E(\(\hat{p}_{j}\)) = − pipj/n (if i ≠ j). Therefore
In the case of Sect. 2, applying the previous point to the distribution M{n; pij} it is deduced that E(\(\hat{p}_{i \cdot } \hat{p}_{ \cdot j}\)) = \(E\left[ {\left( {\sum\nolimits_{h} {\hat{p}_{ih} } } \right)\left( {\sum\nolimits_{t} {\hat{p}_{tj} } } \right)} \right]\) = \(\sum\nolimits_{h} {\sum\nolimits_{t} {E\left( {\hat{p}_{ih} \hat{p}_{tj} } \right)} }\) = \(\sum\nolimits_{h} {\sum\nolimits_{t} {{{\left\{ {\left( {n - 1} \right)p_{ih} p_{tj} + \delta_{ti} \delta_{hj} p_{ij} } \right\}} \mathord{\left/ {\vphantom {{\left\{ {\left( {n - 1} \right)p_{ih} p_{tj} + \delta_{ti} \delta_{hj} p_{ij} } \right\}} n}} \right. \kern-0pt} n}} }\), where the last equality is due to expression (45), and h, t = 1, 2, …, K. Operating it is obtained that E(\(\hat{p}_{i \cdot } \hat{p}_{ \cdot j}\)) = {(n − 1)pi·p·j + pij}/n, as in expression (2). In the same way, E(\(\hat{p}_{i \cdot } \hat{p}_{j \cdot }\)) = \(\sum\nolimits_{h} {\sum\nolimits_{t} {E\left( {\hat{p}_{ih} \hat{p}_{jt} } \right)} }\) = \(\sum\nolimits_{h} {\sum\nolimits_{t} {{{\left\{ {\left( {n - 1} \right)p_{ih} p_{jt} + \delta_{ij} \delta_{ht} p_{ih} } \right\}} \mathord{\left/ {\vphantom {{\left\{ {\left( {n - 1} \right)p_{ih} p_{jt} + \delta_{ij} \delta_{ht} p_{ih} } \right\}} n}} \right. \kern-0pt} n}} }\) = {(n − 1)pi·pj· + δijpi·}/n so that,
In a similar way, for \(\hat{p}_{ \cdot i} \hat{p}_{ \cdot j}\). As \(\hat{\pi }_{i} \hat{\pi }_{j} = {{\left( {\hat{p}_{i \cdot } \hat{p}_{j \cdot } + \hat{p}_{ \cdot i} \hat{p}_{ \cdot j} + \hat{p}_{i \cdot } \hat{p}_{ \cdot j} + \hat{p}_{ \cdot i} \hat{p}_{j \cdot } } \right)} \mathord{\left/ {\vphantom {{\left( {\hat{p}_{i \cdot } \hat{p}_{j \cdot } + \hat{p}_{ \cdot i} \hat{p}_{ \cdot j} + \hat{p}_{i \cdot } \hat{p}_{ \cdot j} + \hat{p}_{ \cdot i} \hat{p}_{j \cdot } } \right)} 4}} \right. \kern-0pt} 4}\) because of the expression (9) then, having applied the previous equalities, expression (10) is obtained. Finally, regarding the end of Sect. 2.5, through expression (18) it is deduced that \(\hat{I}_{eU} - \hat{I}_{e}\) is proportional to n \(\hat{I}_{e}\) − W(1 − \(\sum\nolimits_{i} {\hat{p}_{ii} }\))/{2(K − 1)} − (n − 1)\(\hat{I}_{e}\) = \(\hat{I}_{e}\) − W(1 − \(\sum\nolimits_{i} {\hat{p}_{ii} }\))/{2(K − 1)} which, through expression (17), is also proportional to 1 + \(\sum\nolimits_{i} {\hat{p}_{ii} }\) − 2 \(\sum\nolimits_{i} {\hat{\pi }_{i}^{2} }\) = \(\sum\nolimits_{i} {\left\{ {\hat{\pi }_{i} + \hat{p}_{ii} - 2\hat{\pi }_{i}^{2} } \right\}}\). Taking into account the value of \(\hat{\pi }_{i}\) expression (9) and operating, it is deduced that each term i of the previous expression is also proportional to Si(1 − Si) + \(\hat{p}_{ii} \left( {1 - \hat{p}_{ii} } \right)\) + 2 \(\hat{p}_{ii}\)(1 + Si) ≥ 0, where Si = \(\hat{p}_{i \cdot } + \hat{p}_{ \cdot i} - \hat{p}_{ii}\) ≥ 0. The conclusion is always that \(\hat{I}_{eU} - \hat{I}_{e}\) ≥ 0.
In the case of Sect. 3, expression (46) adopts the form,
Let the value Io = ΣrΣr'≠rΣiΣjwijpir,jr'/{R(R − 1)} = ΣiΣjwijΣrΣr'≠rpir,jr' defined in Sect. 3.1, the one we are trying to estimate. For a given subject s, the possible pairs of replies (i, j), with i ≠ j, are RisRjs, and the possible pairs of replies (i, i) are Ris(Ris − 1), since the two raters must be different. Adding in s and dividing by n we obtain the estimations ΣrΣr'≠r \(\hat{p}_{ir,jr^{\prime}}\) and ΣrΣr'≠r \(\hat{p}_{ir,ir^{\prime}}\) of ΣrΣr'≠rpir,jr' and ΣrΣr'≠rpir,ir' respectively. Therefore, the estimation \(\hat{I}_{o}\) of the value Io of the second expression of the beginning of this paragraph will verify that nR(R − 1)\(\hat{I}_{o}\) = ΣiΣj≠iwijΣsRisRjs + ΣiwiiΣsRis(Ris − 1) = ΣiΣjwijΣsRisRjs − nR, since ΣiΣswiiRis = nR as wii = 1. This leads to the second expression of Eq. (24).
The value of Ie of Sect. 3.2 is given by Ie = ΣrΣr'≠rIe(r, r') = ΣrΣr'≠rΣiΣjwijpirpjr' = ΣiΣjwijΣrΣr'≠rpirpjr' = ΣiΣjwij(pi+pj+ − Σrpirpjr) since ΣrΣr'≠rpirpjr' = ΣrΣr'pirpjr' − Σrpirpjr = ΣrpirΣr'pjr' − Σrpirpjr = pi+pj+ − Σrpirpjr. This leads to the second expression (25).
Regarding what is highlighted in the first paragraph of Sect. 3.3, \(R^{2} E\left( {\hat{\pi }_{i} \hat{\pi }_{j} } \right)\) = \(E\left\{ {\sum\nolimits_{r} {\sum\nolimits_{r^{\prime}} {\hat{p}_{ir} \hat{p}_{jr^{\prime}} } } } \right\}\) = \(E\left\{ {\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {\hat{p}_{ir} \hat{p}_{jr^{\prime}} } + \sum\nolimits_{r} {\hat{p}_{ir} \hat{p}_{jr} } } } \right\}\). Through expressions (46) and (2) which are placed in the format of Sect. 3, \(nR^{2} E\left( {\hat{\pi }_{i} \hat{\pi }_{j} } \right)\) = (n − 1)\(\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {p_{ir} p_{jr^{\prime}} } }\) + (n − 1)\(\sum\nolimits_{r} {p_{ir} p_{jr} }\) + \(\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {p_{ir,jr^{\prime}} } }\) + \(\delta_{ij} \sum\nolimits_{r} {p_{ir} }\) where the sum of the two terms is (n − 1)\(\sum\nolimits_{r} {\sum\nolimits_{r^{\prime}} {p_{ir} p_{jr^{\prime}} } }\) = (n − 1)pi+pj+ = (n − 1)R2πiπj. Therefore, \(\hat{\pi }_{i} \hat{\pi }_{j}\) is not an unbiased estimator of πiπj since,
As \(E\left( {\hat{I}_{e} } \right)\) = \(\sum\nolimits_{i} {\sum\nolimits_{j} {w_{ij} E\left( {\hat{\pi }_{i} \hat{\pi }_{j} } \right)} }\) = n−1[(n − 1)Ie + R−2{ΣiΣjwijΣrΣr≠r'pir,jr' + ΣiΣjwijδijΣrpir}] = n−1[(n − 1)Ie + R−2{R(R − 1)Io + R}], from here we deduce the first expression of (33).
Regarding what is highlighted in the second paragraph of Sect. 3.3, through expression (8) 4Ie(r, r') = ΣiΣjwij(pir + pir')(pjr + pjr') = ΣiΣjwij(pirpjr + pir'pjr' + pirpjr' + pir' pjr) and, through expression (23), 4R(R − 1)Ie = ΣiΣjwij[ΣrΣr'≠rpirpjr + ΣrΣr'≠rpir'pjr' + ΣrΣr'≠rpirpjr' + ΣrΣr'≠rpir'pjr]. As ΣrΣr'≠rpirpjr + ΣrΣr'≠rpir'pjr' = 2(R − 1)Σrpirpjr and ΣrΣr'≠rpirpjr' + ΣrΣr'≠rpir'pjr = 2pi+pj+ − 2Σrpirpjr, then expression (35) is deduced.
Regarding the first paragraph of Sect. 3.5, expression (47) for i = j is,
Therefore, the unbiased estimator of \(\pi_{i}^{2}\) is \(\widehat{{\pi_{i}^{2} }}\) = (n − 1)−1[n \(\hat{\pi }_{i}^{2}\) − \({{\left\{ {\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {\hat{p}_{ir,ir^{\prime}} } } + \sum\nolimits_{r} {\hat{p}_{ir} } } \right\}} \mathord{\left/ {\vphantom {{\left\{ {\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {\hat{p}_{ir,ir^{\prime}} } } + \sum\nolimits_{r} {\hat{p}_{ir} } } \right\}} {R^{2} }}} \right. \kern-0pt} {R^{2} }}\)] and that of \(\sum\nolimits_{i} {\pi_{i}^{2} }\) will be \(\sum\nolimits_{i} {\widehat{{\pi_{i}^{2} }}}\) = (n − 1)−1[n \(\sum\nolimits_{i} {\hat{\pi }_{i}^{2} }\) − \({{\left\{ {\sum\nolimits_{i} {\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {\hat{p}_{ir,ir^{\prime}} } } } + \sum\nolimits_{i} {\sum\nolimits_{r} {\hat{p}_{ir} } } } \right\}} \mathord{\left/ {\vphantom {{\left\{ {\sum\nolimits_{i} {\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {\hat{p}_{ir,ir^{\prime}} } } } + \sum\nolimits_{i} {\sum\nolimits_{r} {\hat{p}_{ir} } } } \right\}} {R^{2} }}} \right. \kern-0pt} {R^{2} }}\)]. In this last expression, \(\sum\nolimits_{i} {\hat{\pi }_{i}^{2} }\) = R−2{1 − K(K − 1)\(\hat{I}_{e}\)/W} through expression (39), \(\sum\nolimits_{i} {\sum\nolimits_{r} {\hat{p}_{ir} } }\) = R since \(\sum\nolimits_{i} {\hat{p}_{ir} }\) = 1, and \(\sum\nolimits_{i} {\sum\nolimits_{r} {\sum\nolimits_{r^{\prime} \ne r} {\hat{p}_{ir,ir^{\prime}} } } }\) = R(R − 1) \(\hat{I}_{oN}\), where \(\hat{I}_{oN}\) is obtained from the second expression of Eq. (22) applied to the non-weighted case of ωij = δij. Substituting all of these values in W(1 − \(\sum\nolimits_{i} {\widehat{{\pi_{i}^{2} }}}\))/{K(K − 1)} we obtain the value of \(\hat{I}_{eU}\) of expression (40). Regarding the statement that \(\hat{I}_{eU} - \hat{I}_{e}\) ≥ 0 one must take into account that \(\hat{I}_{eU} - \hat{I}_{e}\) is proportional to \(\hat{I}_{e}\) − A = \(\hat{I}_{e}\) − W(R − 1)(1 − \(\hat{I}_{oN}\))/{RK(K − 1)}; substituting in this expression the estimators \(\hat{I}_{e}\) and \(\hat{I}_{oN}\) with their values from the last expressions of Eq. (39) and Eq. (40) respectively, it is obtained that \(\hat{I}_{eU} - \hat{I}_{e}\) is proportional to \(\sum\nolimits_{i} {\sum\nolimits_{s} {R_{is}^{2} } }\) − \({{\sum\nolimits_{i} {R_{i + }^{2} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {R_{i + }^{2} } } n}} \right. \kern-0pt} n}\) = ΣiΣs(Ris − \(\overline{R}_{i}\))2 ≥ 0, where \(\overline{R}_{i}\) = ΣsRis/n.
Regarding what is highlighted in the second paragraph of Sect. 3.5, through expression (16) {K(K − 1)/W}Ie(r, r') = 1 − Σi(pir + pir')2/4 = 1 − Σi(\(p_{ir}^{2}\) + \(p_{ir^{\prime}}^{2}\) + 2pirpir')/4. But through expression (23), {K(K − 1)/W}Ie = 1 − Σi[2(R − 1)Σr \(p_{ir}^{2}\) + 2pi+ − 2Σr \(p_{ir}^{2}\)]/{4R(R − 1)}; this leads to the expression (42). Finally, to demonstrate that in the two-pairwise case it also occurs that \(\hat{I}_{eU} - \hat{I}_{e}\) ≥ 0, one must take into that through expression (44) \(\hat{I}_{eU} - \hat{I}_{e}\) is proportional to \(\hat{I}_{e}\) − XN = \(\hat{I}_{e}\) − W(1 − \(\hat{I}_{oN}\))/{2 K(K − 1)}. Substituting in this expression the estimators \(\hat{I}_{e}\) and \(\hat{I}_{oN}\) through its values of the last expressions of expressions (43) and (40) respectively, it is obtained that \(\hat{I}_{eU} - \hat{I}_{e}\) is proportional to nR(R − 2) + ΣiΣs \(R_{is}^{2}\) − Σi \(n_{i + }^{2}\)/n − (R − 2)ΣiΣr \(n_{ir}^{2}\)/2 = ΣiΣs(Ris − \(\overline{R}_{i}\))2 + (R − 2)ΣiΣrnir(n − nir)/n ≥ 0.
As stated previously, all of the above is valid if there is only one multinomial sample. Let us suppose that R = 2, that the rater in the rows is a standard one and that the frequencies Oij are obtained from K multinomial distributions {Oi·; p1, p2, …, pK}, with Σpi = 1. Now \(\hat{I}_{e} = {{\sum\nolimits_{i} {\sum\nolimits_{j} {w_{ij} O_{i \cdot } \hat{p}_{j} } } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {\sum\nolimits_{j} {w_{ij} O_{i \cdot } \hat{p}_{j} } } } n}} \right. \kern-0pt} n} = {{\sum\nolimits_{i} {\sum\nolimits_{j} {w_{ij} O_{i \cdot } O_{ \cdot j} } } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {\sum\nolimits_{j} {w_{ij} O_{i \cdot } O_{ \cdot j} } } } n}} \right. \kern-0pt} n}^{2}\) is an unbiased estimator of Ie = ΣiΣjwijOi·pj/n, since E(\(\hat{p}_{j}\)) = pj.
1.2 Appendix 2: Variances of the new estimators of kappa
From hereon it is assumed that new estimators of kappa are approximately unbiased, since they are based on unbiased estimators of Io and Ie. In the case of Sect. 2, from expression (4) it is deduced that \(\hat{\kappa }_{C}\) = (n − 1)\(\hat{\kappa }_{CU}\)/(n − \(\hat{\kappa }_{CU}\)). Therefore d \(\hat{\kappa }_{C}\)/d \(\hat{\kappa }_{CU}\) = n(n − 1)/(n − \(\hat{\kappa }_{CU}\))2, whose value in E(\(\hat{\kappa }_{CU}\)) ≈ κC is n(n − 1)/(n − κC)2 and, through the delta method, V(\(\hat{\kappa }_{C}\)) = n2(n − 1)2 V(\(\hat{\kappa }_{CU}\))/(n − κC)4. This leads to expression (5). In a similar way, from expression (12) it is deduced that \(\hat{\kappa }_{S}\) = {(2n − 1) \(\hat{\kappa }_{SU}\) − 1}/(2n − 1 − \(\hat{\kappa }_{SU}\)). Therefore, d \(\hat{\kappa }_{S}\)/d \(\hat{\kappa }_{SU}\) = 4n(n − 1)/(2n − 1 − \(\hat{\kappa }_{SU}\))2, whose value in E(\(\hat{\kappa }_{SU}\)) ≈ κS is 4n(n − 1)/(2n − 1 − κS)2 and V(\(\hat{\kappa }_{S}\)) = 16n2(n − 1)2 V(\(\hat{\kappa }_{SU}\))/(2n − 1 − κS)4. This leads to expression (13). In the case of Sect. 3.3, from the second expression of Eq (33) it is deduced that \(\hat{\kappa }_{F}\) = {(nR − R + 1)\(\hat{\kappa }_{FU}\) − 1}/{(nR − 1) − (R − 1)\(\hat{\kappa }_{FU}\))}. Therefore d \(\hat{\kappa }_{F}\)/d \(\hat{\kappa }_{FU}\) = R2n(n − 1)/{(Rn − 1) − (R − 1)\(\hat{\kappa }_{FU}\)}2, whose value in E(\(\hat{\kappa }_{FU}\)) ≈ κF is R2n(n − 1)/{(nR − 1) − (R − 1)κF}2 and V(\(\hat{\kappa }_{F}\)) = R4n2(n − 1)2 V(\(\hat{\kappa }_{FU}\))/{(nR − 1) − (R − 1)κF}4. This leads to expression (34). In a similar way with V(\(\hat{\kappa }_{KU}\)) and V(\(\hat{\kappa }_{F2U}\)).
1.3 Appendix 3: Justification of the equality \(\hat{\rho }_{LU} = \hat{\rho }_{I2}\) \(\hat{\rho }_{L} = \hat{\rho }_{I2S}\) and its simplified formula
Using the notation of the end of Sect. 3.2, the expression \(\hat{\rho }_{LU}\) of (28) is equivalent to this one, where \(\overline{x}_{ \cdot r}\) = x·r/n,
As srr´ = Σs(xsr − \(\overline{x}_{ \cdot r}\))(xsr´ − \(\overline{x}_{{ \cdot r^{\prime } }}\))/(n − 1) = {Σsxsrxsr´ − x·rx·r´/n}/(n − 1), \(s_{r}^{2}\) = Σs(xsr − \(\overline{x}_{ \cdot r}\))2/(n − 1) = (Σs \(x_{sr}^{2}\) − \(x_{ \cdot r}^{2}\)/n)/(n − 1), and \(\left( {\overline{x}_{ \cdot r} - \overline{x}_{{ \cdot r^{\prime } }} } \right)^{2}\) = (x·r − x·r´)2/n2, then.
\(\begin{aligned} & \sum\nolimits_{r} {\sum\nolimits_{{r^{\prime } \ne r}} {s_{{rr^{\prime } }} } } = {{\left( {n\sum\nolimits_{s} {x_{s \cdot }^{2} } + \sum\nolimits_{r} {x_{ \cdot r}^{2} - n\sum\nolimits_{s} {\sum\nolimits_{r} {x_{sr}^{2} } - x_{ \cdot \cdot }^{2} } } } \right)} \mathord{\left/ {\vphantom {{\left( {n\sum\nolimits_{s} {x_{s \cdot }^{2} } + \sum\nolimits_{r} {x_{ \cdot r}^{2} - n\sum\nolimits_{s} {\sum\nolimits_{r} {x_{sr}^{2} } - x_{ \cdot \cdot }^{2} } } } \right)} {\left\{ {n\left( {n - 1} \right)} \right\}}}} \right. \kern-0pt} {\left\{ {n\left( {n - 1} \right)} \right\}}}, \\ & \quad \sum\nolimits_{r} {s_{r}^{2} } = {{\left( {n\sum\nolimits_{s} {\sum\nolimits_{r} {x_{sr}^{2} } } - \sum\nolimits_{r} {x_{ \cdot r}^{2} } } \right)} \mathord{\left/ {\vphantom {{\left( {n\sum\nolimits_{s} {\sum\nolimits_{r} {x_{sr}^{2} } } - \sum\nolimits_{r} {x_{ \cdot r}^{2} } } \right)} {\left\{ {n\left( {n - 1} \right)} \right\}}}} \right. \kern-0pt} {\left\{ {n\left( {n - 1} \right)} \right\}}},\quad {\text{and}}\\ &\quad \sum\nolimits_{r} {\sum\nolimits_{{r^{\prime } }} {\left( {\overline{x}_{ \cdot r} - \overline{x}_{{ \cdot r^{\prime } }} } \right)^{2} } } = 2{{\left( {R\sum\nolimits_{r} {x_{ \cdot r}^{2} } - x_{ \cdot \cdot }^{2} } \right)} \mathord{\left/ {\vphantom {{\left( {R\sum\nolimits_{r} {x_{ \cdot r}^{2} } - x_{ \cdot \cdot }^{2} } \right)} {n^{2} }}}\right. \kern-0pt} {n^{2} }}. \\\end{aligned}\)
Substituting the expression (49) it is obtained the last expression of Eq. (29). Similarly the expression (27) of \(\hat{\rho }_{L}\) leads to the last expression of Eq. (30).
On the other hand, the estimator \(\hat{\rho}_{I2}\) of ρI2 − which is based on the unbiased estimators of its components − is the ICC(2, 1) of Shrout and Fleiss (1979)
where MSS = SSS/(n − 1), MSR = SSR/(R − 1), and MSE = SSE/{(n − 1)(R − 1)} (or SSS, SSE, and SSD) denote the mean squares (or sum of squares) for subjects, raters, and error (residual) in the analysis of variance, respectively. In addition, SSE = SST − SSS − SSR, with SST the sum of squares total. As,
then, substituting in the expression (50) it is obtained again the expression (29). Therefore \(\hat{\rho }_{LU} = \hat{\rho }_{I2}\).
1.4 Appendix 4: Classic non-weighted kappa coefficients
We will now provide the values necessary to define any non-weighted coefficient κ = (Io − Ie)/(1 − Ie), and calculate the value of its classic estimator \(\hat{\kappa } = {{\left( {\hat{I}_{o} - \hat{I}_{e} } \right)} \mathord{\left/ {\vphantom {{\left( {\hat{I}_{o} - \hat{I}_{e} } \right)} {\left( {1 - \hat{I}_{e} } \right)}}} \right. \kern-0pt} {\left( {1 - \hat{I}_{e} } \right)}}\). The new estimator \(\hat{\kappa }_{U}\) is obtained with the same formulas from the text of the article.
When R = 2 all of the kappa coefficients are based on Io = Σpii and \(\hat{I}_{o} = \sum\nolimits_{i} {\hat{p}_{ii} }\) = \({{\sum\nolimits_{i} {O_{ii} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {O_{ii} } } n}} \right. \kern-0pt} n}.\) The actual and estimated values of Ie in each coefficient are:
-
(a)
κC and \(\hat{\kappa }_{C}\) (Cohen's kappa): Ie = Σipi·p·i and \(\hat{I}_{e} = \sum\nolimits_{i} {\hat{p}_{i \cdot } \hat{p}_{ \cdot i} } = {{\sum\nolimits_{i} {O_{i \cdot } O_{ \cdot i} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {O_{i \cdot } O_{ \cdot i} } } {n^{2} }}} \right. \kern-0pt} {n^{2} }}.\)
-
(b)
κS and \(\hat{\kappa }_{S}\) (Scott's pi): Ie = \(\sum\nolimits_{i} {\pi_{i}^{2} }\) where πi = (pi· + p·i)/2 and \(\hat{I}_{e} = \sum\nolimits_{i} {\hat{\pi }_{i}^{2} }\) where \(\hat{\pi }_{i} = {{\left( {\hat{p}_{i \cdot } + \hat{p}_{ \cdot i} } \right)} \mathord{\left/ {\vphantom {{\left( {\hat{p}_{i \cdot } + \hat{p}_{ \cdot i} } \right)} 2}} \right. \kern-0pt} 2} = {{\left( {O_{i \cdot } + O_{ \cdot i} } \right)} \mathord{\left/ {\vphantom {{\left( {O_{i \cdot } + O_{ \cdot i} } \right)} {2n}}} \right. \kern-0pt} {2n}}.\)
-
(c)
\(\hat{\kappa }_{K}\) (Krippendorf's alpha) which estimates κS: \(\hat{I}_{e} = \sum\nolimits_{i} {\hat{\pi }_{i}^{2} }\), with \(\hat{\pi }_{i}\) as in (b), but \(\hat{I}_{o}\) is special: \(\hat{I}_{o} = {{\left\{ {\left( {2n - 1} \right)\sum\nolimits_{i} {\hat{p}_{ii} + 1} } \right\}} \mathord{\left/ {\vphantom {{\left\{ {\left( {2n - 1} \right)\sum\nolimits_{i} {\hat{p}_{ii} + 1} } \right\}} {2n}}} \right. \kern-0pt} {2n}} = {{\left\{ {\left( {2n - 1} \right)\sum\nolimits_{i} {O_{ii} + n} } \right\}} \mathord{\left/ {\vphantom {{\left\{ {\left( {2n - 1} \right)\sum\nolimits_{i} {O_{ii} + n} } \right\}} {2n^{2} }}} \right. \kern-0pt} {2n^{2} }}\).
-
(d)
κG and \(\hat{\kappa }_{G}\) (Gwet's AC1): Ie = Σiπi(1 − πi)/(K − 1) and \(\hat{I}_{e} = {{\sum\nolimits_{i} {\hat{\pi }_{i} \left( {1 - \hat{\pi }_{i} } \right)} } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {\hat{\pi }_{i} \left( {1 - \hat{\pi }_{i} } \right)} } {\left( {K - 1} \right)}}} \right. \kern-0pt} {\left( {K - 1} \right)}}\), with \(\hat{\pi }_{i}\) as in case (b). Note that \(\hat{I}_{e} = {{\left\{ {1 - \sum\nolimits_{i} {\hat{\pi }_{i}^{2} } } \right\}} \mathord{\left/ {\vphantom {{\left\{ {1 - \sum\nolimits_{i} {\hat{\pi }_{i}^{2} } } \right\}} {\left( {K - 1} \right)}}} \right. \kern-0pt} {\left( {K - 1} \right)}}\), where \(\sum\nolimits_{i} {\hat{\pi }_{i}^{2} }\) is the value of \(\hat{I}_{e}\) in (b). In this case, the formula of \(\hat{\kappa }_{GU}\) does have a particular expression:
$$\hat{\kappa }_{GU} = \frac{{\left( {n - 1} \right)\hat{\kappa }_{G} + Y_{N} }}{{\left( {n - 1} \right) + Y_{N} }}\quad {\text{where}}\quad Y_{N} = \frac{{1 - \hat{\kappa }_{G} }}{{2\left( {K - 1} \right)}} - \frac{{\hat{I}_{e} }}{{1 - \hat{I}_{e} }}$$
When R ≥ 2 all of the kappa non-weighted coefficients are based on Io = ΣrΣr'≠rΣipir,ir'/{R(R − 1)} and \(\hat{I}_{o}\) = {\(\sum\nolimits_{i} {\sum\nolimits_{s} {R_{is}^{2} } }\) − nR}/{nR(R − 1)}. The actual and estimated values of Ie are:
-
(A)
κH and \(\hat{\kappa }_{H}\) (Hubert's kappa): Ie = \(\sum\nolimits_{i} {\left\{ {p_{i + }^{2} - \sum\nolimits_{r} {p_{ir}^{2} } } \right\}}\)/{R(R − 1)} and \(\hat{I}_{e}\) = \(\sum\nolimits_{i} {\left\{ {n_{i + }^{2} - \sum\nolimits_{r} {n_{ir}^{2} } } \right\}}\)/ {n2R(R − 1)}.
-
(B)
κF and \(\hat{\kappa }_{F}\) (Fleiss's kappa): Ie = \(\sum\nolimits_{i} {p_{i + }^{2} }\)/R2 and \(\hat{I}_{e}\) = \(\sum\nolimits_{i} {R_{i + }^{2} }\)/(nR)2.
-
(C)
κF2 and \(\hat{\kappa }_{F2}\) (Fleiss's kappa two-pairwise): Ie = [(R − 2)\(\sum\nolimits_{i} {\sum\nolimits_{r} {p_{ir}^{2} } }\) + \(\sum\nolimits_{i} {p_{i + }^{2} }\)]/[2R(R − 1)] and \(\hat{I}_{e}\) = [(R − 2)\(\sum\nolimits_{i} {\sum\nolimits_{r} {n_{ir}^{2} } }\) + \(\sum\nolimits_{i} {n_{i + }^{2} }\)]/[2n2R(R − 1)].
-
(D)
\(\hat{\kappa }_{K}\) (Krippendorf's alpha) which estimates κF: \(\hat{I}_{e} = {{\sum\nolimits_{i} {R_{i + }^{2} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {R_{i + }^{2} } } {\left( {nR} \right)^{2} }}} \right. \kern-0pt} {\left( {nR} \right)^{2} }},\) but \(\hat{I}_{o}\) is special: \(\hat{I}_{o}\) = {(2n − 1)T + 1}/2n where T = {\(\sum\nolimits_{i} {\sum\nolimits_{s} {R_{is}^{2} } }\) − nR}/{nR(R − 1)}.
-
(E)
\(\hat{\kappa }_{K2}\) (Krippendorf's alpha two-pairwise) which estimates κF2: \(\hat{I}_{o}\) is the same as in paragraph (D) and \(\hat{I}_{e}\) = \({{\left\{ {\left( {R - 2} \right)\sum\nolimits_{i} {\sum\nolimits_{r} {n_{ir}^{2} } + \sum\nolimits_{i} {n_{i + }^{2} } } } \right\}} \mathord{\left/ {\vphantom {{\left\{ {\left( {R - 2} \right)\sum\nolimits_{i} {\sum\nolimits_{r} {n_{ir}^{2} } + \sum\nolimits_{i} {n_{i + }^{2} } } } \right\}} {\left\{ {2n^{2} R\left( {R - 1} \right)} \right\}}}} \right. \kern-0pt} {\left\{ {2n^{2} R\left( {R - 1} \right)} \right\}}}\).
-
(F)
κG and \(\hat{\kappa }_{G}\) (Gwet's AC1): Ie = \(\left( {1 - {{\sum\nolimits_{i} {p_{i + }^{2} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {p_{i + }^{2} } } {R^{2} }}} \right. \kern-0pt} {R^{2} }}} \right)\)/(K − 1) and \(\hat{I}_{e}\) = \(\left\{ {1 - {{\sum\nolimits_{i} {R_{i + }^{2} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i} {R_{i + }^{2} } } {\left( {nR} \right)^{2} }}} \right. \kern-0pt} {\left( {nR} \right)^{2} }}} \right\}\)/(K − 1). It can be observed that \(\hat{\kappa }_{G}\) ≥ \(\hat{\kappa }_{F}\), since \(\hat{I}_{e}\)(Gwet) − \(\hat{I}_{e}\)(Fleiss) is proportional to K−1 − \(\sum\nolimits_{i} {\hat{\pi }_{i}^{2} }\) ≤ 0; the first statement because of expressions (39) and (32) respectively; the second one because \(\sum\nolimits_{i} {\hat{\pi }_{i}^{2} }\) reaches a minimum value of 1/K when \(\hat{\pi }_{i}\) = 1/K. In this case, the formula of \(\hat{\kappa }_{GU}\) does have a particular expression:
$$\hat{\kappa }_{GU} = \frac{{\left( {n - 1} \right)\hat{\kappa }_{G} + B_{N} }}{{\left( {n - 1} \right) + B_{N} }}\quad {\text{where}}\quad B_{N} = \frac{{\left( {R - 1} \right)\left( {1 - \hat{\kappa }_{G} } \right)}}{{R\left( {K - 1} \right)}} - \frac{{\hat{I}_{e} }}{{1 - \hat{I}_{e} }}.$$ -
(G)
κG2 and \(\hat{\kappa }_{G2}\) (Gwet's AC1 two-pairwise):
$$\begin{aligned} & I_{e} = \frac{1}{K - 1}\left[ {1 - \frac{1}{{2R\left( {R - 1} \right)}}\left\{ {\left( {R - 2} \right)\sum\limits_{i} {\sum\limits_{r} {p_{ir}^{2} + \sum\limits_{i} {p_{i + }^{2} } } } } \right\}} \right],\,{\text{and}} \\ & \quad \hat{I}_{e} = \frac{1}{K - 1}\left[ {1 - \frac{1}{{2n^{2} R\left( {R - 1} \right)}}\left\{ {\left( {R - 2} \right)\sum\limits_{i} {\sum\limits_{r} {n_{ir}^{2} + \sum\limits_{i} {n_{i + }^{2} } } } } \right\}} \right]. \\ \end{aligned}$$
In this case, the formula of \(\hat{\kappa }_{G2U}\) does have a particular expression:
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Martín Andrés, A., Álvarez Hernández, M. Estimators of various kappa coefficients based on the unbiased estimator of the expected index of agreements. Adv Data Anal Classif (2024). https://doi.org/10.1007/s11634-024-00581-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11634-024-00581-x
Keywords
- Agreement
- Cohen’s kappa
- Concordance and intraclass correlation coefficients
- Conger’s kappa
- Fleiss’ kappa
- Gwet's AC1/2
- Hubert’s kappa
- Krippendorf's alpha
- Pairwise multi-rater kappa
- Scott's pi