Properties of Bangdiwala’s B

Cohen’s kappa is the most widely used coefficient for assessing interobserver agreement on a nominal scale. An alternative coefficient for quantifying agreement between two observers is Bangdiwala’s B. To provide a proper interpretation of an agreement coefficient one must first understand its meaning. Properties of the kappa coefficient have been extensively studied and are well documented. Properties of coefficient B have been studied, but not extensively. In this paper, various new properties of B are presented. Category B-coefficients are defined that are the basic building blocks of B. It is studied how coefficient B, Cohen’s kappa, the observed agreement and associated category coefficients may be related. It turns out that the relationships between the coefficients are quite different for 2×2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\times 2$$\end{document} tables than for agreement tables with three or more categories.


Introduction
In behavioral and social sciences, the biomedical field and engineering, it is frequently required that multiple units (e.g. individuals, objects) are classified by an observer into B Matthijs J. Warrens m.j.warrens@rug.nl Alexandra de Raadt a.de.raadt@rug.nl 1 GION, University of Groningen, Grote Rozenstraat 3, 9712 TG Groningen, The Netherlands several nominal (unordered) categories. Examples are the classification of behavior of children, the coding of arithmetic strategies used by pupils in math class, psychiatric diagnosis of patients, or the classification of production faults. Because there is often no golden standard, the reproducibility of the classifications is usually taken as an indicator of the quality of the category definitions and the ability of the observer to apply them. To assess reproducibility, it is common practice to let two observers independently classify the same units. Reproducibility is then assessed by quantifying agreement between the two observers.
In the literature, various coefficients have been proposed that can be used to quantify agreement between two observers on a nominal scale (Gwet 2012;Hsu and Field 2003;Krippendorff 2004;Warrens 2010a). The most commonly used coefficient is Cohen's kappa (Cohen 1960;Crewson 2005;Fleiss et al. 2003;Sim and Wright 2005;Gwet 2012;Warrens 2015). An alternative to kappa is coefficient B proposed by Bangdiwala (Bangdiwala 1985;Muñoz and Bangdiwala 1997;Shankar and Bangdiwala 2008). Coefficient B can be derived from a graphical representation called the agreement chart. It is defined as the ratio of the sum of areas of squares of perfect agreement to the sum of areas of rectangles of marginal totals of the agreement chart.
Coefficients like kappa and B reduce the ratings of the two observers to a single real number. To provide a proper interpretation of an agreement coefficient one must first understand its meaning. The kappa coefficient has been used in thousands of applications (Maclure and Willett 1987;Sim and Wright 2005;Warrens 2015). Its properties have been extensively studied and are well documented for both 2 × 2 tables (Byrt et al. 1993;Feinstein and Cicchetti 1990;Kang et al. 2013;Uebersax 1987;Vach 2005;Warrens 2008) as well as square contingency tables with three or more categories (Muñoz and Bangdiwala 1997;Schouten 1986;Shankar and Bangdiwala 2008;Warrens 2010bWarrens , 2011Warrens , 2013a. The properties presented in these papers help us understand kappa's behavior in applications and provide new interpretations of coefficient. Properties of coefficient B have been studied, but not extensively. Muñoz and Bangdiwala (1997) presented statistical guidelines for the interpretation of kappa and B based on simulation studies. The four values (1.0, .90, .70, .50) for the observed agreement, (1.0, .85, .55, .25) for 3 × 3 kappas, (1.0, .87, .60, .33) for 4 × 4 kappas, and (1.0, .81, .49, .25) for coefficient B, may be labeled as "perfect agreement", "almost perfect agreement", "substantial agreement" and "moderate agreement", respectively. Furthermore, Shankar and Bangdiwala (2008) studied the behavior of kappa and B in the presence of zero cells and biased marginal distributions.
In this paper various new properties of B are presented. B-coefficients for individual categories are defined that are the basic building blocks of B. It is studied how coefficient B, Cohen's kappa, the observed agreement and associated category coefficients may be related. It turns out that the relationships between the coefficients are quite different for 2 × 2 tables than for agreement tables with three or more categories.
One way to study how coefficients are related to one another, is to attempt to find inequalities between coefficients that hold for all agreement tables of a certain size. An inequality between two coefficients, if it exists, implies that the value of one coefficient always exceeds the value of the second coefficient. If an inequality exists, knowing one value allows us to make an educated guess on the value of the other coefficient.
In a way, an inequality formalizes that two coefficients tend to measure agreement between the observers in a similar way, but to a different extent.
The paper is organized as follows. The notation is introduced in Sect. 2. This section is also used to define the coefficients that are studied and compared in this paper. In Sects. 3 and 4 we present results and relationships for the case of 2×2 tables. Section 3 considers relationships between the B-coefficients. In Sect. 4 the B-coefficients are compared to the other coefficients. In Sect. 5 we present a general result between two category coefficients. In Sect. 6 we show, using counterexamples, that the inequalities presented in Sect. 4 do not generalize to agreement tables with three or more categories. Finally, Sect. 7 contains a discussion.

Agreement table
Suppose we have two observers, A and B, who have classified (rated) independently each one of the n units of a group of units into m nominal (unordered) categories that were defined in advance. Furthermore, suppose that the ratings are summarized in a square agreement table A = π i j , where π i j denotes, for a group of units, the relative frequency (proportion) of units that were classified into category i ∈ {1, 2, . . . , m} by observer A and into category j ∈ {1, 2, . . . , m} by observer B.
An example of agreement table A = π i j is Table 1, which presents the pairwise classifications of a sample of units into m = 3 categories. The cells π 11 , π 22 and π 33 reflect the agreement between the observers, while the off-diagonal elements (e.g. π 21 and π 12 ) reflect disagreement between the observers. The marginal totals or base rates π i+ and π +i for i ∈ {1, 2, 3} reflect how often the categories were used by the observers.

The observed agreement
For category i ∈ {1, 2, . . . , m} the Dice (1945) coefficient is defined as (1) Category 1 π 11 π 12 π 13 π 1+ Category 2 π 21 π 22 π 23 π 2+ Category 3 π 31 π 32 π 33 π 3+ Total π +1 π +2 π +3 1 Coefficient (1) quantifies the agreement between the observers on category i relative to the marginal totals. Coefficient (1) has value 1 when there is perfect agreement between the two observers on category i, and value 0 when there is no agreement (i.e. π ii = 0). If we take a weighted average of the D i -coefficients using the denominators of the coefficients (π i+ + π +i ) as weights, we obtain the observed agreement (2) Coefficient (2) is the proportion of units on which the observers agree. It has value 1 if there is perfect agreement between the observers on all categories, and value 0 if there is perfect disagreement between the observers on all categories. Because (2) is a weighted average of the D i -coefficients, its value lies between the minimum and maximum D i -values. It has sometimes been criticized that (2) overestimates the 'true' agreement between the raters since some agreement in the data may simply occur by chance (Viera and Garrett 2005; Gwet 2012).

Kappa coefficients
For category i ∈ {1, 2, . . . , m} the category kappa is defined as (Warrens 2013b(Warrens , 2015 κ Coefficient (3) quantifies the agreement between the observers on category i. Coefficient (3) corrects the Dice coefficient in (1) for that type of agreement that arises from chance alone (Warrens 2008(Warrens , 2010a(Warrens , 2013b. Coefficient (3) has value 1 when there is perfect agreement between the two observers on category i (then π i+ = π +i ), and 0 when agreement on category i is equal to that expected under statistical independence (i.e. π ii = π i+ π +i ). If we take a weighted average of the κ i -coefficients using the denominators of the coefficients as weights, we obtain Cohen's kappa where P o is the observed agreement in (2), and P e is the expected agreement, defined as Coefficient (5) is the value of (2) under statistical independence. Coefficient (4) corrects the observed agreement in (2) for agreement that arises from chance alone. Cohen's kappa has value 1 when there is perfect agreement between the two observers, and value 0 when agreement is equal to that expected under statistical independence (i.e. P o = P e ). Because (4) is a weighted average of the κ i -coefficients, its value lies between the minimum and maximum κ i -values. With two categories, Cohen's kappa and the category kappas κ 1 and κ 2 are all equal.

B-coefficients
For category i ∈ {1, 2, . . . , m} we may define the category coefficient Coefficient (6) can be used to quantify agreement between the observers on category i. It is the square of the Ochiai (1957) coefficient. Similar to (1) and (3), coefficient (6) has value 1 when there is perfect agreement between the two observers on category i, and value 0 when there is no agreement. If we take a weighted average of the B i -coefficients using the denominators of the coefficients (π i+ π +i ) as weights, we obtain Bangdiwala's B, defined as Like kappa, coefficient (7) is a function of the expected agreement (5). Similar to kappa, coefficient (7) corrects the agreement between the observers for agreement that arises from chance alone, although in a different way than the classical correction for chance function, which is of the form in (4). Coefficient (7) has value 1 when there is perfect agreement between the two observers on all categories, and value 0 if there is no agreement between the observers. Because (7) is a weighted average of the B i -coefficients, its value lies between the minimum and maximum B i -values. Finally, let n i j denote the observed number of units that are classified into category i ∈ {1, 2, . . . , m} by observer A and into category j ∈ {1, 2, . . . , m} by observer B. Assuming a multinominal sampling model with the total numbers of units n fixed, the maximum likelihood estimate of the cell probabilityπ i j is given byπ i j = n i j /n. We obtain the maximum likelihood estimates of the coefficients in this section (e.g.κ and B) by replacing the cell probabilities π i j by theπ i j in the above definitions (Bishop et al. 1975).

Relationships between the B-coefficients
In many agreement studies units are classified into precisely two categories (m = 2). With two categories the classifications can be summarized in an 2 × 2 table (Fleiss et al. 2003;Kang et al. 2013;Warrens 2008). Table 2 is an example of an 2 × 2 table. Table 3 presents the corresponding values of the coefficients, which were defined in Category coefficients B 1 and B 2 quantify agreement between the observers on the categories separately, whereas the overall B summarizes the agreement between the observers over the categories. Since B is a (weighted) average of B 1 and B 2 , its value always lies between the values of B 1 and B 2 , and B can be viewed as a summary statistic. Table 3 illustrates that the category coefficients B 1 and B 2 may produce quite different results. The numbers show that, in terms of B i -coefficients, there is much more agreement on category 1 (.74) than on category 2 (.44). Furthermore, the value of the overall B lies between the two B i -coefficients. Moreover, the B-value lies closer to the B 1 -value, because this is the largest of the two. The latter property follows from the fact that B is a weighted average of B 1 and B 2 , using the denominators of the coefficients as weights. The coefficient with the largest denominator (π i+ π +i ) receives the most weight. For the data in Table 2, we have π 1+ π +1 = .49 and π 2+ π +2 = .09. In other words, the overall B-value will lie closest to the popular category.
Since coefficients B 1 and B 2 may produce quite different values, the overall B is only a proper summary statistic if B 1 and B 2 produce values that are somehow close to one another. If this is not the case, it makes more sense to report the two category coefficients instead, since this is more informative. Theorems 2 and 3 below specify how the three B-coefficients are related. Theorem 2 specifies when B 1 and B 2 are identical. Theorem 1 is used in the proof of Theorem 2.
Thus, f is strictly increasing in u.

Theorem 2
The following conditions are equivalent.
Proof Suppose B 1 = B 2 . Since B is a weighted average of B 1 and B 2 we have B = B 1 = B 2 . Furthermore, note that both B 1 and B 2 are functions of the form f (u, π 12 , π 21 ) in Theorem 1 with u = π 11 or u = π 22 . Since this function is strictly increasing in u we have B 1 = B 2 if and only if π 11 = π 22 . Moreover, for π 11 and π 22 we have the identity π 22 = 1 + π 11 − π 1+ − π +1 . From this identity it follows that we have π 11 = π 22 if and only if π 1+ + π +1 = 1.
Theorem 2 shows that the category coefficients B 1 and B 2 are equal if and only if the observers agree on category 1 as much as they agree on category 2 (i.e. π 11 = π 22 ). The theorem also shows that this can only happen if both categories were used equally often by the two observers together (i.e. π 1+ + π +1 = π 2+ + π +2 ).
Theorem 3 below shows that the largest of B 1 and B 2 is the coefficient associated with the category on which the observers agreed the most often. The latter category is also equivalent to the category that was most often used by the observers together. The theorem follows from using the same arguments as in the proof of Theorem 2.

Relationships to other coefficients
In this paper we are interested in how the various agreement coefficients are related to one another. One way to study this is to attempt to derive inequalities between different coefficients that hold for all agreement tables. In a way, an inequality, if it exists, formalizes that two coefficients tend to measure agreement between the observers in a similar way, but to a different extent. For example, between the observed agreement and the kappa coefficients we have the inequalities P o > κ and D i > κ i for any category i (Warrens 2008(Warrens , 2010a(Warrens , 2013b. The inequalities show that, for any data, the chance-corrected coefficients will always produce a lower value than the corresponding, original (uncorrected) coefficients. The chance-corrected and uncorrected coefficients tend to measure agreement in a similar way. However, the chance-corrected coefficients produce lower values for the same data since they remove agreement that arises from chance alone. For example, for Table 2 we have P o = .80 > .52 = κ, D 1 = .86 > .52 = κ 1 and D 2 = .67 > .52 = κ 2 . Table 3 shows that for 2 × 2 tables we may have the double inequality P o > B > κ (.80 > .69 > .52). In words, the value of observed agreement is greater than the value of the overall B, which in turn tends to be higher than the value of Cohen's kappa. Table 2 also shows that P o is greater than all three B-coefficients. In this section we present formal proofs of these observations for all 2 × 2 tables. In the next section we present an inequality between category coefficients D i and B i from (1) and (6), respectively, for agreement tables of any size.
First, Theorem 4 specifies how the B-coefficients are related to the observed agreement P o . Theorem 4 shows that, if agreement is less than perfect, the observed agreement always exceeds all three B-coefficients.
Finally, by interchanging the roles of category 1 and 2, the inequality P o > B 2 follows from using the same arguments.
If we combine Theorems 3 and 4, it follows that, in practice, we either have the triple inequality P o > B 1 > B > B 2 (which is the case for Table 2) or the triple inequality P o > B 2 > B > B 1 .
Theorem 5 specifies how the overall kappa is related to the overall B-coefficient. The theorem shows that, if there is some agreement, but no perfect agreement, coefficient B is always higher than kappa for 2 × 2 tables.

A general inequality
In Sect. 4 we have not compared category coefficients D i and B i from (1) and (6), respectively. Theorem 6 below presents an inequality between the coefficients. It turns out that the inequality holds for agreement tables of any size, and is not limited to 2×2 tables. In words, Theorem 6 shows that, if there is some agreement on category i (i.e. D i > 0), but no perfect agreement, the D i -coefficient for category i is always higher than the corresponding B i -coefficient.
Proof For 0 < D i < 1, we can write π i+ = π ii + u and π +i = π ii + v, where u and v are real numbers in the interval [0, 1), with at least one of u and v nonzero. With this notation, the inequality D i > B i is equal to Cross multiplying the terms of inequality (15) yields the inequality Inequality (16), and thus the desired inequality, is valid, because π ii and at least one of u and v are nonzero.

Counterexamples
The inequalities presented in Sect. 4 are restricted to the case of 2 × 2 tables. The reason for this is that the inequalities do not necessarily hold for agreement tables with three or more categories. In this section we present examples to illustrate this fact. Table 4 is an example of an 3×3 table. Table 5 presents the corresponding coefficient values. For 2 × 2 tables we always have the inequality have P o > B (Theorem 4). However, Table 5 shows that for tables of other sizes we may have the reverse inequality as well (P o = .80 < .86 = B). Category 3 D 3 = 1.0 κ 3 = 1.0 B 3 = 1.0  Table 6 is another example of an 3 × 3 table. Table 7 presents the corresponding coefficient values. For 2 × 2 tables we always have the inequality have B > κ (Theorem 5). However, Table 7 shows that for tables of other sizes we may have the reverse inequality as well (B = .47 < .49 = κ). Furthermore, Table 7 shows that category coefficients (1), (3) and (6) may provide different information. For example, in terms of the κ i -coefficients the least agreement between the observers in Table 6 is on category 3 (κ 3 = .36). However, in terms of the D i -and B i -coefficients this is category 1 (D 1 = .60 and B 1 = .36).
Finally, Tables 2, 4 and 6 illustrate the inequality presented in Theorem 6. If there is some agreement on category i, but if the agreement is not perfect, the D i -coefficient for category i is always higher than the B i -coefficient corresponding to the same category.

Discussion
In this paper we presented various new properties of Bangdiwala's B. The overall B is a weighted average of the B i -coefficients for individual categories. There are two B i -coefficients in the case of 2 × 2 tables, denoted B 1 and B 2 . The largest of B 1 and B 2 is the coefficient associated with the category on which the observers agreed the most often. The latter category is also equivalent to the category that was most often used by the observers together.
Since the category B-coefficients may produce quite different values, the overall B is only a proper summary statistic if the category B i -coefficients produce values that are somehow close to one another. If this is not the case, it is more informative to also report the individual category coefficients. Of course, this argument also applies to the kappa coefficients.
We also showed that, for 2 × 2 tables, Cohen's kappa never exceeds coefficient B, which in turn is always smaller than the proportion of observed agreement P o . The inequality P o > B may also occur with 3 × 3 and 4 × 4 tables (see Muñoz and Bangdiwala 1997;Shankar and Bangdiwala 2008). However, the reverse inequality P o < B may also be encountered (Tables 4, 5). The inequality B > κ does not always hold for 3 × 3 and 4 × 4 tables. In fact, for many 3 × 3 and 4 × 4 tables presented in Muñoz and Bangdiwala (1997) and Shankar and Bangdiwala (2008) the kappa-value actually exceeds the B-value. Muñoz and Bangdiwala (1997) presented guidelines for the interpretation of the observed agreement, kappa and coefficient B. The four values (1.0, .85, .55, .25) for 3×3 kappas, (1.0, .87, .60, .33) for 4×4 kappas, and (1.0, .81, .49, .25) for coefficient B, may be labeled as "perfect agreement", "almost perfect agreement", "substantial agreement" and "moderate agreement", respectively. Since we have the inequality B > κ for 2 × 2 tables (Theorem 5), the guidelines for kappa presented in Muñoz and Bangdiwala (1997) do not apply to 2 × 2 tables. Further benchmarking is required for this case.