Categorical cutpoints used to assess the adequacy of various statistics—like small, medium, and large for correlation coefficients of .10, .30, and .50 (Cohen, 1988)—are as useful as they are arbitrary. Of course, most authors know, for example, that the difference between p values falling just below or just above the conventional cutpoint of .05 is inconsequential (Cohen, 1990; Rosnow & Rosenthal, 1989). Nonetheless, descriptions of results, especially when many variables are analyzed, are often improved with the use of defined and consistently used categorial terms.

Not all statistics are suitable candidates for categorical cutpoints, however. One such is kappa, a statistic that gauges inter-observer agreement corrected for chance (Cohen, 1960). Depending on circumstances I will detail shortly, a specific value of kappa may be judged adequate in one case but not in another. Thus, kappa is unlike other common statistics for which it is reasonable to say, for example, that a specified value represents a medium effect. Nonetheless, categorical terms for specific values of kappa have appeared in the literature (e.g., Fleiss, 1981, characterized kappas of .40–.60 as fair, .60–.75 as good, and over .75 as excellent). However, these definitions were not supported with a convincing rationale—unlike Cohen (1988) who offered detailed rationales for his suggested categorical terms—and did not take into account circumstances affecting the value of kappa, most importantly the order of the kappa table (i.e., the number of codes). In sum, no one value of kappa can be regarded as universally acceptable. The question for investigators should be, are observers accurate enough, not is kappa big enough.

Estimating how accurate observers would need to be to have generated a specific value of kappa, given its specific circumstances, provides a principled way to assess whether a specific value of kappa is adequate. When training and checking observers, our main concern should not be the magnitude of kappa but the level of observer accuracy we regard as acceptable. As always, the cutpoint selected is arbitrary. Gardner (1995) characterized 80% accuracy as discouragingly low “but possibly representative of the accuracy of classification for some social behaviors or expressions of affect” (p. 347). It seems reasonable to expect better, and—although 100% accuracy will likely elude us—85% or 90% accuracy may represent reasonable goals. KappaAcc, the computer program described here, computes estimated observer accuracy for a kappa table the user provides. It is based on equations developed by Gardner (1995) that take into account the circumstances of the particular kappa table.

Kappa and weighted kappa

Researchers who ask observers to code or rate behavior often gauge inter-observer agreement with kappa (Cohen, 1960), primarily because kappa corrects for chance agreement, whereas a percentage agreement—sometimes seen in older literature—does not. To check agreement, researchers ask two observers to code (nominal scale) or rate (ordinal scale) the same sequence of events (or time intervals, or sessions), applying K codes or ratings to N events. Their N pairs of judgments are then tallied in a K × K kappa table (also called an agreement or confusion matrix). Rows represent one observer, columns the other observer, and rows and columns are labeled with the K mutually exclusive and exhaustive codes or ratings.

Cohen’s kappa is an omnibus statistic, a single number that summarizes the agreement evidenced by the kappa table. The standard formula is:

$$\upkappa =\frac{{P}_{O}- {P}_{C}}{{1 - P}_{C}}, \mathrm{where} {P}_{O}=\sum_{i=1}^{K}{p}_{ii} \mathrm{and }{P}_{C}=\sum_{i=1}^{K}{p}_{+i}{p}_{i+}$$

\({P}_{O}\) represents observed agreement (the sum of the probabilities on the upper-left to lower-right diagonal), \({P}_{C}\) represents chance agreement (the sum of the corresponding row and column probability products), and the formula emphasizes that kappa gauges observer agreement corrected for chance.

The formula for weighted kappa (Cohen, 1968) is more general:

$${\upkappa }_{wt}=1- \frac{\sum_{i=1}^{K}\sum_{j=1}^{K}{w}_{ij}{x}_{ij}}{\sum_{i=1}^{K}\sum_{j=1}^{K}{w}_{ij}{e}_{ij}}$$

Each observed value (\({x}_{ij}\)) and each expected value \(({e}_{ij}\)) is multiplied by the corresponding cell in an array of weights (\({w}_{ij}\)). (Note: The formula for weighted kappa in Bakeman & Quera, 2011, p. 82, contains a typo; the “1 –” before the fraction was inadvertently omitted.) With standard weights—all cells on the diagonal set to 0, indicating agreement, and all off-diagonal cells set to 1, indicating that all disagreements are weighted equally—both the standard and the weighted kappa formulas yield identical results. If observers agreed for all events, the sum of the \({w}_{ij}{x}_{ij}\) products would be 0, the fraction after “1 –”would be 0, and so \(\upkappa\) and \({\upkappa }_{wt}\) would equal 1, indicating perfect agreement.

When codes are nominal, it usually makes the most sense to weight all disagreements equally. But when ratings are ordinal, other arrays of weights could make sense. Figure 1 shows four possible weighting schemes, assuming K = 5: (a) the standard array weights agreements 0 and disagreements 1; (b) the linear array weights more extreme disagreements more highly (e.g., weighting a 1–3 disagreement 2 but a 1–5 disagreement 4); (c) the w/1 standard array regards disagreements within one scale point as agreements and weights them 0; and (d) the w/1 linear array likewise regards disagreements within one scale point as agreements but weights more extreme disagreements more highly. The default for the KappaAcc program is the standard array, but the user can select one of the other three weighting schemes or define a custom scheme if they wish.

Fig. 1
figure 1

Standard weight matrix for five codes and three possible weight matrices for 1 to 5 ratings. Note. In addition to the standard weight matrix for kappa, three other possibilities—potentially useful when rating instead of coding—are (a) the linear matrix, which weights more extreme disagreements more highly (e.g., weighting a 1–3 disagreement 2 but a 1–5 disagreement 4); (b) the w/1 standard matrix, which regards disagreements within one scale point as agreements and weights them 0, weighting other disagreements 1; and (c) the w/1 linear matrix, which likewise regards disagreements within one scale point as agreements but weights more extreme disagreements more highly. Agreements (weighted 0) are shaded

Observer accuracy

Once paired observer judgments are tallied in a kappa table and kappa or weighted kappa has been computed, researchers understandably want to know whether the computed value is big enough—although, as noted earlier, I think the better question is whether observers are accurate enough, not whether kappa is big enough, that is does the value of kappa indicate adequate observer agreement? Observer accuracy provides a reasoned way for determining whether a particular value of kappa is adequate. Unfortunately, its computation requires that we know the “true” state of affairs, whereas in the real world the true value of observer accuracy is unknowable. But in an ideal world of simulated observers, we can specify the true state of affairs, specifically: (a) the number of codes or ratings, (b) their simple probabilities, and (c) observer accuracy (the conditional probability that an observer will assign code A when the event is truly an A). KappaAcc assumes this ideal world.

Gardner (1995) has shown us how to model observer decision making in the ideal world. His equations let us determine the value of kappa that would result if two observers of specified accuracy were asked to assign K codes or ratings to events of specified probability (see also Bakeman et al., 1997; Bakeman & Quera, 2011). The inference is, if simulated observers of known accuracy achieve a value of kappa as big as the value achieved by observers in the real world, it is reasonable to assume that the actual observers are as accurate as the simulated ones. This is the fundamental assumption of the KappaAcc program. Using Gardner’s model, it determines the percentage accuracy required of simulated observers to achieve the magnitude of the kappa observed by the actual observers. See the Appendix for details concerning Gardner’s equations for computing estimated accuracy.

Circumstances affecting the value of kappa

Bakeman and Quera (2011) listed five circumstances that affect the value of kappa. First is observer accuracy, as just discussed. Second is the number of (mutually exclusive and exhaustive) codes or ratings. Third is the prevalence for individual codes or ratings (i.e., their simple probabilities); these could be equiprobable, moderately variable, or highly variable. Fourth is observer bias (i.e., the difference in prevalence between observers); its lack is evidenced when observers report similar probabilities for corresponding codes or ratings. And fifth is observer independence (as every researcher knows, when assessing observer agreement, observers must code or rate “blind,” without knowledge of how the other observer did so). As detailed by Bakeman et al., (1997; also Bakeman & Quera, 2011), when K is less than five, especially when prevalence is highly variable, similar values for observer accuracy result in lower values of kappa. However, when K is greater than five, a larger number of codes or ratings and prevalence variability matter little.

Based on Gardner’s (1995) model and corresponding equations, Bakeman and Quera (2011) provided tables showing values of kappa that would be achieved if observers were 80, 85, 90, and 95% accurate for various values of K, assuming that both observers’ prevalence was equiprobable, moderately variable, or highly variable. This was somewhat limiting and required interpolation. In contrast, KappaAcc uses information extracted from the kappa table the user provides to find a value of observer accuracy for simulated observers that would result in a value for kappa that is as large as the value obtained by the real observers. As noted earlier, our assumption is that the real observers must have been at least this accurate.

Here is how KappaAcc deals with the five circumstances affecting the value of kappa:

  1. 1.

    The accuracy for simulated observers is what KappaAcc produces. KappaAcc assumes that both observers are equally accurate. Gardner’s model allows that accuracy for the observers be set separately, and even separately for the different codes. But doing so seems needlessly complicated and rationalizing different accuracies for different codes seems somewhere between challenging to impossible. The single percentage KappaAcc produces indicates that both simulated observers were at least that accurate for all codes or ratings.

  2. 2.

    KappaAcc extracts the value for K, the number of codes, from the kappa table the user provides.

  3. 3.

    KappaAcc sets prevalence as the means for the two observers’ corresponding probabilities, based on the kappa table the user provides,. Thus, unlike the tables in Bakeman and Quera (2011), which reflect only four levels of variability, the simple probabilities used by KappaAcc reflect the observers’ actual variability. Gardner’s model requires a single set of probabilities—the “true” probabilities from the ideal world—which is why they are estimated with means for KappaAcc’s computations.

  4. 4.

    For this reason, bias is not reflected in Gardner’s model. Normally, we expect the row and column probabilities for a kappa table to be roughly similar. If not, we would have expected that discrepancies would have occasioned discussion with the observers and possibly retraining.

  5. 5.

    KappaAcc assumes that the two observers coded or rated the same session independently; this is simply standard practice.

The KappaAcc program

KappaAcc is an extension of the ComKappa program (Robinson & Bakeman, 1998). Most notably, it has the additional capability to compute estimated observer accuracy. Figure 2 shows the main KappaAcc window after entering tallies for 120 paired observer judgements for a scheme containing five codes.

Fig. 2
figure 2

KappaAcc main window showing tallies for a five-code kappa table

Selecting the table icon (or Run > Define a new table) lets you define the number of codes or ratings and provide labels for them. You can then enter the values for the kappa table directly in the window or copy-and-paste the values from a spread sheet. If you want other than the standard weights, select the weight icon (or Run > Specify weights) to select the weights you want. Finally check the statistics you want computed and select the compute icon (or Run > Compute stats). The possible statistics are:

  1. 1.

    Cohen’s omnibus kappa: Kappa as described earlier using standard weights.

  2. 2.

    Individual code kappas: As noted earlier, kappa is an omnibus statistic; it summarizes agreement for a set of mutually exclusive and exhaustive codes. Computing a separate kappa for each code (forming a 2 × 2 table for each code and computing its kappa) can be informative because it identifies particularly problematic codes.

  3. 3.

    Kappa maximum: In theory, values of kappa can vary from –1 to + 1, where 1 represents perfect agreement. Negative values are rare and indicate greater than chance disagreement, but kappa can equal 1 only when the tallies for the corresponding rows and columns are the same—that is, when the simple probabilities for each code are the same for both observers. If not, the value of kappa can be no higher than kappa maximum.

  4. 4.

    Kappa standard error: For completeness, KappaAcc computes kappa’s standard error, although its usefulness is limited. Statistical significance for kappa is rarely reported; as Bakeman and Gottman (1997) wrote, even relatively low values of kappa can still be significantly different from zero, but not of sufficient magnitude to satisfy investigators.

  5. 5.

    Weighted kappa: If you specified other than standard weights, check this box so that the value of weighted kappa will be displayed.

  6. 6.

    Percent agreement: The agreement observed—\({P}_{O}\) in the standard kappa formula.

  7. 7.

    Percent by chance. Agreement expected by chance—\({P}_{C}\) in the standard kappa formula.

  8. 8.

    Estimated accuracy. If checked, weighted kappa is also displayed; it will have the same value as omnibus kappa if standard weights are used.

For the example data in Fig. 2, the standard or omnibus kappa was .61 (69% agreement, uncorrected for chance). Standard weights were used so the value of weighted kappa was the same. KappaAcc determined that simulated observers would need to be 82% accurate to achieve a kappa of .61. Often journal articles give just a value of kappa for each mutually exclusive and exhaustive scheme. I recommend that researchers who gauge observer agreement using kappa also give—and that editors and reviewers ask for—not just estimated observer accuracy, but also the number of sessions (if results from several sessions are pooled), the number of codes, and the number of tallies in the kappa table—information that provides necessary context. Moreover, for all the reasons noted here, a lone value of kappa is almost impossible to interpret, whereas observer accuracy admits to intuitive understanding.

For the example just given, results could be stated as follows: Using five codes, two observers coded one session and independently made 120 judgments. The value of kappa was .61 (69% agreement uncorrected for chance). To produce a kappa of this magnitude simulated observers would need to be 82% accurate, which was somewhat below our target of 85%.

Program details

KappaAcc is programmed in Pascal and compiled using Embarcadero® Delphi 10 Seattle. It will run on Windows computers or on Apple computers with a Windows simulator. It is contained in a single executable file, KappaAcc.exe. Once placed in a folder on your computer it can be invoked, like any other program, with a double click; you could also create a shortcut and place it on your desktop. If your computer’s security measures block running of “unknown” executable files, you may need help from your local IT people. This write-up as a PDF file and the KappaAcc.exe file are contained in a file (KappaAcc.zip) that can be downloaded at no charge from http://bakeman.gsucreate.org /kappa.