Inequalities between multi-rater kappas

Abstract

The paper presents inequalities between four descriptive statistics that have been used to measure the nominal agreement between two or more raters. Each of the four statistics is a function of the pairwise information. Light’s kappa and Hubert’s kappa are multi-rater versions of Cohen’s kappa. Fleiss’ kappa is a multi-rater extension of Scott’s pi, whereas Randolph’s kappa generalizes Bennett et al. S to multiple raters. While a consistent ordering between the numerical values of these agreement measures has frequently been observed in practice, there is thus far no theoretical proof of a general ordering inequality among these measures. It is proved that Fleiss’ kappa is a lower bound of Hubert’s kappa and Randolph’s kappa, and that Randolph’s kappa is an upper bound of Hubert’s kappa and Light’s kappa if all pairwise agreement tables are weakly marginal symmetric or if all raters assign a certain minimum proportion of the objects to a specified category.

References

  1. Artstein R, Poesio M (2005) Kappa3 = Alpha (or Beta). NLE Technical Note 05-1, University of Essex

  2. Banerjee M, Capozzoli M, McSweeney L, Sinha D (1999) Beyond kappa: a review of interrater agreement measures. Can J Stat 27: 3–23

    MATH  Article  MathSciNet  Google Scholar 

  3. Bennett EM, Alpert R, Goldstein AC (1954) Communications through limited response questioning. Public Opin Q 18: 303–308

    Article  Google Scholar 

  4. Berry KJ, Mielke PW (1988) A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educ Psychol Meas 48: 921–933

    Article  Google Scholar 

  5. Brennan RL, Prediger DJ (1981) Coefficient kappa: some uses, misuses, and alternatives. Edu Psychol Meas 41: 687–699

    Article  Google Scholar 

  6. Cohen J (1960) A coefficient of agreement for nominal scales. Edu Psychol Meas 20: 37–46

    Article  Google Scholar 

  7. Cohen J (1968) Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 70: 213–220

    Article  Google Scholar 

  8. Conger AJ (1980) Integration and generalization of kappas for multiple raters. Psychol Bull 88: 322–328

    Article  Google Scholar 

  9. Craig RT (1981) Generalization of Scott’s index of intercoder agreement. Public Opin Q 45: 260–264

    Article  Google Scholar 

  10. Davies M, Fleiss JL (1982) Measuring agreement for multinomial data. Biometrics 38: 1047–1051

    MATH  Article  Google Scholar 

  11. De Mast J (2007) Agreement and kappa-type indices. Am Stat 61: 148–153

    Article  MathSciNet  Google Scholar 

  12. Di Eugenio B, Glass M (2004) The kappa statistic: a second look. Comput Linguist 30: 95–101

    Article  Google Scholar 

  13. Dou W, Ren Y, Wu Q, Ruan S, Chen Y, Bloyet D, Constans J-M (2007) Fuzzy kappa for the agreement measure of fuzzy classifications. Neurocomputing 70: 726–734

    Google Scholar 

  14. Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76: 378–382

    Article  Google Scholar 

  15. Gwet KL (2008) Variance estimation of nominal-scale inter-rater reliability with random selection of raters. Psychometrika 73: 407–430

    Article  MathSciNet  Google Scholar 

  16. Heuvelmans APJM, Sanders PF (1993) Beoordelaarsovereenstemming. In: Eggen TJHM, Sanders PF (eds) Psychometrie in de Praktijk. Cito Instituut voor Toestontwikkeling, Arnhem, pp 443–470

  17. Hsu LM, Field R (2003) Interrater agreement measures: comments on kappa n, Cohen’s kappa, Scott’s π and Aickin’s α. Underst Stat 2: 205–219

    Article  Google Scholar 

  18. Hubert L (1977) Kappa revisited. Psychol Bull 84: 289–297

    Article  Google Scholar 

  19. Janes CL (1979) An extension of the random error coefficient of agreement to N × N tables. Br J Psychiatry 134: 617–619

    Article  Google Scholar 

  20. Janson H, Olsson U (2001) A measure of agreement for interval or nominal multivariate observations. Educ Psychol Meas 61: 277–289

    Article  MathSciNet  Google Scholar 

  21. Janson S, Vegelius J (1979) On generalizations of the G index and the Phi coefficient to nominal scales. Multivar Behav Res 14: 255–269

    Article  Google Scholar 

  22. Kraemer HC (1979) Ramifications of a population model for κ as a coefficient of reliability. Psychometrika 44: 461–472

    MATH  Article  MathSciNet  Google Scholar 

  23. Kraemer HC (1980) Extensions of the kappa coefficient. Biometrics 36: 207–216

    MATH  Article  Google Scholar 

  24. Kraemer HC, Periyakoil VS, Noda A (2002) Tutorial in biostatistics: kappa coefficients in medical research. Stat Med 21: 2109–2129

    Article  Google Scholar 

  25. Krippendorff K (1987) Association, agreement, and equity. Qual Quant 21: 109–123

    Article  Google Scholar 

  26. Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33: 159–174

    MATH  Article  MathSciNet  Google Scholar 

  27. Light RJ (1971) Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol Bull 76: 365–377

    Article  Google Scholar 

  28. Mitrinović DS (1964) Elementary inequalities. P. Noordhoff, Groningen

    Google Scholar 

  29. O’Malley FP, Mohsin SK, Badve S, Bose S, Collins LC, Ennis M, Kleer CG, Pinder SE, Schnitt SJ (2006) Interobserver reproducibility in the diagnosis of flat epithelial atypia of the breast. Mod Pathol 19: 172–179

    Article  Google Scholar 

  30. Popping R (1983) Overeenstemmingsmaten voor nominale data. PhD thesis, Rijksuniversiteit Groningen, Groningen

  31. Randolph JJ (2005) Free-marginal multirater kappa (multirater κ free): an alternative to Fleiss’ fixed-Marginal multirater kappa. Paper presented at the Joensuu Learning and Instruction Symposium, Joensuu, Finland

  32. Schouten HJA (1980) Measuring agreement among many observers. Biom J 22: 497–504

    MATH  Article  MathSciNet  Google Scholar 

  33. Schouten HJA (1982) Measuring pairwise agreement among many observers. Biom J 24: 431–435

    MATH  Article  MathSciNet  Google Scholar 

  34. Schouten HJA (1986) Nominal scale agreement among observers. Psychometrika 51: 453–466

    Article  MathSciNet  Google Scholar 

  35. Scott WA (1955) Reliability of content analysis: the case of nominal scale coding. Public Opin Q 19: 321–325

    Article  Google Scholar 

  36. Vanbelle S, Albert A (2009) A note on the linearly weighted kappa coefficient for ordinal scales. Stat Methodol 6: 157–163

    Article  Google Scholar 

  37. Warrens MJ (2008a) On similarity coefficients for 2 × 2 tables and correction for chance. Psychometrika 73: 487–502

    Article  MathSciNet  Google Scholar 

  38. Warrens MJ (2008b) Bounds of resemblance measures for binary (presence/absence) variables. J Classif 25: 195–208

    MATH  Article  MathSciNet  Google Scholar 

  39. Warrens MJ (2008c) On association coefficients for 2 × 2 tables and properties that do not depend on the marginal distributions. Psychometrika 73: 777–789

    MATH  Article  MathSciNet  Google Scholar 

  40. Warrens MJ (2008d) On the equivalence of Cohen’s kappa and the Hubert-Arabie adjusted Rand index. J Classif 25: 177–183

    MATH  Article  Google Scholar 

  41. Warrens MJ (2008e) On the indeterminacy of resemblance measures for (presence/absence) data. J Classif 25: 125–136

    MATH  Article  MathSciNet  Google Scholar 

  42. Warrens MJ (2010a) Inequalities between kappa and kappa-like statistics for k × k tables. Psychometrika 75: 176–185

    MATH  Article  Google Scholar 

  43. Warrens MJ (2010b) A formal proof of a paradox associated with Cohen’s kappa. J Classif (in press)

  44. Warrens MJ (2010c) Cohen’s kappa can always be increased and decreased by combining categories. Stat Methodol 7: 673–677

    Article  Google Scholar 

  45. Warrens MJ (2010d) A Kraemer-type rescaling that transforms the odds ratio into the weighted kappa coefficient. Psychometrika 75: 328–330

    MATH  Article  Google Scholar 

  46. Zwick R (1988) Another look at interrater agreement. Psychol Bull 103: 374–378

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

The author thanks three anonymous reviewers for their helpful comments and valuable suggestions on earlier versions of this paper.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Matthijs J. Warrens.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and Permissions

About this article

Cite this article

Warrens, M.J. Inequalities between multi-rater kappas. Adv Data Anal Classif 4, 271–286 (2010). https://doi.org/10.1007/s11634-010-0073-4

Download citation

Keywords

  • Nominal agreement
  • Cohen’s kappa
  • Scott’s pi
  • Light’s kappa
  • Hubert’s kappa
  • Fleiss’ kappa
  • Randolph’s kappa
  • Cauchy–Schwarz inequality
  • Arithmetic-harmonic means inequality

Mathematics Subject Classification (2010)

  • 62H17
  • 62H20
  • 62P25