, 73:407

Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters

Theory and Methods


Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater reliability variance estimators proposed in the literature only account for the subject sampling variability, ignoring the extra sampling variance due to the sampling of raters, even though the latter may be the biggest of the variance components. Such variance estimators make statistical inference possible only to the subject universe. This paper proposes variance estimators that will make it possible to infer to both universes of subjects and raters. The consistency of these variance estimators is proved as well as their validity for confidence interval construction. These results are applicable only to fully crossed designs where each rater must rate each subject. A small Monte Carlo simulation study is presented to demonstrate the accuracy of large-sample approximations on reasonably small samples.


inter-rater reliability AC1 coefficient kappa statistic agreement coefficient 


  1. Bartfay, E., & Donner, A. (2001). Statistical inferences for inter-observer agreement studies with nominal outcome data. The Statistician, 50, 135–146. Google Scholar
  2. Bennet, E.M., Alpert, R., & Goldstein, A.C. (1954). Communications through limited response questioning. Public Opinion Quarterly, 18, 303–308. CrossRefGoogle Scholar
  3. Berry, K.J., & Mielke, P.W. Jr. (1988). A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48, 921–933. CrossRefGoogle Scholar
  4. Brennan, R.L., & Prediger, D.J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699. CrossRefGoogle Scholar
  5. Byrt, T., Bishop, J., & Carlin, J.B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429. CrossRefPubMedGoogle Scholar
  6. Cicchetti, D.V., & Feinstein, A.R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558. CrossRefPubMedGoogle Scholar
  7. Cochran, W.G. (1977). Sampling techniques (3rd ed.). New York: Wiley. Google Scholar
  8. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. CrossRefGoogle Scholar
  9. Conger, A.J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328. CrossRefGoogle Scholar
  10. Cook, R.J. (1998). Kappa and its dependence on marginal rates. In P. Armitage & T. Colton (Eds.), Encyclopedia of biostatistics (pp. 2166–2168). New York: Wiley. Google Scholar
  11. Donner, A., & Eliasziw, M. (1992). A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significance-testing and sample size estimation. Statistics in Medicine, 11, 1511–1519. CrossRefPubMedGoogle Scholar
  12. Feinstein, A.R., & Cicchetti, D.V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549. CrossRefPubMedGoogle Scholar
  13. Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382. CrossRefGoogle Scholar
  14. Fuller, W.A., & Isaki, C.T. (1981). Survey design under superpopulation models. In D. Krewski, J.N.K. Rao, & R. Platek (Eds.), Current topics in survey sampling (pp. 199–226). New York: Academic Press. Google Scholar
  15. Goodman, L.A., & Kruskal, W.H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 1732–1769. Google Scholar
  16. Gwet, K. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1). Google Scholar
  17. Holley, J.W., & Guilford, J.P. (1964). A note on the G index of agreement. Educational and Psychological Measurement, 24, 749–753. CrossRefGoogle Scholar
  18. Isaki, C.T., & Fuller, W.A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77, 89–96. CrossRefGoogle Scholar
  19. Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61, 277–289. CrossRefGoogle Scholar
  20. Janson, H., & Olsson, U. (2004). A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educational and Psychological Measurement, 64, 62–70. CrossRefGoogle Scholar
  21. Janson, S., & Vegelius, J. (1979). On generalizations of the G index and the PHI coefficient to nominal scales. Multivariate Behavioral Research, 14, 255–269. CrossRefGoogle Scholar
  22. Kraemer, H.C., Periyakoil, V.S., & Noda, A. (2002). Kappa coefficients in medical research. Statistics in Medicine, 21, 2109–2129. CrossRefGoogle Scholar
  23. Landis, R.J., & Koch, G.G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363–374. CrossRefPubMedGoogle Scholar
  24. Light, R.J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365–377. CrossRefGoogle Scholar
  25. Maxwell, A.E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 79–83. CrossRefPubMedGoogle Scholar
  26. McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46. CrossRefGoogle Scholar
  27. Nam, J.M. (2000). Interval estimation of the kappa coefficient with binary classification and an equal marginal probability model. Biometrics, 56, 583–585. CrossRefPubMedGoogle Scholar
  28. Rao, C.R. (2002). Wiley series in probability and statistics. Linear statistical inference and its applications (2nd ed.). Google Scholar
  29. Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243–253. CrossRefGoogle Scholar
  30. Schuster, C., & Smith, D.A. (2006). Estimating with a latent class model the reliability of nominal judgments upon which two raters agree. Educational and Psychological Measurement, 66, 739–747. CrossRefGoogle Scholar
  31. Scott, W.A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, XIX, 321–325. CrossRefGoogle Scholar
  32. Simon, P. (2006). Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educational and Psychological Measurement, 66, 765–777. CrossRefGoogle Scholar
  33. Uebersax, J.S., & Grove, W.M. (1990). Latent class analysis of diagnostic agreement. Statistics in Medicine, 9, 559–572. CrossRefPubMedGoogle Scholar
  34. Uebersax, J.S., & Grove, W.M. (1993). A latent trait finite mixture analysis of rating agreement. Biometrics, 49, 823–835. CrossRefPubMedGoogle Scholar
  35. Zou, G., & Klar, N. (2005). A non-iterative confidence interval estimating procedure for the intraclass kappa statistic with multinomial outcomes. Biometrical Journal, 5, 682–690. CrossRefGoogle Scholar
  36. Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–378. CrossRefPubMedGoogle Scholar

Copyright information

© The Psychometric Society 2008

Authors and Affiliations

  1. 1.STATAXIS ConsultingSr. Statistical ConsultantMontgomery VillageUSA

Personalised recommendations