Variance Estimation of NominalScale InterRater Reliability with Random Selection of Raters
 Kilem Li Gwet
 … show all 1 hide
Rent the article at a discount
Rent now* Final gross prices may vary according to local VAT.
Get AccessAbstract
Most interrater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the interrater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all interrater reliability variance estimators proposed in the literature only account for the subject sampling variability, ignoring the extra sampling variance due to the sampling of raters, even though the latter may be the biggest of the variance components. Such variance estimators make statistical inference possible only to the subject universe. This paper proposes variance estimators that will make it possible to infer to both universes of subjects and raters. The consistency of these variance estimators is proved as well as their validity for confidence interval construction. These results are applicable only to fully crossed designs where each rater must rate each subject. A small Monte Carlo simulation study is presented to demonstrate the accuracy of largesample approximations on reasonably small samples.
 Bartfay, E., Donner, A. (2001) Statistical inferences for interobserver agreement studies with nominal outcome data. The Statistician 50: pp. 135146
 Bennet, E.M., Alpert, R., Goldstein, A.C. (1954) Communications through limited response questioning. Public Opinion Quarterly 18: pp. 303308 CrossRef
 Berry, K.J., Mielke, P.W. (1988) A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement 48: pp. 921933 CrossRef
 Brennan, R.L., Prediger, D.J. (1981) Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement 41: pp. 687699 CrossRef
 Byrt, T., Bishop, J., Carlin, J.B. (1993) Bias, prevalence and kappa. Journal of Clinical Epidemiology 46: pp. 423429 CrossRef
 Cicchetti, D.V., Feinstein, A.R. (1990) High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology 43: pp. 551558 CrossRef
 Cochran, W.G. (1977) Sampling techniques. Wiley, New York
 Cohen, J. (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: pp. 3746 CrossRef
 Conger, A.J. (1980) Integration and generalization of kappas for multiple raters. Psychological Bulletin 88: pp. 322328 CrossRef
 Cook, R.J. Kappa and its dependence on marginal rates. In: Armitage, P., Colton, T. eds. (1998) Encyclopedia of biostatistics. Wiley, New York, pp. 21662168
 Donner, A., Eliasziw, M. (1992) A goodnessoffit approach to inference procedures for the kappa statistic: Confidence interval construction, significancetesting and sample size estimation. Statistics in Medicine 11: pp. 15111519 CrossRef
 Feinstein, A.R., Cicchetti, D.V. (1990) High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology 43: pp. 543549 CrossRef
 Fleiss, J.L. (1971) Measuring nominal scale agreement among many raters. Psychological Bulletin 76: pp. 378382 CrossRef
 Fuller, W.A., Isaki, C.T. Survey design under superpopulation models. In: Krewski, D., Rao, J.N.K., Platek, R. eds. (1981) Current topics in survey sampling. Academic Press, New York, pp. 199226
 Goodman, L.A., Kruskal, W.H. (1954) Measures of association for cross classifications. Journal of the American Statistical Association 49: pp. 17321769
 Gwet, K. (2008). Computing interrater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1).
 Holley, J.W., Guilford, J.P. (1964) A note on the G index of agreement. Educational and Psychological Measurement 24: pp. 749753 CrossRef
 Isaki, C.T., Fuller, W.A. (1982) Survey design under the regression superpopulation model. Journal of the American Statistical Association 77: pp. 8996 CrossRef
 Janson, H., Olsson, U. (2001) A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement 61: pp. 277289 CrossRef
 Janson, H., Olsson, U. (2004) A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educational and Psychological Measurement 64: pp. 6270 CrossRef
 Janson, S., Vegelius, J. (1979) On generalizations of the G index and the PHI coefficient to nominal scales. Multivariate Behavioral Research 14: pp. 255269 CrossRef
 Kraemer, H.C., Periyakoil, V.S., Noda, A. (2002) Kappa coefficients in medical research. Statistics in Medicine 21: pp. 21092129 CrossRef
 Landis, R.J., Koch, G.G. (1977) An application of hierarchical kappatype statistics in the assessment of majority agreement among multiple observers. Biometrics 33: pp. 363374 CrossRef
 Light, R.J. (1971) Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin 76: pp. 365377 CrossRef
 Maxwell, A.E. (1977) Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry 130: pp. 7983 CrossRef
 McGraw, K.O., Wong, S.P. (1996) Forming inferences about some intraclass correlation coefficients. Psychological Methods 1: pp. 3046 CrossRef
 Nam, J.M. (2000) Interval estimation of the kappa coefficient with binary classification and an equal marginal probability model. Biometrics 56: pp. 583585 CrossRef
 Rao, C.R. (2002). Wiley series in probability and statistics. Linear statistical inference and its applications (2nd ed.).
 Schuster, C. (2004) A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement 64: pp. 243253 CrossRef
 Schuster, C., Smith, D.A. (2006) Estimating with a latent class model the reliability of nominal judgments upon which two raters agree. Educational and Psychological Measurement 66: pp. 739747 CrossRef
 Scott, W.A. (1955) Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly XIX: pp. 321325 CrossRef
 Simon, P. (2006) Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educational and Psychological Measurement 66: pp. 765777 CrossRef
 Uebersax, J.S., Grove, W.M. (1990) Latent class analysis of diagnostic agreement. Statistics in Medicine 9: pp. 559572 CrossRef
 Uebersax, J.S., Grove, W.M. (1993) A latent trait finite mixture analysis of rating agreement. Biometrics 49: pp. 823835 CrossRef
 Zou, G., Klar, N. (2005) A noniterative confidence interval estimating procedure for the intraclass kappa statistic with multinomial outcomes. Biometrical Journal 5: pp. 682690 CrossRef
 Zwick, R. (1988) Another look at interrater agreement. Psychological Bulletin 103: pp. 374378 CrossRef
 Title
 Variance Estimation of NominalScale InterRater Reliability with Random Selection of Raters
 Journal

Psychometrika
Volume 73, Issue 3 , pp 407430
 Cover Date
 20080901
 DOI
 10.1007/s1133600790548
 Print ISSN
 00333123
 Online ISSN
 18600980
 Publisher
 SpringerVerlag
 Additional Links
 Topics
 Keywords

 interrater reliability
 AC 1 coefficient
 kappa statistic
 agreement coefficient
 Industry Sectors
 Authors

 Kilem Li Gwet ^{(1)}
 Author Affiliations

 1. STATAXIS Consulting, Sr. Statistical Consultant, 20315 Marketree Place, Montgomery Village, MD, 20886, USA