Factors affecting intercoder reliability: a Monte Carlo experiment

Feng, Guangchao Charles

doi:10.1007/s11135-012-9745-9

Factors affecting intercoder reliability: a Monte Carlo experiment

Published: 13 September 2012

Volume 47, pages 2959–2982, (2013)
Cite this article

Quality & Quantity Aims and scope Submit manuscript

Guangchao Charles Feng¹

1067 Accesses
24 Citations
Explore all metrics

Abstract

Although it has long been a consensus that intercoder reliability is crucial to the validity of a content analysis study, the choice among them has been debated. This study reviewed and empirically tested most popular intercoder reliability indices, aiming to find the most robust index against prevalence and rater bias, by empirically testing their relationships with response surface methodology through a Monte Carlo experiment. It was found that Maxwell’s R.E is superior to Krippendorff’s α, Scott’s π, Cohen’s κ, I _r of Perreault and Leigh, and Gwet’s AC ₁. More nuanced relationships among prevalence, sensitivity, specificity and the intercoder reliability indices were discovered through response surface plots. Both theoretical and practical implications were also discussed in the end.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agresti A.: Modelling patterns of agreement and disagreement. Stat. Methods Med. Res. 1(2), 201–218 (1992). doi:10.1177/096228029200100205
Article Google Scholar
Agresti A.: Categorical Data Analysis. Wiley, New York (2002)
Book Google Scholar
Agresti, A., Lang, J.B.: Quasi-symmetric latent class models, with application to rater agreement. Biometrics 49(1), 131–139 (1993). http://0-www.jstor.org/stable/2532608
Google Scholar
Agresti A., Ghosh A., Bini M.: Raking kappa: describing potential impact of marginal distributions on measures of agreement. Biometrical J. 37(7), 811–820 (1995). doi:10.1002/bimj.4710370705
Article Google Scholar
Artstein R., Poesio M.: Inter-coder agreement for computational linguistics. Comput. Linguistics 34(4), 555–596 (2008)
Article Google Scholar
Bakeman R.: Behavioral Observation and Coding. Cambridge University Press, New York (2000)
Google Scholar
Banerjee M., Capozzoli M., McSweeney L., Sinha D.: Beyond kappa: a review of interrater agreement measures. Can. J. Stat. 27(1), 3–23 (1999)
Article Google Scholar
Barlow W., Lai M.Y., Azen S.P.: A comparison of methods for calculating a stratified kappa. Stat. Med. 10(9), 1465–1472 (1991). doi:10.1002/sim.4780100913
Article Google Scholar
Bennett, E.M., Alpert, R., Goldstein, A.C.: Communications through limited-response questioning. Public Opin. Q. 18(3), 303–308 (1954). doi:10.1086/266520, http://poq.oxfordjournals.org/content/18/3/303.abstract, http://poq.oxfordjournals.org/content/18/3/303.full.pdf+html
Bewick V., Cheek L., Ball J.: Statistics review 13: receiver operating characteristic curves. Crit. Care 8(6), 508–512 (2004). doi:10.1186/cc3000
Article Google Scholar
Bhapkar, V.P.: A note on the equivalence of two test criteria for hypotheses in categorical data. J. Am. Stat. Assoc. 61(313), 228–235 (1966). http://www.jstor.org/stable/2283057
Google Scholar
Bishop Y., Fienberg S., Holland P.: Discrete Multivariate Analysis: Theory and Practice. Springer, Cambridge (2007)
Google Scholar
Bloch, D.A., Kraemer, H.C.: 2 x 2 kappa coefficients: measures of agreement or association. Biometrics 45(1), 269–287 (1989). http://www.jstor.org/stable/2532052
Google Scholar
Box G., Draper N.: Response surfaces, mixtures, and ridge analyses, vol. 527, 2nd edn. Wiley-Interscience, New Jersy (2007)
Book Google Scholar
Brennan R., Prediger D.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41(3), 687–699 (1981)
Article Google Scholar
Broemeling L.: Bayesian Methods for Measures of Agreement. Chapman & Hall/CRC biostatistics series, CRC Press, Boca Raton (2009)
Book Google Scholar
Byrt, T., Bishop, J., Carlin, J.B.: Bias, prevalence and kappa. J. Clin. Epidemiol. 46(5), 423–429 (1993). doi:10.1016/0895-4356(93)90018-V, http://www.sciencedirect.com/science/article/pii/089543569390018V
Google Scholar
Choudhary P., Yin K.: Bayesian and frequentist methodologies for analyzing method comparison studies with multiple methods. Stat. Biopharm. Res. 2(1), 122–132 (2010)
Article Google Scholar
Cicchetti D., Feinstein A.: High agreement but low kappa: Ii. resolving the paradoxes* 1. J. Clin. Epidemiol. 43(6), 551–558 (1990)
Article Google Scholar
Cohen J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960). doi:10.1177/001316446002000104
Article Google Scholar
Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968). doi:10.1037/h0026256, http://search.ebscohost.com/login.aspx?direct=true&db=pdh&AN=bul-70-4-213&site=ehost-live
Conger A.: Integration and generalization of kappas for multiple raters. Psychol. Bull. 88(2), 322–328 (1980). doi:10.1037/0033-2909.88.2.322
Article Google Scholar
Coughlin, S.S., Pickle, L.W.: Sensitivity and specificity-like measures of the validity of a diagnostic test that are corrected for chance agreement. Epidemiology 3(2), 178–181 (1992) http://www.jstor.org/stable/3702898
Dahl, D.: xtable: export tables to latex or html. R package version 1(6) (2009)
de Mast J.: Agreement and kappa-type indices. Am. Stat. 61(2), 148–153 (2007)
Article Google Scholar
Dewey, M.: Coefficients of agreement. Br. J. Psychiatry 143(5), 487–489 (1983). doi:10.1192/bjp.143.5.487, http://bjp.rcpsych.org/cgi/content/abstract/143/5/487, http://bjp.rcpsych.org/cgi/reprint/143/5/487.pdf
Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. the problems of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990). doi:10.1016/0895-4356(90)90158-L, http://www.sciencedirect.com/science/article/pii/089543569090158L
Google Scholar
Feuerman M., Miller A.R.: Relationships between statistical measures of agreement: sensitivity, specificity and kappa. J. Eval. Clin. Pract. 14(5), 930–933 (2008). doi:10.1111/j.1365-2753.2008.00984.x
Article Google Scholar
Finn R.: A note on estimating the reliability of categorical data. Educ. Psychol. Meas. 30(1), 70–76 (1970). doi:10.1177/001316447003000106
Article Google Scholar
Fleiss J.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)
Article Google Scholar
Fleiss, J.: Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31(3): 651–659 (1975). http://0-www.jstor.org/stable/2529549
Google Scholar
Fleiss, J.L., Levin, B., Paik, M.C.: The measurement of interrater agreement, 3rd edn. In: Statistical Methods for Rates and Proportions, chap. 18, pp. 598–626. Wiley, New York (2004). doi:10.1002/0471445428.ch18
Gamer, M., Lemon, J., Singh, I.F.P.: irr: various coefficients of interrater reliability and agreement (2010). http://cran.r-project.org/web/packages/irr/index.html
Guggenmoos-Holzmann I.: How reliable are change-corrected measures of agreement?. Stat. Med. 12(23), 2191–2205 (1993). doi:10.1002/sim.4780122305
Article Google Scholar
Guggenmoos-Holzmann I.: The meaning of kappa: probabilistic concepts of reliability and validity revisited* 1. J. Clin. Epidemiol. 49(7), 775–782 (1996)
Article Google Scholar
Gwet K.: Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Stat. Methods Inter-Rater Reliab. Assess. Ser. 2, 1–9 (2002)
Google Scholar
Gwet K.: Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)
Article Google Scholar
Gwet K.: Handbook of Inter-rater Reliability-A Definitive Guide to Measuring the Extent of Agreement Among Multiple Raters. Advanced Analytics, LLC, Gaithersburg (2010)
Google Scholar
Hayes A., Krippendorff K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)
Article Google Scholar
Hoehler, F.K.: Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. J. Clin. Epidemiol. 53(5), 499–503 (2000). doi:10.1016/S0895-4356(99)00174-2, http://www.sciencedirect.com/science/article/pii/S0895435699001742
Google Scholar
Holley, J., Guilford, J.: A note on the g index of agreement. Educ. Psychol. Meas. 24(4), 749–753 (1964). doi:10.1177/001316446402400402, http://epm.sagepub.com/content/24/4/749.short, http://epm.sagepub.com/content/24/4/749.full.pdf+html
Holsti O.: Content Analysis for the Social Sciences and Humanities. Addison-Wesley, Reading (1969)
Google Scholar
Hoyt, W.T.: Rater bias in psychological research: when is it a problem and what can we do about it? Psychol. Methods 5(1):64–86 (2000). http://search.ebscohost.com/login.aspx?direct=true&db=pdh&AN=met-5-1-64&site=ehost-live
Hripcsak, G., Heitjan, D.F.: Measuring agreement in medical informatics reliability studies. J. Biomed. Inf. 35(2), 99–110 (2002). doi:10.1016/S1532-0464(02)00500-2, http://www.sciencedirect.com/science/article/pii/S1532046402005002
Hsu L., Field R.: Interrater agreement measures: comments on kappa _n, Cohen’s κ, Scott’s π, and Aickin’s α. Underst. Stat. 2, 205–219 (2003)
Article Google Scholar
Hutchinson T.P.: Kappa muddles together two sources of disagreement: tetrachoric correlation is preferable. Res. Nurs. Health 16(4), 313–316 (1993). doi:10.1002/nur.4770160410
Article Google Scholar
Janes C.: An extension of the random error coefficient of agreement to n × n tables. Br. J. Psychiatry 134(6), 617–619 (1979). doi:10.1192/bjp.134.6.617
Article Google Scholar
Janson S., Vegelius J.: On generalizations of the g index and the phi coefficient to nominal scales. Multivar. Behav. Res. 14(2), 255–269 (1979). doi:10.1207/s15327906mbr1402_9
Article Google Scholar
Kirk, J.: rmac: calculate RMAC or FMAC agreement coefficients (2010). http://cran.r-project.org/web/packages/rmac/index.html
Kraemer H.: Ramifications of a population model for as a coefficient of reliability. Psychometrika 44(4), 461–472 (1979). doi:10.1007/BF02296208
Article Google Scholar
Krippendorff, K.: Bivariate agreement coefficients for reliability of data. Sociol. Methodol. 2, 139–150 (1970). http://www.jstor.org/stable/270787
Krippendorff K.: Content Analysis: An Introduction to its Methodology, 2nd edn. Sage Publications, Inc, Thousand Oaks (2004)
Google Scholar
Krippendorff K.: Reliability in content analysis. Some common misconceptions and recommendations. Human Commun. Res. 30(3), 411–433 (2004). doi:10.1111/j.1468-2958.2004.tb00738.x
Google Scholar
Krippendorff, K.: Computing krippendorff’s alpha reliability (2007). http://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=asc_papers
Lamport L.: Latex: A Document Preparation System. Addison-Wesley Longman Publishing Co., Inc, Boston (1986)
Google Scholar
Langenbucher J., Labouvie E., Morgenstern J.: Measuring diagnostic agreement. J. Consult. Clin. Psychol. 64(6), 1285–1289 (1996)
Article Google Scholar
Lantz, C.A., Nebenzahl, E.: Behavior and interpretation of the κ statistic: resolution of the two paradoxes. J. Clin. Epidemiol. 49(4), 431–434 (1996). doi:10.1016/0895-4356(95)00571-4, http://www.sciencedirect.com/science/article/pii/0895435695005714
Leisch, F.: Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis. http://www.stat.uni-muenchen.de/leisch/Sweave, ISBN 3-7908-1517-9 (2002)
Lemon, J., Fellows, I.: Concord: concordance and reliability (2007). http://cran.r-project.org/web/packages/concord/index.html
Lenth, R.V.: RSM: Response-Surface Analysis. 1st edn (2010). http://cran.r-project.org/web/packages/rsm/index.html
Light R.J.: Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol. Bull. 76(5), 365–377 (1971)
Article Google Scholar
Lin L.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255–268 (1989)
Article Google Scholar
Lin L., Hedayat A.S., Wenting W.: A unified approach for assessing agreement for continuous and categorical data. J. Biopharm. Stat. 17(4), 629–652 (2007). doi:10.1080/10543400701376498
Article Google Scholar
Lombard M.: A call for standardization in content analysis reliability. Human Commun. Res. 30(3), 434–437 (2004)
Article Google Scholar
Mason R., Gunst R., Hess J.: Statistical Design and Analysis of Experiments: With Applications to Engineering and Science, vol. 356. Wiley-Interscience, New York (2003)
Book Google Scholar
Maxwell A.: Comparing the classification of subjects by two independent judges. Br. J. Psychiatry 116(535), 651–655 (1970)
Article Google Scholar
Maxwell, A.E.: Coefficients of agreement between observers and their interpretation. Br. J. Psychiatry 130(1), 79–83 (1977). doi:10.1192/bjp.130.1.79, http://bjp.rcpsych.org/content/130/1/79.abstract, http://bjp.rcpsych.org/content/130/1/79.full.pdf+html
May S.: Modelling observer agreement—an alternative to kappa. J. Clin. Epidemiol. 47(11), 1315–1324 (1994)
Article Google Scholar
McNemar Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947). doi:10.1007/BF02295996
Article Google Scholar
Myers R., Montgomery D., Anderson-Cook C.: Response Surface Methodology: Process and Product Optimization Using Designed Experiments, vol. 705, 3rd edn. Wiley, Hoboken (2009)
Google Scholar
Osgood, C.: The Representational Model and Relevant Research Methods. Trends in Content Analysis, pp. 33–88. UK (1959)
Perreault, J., William, D., Leigh, L.E.: Reliability of nominal data based on qualitative judgments. J. Mark. Res. 26(2), 135–148, http://www.jstor.org/stable/3172601 (1989)
Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing (2011). http://www.R-project.org/, ISBN 3-900051-07-0
Revolution Analytics doSMP: Foreach Parallel Adaptor for the revoIPC Package. 1st edn (2011). http://cran.r-project.org/web/packages/doSMP/index.html
Rosenbaum, J.: Bayesian methods for measures of agreement. J. R. Stat. Soc. Ser. A (Stat. Soc.) 173(1), 270–270 (2010). doi:10.1111/j.1467-985X.2009.00624_2.x (book Reviews)
Sarkar, D.: Lattice: Lattice Graphics, 0th edn. (2012). http://cran.r-project.org/web/packages/lattice/index.html
Scott W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19, 321–325 (1955). doi:10.1086/266577
Article Google Scholar
Shrout P., Fleiss J.: Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86(2), 420–428 (1979)
Article Google Scholar
Shrout P., Spitzer R., Fleiss J.: Quantification of agreement in psychiatric diagnosis revisited. Arch. Gen. Psychiatry 44(2), 172–177 (1987)
Article Google Scholar
Siegel S., Castellan N.: Nonparametric Statistics for the Behavioral Sciences. McGraw-HiU Book Company, New York (1988)
Google Scholar
Sim J., Wright C.: The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys. Ther. 85(3), 257–268 (2005)
Google Scholar
Spitznagel E., Helzer J.: A proposed solution to the base rate problem in the kappa statistic. Arch. Gen. Psychiatry 42(7), 725–728 (1985)
Article Google Scholar
Stevenson, M., Sanchez, J., Thornton, R.: epir: Functions for analysing epidemiological data (2011). http://cran.r-project.org/web/packages/epiR/index.html
Stewart A.: Basic Statistics and Epidemiology: A Practical Guide. Radcliffe Pub., Oxford (2007)
Google Scholar
Stuart A.: A test for homogeneity of the marginal distributions in a two-way classification. Biometrika 42(3–4), 412–416 (1955)
Google Scholar
Tanner, M.A., Young, M.A.: Modeling agreement among raters. J. Am. Stat. Assoc. 80(389), 175–180 (1985). http://www.jstor.org/stable/2288068
Google Scholar
Thompson, W., Walter, S.D.: A reappraisal of the kappa coefficient. J. Clin. Epidemiol. 41(10), 949–958 (1988). doi:10.1016/0895-4356(88)90031-5, http://www.sciencedirect.com/science/article/pii/0895435688900315
Google Scholar
Tinsley, H.E., Weiss, D.J.: Interrater reliability and agreement of subjective judgments. J. Couns. Psychol. 22(4), 358–376 (1975). doi:10.1037/h0076640, http://search.ebscohost.com/login.aspx?direct=true&db=pdh&AN=cou-22-4-358&site=ehost-live
Google Scholar
Tinsley H.E., Weiss D.J.: Interrater reliability and agreement. In: Tinsley, H., Brown, S. (eds.) Handbook of Applied Multivariate Statistics and Mathematical Modeling, chap 4, pp. 95–124. Academic Press, San Diego (2000)
Chapter Google Scholar
Uebersax J.: Diversity of decision-making models and the measurement of interrater agreement. Psychol. Bull. 101(1), 140–146 (1987)
Article Google Scholar
Uebersax, J.: The myth of chance-corrected agreement (2009). http://www.john-uebersax.com/stat/kappa2.htm
Uebersax, J., Grove, W.: Latent class analysis of diagnostic agreement. Stat. Med. 9(5), 559–572 (1990). http://www.psych.umn.edu/faculty/grove/053latentclassanalysisof.pdf
Google Scholar
Vach, W.: The dependence of cohen’s kappa on the prevalence does not matter. J. Clin. Epidemiol. 58(7), 655–661 (2005). doi:10.1016/j.jclinepi.2004.02.021, http://www.sciencedirect.com/science/article/pii/S0895435604003300
Google Scholar
von Eye, A., Sőrensen, S.: Models of chance when measuring interrater agreement with kappa. Biometrical J. 33(7), 781–787 (1991). doi:10.1002/bimj.4710330704, http://dx.doi.org/10.1002/bimj.4710330704
Google Scholar
von Eye A., von Eye M.: On the marginal dependency of Cohen’s κ. Eur. Psychol. 13(4), 305–315 (2008)
Article Google Scholar
Warrens, M.: A formal proof of a paradox associated with Cohen’s kappa. J. Classif. 27(3), 322–332 (2010). doi:10.1007/s00357-010-9060-x, https://openaccess.leidenuniv.nl/bitstream/handle/1887/16310/Warrens_2010_JoC_27_322_332.pdf?sequence=2
Google Scholar
Yu, Y., Lin, L.: Agreement: statistical tools for measuring agreement (2011). http://cran.r-project.org/web/packages/Agreement/Agreement.pdf
Zhao, X.: When to use Scott’s π or Krippendorff’s α, if ever. The annual Conference of the Association for Education in Journalism and Mass Communication, the Association for Education in Journalism and Mass Communication, St. Louis (2011). http://www.allacademic.com/meta/p517973_index.html
Zhao X., Liu J.S., Deng K.: Assumptions behind inter-coder reliability indices, vol. 36. Routledge, New York (2012)
Google Scholar
Zwick R.: Another look at interrater agreement. Psychol. Bull. 103(3), 374–378 (1988). doi:10.1037/0033-2909.103.3.374
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Communication, Hong Kong Baptist University, Communication and Visual Arts Building, 5 Hereford Road, Kowloon Tong, Kowloon, Hong Kong
Guangchao Charles Feng

Authors

Guangchao Charles Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangchao Charles Feng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feng, G.C. Factors affecting intercoder reliability: a Monte Carlo experiment. Qual Quant 47, 2959–2982 (2013). https://doi.org/10.1007/s11135-012-9745-9

Download citation

Published: 13 September 2012
Issue Date: August 2013
DOI: https://doi.org/10.1007/s11135-012-9745-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Factors affecting intercoder reliability: a Monte Carlo experiment

Abstract

Access this article

Similar content being viewed by others

Interrater reliability estimators tested against true interrater reliabilities

Detecting rater bias using a person-fit statistic: a Monte Carlo simulation study

Comparing Hyperprior Distributions to Estimate Variance Components for Interrater Reliability Coefficients

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Factors affecting intercoder reliability: a Monte Carlo experiment

Abstract

Access this article

Similar content being viewed by others

Interrater reliability estimators tested against true interrater reliabilities

Detecting rater bias using a person-fit statistic: a Monte Carlo simulation study

Comparing Hyperprior Distributions to Estimate Variance Components for Interrater Reliability Coefficients

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation