Abraira, V., & Pérez de Vargas, A. (1999). Generalization of the kappa coefficient for ordinal categorical data, multiple observers and incomplete designs. Qüestiió, 23, 561–571.
Banerjee, M. (1999). Beyond kappa: a review of interrater agreement measures. Canadian Journal of Statistics-Revue Canadienne de Statistique, 27, 3–23.
Berry, K.J., Johnston, J.E., Zahran, S., & Mielke, P.W. (2009). Stuart’s tau measure of effect size for ordinal variables: some methodological considerations. Behavior Research Methods, 41, 1144–1148.
Blackman, N.J.M., & Koval, J.J. (2000). Interval estimation for Cohen’s kappa as a measure of agreement. Statistics in Medicine, 19, 723–741.
Brenner, H., & Kliebsch, U. (1996). Dependence of weighted kappa coefficients on the number of categories. Epidemiology, 7, 199–202.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.
Conger, A.J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.
Crewson, P.E. (2005). Fundamentals of clinical research for radiologists. Reader agreement studies. American Journal of Roentgenology, 184, 1391–1397.
Davies, M., & Fleiss, J.L. (1982). Measuring agreement for multinomial data. Biometrics, 38, 1047–1051.
De Raadt, A., Warrens, M.J., Bosker, R.J., & Kiers, H.A.L. (2019). Kappa coefficients for missing data. Educational and Psychological Measurement, 79, 558–576.
De Vet, H.C.W., Mokkink, L.B., Terwee, C.B., Hoekstra, O.S., & Knol, D.L. (2013). Clinicians are right not to like Cohen’s kappa. British Medical Journal, 346, f2125.
De Winter, J.C., Gosling, S.D., & Potter, J. (2016). Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: a tutorial using simulations and empirical data. Psychological Methods, 21, 273–290.
Fagot, R.F. (1993). A generalized family of coefficients of relational agreement for numerical scales. Psychometrika, 58, 357–370.
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.
Fleiss, J.L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.
Graham, P., & Jackson, R. (1993). The analysis of ordinal agreement data: beyond weighted kappa. Journal of Clinical Epidemiology, 46, 1055–1062.
Gwet, K.L. (2012). Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among multiple raters, 3rd edn. Gaithersburg: Advanced Analytics.
Hauke, J., & Kossowski, T. (2011). Comparison of values of Pearson’s and Spearman’s correlation coefficient on the same sets of data. Quaestiones Geographicae, 30, 87–93.
Holmquist, N.D., McMahan, C.A., & Williams, O.D. (1967). Variability in classification of carcinoma in situ of the uterine cervix. Archives of Pathology, 84, 334–345.
Hubert, L. (1977). Kappa revisited. Psychological Bulletin, 84, 289–297.
Kendall, M.G. (1955). Rank correlation methods, 2nd edn. New York City: Hafner Publishing Co.
Kendall, M.G. (1962). Rank correlation methods, 3rd edn. Liverpool: Charles Birchall & Sons Ltd.
Krippendorff, K. (1978). Reliability of binary attribute data. Biometrics, 34, 142–144.
Krippendorff, K. (2013). Content analysis: an introduction to its methodology, 3rd edn. Thousand Oaks: Sage.
Kundel, H.L., & Polansky, M. (2003). Measurement of observer agreement. Radiology, 228, 303–308.
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
Light, R.J. (1971). Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin, 76, 365–377.
Maclure, M., & Willett, W.C. (1987). Misinterpretation and misuse of the kappa statistic. Journal of Epidemiology, 126, 161–169.
McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.
McHugh, M.L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22, 276–282.
Mielke, P.W., Berry, K.J., & Johnston, J.E. (2007). The exact variance of weighted kappa with multiple raters. Psychological Reports, 101, 655–660.
Mielke, P.W., Berry, K.J., & Johnston, J.E. (2008). Resampling probability values for weighted kappa with multiple raters. Psychological Reports, 102, 606–613.
Moradzadeh, N., Ganjali, M., & Baghfalaki, T. (2017). Weighted kappa as a function of unweighted kappas. Communications in Statistics - Simulation and Computation, 46, 3769–3780.
Mukaka, M.M. (2012). A guide to appropriate use of correlation coefficient in medical research. Malawi Medical Journal, 24, 69–71.
Muñoz, S.R., & Bangdiwala, S.I. (1997). Interpretation of kappa and B statistics measures of agreement. Journal of Applied Statistics, 24, 105–111.
Parker, R.I., Vannest, K.J., & Davis, J.L. (2013). Reliability of multi-category rating scales. Journal of School Psychology, 51, 217–229.
R Core Team. (2019). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Rodgers, J.L., & Nicewander, W.A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42, 59–66.
Scott, W.A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, 19, 321–325.
Schouten, H.J.A. (1986). Nominal scale agreement among observers. Psychometrika, 51, 453–466.
Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243–253.
Schuster, C., & Smith, D.A. (2005). Dispersion weighted kappa: an integrative framework for metric and nominal scale agreement coefficients. Psychometrika, 70, 135–146.
Shrout, P.E., & Fleiss, J.L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.
Shiloach, M., Frencher, S.K., Steeger, J.E., Rowell, K.S., Bartzokis, K., Tomeh, M.G., & Hall, B.L. (2010). Toward robust information: data quality and inter-rater reliability in American college of surgeons national surgical quality improvement program. Journal of the American College of Surgeons, 1, 6–16.
Siegel, S., & Castellan, N.J. (1988). Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill.
Sim, J., & Wright, C.C. (2005). The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical Therapy, 85, 257–268.
Soeken, K.L., & Prescott, P.A. (1986). Issues in the use of kappa to estimate reliability. Medical Care, 24, 733–741.
Strijbos, J.-W., & Stahl, G. (2007). Methodological issues in developing a multi-dimensional coding procedure for small-group chat communication. Learning and Instruction, 17, 394–404.
Tinsley, H.E.A., & Weiss, D.J. (2000). Interrater reliability and agreement. In Tinsley, H.E.A., & Brown, S.D. (Eds.) Handbook of applied multivariate statistics and mathematical modeling (pp. 94–124). Academic Press: New York.
Vanbelle, S., & Albert, A. (2009). A note on the linearly weighted kappa coefficient for ordinal scales. Statistical Methodology, 6, 157–163.
Vanbelle, S. (2016). A new interpretation of the weighted kappa coefficients. Psychometrika, 81, 399–410.
Van de Grift, W. (2007). Measuring teaching quality in several European countries. School Effectiveness and School Improvement, 25, 295–311.
Van der Scheer, E.A., Glas, C.A.W., & Visscher, A.J. (2017). Changes in teachers’ instructional skills during an intensive data-based decision making intervention. Teaching and Teacher Education, 65, 171–182.
Viera, A.J., & Garrett, J.M. (2005). Understanding interobserver agreement: the kappa statistic. Family Medicine, 37, 360–363.
Warrens, M.J. (2010). Inequalities between multi-rater kappas. Advances in Data Analysis and Classification, 4, 271–286.
Warrens, M.J. (2011). Weighted kappa is higher than Cohen’s kappa for tridiagonal agreement tables. Statistical Methodology, 8, 268–272.
Warrens, M.J. (2012a). Some paradoxical results for the quadratically weighted kappa. Psychometrika, 77, 315–323.
Warrens, M.J. (2012b). A family of multi-rater kappas that can always be increased and decreased by combining categories. Statistical Methodology, 9, 330–340.
Warrens, M.J. (2012c). Equivalences of weighted kappas for multiple raters. Statistical Methodology, 9, 407–422.
Warrens, M.J. (2013). Conditional inequalities between Cohen’s kappa and weighted kappas. Statistical Methodology, 10, 14–22.
Warrens, M.J. (2014). Corrected Zegers-ten Berge coefficients are special cases of Cohen’s weighted kappa. Journal of Classification, 31, 179–193.
Warrens, M.J. (2015). Five ways to look at Cohen’s kappa. Journal of Psychology & Psychotherapy, 5, 197.
Warrens, M.J. (2017). Transforming intraclass correlations with the Spearman-Brown formula. Journal of Clinical Epidemiology, 85, 14–16.
Wing, L., Leekam, S.R., Libby, S.J., Gould, J., & Larcombe, M. (2002). The diagnostic interview for Social and Communication disorders: background, inter-rater reliability and clinical use. Journal of Child Psychology and Psychiatry, 43, 307–325.
Xu, W., Hou, Y., Hung, Y.S., & Zou, Y. (2013). A comparative analysis of Spearman’s rho and Kendall’s tau in normaland contaminated normal models. Signal Processing, 93, 261–276.