Abraira, V., & Pérez de Vargas, A. (1999). Generalization of the kappa coefficient for ordinal categorical data, multiple observers and incomplete designs. Qüestiió, 23, 561–571.
MATH
Google Scholar
Banerjee, M. (1999). Beyond kappa: a review of interrater agreement measures. Canadian Journal of Statistics-Revue Canadienne de Statistique, 27, 3–23.
MathSciNet
MATH
Google Scholar
Berry, K.J., Johnston, J.E., Zahran, S., & Mielke, P.W. (2009). Stuart’s tau measure of effect size for ordinal variables: some methodological considerations. Behavior Research Methods, 41, 1144–1148.
Google Scholar
Blackman, N.J.M., & Koval, J.J. (2000). Interval estimation for Cohen’s kappa as a measure of agreement. Statistics in Medicine, 19, 723–741.
Google Scholar
Brenner, H., & Kliebsch, U. (1996). Dependence of weighted kappa coefficients on the number of categories. Epidemiology, 7, 199–202.
Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Google Scholar
Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.
Google Scholar
Conger, A.J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.
Google Scholar
Crewson, P.E. (2005). Fundamentals of clinical research for radiologists. Reader agreement studies. American Journal of Roentgenology, 184, 1391–1397.
Google Scholar
Davies, M., & Fleiss, J.L. (1982). Measuring agreement for multinomial data. Biometrics, 38, 1047–1051.
MATH
Google Scholar
De Raadt, A., Warrens, M.J., Bosker, R.J., & Kiers, H.A.L. (2019). Kappa coefficients for missing data. Educational and Psychological Measurement, 79, 558–576.
Google Scholar
De Vet, H.C.W., Mokkink, L.B., Terwee, C.B., Hoekstra, O.S., & Knol, D.L. (2013). Clinicians are right not to like Cohen’s kappa. British Medical Journal, 346, f2125.
Google Scholar
De Winter, J.C., Gosling, S.D., & Potter, J. (2016). Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: a tutorial using simulations and empirical data. Psychological Methods, 21, 273–290.
Google Scholar
Fagot, R.F. (1993). A generalized family of coefficients of relational agreement for numerical scales. Psychometrika, 58, 357–370.
Google Scholar
Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.
Google Scholar
Fleiss, J.L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.
Google Scholar
Graham, P., & Jackson, R. (1993). The analysis of ordinal agreement data: beyond weighted kappa. Journal of Clinical Epidemiology, 46, 1055–1062.
Google Scholar
Gwet, K.L. (2012). Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among multiple raters, 3rd edn. Gaithersburg: Advanced Analytics.
Google Scholar
Hauke, J., & Kossowski, T. (2011). Comparison of values of Pearson’s and Spearman’s correlation coefficient on the same sets of data. Quaestiones Geographicae, 30, 87–93.
Google Scholar
Holmquist, N.D., McMahan, C.A., & Williams, O.D. (1967). Variability in classification of carcinoma in situ of the uterine cervix. Archives of Pathology, 84, 334–345.
Google Scholar
Hubert, L. (1977). Kappa revisited. Psychological Bulletin, 84, 289–297.
Google Scholar
Kendall, M.G. (1955). Rank correlation methods, 2nd edn. New York City: Hafner Publishing Co.
MATH
Google Scholar
Kendall, M.G. (1962). Rank correlation methods, 3rd edn. Liverpool: Charles Birchall & Sons Ltd.
Google Scholar
Krippendorff, K. (1978). Reliability of binary attribute data. Biometrics, 34, 142–144.
MathSciNet
Google Scholar
Krippendorff, K. (2013). Content analysis: an introduction to its methodology, 3rd edn. Thousand Oaks: Sage.
Google Scholar
Kundel, H.L., & Polansky, M. (2003). Measurement of observer agreement. Radiology, 228, 303–308.
Google Scholar
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
MATH
Google Scholar
Light, R.J. (1971). Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin, 76, 365–377.
Google Scholar
Maclure, M., & Willett, W.C. (1987). Misinterpretation and misuse of the kappa statistic. Journal of Epidemiology, 126, 161–169.
Google Scholar
McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.
Google Scholar
McHugh, M.L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22, 276–282.
Google Scholar
Mielke, P.W., Berry, K.J., & Johnston, J.E. (2007). The exact variance of weighted kappa with multiple raters. Psychological Reports, 101, 655–660.
Google Scholar
Mielke, P.W., Berry, K.J., & Johnston, J.E. (2008). Resampling probability values for weighted kappa with multiple raters. Psychological Reports, 102, 606–613.
Google Scholar
Moradzadeh, N., Ganjali, M., & Baghfalaki, T. (2017). Weighted kappa as a function of unweighted kappas. Communications in Statistics - Simulation and Computation, 46, 3769–3780.
MathSciNet
MATH
Google Scholar
Mukaka, M.M. (2012). A guide to appropriate use of correlation coefficient in medical research. Malawi Medical Journal, 24, 69–71.
Google Scholar
Muñoz, S.R., & Bangdiwala, S.I. (1997). Interpretation of kappa and B statistics measures of agreement. Journal of Applied Statistics, 24, 105–111.
Google Scholar
Parker, R.I., Vannest, K.J., & Davis, J.L. (2013). Reliability of multi-category rating scales. Journal of School Psychology, 51, 217–229.
Google Scholar
R Core Team. (2019). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Rodgers, J.L., & Nicewander, W.A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42, 59–66.
Google Scholar
Scott, W.A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, 19, 321–325.
Google Scholar
Schouten, H.J.A. (1986). Nominal scale agreement among observers. Psychometrika, 51, 453–466.
MathSciNet
Google Scholar
Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243–253.
MathSciNet
Google Scholar
Schuster, C., & Smith, D.A. (2005). Dispersion weighted kappa: an integrative framework for metric and nominal scale agreement coefficients. Psychometrika, 70, 135–146.
MathSciNet
MATH
Google Scholar
Shrout, P.E., & Fleiss, J.L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.
Google Scholar
Shiloach, M., Frencher, S.K., Steeger, J.E., Rowell, K.S., Bartzokis, K., Tomeh, M.G., & Hall, B.L. (2010). Toward robust information: data quality and inter-rater reliability in American college of surgeons national surgical quality improvement program. Journal of the American College of Surgeons, 1, 6–16.
Google Scholar
Siegel, S., & Castellan, N.J. (1988). Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill.
Google Scholar
Sim, J., & Wright, C.C. (2005). The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical Therapy, 85, 257–268.
Google Scholar
Soeken, K.L., & Prescott, P.A. (1986). Issues in the use of kappa to estimate reliability. Medical Care, 24, 733–741.
Google Scholar
Strijbos, J.-W., & Stahl, G. (2007). Methodological issues in developing a multi-dimensional coding procedure for small-group chat communication. Learning and Instruction, 17, 394–404.
Google Scholar
Tinsley, H.E.A., & Weiss, D.J. (2000). Interrater reliability and agreement. In Tinsley, H.E.A., & Brown, S.D. (Eds.) Handbook of applied multivariate statistics and mathematical modeling (pp. 94–124). Academic Press: New York.
Vanbelle, S., & Albert, A. (2009). A note on the linearly weighted kappa coefficient for ordinal scales. Statistical Methodology, 6, 157–163.
MathSciNet
MATH
Google Scholar
Vanbelle, S. (2016). A new interpretation of the weighted kappa coefficients. Psychometrika, 81, 399–410.
MathSciNet
MATH
Google Scholar
Van de Grift, W. (2007). Measuring teaching quality in several European countries. School Effectiveness and School Improvement, 25, 295–311.
Google Scholar
Van der Scheer, E.A., Glas, C.A.W., & Visscher, A.J. (2017). Changes in teachers’ instructional skills during an intensive data-based decision making intervention. Teaching and Teacher Education, 65, 171–182.
Google Scholar
Viera, A.J., & Garrett, J.M. (2005). Understanding interobserver agreement: the kappa statistic. Family Medicine, 37, 360–363.
Google Scholar
Warrens, M.J. (2010). Inequalities between multi-rater kappas. Advances in Data Analysis and Classification, 4, 271–286.
MathSciNet
MATH
Google Scholar
Warrens, M.J. (2011). Weighted kappa is higher than Cohen’s kappa for tridiagonal agreement tables. Statistical Methodology, 8, 268–272.
MathSciNet
MATH
Google Scholar
Warrens, M.J. (2012a). Some paradoxical results for the quadratically weighted kappa. Psychometrika, 77, 315–323.
MathSciNet
MATH
Google Scholar
Warrens, M.J. (2012b). A family of multi-rater kappas that can always be increased and decreased by combining categories. Statistical Methodology, 9, 330–340.
MathSciNet
MATH
Google Scholar
Warrens, M.J. (2012c). Equivalences of weighted kappas for multiple raters. Statistical Methodology, 9, 407–422.
MathSciNet
MATH
Google Scholar
Warrens, M.J. (2013). Conditional inequalities between Cohen’s kappa and weighted kappas. Statistical Methodology, 10, 14–22.
MathSciNet
MATH
Google Scholar
Warrens, M.J. (2014). Corrected Zegers-ten Berge coefficients are special cases of Cohen’s weighted kappa. Journal of Classification, 31, 179–193.
MathSciNet
MATH
Google Scholar
Warrens, M.J. (2015). Five ways to look at Cohen’s kappa. Journal of Psychology & Psychotherapy, 5, 197.
Google Scholar
Warrens, M.J. (2017). Transforming intraclass correlations with the Spearman-Brown formula. Journal of Clinical Epidemiology, 85, 14–16.
Google Scholar
Wing, L., Leekam, S.R., Libby, S.J., Gould, J., & Larcombe, M. (2002). The diagnostic interview for Social and Communication disorders: background, inter-rater reliability and clinical use. Journal of Child Psychology and Psychiatry, 43, 307–325.
Google Scholar
Xu, W., Hou, Y., Hung, Y.S., & Zou, Y. (2013). A comparative analysis of Spearman’s rho and Kendall’s tau in normaland contaminated normal models. Signal Processing, 93, 261–276.
Google Scholar