Abstract
The most popular criterion to measure the overall agreement between two raters is the Cohen’s kappa coefficient. This coefficient measures the agreement of two raters who judge about some subjects with a binary nominal rating. In this paper, we consider a unified Bayesian approach for testing some hypotheses about the kappa coefficients under order constraints. This is done for rating of more than two studies with binary response. The Monte Carlo Markov Chain (MCMC) approach is used for the model implementation. The approach is illustrated using some simulation studies. Also, the proposed method is applied for analyzing a real data set.
Similar content being viewed by others
References
Agresti, A. (1983). Testing marginal homogeneity for ordinal categorical variables. Biometrics, 39, 505–510.
Altaye, M., Donner, A., & Klar, N. (2001). Inference procedures for assessing agreement among multiple raters. Biometrics, 57(2), 584–588.
Barlow, W., Lai, M. Y., & Azen, S. (1991). A comparison of methods for calculating a stratified kappa. Statistics in Medicine, 10, 1465–1472.
Basu, S., Banerjee, M., & Sen, A. (2000). Bayesian inference for kappa from single and multiple studies. Biometrics, 56(2), 577–582.
Bloch, D. A., & Kraemer, H. C. (1989). 2x2 kappa coefficients: Measures of agreement or association. Biometrics, 45, 269–287.
Brennan, R. L, & Prediger, D.J. (1981). Coefficient kappa: some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 213–220.
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.
Davies, M., & Fleiss, J. L. (1982). Measuring agreement for multinomial data. Biometrics, 38, 1047–1051.
Dickey, J. (1971). The weighted likelihood ratio, linear hypotheses on normal location parameters. The Annals of Statistics, 42, 204–223.
Dickey, J. (1976). Approximate posterior distributions. Journal of the American Statistical Association, 71, 680–689.
Dickey, J., & Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markov chain. The Annals of Mathematical Statistics, 41, 214–226.
Donner, A., & Eliasziw, M. (1992). A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significancetesting and sample size estimation. Statistics in Medicine, 11, 1511–1519.
Donner, A., Eliasziw, M., & Klar, N. (1996). Testing the homogeniety of kappa statistics. Biometrics, 52, 176–183.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley.
Gilks, W. R., & Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 337–348.
Hale, C. A., & Fleiss, J. L. (1993). Interval estimation under two study designs for kappa with binary classifications. Biometrics, 49, 523–534.
Hoijtink, H. (2011). Informative hypotheses: theory and practice for behavioral and social scientists. London, UK: Chapman & Hall/CRC.
Hsu, L. M., & Field, R. (2003). Inter-rater agreement measures: comments on kappa, Cohen’s kappa, Scott’s π and Aickin’s α. Understanding Statistics, 2, 205–219.
Jakobsson, U., & Westergren, A. (2005). Statistical methods for assessing agreement for ordinal data. Scandinavian Journal of Caring Sciences, 19, 427–431.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795.
Klugkist, I., Laudy, O., & Hoijtink, H. (2005). Inequality constrained analysis of variance: A Bayesian approach. Psychological Methods, 10, 477–493.
Klugkist, I., Laudy, O., & Hoijtink, H. (2010). Bayesian evaluation of inequality and equality constrained hypotheses for contingency tables. Psychological Methods.
Kraemer, H. C. (1992a). How many raters? Toward the most reliable diagnostic consensus. Statistics in Medicine, 11, 317–332.
Kraemer, H. C. (1992b). Measurement of reliability for categorical data in medical research. Statistical Methods in Medical Research, 1, 183–200.
Kraemer, H. C, Periyakoil, V. S., & Noda, A. (2004). Tutorial in biostatistics: kappa coefficients in medical research. Statistics in Medicine, 21, 2109–2129.
Krippendorff, K. (2004). Reliability in content analysis: some common misconceptions and recommendations. Human Communication Research, 30, 411–433.
Lee, J. J., & Tu, Z. N. (1994). A better confidence interval for kappa (k) on measuring agreement between two raters with binary outcomes. Journal of Computational and Graphical Statistics, 3(3), 301–321.
Lipsitz, S. R., Laird, N. M., & Breman, T. A. (1994). Simple moment estimates of the k-coefficient and its variance. Applied Statistics, 43(2), 309–323.
Mulder, J. (2015). Bayes factors for testing order-constrained hypotheses 486 on correlations. Journal of Mathematical Psychology.
Mulder, J. (2016). Bayes factors for testing order-constrained hypotheses on correlations. Journal of Mathematical Psychology, 72, 104–115.
Mulder, J., Hoijtink, H., & Klugkist, I. (2010). Equality and inequality constrained multivariate linear models: Objective model selection using constrained posterior priors. Journal of Statistical Planning and Inference, 140, 887–906.
Mulder, J., Klugkist, I., Meeus, W., van de Schoot, A., Selfhout, M., & Hoijtink, H. (2009). Bayesian model selection of informative hypotheses for repeated measurements. Journal of Mathematical Psychology, 53, 530–546.
Nam, J. M. (2003). Homogeneity score test for the intraclass version of the kappa statistics and sample size determination in multiple or stratified studies. Biometrics, 59, 1027–1035.
Oh, M. S., & Shin, D. W. (2011). A unified Bayesian inference on treatment means with order constraints. Computational Statistics & Data Analysis, 55(1), 924–934.
Popping, R. (2010). Some views on agreement to be used in content analysis studies. Quality & Quantity, 44, 1067–1078.
Rifkin, M. D., Zerhouni, E. A., Constantine, M. D., Gastonis, C. A., Quint, L. E., Paushter, D. M., Epstein, J. I., Hamper, U., Walsh, P. C, & McNeil, B. J. (1990). Comparison of magnetic resonance imaging and ultrasonography in staging early prostate cancer. The New England Journal of Medicine, 323(10), 621–626.
Rogel, A., Boelle, P. Y., & Mary, J. Y. (1998). Globaland partial agreement among Several observers. Statistics Methods, 17, 489–501.
Scott, W. A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, 19, 321–325.
Shoukri, M. M. (2004). Measures of inter observer agreement. Boca Raton: Chapman & Hall/CRC.
Shoukri, M. M., Martin, S. W., & Mian, I. U. H. (1995). Maximum likelihood estimation of the kappa coefficient from models of matched binary responses. Statistics in Medicine, 14, 83–99.
Sturtz, S., Ligges, U, & Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS from R. Journal of Statistical Software, 12(3), 1–16.
Thompson, W. D., & Walter, S. D. (1988). A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology, 41, 949–958.
Vanbelle, S., & Albert, A. (2009a). Agreement between two independent groups of raters. Psychometrika, 74, 477–491.
Vanbelle, S., & Albert, A. (2009b). Agreement between an isolated rater and a group of raters. Statistica Neerlandica, 63, 82–100.
Verdinelli, I., & Wasserman, L. (1995). Computing Bayes factors using a generalization of the Savage-Dickey density ratio. Journal of the American Statistical Association, 90, 614–618.
Visser, H., & de Nijs, T. (2006). The map comparison kit. Environmental Modeling and Software, 21, 346–358.
Von Eye, A., & Mun, E. Y. (2005). Analyzing rater agreement, manifest variable methods. Mahwash, N.J., London: Lawrence Erlbaum Associates.
Warrens, M. J. (2008a). On similarity coefficients for 2 × 2 tables and correction for chance. Psychometrika, 73, 487–502.
Warrens, M.J. (2008b). On the equivalence of Cohen’s kappa and the Hubert-Arabie adjusted Rand index. Journal of Classification, 25, 177–183.
Warrens, M.J. (2010a). Inequalities between kappa and kappa-like statistics for k × k tables. Psychometrika, 75, 176–185.
Warrens, M.J. (2010b). A formal proof of a paradox associated with Cohen’s kappa. Journal of Classification, 27, 322–332.
Warrens, M. J. (2011). Weighted kappa is higher than Cohen’s kappa for tridiagonal agreement tables. Statistical Methodology, 4, 271–286.
Williams, G. W. (1976). Comparing the joint agreement of several raters with one rater. Biometrics, 32, 619–627.
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–378.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ganjali, M., Moradzadeh, N. & Baghfalaki, T. Bayesian testing of agreement criteria under order constraints. J. Korean Stat. Soc. 46, 78–87 (2017). https://doi.org/10.1016/j.jkss.2016.06.004
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1016/j.jkss.2016.06.004