Skip to main content
Log in

Bayesian testing of agreement criteria under order constraints

  • Published:
Journal of the Korean Statistical Society Aims and scope Submit manuscript

Abstract

The most popular criterion to measure the overall agreement between two raters is the Cohen’s kappa coefficient. This coefficient measures the agreement of two raters who judge about some subjects with a binary nominal rating. In this paper, we consider a unified Bayesian approach for testing some hypotheses about the kappa coefficients under order constraints. This is done for rating of more than two studies with binary response. The Monte Carlo Markov Chain (MCMC) approach is used for the model implementation. The approach is illustrated using some simulation studies. Also, the proposed method is applied for analyzing a real data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agresti, A. (1983). Testing marginal homogeneity for ordinal categorical variables. Biometrics, 39, 505–510.

    Article  Google Scholar 

  • Altaye, M., Donner, A., & Klar, N. (2001). Inference procedures for assessing agreement among multiple raters. Biometrics, 57(2), 584–588.

    Article  MathSciNet  MATH  Google Scholar 

  • Barlow, W., Lai, M. Y., & Azen, S. (1991). A comparison of methods for calculating a stratified kappa. Statistics in Medicine, 10, 1465–1472.

    Article  Google Scholar 

  • Basu, S., Banerjee, M., & Sen, A. (2000). Bayesian inference for kappa from single and multiple studies. Biometrics, 56(2), 577–582.

    Article  MathSciNet  MATH  Google Scholar 

  • Bloch, D. A., & Kraemer, H. C. (1989). 2x2 kappa coefficients: Measures of agreement or association. Biometrics, 45, 269–287.

    Article  MATH  Google Scholar 

  • Brennan, R. L, & Prediger, D.J. (1981). Coefficient kappa: some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699.

    Article  Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 213–220.

    Article  Google Scholar 

  • Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.

    Article  Google Scholar 

  • Davies, M., & Fleiss, J. L. (1982). Measuring agreement for multinomial data. Biometrics, 38, 1047–1051.

    Article  MATH  Google Scholar 

  • Dickey, J. (1971). The weighted likelihood ratio, linear hypotheses on normal location parameters. The Annals of Statistics, 42, 204–223.

    Article  MathSciNet  MATH  Google Scholar 

  • Dickey, J. (1976). Approximate posterior distributions. Journal of the American Statistical Association, 71, 680–689.

    Article  MathSciNet  MATH  Google Scholar 

  • Dickey, J., & Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markov chain. The Annals of Mathematical Statistics, 41, 214–226.

    Article  MathSciNet  MATH  Google Scholar 

  • Donner, A., & Eliasziw, M. (1992). A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significancetesting and sample size estimation. Statistics in Medicine, 11, 1511–1519.

    Article  Google Scholar 

  • Donner, A., Eliasziw, M., & Klar, N. (1996). Testing the homogeniety of kappa statistics. Biometrics, 52, 176–183.

    Article  MATH  Google Scholar 

  • Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.

    Article  Google Scholar 

  • Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley.

    MATH  Google Scholar 

  • Gilks, W. R., & Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 337–348.

    Google Scholar 

  • Hale, C. A., & Fleiss, J. L. (1993). Interval estimation under two study designs for kappa with binary classifications. Biometrics, 49, 523–534.

    Google Scholar 

  • Hoijtink, H. (2011). Informative hypotheses: theory and practice for behavioral and social scientists. London, UK: Chapman & Hall/CRC.

    Book  Google Scholar 

  • Hsu, L. M., & Field, R. (2003). Inter-rater agreement measures: comments on kappa, Cohen’s kappa, Scott’s π and Aickin’s α. Understanding Statistics, 2, 205–219.

    Article  Google Scholar 

  • Jakobsson, U., & Westergren, A. (2005). Statistical methods for assessing agreement for ordinal data. Scandinavian Journal of Caring Sciences, 19, 427–431.

    Article  Google Scholar 

  • Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795.

    Article  MathSciNet  MATH  Google Scholar 

  • Klugkist, I., Laudy, O., & Hoijtink, H. (2005). Inequality constrained analysis of variance: A Bayesian approach. Psychological Methods, 10, 477–493.

    Article  Google Scholar 

  • Klugkist, I., Laudy, O., & Hoijtink, H. (2010). Bayesian evaluation of inequality and equality constrained hypotheses for contingency tables. Psychological Methods.

    Google Scholar 

  • Kraemer, H. C. (1992a). How many raters? Toward the most reliable diagnostic consensus. Statistics in Medicine, 11, 317–332.

    Article  Google Scholar 

  • Kraemer, H. C. (1992b). Measurement of reliability for categorical data in medical research. Statistical Methods in Medical Research, 1, 183–200.

    Article  Google Scholar 

  • Kraemer, H. C, Periyakoil, V. S., & Noda, A. (2004). Tutorial in biostatistics: kappa coefficients in medical research. Statistics in Medicine, 21, 2109–2129.

    Article  Google Scholar 

  • Krippendorff, K. (2004). Reliability in content analysis: some common misconceptions and recommendations. Human Communication Research, 30, 411–433.

    Google Scholar 

  • Lee, J. J., & Tu, Z. N. (1994). A better confidence interval for kappa (k) on measuring agreement between two raters with binary outcomes. Journal of Computational and Graphical Statistics, 3(3), 301–321.

    Google Scholar 

  • Lipsitz, S. R., Laird, N. M., & Breman, T. A. (1994). Simple moment estimates of the k-coefficient and its variance. Applied Statistics, 43(2), 309–323.

    Article  MathSciNet  MATH  Google Scholar 

  • Mulder, J. (2015). Bayes factors for testing order-constrained hypotheses 486 on correlations. Journal of Mathematical Psychology.

    Google Scholar 

  • Mulder, J. (2016). Bayes factors for testing order-constrained hypotheses on correlations. Journal of Mathematical Psychology, 72, 104–115.

    Article  MathSciNet  MATH  Google Scholar 

  • Mulder, J., Hoijtink, H., & Klugkist, I. (2010). Equality and inequality constrained multivariate linear models: Objective model selection using constrained posterior priors. Journal of Statistical Planning and Inference, 140, 887–906.

    Article  MathSciNet  MATH  Google Scholar 

  • Mulder, J., Klugkist, I., Meeus, W., van de Schoot, A., Selfhout, M., & Hoijtink, H. (2009). Bayesian model selection of informative hypotheses for repeated measurements. Journal of Mathematical Psychology, 53, 530–546.

    Article  MathSciNet  MATH  Google Scholar 

  • Nam, J. M. (2003). Homogeneity score test for the intraclass version of the kappa statistics and sample size determination in multiple or stratified studies. Biometrics, 59, 1027–1035.

    Article  MathSciNet  MATH  Google Scholar 

  • Oh, M. S., & Shin, D. W. (2011). A unified Bayesian inference on treatment means with order constraints. Computational Statistics & Data Analysis, 55(1), 924–934.

    Article  MathSciNet  MATH  Google Scholar 

  • Popping, R. (2010). Some views on agreement to be used in content analysis studies. Quality & Quantity, 44, 1067–1078.

    Article  Google Scholar 

  • Rifkin, M. D., Zerhouni, E. A., Constantine, M. D., Gastonis, C. A., Quint, L. E., Paushter, D. M., Epstein, J. I., Hamper, U., Walsh, P. C, & McNeil, B. J. (1990). Comparison of magnetic resonance imaging and ultrasonography in staging early prostate cancer. The New England Journal of Medicine, 323(10), 621–626.

    Article  Google Scholar 

  • Rogel, A., Boelle, P. Y., & Mary, J. Y. (1998). Globaland partial agreement among Several observers. Statistics Methods, 17, 489–501.

    Google Scholar 

  • Scott, W. A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, 19, 321–325.

    Article  Google Scholar 

  • Shoukri, M. M. (2004). Measures of inter observer agreement. Boca Raton: Chapman & Hall/CRC.

    Google Scholar 

  • Shoukri, M. M., Martin, S. W., & Mian, I. U. H. (1995). Maximum likelihood estimation of the kappa coefficient from models of matched binary responses. Statistics in Medicine, 14, 83–99.

    Article  Google Scholar 

  • Sturtz, S., Ligges, U, & Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS from R. Journal of Statistical Software, 12(3), 1–16.

    Article  Google Scholar 

  • Thompson, W. D., & Walter, S. D. (1988). A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology, 41, 949–958.

    Article  Google Scholar 

  • Vanbelle, S., & Albert, A. (2009a). Agreement between two independent groups of raters. Psychometrika, 74, 477–491.

    Article  MathSciNet  MATH  Google Scholar 

  • Vanbelle, S., & Albert, A. (2009b). Agreement between an isolated rater and a group of raters. Statistica Neerlandica, 63, 82–100.

    Article  MathSciNet  Google Scholar 

  • Verdinelli, I., & Wasserman, L. (1995). Computing Bayes factors using a generalization of the Savage-Dickey density ratio. Journal of the American Statistical Association, 90, 614–618.

    Article  MathSciNet  MATH  Google Scholar 

  • Visser, H., & de Nijs, T. (2006). The map comparison kit. Environmental Modeling and Software, 21, 346–358.

    Article  Google Scholar 

  • Von Eye, A., & Mun, E. Y. (2005). Analyzing rater agreement, manifest variable methods. Mahwash, N.J., London: Lawrence Erlbaum Associates.

    Google Scholar 

  • Warrens, M. J. (2008a). On similarity coefficients for 2 × 2 tables and correction for chance. Psychometrika, 73, 487–502.

    Article  MathSciNet  MATH  Google Scholar 

  • Warrens, M.J. (2008b). On the equivalence of Cohen’s kappa and the Hubert-Arabie adjusted Rand index. Journal of Classification, 25, 177–183.

    Article  MathSciNet  MATH  Google Scholar 

  • Warrens, M.J. (2010a). Inequalities between kappa and kappa-like statistics for k × k tables. Psychometrika, 75, 176–185.

    Article  MathSciNet  MATH  Google Scholar 

  • Warrens, M.J. (2010b). A formal proof of a paradox associated with Cohen’s kappa. Journal of Classification, 27, 322–332.

    Article  MathSciNet  MATH  Google Scholar 

  • Warrens, M. J. (2011). Weighted kappa is higher than Cohen’s kappa for tridiagonal agreement tables. Statistical Methodology, 4, 271–286.

    MathSciNet  MATH  Google Scholar 

  • Williams, G. W. (1976). Comparing the joint agreement of several raters with one rater. Biometrics, 32, 619–627.

    Article  MATH  Google Scholar 

  • Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–378.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to T. Baghfalaki.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ganjali, M., Moradzadeh, N. & Baghfalaki, T. Bayesian testing of agreement criteria under order constraints. J. Korean Stat. Soc. 46, 78–87 (2017). https://doi.org/10.1016/j.jkss.2016.06.004

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1016/j.jkss.2016.06.004

AMS 2000 subject classifications

Keywords

Navigation