On the Usefulness of Interrater Reliability Coefficients

ten Hove, Debby; Jorgensen, Terrence D.; van der Ark, L. Andries

doi:10.1007/978-3-319-77249-3_6

Debby ten Hove⁶,
Terrence D. Jorgensen⁶ &
L. Andries van der Ark⁶

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 233))

Included in the following conference series:

The Annual Meeting of the Psychometric Society

1666 Accesses
9 Citations

Abstract

For four data sets of different measurement levels, we computed 20 coefficients that estimate interrater reliability. The results show that the coefficients provide very different numerical values when applied to the same data. We discuss possible explanations for the differences among coefficients and suggest further research that is needed to clarify which coefficient a researcher should use to estimate interrater reliability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bhapkar, V. P. (1966). A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 61, 228–235. https://doi.org/10.2307/2283057.
Article MathSciNet MATH Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. https://doi.org/10.1177/001316446002000104.
Article Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220. https://doi.org/10.1037/h0026256.
Article Google Scholar
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328. https://doi.org/10.1037/0033-2909.88.2.322.
Article Google Scholar
Eliasziw, M., Young, S. L., Woodbury, M. G., & Fryday-Field, K. (1994). Statistical methodology for the concurrent assessment of interrater and intrarater reliability: Using goniometric measurements as an example. Physical Therapy, 74, 777–788. https://doi.org/10.1093/ptj/74.8.777.
Article Google Scholar
Feng, G. C. (2015). Mistakes and how to avoid mistakes in using intercoder reliability indices. Methodology, 11, 13–22. https://doi.org/10.1027/1614-2241/a000086.
Article Google Scholar
Finn, R. H. (1970). A note on estimating the reliability of categorical data. Educational and Psychological Measurement, 30, 71–76. https://doi.org/10.1177/001316447003000106.
Article Google Scholar
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382. https://doi.org/10.1037/h0031619.
Article Google Scholar
Gamer, M., Lemon, J., & Fellows, I., & Singh, P. (2012). irr: Various coefficients of interrater reliability and agreement [computer software]. https://CRAN.R-project.org/package=irr.
Gwet, K. L. (2014). Handbook of inter-rater reliability (4th ed.). Gaithersburg, MD: Advanced Analytics, LLC.
Google Scholar
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8, 23–34. http://www.tqmp.org/RegularArticles/vol08-1/p023/p023.pdf.
Article Google Scholar
Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61, 277–289. https://doi.org/10.1177/00131640121971239.
Article MathSciNet Google Scholar
Kendall, M. G. (1948). Rank correlation methods. London, UK: Griffin.
MATH Google Scholar
Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage.
MATH Google Scholar
Krippendorff, K. (2016). Misunderstanding reliability. Methodology, 12, 139–144. https://doi.org/10.1027/1614-2241/a000119.
Article Google Scholar
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. https://doi.org/10.2307/2529310.
Article MATH Google Scholar
Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365–377. https://doi.org/10.1037/h0031643.
Article Google Scholar
Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 116, 651–655. https://doi.org/10.1192/bjp.116.535.651.
Article Google Scholar
Pearson, K. (1895). Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242. http://www.jstor.org/stable/115794.
Article Google Scholar
Popping, R. (1988). On agreement indices for nominal data. In W. E. Saris & I. N. Gallhofer (Eds.), Sociometric research (pp. 90–105). London, UK: Palgrave Macmillan. https://doi.org/10.1007/978-1-349-19051-5_6.
Chapter Google Scholar
Rhemtulla, M., Brosseau-Laird, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354–373. https://doi.org/10.1037/a0029315.
Article Google Scholar
Robinson, W. S. (1957). The statistical measurement of agreement. American Sociological Review, 22, 17–25. http://www.jstor.org/stable/2088760.
Article Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. https://doi.org/10.2307/1412159.
Article Google Scholar
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlation: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. https://doi.org/10.1037/0033-2909.86.2.420.
Article Google Scholar
Stuart, A. (1953). The estimation and comparison of strengths of association in contingency tables. Biometrika, 40, 105–110. https://doi.org/10.2307/2333101.
Article MathSciNet MATH Google Scholar
Van der Put, C. E., Spanjaard, H. J. M., van Domburgh, L., Doreleijers, T. A. H., Lodewijks, H. P. B., Ferwerda, H. B., et al. (2011). Ontwikkeling van het Landelijke Instrumentarium Jeugdstrafrechtketen (LIJ) [development of the national assessment procedure for youth criminal justice]. Kind & Adolescent Praktijk, 10, 76–83. http://www.tqmp.org/RegularArticles/vol08-1/p023/p023.pdf.
Article Google Scholar
Vangeneugden, T., Laenen, A., Geys, H., Renard, D., & Molenberghs, G. (2005). Applying concepts of generalizability theory on clinical trial data to investigate sources of variation and their impact on reliability. Biometrics, 61, 295–304. https://doi.org/10.1111/j.0006-341X.2005.031040.x.
Article MathSciNet Google Scholar
Zhao, X., Liu, J. S., & Deng, K. (2013). Assumptions behind intercoder reliability indices. Annals of the International Communication Association, 36, 419–480. https://doi.org/10.1080/23808985.2013.11679142.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Research Institute of Child Development and Education, University of Amsterdam, P. O. Box 15776, 1001 NG, Amsterdam, The Netherlands
Debby ten Hove, Terrence D. Jorgensen & L. Andries van der Ark

Authors

Debby ten Hove
View author publications
You can also search for this author in PubMed Google Scholar
Terrence D. Jorgensen
View author publications
You can also search for this author in PubMed Google Scholar
L. Andries van der Ark
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to L. Andries van der Ark .

Editor information

Editors and Affiliations

Umeå School of Business, Economics and Statistics, Umeå University, Umeå, Sweden
Marie Wiberg
Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA
Steven Culpepper
Faculty of Psychology and Educational Sciences, KU Leuven, Leuven, Belgium
Rianne Janssen
Faculty of Mathematics, Pontificia Universidad Católica de Chile, Santiago, Chile
Jorge González
Department of Psychology, University of Amsterdam, Amsterdam, The Netherlands
Dylan Molenaar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

ten Hove, D., Jorgensen, T.D., van der Ark, L.A. (2018). On the Usefulness of Interrater Reliability Coefficients. In: Wiberg, M., Culpepper, S., Janssen, R., González, J., Molenaar, D. (eds) Quantitative Psychology. IMPS 2017. Springer Proceedings in Mathematics & Statistics, vol 233. Springer, Cham. https://doi.org/10.1007/978-3-319-77249-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-77249-3_6
Published: 21 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77248-6
Online ISBN: 978-3-319-77249-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics