Skip to main content
Log in

Measures of interobserver agreement: Calculation formulas and distribution effects

  • Published:
Journal of behavioral assessment Aims and scope Submit manuscript

Abstract

Seventeen measures of association for observer reliability (interobserver agreement) are reviewed and computational formulas are given in a common notational system. An empirical comparison of 10 of these measures is made over a range of potential reliability check results. The effects on percentage and correlational measures of occurrence frequency, error frequency, and error distribution are examined. The question of which is the “best” measure of interobserver agreement is discussed in terms of critical issues to be considered

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bear, D. M. Reviewer's comment: Just because it's reliable doesn't mean that you can use it.Journal of Applied Behavior Analysis 1977,10, 117–119.

    Google Scholar 

  • Christensen, A. Naturalistic observation of families: A system for random audio recording in the home.Behavior Therapy 1979,10, 418–422.

    Google Scholar 

  • Cicchetti, D. V., and Fleiss, J. L. Comparison of the null distributions of weighted kappa and the C ordinal statistic.Applied Psychological Measurement 1977,1, 195–201.

    Google Scholar 

  • Clement, P. W. A formula for computing inter-observer agreementPsychological Reports 1976,39, 257–258.

    Google Scholar 

  • Cohen, J. A coefficient of agreement for nominal scales.Educational and Psychological Measurement 1960,20, 37–46.

    Google Scholar 

  • Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit.Psychological Bulletin 1968,70, 213–220.

    Google Scholar 

  • Cronbach, L. J., Glaser, G. C., Nanda, H., and Rajaratnam, N.The Dependability of Behavioral Measurements: Theory of General Profiles. New York: Wiley, 1972.

    Google Scholar 

  • Everitt, B. S. Moments of the statistics kappa and weighted kappa.British Journal of Mathematical and Statistical Psychology 1968,21, 97–103.

    Google Scholar 

  • Everitt, B. S.The Analysis of Contingency Tables. New York: Wiley, 1977.

    Google Scholar 

  • Farkas, G. M. Correction for bias present in a method of calculating interobserver agreement.Journal of Applies Behavior Analysis 1978,11, 188.

    Google Scholar 

  • Fleiss, J. L. Estimating the accuracy of dichotomous judgments.Psychometrika 1965,30, 469–479.

    Google Scholar 

  • Fleiss, J. L. Measuring nominal scale agreement among many raters.Psychological Bulletin 1971,76, 378–382.

    Google Scholar 

  • Fleiss, J. L.Statistical Methods for Rates and Proportions. New York: Wiley, 1973.

    Google Scholar 

  • Fleiss, J. L. Measuring agreement between two judges on the presence or absence of a trait.Biometrics 1975,31, 651–659.

    Google Scholar 

  • Fleiss, J. L., and Cicchetti, D. V. Inference about weighted kappa in the nonnull case.Applied Psychological Measurement 1978,2, 113–117.

    Google Scholar 

  • Fleiss, J. L., Cohen, J., and Everitt, B. S. Large sample standard errors of kappa and weighted kappa.Psychological Bulletin 1969,72, 323–327.

    Google Scholar 

  • Fleiss, J. L., Nee, J. C. M., and Landis, J. R. Large sample variance of kappa in the case of different sets of raters.Psychological Bulletin 1979,86, 974–977.

    Google Scholar 

  • Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classification, Part I.Journal of the American Statistical Association 1954,49, 732–764.

    Google Scholar 

  • Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part II.Journal of the American Statistical Association 1959,54, 123–163.

    Google Scholar 

  • Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part III, Approximate sampling theory.Journal of the American Statistical Association 1963,58, 310–364.

    Google Scholar 

  • Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part IV, Simplification of asymptotic variances.Journal of the American Statistical Association 1972,67, 415–421.

    Google Scholar 

  • Haggard, E. A.Intraclass Correlation and the Analysis of Variance. New York: Dryden, 1958.

    Google Scholar 

  • Harris, F. C., and Lahey, B. B. A method for combining occurrence and nonoccurrence agreement scores.Journal of Applied Behavior Analysis 1978,11, 523–527.

    Google Scholar 

  • Hartmann, D. P. Considerations in the choice of interobserver reliability estimates.Journal of Applied Behavior Analysis 1977,10, 103–116.

    Google Scholar 

  • Hartmann, D. P. A Note on reliability: Old wine in a new bottle.Journal of Applied Behavior Analysis 1979,12, 298.

    Google Scholar 

  • Hawkins, R. P., and Dotson, V. A. Reliability scores that delude: An Alice in Wonderland trip through the misleading characteristics of interobserver agreement scores in interval recording. In E. Ramp and G. Semb (Eds.),Behavior Analysis: Areas of Research and Application. Englewood Cliffs, N.J.: Prentice-Hall, 1975, 539–376.

    Google Scholar 

  • Holley, J. A., and Guilford, J. P. A note on the G index of agreement.Educational and Psychological Measurement 1964,24, 749–753.

    Google Scholar 

  • House, A. E. Naturalistic observation: Formal and informal difficulties.Child Study Journal 1978,8, 17–28.

    Google Scholar 

  • House, A. E. Detecting bias in observational data.Behavioral Assessment 1980,2, 29–31.

    Google Scholar 

  • House, B. J., and House, A. E. Frequency, complexity, and clarity as covariates of observer reliability.Journal of Behavioral Assessment 1979,1, 149–165.

    Google Scholar 

  • House, A. E., Farber, J., and Nier, L. L. Accuracy and speed of reliability calculation using different measures of interobserver agreement. Paper presented in poster session, Association for Advancement of Behavior Therapy, New York, November 1980.

    Google Scholar 

  • Hughes, H., Hughes, A., and Dial, H. A behavioral seal: An apparatus alternative to behavioral observation of thumbsucking.Behavioral Research Method and Instrumentation 1978,10, 460–461.

    Google Scholar 

  • Janson, S., and Vegelius, J. On generalizations of the G index and the phi coefficient to nominal scales.Multivariate Behavioral Research 1979,14, 255–269.

    Google Scholar 

  • Johnson, S. C. Hierarchical clustering schemes.Psychometrika 1967,32, 241–254.

    Google Scholar 

  • Johnson, S. M., and Bolstad, O. D. Rectivity to home observation: A comparison of audio recorded behavior with observers present or absent.Journal of Applied Behavior Analysis 1975,8, 181–185.

    Google Scholar 

  • Johnson, S. M., Christensen, A., and Bellamy, G. T. Evaluation of family intervention through unobtrusive audio recordings: Experiences in “bugging children.”Journal of Applied Behavior Analysis 1976,9, 213–219.

    Google Scholar 

  • Jones, R. R., Reid, J. B., and Patterson, G. R. Naturalistic observation in clinical assessment. In P. McReynolds (Ed.),Advances in Psychological Assessment, Vol. 3 San Francisco: Jossey-Bass, 1975, pp. 42–95.

    Google Scholar 

  • Kaydin, A. E. Artifact, bias, and complexity of assessment: The ABCs of reliability.Journal of Applied Behavior Analysis 1977,10, 141–150.

    Google Scholar 

  • Kelly, M. B. A review of the observational data-collection and reliability procedures reported inThe Journal of Applied Behavior Analysis.Journal of Applied Behavior Analysis. 1977,10, 97–101.

    Google Scholar 

  • Kendall, M. G., and Stuart, A.The Advanced Theory of Statistics. Vol. 2. Inference and Relationship. New York: Hafner

  • Kent, R. N., and Foster, S. L. Direct observational procedures: Methodological issues in naturalistic settings. In A. R. Ciminero, K. S. Calhoun, and H. E. Adams (Eds.),Handbook of Behavioral Assessment. New York: John Wiley & Sons, 1977, pp. 279–328.

    Google Scholar 

  • Knapp, T. J., and Loveless, S. E. A simple procedure for determining reliability scores in interval recording.Behavior Therapy 1976,7, 557–558.

    Google Scholar 

  • Kratochwill, T. R., and Wetzel, R. J. Observer agreement, credibility, and judgment: Some considerations in presenting observer agreement data.Journal of Applied Behavior Analysis 1977,10, 133–139.

    Google Scholar 

  • Landis, J. R., and Koch, G. G. A one-way components of variance model for categorical data.Biometrics 1977a,33, 671–679.

    Google Scholar 

  • Landis, J. R., and Koch, G. G. The measurement of observer agreement for categorical data.Biometrics 1977b,33, 159–174.

    Google Scholar 

  • Mackie, R. R. (Ed.).Vigilance. New York: Plenum, 1977.

    Google Scholar 

  • Maxwell, A. E., and Pilliner, A. E. G. Deriving coefficients of reliability and agreement for ratings.British Journal of Mathematical and Statistical Psychology 1968,21, 105–116.

    Google Scholar 

  • McQueen, W. M. A simple device for improving inter-rater reliability.Behavior Therapy 1975,6, 128–129.

    Google Scholar 

  • Mitchell, S. K. Interobserver agreement, reliability, and generalizability of data collected in observational studies.Psychological Bulletin 1979,86, 376–390.

    Google Scholar 

  • Nunnally, J. C.Psychometric Theory. New York: McGraw-Hill, 1967.

    Google Scholar 

  • Repp, A. C., Deitz, D. E., Boles, S. M., Deitz, S. M., and Repp, C. F. Differences among common methods for calculating interobserver agreement.Journal of Applied Behavior Analysis 1976,9, 109–113

    Google Scholar 

  • Sarndal, C. E. A comparative study of association measures.Psychometrika 1974,39, 165–187.

    Google Scholar 

  • Scott, W. A. Reliability of content analysis: The case of nominal scale coding.Public Opinion Quarterly 1955,19, 321–325.

    Google Scholar 

  • Siegel, S.Nonparametric Statistics. New York: McGraw-Hill, 1956.

    Google Scholar 

  • Sloat, K. M. C. A comment on “Correction for bias present in a method of calculating interobserver agreement,” Unpublished paper. Kamehomeha Early Education Program, 1978.

  • Taylor, D. R. An expedient method for calculating the Harris and Lahey weighted agreement formula.The Behavior Therapist 1980,3, 3.

    Google Scholar 

  • Wahler, R. G., House, A. E., and Stambaugh, E. E.Ecological Assessment of Child Behavior. New York: Pergamon, 1976.

    Google Scholar 

  • Yelton, A. R., Wildman, B. G., and Erickson, M. T. A probability-based formula for calculating interobserver agreement.Journal of Applied Behavior Analysis 1977,10, 127–131.

    Google Scholar 

  • Yule, G. U. On the association of attributes in statistics.Philosophical Transactions of the Royal Society, Series A 1900,194, 257.

    Google Scholar 

  • Yule, G. U. On the methods of measuring association between two attributes.Journal of the Royal Statistical Society 1912,75, 579–642.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

House, A.E., House, B.J. & Campbell, M.B. Measures of interobserver agreement: Calculation formulas and distribution effects. Journal of Behavioral Assessment 3, 37–57 (1981). https://doi.org/10.1007/BF01321350

Download citation

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01321350

Key words

Navigation