Journal of behavioral assessment

, Volume 3, Issue 1, pp 37–57 | Cite as

Measures of interobserver agreement: Calculation formulas and distribution effects

  • Alvin Enis House
  • Betty J. House
  • Martha B. Campbell


Seventeen measures of association for observer reliability (interobserver agreement) are reviewed and computational formulas are given in a common notational system. An empirical comparison of 10 of these measures is made over a range of potential reliability check results. The effects on percentage and correlational measures of occurrence frequency, error frequency, and error distribution are examined. The question of which is the “best” measure of interobserver agreement is discussed in terms of critical issues to be considered

Key words

interobserver agreement observer reliability measures of association naturalistic observation interval-by-interval coding systems 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bear, D. M. Reviewer's comment: Just because it's reliable doesn't mean that you can use it.Journal of Applied Behavior Analysis 1977,10, 117–119.Google Scholar
  2. Christensen, A. Naturalistic observation of families: A system for random audio recording in the home.Behavior Therapy 1979,10, 418–422.Google Scholar
  3. Cicchetti, D. V., and Fleiss, J. L. Comparison of the null distributions of weighted kappa and the C ordinal statistic.Applied Psychological Measurement 1977,1, 195–201.Google Scholar
  4. Clement, P. W. A formula for computing inter-observer agreementPsychological Reports 1976,39, 257–258.Google Scholar
  5. Cohen, J. A coefficient of agreement for nominal scales.Educational and Psychological Measurement 1960,20, 37–46.Google Scholar
  6. Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit.Psychological Bulletin 1968,70, 213–220.Google Scholar
  7. Cronbach, L. J., Glaser, G. C., Nanda, H., and Rajaratnam, N.The Dependability of Behavioral Measurements: Theory of General Profiles. New York: Wiley, 1972.Google Scholar
  8. Everitt, B. S. Moments of the statistics kappa and weighted kappa.British Journal of Mathematical and Statistical Psychology 1968,21, 97–103.Google Scholar
  9. Everitt, B. S.The Analysis of Contingency Tables. New York: Wiley, 1977.Google Scholar
  10. Farkas, G. M. Correction for bias present in a method of calculating interobserver agreement.Journal of Applies Behavior Analysis 1978,11, 188.Google Scholar
  11. Fleiss, J. L. Estimating the accuracy of dichotomous judgments.Psychometrika 1965,30, 469–479.Google Scholar
  12. Fleiss, J. L. Measuring nominal scale agreement among many raters.Psychological Bulletin 1971,76, 378–382.Google Scholar
  13. Fleiss, J. L.Statistical Methods for Rates and Proportions. New York: Wiley, 1973.Google Scholar
  14. Fleiss, J. L. Measuring agreement between two judges on the presence or absence of a trait.Biometrics 1975,31, 651–659.Google Scholar
  15. Fleiss, J. L., and Cicchetti, D. V. Inference about weighted kappa in the nonnull case.Applied Psychological Measurement 1978,2, 113–117.Google Scholar
  16. Fleiss, J. L., Cohen, J., and Everitt, B. S. Large sample standard errors of kappa and weighted kappa.Psychological Bulletin 1969,72, 323–327.Google Scholar
  17. Fleiss, J. L., Nee, J. C. M., and Landis, J. R. Large sample variance of kappa in the case of different sets of raters.Psychological Bulletin 1979,86, 974–977.Google Scholar
  18. Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classification, Part I.Journal of the American Statistical Association 1954,49, 732–764.Google Scholar
  19. Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part II.Journal of the American Statistical Association 1959,54, 123–163.Google Scholar
  20. Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part III, Approximate sampling theory.Journal of the American Statistical Association 1963,58, 310–364.Google Scholar
  21. Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part IV, Simplification of asymptotic variances.Journal of the American Statistical Association 1972,67, 415–421.Google Scholar
  22. Haggard, E. A.Intraclass Correlation and the Analysis of Variance. New York: Dryden, 1958.Google Scholar
  23. Harris, F. C., and Lahey, B. B. A method for combining occurrence and nonoccurrence agreement scores.Journal of Applied Behavior Analysis 1978,11, 523–527.Google Scholar
  24. Hartmann, D. P. Considerations in the choice of interobserver reliability estimates.Journal of Applied Behavior Analysis 1977,10, 103–116.Google Scholar
  25. Hartmann, D. P. A Note on reliability: Old wine in a new bottle.Journal of Applied Behavior Analysis 1979,12, 298.Google Scholar
  26. Hawkins, R. P., and Dotson, V. A. Reliability scores that delude: An Alice in Wonderland trip through the misleading characteristics of interobserver agreement scores in interval recording. In E. Ramp and G. Semb (Eds.),Behavior Analysis: Areas of Research and Application. Englewood Cliffs, N.J.: Prentice-Hall, 1975, 539–376.Google Scholar
  27. Holley, J. A., and Guilford, J. P. A note on the G index of agreement.Educational and Psychological Measurement 1964,24, 749–753.Google Scholar
  28. House, A. E. Naturalistic observation: Formal and informal difficulties.Child Study Journal 1978,8, 17–28.Google Scholar
  29. House, A. E. Detecting bias in observational data.Behavioral Assessment 1980,2, 29–31.Google Scholar
  30. House, B. J., and House, A. E. Frequency, complexity, and clarity as covariates of observer reliability.Journal of Behavioral Assessment 1979,1, 149–165.Google Scholar
  31. House, A. E., Farber, J., and Nier, L. L. Accuracy and speed of reliability calculation using different measures of interobserver agreement. Paper presented in poster session, Association for Advancement of Behavior Therapy, New York, November 1980.Google Scholar
  32. Hughes, H., Hughes, A., and Dial, H. A behavioral seal: An apparatus alternative to behavioral observation of thumbsucking.Behavioral Research Method and Instrumentation 1978,10, 460–461.Google Scholar
  33. Janson, S., and Vegelius, J. On generalizations of the G index and the phi coefficient to nominal scales.Multivariate Behavioral Research 1979,14, 255–269.Google Scholar
  34. Johnson, S. C. Hierarchical clustering schemes.Psychometrika 1967,32, 241–254.Google Scholar
  35. Johnson, S. M., and Bolstad, O. D. Rectivity to home observation: A comparison of audio recorded behavior with observers present or absent.Journal of Applied Behavior Analysis 1975,8, 181–185.Google Scholar
  36. Johnson, S. M., Christensen, A., and Bellamy, G. T. Evaluation of family intervention through unobtrusive audio recordings: Experiences in “bugging children.”Journal of Applied Behavior Analysis 1976,9, 213–219.Google Scholar
  37. Jones, R. R., Reid, J. B., and Patterson, G. R. Naturalistic observation in clinical assessment. In P. McReynolds (Ed.),Advances in Psychological Assessment, Vol. 3 San Francisco: Jossey-Bass, 1975, pp. 42–95.Google Scholar
  38. Kaydin, A. E. Artifact, bias, and complexity of assessment: The ABCs of reliability.Journal of Applied Behavior Analysis 1977,10, 141–150.Google Scholar
  39. Kelly, M. B. A review of the observational data-collection and reliability procedures reported inThe Journal of Applied Behavior Analysis.Journal of Applied Behavior Analysis. 1977,10, 97–101.Google Scholar
  40. Kendall, M. G., and Stuart, A.The Advanced Theory of Statistics. Vol. 2. Inference and Relationship. New York: HafnerGoogle Scholar
  41. Kent, R. N., and Foster, S. L. Direct observational procedures: Methodological issues in naturalistic settings. In A. R. Ciminero, K. S. Calhoun, and H. E. Adams (Eds.),Handbook of Behavioral Assessment. New York: John Wiley & Sons, 1977, pp. 279–328.Google Scholar
  42. Knapp, T. J., and Loveless, S. E. A simple procedure for determining reliability scores in interval recording.Behavior Therapy 1976,7, 557–558.Google Scholar
  43. Kratochwill, T. R., and Wetzel, R. J. Observer agreement, credibility, and judgment: Some considerations in presenting observer agreement data.Journal of Applied Behavior Analysis 1977,10, 133–139.Google Scholar
  44. Landis, J. R., and Koch, G. G. A one-way components of variance model for categorical data.Biometrics 1977a,33, 671–679.Google Scholar
  45. Landis, J. R., and Koch, G. G. The measurement of observer agreement for categorical data.Biometrics 1977b,33, 159–174.Google Scholar
  46. Mackie, R. R. (Ed.).Vigilance. New York: Plenum, 1977.Google Scholar
  47. Maxwell, A. E., and Pilliner, A. E. G. Deriving coefficients of reliability and agreement for ratings.British Journal of Mathematical and Statistical Psychology 1968,21, 105–116.Google Scholar
  48. McQueen, W. M. A simple device for improving inter-rater reliability.Behavior Therapy 1975,6, 128–129.Google Scholar
  49. Mitchell, S. K. Interobserver agreement, reliability, and generalizability of data collected in observational studies.Psychological Bulletin 1979,86, 376–390.Google Scholar
  50. Nunnally, J. C.Psychometric Theory. New York: McGraw-Hill, 1967.Google Scholar
  51. Repp, A. C., Deitz, D. E., Boles, S. M., Deitz, S. M., and Repp, C. F. Differences among common methods for calculating interobserver agreement.Journal of Applied Behavior Analysis 1976,9, 109–113Google Scholar
  52. Sarndal, C. E. A comparative study of association measures.Psychometrika 1974,39, 165–187.Google Scholar
  53. Scott, W. A. Reliability of content analysis: The case of nominal scale coding.Public Opinion Quarterly 1955,19, 321–325.Google Scholar
  54. Siegel, S.Nonparametric Statistics. New York: McGraw-Hill, 1956.Google Scholar
  55. Sloat, K. M. C. A comment on “Correction for bias present in a method of calculating interobserver agreement,” Unpublished paper. Kamehomeha Early Education Program, 1978.Google Scholar
  56. Taylor, D. R. An expedient method for calculating the Harris and Lahey weighted agreement formula.The Behavior Therapist 1980,3, 3.Google Scholar
  57. Wahler, R. G., House, A. E., and Stambaugh, E. E.Ecological Assessment of Child Behavior. New York: Pergamon, 1976.Google Scholar
  58. Yelton, A. R., Wildman, B. G., and Erickson, M. T. A probability-based formula for calculating interobserver agreement.Journal of Applied Behavior Analysis 1977,10, 127–131.Google Scholar
  59. Yule, G. U. On the association of attributes in statistics.Philosophical Transactions of the Royal Society, Series A 1900,194, 257.Google Scholar
  60. Yule, G. U. On the methods of measuring association between two attributes.Journal of the Royal Statistical Society 1912,75, 579–642.Google Scholar

Copyright information

© Plenum Publishing Corporation 1981

Authors and Affiliations

  • Alvin Enis House
    • 1
  • Betty J. House
    • 1
  • Martha B. Campbell
    • 1
  1. 1.Psychology DepartmentIllinois State UniversityNormal

Personalised recommendations