Abstract
Seventeen measures of association for observer reliability (interobserver agreement) are reviewed and computational formulas are given in a common notational system. An empirical comparison of 10 of these measures is made over a range of potential reliability check results. The effects on percentage and correlational measures of occurrence frequency, error frequency, and error distribution are examined. The question of which is the “best” measure of interobserver agreement is discussed in terms of critical issues to be considered
Similar content being viewed by others
References
Bear, D. M. Reviewer's comment: Just because it's reliable doesn't mean that you can use it.Journal of Applied Behavior Analysis 1977,10, 117–119.
Christensen, A. Naturalistic observation of families: A system for random audio recording in the home.Behavior Therapy 1979,10, 418–422.
Cicchetti, D. V., and Fleiss, J. L. Comparison of the null distributions of weighted kappa and the C ordinal statistic.Applied Psychological Measurement 1977,1, 195–201.
Clement, P. W. A formula for computing inter-observer agreementPsychological Reports 1976,39, 257–258.
Cohen, J. A coefficient of agreement for nominal scales.Educational and Psychological Measurement 1960,20, 37–46.
Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit.Psychological Bulletin 1968,70, 213–220.
Cronbach, L. J., Glaser, G. C., Nanda, H., and Rajaratnam, N.The Dependability of Behavioral Measurements: Theory of General Profiles. New York: Wiley, 1972.
Everitt, B. S. Moments of the statistics kappa and weighted kappa.British Journal of Mathematical and Statistical Psychology 1968,21, 97–103.
Everitt, B. S.The Analysis of Contingency Tables. New York: Wiley, 1977.
Farkas, G. M. Correction for bias present in a method of calculating interobserver agreement.Journal of Applies Behavior Analysis 1978,11, 188.
Fleiss, J. L. Estimating the accuracy of dichotomous judgments.Psychometrika 1965,30, 469–479.
Fleiss, J. L. Measuring nominal scale agreement among many raters.Psychological Bulletin 1971,76, 378–382.
Fleiss, J. L.Statistical Methods for Rates and Proportions. New York: Wiley, 1973.
Fleiss, J. L. Measuring agreement between two judges on the presence or absence of a trait.Biometrics 1975,31, 651–659.
Fleiss, J. L., and Cicchetti, D. V. Inference about weighted kappa in the nonnull case.Applied Psychological Measurement 1978,2, 113–117.
Fleiss, J. L., Cohen, J., and Everitt, B. S. Large sample standard errors of kappa and weighted kappa.Psychological Bulletin 1969,72, 323–327.
Fleiss, J. L., Nee, J. C. M., and Landis, J. R. Large sample variance of kappa in the case of different sets of raters.Psychological Bulletin 1979,86, 974–977.
Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classification, Part I.Journal of the American Statistical Association 1954,49, 732–764.
Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part II.Journal of the American Statistical Association 1959,54, 123–163.
Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part III, Approximate sampling theory.Journal of the American Statistical Association 1963,58, 310–364.
Goodman, L. A., and Kruskal, W. H. Measures of association for cross-classifications, Part IV, Simplification of asymptotic variances.Journal of the American Statistical Association 1972,67, 415–421.
Haggard, E. A.Intraclass Correlation and the Analysis of Variance. New York: Dryden, 1958.
Harris, F. C., and Lahey, B. B. A method for combining occurrence and nonoccurrence agreement scores.Journal of Applied Behavior Analysis 1978,11, 523–527.
Hartmann, D. P. Considerations in the choice of interobserver reliability estimates.Journal of Applied Behavior Analysis 1977,10, 103–116.
Hartmann, D. P. A Note on reliability: Old wine in a new bottle.Journal of Applied Behavior Analysis 1979,12, 298.
Hawkins, R. P., and Dotson, V. A. Reliability scores that delude: An Alice in Wonderland trip through the misleading characteristics of interobserver agreement scores in interval recording. In E. Ramp and G. Semb (Eds.),Behavior Analysis: Areas of Research and Application. Englewood Cliffs, N.J.: Prentice-Hall, 1975, 539–376.
Holley, J. A., and Guilford, J. P. A note on the G index of agreement.Educational and Psychological Measurement 1964,24, 749–753.
House, A. E. Naturalistic observation: Formal and informal difficulties.Child Study Journal 1978,8, 17–28.
House, A. E. Detecting bias in observational data.Behavioral Assessment 1980,2, 29–31.
House, B. J., and House, A. E. Frequency, complexity, and clarity as covariates of observer reliability.Journal of Behavioral Assessment 1979,1, 149–165.
House, A. E., Farber, J., and Nier, L. L. Accuracy and speed of reliability calculation using different measures of interobserver agreement. Paper presented in poster session, Association for Advancement of Behavior Therapy, New York, November 1980.
Hughes, H., Hughes, A., and Dial, H. A behavioral seal: An apparatus alternative to behavioral observation of thumbsucking.Behavioral Research Method and Instrumentation 1978,10, 460–461.
Janson, S., and Vegelius, J. On generalizations of the G index and the phi coefficient to nominal scales.Multivariate Behavioral Research 1979,14, 255–269.
Johnson, S. C. Hierarchical clustering schemes.Psychometrika 1967,32, 241–254.
Johnson, S. M., and Bolstad, O. D. Rectivity to home observation: A comparison of audio recorded behavior with observers present or absent.Journal of Applied Behavior Analysis 1975,8, 181–185.
Johnson, S. M., Christensen, A., and Bellamy, G. T. Evaluation of family intervention through unobtrusive audio recordings: Experiences in “bugging children.”Journal of Applied Behavior Analysis 1976,9, 213–219.
Jones, R. R., Reid, J. B., and Patterson, G. R. Naturalistic observation in clinical assessment. In P. McReynolds (Ed.),Advances in Psychological Assessment, Vol. 3 San Francisco: Jossey-Bass, 1975, pp. 42–95.
Kaydin, A. E. Artifact, bias, and complexity of assessment: The ABCs of reliability.Journal of Applied Behavior Analysis 1977,10, 141–150.
Kelly, M. B. A review of the observational data-collection and reliability procedures reported inThe Journal of Applied Behavior Analysis.Journal of Applied Behavior Analysis. 1977,10, 97–101.
Kendall, M. G., and Stuart, A.The Advanced Theory of Statistics. Vol. 2. Inference and Relationship. New York: Hafner
Kent, R. N., and Foster, S. L. Direct observational procedures: Methodological issues in naturalistic settings. In A. R. Ciminero, K. S. Calhoun, and H. E. Adams (Eds.),Handbook of Behavioral Assessment. New York: John Wiley & Sons, 1977, pp. 279–328.
Knapp, T. J., and Loveless, S. E. A simple procedure for determining reliability scores in interval recording.Behavior Therapy 1976,7, 557–558.
Kratochwill, T. R., and Wetzel, R. J. Observer agreement, credibility, and judgment: Some considerations in presenting observer agreement data.Journal of Applied Behavior Analysis 1977,10, 133–139.
Landis, J. R., and Koch, G. G. A one-way components of variance model for categorical data.Biometrics 1977a,33, 671–679.
Landis, J. R., and Koch, G. G. The measurement of observer agreement for categorical data.Biometrics 1977b,33, 159–174.
Mackie, R. R. (Ed.).Vigilance. New York: Plenum, 1977.
Maxwell, A. E., and Pilliner, A. E. G. Deriving coefficients of reliability and agreement for ratings.British Journal of Mathematical and Statistical Psychology 1968,21, 105–116.
McQueen, W. M. A simple device for improving inter-rater reliability.Behavior Therapy 1975,6, 128–129.
Mitchell, S. K. Interobserver agreement, reliability, and generalizability of data collected in observational studies.Psychological Bulletin 1979,86, 376–390.
Nunnally, J. C.Psychometric Theory. New York: McGraw-Hill, 1967.
Repp, A. C., Deitz, D. E., Boles, S. M., Deitz, S. M., and Repp, C. F. Differences among common methods for calculating interobserver agreement.Journal of Applied Behavior Analysis 1976,9, 109–113
Sarndal, C. E. A comparative study of association measures.Psychometrika 1974,39, 165–187.
Scott, W. A. Reliability of content analysis: The case of nominal scale coding.Public Opinion Quarterly 1955,19, 321–325.
Siegel, S.Nonparametric Statistics. New York: McGraw-Hill, 1956.
Sloat, K. M. C. A comment on “Correction for bias present in a method of calculating interobserver agreement,” Unpublished paper. Kamehomeha Early Education Program, 1978.
Taylor, D. R. An expedient method for calculating the Harris and Lahey weighted agreement formula.The Behavior Therapist 1980,3, 3.
Wahler, R. G., House, A. E., and Stambaugh, E. E.Ecological Assessment of Child Behavior. New York: Pergamon, 1976.
Yelton, A. R., Wildman, B. G., and Erickson, M. T. A probability-based formula for calculating interobserver agreement.Journal of Applied Behavior Analysis 1977,10, 127–131.
Yule, G. U. On the association of attributes in statistics.Philosophical Transactions of the Royal Society, Series A 1900,194, 257.
Yule, G. U. On the methods of measuring association between two attributes.Journal of the Royal Statistical Society 1912,75, 579–642.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
House, A.E., House, B.J. & Campbell, M.B. Measures of interobserver agreement: Calculation formulas and distribution effects. Journal of Behavioral Assessment 3, 37–57 (1981). https://doi.org/10.1007/BF01321350
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF01321350