Behavioral researchers have developed a sophisticated methodology to evaluate behavioral change which is dependent upon accurate measurement of behavior. Direct observation of behavior has traditionally been the mainstay of behavioral measurement. Consequently, researchers must attend to the psychometric properties, such as interobserver agreement, of observational measures to ensure reliable and valid measurement. Of the many indices of interobserver agreement, percentage of agreement is the most popular. Its use persists despite repeated admonitions and empirical evidence indicating that it is not the most psychometrically sound statistic to determine interobserver agreement due to its inability to take chance into account. Cohen's (1960) kappa has long been proposed as the more psychometrically sound statistic for assessing interobserver agreement. Kappa is described and computational methods are presented.
This is a preview of subscription content, log in to check access.
American Psychological Association, American Educational Research Association, and National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
Baer, D. M. (1977). Reviewer's comment: Just because it's reliable doesn't mean that you can use it. Journal of Applied Behavior Analysis, 10, 117–119.
Berk, R. A. (1979). Generalizability of behavioral observations: A clarification of interobserver agreement and interobserver reliability. American Journal of Mental Deficiency, 83, 460–472.
Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6, 284–290.
Ciminero, A. R., Calhoun, K. S., & Adams, H. E. (Eds.). (1986). Handbook of behavioral assessment (2nd ed.). New York: Wiley.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Cone, J. D. (1977). The relevance of reliability and validity for behavioral assessment. Behavior Therapy, 8, 411–426.
Cone, J. D. (1988). Psychometric considerations and the multiple models of behavioral assessment. In A. S. Bellack & M. Hersen (Eds.), Behavioral assessment: A practical handbook (3rd Edition). NY: Pergamon.
Dunn, G., & Everitt, B. (1995). Clinical biostatistics: An introduction to evidence-based medicine. London: Edward Arnold.
Everitt, B. S. (1994). Statistical methods for medical investigations (2nd Edition). NY: Halsted Press.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. NY: Wiley.
Foster, S. L., BellDolan, D. J., & Burge, D. A. (1988). Behavioral observation. In A. S. Bellack & M. Hersen (Eds.), Behavioral assessment: A practical handbook (3rd Edition). NY: Pergamon.
Gresham, F. M. (1998). Designs for evaluating behavior change. In T. S. Watson & F. M. Gresham (Eds.), Handbook of child behavior therapy. NY: Plenum.
Hartmann, D. P. (1977, Spring). Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis, 10, 103–116.
Hoge, R. D. (1985). The validity of direct observation measures of pupil classroom behavior. Review of Educational Research, 55, 469–483.
Hops, H., Davis, B., & Longoria, N. (1995). Methodological issues in direct observation: Illustrations with the living in familial environments (LIFE) coding system. Journal of Clinical Child Psychology, 24, 193–203.
Johnson, J. M., & Pennypacker, H. S. (1993). Strategies and tactics of human behavioral research (2nd Edition). Hillsdale, NJ: Erlbaum.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
Langenbucher, J., Labouvie, E., & Morgenstern, J. (1996). Methodological developments: Measuring diagnostic agreement. Journal of Consulting and Clinical Psychology, 64, 1285–1289.
McDermott, P. A. (1988). Agreement among diagnosticians or observers: Its importance and determination. Professional School Psychology, 3, 225–240.
Nelson, L. D., & Cicchetti, D. V. (1995). Assessment of emotional functioning in brainimpaired individuals. Psychological Assessment, 7, 404–413.
Shrout, P. E., Spitzer, R. L.,& Fleiss, J. L. (1987). Comment: Quantification of agreement in psychiatric diagnosis revisited. Archives of General Psychiatry, 44, 172–178.
Suen, H. K. (1988). Agreement, reliability, accuracy, and validity: Toward a clarification. Behavioral Assessment, 10, 343–366.
Suen, H. K., & Lee, P. S. (1985). Effects of the use of percentage agreement on behavioral observation reliabilities: A reassessment. Journal of Psychopathology and Behavioral Assessment, 7, 221–234.
Wasik, B. H., & Loven, M. D. (1980). Classroom observational data: Sources of inaccuracy and proposed solutions. Behavioral Assessment, 2, 211–227.
Watkins, M. W. (1988). MacKappa [Computer software]. Pennsylvania State University: Author.
About this article
Cite this article
Watkins, M.W., Pacheco, M. Interobserver Agreement in Behavioral Research: Importance and Calculation. Journal of Behavioral Education 10, 205–212 (2000). https://doi.org/10.1023/A:1012295615144
- interobserver agreement
- interrater reliability
- observer agreement