Interrater reliability and agreement of performance ratings: A methodological comparison

Abstract

This paper demonstrates and compares methods for estimating the interrater reliability and interrater agreement of performance ratings. These methods can be used by applied researchers to investigate the quality of ratings gathered, for example, as criteria for a validity study, or as performance measures for selection or promotional purposes. While estimates of interrater reliability are frequently used for these purposes, indices of interrater agreement appear to be rarely reported for performance ratings. A recommended index of interrater agreement, theT index (Tinsley & Weiss, 1975), is compared to four methods of estimating interrater reliability (Pearsonr, coefficient alpha, mean correlation between raters, and intraclass correlation). Subordinate and superior ratings of the performance of 100 managers were used in these analyses. The results indicated that, in general, interrater agreement and reliability among subordinates were fairly high. Interrater agreement between subordinates and superiors was moderately high; however, interrater reliability between these two rating sources was very low. The results demonstrate that interrater agreement and reliability are distinct indices and that both should be reported. Reasons are discussed as to why interrater reliability should not be reported alone.

This is a preview of subscription content, access via your institution.

References

  1. Berry, K., & Mielke, P. (1988). A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters.Educational and Psychological Measurement, 48, 921–933.

    Google Scholar 

  2. Berry, K., & Mielke, P. (1990). A generalized agreement measure.Educational and Psychological Measurement, 50, 123–125.

    Google Scholar 

  3. Campion, M., & Pursell, E. (1981).Plymouth Fiber Extraboard Validation Report. New Bern, NC: Weyerhaeuser Co.

    Google Scholar 

  4. Campion, M., Pursell, E., & Brown, B. (1988). Structured interviewing: Raising the psychometric properties of the employment interview.Personnel Psychology, 41, 25–42.

    Google Scholar 

  5. Cronbach, L. Gleser, G., Nanda, H., & Rajaratnam, N. (1972).The dependability of behavioral measurements. New York: Wiley.

    Google Scholar 

  6. Ghiselli, E., Campbell, J., & Zedeck, S. (1981).Measurement theory for the behavioral sciences. San Francisco: W. H. Freeman.

    Google Scholar 

  7. Guilford, J., & Fruchter, B. (1978).Fundamental statistics in psychology and education. New York: McGraw Hill.

    Google Scholar 

  8. James, L., Demaree, R., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias.Journal of Applied Psychology, 69, 322–327.

    Google Scholar 

  9. Hayes, W. (1988).Statistics. Fort Worth, TX: Holt, Rinehart and Winston.

    Google Scholar 

  10. Kozlowski, S., & Hattrup, K. (1992). A disagreement about within group agreement: Disentangling issues of consistency versus consensus.Journal of Applied Psychology, 77, 161–167.

    Google Scholar 

  11. Lawlis, F., & Lu, E. (1972). Judgment of counseling process: Reliability, agreement, and error.Psychological Bulletin, 78, 17–20.

    Google Scholar 

  12. Rothstein, H. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe.Journal of Applied Psychology, 75, 85–98.

    Google Scholar 

  13. Saal, F, Downey, R., & Lahey, M. (1980). Rating the ratings: Assessing the psychometric quality of rating data.Psychological Bulletin, 88, 413–428.

    Google Scholar 

  14. SAS. (1990).SAS/STAT user's guide. Vol. 2. Cary, NC: SAS Institute.

    Google Scholar 

  15. Schneider, B., & Schmitt, N. (1986).Staffing organizations. Glen View, Il: Scott, Foresman.

    Google Scholar 

  16. Shrout, P., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing rater reliability.Psychological Bulletin, 86, 420–428.

    Google Scholar 

  17. Tinsley, H., & Weiss, D. (1975). Interrater reliability and agreement of subjective judgments.Journal of Counseling Psychology, 22, 358–376.

    Google Scholar 

  18. Tornow, W. (1993). Perceptions or reality: Is multi-perspective measurement a means or an end?Human Resource Management, 32, 221–229.

    Google Scholar 

  19. Winer, B. (1971).Statistical principles in experimental design (2nd ed.). New York: McGraw-Hill.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to John W. Fleenor.

Additional information

This paper is based, in part, on a thesis submitted to East Carolina University by the second author. Portions of this study were presented at the American Psychological Association meeting in New Orleans, LA, August, 1989. The authors would like to thank Michael Campion and two anonymous reviewers for their comments on earlier drafts of this paper.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Fleenor, J.W., Fleenor, J.B. & Grossnickle, W.F. Interrater reliability and agreement of performance ratings: A methodological comparison. J Bus Psychol 10, 367–380 (1996). https://doi.org/10.1007/BF02249609

Download citation

Keywords

  • Performance Rating
  • Social Psychology
  • Intraclass Correlation
  • Validity Study
  • Social Issue