Skip to main content

Benchmarking Kappa: Interrater Agreement in Software Process Assessments

Abstract

Software process assessments are by now a prevalent tool for process improvement and contract risk assessment in the software industry. Given that scores are assigned to processes during an assessment, a process assessment can be considered a subjective measurement procedure. As with any subjective measurement procedure, the reliability of process assessments has important implications on the utility of assessment scores, and therefore the reliability of assessments can be taken as a criterion for evaluating an assessment's quality. The particular type of reliability of interest in this paper is interrater agreement. Thus far, empirical evaluations of the interrater agreement of assessments have used Cohen's Kappa coefficient. Once a Kappa value has been derived, the next question is “how good is it?” Benchmarks for interpreting the obtained values of Kappa are available from the social sciences and medical literature. However, the applicability of these benchmarks to the software process assessment context is not obvious. In this paper we develop a benchmark for interpreting Kappa values using data from ratings of 70 process instances collected from assessments of 19 different projects in 7 different organizations in Europe during the SPICE Trials (this is an international effort to empirically evaluate the emerging ISO/IEC 15504 International Standard for Software Process Assessment). The benchmark indicates that Kappa values below 0.45 are poor, and values above 0.62 constitute substantial agreement and should be the minimum aimed for. This benchmark can be used to decide how good an assessment's reliability is.

This is a preview of subscription content, access via your institution.

References

  • Allen, M., and Yen, W. 1979. Introduction to Measurement Theory. Brooks/Cole Publishing Company.

  • Altman, D. 1991. Practical Statistics for Medical Research. Chapman and Hall.

  • Armitage, P., and Berry, G. 1994. Statistical Methods in Medical Research. Blackwell Science.

  • Bennett, E., Alpert, R., and Goldstein, A. 1954. Communications through limited response questioning. Public Opinion Quarterly 18: 303–308.

    Google Scholar 

  • Bicego, A., Khurana, M., and Kuvaja, P. 1998. Bootstrap 3.0: Software process assessment methodology. Proceedings of SQM'98.

  • Briand, L., El Emam, K., Laitenberger, O., and Fussbroich, T. 1998. Using simulation to build inspection efficiency benchmarks for development projects. Proceedings of the International Conference on Software Engineering. 340–349.

  • Briand, L., El Emam, K., and Wieczorek, I. 1998. A case study in productivity benchmarking: Methods and lessons learned. Proceedings of the 9th European Software Control and Metrics Conference. Shaker Publishing B.V., The Netherlands, 4–14.

    Google Scholar 

  • Camp, R. 1989. Benchmarking: The Search for Industry Best Practices that Lead to Superior Performance. ASQC Quality Press.

  • Cicchetti, D. 1972. A new measure of agreement netween rank order variables. Proceedings of the American Psychological Association 7: 17–18.

    Google Scholar 

  • Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement XX(1): 37–46.

    Google Scholar 

  • Cohen, J. 1968. Weighted kappa: Nominal scale agreement with provision for scaled agreement or partial credit. Psychological Bulletin 70: 213–220.

    Google Scholar 

  • Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates.

  • El Emam, K. 1998. The internal consistency of the ISO/IEC 15504 software process capability scale. To appear in Proceedings of the 5th International Symposium on Software Metrics. IEEE CS Press.

  • El Emam, K., and Goldenson, D. R. 1995. SPICE: An empiricist's perspective. Proceedings of the Second IEEE International Software Engineering Standards Symposium, pp. 84–97.

  • El Emam, K., and Madhavji, N. H. 1995. The reliability of measuring organizational maturity. Software Process Improvement and Practice Journal 1(1): 3–25.

    Google Scholar 

  • El Emam, K., Briand, L., and Smith, B. 1996. Assessor agreement in rating SPICE processes. Software Process Improvement and Practice Journal 2(4): 291–306.

    Google Scholar 

  • El Emam, K., and Goldenson, D. R. 1996. An empirical evaluation of the prospective international SPICE standard. Software Process Improvement and Practice Journal 2(2): 123–148.

    Google Scholar 

  • El Emam, K., Goldenson, D., Briand, L., and Marshall, P. 1996. Interrater agreement in SPICE-based assessments: Some preliminary results. Proceedings of the International Conference on the Software Process, pp. 149–156.

  • El Emam, K., Smith, B., and Fusaro, P. 1997. Modeling the reliability of SPICE based assessments. Proceedings of the Third IEEE International Software Engineering Standards Symposium, pp. 69–82.

  • El Emam, K., Drouin, J-N, and Melo, W. (eds.) 1998. SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE CS Press.

  • El Emam, K., and Marshall, P. 1998. Interrater agreement in assessment ratings. El Emam, K., Drouin, J-N, and Melo, W. (eds.) SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE CS Press.

  • El Emam, K., Simon, J-M, Rousseau, S., and Jacquet, E. 1998. Cost implications of interrater agreement for software process assessments. To appear in Proceedings of the 5th International Symposium on Software Metrics. IEEE CS Press.

  • El Emam, K., and Wieczorek, I. 1998. The repeatability of code defect classifications. To appear in Proceedings of the International Symposium on Software Reliability Engineering. IEEE CS Press.

  • Everitt, B. 1992. The Analysis of Contingency Tables. Chapman and Hall.

  • Fleiss, J. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5): 378–382.

    Google Scholar 

  • Fleiss, J. 1981. Statistical Methods for Rates and Proportions. John Wiley & Sons.

  • Fleiss, J., and Cohen, J. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33: 613–619.

    Google Scholar 

  • Fleiss, J., Cohen, J., and Everitt, B. 1969. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin 72(5): 323–327.

    Google Scholar 

  • Fusaro, P., El Emam, K., and Smith, B. 1997a. Evaluating the interrater agreement of process capability ratings. Proceedings of the Fourth International Software Metrics Symposium. 2–11.

  • Fusaro, P., El Emam, K., and Smith, B. 1997b. The internal consistencies of the 1987 SEI maturity questionnaire and the SPICE capability dimension. Empirical Software Engineering: An International Journal 3: 179–201, Kluwer Academic Publishers.

  • Gordis, L. 1996. Epidemiology. W. B. Saunders.

  • Hartmann, D. 1977. Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis 10(1): 103–116.

    Google Scholar 

  • Henkel, E. 1976. Tests of Significance. Sage Publications.

  • Landis, J., and Koch, G. 1977. The measurement of observer agreement for categorical data. Biometrics 33: 159–174.

    Google Scholar 

  • Lindsay, R., and Ehrenberg, A. 1993. The design of replicated studies. The American Statistician 47(3): 217–228.

    Google Scholar 

  • Lyman, H. 1963. Test Scores and What They Mean. Prentice-Hall.

  • Maclennan, F., Ostrolenk, G., and Tobin, M. 1998. Introduction to the SPICE trials. El Emam, K., Drouin, J-N, and Melo, W. (eds.) SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE CS Press.

  • Rout, T., and Simms, P. 1998. Introduction to the SPICE documents and architecture. El Emam, K., Drouin, J-N, and Melo, W. (eds.) SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE CS Press.

  • Scott, W. 1955. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly 19: 321–325.

    Google Scholar 

  • Simon, J-M, El Emam, K., Rousseau, S., Jacquet, E., and Babey, F. 1997. The reliability of ISO/IEC PDTR 15504 assessments. Software Process Improvement and Practice Journal 3: 177–188.

    Google Scholar 

  • Software Engineering Institute 1998. CMMI A Specification Version 1.1. Available at http://www.sei.cmu.edu/activities/cmm/cmmi/specs/aspec1.1.html, 23rd April.

  • Squires, B. 1990. Statistics in biomedical manuscripts: What editors want from authors and peer reviewers. Canadian Medical Association Journal 142(3): 213–214.

    Google Scholar 

  • Suen, H., and Lee, P. 1985. The effects of the use of percentage agreement on behavioral observation reliabilities: A reassessment. Journal of Psychopathology and Behavioral Assessment 7(3): 221–234.

    Google Scholar 

  • Umesh, U., Peterson, R., and Sauber, M. 1989. Interjudge agreement and the maximum value of kappa. Educational and Psychological Measurement 49: 835–850.

    Google Scholar 

  • Woodman, I., and Hunter, R. 1996. Analysis of assessment data from phase 1 of the SPICE trials. IEEE TCSE Software Process Newsletter, No. 6, Spring 1996 (available at http://www-se.cs.mcgill.ca/process/spn.html).

  • Zeisel, H. 1955. The significance of insignificant differences. Public Opinion Quarterly 319–321.

  • Zwick, R. 1988. Another look at interrater agreement. Psychological Bulletin 103(3):374–378.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Emam, K.E. Benchmarking Kappa: Interrater Agreement in Software Process Assessments. Empirical Software Engineering 4, 113–133 (1999). https://doi.org/10.1023/A:1009820201126

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009820201126

  • Process assessment
  • Inter-rater agreement
  • ISO/IEC 15504
  • reliability