Abstract
Software process assessments are by now a prevalent tool for process improvement and contract risk assessment in the software industry. Given that scores are assigned to processes during an assessment, a process assessment can be considered a subjective measurement procedure. As with any subjective measurement procedure, the reliability of process assessments has important implications on the utility of assessment scores, and therefore the reliability of assessments can be taken as a criterion for evaluating an assessment's quality. The particular type of reliability of interest in this paper is interrater agreement. Thus far, empirical evaluations of the interrater agreement of assessments have used Cohen's Kappa coefficient. Once a Kappa value has been derived, the next question is “how good is it?” Benchmarks for interpreting the obtained values of Kappa are available from the social sciences and medical literature. However, the applicability of these benchmarks to the software process assessment context is not obvious. In this paper we develop a benchmark for interpreting Kappa values using data from ratings of 70 process instances collected from assessments of 19 different projects in 7 different organizations in Europe during the SPICE Trials (this is an international effort to empirically evaluate the emerging ISO/IEC 15504 International Standard for Software Process Assessment). The benchmark indicates that Kappa values below 0.45 are poor, and values above 0.62 constitute substantial agreement and should be the minimum aimed for. This benchmark can be used to decide how good an assessment's reliability is.
Similar content being viewed by others
References
Allen, M., and Yen, W. 1979. Introduction to Measurement Theory. Brooks/Cole Publishing Company.
Altman, D. 1991. Practical Statistics for Medical Research. Chapman and Hall.
Armitage, P., and Berry, G. 1994. Statistical Methods in Medical Research. Blackwell Science.
Bennett, E., Alpert, R., and Goldstein, A. 1954. Communications through limited response questioning. Public Opinion Quarterly 18: 303–308.
Bicego, A., Khurana, M., and Kuvaja, P. 1998. Bootstrap 3.0: Software process assessment methodology. Proceedings of SQM'98.
Briand, L., El Emam, K., Laitenberger, O., and Fussbroich, T. 1998. Using simulation to build inspection efficiency benchmarks for development projects. Proceedings of the International Conference on Software Engineering. 340–349.
Briand, L., El Emam, K., and Wieczorek, I. 1998. A case study in productivity benchmarking: Methods and lessons learned. Proceedings of the 9th European Software Control and Metrics Conference. Shaker Publishing B.V., The Netherlands, 4–14.
Camp, R. 1989. Benchmarking: The Search for Industry Best Practices that Lead to Superior Performance. ASQC Quality Press.
Cicchetti, D. 1972. A new measure of agreement netween rank order variables. Proceedings of the American Psychological Association 7: 17–18.
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement XX(1): 37–46.
Cohen, J. 1968. Weighted kappa: Nominal scale agreement with provision for scaled agreement or partial credit. Psychological Bulletin 70: 213–220.
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates.
El Emam, K. 1998. The internal consistency of the ISO/IEC 15504 software process capability scale. To appear in Proceedings of the 5th International Symposium on Software Metrics. IEEE CS Press.
El Emam, K., and Goldenson, D. R. 1995. SPICE: An empiricist's perspective. Proceedings of the Second IEEE International Software Engineering Standards Symposium, pp. 84–97.
El Emam, K., and Madhavji, N. H. 1995. The reliability of measuring organizational maturity. Software Process Improvement and Practice Journal 1(1): 3–25.
El Emam, K., Briand, L., and Smith, B. 1996. Assessor agreement in rating SPICE processes. Software Process Improvement and Practice Journal 2(4): 291–306.
El Emam, K., and Goldenson, D. R. 1996. An empirical evaluation of the prospective international SPICE standard. Software Process Improvement and Practice Journal 2(2): 123–148.
El Emam, K., Goldenson, D., Briand, L., and Marshall, P. 1996. Interrater agreement in SPICE-based assessments: Some preliminary results. Proceedings of the International Conference on the Software Process, pp. 149–156.
El Emam, K., Smith, B., and Fusaro, P. 1997. Modeling the reliability of SPICE based assessments. Proceedings of the Third IEEE International Software Engineering Standards Symposium, pp. 69–82.
El Emam, K., Drouin, J-N, and Melo, W. (eds.) 1998. SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE CS Press.
El Emam, K., and Marshall, P. 1998. Interrater agreement in assessment ratings. El Emam, K., Drouin, J-N, and Melo, W. (eds.) SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE CS Press.
El Emam, K., Simon, J-M, Rousseau, S., and Jacquet, E. 1998. Cost implications of interrater agreement for software process assessments. To appear in Proceedings of the 5th International Symposium on Software Metrics. IEEE CS Press.
El Emam, K., and Wieczorek, I. 1998. The repeatability of code defect classifications. To appear in Proceedings of the International Symposium on Software Reliability Engineering. IEEE CS Press.
Everitt, B. 1992. The Analysis of Contingency Tables. Chapman and Hall.
Fleiss, J. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5): 378–382.
Fleiss, J. 1981. Statistical Methods for Rates and Proportions. John Wiley & Sons.
Fleiss, J., and Cohen, J. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33: 613–619.
Fleiss, J., Cohen, J., and Everitt, B. 1969. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin 72(5): 323–327.
Fusaro, P., El Emam, K., and Smith, B. 1997a. Evaluating the interrater agreement of process capability ratings. Proceedings of the Fourth International Software Metrics Symposium. 2–11.
Fusaro, P., El Emam, K., and Smith, B. 1997b. The internal consistencies of the 1987 SEI maturity questionnaire and the SPICE capability dimension. Empirical Software Engineering: An International Journal 3: 179–201, Kluwer Academic Publishers.
Gordis, L. 1996. Epidemiology. W. B. Saunders.
Hartmann, D. 1977. Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis 10(1): 103–116.
Henkel, E. 1976. Tests of Significance. Sage Publications.
Landis, J., and Koch, G. 1977. The measurement of observer agreement for categorical data. Biometrics 33: 159–174.
Lindsay, R., and Ehrenberg, A. 1993. The design of replicated studies. The American Statistician 47(3): 217–228.
Lyman, H. 1963. Test Scores and What They Mean. Prentice-Hall.
Maclennan, F., Ostrolenk, G., and Tobin, M. 1998. Introduction to the SPICE trials. El Emam, K., Drouin, J-N, and Melo, W. (eds.) SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE CS Press.
Rout, T., and Simms, P. 1998. Introduction to the SPICE documents and architecture. El Emam, K., Drouin, J-N, and Melo, W. (eds.) SPICE: The Theory and Practice of Software Process Improvement and Capability Determination. IEEE CS Press.
Scott, W. 1955. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly 19: 321–325.
Simon, J-M, El Emam, K., Rousseau, S., Jacquet, E., and Babey, F. 1997. The reliability of ISO/IEC PDTR 15504 assessments. Software Process Improvement and Practice Journal 3: 177–188.
Software Engineering Institute 1998. CMMI A Specification Version 1.1. Available at http://www.sei.cmu.edu/activities/cmm/cmmi/specs/aspec1.1.html, 23rd April.
Squires, B. 1990. Statistics in biomedical manuscripts: What editors want from authors and peer reviewers. Canadian Medical Association Journal 142(3): 213–214.
Suen, H., and Lee, P. 1985. The effects of the use of percentage agreement on behavioral observation reliabilities: A reassessment. Journal of Psychopathology and Behavioral Assessment 7(3): 221–234.
Umesh, U., Peterson, R., and Sauber, M. 1989. Interjudge agreement and the maximum value of kappa. Educational and Psychological Measurement 49: 835–850.
Woodman, I., and Hunter, R. 1996. Analysis of assessment data from phase 1 of the SPICE trials. IEEE TCSE Software Process Newsletter, No. 6, Spring 1996 (available at http://www-se.cs.mcgill.ca/process/spn.html).
Zeisel, H. 1955. The significance of insignificant differences. Public Opinion Quarterly 319–321.
Zwick, R. 1988. Another look at interrater agreement. Psychological Bulletin 103(3):374–378.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Emam, K.E. Benchmarking Kappa: Interrater Agreement in Software Process Assessments. Empirical Software Engineering 4, 113–133 (1999). https://doi.org/10.1023/A:1009820201126
Issue Date:
DOI: https://doi.org/10.1023/A:1009820201126