Abstract
Educators must often decide how many points to use in a rating scale. No studies have compared interrater reliability for different-length scales, and few have evaluated accuracy. This study sought to evaluate the interrater reliability and accuracy of mini-clinical evaluation exercise (mini-CEX) scores, comparing the traditional mini-CEX nine-point scale to a five-point scale. Methods: The authors conducted a validity study in an academic internal medicine residency program. Fifty-two program faculty participated. Participants rated videotaped resident-patient encounters using the mini-CEX with both a nine-point scale and a five-point scale. Some cases were scripted to reflect a specific level of competence (unsatisfactory, satisfactory, superior). Outcome measures included mini-CEX scores, accuracy (scores compared to scripted competence level), interrater reliability, and domain intercorrelation. Results: Interviewing, exam, counseling, and overall ratings varied significantly across levels of competence (P < .0001). Nine-point scale scores accurately classified competence more often (391/720 [54%] for overall ratings) than five-point scores (316/723 [44%], P < .0001). Interrater reliability was similar for scores from the nine- and five-point scales (0.43 and 0.40, respectively, for overall ratings). With the exception of correlation between exam and counseling scores using the five-point scale (r = 0.38, P = .13), score correlations among all domain combinations were high (r = 0.46–0.89) and statistically significant (P ≤ .015) for both scales. Conclusions: Mini-CEX scores demonstrated modest interrater reliability and accuracy. Although interrater reliability is similar for nine- and five-point scales, nine-point scales appear to provide more accurate scores. This has implications for many educational assessments.
Similar content being viewed by others
References
Beckman, T. J., Ghosh, A. K., Cook, D. A., Erwin, P. J., & Mandrekar, J. N. (2004). How reliable are assessments of clinical teaching? A review of the published instruments. Journal of General Internal Medicine, 19, 971–977. doi:10.1111/j.1525-1497.2004.40066.x.
Brennan, R. L. (2001). Generalizability theory. New York: Springer.
Cook, D. A., Dupras, D. M., Beckman, T. J., Thomas, K. G., & Pankratz, V. S. (2008). Effect of rater training on reliability and accuracy of mini-CEX scores: A randomized, controlled trial. Journal of General Internal Medicine (in press). doi:10.1007/s11606-008-0842-3.
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619. doi:10.1177/001316447303300309.
Hancock, G. R., & Klockars, A. J. (1991). The effect of scale manipulations on validity: Targetting frequency rating scales for anticipated performance levels. Applied Ergonomics, 22, 147–154. doi:10.1016/0003-6870(91)90153-9.
Harvill, L. M. (1991). NCME instructional module: Standard error of measurement. Educational Measurement: Issues and Practice, 10(2), 33–41. doi:10.1111/j.1745-3992.1991.tb00195.x.
Holmboe, E. S., Hawkins, R. E., & Huot, S. J. (2004). Effects of training in direct observation of medical residents’ clinical competence: A randomized trial. Annals of Internal Medicine, 140, 874–881.
Holmboe, E. S., Huot, S., Chung, J., Norcini, J., & Hawkins, R. E. (2003). Construct validity of the mini-clinical evaluation exercise (mini-CEX). Academic Medicine, 78, 826–830. doi:10.1097/00001888-200308000-00018.
Jenkins, G. D., & Taber, T. D. (1977). A Monte Carlo study of factors affecting three indices of composite scale reliability. The Journal of Applied Psychology, 62(4), 392–398. doi:10.1037/0021-9010.62.4.392.
Kogan, J. R., Bellini, L. M., & Shea, J. A. (2003). Feasibility, reliability, and validity of the mini-clinical evaluation exercise (mCEX) in a medicine core clerkship. Academic Medicine, 78(10, Suppl), S33–S35. doi:10.1097/00001888-200310001-00011.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. doi:10.2307/2529310.
Margolis, M. J., Clauser, B. E., Cuddy, M. M., Ciccone, A., Mee, J., Harik, P., et al. (2006). Use of the mini-clinical evaluation exercise to rate examinee performance on a multiple-station clinical skills examination: A validity study. Academic Medicine, 81(10, Suppl), S56–S60. doi:10.1097/01.ACM.0000236514.53194.f4.
Matell, M. S., & Jacoby, J. (1971). Is there an optimal number of alternatives for Likert scale items? Study I: Reliability and validity. Educational and Psychological Measurement, 31, 657–674. doi:10.1177/001316447103100307.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81–97. doi:10.1037/h0043158.
Nishisato, S., & Torii, Y. (1970). Effects of categorizing continuous normal variables on product-moment correlation. Japanese Psychological Research, 13, 45–49.
Norcini, J. J., Blank, L. L., Arnold, G. K., & Kimball, H. R. (1995). The mini-CEX (clinical evaluation exercise): A preliminary investigation. Annals of Internal Medicine, 123, 795–799.
Norcini, J. J., Blank, L. L., Duffy, F. D., & Fortna, G. S. (2003). The mini-CEX: A method for assessing clinical skills. Annals of Internal Medicine, 138, 476–481.
Preston, C. C., & Colman, A. M. (2000). Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychologica, 104, 1–15. doi:10.1016/S0001-6918(99)00050-5.
Streiner, D. L., & Norman, G. R. (2003). Health measurement scales: A practical guide to their development and use (3rd ed.). New York: Oxford University Press.
Weng, L.-J. (2004). Impact of the number of response categories and anchor labels on coefficient alpha and test–retest reliability. Educational and Psychological Measurement, 64, 956–972. doi:10.1177/0013164404268674.
Acknowledgments
Thanks to K. G. Thomas and D. M. Dupras for assistance in study planning and execution, F. Enders for assistance in statistical planning, and to E. S. Holmboe for use of scripted cases. Funding was provided by the Mayo Education Innovation Program. A paper based on this study was presented at the 2008 meeting of the American Educational Research Association in New York.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cook, D.A., Beckman, T.J. Does scale length matter? A comparison of nine- versus five-point rating scales for the mini-CEX. Adv in Health Sci Educ 14, 655–664 (2009). https://doi.org/10.1007/s10459-008-9147-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10459-008-9147-x