Skip to main content
Log in

Does scale length matter? A comparison of nine- versus five-point rating scales for the mini-CEX

  • Published:
Advances in Health Sciences Education Aims and scope Submit manuscript


Educators must often decide how many points to use in a rating scale. No studies have compared interrater reliability for different-length scales, and few have evaluated accuracy. This study sought to evaluate the interrater reliability and accuracy of mini-clinical evaluation exercise (mini-CEX) scores, comparing the traditional mini-CEX nine-point scale to a five-point scale. Methods: The authors conducted a validity study in an academic internal medicine residency program. Fifty-two program faculty participated. Participants rated videotaped resident-patient encounters using the mini-CEX with both a nine-point scale and a five-point scale. Some cases were scripted to reflect a specific level of competence (unsatisfactory, satisfactory, superior). Outcome measures included mini-CEX scores, accuracy (scores compared to scripted competence level), interrater reliability, and domain intercorrelation. Results: Interviewing, exam, counseling, and overall ratings varied significantly across levels of competence (P < .0001). Nine-point scale scores accurately classified competence more often (391/720 [54%] for overall ratings) than five-point scores (316/723 [44%], P < .0001). Interrater reliability was similar for scores from the nine- and five-point scales (0.43 and 0.40, respectively, for overall ratings). With the exception of correlation between exam and counseling scores using the five-point scale (r = 0.38, P = .13), score correlations among all domain combinations were high (r = 0.46–0.89) and statistically significant (P ≤ .015) for both scales. Conclusions: Mini-CEX scores demonstrated modest interrater reliability and accuracy. Although interrater reliability is similar for nine- and five-point scales, nine-point scales appear to provide more accurate scores. This has implications for many educational assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others


  • Beckman, T. J., Ghosh, A. K., Cook, D. A., Erwin, P. J., & Mandrekar, J. N. (2004). How reliable are assessments of clinical teaching? A review of the published instruments. Journal of General Internal Medicine, 19, 971–977. doi:10.1111/j.1525-1497.2004.40066.x.

    Article  Google Scholar 

  • Brennan, R. L. (2001). Generalizability theory. New York: Springer.

    Google Scholar 

  • Cook, D. A., Dupras, D. M., Beckman, T. J., Thomas, K. G., & Pankratz, V. S. (2008). Effect of rater training on reliability and accuracy of mini-CEX scores: A randomized, controlled trial. Journal of General Internal Medicine (in press). doi:10.1007/s11606-008-0842-3.

  • Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619. doi:10.1177/001316447303300309.

    Article  Google Scholar 

  • Hancock, G. R., & Klockars, A. J. (1991). The effect of scale manipulations on validity: Targetting frequency rating scales for anticipated performance levels. Applied Ergonomics, 22, 147–154. doi:10.1016/0003-6870(91)90153-9.

    Article  Google Scholar 

  • Harvill, L. M. (1991). NCME instructional module: Standard error of measurement. Educational Measurement: Issues and Practice, 10(2), 33–41. doi:10.1111/j.1745-3992.1991.tb00195.x.

    Article  Google Scholar 

  • Holmboe, E. S., Hawkins, R. E., & Huot, S. J. (2004). Effects of training in direct observation of medical residents’ clinical competence: A randomized trial. Annals of Internal Medicine, 140, 874–881.

    Google Scholar 

  • Holmboe, E. S., Huot, S., Chung, J., Norcini, J., & Hawkins, R. E. (2003). Construct validity of the mini-clinical evaluation exercise (mini-CEX). Academic Medicine, 78, 826–830. doi:10.1097/00001888-200308000-00018.

    Article  Google Scholar 

  • Jenkins, G. D., & Taber, T. D. (1977). A Monte Carlo study of factors affecting three indices of composite scale reliability. The Journal of Applied Psychology, 62(4), 392–398. doi:10.1037/0021-9010.62.4.392.

    Article  Google Scholar 

  • Kogan, J. R., Bellini, L. M., & Shea, J. A. (2003). Feasibility, reliability, and validity of the mini-clinical evaluation exercise (mCEX) in a medicine core clerkship. Academic Medicine, 78(10, Suppl), S33–S35. doi:10.1097/00001888-200310001-00011.

    Article  Google Scholar 

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. doi:10.2307/2529310.

    Article  Google Scholar 

  • Margolis, M. J., Clauser, B. E., Cuddy, M. M., Ciccone, A., Mee, J., Harik, P., et al. (2006). Use of the mini-clinical evaluation exercise to rate examinee performance on a multiple-station clinical skills examination: A validity study. Academic Medicine, 81(10, Suppl), S56–S60. doi:10.1097/01.ACM.0000236514.53194.f4.

    Article  Google Scholar 

  • Matell, M. S., & Jacoby, J. (1971). Is there an optimal number of alternatives for Likert scale items? Study I: Reliability and validity. Educational and Psychological Measurement, 31, 657–674. doi:10.1177/001316447103100307.

    Article  Google Scholar 

  • Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81–97. doi:10.1037/h0043158.

    Article  Google Scholar 

  • Nishisato, S., & Torii, Y. (1970). Effects of categorizing continuous normal variables on product-moment correlation. Japanese Psychological Research, 13, 45–49.

    Google Scholar 

  • Norcini, J. J., Blank, L. L., Arnold, G. K., & Kimball, H. R. (1995). The mini-CEX (clinical evaluation exercise): A preliminary investigation. Annals of Internal Medicine, 123, 795–799.

    Google Scholar 

  • Norcini, J. J., Blank, L. L., Duffy, F. D., & Fortna, G. S. (2003). The mini-CEX: A method for assessing clinical skills. Annals of Internal Medicine, 138, 476–481.

    Google Scholar 

  • Preston, C. C., & Colman, A. M. (2000). Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychologica, 104, 1–15. doi:10.1016/S0001-6918(99)00050-5.

    Article  Google Scholar 

  • Streiner, D. L., & Norman, G. R. (2003). Health measurement scales: A practical guide to their development and use (3rd ed.). New York: Oxford University Press.

    Google Scholar 

  • Weng, L.-J. (2004). Impact of the number of response categories and anchor labels on coefficient alpha and test–retest reliability. Educational and Psychological Measurement, 64, 956–972. doi:10.1177/0013164404268674.

    Article  Google Scholar 

Download references


Thanks to K. G. Thomas and D. M. Dupras for assistance in study planning and execution, F. Enders for assistance in statistical planning, and to E. S. Holmboe for use of scripted cases. Funding was provided by the Mayo Education Innovation Program. A paper based on this study was presented at the 2008 meeting of the American Educational Research Association in New York.

Author information

Authors and Affiliations


Corresponding author

Correspondence to David A. Cook.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cook, D.A., Beckman, T.J. Does scale length matter? A comparison of nine- versus five-point rating scales for the mini-CEX. Adv in Health Sci Educ 14, 655–664 (2009).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: