Effect of Rater Training on Reliability and Accuracy of Mini-CEX Scores: A Randomized, Controlled Trial Authors
First Online: 11 November 2008 Received: 07 May 2008 Revised: 03 September 2008 Accepted: 09 October 2008 DOI:
Cite this article as: Cook, D.A., Dupras, D.M., Beckman, T.J. et al. J GEN INTERN MED (2009) 24: 74. doi:10.1007/s11606-008-0842-3
Mini-CEX scores assess resident competence. Rater training might improve mini-CEX score interrater reliability, but evidence is lacking.
Evaluate a rater training workshop using interrater reliability and accuracy.
Randomized trial (immediate versus delayed workshop) and single-group pre/post study (randomized groups combined).
Academic medical center.
Fifty-two internal medicine clinic preceptors (31 randomized and 21 additional workshop attendees).
The workshop included rater error training, performance dimension training, behavioral observation training, and frame of reference training using lecture, video, and facilitated discussion. Delayed group received no intervention until after posttest.
Mini-CEX ratings at baseline (just before workshop for workshop group), and four weeks later using videotaped resident–patient encounters; mini-CEX ratings of live resident–patient encounters one year preceding and one year following the workshop; rater confidence using mini-CEX.
Among 31 randomized participants, interrater reliabilities in the delayed group (baseline intraclass correlation coefficient [ICC] 0.43, follow-up 0.53) and workshop group (baseline 0.40, follow-up 0.43) were not significantly different (
p = 0.19). Mean ratings were similar at baseline (delayed 4.9 [95% confidence interval 4.6–5.2], workshop 4.8 [4.5–5.1]) and follow-up (delayed 5.4 [5.0–5.7], workshop 5.3 [5.0–5.6]; p = 0.88 for interaction). For the entire cohort, rater confidence (1 = not confident, 6 = very confident) improved from mean (SD) 3.8 (1.4) to 4.4 (1.0), p = 0.018. Interrater reliability for ratings of live encounters (entire cohort) was higher after the workshop (ICC 0.34) than before (ICC 0.18) but the standard error of measurement was similar for both periods.
Rater training did not improve interrater reliability or accuracy of mini-CEX scores.
Clinical trials registration
clinicaltrials.gov identifier NCT00667940
Electronic supplementary material
The online version of this article (doi:
) contains supplementary material, which is available to authorized users. 10.1007/s11606-008-0842-3 References
Holmboe ES, Hawkins RE, Huot SJ.
Effects of training in direct observation of medical residents’ clinical competence: a randomized trial. Ann Intern Med. 2004;140:874–81.
Norcini JJ, Blank LL, Duffy FD, Fortna GS.
The mini-CEX: a method for assessing clinical skills. Ann Intern Med. 2003;138:476–81.
Kogan JR, Bellini LM, Shea JA.
Feasibility, reliability, and validity of the mini-clinical evaluation exercise (mCEX) in a medicine core clerkship. Acad Med. 2003;78(10 Suppl):S33–5.
Holmboe ES, Hawkins RE.
Methods for evaluating the clinical competence of residents in internal medicine: a review. Ann Intern Med. 1998;129:42–8.
Woolliscroft JO, Stross JK, Silva J Jr.
Clinical competence certification: a critical appraisal. J Med Educ. 1984;59:799–805.
Kroboth FJ, Kapoor W, Brown FH, Karpf M, Levey GS.
A comparative trial of the clinical evaluation exercise. Arch Intern Med. 1985;145:1121–3.
Herbers JE Jr., Noel GL, Cooper GS, Harvey J, Pangaro LN, Weaver MJ.
How accurate are faculty evaluations of clinical competence. J Gen Intern Med. 1989;4:202–8.
Kroboth FJ, Hanusa BH, Parker S, et al.
The inter-rater reliability and internal consistency of a clinical evaluation exercise. J Gen Intern Med. 1992;7:174–9.
Noel GL, Herbers JE Jr., Caplow MP, Cooper GS, Pangaro LN, Harvey J.
How well do internal medicine faculty members evaluate the clinical skills of residents. Ann Intern Med. 1992;117:757–65.
Norcini JJ, Blank LL, Arnold GK, Kimball HR.
The mini-CEX (clinical evaluation exercise): a preliminary investigation. Ann Intern Med. 1995;123:795–9.
Schroter S, Plowman R, Hutchings A, Gonzalez A. Reporting of Ethical Committee Approval and Patient Consent by Study Design in 5 General Medical Journals. Paper presented at the Fifth International Congress on Peer Review and Biomedical Publication, Chicago, Illinois, September, 2005.
Margolis MJ, Clauser BE, Cuddy MM, et al.
Use of the Mini-Clinical Evaluation Exercise to Rate Examinee Performance on a Multiple-Station Clinical Skills Examination: A Validity Study. Acad Med. 2006;81(10 Suppl):S56–S60.
Hatala R, Ainslie M, Kassen BO, Mackie I, Roberts JM.
Assessing the mini-Clinical Evaluation Exercise in comparison to a national specialty examination. Med Educ. 2006;40:950–6.
An essay on the history and future of reliability from the perspective of replications. J Educ Meas. 2001;38:295–317.
Williams RG, Klamen DA, McGaghie WC.
Cognitive, Social, and Environmental Sources of Bias in Clinical Performance Ratings. Teach Learn Med. 2003;15:270–92.
Newble DI, Hoare J, Sheldrake PF.
The selection and training of examiners for clinical examinations. Med Educ. 1980;14:345–9.
Holmboe ES, Huot S, Chung J, Norcini J, Hawkins RE.
Construct validity of the miniclinical evaluation exercise (miniCEX). Acad Med. 2003;78:826–30.
Müller MJ, Rossbach W, Dannigkeit P, Müller-Siecheneder F, Szegedi A, Wetzel H.
Evaluation of standardized rater training for the Positive and Negative Syndrome Scale (PANSS). Schizophr Res. 1998;32:151–60.
Müller MJ, Dragicevic A.
Standardized rater training for the Hamilton Depression Rating Scale (HAMD-17) in psychiatric novices. J Affect Disord. 2003;77:65–9.
Angkaw AC, Tran GQ, Haaga DAF.
Effects of training intensity on observers’ rating of anxiety, social skills, and alcohol-specific coping skills. Behav Res Ther. 2006;44:533–44.
Woehr DJ, Huffcutt AI. Rater training for performance appraisal: A quantitative review. J Occup Organ Psychol. 1994;67(3):189–205.
Cook DA, Beckman TJ. Psychometric properties of mini-clinical evaluation exercise (mini-CEX) scores: Accuracy, reliability, and effect of scale length. Paper presented at the 2008 meeting of the American Educational Research Association, New York, March, 2008.
Casella G, Berger RL. Statistical Inference. 2New York: Duxbury Press; 2001.
Walter SD, Eliasziw M, Donner A.
Sample Size and Optimal Study Designs for Reliability Studies. Stat Med. 1998;17:101–10.
Zeger SL, Liang K-Y.
Longitudinal Data Analysis for Discrete and Continuous Outcomes. Biometrics. 1986;42:121–30.
Fleiss JL, Cohen J.
The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Meas. 1973;33:613–9.
Jacobs R, Kozlowski SW.
A closer look at halo error in performance ratings. Acad Manage J. 1985;28:201–12.
NCME Instructional Module: Standard Error of Measurement. Educ Meas: Issues Pract. 1991;10(2):33–41.
Brennan RL. Generalizability Theory. New York: Springer-Verlag; 2001.
Research in clinical reasoning: past history and current trends. Med Educ. 2005;39:418–27.
Murphy KR, Cleveland JN, Skattebo AL, Kinney TB.
Raters who pursue different goals give different ratings. J Appl Psychol. 2004;89:158–64.
Kroboth FJ, Hanusa BH, Parker SC.
Didactic value of the clinical evaluation exercise. Missed opportunities. J Gen Intern Med. 1996;11:551–3.
Srinivasan M, Hauer KE, Der-Martirosian C, Wilkes M, Gesundheit N.
Does feedback matter? Practice-based learning for medical students after a multi-institutional clinical performance examination. Med Educ. 2007;41:857–65.
Fernando N, Cleland J, McKenzie H, Cassar K.
Identifying the factors that determine feedback given to undergraduate medical students following formative mini-CEX assessments. Med Educ. 2008;42:89–95.
Holmboe E, Fiebach N, Galaty L, Huot S.
Effectiveness of a focused educational intervention on resident evaluations from faculty. J Gen Intern Med. 2001;16:427–34.
Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2Hillsdale, NJ: Lawrence Erlbaum; 1988.
Landis JR, Koch GG.
The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.
Kobak KA, Engelhardt N, Lipsitz JD.
Enriched rater training using Internet based technologies: a comparison to traditional rater training in a multi-site depression trial. J Psychiatr Res. 2006;40:192–9.
PubMed CrossRef Copyright information
© Society of General Internal Medicine 2008