Skip to main content
Log in

Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review

  • Review
  • Published:
Advances in Health Sciences Education Aims and scope Submit manuscript

Abstract

Even though rater-based judgements of clinical competence are widely used, they are context sensitive and vary between individuals and institutions. To deal adequately with rater-judgement unreliability, evaluating the reliability of workplace rater-based assessments in the local context is essential. Using such an approach, the primary intention of this study was to identify the trainee score variation around supervisor ratings, identify sampling number needs of workplace assessments for certification of competence and position the findings within the known literature. This reliability study of workplace-based supervisors’ assessments of trainees has a rater-nested-within-trainee design. Score variation attributable to the trainee for each competency item assessed (variance component) were estimated by the minimum-norm quadratic unbiased estimator. Score variance was used to estimate the number needed for a reliability value of 0.80. The trainee score variance for each of 14 competency items varied between 2.3% for emergency skills to 35.6% for communication skills, with an average for all competency items of 20.3%; the “Overall rating” competency item trainee variance was 28.8%. These variance components translated into 169, 7, 17 and 28 assessments needed for a reliability of 0.80, respectively. Most variation in assessment scores was due to measurement error, ranging from 97.7% for emergency skills to 63.4% for communication skills. Similar results have been demonstrated in previously published studies. In summary, overall supervisors’ workplace based assessments have poor reliability and are not suitable for use in certification processes in their current form. The marked variation in the supervisors’ reliability in assessing different competencies indicates that supervisors may be able to assess some with acceptable reproducibility; in this case communication and possibly overall competence. However, any continued use of this format for assessment of trainee competencies necessitates the identification of what supervisors in different institutions can reliably assess rather than continuing to impose false expectations from unreliable assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Accreditation Council for Graduate Medical Education (ACGME). (2000). ACGME/ABMS joint initiative toolbox of assessment methods version 1.1 September 2000 http:\\www.abim.org (Accesed 7th March 2007): Accreditation Council for Graduate Medical Education and American Board of Medical Specialties.

  • Albanese, M. A., Mejicano, G., Mullan, P., Kokotailo, P., & Gruppen, L. (2008). Defining characteristics of educational competencies. Medical Education, 42, 248–255.

    Article  Google Scholar 

  • Baltagi, B. H., Song, S. H., & Jung, B. C. (2002). A comparative study of alternative estimators for the unbalanced 2-way error component regression model. Econometrics Journal, 5, 480–493.

    Article  Google Scholar 

  • Beckman, T. J., Cook, D. A., & Mandrekar, J. N. (2006). Factor instability of clinical teaching assessment scores among general internists and cardiologists. Medical Education, 40, 1209–1216.

    Article  Google Scholar 

  • Carline, J. D., Wenrich, M., & Ramsey, P. G. (1989). Characteristics of ratings of physician competence by professional associates. Evaluation & the Health Professions, 12, 409–423.

    Article  Google Scholar 

  • Cohen, S. N., Farrant, P. B., & Taibjee, S. M. (2009). Assessing the assessments: UK dermatology trainees’ views of the workplace assessment tools. British Journal of Dermatology, 161, 34–39.

    Article  Google Scholar 

  • Cook, D. A., Dupras, D. M., Beckman, T. J., Thomas, K. G., & Pankratz, V. S. (2008). Effect of rater training on reliability and accuracy of mini-CEX scores: A randomized, controlled trial. Journal of General Internal Medicine, 24, 74–79.

    Article  Google Scholar 

  • Cooper, W. H. (1981). Ubiquitous halo. Psychological Bulletin, 90, 218–244.

    Article  Google Scholar 

  • Cowles, J. T., & Kubany, A. J. (1959). Improving the measurement of clinical performance of medical students. Journal of Clinical Psychology, 15, 139–143.

    Article  Google Scholar 

  • Cronbach, L. J. (1950). Further evidence on response sets and test design. Educational and Psychological Measurement, 10, 3–31.

    Article  Google Scholar 

  • Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Pyschometrica, pp. 297–333.

  • Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability of scores and profiles. New York: Wiley.

    Google Scholar 

  • Cronbach, L. J., & Shavelson, R. J. E. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418.

    Article  Google Scholar 

  • Crossley, J., Russell, J., Jolly, B., Ricketts, C., Roberts, C., Schuwirth, L., et al. (2007). ‘I’m pickin’ up good regressions’: the governance of generalisability analyses. Medical Education, 41, 926–934.

    Article  Google Scholar 

  • Davis, J. K., Inamdar, S., & Stone, R. K. (1986). Interrater agreement and predictive validity of faculty ratings of pediatric residents. Journal of Medical Education, 61, 901–905.

    Google Scholar 

  • de Lima, A. A., Barrero, C., Baratta, S., Costa, Y. C., Bortman, G., Carabajales, J., et al. (2007). Validity, reliability, feasibility and satisfaction of the mini-clinical evaluation exercise (Mini-CEX) for cardiology residency training. Medical Teacher, 29, 785–790.

    Article  Google Scholar 

  • Dickinson, T. L., & Tice, T. E. (1977). The discriminant validity of scales developed by retranslation. Personnel Psychology, 30, 217–228.

    Article  Google Scholar 

  • Downing, S. M. (2004). Reliability: On the reproducibility of assessment data. Medical Education, 38, 1006–1012.

    Article  Google Scholar 

  • Dudek, N. L., Marks, M. B., & Regehir, G. (2005). Failure to fail: The perspectives of clinical supervisors. Academic Medicine, 80, S84–S87.

    Article  Google Scholar 

  • Durning, S. J., Pangaro, L. N., Lawrence, L. L., Waechter, D., McManigle, J., & Jackson, J. L. (2005). The feasibility, reliability, and validity of a program director’s (supervisor’s) evaluation form for medical school graduates. Academic Medicine, 80, 964–968.

    Article  Google Scholar 

  • Fleiss, J. L., & Shrout, P. E. (1978). Approximate interval estimation for a certain intraclass correlation coefficient. Psychometrika, 43, 259–262.

    Article  Google Scholar 

  • Gleser, G. C., Cronbach, L. J., & Rajaratnam, N. (1965). Generalizability of scores influenced by multiple sources of variance. Psychometrika, 30, 395–418.

    Article  Google Scholar 

  • Govaerts, M. J. B. (2008). Educational competencies or education for professional competence? Medical Education, 42, 234–236.

    Article  Google Scholar 

  • Haber, R. J., & Avins, A. L. (1994). Do ratings on the American Board of Internal Medicine resident evaluation form detect differences in clinical competence? Journal of General Internal Medicine, 9, 140–145.

    Article  Google Scholar 

  • Hamdy, H., Prasad, K., Anderson, M. B., Scherpbier, A., Williams, R., Zwierstra, R., et al. (2006). BEME systematic review: Predictive values of measurements obtained in medical schools and future performance in medical practice. Medical Teacher, 28, 103–116.

    Article  Google Scholar 

  • Hess, J. W. (1969). A comparison of methods for evaluating medical student skill in relating to patients. Journal of Medical Education, 44, 934–938.

    Google Scholar 

  • Hill, F., Kendall, K., Galbraith, K., & Crossley, J. (2009). Implementing the undergraduate mini-CEX: A tailored approach at Southampton University. Medical Education, 43, 326–334.

    Article  Google Scholar 

  • Holmboe, E. S., & Hawkins, R. E. (1998). Methods for evaluating the clinical competence of residents in internal medicine: A review. Annals of Internal Medicine, 129, 42–48.

    Google Scholar 

  • Hull, A. L., Hodder, S., Berger, B., Ginsberg, D., Lindheim, N., Quan, J., et al. (1995). Validity of three clinical performance assessments of internal medicine clerks. Academic Medicine, 70, 517–522.

    Article  Google Scholar 

  • Hutchinson, L., Aitken, P., & Hayes, T. (2002). Are medical postgraduate certification processes valid? A systematic review of the published evidence. Medical Education, 36, 73–91.

    Article  Google Scholar 

  • James, R. J., Demnaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98.

    Article  Google Scholar 

  • James, P. A., Kreiter, C. D., Shipengrover, J., & Crosson, J. (2002). Identifying the attributes of instructional quality in ambulatory teaching sites: A validation study of the MedEd IQ. Family Medicine, 34, 268–273.

    Google Scholar 

  • Joint Committee on Standards for Educational, Psychological Testing of the American Educational Research Association. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

    Google Scholar 

  • Kastner, L., Gore, E., & Novack, A. H. (1984). Pediatric residents’ attitudes and cognitive knowledge, and faculty ratings. The Journal of Pediatrics, 104, 814–818.

    Article  Google Scholar 

  • Keck, J. W., & Arnold, L. (1979). Development and validation of an instrument to assess the clinical performance of medical residents. Educational and Psychological Measurement, 39, 903–908.

    Article  Google Scholar 

  • Kegel-Flom, P. (1975). Predicting supervisor, peer, and self-ratings of intern performance. Journal of Medical Education, 50, 812–815.

    Google Scholar 

  • Keller, L. A., Mazor, K. M., Swaminathan, H., & Pugnaire, M. P. (2000). An investigation of the impacts of different generalizability study designs on estimates of variance components and generalizability coefficients. Academic Medicine, 75, S21–S24.

    Article  Google Scholar 

  • King, L. M., Schmidt, F. L., & Hunter, J. E. (1980). Halo in a multidimensional forced-choice evaluation scale. Journal of Applied Psychology, 65, 507–516.

    Article  Google Scholar 

  • Kogan, J. R., Holmboe, E. S., & Hauer, K. S. (2009). Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. Journal of the American Medical Association, 302, 1316–1326.

    Article  Google Scholar 

  • Koretz, D. (2003). Using multiple measures to address perverse incentives and score inflation. Educational Measurement: Issues and Practice, 22, 18–26.

    Article  Google Scholar 

  • Kreiter, C. D., & Ferguson, K. J. (2002). The empirical validity of straight-line responses on a clinical evaluation form. Academic Medicine, 77, 414–418.

    Article  Google Scholar 

  • Kreiter, C. D., Ferguson, K., Lee, W. C., Brennan, R. L., & Densen, P. (1998). A generalizability study of a new standardized rating form used to evaluate students’ clinical clerkship performances. Academic Medicine, 73, 1294–1298.

    Article  Google Scholar 

  • Kreiter, C. D., James, P. A., Stansfield, R. B., & Callaway, M. R. (2002). An empirical validity study of a preceptor evaluation instrument. Academic Medicine, 77, S70–S72.

    Article  Google Scholar 

  • Kroboth, F. J., Hanusa, B. H., Parker, S., Coulehan, J. L., Kapoor, W. N., Brown, F. H., et al. (1992). The inter-rater reliability and internal consistency of a clinical evaluation exercise. Journal of General Internal Medicine, 7, 174–179.

    Article  Google Scholar 

  • Kwolek, C. J., Donnelly, M. B., Sloan, D. A., Birrell, S. N., Strodel, W. E., & Schwartz, R. W. (1997). Ward evaluations: Should they be abandoned? Journal of Surgical Research, 69, 1–6.

    Article  Google Scholar 

  • Latham, G. P., Wexley, K. N., & Pursell, E. D. (1975). Training managers to minimize rating errors in the observation of behavior. Journal of Applied Psychology, 60, 550–555.

    Article  Google Scholar 

  • Levine, H. G., & McGuire, C. H. (1971). Rating habitual performance in graduate medical education. Academic Medicine, 46, 306–311.

    Article  Google Scholar 

  • Magzoub, M. E. M. A., Schmidt, H. G., Abdel-Hameed, A. A., Dolmans, D., & Mustafa, S. E. (1998). Student assessment in community settings: A comprehensive approach. Medical Education, 32, 50–59.

    Article  Google Scholar 

  • Margolis, M. J., Clauser, B. E., Cuddy, M. M., Ciccone, A., Mee, J., Harik, P., et al. (2006). Use of the mini-clinical evaluation exercise to rate examinee performance on a multiple-station clinical skills examination: A validity study. Academic Medicine, 81, S56–S60.

    Article  Google Scholar 

  • Mazor, K. M., Zanetti, M. L., Alper, E. J., Hatem, D., Barrett, S. V., Meterko, V., et al. (2007). Assessing professionalism in the context of an objective structured clinical examination: An in-depth study of the rating process. Medical Education, 41, 331–340.

    Article  Google Scholar 

  • Metheny, W. P. P. (1991). Limitations of physician ratings in the assessment of student clinical performance in an obstetrics and gynecology clerkship. Obstetrics and Gynecology, 78, 136–141.

    Google Scholar 

  • Miller, A., & Archer, J. (2010). Impact of workplace based assessment on doctors’ education and performance: A systematic review. British Medical Journal, 341, c5064. doi:10.1136/bmj.c5064.

    Article  Google Scholar 

  • Murphy, K. R., & Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619–624.

    Article  Google Scholar 

  • Nasca, T. J., Gonnella, J. S., Hojat, M., Veloski, J., Erdmann, J. B., Robeson, M., et al. (2002). Conceptualization and measurement of clinical competence of residents: A brief rating form and its psychometric properties. Medical Teacher, 24, 299–303.

    Article  Google Scholar 

  • Norcini, J. J., Blank, L. L., Arnold, G. K., & Kimball, H. R. (1995). The mini-CEX (clinical evaluation exercise): A preliminary investigation. Annals of Internal Medicine, 123, 795–799.

    Google Scholar 

  • Norcini, J. J., Blank, L. L., Duffy, F. D., & Fortna, G. S. (2003). The mini-CEX: A method for assessing clinical skills. Annals of Internal Medicine, 138, 476–481.

    Google Scholar 

  • Pulito, A. R., Donnelly, M. B., & Pylmale, M. (2007). Factors in faculty evaluation of medical students’ performance. Medical Education, 41, 667–675.

    Article  Google Scholar 

  • Remmers, H. H., Shock, N. W., & Kelly, E. L. (1927). An empirical study of the validity of the Spearman-Brown formula as applied to the Purdue rating scale. The Journal of Educational Psychology, 18, 187–195.

    Article  Google Scholar 

  • Ronan, W. W., & Prien, E. P. (1966). Toward a criterion theory: A review of research and opinion. Greensboro, NC: Creativity Research Institute, Smith Richardson Foundation.

    Google Scholar 

  • Ronan, W. W., & Prien, E. P. (1971). Perspectives on the measurement of human performance. New York: Appleton Century Crofts.

    Google Scholar 

  • Rothstein, R. H. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322–327.

    Article  Google Scholar 

  • Ryan, J. G., Mandel, F. S., Sama, A., & Ward, M. F. (1996). Reliability of faculty clinical evaluations of non-emergency medicine residents during emergency department rotations. Academic Emergency Medicine, 3, 1124–1130.

    Article  Google Scholar 

  • Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413–428.

    Article  Google Scholar 

  • Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119–144.

    Article  Google Scholar 

  • Sadler, D. R. (2005). Interpretations of criteria-based assessment and grading in higher education. Assessment & Evaluation in Higher Education, 30, 175–194.

    Article  Google Scholar 

  • Sadler, D. R. (2009). Grade integrity and the representation of academic achievement. Studies in Higher Education, 34, 807–826.

    Article  Google Scholar 

  • Schwanz, R. W., Donnelly, M. B., Sloan, D. A., Johnson, S. B., & Strodel, W. E. (1995). The relationship between faculty ward evaluations, OSCE, and ABSITE as measures of surgical intern performance. The American Journal of Surgery, 169, 414–417.

    Article  Google Scholar 

  • Searle, G. F. (2008). Is CEX good for psychiatry? An evaluation of workplace-based assessment. Psychiatric Bulletin, 32, 271–273.

    Article  Google Scholar 

  • Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.

    Article  Google Scholar 

  • Speer, A. J., Solomon, D. J., & Fincher, R.-M. E. (2000). Grade inflation in internal medicine clerkships: Results of a national survey. Teaching and Learning in Medicine, 12, 112–116.

    Article  Google Scholar 

  • Streiner, D. L. (1995). Clinical ratings—ward rating. In S. Shannon & G. Norman (Eds.), Evaluation methods: A resource handbook (pp. 29–32). Hamilton: Program for Educational Development McMaster University.

    Google Scholar 

  • Streiner, D. L., & Norman, G. R. (2009). Health measurement scales. A practical guide to their development and use (4th ed.). Oxford: Oxford University Press.

    Google Scholar 

  • Swanson, D. B., Norman, G. R., & Linn, R. L. (1995). Performance-based assessment: Lessons from the health professions. Educational Researcher, 24, 5–11–35.

    Google Scholar 

  • Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Pearson Allyn and Bacon.

    Google Scholar 

  • Thompson, W. G., Lipkin, M, Jr., Gilbert, D. A., Guzzo, R. A., & Roberson, L. (1990). Evaluating evaluation: assessment of the American Board of Internal Medicine Resident Evaluation Form. Journal of General Internal Medicine, 5, 214–217.

    Article  Google Scholar 

  • Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4, 25–29.

    Article  Google Scholar 

  • Turnbull, J., MacFadyen, J., van Barneveld, C., & Norman, G. (2000). Clinical work sampling: A new approach to the problem of in-training evaluation. Journal of General Internal Medicine, 15, 556–561.

    Article  Google Scholar 

  • van Barneveld, C. (2005). The dependability of medical students’ performance ratings as documented on in-training evaluations. Academic Medicine, 80, 309–312.

    Article  Google Scholar 

  • van der Vleuten, C. P. M., Scherpbier, A. J. J. A., Dolmans, D. H. J. M., Schuwirth, L. W. T., Verwijnen, G. M., & Wolfhagen, H. A. P. (2000). Clerkship assessment assessed. Medical Teacher, 22, 592–600.

    Article  Google Scholar 

  • van der Vleuten, C. P., & Schuwirth, L. W. (2005). Assessing professional competence: From methods to programmes. Medical Education, 39, 309–317.

    Article  Google Scholar 

  • Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557–574.

    Article  Google Scholar 

  • Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2005). Is there a general factor in ratings of job performance? A meta-analytic framework for disentangling substantive and error influences. Journal of Applied Psychology, 90, 108–131.

    Article  Google Scholar 

  • Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practice, 15, 22–29.

    Article  Google Scholar 

  • Wass, V., Van der Vleuten, C., Shatzer, J., & Jones, R. (2001). Assessment of clinical competence. The Lancet, 357, 945–949.

    Article  Google Scholar 

  • Weller, J. M., Jolly, B., Misur, M. P., Merry, A. F., Jones, A., Crossley, J. G., et al. (2009). Mini-clinical evaluation exercise in anaesthesia training. British Journal of Anaesthesia, 102, 633–641.

    Article  Google Scholar 

  • Wherry, S., & Bartlett, C. J. (1982). The control of bias in ratings: A theory of rating. Personnel Psychology, 35, 521–551.

    Article  Google Scholar 

  • Wilkinson, J. R., Crossley, J. G., Wragg, A., Mills, P., Cowan, G., & Wade, W. (2008). Implementing workplace-based assessment across the medical specialties in the United Kingdom. Medical Education, 42, 364–373.

    Article  Google Scholar 

  • Williams, R. G., Klamen, D. A., & McGaghie, W. C. (2003). Cognitive, social and environmental sources of bias in clinical performance ratings. Teaching and Learning in Medicine, 15, 270–292.

    Article  Google Scholar 

  • Williams, R. G., Verhulst, S., Colliver, J. A., & Dunnington, G. L. (2004). Assuring the reliability of resident performance appraisals: More items or more observations? Surgery, 137, 141–147.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. A. McGill.

Appendix

Appendix

Rights and permissions

Reprints and permissions

About this article

Cite this article

McGill, D.A., van der Vleuten, C.P.M. & Clarke, M.J. Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review. Adv in Health Sci Educ 16, 405–425 (2011). https://doi.org/10.1007/s10459-011-9296-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10459-011-9296-1

Keywords

Navigation