Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review

McGill, D. A.; van der Vleuten, C. P. M.; Clarke, M. J.

doi:10.1007/s10459-011-9296-1

Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review

Review
Published: 24 May 2011

Volume 16, pages 405–425, (2011)
Cite this article

Advances in Health Sciences Education Aims and scope Submit manuscript

D. A. McGill¹,
C. P. M. van der Vleuten² &
M. J. Clarke³

1636 Accesses
39 Citations
2 Altmetric
Explore all metrics

Abstract

Even though rater-based judgements of clinical competence are widely used, they are context sensitive and vary between individuals and institutions. To deal adequately with rater-judgement unreliability, evaluating the reliability of workplace rater-based assessments in the local context is essential. Using such an approach, the primary intention of this study was to identify the trainee score variation around supervisor ratings, identify sampling number needs of workplace assessments for certification of competence and position the findings within the known literature. This reliability study of workplace-based supervisors’ assessments of trainees has a rater-nested-within-trainee design. Score variation attributable to the trainee for each competency item assessed (variance component) were estimated by the minimum-norm quadratic unbiased estimator. Score variance was used to estimate the number needed for a reliability value of 0.80. The trainee score variance for each of 14 competency items varied between 2.3% for emergency skills to 35.6% for communication skills, with an average for all competency items of 20.3%; the “Overall rating” competency item trainee variance was 28.8%. These variance components translated into 169, 7, 17 and 28 assessments needed for a reliability of 0.80, respectively. Most variation in assessment scores was due to measurement error, ranging from 97.7% for emergency skills to 63.4% for communication skills. Similar results have been demonstrated in previously published studies. In summary, overall supervisors’ workplace based assessments have poor reliability and are not suitable for use in certification processes in their current form. The marked variation in the supervisors’ reliability in assessing different competencies indicates that supervisors may be able to assess some with acceptable reproducibility; in this case communication and possibly overall competence. However, any continued use of this format for assessment of trainee competencies necessitates the identification of what supervisors in different institutions can reliably assess rather than continuing to impose false expectations from unreliable assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A collective case study of supervision and competence judgments on the inpatient internal medicine ward

Article Open access 25 January 2021

Construct validation of judgement-based assessments of medical trainees’ competency in the workplace using a “Kanesian” approach to validation

Article Open access 30 December 2015

Entrustment Ratings in Internal Medicine Training: Capturing Meaningful Supervision Decisions or Just Another Rating?

Article 16 April 2019

References

Accreditation Council for Graduate Medical Education (ACGME). (2000). ACGME/ABMS joint initiative toolbox of assessment methods version 1.1 September 2000 http:\\www.abim.org (Accesed 7th March 2007): Accreditation Council for Graduate Medical Education and American Board of Medical Specialties.
Albanese, M. A., Mejicano, G., Mullan, P., Kokotailo, P., & Gruppen, L. (2008). Defining characteristics of educational competencies. Medical Education, 42, 248–255.
Article Google Scholar
Baltagi, B. H., Song, S. H., & Jung, B. C. (2002). A comparative study of alternative estimators for the unbalanced 2-way error component regression model. Econometrics Journal, 5, 480–493.
Article Google Scholar
Beckman, T. J., Cook, D. A., & Mandrekar, J. N. (2006). Factor instability of clinical teaching assessment scores among general internists and cardiologists. Medical Education, 40, 1209–1216.
Article Google Scholar
Carline, J. D., Wenrich, M., & Ramsey, P. G. (1989). Characteristics of ratings of physician competence by professional associates. Evaluation & the Health Professions, 12, 409–423.
Article Google Scholar
Cohen, S. N., Farrant, P. B., & Taibjee, S. M. (2009). Assessing the assessments: UK dermatology trainees’ views of the workplace assessment tools. British Journal of Dermatology, 161, 34–39.
Article Google Scholar
Cook, D. A., Dupras, D. M., Beckman, T. J., Thomas, K. G., & Pankratz, V. S. (2008). Effect of rater training on reliability and accuracy of mini-CEX scores: A randomized, controlled trial. Journal of General Internal Medicine, 24, 74–79.
Article Google Scholar
Cooper, W. H. (1981). Ubiquitous halo. Psychological Bulletin, 90, 218–244.
Article Google Scholar
Cowles, J. T., & Kubany, A. J. (1959). Improving the measurement of clinical performance of medical students. Journal of Clinical Psychology, 15, 139–143.
Article Google Scholar
Cronbach, L. J. (1950). Further evidence on response sets and test design. Educational and Psychological Measurement, 10, 3–31.
Article Google Scholar
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Pyschometrica, pp. 297–333.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability of scores and profiles. New York: Wiley.
Google Scholar
Cronbach, L. J., & Shavelson, R. J. E. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418.
Article Google Scholar
Crossley, J., Russell, J., Jolly, B., Ricketts, C., Roberts, C., Schuwirth, L., et al. (2007). ‘I’m pickin’ up good regressions’: the governance of generalisability analyses. Medical Education, 41, 926–934.
Article Google Scholar
Davis, J. K., Inamdar, S., & Stone, R. K. (1986). Interrater agreement and predictive validity of faculty ratings of pediatric residents. Journal of Medical Education, 61, 901–905.
Google Scholar
de Lima, A. A., Barrero, C., Baratta, S., Costa, Y. C., Bortman, G., Carabajales, J., et al. (2007). Validity, reliability, feasibility and satisfaction of the mini-clinical evaluation exercise (Mini-CEX) for cardiology residency training. Medical Teacher, 29, 785–790.
Article Google Scholar
Dickinson, T. L., & Tice, T. E. (1977). The discriminant validity of scales developed by retranslation. Personnel Psychology, 30, 217–228.
Article Google Scholar
Downing, S. M. (2004). Reliability: On the reproducibility of assessment data. Medical Education, 38, 1006–1012.
Article Google Scholar
Dudek, N. L., Marks, M. B., & Regehir, G. (2005). Failure to fail: The perspectives of clinical supervisors. Academic Medicine, 80, S84–S87.
Article Google Scholar
Durning, S. J., Pangaro, L. N., Lawrence, L. L., Waechter, D., McManigle, J., & Jackson, J. L. (2005). The feasibility, reliability, and validity of a program director’s (supervisor’s) evaluation form for medical school graduates. Academic Medicine, 80, 964–968.
Article Google Scholar
Fleiss, J. L., & Shrout, P. E. (1978). Approximate interval estimation for a certain intraclass correlation coefficient. Psychometrika, 43, 259–262.
Article Google Scholar
Gleser, G. C., Cronbach, L. J., & Rajaratnam, N. (1965). Generalizability of scores influenced by multiple sources of variance. Psychometrika, 30, 395–418.
Article Google Scholar
Govaerts, M. J. B. (2008). Educational competencies or education for professional competence? Medical Education, 42, 234–236.
Article Google Scholar
Haber, R. J., & Avins, A. L. (1994). Do ratings on the American Board of Internal Medicine resident evaluation form detect differences in clinical competence? Journal of General Internal Medicine, 9, 140–145.
Article Google Scholar
Hamdy, H., Prasad, K., Anderson, M. B., Scherpbier, A., Williams, R., Zwierstra, R., et al. (2006). BEME systematic review: Predictive values of measurements obtained in medical schools and future performance in medical practice. Medical Teacher, 28, 103–116.
Article Google Scholar
Hess, J. W. (1969). A comparison of methods for evaluating medical student skill in relating to patients. Journal of Medical Education, 44, 934–938.
Google Scholar
Hill, F., Kendall, K., Galbraith, K., & Crossley, J. (2009). Implementing the undergraduate mini-CEX: A tailored approach at Southampton University. Medical Education, 43, 326–334.
Article Google Scholar
Holmboe, E. S., & Hawkins, R. E. (1998). Methods for evaluating the clinical competence of residents in internal medicine: A review. Annals of Internal Medicine, 129, 42–48.
Google Scholar
Hull, A. L., Hodder, S., Berger, B., Ginsberg, D., Lindheim, N., Quan, J., et al. (1995). Validity of three clinical performance assessments of internal medicine clerks. Academic Medicine, 70, 517–522.
Article Google Scholar
Hutchinson, L., Aitken, P., & Hayes, T. (2002). Are medical postgraduate certification processes valid? A systematic review of the published evidence. Medical Education, 36, 73–91.
Article Google Scholar
James, R. J., Demnaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98.
Article Google Scholar
James, P. A., Kreiter, C. D., Shipengrover, J., & Crosson, J. (2002). Identifying the attributes of instructional quality in ambulatory teaching sites: A validation study of the MedEd IQ. Family Medicine, 34, 268–273.
Google Scholar
Joint Committee on Standards for Educational, Psychological Testing of the American Educational Research Association. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Google Scholar
Kastner, L., Gore, E., & Novack, A. H. (1984). Pediatric residents’ attitudes and cognitive knowledge, and faculty ratings. The Journal of Pediatrics, 104, 814–818.
Article Google Scholar
Keck, J. W., & Arnold, L. (1979). Development and validation of an instrument to assess the clinical performance of medical residents. Educational and Psychological Measurement, 39, 903–908.
Article Google Scholar
Kegel-Flom, P. (1975). Predicting supervisor, peer, and self-ratings of intern performance. Journal of Medical Education, 50, 812–815.
Google Scholar
Keller, L. A., Mazor, K. M., Swaminathan, H., & Pugnaire, M. P. (2000). An investigation of the impacts of different generalizability study designs on estimates of variance components and generalizability coefficients. Academic Medicine, 75, S21–S24.
Article Google Scholar
King, L. M., Schmidt, F. L., & Hunter, J. E. (1980). Halo in a multidimensional forced-choice evaluation scale. Journal of Applied Psychology, 65, 507–516.
Article Google Scholar
Kogan, J. R., Holmboe, E. S., & Hauer, K. S. (2009). Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. Journal of the American Medical Association, 302, 1316–1326.
Article Google Scholar
Koretz, D. (2003). Using multiple measures to address perverse incentives and score inflation. Educational Measurement: Issues and Practice, 22, 18–26.
Article Google Scholar
Kreiter, C. D., & Ferguson, K. J. (2002). The empirical validity of straight-line responses on a clinical evaluation form. Academic Medicine, 77, 414–418.
Article Google Scholar
Kreiter, C. D., Ferguson, K., Lee, W. C., Brennan, R. L., & Densen, P. (1998). A generalizability study of a new standardized rating form used to evaluate students’ clinical clerkship performances. Academic Medicine, 73, 1294–1298.
Article Google Scholar
Kreiter, C. D., James, P. A., Stansfield, R. B., & Callaway, M. R. (2002). An empirical validity study of a preceptor evaluation instrument. Academic Medicine, 77, S70–S72.
Article Google Scholar
Kroboth, F. J., Hanusa, B. H., Parker, S., Coulehan, J. L., Kapoor, W. N., Brown, F. H., et al. (1992). The inter-rater reliability and internal consistency of a clinical evaluation exercise. Journal of General Internal Medicine, 7, 174–179.
Article Google Scholar
Kwolek, C. J., Donnelly, M. B., Sloan, D. A., Birrell, S. N., Strodel, W. E., & Schwartz, R. W. (1997). Ward evaluations: Should they be abandoned? Journal of Surgical Research, 69, 1–6.
Article Google Scholar
Latham, G. P., Wexley, K. N., & Pursell, E. D. (1975). Training managers to minimize rating errors in the observation of behavior. Journal of Applied Psychology, 60, 550–555.
Article Google Scholar
Levine, H. G., & McGuire, C. H. (1971). Rating habitual performance in graduate medical education. Academic Medicine, 46, 306–311.
Article Google Scholar
Magzoub, M. E. M. A., Schmidt, H. G., Abdel-Hameed, A. A., Dolmans, D., & Mustafa, S. E. (1998). Student assessment in community settings: A comprehensive approach. Medical Education, 32, 50–59.
Article Google Scholar
Margolis, M. J., Clauser, B. E., Cuddy, M. M., Ciccone, A., Mee, J., Harik, P., et al. (2006). Use of the mini-clinical evaluation exercise to rate examinee performance on a multiple-station clinical skills examination: A validity study. Academic Medicine, 81, S56–S60.
Article Google Scholar
Mazor, K. M., Zanetti, M. L., Alper, E. J., Hatem, D., Barrett, S. V., Meterko, V., et al. (2007). Assessing professionalism in the context of an objective structured clinical examination: An in-depth study of the rating process. Medical Education, 41, 331–340.
Article Google Scholar
Metheny, W. P. P. (1991). Limitations of physician ratings in the assessment of student clinical performance in an obstetrics and gynecology clerkship. Obstetrics and Gynecology, 78, 136–141.
Google Scholar
Miller, A., & Archer, J. (2010). Impact of workplace based assessment on doctors’ education and performance: A systematic review. British Medical Journal, 341, c5064. doi:10.1136/bmj.c5064.
Article Google Scholar
Murphy, K. R., & Balzer, W. K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619–624.
Article Google Scholar
Nasca, T. J., Gonnella, J. S., Hojat, M., Veloski, J., Erdmann, J. B., Robeson, M., et al. (2002). Conceptualization and measurement of clinical competence of residents: A brief rating form and its psychometric properties. Medical Teacher, 24, 299–303.
Article Google Scholar
Norcini, J. J., Blank, L. L., Arnold, G. K., & Kimball, H. R. (1995). The mini-CEX (clinical evaluation exercise): A preliminary investigation. Annals of Internal Medicine, 123, 795–799.
Google Scholar
Norcini, J. J., Blank, L. L., Duffy, F. D., & Fortna, G. S. (2003). The mini-CEX: A method for assessing clinical skills. Annals of Internal Medicine, 138, 476–481.
Google Scholar
Pulito, A. R., Donnelly, M. B., & Pylmale, M. (2007). Factors in faculty evaluation of medical students’ performance. Medical Education, 41, 667–675.
Article Google Scholar
Remmers, H. H., Shock, N. W., & Kelly, E. L. (1927). An empirical study of the validity of the Spearman-Brown formula as applied to the Purdue rating scale. The Journal of Educational Psychology, 18, 187–195.
Article Google Scholar
Ronan, W. W., & Prien, E. P. (1966). Toward a criterion theory: A review of research and opinion. Greensboro, NC: Creativity Research Institute, Smith Richardson Foundation.
Google Scholar
Ronan, W. W., & Prien, E. P. (1971). Perspectives on the measurement of human performance. New York: Appleton Century Crofts.
Google Scholar
Rothstein, R. H. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322–327.
Article Google Scholar
Ryan, J. G., Mandel, F. S., Sama, A., & Ward, M. F. (1996). Reliability of faculty clinical evaluations of non-emergency medicine residents during emergency department rotations. Academic Emergency Medicine, 3, 1124–1130.
Article Google Scholar
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413–428.
Article Google Scholar
Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119–144.
Article Google Scholar
Sadler, D. R. (2005). Interpretations of criteria-based assessment and grading in higher education. Assessment & Evaluation in Higher Education, 30, 175–194.
Article Google Scholar
Sadler, D. R. (2009). Grade integrity and the representation of academic achievement. Studies in Higher Education, 34, 807–826.
Article Google Scholar
Schwanz, R. W., Donnelly, M. B., Sloan, D. A., Johnson, S. B., & Strodel, W. E. (1995). The relationship between faculty ward evaluations, OSCE, and ABSITE as measures of surgical intern performance. The American Journal of Surgery, 169, 414–417.
Article Google Scholar
Searle, G. F. (2008). Is CEX good for psychiatry? An evaluation of workplace-based assessment. Psychiatric Bulletin, 32, 271–273.
Article Google Scholar
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.
Article Google Scholar
Speer, A. J., Solomon, D. J., & Fincher, R.-M. E. (2000). Grade inflation in internal medicine clerkships: Results of a national survey. Teaching and Learning in Medicine, 12, 112–116.
Article Google Scholar
Streiner, D. L. (1995). Clinical ratings—ward rating. In S. Shannon & G. Norman (Eds.), Evaluation methods: A resource handbook (pp. 29–32). Hamilton: Program for Educational Development McMaster University.
Google Scholar
Streiner, D. L., & Norman, G. R. (2009). Health measurement scales. A practical guide to their development and use (4th ed.). Oxford: Oxford University Press.
Google Scholar
Swanson, D. B., Norman, G. R., & Linn, R. L. (1995). Performance-based assessment: Lessons from the health professions. Educational Researcher, 24, 5–11–35.
Google Scholar
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Pearson Allyn and Bacon.
Google Scholar
Thompson, W. G., Lipkin, M, Jr., Gilbert, D. A., Guzzo, R. A., & Roberson, L. (1990). Evaluating evaluation: assessment of the American Board of Internal Medicine Resident Evaluation Form. Journal of General Internal Medicine, 5, 214–217.
Article Google Scholar
Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4, 25–29.
Article Google Scholar
Turnbull, J., MacFadyen, J., van Barneveld, C., & Norman, G. (2000). Clinical work sampling: A new approach to the problem of in-training evaluation. Journal of General Internal Medicine, 15, 556–561.
Article Google Scholar
van Barneveld, C. (2005). The dependability of medical students’ performance ratings as documented on in-training evaluations. Academic Medicine, 80, 309–312.
Article Google Scholar
van der Vleuten, C. P. M., Scherpbier, A. J. J. A., Dolmans, D. H. J. M., Schuwirth, L. W. T., Verwijnen, G. M., & Wolfhagen, H. A. P. (2000). Clerkship assessment assessed. Medical Teacher, 22, 592–600.
Article Google Scholar
van der Vleuten, C. P., & Schuwirth, L. W. (2005). Assessing professional competence: From methods to programmes. Medical Education, 39, 309–317.
Article Google Scholar
Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557–574.
Article Google Scholar
Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2005). Is there a general factor in ratings of job performance? A meta-analytic framework for disentangling substantive and error influences. Journal of Applied Psychology, 90, 108–131.
Article Google Scholar
Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practice, 15, 22–29.
Article Google Scholar
Wass, V., Van der Vleuten, C., Shatzer, J., & Jones, R. (2001). Assessment of clinical competence. The Lancet, 357, 945–949.
Article Google Scholar
Weller, J. M., Jolly, B., Misur, M. P., Merry, A. F., Jones, A., Crossley, J. G., et al. (2009). Mini-clinical evaluation exercise in anaesthesia training. British Journal of Anaesthesia, 102, 633–641.
Article Google Scholar
Wherry, S., & Bartlett, C. J. (1982). The control of bias in ratings: A theory of rating. Personnel Psychology, 35, 521–551.
Article Google Scholar
Wilkinson, J. R., Crossley, J. G., Wragg, A., Mills, P., Cowan, G., & Wade, W. (2008). Implementing workplace-based assessment across the medical specialties in the United Kingdom. Medical Education, 42, 364–373.
Article Google Scholar
Williams, R. G., Klamen, D. A., & McGaghie, W. C. (2003). Cognitive, social and environmental sources of bias in clinical performance ratings. Teaching and Learning in Medicine, 15, 270–292.
Article Google Scholar
Williams, R. G., Verhulst, S., Colliver, J. A., & Dunnington, G. L. (2004). Assuring the reliability of resident performance appraisals: More items or more observations? Surgery, 137, 141–147.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cardiology, The Canberra Hospital, Garran, ACT, 2605, Australia
D. A. McGill
Department of Educational Research and Development, Maastricht University, Maastricht, The Netherlands
C. P. M. van der Vleuten
Clinical Trial Service Unit, University of Oxford, Oxford, UK
M. J. Clarke

Authors

D. A. McGill
View author publications
You can also search for this author in PubMed Google Scholar
C. P. M. van der Vleuten
View author publications
You can also search for this author in PubMed Google Scholar
M. J. Clarke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to D. A. McGill.

Appendix

Rights and permissions

Reprints and permissions

About this article

Cite this article

McGill, D.A., van der Vleuten, C.P.M. & Clarke, M.J. Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review. Adv in Health Sci Educ 16, 405–425 (2011). https://doi.org/10.1007/s10459-011-9296-1

Download citation

Received: 28 November 2010
Accepted: 14 April 2011
Published: 24 May 2011
Issue Date: August 2011
DOI: https://doi.org/10.1007/s10459-011-9296-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review

Abstract

Access this article

Similar content being viewed by others

A collective case study of supervision and competence judgments on the inpatient internal medicine ward

Construct validation of judgement-based assessments of medical trainees’ competency in the workplace using a “Kanesian” approach to validation

Entrustment Ratings in Internal Medicine Training: Capturing Meaningful Supervision Decisions or Just Another Rating?

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review

Abstract

Access this article

Similar content being viewed by others

A collective case study of supervision and competence judgments on the inpatient internal medicine ward

Construct validation of judgement-based assessments of medical trainees’ competency in the workplace using a “Kanesian” approach to validation

Entrustment Ratings in Internal Medicine Training: Capturing Meaningful Supervision Decisions or Just Another Rating?

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation