Advances in Health Sciences Education

, Volume 18, Issue 4, pp 701–725 | Cite as

A critical evaluation of the validity and the reliability of global competency constructs for supervisor assessment of junior medical trainees

  • D. A. McGill
  • C. P. M. van der Vleuten
  • M. J. Clarke


Supervisor assessments are critical for both formative and summative assessment in the workplace. Supervisor ratings remain an important source of such assessment in many educational jurisdictions even though there is ambiguity about their validity and reliability. The aims of this evaluation is to explore the: (1) construct validity of ward-based supervisor competency assessments; (2) reliability of supervisors for observing any overarching domain constructs identified (factors); (3) stability of factors across subgroups of contexts, supervisors and trainees; and (4) position of the observations compared to the established literature. Evaluated assessments were all those used to judge intern (trainee) suitability to become an unconditionally registered medical practitioner in the Australian Capital Territory, Australia in 2007–2008. Initial construct identification is by traditional exploratory factor analysis (EFA) using Principal component analysis with Varimax rotation. Factor stability is explored by EFA of subgroups by different contexts such as hospital type, and different types of supervisors and trainees. The unit of analysis is each assessment, and includes all available assessments without aggregation of any scores to obtain the factors. Reliability of identified constructs is by variance components analysis of the summed trainee scores for each factor and the number of assessments needed to provide an acceptably reliable assessment using the construct, the reliability unit of analysis being the score for each factor for every assessment. For the 374 assessments from 74 trainees and 73 supervisors, the EFA resulted in 3 factors identified from the scree plot, accounting for only 68 % of the variance with factor 1 having features of a “general professional job performance” competency (eigenvalue 7.630; variance 54.5 %); factor 2 “clinical skills” (eigenvalue 1.036; variance 7.4 %); and factor 3 “professional and personal” competency (eigenvalue 0.867; variance 6.2 %). The percent trainee score variance for the summed competency item scores for factors 1, 2 and 3 were 40.4, 27.4 and 22.9 % respectively. The number of assessments needed to give a reliability coefficient of 0.80 was 6, 11 and 13 respectively. The factor structure remained stable for subgroups of female trainees, Australian graduate trainees, the central hospital, surgeons, staff specialist, visiting medical officers and the separation into single years. Physicians as supervisors, male trainees, and male supervisors all had a different grouping of items within 3 factors which all had competency items that collapsed into the predefined “face value” constructs of competence. These observations add new insights compared to the established literature. For the setting, most supervisors appear to be assessing a dominant construct domain which is similar to a general professional job performance competency. This global construct consists of individual competency items that supervisors spontaneously align and has acceptable assessment reliability. However, factor structure instability between different populations of supervisors and trainees means that subpopulations of trainees may be assessed differently and that some subpopulations of supervisors are assessing the same trainees with different constructs than other supervisors. The lack of competency criterion standardisation of supervisors’ assessments brings into question the validity of this assessment method as currently used.


Supervisor assessment Workplace assessment Competency assessment Exploratory factor analysis Reliability Validity 


  1. Baltagi, B. H., Song, S. H., & Jung, B. C. (2002). A comparative study of alternative estimators for the unbalanced 2-way error component regression model. Econometrics Journal, 5, 480–493.CrossRefGoogle Scholar
  2. Brumback, G. B., & Howell, M. A. (1972). Rating the clinical effectiveness of employed physicians. Journal of Applied Psychology, 56, 241–244.CrossRefGoogle Scholar
  3. Cook, D., Beckman, T., Mandrekar, J., & Pankratz, V. (2010). Internal structure of mini-CEX scores for internal medicine residents: Factor analysis and generalizability. Advances in Health Sciences Education, 15, 633–645.CrossRefGoogle Scholar
  4. Crossley, J., Johnson, G., Booth, J., & Wade, W. (2011). Good questions, good answers: Construct alignment improves the performance of workplace-based assessment scales. Medical Education, 45, 560–569.CrossRefGoogle Scholar
  5. Dielman, T. E., Hull, A. L., & Davis, W. K. (1980). Psychometric properties of clinical performance ratings. Evaluation and the Health Professions, 3, 103–117.CrossRefGoogle Scholar
  6. Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4, 272–299.CrossRefGoogle Scholar
  7. Field, A. (2005). Discovering statistics using SPSS (2nd ed.). London: Sage Publications.Google Scholar
  8. Forsythe, G. B., McGaghie, W. C., & Friedman, C. P. P. (1985). Factor structure of the resident evaluation form. Educational and Psychological Measurement, 45, 259–264.Google Scholar
  9. Forsythe, G. B., McGaghie, W. C., & Friedman, C. P. P. (1986). Construct validity of medical clinical competence measures: A multitrait-multimethod matrix study using confirmatory factor analysis. American Educational Research Journal, 23, 315–336.CrossRefGoogle Scholar
  10. Gingerich, A., Regehr, G., & Eva, K. W. (2011). Rater-based assessments as social judgments: Rethinking the etiology of rater errors. Academic Medicine, 86, S1–S7.CrossRefGoogle Scholar
  11. Ginsburg, S. M., McIlroy, J. P., Oulanova, O. M., Eva, K. P., & Regehr, G. P. (2010). Toward authentic clinical evaluation: Pitfalls in the pursuit of competency. Academic Medicine, 85, 780–786.CrossRefGoogle Scholar
  12. Gough, H. G., Hall, W. B. P., & Harris, R E Ph. (1964). Evaluation of performance in medical training. Journal of Medical Education, 39, 679–692.Google Scholar
  13. Govaerts, M. J., van der Vleuten, C. P., Schuwirth, L. W., & Muijtjens, A. M. (2007). Broadening perspectives on clinical performance assessment: Rethinking the nature of in-training assessment. Advances in Health Sciences Education, 12, 239–260.CrossRefGoogle Scholar
  14. Hamdy, H., Prasad, K., Anderson, M. B., Scherpbier, A., Williams, R., Zwierstra, R., et al. (2006). BEME systematic review: Predictive values of measurements obtained in medical schools and future performance in medical practice. Medical Teacher, 28, 103–116.CrossRefGoogle Scholar
  15. Hawkins, R. E., Sumption, K. F., Gaglione, M. M., & Holmboe, E. S. (1999). The in-training examination in internal medicine: Resident perceptions and lack of correlation between resident scores and faculty predictions of resident performance. The American Journal of Medicine, 106, 206–210.CrossRefGoogle Scholar
  16. Henson, R. K., & Roberts, J. K. (2006). Use of exploratory factor analysis in published research. Educational and Psychological Measurement, 66, 393–416.CrossRefGoogle Scholar
  17. Hutchinson, L., Aitken, P., & Hayes, T. (2002). Are medical postgraduate certification processes valid? A systematic review of the published evidence. Medical Education, 36, 73–91.CrossRefGoogle Scholar
  18. Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141–151.CrossRefGoogle Scholar
  19. Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39, 31–36.CrossRefGoogle Scholar
  20. Kastner, L., Gore, E., & Novack, A. H. (1984). Pediatric residents’ attitudes and cognitive knowledge, and faculty ratings. The Journal of Pediatrics, 104, 814–818.CrossRefGoogle Scholar
  21. King, L. M., Schmidt, F. L., & Hunter, J. E. (1980). Halo in a multidimensional forced-choice evaluation scale. Journal of Applied Psychology, 65, 507–516.CrossRefGoogle Scholar
  22. Kogan, J. R., Holmboe, E. S., & Hauer, K. S. (2009). Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. Journal of the American Medical Association, 302, 1316–1326.CrossRefGoogle Scholar
  23. Levine, H. G., & McGuire, C. H. (1971). Rating habitual performance in graduate medical education. Academic Medicine, 46, 306–311.CrossRefGoogle Scholar
  24. Lurie, S. J., Mooney, C. J., & Lyness, J. M. (2009). Measurement of the general competencies of the accreditation council for graduate medical education: A systematic review. Academic Medicine, 84, 301–309.CrossRefGoogle Scholar
  25. McGill, D., Van der Vleuten, C., & Clarke, M. (2011). Supervisor assessment of clinical and professional competence of medical trainees: A reliability study using workplace data and a focused analytical literature review. Advances in Health Sciences Education, 16, 405–425.CrossRefGoogle Scholar
  26. McKinley, D. W., & Boulet, J. R. (2005). Using factor analysis to evaluate checklist items. Academic Medicine RIME: Proceedings of the Forty-fourth Annual Conference, 80, S102–S105.Google Scholar
  27. McLaughlin, K., Vitale, G., Coderre, S., Violato, C., & Wright, B. (2009). Clerkship evaluation: What are we measuring? Medical Teacher, 31, e36–e39.CrossRefGoogle Scholar
  28. Metheny, W. P. P. (1991). Limitations of physician ratings in the assessment of student clinical performance in an obstetrics and gynecology clerkship. Obstetrics and Gynecology, 78, 136–141.Google Scholar
  29. Miller, A., & Archer, J. (2010). Impact of workplace based assessment on doctors’ education and performance: A systematic review. British Medical Journal, 341, c5064. doi:10.1136/bmj.c5064.CrossRefGoogle Scholar
  30. Norcini, J., & Burch, V. (2007). Workplace-based assessment as an educational tool: AMEE Guide No. 31. Medical Teacher, 29, 855–871.CrossRefGoogle Scholar
  31. Norman, G. R., & Streiner, D. L. (2008). Biostatistics. The Bare Essentials. (3rd ed.) Shelton, Connecticut: People’s Medical Publishing House.Google Scholar
  32. Pett, M. A., Lackey, N. R., & Sullivan, J. J. (2003). Making sense of factor analysis. The use of factor analysis for instrument Development in Health Care Research. Thousand Oaks: Sage Publications.Google Scholar
  33. Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88, 879–903.CrossRefGoogle Scholar
  34. Pulito, A. R., Donnelly, M. B., & Pylmale, M. (2007). Factors in faculty evaluation of medical students’ performance. Medical Education, 41, 667–675.CrossRefGoogle Scholar
  35. Quarrick, E. A., & Sloop, E. W. (1972). A method for identifying the criteria of good performance in a medical clerkship program. Journal of Medical Education, 47, 188–197.Google Scholar
  36. Ramsey, P. G., Wenrich, M. D., Carline, J. D., Inui, T. S., Larson, E. B., & LoGerfo, J. P. (1993). Use of peer ratings to evaluate physician performance. Journal of the American Medical Association, 269(13), 1655–1660.CrossRefGoogle Scholar
  37. Regehr, G., Eva, K., Ginsburg, S., Halwani, Y., & Sidhu, R. (2011). Assessment in postgraduate medical education: Trends and issues in assessment in the workplace Members of the FMEC PG consortium.Google Scholar
  38. Remmers, H. H., Shock, N. W., & Kelly, E. L. (1927). An empirical study of the validity of the Spearman-Brown formula as applied to the Purdue Rating Scale. The Journal of Educational Psychology, 18, 187–195.CrossRefGoogle Scholar
  39. Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292.CrossRefGoogle Scholar
  40. Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413–428.CrossRefGoogle Scholar
  41. Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119–144.CrossRefGoogle Scholar
  42. Schmidt, F. L., & Kaplan, L. B. (1971). Composite versus multiple criteria: A review and resolution of the controversy. Personnel Psychology, 24, 419–434.CrossRefGoogle Scholar
  43. Schumacher, R. E., & Lomax, R. G. (2010). A beginner’s guide to structural equation modelling (3rd ed.). New York: Taylor and Francis Group.Google Scholar
  44. Silber, C. G., Nasca, T. J., Paskin, D. L., Eiger, G., Robeson, M., & Veloski, J. J. (2004). Do global rating forms enable program directors to assess the ACGME competencies? Academic Medicine, 79, 549–556.CrossRefGoogle Scholar
  45. Streiner, D. L., & Norman, G. R. (2009). Health measurement scales. A pratcial guide to their development and use. (4th ed.) Oxford: Oxford University Press.Google Scholar
  46. Swing, S. R., Clyman, S. G., Holmboe, E. S., & Williams, R. G. (2009). Advancing resident assessment in graduate medical education. Journal of Graduate Medical Education, 1, 278–286.CrossRefGoogle Scholar
  47. Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Pearson Allyn and Bacon.Google Scholar
  48. Thompson, W. G., Lipkin, M, Jr, Gilbert, D. A., Guzzo, R. A., & Roberson, L. (1990). Evaluating evaluation: Assessment of the American Board of Internal Medicine Resident Evaluation Form. Journal of General Internal Medicine, 5, 214–217.CrossRefGoogle Scholar
  49. van der Vleuten, C. P. M. (1996). The assessment of professional competence: Developments, research and practical implications. Advances in Health Sciences Education, 1, 41–67.CrossRefGoogle Scholar
  50. van der Vleuten, C. P., & Schuwirth, L. W. (2005). Assessing professional competence: From methods to programmes. Medical Education, 39, 309–317.CrossRefGoogle Scholar
  51. Velicer, W. F., & Jackson, D. N. (1990). Component analysis versus common factor analysis: Some issues in selecting an appropriate procedure. Multivariate Behavioral Research, 25, 1.CrossRefGoogle Scholar
  52. Viswesvaran, C., Schmidt, F. L., & Ones, D. S. (2005). Is there a general factor in ratings of job performance? A meta-analytic framework for disentangling substantive and error influences. Journal of Applied Psychology, 90, 108–131.CrossRefGoogle Scholar
  53. Wass, V., Van der Vleuten, C., Shatzer, J., & Jones, R. (2001). Assessment of clinical competence. The Lancet, 357, 945–949.CrossRefGoogle Scholar
  54. Williams, R. G., Klamen, D. A., & McGaghie, W. C. (2003). Cognitive, social and environmental sources of bias in clinical performance ratings. Teaching and Learning in Medicine, 15, 270–292.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2012

Authors and Affiliations

  • D. A. McGill
    • 1
  • C. P. M. van der Vleuten
    • 2
  • M. J. Clarke
    • 3
  1. 1.Department of CardiologyThe Canberra HospitalGarranAustralia
  2. 2.Department of Educational Research and DevelopmentMaastricht UniversityMaastrichtThe Netherlands
  3. 3.Clinical Trial Service UnitUniversity of OxfordOxfordUK

Personalised recommendations