McGill, D.A., van der Vleuten, C.P.M. & Clarke, M.J. Adv in Health Sci Educ (2013) 18: 701. doi:10.1007/s10459-012-9410-z
Supervisor assessments are critical for both formative and summative assessment in the workplace. Supervisor ratings remain an important source of such assessment in many educational jurisdictions even though there is ambiguity about their validity and reliability. The aims of this evaluation is to explore the: (1) construct validity of ward-based supervisor competency assessments; (2) reliability of supervisors for observing any overarching domain constructs identified (factors); (3) stability of factors across subgroups of contexts, supervisors and trainees; and (4) position of the observations compared to the established literature. Evaluated assessments were all those used to judge intern (trainee) suitability to become an unconditionally registered medical practitioner in the Australian Capital Territory, Australia in 2007–2008. Initial construct identification is by traditional exploratory factor analysis (EFA) using Principal component analysis with Varimax rotation. Factor stability is explored by EFA of subgroups by different contexts such as hospital type, and different types of supervisors and trainees. The unit of analysis is each assessment, and includes all available assessments without aggregation of any scores to obtain the factors. Reliability of identified constructs is by variance components analysis of the summed trainee scores for each factor and the number of assessments needed to provide an acceptably reliable assessment using the construct, the reliability unit of analysis being the score for each factor for every assessment. For the 374 assessments from 74 trainees and 73 supervisors, the EFA resulted in 3 factors identified from the scree plot, accounting for only 68 % of the variance with factor 1 having features of a “general professional job performance” competency (eigenvalue 7.630; variance 54.5 %); factor 2 “clinical skills” (eigenvalue 1.036; variance 7.4 %); and factor 3 “professional and personal” competency (eigenvalue 0.867; variance 6.2 %). The percent trainee score variance for the summed competency item scores for factors 1, 2 and 3 were 40.4, 27.4 and 22.9 % respectively. The number of assessments needed to give a reliability coefficient of 0.80 was 6, 11 and 13 respectively. The factor structure remained stable for subgroups of female trainees, Australian graduate trainees, the central hospital, surgeons, staff specialist, visiting medical officers and the separation into single years. Physicians as supervisors, male trainees, and male supervisors all had a different grouping of items within 3 factors which all had competency items that collapsed into the predefined “face value” constructs of competence. These observations add new insights compared to the established literature. For the setting, most supervisors appear to be assessing a dominant construct domain which is similar to a general professional job performance competency. This global construct consists of individual competency items that supervisors spontaneously align and has acceptable assessment reliability. However, factor structure instability between different populations of supervisors and trainees means that subpopulations of trainees may be assessed differently and that some subpopulations of supervisors are assessing the same trainees with different constructs than other supervisors. The lack of competency criterion standardisation of supervisors’ assessments brings into question the validity of this assessment method as currently used.