Sources of variation and reliability of scores and grades (RQ1)
Table 4 shows the partitioning of the variance in station level outcomes using linear mixed model for domain scores, and then separately for global grades, both based on random effects of Candidate, Station, Examiner and Exam (Table 2).
Table 4 Variance in station-level scores and grades from separate linear mixed models (n = 313,593)
Focussing first on domain scores, Table 4 shows that there is more variance due to Examiner at the station level than other facets, but that two-thirds of variance is not explained by any of these facets (residual 66.5%).
The story for global grades is a little different, as the Candidate facet account for a higher percentage of variance than does Examiner (9.5% vs. 7.9%). However, the amount of residual variance is higher when comparing with domain scores (75.9% vs. 66.5%).
For both outcomes, the Station facet accounts for a relatively small proportion of variance (5.7% and 6.3% for scores and grades respectively), and the Exam facet for an even smaller amount (0.5% and 0.4%).
Within a generalisability framework, we can use the variance components from Table 4 to calculate overall reliability coefficients and standard errors of measurement (SEM) for a typical single 18 station PLAB2 exam - Table 5.
Table 5 Overall reliability/SEM estimates for an 18 station PLAB2 OSCE
The value of these estimates are acceptable according to the usual guidelines (Lance et al., 2006; Park, 2019), but as a result of the greater residual variance for grades (Table 4), the reliability estimate for global grades are lower, and the SEMs correspondingly greater, than they are for domain scores.
Correlations between observed and modelled station-level scores and grades
Table 6 shows the (Pearson) correlation coefficients for domain scores and grades—for both observed values (i.e. those actually produced by examiners) and modelled values derived in the linear mixed modelling.
Table 6 Correlation between observed and modelled values across all candidate/station interactions
The overall correlation between observed scores and grades is strong overall (r = 0.85), and is very similar to that between the modelled values of these (0.86). Importantly, the correlation between observed and modelled values are not as strong—r = 0.60 for domain scores, and 0.52 for global grades. This indicates that the modelling has actually had an important impact on adjusting scores/grades when ‘controlling’ for unwanted sources of variance (Station, Examiner and Exam).
Individual modelled estimates of stringency of each facet (RQ2)
The modelling gives summary statistics for Candidate, Station, Examiner and Exam stringency for all levels of the facet (see Table 3 for more details on how to interpret each of these).
Table 7 Summary statistics for estimates of stringency for each facet (station-level)
Table 7 shows that the mean values for each facet are the same—this is a natural consequence of the modelling. More importantly, by comparing standard deviations, the modelling suggests that there is greater variation in examiner stringency than there is for candidate ability, and this is particularly the case for domain scores (SD = 0.96, 0.71 respectively). In addition, variation in station difficulty is of a smaller magnitude, and variation across exams is very small. All these results are entirely consistent with the variance component analysis in Table 4.
Cluster analysis of examiner stringency
The correlation between the two estimates of examiner stringency (for domain scores and global grades) is quite strong at r = 0.76 (n = 862, p < 0.001) indicating that examiners are broadly consistent in their level of stringency across the two methods of scoring performance in a station. Taking the analysis further, a simple cluster analysis results in a two-cluster solution with ‘fair’ fit (silhouette score = 0.6, (Norusis, 2011, Chapter 17)). This is the maximum number of clusters when only two variables (examine stringency in domain scores and grades) are present in the analysis.
Figure 3 shows a scatter graph of the two sets of estimates, with clusters labelled as hawkish (for those estimated as scoring relatively lowly) and doveish (for those estimated scoring more highly).
Standard setting comparisons at station and exam level (RQ3)
We can compare borderline regression method (BRM) cut-scores derived from observed scores/grades in a station with those derived from the modelled outcomes. This provides us with insight into how the combined effect of examiner stringency (i.e. in both scores and grades) impacts on BRM standards, and then on candidate pass/fail outcomes. We note that in practice, cut-scores in PLAB2 are higher, and subsequent pass rates lower, than those presented here. For reasons of simplicity and data comparability, the comparison in this study is kept straightforward, omitting some elements of the actual standard setting approach—Appendix 3 gives more justification for this decision.
With 442 exams and 17.8 stations per exam (Table 1), there are a total of 7,877 separate station administrations in the data. The key overall finding when comparing these cut-scores, at both the station and the exam level, is that those derived from modelled outcomes are systematically lower than those based on the actual observed scores. To our knowledge, this is a completely new finding that has not been evidenced before in the literature.
At the station level, a paired t-test gives a mean difference between cut-scores from observed and modelled data of around 3% on the 12 point scale (mean observed = 5.61, mean modelled = 5.24; t = 57.94, df = 7876, p < 0.001, Cohen’s d = 0.65).
The equivalent analysis at the exam level ( ) gives a similar mean difference in exam level percentage scores of 3.1% (mean observed = 46.76, mean modelled = 43.66; t = 52.68, df = 441, p < 0.001, Cohen’s d = 2.51). The larger Cohen’s d at the exam level is a result of a much smaller (relative) standard deviation (of the difference) at this level. This lower SD is an artefact of the summing of a set of station-level cut-scores, a set of 18 somewhat independent random variables.
Additional analysis indicates that the average difference in cut-scores is a direct result of the systematic differences between BRM intercepts and slopes derived from modelled and observed data—as a result of error being removed during the modelling process. In short, regression slopes are typically higher in modelled data, and intercepts lower. Appendix 4 gives more a more detailed explanation of why these differences occur.
Figure 4 shows a scatter graph of the two exam-level cut-scores (observed and modelled), with the line of equality shown. Almost all exam-level cut-scores are higher in the observed data than in that modelled.
That cut-scores are systematically lower once error has been removed was not anticipated, and one we will revisit in the Discussion.
Whilst this analysis has shown that cut-scores are systematically different between observed and modelled values, it should be emphasised that candidate domain scores (or grades) themselves do not differ on average. This is because residuals (= observed—modelled) are estimated with mean zero.
Indicative differences in exam-level decisions (RQ3)
For completeness, the final analysis is a comparison of pass rates between observed and modelled data. However, as earlier caveats have indicated, this is quite problematic in some regards, and does not directly correspond to actual PLAB2 decision-making (again, see Appendix 3 for more on this).
Table 8 compares indicative exam-level pass/fail decisions between observed and modelled data.
Table 8 A hypothetical comparison between observed and modelled pass/fail decision
As a result of the difference in exam-level cut-scores, the pass rate is much higher when using modelled data (97.1% vs. 87.2% for observed data).
For all the reasons already stated, this final analysis should be treated as indicative rather than representing a complete picture of how PLAB2 decision-making might change if scores were to be adjusted from observed to modelled in the way presented. We return to some of these issues in the Discussion.