Background

Objective structured clinical examinations (OSCEs) are considered a fair evaluation method for health students since they aim to assess their competencies in a standardized and objective way [1]. Several factors have been established as influencing OSCE reliability, including the duration, circuit, sites, scoring system, and rater [2, 3], with extremely heterogeneous conclusions. In OSCEs with an extremely large number of students, the practice of conducting multiple parallel versions of the same examination with different raters can also introduce rater variability [4]. Therefore, each cohort of raters should evaluate performances with the same standard of judgment to ensure that students are not systematically either advantaged or disadvantaged by their circuit and guarantee the fairness of OSCEs. Few studies have examined the influence of different circuits on OSCE examinations [5,6,7]. Their findings were heterogeneous, probably because in most studies, students are fully nested within cohorts of examiners with no crossover between groups of students and groups of examiners, thus preventing assessing rater cohort variability. Yeates et al. developed a video-based method to adjust for examiner effect in fully nested OSCEs and showed that examiner cohorts could substantially influence the scores of students and could potentially influence the categorization of around 6.0% of them. One student (0.8%) passed who would otherwise have failed, whereas six students (5.2%) failed who would otherwise have passed [4].

The COVID-19 pandemic forced medical schools globally to cancel the on-site OSCEs [8,9,10,11]. To the best of our knowledge, data on examiner effects, score variance, and online OSCEs remain scarce. We exploited a large-scale online OSCE (eOSCE) at the Université Paris Cité medical school [12], allowing live and remote evaluation, to assess the agreement between live and remote video-based evaluations and quantify the score variability corresponding to student ability and the rater effect on the global score for the station and at the item level.

Methods

Study design

The medical school of the Université de Paris Cité conducted eOSCEs as a mock examination on June 2021, using the video conferencing platform Zoom; 531 students in their fifth year of medical school and 298 teachers participated.

We conducted a double evaluation on a sample of recorded students’ performances’ during the three eOSCE stations.

This study obtained the approval of the ethics committee of the Université Paris Cité, CER U-Paris N° 2021-96-BOUZID. The ethics committee of Université Paris Cité waived the need for written informed consent from the students but required that they received clear information about the study protocol with the possibility to decline to participate to the training.

Population

Medical students completing their fifth year at the Université Paris Cité medical school (Paris, France) were invited to participate on a voluntary basis in the first large-scale eOSCE in our school. Teachers from the medical school of the Université Paris Cité with previous experience of on-site OSCEs administered the eOSCE and were involved as raters or standardized patients.

Description of eOSCE station

We proposed a circuit of three eOSCE stations to the students. Expert teachers from the Université de Paris Cité OSCE group carefully prepared these stations. Each station was evaluated by two other teachers and previously tested with volunteer residents to assess its feasibility within allocated time. Station#1 concerned gynecology and focused on history-taking skills. Station#2 concerned addictology and evaluated communication and history-taking skills. Station#3 concerned pediatrics; it provided a picture of chickenpox’s lesions and considered therapeutic management strategy skills. None of these stations addressed any technical procedures or clinical examination skills to accommodate the digital environment and allow more straightforward remote evaluation.

Each station lasted 7 min, and the student was then invited to click on the next link for the following station. The scoring system was binary (Fulfilled/Not fulfilled) for each item, and the items were weighted differently.

OSCE evaluation

The raters observed the OSCE station with both their camera and microphone turned off. They then completed the evaluation grid online on the Université Paris Cité usual software, “Sides THEIA.”

Four weeks after the eOSCEs, the videos were uploaded on a secured institutional online platform, and a panel of 35 volunteer raters watched 236 randomly selected stations, completing a double evaluation. They were able to pause and rewatch the videos as much as they wanted.

Objectives

The primary objective was to compare the live online evaluation with the remote online evaluation of these eOSCEs.

The secondary objective was to assess the other score variability components: students and raters’ effect, raters’ experience, students’ genders, and the evaluated items.

Statistical analysis

Separate descriptive analyses were performed for the three stations. We reported score dispersions, the success percentage for each item, and discrimination. Discrimination indicates how much better the best students perform than others for a specific item. This is defined as the difference in success rate (or score) between the subset of the 30% students with best performances and 30% students with worst performances. These subsets refer to the station’s score, whereas discrimination is computed for each item.

In our context of multiple raters (live and remote evaluation) and multiple students, we fitted a linear mixed model with student and rater as random effects and the score as an explained variable, allowing estimation of intraclass correlation coefficients (ICCs, also referred to as variance partition coefficients) for student and rater. Three linear models were fitted, one for each station. The rater ICC represents the variance of the score due to the rater, expressed as a proportion of the total variance (rater, student, and residual). A low rater ICC indicates a relatively homogeneous notation or, at least, a low effect of rater heterogeneity on score dispersion [13]. We also estimated the student ICC: a high student ICC indicates that the observed dispersion of the scores is almost entirely due to the dispersion of the student’s skills.

The influence of the gender and experience of the rater was tested by including fixed effects in the model, and we reported the corresponding Wald p-value. The experience was classified binarily; experienced raters were tenured academic physicians. The same strategy was used to test the influence of student gender and the timing of the evaluation: live or remote; p-values below 5% were considered statistically significant.

Each station comprised 18 to 28 items, for which the notation was binary, and we also investigated the sources of variability of scores at the item level. We fitted a mixed logistic model for each item to evaluate student and rater ICC at item level according to the latent variable approach described by Goldstein et al. [14]. We also reported crude agreement at an item level, defined as the number of students for which both raters agree, even if its interpretation can be misleading since part of this crude agreement is due to chance. For all models (linear and logistic), variance estimates were obtained based on the restricted maximum likelihood (REML) with the lme4 package of R 4.1.2 software. Missing values were not imputed, and the analysis was limited to available data.

Results

A total of 202 students participated in at least one station; 131 (65%) were female. The first station comprised 18 separate items. After purging for missing data and removing students who were only evaluated once, 170 observations, corresponding to 85 students and nine raters, were analyzed. For the two other stations, using the same quality control, we retained 192 and 110 observations for the statistical analysis, corresponding to 96 and 55 students, and 15 and seven raters, respectively.

Of the 31 raters, 18 (58%) were male. Scores did not differ significantly according to the gender of the rater (p = 0.96, 0.10, and 0.26). There was also no systematic difference in scores according to the evaluation timing (live or remote, p = 0.92, 0.053, and 0.38). Twenty raters were experienced physicians, but no association was found between the rater’s experience and scores for Station#1 and Station#3 (p = 0.26 and 0.12, respectively). For Station#2, experienced raters gave higher scores (mean score difference 5.4, 95% CI 4.5–10.8, p = 0.048). The gender of the student was not associated with their score (p = 0.32, p = 0.57, and p = 0.25 for the three stations).

Table 1 summarizes the results of the different models. The median score (out of 100) and interquartile range were 60 (IQR 50–70), 60 (IQR 54–70), and 53 (IQR 45–62) for the three stations. The score variance proportions explained by the rater (namely, the rater’s ICC) were 23.0, 16.8, and 32.8%. Some items had an extremely high success rate and thus low discrimination. Item 10 of Station#3 (chickenpox diagnosis) was passed for all students, leading to a 100% success rate and 0% discrimination. Two items (one in Station#2 and another in Station#3 of medical history and therapeutic education) had negative discrimination.

Table 1 Summary of the factors influencing students ‘scores variability

The item-level analysis showed extremely high variability between items. Some items showed a high proportion of variance explained by the rater (e.g., in the first station, item 5 concerning medical history had an estimated rater ICC of 0.48). Conversely, most of the items showed a reasonable rater ICC. All agreement proportions appeared fair since only one was below 73%. Note that for an item with nearly complete agreement or a high proportion of success, the statistical model may fail to converge or return a singular fit, resulting in 22 items out of 64 not being analyzed.

Discussion

To our knowledge, this study is the first to compare live and remote evaluations of eOSCEs. We found no significant difference between the live and remote evaluations. Previous studies showed that remote evaluation using a video recording system is as reliable as live in-person evaluation in on-site OSCEs [15, 16]. Our findings are consistent with the conclusion of Yeates et al. that internet-based scoring could potentially offer a more flexible means to facilitate scoring and minimize the examiners’ cohort effect [17]. Chen et al. even emphasized that on-site evaluation could introduce an audience effect that could influence the students’ performances [15]. One of the greatest challenges for OSCE organizers is to recruit available teachers for the evaluation. Remote evaluation might therefore enable fewer examiners to work simultaneously.

The score variance proportion explained by the rater was moderate for the three stations comprising our eOSCEs. The gender of our raters did not influence the scoring, but experienced raters scored higher than junior raters in Station#2. This finding contrasts with the findings of Chong et al. on the raters’ experience since they demonstrated that junior doctors scored consistently higher than senior doctors in all domains of OSCE assessment [18, 19]. However, Station#2 in our study, concerning alcohol addiction, had the lowest rater ICC and, therefore, the more homogenous evaluation between raters. More experienced raters scored higher than juniors. Regarding students’ ICC, they are slightly lower than those reported in previous publications on interrater-reliability in on-site OSCEs [20, 21]. Per instance, in this study by Hurley et al. which objective was to assess inter-observer reliability and observer accuracy as a function of OSCE checklist length. Inter-rater reliability ranged from 58 to 78% (corresponding to students’ ICC in our study that ranged from 39.4 to 60.2%) [22].

The item analysis showed a reasonable rater ICC with good agreement proportions. However, few items showed a high proportion of variance explained by the rater. In the 5th item of Station#1, regarding medical history and, more precisely, endometrial cancer risk factor research, the rater ICC was higher than in other items, suggesting that this item was not clearly explained in the scoring process.

Regarding the students’ profiles, this study showed no impact of the student’s gender on OSCE scores, also confirming the findings of Humphrey-Murto et al. in a study evaluating simulated patients’ gender and students’ gender on OSCE grading [23].

Limitations

Our study has some limitations. First, OSCE stations mainly focused on communication and history-taking skills, so the video interface was suitable. Still, a recent review suggests that it may be helpful to employ multiple cameras for more technical tasks and rely on more advanced simulation methods. All agreement proportions were fair; however, this might be partly explained by chance, especially for items with a high success rate.

Conclusion

Our study suggests that remote evaluation is as reliable as live evaluation for eOSCEs. It also, highlights that the score variance proportion explained by the rater is significant even with eOSCEs and that high variability exists between items. These data encourage us to continue improving the OSCE station writing process. Further studies are required to compare the variability of the scores between online and on-site OSCEs.