Background

The proportionate effectiveness of selection methods for entry into postgraduate medical training has been a relatively under-researched topic internationally [14]. As with all selection methodologies, various psychometric criteria must be satisfied to ensure that a given postgraduate medical selection system is fair and robust, including standardisation, reliability and validity [57]. Faced with limited training positions and a high volume of applicants, medical selection has traditionally relied on academic attainment as primary selection criteria in admission systems [8]. However, there is a growing recognition in the literature that other important non-academic attributes and skills must be present from the start of training in order to become a competent clinician [9]. Given that medical selection systems globally are increasingly implementing several selection methods in combination (targeting both the academic and non-academic attributes required of clinicians), there is a need to evaluate the relative and complementary roles of, and value-added by, selection methods in predicting desired outcome criteria, which to date is lacking in the research literature [2].

Internationally, extensive literature documents the reliability, validity and stakeholder acceptability of situational judgement tests (SJTs) as measures of non-academic ability across a range of occupations, including in the context of medical selection [2, 1014]. However, although evidence regarding the construct validity and reliability of SJTs exists at the postgraduate level for selection into UK General Practice (GP) [3, 15, 16], there is limited extant research internationally regarding the predictive validity of SJTs for selection into postgraduate specialty training.

One high volume postgraduate specialty that has recently incorporated non-academic assessment at the point of selection is Australian GP training. The current selection system was implemented nationally in 2011 following a successful pilot in 2010, and comprises an SJT and a multiple-mini interview (MMI). The standardised results of the SJT and MMI determine an applicant’s overall selection score. Applicants’ overall selection score and geographic training region preference are used to determined if the applicant can be shortlisted for subsequent local selection processes.

The selection process targets seven core attributes (detailed in Fig. 1), which were criterion-matched against the competencies identified as important for entry-level GP registrars in the domains of practice defined by the Royal Australian College of General Practitioners (RACGP) and the Australian College of Rural and Remote Medicine (ACRRM).

Fig. 1
figure 1

Attributes assessed by the Australian GP selection methods

The Australian GP selection process does not include an explicit measure of academic attainment at the point of selection, unlike more traditional selection systems. However, completion of a primary medical qualification is a requirement for eligibility, therefore academic attainment is a prerequisite at the point of selection. In addition, academic attainment tends to be relatively homogenous in trainee physicians, therefore differentiating between applicants for postgraduate medical education predominantly on the basis of academic achievement is challenging and likely to be error-prone [1719]. Instead, as outlined in Fig. 1, related attributes such as clinical reasoning and problem solving are assessed via an SJT. Preliminary, cross-sectional evidence of the reliability and concurrent validity of the selection system has been demonstrated [20], however to date longitudinal data has not been collected to assess the validity of the selection system for predicting performance in end-of-training assessments. Moreover, the complementary roles of the SJT and MMI have yet to be assessed. In any multi-method selection system, it is important that each method has added value (i.e., assesses something different to the other tools in the system) in order to ensure that efficiency and cost-effectiveness is maximised. Therefore, in the present study we posed the following research questions:

  1. 1.

    What is the predictive validity of the SJT, the MMI, and the overall selection score for performance on end-of-training assessments in Australian GP training?

  2. 2.

    What are the incremental validities of the SJT and the MMI for predicting performance on end-of-training assessments in Australian GP training?

Method

Participants

Selection data was collected from participants who took part in the 2010 and 2011 selection process into Australian GP training, which comprised both the SJT and the MMI. Participants provided their consent at the point of selection into GP training for their data to be used for research purposes. In 2010, this new selection process was piloted by three Regional Training Providers (N = 345) and in 2011 this selection process was used nationally across Australia (N = 1335).

End-of-training assessment scores were requested from all 17 Regional Training Providers, and received from 13 of these. From this data, it was possible to match the selection and end-of-training data for N = 443 registrars. Table 1 provides the sample and entire population’s demographics, which shows that our sample is consistent with the demographic breakdown of the entire population.

Table 1 Demographics

Procedure

A retrospective longitudinal design, using previously validated methods [2, 1014], was used to evaluate the selection data’s relationship with end-of-training assessment scores.

Analyses were conducted using SPSS 22.0 for Windows. Pearson product–moment correlations were used to assess the association between all selection and end-of-training assessments, and hierarchical regression analyses examined the predictive power of the selection methods. Missing data were deleted pairwise to maximise the available sample size for each analysis.

Selection methods

Situational judgement test

The SJT is a low fidelity computer-delivered examination which is completed under invigilated conditions. The test comprises 50 questions and applicants have two hours to complete the test. Two response formats are used: ranking and multiple response (see Fig. 2). This SJT has been found to have high internal reliability (Cronbach’s alpha = .91) [21].

Fig. 2
figure 2

Example SJT Questions

Multiple-mini interview

The MMI rotates applicants between six, 10 min interview stations. Applicants have two minutes to read the question before entering the interview room, then eight minutes to answer the question from the interviewer, in a face-to-face context. Interviewee responses are then probed further by the interviewer. An example MMI question is provided in Fig. 3. Each interviewer gives the applicant a score out of seven based on standardised criteria. This specific MMI has been found to have high internal reliability [21] (mean Cronbach’s alpha = .76).

Fig. 3
figure 3

Example MMI Question

The SJT and MMI are each weighted as 50 % of the overall selection score.

End-of-training assessments

The outcome measures for this study were end-of-training assessment scores for the final RACGP Fellowship assessment, consisting of three invigilated assessments;

Applied knowledge test

The applied knowledge test is a multiple-choice examination, which includes 150 clinically-based questions delivered via computer over three hours.

Key feature problems

This is a computer delivered examination paper that assesses clinical decision making skills. The 26 ‘key feature problems’ each consist of a clinical case scenario followed by questions that focus only on those critical steps. Candidates are required to type short responses or choose from a list of options provided and the assessment lasts for three hours.

Objective structured clinical exam

This is a four hour high fidelity clinical performance assessment of applied knowledge, clinical reasoning, clinical and communication skills, and professional behaviours in the context of patient consultations and peer discussions. They are combined of 14 clinical cases of either short eight minutes or long 19 min stations, including rest stations.

For each of these assessments, GP registrars were able to complete the assessment multiple times. For the purposes of this study, the applicants’ ‘best’ score on each assessment was utilised. The reliability of each of the end-of-training assessments could not be calculated from the data collected, and was not readily accessible online at the time of publication.

Results

Descriptive statistics

Matched data was available from 443 registrars. All variables in the study showed normal distributions, with the exception of the SJT which had a slight negative skew, as is typical of SJT score distributions [20, 22, 23]. Skewness of the SJT score distribution was assessed, and was within acceptable limits (−1.28). As such, parametric analyses were run on all variables, as previous research has suggested that apart from in instances of extreme skew, parametric analyses are more powerful and robust [24, 25].

Raw scores from each of the three end-of-training assessments were converted into percentage scores, to enable direct comparison of results between assessments. Table 2 details the descriptive statistics for scores on the selection methods and end-of-training assessments.

Table 2 Descriptive statistics for selection methods and end-of-training assessments

A significant correlation (p < .001) was found between scores on the SJT and MMI in the population as a whole, as well as in the matched sample (r = .53, N = 1594; r = .39, N = 443 respectively). The slightly smaller correlation in the matched sample is to be expected given the likely restriction of range inherent in successful applicants’ selection scores.

Predictive validity of the selection methods

Table 3 presents the correlations between the selection methods and the end-of-training assessments. Results showed that both the SJT and MMI, as well as the overall selection score, are significantly correlated with performance on all end-of-training assessments (r values ranging from .12 to .54; p < .05 to p < .001).

Table 3 Pearson’s correlations between selection methods and end-of-training assessments

The SJT and MMI both showed a particularly strong correlation with the objective structured clinical exam (r = .44 and r = .46 respectively, both p < .001), which is likely to reflect the similarity in content between these two assessments; i.e., both the SJT and MMI have been designed to assess non-academic attributes. While the SJT and MMI correlated at a similar level with both the applied knowledge test (r = .14, p < .01 and r = .12, p < .05, respectively) and the key feature problems (r = .24, r = .20 respectively, both p < .001), correlations were substantially smaller than the correlations between the SJT and MMI, and the objective structured clinical exam.

Incremental validity of the SJT and MMI

Hierarchical regression analyses were conducted to ascertain the extent to which the SJT and MMI explain significant added value (incremental validity) over and above each other, for predicting scores on all three end-of-training assessments. Results are shown in Table 4.

Table 4 Hierarchical regression analysis of the SJT and MMI with end-of-training assessments

The SJT explains a significant amount of additional variance, over the MMI, in the applied knowledge test, the key feature problems and the objective structured clinical exam scores (1 %, 3 % and 8 % respectively). The MMI explains a significant amount of additional variance, over the SJT, in the key feature problems and the objective structured clinical exam scores (2 and 10 % respectively).

Discussion

This study provides longitudinal data to advance the relative dearth of research regarding the predictive validity of selection methods in postgraduate medical settings. This is the first study to explore the relative predictive validity of, and value added by, an SJT and an MMI within a single postgraduate specialty selection system. Our results show that the SJT and MMI are significantly correlated with end-of-training assessment performance, indicating that each selection method, and the overall selection score, has good longitudinal predictive validity. Regression analyses indicate that these relationships are significantly predictive of performance across all end-of-training assessments.

There is a moderate correlation between the SJT and MMI, suggesting that these selection methods have both common and independent variance, and therefore that each method offers a unique contribution to the selection system. Both the selection methods explain significant additional variance over each other in predicting performance on the end-of-training assessments. The SJT explains additional variance over the MMI for performance on the applied knowledge test, but the opposite is not true. Practically, this means that the selection model with the best predictive validity of end-of-training assessment performance is a combination of both the SJT as MMI as both methods contribute incremental validity over and above the other in predicting training outcomes.

These are important findings as this is the first longitudinal study exploring the predictive validity of medical postgraduate selection methods in Australia. The results have relevance internationally, as they suggest that the combination of an SJT and an MMI is effective in identifying applicants who go on to perform well in assessments at the end of specialty medical training. These findings progress the current literature regarding the relative contributions of different selection methodologies when methods are used in combination, which a recent systematic review indicated is lacking at present [2].

The SJT and MMI show a particularly strong correlation with the objective structured clinical exam. The strong correlation between the MMI and objective structured clinical exam is likely to reflect the similarities between these two assessments, for example that they are both face-to-face (high fidelity) and assess an individual’s ability to communicate effectively and respond to a question or situation in an appropriate way. Importantly, although the SJT is a low fidelity written assessment, the positive correlation with the objective structured clinical exam is especially encouraging, as compared to the MMI, a text based SJT is significantly less resource intensive to deliver and can be machine marked. Comparatively lower correlations were found between the selection methods and the applied knowledge test, which is expected given that the applied knowledge test is a measure of declarative knowledge and could be considered the least consistent assessment with the selection methods in terms of underlying constructs being measured; we would not expect an SJT (designed to assess non-academic constructs and interpersonal skills) to predict performance on a highly cognitively loaded criterion [26]. We would, however, expect an SJT to predict performance on criterion-matched outcomes such as interpersonal skills and patient care [26], as assessed by the objective structured clinical exam and the key feature problems test.

Implications

Considering priorities for a future research agenda for evaluating the predictive validity of selection into Australian GP training, it would be prudent to gather criterion-matched in-training (i.e., mid-GP training) performance data, and if possible, gather performance data from registrars once they enter practice. This would allow analysis of the predictive validity of the selection methods throughout GP training and beyond, and such data is lacking in the medical selection research at present [2]. This is important as indicators of competence, and selection methods, have been found to be differentially predictive of performance at different stages of medical training [2, 26, 27]. Specifically, non-academic measures have been found to be more predictive in the later stages of medical education and training, for example, conscientiousness has been identified as a predictor of success in undergraduate training, but may actually hinder aspects of performance in clinical practice [27]. As such, different selection methods may predict differently at different stages, for example, an SJT may be less predictive of academic performance in the early years of training, but significantly more predictive of performance outcomes once trainees enter clinical practice [28, 29]. Thus, it would be beneficial for future research to follow the current cohort of applicants once they enter independent clinical practice.

Limitations

As this is the first analysis of predictive validity, we have adapted a conservative approach to data analysis and have not corrected for restriction of range in the present study; therefore these results are likely to have underestimated the magnitude of relationships between selection methods and performance on end-of-training assessment. However, future analysis of more longitudinal data (i.e., mid-GP training, into the consultant role, and beyond) may benefit from restriction of range analysis, as the pool of applicants is likely to diminish at each stage, thus increasing range restriction which serves to supress the magnitude of the predictive validity coefficients. Another limitation of this study is the relatively small sample size, therefore it would be beneficial to conduct further research on a larger sample size.

The reliability of each of the end-of-training assessments could not be calculated from the data collected, and were not readily accessible online at the time of publication. As such, it is difficult to determine the reason for the comparatively weak correlations between the SJT and MMI, and the applied knowledge test. However, it should be noted that the SJT and MMI are designed to target different constructs when compared to the applied knowledge test, so these results are expected to some extent.

Conclusions

This study represents the first longitudinal analysis of the predictive validity of the methods for selection into Australian General Practice training. The SJT and MMI were significant positive predictors of all three end-of-training assessments. Results show that the two selection methods are complementary as they both explain incremental variance over each other for end-of-training assessment scores. This research therefore adds to the relatively sparse literature at present regarding the predictive validity of postgraduate medical selection methods, and their comparable effectiveness when used in a single selection system. Future research would benefit from more longitudinal research with criterion-matched outcomes, across the duration of GP training, and once they enter independent clinical practice.

Ethics statement

Participants provided their consent at the point of selection into GP training for their data to be used for research purposes. Ethical approval for this specific study was granted via the University of Sydney.