Background

Medical university admission tests/admission procedures fulfill the demand of selecting potential students and are used as predictors for the educational success of the college applicants. Admission tests thus (i) have to guarantee the fair and reproducible allocation of limited university places to a preferably diverse future student population [1,2], (ii) should select those applicants who, with the greatest probability, develop – hard to define - abilities and characteristics that are expected from future physicians [3-5] and, (iii) should identify those applicants who show the greatest probability of finishing the course of study [3,6,7]. In addition to the assessment of cognitive abilities, the assessment of non-cognitive abilities is increasingly demanded [8]. In this context various methods for determining “soft skills”, (inter) personal skills or the ability to handle theoretical social constructs (e.g., health/sickness, ethnicity, gender) in complex situations were evaluated [9]. As instruments for assessing personal qualities, different tools are discussed in the literature [10]:

  • the interview, with no attested positive predictive validity for medical school applicants [11] and disputable reliability [5,12];

  • psychometric assessments (as for example, the Personal Qualities Assessment (PQA)) are – assuming further development – assigned definite potential [4,12];

  • the Multiple Mini Interview (MMI) which, in studies, among other things, is attested a statistically significant, predictive validity for the future performance of participants [8,11,12];

  • letters of recommendation as well as personal and autobiographical statements – whose reliability or predictive validity to date was not yet confirmed [12].

A further assessment instrument is the Situational Judgment Test (SJT) [13,14]. The SJT assesses – as McDaniel et al. [13] summarize in their meta-analysis – a plurality of constructs [13,15]. Following this result, O’Connell et al. [16] recommend to interpret SJTs best as measurement methods and not measures of a single construct [16]. At any rate, the SJT is attested validity as a predictor for future job performance [17] and – assuming that relevant work-related situations are described – face and content validity [17,18].

As the only one of the three Austrian medical universities, the Medical University of Graz has amended its admission process (cognitive testing with the subsections biology, chemistry, physics and mathematics as well as the testing of text comprehension) by including a written Situational Judgment Test (SJT) in the year 2010 [19-21].

Method

Study population

This study is an observational investigation focusing on the results of the situational judgment test (SJT) as part of the admission test for the study of human medicine and dentistry at the Medical University of Graz, obtained in the academic years 2010/11, 2011/12 and 2012/13. Over the three years, there were 4741 applicants, all of whom were included in the study. (The distributions of applicants for the time period investigated are depicted in Table 1).

Table 1 Distributions of applicants as well as of successful applicants according to sex and nationality in three consecutive academic years

Admission examination measures: cognitive test & situational judgment test

Cognitive test

The cognitive test, as applied in the academic years investigated, is based on secondary school level knowledge in biology, chemistry, physics and mathematics, and additionally contains a text comprehension test part. (The number of items in the individual subareas is depicted in Table 2). These five different test disciplines (biology, chemistry, physics, mathematics, and text comprehension) and the SJT (the sixth test discipline) are designed “test parts”. All test parts are uniformly done in the format of a written multiple choice test. Specifically, for each test item there are four distractors, one of which represents the correct answer. For correct answers, the applicants receive positive scores of 2 (5 in the case of text comprehension part) in dependence on the test part; for wrong answers a negative score of −1 is counted. The rationale behind this scoring is twofold: first, guessing should be discouraged. Second, in medicine a critical self-evaluation of one’s knowledge is imperative, and thus, applicants should be encouraged to critically self-assess their knowledge before answering a test item. Leaving out an item without choosing one of the four distractors leads to a score of 0 for this item. For the determination of the ranking of the applicants – and hence, for the decision whether or not an applicant was admitted, − the scores for each item are summed up to give a total score. Due to the different number of items in the various test parts, there is an implicit weight given to each of these parts.

Table 2 Mean relative scores showing the performance of women and men in the various test parts

Situational judgment test

The development of the SJT items proceeded in four phases, using lecturers/professors and advanced students [14,22].

Phase 1: In the framework of a seminar at the Medical University Graz (MUG), students with a minimum of study experience of 4–6 semesters were given the task to describe critical situations that were experienced in a medical context (in the role of patient, family member, student, etc.) as particularly appropriate or particularly inappropriate. The experienced patterns of action were discussed in small groups and additional possible courses of action were developed. The situations described by the students were then presented to a core team of experts, who grouped and selected representative scenarios and adapted the possible routes of action according to form, length and style, in order to create the actual test items. The following set of criteria was used:

  • the comprehensible context/the possible reference to basic statements of the bio-psycho-social model (information regarding the bio-psycho-social model was made available to all college applicants with a notice regarding its relevance for the test),

  • the degree of difficulty (no medical (pre)-knowledge is necessary for responding) and

  • logical coherence.

Phase 2: Critical evaluation and extension of possible courses of action of the situational descriptions – included in the further process – by professors and lecturers.

Phase 3: Evaluation of the courses of action by the steering committee (professors/lecturers/psychologists) and discussion about or determination of the sequence of potential courses of action by the steering committee together with the core team.

Phase 4: Performance of a pre-test, again modification of the SJT items, taking into account the results of the pre-test. Final revision and approval [23].

Perceptions of the admission examination by the examinees

In 2010, after having completed the admission test, the applicants were invited to provide an evaluation of certain aspects of the procedure. For each part of the admission test, they were asked – among other questions – for their subjective judgment of the difficulty as well as of the importance within the admission test and the importance for their prospective future career in medicine. The candidates were given the opportunity to provide their rating on a 6-point scale (1 = not difficult at all, 6 = very difficult/1 = not meaningful at all, 6 = very meaningful). All data were made anonymous in order to eliminate any retracing.

Statistical analyses

For each test item, the index of discrimination describing the correlation of that index with the total test is computed. These indices of discrimination are then aggregated for the knowledge test (combined results on biology, chemistry, physics and mathematics), text comprehension test and SJT, separately for each year.

For proper statistical analyses of the results of the various test parts, we take into account the fact that not only the absolute numbers of items are different for each test part, but these numbers also vary from one year to the next (in Table 2, these item numbers per test part and year are explicitly stated). In order to compensate for these variations and to yield comparable results for the different test parts, we calculate “relative scores” for each test part using the following formula:

$$ relative\ score=\frac{score- minimum}{maximum- minimum}. $$

Here, “score” is the absolute score of an applicant in a chosen test part, “minimum” represents the worst case of answering all items of a test part wrongly, and “maximum” denotes the best case of answering all items of a test part correctly. To give an example, suppose an applicant with a biology score of 45. In the respective admission test, suppose there are 90 biology items with possible scores of −1/0/+2, if the answer was false/no answer/correct. In this case, minimum = − 90 and aximum = 180. The applicant thus has a

$$ relative\ score=\frac{45-\left(-90\right)}{180-\left(-90\right)}=\frac{135}{270}=0.50. $$

Computing relative scores this way ensures that they can range from 0.0 (all items of a test part falsely answered) to 1.0 (all items of a test part correctly answered). (Other normalizing schemes like z-scoring would have been possible; qualitative aspects of the results and conclusions probably would remain basically unchanged).

Basic statistical analyses of these relative scores are performed using the usual descriptive statistical techniques as well as correlation analysis. Performance differences between women and men in the various test parts are analyzed using effect sizes based on comparison of mean values (Cohen’s d) because due to the high frequency of observations even very small differences of mean values become statistically significant in terms of usually employed P-values. Cohen’s d values are generally interpreted as follows: d ≤ 0.2 indicates a weak effect, d > 0.5 indicates a strong effect, and 0.2 < d ≤ 0.5, a moderate effect.

The associations between the relative scores achieved in the various test parts were assessed by computing pairwise linear correlation coefficients between all test parts and visualized by bivariate scatterplots.

All statistical analyses are performed using STATA 13 software (StataCorp. LP, College Station, TX, USA).

Ethics statement

The authors gathered anonymized data from a data set that is routinely collected about medical students’ admission, dropout, and graduation dates and examination history, as required by the Austrian Federal Ministry of Science and Research. Because the data were anonymous and no data beyond those required by law were collected for this study, the Medical University of Graz’s ethical approval committee did not require approval for this study.

Results and discussion

Basic data

For the academic years 2010/11 to 2012/13, Table 1 shows basic data on the admission tests at the Medical University of Graz. As already described in an earlier publication [24], there are consistently more women than men among the applicants. This corresponds extensively with the communicated data on admission processes for Europe. Tiffin et al. [25] describe, for example, that for the UK, women – in relation to the UK population – are over-represented in medical school intakes [25]. In contrast to this, the data from North America indicate a decrease in female applicants [26].

Sex effects

Table 2 shows the relative scores obtained by women and men in the different test parts as well as the effect size of sex. As can be seen from the mean values of the relative scores, among the natural science parts, physics is the most difficult test part (with the smallest relative scores), while biology, chemistry and mathematics present similar difficulties to the test applicants. Men perform considerably better in physics and mathematics: one result that is confirmed by all public medical universities in Austria [27,28] and discussed internationally, e.g., for physics and biology [2,25,29]. In the literature, stereotyping, different risk behavior in men and women, the factor time or testing anxiety, among other things, are listed as reasons for the gender gap in high stakes tests [24,29]. While in text comprehension men still perform slightly better than women, the reverse is true in SJT; here the negative values of Cohen’s d indicate consistent better performances of women with weak to moderate effect size. The 95% confidence intervals of Cohen’s d show that the observed effect t sizes are significantly different from zero in all cases, with the single exception of text comprehension in 2010/11; here, the confidence interval contains zero.

Indices of discrimination of the test parts

Table 3 indicates, that in each year studied, the highest mean indices of discrimination were found for the knowledge test part (consisting of biology, chemistry, physics and mathematics), followed by text comprehension, and the least discriminatory test part was, with the exception of 2011, the SJT. The low answer variance for less difficult tasks – in the present case, the questions in the framework of the SJT – influences the mean indices of discrimination. As a further factor that influences the discriminatory power and, ultimately, the validity of, e.g., SJT results, the positioning of the SJT in the whole test is discussed in the literature [30,31]. In this context, Marentette et al. [31] describe construct-irrelevant order effects which occur when longer SJT items and SJT items presented in written form have to be answered at the end of an admission process [31]. Nevertheless, in any case all single test indices of any of the test parts were positive, indicating that participants with higher abilities on average performed better on each single test item.

Table 3 Mean item discrimination indices of the test parts, grouped per year of admission test

Correlation analyses

Table 4 reports, for each year separately, the pairwise linear correlation coefficients between the relative scores of the various test parts. While due to the large numbers of subjects included, all correlation coefficients are significantly different from zero, there are considerable differences: the highest correlation coefficients are invariably seen between biology and chemistry results. In general, the four natural science scores show relatively strong mutual correlations. Text comprehension is moderately strongly correlated with all other variables, including SJT, but the latter with all other variables except text comprehension shows very weak correlations. This result appears in front of the background that Situational Judgment Inventories measure constructs that are not exclusively identical with cognitive ability, not a big surprise [32]. As possible explanation one could use, among other things, the instruction type (behavioral tendency response instructions) of the performed SJTs. As McDaniel et al. [15] record, in the framework of a “typical performance test” (among other things, SJT with behavioral tendency response instructions), in contrast to “maximal performance tests” (among other things, knowledge test), lower cognitive correlates are to be expected [13,15].

Table 4 Pairwise linear correlation coefficients between relative scores on the various text parts, sorted by year of admission test *

Figure 1 visualizes the results aggregated over the three years: the strong correlation between biology and chemistry, and also the moderate correlations between the other test parts except SJT is obvious. The panels in the SJT row, however, show that the relative SJT scores are not nearly symmetrically distributed around a value of about 0.5; rather, most observations cluster in the high range above a relative score of 0.6, and apparently they do not depend on the relative score of the other test parts. This behavior of the relative SJT scores nicely reflects the fact that the SJT test part is the one with the least difficulty.

Figure 1
figure 1

Aggregated admission test results for three years. Pairwise bivariate scatter plots of the relative scores of the various test parts, r, linear correlation coefficient.

Perceptions of the admission examination

Figure 2 indicates that the SJT part is judged to present the least difficulty, while the knowledge test part is deemed to be the difficult part. Regarding the importance aspects of the test parts, the differences between the test parts were remarkably small; however, SJT was invariably regarded to be most important, both with respect to the admission procedure and the future professional life of the candidates. A similar rating by applicants was described by Lievens & Sackett (2006), among others: the written SJT as well as the video-based SJT were attested far more face-validity than the other parts of the admission exam [33].

Figure 2
figure 2

Results of the evaluation of the admission procedure by the applicants. The responses were on likert scales with six grades.

Conclusions

Inclusion of the SJT in an admission procedure for medical studies which previously was nearly exclusively based on scientific knowledge was demonstrated to be organizationally feasible in the presented manner. Moreover, the subjective responses of the applicants were quite positive, probably because of the felt relevance for the future study as well as profession. The lack of significant correlations between the other test parts and the SJT indicated that the spectrum of competencies tested was indeed broadened by inclusion of the SJT; a fact that seemed highly desirable in view of the overwhelming contribution of natural science knowledge to the admission test in the past.