Scoring method of a Situational Judgment Test: influence on internal consistency reliability, adverse impact and correlation with personality?

Situational Judgment Tests (SJTs) are increasingly used for medical school selection. Scoring an SJT is more complicated than scoring a knowledge test, because there are no objectively correct answers. The scoring method of an SJT may influence the construct and concurrent validity and the adverse impact with respect to non-traditional students. Previous research has compared only a small number of scoring methods and has not studied the effect of scoring method on internal consistency reliability. This study compared 28 different scoring methods for a rating SJT on internal consistency reliability, adverse impact and correlation with personality. The scoring methods varied on four aspects: the way of controlling for systematic error, and the type of reference group, distance and central tendency statistic. All scoring methods were applied to a previously validated integrity-based SJT, administered to 931 medical school applicants. Internal consistency reliability varied between .33 and .73, which is likely explained by the dependence of coefficient alpha on the total score variance. All scoring methods led to significantly higher scores for the ethnic majority than for the non-Western minorities, with effect sizes ranging from 0.48 to 0.66. Eighteen scoring methods showed a significant small positive correlation with agreeableness. Four scoring methods showed a significant small positive correlation with conscientiousness. The way of controlling for systematic error was the most influential scoring method aspect. These results suggest that the increased use of SJTs for selection into medical school must be accompanied by a thorough examination of the scoring method to be used.

Abstract Situational Judgment Tests (SJTs) are increasingly used for medical school selection. Scoring an SJT is more complicated than scoring a knowledge test, because there are no objectively correct answers. The scoring method of an SJT may influence the construct and concurrent validity and the adverse impact with respect to non-traditional students. Previous research has compared only a small number of scoring methods and has not studied the effect of scoring method on internal consistency reliability. This study compared 28 different scoring methods for a rating SJT on internal consistency reliability, adverse impact and correlation with personality. The scoring methods varied on four aspects: the way of controlling for systematic error, and the type of reference group, distance and central tendency statistic. All scoring methods were applied to a previously validated integrity-based SJT, administered to 931 medical school applicants. Internal consistency reliability varied between .33 and .73, which is likely explained by the dependence of coefficient alpha on the total score variance. All scoring methods led to significantly higher scores for the ethnic majority than for the non-Western minorities, with effect sizes ranging from 0.48 to 0.66. Eighteen scoring methods showed a significant small positive correlation with agreeableness. Four scoring methods showed a significant small positive correlation with conscientiousness. The way of controlling for systematic error was the most influential scoring method aspect. These results suggest that the increased use of SJTs for selection into medical school must be accompanied by a thorough examination of the scoring method to be used.

Introduction
Background Selection into medical school has been dominated by cognitive-based measures which are predictive for academic performance, but are less predictive for clinical performance (Ferguson et al. 2002;Salvatori 2001). Adding non-cognitive-based measures to cognitivebased measures may improve the predictive quality of a selection procedure (Kulatunga-Moruzi and Norman 2002;Lucieer et al. 2015;Powis 2015). Non-cognitive-based selection instruments with good validity and reliability are essential for this purpose, because selection into medical school is highly competitive, with the number of applicants greatly exceeding the number of available places.
An upcoming non-cognitive-based measure for selection into medical school is the Situational Judgment Test (SJT). An SJT presents applicants with several situations that they may encounter during the job (or at medical school), followed by a number of possible responses to that situation. Respondents are instructed to judge the appropriateness of these responses by stating what they would or should do in the described situation (Motowidlo et al. 1990;. Administering SJTs in work-related selection procedures has several beneficial characteristics: (1) good predictive validity with regard to job performance ), (2) incremental validity over and above cognitive ability and personality (Clevenger et al. 2001), (3) less adverse impact than cognitive measures (McDaniel and Nguyen 2001), (4) higher favorability ratings by candidates than in cognitive tests (Lievens 2013) and (5) more efficient administration to large groups of applicants than other non-cognitive-based instruments (e.g., assessment centers) (Motowidlo et al. 1990).
Previous studies on the use of SJTs for selection into medical school have shown that these beneficial characteristics of SJTs also apply in a medical school context (Koczwara et al. 2012;Lievens 2013;Lievens et al. 2005;Lievens and Sackett 2012;Patterson et al. 2009Patterson et al. , 2011Patterson et al. , 2015. Despite the good qualities mentioned above, some aspects of SJTs require more research. One of these aspects is the scoring method (Whetzel and McDaniel 2009). Scoring an SJT is more complicated than scoring a traditional knowledge test because there are no objectively correct answers, since SJTs consist of dilemmas with no clear-cut solutions (Bergman et al. 2006). Different researchers and practitioners have used different methods to convert the judgments on an SJT to a score, which has led to a large variety of scoring methods. This study will investigate the effect of these various scoring methods on three psychometric qualities (i.e., internal consistency reliability, adverse impact and correlation with personality). For this purpose, we used a previously validated integritybased SJT (Husbands et al. 2015) for the selection of medical school applicants at a Dutch medical school.
Choice of scoring method depends on the type of scoring key and response format of an SJT. This study will focus on scoring methods for SJTs that use a rational scoring key and a Likert scale response format. A rational scoring key uses the judgments of a reference group of Subject Matter Experts (SMEs) to determine the ''correct'' answer. SMEs are individuals highly experienced in the relevant domain (Bergman et al. 2006). The Likert scale response format instructs the respondents to rate the appropriateness of each response option on a rating scale ).

Scoring methods
The scoring methods in this study differ on four aspects: the way of controlling for systematic error, the type of reference group, the type of distance and the type of central tendency statistic.
Aspect 1: controlling for systematic error SJTs with a rational scoring key and a Likert scale response format can be scored using raw, standardized, and dichotomous consensus (McDaniel et al. 2011). Raw consensus computes the distance between the applicant's rating and the mean rating of the reference group using the raw data. Standardized consensus calculates the distance after conducting a within-person z standardization such that each applicant has a mean of zero and a standard deviation of one across the SJT items. Dichotomous consensus divides the Likert scale in the middle. Points are awarded when an applicant's position on the Likert scale is on the same side as the reference group. Some dichotomous scoring methods increase the scoring range by applying a negative correction by subtracting points when applicants are on the other side of the Likert scale.
By standardizing or dichotomizing the data, McDaniel et al. (2011) attempted to control for systematic error. Systematic error in an SJT score may be caused by response tendencies or coaching in strategies on how to use the Likert scale, for example only opt for the extremes or only opt for the middle of the scale (McDaniel et al. 2011). Moreover, response tendencies are influenced by ethnic differences. For example, Black and Hispanic Americans are more inclined to use the extremes of a Likert scale than White Americans (Bachman and O'Malley 1984;Hui and Triandis 1989). By standardizing or dichotomizing the data, these cultural differences in the use of a Likert scale no longer influence the SJT score. Raw consensus does not control for systematic error. McDaniel et al. (2011) examined the effect of these three scoring methods on the concurrent validity in two studies, using scores on a biodata scale measuring quitting tendencies and supervisory ratings of job performance as criterion. Higher concurrent validity was found for the standardized consensus and dichotomous consensus scales than for the raw consensus scale, which they explained by the removal of systematic error from the SJT score. In addition, the standardized and dichotomous consensus scales resulted in substantially smaller differences between White and Black respondents than the raw consensus scale, which they attributed to the removal of ethnic differences in the use of a Likert scale. Similarly, Legree et al. (2010) found a higher concurrent validity for a standardized scale than a raw scale.
Next to using raw, standardized and dichotomous consensus, a score on an SJT with a rational scoring key and Likert scale response format can also be calculated using percent agreement (Legree et al. 2005). Percent agreement uses the endorsement ratios among the SMEs to determine the score corresponding to each rating. Percent agreement, like raw consensus, does not control for systematic error.
An example of a scoring method using percent agreement assigns two points to the Likert scale point endorsed by 50 % or more of the SMEs and one point to the scale point Scoring method of a Situational Judgment Test: influence on… endorsed by 25-50 % of the SMEs (Chan and Schmitt 1997). Another example assigns a score to each Likert scale point depending on the proportion of the reference group that endorsed that rating point (Lievens et al. 2015).

Aspect 2: reference group
A second aspect on which scoring methods may differ is the reference group. As stated above, a rational scoring key uses the judgments of a group of SMEs to determine the ''correct'' answer on an SJT. Most SJT scoring methods use SMEs because it is expected that they have knowledge about what behavior is effective and ineffective in their field (Motowidlo and Beier 2010). However, a number of SJT studies have used the group of respondents itself as a reference, a procedure called Consensus Based Measurement (CBM). Legree et al. (2005) argued that this procedure may be more appropriate for constructs for which no clear SMEs can be identified. A study on an SJT used for the US Airforce found that the mean ratings of the SMEs strongly correlated with the mean ratings of the group of respondents (Legree 1995;Legree and Grafton 1995). Similar results were found for an SJT measuring Tacit Knowledge of Military Leadership comparing lieutenants (i.e., SMEs) with cadets (Hedlund et al. 2003). Comparison of two SJT scoring keys based on either novices' or experts' mean effectiveness ratings found a correlation of .75 between the two keys (Motowidlo and Beier 2010). In addition, both scoring keys resulted in scores that had similar criterion-related validity coefficients. These results were explained by novices' possession of a different, more general type of knowledge outside the specific job context. Furthermore, Lineberry et al. (2014) stated that for script concordance tests used for assessing clinical reasoning skills, having experience does not indicate that someone is an infallible expert and that residents (i.e., novices) can outperform most panelists (i.e., SMEs). We are not aware of any previous research on the effect of using a less experienced reference group in a medical selection context.

Aspect 3: distance
A third aspect on which scoring methods may differ is the type of distance that is calculated between an applicant's rating and the overall rating of the reference group (SMEs or respondents). Some SJT studies have used the squared distance (McDaniel et al. 2011), whereas others have used the absolute distance (Legree 1995). Squaring the distance gives more weight to ratings that deviate more from the reference group (Legree et al. 2005).

Aspect 4: central tendency statistic
A fourth aspect on which SJT scoring methods may differ is the manner of how the judgments of the reference group are summarized (i.e., central tendency statistic). Most SJT scoring methods have used the mean as a central tendency statistic, whereas some studies have used the mode (De Meijer et al. 2010;Lievens et al. 2015). Scoring methods using the mode assign points to the Likert scale point that most of the people in the reference group endorse. Besides the mean and mode, another widely used central tendency statistic is the median, which reflects the number at the central point when the data are ranked in numerical order (McCluskey and Lalkhen 2007). To our knowledge, the median has so far never been used for scoring SJTs. For the sake of completeness, this study will include all three central tendency statistics.

Present study
The first goal of this study was to investigate the effect of scoring method on the internal consistency reliability of an SJT score. The appropriateness of internal consistency as a reliability estimate for SJT scores is often called into question (Catano et al. 2012). Internal consistency reliability estimates, such as coefficient alpha, are based on the assumption that all items measure the same latent trait on the same scale, i.e., that the same latent trait equally contributes to all item scores (Yang and Green 2011). The multidimensional nature of SJTs violates this strict assumption resulting in an inaccurate estimate of reliability (Graham 2006). However, the integrity-based SJT used in this study was designed to measure one dimension, which might lead to a less serious violation of the assumption of unidimensionality. This is supported by a meta-analysis of Campion et al. (2014) that reported a mean alpha of .57 across 129 coefficients (range 0-.92). In addition, it was shown that coefficient alpha was significantly higher for SJTs that had a larger focus on one dimension. The focus of the current integrity-based SJT on one dimension may support the use of internal consistency reliability. So, given the anticipated unidimensionality of the SJT used in this study and because coefficient alpha is still commonly reported in the SJT literature, we chose it as a measure of comparison between scoring methods. To the best of our knowledge, this will be the first study to investigate the effect of different scoring methods on the internal consistency reliability.
The second goal of this study was to examine the effect of scoring method on adverse impact, by analyzing the differences between Dutch and non-Western minority applicants. Adverse impact will be examined because SJTs may play an important role in promoting fairness in medical school selection, since SJT scores potentially demonstrate lower ethnic subgroup differences than cognitive ability test scores. On cognitive ability tests, White test takers have been shown to score approximately one standard deviation higher than non-White test takers (De Soete et al. 2013). A meta-analysis on ethnic subgroup differences across 32 SJTs-mainly originating from the US-showed that White test takers score approximately 0.38 standard deviation higher than Black test takers, 0.24 standard deviation higher than Hispanic test takers and 0.29 standard deviation higher than Asian test takers (Whetzel et al. 2008). A Dutch study also found that the ethnic subgroup difference in an integrity SJT score (d = 0.38) was lower than in a cognitive ability test score (d = 0.48) (De Meijer et al. 2010). Selection on only cognitive ability test scores might lead to the rejection of more ethnic minority applicants than ethnic majority applicants, whereas selection on SJT scores may increase the admission rate among ethnic minorities, resulting in a more culturally diverse medical student population. To promote the expected positive influence of an SJT on fairness, it is crucial to investigate the potential influence of scoring method on adverse impact. In line with the findings of McDaniel et al. (2011), we expect that scoring methods controlling for systematic error (i.e., standardized and dichotomous consensus) will lead to smaller ethnic differences than scoring methods that do not (i.e., raw consensus and percent agreement). The other scoring method aspects (i.e., type of reference group, distance and central tendency statistic) have not been studied in combination with adverse impact before.
The third goal of this study was to investigate the effect of scoring method on the correlation between the SJT score and three of the Big Five personality traits. The Big Five describes someone's personality using five broad dimensions: neuroticism (i.e., emotional instability), extraversion (i.e., outgoing and energetic), openness to experience (i.e., intellectual curiosity), agreeableness (i.e., altruistic and compassionate) and conscientiousness (i.e., organized and persistent) (Costa and MacCrae 1992). The correlation with the Big Five was examined because three of the five dimensions (i.e., conscientiousness, emotional stability and agreeableness) have been shown to moderately and positively correlate with SJT scores (McDaniel et al. 2007) and integrity test scores (Marcus et al. 2007). Moreover, the validity and reliability of the scores on the Big Five measure used in this study [i.e., NEO-PI-R (Costa and MacCrae 1992)] has repeatedly been demonstrated (Costa and McCrae 2008), including in samples of adolescents (De Fruyt et al. 2000). It is therefore expected that the integrity-based SJT will be correlated to these three Big Five dimensions and that the resulting correlation coefficients will provide a good measure of comparison between the scoring methods. We hypothesize that scoring methods that control for systematic error will lead to higher correlation coefficients, because the influence of response tendencies regarding the use of Likert scales is removed from the SJT score (Legree et al. 2010;McDaniel et al. 2011). We are unaware of any previous studies that have investigated the effect of type reference group, distance and central tendency statistic on the correlation of an SJT score with personality.

Procedure
The SJT was administered during the selection procedure for the Erasmus MC Medical School in 2014 and 2015 (N = 1025). The administration was solely for research purposes and participation was voluntarily. The Erasmus MC Medical School selects students on their participation in extracurricular activities, their performance on five cognitive tests during three on-site testing days (Urlings-Strop et al. 2009) and their pre-university Grade Point Average (GPA). The administration of the SJT was conducted during the on-site testing days, using paper-and-pencil. An additional questionnaire was administered regarding applicants' demographic characteristics. A personality questionnaire was administered online when applicants registered for the selection procedure. The applicants were informed that the SJT and questionnaires were administered solely for research purposes and that their answers would not influence the outcome of the selection procedure. Participation was voluntarily.

Integrity-based Situational Judgment Test
The integrity-based SJT used in this study was developed in the United Kingdom (UK) (Husbands et al. 2015). The authors translated this SJT to Dutch. This translation was validated using the back translation procedure described by Brislin (1970). The back translation was conducted by an independent commercial translation office. The authors discussed and made appropriate changes to the translated version.
The SJT consisted of ten scenarios describing problematic situations that could occur during medical school. Each scenario was followed by five response options. The respondents had to judge the appropriateness of each response option on a four-point Likert scale (1 Very inappropriate-4 Very appropriate) in terms of what should be done given the situation [i.e., knowledge-based instructions (Ployhart and Ehrhart 2003)]. An example of an SJT item is presented in Appendix 1.
A rational scoring key for this SJT was developed based on the judgments of 16 SMEs (75 % female). The mean age of this group was 40.8 years (SD = 11.1). The SMEs were individuals involved in teaching professionalism in the medical curriculum. Two of the SMEs were medical doctors. The mean number of years of experience with professionalism in the medical curriculum of this group was 6.4 (SD = 5.9). All SMEs were native Dutch. The intraclass correlation coefficient (ICC) among the SMEs was .65, indicating a moderate agreement (two-way mixed model, absolute agreement).

Demographics
An applicant was considered a non-Western minority when one of his/her parents was born outside Europe or North-America (Statistics Netherlands; www.cbs.nl).
The socio-economic status of an applicant was determined by the level of education of his/her parents. A division was made between first-generation and non-first-generation university students. First-generation university students were defined as students whose parents did not attend university (either a research university or a university of applied science).

Personality questionnaire
In 2014, the Dutch version of the NEO-PI-R was administered to assess the applicants' standing on the Big Five personality traits (Costa and MacCrae 1992;Hoekstra et al. 1996). The questionnaire consisted of 240 statements that applicants had to judge on a five-point Likert scale (1 Strongly disagree-5 Strongly agree). The five personality subscales demonstrated good internal consistency reliabilities (coefficient alpha): .92 for neuroticism, .87 for extraversion, .85 for openness, .87 for agreeableness and .88 for conscientiousness. Due to the length of the questionnaire, the NEO-PI-R was not administered in 2015.

Scoring methods
In preparation for this study we combined the four aspects on which scoring methods can differ; this yielded 28 scoring methods to be tested (Fig. 1). These scoring methods followed the categorization into raw, standardized and dichotomous consensus scoring methods as proposed by McDaniel et al. (2011).
Within each of the raw and standardized scoring methods, the distance (absolute or squared) was calculated between the applicant's rating and the overall rating of the reference group on the Likert scale. The reference group was either made up of the 16 SMEs or of the group of respondents itself. The overall rating of this reference group was reflected by either the mean, median or mode.
In addition to the raw and standardized consensus scoring methods, the dichotomous consensus scoring method was applied. The reference group consisted of either the SMEs or the group of respondents itself. Another variation was applied by either assigning zero points to or subtracting one point from applicants whose rating was located on the opposite side of the Likert scale than the reference group.
The 24 scoring methods based on either raw, standardized or dichotomous consensus were complemented with four scoring methods based on percent agreement (Legree et al. 2005). These scoring methods used either the 25-50 % endorsement rule used by Chan and Schmitt (1997) or assigned a score to each Likert scale point corresponding to the proportion of subjects in the reference group who endorsed that point (Lievens et al. 2015). The reference group consisted of either the SMEs or the respondents.
The correlations between the 28 scoring methods are presented in Appendix 2. Although some correlation coefficients indicated a large overlap between the scoring methods (i.e., within the raw consensus scoring method set), other scoring methods showed less overlap (i.e., between the raw and dichotomous scoring method sets).
To our knowledge, of half of these scoring methods no results have been published in the context of application to an SJT (i.e., scoring methods using the median, scoring methods calculating the distance from the group mode, dichotomous scoring methods using the SMEs, percent agreement scoring methods using the endorsement rate of the group and the proportions of the SMEs).

Statistical analysis
Both SPSS (IBM SPSS Statistics for Windows, Version 21.0. Armonk, NY: IBM Corp.) and R (Version 3.1.0) were used to convert the judgments on the SJT to a score, using the different scoring methods. The raw and standardized consensus scoring methods that used the group of respondents itself as a reference were conducted using a leave-one-out method (Hastie et al. 2009). This method removes the applicant whose score needs to be calculated from the dataset, and calculates the summary statistic across the remaining group members. The distance between the applicant and the remaining group members composes the applicant's score.
Coefficient alpha was used as an estimate of internal consistency reliability (Cronbach 1951). Independent t-tests were used to examine the 28 different SJT scores on disparities between first-generation and non-first-generation university applicants and between Dutch and non-Western minority applicants. The effect sizes of the social and ethnic disparities were reflected by Cohen's d (Cohen 1988). A stricter alpha level (a = .001) was used because of the large number of comparisons. For each scoring method, Pearson correlations were used to determine the correlation between the SJT score and the three Big Five personality traits for which we expected a correlation.
General linear models were used to examine which scoring method aspects significantly influenced the outcome measures (i.e., coefficient alpha, effect size and correlation coefficient). For each outcome measure, four general linear models were tested, namely one model for each scoring method aspect. The four aspects were tested in separate models because the small number of data points (i.e., 28) did not allow entering all four aspects in one model. The effect sizes were corrected for the reliability of the scoring method by dividing Cohen's d by coefficient alpha, since low reliability may obscure subgroup differences (Lievens et al. 2008).

Participants
Nine-hundred thirty-one medical school applicants responded (response rate = 90.8 %). The demographic characteristics of this sample are depicted in Table 1. The two cohorts (2014 and 2015) were similar with regard to gender, age and ethnicity. Cohort 2015 consisted of significantly more first-generation students than cohort 2014, but the size of this effect was small [X 2 (1) = 6.02, p = .014, u = .08]. Personality data were obtained from 73.3 % of the participants from cohort 2014. SJT scores did not significantly differ between respondents and non-respondents to the personality questionnaire.

Internal consistency reliability
Coefficient alpha varied from .33 to .73 depending on the scoring method ( Table 2). The lowest coefficient alpha was found for the scoring method that calculated the absolute distance from the mean of the group of respondents itself using standardized consensus. The highest coefficient alpha was found for the scoring method that calculated the absolute distance from the mean of the group of respondents itself using raw consensus.
For the general linear models with coefficient alpha as dependent variable, the way of controlling for systematic error was the only significant factor with a very large effect size, F(3, 24) = 40.05, p \ .001, g 2 = .83. Raw consensus led to a significantly higher coefficient alpha than the other three methods of controlling for systematic error. In addition, standardized consensus and percent agreement yielded a significantly higher coefficient alpha than dichotomous consensus.

Adverse impact
All scoring methods led to significantly higher scores for the Dutch majority than for the non-Western minorities ( Table 3). The effect sizes (d) of these differences ranged from 0.48 to 0.66 (medium effect). The largest differences were found for the scoring methods that calculated the absolute distance from the SME median using standardized consensus. The smallest ethnic difference was observed for all scoring methods that used dichotomous consensus.
For the general linear models with the corrected effect size as dependent variable, the way of controlling for systematic error was again the only significant factor with a very large effect size, F(3,24) = 15.54, p \ .001, g 2 = .66. Raw consensus led to smaller corrected effect sizes than standardized and dichotomous consensus, but not percent agreement. None of the scoring methods led to significant differences between first-generation university applicants and non-first-generation university applicants (data available upon request). Due to the lack of significant differences, no general linear models were tested.

Correlation with personality
Eighteen scoring methods resulted in an SJT score that had a significant but small positive correlation with agreeableness (Table 4). The largest correlation coefficients were found for scoring methods calculating the distance from the SME mean using standardized consensus. In addition, four scoring methods resulted in an SJT score that had a significant but small positive correlation with conscientiousness. The largest correlation coefficients were found for scoring methods calculating the absolute distance from the SME mean and median both using standardized consensus. Due to the low effect sizes and the small range of significant correlation coefficients, no general linear models were tested.

Discussion
This study shows that the psychometric quality of an SJT greatly depends on the choice of scoring method, specifically in the way the scoring method controls for systematic error. Firstly, the way of controlling for systematic error strongly affects the internal consistency reliability of an SJT score, with higher reliability estimates for scoring methods that use raw consensus. Secondly, the way of controlling for systematic error influences the adverse impact of the SJT score, with a lower adverse impact for scoring methods that use raw consensus compared to dichotomous and standardized consensus. Lastly, the different scoring methods had a minor influence on the correlation with agreeableness and conscientiousness, but the practical significance of these correlations was negligible.

Internal consistency reliability
Our first finding was that the way a scoring method controls for systematic error strongly influences the internal consistency reliability. This strengthens the concerns about the use of coefficient alpha as a reliability estimate for an SJT score. Changing only the scoring method could alter the acceptability of the resulting reliability estimate from poor to sufficient, even for an SJT that was specifically constructed to measure one dimension. This large variety in internal consistency reliability is likely explained by the dependence of coefficient alpha on the total score variance (Streiner 2003). Standardized and dichotomous consensus and percent agreement were associated with a reduction in total score variance, which is demonstrated by the lower standard deviations in Table 2. This reduction in total score variance will most likely lead to a lower coefficient alpha. This line of reasoning implies that coefficients alpha reported in previous studies on SJTs may be strongly influenced by irrelevant aspects, such as the total score variance generated by the scoring method used. Assuming that most studies on SJTs arbitrarily choose one scoring method rather than another, choice of scoring method contributes to the limited usefulness of coefficient alpha as a reliability estimate for SJTs. Future studies should investigate whether the large variation in coefficient alpha caused by different scoring methods also occurs in other reliability estimates (e.g., alternate forms reliability) to find out whether this large variation is an artifact of coefficient alpha only.
A more accurate reliability estimate might be obtained by a combination of a more thoroughly construct-based SJT development (Christian et al. 2010) and a reliability estimate that takes into account the imposed factor structure of the SJT, for example a structural equation modeling (SEM) reliability estimate (Yang and Green 2011) or stratified alpha (Catano et al. 2012). Future research is required on the application of constructbased development methods and alternative internal consistency estimates for SJTs.

Adverse impact
Although all scoring methods led to significant ethnic differences in SJT score, the way a scoring method controlled for systematic error influenced the size of these effects.
Specifically, the effect size decreased when using raw consensus instead of standardized or dichotomous consensus. This result is not in line with the findings of McDaniel et al. (2011) who found lower ethnic subgroup differences for scoring methods that controlled for systematic error (i.e., standardized and dichotomous consensus), which they explained by the removal of ethnicity related response tendencies in the use of Likert scales. However, the uncorrected effect sizes do show some support for this line of reasoning with the lowest effect sizes reported for the scoring methods using dichotomous consensus. The absence of lower effect sizes for standardized consensus might be caused by the low number of scale points (i.e., four) on the Likert scale that was used. Narrow Likert scales may not be as strongly affected by response tendencies as Likert scales with more scale points (Flaskerud 1988), resulting in no differences when controlling for the response tendencies. A study on script concordance tests recommended a reduction of the Likert scale from five to three points in order to decrease the influence of construct-irrelevant factors such as examinee response styles (Lineberry et al. 2013). Dichotomizing the Likert scale does seem to have some effect on adverse impact, but at the cost of low internal consistency reliability, leading to a similar issue as the diversity-validity dilemma (De Soete et al. 2013).
Another noteworthy finding is that adverse impact was similar for both reference groups (SMEs and respondents). Previous studies which compared different reference groups found similar validity coefficients for the scores of both groups (Legree et al. 2005; Motowidlo and Beier 2010), but did not study the effect of the reference group on adverse impact. Most SJTs use SMEs as a reference group under the assumption that they have considerable experience in a relevant setting and therefore know what kind of behaviors are appropriate in the described situations. Our results suggest that the use of a reference group of inexperienced respondents (i.e., secondary school students) does not affect the adverse impact of an SJT.
A possible explanation for this comparable adverse impact is the better representativeness of the group of respondents with respect to ethnicity. All our SMEs in this study were native Dutch, while only 57 % of the applicants were native Dutch. Little is known about the cultural susceptibility of integrity. However, medical professionalism has been found to depend on cultural context (Chandratilake et al. 2012;Jha et al. 2015) and since integrity is an important aspect of medical professionalism, it too might depend on cultural context (Arnold and Stern 2006). A reference group that is more representative of the demographic characteristics of the applicant group may lead to a more accurate measurement of the targeted construct and may therefore result in equal or less adverse impact. Future research should investigate the effect of the demographic composition of the reference group on the psychometric quality of an SJT.
Another explanation for the equal adverse impact for both type of reference groups might be that there were too few SMEs to be able to achieve proper consensus on the difficult dilemmas described in the scenarios. This was reflected by the non-perfect agreement in the SMEs' evaluation of the response options (ICC = .65). A group of 931 individuals might result in more meaningful consensus. This contention is supported by Legree et al. (2005), who stated that in light of equal validity coefficients, an examineebased scoring standard gives more reliable values than an expert-based scoring standard, due to the larger number of examinees.

Correlation with personality
Our last finding was that 18 scoring methods showed a correlation with agreeableness and four scoring methods showed a correlation with conscientiousness, which was in line with previous research (Marcus et al. 2007;McDaniel et al. 2007). However, these correlations must be interpreted with caution, since all correlation coefficients represent small effects and it is likely that the large sample size has contributed to the statistical significance of these small effects. The larger number of significant correlations among scoring methods using standardized consensus is in line with the findings of McDaniel et al. (2011) and might be explained by the removal of systematic error from the SJT score. However, the small effect size of these correlations between the integrity-based SJT score and the three Big Five personality traits precludes any conclusive statements about the effect of scoring method on the correlation with personality.
The small number of significant correlations between the SJT score and the Big Five personality traits is in consonance with a previously reported non-association between the Big Five personality traits and the score on a multiple mini interview (MMI), another widely used selection instrument for medical school (Kulasegaram et al. 2010). This nonassociation might be explained by the fact that personality tests assess non-cognitive traits, whereas MMIs and SJTs assess non-cognitive behaviors. Non-cognitive behaviors are more dependent on situational factors than personality traits (Eva 2005). This is in line with a previous study which demonstrated that a contextualized personality measure had higher criterion validity for academic performance and counterproductive academic behavior than a generic personality measure (Holtrop et al. 2014). The lack of contextualization of the NEO-PI-R limits the usefulness of personality tests in medical school selection and may be an explanation for the absence of any meaningful correlations between the SJT score and personality.

Scoring method aspects revisited
Four scoring method aspects were examined. Differences in internal consistency reliability and adverse impact were found for only one aspect: the way of controlling for systematic error, with raw consensus leading to scores with the highest coefficient alpha and the smallest ethnic subgroup differences. As mentioned above, these differences might be explained by the effect of this scoring method aspect on the total score variance and the negligible effect of response tendencies due to the narrow Likert scale used in this study. No differences were found for the other three aspects (i.e., reference group, distance and central tendency statistic).
As stated before, the absence of differences for reference group might be caused by the larger size and better representativeness of the group of respondents itself, which might remove the benefits of using a highly experienced but small group of SMEs. Another potential reason is that integrity-related issues in the beginning stage of medical school do not require specific knowledge but more general knowledge which can be possessed by both reference groups, which is reflected by a correlation of .90 between the group of SMEs and group of respondents itself in their average rating.
The absence of differences for the scoring method aspect of distance (absolute vs. squared) may be explained by the low number of scale points on the Likert scale (i.e., four), which means that the maximum distance between an applicant's rating and the overall rating can never exceed three. This may not be sufficient to get a significant difference in the outcome measure when squaring the distance between both ratings. Future research should examine the scoring method aspect of distance for SJTs using Likert scales with more scale points.
Lastly, the similar results for the three different central tendency statistics may be explained by the distribution of the ratings across the Likert scale. Data with a symmetric distribution are best summarized using the mean. Since the mean is strongly influenced by extreme scores (Field 2013), asymmetrically distributed data are better summarized using the median or mode. A four-point Likert scale precludes extreme scores leading to similar values for the mean, median and mode and likely causes the comparable results for this scoring method aspect.

Practical implications
The most important practical implication of this study is that it creates awareness about the importance of carefully considering the immense number of possibilities for converting the judgments on an SJT to a score. Instead of arbitrarily choosing one of the many existing methods, researchers and practitioners should accompany the development of an SJT with a thorough examination of the scoring method to be used. In addition, this study demonstrated that the results when using the group of respondents itself are similar to those obtained when using a group of SMEs as reference. Using the group of respondents has practical and economic advantages, since the collection of data from SMEs can be difficult.
Unfortunately, this study does not allow any conclusive statements about which scoring method is best, because the findings are highly dependent on this particular SJT measuring this particular construct in this particular setting. Firstly, this study was conducted in the Netherlands, where medical school applicants are relatively young (17-18 years). The use of more mature applicants may lead to different results for scoring methods that use the group of respondents itself as a reference. Secondly, the cultural context may influence the way the reference group judges integrity-related dilemmas (Chandratilake et al. 2012;Jha et al. 2015). Finally, SJTs measuring other constructs than integrity might be differentially influenced by changing the scoring method. Future research should replicate this study with other SJTs measuring different constructs in other settings to investigate the generalizability of these findings and to provide clarity on which scoring method is best for which situation.

Strengths and limitations
To our knowledge, this is the first study to compare such a large number of scoring methods, varying not only the way of controlling for systematic error and the type of reference group, but also the type of distance and central tendency statistic. Next to the large number of scoring methods examined, this study also contributes to previous research by the examining the effect of scoring method on internal consistency reliability. Embedding the administration of the SJT into the selection procedure led to a very high response rate, ensuring that our results were not influenced by a volunteer bias. The credibility of our results is further supported by a relatively small restriction of range. Unlike many other selection procedures, the current selection procedure was not preceded by a pre-selection on cognitive competencies.
Although this study compared a large number of scoring methods, we do not claim that this list is exhaustive. Examples of other approaches for scoring SJTs are the squared Mahalanobis distance (Barbot et al. 2012) and the use of paired comparisons (Gold and Holodynski 2015). It seems that the possibilities are endless and future studies should investigate these other scoring methods. For practical reasons, the number of scoring methods in this study was limited to 28.

Conclusion
In conclusion, although the SJT scoring method is often chosen arbitrarily, this study shows that changing the scoring method strongly influences the internal consistency reliability and adverse impact of an SJT score. The most influential characteristic of a scoring method is the way of controlling for systematic error. Given the increasing use of SJTs for selection into medical school, it is crucial to thoroughly examine which scoring method is best to use.   All correlations are significant. The numbers in the table correspond to the scoring methods in Tables 2, 3 and 4