Introduction

Medical schools and other health professions education (HPE) schools are responsible for selecting qualified students, as well as generating student populations that reflect the diverse society they will serve in the future (General Medical Council, 2015). A diverse healthcare workforce, aside from issues of equity and fairness, is important to improve the cultural competency of healthcare providers, and increase and equalize access to high-quality healthcare for different population groups (Cohen et al., 2002; Morgan et al., 2016). However, student diversity can be affected by the use of selection procedures for undergraduate HPE programs, as selection chances are unequally distributed across subgroups of applicants (Fielding et al., 2018; Mathers et al., 2016). Selection procedures cannot only include a great variety of tools, either defined at a national level or by an individual school, but the same tool can also be implemented in different ways. This raises the question if different tools have differential effects on student diversity (Patterson et al., 2016), and if some tools are more context-independent than others concerning their impact on student diversity. In the present multi-site study, we examined the selection chances of applicant subgroups and their performance on different selection tools in multiple contexts.

So far, literature has shown that selection procedures mainly negatively affect the selection chances of applicants with lower socio-economic status (SES) and from ethnic minorities (Fielding et al., 2018; Mathers et al., 2016; Mulder et al., 2022; Stegers-Jager et al., 2015; Steven et al., 2016). However, this effect may not always be straightforward, as performance differences between subgroups can depend on the combination of tools used in the procedures (Stegers-Jager, 2018). Traditionally, selection procedures in the United States and Europe mainly included prior education grade point average (pre-GPA) and cognitive tests, aimed at measuring intellectual ability. In the past decades, there has been a shift towards the inclusion of broadened selection criteria, which aim to add to the information derived from traditional tools, and often intend to evaluate personal qualities (Niessen & Meijer, 2017; Stegers-Jager, 2018). Examples include curriculum vitae (CV) and situational judgement tests (SJT). In this paper, we will refer to this distinction with the terms traditional and broadened criteria.

Prior research demonstrated performance discrepancies on traditional criteria, favoring higher SES and ethnic majority applicants (Girotti et al. 2020; Juster et al., 2019; Lievens et al., 2016; Stegers-Jager et al., 2015). Although broadened selection criteria were partly introduced to mitigate these adverse effects on student diversity, results so far are inconsistent (Stegers-Jager, 2018). For instance, Lievens et al (2016) found that the inclusion of an SJT in the United Kingdom could increase the representation of lower SES applicants, but not of ethnic minority applicants. However, a similar study in the United States found that adding an SJT was advantageous for the representation of both lower SES and ethnic minority applicants (Juster et al., 2019). This implies that the effects of selection tools on diversity can be context-dependent, at least in the case of broadened criteria. The curriculum-sampling test is another tool assessing broadened criteria that is increasingly used in international contexts and proved effective in terms of predicting academic achievement (Niessen et al., 2018). Curriculum-sampling tests mimic representative parts of a subject of the academic program. Generally, applicants study literature or watch video lectures from small-scale versions of an introductory course, followed by an exam (Niessen et al., 2018). Curriculum-sampling tests are aimed at measuring a mixture of attributes such as knowledge, motivation, and time spent studying (Niessen & Meijer, 2017). Additionally, these tests intend to assess the applicants’ ‘fit’ with the program (e.g., the way of testing and studying). To our knowledge, subgroup performance differences on this specific tool have not yet been investigated.

Every country has its own laws and regulations for selection and admission, as well as a unique context regarding student diversity. Typical for the Netherlands is that, after years of lottery, programs are now responsible for designing their own selection procedures. Programs independently decide which tools they include (both self-developed and standardized), and how many they include (with a minimum of two). This results in a great variety of procedures and tools. Results from a national retrospective study indicate that since the abolishment of lottery, inequality in selection chances between subgroups of applicants has increased (Mulder et al., 2022). The authors found that women, ethnic majority applicants, and applicants with higher SES had a higher probability of admission compared to their peers. However, this study did not take into account the role of the extensive range of possible selection procedures and tools. One previous single-site study attempted to unravel this matter, and concluded that ethnic minority and lower SES applicants had lower scores on academic criteria, but not on non-academic criteria (Stegers-Jager et al., 2015). The researchers discovered that for the institution under consideration, men had higher selection chances compared to women, which was again only related to performance on academic criteria. This contradicts the findings of the aforementioned national study (Mulder et al., 2022). This strengthens the hypothesis that the effects of selection on student diversity are context-dependent. An additional observation of the single-site study was that being a first-generation immigrant was correlated with poorer selection outcomes (Stegers-Jager et al., 2015), a variable that was not accounted for in the national cohort study. A final potentially relevant variable that was included in neither of the studies, is prior education. A recent report indicated that applicants with prior foreign education had smaller selection chances compared to applicants from the ‘traditional’ pre-university track (Van Den Broek et al., 2018).

In short, it is not clear how different selection tools can affect student diversity across different contexts. The freedom of Dutch HPE programs to design their selection procedures creates the unique opportunity to compare the effects of selection on student diversity across different procedures with a variety of selection tools. The present prospective multi-site study aimed to evaluate the probability of selection into five undergraduate HPE programs for subgroups of applicants based on gender, migration background (as an indicator of ethnicity), parental education (as an indicator of SES), and prior education. Additionally, we examined performance differences on two traditional selection tools (pre-GPA, biomedical knowledge test), and two tools assessing broadened criteria (curriculum-sampling test, CV).

Method

Design and context

The present research concerns a prospective multi-site cohort study. We collected data from five university-level undergraduate HPE programs in the Netherlands, including three medical programs (labeled A, B, and C), one technical-medical (clinical technology) program (labeled D), and one pharmacy program (labeled E). The included programs were located in different parts of the Netherlands, both in urban and rural areas, and were all concerned about enhancing diversity in their selection processes.

Uniquely to the Netherlands is that admission requirements of different types of undergraduate HPE programs are identical. To be eligible, applicants need to meet the same stringent requirements regarding subjects taken (e.g., physics, chemistry, and biology) and educational level. Consequently, the applicant pools are relatively homogeneous in terms of academic background; students who apply to a university-level undergraduate HPE program are already strongly preselected based on academic skills due to highly selective secondary education (Niessen & Meijer, 2016). When applicants apply to their program of choice, they apply to one specific institution. Each institution has a predetermined fixed number of spots. By law, institutions are required to include at least two selection criteria, but as previously mentioned, there are no additional requirements concerning, for instance, the content and quality of the tools. Consequently, great variety exists in the selection procedures that programs employ, both between and within different types of HPE programs at different institutions. We studied tools used by more than one program, to evaluate whether effects were similar or different across programs.

The selection procedures of the five programs are described in Table 1. The tools used by multiple programs were pre-GPA, biomedical knowledge test, curriculum-sampling test, and CV. Pre-GPA comprised of applicants’ average school grades on required subjects, usually mathematics, physics, biology, and/or chemistry. Biomedical knowledge tests assessed applicants’ existing general knowledge about biomedical subjects, without requiring any preparation. Curriculum-sampling tests were (largely) based on preparatory materials in the form of a lecture and/or reading materials that applicants had received some weeks prior to the testing day. CVs consisted of an assessment of extracurricular activities, such as (voluntary) jobs, internships, or evidence of extraordinary cultural or athletic skills. One standardized tool was included, the biomedical knowledge subtest of the BioMedical Admissions Test (BMAT), which was administered by one program (D). All other selection tools in our sample were self-developed by the individual programs. Consequently, the specific application of the tools differed between the programs, e.g., the specific subjects included in pre-GPA and the types of questions in tests. All selection tools, except for the BMAT, were administered in Dutch. Programs were responsible for their own quality assurance, and we did not have access to psychometric information.

Table 1 Selection procedures of the five programs included in the present study

Participants and procedure

All applicants engaged in the selection procedures for entry in September 2020 (N = 3280) were invited to participate. For programs A, D, and E, applicants were invited during the on-site testing days. Programs B and C did not perform on-site testing due to COVID-19 pandemic measures, necessitating recruitment via e-mail during the selection procedure.

Applicants were requested to complete a demographics questionnaire. In this survey, applicants were asked to report their student number, gender, migration background, parental education and prior education. Data on performance on the selection procedures were derived from the related university student administration systems. Student number was used to connect the data from the demographics questionnaire with performance data.

Informed consent was obtained from all participants. Applicants were informed that participation was voluntary and would not influence their selection outcomes, and we made explicit that the researchers operated independently from the selection committees. Applicants did not receive incentives for participation in the study. All data were pseudonymized immediately after the demographics and performance data were combined. The Medical Ethical Review Committee of Erasmus MC declared the study exempt from ethical approval.

Variables

Predictors included gender, migration background, prior education, and parental education.

Gender diversity was acknowledged in the present study, and applicants had the option to choose between three categories: ‘man’, ‘woman’, and ‘other, namely [free text box]’.

Migration background was used as a proxy for ethnicity, recognizing that this does not completely capture the multidimensional character of ethnicity. Migration background was defined in alignment with Statistics Netherlands (CBS). Individuals have a migration background when at least one of their parents was born outside of the Netherlands. Based on the taxonomy of CBS, we distinguished between a Western and non-Western migration background. All European (excluding Turkey), North American, and Oceanian countries, Indonesia, and Japan were considered Western. Non-Western countries included all countries in Africa, Asia (excluding Indonesia and Japan), Latin America, and Turkey. Additionally, we distinguished between first-generation and second-generation immigrants. First-generation immigrants were born outside the Netherlands. In the Netherlands, the use of migration background and CBS taxonomy are considered the standard for operationalizing ethnicity, also in research in HPE (e.g., Mulder et al., 2022; Stegers-Jager et al., 2015).

In the Netherlands, the typical educational route to an HPE program is the pre-university track of secondary school with a health/science profile. However, applicants can apply to HPE programs from alternative forms of prior education. We distinguished between standard Dutch pre-university education, university, higher vocational education, all forms of foreign education, and other forms of prior education (e.g., entrance exams and adult education).

Finally, parental education was used as a proxy for SES, acknowledging that this is only one of the many indicators that can be used to operationalize SES. Parental education was determined by the educational level of applicants’ parents. Applicants were categorized as first-generation university applicants when none of their parents had attended higher education, i.e., university or higher vocational education. First-generation university applicants were a subgroup of interest, because previous research has demonstrated that their odds of being selected into medical school are lower (Mason et al., 2021; Stegers-Jager et al., 2015). Additionally, they face numerous obstacles when applying to medical school, including a lack of knowledge about the admission process and financial barriers (Romero et al., 2020).

Outcome measures

Five outcome measures were included. The first—binary—outcome measure indicated whether an applicant was selected (yes/no), determined by their ranking number. The other four outcome measures were continuous and reflected performance on the four tools: pre-GPA, curriculum-sampling test, biomedical knowledge test, and CV. For each tool, the responsible program calculated a raw score based on its own scoring method. Subsequently, each program transformed these raw scores into standardized Z-scores to enable comparisons of tools between tracks. These Z-scores were made available to the researchers and used for the analyses.

Statistical analyses

Multilevel logistic regression analysis was performed to calculate odds ratios (OR) for the effect of the different predictors on the probability of selection. An OR of > 1 indicates an increased likelihood of selection. Since the content of the selection procedure differed between programs, we used the program to which the applicants applied as a random intercept in this model. Program E was excluded from this analysis, given the high selection rate of 96%, which was in large contrast with other programs having an average selection rate of 47% (Appendix 1: Table 6). The selection rate of this program was this high due to the small number of applicants compared to the number of available spots.

To compare performance on different tools, we used Z-scores that were provided by the participating programs. We performed multilevel linear regression to assess performance differences between Z-scores on the four overlapping tools. Program C applied different scoring methods for two independent selection tracks with intake restriction (Table 1), resulting in Z-scores for the two different tracks. Therefore, we used the variable ‘track’ instead of ‘program’ as a random intercept for the analyses of the four tools. This random effect was included because, as mentioned earlier, the specific application of each tool differed across settings.

We applied likelihood ratio tests with the boundary correction to assess whether the inclusion of ‘program’ or ‘track’ as a random intercept explained significantly more variance compared to the model without the intercept.

Analyses were executed using the LME4 1.1.26 and NLME 3.1.152 packages in R version 4.0.4. For all statistical analyses, assumptions were checked. We interpreted OR of > 1.68 or < 0.60 as a small effect, OR of > 3.47 or < 0.29 as a medium effect, and OR of > 6.71 or < 0.15 as a large effect (Chen et al., 2010).

Results

Applicant characteristics

In total, 1935 applicants participated in the study (response rate 59%, range 34–81% for individual programs). With respect to gender, 30% of the respondents identified as men, and one applicant identified as ‘other’. This individual was excluded from the subgroup analyses, and therefore only the categories of men and women are described in the results. Furthermore, 38% had a migration background, 20% applied from alternative forms of prior education, and 25% were first-generation university applicants. In terms of gender and age, all samples were representative of the complete applicant pool. For two programs (C and E), participating applicants performed slightly better on the selection (in terms of ranking number) compared to non-participating applicants.

Since applicants were exposed to some overlapping tools, but also to some unique tools, and one program was excluded from the analyses on the probability of selection, the distribution of applicant characteristics differed between the multilevel analyses (Table 2). Noteworthy is that for the biomedical knowledge test, the proportion of applicants with a migration background was relatively low compared to the other programs (28% vs 36–43%). Additionally, for pre-GPA, the proportion of applicants from alternative forms of prior education was comparatively small (10% vs 19–23%). This is probably caused by the fact that pre-GPA is not always included as a selection tool for those applicants.

Table 2 Applicant characteristics of the total sample and of the different multilevel analyses

The individual programs differed in their distribution of the applicant characteristics of interest (Appendix 1: Tables 6, 7). The most notable difference is that compared to the other programs, Program D—the only rural program in the sample—had a lower representation of applicants with a migration background (13% vs 32–53%) and first-generation university applicants (15% vs 24–32%). Differences in demographic composition are not caused by differences in admission requirements, since these were all comparable across programs. However, it is possible that other institutional-related factors made certain programs more attractive to specific subgroups, including location and selection procedure (Wouters et al., 2017b).

Table 3 Results multilevel logistic regression analysis for probability of selection (N = 1688)

Probability of selection

First-generation Western immigrants were significantly less likely to be selected compared to applicants without a migration background (23% vs 49%), corresponding to an adjusted OR of 0.45 (95% confidence interval [CI] [0.20, 0.99]; Table 3). Additionally, foreign-educated applicants had smaller selection odds than those from standard pre-university education (24% vs 49%, adjusted OR = 0.46, 95% CI [0.22, 0.94]). Both can be interpreted as small effects (i.e., OR < 0.60). The category ‘other forms of prior education’ demonstrated a medium (i.e., OR < 0.29), but non-significant (level 0.05) negative effect (18% vs 49%, adjusted OR = 0.28, 95% CI [0.08, 1.00]), which could be due to the small size of this group (N = 17). Gender and parental education were not significantly associated with the probability of selection. The random effect of program was not significant (SD = 0.00, 95% CI [0.00, 0.21], p = 0.50), indicating that subgroup differences in the probability of selection were similar across programs, given the fixed structure considered (i.e., the variables of gender, migration background, prior education and parental education).

Performance on traditional criteria

Pre-GPA

Pre-GPA was used by four programs (B, C, D, and E), of which one program used two independent selection tracks, resulting in five tracks in the analysis (B, C1, C2, D, and E). Compared to traditional applicants, first-generation university applicants had significantly lower pre-GPAs (B = − 0.17, 95% CI [− 0.30, − 0.03]; Tables 4, 5). As Z-scores were used for all criteria, the unstandardized Bs indicate the difference in SD. Thus, for example, pre-GPAs of first-generation university applicants were 0.17 SD lower than those of non-first-generation university applicants. Applicants with university-level and with ‘other forms of prior education’ had significantly lower pre-GPA (respectively, B = − 0.41, 95% CI [− 0.63, − 0.18]; B = − 0.76, 95% CI [− 1.42, − 0.11] compared to standard pre-university applicants, while pre-GPAs of applicants with foreign education were significantly higher (B = 1.13, 95% CI [0.54, 1.72]). Gender and migration background were not associated with pre-GPA. The random effect of track was not significant (SD = 0.005, 95% CI [0, 506553], p = 0.50), indicating that the performance differences found on pre-GPA were similar across tracks, given the fixed structure considered.

Table 4 Descriptive statistics of subgroup performance on four selection tools
Table 5 Results multilevel linear regression analyses for performance on four selection tools

Biomedical knowledge test

Biomedical knowledge tests were used by two programs (A and D). Men and applicants who were studying at university-level performed significantly better on biomedical knowledge tests compared to women and applicants from standard pre-university education (respectively, B = 0.21, 95% CI [0.06, 0.37]; B = 0.32, 95% CI [0.11, 0.52]; Tables 6, 7). Migration background and parental education were not associated with test scores, and the random effect of track was not significant (SD = 0.01, 95% CI [0, 12114,82], p = 0.46), indicating that subgroup differences in performance were similar across programs, given the fixed structure considered.

Performance on broadened criteria

Curriculum-sampling test

Three programs included curriculum-sampling tests (A, B, and E). Applicants with a non-Western migration background, both first-generation and second-generation, scored lower on curriculum-sampling tests compared to their traditional counterparts (respectively, B = − 0.43, 95% CI [− 0.67, − 0.20]; B = − 0.21, 95% CI [− 0.34, − 0.10]; Tables 6, 7). Applicants who were already studying at university-level performed significantly better compared to standard pre-university applicants (B = 0.37, 95% CI [0.21, 0.53]), while applicants with foreign education had lower test scores (B = − 0.56, 95% CI [− 0.91, − 0.22]). Test scores were not influenced by gender or parental education. Given the fixed structure considered, the random effect of track was significant (SD = 0.09, 95% CI [0.03, 0.32], p = 0.01), implying that our overall findings differed between programs. Descriptive statistics of the individual programs employing curriculum-sampling tests (Appendix 1: Table 8) indicate that only for program E, applicants with a first-generation non-Western background had notable low mean Z-scores compared to those without a migration background (M = − 0.86 vs M = 0.29). This difference was smaller for program B (M = − 0.13 vs M = 0.18) and non-existent for program A (M = 0.02 vs M = − 0.01). Noteworthy is that in program E relatively more applicants had a first-generation non-Western background than in the other two programs.

CV

Three tracks derived from two different programs included a CV (B, C1, and C2). Compared to women or traditional applicants, CV scores were significantly lower for men (B = − 0.17, 95% CI [− 0.31, − 0.02]; Tables 6, 7), first-generation Western immigrants (B = − 0.43, 95% CI [− 0.85, − 0.00]), and applicants with higher vocational education, foreign education, and ‘other forms of prior education’ (Bs between − 0.61 and − 0.81). Parental education was not associated with CV scores. There was a significant effect of track for CV (SD = 0.25, 95% CI [0.08, 0.76] p < 0.001), indicating that the aforementioned effects differed across tracks, given the fixed structure considered. Descriptive statistics (Appendix 1: Table 9) suggest that the gender-based performance gap was smaller for track B compared to the other tracks (track B: M = 0.04 (men) vs M = 0.17 (women); track C1: M = − 0.29 vs M = − 0.06, track C2: M = − 0.11 vs M = 0.25). The overall result that first-generation Western immigrants had lower scores than applicants without a migration background was found for track B (M = − 0.75 vs M = 0.22) and track C2 (M = − 0.42 vs M = 0.09), but not for track C1 (M = − 0.03 vs M = − 0.14). Compared to those without a migration background, applicants with a second-generation non-Western background had lower CV scores in track B (M = − 0.28 vs M = 0.22), similar CV scores in track C1 (M = − 0.17 vs M = − 0.14) and higher CV-scores in track C2 (M = 0.45 vs M = 0.09). For track B, larger differences were observed between different forms of prior education, but this is probably related to the fact that for program C, the tracks were distinguished based on prior education, resulting in a large concentration of standard pre-university education in track C1 and a large concentration of other forms of prior education in track C2.

Discussion

Unraveling the impact of distinctive selection procedures on student diversity in undergraduate HPE programs requires an insight into how subgroups based on gender, migration background (as an indicator of ethnicity), parental education (as an indicator of SES), and prior education perform on the applied selection tools in different contexts. Our results demonstrated that selection chances of applicants with non-traditional backgrounds were generally smaller, but only significantly for applicants with first-generation Western migration backgrounds and applicants with foreign education. These findings did not differ between programs. However, when taking a closer look, we found larger differences in subgroup performance and more variability in effects. We conclude that the broadened criteria under research—curriculum-sampling tests and CVs—may reduce SES-related performance differences, but not disparities based on applicant ethnicity. Furthermore, subgroup performance differences were context-specific for broadened criteria, but not for traditional criteria.

Our first key finding that the implementation of broadened selection criteria instead of traditional criteria potentially reduces performance disparities based on SES, but may not mitigate an ethnicity-related performance gap, confirms the previous work of Lievens et al. (2016). With respect to the traditional criteria under research, we found that first-generation university applicants had lower pre-GPAs than applicants from traditional backgrounds, also confirming previous research (Griffin & Hu, 2015; Juster et al., 2019; Puddey et al., 2011). Nevertheless, pre-GPAs did not differ between ethnic majority and ethnic minority applicants, which may be explained by a great variety in pre-GPAs between different ethnic minority groups (Puddey et al., 2011). We did not identify significant SES-based and ethnicity-based performance differences on biomedical knowledge tests. However, the sample for this outcome measure was smaller and less diverse compared to the other tools, and international research on such tools persistently reveals such disparities (Girotti et al. 2020; Griffin & Hu, 2015; Juster et al., 2019; Lievens et al., 2016; Puddey et al., 2011). With respect to broadened criteria, our study is the first to investigate subgroup performance on curriculum-sampling tests and CVs in a multi-institutional setting. On both broadened criteria under research, we did not find performance differences based on SES, which resonates with previous research (Griffin & Hu, 2015; Juster et al., 2019; Lievens et al., 2016; Stegers-Jager et al., 2015). Nevertheless, we found that applicants with a migration background were disadvantaged on both tools, whereas previous studies reported mixed findings (Juster et al., 2019; Lievens et al., 2016; Stegers-Jager et al., 2015). A possible explanation for our findings regarding SES is that broadened criteria are less prone to coaching—which is generally more available to high SES applicants (Stemig et al., 2015)—due to their unstandardized and program-specific nature. Traditional criteria, on the other hand, are potentially more susceptible to coaching, as applicants can, for instance, purchase private tutoring to increase their pre-GPA. Simultaneously, the lack of standardization of broadened criteria could increase the risk of cultural bias. Cultural bias can, for instance, occur when certain questions are interpreted differently by members from ethnic minority groups, and may explain the lower scores of applicants with migration backgrounds (Kim & Zebelina, 2015). Language bias probably did not play a significant role in performance disparities based on migration background, because effects were not consistently observed amongst all first-generation immigrants. Additionally, results from previous research suggest that disparities also exist for immigrants from Dutch-speaking countries (Stegers-Jager et al., 2015).

A second key finding is that for the broadened selection criteria, subgroup performance differences were context-specific, whereas the traditional selection criteria had consistent effects across programs. This is in accordance with the current evidence for subgroup differences in performance on the two types of criteria: results from prior research regarding the use of broadened criteria are mixed (Juster et al., 2019; Lievens et al., 2016; Stegers-Jager, 2018; Stegers-Jager et al., 2015), while the outcomes from traditional criteria are rather consistent in disadvantaging ethnic minority and lower SES applicants (Griffin & Hu, 2015; Juster et al., 2019; Lievens et al., 2016; Puddey et al., 2011; Stegers-Jager et al., 2015). Additionally, the overall finding that men and applicants with a first-generation Western migration background had lower CV scores was not in line with a previous Dutch single-institution study (Stegers-Jager et al., 2015). Our study is the first to directly demonstrate that seemingly comparable tools can have differential effects on subgroup performance across different programs. Typically, broadened criteria allow for more variation and can be further adjusted to the specific program contents, which may be the cause of stronger context-dependent effects on subgroup performance for these tools. For instance, curriculum-sampling tests vary in their subject, preparatory materials, and preparation time. Additionally, previous research suggests that the complexity of the language (Lievens et al., 2016), and question format (Edwards & Arthur, 2007), may contribute to subgroup differences in test performance. Likewise, the scoring method and the type of extracurricular activities that are considered in CV scores may play a role, since healthcare experiences are considered to be unequally accessible to applicants from different backgrounds (Wouters, Croiset, Isik, et al., 2017).

A third key finding is that subgroup differences in performance on individual tools did not always have consequences for the probability of selection of those subgroups. We found that selection chances were only significantly smaller for applicants with a first-generation Western migration background and applicants with foreign education, two subgroups that were left unnoticed by previous research. Combining tools with differential subgroup performance within procedures and across procedures may have counter-balanced the overall effect, and the weightings of different tools may have played a role (Lievens et al., 2016; Stegers-Jager, 2018). Our findings are not fully supported by the results from a recent retrospective study that included applicants to all Dutch undergraduate HPE programs (Mulder et al., 2022). The authors found significantly lower selection probability for additional ethnic minority groups, men and lower SES groups, although the results were negligible in terms of statistical effect size (Chen et al., 2010). The discrepancy between findings may be explained by differences between target groups: the present study used prospective data of a subset of programs and included a more heterogenous group of applicants, including older applicants and those with foreign education.

Strengths of our study include that we collected data from multiple programs and that we used a multilevel analytical approach, creating the opportunity to correct for and examine contextual differences. The typical Dutch admissions system, which allows schools to design their own selection procedure, allowed us to compare a variety of (applications of) tools. As a consequence, not all tools were used by all programs. Therefore, direct comparison across different outcome measures and examination of the correlation of performance between different tools were not possible. Another limitation is that although the present study is, to our knowledge, the first to include the selection procedures of a range of different types of undergraduate HPE programs, it was not possible to cover all specialties and institutions. This may have consequences for the generalizability of our findings. Furthermore, we included parental education as a relevant indicator for SES, since first-generation university applicants have been shown to face barriers during the transition into higher education (Stephens et al., 2014), but we may have overlooked other potentially relevant SES-related effects, such as parental income and profession (Girotti et al. 2020; Mulder et al., 2022; Steven et al., 2016). Likewise, migration background is a stable and objective indicator of ethnicity, but does not account for ethnic identity (Ross et al., 2020; Stronks et al., 2009). Another limitation is that for certain subgroups, sample sizes were small, thus those results should be interpreted with caution. Finally, two selection procedures were partly affected by COVID-19 measures, potentially reducing the generalizability of our findings. Nevertheless, the effects of the probability of selection did not differ significantly between the four programs in that analysis, of which two were affected by COVID.

The variety in subgroup differences between and within tools implies that future research should determine whether specific characteristics of tools can play a moderating role in their effects on diversity. This could lead to the identification of best practices. Furthermore, based on our results we cannot draw conclusions with respect to the effect of different weightings of tools. Therefore, we endorse a previous suggestion to investigate the effects of different weightings of tools on student diversity (Stegers-Jager, 2018). Future studies should also examine whether selection tools differentially predict academic performance for different subgroups, to determine whether the performance disparities we found correspond with bias (i.e., underprediction or overprediction for certain subgroups). Finally, future research should identify the specific underlying characteristics and needs of subgroups of applicants with non-traditional backgrounds within the context of HPE selection, to provide better support during and, as suggested by others (Lievens, 2015; Wouters, 2020), also after selection. For instance, applicants from alternative forms of prior education may face difficulties managing the expectations in HPE selection that can strongly differ from their previous educational experiences (Katartzi & Hayward, 2020; Rienties & Tempelaar, 2013).

From a practical viewpoint, the context-specificity of subgroup differences in performance indicates that HPE programs need to establish continuous evaluation of the possible effects of their selection procedures on student diversity, rather than only relying on existing research in other contexts. Additionally, we encourage programs to conscientiously include and/or develop alternative tools that can reduce adverse impact and explicitly promote well-needed diversity, such as SJTs (Juster et al., 2019) and multiple mini-interviews (Griffin & Hu, 2015), while keeping in mind that effects can be context-specific. We acknowledge the desire to apply school-specific selection procedures, as selection procedures that align their contents with the curriculum can have high predictive value (Schreurs et al., 2020). Simultaneously, this creates a responsibility for programs to evaluate different aspects of the validity of their selection procedure, including adverse impact (Schreurs, 2020). Additionally, programs could consider validating their tools with diverse norming groups (Padilla & Borsato, 2008).

In conclusion, selection into undergraduate HPE programs can unintentionally impact student diversity, hindering equitable admission. Compared to traditional criteria, broadened criteria can reduce SES-related performance differences, but not disparities based on ethnicity. For broadened criteria, subgroup differences in performance also vary across contexts. We, therefore, call for continuous evaluation effects of selection on diversity, the identification of best practices within existing tools, the inclusion of tools with a positive or neutral impact on student diversity, and sufficient quality control.