When ethnic minority students are judged as more suitable for the highest school track: a shifting standards experiment

When students are grouped into school tracks, this has lasting consequences for their learning and later careers. In Germany to date, some groups of students (boys, ethnic minority students) are underrepresented in the highest track. Stereotypes about these groups exist that entail negative expectations about their suitability for the highest track. Based on the shifting standards model, the present research examines if and how stereotypes influence tracking recommendations. According to this theory, members of negatively stereotyped groups will be judged more leniently or more strictly depending on the framing of the judgment situation (by inducing minimum or confirmatory standards). N = 280 teacher students participated in a vignette study in which they had to choose the amount of positive evidence for suitability they wanted to see before deciding to recommend a fictitious student to the highest track. A 2 (judgment standard: minimum vs. confirmatory) × 2 (target student’s gender: male vs. female) × 2 (target student’s ethnicity: no migration background vs. Turkish migration background) between-subjects design was used. No effects of target gender occurred, but the expected interaction of target’s ethnicity and judgment standard emerged. In the minimum standard condition, less evidence was required for the ethnic minority student to be recommended for the highest track compared to the majority student. In the confirmatory standards condition, however, participants tended to require less evidence for the ethnic majority student. Our experiment underlines the importance of the framing of the recommendation situation, resulting in a more lenient or stricter assessment of negatively stereotyped groups.


Introduction
In educational settings, tracking is often used in the hopes that classrooms that are more homogenous in their performance will receive lessons that are better tailored to their needs (Hattie, 2002). It is argued that teachers are able to scaffold students better in more performance-homogenous groups, neither posing too difficult tasks nor challenging students insufficiently. However, tracking is often criticized because it reproduces existing social class differences and because the lower track provides a lower-quality learning environment, contradicting the aim of catering better to the needs of all students Hattie, 2002;Maaz et al., 2008;Retelsdorf et al., 2010;Van Houtte, 2004).
Tracking can be used both within schools, so that students attend classes at different performance levels in their school, and between schools, meaning that students of different performance levels are taught in different schools (Becker et al., 2016). In Germany, students are tracked between schools after 4th or 6th grade (ages nine or eleven) depending on the federal state, and the school track they attend has long-term consequences. Students in the highest track (academic track, Gymnasium) receive a diploma permitting them to study at the university, whereas students with a diploma from the vocational track, which some federal states further separate into two tracks, qualify for apprenticeships (see Becker et al. (2016) for an overview). Importantly, there is evidence that the different tracks constitute unequal learning environments. Teachers in lower tracks show less adaptive goal orientations and instructional methods, and they are more likely to experience burnout (Retelsdorf et al., 2010;Van Houtte, 2004). Moreover, students in higher tracks make greater achievement gains even when prior achievement is controlled (scissor effect; Becker et al., 2006Becker et al., , 2012Hattie, 2002;Maaz et al., 2008;Retelsdorf et al., 2012). In theory, the tracking system is meant to be permeable, though in practice this is a mostly downward movement of students to a lower track (Bellenberg, 2012).
With effects on both learning in school and later-life outcomes, tracking decisions are clearly impactful for students' lives and should not be made lightly. The German government established that tracking recommendations should consider "not only performance related to goals established in the curriculum but also the general skills important for success in school" (Standing Conference of the Ministers of Education and Cultural Affairs [KMK], 2015;p. 6). The recommendation of a student for a specific track should be based on their performance, but also their "suitability, affinity, and willingness […] to work intellectually" (KMK, 2015;p. 5). While in most German federal states, teachers' recommendations are not binding anymore, they carry weight in the admission process to different schools of the same track. Some states do have binding recommendations, meaning that students may need to further prove their abilities if they want to attend a higher-than-recommended track (KMK, 2015).
In addition to more objective aspects such as grades, teachers do consider motivational and social factors when making a recommendation as suggested by the guideline (Anders et al., 2010;Kaiser et al., 2013;Sneyers et al., 2018). Because teachers' perceptions play a role in their recommendations, they may be vulnerable to bias along gender and ethnicity lines. Indeed, studies show that school is perceived as a more feminine space by students and teachers alike (Heyder & Kessels, 2013); Kessels, 2015, and boys are underrepresented in the academic track (Spinath et al., 2014, Federal Statistical Office, 2019. Similarly, students with a Turkish migration background are underrepresented in the academic track (Rauch et al., 2016). German teacher students have lower expectations for students with a Turkish migration background (Lorenz et al., 2016), and rate them as less qualified for the academic track (Tobisch & Dresel, 2017) based on the same performance. Our study aims to complement this research by examining whether the framing of a recommendation situation can influence the degree and directionality of bias against male and ethnic minority students.
We draw from the shifting standards model (SSM; Biernat, 2012), which posits that people shift their standards when assessing a person's suitability to a specific position or task according to the social group this person belongs to and the stereotypes about this group's abilities and characteristics. Depending on the framing of the judgment situationwhether the assessment is meant to be just a first, tentative impression or it is meant to be a judgment made with certainty-more or less evidence is required to judge members of negatively stereotyped groups as possessing the characteristic in question (Biernat et al., 2008(Biernat et al., , 2010. Translating the predictions of the SSM to tracking recommendations, the present study tests how many positive learning-related behavioral examples teacher students will want to see to make an academic track recommendation for negatively stereotyped students (compared to positively stereotyped students) in two different framings of the judgment situation.

Shifting standards in stereotyped judgments
Focusing on gender and racial group membership, the basic premise of the SSM is that stereotypes about various social groups are used as expectations or standards. A stereotypebased judgment standard entails expectations about "the likely mean and range of group members on the attribute" (Biernat, 1995, p. 89). Thus, a specific target is being judged relative to the average expectations for their group and perceivers tend to shift their standard of judgment to correspond to this expectation (Biernat, 1995). This is possible if the language used is subjective, that is, when it allows for different interpretations of the same word (Biernat, 2012). For instance, when describing a woman as "tall," she is perceived as tall compared with an average woman (Biernat & Manis, 1994), while a man of the same height would not be rated as "tall" because the average man is expected to be taller than the average woman. Research has shown that members of negatively stereotyped groups are rated more negatively on objective scales, but equally or even more positively on such subjective scales (Biernat, 2012).

Shifting standards in teacher judgments
So far, research has studied standard shifts predominantly regarding adult targets. This is surprising given the vast amount of research on ethnic and gender bias in the school context in many countries (Anderson-Clark et al., 2008;Arbuckle & Little, 2004;Glock, 2016a;Heyder & Kessels, 2017;Kessels, 2015;Kessels & Steinmayr, 2013;Krahé et al., 2007;Lorenz et al., 2016;Malouff & Thorsteinsson, 2016;Meissel et al., 2017;Parks & Kennedy, 2007;Tobisch & Dresel, 2017). In Germany, two studies applied the shifting standard paradigm to negatively stereotyped elementary school students contrasting the use of objective versus subjective scales. The use of subjective scales (e.g., characterizing a student's math competencies as "good") allows different standards of judgment to be applied to students from different social groups and should therefore mask biases. A girl described as "really good at math" might be perceived as good only compared to other girls for whom the perceiver might have low expectations. On the other hand, the use of objective scales (e.g., estimating the score a student has in a standardized test) does not allow for the use of different standards for different groups and should therefore reveal stereotyped expectations (Biernat, 2012). In accordance with the SSM, Holder and Kessels (2017, study 1) found girls to be judged more strictly in Math on objective scales and more leniently on subjective scales compared to boys. In parallel, teacher students judged boys with a Turkish migration background more strictly in German on objective scales and more leniently on subjective scales compared to boys without migration background (Holder & Kessels, 2017, study 2). Together, these studies revealed a standard shift in teacher judgments for the first time.

Minimum standards and confirmatory standards
Standard shifts depend not only on language, but also on contextual factors. Biernat and Kobrynowicz (1997) distinguished between minimum standards and confirmatory standards. Minimum standards are "expectations for a group and tend to directly reflect stereotypes" (Biernat et al., 2010, p. 855) and thus tend to be lower for group members stereotyped as deficient on a given attribute. These standards are used when some initial evidence for a required ability is needed, for instance for a short-list in a selection process, or when a non-exclusive reward (e.g., verbal praise) is given out. In this case, the perceived group member's behavior is compared to the low expectations the judge has for the group he or she belongs to.
On the other hand, confirmatory standards are "thresholds that reflect certainty that an individual has an attribute" (Biernat et al., 2010, p. 855) and thus tend to be higher for group members stereotyped as deficient. These confirmatory standards are used when perceivers seek "a higher level of proof; a level that allows for the broad based inference that the attribute is ability-based" (Biernat & Kobrynowicz, 1997, p. 546). This is the case when definite proofs for a required ability are needed, when a final selection or hiring takes place, or a zero-sum reward (e.g., an award) is given out. In this case, a cross-category comparison between low expectations for the negatively stereotyped group and high expectations for the other group takes place (Biernat, 2012). For example, in one study, Black men were suspected of being competent based on a smaller number of behavioral examples of competence (minimum standard) but were confirmed to be competent based on more behavioral examples (confirmatory standard) relative to White men (Biernat & Kobrynowicz, 1997). This pattern has been repeatedly found in studies on gender and competence in workplace/masculine domains as well as on race and workplace competence (Biernat & Fuegen, 2001;Biernat & Kobrynowicz, 1997;Biernat & Ma, 2005;Biernat et al., 2008).
According to the authors, a suspicion-certainty continuum of dispositional inference is at the heart of the standards shift (Biernat et al., 2008). Since expectations are lower for members of the negatively stereotyped group, little counter-stereotypical evidence is perceived as sufficient for perceivers to suspect (minimum standard) that this group member might possess the attribute in question to some degree. However, these low expectations also imply that perceivers require more behavioral evidence to be certain (confirmatory standard) that the member of a negatively stereotyped group actually has the attribute of interest (Biernat et al., 2008).
While first studies showed the impact of subjective versus objective scales when teachers judge negatively stereotyped students (Holder & Kessels, 2017), it is unknown which role minimum and confirmatory standards may play in the educational context. If negative stereotypes towards students of Turkish origin and towards boys in general regarding their readiness to attend the academic track do exist, the evidence needed for a recommendation for this track should differ by the standard invoked. As a result, such stereotypes should influence tracking recommendations and ultimately the likelihood of students from stereotyped groups to move into the highest track.

Ethnic bias in teachers' judgments and tracking recommendations
As research has shown, stereotypes might influence teachers' expectations for and judgments of students (e.g., Jussim & Harber, 2005), and may result in tracking recommendations influenced by stereotypes (e.g., Klapproth et al., 2018). Teacher expectations for ethnic majority students are regularly found to be higher than for negatively stereotyped ethnic minority students, especially for Black students in the USA (Jussim & Harber, 2005;McKown & Weinstein, 2008;Ready & Wright, 2011;Tenenbaum & Ruck, 2007). In Germany, many students with a migration background have a Turkish background (Rauch et al., 2016), and these students not only underperform compared to students without a migration background but also compared to students from other ethnic minorities (Stanat et al., 2010). However, research also demonstrates that differences in teacher expectations or performance judgments remain even when statistically or experimentally controlling for achievement (e.g., Glock, 2016a;Holder & Kessels, 2017;Lorenz et al., 2016).
Not only teachers' expectations but also their judgments of performance and behavior have been found to be biased by students' ethnicity (Anderson-Clark et al., 2008;Malouff & Thorsteinsson, 2016;Meissel et al., 2017;Parks & Kennedy, 2007;Quinn, 2020; but see Baker et al., 2015;Glock & Krolak-Schwerdt, 2014;Kaiser et al., 2017). Ethnic minority students' performance is often judged as deficient relative to their ethnic majority peers, in Germany most visibly in studies concerning students of Turkish origin (e.g., Bonefeld & Dickhäuser, 2018;Bonefeld et al., 2017;Sprietsma, 2013). Experimental evidence focusing on language proficiency indicated that teacher students judged a student with a Turkish migration background less favorably and as less competent in German than a student without a migration background even though their performance was identical Glock, 2016a;Holder & Kessels, 2017). Moreover, supposed primary school students with a Turkish migration background received worse grades for the same performance (Bonefeld & Dickhäuser, 2018;Sprietsma, 2013). Since grades in German are especially relevant to teachers' tracking recommendations (Ditton et al., 2005), this bias might have lasting consequences for students.
Studies on perceptions of students' behavior revealed that when students misbehave, teachers respond more harshly to ethnic minority students, particularly teacher students who have little in-class experience (Glock, 2016b;Kleen & Glock, 2018). A relatively minor misbehavior, for example, talking to a peer, can then lead to a disproportionate negative reaction from the teacher-which may negatively affect how the teacher views the child's suitability to the academic track. Moreover, data from the USA demonstrated that elementary school teachers systematically underestimate the executive functions of ethnic minority students and students with limited English proficiency (Garcia et al., 2019). Such a bias could be highly relevant to teachers' perceptions of students' ability to succeed in the academic track.
While bias in how students' performance and behavior is judged could influence teachers' tracking recommendations for ethnic minority and majority students, so far no consistent pattern has emerged regarding this. Data based on field studies in German primary schools did not find evidence that disadvantages exist for students with a migration background compared to students without a migration background when socioeconomic background and/or achievement indicators were accounted for (Ditton et al., 2005;Kristen, 2006;Schneider, 2011;Tiedemann & Billmann-Mahecha, 2007). Once these variables are included, some studies even demonstrated a slight advantage for students with a (specific) migration background for tracking recommendations (Bos et al., 2004;Driessen et al., 2008) and a large advantage for de facto transitions to secondary schools (Gresch & Becker, 2010). A study from the Netherlands also demonstrated substantial betweenteacher variation in the degree and directionality of teacher biases towards ethnic minority students (Timmermans et al., 2015).
In contrast, experimental evidence suggests that teachers and teacher students do discriminate against them in their tracking recommendations when achievement is held constant (e.g., Sprietsma, 2013). A recent vignette study provided evidence that teachers rated a student with a Turkish migration background as less qualified for the highest school track than a student without a migration background (Tobisch & Dresel, 2017), a pattern that also extends to tracking recommendations Klapproth et al., 2018;Sprietsma, 2013; but see also Glock et al., 2015).
In summary, several studies revealed that teachers hold lower expectations for ethnic minority students compared to ethnic majority students and that teacher judgments might be biased. Teachers also react to student misbehavior differently based on student ethnicity, indicating that they might perceive these students differently. In Germany, empirical evidence points to biased expectations and judgments especially regarding students with a Turkish migration background. With regard to tracking recommendations, the findings on the influence of students' ethnicity are inconclusive to date. The present study seeks to identify a contextual factor of the recommendation situation-whether the standard invoked is minimum or confirmatory-that may result in relative advantages of students with and without a migration background.

Gender bias in teachers' judgments and tracking recommendations
Gender identity is another factor related to school performance and attendance of the academic track. Though to a much smaller extent compared to students with a migration background, boys in general are underrepresented in the academic track (Spinath et al., 2014;Federal Statistical Office, 2019), and teachers are less likely to recommend a higher track for them (Bos et al., 2007;Driessen et al., 2008;Jürges & Schneider, 2011;Lehmann et al., 1997;Timmermans et al., 2015; but see also Ditton et al., 2005;Klapproth & Fischer, 2019). In part, this can be explained by their-on average-lower German grades since these are weighted more strongly for the recommendation than other grades (Ditton et al., 2005;Lehmann et al., 1997). Even when ability based on standardized tests is controlled for, girls receive better German grades than boys (Lehmann et al., 1997) and boys need better reading competence than girls to have the same chance of receiving a recommendation for the academic track (Bos et al., 2007).
Research indicates that teachers do, in addition to grades, consider motivational and social factors as is intended by the government's guidelines (Anders et al., 2010;Kaiser et al., 2013;Sneyers et al., 2018). On an implicit level, teacher students associate girls more with positive student behaviors and boys more with negative student behaviors (Glock & Kleen, 2017). Teachers also explicitly report that girls more so than boys demonstrate specific additional skills and behaviors relevant to achievement (e.g., Heyder & Kessels, 2017;Kessels, 2015), and this might lead them to recommend the academic track more often for girls. Indeed, girls themselves indicate that they are more self-disciplined, more empathetic, and have more positive attitudes towards seeking help (Duckworth & Seligman, 2006;Kessels & Steinmayr, 2013;Krahé et al., 2007). Teachers also perceive that girls are more motivated, show more effort and behavior that fosters learning, study more independently, are less disruptive, and are more socially skilled (Anders et al., 2010;Arbuckle & Little, 2004;Heyder & Kessels, 2017;Jones & Myhill, 2004;Kessels, 2015;Kuhl & Hannover, 2012;Spinath et al., 2014;Veenstra et al., 2008). These behaviors may indicate to teachers that girls have greater suitability, affinity, and willingness to work intellectually-and thus they would be more likely to recommend them for the academic track (Anders et al., 2010).
Given the only small differences in reading competencies between boys and girls at the end of elementary school in Germany (McElvany et al., 2017), the main reason for boys being less frequently recommended to the highest track seems to be the difference in social and learning behavior. These differences lead to both teachers' perceptions of girls being more adjusted to the school demands as well as to girls' better grades (Hannover & Kessels, 2011). Research has highlighted that when controlling for students' performance and learning behavior (regarding homework, self-regulation), boys are in fact recommended for the highest track more often than girls (Neugebauer, 2011). If the stereotype about boys implies a lack of adaptive social and learning behaviors, tracking recommendations based on these stereotypes might be biased. In the present study on ethnic bias in tracking judgments, we therefore include gender as a second factor in order to analyze what kind of framing of the tracking judgment will result in relative advantages or disadvantages regarding both gender and minority status.

Study overview and hypotheses
The SSM has proven to be a useful framework for revealing stereotype-based judgments in many fields such as hiring, parenting, workplace competence, and personality traits such as assertiveness (Biernat, 2012). However, despite the rich body of research on gender and ethnic bias at schools (e.g., Arbuckle & Little, 2004;Heyder & Kessels, 2017;Kessels, 2015;Malouff & Thorsteinsson, 2016;Tobisch & Dresel, 2017), the SSM has only rarely been applied to school-based judgments. The present study examines for the first time if a standard shift occurs in tracking judgments, a judgment that is highly influential for students' future education and work life.
To test this, we vary the framing of the tracking judgment so that for one group of German teacher students, a minimum standard is induced, whereas a confirmatory standard is induced for the other. Following prior research, we applied the behavioral checklist paradigm (Biernat & Kobrynowicz, 1997;Biernat et al., 2008) enumerating behaviors that foster students' learning. The minimum standard was invoked by instructing participants to tick the minimum number of behaviors that are necessary to suspect that a student may be qualified to attend the highest track, while the confirmatory standard was invoked by asking them to indicate the number of behaviors that are necessary to confirm that a student is qualified for the highest track. In addition, ethnicity and gender of target students were varied between participants.
If negative stereotypes about the suitability of boys and of students with a Turkish migration background for attending the highest track exist, an interaction effect should emerge (H1): In the minimum standards condition, teacher students are expected to require less evidence for the suitability of negatively stereotyped students, that is, students with a Turkish migration background and boys, than positively stereotyped students, that is, students without a migration background and girls (H1a). In the confirmatory standard condition, they are expected to require more evidence to confirm the suitability of students with a Turkish migration background and boys (H1b). Importantly, students with a Turkish background are much more underrepresented at the highest tracks than boys in general and ability stereotypes disadvantage students with a Turkish migration background (Froehlich et al., 2016) which is not the case for boys in general. Therefore, the pattern in accordance with the SSM regarding ethnicity may be more pronounced relative to gender. We will also examine the interplay of gender and ethnicity, though only in an exploratory manner since to date intersectional effects within the standards shift paradigm have not been studied in detail.

Participants
Participants were recruited using the university mailing list for students at a large university in a city of Northwestern Germany. An email invited teacher students to participate in an online study on tracking recommendations. As an incentive, the possible participation in a lottery for vouchers worth 20 € was included. Notwithstanding that both the invitation email and the instruction of the online study stated clearly that only teacher students were the target group for the study, several students not in the teacher training program participated (n = 261). These were excluded from the data analyses, as well as n = 14 persons ticking "other" instead of the different levels of teacher training spelled out in the questionnaire. In addition, we decided to exclude those students who were enrolled in the program for teaching exclusively children with Special Educational Needs (SEN; n = 123) as we aimed to test a sample with a comparable conception of regular students' behaviors at grade 4 and the fictitious target student described in the vignette had no SEN. The final study sample comprised n = 280 teacher students (n = 31 primary school; n = 45 secondary school/lower tracks; n = 204 secondary school/higher track), the mean age was 24.71 (SD = 4.24), and 95% spoke German as their mother tongue. Unfortunately, gender of participants was not collected.

Design, experimental treatment, and measures
Participants were randomly assigned to one of eight conditions in a 2 (standard: minimum vs. confirmatory) × 2 (target student's gender: male vs. female) × 2 (target student's ethnicity: no migration background vs. Turkish migration background) between-subjects design. On the first page of the online questionnaire, an "exercise in evaluating elementary school children" was announced. On the second page, participants read that teachers are requested to form an overall picture of a student when making tracking recommendations at the end of primary school and that this is not exclusively based on students' grades. They further read that the guidelines of the KMK ask to consider the child's "suitability, affinity, and willingness […] to work intellectually" (2015, p. 5) as well as the "general skills important for success at school" (2015, p. 6). As a basis for this, various school-related behaviors that students might exhibit could be considered. Participants were told that they would now engage in a short exercise for doing this. On the next page, they were asked to imagine one specific example of a primary student they had to judge. "[Target's name] is student of a fourth class in [city]. Considering exclusively his/her GPA, a transition to the Gymnasium would not be unambiguously justifiable. How do you proceed when judging [target's name] behavior? In the following you see ten behaviors that can be observed in fourth-graders." Following procedures used by Biernat et al. (2010), participants in the minimum standards condition read that we were interested in "the minimum number of behaviors that are necessary to suspect that [the target student] may be qualified for the Gymnasium," while participants in the confirmatory standards condition read that we were interested in the "the total number of behaviors that are necessary to confirm that [the target student] is qualified for the Gymnasium." All participants were asked to review a list of ten behaviors and to check off "as many or as few behaviors" as the target student would need to engage in to either "give you some inkling or hint that [the target student] may be qualified for the Gymnasium" or "to confirm that [the target student] is qualified for the Gymnasium" (emphasis original). Crossed with this manipulation, participants were asked to imagine a student named "Tim Menzel" or "Anna Menzel" (German names) or "Deniz Gül" or "Selma Gül" (Turkish names).

Dependent variable
The dependent measure was the number of behaviors checked (out of ten). The ten behaviors were taken from a pilot study in which 86 teacher students (77% female; M age = 23.96, SD age = 2.8) rated a pool of 61 behaviors representing possible student behavior as either fostering or impeding learning. Ten behaviors that were unambiguously classified as fostering learning were used for the present study (M < 2.0 and SD < 1.0 on a 7-Likert scale where 1 = fosters learning and 7 = impedes learning). For example, the list included behaviors such as "prepares himself/herself systematically for tests" and "continuously works on the material taught in class" (see Table A1 for all ten behaviors).

Results
We conducted a Target ethnicity × Target gender × Standards ANOVA with number of behaviors checked as the dependent variable (see Table 1). Contrary to hypotheses, the interaction effect of gender and standard was non-significant, F(1, 272) = 0.31, p = .576, and neither was the main effect, F(2, 272) = 0.28, p = .867. Moreover, the three-way interaction effect between gender, ethnicity, and standards that was included for exploratory purposes was not significant, F(2, 272) = 0.02, p = .896. Therefore, the effect of gender was not further examined.
As expected, the ANOVA revealed a non-significant main effect of ethnicity, F(1, 272) = 0.23, p = .635, but a significant interaction between ethnicity and standard, F(1, 272) = 6.63, p = .011, partial η 2 = .024. While minimum standards were lower for a student with a Turkish name (M = 3.82, SD = 1.56, n = 72) than for a student with a German name (M = 4.34, SD = 1.56, n = 67), confirmatory standards were higher for a student with a Turkish name (M = 4.28, SD = 1.34, n = 74) than for a student with a German name (M = 3.93, SD = 1.58, n = 67). To test whether the groups differed from one another as expected, we ran two planned contrasts, which we evaluate using one-tailed tests (Furr & Rosenthal, 2003). The first planned contrast tested whether the number of behaviors checked was higher for a student with a Turkish name compared to a German name in the minimum standards condition, and this contrast was significant, t(276) = 2.04, p α/2 = .021, Cohen's d = 0.35). The second contrasts tested whether in the confirmatory standard condition the number of behaviors checked would be greater for students with a Turkish as compared to a German name. This contrast was marginally significant, t(276) = 1.41, p α/2 = .080, Cohen's d = 0.24. Taken together, when participants were instructed to only get a vague idea whether the respective student might be suited for the academic track, participants were less strict with their expectations what kind of behaviors a student with a Turkish name should display compared to a student with a German name. At the same time, participants tended to be stricter and tended to need more proof that a Turkish student would definitively be suited for the highest track than when considering a German student.

Discussion
In the present study, we examined for the first time whether shifting standards play a role in the tracking judgments teacher students make for male and female students as well as for students with and without a migration background. We expected that the certainty with which a judgment needed to be made would influence whether judgments would be biased for or against students belonging to groups negatively stereotyped in this domain (i.e., boys, students with a Turkish migration background). When teacher students only needed to get a general idea of the students' suitability, they would require less behavioral proof of suitability for students who belong to negatively stereotyped groups than for students belonging to positively stereotyped groups. In contrast, when teacher students needed to be certain of suitability, they would require more behavioral proof of students belonging to negatively stereotyped groups. To our knowledge, this is the first study to test the difference between minimum and confirmatory standards in the German context and the first internationally to apply the SSM to tracking judgments.
We found evidence that teacher students shifted their standards for tracking judgments based on the migration background of the student. This shift was more pronounced when teacher students were asked to indicate whether they suspected a student to be suitable to the academic track: In this case, teacher students needed to see fewer positive learningrelated behaviors for students with a Turkish migration background than for students without a migration background. When teacher students needed to be certain of the suitability of a student, they tended to require more behavioral proof for the student with a Turkish migration background than the student without a migration background. These results are consistent with the predictions of the SSM (Biernat & Manis, 1994;Collins et al., 2009) and complement prior research demonstrating shifting standards more generally (e.g., Biernat & Fuegen, 2001;Biernat et al., 2008;Holder & Kessels, 2017). Thus, they are in line with the expectation that negative stereotypes about the academic abilities of Turkish-Germans (Bonefeld & Karst, 2020) result in lower expectations for this group and a subsequent shift in the amount of proof needed for a decision depending on the certainty with which the decision needs to be made. With regard to gender, we did not find the anticipated effects of the certainty with which teacher students needed to make their judgments. In contrast to research on standard shifts based on gender focusing suitability for a job (Biernat & Kobrynowicz, 1997) or competence in mathematics (Holder & Kessels, 2017), teacher students required an equal number of positive learning-related behaviors in boys and girls in order to recommend them for the academic track and did so regardless of the standard. This might be due to the content of stereotypes about boys: While they are seen as less academically minded, they are also seen as-and see themselves as-more intelligent (e.g., Steinmayr & Spinath, 2009). To some extent, this positive stereotype might counterbalance the perception that boys might be less suited to the academic track. Moreover, past research demonstrated that the direction of gender biases in teacher students' recommendations, i.e., whether they favor boys or girls, also depend on the students' achievement level and trajectory (Klapproth & Fischer, 2019).
In our study, minimum or confirmatory standards were invoked experimentally. The results of this study clearly demonstrate the need to understand whether the de facto standards applied to tracking recommendations in Germany resemble minimum or confirmatory standards. However, it is not immediately clear which standard is currently used in Germany, and as federal states vary in the laws governing the transition process, standards may also vary between them. Therefore, we first elaborate on impacts of different standards on students before discussing more generally which standard may be prevalent in Germany today.
For students, especially ethnic minority students, the difference between minimum and confirmatory standards can be highly influential for their further educational path. If confirmatory standards are applied and ethnic minority students are less likely to receive a recommendation for the academic track, they will be placed in less challenging learning environments (e.g., Retelsdorf et al., 2010;van Houtte, 2004) and may not reach their full potential. If, on the other hand, a minimum standards is used, this could reduce the underrepresentation of students with a Turkish migration background in academic tracks. However, if a student is ultimately not suited to the academic track, they may experience repeated failures, which has been theorized to be a risk factor for students' motivation, for example, due to lower self-efficacy and learned hopelessness (Au et al., 2009(Au et al., , 2010.
Thus, the implications for students differ depending on the standard teachers currently use in Germany for tracking recommendations-and whether they find it more problematic to send a child to a track that is below or above its abilities. If teachers are wary of underestimating children, they may use minimum standards-recommending all children who they suspect may be suited to the academic track. However, if teachers are mainly concerned not to overestimate students, they may apply confirmatory standards-recommending only those students who they are certain are suitable to the academic track.
On a theoretical level, the SSM predicts that rewards that are considered unlimited (e.g., praise) will be given to members of negatively stereotyped groups more readily, whereas rewards that are restricted in number will be preferentially given to members of positively stereotyped groups (Biernat & Vescio, 2002). While at a given point in time there may not be an infinite number of spots available at academic track schools, the number of recommendations that can be distributed in a single elementary school classroom is not restricted. Therefore, teachers may be more likely to apply a minimum standard with regard to this in principle limitless resource. The overall increase in students being recommended for the academic track in recent decades (Dietze, 2011) suggests that teachers may be applying minimum standards more and more. Importantly, qualitative and quantitative research has shown that teachers do vary in the leniency of their recommendations (Baeriswyl et al., 2011;Maier, 2007;Pohlmann-Rother, 2010), but does not allow conclusions regarding the standard used more generally.
Additionally, structural factors may influence whether minimum or confirmatory standards are used. In some German federal states, teachers' recommendations are binding, whereas in other states, parents can ultimately decide the secondary school track for their child. If teachers' recommendations are binding, they are held accountable for their decision to a greater extent, that is, they may need to justify their decision to others (Tetlock, 1983). This may make them more cautious not to hinder students from attending more challenging tracks or not to run the risk of being in conflict with parents disagreeing with the decision. Experimental evidence suggests that greater accountability increases judgment accuracy in teachers (Pit-ten Cate et al., 2020) and reduces negative biases against ethnic minority and low-SES students (Glock et al., 2012;Krolak-Schwerdt et al., 2013;Pit-ten Cate et al., 2016), though the authors link this effect not to different standards, but less reliance on heuristics such as group membership. While large-scale studies show that binding teacher recommendations lead to lower inequality, for example, based on social class, in children's actual enrollment in a specific secondary track (e.g., Dollmann, 2011), no such research has been conducted regarding the effect of binding decisions on teacher recommendations themselves. However, Dietze (2011) showed that teachers have historically given more academic track recommendations overall in German federal states where recommendations are binding, which may be due to teachers applying minimum standards with regard to students' suitability. However, this difference has gotten smaller in the decade before Dietze's study in the subset of federal states he examined.
Overall, prior research provides some indication that the standards applied in tracking recommendations might be minimal and that greater accountability could lead to greater use of minimum standards, though neither notion has been tested directly. Thus, understanding whether teachers generally apply minimum or confirmatory standards, is central to understanding the impact of shifting standards in tracking recommendations and its implications. Moreover, different standards may be applied for different groups: The absence of a bias against ethnic minority students in de facto tracking recommendations once achievement (and sometimes social class) is accounted for speaks against the use of confirmatory standards. However, students from low-SES families still receive more lowertrack recommendations once achievement is controlled (Ditton et al., 2005;Schneider, 2011), which is indicative of confirmatory standards. Thus, intersectional analyses particularly of migration background and social class are needed in the future.

Limitations
Two characteristics of the sample may limit generalizability of the results. First, due to an oversight, participants' gender was not collected. At the time the research was carried out, teacher students at the university were 65% female. As stated above, 95% of participants spoke German as a mother tongue. Overall then, the sample likely consisted mainly of female teacher students without a migration background, a group that is positively stereotyped when it comes to academic achievement. Prior research established that those who are stereotyped to possess a trait (e.g., suitability to academic pursuits) shift their evaluations the most (Biernat et al., 2008). Therefore, our sample might have been especially likely to shift standards with regard to students with a Turkish migration background.
Secondly, our sample consisted of teacher students rather than teachers practicing in schools. Teachers accumulate expertise over the years, which may improve their decision accuracy in tracking recommendations. Indeed, experienced teachers focus less on information not directly related to achievement (father's occupation, migration background, non-learning related behaviors) than teacher students when making recommendations, resulting in recommendations less biased against students with a low socioeconomic status (Böhmer et al., , 2017; but see also Glock et al., 2015). Therefore, they may exhibit less bias in recommendations regardless of the implied standard. While research on shifting standards in samples with experienced teachers is warranted, the demonstration of such an effect in teacher students is highly relevant. After all, expertise is acquired over years of experience-and until novice teachers have acquired such an expertise, they will have influenced many children's lives through their recommendations.

Conclusion
The results of the present study demonstrate for the first time that the direction of ethnic biases in tracking judgments vary depending on the framing of standards as minimal or confirmatory. Specifically, students with a Turkish migration background needed to provide less behavioral proof of their suitability for the academic track when standards were minimal, but in tendency needed to provide more proof than their peer without a migration background when standards were confirmatory. Contrary to our predictions and prior empirical findings, no differential effects by the gender of the student were observed.
This research adds to our understanding of the complexity of the effects of social category membership in the school context. Rather than indicating the presence of straightforward positive and negative biases, it shows that the framing of the standard is another factor which influences teachers' evaluations of students' academic potential. Developing an evidence-based understanding of the optimal standard of certainty for unbiased tracking recommendations is a necessary precondition to further improvements both on the level of the school system and individual teachers. Potentially, creating clearer guidelines for tracking should lead to a consistent standard that neither disadvantages nor advantages students who are members of various social categories. For example, the complementary reliance on results of standardized cognitive ability tests would reduce social inequality based on parents' education and, to a lesser extent, migration background in tracking recommendations (Steinmayr et al., 2017). Though such an intervention does not reduce the existing differences in performance between students of different backgrounds, it may ensure that they are not further exacerbated by teachers' recommendations.

Appendix
Acknowledgements We would like to thank Liesel Heiermann for organizing and conducting the data collection.
Funding Open Access funding enabled and organized by Projekt DEAL. Open Access funding provided by Projekt DEAL.
Availability of data, material, and code Data, material, and code are available upon request.

Competing interests
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.