Stereotype threat in learning situations? An investigation among language minority students

Stereotype threat (ST) is a potential explanation for inequalities in language competencies observed between students from different language backgrounds. Language competencies are an important prerequisite for educational success, wherefore the significance for investigation arises. While ST effects on achievement are empirically well documented, little is known about whether ST also impairs learning. Thus, we investigated vocabulary learning in language minority elementary school students, also searching for potential moderators. In a pre-post design, 240 fourth-grade students in Germany who were on average 10 years old (MAge = 9.92, SD = 0.64; 49.8% female) were randomly assigned to one of four experimental conditions: implicit ST, explicit ST without threat removal before posttest, explicit ST with threat removal before posttest, and a control group. Results showed that learning difficult vocabulary from reading two narrative texts was unaffected by ST. Neither students’ identification with their culture of residence and culture of origin nor stereotyped domain of reading were moderators. The findings are discussed with regard to content and methodological aspects such that a motivation effect might have undermined a possible ST effect. Implications for future research include examining the question at what age children become susceptible to ST and whether students have internalized negative stereotypes about their own group, which could increase the likelihood of ST effects occurring.


Introduction
In recent years, cultural and linguistic diversity of students and thus of school classes has increased worldwide (OECD, 2019). Large-scale assessments such as the Programme for International Student Assessment (PISA) for high schools and the Progress in International Reading Literacy (PIRLS) for elementary schools have repeatedly shown that there are on average differences in achievement in various domains between language minority and language majority children (Mullis et al., 2017;OECD, 2019). Several reasons for these disparities are discussed and investigated. In addition to differences in socioeconomic status, psychological processes concerning stereotypes and stereotype threat (ST) have proven to partly explain achievement differences between immigrant and non-immigrant students (e.g., Appel et al., 2015;Steele & Aronson, 1995). ST describes the situation in which knowledge of a negative stereotype about a group to which one belongs triggers the threat to confirm this stereotype oneself (Steele & Aronson, 1995). ST impairs achievement, thus contributing to a confirmation of the negative stereotype (Baysu & Phalet, 2019;Steele & Aronson, 1995). In achievement situations, ST is empirically well-researched (Appel & Kronberger, 2012;Flore & Wicherts, 2015), but less is known about whether ST also impairs learning, such as the acquisition of new vocabulary (Rydell & Boucher, 2017). First empirical findings suggest that vocabulary growth can be negatively influenced by ST as early as in elementary school (Sander et al., 2018). This is particularly worrisome, as elementary school years are of crucial importance for further educational pathways, and a strong command of the language of instruction is a prerequisite for future educational success (e.g., Biemiller, 2005).
Numerous studies have shown that various variables, for example, the identification with the culture of residence and culture of origin as well as identification with a particular academic domain, can mitigate or enhance ST effects in achievement situations (e.g., Baysu & Phalet, 2019;Pansu et al., 2016). However, it is unclear whether these variables moderate ST effects in learning situations. Thus, we examine potential effects and possible moderators of ST in a vocabulary learning situation among language minority elementary school students.

Importance of vocabulary
Vocabulary, as the entirety of words in the mental lexicon, is a prerequisite for reading, listening, and understanding spoken and written language and is therefore highly relevant for both academic success and later professional success (e.g., Graves, 2016). Elementary school years are of particular importance for vocabulary acquisition and promotion, as children learn an average of 1,000 new words per year during this period (e.g., Biemiller, 2005). Strategies promoting vocabulary can be distinguished according to the kind of instruction, either implicit (e.g., reading texts; McElvany & Artelt, 2007;Vidal, 2011;Webb, 2008) or explicit (e.g., vocabulary learning; Elgort, 2011;Nation, 2013). Implicit instruction focuses on the meaning aspect of language, whereas explicit instruction aims to systematic teach grammar and vocabulary (DeKeyser, 2003;Ellis et al., 2009). There is evidence that combining both is effective for vocabulary acquisition (Karami & Bowles, 2019;Marulis & Neuman, 2010;Stanat et al., 2012). McElvany et al. (2017) revealed with regard to vocabulary acquisition among language minority children that learning from context (reading a German-language text with target words that can be deduced from the context of the text) was effective compared to a control group (reading a German-language text without target words).

The phenomenon of stereotype threat
Stereotypes generally refer to beliefs about the characteristics and attributes of a group and its members (Dovidio et al., 2010). Between the ages of two and five, children begin to evolve stereotypes, for example, related to gender (Martin & Ruble, 2010). Cognitive abilities and conceptual understanding continue to develop with age such that categorization processes leading to stereotypes are no longer based solely on perceptual differences but also on internal, abstract attributes (Baron & Banaji, 2006;Bar-Tal, 1996;Kite & Whitley, 2016). Stereotypes can be activated automatically and unconsciously and thus can influence the perception of groups and their members as well as the behavior displayed towards them (Dovidio et al., 2010). Research on ST originated in the USA with the seminal investigations by Steele & Aronson (1995), who focused on lower achievement outcomes under ST among ethnic minorities on standardized tests. In their fourth experiment, the authors showed that when Black American undergraduates were asked about their ethnicity before solving difficult verbal ability items, they performed worse on those items compared to White American undergraduates (Steele & Aronson, 1995). Their studies led to extensive research on this phenomenon (e.g., Appel et al., 2015;Nadler & Clark, 2011;Nguyen & Ryan, 2008). With respect to students of Turkish origin, Martiny et al. (2014) found for ninth-graders that students of Turkish origin who were threatened scored lower than natives and also scored lower compared to students of Turkish origin in the control condition.

Activation of stereotype threat
A more differentiated picture of ST emerges when a distinction is made regarding the explicitness of the threat activation. An implicit threat is given, for example, by having research participants indicate their ethnicity via their country of birth and family language, without giving a direct cue about their group's disadvantaged position (Sander et al., 2018;Shewach et al., 2019). Ambady et al. (2001) administered ST implicitly by presenting a short questionnaire to children in grades 3 to 8, including questions about the language spoken at home, before they took a math test. The results indicated that the subtle activation of negative stereotypes impaired Asian American girls' achievement but not Asian American boys' achievement.
Explicit threat is administered by directly referring to achievement differences between groups (e.g., Keller & Dauenheimer, 2003). Also, Sander et al. (2018) explicitly activated ST by pointing out to their participants that those who (even sometimes) speak a language other than German at home face problems learning new unknown vocabulary. Nguyen & Ryan (2008) distinguished in their meta-analysis implicit and explicit activation, with the latter additionally differentiated into moderately explicit (direct evidence of group differences) and blatantly explicit (direct evidence that one group outperformed the other group). For minorities, they found that a moderately explicit threat led to larger ST effects compared to blatant activation, and this in turn led to larger effects than implicit activation (d = 0.64 vs. d = 0.41 vs. d = 0.22). Similarly, Appel et al. (2015) revealed that while all three forms of activation led to achievement deficits, moderately explicit activation yielded the largest effect for people with immigrant background.
Numerous studies examined ST in achievement situations, which is empirically wellestablished (e.g., Appel et al., 2015;Spencer et al., 2016). Here, an implicitly or explicitly activated ST impairs access to or application of knowledge or skills the person has previously acquired Nguyen & Ryan, 2008;Steele & Aronson, 1995). Little is known about whether ST also affects the ability to gain knowledge in a learning context (Rydell & Boucher, 2017;Taylor & Walton, 2011). In our research, we had children work on a language vocabulary learning task while being exposed to different forms of ST. Whereas most studies have investigated ST effects on achievement in mathematics or sciences (e.g., Flore & Wicherts, 2015;Neuville & Croizet, 2007), we focused on the less researched domain of language competency, which is of particular importance to a group especially vulnerable to ST: language minority children.

Stereotype threat and learning
In a learning situation, individuals acquire new knowledge and skills by processing new information and building a coherent representation in long-term memory (McDaniel et al., 2014). In achievement situations, ST can impair the efficiency of working memory (Schmader et al., 2008), while Boucher et al. (2012) assumed that ST in learning situations interferes with encoding the content from the learning phase. The authors suggested that ST can be examined in a learning situation by comparing a condition in which the threat is removed before the achievement situation to a condition in which the threat is not removed (Boucher et al., 2012).
One study separating learning and achievement situations was by Boucher et al. (2012). The authors found that female undergraduates in mathematics revealed lower learning outcomes in a ST condition and in a condition with ST removal after the learning phase compared, respectively, to a control group and a condition where the threat was removed before the learning phase. Furthermore, a study by McLaughlin Lyons et al. (2018) showed for a sample of fifth-grade students from different ethnic minority groups that in a videotaped challenging mathematics lesson, students in the ST condition had lower learning growth compared to the control group. Taylor & Walton (2011) also investigated ST in a learning situation and focused on vocabulary learning of difficult and seldom words among African American university students. Students who had to learn under ST remembered fewer words after a time interval of 1 to 2 weeks than students who had not learned under threat. Sander et al. (2018) examined ST in a vocabulary learning situation among 118 language minority elementary school children in Germany. In a pre-post design, the children were assigned to one of three experimental ST conditions (implicit, explicit, and control). The threat was administered before the learning situation, in which the children had to learn difficult words from narrative texts. Afterwards, they completed a vocabulary posttest. The results indicated that vocabulary growth was lower in both ST conditions compared to the control condition, indicating that a ST effect occurred in learning situations. However, due to the design, with no removal of the threat before the posttest, the findings cannot solely be attributed to the threat affecting the learning situation. Thus, it remains unclear whether ST had an effect on the learning or achievement situation, as it is also possible that children were less able to retrieve their knowledge in the posttest due to the threat. To sum up, first, studies indicate that in addition to achievement, learning can also be influenced by ST.

Person-related moderators of stereotype threat
Various variables may decrease or increase ST vulnerability (e.g., Appel et al., 2015;Steele, 1997). ST research provides broad findings on facilitators that can mitigate or enhance ST impacts (Pennington et al., 2016;Spencer et al., 2016). Additionally to situational factors, personal factors are of high importance which include, for example, group and domain identification (Steele et al., 2002). Therefore, we focused on identification with the culture of residence and culture of origin as well as identification with the domain of reading.
Ethnic identity begins to develop during middle childhood. Individuals with an immigrant background can develop both an identity as a member of their culture of origin and one as a member of their culture of residence (Zander & Hannover, 2013;Berry et al., 2006;Ruble et al., 2004). Identification with the culture of residence and origin can be important personal factors related to ST (Baysu & Phalet, 2019;Weber et al., 2015). According to social identity theory (SIT) (Tajfel & Turner, 1986), individuals strive for a positive social identity based on comparison processes with social groups. Therefore, it can be assumed that individuals may be affected by ST when they identify highly with a stereotyped group. For example, Weber et al. (2015) examined both identification with the culture of origin and the culture of residence in a sample of eighth-graders with an immigrant background in Austria. Students under explicit threat exhibited better cognitive achievement when they identified highly with Austria (culture of residence), independently of their identification with their culture of origin. In contrast, students' achievement in the control condition and in the implicit threat condition was unrelated to identification with Austria. Furthermore, two studies by Baysu & Phalet (2019) with Turkish origin and Moroccan origin minority students in Belgian secondary schools revealed that a dual identity can either promote or hinder minority achievement depending on stereotype threat experienced during a verbal test. In low threat situations, dual-identity students showed higher achievement and higher self-esteem than otherwise-identified students in the control condition. In high threat situations, dual-identity students performed worse and reported more anxiety compared to the control condition. In their meta-analysis, Nguyen & Benet-Martínez (2013) found, when focusing on people between 10 and 70 years, a strong and positive association between individuals having dual identities and their psychological and sociocultural adjustment compared to individuals who identified with only one of the two cultures. In a study by Armenta (2010), however, the relevance of identification with the culture of origin in a sample of undergraduate students was shown. High ethnic identification led to weaker achievement in the presence of negative achievement stereotypes (Latinos) and to stronger achievement in the presence of positive achievement stereotypes (Asian Americans). In contrast, lower ethnic identification did not have an effect regardless of the achievement stereotype activated. Similarly, Cole et al. (2007) reported that ethnic minority students who identified highly with their culture of origin were more vulnerable to ST. Concerning vocabulary learning situations, Sander et al. (2018) examined fourth-graders' ethnic identification using a single undifferentiated, nominally scaled item and found no moderation of the ST effect. Overall, empirical findings concerning identification with the culture of residence and origin are heterogeneous.
Another important personal factor is identification with the stereotyped domain. According to Steele's (1997) conceptualization, it is composed of the value and importance a person attributes to that domain and of the abilities one believes one has in that domain. It is assumed that high identification with the stereotyped domain will increase the pressure not to confirm the stereotype in that domain (Wasserberg, 2017). The results of the second experiment by Aronson et al. (1999) revealed that high identifiers (Asian students from university) performed less well in the threat than in the non-threat condition. Keller (2007) investigated identification with the domain of mathematics among tenth-grade students in Germany. Girls who identified highly with the domain of math had a loss of achievement in an ST condition compared to girls who identified less with that domain. With regard to the domain of reading, Pansu et al. (2016) showed in a sample of 80 French third-graders highly identified with the domain of reading that boys scored lower than girls on a reading test in a threat condition. The opposite was found in the reduced threat condition: Here, boys scored higher than girls.
In summary, we assumed that regarding the identification with the culture of residence, a high identification might lead to a weaker ST effect, because the threat might affect those students less given that identity could serve as a buffer. With respect to the identification with the culture of origin and the identification with the domain of reading, we expected those to enhance the ST effect because high identification with the culture of origin may increase sensitivity to negative stereotypes towards this group and high domain identification should generally increase the effect of threat (Steele et al., 2002) due to personal concernedness or importance. Both should correspondingly result in lower vocabulary growth.

Research questions
ST is a possible explanation for achievement differences based on ethnicity (e.g., Froehlich et al., 2018). Less is known with regard to ST effects in learning situations (Rydell & Boucher, 2017). Due to the fact that disparities also exist in language competencies such as vocabulary and that vocabulary is of high importance for school and professional success, we focused on the effects of ST in vocabulary learning situations. Sander et al. (2018) revealed that ST impaired vocabulary learning, although it remained unclear whether the ST effect occurred in an achievement or a learning situation. Thus, we wanted to replicate and broaden these findings by Sander et al. (2018) with a larger sample size and an extended study design. Furthermore, we operationalized identification with the culture of origin in a more differentiated manner and included two other potential moderators in order to obtain a more fine-grained picture. We addressed the following research questions: 1. Do language minority children exhibit lower growth in vocabulary in the presence of (a) implicit and/or (b) explicit ST without removal of the threat before posttest (hereinafter known as explicit without removal) relative to a condition without ST? For both ST conditions we expected that language minority students will learn on average fewer words than students in the control condition (1a). Also, the extent of the ST effect should be larger in the explicit condition compared to the implicit condition (1b).
2. Do language minority students differ in their vocabulary learning in the explicit ST condition with removal of the threat before posttest (hereinafter known as explicit with removal) and without removal? As this was testing if ST is indeed effecting the learning rather than the achievement situation, we assumed that vocabulary growth would be similar in both conditions (2). 3. To what extent is the expected ST effect on vocabulary growth moderated by (a) identification with the culture of residence and (b) origin and/or (c) identification with the stereotyped domain of reading?
We expected that the ST effect would be lower for language minority children who highly identified with the culture of residence, indicated by greater vocabulary growth compared to children who identified more weakly with the culture of residence (3a). For language minority students who highly identified with the culture of origin, we assumed that the ST effect would be larger, resulting in lower vocabulary growth (3b). Furthermore, we expected a larger ST effect for language minority children who highly identified with the reading domain and thus lower vocabulary growth compared to children who identified more weakly with the domain of reading (3c).

Participants
Data for this study was collected in spring 2019 in the context of the project Effects and moderators of stereotype threat in vocabulary learning situations among students with immigrant background in elementary and secondary schools (ST 2 ). A total of 822 elementary school students from 46 fourth-grade classes in 30 schools in North Rhine-Westphalia participated. Language majority students, children with special educational needs, and one child with implausibly high gains (maximum + 3 SD) between pre-and posttest were excluded from the sample. Therefore, the analyses were based on n = 240 language minority students (49.8% female) drawn from all 46 classes, who were just under 10 years old on average (M = 9.92, SD = 0.64). As the study focused on ST in the context of vocabulary acquisition, language minority status was operationalized based on family language ("I sometimes speak German at home and most of the time another language: ___________"/ "I never speak German at home, but I speak _________."). There were no statistically significant differences between the four experimental conditions in sex, age, cognitive abilities, and amount of books at home as indicator of socioeconomic status (see Table A, Supplement 1).

Experimental design and procedure
In order to test the impact of different ST conditions, a pre-post design was used (see Fig. 1). Prior to data collection, students were randomly assigned to one of four conditions: (a) implicit, (b) explicit without removal, (c) explicit with removal, and (d) control group. Each child got a tablet on which the experimental procedure was implemented and on which they entered their answers. We used the open source software OpenSesame (Mathôt et al., 2012) to program the experiment. The study was carried 1 3 out by trained research assistants who used a standardized test manual. Participation was voluntary. Declaration of consent was given by parents before data collection. Data collection lasted for two consecutive 45-min lessons. In the first lesson, children were asked how strongly they identified with the domain of reading and worked on a vocabulary pretest to assess their vocabulary with regard to the texts they would have to read in the subsequent learning units (see section "Instruments"). After pretest, the experimental manipulation was administered. Students in the implicit threat condition answered questions about their language spoken at home and both their and their parents' country of birth. Students in the explicit threat condition read a short text and were informed that children who speak a language other than German at home have difficulties learning new words. The explicit condition with removal was configured following Boucher et al. (2012). Here, the threat was the same as in the explicit condition without removal, but students were informed before the last posttest that irrespective of which languages they speak at home, all children can learn equally well. Children in the control group did not receive any kind of threat. They answered questions concerning their favorite drink and meal. Following Nguyen and Ryan (2008), the implicit induction of threat can be classified as subtle and the explicit induction as blatant obvious. Each experimental condition was followed by two learning units with a corresponding vocabulary posttest (see Fig. 1). In each learning unit, students read a narrative text containing target words (see section "Instruments"). The meaning of the target words could be deduced from the text context. After reading these texts, children answered two multiple-choice questions to ensure that they had read the texts carefully. Additionally to the implicit learning task, an explicit learning element was added: students worked on a synonym game in which they had to assign synonyms from a list (not the same synonyms as in the vocabulary test) to the target words from the text. Subsequently, the correct solution to the synonym game was presented to every student. The posttest followed the synonym game, except for the explicit condition with removal. Here, the threat was removed before the children completed the last posttest. After a short break, students completed a second lesson. They worked on a cognitive ability test and answered questions regarding social demographics as well as their identification with the culture of residence and origin. Lastly, students in the implicit, explicit without removal, and control condition were also informed that all children can learn difficult words equally well, regardless of whether they speak a language other than German at home.

Vocabulary test
The vocabulary pre-and posttest consisted of 18 target words and three icebreaker items to provide a positive beginning to the vocabulary test (McElvany et al., 2017). For each target word (e.g., "trivial"), a corresponding synonym had to be selected, which was presented together with four distractors (e.g., "triple/dry/sad/simple/wet"). Answers in the pre-and posttest were dichotomously coded (0 = incorrect or not completed; 1 = correct). Thus, children could achieve between 0 and 18 points. The pre-and posttest's reliability was satisfactory.

Learning material
Each text in the learning unit was age-appropriate and encompassed about 300 words with nine target words (three nouns, three verbs, and three adjectives). Both learning texts were selected from the intervention study Potential of the native language to reduce educational inequality-Vocabulary acquisition before central transitions of the education system (InterMut) and have proven to produce good learning growth rates (cf. McElvany et al., 2017). The texts were about a detective story about a missing elephant in a zoo and about a child who suffers a mishap at home.

Sociodemographic data
In addition to age and gender (0 = boy; 1 = girl), family language as well as child and his/ her parents' country of birth (0 = Germany; 1 = other) were assessed. Students also indicated the number of books at home (Wendt et al., 2016). Five answers could be selected: from 1 = none or very few (0-10 books) to 5 = enough to fill three or more shelves (200 books).

Moderators of stereotype threat
Students' identification with the culture of residence (Germany) was measured with items from the affective dimension of the scale for identification with Germany (Zander & Hannover, 2013). The six items were adapted to make them easier to understand for fourthgraders (e.g., "I have a good feeling when I think about Germany"). The scale provided information about the extent to which students identify with Germany. Furthermore, the children answered six items regarding identification with their culture of origin. The scale covered how strongly they feel connected to their own or their parents' country of origin (e.g., "I feel strongly connected with this country and this culture"). These items were also adapted from the original items by Zander & Hannover (2013). In order to capture identification with the reading domain, items by Keller (2007) and Arens et al. (2011) were modified. The scale consisted of four items and indicated how much learners identify with this particular academic domain (e.g., "It is important to me that I am good at reading"). All items were measured on a 4-point Likert scale (1 = strongly disagree to 4 = strongly agree). Table 1 contains scale characteristics. For subsequent analyses, we dichotomized all three variables using a median split (0 = low identification, 1 = high identification).wwww

Cognitive abilities
The figural subtest of the standardized German cognitive ability test for grades 4 to 12 (Kognitiver Fähigkeitstest [KFT] 4-12 R; Heller & Perleth, 2000) was used to measure cognitive abilities. Following ST theory, cognitive abilities were included as an important control variable because the theory postulates that effects of ST are found despite similar cognitive abilities. In addition, given the background of a language-based ST, a figural, language-free subtest was explicitly chosen to examine cognitive abilities independent of linguistic abilities. The test consists of 25 items, which were dichotomously coded (0 = incorrect or incomplete; 1 = correct). Between 0 and 25 points could be achieved. The children were shown two objects that have a certain relation to each other (e.g., little black circle to large white circle). They were then shown other objects (e.g., little black triangle) and had to select the appropriate analogue object (e.g., large white triangle) from five objects.

Statistical analyses
SPSS 27 was used for descriptive statistics and statistical analyses. An a priori sensitivity analysis with G*Power revealed that n = 44 participants were required for each of the four conditions (N = 176) (Faul et al., 2007). Results were considered statistically significant if the p-value was ≤ 0.05. As effect size measures, partial eta square and Cohen's d were reported (Cohen, 1988). Statistical power was calculated a posteriori using G*Power (Faul et al., 2007). The posttest consisted of 18 words and was composed of the nine words from both posttests 1 and 2. In order to investigate ST's impairment of vocabulary growth in research question 1, we calculated a repeated measures ANOVA with planned contrasts. The withinsubject variable was the vocabulary pre-and posttest, and the between-subject variable was the ST condition (three levels; implicit, explicit without removal, control group). For the second research question, we also conducted a repeated measures ANOVA with condition as the between-subjects variable (two levels; explicit with/without removal). In addition to classical inference testing using confidence intervals and p values, we conducted Bayesian parameter estimation for the first and second research questions with the open source program JASP (JASP Team, 2020;Wagenmakers, Love, et al., 2018). Bayesian estimation was used to provide additional assurance regarding possible ST effects in learning situations because the Bayes factor can quantify evidence for the null hypothesis (for more advantages, see Wagenmakers, Marsman, et al., 2018). To investigate research question 3, we carried out six moderation analyses in order to obtain a differentiated picture of the ST conditions. In repeated measures ANOVA, we entered the dichotomized moderators (identification with culture of residence, identification with culture of origin, identification with the domain of reading) and the conditions (implicit, explicit without removal, and control; explicit with and without removal). The vocabulary pre-and posttest was the within-subject variable. Listwise deletion was used to handle missing data. The number of missing values was less than 4.6%.

Descriptive findings
Descriptive analyses (see Table 1 and Table A in Supplement 1) revealed that children knew on average four of target words in the pretest (M pretest = 4.60, SD = 2.81) and eight words 1 3 in the posttest (M posttest = 8.56, SD = 3.80). Furthermore, a statistically significant and large correlation between vocabulary pre-and posttest was found, indicating a strong positive association (Cohen, 1988). Additionally, there were positive, moderately strong correlations between both pretest/posttest and cognitive abilities. These coefficients indicate that higher cognitive abilities were associated with higher scores on the vocabulary tests. Furthermore, learners identified highly with the culture of residence and culture of origin on average. Both mean values deviated statistically significantly and substantially from the theoretical mean of 2.5 in positive direction (i.e., above the mean), t(235) Identification culture of residence = 10.33, p < 0.001, d = 0.67; t(228) Identification culture of origin = 16.64, p < 0.001, d = 1.10. The theoretical mean of 2.5 would indicate a neutral response. The effects can be classified as medium and large (Cohen, 1988).

Vocabulary growth in the implicit and explicit without removal ST conditions
Regarding the question of whether language minority children show a lower growth in vocabulary in the (a) implicit and/or (b) explicit ST condition without removal, relative to a control condition, the repeated measures ANOVA revealed a statistically significant main effect of time (vocabulary pre-and posttest). It indicated that there was a statistically significant vocabulary growth of four words on average across all three experimental conditions, M pretest = 4.51, SD = 2.83; M posttest = 8.31, SD = 3.88; F(1,179) = 268.84, p < 0.001, η p 2 = 0.60. This effect size represents a large effect (Cohen, 1988). Planned contrasts revealed no statistically significant difference in vocabulary growth between the implicit (M = 6.34, SD = 0.40) and the control condition (M = 5.89, SD = 0.40) of 0.48 (SE = 0.56), p = 0.212, but provided a statistically significant difference between the explicit without removal (M = 6.93, SD = 0.37) and the control condition (M = 5.89, SD = 0.40) of 1.04 (SE = 0.51), p = 0.028. Furthermore, there was neither a main effect of condition nor an interaction between time and condition. No ST effect on vocabulary growth was found; thus, the empirical data did not support hypotheses 1a and 1b. In the context of a Bayesian mixed-factor ANOVA, an examination of the Q-Q plots revealed that the assumption of normal distribution of the residuals was not violated. The Bayesian estimation (see Table B, Supplement 2) shows that the data were best represented by the model that included time as a factor over the other models, supporting the results of the ANOVA using classical inference testing.
As students of Turkish origin represent the largest subgroup of language minority people in Germany and are also negatively stereotyped as a group low in language ability (Froehlich et al., 2016;Statistisches Bundesamt, 2021), we were interested in whether we find ST effects in this subgroup. The subsample was based on 89 children of Turkish origin who were on average ten years old (M = 9.88, SD = 0.47; 45.5% female; implicit ST n = 24, explicit ST without removal n = 26, explicit ST with removal n = 19, and control condition n = 20). Regarding research question 1, the analysis showed a similar pattern of findings, as no ST effect on vocabulary growth was found, F(2, 67) = 0.93, p = 0.400. Moreover, we further conducted an analysis with children who were most likely to be threatened by language-related stereotypes. This subsample was also determined based on the language that participants' reported to speak at home. Given that this subanalysis focused on children who were most likely to be threatened by language-related stereotypes, we excluded, for example, French-and English-speaking children (n = 25) from the sample of language minority students. Turkish-speaking children as well as, for example, Afghan-, Bosnian-, Moroccan-, and Romanian-speaking children remained in the sample. Thus, the sample size for this analysis consisted of 157 children. The analysis revealed also no ST effect on vocabulary growth, F(2, 154) = 0.16, p = 0.854.

Vocabulary growth in the explicit ST condition with and without removal
The repeated measures ANOVA examining whether students' vocabulary learning differed in the explicit condition with and without removal revealed a statistically significant main effect of time (vocabulary pretest and posttest), F(1,122) = 208.91, p < 0.001, η p 2 = 0.63. This effect size was deemed large (Cohen, 1988). The main effect of condition and the interaction did not achieve statistical significance. Therefore, the results did not support hypothesis 2. Again, the Q-Q plots of the Bayesian mixed-factor ANOVA indicated that the assumption of normal distribution of the residuals was not violated. Table B in Supplement 2 shows that the model containing only time as a factor best represented the data compared to the other models, again confirming the findings of the ANOVA using classical inference testing.

Moderator analyses
In order to test whether ST effects on vocabulary growth were moderated by (a) identification with culture of residence, (b) identification with culture of origin, and/or (c) the domain identification, separate moderator analyses were conducted. The results revealed no moderation by identification with culture of residence, identification with culture of origin, or identification with the domain of reading (see Table 2). However, identification with the domain of reading was found to be related to vocabulary growth. The planned contrasts showed that for each moderator, the explicit without removal condition differed from the control condition (see Table C in Supplement 3). Hence, hypotheses 3a-c were not supported.

Discussion
Several studies have reported that language minority students showed on average lower vocabulary in the language of instruction compared to native students, whereby vocabulary is an important prerequisite for educational success. Therefore, we examined ST effects as a possible explanation for educational inequalities. More precisely, in a pre-post design, we investigated whether implicitly and/or explicitly induced ST has an impact on vocabulary acquisition and whether students' vocabulary learning differed for explicit ST with or without removal before posttest, meaning that ST was explicitly tested in a learning rather than an achievement situation (Boucher et al., 2012). Furthermore, we analyzed identification with the culture of residence, origin, and the domain of reading as potential moderators.
Summarized, the results revealed that students had a vocabulary growth of four words on average, regardless of the experimental condition. The amount of growth was consistent with other studies that also focused on vocabulary growth from reading short texts (e.g., El-Khechen et al., 2012;Sander et al., 2018). Concerning the results of the first research question, no ST effect was found in the learning situation regardless of whether the threat was implicitly or explicitly induced. In light of the non-significant main effect of condition and the lower vocabulary growth in the control condition compared to the other conditions, the difference in planned contrasts between the explicit without removal and the control condition can be interpreted as a tendency towards stereotype reactance. Nevertheless, the no ST effect is contrary to our expectations and not in line with previous findings (e.g., Hermann & Vollmeyer, 2016;Sander et al., 2018). Furthermore, referring the second research question, there was no difference in vocabulary growth between the explicit ST condition with and without removal, indicating no ST effect in the learning situation. Therefore, these findings are inconsistent with previous research (Boucher et al., 2012;Rydell & Boucher, 2017). One explanation for these non-significant findings might be that ST effects have been frequently examined and found in laboratory settings and less often in real world settings (Cullen et al., 2004;Stricker & Ward, 2004). A closer look at mean vocabulary growth among our four experimental groups revealed that children tended to learn more or even similar in all ST conditions than in the control condition, although the differences were not statistically significant. Perhaps the claim that children who also speak a language other than German at home have difficulties learning vocabulary actually motivated the children to make an extra effort. Hence, the results might be interpreted in terms of a tendency towards stereotype reactance (e.g., Kray et al., 2001). Stereotype reactance is based on the theory by Brehm (1966) and is defined as reacting to the threat in a way that defies expectations, meaning that participants tend to refute the induced stereotype and thus increase their performance (Kray et al., 2001). Speaking against such an interpretation is that we only slightly adapted the experimental treatment by Sander et al. (2018), who did find the expected ST effect. Another possible reason could be that the children were unaware of a negative stereotype about families communicating in a language other than German, which is a prerequisite for ST effects to occur.  recently found that Turkish origin elementary school children in Germany hold no achievement-related negative stereotypes about people of Turkish origin. This could indicate that language minority children may be familiar with achievement-related stereotypes but have not internalized them due to their differentiated knowledge of their own group. Similarly, Shelvin et al. (2014) measured stereotype awareness in African American children aged 10 to 12 through a racial stereotype-generation task and found that not all children (44%) named the achievement-related stereotype Blacks are less intelligent than Whites. Children who mentioned this stereotype had a decrease in achievement on a vocabulary subtest compared to children who were unaware of the stereotype. Likewise, Wasserberg's (2014) findings for African American elementary school children showed that when the test was diagnostic of verbal skills, children who were aware of racial stereotypes performed less well than children who were unaware of them. Smith & Hopkins (2004) also found no ST effect in a sample of African American college students on either arithmetic or spelling tests. The authors assumed that "these students have not incorporated this stereotype into their cognitive schemas because of their own sense of competence" (Smith & Hopkins, 2004, p. 319). Furthermore, our results of no ST effect are consistent with the findings of Chaffee et al. (2020), who investigated the effect of explicit ST in four experiments involving men working on languagerelated tasks.
Moreover, our findings could be interpreted in light of the replication crisis and a possible publication bias (e.g., Ganley et al., 2013). Although the effects of ST have been empirically demonstrated by a several studies (e.g., Appel et al., 2015;Pennington et al., 2016;Spencer et al., 2016), a study by the Open Science Collaboration (2015) on replicability in psychological science showed that only 36% of 100 replicated studies exhibited statistically significant results. Against this background, many studies examining ST have also investigated the possibility of publication bias. Publication bias was demonstrated and defined by Begg (1994, p. 402) as the fact "that there really are a number of small studies with effect sizes distributed around the null value, but most of these remain unpublished." Ganley et al. (2013) analyzed a sample of 931 students from childhood to adolescence and could not detect any ST effect regarding gender differences in mathematics. Additionally, the authors found out that non-significant results were either not published or only published alongside significant results. Moreover, Shewach et al. (2019) examined the setting of the studies included in their meta-analysis for possible publication bias. Corresponding with Flore & Wicherts (2015), the authors found the presence of a publication bias, which they argue is inflated to a certain extent yet due to the suppression of null results and due to non-publication of non-significant findings (Shewach et al., 2019).
We also did not find that ST effects were moderated by children's identification with their culture of residence, culture of origin, or with the domain of reading. These results are contrary to findings for ST in achievement situations (e.g., Baysu & Phalet, 2019;Weber et al., 2018), where, for example, high domain identification has been shown to decrease achievement (e.g., Appel et al., 2011;Pansu et al., 2016;Steele, 1997). Regarding learning situations, the results on identification with culture of origin are consistent with previous research findings, which also found no moderating effect of this variable (e.g., Sander et al., 2018).

Limitations and future directions
Despite this study's important strengths, such as the pre-post design, certain aspects warrant attention. Due to the small size of language minority subgroups, analyses for these specific groups (e.g., Arabic-, Russian-, Polish-, and Romanian-speaking children) were not possible, who might be more or also differently affected by a language-related threat. Future research may systematically compare students from different language groups which would lead to a more fine-grained picture of threat effects for different groups. To better understand the obtained null effects, it would also be beneficial to assess children's awareness of negative language-related stereotypes and include this as a potential confounding variable or moderator in the analyses. These information might also have helped to better understand null effects. Additionally, this should also be deliberated in further research examining whether ST is a phenomenon that potentially only occurs in (vocabulary) achievement situations but not in (vocabulary) learning situations in actual classrooms. Moreover, it is not clear whether a motivation effect undermined the possible ST effects, meaning that the explicit threat might have been motivating for language minority students. This conclusion (stereotype reactance) is supported by the results of the planned contrasts.
Moreover, it is important to research at what age children become susceptible to ST. Likewise, it is relevant to examine the development and effects of stereotypes in similar learning situations in secondary school. It should also be examined whether elementary school students, as well as older students, have internalized negative stereotypes about their own group, making ST effects more likely. Moreover, it would be also interesting to investigate ST effects longitudinally to test knowledge or retrieval after several weeks (e.g., Taylor & Walton, 2011). Further, it would be worthwhile to focus on another individual factor, namely, stress (e.g., Wolf, 2017), because stress seems to impair cognitive processes.
However, important strengths can also be mentioned. While previous research typically investigated ST in achievement situations, our study focused on ST in vocabulary learning situations. Going beyond Sander et al. (2018), we included an experimental condition in which ST was removed before posttest. Thus, we sought to determine whether ST in fact impaired children's learning, rather than access to previously acquired vocabulary in the achievement situation (cf., Boucher et al., 2012).

Conclusion
Overall, the present findings are inconsistent with published ST studies. Therefore, further research in this area is necessary to gain a better understanding of the phenomenon given the heterogeneous findings. But given that the null results regarding vocabulary learning situations among language minority children can be supported by further research, practical and theoretical implications can be derived. Thus, it might still be worthwhile to sensitize teachers with regard to stereotypes and their effects in order to reduce inequalities in the educational system and strengthen educational participation. More specifically, teachers should be sensitized to be especially aware of activating stereotypes in achievement situations as prior studies revealed. In learning situations, activating negative stereotypes explicitly could be motivating. Theoretical implications could be the differentiation of stereotype threat theory. Thus, theory could differentiate of type and domain of activated stereotypes (e.g., language-related vs. gender-related stereotype; language vs. math domain) as well as the distinction between learning and achievement situations. Further, the group of interest could be considered as point of differentiation, e.g., migration background/language minority and/or gender. Thus, the implications of potentially threatening statements, including the emphasis of achievement differences or merely mentioning the results of large international student assessments, could be better understood by focusing different groups of interest and systematically varying their numeric representation in a given educational context and assessing the existence of a negative (or even positive) performance stereotype. This might help to better understand indifferent findings and the critique on stereotype threat theory (Chaffee et al., 2020;Ganley et al., 2013;Shewach et al., 2019).
Funding Open Access funding enabled and organized by Projekt DEAL. This work was supported by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) [392231161].

Data availability
The data described in this article are openly available within the Open Science Framework at https:// osf. io/ dh9er/? view_ only= a9b47 b491c ef450 98efc 6e809 1d2ee 6c.

Declarations
Ethics approval According to the unanimous positive vote of the Ethics Committee of the TU Dortmund University, the research project complies with the ethical guidelines for conducting scientific research. Participation was voluntary and took place only if parental consent was given prior to data collection.

Conflict of interest The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.