Despite educational policies aiming at equal opportunities for male and female students (Council of Europe, 2018; OECD, 2015), gender differences have repeatedly been reported for some domains in international large-scale assessments such as PIRLSFootnote 1, PISAFootnote 2, and TIMSSFootnote 3 (for an overview see Stanat et al., 2018; Hannover & Wolter, 2021; Rosén et al., 2022). Across countries and languages, female students, on average, outperform male students in reading, and male students often outperform female students in math, though the gender gap in math is less consistent and typically less pronounced than in reading (Mullis et al., 2017, 2020; OECD, 2019). Such gender differences in academic achievement exist already at pre-school age (Lonnemann et al., 2013; Wolter et al., 2015), continue throughout primary school (Gentrup et al., 2022; Lorenz et al., 2023), and are still apparent in secondary school (OECD, 2019; Reinhold et al., 2019).

Various aspects contributing to these gender performance gaps have been investigated in previous research, among them domain-specific academic self-concepts (e.g., Jansen et al., 2019), gender stereotypes (e.g., Flore & Wicherts, 2015; Pansu et al., 2016), and beneficial behaviors for learning, e.g., self-discipline (e.g., Duckworth & Seligman, 2006). Gender role attitudes (GRAs) are another facet possibly contributing to gender disparities in school performance. GRAs represent the internalization and acceptance of socially shared descriptive and prescriptive characteristics of male and female individuals, including gendered behavioral expectations related to family, education, and employment (Davis & Greenstein, 2009; Ullrich et al., 2022; Wolter et al., 2015). Only a few studies have empirically investigated the association between GRAs and school success. While most of these studies show that traditional GRAs are generally associated with lower performance (Ehrtmann & Wolter, 2018; Hadjar & Lupatsch, 2010; Kessels & Steinmayr, 2013), others report the reverse pattern for certain subgroups based on their ethnic origin (e.g., Salikutluk & Heyne, 2014) or do not find any relevant association (e.g., Salikutluk & Heyne, 2017). Most of these studies measured performance with school grades (Hadjar & Lupatsch, 2010; Kessels & Steinmayr, 2013; Salikutluk & Heyne, 2014, 2017), which are, however, prone to gender bias (for a meta-analysis see Voyer & Voyer, 2014). Empirical evidence based on objective indicators of achievement, e.g., standardized performance tests, is very limited (for an exception, see Ehrtmann & Wolter, 2018). Moreover, only a few studies examined the association between students’ GRAs and their performance in both stereotypical male and female domains (Ehrtmann & Wolter, 2018; Salikutluk & Heyne, 2014). Thus, it remains challenging to conclude which role GRAs play in boys’ and girls’ performances in gender-stereotypical and counter-gender-stereotypical domains.

In this study, we use performance test scores of a stereotypical female domain (i.e., reading) and a stereotypical male domain (i.e., math) of PISA-2009 data from Germany and compare students across and within genders to investigate if and how GRAs contribute to explaining gender disparities in reading and math.

1 Gender stereotypes, gender roles, and gender role attitudes

Before explaining in more detail how GRAs may be related to students’ performance, we describe the concepts of gender stereotypes, gender roles, and GRAs. There are different, sometimes not clear-cut and overlapping definitions of these concepts (Eckes, 2008). Integrating previous theoretical considerations, we define these concepts as follows: While gender stereotypes cover generalized beliefs about attitudes and behaviors females and males have or show (e.g., girls are good in reading, boys are good in math; Cvencek et al., 2011; Martinot et al., 2012; Nosek et al., 2002; Steffens et al., 2010; Tobin et al., 2010), gender roles incorporate socially shared normative and, thus, prescriptive expectations regarding attitudes and behaviors female and male persons should have or show (e.g., boys should play with trucks and like blue, while girls should play with dolls and like pink; Alfermann, 1996; Athenstaedt, 2003; Eagly et al., 2000; Eccles, 1987; Eckes, 2008). Mostly, these expectations focus on gender-typical behaviors with respect to the division of labor within the family and paid occupation (Alfermann, 1996; Ullrich et al., 2022). Within our understanding, individuals’ GRAs are an expression of both, internalized gender stereotypes and gender roles and, thus, cover the subjectively perceived degree of appropriateness of these stereotypes and roles (cf. Becher & El-Menouar, 2014; Wolter et al., 2015). In line with other scholars (Athenstaedt, 2000; Kulik, 2002; Wolter et al., 2015), we conceptualize GRAs as a one-dimensional continuum differentiating between traditional and egalitarian views. Individuals with traditional GRAs expect women to be responsible for household duties and child care (“homemaker”; Davis, 1984, p. 403) and men to be in charge of the financial security of the family (“breadwinner”). Individuals with egalitarian GRAs are either convinced that men and women should share homemaker and breadwinner roles equally, or they reject the assignment of typical roles to individuals based on gender.

2 Multiple-route model of sex stereotypes and gender roles

According to the Multiple-Route Model of Sex Stereotypes and Gender Roles (Chalabaev et al., 2013), gender roles and gender stereotypes may affect academic success through an internalization and a situational route (Fig. 1). The internalization route is based on the expectancy-value theory by Eccles and colleagues (Eccles et al., 1983, 1990) and assumes that students with traditional GRAs accept gender stereotypes and gender roles typical of their cultural norms and incorporate these stereotypes and gender roles into their (gender) self-image (Ashmore & Del Boca, 1979; Hermann, 2020). Applying the internalization route, for example, to the math domain, girls with traditional GRAs might internalize the stereotype “girls cannot do math” and incorporate the gender role “math is not appropriate for girls,” leading to the (gender) self-image “Because I am a girl, math is not appropriate for me; thus, I should not be good in math.” Consequently, these girls may lose interest in math, withdraw from math activities, and possibly not make a great effort during math classes and performance tests (Chalabaev et al., 2013). In the situational route, girls might be confronted with the stereotype “girls cannot do math” in a specific situation, for example, when sitting a math exam. Out of fear of conforming to this stereotype, girls might struggle with task-irrelevant thoughts, which reduce working memory capacities and can result in lower performance—a process known as stereotype threat.

Fig. 1
figure 1

Simplified Multiple-Route Model of Sex Stereotypes and Gender Roles. (adapted from Chalabaev et al., 2013)

Stereotype threat effects (situational route) have been investigated in many studies (for a meta-analysis see Flore & Wicherts, 2015; for an overview of stereotype threat research see Spencer et al., 2016 and Stoet & Geary, 2012). Recent studies also examined the relation between the gender stereotypes of significant others such as parents, teachers, and peers and students’ individual cognitive and motivational-affective outcomes besides situational effects (Doornkamp et al., 2023; Henschel et al., 2023; Muntoni & Retelsdorf, 2019; Muntoni et al., 2021; Plante et al., 2013). Overall, these studies find that gender stereotypes and educational student outcomes are related (see however, Doornkamp et al., 2023). Regarding the type of relation, the results are, however, inconclusive. While some studies found that students’ gender stereotypes and achievement are directly related (Gentile et al., 2018; Plante et al., 2013; Song et al., 2016), others observed an indirect relation through motivational-affective outcomes such as academic self-concept, interest, motivation, effort, and anxiety (Froehlich et al., 2022; Henschel et al., 2023; Muntoni & Retelsdorf, 2019; Muntoni et al., 2021; Song et al., 2016). However, research on the relevance of GRAs for boys’ and girls’ differing school performance (internalization route) remains scarce.

3 Gender role attitudes and general school performance

Several theoretical assumptions on the relation between students’ GRAs and their educational success (e.g., performance, grades, qualification level) are suggested in the literature. While some of these focus on academic success in general, other theoretical reflections differentiate between gender-stereotypical and counter-gender-stereotypical domains, thereby proposing different associations between GRAs and girls’ and boys’ performance, depending on the domain.

Concerning GRAs and general academic success, scholars argue that traditional GRAs are associated with lower achievement, irrespective of gender and domain (DiPrete & Jennings, 2012; Hadjar et al., 2012). More specifically, scholars argue that boys with traditional GRAs internalize general school-related gender stereotypes, such as “school is feminine” and “males are left behind at school” (Hannover & Kessels, 2011; Heyder & Kessels, 2013; Xie et al., 2022). In consequence, these boys do not associate school success with themselves and may demonstrate inappropriate behaviors in school that negatively affect their academic success (see, e.g., DiPrete & Jennings, 2012). For female students, it is argued that girls with traditional GRAs view education as not essential because the skills and qualifications acquired in school are not necessary for their primary future role as homemakers (Hadjar et al., 2012; Suárez-Orozco & Qin, 2006). To sum up, despite different mechanisms, boys and girls should generally benefit from egalitarian GRAs.

Empirical findings support this argumentation line because traditional GRAs are related to lower school performance in most studies, even when considering individual (e.g., school type, general cognitive ability) and familial (e.g., parents’ education) characteristics (Hadjar & Lupatsch, 2010; Kessels & Steinmayr, 2013). However, these studies measured performance with school grades, which are prone to gender bias. On average, boys receive lower grades than girls, even when considering possible differences in achievement, general cognitive abilities, and motivational characteristics (Rüdiger et al., 2021; Voyer & Voyer, 2014). Additionally, these studies averaged school grades across domains, impeding comparisons between gender-stereotypical and counter-gender-stereotypical domains separately for girls and boys. However, theoretical considerations suggest that domain-specific effects are plausible, as outlined in the following.

4 Gender role attitudes in counter-gender-stereotypical domains

Concerning the relation between GRAs and achievement in counter-gender-stereotypical domains, the internalization route of the Multiple-Route Model of Sex Stereotypes and Gender Roles (Chalabaev et al., 2013) suggests advantages of egalitarian GRAs for boys in reading and for girls in math. According to the model, girls with egalitarian GRAs do not internalize the gender stereotype that “girls cannot do math.” Boys with egalitarian GRAs should not internalize the gender stereotype that “boys cannot read.” Thus, students with egalitarian GRAs should not be bothered by these negative stereotypes towards their gender. Hence, boys should profit from egalitarian GRAs in reading, while girls should benefit from egalitarian GRAs in math.

Studies separately investigating the role of GRAs for students’ performance in the domains of language and math yielded inconsistent results (Ehrtmann & Wolter, 2018; Salikutluk & Heyne, 2014, 2017), possibly due to different operationalizations of school performance. Ehrtmann and Wolter (2018) studied students’ achievement development from grade five to seven. Their results partly support the assumption of the internalization route, as the authors found a gender-differentiated effect in math but not in reading. In math, girls with egalitarian GRAs showed higher achievement gains from grade five to grade seven than their female peers endorsing traditional GRAs. In reading, boys’ and girls’ achievements equally increased when endorsing egalitarian GRAs. Based on their results, Ehrtmann and Wolter (2018) suggest that girls may generally benefit from egalitarian GRAs, while boys might only benefit from egalitarian GRAs in counter-gender-stereotypical domains. In contrast, Salikutluk and Heyne (2014, 2017) did not find significant relations between students’ GRAs and grades in German and math, except for girls of Turkish origin in math; however, in the opposite direction. Specifically, girls of Turkish origin with traditional GRAs had better math grades than boys of Turkish origin with similar traditional GRAs. In sum, empirical evidence is too limited and ambiguous to draw conclusions about the relation between students’ GRAs and performance in counter-gender-stereotypical domains.

5 Gender role attitudes in gender-stereotypical domains

When considering gender-stereotypical domains, two contradicting argumentation lines are plausible concerning the relation between GRAs and achievement. Students may benefit from egalitarian GRAs or benefit from traditional GRAs.

The first line of argumentation derives from the global assumption that egalitarian GRAs are beneficial for students’ school engagement: Egalitarian GRAs should be related to higher school success (see chapter on GRAs and general school performance). Hence, boys and girls should generally benefit from egalitarian GRAs, including gender-stereotypical school domains, i.e., boys should benefit in math and girls in reading.

The second line of argumentation refers to gender-related beliefs connected to GRAs. According to the internalization route of the Multiple-Route Model of Sex Stereotypes and Gender Roles (Chalabaev et al., 2013), students who are convinced that “girls are good in reading” and “boys are good in math,” incorporate these gender stereotypes into their self-image, aim at conforming to the respective gender role, and endorse traditional GRAs. Thus, students with traditional GRAs are more likely to invest in the domain they are supposed to be good at, according to gender stereotypical beliefs. Therefore, boys with traditional GRAs should perform better in math than boys with egalitarian GRAs. Correspondingly, girls with traditional GRAs should show advantages in reading compared to girls with egalitarian GRAs.

The relations between GRAs and performance in gender-stereotypical domains have yet to be explicitly tested separately for girls and boys. Hence, it needs to be investigated which argumentation line is empirically supported to better understand the role of students’ GRAs for explaining gender disparities in school success.

6 The present study

Although GRAs might be a possible explanation for gender disparities in school performance, few studies have separately investigated the role of GRAs in girls’ and boys’ achievement in female and male-stereotyped school domains. Previous studies used grades as an indicator of school performance or did not differentiate between gender-stereotypical and counter-gender-stereotypical domains. Hence, it remains to be seen if GRAs play the same role in boys’ and girls’ achievement in gender-stereotypical and counter-gender-stereotypical subjects. With our study, we aim to address this research gap by investigating the following research question: To what extent do GRAs contribute to explaining gender achievement gaps in reading and math? For the counter-gender-stereotypical domain, we assume that boys with egalitarian GRAs score higher in reading achievement than boys with traditional GRAs (Hypothesis 1). Accordingly, we expect girls with egalitarian GRAs to score higher in math achievement than girls with traditional GRAs (Hypothesis 2). As for the relation between GRAs and achievement in gender-stereotypical domains, two contradicting lines of argumentation are plausible (see the section on GRAs in gender-stereotypical domains). We will conduct exploratory analyses to investigate the difference in reading between girls with traditional GRAs and those with egalitarian GRAs. Similarly, we will compare boys’ traditional and egalitarian GRAs in relation to their math performance.

7 Methods

7.1 Sample

The current study uses data from the PISA-2009 cycle conducted in Germany (Klieme et al., 2013), as this is the only cycle that measured students’ GRAs within the “Supplementary Study on Migration” of the German research consortium of PISA (Hertel et al., 2014) in addition to achievement in reading and math. In this cycle, the sampling procedure in Germany differed from other countries. Instead of randomly choosing 15-year-old students, which is the standard procedure in PISA (OECD, 2012b), in Germany, two entire grade nine classes were drawn from the participating schools (Hertel et al., 2014). In Germany, PISA-2009 was conducted in April and May 2009. The German PISA-2009 sample comprised 9,461 students from 404 classes in 202 general education schools (Jude & Klieme, 2010). One student was deleted from the analyses due to missing information on their gender. The students of the analytic sample (N = 9,460) were, on average, 15.61 years old (SD = 0.63; 12.75–19.17 years). Half of them identified as female (49.70%).

8 Measures

8.1 Achievement in reading and math

In PISA-2009, reading achievement was assessed with 131 items within 37 reading units, based on a variation of the following text characteristics: situation (e.g., personal letter, schoolbook text), format (e.g., essay, table, diagram), type (e.g., narration, exposition, instruction), and cognitive process (e.g., retrieving, reflecting and evaluating) to ensure broad coverage of the domain (for more details see OECD, 2009, 2012b). Math achievement was assessed with 35 items within 18 units, thereby covering different subject matters (e.g., recognizing shapes and patterns in shapes, recognizing numerical patterns), processes (e.g., modeling, problem-solving), and situations in which math is needed (e.g., in daily life, banking, finance). Due to the application of a multi-matrix test design (i.e., students do not receive all items to solve, but a pre-defined selection and thus have missing values on the items they do not receive), the PISA consortium generated plausible values (PVs) based on multiple imputation, accounting for missing information as an approximation of individual achievement scores (OECD, 2012b). In the PISA-2009 Scientific Use File (SUF), five PVs are provided for each student for each domain. Across all countries, the metric of the scales was fixed to a mean score of 500 and a standard deviation of 100 (OECD, 2012b). In the analytical sample of the current study, the mean scores were M = 500.82 (SD = 86.08; 150.12–761.32) for students’ reading achievement and M = 513.53 (SD = 86.40; 207.05–781.51) for students’ achievement in math.

8.2 Gender

Students indicated their gender as female or male. We recoded this variable into a dummy variable (female = 1, male = 0).

8.3 Gender role attitudes

Students responded to nine items addressing gender division in education, family responsibilities, employment, and politics (Hertel et al., 2014; Krampen, 1979, 1983; Weidacher et al., 2001). A sample item is “Good school performance is more important for boys than for girls.” Table 1 lists all nine items. Answer options ranged from completely disagree (1) to completely agree (4). We recoded items describing a traditional view so that higher scores indicate egalitarian GRAs. Factor analyses (principal component analyses utilizing oblimin rotation) suggested a one-dimensional structure with an eigenvalue > 3.81, factor loadings > 0.52, and an explained variance of 42%. The reliability of this scale was good (αall = 0.83 for all students, αfemale = 0.80 for girls, and αmale = 0.81 for boys). The mean of students’ GRAs was Mall = 2.99 (SD = 0.76), with girls’ mean GRA Mfemale = 3.24 (SD = 0.69) and boys’ GRA Mmale = 2.76 (SD = 0.74).

Table 1 Items of the scale assessing gender role attitudes

8.4 Control variables

We controlled for important background variables related to achievement or GRAs. Age is relevant for both. Within the same grade, some studies show that older students reach higher performance levels (Cáceres-Delpiano & Giolito, 2019; Lien et al., 2005), whereas some meta-analyses reveal lower performance levels for older students compared to younger ones (Hattie, 2008; Jimerson, 2001). During adolescence, GRAs seem to become more egalitarian with age (Antill et al., 2003; Ullrich et al., 2022). Students’ age is provided in years.

Within Germany’s stratified secondary education system, students attending the academic school track reach higher achievement levels (Reinhold et al., 2019) and report more egalitarian GRAs (Valtin & Wagner, 2004) than students at non-academic school tracks. We distinguished students who attended the Gymnasium (highest academic track) from those who attended other secondary school types using a dichotomous variable (Gymnasium = 1, other school types = 0) based on the variable in the SUF indicating the school type.

In Germany, an increasing number of students grows up multilingually, acquiring German as their second language (Henschel et al., 2023; Statistisches Bundesamt, 2019). These students often score lower in reading (e.g., Marx et al., 2015) and math achievement (e.g., Henschel et al., 2019). Students’ answers to the question “Which language do you most often speak at home?” are available as a dichotomous variable (German, another language). We recoded this information into the dichotomous variable speaking German at home (yes = 1, no = 0).

Furthermore, the discrepancy in school success based on students’ socio-economic background is well-known in the German society (Heppt et al., 2022; Mahler & Kölm, 2019; Sachse et al., 2022; Weis et al., 2019). Two variables served as indicators for adolescents’ social family background. For parents’ education, we used students’ information regarding their parents’ highest educational level (HISCED), recoded into years of schooling (for details regarding the transformation of HISCED into years see OECD, 2012b, p. 364). As a second indicator, we included the available number of books at home, ranging from none or only very few (0 to 10 books) (1) to enough to fill several shelves (more than 500 books) (6).

8.5 Domain-specific control variables

Previous studies show a gender bias in grades, with boys receiving poorer grades than girls (for a meta-analysis see Voyer & Voyer, 2014). We, thus, controlled for the grade in the school subject “German” in the analyses of reading achievement and for the grade in math when analyzing math achievement. In the German educational system, the grades range from 1 (very good) to 6 (insufficient); thus, lower numbers indicate better grades.

As the PISA-2009 cycle focused on reading performance, additional domain-specific student characteristics, such as academic self-concept in reading and interest in reading that are related to achievement (Skaalvik & Skaalvik, 2004; Trautwein et al., 2006), were measured. Academic self-concept in reading and reading interest were captured with three items each, using a 4-point scale that ranged from 1 (does not apply) to 4 (applies). Thus, higher values indicate a higher academic self-concept in reading and a greater reading interest.

8.6 Missing data treatment and analyses

In empirical research, missing data are a common phenomenon. In our data, missing values on relevant study variables ranged from 1.03% for students’ grades in German to 14.23% for two GRAs items. To address item non-response, we applied multivariate imputation by chained equations using the R package mice (van Buuren & Groothuis-Oudshoorn, 2011). In the imputation model, we included the variables used in the later analyses and auxiliary variables that were substantially correlated with the study variables (e.g., students’ attitude toward school). We generated five complete data sets and integrated the five PVs provided for each domain in the SUF into our analyses data. Descriptive statistics, factor analyses, reliabilities, and tests for measurement invariance were calculated with the first imputed data set. Regression analyses testing our hypotheses were separately conducted with each of the five data sets and the results were pooled according to Rubin’s rule (Rubin, 1987).

We performed a series of t-tests and χ2-tests to examine potential gender differences in the study variables. To compare these differences, we used the effect sizes Cohen’s d and Phi φ and interpreted them according to Cohen (1992) as small (d ≥ 0.20; φ ≥ 0.10), medium (d ≥ 0.50; φ ≥ 0.30), or large (d ≥ 0.80; φ ≥ 0.50).

We tested measurement invariance of GRAs between boys and girls. The unrestricted baseline model showed an acceptable fit (χ2 = 1,256.05; df = 54; p <.001; CFI = 0.94; TLI = 0.92; RMSEA = 0.07; SRMR = 0.04), indicating configural measurement invariance (Little, 2013). The more restrictive model, in which factor loadings were held constant, also showed an acceptable fit (χ2 = 1,332.73; df = 62; p <.001; CFI = 0.93; TLI = 0.93; RMSEA = 0.07; SRMR = 0.04), indicating metric measurement invariance. Partial scalar measurement invariance was assured when four factor intercepts were unrestricted (χ2 = 1,377.15; df = 66; p <.001; CFI = 0.93; TLI = 0.93; RMSEA = 0.07; SRMR = 0.04). Moreover, cut-off criteria suggested by Chen (2007; ΔCFI < 0.01, ΔRMSEA < 0.015, and ΔSRMR < 0.03) were met when comparing metric and partial scalar models. Therefore, we assume partial scalar measurement invariance between girls’ and boys’ GRAs.

We conducted ordinary least square (OLS) linear regression analyses to investigate our hypotheses. All metric variables were z-standardized (M = 0.00, SD = 1.00) prior to the analyses. Thus, direct comparisons between the β-coefficients are possible and can be interpreted as a small (β ≥ 0.10), medium (β ≥ 0.30), or large effect (β ≥ 0.50; Cohen, 1992). We also report unstandardized b-coefficients for easier interpretation, i.e., one unit change of a predictor variable (e.g., gender) is associated with b units change on the dependent variable (e.g., reading achievement). We compared regression coefficients between nested models and used z-tests for testing for statistical differences (Clogg et al., 1995; Paternoster et al., 1998).

We used Stata 15.1 (StataCorp, 2017) for data preparation, analyzing the factor structure of GRAs, and descriptive analyses, RStudio 4.2.1 (RStudio Team, 2022) for handling missing data, and Mplus 8.8 (Muthén & Muthén, 1998–2022) for testing measurement invariance of GRAs and modeling regression analyses. We estimated cluster-robust standard errors with the option “type = complex” (Williams, 2000) to account for the multilevel data structure with students clustered in classes and „type = imputation“ to combine the results of the five separate analyses.

9 Results

9.1 Descriptive analyses

Table 2 displays the descriptive statistics for all study variables for the whole sample and separately for female and male students. We found significant differences between boys and girls for several variables in the expected directions. Concerning our key analyses variables, girls expressed more egalitarian GRAs and outperformed boys in reading achievement. In contrast, boys had less egalitarian GRAs and scored higher on the math test than girls. The gender differences in reading and GRAs were of medium effect size, those in math of small effect size.

Table 2 Descriptive statistics for all study variables for the overall sample, and separately for boys and girls, including t-test and Chi-Square results

Students’ GRAs significantly correlated with all variables included in the regression analyses, most strongly with reading and math achievement (see Online Resource 1 for the correlations between all study variables). For boys, the correlations between GRAs and reading (r =.23, p <.001) and between GRAs and math (r =.18, p <.001) were positive and yielded small effect sizes. For girls, the correlations between GRAs and reading (r =.33, p <.001) and between GRAs and math (r =.28, p <.001) were also positive and of medium effect size.

9.2 Regression analyses

9.2.1 Reading achievement

The results for reading achievement are reported in Table 3. Students’ gender significantly predicted their reading achievement. Girls scored higher in reading than boys (β = 0.40, Model 1a), yielding a medium effect size. Egalitarian GRAs and reading achievement were significantly positively related when controlling for students’ gender (β = 0.28, Model 2a), yielding a small effect size. The interaction between students’ gender and GRAs was significant (β = 0.12, Model 3a), thus, pointing to differences in the relation between GRAs and reading achievement for boys and girls. The effect of GRAs and the interaction effect decreased but remained significant when controlling for relevant individual and familial background information and domain-specific variables (see Models 4a, 5a, and 6). Figure 2 visualizes the interplay between GRAs and reading achievement for boys and girls based on Model 5a. As assumed in H1, boys with egalitarian GRAs scored significantly higher in reading than boys with traditional GRAs. In other words, the more boys endorsed egalitarian GRAs rather than traditional GRAs, the higher their reading achievement level. Girls also differed in reading based on their GRAs. Girls with egalitarian GRAs scored higher in reading achievement than those with traditional GRAs. The effect of GRAs on reading achievement was more pronounced for girls than for boys. For an increase of one standard deviation (SD) in GRAs, i.e., being more egalitarian, boys scored almost 9 points (b = 8.82, SE = 1.04, β = 0.10, p <.001) higher in reading, while girls scored 14 points higher (b = 14.59, SE = 1.16, β = 0.17, p <.001).

Table 3 Results for stepwise multiple regression predicting reading achievement
Fig. 2
figure 2

Reading achievement in relation to GRAs, shown for boys and girls

The gender difference in reading (Model 1a) remained significant but, as post-hoc-tests revealed, significantly decreased in effect size in subsequent models to small effects when entering students’ GRAs and the interaction term as predictors in Model 3a (β = 0.19; comparison of Models 1a and 3a: ∆β = − 0.21, p <.001) and when additionally controlling for background information and German grade in Model 5a (β = 0.12; comparison of models 1a and 5a: ∆β = − 0.28, p <.001). In the full model (Model 6) that additionally included academic self-concept in reading and reading interest, the difference between boys and girls in reading achievement was reduced to a negligible effect (β = 0.05; comparison of Models 1a and 6: ∆β = − 0.35, p <.001).

Comparing R2 between the models reveals that gender and GRAs predict students’ reading achievement to a much lesser extent than the control variables. Gender and GRAs explained 11% (Model 3a) of the variance in students’ reading achievement, while 45% of the variance were explained when controlling for individual and familial background information (Model 4a) and 51% when additionally accounting for academic self-concept in reading and reading interest (Model 6). Thus, the largest part of the explained variance in reading achievement was due to individual, familial, and reading-related information, not gender and GRAs.

9.2.2 Math achievement

The regression analyses predicting math achievement are shown in Table 4. Boys and girls significantly differed in their math achievement. Girls scored lower in math than boys (β = − 0.26, Model 1b), yielding a small effect size. Students’ GRAs were significantly related to math achievement when controlling for gender (β = 0.24, Model 2b), with more egalitarian GRAs being related to higher math achievement, yielding a small effect size. The interaction between students’ gender and GRAs was significant (β = 0.14, Model 3b), thus, pointing to differences in the relation between GRAs and math achievement for boys and girls. The effect of GRAs and the interaction effect remained significant but decreased in size when controlling for individual and family background information (Model 4b) and math grade (Model 5b). Figure 3 visualizes the interplay between GRAs and math achievement for girls and boys based on Model 5b. As assumed in H2, girls with egalitarian GRAs reached higher math scores than those with traditional GRAs. In other words, the more girls endorsed egalitarian GRAs rather than traditional GRAs, the higher their math achievement level. Similarly, boys with egalitarian GRAs reached higher math test scores than boys with traditional GRAs. The relation between GRAs and math achievement was more pronounced for girls. While girls scored almost 11 points higher in math for one SD increase towards more egalitarian GRAs (b = 10.79, SE = 1.12, β = 0.12, p <.001), boys scored approximately 4 points higher in math proficiency (b = 4.37, SE = 1.04, β = 0.05, p <.001).

Table 4 Results for stepwise multiple regression of math achievement
Fig. 3
figure 3

Math achievement in relation to GRAs, shown for boys and girls

The gender difference in math significantly increased from a small effect in Model 1b (β = − 0.26) to a medium effect in the subsequent models when adding GRAs and the interaction term as predictors in Model 3b (β = − 0.44, comparison of Models 1b and 3b: ∆β = 0.18, p <.001) and when additionally adding all control variables in Model 5b (β = − 0.39; comparison of Models 1b and 5b: ∆β = 0.13, p <.001).

The R2 shows that gender and GRAs predict students’ math achievement to a smaller degree than the control variables. The explained variance of students’ math achievement increased from 7% when gender and GRAs were the only predictors (Model 3b) to 52% when controlling for individual and familial variables and math grade (Model 5b).

10 Discussion

The main objective of this study was to determine if and how students’ GRAs contribute to explaining the gender achievement gaps in reading and math. As assumed for the counter-gender-stereotypical domain, boys with egalitarian GRAs scored higher in reading than boys with traditional GRAs, and girls with egalitarian GRAs scored higher in math than girls with traditional GRAs. In the gender-stereotypical domain, girls with egalitarian GRAs reached higher reading scores than girls with traditional GRAs. Also, boys with egalitarian GRAs yielded higher math scores than boys with traditional GRAs. Thus, the results suggest that all students, irrespective of gender, benefit from egalitarian GRAs in reading and math. Furthermore, girls with egalitarian GRAs had a particularly large advantage over boys in reading. In math, girls with egalitarian GRAs performed equally well as boys with egalitarian GRAs and better than boys with traditional GRAs. The effects dropped in magnitude but remained significant when considering important confounding variables, such as school type and parental education.

These results support the notion that students’ GRAs play a role in the gender gaps in reading and math but in different ways. While girls’ egalitarian GRAs are related with higher math achievement and, in that way, with a reduced gender gap in math, boys’ egalitarian GRAs come along with higher reading scores, which do not suffice to reducing the gender gap in reading. In other words, egalitarian GRAs seem to support girls in decreasing their math disadvantages and in increasing their advantages in reading. In sum, while all students benefit from egalitarian GRAs regarding school achievement, the effect is particularly pronounced for girls in both domains.

As described in the PISA-2009 report (OECD, 2010) and in line with further empirical research, we found domain-specific gender disparities in achievement: Girls outperformed boys in reading (Böhme et al., 2016; Lorenz et al., 2023; Mullis et al., 2017; Rosén et al., 2022), whereas boys outperformed girls in math (Mullis et al., 2020; Reinhold et al., 2019; Schipolowski et al., 2019), with a larger gender gap in reading than in math (e.g., OECD, 2019). Furthermore, our results are in accordance with the general assumption that traditional GRAs are related to lower school success (DiPrete & Jennings, 2012; Hadjar et al., 2012) or, to put it the other way around, that egalitarian GRAs are related to higher academic performance. Similar to the study by Ehrtmann and Wolter (2018), our findings indicate that GRAs play a role in achievement, especially for girls, who benefit more from egalitarian GRAs than male students do. In contrast to the results of Ehrtmann and Wolter (2018), we also found a gender-differentiated effect of GRAs on reading proficiency. While in their study, female and male students benefit equally from egalitarian GRAs, we found that girls profit even more from egalitarian GRAs in reading than boys do. A possible explanation for these divergent findings may lie in the relation between GRAs and gendered performance development, on the one hand, and between GRAs and gendered performance level, on the other hand. The effects of GRAs might not differ between boys and girls for performance development in reading, which Ehrtmann and Wolter (2018) investigated, but for the performance level in reading itself, which we explored. Hence, contrary to Ehrtmann’s and Wolter’s (2018) findings, the results of our study suggest that the relation between students’ GRAs and their performance is not domain-specific, neither for girls nor for boys. Based on the present findings, we would conclude that all students benefit from egalitarian GRAs. However, girls have a greater advantage of egalitarian GRAs in both gender-stereotypical and counter-gender-stereotypical domains, thereby contributing to an increased gender gap in reading and to a smaller gender gap in math.

10.1 Limitations and future research

In interpreting the present study’s findings, several limitations must be considered. First, the study cannot address the causal association between GRA and proficiency due to its cross-sectional nature. Also, underlying mechanisms that drive the observed relations between GRA and academic performance cannot be analyzed. Second, we used a data set collected in 2009, as this was the only PISA-cycle that measured students’ GRAs in addition to achievement in reading and math. Therefore, our results might not entirely reflect the current picture of gender disparities in school performance and students’ GRAs. Indeed, based on developments that have occurred during the last approximately 15 years, the results reported in the present investigation might possibly either overestimate or underestimate the relation between GRAs and gender disparities. Comparing the results of female and male students’ performances of several large-scale studies in trend analyses, no clear picture emerges regarding an increase or decrease of the gender gaps in reading and math over time in Germany. Between PISA 2009 and PISA 2018, the gender gaps in reading and math have slightly decreased, although the change in math was not statistically significant (OECD, 2019). Contrary, trend analyses of national assessment studies in Germany report a significant decrease of the gender gap in math of students attending Grade 9 between 2012 and 2018 (Schipolowski et al., 2019) but no statistically significant change of the gender gap in reading of students attending Grade 9 between 2009 and 2015 (Böhme et al., 2016). When looking at elementary school students, trend analyses of the national assessment study and PIRLS in Germany similarly do not show any significant decrease or increase of the gender gap in reading between the assessment years 2011 and 2021 (Frey et al., 2023; Gentrup et al., 2022), whereas the gender gap in math increased for elementary school students between 2011 and 2021 (Gentrup et al., 2022). Thus, our findings may slightly overestimate the contribution of GRAs in explaining gender achievement gaps. At the same time, individuals have become more egalitarian in their GRAs across generations (for an overview see Dotti Sani & Quaranta, 2017). Students are, thus, likely to report more egalitarian GRAs now than in 2009 when using the same items. This might point to a possible underestimation of the relation between GRAs and performance. Third, although we set out to conduct analogous analyses for math and reading, we could not consider academic self-concept and interest in math as they were not part of the PISA-2009 cycle. However, these concepts are essential for girls’ achievement, especially in the math domain, as they typically report a lower academic self-concept in math than males (Jacobs et al., 2002; Schiepe-Tiska et al., 2016; Schilling et al., 2006). Even with equal or higher achievement, female students consider their math skills lower than male students (Jansen et al., 2019; Jansen & Stanat, 2016). Relatedly, although we considered a range of important confounding variables, contributing to the relation between adolescents’ GRA and their academic achievement, further variables could have come into play. Given the role of the GRAs of significant others, such as parents and peers, in shaping students’ GRAs (Carlson & Knoester, 2011; Halimi et al., 2021; Marks et al., 2009; Taraszow et al., 2023), including these people’s GRAs in the analyses might have yielded an even more nuanced picture. Such analyses would probably highlight the mediating role of students’ GRAs in the relation between the GRAs of significant others and student achievement.

Given the study’s limitations, future research would benefit from longitudinal studies based on more recent data because GRAs are prone to changes during adolescence (Galambos, 2004; Ullrich et al., 2022). Thus, a longitudinal depiction of the development of GRAs, ideally including comprehensive information on students’ academic self-concept, interest, motivation, and gender stereotypes, would provide more insights into the causal relations between the development of GRAs and performance as well as its underlying mechanisms. Furthermore, it would be valuable to consider newly developed instruments capturing recent societal changes in public gender roles (overviews in Halimi et al., 2017; Klocke & Lamberty, 2016). For example, the “traditional male-breadwinner-female-carer arrangement” (Gornick & Meyers, 2003, p. 90) has shifted towards a dual-earner model in most European societies (Valentova, 2012) so that most adolescents grow up with parents who are both working. With women having entered the paid labor area, other gender inequalities have arisen, such as the gender pay gap and women’s double roles as breadwinner and homemaker (Halimi et al., 2017; OECD, 2012a; World Economic Forum, 2019). Adding items that cover these changes and new inequalities would help to gain a more up-to-date insight into the relations between GRAs and school performance in gender-stereotypical and counter-gender-stereotypical domains for boys and girls. However, data sets that include standardized performance tests and measures of current gender roles and GRAs are missing. Moreover, while most studies so far focused on either gender stereotypes or gender roles, future studies examining both constructs would be valuable. A simultaneous investigation would help to examine the extent to which the concepts of gender stereotypes, gender roles, and GRAs can be empirically separated from each other. It would further help to identify their respective contribution to the prediction of achievement or achievement gaps, especially because research shows both, a direct relation between gender stereotypes and performance as well as between GRAs and performance. Finally, besides reading and math, other school domains are also gender-stereotyped. For instance, music is considered typically female and physics typically male (Kessels, 2005). Yet, a clear gender gap in physics achievement is not observable (Jansen & Stanat, 2016; Mullis et al., 2020; Schipolowski et al., 2019), while music has not yet been assessed in large-scale performance studies. Future research may focus on such domains to generalize our findings.

10.2 Implications and conclusion

This study investigated the role of GRAs in gender disparities in reading—a stereotypical female domain—and in math—a stereotypical male domain. The results showed that egalitarian GRAs are associated with higher performance in gender-stereotypical and counter-gender-stereotypical domains for all students. In addition, girls particularly benefit from egalitarian GRAs in both domains. Girls with egalitarian GRAs achieved higher reading scores than boys with egalitarian GRAs; thus, girls’ egalitarian GRAs had even larger advantages in reading. In math, girls with egalitarian GRAs performed at similar levels as boys with egalitarian GRAs; hence, girls’ egalitarian GRAs were associated with smaller math disadvantages. This is an important finding as it indicates that GRAs contribute to explaining students’ and, specifically, girls’ school success.

Promoting the endorsement of egalitarian GRAs, thus, seems to benefit students’ academic achievement across genders and domains. Therefore, teachers and educators should be aware of the mechanisms underlying the formation of gender stereotypes and GRAs, and they should be prepared to implement strategies and use materials that effectively support the development of egalitarian GRAs in children and youth. Specifically, teachers may counteract gender stereotypes by using gender-fair language with their students. The simple usage of both female and male title of gender-stereotypical occupations (e.g., “Ingenieurinnen und Ingenieure” [German for “female engineers and male engineers”]), instead of the generic masculine forms (e.g., “Ingenieure” [German for “engineers”]), enhances children’s interest and self-efficacy towards these occupations (Vervecken & Hannover, 2015; Vervecken et al., 2013). Also, educators should put effort in searching for and choosing materials with few gender-stereotypical representations and more balanced gender portrayal, as research shows that the genders are still often represented stereotypically and imbalanced in, for example, (school)books (Cruz Neri et al., 2024; Islam & Asadullah, 2018; Moser & Hannover, 2014). Furthermore, when developing and implementing interventions in schools, the stereotypically disadvantaged gender should be explicitly addressed as such gender-specific interventions seem to have greater effects than general interventions not specifically addressing a certain gender (e.g., Lesperance et al., 2022).

Besides educational outcomes, egalitarian GRAs are also relevant for societal gender equality—one of the millennium development goals (The General Assembly of the United Nations, 2000; United Nations, 2015) and should therefore similarly be promoted in other public areas. Encounters and personal contact with counter-gender-stereotypical role models can reduce gender stereotypes in female adolescents (e.g., Olsson & Martiny, 2018).

All in all, promoting egalitarian gender roles through knowledge acquisition about the consequences of gender stereotypes (e.g., Johns et al., 2005), exposure to counter-gender-stereotypical role models (e.g., Olsson & Martiny, 2018), and usage of gender-fair language as well as gender-balanced representations (e.g., Moser & Hannover, 2014; Vervecken & Hannover, 2015) in as many societal areas as possible (e.g., media, [school]books, toys, clothes) could eventually contribute to higher gender equality.