1 Introduction

Various dimensions of educational success, such as student achievement, vary by parental socioeconomic status (SES). Discrimination by teachers may account for at least part of the observed socioeconomic inequalities. For instance, research indicates that teacher stereotypes related to family SES can lead to bias in teacher judgement (Jussim et al. 1996; Jussim and Harber 2005; Tenenbaum and Ruck 2007). In addition to affecting grades (e.g., Kiss 2013; Sprietsma 2013), differential judgements and expectations can result in different verbal (e.g., less warm and supportive, low-quality feedback; Gentrup et al. 2020; Rubie-Davies 2007), non-verbal teacher behaviours (e.g., reduced eye contact; Babad 1990, 1993), and in a self-fulfilling prophecy (Wang et al. 2018). Such processes could exacerbate SES-related achievement gaps and social inequalities in education. While there are many studies examining the extent of such bias, its consequences, and the underlying processes (e.g., Urhahne and Wijnia 2021, for a recent review), only a few studies approach this topic from a cross-country perspective and, hence, consider the wider institutional setting (see, e.g., Geven et al. 2021; Hofer 2015, for exceptions).

This study examines teacher judgements and their association with student achievement in three national contexts: England, Germany, and the US. In all three countries, teachers might shape educational careers by teaching students and grading and sorting them. In some contexts, teachers might have a more significant impact on students’ educational careers than in others. In Germany, for example, teachers recommend the type of secondary school a child should attend after primary school. In England and the US, there are comprehensive schooling systems, while ability grouping within schools and/or classes is a common practice from lower grades on. The age at which tracking happens and the type of tracking and ability grouping might impact teacher judgements and their association with later achievement, as might the amount and level of (standardised) testing or further accountability practices used (e.g., Finnigan and Gross 2007; Geven et al. 2021; Kelly and Carbonaro 2012; Lee et al. 2014; Lerner and Tetlock 1999; Pit-ten Cate et al. 2020). Finally, Geven et al. (2021) point out that cultural beliefs about how effort can overcome original disadvantage (growth mindset) may also shape teacher judgements.

In this study, we first investigate whether SES-related bias in teacher judgements of student skills in early primary school differs between England, Germany, and the US. To this end, we regress the teacher judgements on student test scores. We focus on mathematics, one of the main school subjects in all three countries and one related to skills that strongly affect economic outcomes in later life (e.g., Ritchie and Bates 2013). Second, we examine the effects of these (potentially biased) teacher judgements on student achievement at the end of primary education (age 10–11) using value-added-models (e.g., Gentrup et al. 2020; Hinnant et al. 2009; Madon et al. 1997). We rely on harmonised data from three large-scale surveys, the Millennium Cohort Study (MCS) for England, the Starting Cohort 2 of the National Educational Panel Study (NEPS-SC2) for Germany, and the Early Childhood Longitudinal Study, Kindergarten Class of 2010–11 (ECLS-K:2011) for the US.

2 Theoretical considerations and empirical findings

2.1 Teacher judgements and judgement bias

Teachers’ understanding of the students’ achievement, needs, and resources is an essential precondition for efficient teaching and progress in student learning (e.g., Baumert and Kunter 2013; Hattie 2009; Helmke and Schrader 1987; Karing et al. 2011). Research shows that the accuracy of teacher judgements—defined as teachers’ ability to, amongst other things, adequately assess students’ characteristics—vary between teachers. Overall, in their meta-analysis, Südkamp et al. (2012) have shown that the shared variance between teacher judgements and students’ achievement—including results from standardised achievement tests as well as curriculum-based measures—is around 40%. Moreover, the shared variance is higher when measures of student motivation and cognitive abilities are considered alongside their achievement (e.g., de Boer et al. 2010). The remaining variance might be interpreted as inaccuracy (e.g., Gentrup et al. 2020). Empirically, it has been shown that students from less socioeconomically advantaged families often face lower teacher expectations vis-à-vis their objective achievement measures (for Germany see, e.g., Lorenz et al. 2016; Tobisch and Dresel 2017; for the UK see, e.g., Campbell 2015; Lee and Newton 2021; Plewis 1997; for the US see, e.g., Alvidrez and Weinstein 1999). These statistical effects might then be interpreted as bias. For the US, however, evidence is rather mixed as there are also studies that did not find systematic differences in teacher judgements coinciding with family SES (e.g., Hinnant et al. 2009).

Teacher judgements can be conceptualised as the result of cognitive information processing and diagnostic thinking (see Loibl et al. 2020, for a conceptional framework). Dual-processing theories postulate the presence of two different strategies of information processing: (1) automatic judgements guided by stereotypes that do not include the integration of relevant target information and (2) information-based judgements which involve the deliberate integration of target information into multifaceted judgements (e.g., Fiske et al. 1999). Both strategies could be understood as the end points of a continuum (Fiske et al. 1999).

Which of the two strategies is likely to be implemented depends, among other things, on the teacher’s personal characteristics (e.g., attitudes, knowledge, or mindsets) and situational characteristics which include, for example, time pressure, judgement goals, and social cues (Loibl et al. 2020). Social cues are pieces of information that are used to form judgements about objects and other persons. They differ in how easily they are observed and extracted and whether their identification requires inferential information processing (Loibl et al. 2020). Social cues such as names, a dialect or an accent increase the likelihood of categorisation and generalisations (Fiske et al. 1999; Smith and DeCoster 2000). Moreover, teacher judgements might be more likely to be based on categorisation and generalisations when information is ambiguous and when there are only few cognitive resources available to the perceiver (Gawronski et al. 2003).

Above and beyond these personal and situational factors, both information processing and stereotypes are embedded in and affected by the broader institutional context (see, e.g., Geven et al. 2021; Weinstein 2002, for a similar argumentation). This context includes not only conditions and regulations of schools, the education system, and teacher training, but also norms and values, as well as cultural-cognitive beliefs that frame and guide social action (see Geven et al. 2021, for a similar argumentation). For example, conscious and unconscious biases in teacher judgements develop within training programs; within the cultural norms and values prevailing in a country or region as well as the stereotypes and prejudices shared in these contexts (e.g., Geven et al. 2021); and within the setting of schooling and teaching (e.g., Romijn et al. 2021). Next, we describe aspects of the national context that might affect the extent of bias in teacher judgements.

First, a certain mindset or mindfulness of distortions might shape teachers’ information processing. Geven et al. (2021) argue that in “growth mindset” cultures, people assume that initial disadvantages due to family SES can be overcome through effort. The counterpart to a growth mindset is a fixed mindset, i.e., the view that talent and skills are innate. Consequently, Geven and colleagues expect less SES-related bias in teacher expectations in growth mindset cultures than in fixed mindset ones (Geven et al. 2021, p. 6, referring to Dweck and Yeager 2019; Jacovidis et al. 2020).

Second, accountability may frame and affect teachers’ diagnostic thinking and information processing (e.g., Krolak-Schwerdt et al. 2018). School accountability policies include, for instance, setting specific goals and sanctioning or rewarding schools based on their performance and governmental goals (Finnigan and Gross 2007). Previous research has shown that accountability policies are often accompanied by higher levels of teacher motivation and effort (e.g., Finnigan and Gross 2007; Kelley et al. 2002; Mintrop 2003) as well as more accurate judgements (e.g., Krolak-Schwerdt et al. 2013; Pit-ten Cate et al. 2020). Hence, teachers could be expected to have more incentive to judge student achievement accurately in systems in which they are held accountable for their work. However, there is also empirical evidence showing that accountability policies can go hand in hand with higher levels of teacher stress and decreased self-efficacy (e.g., Berryhill et al. 2009; Jerrim and Sims 2022), which might offset the expected positive effects on motivation and effort.

Third, (state or nation-wide) regular (standardised) testing—a specific form of accountability—might provide increasing amounts of comprehensive and comparable information to teachers. Such information might help teachers to form more accurate teacher judgements (e.g., Ostermann et al. 2018). Thus, teacher judgements might be more accurate in countries that (regularly) conduct standardised achievement tests or comparable performance assessments.

Fourth, tracking and ability grouping might affect the teachers’ information processing. Due to the necessity of assessing which course, stream or track is more suitable for each student, teachers are forced and often trained to thoroughly consider the performance of their students (e.g., Krolak-Schwerdt et al. 2018).Footnote 1 Because regular testing might improve teacher accuracy, teacher judgements might be more precise in education systems characterised by ability grouping. This should be particularly the case for teacher judgements preceding the separation of students into tracks and streams.

2.2 Teacher judgements and achievement development

Relying on the concept of the self-fulfilling prophecy (Merton 1948), Rosenthal (1973) proposed four paths through which teacher judgements might affect children’s learning and achievement: (1) teachers’ input, (2) opportunities for output (e.g., calling on students), (3) teacher feedback, and (4) the nature or climate of teacher-student relations. Empirical evidence on these four possible processes and their (relative) relevance for transmitting biased judgements is scarce (see, e.g., Urhahne and Wijnia 2021, for a similar evaluation of the state of the art). One exception is the study conducted by Gentrup et al. (2020) that showed how teacher feedback varied significantly based on the inaccuracy of teachers’ expectations. In particular, they reported that—compared to lower-expectancy students with similar achievement—higher-expectancy ones received more performance feedback than behavioural feedback as well as slightly more positive, rather than negative, performance feedback. Although teacher feedback varied with teacher expectations, this did not significantly mediate the significant effect of teacher expectancy on later achievement. In addition, there are studies that, although not explicitly examining teacher behaviour, showed how biased expectations might be related to student achievement by affecting, for example, student’s feelings of academic futility (e.g., Agirdag et al. 2013).

From a cross-country perspective, we assume that the potential pathways are relevant in all contexts. However, some of the above-mentioned institutional features might moderate the association between biased teacher judgements and student achievement. In particular, ability grouping might be a powerful way in which teacher bias affects subsequent achievement (see, e.g., Ready and Chu 2015, for a similar argumentation). Students whose ability is underestimated by their teachers will be assigned to less-demanding, lower-quantity courses with slower pace of instruction (e.g., Gamoran 1986; Pallas et al. 1994). Such ‘inadequate’ placements might de-motivate students, possibly leading to lower achievement than what would have been possible under different conditions. If teacher judgements correlate with students’ SES net of abilities and skills, ‘inadequate’ placements resulting from biased judgements would then contribute to the persistence and even exacerbation of socioeconomic achievement gaps. Another potential moderator could be standardisation (e.g., Klenowski and Wyatt-Smith 2010): the more input factors such as curricular goals, teaching materials, or exercises are predetermined, the less “room” will exist for biased teacher judgements to impact students’ skills development.

Research on the effects of biased judgements and expectations (based on student gender or ethnicity) on achievement inequalities often concentrates on later stages of education. Hence, there is little evidence that focuses specifically on primary education as well as on the differential effects of biased teacher judgements for students from varying socioeconomic backgrounds. In the US, Hinnant et al. (2009) found that teacher expectation bias from Grade 1 was related to both third and fifth-grade math achievement, while the effect on achievement in Grade 3 was especially pronounced for students from low-income families. In line with these results, Sorhagen (2013) showed that, amongst other things, teachers’ inaccurately low expectations in Grade 1 might foster lower math achievement at age 15, especially for students from low-income families. For Germany, there is evidence that accuracy in teacher judgements measured in Grade 9 was associated with math achievement in Grade 10 when controlling for students’ background and prior achievement (Anders et al. 2010). In this study, teacher accuracy was measured with teacher’s ability to rank students in their class in terms of their overall performance in mathematics. This teacher-reported rank was then related to the rank resulting from the actual PISA math test scores from Grade 9 by calculating the rank correlation. This measurement of teacher judgement bias differs from the approach chosen in the referred studies by Hinnant et al. (2009) and Sorhagen (2013), in which teacher perceptions of children’s ability were regressed on children’s test scores; the resulting residuals from these regressions were interpreted as teacher judgement bias (or discrepancy scores as labelled by the authors; see also Sect. 3.3 in this paper).

2.3 The England, Germany, and US contexts

Mindset

According to the data of the World Values Survey (wave 5; Inglehart et al. 2014), people from the US are more likely to strongly agree with the statement that hard work brings success, followed by people from the UK and Germany (own calculations). This suggests that a growth mindset is more prevalent in the US whilst a fixed mindset is more prevalent in Germany, with England somewhere in between. Due to the affinity between the “growth mindset” culture and the ideology of the “American dream”, especially when compared to relatively pessimistic European cultural beliefs on educational success and intergenerational mobility (Alesina et al. 2018), Geven et al. (2021, p. 6) expect less SES-related bias in teacher expectations in the US than in European countries.

General notes on the education systems

In England, compulsory schooling lasts from age 5–16, although most children attend a full-time primary school reception class at age 4. Children in reception classes are in Year 0. One year later, when compulsory school starts, they are in Year 1. Primary schooling is divided into Key Stage (KS) 1 which spans the ages from 5 and 7 (Year 1 and 2) and Key Stage 2 which covers the ages from 7–11 (Year 3–6). Each Key Stage is linked to a national curriculum: ability in a subject is then defined by the attained Key Stage level (Burgess and Greaves 2009, p. 4 f., 2013; Hall and Ozerk 2010, p. 376). In Germany, primary education starts on average at age 6 and lasts 4 years (6 years in the states Berlin and Brandenburg). At the end of primary education, around the age of 10, students make the transition into secondary education, which is stratified into one academic track and one or more non-academic tracks. The curricula, which are under the responsibility of the federal states, set specific goals with regard to the performance students have to achieve in each subject (Eckhardt 2019, p. 110). Still, curricula are formulated in such a way that teachers have some room for manoeuvre, although they are supposed to agree on teaching methods and assessment criteria for each specific subject within their school (Eckhardt 2019, p. 110). In the US, schooling starts with kindergarten at age 5. Overall, the US education system is characterised by decentralisation (McGuinn and Manna 2013), e.g., the curriculum and funding are determined by school districts, with funding of public schools not being equalised within states (Yanushevsky 2011, p. 40 f.). Hence, there are large differences between schools in different districts with respect to curriculum, school resources etc.

Accountability and testing

Overall, England has a comprehensive assessment system (Hall and Ozerk 2010) with a high degree of accountability (Bradbury 2014). At the end of Key Stage 2 (age 11), students take national standardised exams. In addition to the standardised tests, teachers give, at the end of the academic year, a judgement for each student in the same subjects where the Key Stage 2 tests take place (Hall and Ozerk 2010, p. 376 f.). This judgement is based, amongst other things, on in-school tests as well as on a set of “probing questions” specifically provided by the central government education authorities to help assess each student’s level (Burgess and Greaves 2009, p. 5). The nature and frequency of in-school tests vary greatly across schools as these are at their own discretion. In general, teachers must provide evidence for their judgement. To support teachers in “aligning their judgements systematically with national standards”, the Qualifications and Curriculum Authority provides online materials (Burgess and Greaves 2009, p. 5). Key Stage results and teacher judgements—at least during primary education—have no direct impact on students’ educational careers (Burgess and Greaves 2009). However, schools have an incentive to award high Key Stage scores as aggregate statistics are published and determine ranking in public school league tables, which then affects school desirability to parents and thus enrolment numbers (Burgess and Greaves 2009; Hall and Ozerk 2010). In Germany, during the first years of primary school, students’ knowledge and skills are assessed by means of competence-based reports, observation sheets, learning development reports, learning diaries, and portfolios (Eckhardt 2019). From Grade 3 onwards, pupils start taking written tests in subjects such as German or mathematics (Eckhardt 2019). In general, several accountability mechanisms are implemented with the aim of increasing school quality (see chapter 11.2 in Eckhardt 2019, for more information). In primary education, standardised tests in German and mathematics are implemented for the second half-year of Grade 3. These tests assess the level of competency of pupils compared to the binding nation-wide educational standards (the so-called VERA 3; see, e.g., KMK 2015). The central aim of VERA 3 is to support and improve teaching and school development. Feedback based on the test results from VERA 3 for teachers contains information for each test subject at class, task, and student level—each with national comparative scores. In the US, there have also been state-wide standardised tests (at least once between Grade 3 and 5) at public schools since the No Child Left Behind Act of 2001 (e.g., Figlio and Loeb 2011; Hanushek and Rivkin 2010). Nevertheless, each state develops its own standards and subject-specific accountability policies (Figlio and Loeb 2011; Hanushek and Rivkin 2010).

Ability grouping

In England and the US, ability grouping within schools is common and rather flexible, with low-threshold opportunities for changes over time (see Boliver and Capsada-Munsech 2021; Hallam and Parsons 2013, for England; see Condron 2008; Loveless 2013, for the US). For the UK, Hallam and Parsons showed that around 16% of children in Year 2 were streamed (Hallam and Parsons 2012, p. 522 ff.), and around 37% were setted (Hallam and Parsons 2013, p. 6). Here, streaming is defined as grouping students from the same year according to their ability into different classes in which most or all lessons are taught. Setting is instead defined as grouping students according to their ability in selected subjects only (Hallam and Parsons 2012, p. 520), so that students attend courses of different levels for different subjects (e.g., Domina et al. 2017). In the US, ability grouping within classes and subjects is common practice during primary school (e.g., Lleras and Rangel 2009) alongside grade retention (Warren and Saliba 2012). In Germany, ability grouping is relatively uncommon during primary education (Ammermueller and Pischke 2009). In most federal states, lessons are taught in grades and only in some federal states, there are age-mixed groups in the first two years of schooling (Eckhardt 2019).

Table 1 summarises key information by country and presents some tentative expectations on cross-country variations in the degree of teacher bias in early primary school. Overall, we expect less pronounced variation according to family SES in the US (due to growth mindset, school accountability policies, and grouping), followed by England (due to school accountability policies and grouping). Conversely, teacher judgements in Germany should be particularly biased according to SES.

Table 1 Key country characteristics and expectations on their effect on teacher judgement. (Own compilation)

With respect to the impact of teacher biases on students’ achievement development, we expect stronger effects in England and the US due to the higher prevalence of ability grouping in lower grades (not shown in Table 1). In England, however, this effect might be a bit smaller due to standardised curricula.

Special attention should be paid to the extent of time teachers and children spend together. First, we expect that teacher judgements become more reliable over time as teachers gain additional information on their students (e.g., Paleczek et al. 2017). Second, if students are taught by the same teacher over several years, her or his judgements and behaviour should affect students’ achievement more strongly than if the teacher were to change every year (see Raudenbush 1984). Still, teacher bias might nevertheless affect student achievement even with yearly teacher turnover: research has shown that the effects of teacher expectations can persist over years even when teachers change (e.g., Alvidrez and Weinstein 1999; de Boer et al. 2010; Hinnant et al. 2009; Rubie-Davies et al. 2014). Reasons for such long-term, cross-year effects could be that students internalise teachers’ positive or negative perceptions of their performance which could then affect, for instance, their motivation and effort. It is also possible that students face systematically different learning opportunities (especially where ability grouping exists; e.g., de Boer et al. 2010). The typical situation for England is that, in primary school, the child has a single teacher for all subjects (the “class teacher”; Burgess and Greaves 2009). Teachers then change each year as children transition into a higher grade while students remain in the same class. The situation is similar in the US (e.g., Hill and Jones 2018). In Germany, a class teacher teaches all subjects during primary education and accompanies the children often for more than one year. Mainly from Grade 3 onwards, the likelihood that students are taught by other, subject-specific teachers increases (Eckhardt 2019).

3 Data and operationalisations

3.1 Data

We analysed longitudinal data from England, Germany, and the US covering the period of primary education (see Table 2 for further information).

Table 2 Survey and data information by country

England

The Millennium Cohort Study (MCS) is an ongoing observational, multidisciplinary cohort study that began in 2000–2001 (Joshi and Fitzsimons 2016; University College London, UCL Institute of Education, Centre for Longitudinal Studies, Department for Education 2021; University of London, Institute of Education, Centre for Longitudinal Studies 2021). The MCS drew a representative sample of 18,552 families from across the UK in the first wave. We restricted the sample to students in state schools in England as only for them we had information on Key Stage test results at the end of primary school from the linked National Pupil Database.

Germany

The German National Educational Panel Study (NEPS) is a national multi-cohort study aimed at providing data on the development of a range of skills throughout the lifespan of cohort members (Blossfeld and Roßbach 2019). In our analyses, we used data from the Starting Cohort 2 (NEPS-SC2; NEPS Network 2020).Footnote 2 6733 students from 374 schools were tested in Grade 1, in spring 2013, whereas 5636 parents were interviewed by telephone.

US

The Early Childhood Longitudinal Study, Kindergarten Class 2010–11 (ECLS-K: 2011) collected data from a nationally representative sample of about 18,150 students who entered kindergarten in the fall of 2010 in 950 schools across the US (Tourangeau et al. 2015).Footnote 3

3.2 Instruments

3.2.1 Teacher judgements at the beginning of primary education (T1)

In all three studies, teachers were asked to rate the mathematical skills of each student on a 5-point-scale (much worse, slightly worse, equally as good, slightly better, much better in Germany; well/far below average, below average, average, above average, and well/far above average in the UK and US).Footnote 4 Teachers in England and Germany were asked to compare the cohort member to children of the same age, whilst teachers in the US compared the child to other children of the same grade level.

3.2.2 Tests on mathematical achievement and cognitive abilities

Mathematical achievement

At age 7, the MCS administered an adapted version of the National Foundation for Educational Research (NFER) Progress in Math test. At age 11, Key Stage 2 mathematics test marks from the National Pupil Database were linked to MCS participants. The adapted version of the Progress in Math test assessed mathematical skills and knowledge by asking children 20 questions covering such topics as numbers, spaces, measurement, and data handling. The test was read aloud to children at their homes, and they were asked to complete a series of calculations in a paper and pencil exercise. All children had to complete an initial test and were then routed to an easier, medium, or harder section on the basis of their initial score. Key Stage 2 mathematics test marks are a component of the compulsory standardised assessment based on the national curriculum for all children in state schools in England at the end of Year 6 (age 11).

In Germany, we used results from mathematical tests constructed by the NEPS. The tests covered content-related (i.e., quantity, space and shape, change and relationship, data and change) and process-related components (i.e., applying technical skills, representing, modelling, communicating, problem-solving; Schnittjer et al. 2020). The tests consisted of 22 items in Grade 1 and 24 items in Grade 4. In Grade 1, a picture-based answer format was used, whereas in Grade 4, a paper-pencil format was employed.

For the US, we used results from mathematical tests conducted in Grades 1 and 5. The assessment framework was based on that developed for the National Assessment of Educational Progress and for the Principles and Standards for School Mathematics guidelines of the National Council of Teachers of Mathematics. The assessment was designed to measure skills in conceptual knowledge, procedural knowledge, and problem-solving. The test consisted of questions on number sense, properties, and operations; measurement; geometry and spatial sense; data analysis, statistics, and probability; and patterns, algebra, and functions. At both time points, a set of routing items was administered to all students, and then the students’ scores on these items determined which second-stage test (low, middle, or high difficulty) they received.

Cognitive abilities

In England, cognitive abilities were measured using the British Ability Score II Pattern Construction test (Elliott et al. 1996; Jones and Schoon 2008), in which children were asked to replicate patterns presented to them using solid plastic cubes. In our analysis, we used the ability score, which accounts for differences in the items answered by children due to differential routing by difficulty.

In Germany, we used results from the NEPS-MAT test administered in Grade 2 to assess nonverbal abilities. The test included horizontally and vertically arranged fields with different geometrical elements. Children were asked to choose the right complement for one free field from several offered solutions on the basis of deduced logical rules which underlie the patterns of the geometrical elements. The test consisted of 12 items.

In the US, working memory was measured by the Numbers Reversed task (Blackwell 2001). The child was asked to recall an orally presented sequence of numbers and repeat the sequence in reverse order. Although the sequences became progressively longer, they did not exceed eight numbers. In our analyses, we used the age-standardised score (the W score), representing both a child’s ability and the task difficulty (Tourangeau et al. 2015).

All indicators for student achievement and teacher judgements were z‑standardised to allow for cross-country comparisons.

3.2.3 Time aspects

In all three countries, the windows of information collection on teacher judgement (T1) spanned several months. We accounted for this variation by creating a dummy variable with the value 0 if a teacher rated the students early and the value 1 if a teacher rated them later in the data collection process. In England, where the school year runs from September to July, we assigned students who had been assessed by teachers between September and January to early, and those assessed between February and July to late. In Germany, early assessments took place between February and March and late assessments between April to June. In the US, assessments between January and March were defined as early and those between April and June as late. Besides the time of assessment, we also considered the age in months at the time of testing at T1. Furthermore, when we examined the effects of biased assessment on later achievement, we controlled for the time span (in months) between testing at T2, towards the end of primary school, and testing at T1, at the beginning of primary education.

3.2.4 Background variables

We operationalised the SES of the family, relying on the highest education of (step‑)parents living together with the child at the start of primary education. Hence, where there was only one co-resident parent, the family was categorised based on her or his level of education, whilst where there were two resident parents, the family was categorised based on the more highly educated of the two. In the final variable, we distinguished between high, medium, and low education. In all countries, high education captured a first/bachelor’s university degree or higher, requiring 3–4 years of full-time study at the tertiary level. The definition of low education differed between the two countries with comprehensive school systems (England and US) and Germany, an early tracking country. For England and the US, low education was defined as no qualification beyond the expected standard, i.e., the target of the education system for all students in compulsory education. In the US, this was a high school diploma; in England, this was the attainment of at least a grade C qualification at the end of compulsory schooling (age 16). For Germany, low education was defined as no attainment beyond the intermediate/junior secondary track. In the medium education group were all those who did not fall in either the high or low category. Family education could be evident to teachers through, e.g., cultural mannerisms and linguistic patterns (e.g., Ready and Chu 2015, p. 972).

We further considered immigration status indicating whether a student and/or at least one parent was foreign-born (0 = no immigration; 1 = foreign-born student and/or at least one parent). Furthermore, we controlled for the gender of the student (0 = male; 1 = female).

Descriptive statistics of the study variables are displayed in Table 3.

Table 3 Descriptive statistics by country. (Sources: Own calculations based on MCS, NEPS-SC2, and ECLS-K:2011)

3.3 Analytic approach

We used a stepwise approach as suggested by Madon et al. (1997) and applied in other studies (e.g., Gentrup et al. 2020; Hinnant et al. 2009):

In a first step, we regressed teacher judgement on students’ results in a mathematical achievement test (correcting standard errors by clustering students by teacher). In addition, results from tests on cognitive abilities were used as a covariate to reduce the risk of measurement error as well as omitted-variable bias. The concern is that a single mathematical achievement test score may not fully capture a child’s “true” ability, either because it is a noisy measure (with random error) or because it is only a partial measure of overall mathematical ability. In this situation, a component of what ends up in residual error term—what we take to represent “bias”—may in fact reflect the teacher’s superior knowledge of the child’s genuine capacities. To take this into account, we have included a further performance measure. Both test scores—for math and cognitive abilities—should cover to a larger degree students’ “true” performance, and variations in teacher ratings beyond these comprehensive indicators of student performance should then map bias. The residuals of these regressions were then standardised to zero-mean unit-variance z‑scores and compared across parental education groups in order to identify biased teacher judgements and their SES gradient: a positive residual score represents teacher overestimation and a negative residual score represents teacher underestimation; the prediction of student achievement is more accurate the closer a residual score is to zero (Madon et al. 1997).

In a second step, we estimated linear regression models for maths test scores towards the end of primary school in a value-added model framework. Here, we used information on test results from Grade 1 (Germany and US) or Year 2 (England) as predictors along with the residuals from the previous regression, as well as further controls such as parental education, gender, and immigration status. Standard errors were corrected by clustering students by classes.

All results are based on the complete case analysis.

4 Results

4.1 Teacher judgements and student social background

Results from linear regressions with teacher judgement of students’ mathematical skills (z-standardised scores) as the dependent variable are presented in Table 4. Students’ math test scores were a much stronger predictor than cognitive abilities in all three countries. Overall, the share of explained variance was highest in England and the US (37% and 40%, respectively) and strikingly lower in Germany (25%).

Table 4 Results of regression models for teacher judgement (z-standardised). (Sources: Own calculations based on MCS, NEPS-SC2, and ECLS-K:2011)

In the next step, we examined whether the standardised residuals from these regressions and, hence, the degree and direction of inaccuracy varied systematically by parental education. Fig. 1 presents the mean residuals for the three education groups with 95% confidence intervals.

Fig. 1
figure 1

Teacher judgement bias (mean residuals) by parental education and country (Plotted are the means and 95% confidence intervals for each parental education group (values are in Table S1 in the supplementary material)). (Sources: Own calculations based on MCS, NEPS-SC2, and ECLS-K:2011)

In England and Germany, teacher judgements of students with low-educated parents showed on average a negative bias, whilst teacher judgements of students from high-educated families showed a positive bias. In the US, the results revealed a different pattern: family SES was unrelated to teacher judgements.

4.2 Teacher judgements, student social background, and student progress in primary school

Table 5 displays the coefficients of parental education and inaccuracy in teacher judgement (operationalised as the residuals from the regressions presented in Table 4) from the regression models for later student achievement. All models controlled for test results in Grade 1 (Germany and the US) or in Year 2 (England) as well as for socio-demographic characteristics of the student/family and the time elapsed between testing at T1 and T2. In all countries, achievement gaps related to parental education increased during primary school, with students of high-educated parents showing higher gains in mathematical skills, given their initial achievement, compared to students from medium-educated and low-educated parents (see M1 in Table 5). These parental education-related differences in learning progress were particularly pronounced in Germany.

Table 5 Results of regression models for later student mathematical achievement (z-standardised). (Sources: Own calculations based on MCS, NEPS-SC2, and ECLS-K:2011)

Our main interest was whether the effects of parental education were at least partly due to biased teacher judgement. Therefore, Model 2 controlled for the standardised residuals from the regressions presented in Table 4. Inaccurate judgement at the beginning of primary education was associated with students’ math achievement at later time points (see M2 in Table 5): overrated (or underrated) students performed better (worse) later on. In England and to a lesser extent in Germany, we also observed a clear and significant reduction in the effect of parental education after controlling for biased judgements (England: high-educated: ∆β = −0.06, SE = 0.01, p < 0.001; low-educated: ∆β = 0.05, SE = 0.01, p < 0.001; Germany: high-educated: ∆β = −0.02, SE = 0.01, p = 0.003; low-educated: ∆β = 0.04, SE = 0.01, p < 0.001)Footnote 5. In England, for example, 32% of the reduction of the achievement gap between students from high- and medium-educated families was explained by biased teacher judgements at T1 in the value-added model (∆β/βM1); for the reduction of the achievement gap between students from low- and medium-educated families the share was 50%. In the US, in contrast, there was a small, but significant increase in the effect for students with high-educated parents when controlling for teacher judgement residuals (high-educated: ∆β = −0.01, SE = 0.00, p = 0.010; low-educated: ∆β = −0.00, SE = 0.00, p = 0.057). These patterns supported our expectation that biased teacher judgements might contribute to widening SES-related achievement inequalities over time.

4.3 Sensitivity checks and further analysis

Heterogeneous effects of biased teacher judgements

In England and the US, we found that the association of biased teacher judgements with math achievement was significantly weaker for students from highly educated families as compared to students from low-educated families (see Table S3 in the supplementary material). In Germany, respective interaction effects between parental education and residuals were nonsignificant.

Teacher change over the course of primary education

For Germany, we considered whether effects were stronger for children taught by the same teacher throughout several grades of primary education. Information collected annually on teachers’ birth year, month, and gender was used to identify whether the teacher changed over time. We then re-estimated all models for a restricted sample of students who were taught by the same teacher for at least two years (see Table S4). The results were similar to those presented above, both in terms of teacher judgement accuracy and the effects of teacher judgements on student achievement.

Language skills

We replicated all analyses using language skills as the outcome variable (see Tables S5–S8 as well as Figure S1). The results regarding biased judgements and their association with achievement development are largely comparable to the results presented for mathematics. For Germany, however, we observed a less pronounced association between teacher judgement bias and later language skills than in the mathematical domain. It remains an open research question why there is such a weak association in Germany. One potential explanation might be that the objective measures of language skills at T2 in England and the US are curriculum-oriented, while those in Germany aim to assess general language skills (i.e., receptive vocabulary): Assuming that teacher judgements are manifested in teaching behaviour, then they should have a stronger influence on the acquisition of language skills that are primarily taught at school (e.g., reading skills, spelling, etc.). In contrast, in the acquisition of more general language skills such as vocabulary, parents and peers are also strongly involved. Consequently, effects of teacher judgement bias on later achievement should be more observable when curriculum-oriented tests are used.

5 Conclusion and discussion

In this paper, we asked whether inaccurate teacher judgements of their students’ math skills correlate with student social origin and whether such bias is associated with math achievement in primary school. We examined unexplained variance in teacher judgements that remained after controlling for actual student achievement and cognitive abilities, and we interpreted this variance as inaccuracy in teacher judgements. We expected that a growth mindset culture, higher accountability, and more ability grouping lead to lower teacher judgement bias. In consequence, we expected teacher bias to be particularly low in the US, followed by England. In contrast, we expected a more pronounced teacher judgement bias for Germany due to less common growth mindsets, a lower degree of accountability, and non-existing ability grouping during primary education. Empirically, our expectations were confirmed. We showed that the unexplained variance in teacher judgements was systematically linked to family SES, operationalised by the highest parental education, in Germany and to a lesser extent in England but not in the US. This pattern is in line with previous research on systematic variations in teacher judgements based on family SES in those countries (e.g., Campbell 2015; Geven et al. 2021; Hinnant et al. 2009; Tobisch and Dresel 2017).

In a subsequent step, we studied whether teacher judgement bias was associated with later achievement and mediated the effect of parental education. Due to the higher prevalence of ability grouping in lower grades in England and the US, we expected stronger effects on later achievement in these two countries compared to Germany. For England, this effect might be attenuated due to standardised curricula. Empirically, we showed that inaccuracy in teacher judgements predicted students’ end of primary school achievement in all three countries, even when considering prior achievement, cognitive abilities, and students’ background characteristics. This could be interpreted as a self-fulfilling prophecy. Only in England and Germany did the effect of parental education decrease when controlling for biased judgements. Since no relation was found between teacher inaccuracy and parental education in the US, it is not surprising that, in this country, the parental education effect on math achievement growth was mediated only partially and to a lesser extent by teacher judgement.

As we observed country differences both in the extent of teacher bias and in the relevance of this bias for the achievement development (see Geven et al. 2021, for similar findings), our findings support the assumption that the institutional and societal settings matter. Hence, a cross-country perspective enriches research on the role of teachers in explaining SES-related inequalities. Our results can be seen as a starting point for future research to investigate cross-country variations and the underlying mechanisms in more detail.

Although our study contributes to the literature on teacher accuracy and bias by providing a cross-country comparison, it has limitations. First, we described the underlying theoretical considerations and linked them to the situation in the three countries under study to derive hypotheses on the extent of family SES-related judgement bias in England, Germany, and the US. However, our expectations referred to general trends for the three countries, but it remains open to what extent within-country variation leads to a weakening or strengthening of observable patterns. Second, we did not consider underlying mechanisms such as ability grouping or actual accountability and monitoring approaches at school and, therefore, we do not know which of the presumed mechanisms have actually led to the observed patterns. For example, we expected ability grouping to be important in mediating the association between biased teacher judgements and later achievement. However, we are not able to explore this fully because ability grouping assignments in our data took place pre-date, and were observable to teachers, when they formed their judgements. Exemplary analyses for England revealed that ability group placement is associated with teacher judgements and predicts how children will progress over time, net of standardised test scores at T1 (see Table S9). Future studies should also examine to what extent there is mutual reinforcement or weakening between, amongst others, growth mindset, accountability, and ability grouping. Furthermore, further institutional or societal factors that might (simultaneously) affect teacher judgement bias should be considered. For example, another social mechanism possibly responsible for the observed country differences might be variation in the awareness of expectation effects among the teachers. In the US, research on self-fulfilling prophecies in schools has a much longer tradition (initiated by the experiment Pygmalion in the Classrooms, see Rosenthal and Jacobsen 1968) compared to Europe. Consequently, teachers might be informed about this phenomenon in the US, for example, during teacher education, but to a lesser extent in Europe. Third, concerns might be raised about linking results from standardised assessments with global teacher judgements (e.g., Arens et al. 2017; Hübner et al. 2022; Jussim and Harber 2005). Teacher judgements might be more accurate than test results as teachers might have “valid” information above and beyond what a (single) test captures. This additional information would also account for the fact that children with higher teacher judgements perform better in later achievement tests. Fourth, previous research showed that indirect teacher judgements of general mathematical performance, like the ones we drew on, correlate less strongly with actual student skills and abilities than direct judgements such as those referring to the expected number of correctly solved tasks of a math test (Hoge and Coladarci 1989; see also Südkamp et al. 2012). This result indicates that direct teacher judgements are more accurate. Consequently, the patterns we reported in this study might have been less pronounced, if one were to use more specific measures of teacher judgement measures. Fifth, possible criticism might pertain to the two-step approach used: standard errors in the second regression will tend to be underestimated as residuals are treated as observed variables, ignoring the imprecision that comes from estimating them (Murphy and Topel 2002). Sixth, we did not consider the role of race or ethnicity. For the US, in particular, there is mixed evidence that the discrepancy between teacher judgement or expectations and student test scores systematically varies with students’ race or ethnicity (see, e.g., Geven et al. 2018; Wang et al. 2018, for an overview). While some studies documented variation in teacher ratings by race and/or ethnicity (e.g., McKown and Weinstein 2008; Ready and Wright 2011), others suggest much of this variation to reflect actual performance differences between various race and/or ethnic groups (e.g., Jussim et al. 1996; Madon et al. 1998). Understanding how the country differences in SES-related teacher bias are confounded with race- or ethnicity-related bias is one important topic for future research. Finally, as in all cross-country studies with post-hoc harmonised data, survey designs and instruments differ in measurement points, test material, or wording of questions (Law et al. 2021). We tried to make the data as comparable as possible; however, some issues remain. For example, although all mathematics achievement tests seemed to measure similar facets of math skills (e.g., number knowledge, knowledge of geometry, and spatial sense), we did not have access to single test items.