1 Introduction

Levels of institutional differentiation are characteristic features of educational systems. In this context, there is a very controversial discussion concerning early between-school ability tracking, i.e., regarding the grade at which students are separated into different ability tracks with different curricula and different access to higher education. For example, Germany tracks students after the fourth grade,Footnote 1 while countries like the USA do not track students into ability-grouped schools before higher education.

The arguments in favor of selective schooling center on a perceived trade-off between equity and efficacy (Hanushek and Wößmann 2006). Those who believe in the efficacy of track differentiation argue that it is easier and more efficient to teach more homogeneous student groups. Tracking advocates also argue from a societal perspective that vocational and academic tracks give rise to school leavers with a mix of qualifications, which is beneficial in a heterogeneous job market. However, this does not consider the possible effects of tracking on equity, especially in the case of very early tracking. A possible social bias in the track selection process and differential expectations, motivations, and resources between the different tracks might contribute to increased inequality (Maaz et al. 2008).

Most previous research on tracking compared countries with tracked and comprehensive school systems. The majority of studies, however, were based on simple correlations and failed to account for the possibility that countries with a tracked as opposed to a comprehensive school system might differ in terms of other important institutional features (van de Werfhorst and Mijs 2010). To disentangle the effect of tracking from the effects of other institutional determinants of student achievement, Hanushek and Wößmann (2006) proposed a difference-in-differences approach where they combined primary (before tracking) and secondary (after tracking) school data to identify the causal effect of early between-school tracking on educational outcomes. This approach has also been adopted by other studies since it allows researchers to identify the effect of tracking on achievement. The findings of these studies paint an inconclusive picture. A limitation of international comparative studies is that their effect estimations are based on rather small samples, since the level of analysis is the country level and the number of countries is naturally limited. Furthermore, different studies have focused on different samples of countries, international assessments, assessment cycles, domains, and measures of educational inequality. For this reason, it is difficult to determine whether inconclusive research findings are due to substantive differences in the setup of the different studies or due to imprecisions in the estimations caused by small samples.

The main purpose of the present study was to use the accumulated data of three international large-scale assessments: the Programme for International Student Assessment (PISA), the Progress in International Reading Literacy Study (PIRLS), and the Trends in International Mathematics and Science StudyFootnote 2 (TIMSS). Combining data from different studies and study cycles increased the sample size and helped us to obtain more precise tracking effect estimates. Furthermore, we used the same data to systematically replicate the analyses for different outcome measures. Specifically, we focused on the effects of tracking on performance levels and three different types of inequalities in achievement, namely dispersion inequality, social achievement gaps, and educational inadequacy.

This paper is divided into five sections. First, we review the theoretical and empirical research on the effects of tracking on different types of inequalities and on performance. Second, we specify our research question and the aim of this study. Third, we present the analytical approach we use to identify the effect of tracking and our approach to combine the results from different analyses. Fourth, we describe the main results regarding the effects of tracking on inequalities and performance. Fifth, we discuss our findings and provide conclusions for educational policy and future research.

2 Literature review: how does tracking affect educational inequalities?

In the first part of the literature review, we outline a theoretical framework for the effects of tracking on inequalities and performance, preceded by a brief clarification of the distinction between three types of educational inequalities. We focus on achievement as it is an important predictor of, for instance, labor market returns, wellbeing, political engagement, integration, and countries’ economic growth (Brighouse et al. 2018; Hanushek 2013; Hanushek et al. 2015). In the second part, we review previous studies on the effects of tracking.

2.1 Different concepts of achievement inequalities

Inequality is a term that has been used in quite different ways by different authors. Van de Werfhorst and Mijs (2010) distinguished between inequality as dispersion and social inequality. These two conceptualizations have different normative ideas about what is unjust (Strietholt 2014) and we think that identifying the differences between inequality conceptualizations is important for the evaluation of the results. Inequality as dispersion implies that the mere existence of differences in achievement is problematic. Social inequality regards differences between social groups as problematic but does not consider the mere existence of variation within each group problematic. Strietholt and Borgna (2018) noted that several studies on educational inequality focused on threshold inequality, which centers on the lower distribution of scores and refers to the proportion of students who do not reach a minimum performance level. This concept is also referred to as educational deprivation (Solga 2014, p. 271), minimum standard (UNESCO 2018), or educational adequacy (Brighouse and Swift 2008, 2009). The basic idea of threshold inequality is that all students should reach a certain threshold achievement level, while inequalities beyond this threshold are not problematic. Therefore, we evaluate the effects of tracking separately for each conceptualization of inequality, as each concept implies different normative ideas about justice. Different inequalities can furthermore be expected to have different implications for societal and individual development. In addition, there are empirical reasons to study the effects on the three concepts separately, as the measurements of the concepts of inequality are not found to correlate with each other. For instance, the dispersion of scores is not associated with the performance gap between students from lower and higher social classes (Strietholt and Borgna 2018).

2.2 Tracking as transition and the effects on inequality

In theoretical terms, between-school tracking constitutes a type of educational stratification that is external (differentiation between schools) and formal (regulated by law) (Chmielewski 2014; Dollmann 2019; Skopek et al. 2019). While our study focuses on between-school tracking, our findings and arguments may apply to other mechanisms of educational differentiation (e.g., within-school tracking). At least three different mechanisms explain how tracking reinforces inequality in achievement; we introduce these before reviewing studies on tracking effects on different types of inequality. First, we describe how the stigmatization of lower tracks affects students at the lower end of the achievement distribution (educational inadequacy). Second, we outline how unequal curricula and resources explain an effect of tracking on the overall achievement distribution (dispersion inequality). Third, we depict how social bias in allocating students to different tracks perpetuates social inequalities in achievement (social achievement gaps).

Stigmatization of lower tracks

One set of arguments against tracking rests on the anticipated disadvantages for students in lower tracks (Slavin 1990). Various researchers observed that students in lower tracks developed negative attitudes toward school; they also expected little future payoff, had lower educational expectations, and had more pronounced feelings of futility than students in higher tracks (Karlson 2015; Lee 2014; van Houtte and Stevens 2015). Such negative attitudes may have consequences for student learning. At the same time, the social composition of schools may have consequences for children’s education. More homogeneous groups may inhibit the positive peer effects of heterogeneous classes, where disadvantaged students may benefit from the shared learning environment (Coleman et al. 1966; Sacerdote 2011). In contrast to the idea of no child left behind, the existence of lower tracks legitimizes poor performance by some students. Following this line of argumentation, tracking might increase the proportion of students who do not have basic literacy skills, a phenomenon that is essentially related to the concept of educational inadequacy.

Unequal curricula and resources

Different tracks lead to different educational pathways that allow students to pursue academic or vocational careers. Such differences are manifested in curricula that are more or less ambitious in lower and higher tracks. In the same vein, the allocation of educational resources—such as teacher quality, infrastructure, and funding—may differ between tracks. Indeed, there is evidence that students in higher tracks benefit from better educational resources (Becker et al. 2012; Guill et al. 2017; Martinková et al. 2020). Such track-specific inequalities in educational opportunity may lead to a higher dispersion of educational outcomes, i.e., dispersion inequality.

Transitions and social bias

So far, this paper has not needed to challenge the assumption that students are allocated to different tracks based on their abilities in order to hypothesize that tracking increases different types of inequality in achievement. However, transitions within the educational systems may reinforce social inequality. Boudon (1974) proposed two mechanisms through which transitions may reinforce social inequality: first, privileged children tend to perform better (primary effects), and second, even after controlling for prior achievement, privileged students have greater chances of accessing more ambitious tracks (secondary effects). There is a plethora of evidence showing that tracking decisions are not solely based on performance (which could have primary segregation effects), but also depend on race or social class after taking previous academic achievement into account (secondary effects) (Batruch et al. 2018; Hallinan 1994; Holm et al. 2013; Horn 2013; Lucas and Berends 2002; Maaz et al. 2008; Pietsch and Stubbe 2007). Additionally, children from privileged backgrounds might receive more support from their parents to reach high tracks (Koerselman 2013). In this respect, the time point at which students are tracked is a critical moment. A recurring hypothesis is that parental background exerts a strong influence on educational transitions, especially when children are younger (Bauer and Riphahn 2006; Chmielewski 2014; Hillmert and Jacob 2010; Lange and von Werder 2017; Schütz et al. 2008). If different tracks lead to a stigmatization of students or provide different educational opportunities for them, social bias in the tracking process will result in higher social achievement gaps. This contradicts the ideal of tracking as a meritocratic process.

2.3 Empirical evidence of early tracking effects

The previous research on early tracking effects can be divided into three categories: studies that conduct cross-sectional analyses on a between-country level, studies that apply quasi-experimental designs, and in within-country comparative studies (cf. Skopek et al. 2019). Cross-sectional studies with international data showed mixed findings regarding the associations of between-school tracking and dispersion inequality (Huang 2009; Micklewright and Schnepf 2007; van de Werfhorst and Mijs 2010). Such cross-sectional studies also found that between-school tracking is associated with higher levels of social inequality (Dämmrich and Triventi 2018; Dollmann 2019; Duru-Bellat and Suchaut 2005; Gorard and Smith 2004; Horn 2009; Marks 2005; Schlicht et al. 2010; Schütz et al. 2008; Skopek et al. 2019; van de Werfhorst and Mijs 2010). However, cross-sectional studies only use information from one point in time and do not allow researchers to draw causal conclusions.

Few studies have used robust designs that allowed researchers to draw causal inferences on the effects of tracking. Most of these robust studies estimated difference-in-differences models to exploit the fact that no country has a tracked primary school system, while some countries allocate students to different ability tracks at the secondary school level. Therefore, researchers can compare student outcome measures in tracked versus comprehensive school systems at the secondary school level while controlling for the same measures at the primary school level to identify the effects of tracking. Another robust approach for identifying tracking effects is to study variation in tracking status within countries over time. There are, however, only two studies that employed this approach, since such school-system reforms rarely occur.

In the following, we review studies on the effects of tracking on dispersion inequality, educational inadequacy, and social achievement gaps. Furthermore, we review findings on tracking effects on performance levels in order to provide some evidence for a possible trade-off between efficacy and inequality.

Effects on dispersion inequality

Hanushek and Wößmann (2006) used PISA, TIMSS, and PIRLS data from several cycles administered between 1995 and 2003 in the domains of mathematics, reading, and science. They combined eight pairs of primary and secondary school studies (e.g., PIRLS 2001 and PISA 2000) and estimated a series of difference-in-differences models for each pair. While they found substantial variation in the effect estimates for different pairs of studies, the pooled estimate indicated that early tracking increased the dispersion of test scores. The variation in the effect estimates might have been due to the fact that each pair of studies only looked at 18 to 26 countries. The findings provided little evidence for domain-specific differences in the effect estimates. Jakubowski (2010) replicated Hanushek and Wößmann’s (2006) study of PIRLS 2001 and PISA 2000 data and found no significant effect on dispersion inequality. Hanushek and Wößmann found no effect for this particular pair of studies either. However, Jakubowski (2010) also analyzed another combination of TIMSS 2003 and PISA 2003 data and again found no effect. Further studies replicated Hanushek and Wößmann’s approach using international data but focused on other educational outcomes and not dispersion inequality (see below).

To our knowledge, only one study has exploited national educational reforms to examine the effects of tracking on dispersion inequality. Piopiunik (2014) combined German data from the PISA 2003 and 2006 cycles and found that lowering the age of tracking increased dispersion inequality significantly. This study focused on a policy change in the federal state of Bavaria, where the tracking age was lowered from sixth to fourth grade. The study provided no evidence that the effects differed for mathematics, reading, and science.

Effects on educational inadequacy

Some studies have estimated the effects of early tracking on different quantiles of the achievement distribution. The percentiles at the lower end of the international achievement distribution can be perceived as thresholds defining educational adequacy. The evidence suggests that tracking increases the number of students who do not achieve basic literacy. Hanushek and Wößmann (2006) found that tracking had a negative effect on the performance of students in the lower quantile of the achievement distribution. Similar analyses of more recent study cycles of PIRLS, TIMSS, and PISA replicated the finding that early tracking had a negative effect on performance at the lower end of the achievement distribution (Lavrijsen and Nicaise 2016). The effects were most pronounced in reading. The aforementioned study by Piopiunik (2014) provided further evidence for a negative effect of early tracking on educational adequacy. Lowering the tracking age in the German state of Bavaria increased the share of low performers in mathematics, reading, and science.

Effects on social achievement gaps

Findings from the research on effects of tracking on social inequality have been inconclusive. While some studies provided evidence that tracking perpetuated social inequality, most observed no tracking effect on social achievement gaps. Ammermüller (2005) estimated a difference-in-differences model based on PISA 2000 and PIRLS 2001 data from 14 countries and found that the effect of social background on reading achievement was more pronounced in countries with more differentiated school tracks. Other studies used the tracking age instead of the number of school tracks as the main explanatory variable. Waldinger (2007) found no effect of the tracking age on the social gap in reading achievement using PIRLS 2001 and PISA 2003 data from a similar but not identical set of 14 countries. Jakubowski (2010) studied the effects of early tracking on social gaps in reading and mathematics. The analyses of PIRLS 2001 and PISA 2003 reading data from 23 countries revealed no significant effects. The analyses of TIMSS 2003 and PISA 2003 mathematics data from 15 countries, however, provided some evidence that early tracking significantly increased social gaps in mathematics achievement. A study using more recent data from PIRLS 2006 and PISA 2012 (N = 33 countries) observed that an earlier tracking age increased social gaps in reading achievement (Lavrijsen and Nicaise 2015).

A general limitation of the previously presented research was that each study was based on a small set of countries. To address this issue, Ruhose and Schwerdt (2016) combined data from five PISA cycles (2000–2012), five TIMSS cycles (1995–2011), and two PIRLS cycles (2001–2006). In total, they analyzed data from 45 countries. Many of these countries were observed in different studies and at multiple time points. The study provided no evidence that tracking increased the achievement gap between native and immigrant students.

Van de Werfhorst (2018) combined secondary school data from the First International Mathematics Study (FIMS) from 1964, the Second International Mathematics Study (SIMS) from 1980 to 1982, and the Third International Mathematics and Science Study (TIMSS) from 1995. The study showed that social achievement inequality was lower in countries that had transformed their school system from tracked to comprehensive than in countries where tracking was retained. A limitation of this study was that it was only based on nine countries that participated in all three international assessments and that only four of these had reformed their school systems.

Effects on performance levels

Studies on the effects of tracking on performance levels revealed mixed findings. Hanushek and Wößmann (2006) and Lavrijsen and Nicaise (2016) replicated analyses on the effects of tracking on performance levels for eight combinations of primary and secondary school assessments. Both reported a tendency for early tracking to reduce performance levels. However, more than half of the single estimates were neutral and one was even significantly positive. Jakubowski (2010) analyzed two study pairs and found one neutral and one negative effect on performance levels.

In the same vain, two single country studies in Germany and Northern Ireland reported contradictory findings. Piopiunik (2014) found a negative effect of tracking on performance levels in Bavaria in Germany. Guyon et al. (2012) found evidence for an improvement of results when increasing the number of students attending the higher track in Northern Ireland.

Summary of the review

The review of research revealed inconsistent findings, which makes it impossible to draw robust inferences on the effects of tracking on student outcomes. We propose two possible explanations for the variation in the effect estimates related to conceptual differences in the outcome measures and to the small sample sizes at the country level.

The conceptual distinction between different educational outcomes seems to explain some of the variation in the results of different studies. At the same time, it is difficult to draw strong conclusions about conceptually different outcomes because the number of studies was limited for each outcome. While several studies focused on social achievement gaps as outcomes, only two investigated the effects of tracking on dispersion inequality. Furthermore, the different studies were based on different datasets and focused on different achievement domains, which makes it even more difficult to distinguish between substantive differences and sampling error.

The low sample size at the country level is another serious issue. Typically, studies only used data from around 20 countries when combining primary and secondary school assessments. Studies that replicated the analyses based on different combinations of primary and secondary school datasets revealed a remarkably high variability in the effect estimates. This illustrates that findings based on single combinations of datasets are unreliable. In this regard, the study by Ruhose and Schwerdt (2016) is an exception because it combined data from several cycles of PIRLS, PISA, and TIMSS in 45 study pairs to increase the sample size and to achieve more reliable estimates. However, that study focused on the achievement gap between native and immigrant students, which is conceptually related to but different from social gaps in achievement.

3 Research questions

The aim of this paper was to use international data to estimate the effects of early tracking on three different types of inequalities in achievement—dispersion inequality, social achievement gaps, and educational inadequacy—and on performance levels. Following Hanushek and Wößmann (2006), we combined primary and secondary school assessments to identify the effect of tracking by applying difference-in-differences analyses. Previous research used different datasets to study different outcomes and mostly drew on rather small samples of countries. Following Ruhose and Schwerdt (2016), we attempted to overcome these limitations by using all available cycles of PISA, TIMSS, and PIRLS administered between 1995 and 2016. The combined data increased the analytical sample and allowed us to study different outcomes.

4 Methodology

4.1 Data sources: combining primary and secondary school information

To identify tracking effects, we exploited the fact that some countries track their students after primary school, while others employ a comprehensive secondary school system. For this purpose, we combined primary and secondary school data from all available cycles of three international large-scale assessments—PIRLS, PISA, and TIMSS—administered between 1995 and 2016.

PIRLS was conducted in 2001, 2006, 2011, and 2016 and assessed reading achievement in fourth grade, at the end of primary school. PISA was administered in 2000, 2003, 2006, 2009, 2012, and 2015 and tested the reading, mathematics, and science performance of 15-year-old secondary school students. TIMSS was conducted in 1995, 1999, 2003, 2007, 2011, and 2015. TIMSS measured student achievement in mathematics and science in both fourth grade (population A) and eighth grade (population B). TIMSS 1999 only tested eighth graders. All studies contained survey weights to generalize from the representative samples to the underlying student populations in the respective countries or regions.

In order to determine changes between primary and secondary school, we matched primary school data from PIRLS or TIMSS population A with secondary school data from the same countries from PISA or TIMSS population B. For this purpose, we applied two matching approaches: first, matching roughly the same years (e.g., PIRLS 2001 with PISA 2000), and second, matching roughly the same cohorts (e.g., PIRLS 2001 with PISA 2006). We applied both approaches because combinations from the same years are subject to period effects, while combinations from the same cohorts are subject to cohort effects (e.g., Blanchard et al. 1977). Figure 1 illustrates the 45 study pairs that formed the basis for our analyses. Nine study pairs matched PIRLS with PISA data, 18 matched TIMSS population A with PISA data, and 18 matched TIMSS population A with TIMSS population B data. We counted the study pairs for TIMSS population A and PISA data and the pairs for TIMSS population A and TIMSS population B data twice since we ran all analyses for mathematics and science separately. In sum, our paired analysis dataset contained information from 75 countries or regions and more than 2 million students. Each country was observed at least two times. The overall number of single observations underlying the study pairs in Fig. 1 by study, cycle, domain, and country (study-by-cycle-by-domain-by-country observations) amounted to 1177.

Fig. 1
figure 1

Study pairs of large-scale assessments at primary and secondary school level. Every arrow reflects a study pair of datasets at primary and secondary school level. The study pairs contain data from all countries that participated in both assessments. The studies were combined so that they roughly matched the same years or the same cohorts. The study pairs of TIMSS Population A and PISA data as well as TIMSS Population A and TIMSS Population B data entered the analyses twice, since mathematics and science were treated separately in the analyses

4.2 Variables

Test scores

To compare educational outcomes in primary and secondary school, we used plausible values of test scores for reading, mathematics, and science achievement. In each study, the scores were linked across assessment cycles so that they had the same metric over time. The scores were standardized to an international mean of 500 with a standard deviation of 100 (Martin et al. 2016, 2017; OECD 2017). We used the test scores to compute three country-level measures of educational inequality and the mean performance level. The plausible values contained no missing data. To ensure that we could measure and compare different conceptualizations of inequality, we aggregated all variables at the country level.

Dispersion inequality

We computed the weighted standard deviation of the test scores as our main measure of dispersion inequality for each of the 1177 study-by-cycle-by-domain-by-country observations. Table 1 shows the distribution of the variable in primary and secondary school. Interestingly, dispersion inequality in primary school was higher in late tracking countries but lower in secondary school.

Table 1 Descriptive statistics of the three inequality measures and the performance measure at Primary and secondary school level in the overall country sample and divided by tracking status

In further robustness checks, we also computed alternative measures of dispersion inequality, namely the range between the 95th and 5th percentile and the range between the 75th and 25th percentile (interquartile range).

Social achievement gaps

The social achievement gap was measured as the weighted mean difference in achievement scores between children from households with less than 100 and at least 100 books. We used the student-reported number of books variable in the main analyses since it was the only measure of socioeconomic status that was available in all international assessments of interest. This type of mean score difference is also referred to as a measure of absolute differences. Another frequently used measure is the relative gap, which considers the overall dispersion of test scores by dividing the absolute differences by the within-country standard deviations. The basic idea is that social groups are more meaningful if the overall dispersion of scores is small. We computed relative social achievement gaps for the number of books variable.

In further analyses, we also used parental education as an alternative measure of social background. Information on parental education was obtained from parents in the primary school studies PIRLS and TIMSS population A and from students in the secondary school studies PISA and TIMSS population B. We computed the absolute achievement gap between children of parents with and without tertiary education. However, information on parental education was not available for TIMSS population A cycles administered before 2011. Therefore, applying this measure reduced the analysis sample.

Missing data ranged from 3% for the books at home variable to 30% for parental education (based on the samples where this item was administered). To account for missing data, we created an imputed dataset using predictive mean matching (e.g., Rubin 1987) in the R package mice (van Buuren and Groothuis-Oudshoorn 2011). The imputation model used information on age, gender, parental education, number of books, country of birth of parents, language at home, and achievement scores.

Educational inadequacy

To measure educational inadequacy, we computed the shares of students who did not meet certain thresholds of the achievement scales for each study-by-cycle-by-domain-by-country observation. We defined the thresholds based on the so-called PISA proficiency level 1b and the low PIRLS and TIMSS international benchmarks. Table 1 shows that, on average, 14% of primary and 12% of the secondary school level students did not reach these levels of adequate achievement in the present sample.

In further analyses, we used more inclusive thresholds and replicated the analyses. Specifically, we used the proficiency level 2 for PISA and the intermediate benchmark for PIRLS and TIMSS. On average, about 30% of the students did not reach these more inclusive adequacy cutoffs.

Performance level

We used the weighted mean achievement as a performance level measure in all study-by-cycle-by-domain-by-country observations. As Table 1 shows, the average performance levels were higher in early tracking countries than in late tracking countries at both primary and secondary school level.

Early tracking

Educational systems track their students into different ability tracks at different ages and grades. To determine the grade and age at which the countries of interest tracked their students, we reviewed reports by UNESCO (UNESCO-IBE 2007, 2012), Eurydice (2005, 2011, 2013a, b, 2014), and OECD (2004, 2006, 2008, 2010). We crosschecked the results with studies by Hanushek and Wößmann (2006), Brunello and Checchi (2007), Waldinger (2007), and Ruhose and Schwerdt (2016). There were few discrepancies regarding the grade and age at which students are tracked between previous studies and between previous studies and our own review. Where deviations arose, we followed our own criteria, which were mainly based on the country reports in UNESCO-IBE (2007, 2012).

Based on the information on the tracking grade and age, we constructed two different variables to determine whether students were tracked at the time of testing in the secondary school assessments (early tracking) or whether they were still in compulsory schooling (late tracking). In the analyses with TIMSS population B data, we used information on whether students were tracked in eighth grade. For analyses with PISA, we used the grade with most 15-year-old students (ninth or tenth grade in most countries). Due to this classification, 17 countries were classified as early tracking countries in analyses using PISA and 13 countries in analyses using TIMSS. Table 2 depicts the number of overall, early, and late tracking countries in each study pair. On average, each study pair contained 26 countries. About one-fourth of these were early tracking countries. Appendix 1 shows the tracking status for all countries in our sample.

Table 2 Number of countries in the overall country sample and divided by the tracking status in the 45 study pairs in the three achievement domains

4.3 Analyses

No country had a tracked primary school system, but some countries had tracked secondary school systems. This enabled us to compare educational measures of countries with and without early between-school tracking at the secondary school level while using the same educational measures at the primary school level as a baseline.

Identification strategy

Simple comparisons of early and late tracking countries may be biased because the observed differences may have existed before the students were tracked. In such cases, differences between early and late tracking countries would not reflect the effect of tracking but rather of other features of the educational system or differences in the social structure. Indeed, Table 1 shows that early and late tracking countries had different baseline inequalities at the primary school level. On average, early tracking countries showed higher performance levels, lower levels of dispersion inequality and educational inadequacy, and higher social achievement gaps in comparison to late tracking countries.

Following Hanushek and Wößmann (2006), we estimated difference-in-differences models to control for any disparities between early and late tracking countries that existed prior to tracking. The basic idea was to relate differences in educational outcomes—for instance, dispersion inequality at the primary and secondary school levels—to differences in the tracking status at the primary and secondary school levels. For this purpose, we estimated models in which we regressed educational outcomes Y in secondary school s, in country j (Ysj) on a dummy variable that indicated whether the country had a tracked secondary school system (Zsj) while controlling for educational outcomes at the primary school level (Ypj):

$$ {Y}_{sj}=\alpha +{\beta}_1{Y}_{pj}+\gamma {Z}_{sj}+{e}_j $$

The key parameter of interest in Eq. (1) was γ, since it estimates the effect of early tracking on the educational outcome. The equation does not include the tracking status at the primary school level because no country in our sample had a tracked primary school system.

We estimated separate models for the four educational outcome measures—dispersion inequality, social achievement gaps, educational inadequacy, and the performance level. The total number of replications for each outcome was 45 including nine replications for reading, 18 for mathematics, and 18 for science (cf. Fig. 1 and Table 2).

Synthesis of effects

We computed weighted mean effect sizes to summarize the i = 45 estimations per dependent variable. For this purpose, we used the formulas that Card (2012) developed for use in meta-analyses. The basic idea is that some effect estimates are more reliable than others (e.g., due to differences in the sample size), which is reflected in different standard errors. For this reason, the inverse value of the squared standard error (\( {SE}_i^2\Big) \) serves as a weight (wi) for the corresponding effect estimate. This means that datasets with less efficient results will have a lower weight in the synthesized results:

$$ {w}_i=\frac{1}{SE_i^2} $$

We estimated a weighted mean of the single effects, consisting of the sum of the effect sizes (ESi) multiplied by their weights (wi), divided by the total sum of weights:

$$ \overline{ES}=\frac{\sum \left({w}_i\ast {ES}_i\right)}{\sum {w}_i} $$

The weights can be used to compute a standard error for the mean effect size (\( {\mathrm{SE}}_{\overline{\mathrm{ES}}}\Big) \). For this purpose, we used the square root of the inverse value of the sum of the weights:

$$ {\mathrm{SE}}_{\overline{ES}}=\sqrt{\frac{1}{\sum {w}_i}} $$

The ratio of the mean effect size and its standard error follows a normal distribution, which can be used to test if the mean effect differs significantly from zero (Card 2012).

5 Results

The results for the different study pairs and the four outcome variables—dispersion, inequality, social achievement gaps, educational inadequacy, and performance level—are depicted in Fig. 2. Panel a shows, for example, the regression coefficients of the effects of early tracking on dispersion inequality along with the 95% confidence intervals for each of the 45 combinations of primary and secondary school data. Since each estimate was based on a rather small sample of countries, the confidence intervals were large and only few estimates differed significantly from zero. Correspondingly, we also observed large confidence intervals for the results of the other outcomes in panels b, c, and d. In panel b, the estimates were only statistically significantly different from zero in seven out of 45 analyses due to the small sample size of countries.

Fig. 2
figure 2figure 2

Individual estimates of the effects of early tracking on the four dependent variables in the 45 study pairs. The single estimates of the early tracking effect on the four outcome variables are depicted for 45 study pairs per outcome. CI is short for confidence interval. In the 45 study pair abbreviations, R stands for reading, M for mathematics, and S for science. In the pair labels, the primary school level dataset is followed by the corresponding secondary school dataset

The low precision of the estimation of the difference-in-differences models made it difficult to draw robust conclusions based on a single pair of primary and secondary school data. However, the replications were based on 45 different combinations and the findings revealed some interesting patterns. For dispersion inequality and social achievement gaps, the large majority of the parameters were positive. For educational inadequacy and performance levels, we observed no overall tendency since roughly half of the estimates were positive and the other half negative.

5.1 Mean effects of early tracking on inequalities

We applied a meta-analytical strategy to combine the effect estimations of different study pairs for each of the four outcomes of interest. Table 3 (column 1) shows the synthesized mean effect across all achievement domains, which was based on all 45 study pairs. The results showed that early tracking increased the three educational inequality measures. The effects were particularly pronounced for the social achievement gap, followed by dispersion inequality. The effect of early tracking on educational inadequacy was small but statistically significant. In contrast to the consistent findings that tracking increased inequality, our study provided no evidence that tracking affected the performance level.

Table 3 Synthesis of the Effects of Early Tracking (\( \overline{\gamma} \)) on the Four Dependent Variables for All Domains and Divided by Domain

In detail, our analyses showed that early tracking significantly increased dispersion inequality by 2.91 score points (p < .001). While there was a general trend of dispersion inequality increasing from the primary to secondary school level, the increase was significantly larger in early tracking countries in comparison to late tracking countries. The outcome measure of dispersion inequality––the standard deviation of test scores at the secondary school level––had an international mean of 89.02 with a SD of 11.51 (see Table 1). We used this information to compute the standardized effect size measure Cohen’s d. The standardized effect of tracking on dispersion inequality was d = 0.25.

We also found strong evidence that tracking increased the social achievement gap. Tracking increased the gap between students from families with few and with many books by 6.90 score points (p < .001), which corresponds to an effect size of d = 0.44. Therefore, the social achievement gaps widened more between primary and secondary school in early tracking countries than in late tracking ones.

The mean effect of tracking on educational inadequacy was 0.88 points (p < .01). This suggests that early tracking increased the share of students who did not reach basic literacy cutoffs by roughly 1%. In comparison to the other concepts of inequality, the standardized effect d = 0.07 is rather small.

In contrast to the results for the three inequality measures, our analyses provided no evidence for an effect of early tracking on the performance level. The mean effect was − 1.00 (d = 0.02) and did not differ significantly from zero (p > .05).

In the main analyses, we used the inverse standard error to weight each study pair by the precision of its estimate. An alternative approach is to weight each study pair equally. To test the sensitivity of our analyses, we replicated all analyses with equal weights (see Appendix 2). The results remained qualitatively the same.

5.2 Further analyses

The review of previous research revealed rather inconsistent findings. We assumed that the small number of countries in each study might be an explanation for the variation in the previously reported findings. An alternative explanation pertains to substantive differences. Our attempt to address this controversy entailed replicating the analyses for different educational outcomes using the same data. Additionally, we conducted a series of alternative specifications to test the robustness of our main analyses.

Effect heterogeneity across achievement domains

In order to test whether tracking affected outcomes in reading, mathematics, and science differently, we replicated the analyses for the three domains separately. The results are depicted in Table 3 (columns 2–4). The findings largely confirmed those of the main specification. Tracking increased the dispersion inequality and the social achievement gap consistently and significantly in all three domains. Furthermore, the analyses on reading suggested that tracking reinforced educational inadequacy and decreased the performance level. We observed no significant effects for educational inadequacy and performance level in mathematics and science. However, only nine study pairs were available to investigate effects in the reading domain, while 18 pairs were available for the estimation of the effects in mathematics and science. For this reason, we suggest that the findings for reading should not be over interpreted.

Alternative inequality measures

Different measures of dispersion inequality, social achievement gaps, and educational inadequacy were used in previous research. In our main analyses, we focused on one measure for each educational outcome. To check the robustness of our analyses, we used alternative measures of educational inequality and replicated the analyses for the same 45 study pairs of primary and secondary school data. In Table 4, each row contains the result of an alternative specification.

Table 4 Robustness checks of the synthesis of the effects of early tracking (\( \overline{\gamma} \)) for all domains in the 45 study pairs

We used the within-country standard deviation of the test scores as the measure of dispersion inequality in the main analyses. In additional robustness checks, we used the range between the 95th and 5th percentile and between the 75th and 25th percentile of the achievement distribution as alternative measures of dispersion inequality. We observed that early tracking also increased the dispersion inequality in these alternative specifications (rows 1–2 in Table 4).

The social achievement gap was operationalized as the mean score difference between children from households with less than 100 and at least 100 books (absolute difference). In further analyses, we standardized this difference by the respective within-country standard deviation (relative difference) and used this variable as an alternative outcome. The scale of the effect changed due to the standardization but it remained significant (p < .001) (row 3 in Table 4). The number-of-books-at-home variable is probably the most commonly used measure of the socioeconomic status in comparative research. It is, however, often criticized, for example because certain student groups tend to systematically underestimate the number of books at home (e.g., Engzell 2019). For this reason, we replicated the main analyses with another frequently used measure of socioeconomic background, namely parental education. The additional analysis replicated the finding that tracking significantly increased the absolute gap between children of parents with and without tertiary education (row 4 in Table 4).

The threshold that defines educational inadequacy can be a more or less inclusive cutoff. In our main analyses, we identified a little more than 10% of the students as having an inadequate level of achievement. In further analyses, we used the intermediate benchmark in PIRLS and TIMSS and proficiency level 2 in PISA instead. This lead us to identify about 30% of the students as failing to attain an adequate level of achievement. The replicated analyses showed that early tracking increased the proportion of students not reaching the TIMSS intermediate benchmark or level 2 in PISA by about 1.5% (p < .01; row 5 in Table 4).

Tracking as nondichotomous

Just as in most previous research, we used a binary tracking indicator in our main analyses. In further analyses, we replaced this binary indicator with a continuous variable for the tracking grade to exploit the variation in how many years students were exposed to a tracked school system (see Appendix 1). A value of zero means that a country had a comprehensive secondary school system at the secondary school level when testing occurred, and values between 1 and 5 imply that students were allocated to different ability tracks one to five grades before the secondary school assessment was administered. However, a drawback of this approach was the limited number of countries tracking students at different times. We replicated the main analyses for all four outcomes using the nondichotomous tracking indicator.

The analyses for the three types of inequalities and the performance levels are presented in rows 6–9 in Table 4. The effects of the tracking grade on the dispersion inequality and social achievement gaps were positive and significant (p < .001). One extra year of exposure to a tracked system increased the countries’ standard deviations of achievement scores by about 0.77 points and the social achievement gap by 2.37 points. These findings imply that postponing tracking by 5 years––for example, from tracking after fourth to tracking after eighth grade––reduced the dispersion inequality by 3.85 points and social achievement gaps by 11.95 points. Consistent with the main results, the effect on educational inadequacy was smaller and, in this case, nonsignificant. Just as in the main analyses, we observed no significant effects for the performance level.

6 Discussion and conclusion

For a long time, the controversy around between-school ability tracking was mainly ideological. Robust empirical evidence on the effects of tracking on student outcomes was rare. However, in the past 15 years, a number of studies with robust designs have been conducted with the aim of contributing empirical evidence to the discussion about the effects of tracking on student learning. Most of the new studies used international data to compare student achievement in countries with tracked versus comprehensive school systems while controlling for prior achievement differences (e.g., Hanushek and Wößmann 2006). While these studies applied sound strategies to identify the effects of tracking on achievement, most suffered from the limitation that international analyses are based on relatively small samples of countries. Furthermore, it was difficult to synthesize previous research because different studies focused on different educational outcomes. Against this backdrop, the main aim of the present study was to use the data accumulated in international assessments to systematically investigate the effects of tracking on educational inequalities and performance levels. Previous research used different data to investigate the effects of tracking on different outcomes. We used the same data to study multiple outcomes.

6.1 Summary of key findings

The literature frequently refers to a perceived trade-off between equity and efficacy in the field of between-school tracking. While previous research was inconclusive, we found strong evidence that tracking increased dispersion inequality and social achievement gaps. Tracking was also associated with educational inadequacy, but the evidence was less robust. In contrast, we found no evidence that tracking boosted performance levels. These main findings were very consistent across different model specifications. We replicated the analyses using different tracking indicators and outcome measures, and the general results confirmed our main findings.

6.2 Conceptual clarity: different outcomes, different findings

We found that the effects of early tracking on educational inequality varied according to the theoretical concept behind the inequality measures; this was confirmed by the series of further analyses on the robustness of our findings. It is worth remembering that our results varied between different concepts of inequality but they were very similar for the same concepts of inequality. The clearest effect was on social achievement gaps, where the effect of tracking seemed to be the most pronounced across all domains and for different measures of student background. This is of particular relevance since it contradicts the argument that tracking is meritocratic; if it were meritocratic, the inequality determined by social characteristics would not vary. This point is reinforced when looking at the effects of tracking on dispersion inequality: Early tracking increased the dispersion of achievement scores, but compared to the effects on social achievement gaps, this was not as relevant to the overall existing dispersion. Finally, looking at the educational inadequacy, we found more inconclusive evidence. We observed significant effects for tracking on educational inadequacy in reading but not in the two other domains. On the other hand, the overall effects and the alternative specification with a more ambitious threshold revealed significant effects of an increase of the proportion of students not reaching minimum levels of achievement. This means that tracking did not help the most disadvantaged students. At worst, these would perform better without tracking, while, at best, tracking does not have discernible effects.

We contrasted the analysis on educational inequalities with analyses on the effect of tracking on performance levels. In line with previous studies, we found no evidence that tracking increased performance levels. If anything, there was some evidence for tracking decreasing performance levels in reading.

6.3 The reproducibility of findings: the issue of a small sample size at the country level

Our study illustrates that the reproducibility of research findings based on international data is limited. We observed a remarkable variability in results between different combinations of primary and secondary school assessments. For this reason, it comes as no surprise that previous research was inconsistent and sometimes even contradictory. International assessments collect information from millions of students, but, at the country level, the number of units of analysis is small. Small samples are generally associated with large standard errors, which means that research findings based on a single international assessment are unreliable. Our findings should encourage researchers to replicate analyses based on data from different international assessments or to combine different assessments to reduce publication bias and establish reliable evidence.

6.4 Limitations of the study

The first limitation is related to the need to simplify the tracking variable itself. As Hillmert and Jacob (2010) noted, studies on transitions in educational careers follow an ideal-typical sequence of transitions and phases in education, while students can and do follow more complex paths in reality. In line with previous research, we used a binary tracking indicator, but we are well aware that between-school tracking can take different forms simultaneously. Following this, our analyses are well suited to detecting the effects of between-school tracking, which is our research question, but they do not account for every form of selection. Studying different types of within-school tracking (both whole-class differentiation and on a course-by-course basis) is beyond the scope of our study (see Chmielewski 2014; Chmielewski et al. 2013). On the other hand, we suspect that if we had been able to measure within-school tracking, the effects would have been even more pronounced. Let us assume, for the moment, that there is a continuum along the distinction between comprehensive, within-school tracking, and between-school tracking systems. In the present study, we regarded within-school tracking countries as comprehensive school systems. This means that our estimates are rather conservative and that the effects would have been even larger if we had considered countries that applied within-school tracking as a separate category.

6.5 Policy implications

When discussing the consequences of between-school tracking, it is useful to revisit the debate on what types of inequality are considered acceptable or even desirable and what types are considered problematic and unjust. With respect to the frequently perceived tradeoff between efficiency and equity, it is important to stress that we did not find any evidence supporting the suggestion that early between-school tracking increases average performance levels. Regarding the question of what types of inequality are acceptable, different perspectives have to be considered. In modern societies, it is generally accepted that performance levels vary between students (inequality as dispersion) and that this mix of skillsets is even needed because the labor market demands differently skilled workers. At the same time, it is more difficult to justify social inequalities, i.e., the idea that children get different opportunities based on their social background and not their educational potential. In the same vein, it is difficult to find arguments supporting educational inadequacy, i.e., the notion that a proportion of students would not even reach the basic performance levels that are necessary to participate in the society and in all parts of the labor market. Therefore, we regard social achievement gaps and educational inadequacy as particularly important outcome measures of educational policies.

Hanushek and Wößmann (2006, p. C75) closed their study with the following statement: “From a policy perspective, it seems incumbent on those advocating early tracking in schools to identify the potential gains from this. These preliminary results suggest that countries lose in terms of the distribution of outcomes, and possibly also in levels of outcomes, by pursuing such policies.” More than 10 years later, with a larger amount of evidence, we have come to a similar conclusion. If we had to make a policy recommendation, it would be to reform early between-school tracking systems into comprehensive school systems.