Missing children: how Chilean schools evaded accountability by having low-performing students miss high-stakes tests


High-stakes testing pressures schools to raise test scores, but schools respond to pressure in different ways. Some responses produce real, broad increases in learning, but other responses can raise reported test scores without increasing learning. We estimate the effect of an accountability program on reading scores and math scores in Chile. Over a 6-year period, fourth-grade reading and math scores rose by 0.2 to 0.3 standard deviations, on average, and half the rise was due to the accountability program. However, many schools, especially schools serving disadvantaged students, inflated their accountability ratings by having low-performing students miss high-stakes tests. To encourage healthier responses to accountability, we recommend setting accountability goals that are attainable for schools with disadvantaged students, and providing incentives for all students to take high-stakes tests.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. Practically all the students who missed reading tests missed math tests as well. We got essentially the same result if Yijt represented missing the math test, missing the reading test, or missing both.

  2. A third approach would be to restrict the data to public schools, but then the model would not be identified because the SEP variable would be almost perfectly collinear with the year fixed effects. This collinearity would result from the fact that nearly all public schools joined SEP in 2008 (see Fig. 1).

  3. In an earlier version of this article, the imputation model was even more flexible; its parameters were estimated separately within every school and year. That model ran much more slowly, however, and may have been overfit in small schools. In any case, its results were practically identical to those obtained from the model we have described here.

  4. In every year, the Ministry of Education reported scores for schools with at least six fourth-grade scores in each subject, but some students miss tests, so a school with more than six fourth graders might have fewer than six test scores. By limiting the analysis to schools with at least fifteen fourth graders, we ensured that those schools would have at least six test scores. Any cutoff between ten and twenty fourth graders produced similar results.

  5. Under SEP’s accountability system, Mineduc classified schools as “autonomous” (the highest level), “emergent” (the intermediate level), or “in recovery” (the lowest level). From 2008 to 2011, Mineduc classified 12% of SEP schools as “autonomous,” 88% as “emergent,” and none as “in recovery.” After 2012, Mineduc classified 2.5% of SEP schools as “in recovery.”

  6. A typical threshold for repeating was a GPA of four, but the exact threshold varied from school to school.

  7. The vulnerability index was calculated by the National School and Scholarship Aid Board, the public body charged with providing food assistance to schools (Mineduc 2008b)

  8. Imputation often has little effect on fixed-effect estimates (Young and Johnson 2015). In an earlier study on the effect of high-stakes testing in Chicago, imputing missing scores also had little effect on fixed-effects estimates (Jacob 2005). However, the Chicago study imputed a constant, which can bias estimates (Allison 2002), instead of using multiple imputation. The Chicago study also did not examine the effect of missing test scores on school accountability ratings, which can be larger, as we have shown.

  9. The only exception is the effect of being 3 years before SEP, which is significant at p < .05. However, this could be an artifact of multiple tests. With 7 years before participation, the probability of one of those years having .01 < p < .05 would be approximately 30%, even if there were no pre-trends at all.


