Skip to main content
Log in

New evidence concerning school accountability and mathematics instructional quality in the no child left behind era

  • Published:
Educational Assessment, Evaluation and Accountability Aims and scope Submit manuscript


Using longitudinal data from the No Child Left Behind (NCLB) era, I applied regression techniques and found a positive association between school failure to reach “adequate yearly progress” in mathematics and subsequent changes in the quality of middle grades mathematics instruction in districts where district leaders adopted robust theories of action for improving mathematics instruction. The positive association was robust to multiple sensitivity tests and may reflect a causal relationship. The evidence suggests that educational leaders in similar contexts can use school failure to reach accountability standards as measured by standardized assessments to promote instructional quality in mathematics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others


  1. Local education agencies give charter schools autonomy from some local policies and regulations with the expectation that charter school leaders will use this autonomy to rapidly improve student achievement.

  2. Three rubrics from the IQA toolkit were not included because previous works suggested scores produced by these rubrics were unreliable (Wilhelm and Kim 2015).

  3. A dependability coefficient is used in absolute decision making and is appropriate when comparing scores to a threshold (Hill, Charalambous, & Kraft, 2012).

  4. The generalizability coefficient of the Mathematical Quality of Instruction (MQI) instrument, based on two observations, is less than 0.60 (Hill et al. 2012). Therefore the dependability coefficient of the IQA is greater than the dependability coefficient of the MQI because the generalizability coefficient of an instrument is always greater than the dependability coefficient associated with a particular data collection and scoring procedure (Brennan 2001).

  5. Exploratory factor analyses conducted by ACRO researchers suggested scores from the eight rubrics loaded onto two constructs. The first construct was primarily based on variation in Task Potential and Implementation scores while the second construct was based on variation in the remaining six rubrics. ACRO researchers decided 50% of the IQA composite score should be determined by variation in scores contributing to the first factor while the other 50% of the IQA composite should be determined by variation in the other six rubrics. Task Potential and Implementation scores were weighted by 25% each. Scores associated with each of the six remaining rubrics were weighted by 8.33%.

  6. There are 30 schools observed for 4 years, yielding a maximum of 90 records at the school-by-year level. However, one school only participated 2 years and six schools participated 3 years. See Table 1.

  7. The survey allowed respondents to choose both White and Hispanic.

  8. It is also plausible that it is easier to improve IQA for teachers in some grades compared to other grades. If teacher-participants in failing schools happened to teach in these “easier” grades, this could explain away the positive relationship between school failure and IQA. I addressed this by adding grade-fixed effects (i.e., dichotomous variables for whether a teacher taught 6th, 7th, or 8th grades) as right-hand side variables. The collection of grade-fixed effects was not related to changes in IQA (i.e., grade-fixed effects were not jointly significant).

    Additionally, several schools-by-year cells passed math AYP in the first year of the study period. It is plausible instruction in that year improved dramatically, maybe because it was the first year that districts could respond to ACRO-provided feedback. District leadership may have wanted to show ACRO researchers they could productively act on researcher feedback. Such relationships could explain a sizeable portion of the observed positive relationship of interest. To address this explanation, I added year-fixed effects as right-hand size variables (i.e., dichotomous variables for whether the record came from the first, second, or third study year). Like grade-fixed effects, year-fixed effects did not predict changes in IQA. Subsequent models did not include year-fixed effects due to their lack of joint significance.

  9. This explanation is counterintuitive. If instruction in failing schools was improving, it seems these schools would not continue to fail.

  10. While there is no consensus regarding the cutoffs to use when identifying highly influential data based on these two statistics, 4/(analytical sample size) and the absolute value of two are often used as rules of thumb to identify extreme values for Cook’s D and studentized residuals, respectively (Bollen and Jackman 1990).

  11. Robust and quantile estimators were unable to account for the clustering of teachers within schools, thus standard errors in rows IV and V in Table 4 are not comparable to estimates from other models.


  • Anagnostopoulos, D. (2003). The new accountability, student failure and teachers’ work in urban high schools. Educational Policy, 17(3), 291–316.

    Article  Google Scholar 

  • Baker, D. (2014). The schooled society (1st ed.). Stanford: Stanford University Press.

    Google Scholar 

  • Ball, D. L. (2000). Bridging practices: intertwining content and pedagogy in teaching and learning to teach. Journal of Teacher Education, 51(3), 241–247.

    Article  Google Scholar 

  • Bollen, K. A., & Jackman, R. W. (1990). In J. Fox & J. S. Long (Eds.), Modern methods of data analysis. Newbury Park: Sage Publications.

    Google Scholar 

  • Boston, M. D. (2012). Assessing instructional quality in mathematics. The Elementary School Journal, 113(1), 76–104.

    Article  Google Scholar 

  • Boston, M. D., & Wilhelm, A. G. (2015). Middle school mathematics instruction in instructionally focused urban districts. Urban Education, 1–33.

  • Boston, M., & Wolf, M. K. (2006). Assessing Academic Rigor in Mathematics Instruction: The Development of the Instructional Quality Assessment Toolkit [Technical Report].

  • Boston, M., Bostic, J., Lesseig, K., & Sherman, M. (2015). A comparison of mathematics classroom observation protocols. Mathematics Teacher Educator, 3(2), 154–175.

    Article  Google Scholar 

  • Brennan, R. L. (2001). Generalizability theory. New York: Springer-Verlag.

    Book  Google Scholar 

  • Campbell, S. L., & Ronfeldt, M. (2018). Observational evaluation of teachers: measuring more than we bargained for? American Educational Research Journal, 000283121877621.

  • Chiang, H. (2009). How accountability pressure on failing schools affects student achievement. Journal of Public Economics, 93(9–10), 1045–1057.

    Article  Google Scholar 

  • Clotfelter, C. T., Ladd, H. F., & Vigdor, J. L. (2006). Teacher-Student Matching and the Assessment of Teacher Effectiveness. The Journal of Human Resources, 41(4), 778–820.

  • Cobb, P., Jackson, K., Henrick, E., & Smith, T. M. (2018). Systems for instructional improvement: creating coherence from the classroom to the district office (1st ed.). Harvard Education Press.

  • Cochran-Smith, M., & Lytle, S. L. (1999). Relationships of knowledge and practice: teacher learning in communities. Review of Research in Education, 24(1), 249–305.

    Article  Google Scholar 

  • Dee, T. S., Jacob, B., & Schwartz, N. L. (2013). The effects of NCLB on school resources and practices. Educational Evaluation and Policy Analysis, 35(2), 252–279.

    Article  Google Scholar 

  • Desimone, L. M., Hochberg, E. D., & McMaken, J. (2016). Teacher knowledge and instructional quality of beginning teachers: growth and linkages. Teachers College Record, 118(May), 54.

    Google Scholar 

  • Franke, M. L., Kazemi, E., & Battey, D. (2007). Mathematics teaching and classroom practice. In F. K. Lester (Ed.), Second handbook of research on mathematics teaching and learning (2nd ed., pp. 225–256). Charlotte: National Council of Teachers of Mathematics.

    Google Scholar 

  • Fuller, B., Wright, J., Gesicki, K., & Kang, E. (2007). Gauging growth: how to judge no child left behind? Educational Researcher, 36(5), 268–278.

    Article  Google Scholar 

  • Gamoran, A., Porter, A. C., Smithson, J., & White, P. A. (1997). Upgrading high school mathematics instruction: improving learning opportunities for low-achieving, low-income youth. Educational Evaluation and Policy Analysis, 19(4), 325–338.

    Article  Google Scholar 

  • Hamilton, L., Berends, M., & Stecher, B. M.. (2005). Teachers ’ responses to standards-based accountability.

  • Hannaway, J., & Hamilton, L. (2008). Performance-based accountability Policies: Implications for School and Classroom Practices.

  • Harris, D. N., & Sass, T. R. (2011). Teacher training, teacher quality and student achievement. Journal of Public Economics, 95(7–8), 798–812.

    Article  Google Scholar 

  • Herman, J. L. (2004). The effects of testing on instruction. In S. H. Fuhrman & R. F. Elmore (Eds.), Redesigning accountability systems for education. New York: Teachers College Press.

    Google Scholar 

  • Hiebert, J., Carpenter, T. P., Fennema, E., Fuson, K. C., Wearne, D., Murray, H., et al. (1997). Making sense: teaching and learning mathematics with understanding. Portsmouth: Heinemann.

    Google Scholar 

  • Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64.

    Article  Google Scholar 

  • Holmstrom, B., & Milgrom, P. (1991). Multitask principal-agent analyses: incentive contracts, asset ownership, and job design. Journal of Law, Economics, and Organization, 7(Special Issue), 24–52.

    Article  Google Scholar 

  • Jacob, B. A. (2005). Accountability, incentives and behavior: the impact of high-stakes testing in the Chicago public schools. Journal of Public Economics, 89(5–6), 761–796.

    Article  Google Scholar 

  • Kane, T. J., Taylor, E. S., Tyler, J. H., & Wooten, A. L. (2011). Identifying effective classroom practices using student achievement data. The Journal of Human Resources, 46(3), 587–613.

    Article  Google Scholar 

  • Kazemi, E., & Franke, M. L.. (2004). Teacher learning in mathematics: using student. 203–235.

  • Kim, J. S., & Sunderman, G. L. (2005). Measuring academic proficiency under the no child left behind act: implications for educational equity. Educational Researcher, 34(8), 3–13.

    Article  Google Scholar 

  • Ladd, H. F., & Sorensen, L. C. (2017). Returns to teacher experience: student achievement and motivation in middle school. Education Finance and Policy, 12(2), 241–279.

    Article  Google Scholar 

  • Manna, P.. (2011). Collision course: federal education policy meets state and local realities. CQ Press.

  • McGuinn, P.. (2006). No Child Left Behind and the transformation of federal education policy. Lawrence.

  • Mintrop, H., & Sunderman, G. L. (2009). Predictable failure of federal sanctions-driven accountability for school improvement—and why we may retain it anyway. Educational Researcher, 38(5), 353–364.

    Article  Google Scholar 

  • MIST Instruments. (n.d.). Retrieved July 13, 2019, from Peabody College of Education and Human Development website:

  • Monfils, L. F., Firestone, W. A., Hicks, J. E., Martinez, M. C., Schorr, R. Y., & Camilli, G. (2004). Teaching to the test. In W. A. Firestone, R. Y. Schorr, & L. F. Monfils (Eds.), The ambiguity of teaching to the test. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • National Center for Education Statistics. (2003). Highlights from the TIMSS 1999 Video Study of Eighth-Grade Mathematics Teaching (pp. 1–12). Washington, D.C.

  • National Council of Teachers of Mathematics. (2000). Principles and standards. Reston: NCTM.

    Google Scholar 

  • NEA - ESEA/NCLB Update #129. (2012). Retrieved from

  • Polikoff, M. S. (2015). The stability of observational and student survey measures of teaching effectiveness. American Journal of Education, 121.

  • Popham, J. W. (2001). Teaching to the test? Educational Leadership, (March).

  • Quint, J. C., Akey, T. M., Rappaport, S., & Willner, C. J. (2007). Instructional leadership, Teaching quality and student achievement suggestive evidence from three urban school districts.

  • Resnick, L., Matsumara, L. C., & Junker, B. (2006). Measuring reading comprehension and mathematics instruction in urban middle schools: a pilot study of the Instructional Quality Assessment (CSE Technical report No. 681). Los Angeles, CA: Center for the study of evaluation.

  • Rockoff, J. E. (2004). The impact of individual teachers on student achievement: evidence from panel data. The American Economic Review, 94(2).

  • Rockoff, J., & Turner, L. J. (2010). Short-run impacts of accountability on school quality. American Economic Journal: Economic Policy, 2, 119–147.

    Google Scholar 

  • Schools and Staffing Survey. (n.d.-a). Among public school teachers born in 1946 or later, total number of teachers, average years of teaching experience, average age, and percentage distribution by sex, marital status, years of teaching experience, race/ethnicity, and selected year of birth: 2011–12. Retrieved July 3, 2017, from

  • Schools and Staffing Survey. (n.d.-b). Characteristics of public, private, and bureau of Indian education elementary and secondary school teachers in the United States: results from the 2007–08 Schools and Staffing Survey. Retrieved July 3, 2017, from

  • Staiger, D. O., Kane, T. J. (2013). Making Decisions with Imprecise Performance Measures: The Relationship Between Annual Student Achievement Gains and a Teacher’s Career Value‐Added

  • StataCorp. (2013). Stata 13 base reference manual. College Station: Stata Press.

    Google Scholar 

  • Stein, M. K., & Lane, S. (1996). Instructional tasks and the development of student capacity to think and reason: an analysis of the relationship between teaching and learning in a reform mathematics project. Educational Research and Evaluation, 2(1), 50–80.

    Article  Google Scholar 

  • Stein, M. K., Grover, B. W., & Henningsen, M. A. (1996). Building student capacity for mathematical thinking and reasoning: an analysis of mathematical tasks used in reform classrooms. American Educational Research Journal, 33(2).

  • Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: what do teacher observation scores really measure? Educational evaluation and policy analysis, XX(X), 0162373715616249-.

  • Toch, T. (2006). Margins of error. Education Sector.

  • Watanabe, M. (2007). Displaced teacher and state priorities in a high-stakes accountability context.

  • Wilhelm, A. G., & Kim, S. (2015). Generalizing from observations of mathematics teachers’ instructional practice using the instructional quality assessment. Journal for Research in Mathematics Education, 46(3), 270–279.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Seth B. Hunter.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix 1: IQA Task Potential, Implementation, and Discussion Rubrics

Copies of the IQA rubrics come from Boston and Wolf (2006). For additional details see (“MIST Instruments,” n.d.)

Table 7 Academic Rigor: Potential of the Task
Table 8 Academic Rigor: Implementation
Table 9 Academic Rigor: Discussion

Appendix 2: Some Measurement Properties of IQA

This section describes procedures used to estimate the year-to-year stability coefficient and reliability of IQA scores as a measure of the long-term component of a teacher’s instructional practices.

Like Polikoff (2015), I estimate the year-to-year stability of the IQA score by regressing the ith teacher’s score in year t (IQAit) on the IQA score the same teacher received in year t − 1 (IQAi,t-1) and district dummy variables. The stability coefficient is the coefficient associated with the teacher’s lagged IQA score represents the extent to which a teacher implements similar instructional practices from year-to-year. I use 241 teacher–year observations in my model, the overall adjusted R2 was 0.13, and the stability coefficient and district-clustered standard error was 0.31 and [0.11], respectively. Polikoff (2015) estimated a stability coefficient for Charlotte Danielson’s Framework for Teaching and the MQI observational rubrics of 0.49 and 0.12, respectively. Thus, the IQA stability coefficient is lower than the score associated with the generic Danielson rubric, but much higher than the coefficient associated with the much more comparable MQI rubric.

Staiger and Kane (2013) argue year-to-year Pearson correlations of noisy measurements, like the stability coefficients in the immediately previous model, underestimate the reliability of an instrument as a measure of “true” performance. These authors show the square root of the correlation of year-to-year teacher performance is a better estimate of instrument reliability. The square root of the year-to-year correlation (i.e., \( \sqrt{\rho_{IQA_{it}{IQA}_{i,t-1}}} \)) represents the correlation of a short-term measurement (IQAit) to a teacher’s true, time-invariant performance (i.e., teacher level mean IQA score, \( \overline{IQA_{it}} \)), and produces what Staiger and Kane call a year–to–career correlation (2013). Conceptually, a year-to-career correlation is better than a year-to-year correlation because short term measures (i.e., IQAit) contain more measurement error than the career measure (i.e., \( \overline{IQA_{it}} \)), leading to attenuation in the year-to-year correlation (Staiger and Kane 2013). In my analytical dataset comprised of 241 teacher–year observations \( {\rho}_{IQA_{it}{IQA}_{i,t-1}} \)= 0.36, so the reliability of IQA scores as a measure of the long-term component of teacher instructional practices is 0.60.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hunter, S.B. New evidence concerning school accountability and mathematics instructional quality in the no child left behind era. Educ Asse Eval Acc 31, 409–436 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: