Using test scores to evaluate and hold school teachers accountable in New Mexico

Geiger, Tray J.; Amrein-Beardsley, Audrey; Holloway, Jessica

doi:10.1007/s11092-020-09324-w

Using test scores to evaluate and hold school teachers accountable in New Mexico

Published: 12 June 2020

Volume 32, pages 187–235, (2020)
Cite this article

Educational Assessment, Evaluation and Accountability Aims and scope Submit manuscript

Tray J. Geiger¹,
Audrey Amrein-Beardsley ORCID: orcid.org/0000-0001-6924-3025¹ &
Jessica Holloway²

609 Accesses
2 Altmetric
Explore all metrics

Abstract

For this study, researchers critically reviewed documents pertaining to the highest profile of the 15 teacher evaluation lawsuits that occurred throughout the U.S. as pertaining to the use of student test scores to evaluate teachers. In New Mexico, teacher plaintiffs contested how they were being evaluated and held accountable using a homegrown value-added model (VAM) to hold them accountable for their students’ test scores. Researchers examined court documents using six key measurement concepts (i.e., reliability, validity [i.e., convergent-related evidence], potential for bias, fairness, transparency, and consequential validity) defined by the Standards for Educational and Psychological Testing and found evidence of issues within both the court documents as well as the statistical analyses researchers conducted on the first three measurement concepts (i.e., reliability, validity [i.e., convergent-related evidence], and potential for bias).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reframing conversations about teacher quality: school and district administrators’ perceptions of the validity, reliability, and justifiability of a new teacher evaluation system

Article 13 February 2019

Acing the test: an examination of teachers’ perceptions of and responses to the threat of state takeover

Article 28 June 2019

Potential Psychosocial and Instructional Consequences of the Common Core State Standards: Implications for Research and Practice

Article 06 January 2015

Notes

The main differences between VAMs and growth models are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, the student growth models are more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, typically without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables.
As per teachers’ contract statuses, HISD teachers new to the district are put on probationary contracts for 1 year if they have more than 5 years of prior teaching experience, or they are put on probationary contracts for 3 years if they are new teachers. Otherwise, teachers are on full contract.
The New Mexico’s modified Danielson model consists of four domains (as does the Danielson model): “Domain 1: Planning and Preparation” (which is the same as Danielson), “Domain 2: Creating an Environment for Learning” (which is “Classroom Environment” as per Danielson), Domain 3: “Teaching for Learning” (which is “Instruction” as per Danielson), and “Domain 4: Professionalism” (which is “Professional Responsibilities” as per Danielson). Domains 1 and 4 when combined yield New Mexico’s Planning, Preparation, and Professionalism (PPP) dimension. It is uncertain how the state adjusted the Danielson model for observational purposes, or whether the state had permission to do so from the Danielson Group (n.d.).
In terms of teacher attendance, the state’s default teacher attendance cut scores were based on days missed as follows: 0–2 days missed = Exemplary, 3–5 days missed = Highly Effective, 6–10 days missed = Effective, 11–13 days missed = Minimally Effective, and 14+ days missed = Ineffective. However, some districts did not include teacher attendance data for various reasons (e.g., “because absences are often attributed to the Family and Medical Leave Act, bereavement, jury duty, military leave, religious leave, professional development, and coaching”) making system fidelity and fairness, again, suspect.
For example, most VAMs require that the scales that are used to measure growth from 1 year to the next can be appropriately positioned upon vertical, interval scales of equal units. These scales should connect consecutive tests on the same fixed ruler, so-to-speak, making it possible to measure growth from 1 year to the next across different grade-level tests. Here, for example, a ten-point difference (e.g., between a score of 50 and 60 in fourth grade) on one test should mean the same thing as a ten-point difference (e.g., between a score of 80 and 90 in fifth grade) on a similar test 1 year later. However, the scales of all large-scale standardized achievement test scores used in all current value-added systems do not even come close to being vertically aligned, as so often assumed (Baker et al. 2010; Ballou 2004; Braun 2004; Briggs and Betebenner 2009; Ho et al. 2009; Newton et al. 2010; Papay 2010; Sanders et al. 2009). “[E]ven the psychometricians who are responsible for test scaling shy away from making [such an] assumption” (Harris 2009, p. 329).
Interpreting r: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0.0 ≤ r ≤ 0.2 = a very weak correlation, if any at all (Merrigan and Huston 2008).
Ibid.
Ibid.
Ibid.
Ibid.
While “free riders” are not defined in exhibit D, researchers assume this means that this term refers to teachers who obtain something without comparable effort.

References

American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Google Scholar
American Educational Research Association (AERA) Council. (2015). AERA statement on use of value-added models (VAM) for the evaluation of educators and educator preparation programs. Educational Researcher, 44(8), 448–452. https://doi.org/10.3102/0013189X15618385 Retrieved from http://edr.sagepub.com/content/early/2015/11/10/0013189X15618385.full.pdf+htmls.
Article Google Scholar
American Statistical Association (ASA). (2014). ASA statement on using value-added models for educational assessment. Alexandria, VA: ASA Retrieved from https://www.amstat.org/asa/files/pdfs/POL-ASAVAM-Statement.pdf.
Google Scholar
Amrein-Beardsley, A., & Collins, C. (2012). The SAS Education Value-Added Assessment System (SAS® EVAAS®) in the Houston Independent School District (HISD): Intended and unintended consequences. Education Policy Analysis Archives, 20(12). https://doi.org/10.14507/epaa.v20n12.2012.
Amrein-Beardsley, A., & Geiger, T. J. (2019). Potential sources of invalidity when using teacher value-added and principal observational estimates: Artificial inflation, deflation, and conflation. Educational Assessment, Evaluation and Accountability, 31(4), 465–493. https://doi.org/10.1007/s11092-019-09311-w.
Anderson, C. (2008). The end of theory: the data deluge makes the scientific method obsolete. Wired. Retrieved from https://www.wired.com/2008/06/pb-theory/.
Araujo, M. C., Carneiro, P., Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in kindergarten. The Quarterly Journal of Economics, 131(3), 1415–1453. https://doi.org/10.1093/qje/qjw016.
Article Google Scholar
Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: a descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, DC: Economic Policy Institute Retrieved from http://www.epi.org/publications/entry/bp278.
Google Scholar
Baker, B. D., Oluwole, J. O., & Green, P. C. (2013). The legal consequences of mandating high stakes decisions based on low quality information: teacher evaluation in the Race-to-the-Top era. Education Policy Analysis Archives, 21(5). https://doi.org/10.14507/epaa.v21n5.2013.
Ballou, D. (2004). Rejoinder. Journal of Educational and Behavioral Statistics, 29(1), 131–134. https://doi.org/10.3102/10769986029001131.
Article Google Scholar
Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77–86. https://doi.org/10.3102/0013189X15574904.
Article Google Scholar
Braun, H. I. (2004). Value-added modeling: what does due diligence require? Princeton, NJ: Educational Testing Service.
Google Scholar
Braun, H. (2015). The value in value-added depends on the ecology. Educational Researcher, 44(2), 127–131. https://doi.org/10.3102/0013189X15576341.
Article Google Scholar
Briggs, D. C. & Betebenner, D. (2009). Is growth in student achievement scale dependent? Paper presented at the annual meeting of the National Council for Measurement in Education (NCME), San Diego, CA.
Briggs, D. & Domingue, B. (2011). Due diligence and the evaluation of teachers: a review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District Teachers by the Los Angeles Times. Boulder, CO: National Education Policy Center (NEPC). http://nepc.colorado.edu/files/NEPC-LAT-VAM-2PP.pdf
Brophy, J. (1973). Stability of teacher effectiveness. American Educational Research Journal, 10(3), 245–252. https://doi.org/10.2307/1161888.
Article Google Scholar
Burgess, K. (2017, July 6). Expert: NM teacher evals are toughest in the nation. The Albuquerque Journal. Retrieved from https://www.abqjournal.com/1029370/expert-nm-teacher-evals-toughest-in-us.html
Burris, C. C., & Welner, K. G. (2011). Letter to Secretary of Education Arne Duncan concerning evaluation of teachers and principals. Boulder, CO: National Education Policy Center (NEPC). Retrieved from http://nepc.colorado.edu/publication/letter-to-Arne-Duncan.
Campbell, D. (1975). Degrees of freedom and the case study. Comparative Political Studies, 8(2), 178–185. https://doi.org/10.1177/001041407500800204.
Article Google Scholar
Cantrell, S., & Kane, T. J. (2013). Ensuring fair and reliable measures of effective teaching: culminating findings from the MET project’s three-year study. Seattle, WA: Bill & Melinda Gates Foundation. Retrieved from https://k12education.gatesfoundation.org/resource/ensuring-fair-and-reliable-measures-of-effective-teaching-culminating-findings-from-the-met-projects-three-year-study/.
Carey, K. (2017 19). The little-known statistician who taught us to measure teachers. The New York Times. Retrieved from https://www.nytimes.com/2017/05/19/upshot/the-little-known-statistician-who-transformed-education.html?_r=0
Chester, M. D. (2003). Multiple measures and high-stakes decisions: a framework for combining measures. Educational Measurement: Issues and Practice, 22(2), 32–41. https://doi.org/10.1111/j.1745-3992.2003.tb00126.x.
Article Google Scholar
Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014). Measuring the impacts of teachers I: evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593–2632. https://doi.org/10.3386/w19424.
Article Google Scholar
Cizek, G. J. (2016). Validating test score meaning and defending test score use: different aims, different methods. Assessment in Education: Principles, Policy & Practice, 23(2), 212–225. https://doi.org/10.1080/0969594X.2015.1063479.
Article Google Scholar
Close, K., Amrein-Beardsley, A., & Collins, C. (2020). Putting teacher evaluation systems on the map: An overview of states’ teacher evaluation systems post-Every Student Succeeds Act. Education Policy Analysis Archives, 28(1), 1–58. https://doi.org/10.14507/epaa.28.5252.
Cody, C. A., McFarland, J., Moore, J. E., & Preston, J. (2010). The evolution of growth models. Raleigh, NC: Public Schools of North Carolina. Retrieved from http://www.dpi.state.nc.us/docs/intern-research/reports/growth.pdf.
Cole, R., Haimson, J., Perez-Johnson, I., & May, H. (2011). Variability in pretest-posttest correlation coefficients by student achievement level. Washington, D.C.: U.S. Department of Education Retrieved from https://ies.ed.gov/ncee/pubs/20114033/pdf/20114033.pdf.
Google Scholar
Collins, C. (2014). Houston, we have a problem: Teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives, 22(98). doi: https://doi.org/10.14507/epaa.v22.1594.
Corcoran, S. P. (2010). Can teachers be evaluated by their students’ test scores? Should they be? The use of value-added measures of teacher effectiveness in policy and practice. Providence, RI: Annenberg Institute for School Reform Retrieved from http://annenberginstitute.org/publication/can-teachers-be-evaluated-their-students%E2%80%99-test-scores-should-they-be-use-value-added-mea.
Google Scholar
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957.
Article Google Scholar
Darling-Hammond, L. (2010). The flat world and education: how America’s commitment to equity will determine our future. New York, NY: Teachers College Press.
Google Scholar
Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132–137. https://doi.org/10.3102/0013189X15575346.
Article Google Scholar
Denby, D. (2012). Public defender: Diane Ravitch takes on a movement. The New Yorker, Annals of Education. Retrieved from https://www.newyorker.com/magazine/2012/11/19/public-defender.
Doherty, K. M., & Jacobs, S. (2015). State of the states 2015: Evaluating teaching, leading and learning. Washington DC: National Council on Teacher Quality (NCTQ) Retrieved from http://www.nctq.org/dmsView/StateofStates2015.
Dorans, N. J., & Cook, L. L. (2016). Fairness in educational assessment and measurement. New York, NY: Routledge.
Book Google Scholar
Dunn, O. J., & Clark, V. A. (1969). Correlation coefficients measured on the same individuals. Journal of the American Statistical Association, 64(325), 366–377. https://doi.org/10.2307/228374625.
Dunn, O. J., & Clark, V. A. (1971). Comparison of tests of the equality of dependent correlation coefficients. Journal of the American Statistical Association, 66(336), 904–908. https://doi.org/10.2307/2284252.
Eckert, J. M., & Dabrowski, J. (2010). Should value-added measures be used for performance pay? Phi Delta Kappan, 91(8), 88–92.
Education Week. (2015). Teacher evaluation heads to the courts. Retrieved from http://www.edweek.org/ew/section/multimedia/teacher-evaluation-heads-to-the-courts.html.
Every Student Succeeds Act (ESSA) of 2016, Pub. L. No. 114–95, § 129 Stat. 1802. (2016). Retrieved from https://www.gpo.gov/fdsys/pkg/BILLS-114s1177enr/pdf/BILLS-114s1177enr.pdf
Flyvbjerg, B. (2011). Five misunderstandings about case-study research. Qualitative Inquiry, 12(2), 219–245. https://doi.org/10.1177/1077800405284363.
Article Google Scholar
Gabriel, R., & Lester, J. N. (2013). Sentinels guarding the grail: value-added measurement and the quest for education reform. Education Policy Analysis Archives, 21(9). https://doi.org/10.14507/v21n9.2013.
Gale, N. K., Heath, G., Cameron, E., Rashid, S., & Redwood, S. (2013). Using the framework method for the analysis of qualitative data in multi-disciplinary health research. BMC Medical Research Methodology, 13(1), 117. https://doi.org/10.1186/1471-2288-13-117.
Article Google Scholar
Gerring, J. (2004). What is a case study and what is it good for? The American Political Science Review, 98(2), 341–354. https://doi.org/10.1017/S0003055404001182.
Article Google Scholar
Glazerman, S. M., & Potamites, L. (2011). False performance gains: a critique of successive cohort indicators. Mathematica Policy Research. Retrieved from https://www.mathematica.org/~/media/publications/PDFs/education/false_perf.pdf.
Glazerman, S., Goldhaber, D., Loeb, S., Raudenbush, S., Staiger, D. O., & Whitehurst, G. J. (2011). Passing muster: evaluating teacher evaluation systems. Washington, D.C.: The Brookings Institution www.brookings.edu/reports/2011/0426_evaluating_teachers.aspx.
Google Scholar
Goldhaber, D. (2015). Exploring the potential of value-added performance measures to affect the quality of the teacher workforce. Educational Researcher, 44(2), 87–95. https://doi.org/10.3102/0013189X15574905.
Article Google Scholar
Goldhaber, D., & Chaplin, D. D. (2015). Assessing the “Rothstein Falsification Test”: does it really show teacher value-added models are biased? Journal of Research on Educational Effectiveness, 8(1), 8–34. https://doi.org/10.1080/19345747.2014.978059.
Article Google Scholar
Goldhaber, D. D., Goldschmidt, P., & Tseng, F. (2013). Teacher value-added at the high-school level: different models, different answers? Educational Evaluation and Policy Analysis, 35(2), 220–236. https://doi.org/10.3102/0162373712466938.
Article Google Scholar
Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value-added: principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96–104. https://doi.org/10.3102/0013189X15575031.
Article Google Scholar
Goldschmidt, P., Choi, K., & Beaudoin, J. B. (2012). Growth model comparison study: practical implications of alternative models for evaluating school performance. Technical issues in large-scale assessment state collaborative on assessment and student standards. Council of Chief State School Officers. Retrieved from https://files.eric.ed.gov/fulltext/ED542761.pdf.
Graue, M. E., Delaney, K. K., & Karch, A. S. (2013). Ecologies of education quality. Education Policy Analysis Archives, 21(8). https://doi.org/10.14507/epaa.v21n8.2013.
Grek, S. (2009). Governing by numbers: the PISA ‘effect’ in Europe. Journal of Education Policy, 24(1), 23–37. https://doi.org/10.1080/02680930802412669.
Article Google Scholar
Grek, S., & Ozga, J. (2010). Re-inventing public education: the new role of knowledge in education policy making. Public Policy and Administration, 25(3), 271–288. https://doi.org/10.1177/0952076709356870.
Article Google Scholar
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: the relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. https://doi.org/10.3102/0013189X14544542.
Article Google Scholar
Guarino, C. M., Reckase, M. D., & Wooldridge, J. M. (2012). Can value-added measures of teacher education performance be trusted? East Lansing, MI: The Education Policy Center at Michigan State University. Retrieved from http://education.msu.edu/epc/library/documents/WP18Guarino-Reckase-Wooldridge-2012-Can-Value-Added-Measures-of-Teacher-Performance-Be-T_000.pdf
Guarino, C. M., Reckase, M. D., Stacy, B. W., & Wooldridge, J. M. (2014). Evaluating specification tests in the context of value-added estimation. East Lansing, MI: The Education Policy Center at Michigan State University.
Google Scholar
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17–27. https://doi.org/10.1111/j.1745-3992.2004.tb00149.x.
Article Google Scholar
Harris, D. N. (2009). Would accountability based on teacher value added be smart policy? An evaluation of the statistical properties and policy alternatives. Education Finance and Policy, 4(4), 319–350. https://doi.org/10.1162/edfp.2009.4.4.319.
Article Google Scholar
Harris, D. N. (2011). Value-added measures in education: what every educator needs to know. Cambridge, MA: Harvard Education Press.
Google Scholar
Harris, D. N., & Herrington, C. D. (2015). Editors’ introduction: the use of teacher value-added measures in schools: new evidence, unanswered questions, and future prospects. Educational Researcher, 44(2), 71–76. https://doi.org/10.3102/0013189X15576142.
Article Google Scholar
Hill, H. C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. https://doi.org/10.3102/0002831210387916.
Article Google Scholar
Ho, A. D., Lewis, D. M., & Farris, J. L. (2009). The dependence of growth-model results on proficiency cut scores. Educational Measurement: Issues and Practice, 28(4), 15–26. https://doi.org/10.1111/j.1745-3992.2009.00159.x.
Article Google Scholar
Holloway-Libell, J. (2015). Evidence of grade and subject-level bias in value-added measures. Teachers College Record, 117.
Ishii, J., & Rivkin, S. G. (2009). Impediments to the estimation of teacher value added. Education Finance and Policy, 4(4), 520–536. https://doi.org/10.1162/edfp.2009.4.4.520.
Article Google Scholar
Jiang, J. Y., Sporte, S. E., & Luppescu, S. (2015). Teacher perspectives on evaluation reform: Chicago’s REACH students. Educational Researcher, 44(2), 105–116. https://doi.org/10.3102/0013189X15575517.
Article Google Scholar
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Washington, D.C.: The National Council on Measurement in Education and American Council on Education.
Google Scholar
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000.
Article Google Scholar
Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy & Practice, 23(2), 198–211. https://doi.org/10.1080/0969594X.2015.1060192.
Article Google Scholar
Kane, M. T. (2017). Measurement error and bias in value-added models. Princeton: Educational Testing Service (ETS) Research Report Series. https://doi.org/10.1002/ets2.12153 Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/ets2.12153/full.
Book Google Scholar
Kane, T. J., & Staiger, D. (2012). Gathering feedback for teaching: combining high-quality observations with student surveys and achievement gains. Seattle, WA: Bill & Melinda Gates Foundation. Retrieved from https://files.eric.ed.gov/fulltext/ED540960.pdf.
Kappler Hewitt, K. (2015). Educator evaluation policy that incorporates EVAAS value-added measures: undermined intentions and exacerbated inequities. Education Policy Analysis Archives, 23(76). doi: https://doi.org/10.14507/epaa.v23.1968.
Kelly, A., & Downey, C. (2010). Value-added measures for schools in England: looking inside the “black box” of complex metrics. Educational Assessment, Evaluation and Accountability, 22(3), 181–198. https://doi.org/10.1007/s11092-010-9100-4.
Article Google Scholar
Kelly, S., & Monczunski, L. (2007). Overcoming the volatility in school-level gain scores: a new approach to identifying value added with cross-sectional data. Educational Researcher, 36(5), 279–287. https://doi.org/10.3102/0013189X07306557.
Article Google Scholar
Koedel, C., & Betts, J. R. (2007). Re-examining the role of teacher quality in the educational production function (working paper no. 2007-03). Nashville, TN: National Center on Performance Initiatives.
Koedel, C., & Betts, J. R. (2009). Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique. Nashville, TN: National Center on Performance Incentives. Retrieved from. https://doi.org/10.1162/EDFP_a_00027?journalCode=edfp.
Book Google Scholar
Koedel, C., Mihaly, K., & Rockoff, J. E. (2015). Value-added modeling: a review. Economics of Education Review, 47, 180–195. https://doi.org/10.1016/j.econedurev.2015.01.006.
Article Google Scholar
Koretz, D. (2017). The testing charade: pretending to make schools better. Chicago, IL: University of Chicago Press.
Book Google Scholar
Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the widget effect: teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234–249. https://doi.org/10.3102/0013189X17718797.
Article Google Scholar
Lavery, M. R., Amrein-Beardsley, A., Pivovarova, M., Holloway, J., Geiger, T., & Hahs-Vaughn, D. L. (2019). Do value-added models (VAMs) tell truth about teachers? Analyzing validity evidence from VAM scholars. Annual Meeting of the American Educational Research Association (AERA), Toronto, Canada. (Presidential Session)
Lingard, B. (2011). Policy as numbers: ac/counting for educational research. The Australian Educational Researcher, 38(4), 355–382. https://doi.org/10.1007/s13384-011-0041-9.
Article Google Scholar
Lingard, B., Martino, W., & Rezai-Rashti, G. (2013). Testing regimes, accountabilities and education policy: commensurate global and national developments. Journal of Education Policy, 28(5), 539–556. https://doi.org/10.1080/02680939.2013.820042.
Article Google Scholar
Linn, R. L., & Haug, C. (2002). Stability of school-building accountability scores and gains. Educational Evaluation and Policy Analysis, 24, 29–36. https://doi.org/10.3102/01623737024001029.
Article Google Scholar
Markus, K. A. (2016). Alternative vocabularies in the test validity literature. Assessment in Education: Principles, Policy & Practice, 23(2), 252–267. https://doi.org/10.1080/0969594X.2015.1060191.
Article Google Scholar
Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. https://doi.org/10.3102/0162373716666166.
Article Google Scholar
Mathis, W. J. (2011). NEPC review: Florida formula for student achievement: Lessons for the nation. Boulder, CO: National Education Policy Center. Retrieved from https://nepc.colorado.edu/thinktank/review-florida-formula.
McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67–101.
Article Google Scholar
McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572–606. https://doi.org/10.1162/edfp.2009.4.4.572.
Article Google Scholar
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027.
Article Google Scholar
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: American Council on Education and Macmillan.
Google Scholar
Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.
Article Google Scholar
Michelmore, K., & Dynarski, S. (2016). The gap within the gap: using longitudinal data to understand income differences in student achievement. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w22474.
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: a sourcebook. Beverly Hills, CA: Sage.
Google Scholar
Moore Johnson, S. (2015). Will VAMS reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117–126. https://doi.org/10.3102/0013189X15573351.
Article Google Scholar
New Mexico Public Education Department. (2016). NMTEACH technical guide. Business rules and calculations. 2015-2016. Santa Fe, NM: Author.
Google Scholar
Newton, P. E., & Shaw, S. D. (2016). Disagreement over the best way to use the word ‘validity’ and options for reaching consensus. Assessment in Education: Principles, Policy & Practice, 23(2), 178–197. https://doi.org/10.1080/0969594X.2015.1037241.
Article Google Scholar
Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher effectiveness: an exploration of stability across models and contexts. Educational Policy Analysis Archives, 18(23). doi: https://doi.org/10.14507/epaa.v18n23.2010.
Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: how high-stakes testing corrupts America’s schools. Cambridge, MA: Harvard Education Press.
Google Scholar
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill.
Google Scholar
Paige, M. A. (2016). Building a better teacher: understanding value-added models in the law of teacher evaluation. Lanham, MD: Rowman & Littlefield.
Google Scholar
Papay, J. P. (2010). Different tests, different answers: the stability of teacher value-added estimates across outcome measures. American Educational Research Journal., 48, 163–193. https://doi.org/10.3102/0002831210362589.
Article Google Scholar
Paufler, N. A., & Amrein-Beardsley, A. (2014). The random assignment of students into elementary classrooms: Implications for value-added analyses and interpretations. American Educational Research Journal (AERJ), 51(2), 328–-362. https://doi.org/10.3102/0002831213508299.
Pauken, T. (2013). Texas vs. No Child Left Behind. The American Conservative. Retrieved from https://www.theamericanconservative.com/articles/texas-vs-no-child-left-behind/.
Polat, N., & Cepik, S. (2015). An exploratory factor analysis of the Sheltered Instruction Observation Protocol as an evaluation tool to measure teaching effectiveness. TESOL Quarterly, 50(4), 817–843. https://doi.org/10.1002/tesq.248.
Article Google Scholar
Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Educational Evaluation and Policy Analysis, 36(4), 399–416. https://doi.org/10.3102/0162373714531851.
Article Google Scholar
Porter, T. M. (1996). Trust in numbers: the pursuit of objectivity in science and public life. Princeton, NJ: Princeton University Press.
Book Google Scholar
Race to the Top Act of 2011, S. 844--112th Congress. (2011). Retrieved from http://www.govtrack.us/congress/bills/112/s844.
Ragin, C. C., & Becker, H. S. (2000). Cases of “what is a case?”. In C. C. Ragin & H. S. Becker (Eds.), What is a case? Exploring the foundations of social inquiry (pp. 1–17). Cambridge: The Press Syndicate of The University of Cambridge.
Google Scholar
Raudenbush, S. W. & Jean, M. (2012). How should educators interpret value-added scores? Stanford, CA: Carnegie Knowledge Network. Retrieved from http://www.carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added/.
Reiss, R. (2017). A vindication of the criticism of New Mexico Public Education Department’s teacher evaluation system. The Beacon, XX(1), 2-4. Retrieved from http://www.cese.org/wp-content/uploads/2017/05/2017-05-Beacon.pdf.
Ritchie, J., Lewis, J., Nicholls, C. M., & Ormston, R. (Eds.). (2013). Qualitative research practice: a guide for social science students and researchers. Los Angeles, CA: Sage.
Google Scholar
Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic achievement. Econometrica, 73(2), 417–458. https://doi.org/10.1111/j.1468-0262.2005.00584.x.
Article Google Scholar
Rockoff, J. E. (2004). The impact of individual teachers on student achievement: evidence from panel data. The American Economic Review, 94(2), 247–252. https://doi.org/10.1257/0002828041302244.
Article Google Scholar
Rosenbaum, P., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.2307/2335942.
Article Google Scholar
Ross, E. & Walsh, K. (2019). State of the states 2019: teacher and principal evaluation policy. Washington, DC: National Council on Teacher Quality (NCTQ). Retrieved from https://www.nctq.org/pages/State-of-the-States-2019:-Teacher-and-Principal-Evaluation-Policy#footnote-15.
Rothstein, J. (2009). Student sorting and bias in value-added estimation: selection on observables and unobservables. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w14666.pdf.
Rothstein, J. (2010). Teacher quality in educational production: tracking, decay, and student achievement. Quarterly Journal of Economics, 125(1), 175–214. https://doi.org/10.1162/qjec.2010.125.1.175.
Article Google Scholar
Rothstein, J. (2017). Revisiting the impacts of teachers (working paper). Berkeley, CA: University of California, Berkeley Retrieved from https://eml.berkeley.edu/~jrothst/CFR/rothstein_cfr_workingpaper_jan2017.pdf.
Google Scholar
Rothstein, J., & Mathis, W. J. (2013). Review of two culminating reports from the MET project. Boulder, CO: National Education Policy Center (NEPC). Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013
Sanders, W. L., Wright, S. P., Rivers, J. C., & Leandro, J. G. (2009, November). A response to criticisms of SAS EVAAS. Cary, NC: SAS Institute Inc. Retrieved from http://www.sas.com/resources/asset/Response_to_Criticisms_of_SAS_EVAAS_11-13-09.pdf.
SAS Institute, Inc. (2019). SAS EVAAS for K-12. Retrieved from http://www.sas.com/en_us/industry/k-12-education/evaas.html.
Schochet, P. Z., & Chiang, H. S. (2013). What are error rates for classifying teacher and school performance using value-added models? Journal of Educational and Behavioral Statistics, 38, 142–171. https://doi.org/10.3102/1076998611432174.
Article Google Scholar
Selwyn, N. (2015). Data entry: towards the critical study of digital data and education. Learning, Media, and Technology, 40(1), 64–82. https://doi.org/10.1080/17439884.2014.921628.
Article Google Scholar
Shaw, L. H. & Bovaird, J. A. (2011). The impact of latent variable outcomes on value-added models of intervention efficacy. Paper presented at the Annual Conference of the American Educational Research Association (AERA), New Orleans, LA.
Sloat, E. F. (2015). Examining the validity of a state policy-directed framework for evaluating teacher instructional quality: informing policy, impacting practice (Unpublished doctoral dissertation). Arizona State University, Tempe, AZ.
Sloat, E., Amrein-Beardsley, A., & Sabo, K. E. (2017). Examining the factor structure underlying the TAP System for Teacher and Student Advancement. AERA Open, 3(4), 1–18. https://doi.org/10.1177/2332858417735526.
Sloat, E., Amrein-Beardsley, A., & Holloway, J. (2018). Different teacher-level effectiveness estimates, different results: Inter-model concordance across six generalized value-added models (VAMs). Educational Assessment, Evaluation and Accountability, 30(4), 367–397. https://doi.org/10.1007/s11092-018-9283-7.
Smith, W. C., & Kubacka, K. (2017). The emphasis of student test scores in teacher appraisal systems. Educational Policy Analysis Archives, 25(86). doi: https://doi.org/10.14507/epaa.25.2889.
Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Brussels: Education International Retrieved from http://download.ei-ie.org/Docs/WebDepot/2016_EI_VAM_EN_final_Web.pdf.
Google Scholar
Stake, R. E. (1978). The case study method in social inquiry. Educational Researcher, 7(2), 5–8.
Article Google Scholar
Stake, R. E., & Trumbull, D. (1982). Naturalistic generalizations. Review Journal of Philosophy and Social Science, 7, 1–12.
Google Scholar
Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: what do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293–317. https://doi.org/10.3102/0162373715616249.
Article Google Scholar
Swedien, J. (2014). Statistical guru for evaluations leaving PED. Albuquerque Journal. Retrieved from https://www.abqjournal.com/463424/statistical-guru-for-evaluations-leaving-ped.html.
Thomas, G. (2011). A typology for the case study in social science following a review of definition, discourse, and structure. Qualitative Inquiry, 17(6), 511–521. https://doi.org/10.1177/1077800411409884.
Article Google Scholar
Timar, T. B., & Maxwell-Jolly, J. (Eds.). (2012). Narrowing the achievement gap: perspectives and strategies for challenging times. Cambridge, MA: Harvard Education Press.
Google Scholar
U.S. Department of Education. (2010). A blueprint for reform: the reauthorization of the Elementary and Secondary Education Act. Retrieved from http://www2.ed.gov/policy/elsec/leg/blueprint/index.html
U.S. Department of Education. (2012). Elementary and Secondary Education Act (ESEA) flexibility. Washington, D.C.: Retrieved from https://www.ed.gov/esea/flexibility
U.S. Department of Education. (2014). States granted waivers from no child left behind allowed to reapply for renewal for 2014 and 2015 school years. Washington D.C. Retrieved from http://www.ed.gov/news/press-releases/states-granted-waivers-no-child-left-behind-allowed-reapply-renewal-2014-and-2015-school-years.
VanWynsberghe, R., & Khan, S. (2007). Redefining case study. International Journal of Qualitative Methods, 6(2), 80–94.
Article Google Scholar
Wallace, T. L., Kelcey, B., & Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey. American Educational Research Journal, 53(6), 1834–1868. https://doi.org/10.3102/0002831216671864.
Article Google Scholar
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: our national failure to acknowledge and act on differences in teacher effectiveness. New York, NY: The New Teacher Project (TNTP). Retrieved from http://tntp.org/assets/documents/TheWidgetEffect_2nd_ed.pdf.
Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: lessons learned in four districts. Washington, DC: Brookings Institution Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf.
Google Scholar
Wright, P., Horn, S., & Sanders, W. L. (1997). Teachers and classroom heterogeneity: their effects on educational outcomes. Journal of Personnel Evaluation in Education, 11(1), 57–67.
Article Google Scholar
Yeh, S. S. (2013). A re-analysis of the effects of teacher replacement using value-added modeling. Teachers College Record, 115(12), 1–35.

Download references

Author information

Authors and Affiliations

Arizona State University, Tempe, AZ, USA
Tray J. Geiger & Audrey Amrein-Beardsley
Australian Research Council, Research for Educational Impact (REDI) Centre, Deakin University, Geelong, Australia
Jessica Holloway

Authors

Tray J. Geiger
View author publications
You can also search for this author in PubMed Google Scholar
Audrey Amrein-Beardsley
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Holloway
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Audrey Amrein-Beardsley.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Demographic variables

Table 4 Demographics used in the analyses

Full size table

Appendix 2. Deviations in scores over time

Table 5 Teachers’ variations in score quintiles over time

Full size table

Appendix 3. Means of teacher effectiveness measures, per teacher and school subgroups

Table 6 VAM means, per teacher subgroup

Full size table

Table 7 Observation score means, per teacher subgroup

Full size table

Table 8 PPP score means, per teacher subgroup

Full size table

Table 9 Survey score means, per teacher subgroup

Full size table

Table 10 VAM means, per school subgroup

Full size table

Table 11 Observation score means, per school subgroup

Full size table

Table 12 PPP score means, per school subgroup

Full size table

Table 13 Survey score means, per school subgroup

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Geiger, T.J., Amrein-Beardsley, A. & Holloway, J. Using test scores to evaluate and hold school teachers accountable in New Mexico. Educ Asse Eval Acc 32, 187–235 (2020). https://doi.org/10.1007/s11092-020-09324-w

Download citation

Received: 09 September 2019
Accepted: 20 May 2020
Published: 12 June 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11092-020-09324-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using test scores to evaluate and hold school teachers accountable in New Mexico

Abstract

Access this article

Similar content being viewed by others

Reframing conversations about teacher quality: school and district administrators’ perceptions of the validity, reliability, and justifiability of a new teacher evaluation system

Acing the test: an examination of teachers’ perceptions of and responses to the threat of state takeover

Potential Psychosocial and Instructional Consequences of the Common Core State Standards: Implications for Research and Practice

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix 1. Demographic variables

Appendix 2. Deviations in scores over time

Appendix 3. Means of teacher effectiveness measures, per teacher and school subgroups

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using test scores to evaluate and hold school teachers accountable in New Mexico

Abstract

Access this article

Similar content being viewed by others

Reframing conversations about teacher quality: school and district administrators’ perceptions of the validity, reliability, and justifiability of a new teacher evaluation system

Acing the test: an examination of teachers’ perceptions of and responses to the threat of state takeover

Potential Psychosocial and Instructional Consequences of the Common Core State Standards: Implications for Research and Practice

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix 1. Demographic variables

Appendix 2. Deviations in scores over time

Appendix 3. Means of teacher effectiveness measures, per teacher and school subgroups

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation