Abstract
For this study, researchers critically reviewed documents pertaining to the highest profile of the 15 teacher evaluation lawsuits that occurred throughout the U.S. as pertaining to the use of student test scores to evaluate teachers. In New Mexico, teacher plaintiffs contested how they were being evaluated and held accountable using a homegrown value-added model (VAM) to hold them accountable for their students’ test scores. Researchers examined court documents using six key measurement concepts (i.e., reliability, validity [i.e., convergent-related evidence], potential for bias, fairness, transparency, and consequential validity) defined by the Standards for Educational and Psychological Testing and found evidence of issues within both the court documents as well as the statistical analyses researchers conducted on the first three measurement concepts (i.e., reliability, validity [i.e., convergent-related evidence], and potential for bias).
Similar content being viewed by others
Notes
The main differences between VAMs and growth models are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, the student growth models are more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, typically without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables.
As per teachers’ contract statuses, HISD teachers new to the district are put on probationary contracts for 1 year if they have more than 5 years of prior teaching experience, or they are put on probationary contracts for 3 years if they are new teachers. Otherwise, teachers are on full contract.
The New Mexico’s modified Danielson model consists of four domains (as does the Danielson model): “Domain 1: Planning and Preparation” (which is the same as Danielson), “Domain 2: Creating an Environment for Learning” (which is “Classroom Environment” as per Danielson), Domain 3: “Teaching for Learning” (which is “Instruction” as per Danielson), and “Domain 4: Professionalism” (which is “Professional Responsibilities” as per Danielson). Domains 1 and 4 when combined yield New Mexico’s Planning, Preparation, and Professionalism (PPP) dimension. It is uncertain how the state adjusted the Danielson model for observational purposes, or whether the state had permission to do so from the Danielson Group (n.d.).
In terms of teacher attendance, the state’s default teacher attendance cut scores were based on days missed as follows: 0–2 days missed = Exemplary, 3–5 days missed = Highly Effective, 6–10 days missed = Effective, 11–13 days missed = Minimally Effective, and 14+ days missed = Ineffective. However, some districts did not include teacher attendance data for various reasons (e.g., “because absences are often attributed to the Family and Medical Leave Act, bereavement, jury duty, military leave, religious leave, professional development, and coaching”) making system fidelity and fairness, again, suspect.
For example, most VAMs require that the scales that are used to measure growth from 1 year to the next can be appropriately positioned upon vertical, interval scales of equal units. These scales should connect consecutive tests on the same fixed ruler, so-to-speak, making it possible to measure growth from 1 year to the next across different grade-level tests. Here, for example, a ten-point difference (e.g., between a score of 50 and 60 in fourth grade) on one test should mean the same thing as a ten-point difference (e.g., between a score of 80 and 90 in fifth grade) on a similar test 1 year later. However, the scales of all large-scale standardized achievement test scores used in all current value-added systems do not even come close to being vertically aligned, as so often assumed (Baker et al. 2010; Ballou 2004; Braun 2004; Briggs and Betebenner 2009; Ho et al. 2009; Newton et al. 2010; Papay 2010; Sanders et al. 2009). “[E]ven the psychometricians who are responsible for test scaling shy away from making [such an] assumption” (Harris 2009, p. 329).
Interpreting r: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0.0 ≤ r ≤ 0.2 = a very weak correlation, if any at all (Merrigan and Huston 2008).
Ibid.
Ibid.
Ibid.
Ibid.
While “free riders” are not defined in exhibit D, researchers assume this means that this term refers to teachers who obtain something without comparable effort.
References
American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
American Educational Research Association (AERA) Council. (2015). AERA statement on use of value-added models (VAM) for the evaluation of educators and educator preparation programs. Educational Researcher, 44(8), 448–452. https://doi.org/10.3102/0013189X15618385 Retrieved from http://edr.sagepub.com/content/early/2015/11/10/0013189X15618385.full.pdf+htmls.
American Statistical Association (ASA). (2014). ASA statement on using value-added models for educational assessment. Alexandria, VA: ASA Retrieved from https://www.amstat.org/asa/files/pdfs/POL-ASAVAM-Statement.pdf.
Amrein-Beardsley, A., & Collins, C. (2012). The SAS Education Value-Added Assessment System (SAS® EVAAS®) in the Houston Independent School District (HISD): Intended and unintended consequences. Education Policy Analysis Archives, 20(12). https://doi.org/10.14507/epaa.v20n12.2012.
Amrein-Beardsley, A., & Geiger, T. J. (2019). Potential sources of invalidity when using teacher value-added and principal observational estimates: Artificial inflation, deflation, and conflation. Educational Assessment, Evaluation and Accountability, 31(4), 465–493. https://doi.org/10.1007/s11092-019-09311-w.
Anderson, C. (2008). The end of theory: the data deluge makes the scientific method obsolete. Wired. Retrieved from https://www.wired.com/2008/06/pb-theory/.
Araujo, M. C., Carneiro, P., Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in kindergarten. The Quarterly Journal of Economics, 131(3), 1415–1453. https://doi.org/10.1093/qje/qjw016.
Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: a descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, DC: Economic Policy Institute Retrieved from http://www.epi.org/publications/entry/bp278.
Baker, B. D., Oluwole, J. O., & Green, P. C. (2013). The legal consequences of mandating high stakes decisions based on low quality information: teacher evaluation in the Race-to-the-Top era. Education Policy Analysis Archives, 21(5). https://doi.org/10.14507/epaa.v21n5.2013.
Ballou, D. (2004). Rejoinder. Journal of Educational and Behavioral Statistics, 29(1), 131–134. https://doi.org/10.3102/10769986029001131.
Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77–86. https://doi.org/10.3102/0013189X15574904.
Braun, H. I. (2004). Value-added modeling: what does due diligence require? Princeton, NJ: Educational Testing Service.
Braun, H. (2015). The value in value-added depends on the ecology. Educational Researcher, 44(2), 127–131. https://doi.org/10.3102/0013189X15576341.
Briggs, D. C. & Betebenner, D. (2009). Is growth in student achievement scale dependent? Paper presented at the annual meeting of the National Council for Measurement in Education (NCME), San Diego, CA.
Briggs, D. & Domingue, B. (2011). Due diligence and the evaluation of teachers: a review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District Teachers by the Los Angeles Times. Boulder, CO: National Education Policy Center (NEPC). http://nepc.colorado.edu/files/NEPC-LAT-VAM-2PP.pdf
Brophy, J. (1973). Stability of teacher effectiveness. American Educational Research Journal, 10(3), 245–252. https://doi.org/10.2307/1161888.
Burgess, K. (2017, July 6). Expert: NM teacher evals are toughest in the nation. The Albuquerque Journal. Retrieved from https://www.abqjournal.com/1029370/expert-nm-teacher-evals-toughest-in-us.html
Burris, C. C., & Welner, K. G. (2011). Letter to Secretary of Education Arne Duncan concerning evaluation of teachers and principals. Boulder, CO: National Education Policy Center (NEPC). Retrieved from http://nepc.colorado.edu/publication/letter-to-Arne-Duncan.
Campbell, D. (1975). Degrees of freedom and the case study. Comparative Political Studies, 8(2), 178–185. https://doi.org/10.1177/001041407500800204.
Cantrell, S., & Kane, T. J. (2013). Ensuring fair and reliable measures of effective teaching: culminating findings from the MET project’s three-year study. Seattle, WA: Bill & Melinda Gates Foundation. Retrieved from https://k12education.gatesfoundation.org/resource/ensuring-fair-and-reliable-measures-of-effective-teaching-culminating-findings-from-the-met-projects-three-year-study/.
Carey, K. (2017 19). The little-known statistician who taught us to measure teachers. The New York Times. Retrieved from https://www.nytimes.com/2017/05/19/upshot/the-little-known-statistician-who-transformed-education.html?_r=0
Chester, M. D. (2003). Multiple measures and high-stakes decisions: a framework for combining measures. Educational Measurement: Issues and Practice, 22(2), 32–41. https://doi.org/10.1111/j.1745-3992.2003.tb00126.x.
Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014). Measuring the impacts of teachers I: evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593–2632. https://doi.org/10.3386/w19424.
Cizek, G. J. (2016). Validating test score meaning and defending test score use: different aims, different methods. Assessment in Education: Principles, Policy & Practice, 23(2), 212–225. https://doi.org/10.1080/0969594X.2015.1063479.
Close, K., Amrein-Beardsley, A., & Collins, C. (2020). Putting teacher evaluation systems on the map: An overview of states’ teacher evaluation systems post-Every Student Succeeds Act. Education Policy Analysis Archives, 28(1), 1–58. https://doi.org/10.14507/epaa.28.5252.
Cody, C. A., McFarland, J., Moore, J. E., & Preston, J. (2010). The evolution of growth models. Raleigh, NC: Public Schools of North Carolina. Retrieved from http://www.dpi.state.nc.us/docs/intern-research/reports/growth.pdf.
Cole, R., Haimson, J., Perez-Johnson, I., & May, H. (2011). Variability in pretest-posttest correlation coefficients by student achievement level. Washington, D.C.: U.S. Department of Education Retrieved from https://ies.ed.gov/ncee/pubs/20114033/pdf/20114033.pdf.
Collins, C. (2014). Houston, we have a problem: Teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives, 22(98). doi: https://doi.org/10.14507/epaa.v22.1594.
Corcoran, S. P. (2010). Can teachers be evaluated by their students’ test scores? Should they be? The use of value-added measures of teacher effectiveness in policy and practice. Providence, RI: Annenberg Institute for School Reform Retrieved from http://annenberginstitute.org/publication/can-teachers-be-evaluated-their-students%E2%80%99-test-scores-should-they-be-use-value-added-mea.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957.
Darling-Hammond, L. (2010). The flat world and education: how America’s commitment to equity will determine our future. New York, NY: Teachers College Press.
Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132–137. https://doi.org/10.3102/0013189X15575346.
Denby, D. (2012). Public defender: Diane Ravitch takes on a movement. The New Yorker, Annals of Education. Retrieved from https://www.newyorker.com/magazine/2012/11/19/public-defender.
Doherty, K. M., & Jacobs, S. (2015). State of the states 2015: Evaluating teaching, leading and learning. Washington DC: National Council on Teacher Quality (NCTQ) Retrieved from http://www.nctq.org/dmsView/StateofStates2015.
Dorans, N. J., & Cook, L. L. (2016). Fairness in educational assessment and measurement. New York, NY: Routledge.
Dunn, O. J., & Clark, V. A. (1969). Correlation coefficients measured on the same individuals. Journal of the American Statistical Association, 64(325), 366–377. https://doi.org/10.2307/228374625.
Dunn, O. J., & Clark, V. A. (1971). Comparison of tests of the equality of dependent correlation coefficients. Journal of the American Statistical Association, 66(336), 904–908. https://doi.org/10.2307/2284252.
Eckert, J. M., & Dabrowski, J. (2010). Should value-added measures be used for performance pay? Phi Delta Kappan, 91(8), 88–92.
Education Week. (2015). Teacher evaluation heads to the courts. Retrieved from http://www.edweek.org/ew/section/multimedia/teacher-evaluation-heads-to-the-courts.html.
Every Student Succeeds Act (ESSA) of 2016, Pub. L. No. 114–95, § 129 Stat. 1802. (2016). Retrieved from https://www.gpo.gov/fdsys/pkg/BILLS-114s1177enr/pdf/BILLS-114s1177enr.pdf
Flyvbjerg, B. (2011). Five misunderstandings about case-study research. Qualitative Inquiry, 12(2), 219–245. https://doi.org/10.1177/1077800405284363.
Gabriel, R., & Lester, J. N. (2013). Sentinels guarding the grail: value-added measurement and the quest for education reform. Education Policy Analysis Archives, 21(9). https://doi.org/10.14507/v21n9.2013.
Gale, N. K., Heath, G., Cameron, E., Rashid, S., & Redwood, S. (2013). Using the framework method for the analysis of qualitative data in multi-disciplinary health research. BMC Medical Research Methodology, 13(1), 117. https://doi.org/10.1186/1471-2288-13-117.
Gerring, J. (2004). What is a case study and what is it good for? The American Political Science Review, 98(2), 341–354. https://doi.org/10.1017/S0003055404001182.
Glazerman, S. M., & Potamites, L. (2011). False performance gains: a critique of successive cohort indicators. Mathematica Policy Research. Retrieved from https://www.mathematica.org/~/media/publications/PDFs/education/false_perf.pdf.
Glazerman, S., Goldhaber, D., Loeb, S., Raudenbush, S., Staiger, D. O., & Whitehurst, G. J. (2011). Passing muster: evaluating teacher evaluation systems. Washington, D.C.: The Brookings Institution www.brookings.edu/reports/2011/0426_evaluating_teachers.aspx.
Goldhaber, D. (2015). Exploring the potential of value-added performance measures to affect the quality of the teacher workforce. Educational Researcher, 44(2), 87–95. https://doi.org/10.3102/0013189X15574905.
Goldhaber, D., & Chaplin, D. D. (2015). Assessing the “Rothstein Falsification Test”: does it really show teacher value-added models are biased? Journal of Research on Educational Effectiveness, 8(1), 8–34. https://doi.org/10.1080/19345747.2014.978059.
Goldhaber, D. D., Goldschmidt, P., & Tseng, F. (2013). Teacher value-added at the high-school level: different models, different answers? Educational Evaluation and Policy Analysis, 35(2), 220–236. https://doi.org/10.3102/0162373712466938.
Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value-added: principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96–104. https://doi.org/10.3102/0013189X15575031.
Goldschmidt, P., Choi, K., & Beaudoin, J. B. (2012). Growth model comparison study: practical implications of alternative models for evaluating school performance. Technical issues in large-scale assessment state collaborative on assessment and student standards. Council of Chief State School Officers. Retrieved from https://files.eric.ed.gov/fulltext/ED542761.pdf.
Graue, M. E., Delaney, K. K., & Karch, A. S. (2013). Ecologies of education quality. Education Policy Analysis Archives, 21(8). https://doi.org/10.14507/epaa.v21n8.2013.
Grek, S. (2009). Governing by numbers: the PISA ‘effect’ in Europe. Journal of Education Policy, 24(1), 23–37. https://doi.org/10.1080/02680930802412669.
Grek, S., & Ozga, J. (2010). Re-inventing public education: the new role of knowledge in education policy making. Public Policy and Administration, 25(3), 271–288. https://doi.org/10.1177/0952076709356870.
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: the relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. https://doi.org/10.3102/0013189X14544542.
Guarino, C. M., Reckase, M. D., & Wooldridge, J. M. (2012). Can value-added measures of teacher education performance be trusted? East Lansing, MI: The Education Policy Center at Michigan State University. Retrieved from http://education.msu.edu/epc/library/documents/WP18Guarino-Reckase-Wooldridge-2012-Can-Value-Added-Measures-of-Teacher-Performance-Be-T_000.pdf
Guarino, C. M., Reckase, M. D., Stacy, B. W., & Wooldridge, J. M. (2014). Evaluating specification tests in the context of value-added estimation. East Lansing, MI: The Education Policy Center at Michigan State University.
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17–27. https://doi.org/10.1111/j.1745-3992.2004.tb00149.x.
Harris, D. N. (2009). Would accountability based on teacher value added be smart policy? An evaluation of the statistical properties and policy alternatives. Education Finance and Policy, 4(4), 319–350. https://doi.org/10.1162/edfp.2009.4.4.319.
Harris, D. N. (2011). Value-added measures in education: what every educator needs to know. Cambridge, MA: Harvard Education Press.
Harris, D. N., & Herrington, C. D. (2015). Editors’ introduction: the use of teacher value-added measures in schools: new evidence, unanswered questions, and future prospects. Educational Researcher, 44(2), 71–76. https://doi.org/10.3102/0013189X15576142.
Hill, H. C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. https://doi.org/10.3102/0002831210387916.
Ho, A. D., Lewis, D. M., & Farris, J. L. (2009). The dependence of growth-model results on proficiency cut scores. Educational Measurement: Issues and Practice, 28(4), 15–26. https://doi.org/10.1111/j.1745-3992.2009.00159.x.
Holloway-Libell, J. (2015). Evidence of grade and subject-level bias in value-added measures. Teachers College Record, 117.
Ishii, J., & Rivkin, S. G. (2009). Impediments to the estimation of teacher value added. Education Finance and Policy, 4(4), 520–536. https://doi.org/10.1162/edfp.2009.4.4.520.
Jiang, J. Y., Sporte, S. E., & Luppescu, S. (2015). Teacher perspectives on evaluation reform: Chicago’s REACH students. Educational Researcher, 44(2), 105–116. https://doi.org/10.3102/0013189X15575517.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Washington, D.C.: The National Council on Measurement in Education and American Council on Education.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000.
Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy & Practice, 23(2), 198–211. https://doi.org/10.1080/0969594X.2015.1060192.
Kane, M. T. (2017). Measurement error and bias in value-added models. Princeton: Educational Testing Service (ETS) Research Report Series. https://doi.org/10.1002/ets2.12153 Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/ets2.12153/full.
Kane, T. J., & Staiger, D. (2012). Gathering feedback for teaching: combining high-quality observations with student surveys and achievement gains. Seattle, WA: Bill & Melinda Gates Foundation. Retrieved from https://files.eric.ed.gov/fulltext/ED540960.pdf.
Kappler Hewitt, K. (2015). Educator evaluation policy that incorporates EVAAS value-added measures: undermined intentions and exacerbated inequities. Education Policy Analysis Archives, 23(76). doi: https://doi.org/10.14507/epaa.v23.1968.
Kelly, A., & Downey, C. (2010). Value-added measures for schools in England: looking inside the “black box” of complex metrics. Educational Assessment, Evaluation and Accountability, 22(3), 181–198. https://doi.org/10.1007/s11092-010-9100-4.
Kelly, S., & Monczunski, L. (2007). Overcoming the volatility in school-level gain scores: a new approach to identifying value added with cross-sectional data. Educational Researcher, 36(5), 279–287. https://doi.org/10.3102/0013189X07306557.
Koedel, C., & Betts, J. R. (2007). Re-examining the role of teacher quality in the educational production function (working paper no. 2007-03). Nashville, TN: National Center on Performance Initiatives.
Koedel, C., & Betts, J. R. (2009). Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique. Nashville, TN: National Center on Performance Incentives. Retrieved from. https://doi.org/10.1162/EDFP_a_00027?journalCode=edfp.
Koedel, C., Mihaly, K., & Rockoff, J. E. (2015). Value-added modeling: a review. Economics of Education Review, 47, 180–195. https://doi.org/10.1016/j.econedurev.2015.01.006.
Koretz, D. (2017). The testing charade: pretending to make schools better. Chicago, IL: University of Chicago Press.
Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the widget effect: teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234–249. https://doi.org/10.3102/0013189X17718797.
Lavery, M. R., Amrein-Beardsley, A., Pivovarova, M., Holloway, J., Geiger, T., & Hahs-Vaughn, D. L. (2019). Do value-added models (VAMs) tell truth about teachers? Analyzing validity evidence from VAM scholars. Annual Meeting of the American Educational Research Association (AERA), Toronto, Canada. (Presidential Session)
Lingard, B. (2011). Policy as numbers: ac/counting for educational research. The Australian Educational Researcher, 38(4), 355–382. https://doi.org/10.1007/s13384-011-0041-9.
Lingard, B., Martino, W., & Rezai-Rashti, G. (2013). Testing regimes, accountabilities and education policy: commensurate global and national developments. Journal of Education Policy, 28(5), 539–556. https://doi.org/10.1080/02680939.2013.820042.
Linn, R. L., & Haug, C. (2002). Stability of school-building accountability scores and gains. Educational Evaluation and Policy Analysis, 24, 29–36. https://doi.org/10.3102/01623737024001029.
Markus, K. A. (2016). Alternative vocabularies in the test validity literature. Assessment in Education: Principles, Policy & Practice, 23(2), 252–267. https://doi.org/10.1080/0969594X.2015.1060191.
Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. https://doi.org/10.3102/0162373716666166.
Mathis, W. J. (2011). NEPC review: Florida formula for student achievement: Lessons for the nation. Boulder, CO: National Education Policy Center. Retrieved from https://nepc.colorado.edu/thinktank/review-florida-formula.
McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67–101.
McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572–606. https://doi.org/10.1162/edfp.2009.4.4.572.
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: American Council on Education and Macmillan.
Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.
Michelmore, K., & Dynarski, S. (2016). The gap within the gap: using longitudinal data to understand income differences in student achievement. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w22474.
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: a sourcebook. Beverly Hills, CA: Sage.
Moore Johnson, S. (2015). Will VAMS reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117–126. https://doi.org/10.3102/0013189X15573351.
New Mexico Public Education Department. (2016). NMTEACH technical guide. Business rules and calculations. 2015-2016. Santa Fe, NM: Author.
Newton, P. E., & Shaw, S. D. (2016). Disagreement over the best way to use the word ‘validity’ and options for reaching consensus. Assessment in Education: Principles, Policy & Practice, 23(2), 178–197. https://doi.org/10.1080/0969594X.2015.1037241.
Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher effectiveness: an exploration of stability across models and contexts. Educational Policy Analysis Archives, 18(23). doi: https://doi.org/10.14507/epaa.v18n23.2010.
Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: how high-stakes testing corrupts America’s schools. Cambridge, MA: Harvard Education Press.
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw-Hill.
Paige, M. A. (2016). Building a better teacher: understanding value-added models in the law of teacher evaluation. Lanham, MD: Rowman & Littlefield.
Papay, J. P. (2010). Different tests, different answers: the stability of teacher value-added estimates across outcome measures. American Educational Research Journal., 48, 163–193. https://doi.org/10.3102/0002831210362589.
Paufler, N. A., & Amrein-Beardsley, A. (2014). The random assignment of students into elementary classrooms: Implications for value-added analyses and interpretations. American Educational Research Journal (AERJ), 51(2), 328–-362. https://doi.org/10.3102/0002831213508299.
Pauken, T. (2013). Texas vs. No Child Left Behind. The American Conservative. Retrieved from https://www.theamericanconservative.com/articles/texas-vs-no-child-left-behind/.
Polat, N., & Cepik, S. (2015). An exploratory factor analysis of the Sheltered Instruction Observation Protocol as an evaluation tool to measure teaching effectiveness. TESOL Quarterly, 50(4), 817–843. https://doi.org/10.1002/tesq.248.
Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Educational Evaluation and Policy Analysis, 36(4), 399–416. https://doi.org/10.3102/0162373714531851.
Porter, T. M. (1996). Trust in numbers: the pursuit of objectivity in science and public life. Princeton, NJ: Princeton University Press.
Race to the Top Act of 2011, S. 844--112th Congress. (2011). Retrieved from http://www.govtrack.us/congress/bills/112/s844.
Ragin, C. C., & Becker, H. S. (2000). Cases of “what is a case?”. In C. C. Ragin & H. S. Becker (Eds.), What is a case? Exploring the foundations of social inquiry (pp. 1–17). Cambridge: The Press Syndicate of The University of Cambridge.
Raudenbush, S. W. & Jean, M. (2012). How should educators interpret value-added scores? Stanford, CA: Carnegie Knowledge Network. Retrieved from http://www.carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added/.
Reiss, R. (2017). A vindication of the criticism of New Mexico Public Education Department’s teacher evaluation system. The Beacon, XX(1), 2-4. Retrieved from http://www.cese.org/wp-content/uploads/2017/05/2017-05-Beacon.pdf.
Ritchie, J., Lewis, J., Nicholls, C. M., & Ormston, R. (Eds.). (2013). Qualitative research practice: a guide for social science students and researchers. Los Angeles, CA: Sage.
Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic achievement. Econometrica, 73(2), 417–458. https://doi.org/10.1111/j.1468-0262.2005.00584.x.
Rockoff, J. E. (2004). The impact of individual teachers on student achievement: evidence from panel data. The American Economic Review, 94(2), 247–252. https://doi.org/10.1257/0002828041302244.
Rosenbaum, P., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.2307/2335942.
Ross, E. & Walsh, K. (2019). State of the states 2019: teacher and principal evaluation policy. Washington, DC: National Council on Teacher Quality (NCTQ). Retrieved from https://www.nctq.org/pages/State-of-the-States-2019:-Teacher-and-Principal-Evaluation-Policy#footnote-15.
Rothstein, J. (2009). Student sorting and bias in value-added estimation: selection on observables and unobservables. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w14666.pdf.
Rothstein, J. (2010). Teacher quality in educational production: tracking, decay, and student achievement. Quarterly Journal of Economics, 125(1), 175–214. https://doi.org/10.1162/qjec.2010.125.1.175.
Rothstein, J. (2017). Revisiting the impacts of teachers (working paper). Berkeley, CA: University of California, Berkeley Retrieved from https://eml.berkeley.edu/~jrothst/CFR/rothstein_cfr_workingpaper_jan2017.pdf.
Rothstein, J., & Mathis, W. J. (2013). Review of two culminating reports from the MET project. Boulder, CO: National Education Policy Center (NEPC). Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013
Sanders, W. L., Wright, S. P., Rivers, J. C., & Leandro, J. G. (2009, November). A response to criticisms of SAS EVAAS. Cary, NC: SAS Institute Inc. Retrieved from http://www.sas.com/resources/asset/Response_to_Criticisms_of_SAS_EVAAS_11-13-09.pdf.
SAS Institute, Inc. (2019). SAS EVAAS for K-12. Retrieved from http://www.sas.com/en_us/industry/k-12-education/evaas.html.
Schochet, P. Z., & Chiang, H. S. (2013). What are error rates for classifying teacher and school performance using value-added models? Journal of Educational and Behavioral Statistics, 38, 142–171. https://doi.org/10.3102/1076998611432174.
Selwyn, N. (2015). Data entry: towards the critical study of digital data and education. Learning, Media, and Technology, 40(1), 64–82. https://doi.org/10.1080/17439884.2014.921628.
Shaw, L. H. & Bovaird, J. A. (2011). The impact of latent variable outcomes on value-added models of intervention efficacy. Paper presented at the Annual Conference of the American Educational Research Association (AERA), New Orleans, LA.
Sloat, E. F. (2015). Examining the validity of a state policy-directed framework for evaluating teacher instructional quality: informing policy, impacting practice (Unpublished doctoral dissertation). Arizona State University, Tempe, AZ.
Sloat, E., Amrein-Beardsley, A., & Sabo, K. E. (2017). Examining the factor structure underlying the TAP System for Teacher and Student Advancement. AERA Open, 3(4), 1–18. https://doi.org/10.1177/2332858417735526.
Sloat, E., Amrein-Beardsley, A., & Holloway, J. (2018). Different teacher-level effectiveness estimates, different results: Inter-model concordance across six generalized value-added models (VAMs). Educational Assessment, Evaluation and Accountability, 30(4), 367–397. https://doi.org/10.1007/s11092-018-9283-7.
Smith, W. C., & Kubacka, K. (2017). The emphasis of student test scores in teacher appraisal systems. Educational Policy Analysis Archives, 25(86). doi: https://doi.org/10.14507/epaa.25.2889.
Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Brussels: Education International Retrieved from http://download.ei-ie.org/Docs/WebDepot/2016_EI_VAM_EN_final_Web.pdf.
Stake, R. E. (1978). The case study method in social inquiry. Educational Researcher, 7(2), 5–8.
Stake, R. E., & Trumbull, D. (1982). Naturalistic generalizations. Review Journal of Philosophy and Social Science, 7, 1–12.
Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: what do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293–317. https://doi.org/10.3102/0162373715616249.
Swedien, J. (2014). Statistical guru for evaluations leaving PED. Albuquerque Journal. Retrieved from https://www.abqjournal.com/463424/statistical-guru-for-evaluations-leaving-ped.html.
Thomas, G. (2011). A typology for the case study in social science following a review of definition, discourse, and structure. Qualitative Inquiry, 17(6), 511–521. https://doi.org/10.1177/1077800411409884.
Timar, T. B., & Maxwell-Jolly, J. (Eds.). (2012). Narrowing the achievement gap: perspectives and strategies for challenging times. Cambridge, MA: Harvard Education Press.
U.S. Department of Education. (2010). A blueprint for reform: the reauthorization of the Elementary and Secondary Education Act. Retrieved from http://www2.ed.gov/policy/elsec/leg/blueprint/index.html
U.S. Department of Education. (2012). Elementary and Secondary Education Act (ESEA) flexibility. Washington, D.C.: Retrieved from https://www.ed.gov/esea/flexibility
U.S. Department of Education. (2014). States granted waivers from no child left behind allowed to reapply for renewal for 2014 and 2015 school years. Washington D.C. Retrieved from http://www.ed.gov/news/press-releases/states-granted-waivers-no-child-left-behind-allowed-reapply-renewal-2014-and-2015-school-years.
VanWynsberghe, R., & Khan, S. (2007). Redefining case study. International Journal of Qualitative Methods, 6(2), 80–94.
Wallace, T. L., Kelcey, B., & Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey. American Educational Research Journal, 53(6), 1834–1868. https://doi.org/10.3102/0002831216671864.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: our national failure to acknowledge and act on differences in teacher effectiveness. New York, NY: The New Teacher Project (TNTP). Retrieved from http://tntp.org/assets/documents/TheWidgetEffect_2nd_ed.pdf.
Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: lessons learned in four districts. Washington, DC: Brookings Institution Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf.
Wright, P., Horn, S., & Sanders, W. L. (1997). Teachers and classroom heterogeneity: their effects on educational outcomes. Journal of Personnel Evaluation in Education, 11(1), 57–67.
Yeh, S. S. (2013). A re-analysis of the effects of teacher replacement using value-added modeling. Teachers College Record, 115(12), 1–35.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1. Demographic variables
Appendix 2. Deviations in scores over time
Appendix 3. Means of teacher effectiveness measures, per teacher and school subgroups
Rights and permissions
About this article
Cite this article
Geiger, T.J., Amrein-Beardsley, A. & Holloway, J. Using test scores to evaluate and hold school teachers accountable in New Mexico. Educ Asse Eval Acc 32, 187–235 (2020). https://doi.org/10.1007/s11092-020-09324-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11092-020-09324-w