Skip to main content
Log in

Using test scores to evaluate and hold school teachers accountable in New Mexico

  • Published:
Educational Assessment, Evaluation and Accountability Aims and scope Submit manuscript

Abstract

For this study, researchers critically reviewed documents pertaining to the highest profile of the 15 teacher evaluation lawsuits that occurred throughout the U.S. as pertaining to the use of student test scores to evaluate teachers. In New Mexico, teacher plaintiffs contested how they were being evaluated and held accountable using a homegrown value-added model (VAM) to hold them accountable for their students’ test scores. Researchers examined court documents using six key measurement concepts (i.e., reliability, validity [i.e., convergent-related evidence], potential for bias, fairness, transparency, and consequential validity) defined by the Standards for Educational and Psychological Testing and found evidence of issues within both the court documents as well as the statistical analyses researchers conducted on the first three measurement concepts (i.e., reliability, validity [i.e., convergent-related evidence], and potential for bias).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The main differences between VAMs and growth models are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, the student growth models are more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, typically without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables.

  2. As per teachers’ contract statuses, HISD teachers new to the district are put on probationary contracts for 1 year if they have more than 5 years of prior teaching experience, or they are put on probationary contracts for 3 years if they are new teachers. Otherwise, teachers are on full contract.

  3. The New Mexico’s modified Danielson model consists of four domains (as does the Danielson model): “Domain 1: Planning and Preparation” (which is the same as Danielson), “Domain 2: Creating an Environment for Learning” (which is “Classroom Environment” as per Danielson), Domain 3: “Teaching for Learning” (which is “Instruction” as per Danielson), and “Domain 4: Professionalism” (which is “Professional Responsibilities” as per Danielson). Domains 1 and 4 when combined yield New Mexico’s Planning, Preparation, and Professionalism (PPP) dimension. It is uncertain how the state adjusted the Danielson model for observational purposes, or whether the state had permission to do so from the Danielson Group (n.d.).

  4. In terms of teacher attendance, the state’s default teacher attendance cut scores were based on days missed as follows: 0–2 days missed = Exemplary, 3–5 days missed = Highly Effective, 6–10 days missed = Effective, 11–13 days missed = Minimally Effective, and 14+ days missed = Ineffective. However, some districts did not include teacher attendance data for various reasons (e.g., “because absences are often attributed to the Family and Medical Leave Act, bereavement, jury duty, military leave, religious leave, professional development, and coaching”) making system fidelity and fairness, again, suspect.

  5. For example, most VAMs require that the scales that are used to measure growth from 1 year to the next can be appropriately positioned upon vertical, interval scales of equal units. These scales should connect consecutive tests on the same fixed ruler, so-to-speak, making it possible to measure growth from 1 year to the next across different grade-level tests. Here, for example, a ten-point difference (e.g., between a score of 50 and 60 in fourth grade) on one test should mean the same thing as a ten-point difference (e.g., between a score of 80 and 90 in fifth grade) on a similar test 1 year later. However, the scales of all large-scale standardized achievement test scores used in all current value-added systems do not even come close to being vertically aligned, as so often assumed (Baker et al. 2010; Ballou 2004; Braun 2004; Briggs and Betebenner 2009; Ho et al. 2009; Newton et al. 2010; Papay 2010; Sanders et al. 2009). “[E]ven the psychometricians who are responsible for test scaling shy away from making [such an] assumption” (Harris 2009, p. 329).

  6. Interpreting r: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0.0 ≤ r ≤ 0.2 = a very weak correlation, if any at all (Merrigan and Huston 2008).

  7. Ibid.

  8. Ibid.

  9. Ibid.

  10. Ibid.

  11. While “free riders” are not defined in exhibit D, researchers assume this means that this term refers to teachers who obtain something without comparable effort.

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Audrey Amrein-Beardsley.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Demographic variables

Table 4 Demographics used in the analyses

Appendix 2. Deviations in scores over time

Table 5 Teachers’ variations in score quintiles over time

Appendix 3. Means of teacher effectiveness measures, per teacher and school subgroups

Table 6 VAM means, per teacher subgroup
Table 7 Observation score means, per teacher subgroup
Table 8 PPP score means, per teacher subgroup
Table 9 Survey score means, per teacher subgroup
Table 10 VAM means, per school subgroup
Table 11 Observation score means, per school subgroup
Table 12 PPP score means, per school subgroup
Table 13 Survey score means, per school subgroup

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Geiger, T.J., Amrein-Beardsley, A. & Holloway, J. Using test scores to evaluate and hold school teachers accountable in New Mexico. Educ Asse Eval Acc 32, 187–235 (2020). https://doi.org/10.1007/s11092-020-09324-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11092-020-09324-w

Keywords

Navigation