Skip to main content
Log in

Different teacher-level effectiveness estimates, different results: inter-model concordance across six generalized value-added models (VAMs)

  • Published:
Educational Assessment, Evaluation and Accountability Aims and scope Submit manuscript

Abstract

In this study, researchers compared the concordance of teacher-level effectiveness ratings derived via six common generalized value-added model (VAM) approaches including a (1) student growth percentile (SGP) model, (2) value-added linear regression model (VALRM), (3) value-added hierarchical linear model (VAHLM), (4) simple difference (gain) score model, (5) rubric-based performance level (growth) model, and (6) simple criterion (percent passing) model. The study sample included fourth to sixth grade teachers employed in a large, suburban school district who taught the same sets of students, at the same time, and for whom a consistent set of achievement measures and background variables were available. Findings indicate that ratings significantly and substantively differed depending upon the methodological approach used. Findings, accordingly, bring into question the validity of the inferences based on such estimates, especially when high-stakes decisions are made about teachers as based on estimates measured via different, albeit popular methods across different school districts and states.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. VAMs are designed to isolate and measure teachers’ alleged contributions to student achievement on large-scale standardized achievement tests as groups of students move from one grade level to the next. VAMs are, accordingly, used to help objectively compute the differences between students’ composite test scores from year-to-year, with value-added being calculated as the deviations between predicted and actual growth (including random and systematic error). Differences in growth are to be compared to “similar” coefficients of “similar” teachers in “similar” districts at “similar” times, after which teachers are positioned into their respective and descriptive categories of effectiveness (e.g., highly effective, effective, ineffective, highly ineffective).

  2. The main differences between VAMs and growth models are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, the SGP model is more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables. Thereafter, determinations are made in terms of whether students increase, maintain, or decrease in growth percentile rankings as compared to their academically similar peers. Accordingly, researchers refer to both models as generalized VAMs throughout the rest of this manuscript unless distinctions between growth models and VAMs are needed or required.

  3. The SGP model is also used or endorsed statewide in the states of Colorado, Hawaii, Indiana, Massachusetts, Mississippi, Nevada, New Jersey, New York, Rhode Island, Virginia, and West Virginia (Collins and Amrein-Beardsley 2014).

  4. The exact number of students covered by the classroom aggregations differs between the analytic methods. For example, regression techniques use list wise deletion of cases if one or more of the explanatory variables are missing, while non-regression techniques only require the presence of two achievement scores in the calculations.

  5. With small enrollments, averaging residual growth scores risk skewing the class aggregate measures. Accordingly, researchers used medians as the class growth measure for this reason.

  6. Researchers’ review of right hand side correlations and model diagnostics suggested multicollinearity among the ELL, PHL, and Lunch variables, although researchers placed no burden of precision or interpretation on the estimated parameters of the individual predictor variables, also noting that the use of collinear predictors did not impact overall model performance (Johnston 1972). The outcome of the modeling approach, then, is an estimate of residual achievement expressed in terms of the original scale scores. The model generates an expected score for each student, and the difference between the actual and the expected outcome is the residual value.

  7. The contingency table for one grade, one subject, contains 36 cells (6 × 6). Diagonal cells compare identical methods and are therefore excluded. Off-diagonal cells are symmetric, leaving a total of 15 comparative measures per grade per subject.

  8. Fifteen per grade per subject by two subjects by three grades.

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Audrey Amrein-Beardsley.

Electronic supplementary material

ESM 1

(DOCX 30 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sloat, E., Amrein-Beardsley, A. & Holloway, J. Different teacher-level effectiveness estimates, different results: inter-model concordance across six generalized value-added models (VAMs). Educ Asse Eval Acc 30, 367–397 (2018). https://doi.org/10.1007/s11092-018-9283-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11092-018-9283-7

Keywords

Navigation