Abstract
In this study, researchers compared the concordance of teacher-level effectiveness ratings derived via six common generalized value-added model (VAM) approaches including a (1) student growth percentile (SGP) model, (2) value-added linear regression model (VALRM), (3) value-added hierarchical linear model (VAHLM), (4) simple difference (gain) score model, (5) rubric-based performance level (growth) model, and (6) simple criterion (percent passing) model. The study sample included fourth to sixth grade teachers employed in a large, suburban school district who taught the same sets of students, at the same time, and for whom a consistent set of achievement measures and background variables were available. Findings indicate that ratings significantly and substantively differed depending upon the methodological approach used. Findings, accordingly, bring into question the validity of the inferences based on such estimates, especially when high-stakes decisions are made about teachers as based on estimates measured via different, albeit popular methods across different school districts and states.
Similar content being viewed by others
Notes
VAMs are designed to isolate and measure teachers’ alleged contributions to student achievement on large-scale standardized achievement tests as groups of students move from one grade level to the next. VAMs are, accordingly, used to help objectively compute the differences between students’ composite test scores from year-to-year, with value-added being calculated as the deviations between predicted and actual growth (including random and systematic error). Differences in growth are to be compared to “similar” coefficients of “similar” teachers in “similar” districts at “similar” times, after which teachers are positioned into their respective and descriptive categories of effectiveness (e.g., highly effective, effective, ineffective, highly ineffective).
The main differences between VAMs and growth models are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, the SGP model is more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables. Thereafter, determinations are made in terms of whether students increase, maintain, or decrease in growth percentile rankings as compared to their academically similar peers. Accordingly, researchers refer to both models as generalized VAMs throughout the rest of this manuscript unless distinctions between growth models and VAMs are needed or required.
The SGP model is also used or endorsed statewide in the states of Colorado, Hawaii, Indiana, Massachusetts, Mississippi, Nevada, New Jersey, New York, Rhode Island, Virginia, and West Virginia (Collins and Amrein-Beardsley 2014).
The exact number of students covered by the classroom aggregations differs between the analytic methods. For example, regression techniques use list wise deletion of cases if one or more of the explanatory variables are missing, while non-regression techniques only require the presence of two achievement scores in the calculations.
With small enrollments, averaging residual growth scores risk skewing the class aggregate measures. Accordingly, researchers used medians as the class growth measure for this reason.
Researchers’ review of right hand side correlations and model diagnostics suggested multicollinearity among the ELL, PHL, and Lunch variables, although researchers placed no burden of precision or interpretation on the estimated parameters of the individual predictor variables, also noting that the use of collinear predictors did not impact overall model performance (Johnston 1972). The outcome of the modeling approach, then, is an estimate of residual achievement expressed in terms of the original scale scores. The model generates an expected score for each student, and the difference between the actual and the expected outcome is the residual value.
The contingency table for one grade, one subject, contains 36 cells (6 × 6). Diagonal cells compare identical methods and are therefore excluded. Off-diagonal cells are symmetric, leaving a total of 15 comparative measures per grade per subject.
Fifteen per grade per subject by two subjects by three grades.
References
American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
American Statistical Association (ASA). (2014). ASA statement on using value-added models for educational assessment. Alexandria, VA. Retrieved from http://www.amstat.org/policy/pdfs/asa_vam_statement.pdf.
Amrein-Beardsley, A., & Holloway, J. (2017). Value-added models for teacher evaluation and accountability: Commonsense assumptions. Educational Policy, 1–27. https://doi.org/10.1177/0895904817719519.
Anagnostopoulos, D., Rutledge, S. A., & Jacobsen, R. (2013). The infrastructure of accountability: Data use and the transformation of American education. Cambridge: Harvard Education Press.
Arizona Department of Education (ADE) (2009). AIMS math technical report 2009. Retrieved from http://www.azed.gov/standards-development-assessment/files/2011/12/aimsmathfieldtesttechreport2009.pdf.
Arizona Department of Education (ADE) (2011). AIMS 2011 technical report. Retrieved from http://www.azed.gov/standards-development-assessment/files/2011/12/aims_tech_report_2011_final.pdf.
Ball, S. J. (2012). Politics and policy making in education: Explorations in sociology. London: Routledge.
Ballou, D., Sanders, W. L., & Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1), 37–65. https://doi.org/10.3102/10769986029001037.
Banchero, S. & Kesmodel, D. (2011). Teachers are put to the test: more states tie tenure, bonuses to new formulas for measuring test scores. The Wall Street Journal. Retrieved from http://online.wsj.com/article/SB10001424053111903895904576544523666669018.html.
Berliner, D. C. (2014). Exogenous variables and value-added assessments: a fatal flaw. Teachers College Record, 116(1).
Berliner, D. (2018). Between Scylla and Charybdis: reflections on and problems associated with the evaluation of teachers in an era of metrification. Education Policy Analysis Archives, 26(54), 1–29. https://doi.org/10.14507/epaa.26.3820.
Betebenner, D. W. (2009). A primer on student growth percentiles. Dover: The Center for Assessment Retrieved from http://www.ksde.org/LinkClick.aspx?fileticket=XmFRiNlYbyc%3d&tabid=1646&mid=10217.
Betebenner, D.W. (2011). Package ‘SGP.’ Retrieved from https://cran.r-project.org/web/packages/SGP/SGP.pdf.
Bill & Melinda Gates Foundation. (2010). Learning about teaching: Initial findings from the measures of effective teaching project. Seattle: Retrieved from http://www.gatesfoundation.org/college-ready-education/Documents/preliminary-findings-research-paper.pdf.
Bill & Melinda Gates Foundation (2013). Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET project’s three-year study. Seattle, WA. Retrieved from http://www.gatesfoundation.org/press-releases/Pages/MET-Announcment.aspx.
Braun, H. I. (2005). Using student progress to evaluate teachers: a primer on value-added models. Princeton: Educational Testing Service Retrieved from http://www.ets.org/Media/Research/pdf/PICVAS.pdf.
Braun, H., Goldschmidt, P., McCaffrey, D., & Lissitz, R. (2012). Graduate student council Division D fireside chat: VA modeling in educational research and evaluation. Paper Presented at Annual Conference of the American Educational Research Association (AERA), Vancouver, Canada.
Briggs, D. C., & Betebenner, D. (2009). Is growth in student achievement scale dependent? Paper presented at the annual meeting of the National Council for Measurement in Education (NCME), San Diego, CA.
Chin, M., & Goldhaber, D. (2015). Exploring explanations for the “weak” relationship between value added and observation-based measures of teacher performance. Cambridge, MA: Center for Education Policy Research (CEPR), Harvard University. Retrieved from http://cepr.harvard.edu/files/cepr/files/sree2015_simulation_working_paper.pdf?m=1436541369.
Close, K., Amrein-Beardsley, A., & Collins, C. (2018). State-level assessments and teacher evaluation systems after the passage of the Every Student Succeeds Act: Some steps in the right direction. Boulder, CO: Nation Education Policy Center (NEPC). Retrieved from http://nepc.colorado.edu/publication/stateassessment.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.). Hillsdale: Lawrence Erlbaum Associates.
Collins, C. (2014). Houston, we have a problem: teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives. Retrieved from http://epaa.asu.edu/ojs/article/view/1594.
Collins, C., & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national overview. Teachers College Record, 16(1). Retrieved from: http://www.tcrecord.org/Content.asp?ContentId=17291
Corcoran, S. P., Jennings, J. L., & Beveridge, A. A. (2011). Teacher effectiveness on high- and low-stakes tests. New York: New York University Retrieved from https://files.nyu.edu/sc129/public/papers/corcoran_jennings_beveridge_2011_wkg_teacher_effects.pdf.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. https://doi.org/10.1037/h0040957.
Curtis, R. (2011). District of Columbia Public Schools: Defining instructional expectations and aligning accountability and support. Washington, D.C.: The Aspen Institute Retrieved from: www.nctq.org/docs/Impact_1_15579.pdf.
Denby, D. (2012). Public defender: Diane Ravitch takes on a movement. The New Yorker. Retrieved from http://www.newyorker.com/reporting/2012/11/19/121119fa_fact_denby.
Doherty, K. M., & Jacobs, S. (2015). State of the states 2015: Evaluating teaching, leading and learning. Washington DC: National Council on Teacher Quality (NCTQ). Retrieved from http://www.nctq.org/dmsView/StateofStates2015.
Duncan, A. (2009). Teacher preparation: Reforming the uncertain profession. Retrieved from http://www.ed.gov/news/speeches/2009/10/10222009.html.
Duncan, A. (2011). Winning the future with education: Responsibility, reform and results. Testimony given to the U.S. Congress, Washington, DC: Retrieved from http://www.ed.gov/news/speeches/winning-future-education-responsibility-reform-and-results.
Every Student Succeeds Act (ESSA) of 2015, Pub. L. No. 114-95, § 129 Stat. 1802. (2016). Retrieved from https://www.gpo.gov/fdsys/pkg/BILLS-114s1177enr/pdf/BILLS-114s1177enr.pdf.
Felton, E. (2016). Southern lawmakers reconsidering role of test scores in teacher evaluations. Education Week. Retrieved from http://blogs.edweek.org/edweek/teacherbeat/2016/03/reconsidering_test_scores_in_teacher_evaluations.html.
Ferguson, G. A., & Takane, Y. (1989). Statistical analysis in psychology and education (6th ed.). New York: McGraw-Hill.
Freed, M. N., Ryan, J. M., & Hess, R. K. (1991). Handbook of statistical procedures and their computer applications to education and the behavioral sciences. New York: Macmillan Publishing Company.
Gabriel, R., & Lester, J. N. (2013). Sentinels guarding the grail: value-added measurement and the quest for education reform. Education Policy Analysis Archives, 21(9), 1–30 Retrieved from http://epaa.asu.edu/ojs/article/view/1165.
Glazerman, S. M., & Potamites, L. (2011). False performance gains: a critique of successive cohort indicators. Washington, DC: Mathematica Policy Research. Retrieved from www.mathematica-mpr.com/publications/pdfs/.../False_Perf.pdf.
Goldhaber, D., Walch, J., & Gabele, B. (2014). Does the model matter? Exploring the relationship between different student achievement-based teacher assessments. Statistics and Public Policy, 1(1), 28–39. https://doi.org/10.1080/2330443x.2013.856169.
Goldschmidt, P., Choi, K., & Beaudoin, J. B. (2012, February). Growth model comparison study: Practical implications of alternative models for evaluating school performance. Technical Issues in Large-Scale Assessment State Collaborative on Assessment and Student Standards. Council of Chief State School Officers.
Graue, M. E., Delaney, K. K., & Karch, A. S. (2013). Ecologies of education quality. Education Policy Analysis Archives, 21(8), 1–36 Retrieved from http://epaa.asu.edu/ojs/article/view/1163.
Grek, S., & Ozga, J. (2010). Re-inventing public education: the new role of knowledge in education policy making. Public Policy and Administration, 25(3), 271–288. https://doi.org/10.1177/0952076709356870.
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: the relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. https://doi.org/10.3102/0013189X14544542.
Harris, D. N. (2011). Value-added measures in education: What every educator needs to know. Cambridge: Harvard Education Press.
Harris, D. N., & Sass, T. R. (2006). Value-added models and the measurement of teacher quality. Tallahassee: Florida Department of Education Retrieved from http://itp.wceruw.org/vam/IES_Harris_Sass_EPF_Value-added_14_Stanford.pdf.
Hill, H. C., Kapitula, L., & Umlan, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. https://doi.org/10.3102/0002831210387916.
Ho, A. D. (2009). The dependence of growth model results on proficiency cut scores. Educational Measurement Issues and Practice, 28(4), 15–26. https://doi.org/10.1111/j.1745-3992.2009.00159.x.
Hursh, D. (2007). Assessing No Child Left Behind and the rise of neoliberal education policies. American Educational Research Journal, 44(3), 493–518. https://doi.org/10.3102/0002831207306764.
Jacob, B. A., & Lefgren, L. (2005). Principals as agents: Subjective performance measurement in education. Cambridge: National Bureau of Economic Research (NBER) Retrieved from www.nber.org/papers/w11463.
Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer control variables. Journal of Research on Educational Effectiveness, 8(1), 60–83. https://doi.org/10.1080/19345747.2014.967898.
Johnston, J. (1972). Econometric methods (2nd ed.). New York: McGraw-Hill.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000.
Kennedy, M. M. (2010). Attribution error and the quest for teacher quality. Educational Researcher, 39(8), 591–598. https://doi.org/10.3102/0013189X10390804.
Kersting, N. B., Chen, M., & Stigler, J. W. (2013). Value-added added teacher estimates as part of teacher evaluations: exploring the effects of data and model specifications on the stability of teacher value-added scores. Education Policy Analysis Archives, 21(7), 1–39 Retrieved from http://epaa.asu.edu/ojs/article/view/1167.
Kimball, S. M., White, B., Milanowski, A. T., & Borman, G. (2004). Examining the relationship between teacher evaluation and student assessment results in Washoe County. Peabody Journal of Education, 79(4), 54–78. https://doi.org/10.1207/s15327930pje7904_4.
Kupermintz, H. (2003). Teacher effects and teacher effectiveness: a validity investigation of the Tennessee Value-Added Assessment System. Educational Evaluation and Policy Analysis, 25, 287–298. https://doi.org/10.3102/01623737025003287.
Kyriakides, L. (2005). Drawing from teacher effectiveness research and research into teacher interpersonal behaviour to establish a teacher evaluation system: a study on the use of student ratings to evaluate teacher behaviour. Journal of Classroom Instruction, 40(2), 44–66.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310.
Lingard, B. (2011). Policy as numbers: ac/counting for educational research. The Australian Educational Researcher, 38(4), 355–382.
Lingard, B., Martino, W., & Rezai-Rashti, G. (2013). Testing regimes, accountabilities and education policy: commensurate global and national developments. Journal of Education Policy, 28(5), 539–556. https://doi.org/10.1080/02680939.2013.820042.
Lockwood, J., McCaffrey, D., Hamilton, L., Stetcher, B., Le, V. N., & Martinez, J. (2007). The sensitivity of value-added teacher effect estimates to different mathematics achievement measures. Journal of Educational Measurement, 44(1), 47–67. https://doi.org/10.1111/j.1745-3984.2007.00026.x.
Loeb, S., Soland, J., & Fox, J. (2015). Is a good teacher a good teacher for all? Comparing value-added of teachers with English learners and non-English learners. Educational Evaluation and Policy Analysis, 36(4), 457–475. https://doi.org/10.3102/0162373714527788.
Mathews, J. (2013). Hidden power of teacher awards. The Washington Post. Retrieved from http://www.washingtonpost.com/blogs/class-struggle/post/hidden-power-of-teacher-awards/2013/04/08/15b7afcc-9e66-11e2-9a79-eb5280c81c63_blog.html.
Mathis, W. (2011). Review of “Florida Formula for Student Achievement: Lessons for the Nation.”. Boulder: National Education Policy Center Retrieved from http://nepc.colorado.edu/thinktank/review-florida-formula.
McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value-added models for teacher accountability. Santa Monica: Rand Corporation.
McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67–101 RAND reprint available at http://www.rand.org/pubs/reprints/2005/RAND_RP1165.pdf.
Messick, S. (1975). The standard problem: meaning and values in measurement and evaluation. American Psychologist, 30, 955–966. https://doi.org/10.1037//0003-066x.30.10.955.
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027. https://doi.org/10.1037//0003-066x.35.11.1012.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education and Macmillan.
Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.
Milanowski, A., Kimball, S. M., & White, B. (2004). The relationship between standards-based teacher evaluation scores and student achievement: Replication and extensions at three sites. Madison: University of Wisconsin-Madison, Center for Education Research.
Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher effectiveness: An exploration of stability across models and contexts. Educational Policy Analysis Archives, 18(23), 1–27 Retrieved from http://epaa.asu.edu/ojs/article/view/810.
Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: How high-stakes testing corrupts America’s schools. Cambridge: Harvard Education Press.
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26(3), 237–257. https://doi.org/10.3102/01623737026003237.
Ozga, J. (2016). Trust in numbers? Digital education governance and the inspection process. European Educational Research Journal, 15(1), 69–81. https://doi.org/10.1177/1474904115616629.
Papay, J. P. (2010). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163–193. https://doi.org/10.3102/0002831210362589.
Pauken, T. (2013). Texas vs. No Child Left Behind. The American Conservative. Retrieved from http://www.theamericanconservative.com/articles/texas-vs-no-child-left-behind/
Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Education Evaluation and Policy Analysis, 36(4), 399–416. https://doi.org/10.3102/0162373714531851.
Porter, T. M. (1996). Trust in numbers: The pursuit of objectivity in science and public life. Princeton: Princeton University Press.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Application and data analysis methods (2nd ed.). Thousand Oaks: Sage Publications, Inc..
Reynolds, C. R., Livingston, R. B., & Wilson, V. (2009). Measurement and assessment in education (2nd ed.). Upper Saddle River: Pearson Education, Inc..
Rhee, M. (2011). The evidence is clear: Test scores must accurately reflect students' learning. The Huffington Post. Retrieved from http://www.huffingtonpost.com/michelle-rhee/michelle-rhee-dc-schools_b_845286.html.
Rizvi, F., & Lingard, B. (2010). Globalizing education policy. London: Routledge.
Rothstein, J., & Mathis, W. J. (2013). Review of two culminating reports from the MET Project. Boulder: National Education Policy Center (NEPC) Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013.
Schafer, W. D., Lissitz, R. W., Zhu, X., Zhang, Y., Hou, X., & Li, Y. (2012). Evaluating teachers and schools using student growth models. Practical Assessment, Research & Evaluation, 17(17). Retrieved from pareonline.net/getvn.asp?v=17&n=17.
Smith, W. C. (2016). The global testing culture: Shaping education policy, perceptions, and practice. Oxford: Symposium Books.
Smith, W. C., & Kubacka, K. (2017). The emphasis of student test scores in teacher appraisal systems. Education Policy Analysis Archives, 25(86). https://doi.org/10.14507/epaa.25.2889.
Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Education International Discussion Paper. Retrieved from: http://download.eiie.org/Docs/WebDepot/2016_EI_VAM_EN_final_Web. pdf.
Stevens, J. (1996). Applied multivariate statistics for the social sciences. Mahwah: Lawrence Erlbaum Associates, Inc.
Tekwe, C. D., Carter, R. L., Ma, C., Algina, J., Lucas, M. E., Roth, J., Arite, M., Fisher, T., & Resnick, M. B. (2004). An empirical comparison of statistical models for value-added assessment of school performance. Journal of Educational and Behavioral Statistics, 29(1), 11–36. https://doi.org/10.3102/10769986029001011.
Timar, T. B., & Maxwell-Jolly, J. (Eds.). (2012). Narrowing the achievement gap: Perspectives and strategies for challenging times. Cambridge: Harvard Education Press.
Verger, A., & Parcerisa, L. (2017). A difficult relationship. Accountability policies and teachers: International evidence and key premises for future research. In M. Akiba & G. LeTendre (Eds.), International handbook of teacher quality and policy (pp. 241–254). New York: Routledge.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. New York: The New Teacher Project (TNTP) Retrieved from http://tntp.org/assets/documents/TheWidgetEffect_2nd_ed.pdf.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
ESM 1
(DOCX 30 kb)
Rights and permissions
About this article
Cite this article
Sloat, E., Amrein-Beardsley, A. & Holloway, J. Different teacher-level effectiveness estimates, different results: inter-model concordance across six generalized value-added models (VAMs). Educ Asse Eval Acc 30, 367–397 (2018). https://doi.org/10.1007/s11092-018-9283-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11092-018-9283-7