Different teacher-level effectiveness estimates, different results: inter-model concordance across six generalized value-added models (VAMs)

  • Edward Sloat
  • Audrey Amrein-BeardsleyEmail author
  • Jessica Holloway


In this study, researchers compared the concordance of teacher-level effectiveness ratings derived via six common generalized value-added model (VAM) approaches including a (1) student growth percentile (SGP) model, (2) value-added linear regression model (VALRM), (3) value-added hierarchical linear model (VAHLM), (4) simple difference (gain) score model, (5) rubric-based performance level (growth) model, and (6) simple criterion (percent passing) model. The study sample included fourth to sixth grade teachers employed in a large, suburban school district who taught the same sets of students, at the same time, and for whom a consistent set of achievement measures and background variables were available. Findings indicate that ratings significantly and substantively differed depending upon the methodological approach used. Findings, accordingly, bring into question the validity of the inferences based on such estimates, especially when high-stakes decisions are made about teachers as based on estimates measured via different, albeit popular methods across different school districts and states.


Teacher accountability Teacher effectiveness Teacher evaluation Teacher quality Validity Value-added models 

Supplementary material

11092_2018_9283_MOESM1_ESM.docx (31 kb)
ESM 1 (DOCX 30 kb)


  1. American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
  2. American Statistical Association (ASA). (2014). ASA statement on using value-added models for educational assessment. Alexandria, VA. Retrieved from
  3. Amrein-Beardsley, A., & Holloway, J. (2017). Value-added models for teacher evaluation and accountability: Commonsense assumptions. Educational Policy, 1–27.
  4. Anagnostopoulos, D., Rutledge, S. A., & Jacobsen, R. (2013). The infrastructure of accountability: Data use and the transformation of American education. Cambridge: Harvard Education Press.Google Scholar
  5. Arizona Department of Education (ADE) (2009). AIMS math technical report 2009. Retrieved from
  6. Arizona Department of Education (ADE) (2011). AIMS 2011 technical report. Retrieved from
  7. Ball, S. J. (2012). Politics and policy making in education: Explorations in sociology. London: Routledge.Google Scholar
  8. Ballou, D., Sanders, W. L., & Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1), 37–65. Scholar
  9. Banchero, S. & Kesmodel, D. (2011). Teachers are put to the test: more states tie tenure, bonuses to new formulas for measuring test scores. The Wall Street Journal. Retrieved from
  10. Berliner, D. C. (2014). Exogenous variables and value-added assessments: a fatal flaw. Teachers College Record, 116(1).Google Scholar
  11. Berliner, D. (2018). Between Scylla and Charybdis: reflections on and problems associated with the evaluation of teachers in an era of metrification. Education Policy Analysis Archives, 26(54), 1–29. Scholar
  12. Betebenner, D. W. (2009). A primer on student growth percentiles. Dover: The Center for Assessment Retrieved from Scholar
  13. Betebenner, D.W. (2011). Package ‘SGP.’ Retrieved from
  14. Bill & Melinda Gates Foundation. (2010). Learning about teaching: Initial findings from the measures of effective teaching project. Seattle: Retrieved from
  15. Bill & Melinda Gates Foundation (2013). Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET project’s three-year study. Seattle, WA. Retrieved from
  16. Braun, H. I. (2005). Using student progress to evaluate teachers: a primer on value-added models. Princeton: Educational Testing Service Retrieved from Scholar
  17. Braun, H., Goldschmidt, P., McCaffrey, D., & Lissitz, R. (2012). Graduate student council Division D fireside chat: VA modeling in educational research and evaluation. Paper Presented at Annual Conference of the American Educational Research Association (AERA), Vancouver, Canada.Google Scholar
  18. Briggs, D. C., & Betebenner, D. (2009). Is growth in student achievement scale dependent? Paper presented at the annual meeting of the National Council for Measurement in Education (NCME), San Diego, CA.Google Scholar
  19. Chin, M., & Goldhaber, D. (2015). Exploring explanations for the “weak” relationship between value added and observation-based measures of teacher performance. Cambridge, MA: Center for Education Policy Research (CEPR), Harvard University. Retrieved from
  20. Close, K., Amrein-Beardsley, A., & Collins, C. (2018). State-level assessments and teacher evaluation systems after the passage of the Every Student Succeeds Act: Some steps in the right direction. Boulder, CO: Nation Education Policy Center (NEPC). Retrieved from
  21. Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.). Hillsdale: Lawrence Erlbaum Associates.Google Scholar
  22. Collins, C. (2014). Houston, we have a problem: teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives. Retrieved from
  23. Collins, C., & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national overview. Teachers College Record, 16(1). Retrieved from:
  24. Corcoran, S. P., Jennings, J. L., & Beveridge, A. A. (2011). Teacher effectiveness on high- and low-stakes tests. New York: New York University Retrieved from Scholar
  25. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Scholar
  26. Curtis, R. (2011). District of Columbia Public Schools: Defining instructional expectations and aligning accountability and support. Washington, D.C.: The Aspen Institute Retrieved from: Scholar
  27. Denby, D. (2012). Public defender: Diane Ravitch takes on a movement. The New Yorker. Retrieved from Scholar
  28. Doherty, K. M., & Jacobs, S. (2015). State of the states 2015: Evaluating teaching, leading and learning. Washington DC: National Council on Teacher Quality (NCTQ). Retrieved from
  29. Duncan, A. (2009). Teacher preparation: Reforming the uncertain profession. Retrieved from
  30. Duncan, A. (2011). Winning the future with education: Responsibility, reform and results. Testimony given to the U.S. Congress, Washington, DC: Retrieved from
  31. Every Student Succeeds Act (ESSA) of 2015, Pub. L. No. 114-95, § 129 Stat. 1802. (2016). Retrieved from
  32. Felton, E. (2016). Southern lawmakers reconsidering role of test scores in teacher evaluations. Education Week. Retrieved from
  33. Ferguson, G. A., & Takane, Y. (1989). Statistical analysis in psychology and education (6th ed.). New York: McGraw-Hill.Google Scholar
  34. Freed, M. N., Ryan, J. M., & Hess, R. K. (1991). Handbook of statistical procedures and their computer applications to education and the behavioral sciences. New York: Macmillan Publishing Company.Google Scholar
  35. Gabriel, R., & Lester, J. N. (2013). Sentinels guarding the grail: value-added measurement and the quest for education reform. Education Policy Analysis Archives, 21(9), 1–30 Retrieved from Scholar
  36. Glazerman, S. M., & Potamites, L. (2011). False performance gains: a critique of successive cohort indicators. Washington, DC: Mathematica Policy Research. Retrieved from
  37. Goldhaber, D., Walch, J., & Gabele, B. (2014). Does the model matter? Exploring the relationship between different student achievement-based teacher assessments. Statistics and Public Policy, 1(1), 28–39. Scholar
  38. Goldschmidt, P., Choi, K., & Beaudoin, J. B. (2012, February). Growth model comparison study: Practical implications of alternative models for evaluating school performance. Technical Issues in Large-Scale Assessment State Collaborative on Assessment and Student Standards. Council of Chief State School Officers.Google Scholar
  39. Graue, M. E., Delaney, K. K., & Karch, A. S. (2013). Ecologies of education quality. Education Policy Analysis Archives, 21(8), 1–36 Retrieved from Scholar
  40. Grek, S., & Ozga, J. (2010). Re-inventing public education: the new role of knowledge in education policy making. Public Policy and Administration, 25(3), 271–288. Scholar
  41. Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: the relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. Scholar
  42. Harris, D. N. (2011). Value-added measures in education: What every educator needs to know. Cambridge: Harvard Education Press.Google Scholar
  43. Harris, D. N., & Sass, T. R. (2006). Value-added models and the measurement of teacher quality. Tallahassee: Florida Department of Education Retrieved from Scholar
  44. Hill, H. C., Kapitula, L., & Umlan, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. Scholar
  45. Ho, A. D. (2009). The dependence of growth model results on proficiency cut scores. Educational Measurement Issues and Practice, 28(4), 15–26. Scholar
  46. Hursh, D. (2007). Assessing No Child Left Behind and the rise of neoliberal education policies. American Educational Research Journal, 44(3), 493–518. Scholar
  47. Jacob, B. A., & Lefgren, L. (2005). Principals as agents: Subjective performance measurement in education. Cambridge: National Bureau of Economic Research (NBER) Retrieved from Scholar
  48. Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer control variables. Journal of Research on Educational Effectiveness, 8(1), 60–83. Scholar
  49. Johnston, J. (1972). Econometric methods (2nd ed.). New York: McGraw-Hill.Google Scholar
  50. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. Scholar
  51. Kennedy, M. M. (2010). Attribution error and the quest for teacher quality. Educational Researcher, 39(8), 591–598. Scholar
  52. Kersting, N. B., Chen, M., & Stigler, J. W. (2013). Value-added added teacher estimates as part of teacher evaluations: exploring the effects of data and model specifications on the stability of teacher value-added scores. Education Policy Analysis Archives, 21(7), 1–39 Retrieved from Scholar
  53. Kimball, S. M., White, B., Milanowski, A. T., & Borman, G. (2004). Examining the relationship between teacher evaluation and student assessment results in Washoe County. Peabody Journal of Education, 79(4), 54–78. Scholar
  54. Kupermintz, H. (2003). Teacher effects and teacher effectiveness: a validity investigation of the Tennessee Value-Added Assessment System. Educational Evaluation and Policy Analysis, 25, 287–298. Scholar
  55. Kyriakides, L. (2005). Drawing from teacher effectiveness research and research into teacher interpersonal behaviour to establish a teacher evaluation system: a study on the use of student ratings to evaluate teacher behaviour. Journal of Classroom Instruction, 40(2), 44–66.Google Scholar
  56. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. Scholar
  57. Lingard, B. (2011). Policy as numbers: ac/counting for educational research. The Australian Educational Researcher, 38(4), 355–382.CrossRefGoogle Scholar
  58. Lingard, B., Martino, W., & Rezai-Rashti, G. (2013). Testing regimes, accountabilities and education policy: commensurate global and national developments. Journal of Education Policy, 28(5), 539–556. Scholar
  59. Lockwood, J., McCaffrey, D., Hamilton, L., Stetcher, B., Le, V. N., & Martinez, J. (2007). The sensitivity of value-added teacher effect estimates to different mathematics achievement measures. Journal of Educational Measurement, 44(1), 47–67. Scholar
  60. Loeb, S., Soland, J., & Fox, J. (2015). Is a good teacher a good teacher for all? Comparing value-added of teachers with English learners and non-English learners. Educational Evaluation and Policy Analysis, 36(4), 457–475. Scholar
  61. Mathis, W. (2011). Review of “Florida Formula for Student Achievement: Lessons for the Nation.”. Boulder: National Education Policy Center Retrieved from Scholar
  62. McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value-added models for teacher accountability. Santa Monica: Rand Corporation.CrossRefGoogle Scholar
  63. McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67–101 RAND reprint available at Scholar
  64. Messick, S. (1975). The standard problem: meaning and values in measurement and evaluation. American Psychologist, 30, 955–966. Scholar
  65. Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027. Scholar
  66. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education and Macmillan.Google Scholar
  67. Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.CrossRefGoogle Scholar
  68. Milanowski, A., Kimball, S. M., & White, B. (2004). The relationship between standards-based teacher evaluation scores and student achievement: Replication and extensions at three sites. Madison: University of Wisconsin-Madison, Center for Education Research.Google Scholar
  69. Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher effectiveness: An exploration of stability across models and contexts. Educational Policy Analysis Archives, 18(23), 1–27 Retrieved from Scholar
  70. Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: How high-stakes testing corrupts America’s schools. Cambridge: Harvard Education Press.Google Scholar
  71. Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26(3), 237–257. Scholar
  72. Ozga, J. (2016). Trust in numbers? Digital education governance and the inspection process. European Educational Research Journal, 15(1), 69–81. Scholar
  73. Papay, J. P. (2010). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163–193. Scholar
  74. Pauken, T. (2013). Texas vs. No Child Left Behind. The American Conservative. Retrieved from
  75. Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Education Evaluation and Policy Analysis, 36(4), 399–416. Scholar
  76. Porter, T. M. (1996). Trust in numbers: The pursuit of objectivity in science and public life. Princeton: Princeton University Press.CrossRefGoogle Scholar
  77. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Application and data analysis methods (2nd ed.). Thousand Oaks: Sage Publications, Inc..Google Scholar
  78. Reynolds, C. R., Livingston, R. B., & Wilson, V. (2009). Measurement and assessment in education (2nd ed.). Upper Saddle River: Pearson Education, Inc..Google Scholar
  79. Rhee, M. (2011). The evidence is clear: Test scores must accurately reflect students' learning. The Huffington Post. Retrieved from
  80. Rizvi, F., & Lingard, B. (2010). Globalizing education policy. London: Routledge.Google Scholar
  81. Rothstein, J., & Mathis, W. J. (2013). Review of two culminating reports from the MET Project. Boulder: National Education Policy Center (NEPC) Retrieved from Scholar
  82. Schafer, W. D., Lissitz, R. W., Zhu, X., Zhang, Y., Hou, X., & Li, Y. (2012). Evaluating teachers and schools using student growth models. Practical Assessment, Research & Evaluation, 17(17). Retrieved from
  83. Smith, W. C. (2016). The global testing culture: Shaping education policy, perceptions, and practice. Oxford: Symposium Books.CrossRefGoogle Scholar
  84. Smith, W. C., & Kubacka, K. (2017). The emphasis of student test scores in teacher appraisal systems. Education Policy Analysis Archives, 25(86). Scholar
  85. Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Education International Discussion Paper. Retrieved from: pdf.
  86. Stevens, J. (1996). Applied multivariate statistics for the social sciences. Mahwah: Lawrence Erlbaum Associates, Inc.Google Scholar
  87. Tekwe, C. D., Carter, R. L., Ma, C., Algina, J., Lucas, M. E., Roth, J., Arite, M., Fisher, T., & Resnick, M. B. (2004). An empirical comparison of statistical models for value-added assessment of school performance. Journal of Educational and Behavioral Statistics, 29(1), 11–36. Scholar
  88. Timar, T. B., & Maxwell-Jolly, J. (Eds.). (2012). Narrowing the achievement gap: Perspectives and strategies for challenging times. Cambridge: Harvard Education Press.Google Scholar
  89. Verger, A., & Parcerisa, L. (2017). A difficult relationship. Accountability policies and teachers: International evidence and key premises for future research. In M. Akiba & G. LeTendre (Eds.), International handbook of teacher quality and policy (pp. 241–254). New York: Routledge.Google Scholar
  90. Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. New York: The New Teacher Project (TNTP) Retrieved from Scholar

Copyright information

© Springer Nature B.V. 2018
corrected publication August/2018

Authors and Affiliations

  1. 1.Mary Lou Fulton Teachers CollegeArizona State UniversityTempeUSA
  2. 2.Research for Educational Impact (REDI) CentreDeakin UniversityBurwoodAustralia

Personalised recommendations