Abstract
Contemporary teacher evaluation policies are built upon multiple-measure systems including, primarily, teacher-level value-added and observational estimates. However, researchers have not yet investigated how using these indicators to evaluate teachers might distort validity, especially when one indicator seemingly trumps, or is trusted over the other. Accordingly, in this conceptual piece, we introduce and begin to establish evidences of three conceptual terms related to the validity of the inferences derived via these two measures in the context of teacher evaluation: (1) artificial inflation, (2) artificial deflation, and (3) artificial conflation. We define these terms by illustrating how those with the power to evaluate teachers (e.g., principals) within such contemporary evaluation systems might (1) artificially inflate or (2) artificially deflate observational estimates when used alongside their value-added counterparts, or (3) artificially conflate both estimates to purposefully (albeit perhaps naïvely) exaggerate perceptions of validity.
Similar content being viewed by others
Notes
The main differences between VAMs and growth models are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, student growth models are more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, typically without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables. See also, for example, Betebenner (2009).
In terms of observational measures, what is known is that the observational systems used for teacher evaluation purposes are more common across states teacher evaluation systems than in years prior (Author(s) 2017). What is also known is that at face value (e.g., face validity), observational systems are valuable to the extent that observational outputs might be considered valid estimates of teacher effectiveness, if and when “(a) the observed performances can be considered a representative sample from the domain [e.g., capturing teacher effectiveness], (b) the performances are evaluated appropriately and fairly, and (c) the sample is large enough to control sampling error” (Guion 1977, as cited in Kane 2006). What is also becoming increasingly evident in the literature, however, is that beyond observational systems’ prima facie qualities, they are now also confronting their own sets of empirical issues. Such validity-related issues include, but are not limited to, whether the observational systems being used are psychometrically sound for their intended purposes (to yield objective data about teachers’ effectiveness in practice), how output from observational systems might be biased by such factors as the types of students whom a teacher teaches, how a teacher’s gender interplays with his/her students’ gender(s), and the like (Author(s) 2017; Bailey et al. 2016; Steinberg and Garrett 2016; Whitehurst et al. 2014).
While researchers who have investigated the reliability of teachers’ observational measures have expressed concerns about their lack of reliability, as well, reliability coefficients of teachers’ observational measures are relatively (and arguably) much higher than their VAM-based counterparts (e.g., r > 0.65 versus 0.20 < r < 0.50, respectively; see, for example, Ho and Kane 2013, see also Praetorius et al. 2014; van der Lans et al. 2016). It is important to note that these reliability coefficients pertain to both the VAM and observational measurements being taken over time, versus being captured via single observations. Indeed, most if not all researchers in this area emphasize the need for multiple observations of both VAM and observational estimates over time in order to secure the highest possible levels of reliability (Kane and Staiger 2012: see also Hill et al. 2012; Praetorius et al. 2014; van der Lans et al. 2016.). Accordingly, many states and districts have increased the number of teacher observations conducted per year (see, for example, Close et al. 2019a; Reddy et al. 2019); although, increasing the number of VAM-based observations is much more difficult given the tests most often used by all states still using VAMs are state-level, large-scale assessments that are only administered once per year Close et al. 2019a).
“Bridging and buffering” is defined in Honig and Hatch (2004, pp. 17, 23–27) when schools, or school principals in this case, use state- or school-wide goals and strategies as the basis for their decisions about how or the extent to which they might productively engage or disengage external demands. More specifically, bridging activities are noted when principals, in this case, selectively engage with external demands in order to inform and enhance, in this case, policy implementation, while buffering does not involve the blind dismissal of external demands but the strategic engagement of external demands in limited to very limited ways, so as to not derail principal decision-making.
It should be noted here, though, that this type of bell curve derived via teachers’ VAM-based scores is not uncommon, but rather often an artifact of many statistical models applied to analyze teachers’ value-added effects (e.g., linear regression or multilevel regression models), whereby it is common to centralize scores around a sample mean or average (see, for example, Winters and Cowen 2013). When VAM-based scores by default yield bell curves as based on normative scores, though, what this means for their relatively subjective observational counterparts to which they are to be combined is only beginning to be observed (e.g., the potential to artificially deflate teachers’ observational scores to match or better fit the normal curves derived via their more objective counterparts).
References
Aaronson, D., Barrow, L., & Sanders, W. (2007). Teachers and student achievement in the Chicago public high schools. Journal of Labor Economics, 25(1), 95–135. https://doi.org/10.1086/508733.
American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Amrein, A. L., & Berliner, D. C. (2002). High-stakes testing, uncertainty, and student learning. Education Policy Analysis Archives, 10(18). https://doi.org/10.14507/epaa.v10n18.2002 Retrieved from http://epaa.asu.edu/epaa/v10n18/
Amrein-Beardsley, A. (2008). Methodological concerns about the Education Value-Added Assessment System (EVAAS). Educational Researcher, 37(2), 65-75. https://doi.org/10.3102/0013189X08316420.
Amrein-Beardsley, A., & Barnett, J. H. (2012). Working with error and uncertainty to increase measurement validity. Educational Assessment, Evaluation and Accountability, 24(4), 369–379. https://doi.org/10.1007/s11092-012-9146-6.
Amrein-Beardsley, A., & *Close, K. (2019b). Teacher-level value-added models (VAMs) on trial: Empirical and pragmatic issues of concern across five court cases. Educational Policy, 1-42. Retrieved from https://journals.sagepub.com/eprint/NXrgAwheiZut8pJCNAMN/full, https://doi.org/10.1177/0895904819843593
Anderson, J. (2013). Curious grade for teachers: nearly all pass. The New York Times. Retrieved from http://www.nytimes.com/2013/03/31/education/curious-grade-for- teachers-nearly-all-pass.html.
Araujo, M. C., Carneiro, P., Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in kindergarten. The Quarterly Journal of Economics, 131(3), 1415–1453. https://doi.org/10.1093/qje/qjw016.
Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: a descriptive study in a large urban district. Washington DC: U.S. Department of Education Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf.
Ballou, D. (2005). Value-added assessment: lessons from Tennessee. In R. W. Lissitz (Ed.), Value-added models in education: theory and application (pp. 272–297). Maple Grove, MN: JAM Press.
Barnett, J. H., Rinthapol, N., & Hudgens, T. (2014). TAP research summary: examining the evidence and impact of TAP. The System for Teacher and Student Advancement. Santa Monica, CA: National Institute for Excellence in Teaching. Retrieved from http://files.eric.ed.gov/fulltext/ED556331.pdf
Betebenner, D. W. (2009). A primer on student growth percentiles. Dover, NH: National Center for the Improvement of Educational Assessment Retrieved from https://www.gadoe.org/Curriculum-Instruction-and-Assessment/Assessment/Documents/Aprimeronstudentgrowthpercentiles.pdf.
Bill & Melinda Gates Foundation. (2013, January 8). Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET project’s three-year study. Seattle, WA. Retrieved from http://www.gatesfoundation.org/press-releases/Pages/MET-Announcment.aspx
Braun, H. I. (2005). Using student progress to evaluate teachers: a primer on value-added models. Princeton, NJ: Educational Testing Service.
Braun, H. (2015). The value in value-added depends on the ecology. Educational Researcher, 44(2), 127–131. https://doi.org/10.3102/0013189X15576341.
Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1–16). Westport, CT: American Council on Education.
Brennan, R. L. (2013). Commentary on “validating interpretations and uses of test scores.”. Journal of Educational Measurement, 50(1), 74–83. https://doi.org/10.1111/jedm.12001.
Brown, C. (2014, July 31). Stephen Colbert interview with Campbell Brown. The Colbert Report. New York, NY: Comedy Central. Retrieved from http://www.cc.com/video-clips/2mpwlv/the-colbert-report-campbell-brown
Burgess, K. (2016, September 16). Number of effective teachers keeps dropping. The Albuquerque Journal. Retrieved from https://www.abqjournal.com/846826/nm-teacher-evals-number-of-effective-teachers-keeps-dropping.html
Campbell, D. T. (1976). Assessing the impact of planned social change. Hanover, NH: The Public Affairs Center, Dartmouth College.
Chester, M. D. (2003). Multiple measures and high-stakes decisions: a framework for combining measures. Educational Measurement: Issues and Practice, 22(2), 32–41. https://doi.org/10.1111/j.1745-3992.2003.tb00126.x.
Chetty, R., Friedman, J., & Rockoff, J. (2014a). Measuring the impact of teachers I: teacher value-added and student outcomes in adulthood. American Economic Review, 104(9), 2593–2632. https://doi.org/10.3386/w19424.
Chetty, R., Friedman, J., & Rockoff, J. (2014b). Measuring the impact of teachers II: evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593–2632. https://doi.org/10.3386/w19424.
Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington, DC: U.S. Department of Education Retrieved from http://ies.ed.gov/ncee/pubs/2016002/pdf/2016002.pdf.
Chin, M., & Goldhaber, D. (2015). Exploring explanations for the “weak” relationship between value added and observation-based measures of teacher performance. Cambridge, MA: Center for Education Policy Research (CEPR), Harvard University. Retrieved from http://cepr.harvard.edu/files/cepr/files/sree2015_simulation_working_paper.pdf
Close, K., Amrein-Beardsley, A., & Collins, C. (2019). Mapping America’s teacher evaluation plans post ESSA. Phi Delta Kappan. Retrieved from https://www.kappanonline.org/mapping-teacher-evaluation-plans-essa-close-amrein-beardsley-collins/.
Collins, C. (2014). Houston, we have a problem: teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives, 22(98), 1–42. https://doi.org/10.14507/epaa.v22.1594.
Collins, C., & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national overview. Teachers College Record, 16(1). Retrieved from: http://www.tcrecord.org/Content.asp?ContentId=17291
Corcoran, S. P. (2010). Can teachers be evaluated by their students’ test scores? Should they be? The use of value-added measures of teacher effectiveness in policy and practice. Providence, RI: Annenberg Institute for School Reform.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, DC: American Council on Education.
Daly, G., & Kim, L. (2010). A teacher evaluation system that works. Santa Monica, CA: National Institute for Excellence in Teaching (NIET).
Danielson, C. (2012). Observing classroom practice. Educational Leadership, 70(3), 32–37.
Danielson, C. (2016). Charlotte Danielson on rethinking teacher evaluation. Education Week. Retrieved from http://www.edweek.org/ew/articles/2016/04/20/charlotte-danielson-on-rethinking-teacher-evaluation.html?cmp=eml-eb-popyrall+06162016
Danielson, C., & McGreal, T. L. (2000). Teacher evaluation to enhance professional practice. Alexandria, VA: Association for Supervision & Curriculum Development.
Darling-Hammond, L. (2013). Getting teacher evaluation right: what really matters for effectiveness and improvement. New York, NY: Teachers College Press.
Doan, S., Schweig, J. D., & Mihaly, K. (2019). The consistency of composite ratings of teacher effectiveness: evidence from New Mexico. American Educational Research Journal. https://doi.org/10.3102/0002831219841369.
Doherty, K. M., & Jacobs, S. (2015). State of the states: Evaluating teaching, leading and learning. Washington, DC: National Council on Teacher Quality (NCTQ).
Duncan, A. (2011). Winning the future with education: responsibility, reform and results. DC: Washington Retrieved from http://www.ed.gov/news/speeches/winning-future-education-responsibility-reform-and-results.
Every Student Succeeds Act (ESSA) of 2016, Pub. L. No. 114–95, § 129 Stat. 1802. (2016).
Furr, R. M., & Bacharach, V. R. (2013). Psychometrics: an introduction. Los Angeles, CA: SAGE Inc..
Goldhaber, D., & Hansen, M. (2013). Is it just a bad class? Assessing the long-term stability of estimated teacher performance. Economica, 80(319), 589–612. https://doi.org/10.1111/ecca.12002.
Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., &.Schuermann, P. (2015). Make room value-added: principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96–104. doi: https://doi.org/10.3102/0013189X15575031.
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: the relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. https://doi.org/10.3102/0013189X14544542.
Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. Miami Herald: The school district disagrees Retrieved from http://www.miamiherald.com/news/local/education/article119791683.html.
Haladyna, T. M., Nolen, N. S., & Haas, S. B. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher, 20(5), 2–7. https://doi.org/10.2307/1176395.
Haney, W. (2000). The myth of the Texas miracle in education. Education Analysis Policy Archives, 8(41). https://doi.org/10.14507/epaa.v8n41.2000.
Hanushek, E. (2009). Teacher deselection. In D. Goldhaber & J. Hannaway (Eds.), Creating a new teaching profession (pp. 165–180). Washington, DC: Urban Institute Press.
Harris, D. N. (2011). Value-added measures in education: what every educator needs to know. Cambridge, MA: Harvard Education Press.
Harris, D. N., Ingle, W. K., & Rutledge, S. A. (2014). How teacher evaluation methods matter for accountability: a comparative analysis of teacher effectiveness ratings by principals and teacher value-added measures. American Educational Research Journal, 51(1), 73–112. https://doi.org/10.3102/0002831213517130.
Hill, H. C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. https://doi.org/10.3102/0002831210387916.
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. https://doi.org/10.3102/0013189x12437203.
Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school personnel. Seattle, WA: Bill & Melinda Gates Foundation.
Holmstrom, B., & Milgrom, P. (1991). Multitask principal-agent analyses: incentive contracts, asset ownership, and job design. Journal of Law, Economics, & Organization, 7, 24–52. https://doi.org/10.1093/jleo/7.special_issue.24.
Honig, M. I., & Hatch, T. C. (2004). Crafting coherence: how schools strategically manage multiple, external demands. Educational Researcher, 33(4), 16–30. https://doi.org/10.3102/0013189X033008016.
Houston Independent School District (HISD). (2012). HISD Core Initiative 1: an effective teacher in every classroom, teacher appraisal and development system – year one summary report. Houston, TX.
Houston Independent School District (HISD). (2013). Progress conference briefing. Houston, TX.
Jacob, B. A. (2005). Accountability, incentives and behavior: the impact of high-stakes testing in the Chicago public schools. Journal of Public Economics, 89(5–6), 761–796. https://doi.org/10.3386/w8968.
Jacob, B. A., & Lefgren, L. (2006). When principals rate teachers: the best-and the worst-stand out. Education Next, 2(6), 58–64.
Jacoby, R., Glauberman, N., & Herrnstein, R. J. (1995). The bell curve debate: history, documents, opinions. New York, NY: Times Books.
Jennings, J. L., & Pallas, A. M. (2016). How does value-added data affect teachers? Educational Leadership, 73(8).
Jerald, C. D., & Van Hook, K. (2011). More than measurement: the TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET).
Jiang, J. Y., Sporte, S. E., & Luppescu, S. (2015). Teacher perspectives on evaluation reform: Chicago’s REACH students. Educational Researcher, 44(2), 105–116. https://doi.org/10.3102/0013189X15575517.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Washington, DC: The American Council on Education.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000.
Kane, T. J. (2015). Teachers must look in the mirror. The New York Daily News. Retrieved from http://www.nydailynews.com/opinion/thomas-kane-teachers-mirror-article-1.2172662
Kane, M., & Case, S. M. (2004). The reliability and validity of weighted composite scores. Applied Measurement in Education, 17(3), 221–240. https://doi.org/10.1207/s15324818ame1703_1.
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: combining high-quality observations with student surveys and achievement gains. Seattle, WA: Bill & Melinda Gates Foundation.
Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment. Seattle, WA: Bill & Melinda Gates Foundation.
Kiewiet de Jonge, C. P., & Nickerson, D. W. (2014). Artificial inflation or deflation? Assessing the item count technique in comparative surveys. Political Behavior, 36(3), 659–682. https://doi.org/10.1007/s11109-013-9249-x.
Koedel, C., & Betts, J. R. (2007). Re-examining the role of teacher quality in the educational production function. Nashville, TN: National Center on Performance Initiatives.
Koedel, C., & Betts, J. R. (2009). Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique (Working paper 2009-01). San Diego, CA: National Bureau of Economic Research. Retrieved from https://economics.missouri.edu/working-papers/2009/wp0902_koedel.pdf
Koedel, C., Mihaly, K., & Rockoff, J. E. (2015). Value-added modeling: a review. Economics of Education Review, 47, 180–195. https://doi.org/10.1016/j.econedurev.2015.01.006.
Koretz, D. (2017). The testing charade: pretending to make schools better. Chicago, IL: University of Chicago Press.
Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the Widget Effect: teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234–249. https://doi.org/10.3102/0013189X17718797.
Martínez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. https://doi.org/10.3102/0162373716666166.
Marzano, R. J., & Toth, M. D. (2013). Teacher evaluation that makes a difference: a new model for teacher growth and student achievement. Alexandria, VA: Association for Supervision & Curriculum Development.
McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572–606. https://doi.org/10.1162/edfp.2009.4.4.572.
Mellon, E. (2010, January 14). HISD moves ahead on dismissal policy: In the past, teachers were rarely let go over poor performance, data show. The Houston Chronicle. Retrieved from http://www.chron.com/disp/story.mpl/metropolitan/6816752.html
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35(11), 1012–1027. https://doi.org/10.1037//0003-066x.35.11.1012.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–85). New York, NY: American Council on Education.
Messick, S. (1990). Validity of test interpretation and use. Princeton, NJ: Educational Testing Service.
Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037//0003-066x.50.9.741.
Mihaly, K., McCaffrey, D. F., Staiger, D. O., & Lockwood, J. R. (2013). A composite estimator of effective teaching. Seattle, WA: Bill & Melinda Gates Foundation.
Nelson, F. H. (2011). A guide for developing growth models for teacher development and evaluation. Paper presented at the Annual Conference of the American Educational Research Association (AERA), New Orleans, LA.
Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: how high-stakes testing corrupts America’s schools. Cambridge, MA: Harvard Education Press.
Organisation for Economic Co-operation and Development (OECD). (2008). Measuring improvements in learning outcomes: best practices to assess the value-added of schools. Paris, France: Author.
Otterman, S. (2010, December). 26. The New York Times: Hurdles emerge in rising effort to rate teachers Retrieved from http://www.nytimes.com/2010/12/27/nyregion/27teachers.html.
Papay, J. P. (2010). Different tests, different answers: the stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163–193. https://doi.org/10.3102/0002831210362589.
Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Educational Evaluation and Policy Analysis, 36(4), 399–416. https://doi.org/10.3102/0162373714531851.
Poon, A., & Schwartz, N. (2016). Investigating misalignment in teacher observation and value-added ratings. Paper presented at the annual meeting of the Association for Education Finance and Policy, Denver, CO.
Porter, E. (2015, March 24). Grading teachers by the test. The New York Times. Retrieved from http://www.nytimes.com/2015/03/25/business/economy/grading-teachers-by-the-test.html
Praetorius, A. K., Pauli, C., Reusser, K., Rakoczy, K., & Klieme, E. (2014). One lesson is all you need? Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. https://doi.org/10.1016/j.learninstruc.2013.12.002.
Quality Basic Education Act. S.B. 364. (2016).
Ramaswamy, S. V. (2014). Teacher evaluations: subjective data skew state results. The Journal News. Retrieved from http://www.lohud.com/story/news/education/2014/09/12/state-teacher-evals-skewed/15527297/
Raudenbush, S. W., & Jean, M. (2012). How should educators interpret value-added scores? Stanford, CA: Carnegie Knowledge Network Retrieved from http://www.carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added/.
Reddy, L. A., Hua, A., Dudek, C. M., Kettler, R. J., Lekwa, A., Arnold-Berkovits, I., & Crouse, K. (2019). Use of observational measures to predict student achievement. Studies in Educational Evaluation, 62, 197–208. https://doi.org/10.1016/j.stueduc.2019.05.001.
Rhee, M. (2011). The evidence is clear: test scores must accurately reflect students’ learning. The Huffington Post. Retrieved from http://www.huffingtonpost.com/michelle-rhee/michelle-rhee-dc-schools_b_845286.html
Rockoff, J. E., Staiger, D. O., Kane, T. J., & Taylor, E. S. (2010). Information and employee evaluation: evidence from a randomized intervention in public schools (Working Paper No. 16240). Cambridge, MA: National Bureau of Economic Research.
Rothstein, J., & Mathis, W. J. (2013). Review of two culminating reports from the MET Project. Boulder, CO: National Education Policy Center Retrieved from https://nepc.colorado.edu/sites/default/files/ttr-final-met-rothstein.pdf.
Rubin, D. B., Stuart, E. A., & Zanutto, E. L. (2004). A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics, 29(1), 103–116. https://doi.org/10.3102/10769986029001103.
Rutledge, S. A., Harris, D. N., & Ingle, W. K. (2010). How principals “bridge and buffer” the new demands of teacher quality and accountability: a mixed-methods analysis of teacher hiring. American Journal of Education, 116(2), 211–242. https://doi.org/10.1086/649492.
Sandilosa, L. E., Sims, W. A., Norwalk, K. E., & Reddy, L. A. (2019). Converging on quality: examining multiple measures of teaching effectiveness. Journal of School Psychology, 74, 10–28. https://doi.org/10.1016/j.jsp.2019.05.004.
Schochet, P. Z., & Chiang, H. S. (2013). What are error rates for classifying teacher and school performance using value-added models? Journal of Educational and Behavioral Statistics, 38(2), 142–171. https://doi.org/10.3102/1076998611432174.
Shaw, L. H., & Bovaird, J. A. (2011). The impact of latent variable outcomes on value-added models of intervention efficacy. Paper presented at the Annual Conference of the American Educational Research Association (AERA), New Orleans, LA.
Shepard, L. A. (1990). Inflated test score gains: is the problem old norms or teaching the test? Educational Measurement: Issues and Practice, 9(3), 15–22. https://doi.org/10.1111/j.1745-3992.1990.tb00374.x.
Sidorkin, A. M. (2016). Campbell’s Law and the ethics of immensurability. Studies in Philosophy and Education, 35(4), 321–332. https://doi.org/10.1007/s11217-015-9482-3.
Sloat, E., Amrein-Beardsley, A., & Holloway, J. (2018). Different teacher-level effectiveness estimates, different results: inter-model concordance across six generalized value-added models (VAMs). Educational Assessment, Evaluation and Accountability, 30(4), 367–397. https://doi.org/10.1007/s11092-018-9283-7.
Solochek, J. S. (2019). Four teachers removed from struggling Hudson Elementary School over test results. Tampa Bay Times. Retrieved from https://www.tampabay.com/news/gradebook/2019/08/23/four-teachers-removed-from-struggling-hudson-elementary-school-over-test-results/
Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Brussels, Belgium: Education International Retrieved from http://download.ei-ie.org/Docs/WebDepot/2016_EI_VAM_EN_final_Web.pdf.
Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: what do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293–317. https://doi.org/10.3102/0162373715616249.
Taylor, K. (2015, March). 22. The New York Times: Cuomo fights rating system in which few teachers are bad Retrieved from https://www.nytimes.com/2015/03/23/nyregion/cuomo-fights-rating-system-in-which-few-teachers-are-bad.html?smid=nytcore-ipad-share&smprod=nytcore-ipad&_r=0.
Tennessee Department of Education (TDE). (2016). Teacher and administrator evaluation in Tennessee: a report on year 4 implementation. Nashville, TN: Author Retrieved from https://team-tn.org/wp-content/uploads/2013/08/TEAM-Year-4-Report1.pdf.
U.S. Department of Education. (2009). Race to the top program executive summary. DC: Washington Retrieved from http://www2.ed.gov/programs/racetothetop/executive-summary.pdf.
U.S. Department of Education. (2014). States granted waivers from No Child Left Behind allowed to reapply for renewal for 2014 and 2015 school years. Washington D.C. Retrieved from http://www.ed.gov/news/press-releases/states-granted-waivers-no-child-left-behind-allowed-reapply-renewal-2014-and-2015-school-years.
van der Lans, R. M. (2018). On the “association between two things”: the case of student surveys and classroom observations of teaching quality. Educational Assessment, Evaluation and Accountability, 30(4), 347–366. https://doi.org/10.1007/s11092-018-9285-5.
van der Lans, R. M., van de Grift, W. J., van Veen, K., & Fokkens-Bruinsma, M. (2016). Once is not enough: establishing reliability criteria for feedback and evaluation decisions based on classroom observations. Studies in Educational Evaluation, 50, 88–95. https://doi.org/10.1016/j.stueduc.2016.08.001.
Wainer, H. (2004). Introduction to a special issue of the Journal of Educational and Behavioral Statistics on value-added assessment. Journal of Educational and Behavioral Statistics, 29(1), 1–3. https://doi.org/10.3102/10769986029001001.
Wallace, T. L., Kelcey, B., & Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey. American Educational Research Journal, 53(6), 1834–1868. doiI: https://doi.org/10.3102/0002831216671864.
Walsh, K., Joseph, N., Lakis, K., & Lubell, S. (2017). Running in place: how new teacher evaluations fail to live up to promises. Washington DC: National Council on Teacher Quality Retrieved from http://www.nctq.org/dmsView/Final_Evaluation_Paper.
Weiner, I. B., Graham, J. R., & Naglieri, J. A. (2013). Handbook of psychology: assessment psychology (10 thVol.). Hoboken, NJ: John Wiley & Sons, Inc..
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: our national failure to acknowledge and act on differences in teacher effectiveness. New York, NY: The New Teacher Project Retrieved from http://tntp.org/assets/documents/TheWidgetEffect_2nd_ed.pdf.
Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf.
Winerip, M. (2011). Evaluating New York teachers, perhaps the numbers do lie. The New York Times. Retrieved from http://www.nytimes.com/2011/03/07/education/07winerip.html?_r=1&emc=eta1
Winters, M. A., & Cowen, J. M. (2013). Who would stay, who would be dismissed? An empirical consideration of value-added teacher retention policies. Educational Researcher, 42(6), 330–337. https://doi.org/10.3102/0013189X13496145.
Yeh, S. S. (2013). A re-analysis of the effects of teacher replacement using value-added modeling. Teachers College Record, 115(12) Retrieved from http://www.tcrecord.org/Content.asp?ContentID=16934.
Zilberberga, A., Finneya, S. J., Marsha, K. R., & Andersona, R. D. (2014). The role of students’ attitudes and test-taking motivation on the validity of college institutional accountability tests: a path analytic model. International Journal of Testing, 14(4), 360–384. https://doi.org/10.1080/15305058.2014.928301.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Amrein-Beardsley, A., Geiger, T.J. Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation. Educ Asse Eval Acc 31, 465–493 (2019). https://doi.org/10.1007/s11092-019-09311-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11092-019-09311-w