Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation

Amrein-Beardsley, Audrey; Geiger, Tray J.

doi:10.1007/s11092-019-09311-w

Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation

Published: 26 November 2019

Volume 31, pages 465–493, (2019)
Cite this article

Educational Assessment, Evaluation and Accountability Aims and scope Submit manuscript

480 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Contemporary teacher evaluation policies are built upon multiple-measure systems including, primarily, teacher-level value-added and observational estimates. However, researchers have not yet investigated how using these indicators to evaluate teachers might distort validity, especially when one indicator seemingly trumps, or is trusted over the other. Accordingly, in this conceptual piece, we introduce and begin to establish evidences of three conceptual terms related to the validity of the inferences derived via these two measures in the context of teacher evaluation: (1) artificial inflation, (2) artificial deflation, and (3) artificial conflation. We define these terms by illustrating how those with the power to evaluate teachers (e.g., principals) within such contemporary evaluation systems might (1) artificially inflate or (2) artificially deflate observational estimates when used alongside their value-added counterparts, or (3) artificially conflate both estimates to purposefully (albeit perhaps naïvely) exaggerate perceptions of validity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

Article Open access 07 June 2017

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Article Open access 10 December 2019

Ethical Considerations of Conducting Systematic Reviews in Educational Research

Notes

The main differences between VAMs and growth models are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, student growth models are more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, typically without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables. See also, for example, Betebenner (2009).
In terms of observational measures, what is known is that the observational systems used for teacher evaluation purposes are more common across states teacher evaluation systems than in years prior (Author(s) 2017). What is also known is that at face value (e.g., face validity), observational systems are valuable to the extent that observational outputs might be considered valid estimates of teacher effectiveness, if and when “(a) the observed performances can be considered a representative sample from the domain [e.g., capturing teacher effectiveness], (b) the performances are evaluated appropriately and fairly, and (c) the sample is large enough to control sampling error” (Guion 1977, as cited in Kane 2006). What is also becoming increasingly evident in the literature, however, is that beyond observational systems’ prima facie qualities, they are now also confronting their own sets of empirical issues. Such validity-related issues include, but are not limited to, whether the observational systems being used are psychometrically sound for their intended purposes (to yield objective data about teachers’ effectiveness in practice), how output from observational systems might be biased by such factors as the types of students whom a teacher teaches, how a teacher’s gender interplays with his/her students’ gender(s), and the like (Author(s) 2017; Bailey et al. 2016; Steinberg and Garrett 2016; Whitehurst et al. 2014).
While researchers who have investigated the reliability of teachers’ observational measures have expressed concerns about their lack of reliability, as well, reliability coefficients of teachers’ observational measures are relatively (and arguably) much higher than their VAM-based counterparts (e.g., r > 0.65 versus 0.20 < r < 0.50, respectively; see, for example, Ho and Kane 2013, see also Praetorius et al. 2014; van der Lans et al. 2016). It is important to note that these reliability coefficients pertain to both the VAM and observational measurements being taken over time, versus being captured via single observations. Indeed, most if not all researchers in this area emphasize the need for multiple observations of both VAM and observational estimates over time in order to secure the highest possible levels of reliability (Kane and Staiger 2012: see also Hill et al. 2012; Praetorius et al. 2014; van der Lans et al. 2016.). Accordingly, many states and districts have increased the number of teacher observations conducted per year (see, for example, Close et al. 2019a; Reddy et al. 2019); although, increasing the number of VAM-based observations is much more difficult given the tests most often used by all states still using VAMs are state-level, large-scale assessments that are only administered once per year Close et al. 2019a).
“Bridging and buffering” is defined in Honig and Hatch (2004, pp. 17, 23–27) when schools, or school principals in this case, use state- or school-wide goals and strategies as the basis for their decisions about how or the extent to which they might productively engage or disengage external demands. More specifically, bridging activities are noted when principals, in this case, selectively engage with external demands in order to inform and enhance, in this case, policy implementation, while buffering does not involve the blind dismissal of external demands but the strategic engagement of external demands in limited to very limited ways, so as to not derail principal decision-making.
It should be noted here, though, that this type of bell curve derived via teachers’ VAM-based scores is not uncommon, but rather often an artifact of many statistical models applied to analyze teachers’ value-added effects (e.g., linear regression or multilevel regression models), whereby it is common to centralize scores around a sample mean or average (see, for example, Winters and Cowen 2013). When VAM-based scores by default yield bell curves as based on normative scores, though, what this means for their relatively subjective observational counterparts to which they are to be combined is only beginning to be observed (e.g., the potential to artificially deflate teachers’ observational scores to match or better fit the normal curves derived via their more objective counterparts).

References

Aaronson, D., Barrow, L., & Sanders, W. (2007). Teachers and student achievement in the Chicago public high schools. Journal of Labor Economics, 25(1), 95–135. https://doi.org/10.1086/508733.
Article Google Scholar
American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Google Scholar
Amrein, A. L., & Berliner, D. C. (2002). High-stakes testing, uncertainty, and student learning. Education Policy Analysis Archives, 10(18). https://doi.org/10.14507/epaa.v10n18.2002 Retrieved from http://epaa.asu.edu/epaa/v10n18/
Amrein-Beardsley, A. (2008). Methodological concerns about the Education Value-Added Assessment System (EVAAS). Educational Researcher, 37(2), 65-75. https://doi.org/10.3102/0013189X08316420.
Amrein-Beardsley, A., & Barnett, J. H. (2012). Working with error and uncertainty to increase measurement validity. Educational Assessment, Evaluation and Accountability, 24(4), 369–379. https://doi.org/10.1007/s11092-012-9146-6.
Amrein-Beardsley, A., & *Close, K. (2019b). Teacher-level value-added models (VAMs) on trial: Empirical and pragmatic issues of concern across five court cases. Educational Policy, 1-42. Retrieved from https://journals.sagepub.com/eprint/NXrgAwheiZut8pJCNAMN/full, https://doi.org/10.1177/0895904819843593
Anderson, J. (2013). Curious grade for teachers: nearly all pass. The New York Times. Retrieved from http://www.nytimes.com/2013/03/31/education/curious-grade-for- teachers-nearly-all-pass.html.
Araujo, M. C., Carneiro, P., Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in kindergarten. The Quarterly Journal of Economics, 131(3), 1415–1453. https://doi.org/10.1093/qje/qjw016.
Article Google Scholar
Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: a descriptive study in a large urban district. Washington DC: U.S. Department of Education Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf.
Google Scholar
Ballou, D. (2005). Value-added assessment: lessons from Tennessee. In R. W. Lissitz (Ed.), Value-added models in education: theory and application (pp. 272–297). Maple Grove, MN: JAM Press.
Google Scholar
Barnett, J. H., Rinthapol, N., & Hudgens, T. (2014). TAP research summary: examining the evidence and impact of TAP. The System for Teacher and Student Advancement. Santa Monica, CA: National Institute for Excellence in Teaching. Retrieved from http://files.eric.ed.gov/fulltext/ED556331.pdf
Betebenner, D. W. (2009). A primer on student growth percentiles. Dover, NH: National Center for the Improvement of Educational Assessment Retrieved from https://www.gadoe.org/Curriculum-Instruction-and-Assessment/Assessment/Documents/Aprimeronstudentgrowthpercentiles.pdf.
Google Scholar
Bill & Melinda Gates Foundation. (2013, January 8). Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET project’s three-year study. Seattle, WA. Retrieved from http://www.gatesfoundation.org/press-releases/Pages/MET-Announcment.aspx
Braun, H. I. (2005). Using student progress to evaluate teachers: a primer on value-added models. Princeton, NJ: Educational Testing Service.
Braun, H. (2015). The value in value-added depends on the ecology. Educational Researcher, 44(2), 127–131. https://doi.org/10.3102/0013189X15576341.
Article Google Scholar
Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1–16). Westport, CT: American Council on Education.
Google Scholar
Brennan, R. L. (2013). Commentary on “validating interpretations and uses of test scores.”. Journal of Educational Measurement, 50(1), 74–83. https://doi.org/10.1111/jedm.12001.
Article Google Scholar
Brown, C. (2014, July 31). Stephen Colbert interview with Campbell Brown. The Colbert Report. New York, NY: Comedy Central. Retrieved from http://www.cc.com/video-clips/2mpwlv/the-colbert-report-campbell-brown
Burgess, K. (2016, September 16). Number of effective teachers keeps dropping. The Albuquerque Journal. Retrieved from https://www.abqjournal.com/846826/nm-teacher-evals-number-of-effective-teachers-keeps-dropping.html
Campbell, D. T. (1976). Assessing the impact of planned social change. Hanover, NH: The Public Affairs Center, Dartmouth College.
Google Scholar
Chester, M. D. (2003). Multiple measures and high-stakes decisions: a framework for combining measures. Educational Measurement: Issues and Practice, 22(2), 32–41. https://doi.org/10.1111/j.1745-3992.2003.tb00126.x.
Article Google Scholar
Chetty, R., Friedman, J., & Rockoff, J. (2014a). Measuring the impact of teachers I: teacher value-added and student outcomes in adulthood. American Economic Review, 104(9), 2593–2632. https://doi.org/10.3386/w19424.
Article Google Scholar
Chetty, R., Friedman, J., & Rockoff, J. (2014b). Measuring the impact of teachers II: evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593–2632. https://doi.org/10.3386/w19424.
Article Google Scholar
Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington, DC: U.S. Department of Education Retrieved from http://ies.ed.gov/ncee/pubs/2016002/pdf/2016002.pdf.
Google Scholar
Chin, M., & Goldhaber, D. (2015). Exploring explanations for the “weak” relationship between value added and observation-based measures of teacher performance. Cambridge, MA: Center for Education Policy Research (CEPR), Harvard University. Retrieved from http://cepr.harvard.edu/files/cepr/files/sree2015_simulation_working_paper.pdf
Close, K., Amrein-Beardsley, A., & Collins, C. (2019). Mapping America’s teacher evaluation plans post ESSA. Phi Delta Kappan. Retrieved from https://www.kappanonline.org/mapping-teacher-evaluation-plans-essa-close-amrein-beardsley-collins/.
Collins, C. (2014). Houston, we have a problem: teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives, 22(98), 1–42. https://doi.org/10.14507/epaa.v22.1594.
Collins, C., & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national overview. Teachers College Record, 16(1). Retrieved from: http://www.tcrecord.org/Content.asp?ContentId=17291
Corcoran, S. P. (2010). Can teachers be evaluated by their students’ test scores? Should they be? The use of value-added measures of teacher effectiveness in policy and practice. Providence, RI: Annenberg Institute for School Reform.
Google Scholar
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, DC: American Council on Education.
Google Scholar
Daly, G., & Kim, L. (2010). A teacher evaluation system that works. Santa Monica, CA: National Institute for Excellence in Teaching (NIET).
Google Scholar
Danielson, C. (2012). Observing classroom practice. Educational Leadership, 70(3), 32–37.
Google Scholar
Danielson, C. (2016). Charlotte Danielson on rethinking teacher evaluation. Education Week. Retrieved from http://www.edweek.org/ew/articles/2016/04/20/charlotte-danielson-on-rethinking-teacher-evaluation.html?cmp=eml-eb-popyrall+06162016
Danielson, C., & McGreal, T. L. (2000). Teacher evaluation to enhance professional practice. Alexandria, VA: Association for Supervision & Curriculum Development.
Darling-Hammond, L. (2013). Getting teacher evaluation right: what really matters for effectiveness and improvement. New York, NY: Teachers College Press.
Google Scholar
Doan, S., Schweig, J. D., & Mihaly, K. (2019). The consistency of composite ratings of teacher effectiveness: evidence from New Mexico. American Educational Research Journal. https://doi.org/10.3102/0002831219841369.
Doherty, K. M., & Jacobs, S. (2015). State of the states: Evaluating teaching, leading and learning. Washington, DC: National Council on Teacher Quality (NCTQ).
Google Scholar
Duncan, A. (2011). Winning the future with education: responsibility, reform and results. DC: Washington Retrieved from http://www.ed.gov/news/speeches/winning-future-education-responsibility-reform-and-results.
Google Scholar
Every Student Succeeds Act (ESSA) of 2016, Pub. L. No. 114–95, § 129 Stat. 1802. (2016).
Furr, R. M., & Bacharach, V. R. (2013). Psychometrics: an introduction. Los Angeles, CA: SAGE Inc..
Google Scholar
Goldhaber, D., & Hansen, M. (2013). Is it just a bad class? Assessing the long-term stability of estimated teacher performance. Economica, 80(319), 589–612. https://doi.org/10.1111/ecca.12002.
Article Google Scholar
Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., &.Schuermann, P. (2015). Make room value-added: principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96–104. doi: https://doi.org/10.3102/0013189X15575031.
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: the relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. https://doi.org/10.3102/0013189X14544542.
Article Google Scholar
Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. Miami Herald: The school district disagrees Retrieved from http://www.miamiherald.com/news/local/education/article119791683.html.
Google Scholar
Haladyna, T. M., Nolen, N. S., & Haas, S. B. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher, 20(5), 2–7. https://doi.org/10.2307/1176395.
Article Google Scholar
Haney, W. (2000). The myth of the Texas miracle in education. Education Analysis Policy Archives, 8(41). https://doi.org/10.14507/epaa.v8n41.2000.
Hanushek, E. (2009). Teacher deselection. In D. Goldhaber & J. Hannaway (Eds.), Creating a new teaching profession (pp. 165–180). Washington, DC: Urban Institute Press.
Google Scholar
Harris, D. N. (2011). Value-added measures in education: what every educator needs to know. Cambridge, MA: Harvard Education Press.
Google Scholar
Harris, D. N., Ingle, W. K., & Rutledge, S. A. (2014). How teacher evaluation methods matter for accountability: a comparative analysis of teacher effectiveness ratings by principals and teacher value-added measures. American Educational Research Journal, 51(1), 73–112. https://doi.org/10.3102/0002831213517130.
Article Google Scholar
Hill, H. C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. https://doi.org/10.3102/0002831210387916.
Article Google Scholar
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. https://doi.org/10.3102/0013189x12437203.
Article Google Scholar
Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school personnel. Seattle, WA: Bill & Melinda Gates Foundation.
Google Scholar
Holmstrom, B., & Milgrom, P. (1991). Multitask principal-agent analyses: incentive contracts, asset ownership, and job design. Journal of Law, Economics, & Organization, 7, 24–52. https://doi.org/10.1093/jleo/7.special_issue.24.
Article Google Scholar
Honig, M. I., & Hatch, T. C. (2004). Crafting coherence: how schools strategically manage multiple, external demands. Educational Researcher, 33(4), 16–30. https://doi.org/10.3102/0013189X033008016.
Article Google Scholar
Houston Independent School District (HISD). (2012). HISD Core Initiative 1: an effective teacher in every classroom, teacher appraisal and development system – year one summary report. Houston, TX.
Houston Independent School District (HISD). (2013). Progress conference briefing. Houston, TX.
Jacob, B. A. (2005). Accountability, incentives and behavior: the impact of high-stakes testing in the Chicago public schools. Journal of Public Economics, 89(5–6), 761–796. https://doi.org/10.3386/w8968.
Article Google Scholar
Jacob, B. A., & Lefgren, L. (2006). When principals rate teachers: the best-and the worst-stand out. Education Next, 2(6), 58–64.
Google Scholar
Jacoby, R., Glauberman, N., & Herrnstein, R. J. (1995). The bell curve debate: history, documents, opinions. New York, NY: Times Books.
Google Scholar
Jennings, J. L., & Pallas, A. M. (2016). How does value-added data affect teachers? Educational Leadership, 73(8).
Jerald, C. D., & Van Hook, K. (2011). More than measurement: the TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET).
Google Scholar
Jiang, J. Y., Sporte, S. E., & Luppescu, S. (2015). Teacher perspectives on evaluation reform: Chicago’s REACH students. Educational Researcher, 44(2), 105–116. https://doi.org/10.3102/0013189X15575517.
Article Google Scholar
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Washington, DC: The American Council on Education.
Google Scholar
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000.
Article Google Scholar
Kane, T. J. (2015). Teachers must look in the mirror. The New York Daily News. Retrieved from http://www.nydailynews.com/opinion/thomas-kane-teachers-mirror-article-1.2172662
Kane, M., & Case, S. M. (2004). The reliability and validity of weighted composite scores. Applied Measurement in Education, 17(3), 221–240. https://doi.org/10.1207/s15324818ame1703_1.
Article Google Scholar
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: combining high-quality observations with student surveys and achievement gains. Seattle, WA: Bill & Melinda Gates Foundation.
Google Scholar
Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment. Seattle, WA: Bill & Melinda Gates Foundation.
Google Scholar
Kiewiet de Jonge, C. P., & Nickerson, D. W. (2014). Artificial inflation or deflation? Assessing the item count technique in comparative surveys. Political Behavior, 36(3), 659–682. https://doi.org/10.1007/s11109-013-9249-x.
Article Google Scholar
Koedel, C., & Betts, J. R. (2007). Re-examining the role of teacher quality in the educational production function. Nashville, TN: National Center on Performance Initiatives.
Google Scholar
Koedel, C., & Betts, J. R. (2009). Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique (Working paper 2009-01). San Diego, CA: National Bureau of Economic Research. Retrieved from https://economics.missouri.edu/working-papers/2009/wp0902_koedel.pdf
Koedel, C., Mihaly, K., & Rockoff, J. E. (2015). Value-added modeling: a review. Economics of Education Review, 47, 180–195. https://doi.org/10.1016/j.econedurev.2015.01.006.
Koretz, D. (2017). The testing charade: pretending to make schools better. Chicago, IL: University of Chicago Press.
Book Google Scholar
Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the Widget Effect: teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234–249. https://doi.org/10.3102/0013189X17718797.
Martínez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. https://doi.org/10.3102/0162373716666166.
Article Google Scholar
Marzano, R. J., & Toth, M. D. (2013). Teacher evaluation that makes a difference: a new model for teacher growth and student achievement. Alexandria, VA: Association for Supervision & Curriculum Development.
Google Scholar
McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572–606. https://doi.org/10.1162/edfp.2009.4.4.572.
Article Google Scholar
Mellon, E. (2010, January 14). HISD moves ahead on dismissal policy: In the past, teachers were rarely let go over poor performance, data show. The Houston Chronicle. Retrieved from http://www.chron.com/disp/story.mpl/metropolitan/6816752.html
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35(11), 1012–1027. https://doi.org/10.1037//0003-066x.35.11.1012.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–85). New York, NY: American Council on Education.
Google Scholar
Messick, S. (1990). Validity of test interpretation and use. Princeton, NJ: Educational Testing Service.
Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037//0003-066x.50.9.741.
Article Google Scholar
Mihaly, K., McCaffrey, D. F., Staiger, D. O., & Lockwood, J. R. (2013). A composite estimator of effective teaching. Seattle, WA: Bill & Melinda Gates Foundation.
Google Scholar
Nelson, F. H. (2011). A guide for developing growth models for teacher development and evaluation. Paper presented at the Annual Conference of the American Educational Research Association (AERA), New Orleans, LA.
Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: how high-stakes testing corrupts America’s schools. Cambridge, MA: Harvard Education Press.
Organisation for Economic Co-operation and Development (OECD). (2008). Measuring improvements in learning outcomes: best practices to assess the value-added of schools. Paris, France: Author.
Google Scholar
Otterman, S. (2010, December). 26. The New York Times: Hurdles emerge in rising effort to rate teachers Retrieved from http://www.nytimes.com/2010/12/27/nyregion/27teachers.html.
Papay, J. P. (2010). Different tests, different answers: the stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163–193. https://doi.org/10.3102/0002831210362589.
Article Google Scholar
Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Educational Evaluation and Policy Analysis, 36(4), 399–416. https://doi.org/10.3102/0162373714531851.
Article Google Scholar
Poon, A., & Schwartz, N. (2016). Investigating misalignment in teacher observation and value-added ratings. Paper presented at the annual meeting of the Association for Education Finance and Policy, Denver, CO.
Porter, E. (2015, March 24). Grading teachers by the test. The New York Times. Retrieved from http://www.nytimes.com/2015/03/25/business/economy/grading-teachers-by-the-test.html
Praetorius, A. K., Pauli, C., Reusser, K., Rakoczy, K., & Klieme, E. (2014). One lesson is all you need? Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. https://doi.org/10.1016/j.learninstruc.2013.12.002.
Article Google Scholar
Quality Basic Education Act. S.B. 364. (2016).
Ramaswamy, S. V. (2014). Teacher evaluations: subjective data skew state results. The Journal News. Retrieved from http://www.lohud.com/story/news/education/2014/09/12/state-teacher-evals-skewed/15527297/
Raudenbush, S. W., & Jean, M. (2012). How should educators interpret value-added scores? Stanford, CA: Carnegie Knowledge Network Retrieved from http://www.carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added/.
Google Scholar
Reddy, L. A., Hua, A., Dudek, C. M., Kettler, R. J., Lekwa, A., Arnold-Berkovits, I., & Crouse, K. (2019). Use of observational measures to predict student achievement. Studies in Educational Evaluation, 62, 197–208. https://doi.org/10.1016/j.stueduc.2019.05.001.
Article Google Scholar
Rhee, M. (2011). The evidence is clear: test scores must accurately reflect students’ learning. The Huffington Post. Retrieved from http://www.huffingtonpost.com/michelle-rhee/michelle-rhee-dc-schools_b_845286.html
Rockoff, J. E., Staiger, D. O., Kane, T. J., & Taylor, E. S. (2010). Information and employee evaluation: evidence from a randomized intervention in public schools (Working Paper No. 16240). Cambridge, MA: National Bureau of Economic Research.
Rothstein, J., & Mathis, W. J. (2013). Review of two culminating reports from the MET Project. Boulder, CO: National Education Policy Center Retrieved from https://nepc.colorado.edu/sites/default/files/ttr-final-met-rothstein.pdf.
Google Scholar
Rubin, D. B., Stuart, E. A., & Zanutto, E. L. (2004). A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics, 29(1), 103–116. https://doi.org/10.3102/10769986029001103.
Article Google Scholar
Rutledge, S. A., Harris, D. N., & Ingle, W. K. (2010). How principals “bridge and buffer” the new demands of teacher quality and accountability: a mixed-methods analysis of teacher hiring. American Journal of Education, 116(2), 211–242. https://doi.org/10.1086/649492.
Article Google Scholar
Sandilosa, L. E., Sims, W. A., Norwalk, K. E., & Reddy, L. A. (2019). Converging on quality: examining multiple measures of teaching effectiveness. Journal of School Psychology, 74, 10–28. https://doi.org/10.1016/j.jsp.2019.05.004.
Article Google Scholar
Schochet, P. Z., & Chiang, H. S. (2013). What are error rates for classifying teacher and school performance using value-added models? Journal of Educational and Behavioral Statistics, 38(2), 142–171. https://doi.org/10.3102/1076998611432174.
Article Google Scholar
Shaw, L. H., & Bovaird, J. A. (2011). The impact of latent variable outcomes on value-added models of intervention efficacy. Paper presented at the Annual Conference of the American Educational Research Association (AERA), New Orleans, LA.
Shepard, L. A. (1990). Inflated test score gains: is the problem old norms or teaching the test? Educational Measurement: Issues and Practice, 9(3), 15–22. https://doi.org/10.1111/j.1745-3992.1990.tb00374.x.
Article Google Scholar
Sidorkin, A. M. (2016). Campbell’s Law and the ethics of immensurability. Studies in Philosophy and Education, 35(4), 321–332. https://doi.org/10.1007/s11217-015-9482-3.
Article Google Scholar
Sloat, E., Amrein-Beardsley, A., & Holloway, J. (2018). Different teacher-level effectiveness estimates, different results: inter-model concordance across six generalized value-added models (VAMs). Educational Assessment, Evaluation and Accountability, 30(4), 367–397. https://doi.org/10.1007/s11092-018-9283-7.
Article Google Scholar
Solochek, J. S. (2019). Four teachers removed from struggling Hudson Elementary School over test results. Tampa Bay Times. Retrieved from https://www.tampabay.com/news/gradebook/2019/08/23/four-teachers-removed-from-struggling-hudson-elementary-school-over-test-results/
Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Brussels, Belgium: Education International Retrieved from http://download.ei-ie.org/Docs/WebDepot/2016_EI_VAM_EN_final_Web.pdf.
Google Scholar
Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: what do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293–317. https://doi.org/10.3102/0162373715616249.
Article Google Scholar
Taylor, K. (2015, March). 22. The New York Times: Cuomo fights rating system in which few teachers are bad Retrieved from https://www.nytimes.com/2015/03/23/nyregion/cuomo-fights-rating-system-in-which-few-teachers-are-bad.html?smid=nytcore-ipad-share&smprod=nytcore-ipad&_r=0.
Tennessee Department of Education (TDE). (2016). Teacher and administrator evaluation in Tennessee: a report on year 4 implementation. Nashville, TN: Author Retrieved from https://team-tn.org/wp-content/uploads/2013/08/TEAM-Year-4-Report1.pdf.
Google Scholar
U.S. Department of Education. (2009). Race to the top program executive summary. DC: Washington Retrieved from http://www2.ed.gov/programs/racetothetop/executive-summary.pdf.
Google Scholar
U.S. Department of Education. (2014). States granted waivers from No Child Left Behind allowed to reapply for renewal for 2014 and 2015 school years. Washington D.C. Retrieved from http://www.ed.gov/news/press-releases/states-granted-waivers-no-child-left-behind-allowed-reapply-renewal-2014-and-2015-school-years.
van der Lans, R. M. (2018). On the “association between two things”: the case of student surveys and classroom observations of teaching quality. Educational Assessment, Evaluation and Accountability, 30(4), 347–366. https://doi.org/10.1007/s11092-018-9285-5.
Article Google Scholar
van der Lans, R. M., van de Grift, W. J., van Veen, K., & Fokkens-Bruinsma, M. (2016). Once is not enough: establishing reliability criteria for feedback and evaluation decisions based on classroom observations. Studies in Educational Evaluation, 50, 88–95. https://doi.org/10.1016/j.stueduc.2016.08.001.
Article Google Scholar
Wainer, H. (2004). Introduction to a special issue of the Journal of Educational and Behavioral Statistics on value-added assessment. Journal of Educational and Behavioral Statistics, 29(1), 1–3. https://doi.org/10.3102/10769986029001001.
Article Google Scholar
Wallace, T. L., Kelcey, B., & Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey. American Educational Research Journal, 53(6), 1834–1868. doiI: https://doi.org/10.3102/0002831216671864.
Walsh, K., Joseph, N., Lakis, K., & Lubell, S. (2017). Running in place: how new teacher evaluations fail to live up to promises. Washington DC: National Council on Teacher Quality Retrieved from http://www.nctq.org/dmsView/Final_Evaluation_Paper.
Google Scholar
Weiner, I. B., Graham, J. R., & Naglieri, J. A. (2013). Handbook of psychology: assessment psychology (10 ^thVol.). Hoboken, NJ: John Wiley & Sons, Inc..
Google Scholar
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: our national failure to acknowledge and act on differences in teacher effectiveness. New York, NY: The New Teacher Project Retrieved from http://tntp.org/assets/documents/TheWidgetEffect_2nd_ed.pdf.
Google Scholar
Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf.
Winerip, M. (2011). Evaluating New York teachers, perhaps the numbers do lie. The New York Times. Retrieved from http://www.nytimes.com/2011/03/07/education/07winerip.html?_r=1&emc=eta1
Winters, M. A., & Cowen, J. M. (2013). Who would stay, who would be dismissed? An empirical consideration of value-added teacher retention policies. Educational Researcher, 42(6), 330–337. https://doi.org/10.3102/0013189X13496145.
Article Google Scholar
Yeh, S. S. (2013). A re-analysis of the effects of teacher replacement using value-added modeling. Teachers College Record, 115(12) Retrieved from http://www.tcrecord.org/Content.asp?ContentID=16934.
Zilberberga, A., Finneya, S. J., Marsha, K. R., & Andersona, R. D. (2014). The role of students’ attitudes and test-taking motivation on the validity of college institutional accountability tests: a path analytic model. International Journal of Testing, 14(4), 360–384. https://doi.org/10.1080/15305058.2014.928301.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Educational Policy and Evaluation, Mary Lou Fulton Teachers College, Arizona State University, PO Box 871811, Tempe, AZ, 85287-1811, USA
Audrey Amrein-Beardsley & Tray J. Geiger

Authors

Audrey Amrein-Beardsley
View author publications
You can also search for this author in PubMed Google Scholar
Tray J. Geiger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Audrey Amrein-Beardsley.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amrein-Beardsley, A., Geiger, T.J. Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation. Educ Asse Eval Acc 31, 465–493 (2019). https://doi.org/10.1007/s11092-019-09311-w

Download citation

Received: 16 April 2019
Accepted: 13 November 2019
Published: 26 November 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11092-019-09311-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Ethical Considerations of Conducting Systematic Reviews in Educational Research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Ethical Considerations of Conducting Systematic Reviews in Educational Research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation