Skip to main content
Log in

Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation

  • Published:
Educational Assessment, Evaluation and Accountability Aims and scope Submit manuscript

Abstract

Contemporary teacher evaluation policies are built upon multiple-measure systems including, primarily, teacher-level value-added and observational estimates. However, researchers have not yet investigated how using these indicators to evaluate teachers might distort validity, especially when one indicator seemingly trumps, or is trusted over the other. Accordingly, in this conceptual piece, we introduce and begin to establish evidences of three conceptual terms related to the validity of the inferences derived via these two measures in the context of teacher evaluation: (1) artificial inflation, (2) artificial deflation, and (3) artificial conflation. We define these terms by illustrating how those with the power to evaluate teachers (e.g., principals) within such contemporary evaluation systems might (1) artificially inflate or (2) artificially deflate observational estimates when used alongside their value-added counterparts, or (3) artificially conflate both estimates to purposefully (albeit perhaps naïvely) exaggerate perceptions of validity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The main differences between VAMs and growth models are how precisely estimates are made and whether control variables are included. Different than the typical VAM, for example, student growth models are more simply intended to measure the growth of similarly matched students to make relativistic comparisons about student growth over time, typically without any additional statistical controls (e.g., for student background variables). Students are, rather, directly and deliberately measured against or in reference to the growth levels of their peers, which de facto controls for these other variables. See also, for example, Betebenner (2009).

  2. In terms of observational measures, what is known is that the observational systems used for teacher evaluation purposes are more common across states teacher evaluation systems than in years prior (Author(s) 2017). What is also known is that at face value (e.g., face validity), observational systems are valuable to the extent that observational outputs might be considered valid estimates of teacher effectiveness, if and when “(a) the observed performances can be considered a representative sample from the domain [e.g., capturing teacher effectiveness], (b) the performances are evaluated appropriately and fairly, and (c) the sample is large enough to control sampling error” (Guion 1977, as cited in Kane 2006). What is also becoming increasingly evident in the literature, however, is that beyond observational systems’ prima facie qualities, they are now also confronting their own sets of empirical issues. Such validity-related issues include, but are not limited to, whether the observational systems being used are psychometrically sound for their intended purposes (to yield objective data about teachers’ effectiveness in practice), how output from observational systems might be biased by such factors as the types of students whom a teacher teaches, how a teacher’s gender interplays with his/her students’ gender(s), and the like (Author(s) 2017; Bailey et al. 2016; Steinberg and Garrett 2016; Whitehurst et al. 2014).

  3. While researchers who have investigated the reliability of teachers’ observational measures have expressed concerns about their lack of reliability, as well, reliability coefficients of teachers’ observational measures are relatively (and arguably) much higher than their VAM-based counterparts (e.g., r > 0.65 versus 0.20 < r < 0.50, respectively; see, for example, Ho and Kane 2013, see also Praetorius et al. 2014; van der Lans et al. 2016). It is important to note that these reliability coefficients pertain to both the VAM and observational measurements being taken over time, versus being captured via single observations. Indeed, most if not all researchers in this area emphasize the need for multiple observations of both VAM and observational estimates over time in order to secure the highest possible levels of reliability (Kane and Staiger 2012: see also Hill et al. 2012; Praetorius et al. 2014; van der Lans et al. 2016.). Accordingly, many states and districts have increased the number of teacher observations conducted per year (see, for example, Close et al. 2019a; Reddy et al. 2019); although, increasing the number of VAM-based observations is much more difficult given the tests most often used by all states still using VAMs are state-level, large-scale assessments that are only administered once per year Close et al. 2019a).

  4. “Bridging and buffering” is defined in Honig and Hatch (2004, pp. 17, 23–27) when schools, or school principals in this case, use state- or school-wide goals and strategies as the basis for their decisions about how or the extent to which they might productively engage or disengage external demands. More specifically, bridging activities are noted when principals, in this case, selectively engage with external demands in order to inform and enhance, in this case, policy implementation, while buffering does not involve the blind dismissal of external demands but the strategic engagement of external demands in limited to very limited ways, so as to not derail principal decision-making.

  5. It should be noted here, though, that this type of bell curve derived via teachers’ VAM-based scores is not uncommon, but rather often an artifact of many statistical models applied to analyze teachers’ value-added effects (e.g., linear regression or multilevel regression models), whereby it is common to centralize scores around a sample mean or average (see, for example, Winters and Cowen 2013). When VAM-based scores by default yield bell curves as based on normative scores, though, what this means for their relatively subjective observational counterparts to which they are to be combined is only beginning to be observed (e.g., the potential to artificially deflate teachers’ observational scores to match or better fit the normal curves derived via their more objective counterparts).

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Audrey Amrein-Beardsley.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amrein-Beardsley, A., Geiger, T.J. Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation. Educ Asse Eval Acc 31, 465–493 (2019). https://doi.org/10.1007/s11092-019-09311-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11092-019-09311-w

Keywords

Navigation