Skip to main content

Principal holistic judgments and high-stakes evaluations of teachers


Results from a sample of 1,013 Georgia principals who rated 12,617 teachers are used to compare holistic and analytic principal judgments with indicators of student growth central to the state’s teacher evaluation system. Holistic principal judgments were compared to mean student growth percentiles (MGPs) and analytic judgments from a formal observation protocol. The correlations of a holistic principal rating with teacher MGPs and observation protocol scores were 0.22 and 0.32. Teachers selected as most successful at increasing student achievement had a mean MGP that was a full SD higher than did teachers selected as least successful, and a mean observation protocol score that was 1.35 SDs higher. Holistic principal judgments appear to be much more strongly influenced by observations of teachers’ classroom practices than they were by evidence of growth in student achievement.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    Ehlert, Koedel, Parsons and Podgursky (2013) argue that even if it were possible to properly specify a value-added model that could isolate the effect of a teacher on student academic growth, it would still be preferable to include additional classroom level covariates at the risk of “overcorrecting” the model in order to create an optimal incentive structure to recruit and retain the best teachers in high-needs school districts.

  2. 2.

    Only about one third of all teachers in Georgia teach in subjects or grades for which it is possible to compute an MGP. For all other teachers, an aggregated growth statistic is computed on the basis of student performance on “student learning objectives” (SLOs). Because SLOs were still relatively new and in the process of being standardized at the time of this study, we do not include them here to keep the scope of our investigation manageable. For a detailed evaluation of SLOs in Georgia, see Buckley 2015.

  3. 3.

    For a primer on the student growth percentile methodology as it has been implemented in Georgia, see

  4. 4.

    It is easy to confuse the acronym “MGP” because the M can refer to either a median or a mean. In many states using SGPs, the median is taken instead of the mean. In Georgia, the decision was made to use the mean instead of the median in part because of research conducted by Castellano and Ho (2015) that suggests the mean will be more stable to random fluctuations in student cohorts.

  5. 5.

    In the survey “professional development support” is defined by example: “all teachers can benefit from support in the form of professional development (PD) that helps them become better at their job. Examples of these kinds of PD supports might include workshops offered at the district or school level, presentations offered by professional speakers from outside the school, periodic meetings in teacher teams during the school year, one-on-one coaching and feedback on teaching from a mentor or mentors, and taking coursework at an institution of higher education ”

  6. 6.

    We regard this as a quasi-ordinal rating scale in the sense that a rating of a 4 is meant to indicate a teacher whom a principal believes to require more support than do a teacher with a rating of 1, 2, or 3, and a rating of a 1 indicates a teacher whom a principal believes to require less support than do a teacher with a rating of a 2, 3, or 4. However, it is not clear that a rating of 3 necessarily indicates a greater level of support than a rating of 2, and in subsequent analyses, we sometimes collapse the middle two categories or restrict focus to the top and bottom categories.

  7. 7.

    The courses for which EOCTs exist are Mathematics I, Mathematics II, Coordinate Algebra, Georgia Performance Standards Algebra, Analytic Geometry, Georgia Performance Standards Geometry, United States History, Economics, Biology, Physical Science, Ninth Grade Literature and Composition, and American Literature and Composition.

  8. 8.

    This assumption is surely violated by the clustering of teachers within schools and school districts. However, because our sample sizes are so large in this regression context, involving the full population of teachers in the state, producing cluster-adjusted standard errors would have no impact on conventional tests of statistical significance, and such tests are not relevant to the approach anyway.

  9. 9.

    These variables are meant to be illustrative rather than exhaustive; examples of other variables that could have been included would be racial/ethnic composition, attendance rates, student “churn” (students that enter and exit the classroom throughout the year), the proportion of students in gifted and talented program, etc. Indeed, one challenge with this approach is that it can be unclear where one should stop in adding factors that need to be controlled.

  10. 10.

    In the results not shown here due to space constraints, we verify this by regressing principals’ least/most successful judgments for teachers on MGP and TAPS scores. We find that the TAPS score variable has the strongest influence, especially in judgments of teachers deemed least successful at increasing student achievement.


  1. Adler, M. (2014). Review of measuring the impacts of teachers. National Education Policy Center.

  2. American Statistical Association (2014). ASA Statement on Using Value-Added Models for Educational Assessment. Accessed 12 June 2016.

  3. Ballou, D. (2012). Review of long-term impacts of teachers. National Education Policy Center.

  4. Ballou, D., Sanders, W., & Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1), 37–65.

    Article  Google Scholar 

  5. Betebenner, D. (2009). Norm- and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28(4), 42–51.

    Article  Google Scholar 

  6. Castellano, K. E., & Ho, A. D. (2015). Practical differences among aggregate-level conditional status metrics: from median student growth percentiles to value-added models. Journal of Educational and Behavioral Statistics, 40(1), 35–68.

    Article  Google Scholar 

  7. Chetty, R., Friedman, T., & Rockoff, J. (2014a). Measuring the impacts of teachers I: evaluating bias in teacher value-added estimates. American Economic Review., 104(9), 2593–2632.

    Article  Google Scholar 

  8. Chetty, R., Friedman, T., & Rockoff, J. (2014b). Measuring the impacts of teachers II: teacher value-added and student outcomes in adulthood. American Economic Review., 104(9), 2633–2679.

    Article  Google Scholar 

  9. Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. J. (2013). The sensitivity of value-added estimates to specification adjustments: evidence from school- and teacher-level models in Missouri. Statistics and Public Policy, (March 2014), 19–27.

  10. Goldhaber, D., Walch, J., & Gabele, B. (2013). Does the model matter? Exploring the relationship between different student achievement-based teacher assessments. Statistics and Public Policy, 1(1), 28–39.

    Article  Google Scholar 

  11. Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value added: principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96–104.

    Article  Google Scholar 

  12. Guarino, C., Reckase, M., Stacy, B., & Wooldridge, J. (2015). A comparison of student growth percentile and value-added models of teacher performance. Statistics and Public Policy, 2, 1. doi:10.1080/2330443X.2015.1034820.

    Article  Google Scholar 

  13. Harris, D. N., Ingle, W. K., & Rutledge, S. A. (2014). How teacher evaluation methods matters for accountability: a comparative analysis of teacher effectiveness ratings by principals and teacher value-added measures. American Educational Research Journal, 5(1), 73–112.

    Article  Google Scholar 

  14. Hill, H., Charalambos, C., & Kraft, M. (2012). When rater reliability is not enough: teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64.

    Article  Google Scholar 

  15. Ingle, W. K., Rutledge, S. A., & Bishop, J. L. (2011). Context matters: principals’ sense- making of teacher hiring and on-the-job performance. Journal of Educational Administration, 49, 579–610.

    Article  Google Scholar 

  16. Jacob, B. A., & Lefgren, L. (2008). Can principals identify effective teachers? Evidence on subjective performance evaluation in education. Journal of Labor Economics, 26(1), 101–136.

    Article  Google Scholar 

  17. Kane, T. J., McCaffrey, D. M., Miller, T., & Staiger, D. O. (2013a). Have we identified effective teachers? Validating measures of effective teaching using random assignment. Seattle, WA: Bill and Melinda Gates Foundation.

    Google Scholar 

  18. Mashburn, A., Meyer, J., Allen, J., & Pianta, R. (2014). The effect of observation length and presentation order on the reliability and validity of an observational measure of teaching quality. Educational and Psychological Measurement, 74(3), 400–422.

    Article  Google Scholar 

  19. McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572–606.

    Article  Google Scholar 

  20. Reardon, S. F., & Raudenbush, S. W. (2009). Assumptions of value-added models for estimating school effects. Education Finance and Policy, 4(4), 492–519.

    Article  Google Scholar 

  21. Rockoff, J., Staiger, D. O., Kane, T. J., & Taylor, E. (2012) Information and employee evaluation: evidence from a randomized intervention in public schools. American Economic Review, 102(7), 3184–3213.

  22. Rothstein, J. (2009). Student sorting and bias in value-added estimation: selection on observables and unobservables. Education Finance and Policy, 4(4), 537–571.

    Article  Google Scholar 

  23. Rothstein, J. (2010). Teacher quality in educational production: tracking, decay, and student achievement. Quarterly Journal of Economics, 125(1), 175–214.

    Article  Google Scholar 

  24. Schochet, P. Z., & Chiang, H. S. (2010). Error rates in measuring teacher and school performance based on student test score gains. Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

    Google Scholar 

  25. Walsh, E., & Isenberg, E. (2015). How does value added compare to student growth percentiles? Statistics and Public Policy, 2, 1. doi:10.1080/2330443X.2015.1034390.

    Article  Google Scholar 

  26. Whitehurst, G., Chingos, M., & Lindquist, K. (2014). Evaluating teachers with classroom observations: lessons learned in four districts. Brown Center on Education Policy at Brookings.

Download references

Author information



Corresponding author

Correspondence to Derek C. Briggs.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Briggs, D.C., Dadey, N. Principal holistic judgments and high-stakes evaluations of teachers. Educ Asse Eval Acc 29, 155–178 (2017).

Download citation


  • Teacher evaluation
  • Principal judgment
  • Accountability
  • Growth modeling
  • Observation protocols
  • Value-added modeling