Classroom observation systems in context: A case for the validation of observation systems
Researchers and practitioners sometimes presume that using a previously “validated” instrument will produce “valid” scores; however, contemporary views of validity suggest that there are many reasons this assumption can be faulty. In order to demonstrate just some of the problems with this view, and to support comparisons of different observation protocols across contexts, we introduce and define the conceptual tool of an observation system. We then describe psychometric evidence of a popular teacher observation instrument, Charlotte Danielson’s Framework for Teaching, in three use contexts—a lower-stakes research context, a lower-stakes practice-based context, and a higher-stakes practice-based context. Despite sharing a common instrument, we find the three observation systems and their associated use contexts combine to produce different average teacher scores, variation in score distributions, and different levels of precision in scores. However, all three systems produce higher average scores in the classroom environment domain than the instructional domain and all three sets of scores support a one-factor model, whereas the Framework posits four factors. We discuss how the dependencies between aspects of observation systems and practical constraints leave researchers with significant validation challenges and opportunities.
KeywordsValidity Teacher evaluation Observation systems Factor analyses Teaching quality
This study was supported by grants from W.T. Grant Foundation (Grant # 181068) and The Bill and Melinda Gates Foundation (Grant # OPP52048). For making the data available for this study, we thank administrators, teachers, and staff from Los Angeles Unified School District (LAUSD) and three large southern districts. The opinions expressed herein are those of the authors and not the funding agency or participants.
- American Educational Research Association, American Psychological Association, and National Council on Measurement in Education [AERA/APA/NCME]. (2014). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association.Google Scholar
- Archer, J., Cantrell, S., Holtzman, S. L., Joe, J. N., Tocci, C. M., & Wood, J. (2016). Better feedback for better teaching: a practical guide to improving classroom observations. New York: John Wiley & Sons.Google Scholar
- Bell, C., Jones, N., Lewis, J., Qi, Y., Kirui, D., Stickler, L., & Liu, S. (2016). Understanding consequential assessment systems of teaching: Year 1 final report to Los Angeles Unified School District (Research Memorandum No. RM-16-12). Princeton, NJ: Educational Testing Service.Google Scholar
- Chaplin, D., Gill, B., Thompkins, A., & Miller, H. (2014). Professional practice, student surveys, and value-added: Multiple measures of teacher effectiveness in the Pittsburgh Public Schools. REL 2014-024. Regional Educational Laboratory Mid-Atlantic.Google Scholar
- Charalambous, C. Y., & Praetorius, A. K. (2018). Studying mathematics instruction through different lenses: setting the ground for understanding instructional quality more comprehensively. ZDM, 50(3), 355–366.Google Scholar
- Cohen, J., & Grossman, P. (2016). Respecting complexity in measures of teaching: keeping students and schools in focus. Teaching and Teacher Education, 55, 308–317. https://doi.org/10.1016/j.tate.2016.01.017.
- Dalland, C.P., Klette, K., & Svenkerud, S. (2018). Video studies and the challenge of selecting time scales. International Journal of Research & Method in Education. Manuscript submitted for publication.Google Scholar
- Danielson, C. (1996). Enhancing professional development: A framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development.Google Scholar
- Danielson, C. (2007). Enhancing professional practice: a framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development.Google Scholar
- Danielson, C. (2011). Enhancing professional practice: a framework for teaching. Princeton, NJ: The Danielson Group.Google Scholar
- Danielson, C. (2013). The Framework for Teaching evaluation instrument, 2013 Edition. Retrieved January 17, 2017 from https://www.danielsongroup.org/framework/.
- Darling-Hammond, L., & Rothman, R. (2015). Teaching in the flat world: learning from high-performing systems. Teachers College Press.Google Scholar
- Donaldson, M. L., & Woulfin, S. (2018). From tinkering to going “rogue”: how principals use agency when enacting new teacher evaluation systems. Educational Evaluation and Policy Analysis 0162373718784205.Google Scholar
- Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: a research synthesis. National Comprehensive Center for Teacher Quality. Retrieved on December 3, 2008 from: https://gtlcenter.org/sites/default/files/docs/EvaluatingTeachEffectiveness.pdf.
- Hafen, C. A., Hamre, B. K., Allen, J. P., Bell, C. A., Gitomer, D. H., & Pianta, R. C. (2015). Teaching through interactions in secondary school classrooms revisiting the factor structure and practical application of the Classroom Assessment Scoring System–Secondary. The Journal of Early Adolescence, 35(5–6), 651–680.CrossRefGoogle Scholar
- Herlihy, C., Karger, E., Pollard, C., Hill, H. C., Kraft, M. A., Williams, M., & Howard, S. (2014). State and local efforts to investigate the validity and reliability of scores from teacher evaluation systems. Teachers College Record, 116(1), 1–28.Google Scholar
- Hess, F. M. (2015). Lofty promises but little change for America’s schools. Education Next, 15(4), 50–56.Google Scholar
- Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school personnel. Research paper. MET Project. Bill & Melinda Gates Foundation.Google Scholar
- Joe, J. N., McClellan, C. A., & Holtzman, S. L. (2014). Scoring design decisions: reliability and the length and focus of classroom observations. In T. J. Kane, K. Kerr, & R. C. Pianta (Eds.), Designing teacher evaluation systems (pp. 415–443). New York: Jossey Bass.Google Scholar
- Joe, J. N., Tocci, C. M., Holtzman, S. L., & Williams, J. C. (2013). Foundations of observation: considerations for developing a classroom observation system that helps districts achieve consistent and accurate scores. MET Project, Policy and Practice Brief. Retrieved on January 21, 2019 from http://k12education.gatesfoundation.org/resource/foundations-of-observations-considerations-for-developing-a-classroom-observation-system-that-helps-districts-achieve-consistent-and-accurate-scores/.
- Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (pp. 17–64). New York: Praeger.Google Scholar
- Kane, T. J., & Staiger, D. O. (2012). Gathering Feedback for Teaching: Combining High-Quality Observations with Student Surveys and Achievement Gains. Retrieved on January 4, 2013 from http://metproject.org/downloads/MET_Gathering_Feedback_Research_Paper.pdf.
- Kane, T. J., Taylor, E. S., Tyler, J. H., & Wooten, A. L. (2010). Identifying effective classroom practices using student achievement data, (September 2010), 51. https://doi.org/10.3386/w15803.
- Lazarev, V., Newman, D., Sharp, A., & (ED), R. E. L. W. (2014). Properties of the multiple measures in Arizona’s teacher evaluation model. REL 2015-050. Regional Educational Laboratory West, (October). Retrieved on July 23, 2018 from https://files.eric.ed.gov/fulltext/ED548027.pdf.
- Martin-Raugh, M., Tannenbaum, R. J., Tocci, C. M., & Reese, C. (2016). Behaviorally anchored rating scales: An application for evaluating teaching practice. Teaching and Teacher Education, 59, 414–419. https://doi.org/10.1016/j.tate.2016.07.026
- McClellan, C. (2013). What it looks like: master coding videos for observer training and assessment. Seattle: Bill & Melinda Gates Foundation. Retrieved on January 14, 2014 from http://k12education.gatesfoundation.org/resource/what-it-looks-like-master-coding-videos-for-observer-training-and-assessment/.
- McClellan, C., Atkinson, M., & Danielson, C. (2012). Teacher evaluator training & certification: lessons learned from the Measures of Effective Teaching project (Practitioner Series for Teacher Evaluation). San Francisco: Teachscape. Retrieved Jan 3, 2019 from https://www.issuelab.org/resource/teacher-evaluator-training-certification-lessons-learned-from-themeasures-of-effective-teaching-project.html.
- Netolicky, D. M. (2016). Coaching for professional growth in one Australian school: “oil in water”. International Journal of Mentoring and Coaching in Education, 5(2), 66–86. https://doi.org/10.1108/IJMCE-09-2015-0025.
- Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom assessment scoring system (CLASS) manual, pre-K. Baltimore: Brookes.Google Scholar
- Pons, A. (2018). What does teaching look like? A new video study [Blog post]. Retrieved from http://oecdeducationtoday.blogspot.com/2018/01/what-does-teaching-look-like-new-video.html. Accessed 2 Dec 2018.
- Praetorius, A.-K., Pauli, C., Reusser, K., Rakoczy, K., & Klieme, E. (2014). One lesson is all you need? Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. https://doi.org/10.1016/j.learninstruc.2013.12.002.
- Roegman, R., Goodwin, A. L., Reed, R., & Scott-McLaughlin, R. M. (2016). Unpacking the data: an analysis of the use of Danielson’s (2007) Framework for Professional Practice in a teaching residency program. Educational Assessment, Evaluation and Accountability, 28(2), 111–137. https://doi.org/10.1007/s11092-015-9228-3.
- Sahlberg, P. (2011). Finnish lessons. New York: Teachers College Press.Google Scholar
- Seidel, T., Prenzel, M., & Kobarg, M. (2005). How to run a video study. Technical report of the IPN Video Study. Berlin: WaxmannGoogle Scholar
- State of New Jersey Administrative Code, 6A:10-7.1 (2016), Subchapter 7.Google Scholar
- Stigler, J. W., Gonzales, P., Kwanaka, T., Knoll, S., & Serrano, A. (1999). The TIMSS videotape classroom study: methods and findings from an exploratory research project on eighth-grade mathematics instruction in Germany, Japan, and the United States, Washington D. C. Retrieved Oct 12, 2014 from: http://nces.ed.gov/pubs99/1999074.pdf.
- Taut, S., Santelices, M. V., & Stecher, B. (2012). Validation of a national teacher assessment and improvement system. Educational Assessment, 17(4), 163–199.Google Scholar
- Taut, S., & Sun, Y. (2014). The development and implementation of a national, standards-based, multi-method teacher performance assessment system in Chile. Education Policy Analysis Archives, 22(71), 1–31. https://doi.org/10.14507/epaa.v22n71.2014.
- White, T. (2014a). Evaluating teachers more strategically: using performance results to streamline evaluation systems. Retrieved September 6, 2018 from: https://www.carnegiefoundation.org/wp-content/uploads/2014/12/BRIEF_evaluating_teachers_strategically_Jan2014.pdf.
- White, T. (2014b). Adding eyes: the rise, rewards, and risks of multi-rater teacher observation systems. Retrieved September 6, 2018 from: https://www.carnegiefoundation.org/wp-content/uploads/2014/12/BRIEF_Multi-rater_evaluation_Dec2014.pdf.
- White, M. C. (2018). Rater performance standards for classroom observation instruments. Educational Researcher, 47(8), 492–501. https://doi.org/10.3102/0013189X18785623.
- Whitehurst, G., Chingos, M., & Lindquist, K. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Providence, RI: Brown Center on Education Policy at the Brookings Institution.Google Scholar