Abstract
In many countries, students are asked about their perceptions of teaching in order to make decisions about the further development of teaching practices on the basis of this feedback. The stability of this measurement of teaching quality is a prerequisite for the ability to generalize the results to other teaching situations. The present study aims to expand the extant empirical body of knowledge on the effects of situational factors on the stability of students’ perceptions of teaching quality. Therefore, we investigate whether the degree of stability is moderated by three situational factors: time between assessments, subjects taught by teachers, and students’ grade levels. To this end, we analyzed data from a web-based student feedback system. The study involved 497 teachers, each of whom conducted two student surveys. We examined the differential stability of student perceptions of 16 teaching constructs that were operationalized as latent correlations between aggregated student perceptions of the same teacher’s teaching. Testing metric invariance indicated that student ratings provided measures of teaching constructs that were invariant across time, subjects, and grade levels. Stability was moderated to some extent by grade level but not by subjects taught nor time spacing between surveys. The results provide evidence of the extent to which situational factors may affect the stability of student perceptions of teaching constructs. The generalizability of the students’ feedback results to other teaching situations is discussed.

Similar content being viewed by others
Notes
A total of 22 of these 96 teachers conducted both surveys in the same school year. In these cases, the very same class or a parallel class may have been surveyed. Based on the available data, the two cases cannot be distinguished. However, the general pattern of results for group B does not change if these 22 surveys are excluded from the analysis.
There was one exception: achievement expectations across grade level ΔCFI = .011.
Subgroup B showed small increases in the mean values at the second measurement point (up to a maximum difference of 0.09. on the original scale from 1 to 4). However, in most cases (12 out of 16 comparisons), these differences did not reach statistical significance.
References
Altricher, H., & Maag Merki, K. (2010). Steuerung der Entwicklung des Schulwesens [Steering the development of the school system]. In H. Altricher & K. Maag Merki (Eds.), Handbuch Neue Steuerung im Schulsystem (pp. 15–40). Wiesbaden: Verlag für Sozialwissenschaften.
Balch, R. T. (2012). The validation of a student survey on teacher practice. Nashville: Vanderbilt University.
Bell, C. A., Gitomer, D. H., McGaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2), 62–87. https://doi.org/10.1080/10627197.2012.715014.
Cheung, G. W., & Rensvold, R. B. (2002). Evaluation goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255.
Clausen, M. (2002). Unterrichtsqualität: eine Frage der Perspektive? [Quality of instruction: a matter of perspective?]. Münster: Waxmann.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Erlbaum.
Cole, D. A., & Maxwell, S. E. (2003). Testing mediational models with longitudinal data: questions and tips in the use of structural equation modeling. Journal of Abnormal Psychology, 112, 558–577.
Cumming, G. (2014). The new statistics: why and how. Psychological Science, 25(1), 7–29.
de Jong, R., & Westerhof, K. J. (2001). The quality of student ratings of teacher behaviour. Learning Environments Research, 4, 51–85.
Ditton, H., & Arnoldt, B. (2004). Schülerbefragung zum Fachunterricht: Feedback an Lehrkräfte [Surveying students about instructing: feedback for teachers]. Empirische Pädagogik, 18, 115–139.
Fauth, B., Decristan, J., Rieser, S., Klieme, E., & Büttner, G. (2014). Student ratings of teaching quality in primary school: dimensions and prediction of student outcomes. Learning and Instruction, 29, 1–9.
Ferguson, R. F. (2012). Can student surveys measure teaching quality? Phi Delta Kappan, 94, 24–28.
Gaertner, H. (2010). Wie Schülerinnen und Schüler ihre Lernumwelt wahrnehmen: Ein Vergleich verschiedener Maße zur Übereinstimmung von Schülerwahrnehmungen [How students perceive their learning environment: a comparison of four indices of interrater agreement]. Zeitschrift für Pädagogische Psychologie, 24, 111–122.
Gaertner, H. (2014). Effects of student feedback as a method of self-evaluating the quality of teaching. Studies in Educational Evaluation, 42, 91–99.
Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: a research synthesis. Washington, DC: National Comprehensive Center for Teacher Quality.
Hamre, B., & Pianta, R. (2010). Classroom environments and developmental processes: conceptualization and measurement. In J. L. Meece & J. S. Eccles (Eds.), Handbook of research on schools, schooling and human development (pp. 25–41). New York: Routledge.
Harker, R., & Tymms, P. (2004). The effects of student composition on school outcomes. School Effectiveness and School Improvement, 15(2), 177–199.
Harris, D. N. (2010). How do school peers influence student educational outcomes? Theory and evidence from economics and other social sciences. Teachers College Record, 112(4), 1163–1197.
Hattie, J. (2009). Visible learning. A synthesis of over 800 meta-analyses relating to achievement. London: Routledge.
Hiebert, J., & Morris, A. K. (2012). Teaching, rather than teachers, as a path toward improving classroom instruction. Journal of Teacher Education, 63, 92–102.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: combining high-quality observations with student surveys and achievement gains. Seattle: Bill & Melinda Gates Foundation: MET Project.
Kennedy, M. M. (2010). Attribution error and the quest for teaching quality. Educational Researcher, 39, 591–598.
Kimball, S. M., & Milanowski, A. T. (2009). Examining teacher evaluation validity and leadership decision making within a standards-based evaluation system. Educational Administration Quarterly, 45, 34–70.
Klieme, E., & Rakoczy, K. (2008). Empirische Unterrichtsforschung und Fachdidaktik [Empirical instruction research and didactics]. Zeitschrift für Pädagogik, 54, 222–237.
Klieme, E., Pauli, C., & Reusser, K. (2009). The Pythagoras study: investigating effects of teaching and learning in Swiss and German mathematics classrooms. In T. Seidel & P. Najvar (Eds.), The power of video studies in investigating teaching and learning in the classroom (pp. 137–160). Münster: Waxmann.
Kloss, J. (2013). Grundschüler als Experten für Unterricht [Primary school students as experts for teaching]. Frankfurt: Peter Lang.
Kratz, H. E. (1896). Characteristics of the best teachers as recognized by children. The Pedagogical Seminary, 3, 413–418.
Kunter, M., Kleickmann, T., Klusmann, U., & Richter, D. (2013). The development of teachers’ professional competence. In M. Kunter, J. Baumert, W. Blum, U. Klusmann, S. Krauss, & J. Neubrand (Eds.), Cognitive activation in the mathematics classroom and professional competence of teachers (pp. 63–79). New York: Springer.
Kyriakides, L. (2005). Drawing from teacher effectiveness research and research into teacher interpersonal behaviour to establish a teacher evaluation system: a study on the use of student ratings to evaluate teacher behaviour. Journal of Classroom Interaction, 40, 44–66.
Lai, M. K., & Schildkamp, K. (2013). Data-based decision making: an overview. In K. Schildkamp, M. K. Lai, & L. Earl (Eds.), Data-based decision making in education (pp. 9–22). Dordrecht: Springer.
Lei, X., Li, H., & Leroux, A. J. (2018). Does a teacher’s classroom observation rating vary across multiple classrooms? Educational Assessment, Evaluation and Accountability, 30, 27–46. https://doi.org/10.1007/s11092-017-9269-x.
Lenske, G. (2016). Schülerfeedback in der Grundschule [Student feedback in primary school]. Münster: Waxmann.
Marsh, H. W., & Hocevar, D. (1991). Students’ evaluations of teaching effectiveness: the stability of mean ratings of the same teachers over a 13-year period. Teaching and Teacher Education, 7(4), 303–314.
MET Project. (2010). Learning about teaching: initial findings from the measures of effective teaching project. Seattle: Bill & Melinda Gates Foundation.
MET Project. (2012). Asking students about teaching: student perception surveys and their implementation. Seattle: Bill & Melinda Gates Foundation.
Murray, H. G. (2007). Low-inference teaching behaviors and college teaching effectiveness: recent developments and controversies. In R. P. Perry & J. C. Smart (Eds.), The scholarship of teaching and learning in higher education: an evidence-based perspective (pp. 145–200). New York: Springer.
Neppl, T. K., Donnellan, M. B., Scaramella, L. V., Widaman, K. F., Spilman, S. K., Ontai, L. L., & Conger, R. D. (2010). Differential stability of temperament and personality from toddlerhood to middle childhood. Journal of Research in Personality, 44, 386–396.
OECD. (2013). Teachers for the 21st century: using evaluations to improve teaching. Paris: OECD.
OECD. (2015). Education policy outlook 2015: making reforms happen. Paris: OECD.
OECD. (2016). PISA 2015 assessment and analytical framework: science, reading, mathematic and financial literacy. Paris: OECD Publishing.
Opdenakker, M.-C., & Van Damme, J. (2007). Do school context, student composition and school leadership affect school practice and outcomes in secondary education? British Educational Research Journal, 33(2), 179–206.
Oser, F. K., & Baeriswyl, F. J. (2001). Choreographies of teaching: bridging instruction to learning. In V. Richardson (Ed.), Handbook of research on teaching (4th ed., pp. 1031–1065). Washington, DC: American Educational Research Association.
Praetorius, A.-K. (2014). Messung von Unterrichtsqualität durch Ratings [Measure of instruction quality within observer ratings]. Münster: Waxmann.
Praetorius, A.-K., Vieluf, S., Saß, S., Bernholt, A., & Klieme, E. (2015). The same in German as in English? Investigating the subject-specificity of teaching quality. Zeitschrift für Erziehungswissenschaft, 19(1), 1–19.
Rakoczy, K. (2008). Motivationsunterstützung im Mathematikunterricht [Supporting students’ motivation in mathematics instruction]. Münster: Waxmann.
Rantanen, P. (2013). The number of feedbacks needed for reliable evaluation: a multilevel analysis of the reliability, stability and generalisability of students’ evaluation of teaching. Assessment & Evaluation in Higher Education, 38, 224–239.
Roberts, B. W., & DelVecchio, W. F. (2000). The rank-order consistency of personality traits from childhood to old age: a quantitative review of longitudinal studies. Psychological Bulletin, 126, 3–25.
Slavin, R. E. (1995). A model of effective instruction. The Educational Forum, 59, 166–176.
Thiel, F., & Thillmann, K. (2012). Interne evaluation [School self-evaluation]. In A. Wacker, U. Maier, & J. Wissinger (Eds.), Schul- und Unterrichtsreform durch ergebnisorientierte Steuerung–Empirische Befunde und forschungsmethodische Implikationen (pp. 35–56). Wiesbaden: Springer.
van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9, 486–492.
Wagner, W., Göllner, R., Helmke, A., Trautwein, U., & Lüdtke, O. (2013). Construct validity of student perceptions of instructional quality is high, but not perfect: dimensionality and generalizability of domain-independent assessments. Learning and Instruction, 28, 1–11.
Wagner, W., Göllner, R., Werth, S., Voss, T., Schmitz, B., & Trautwein, U. (2016). Student and teacher ratings of instructional quality: consistency of ratings over time, agreement, and predictive power. Journal of Educational Psychology, 108(5), 705–721.
Wurster, S., & Gaertner, H. (2013). Erfassung von Bildungsprozessen im Rahmen von Schulinspektion und deren potenzieller Nutzen für die empirische Bildungsforschung [Assessment of educational processes within school inspection and their potential use for education research]. Unterrichtswissenschaft, 41, 217–235.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
Appendix B
Exemplary MPlus input file for clarity
Below is an example syntax for the construct clarity and the comparison of stability over time (model A vs. B); for constructs with more items, the syntax is extended accordingly. All analyses were performed using the statistics program MPlus (version 7.4), and the estimation method is always maximum likelihood (ML). The grouping variable is always dichotomous and refers to the comparison over time (model A vs. B), subject (model A vs. C), or grade level (model A vs. D). The model test (Wald test) refers to the invariance testing.

Rights and permissions
About this article
Cite this article
Gaertner, H., Brunner, M. Once good teaching, always good teaching? The differential stability of student perceptions of teaching quality. Educ Asse Eval Acc 30, 159–182 (2018). https://doi.org/10.1007/s11092-018-9277-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11092-018-9277-5
