1 Introduction

Teaching quality has been researched extensively in the past years with a high number of empirical studies in educational sciences and psychology (Bill and Melinda Gates Foundation, 2012; Charalambous & Praetorius, 2018), as it is regarded an important mediator between teacher competence and student learning (Baumert et al., 2010; Blömeke et al., 2022; Nilsen et al., 2018). To better understand how learning develops in the classroom, scholars are concerned with the reliable and valid measurement of teaching quality (e.g., Hill et al., 2012; Jentsch et al., 2020; Ryan et al., 1995). In doing so, Helmke (2012) considers classroom observation as the “gold standard” amongst other ways of capturing teaching quality (e.g., student ratings in large-scale assessment) because of its direct assessment of teaching practices.

However, current research has shown that classroom observation suffers from several methodological issues (e.g., segment length, rater bias, measurement error, stability over time, Leckie & Baird, 2011; Mashburn et al., 2014; Praetorius et al., 2014; White & Klette, 2023), one of which are mode effects (Casabianca et al., 2013; Jaeger 1993). Mode effects are differences in scores that are due to observation mode rather than true variation in the latent construct. They are therefore a potential danger to the validity of the inferences drawn from data (Bell et al., 2012; Kane, 2013). If mode effects occur, live and video ratings might not yield the same findings, although the same frameworks or measures are applied to capture teaching quality.

To the best of our knowledge, only one study has been conducted so far to rigorously analyze the differences between live and video ratings of teaching quality (Casabianca et al., 2013; see also Frederiksen et al., 1992).Footnote 1 Casabianca and colleagues (2013) investigated variation across observation modes in 82 US-American classrooms during one year of schooling. Ratings were obtained with the established Classroom Assessment Scoring System (CLASS, Pianta et al., 2008), which captures three generic dimensions of teaching quality (classroom organization, emotional support, instructional support). The scholars found that video ratings were slightly higher, which was explained by a time lag of about 100 days between live and video scoring. They concluded that the observed differences were not due to observation modes but to raters’ increased experience over time. Additional correlation analysis showed that live and video ratings resulted in similar rankings of classrooms.

In the present study, we apply a design similar to the one by Casabianca et al. (2013) to analyze live and video ratings of teaching quality. However, this study is set in a different educational context (i.e., mathematics classrooms in secondary schools in a German metropolitan area) and draws on a hybrid conceptual framework (Charalambous & Praetorius, 2018) which also takes subject-specific characteristics of teaching quality into account. We investigate to what extent differences in observation modes could be associated to how scores are assigned to classrooms or lessons (absolute decisions), and how classrooms or lessons are ranked (relative decisions), as well as measurement error and generalizability of teaching quality scores (Cronbach et al., 1972).

2 Conceptual framework

2.1 Teaching quality in mathematics classrooms

Following the TIMSS 1995 Video Study (Stigler et al., 1999), German educational researchers have developed a generic framework of teaching quality with three basic dimensions (Klieme et al., 2006) which are classroom management, student support, and (potential for) cognitive activation. The three basic dimensions have been shown to positively relate to students’ achievement in mathematics classrooms across several studies and various operationalizations (e.g., Baumert et al., 2010; Lipowsky et al., 2009; for an overview see Praetorius et al., 2018). Classroom management refers to teachers’ procedures and strategies that enable efficient use of time (time on task), as well as behavioral management (Helmke, 2012; Kounin, 1970). Student support draws on self-determination theory (Deci & Ryan, 1985) and aims at both motivational and emotional support, as well as individualization and differentiation. Cognitive activation, finally, addresses opportunities for “high-order thinking” from a socio-constructivist perspective on teaching and learning (e.g., problem-solving, Mayer, 2004; Shuell, 1993). According to Klieme and Rakoczy (2008), cognitive activation should be operationalized with regard to subject-specific differences to better understand student learning in the corresponding domains (e.g., modelling tasks in mathematics vs. classroom discourse in history vs. text-based instruction in language arts). In a similar vein, scholars in mathematics education have argued that generic operationalizations of the three basic dimensions might not address all the characteristics that are relevant to teaching quality in mathematics classrooms (e.g., general mathematical competencies according to national curricular standards, Blum et al., 2015, or mathematical correctness, Brunner, 2018; see also Learning Mathematics for Teaching Project, 2011).

Empirical evidence suggests that generic and subject-specific measures of teaching quality generate moderately correlated, but still unique information about classrooms (Kane & Staiger, 2012). Evaluating this finding, Charalambous and Praetorius (2018) conclude that subject-specific and generic measures together could explain more variance in student learning in mathematics than generic measures alone. Since subject-specificity might be considered a continuum rather than a binary characteristic, they argue that it could be meaningful for scholars to develop hybrid frameworks of teaching quality, which take both perspectives into account (i.e., generic and subject-specific, see also Charalambous & Praetorius, 2018).

In the present study, we apply such a hybrid framework (Schlesinger et al., 2018). It draws on the three basic dimensions but adds a fourth dimension (mathematics educational structuring, Jentsch et al., 2021; see also Drollinger-Vetter, 2011; Kleickmann et al., 2020) to capture additional characteristics of teaching quality that are relevant to student learning in secondary mathematics classrooms. Mathematics educational structuring refers to teaching practices that provide cognitive or instructional support to students when building up knowledge (e.g., mathematical correctness, explanations, consolidation). This dimension of teaching quality complements the three basic dimensions by teachers’ efforts in adapting cognitive challenges to students’ individual characteristics and is closely connected to scaffolding in instruction (e.g., Van de Pol et al., 2010). Jentsch et al. (2021) provide empirical evidence for a four-dimensional structure underlying observer ratings, which corresponds to the three basic dimensions and mathematics educational structuring (see Kleickmann et al., 2020, for a similar finding in science classrooms using student ratings). Furthermore, they find that the latter is also related to mathematics teachers’ competence. Blömeke et al. (2022) show that teaching quality as modeled with this framework is connected to student achievement in mathematics.

2.2 Validity arguments for classroom observation

Modern validity theory (Bell et al., 2012; Kane, 2013) states that measures usually serve a specific purpose, and to evaluate the validity of test scores in a meaningful way, such purposes must be considered. We argue alongside Casabianca et al. (2013) that classroom observation can serve at least two different purposes, and these are related to the unit of analysis (e.g., classroom, lesson). Classroom-based (in contrast to lesson-based) conclusions refer to classrooms as the unit of analysis in teaching quality research. They are typically driven by long-term decisions that need to generalize teaching practices over a period involving many lessons, for instance when connecting teaching quality measures to student learning. On the other hand, lesson-based conclusions refer to lessons as the unit of analysis. They are drawn to provide feedback to teachers on a particular topic or classroom setting. As both are widely used in educational research, policy, and practice, we take these two perspectives into account in the present study.

To develop a validity argument specifically suited for measures of teaching quality, Bell et al. (2012) discuss scoring, generalization, extrapolation, and implication as the four main inferences drawn from classroom observation. The scoring assumption refers to the appropriateness, accuracy, and consistency of the scoring procedure. Generalization means that “the sample of teaching observed is representative of all the instances of teaching to which one wants to generalize” (Bell et al., 2012, p. 67). Extrapolation relates teaching quality scores to other meaningful concepts within a theory (e.g., see the offer-use model by Helmke, 2012). Finally, the implication inference connects teaching quality scores to decisions that are based on them (e.g., pass/fail grades in practical teacher examinations).

The generalization inference is particularly important, because it refers to the degree to which the observed scores reflect the targeted construct, rather than unintended sources of variation (e.g., mode effects). Towards this end, researchers should provide evidence that inferences drawn from scores observed in a study do not largely depend on the conditions under which it was conducted. Generalizing across observation mode, therefore, makes the claim that the corresponding scores do not differ significantly regarding scoring distribution, ranking of lessons or classrooms, as well as measurement error and reliability.

2.3 Classroom observation mode: live versus video scoring

Observation mode is important to the assessment of teaching quality because of different procedures for data collection in educational research and practice. Due to pragmatic reasons, live scoring is usually performed in educational practice (e.g., school inspection), whereas educational research often applies video scoring. Live scoring entails the advantage of observers being physically in the classrooms, while using video has the benefit that it can be watched many times. Beyond the possibility to obtain multiple ratings (e.g., to decrease measurement error), teachers may find video useful for professional development activities, as they are able to evaluate their performance on their own or with peers (Brunvard, 2010; Sherin & Han, 2004; Van Es & Sherin, 2010). However, capitalizing on these benefits with a framework that was originally developed for live scoring (or vice versa) needs careful evaluation for mode effects, as these might imply increased measurement error or bias.

Casabianca et al. (2013) argue that live and video scoring differ in how raters access information on the lessons they are observing. For instance, during live observation, raters can in principle pay attention to any action of students and teachers, as well as to tasks and material at all times. While this may include ambient audio information, raters’ possibilities to capture one-to-one conversations (i.e., teacher/student or student/student) could be limited. On one hand, being able to always observe all students is important to adequately score classroom management because students’ individual time-on-task can be taken into account. Ambient audio, on the other hand, might be helpful as contextual information to adequately consider potential disruptions during scoring. Both pieces of information could also help raters to understand to what extent students are cognitively activated in the classroom. For instance, if raters can observe students’ reactions to potentially challenging tasks, they might have a chance to capture the amount of productive struggle that students are involved in.

During video observation, raters’ attention is necessarily drawn to what the cameras have captured. This might increase the amount of standardization because raters receive similar information at the same time, and therefore lead to higher reliability in scores. What is more, the perceived audio information is different from the live scoring. In most settings teachers are equipped with additional microphones, which ensures that teachers’ voices are always heard (Casabianca et al., 2013). This could have an impact on raters’ capabilities of scoring how teachers support individual students (e.g., feedback, scaffolding) because these practices usually occur during one-to-one conversations or group work. Thus, raters might benefit from the additional audio information that is accessible to them during video scoring.

2.4 Research questions

Mode effects could lead to different score interpretations on the same construct and following Bell et al. (2012), are therefore a danger to validity. The goal of our study is to investigate to what extent our teaching quality framework is dependent on whether live or video scoring is applied. Given the hybrid nature of our framework, it is also of interest whether mode differences are more likely to occur with generic or subject-specific dimensions of teaching quality. As standardized classroom observations can be used for both absolute and relative decisions, we analyze differences in teaching quality mean scores, as well as differences in rank orders for lessons and classrooms (i.e., correlation analysis). This is done to explore the degree to which teaching quality scores are associated with observation mode. We address the following research questions (Casabianca et al., 2013), focusing on differences in mode effects between generic and subject-specific teaching quality dimensions:

  1. 1.

    Do raters use the scale of our observational instrument differently across modes? Are there differences in scale use between generic and subject-specific dimensions?

  2. 2.

    To what extent do live and video scores rank lessons or classrooms differently? Are these rankings different for generic and subject-specific dimensions?

  3. 3.

    To what extent do sources of variance (i.e., classrooms, lessons, segments, and raters) compare between scoring modes? Are there differences in variance decompositions between generic and subject-specific dimensions?

  4. 4.

    What are the implications for measurement error and reliability of live and video scores regarding classroom-based as well as lesson-based decisions? Are these implications different for generic and subject-specific dimensions?

3 Methods

3.1 Participants

Data were collected from a subsample of the Teacher Education and Development Study–Instruct (TEDS-Instruct). Both TEDS-Instruct and the present study took place in secondary school mathematics classrooms in a German metropolitan area, years 7–10. A convenience (i.e., non-random) subsample of the TEDS-Instruct participants took part in this follow-up study. We observed and video-recorded two lessons of 90 min in every classroom between December 2016 and May 2017, usually within two weeks’ time. Fifteen licensed mathematics teachers participated, eight of which were female and seven were male. The teachers’ age median was 36 years (min = 28, max = 71) and they had been teaching for six years on average (min = 0.5, max = 30).

3.2 Observational instrument

The observational instrument was developed within TEDS-Instruct and consists of 21 high-inference items (see Table 1). It captures three basic dimensions (classroom management, student support, cognitive activation, Praetorius et al., 2018) and mathematics educational structuring, covering more subject-specific characteristics of teaching quality (Jentsch et al., 2021). Raters assign scores on a four-point scale (from 1: very low teaching quality, through 4: very high teaching quality). Classroom management is assessed with three items (e.g., time on task, Cronbach’s α = 0.87). Student support is captured with four items (e.g., dealing with heterogeneity, α = 0.73). Cognitive activation is measured with seven items (e.g., challenging questions, α = 0.80). Finally, mathematics educational structuring is also captured with seven items (e.g., mathematical correctness, α = 0.81). Additional information on the development of the observational instrument can be found in Schlesinger et al. (2018) and Jentsch et al. (2021).

Table 1 Scoring distribution by item, dimension and mode (percentages)

3.3 Scoring procedure

Lesson scoring was conducted by six extensively trained raters. All of them were student teachers or PhD students in a mathematics education program and had obtained at least a Bachelor’s degree. The training took 30–40 h and consisted of both live and video scoring, peer discussions, as well as more theoretical work involving the manual for the observational instrument and additional literature. In a pilot study high rater reliability for all items of the observational instrument was reached (ICC > 0.80). Scoring was performed four times per lesson (approx. every 22 min, see Mashburn et al., 2014, for a discussion of the potential benefits for reliability and validity), and we ensured that live and video scoring took place at the same time points within a lesson. All lesson segments were double-coded (i.e., two independent scores are available for every segment). In addition, raters were allowed to change their scores based on the manual for the observational instrument and peer discussion after the lesson had ended.

All lessons were scored under both observation modes with different raters. Otherwise, procedures were the same for live and video scoring. This means that raters were not allowed to stop the videos during scoring, nor to move around in the classroom to increase the amount of standardization across observation modes. Due to practicalities, however, it was not possible to assign raters randomly to classrooms, lessons, or observation modes (as e.g., in a random block design). This resulted in an uneven distribution of raters across modes, with four raters being assigned more frequently to live scoring, and two raters working mainly with video scoring.

Video scoring was performed within two weeks after the corresponding live observations had taken place to minimize rater drift (Casabianca et al., 2013). To this end, two cameras and a teacher microphone (lavalier) were used. A static camera was set on the class using a wide angle, and one camera followed the teacher.

3.4 Statistical analysis

Statistical analysis was performed with IBM SPSS 26 and consisted of three steps. First, we examined the scoring distributions for all items with respect to observation mode. Second, mean differences as well as bivariate correlations across modes and teaching quality dimensions were estimated, which involved both lesson-level and classroom-level scores. Mean differences were investigated for statistical significance with paired t tests.

As reported above, raters were not distributed evenly across modes. In the case of rater effects, mean differences could occur across modes that are in fact due to raters using the scales differently. Adjusting for these effects is therefore crucial in the present study. To do so, we estimated separate mixed models for every teaching quality dimension with fixed effects for raters and observation mode, as well as random classroom effects.

Following Casabianca and colleagues (2013), we also looked at time trends in the scoring of teaching quality. Raters might change how they assign scores to lesson segments over time, and this could be a confounder when investigating mode effects. We estimated linear mixed models involving fixed effects for observation mode, time (months) and the interaction between mode and time, as well as random effects for classrooms.

In addition, mode-specific Generalizability and Decision studies (G and D studies, Brennan, 2001; Cronbach et al., 1972; Shavelson & Webb, 1991) are conducted to provide an in-depth analysis of the dependability of the corresponding scores. G Theory is an approach to decompose the observed variability in scores with respect to study conditions by performing analysis of variance (Brennan, 2001). The resulting variance decomposition provides insights on potential sources of measurement error, as it allows for distinguishing between wanted (e.g., differences in teaching quality between lessons or classrooms) and unwanted variability in scores (e.g., rater bias). A D study (Shavelson & Webb, 1991) is an exploratory simulation study based on the results of a G study. It aims at estimating how the present study conditions affect measurement error and reliability, similarly to the Spearman-Brown formula (Cronbach et al., 1972). In this study, we explore for each mode how teaching quality varies with respect to classrooms, lessons, segments, and raters. We estimate measurement error as well as reliabilities for live and video ratings and discuss how these could be improved under different study conditions.

We estimated random effects for classrooms, lessons, segments, and raters, as well as interactions between classrooms and raters. Regarding D studies, we investigated the potential to decrease measurement error by varying the number of observed lessons (2, 4) and segments (2, 4, 8). As only a negligible amount of variance was due to rater effects, we refrained from also conducting D studies that varied regarding the number of raters.

4 Results

4.1 Scoring distributions, time trends, and bivariate correlations

Table 1 provides the scoring distributions across modes as well as generic and subject-specific teaching quality dimensions.Footnote 2 We see that raters do not make use of the full rating scale in its breadth, but this appears to be similar across modes. For classroom management a ceiling-effect can be observed, as the lowest score is almost never applied, while the highest score is given most often. The highest score has been assigned even more frequently to video-recorded lesson segments, yielding percentage differences of 12–24% across modes for items assessing classroom management. For student support raters mostly apply lower scores in both observation modes, and the highest score is rarely used instead. When scoring cognitive activation, raters seem to make a wider use of the four-point-scale during video scoring, as the ratings are more evenly distributed than in the live scoring (percentage differences for the lowest score 1–17%). For most items assessing cognitive activation lower scores are obtained, too. We do, however, not identify a clear picture regarding mathematics educational structuring. Several items have a similar distribution between modes (e.g., correctness, co-construction), whereas for others the scores vary to a larger extent (e.g., structure, explanations).

4.1.1 Mean differences

Table 2 shows descriptive statistics and correlations across modes after item scores were aggregated to dimensions and then to the lesson level. The reported mean differences in Table 2 for classroom management (live vs. video: Mdiff = -0.18, SE = 0.06, t(14) = -2.92, p =.011, Cohen’s d = -0.76) and cognitive activation (Mdiff = 0.17, SE = 0.04, t(14) = 3.97, p =.001, d = 1.02) are statistically significant, while those for student support (Mdiff = 0.07, SE = 0.08, t(14) = 0.83, p =.418, d = 0.22) as well as mathematics educational structuring (Mdiff = -0.02, SE = 0.06, t(14) = -0.36, p =.725, d = -0.09) are not. According to Cohen’s classification (Cohen, 1992) the effects are moderate to large. Adjusting for rater differences across modes yields similar results (mode effect live vs. video for classroom management: Mdiff = -0.11, SE = 0.04, p =.002, student support: Mdiff = 0.09, SE = 0.05, p =.070, cognitive activation: Mdiff = 0.16, SE = 0.05, p <.001, mathematics educational structuring: Mdiff = 0.02, SE = 0.03, p =.411). This shows that for video-recorded lessons, higher scores in classroom management and lower scores in cognitive activation are assigned, which suggests mixed results regarding generic versus subject-specific dimensions of teaching quality.

Table 2 Descriptives and bivariate correlations by dimension and mode, lesson-level. ote. Correlations were obtained from 1,000 Bootstrap samples (Efron & Tibshirani, 1993). CM = Classroom Management, SS = Student Support, CA = Cognitive Activation, MS = Mathematics Educational Structuring

4.1.2 Time trends

Figures 1 and 2 show time trends in scoring the four teaching quality dimensions across observation modes. We see that particularly the subject-specific dimensions evolve differently over time, with slightly lower scores in the third month for video-recorded lessons. For classroom management we observe more variation over time in the live ratings. However, after having adjusted the reported mean differences for time trends we obtain similar findings with an additional small effect for mathematics educational structuring (mode effect live vs. video for classroom management: Mdiff = -0.28, SE = 0.04, p <.001, student support: Mdiff = 0.07, SE = 0.06, p =.107, cognitive activation: Mdiff = 0.25, SE = 0.04, p <.001, mathematics educational structuring: Mdiff = 0.08, SE = 0.03, p =.015).

Fig. 1
figure 1

Time trends for live observation by dimension. Note. The x-axis represents the month of the lesson scoring and the y-axis is the average score across lessons. CM = Classroom Management (top line), SS = Student Support (bottom line), CA = Cognitive Activation, MS = Mathematics Educational Structuring

Fig. 2
figure 2

Time trends for video observation by dimension. Note. The x-axis represents the month of the lesson scoring and the y-axis is the average score across lessons. CM = Classroom Management (top line), SS = Student Support (bottom line), CA = Cognitive Activation, MS = Mathematics Educational Structuring

4.1.3 Correlations

Estimating bivariate correlations for teaching quality dimensions both across and within modes (see Table 2), we see that cognitive activation (r =.78) and mathematics educational structuring (r =.73) reach values across modes close to what is usually considered an acceptable reliability in the social sciences. The association between live and video scores in classroom management is slightly lower (r =.63), and for student support even more so (r =.45). However, as will be revealed in the D studies, these correlations are close to the estimated reliabilities for the corresponding teaching quality dimensions. This suggests that, live and video scoring results in similar lesson rankings after having adjusted for measurement error (i.e., disattenuated correlations are 0.83 for classroom management, 0.76 for student support, 1.00 for cognitive activation, and 0.85 for mathematics educational structuring).Footnote 3

Table 2 also provides correlations within modes, which reveal further differences between live and video ratings. First, classroom management is associated with cognitive activation and mathematics educational structuring at a moderate or large effect sizes in the live ratings. At the same time, there is no correlation to student support. For video-recorded lessons, the association between classroom management and any other dimension is of similar size. Second, student support and cognitive activation are moderately related in the live setting, but not at all in video observation. This suggests that teaching quality dimensions correlate differently across modes and need further investigation in future research.

4.2 Generalizability and decision studies

Table 3 provides the results of G studies with random effects for classrooms, lessons, segments, and raters. The variance decompositions explain more than 85% of the total variance in teaching quality dimensions, which shows that are large share of variability in scores is due to the investigated sources of variation. In accordance with the mixed models presented above, we find only a small amount of variance that is due to rater effects (0–7%, main and interaction effects summed up). Large mode differences occur for classroom management, such that variation between classrooms explains twice the amount of variance (live vs. video: 47% vs. 23%). In contrast, variation between lesson segments within classrooms explains less variance in classroom management in the live ratings than in the video ratings (37% vs. 59%). For student support, the variance decomposition yields very similar results across modes. Scoring the cognitive activation and mathematics educational structuring results in differences for between-lesson and within-lesson components across observation modes, with the first being larger in live observations (11% vs. 4%, 15% vs. 2%), and the latter yielding higher percentages for video recordings (32% vs. 51%, 18% vs. 22%).

Table 3 Variance decomposition of scores by dimension and mode (percentages of total variability in parentheses)

4.2.1 D studies

Table 4 presents D studies for various research designs, including both lesson-based and classroom-based decisions. We see that for live ratings under the present study conditions (two lessons per classroom, four segments per lesson), classroom management, cognitive activation, and mathematics educational structuring reach reliabilities close to or larger than 0.80, which are usually considered acceptable. For video scoring and lesson-based decisions, the reliabilities are lower, and no acceptable results are obtained for classroom management. For student support we do not obtain acceptable reliabilities either, and at least four lessons (live ratings, classroom-based decisions) or eight segments (video ratings, lesson-based decisions) would be necessary to reach values above 0.70. Summing up, this suggests that with the conditions applied in the present study, live observations of classroom management and subject-specific characteristics of teaching quality yield good reliabilities for both classroom-based and lesson-based decisions. For the video ratings, this only holds true for cognitive activation and mathematics educational structuring (cognitive activation even questionable for lesson-based decisions).

Table 4 Decision studies by dimension and mode for various numbers of lessons and segments

5 Discussion

In the present study, we analyzed live and video ratings with a hybrid framework of teaching quality, involving both generic and subject-specific instructional practices (Schlesinger et al., 2018). We investigated how absolute (scores raters assign to classrooms or lessons) and relative decisions (rankings of classrooms and lessons) could be influenced by observation mode, as well as measurement error and generalizability. Every lesson in our study was rated using both observation modes (i.e., live and video scoring).

Classroom management scored lower in live ratings, and video ratings resulted in unacceptable reliability on this dimension because of large segment variability within lessons. Increasing the number of observed segments per lesson could still improve the reliability. Cognitive activation scored higher in live ratings, but the results in terms of reliability were very similar across modes. For student support we did not find any mean differences between live and video ratings, and variation in reliabilities was negligible, too. We had similar results for mathematics educational structuring, with larger variation between lessons for live ratings. This is an unexpected result, as we assumed larger measurement error for student support and mathematics educational structuring in live scoring. The reason for this may be that raters should be able to assess the interactions between teachers and students more accurately during video scoring because of the teacher microphone (i.e., discussions can be heard and scored accordingly, which may not be the case for live ratings). Further research is necessary to shed light on how observation mode affects the assessment of scaffolding and supportive teaching practices.

Although some effect sizes are large, we should acknowledge that mean differences were presented in the original metric and therefore account for a quarter of a scale point at maximum (1 through 4), which is only slightly more than the estimated standard error (see Table 4). In a less homogenous sample, therefore, variability across modes would likely be less prominent. Therefore, we conclude that differences in mean values in classroom management could be due to differences in volume levels of teachers’ and students’ voices across modes. By listening to recordings from the teachers’ microphones, raters might have difficulties to judge the volume level in the classroom (i.e., students’ voices could be perceived quieter than they actually were), causing bias at that end. Consequently, raters could perceive disturbances as less problematic during video scoring. At the same time, video scoring of cognitive activation could be more difficult for the raters, as it might be unclear what students are working on, gestures and small movements may not be clearly visible in the video. During live observation, it is more likely that raters assess student discourse or problem-solving activities more accurately.

Overall, we found that live and video rating led to a very similar ranking of lessons and classrooms regarding teaching quality. Regarding the intercorrelations within modes (see Table 2), differences in how classroom management is associated with cognitive activation and mathematics educational structuring could be explained by measurement error. We found poor reliability for video ratings of classroom management, which leads to underestimating correlations with other variables. However, this phenomenon does not explain how classroom management is associated to student support, where the correlation is negative in the video setting and virtually zero in the live setting. Regarding the associations between student support and cognitive activation, we believe tasks might be perceived as less cognitively activating if the level of student support is high. This indicates that students were less involved in higher-order thinking. This claim is supported by a recent study making use of the same data (Benecke & Kaiser, 2023), as teachers provide more content-related than just strategic help to students. Again, raters could rather perceive differences in student support during live scoring because of the teacher microphone.

The study by Casabianca et al. (2013) was the only one so far to explore differences with respect to observation mode in teaching quality, and the findings in terms of ranking classrooms and lessons were very similar to those in the present study. However, in contrast to Casabianca et al. (2013) we did not find that mean differences across modes could be explained by time trends regarding the scoring procedure. They remained statistically significant for classroom management and cognitive activation after adjusting for rater and time differences, even though the latter were marginal in this study. Future studies could explore further aspects of the study design that might influence observer ratings in the classroom (e.g., by exploring the dependability of different kinds of measures on observation mode and by comparing raters with varying amounts of experience or content knowledge).

5.1 Limitations

This study was set in a particular Western European context (i.e., mathematics classrooms in secondary schools in a German metropolitan area), which has probably shaped our view on teaching and learning accordingly. We acknowledge that our data stem from a convenience sample that is likely to represent a positive selection of German mathematics teachers, because they volunteered to participate in our study. Therefore, it might be worthwhile to replicate our results with a random sample of larger size.

Another limitation is that we did not explore the full potential of the video scoring in our study. Raters were asked not to stop the videos during scoring, and neither were they allowed to watch videos more than once. We took this decision to increase standardization across scoring procedures and to capitalize on the different types of information that are available to raters during live or video scoring, respectively. However, we understand that scholars often drop these restrictions when they use video scoring, and future research projects could take this procedure into account by employing a three-arm design (e.g., live vs. restricted video vs. unrestricted video scoring). In doing so, it would be possible to explore if being able to stop or rewind the video comes with additional benefits for reliability.

Finally, we could not assign raters randomly to classrooms or modes for pragmatic reasons (e.g., as in a random block design). This resulted in an uneven distribution of raters across modes, but statistical analysis adjusted for rater main effects. Our procedure has also the limitation of allowing raters to change their scores after the lesson has ended based on the manual of the observational instrument. This might result in bias if some observers change their scores more often than others. However, König (2015) argues that this approach can also lead to higher reliability if observers score more closely to the manual.

5.2 Conclusions

Mode effects are a potential danger to validity in studies using classroom observation, because they can affect the scoring procedure as well as the conclusions drawn from scores. In the present study, we compared live and video scoring of teaching quality in German secondary mathematics classrooms regarding absolute and relative decisions (i.e., scoring distributions and rankings of classrooms or lessons). Although relative decisions were only marginally affected, our findings suggest that the extent to which observation mode influences the precision of the scoring procedure is dependent on the teaching quality dimension, rather than the degree of subject-specificity: Given our hybrid framework, live scoring of classroom management and cognitive activation should be preferred over video scoring, particularly for lesson-based decisions. Vice versa, scoring teachers’ cognitive and instructional support to students (i.e., mathematics educational structuring) benefits from videotaping lessons, which is likely due to better audio capture. Special attention should be paid to within-lesson variance in future studies, which may affect the validity of the conclusions drawn from scores if long-term decisions are made. We therefore recommend that both researchers and practitioners discuss carefully which conclusions they wish to draw from data, and choose frameworks, instruments as well as observation mode accordingly.