Keywords

1 Introduction

Beliefs about what constitutes ‘good’ or ‘high’ quality practice in teaching can vary markedly for different age groups of students, at other times and in different contexts. ‘Effectiveness’ is a contested term that can evoke strong emotions because perceived effectiveness links with notions of professional competency and high-stakes accountability in some countries. Researchers may question individual teachers’ beliefs about their professional autonomy. Notions of what constitutes high quality or good teaching, or the idea that teaching is an art or a craft rather than a science, are sometimes used to raise concerns with narrower concepts of effectiveness.

Researchers recognise the importance of effective teaching behaviour for student outcomes, but most teachers still struggle to implement complex teaching skills in their daily classroom practices. The Measures of Effective Teaching (MET) project (Bill & Melinda Gates Foundation, 2018) has provided the research communities with the most extensive dataset on classroom observation with an easily accessible video library for secondary data analysis. To our knowledge, this is the first study to connect two large-scale classroom observation studies through secondary data analysis of selected lesson videos with the same instruments.

2 Theoretical Framework

2.1 Examining Teacher Effectiveness Through Classroom Observations

Teacher effectiveness research is a branch of educational effectiveness research, focusing mainly on variations in teaching quality on student outcomes. Value-added measures, classroom observations, and student surveys are familiar sources of information and data about teachers’ behaviour and classroom practices that can be drawn upon to provide evidence to inform our understanding of teacher effectiveness (Bacher-Hicks et al., 2019). If student outcomes are the essential criteria for teacher effectiveness, the question remains about what kinds of outcomes, objectives, and goals can be achieved by teachers and schools. We clearly cannot go on endlessly adding more objectives and more content for teachers and schools and still expect them to succeed. Teacher effectiveness beyond the classroom level (e.g., Cheng, 1996; Cheng & Tsui, 1998) is not practically appealing to practitioners because such a conceptualisation could obscure the focus of the teacher’s role and duties in teaching. In practice, teachers and schools often prefer to restrict teacher evaluation to specific objectives in teaching.

For evaluating teaching quality, while gains in cognitive and non-cognitive domains of student achievement are tentative, a plea for meeting cognitive purposes and obtaining higher academic attainments as the criteria for the effectiveness of education in schools often sounds more appealing. Bacher-Hicks et al. (2019) argued for value-added measures as unbiased predictors of teacher performance in experimental conditions where students were assigned randomly to different classrooms. However, value-added measures are not unbiased as assumed because they tend to shift when different tests assess student achievements (Grossman et al., 2014).

While a classroom observation approach cannot adjust for classroom composition, it has two obvious advantages apart from easily accessible applications in naturally occurring settings. First, it allows ready comparisons across grades and subjects without relying on reliable, standardised tests. Second, a classroom observation approach looks at teacher effectiveness from a different angle by allowing the observers or evaluators to associate the observed behaviours with various aspects of the student learning process, such as student engagement in class and students’ self-reported behaviours or learning characteristics (e.g., Clunies-Ross et al., 2008; Helmke et al., 1986; Virtanen et al., 2015).

2.2 Comparisons of Classroom Observation Instruments

Classroom observation is a powerful method to collect data on teacher behaviours in class. Numerous sources of information and data about teachers’ behaviour and classroom practices can be drawn upon to provide evidence to inform our understanding of teacher effectiveness. A standard method in a classroom observation approach to teacher effectiveness research is to observe different teachers’ teaching practices by independent observers. For example, this paper compared the results obtained from different high-inference instruments to capture aspects of teaching dimensions hypothesised to be operated at the classroom level.

We can compare different classroom observation instruments of similar nature (i.e., for generic teaching behaviours) for different lessons in a single study (Day et al., 2008; Kington et al., 2014), different classroom observation instruments of similar nature for the same lessons in a single study (e.g., Ko, 2010; Ko et al., 2015; Lei et al., 2023; Sammons et al., 2014, 2016), and different classroom observation instruments of different nature (i.e., effective vs inspiring teaching behaviours) for different lessons in a single study (Ko et al., 2019a, b, 2016; Zhao & Ko, 2022).

However, the measurement strategy of teacher effectiveness in the MET project was unique as it compared different classroom instruments that differed in specificity. It involved comparisons of generic teacher behaviours by the Framework for Teaching (Danielson, 2013) and the Classroom Assessment Scoring System (CLASS) (Pianta et al., 2012) and of subject-specific ones by the Mathematical Quality of Instruction (MQI) (Hill et al., 2008; The Learning Mathematics for Teaching Project, 2011), the Protocol for Language Arts Teaching Observations (PLATO), the Quality of Science Teaching (QST) (Schultz & Pecheone, 2014), and the UTeach Teacher Observation Protocol (UTOP) (Walkington & Marder, 2014, 2018).

The challenges of developing and comparing quality teacher observation systems lie in establishing rater reliability and making the instruments more generalisable across contexts (Hill et al., 2012; Liu et al., 2019). For example, despite its wide application, the CLASS was not adequately validated without revisions in Hafen et al. (2015), where the lesson videos were collected in various projects. Wallace et al. (2020) also reported that the CLASS failed to discriminate classroom management quality, with most teachers’ scores clustering around the most positive ranges of effectiveness. The present study differed from the heuristic comparison of classroom observation instruments by Bell et al. (2019) in that we observed the same lessons with different instruments. Bell et al.’s (2019) comparison was crude and non-quantitative, as all instruments they compared shared ten similar teaching dimensions.

Secondary data analysis on the MET data should provide quantitative evidence for instrument comparisons, but to date, we still cannot find any study exploring this. We intended to fill this gap with this study and were motivated to conduct secondary data analysis with the same classroom observation instrument in the international collaborative ICALT3 project (Maulana et al., 2021) so that the new data on the selected sample would form a part of the enlarged study to inform the measurement invariance of teaching quality (Krammer et al., 2020; Maulana et al., 2021).

2.3 Video Lesson Analysis

The TIMSS 1995 video study by Stigler et al. (1999) was a pioneer and exemplar in using video data to explore teaching characteristics and patterns cross-culture beyond qualitative coding to provide quantitative analysis for hypothesis testing (Jacob et al., 1999; Stigler et al., 2000). The initial sample included 231 mathematics lessons from Germany, Japan, and the United States, selected from a nationally representative sample of eighth-grade students and classrooms participating in the 1994–95 TIMSS assessments. The TIMSS 1999 video study expanded to include Australia, the Czech Republic, Hong Kong SAR, the Netherlands, Switzerland, and the United States (Hiebert et al., 2003). In the video study on science teaching, the participating countries included Australia, the Czech Republic, Japan, the Netherlands, and the United States (Roth et al., 2006). However, no observation instruments were developed or adopted for observations in these studies. Only a portion of the lesson videos are publicly available for secondary data analysis, so our purpose should provide new data for the ICALT3 with the MET data.

Apart from the MET project, only a few studies in the literature used lesson videos to conduct lesson observation to inform teaching practice and performance (e.g., Hafen et al., 2015; Ko et al., 2015, 2016). Secondary data analysis makes instrument comparisons feasible if a lesson is videotaped for observation, as in the MET project, providing opportunities to observe the same lessons again at different times and research contexts.

3 Methods

3.1 Data Collection

The current study used both the original data of the CLASS and new observational data using two classroom observations to compare classroom characteristics.

3.2 Raters

The second and third authors conducted the majority of the lessons. The third author assisted the second author as a research assistant to use ICALT in another project (Lei et al., 2023). When these research assistants shared the secondary video analysis, they passed the training session and conducted two calibrations. One English for Language Arts (ELA) lesson and one Math lesson were observed and scored twice by each rater in each calibration. After each calibration, they conducted an inter-rater reliability test and proceeded with the lesson observation when Krippendorff’s alpha (2004) increased from .52 to .73 for the ELA lesson and from .55 to .82 for the ELA Math lesson.

3.3 Video Samples

3.3.1 Original Lesson Videos of the MET Study

The Measurement of Effective Teaching (MET) project was a large project funded by the Bill & Melinda Gates Foundation (2018). Around 2700 teachers from 10 districts in the United States teaching science, English, and math across 4–9 grades participated from 2009–2010 and 2010–2011 (Bill & Melinda Gates Foundation, 2018). Each teacher was videotaped during the lessons one to four times over a year. After training, the lessons were divided into segments and coded in 20-min segments by their administrator and peer observers using different classroom observation instruments. Despite its scale, teachers, classrooms, schools and districts in the MET project were not randomised.

4 Current Secondary Data Analysis

Among these different instruments, the CLASS was used for all lessons in the MET project and the most studied instrument outside the U.S.A. (e.g., Taut et al., 2019 in Chile; Pöysä et al., 2019 in Finland; Havik & Westergård, 2020 in Norway). The CLASS was assumed to be a reliable reference for selecting lesson videos for secondary data analysis with two new classroom observation instruments, the International Comparative Analysis of Learning and Teaching (ICALT) and the Comparative Analysis of Effective Teaching and Inspiring Teaching (CETIT). Thus, in this study, we selected four hundred twenty-three lessons proportional to the stanine distribution (Clark-Carter, 2005) of the percentiles of the aggregated mean scores of the various teaching dimensions of the CLASS.Footnote 1 We also limited the sample to secondary school lessons (i.e., 7–9) and English and mathematics only. Two lessons were excluded due to low video quality. Three trained raters observed nine lessons for calibrations first and started secondary observations after inter-rater reliability was over 90%. Each observer was assigned randomly to observe different lessons. The total numbers of segment, video, rater, and teacher are summarised in Table 16.1.

Table 16.1 The total numbers of segment, video, rater and teacher of the original data in the MET project and 423 chosen in this project

4.1 Instruments

4.1.1 CLASS

The CLASS in the MET project has an additional dimension, Instruction Dialogue, in addition to its original version with tenFootnote 2 dimensions of teaching quality: Positive Climate, Negative Climate, Teacher Sensitivity, Adolescent Perspectives, Behaviour Management, Productivity, Instructional Learning Formats, Content Understanding, Analysis and Inquiry, Quality of Feedback, and one dimension of Learner Engagement (Pianta & Hamre, 2009). Each lesson was divided into one, two, or three segments, each rated independently by a different rater on a 7-point Likert scale representing low to high levels.

4.1.2 ICALT

Originally developed as an instrument for inspection to capture generic teaching behaviours (van de Grift, 2007, 2014), the ICALT has expanded into thirty-two high-inference teaching indicators categorised into six domains: Safe and stimulating learning climate, Efficient organisation, Clear and structured instructions, Intensive and activating teaching, Adjusting instructions and learner processing to inter-learner differences, and Teaching learning strategies. The ICALT also contained a three-item (e.g., ‘…take an active approach to learn’) student learning domain to document learner engagement during classroom observations. Three observers completed classroom observation for each lesson and rated the items based on teachers’ performance on a 4-point scale, from ‘mostly weak’ to ‘mostly strong.’

4.1.3 CETIT

Based on the teaching aspects characterised as inspiring teaching by Sammons et al. (2014), Ko et al. (2016) used the Delphi method to finalise and validate the CETIT. This new high-inference classroom observation instrument consisted of sixty-eight descriptive statements that included effective and inspiring teaching domains. According to Ko et al. (2016), inspiring teaching includes four aspects: Flexibility, Teaching reflective thinking, Innovative teaching, and Teaching collaborative learning. Teaching behaviours corresponding to these inspiring teaching domains include “The teacher allowed options for students in their seatwork,” “… asked students to comment on his/her viewpoint,” “… used ICT in teaching,” and “… told students how to share their work in a task.” While Teaching reflective thinking and Teaching collaborative learning were two distinctive classroom practices in the CETIT, they were conceived as a single characteristic by Sammons et al. (2014). Dimensions Assessment for learning and Professional Knowledge and expectations are two unique teaching aspects in the CETIT (i.e., not found in the CLASS or the ICALT). They were found to cluster with other teaching domains of effective teaching (Ko et al., 2016, 2019a, b). For this study, two new dimensions, Engagement in exploratory learning and Engagement in knowledge consolidation, developed by Piburn and Sawada (2000), were adopted to test whether the learner dimensions in different instruments might favour the teaching dimensions of the classroom observation instruments to which they belong.

4.2 Data Analysis

For all three instruments, the means, standard deviations, and reliability tests were conducted in SPSS 20. Confirmatory factor analyses (CFA) were conducted in MPlus 7. The original three-factor model of the CLASS was tested first, followed by one-factor and two-factor models for comparison. For the ICALT, a six-factor model was tested with the theoretical structure. Three CFA models were tested on the CETIT: (a) an eight-factor model on effective teaching, (b) a four-factor model on inspiring teaching, and (c) a 12-factor full model. Multiple good fit indices were selected as the criteria suggested by Tabachnick et al. (2007) for evaluating the CFA models: (a) the Root mean square error of approximation (RMSEA) below .08, (b) a Comparative fit index (CFI) above .95, (c) standardised root mean square residual under .08, and (d) χ2/df to be under 2.

5 Findings

5.1 Descriptive Statistics

5.1.1 CLASS

The overall results shown in Table 16.2 are consistent with the CLASS results in the literature. Instructional support was the weakest domain. At the dimension level, the average scores were relatively low for Negative Climate (M = 1.47, SD = .63) and Analysis and Inquiry (M = 2.42, SD = .90). In contrast, Dimensions Behavior Management (M = 5.72, SD = .98) and Productivity (M = 5.54, SD = .93) were scored relatively higher than all other dimensions. Table 16.2 indicated that the reliability for each domain, Emotional Support, Classroom Organisation, or Instructional Support, was acceptable as a subscale and the full-scale CLASS (α > .7). There were no reliability scores for dimensions because they were single indicators. 

Table 16.2 Means, standard deviations and Cronbach’s Alphas (α) of CLASS

5.1.2 ICALT

For ICALT, the means of Adjusting Instructions and Learner Processing to Inter-Learner Differences (M = 1.68, SD = .42) and Teaching Learning Strategies (M = 1.45, SD = .40) were low because they were rare in the sampled lessons. The reliability test results indicated a high level of internal consistency for the full scale of ICALT(α = .87). Still, as depicted in Table 16.3, half of the ICALT domains have reliability below .7, the threshold acceptable in education research (Taber, 2018): Safe and Stimulating Learning Climate (α = .58), Intensive and Activating Teaching (α = .56) and Adjusting Instructions and Learner Processing to Inter-Learner Differences (α = .49).

Table 16.3 Mean, standard deviation and Cronbach’s Alpha (α) of ICALT

5.1.3 CETIT

The result suggested that the full scale with all 68 items was highly consistent(α = .93). The result also indicated good reliabilities in most of the CETIT dimensions. Besides, there was an unacceptable internal consistency of the subscale Stimulating Learning Environment (α = .23), with a relatively lower score average (M = 1.43, SD = .39). The four subscales with reliability close to the .6-threshold included Flexibility (α = .58), Purposeful and Relevant Teaching (α = .57), Safe Classroom Climate (α = .58), and Innovative Teaching (α = .59) (Table 16.4).

Table 16.4 Mean, standard deviation and Cronbach’s Alpha (α) of CETIT

6 Correlations of Factors of Three Instruments

Table 16.5 displays the Pearson correlation coefficients of the teacher dimensions, learner engagement dimensions, and the whole scale. As correlations are sensitive to sample size, we should focus on the association’s magnitude or strength. In general, a coefficient between .4 and .6 indicates a moderate strength. While a value above .6 suggests a strong association, a value between .2 and .4 is weak to mild. Values below .2 are considered weak even though the correlation may be statistically significant.

Table 16.5 Correlations between CLASS, ICALT & CETIT

Most teaching dimensions of the CLASS were correlated significantly only with other dimensions of the same scale, but teaching dimensions of ICALT and CETIT correlated with other dimensions of each other scale. All eleven CLASS dimensions suggested weak or no correlations with the ICALT and CETIT dimensions. In the ICALT, the result indicated that the domain Teaching Learning Strategies did not correlate with three domains in the ICALT: Safe and Stimulating Learning Climate, Efficient Organisation, and Clear and Structured Instructions.

In the CETIT, the dimension Innovative Teaching showed no correlation with the other nine dimensions, except for Flexibility (r = .271,p < .01) and Teaching Reflective Thinking (r = .280,p < .01). All three domains were classified as inspiring teaching practices. In contrast, other CETIT dimensions were correlated significantly with most ICALT dimensions. Comparing the subscales of student engagement in the CLASS, ICALT, and CETIT, Learner Engagement in the ICALT showed stronger correlations with more teaching dimensions, six in the CLASS, five in the ICALT, and ten in the CETIT. Learner Engagement in the ICALT was also weakly associated with Student Engagement in the CLASS (r = .239,p < .01), Engagement in Exploratory Learning (r = .481,p < .01) and Engagement in Knowledge Consolidation (r = .608,p < .01) in the CETIT.

6.1 Comparing Confirmatory Factor Models of Three Instruments

6.1.1 CLASS

Except for the original three-factor models of the CLASS, the one-factor and two-factor models were also built up to investigate a better factor structure of the CLASS based on the sampled lessons. In all three models, the two-factor model showed a relatively better model fit than the other one-factor model and the original three-factor model. The one-factor model of the CLASS suggested poor model fit to the data, χ2(54) = 846.491, p < .001, CFI = .768, RMSEA = .186, but interestingly, the theoretical three-factor model of the CLASS had the worst fit indices, χ2(41) = 774.629, p < .001, CFI = .761, RMSEA = .205. In contrast, the two-factor model of the CLASS had the best but still unacceptable fit indices, χ2 (43) = 358.418, p < .001, CFI = .897, RMSEA = .131.

6.1.2 ICALT and CETIT

Relatively speaking, the results indicated a poor model fit for the six-factor model of the ICALT, χ2 (449) = 2823.249, p < .001, CFI = .558, RMSEA = .112, while all three CFA models of the CETIT suggested relatively better model fits than the ICALT ones. The eight-factor model of effective practices in the CETIT, χ2(1091) = 4796.771, p < .001, CFI = .618, RMSEA = .09, have better fit indices than those of the four-factor model of inspiring practices, χ2 (149) = 782.373, p < .001, CFI = .659, RMSEA = .1, except for the SRMR. However, the full 12-factor model has overall the best fit (except for CFI) among all CFA models with χ2 (2144) = 802.596, p < .001, CFI = .572, and RMSEA = .08.

7 Discussions

7.1 Teaching Effectiveness in Different Lens

This secondary analysis was intended to examine teacher effectiveness by comparing different classroom observation instruments. Theoretically, CLASS and ICALT have similar teaching dimensions, but our results showed that ICALT and CETIT were more closely correlated. We could not rule out that this closer relationship was a halo effect of the rater effect because the same raters rated them. While all three scales were reliable, some of the individual dimensions of ICALT and CETIT were internally inconsistent, contrary to the latest research (e.g., Ko & Li, 2020; Maulana et al., 2021). The most puzzling findings were the insignificant relationship of the factors in the confirmatory factor analyses of the three instruments in Table 16.6.

Table 16.6 Model fit indices of confirmatory factor models of CLASS, ICALT and CETIT

7.2 Validity and Reliability of Instruments

The major limitation of the current study was the poor validity and reliability of the instruments. Though the CLASS and ICALT have been validated in many international contexts, we failed to validate them in the selected sample. We do not intend to provide arguments for retaining the models with poor fit indices nor discuss strategies to modify the model to obtain an acceptable fit because this would go beyond the purpose of this paper. To our surprise, the two-factor model showed a better fit than the theoretical three-factor model. However, similar results were reported by Hafen et al. (2015), who found their bi-factor model fitted the MET data better than the original three-factor model. It is beyond this book chapter’s scope to explore a possible revised three-factor model. Still, the results suggested that the CLASS could be inherently unstable because the Instruction Support domain is empirically more distinctive than the other domains.

Regarding instrument comparison, the CFA results favoured the CETIT slightly, more for its effective teaching component than its inspiring teaching component. Further studies on the relative significance of individual teaching dimensions (or subscales) will help us further teacher effectiveness research from scale or instrument development to teacher development conceptualising teaching practices ranked by difficulties (Ko et al., 2016).

We are also surprised that the reliability scores of some of the subscales of the ICALT and CETIT were unacceptably low. These results differed much from what we found in our previous projects (Ko et al., 2016, 2019a, b; Maulana et al., 2021). These results might raise concerns over the reliability of the raters’ judgements. Given the high-inference nature of classroom observation instruments, ratings are expected to be evaluative. Though we had trained our raters and did calibration to minimise subjective biases in our observations, halo effects might affect the raters’ judgements, making the results of the ICALT and CETIT more similar to each other than the CLASS. However, we are more inclined to suspect that this might be a side effect of a biased sample (see below). Still, further analyses to explore any rater effects seem wanting.

7.3 Limitations with the Original MET Sample

Conducting classroom observation or teacher evaluation research has been challenging because teacher evaluation is always a sensitive matter for practitioners. The MET lessons were not naturalistic and subject to self-selection bias because teachers and schools provided lesson video clips. There was little control over the quality of the recording and the settings. The video quality might affect the raters’ judgments of student engagement as students were often off the screen. However, the secondary video data analysis could be a strength because this allowed other researchers to build up a video-based lesson database with other instruments.

Since we suspected there might be a problem using aggregate averages as references to select our lesson sample, we did another CFA with all the MET lessons to establish the scale validity, but the fit indices were also disappointing. We could not find any report concerning CLASS validation in the MET in its documentation or the literature. We could not identify what characteristics in the entire MET sample and our lesson sample might have caused the inadequate validation. Our assumption that the validations of the ICALT and CETIT were much affected by some unknown biased sample selection may not be justifiable as it seems. Moreover, we have not conducted further analyses to check systematic biases regarding teacher, school and district characteristics, as we assumed they would be marginal compared to variations in teaching quality.

7.3.1 Significance and Implications

Studying teachers’ classroom practices and their effects is essential for teacher development and school improvement. We regard this study’s significant implication in indicating the relative strengths and areas for teaching improvements (i.e., flexibility, innovative teaching, adjusting instructions and learner processing to inter-learner differences, teaching learning strategies). Future training on the CETIT and ICALT as reflective tools may benefit practitioners.

Despite the limitations discussed, this study provides data for instrument comparisons. Some teaching practices are comparable across instruments. Instrument comparison was already an essential focus in the MET project, which included six observation instruments, including more generic by nature, the CLASS, and more subject-specific ones like the PLATO, MQI, QST, and UTOP. Future research should extend comparisons to these subjects-specific instruments.

The secondary data analysis was a cost-effective strategy to connect two independent studies, the MET and ICALT3 projects. The secondary data analysis could be done because the lessons were videotaped, providing opportunities to observe the same lessons again at different times and in research contexts. However, secondary data analysis is also limited by the quality of the original sample also limits secondary data analysis as the researchers who conduct secondary data analysis can do little to rectify flaws in the data collection processes.

8 Conclusion

It is tricky and controversial to define effective teaching or teaching effectiveness. Effective teaching requires criteria for effectiveness. The criteria implied in the various teaching dimensions in the CLASS, ICALT and CETIT refer to education objectives in general and teaching in particular. Visions about these criteria result from a political and societal debate, but educational professionals, teachers and schools can also participate in classroom observations. Going beyond identifying effective classroom practice characteristics, we have uncovered the similarities and variations across teaching dimensions in different instruments. It was surprising that the CLASS could not be validated in our sample as in the original MET dataset. Despite limitations in the validity and reliability of the samples, we consider that our attempt to provide data for the ICALT3 project is at least partially fulfilled.