1 Introduction

Although for years scholars in the field of teaching quality have been attending to either subject-generic or subject-specific teaching aspects (TAs)—with the former cutting across different subjects and the latter being more germane to teaching particular subjects (cf. Charalambous & Kyriakides, 2017)—the last decade has seen heightened interest in considering both types of TAs (see, for example, such a discussion in the special issue ZDM – Mathematics Education, 50(3). This was fueled by theoretical arguments recognizing the complexity of teaching (Cohen, 2011) and underlining the need to combine theoretical frameworks to better study teaching quality (e.g., Hamre et al., 2013; Charalambous & Praetorius, 2018). Empirical studies reveal that considering both types of TAs can help better capture teacher-student interactions around the content (Praetorius & Charalambous, 2018) and showing either the combination to explain a higher percentage of the unexplained variance in student learning (e.g., Charalambous & Kyriakides, 2017) or working complementarily to predict student learning (e.g., Blazar & Kraft, 2017). Scholars in mathematics education have recently started discussing different ways of combining the two types of TA (Brunner, 2018; Blazar et al., 2017), and developing frameworks attempting to integrate these TAs (e.g., the MAIN-TEACH model, Charalambous & Praetorius, 2020).

Nevertheless, the relationship between subject-generic and subject-specific TAs is still unclear. Can subject-generic and subject-specific TAs be integrated? If so, how? Addressing this question could help catalyze the development of frameworks that bring together subject-generic and subject-specific TAs by not simply juxtaposing them but by exploring their similarities and potential overlaps so that the outcome framework is not simply the sum of its parts. In this paper, we address this question from an empirical perspective, reflecting on whether a common scale encompassing both subject-generic and subject-specific aspects can be developed. Forming such a scale may help examine if and how subject-generic TAs relate to subject-specific TAs, and particularly whether subject-generic and subject-specific TAs belong to a single overarching factor (i.e., teaching quality) and are therefore related to each other. We also examine whether they can be organized into separate distinct groups, or if these groups include both subject-generic and subject-specific TAs. If the first holds, it implies that integrating these two different TAs can largely be equated to simply juxtaposing distinct aspects. If the latter is true, it points to the potential of integrating the TAs in ways that acknowledge the interrelations between these two different types of TAs.

Such empirical work may have theoretical and practical implications. From a theoretical perspective, it can provide insights into developing frameworks that integrate these two types of TAs by considering their similarities and differences. It could also help start reflecting on whether TAs which have for years been considered as “generic” or “specific” fall along a continuum with the boundaries between the two types of practices being more blurred than initially thought. From a practical standpoint, it can inform initial and ongoing professional development programs for pre-service and in-service teachers since it may provide teacher educators with ideas as to how the two types of TAs can be integrated.

In this study we bring together two widely used frameworks, one subject-generic (the Dynamic Model of Educational Effectiveness [DMEE], Creemers & Kyriakides, 2008) and one subject-specific (the Mathematical Quality of Instruction [MQI], Learning Mathematics for Teaching [LMT] Project, 2011). Before justifying this selection (see Sect. 2), two remarks are in order. First, the exercise undertaken in this study needs to be replicated by drawing on other subject-generic and subject-specific frameworksFootnote 1. Such replication studies may help us understand how frameworks that aim to integrate both types of TAs can be developed. Second, in selecting the two frameworks, we followed the classification of Charalambous and Praetorius (2018), acknowledging that subject-specificity and subject-genericness in frameworks should not be considered as dichotomous but rather as forming a continuum. We therefore based our selection of the two frameworks on their developers’ original intentions (i.e., to study teaching quality in a specific subject or across different subjects) and the extent to which the development of these frameworks was informed by subject-specific demands of teaching within a particular discipline.

In pursuing this exercise to work at the intersection of subject-generic and subject-specific frameworks, we recognize that this is not the first endeavor to do so. Several frameworks that combine these two types of TAs—identified as hybrid (see Charalambous & Praetorius, 2018)—have been developed and used over the past decade. These include, among others, TEDS-Instruct (Schlesinger et al., 2018), Teaching for Robust Understanding (Schoenfeld, 2018), and the UTeach Observation Protocol (Walkington & Marder, 2018). Empirical studies utilizing these frameworks suggest that they capture an overarching factor of teaching quality combining both types of TAs (e.g., see Blömeke et al., 2022 for TEDS-Instruct). Recognizing the merit of these frameworks as a promising approach for integration, in this work we follow a rather different approach that brings together two different frameworks and explore the possibility of developing a common scale that encompasses both aspects. This work could also be promising because it allows for collecting more detailed information on teaching quality, given that each framework goes into more depth in capturing either subject-generic or subject-specific aspects.

2 The theoretical frameworks of the study

2.1 The Dynamic Model of Educational Effectiveness (DMEE)

The DMEE (Creemers & Kyriakides, 2008) was developed to establish stronger links between Educational Effectiveness Research (EER) and research on improvement by considering the strengths and limitations of the main integrated models of EER (e.g., Creemers, 1994; Stringfield & Slavin, 1992). The DMEE is multilevel in nature and refers to factors operating at the student, classroom, school, and system levels which are associated with student learning outcomes. Five dimensions are used to measure both quantitative (i.e., frequency) as well as qualitative characteristics of the functioning of each factor (i.e., focus, stage, quality, and differentiation). The dimensions are not only important from a measurement perspective, but also, and even more, from a theoretical point of view. The focus dimension is in line with the synergy theory (Liu & Jiang, 2018) and argues that the specificity and the number of purposes addressed by each task associated with a factor should be examined. Similarly, the stage dimension implies that the factors need to take place over a long period of time to ensure that they have a continuous direct or indirect effect on student learning. Several studies provided empirical support to this argument (e.g., Creemers, 1994; Scheerens, 2013). Using the stage dimension to measure the functioning of a factor can help identify the extent to which there is constancy at each level and flexibility in using each factor. Finally, the differentiation dimension is in line with findings of research into differential effectiveness (Campbell et al., 2003) which reveals that adaptation to the specific needs of each group of students may increase the successful implementation of a factor and ultimately maximize its effect on student learning outcomes (Tomlinson, 2014). Appendix A shows how each dimension is used in measuring the orientation factor (for more information on how each factor is measured see Creemers & Kyriakides, 2008).

At the classroom level, the DMEE takes into account the main findings of teacher effectiveness research and refers to factors concerned with teacher behavior in the classroom found to be associated with student learning outcomes. It also attempts to develop a comprehensive framework of effective teaching by considering different theories of learning and different teaching approaches. Specifically, the model refers to eight factors (hereafter TAs) associated with student learning outcomes in different learning domains (Scheerens, 2013). The main elements of the eight factors are mentioned in Fig. 1 which reveals that the DMEE refers to TAs such as structuring and application (found to be related with student learning outcomes by the early teacher effectiveness studies – see Brophy & Good, 1986) and are associated with the direct and active teaching approach (Joyce et al., 2000) and to modeling and assessment which are in line with constructivism (Schoenfeld, 1998). Moreover, the collaboration technique is considered in defining the elements of the classroom learning environment. Multiple theories of learning are considered in defining the TAs. For example, motivation learning theories and the cognitive load theory are considered in defining orientation and application, correspondingly. Finally, some factors of the DMEE refer to TAs also captured in other subject-generic frameworks. For example, modeling and questioning (i.e., raising process questions) align with cognitive activation included in TBD (Praetorius et al., 2018) and TEDS-Instruct (Schlesinger et al., 2018). Similarly, management of time can be identified in several other frameworks (e.g., CLASS, TBD, TEDS-Instruct). (For a more systematic description of the factors of DMEE and their relationship with other theoretical models, see Kyriakides et al., 2020).

More than 20 large-scale studies and one meta-analysis have been conducted to examine the main assumptions of the DMEE at classroom level (for a review of these studies see Kyriakides et al., 2020). Below, issues of validity, reliability and prediction of student outcomes are briefly discussed. One high-inference and two low-inference observation instruments, as well as a student questionnaire are being used to capture the TAs examined by the DMEE (for more information see Creemers & Kyriakides, 2012; also see Appendix A for a description of the instrument used in the present study). Kyriakides and Creemers (2008) analyzed data emerged from these instruments by using a multi-trait and multi-method model and provided support to the construct validity of the instruments. This model was then replicated in more than 20 studies (for a review of these studies see Kyriakides et al., 2020) and confirmatory factor analyses revealed that each TA can be measured in relation to the five dimensions (e.g., Bodroža et al., 2022; Dierendonck, 2023). These studies showed that students were able to provide reliable data on the teaching practices of their teachers. Satisfactory results about the reliability of the observations instruments were also generated (the alpha reliability coefficients for each TAs as captured by the three observation instruments were higher than 0.83, and the reported inter-rater reliability coefficients r2 were higher than 0.75).

Finally, these studies revealed that TAs are associated with student achievement gains. Cognitive learning outcomes in different subjects (e.g., mathematics, language, science, and religious education) as well as non-cognitive outcomes (e.g., attitudes towards mathematics) were used to measure the impact of the TAs. Thereby some support for the assumption that the TAs are associated with student achievement gains in different learning outcomes has been provided (Chaudhary & Singh, 2022; Polymeropoulou & Lazaridou, 2022). The generic nature of the DMEE is also supported since a synthesis of these studies revealed that the effects of the TAs on different student learning outcomes were similar (i.e., Cohen’s d values were around 0.20). However, only two studies examined the impact of the teacher factors on non-cognitive outcomes and only one on student metacognitive outcomes.

The DMEE assumes that the eight TA are related to each other. Six studies conducted in different countries (i.e., Canada, Cyprus, Greece, Maldives, and Taiwan) revealed that the TAs can be classified into specific levels of effective teaching. For example, a study focusing on Cypriot primary teachers teaching three different subjects (i.e., Mathematics, Greek language, and Religious Education) revealed five levels of effective teaching (Kyriakides et al., 2009). The first three levels were related to the direct and active teaching approach, by moving from the basic requirements concerning quantitative characteristics of teaching routines to the more advanced requirements concerning the appropriate use of these skills as these are measured by the qualitative characteristics of these TAs. These skills also gradually move from the use of teacher-centered approaches to the active involvement of students. The last two levels were more demanding since teachers are expected to differentiate instruction (Level 4) and demonstrate their ability to use constructivism (Level 5). Teachers situated at higher levels were also found to be more effective in terms of promoting student learning outcomes in each subject. Similar results emerged from studies conducted in other countries. The results of these studies were considered in developing the dynamic approach to teacher improvement (Creemers et al., 2013).

Fig. 1
figure 1

Description of the Main Teaching Aspects (TAs) of the Dynamic Model of Educational Effectiveness

2.2 Mathematical Quality of Instruction (MQI)

Recognizing that the existing classroom observation frameworks did not capture the mathematical quality in teaching, the MQI developers sought to develop a framework that would be sensitive to the mathematical nuances in teaching (LMT Project, 2011). Toward this end, they drew on the instructional triangle (cf. Cohen et al., 2003), thinking of instruction as comprised of dynamic interactions among the teacher, the students, and the content, which were situated in educational settings. The framework was developed following both a top-down and a bottom-up approach, through iterative cycles of examining the literature to identify TAs that are germane to teaching mathematics and a close analysis of video recorded elementary mathematics lessons (see Hill, 2010; LMT Project, 2011 for more information on this process). As such the MQI design fulfils the first perspective of subject-specificityFootnote 2 proposed by Mu et al. (2022): that of applicability. The discussion of the derived TAs among different experts in teaching mathematics partly accounted for the second perspective, relevance (see Mu et al., 2022); a discussion with experts in other fields could have offered additional insights about the degree of subject-specificity of the chosen TAs. Since its initial development, the MQI framework has gone through different iterations. In its current form, it includes four dimensions (hereafter, TAs) with twenty items (see Fig. 2). The first two TAs reflect the relationship between the teacher and the content; the third TA focuses on how the teacher facilitates students’ interactions with the content, while the fourth captures students’ interaction with the content.

MQI has been used in several studies to investigate the relationship between teaching quality and mathematical knowledge for teaching (MKT) and student learning. With respect to the association between MQI and MKT, most studies (Hill et al., 2008; Kelcey et al., 2019; Lee & Santagata, 2020; Santagata & Lee, 2021) have focused on elementary grades. Although employing different teacher populations and dissimilar designs, these studies converge in showing a positive association between MKT and MQI. For example, studies that used small samples of elementary school teachers (N < 10)—ranging from first-year teachers (Santagata & Lee, 2021) to novice teachers (Lee & Santagata, 2020) to more seasoned teachers (Hill et al., 2008)—showed moderate to strong positive associations between MKT and aspects of MQI (ranging from to rrho=0.65 up to rrho=0.83). Interestingly, Lee and Santagata’s (2020) longitudinal study, showed that whereas during the first year of the study these correlations were not significant, they became so in the second year of the study, suggesting that the effects of MKT on teaching quality might not be directly identified during teachers’ early career stages. Analogous findings also emerged with larger samples of elementary school teachers (e.g., N ≈ 270). For example, in Kelcey et al. (2019) a one standard deviation difference in teacher knowledge was associated with about a 0.22 standard deviation change in quality in Ambitious teaching (a collective term that encompasses the first, the third, and the fourth MQI TAs, see Fig. 2) and a 0.35 change in Errors and Imprecision. In middle-grades, Hill et al.’s (2012) study with 24 middle-grade teachers showed a moderate correlation between MKT and MQI (rrho=0.58, p < .01). Collectively, these results support that MQI satisfies knowledge, the third perspective of subject-specificity (Mu et al., 2022).

Both large scale (Kane & Staiger, 2012) and smaller scale (Blazar & Archer, 2020; Blazar & Kraft, 2017; Blazar et al., 2016; Hill et al., 2011; Kraft & Hill, 2020; Kelcey et al., 2019; Mantzicopoulos et al., 2019) studies have examined the predictive validity of teachers’ MQI scores on their student learning. Although the findings of these studies are mixed, they point to a pattern of positive associations between MQI and student learning, thus supporting that MQI partly satisfies the last perspective of subject-specificity, predictivity (Mu et al., 2022).

The largest study conducted, the Measures of Effective Teaching [MET] Project, found positive significant, yet low, correlations (from r = .12 to r = .16) between teachers’ MQI scores and students’ scores on state tests and a more cognitively demanding project-administered test (Kane & Staiger, 2012). Higher relations were found in smaller-scale studies. Drawing on a sample of 24 middle-school teachers and their 222 students, Hill et al. (2011) found moderate associations between teachers’ MQI scores and students’ value-added scores (rrho = 0.30 to rrho = 0.56 in the different value-added models employed). Collectively these low to moderate correlations suggest that teaching quality, as measured by MQI is associated with student learning. Other studies that used more advanced analyses than simple correlations provide stronger evidence for this association. In addition, Blazar (2015) showed the overarching TA of Ambitious Teaching to positively predict fourth-and fifth-grade students’ scores on a low-stakes mathematics test (β = 0.11, SE = 0.04, p < .05). Focusing on different student learning outcomes and drawing on a sample of 310 fourth- and fifth-grade teachers and their 10,575 students, Blazar and Kraft (2017) showed Errors and Imprecisions to be negatively associated with performance on state-tests (β=−0.02, SE = 0.01, p < .10), self-efficacy (β=-0.09, SE = 0.03, p < .01), and happiness (β=-0.18, SE = 0.08, p < .05); however, Ambitious Teaching was not significantly related with any of these outcomes. A similar non-significant effect was also found in a randomized field trial utilizing coaching to improve elementary and middle-school teachers’ MQI (Kraft & Hill, 2020); although coaching did result in improvements in the mathematical quality in teachers’ lessons, this improvement was not reflected on students’ learning as captured on formative and summative assessments. In contrast, a study focusing on a younger student population (285 kindergarten students, see Mantzicopoulos et al., 2019) showed scores on Ambitious Teaching to predict students’ end-of-year progress on kindergarten mathematics standards (β = 1.63, SE = 0.75, p < .05) but not their mathematical reasoning score (β = 1.78, SE = 1.81, p > .05). The whole lesson MQI scores were also associated with teacher-rated students’ interest in mathematics (β = 0.82, SE = 0.34, p < .05).

During the last five years, studies have also focused on examining differential effects of MQI on student learning, exploring for which students and under what conditions higher MQI scores are conducive to student learning. For example, Blazar and Archer (2020) showed Ambitious Teaching to be more effective for English language learners (β = 0.07, SE = 0.03, p < .05) compared to the general student population (β = 0.02, SE = 0.02, p > .05). Similarly, Kelcey et al. (2019) showed that the significant positive relationship between achievement gains and Ambitious Teaching was present only in state districts with coherent instructional guidance and whose state tests were more cognitively challenging (β = 0.11, SE = 0.04, p < .05) as compared to districts not having these characteristics (β=-0.07, SE = 0.05, p > .05).

Fig. 2
figure 2

Description of the Four Teaching Aspects (TAs) of the Mathematical Quality of Instruction

2.3 Exploring possibilities for integrating the two frameworks

Although the DMEE and MQI represent two distinct traditions in studying teaching quality in mathematics, with the first focusing on subject-generic TAs (cf. Panayiotou et al., 2021) and the second zooming in on the mathematical aspects of teaching (cf. Litke et al., 2021), there are both empirical and theoretical reasons to bring them together and explore possibilities of integration.

Empirically speaking, both frameworks have been validated in the same educational context, which is also the context considered herein. Second, although being a generic framework, DMEE has been used extensively in studying teaching quality in mathematics (cf. Kyriakides et al., 2020). Finally, both frameworks have been used extensively to capture teaching quality in primary grades, which is the focus of this study. Collectively, these three elements suggest that any difficulties that might arise when integrating the two frameworks will not be due to the need of adapting these frameworks to the context of the study.

Theoretically speaking, there exist important similarities and differences between the two frameworks. For example, modeling (DMEE) is intended to provide students with transferable heuristics that can help them move beyond just solving problems in a single lesson. Similarly, the task cognitive demands pertain to structuring a challenging environment for students that can help them develop mathematical reasoning. At the same time, these TAs are distinct in that the first denotes capturing the development of transferable strategies, whereas the second focuses on whether the tasks enacted in the lesson provide students with opportunities to engage in rich mathematical practices. Consider, also the qualitative characteristics (focus and quality) of questioning and the classroom as a learning environment (DMEE): compared to the more quantitative characteristics (frequency and stage) of these factors, the qualitative characteristics both pertain to providing students with substantive opportunities to interact with the content, through the provision of constructive feedback. The remediation of students’ errors in MQI also attends to such opportunities by exploring how the teacher works with students’ errors to help them develop mathematical understanding. However, although both frameworks are attending to the feedback provided to students, they consider this aspect from different perspectives: DMEE considers more general characteristics of feedback providing, whereas MQI attends to more mathematical features. Such examples suggest that there are reasons to believe that TAs of DMEE and MQI may co-exist on a single scale. Given that aforementioned DMEE and MQI TAs capture similar manifestations of teacher-student interactions yet from different perspectives (as suggested by the two preceding examples), it is likely that these TAs will co-appear in clusters combining generic and specific TAs. To the extent that this holds, it would be informative to examine which TAs from the two frameworks are clustered together and what might be driving their co-existence in the same cluster.

Apart from the similarities identified above, there are important differences between the two frameworks, which, however, suggest that the frameworks may complement each other. For example, whereas MQI does not have any aspect related to orienting students to the importance of what is to be learned, such aspects are covered in DMEE; on the other hand, whereas DMEE does not attend to the mathematical content offered to the students, this is the main focus of the MQI. Similarly, whereas structuring (DMEE) captures connections among the different lesson goals and activities without attending to the mathematical substance of such connections, linking and connections (MQI) pertains to explicitly drawing mathematical connections between different representations and different mathematical ideas. These complementarities also highlight the importance for exploring possibilities of integrating the two frameworks, given that such an integration might show how one framework accounts for the limitations of the other.

In this paper, we explore the possibility of such integration by attempting to develop a common scale that combines subject-specific and subject generic TAs. We use the term “integrate” to denote the combination “of two or more things in order to become more effective” (https://dictionary.cambridge.org/), since our intention is to examine whether the TAs of the two frameworks could be combined into a more functional and comprehensive whole. If such a scale cannot be developed, this would suggest that such an integration is not possible. However, developing such a scale, in and of its own is not sufficient for arguing about such an integration: if the generic TAs are all clustered in certain levels and specific TAs in other distinct levels, the scale would not empirically corroborate the type of integration proposed above, given that the two types of TAs would still be distinct from each other. In this paper we are interested in exploring the possibility of developing a common scale with TAs of both frameworks distributed and mixed all over the continuum.

3 Research questions

This study aimed at addressing the following questions:

  1. 1.

    Can a scale with good psychometric properties that combines the TAs of DMEE and MQI be developed?

  2. 2.

    If such a scale can be developed, can we identify levels of effective teaching that include both subject-generic and subject-specific TAs?

Developing a scale including both types of TAs would provide empirical evidence attesting to the possibility and importance of integrating generic and specific TAs as opposed to simply juxtaposing them. Given the exploratory character of this study, this will also give the opportunity to reflect on what might be driving the co-appearance of TAs from both frameworks in the same cluster, something that could provide important insights about what it means to integrate subject-generic and subject-specific aspects and possible ways of developing frameworks that combine both aspects.

4 Methods

4.1 Participants and setting

Thirty-eight elementary Cypriot school teachers participated voluntarily in the study (see Table 1 for demographic information). Each teacher had six of their mathematics lessons videotaped. Teachers were free to choose which lessons to have videotaped. Teachers’ self-selection of recording dates should not be a concern, given prior research suggesting that when teachers were given discretion to choose from among a set of their classroom videos for evaluative purposes, the ranking of teachers in terms of the teaching quality in the chosen videos was similar to the ranking from a random set of videotaped lessons (Ho & Kane, 2013). Each lesson was videotaped by a single camera placed at the back of the classroom; students whose parents did not give consent for participation were placed outside of the videotaped cone but participated in the lesson. Ethics permission for the study was obtained from the National Centre of Educational Research (ethics approval number blinded). Lessons averaged about an hour.

Table 1 Teacher demographic characteristics

4.2 Data coding

Each lesson was coded using both the DMEE and the MQI by raters trained in either of the two frameworks. Training lasted approximately 20 h for each framework and raters were certified only when their ratings were consistent with at least 80% of the ratings of master raters to selected videotaped lesson excerpts. For both DMEE and MQI, each lesson was coded by a pair of raters (NDMEE=3, NMQI=3). The raters (different for DMEE and MQI) first coded the lessons individually and then met to reconcile their ratings. Each rater coded 152 lessons, two from each teacher; all possible pairs of coders were formed and each pair coded two lessons from each teacher. For the purposes of this analysis, we used their reconciled codes that represent the pair’s consensus on coding the lesson. Inter-rater reliability before reconciliation were higher than 0.70 for both DMEE and MQI.

Raters were asked to complete both a low-inference and a high-inference instrument. Because of differences in the coding procedures utilized for the low-inference DMEE and MQI instruments, we utilized the two high-inference instruments. The DMEE high-inference instrument measures all eight TAs but assessment. Observers were expected to complete a Likert scale comprising 34 items at the end of each lesson to indicate how often each teacher behavior was observed (for examples, see Appendix A1). The MQI high-inference instrument was developed taking into consideration the segment-level MQI codes which were transferred at the lesson level: at the end of each lesson, the raters were asked to evaluate the mathematical quality of teaching for the MQI codes using a scale from 0 (not at all) to 4 (to a great extent) (see Appendix A2 for an example of this scale). At the time the study was conducted, the existing MQI version included six codes for Richness (see Fig. 2, except for c), four codes of Errors and Imprecision, three codes of Working with students and mathematics, and four codes for Common Core-Aligned Student Practices (except for c and e).

4.3 Data analysis

Descriptive analysis was conducted to identify TAs suffering from ceiling and/or floor effects. Two items on the differentiation dimension of two DMEE TAs (i.e., structuring and dealing with misbehavior) had to be excluded since they were infrequently observed (1.31%). Similarly, all the four Errors and Imprecision codes of the MQI were dropped because they appeared very infrequently (e.g., 1.7% for mathematical errors).

The Extended Logistic model of Rasch (Andrich, 1988) was first used separately for each framework, to examine whether its corresponding data could form a scale measuring its respective TAs. We treated each lesson separately meaning that 228 person estimates (i.e., 38 teachers X 6 lessons) were generated by using Quest (Adams & Khoo, 1996). The Rasch model is appropriate for the specification of such scales because it enables testing whether the data meets the requirement that both teachers’ lesson performance on the framework items and the difficulties of the items form a stable sequence (within probabilistic constraints) along a single continuum (Bond & Fox, 2001).

Once developing two separate scales, the Rasch model was then utilized to find out whether a common scale can be established (we also checked whether the data had better fit to more complex models, such as a multidimensional IRT scale; see more on these analyses in Appendix B1). Having established the reliability of the common scale, we then employed the procedure for detecting pattern clustering in measurement designs developed by Marcoulides and Drezner (1999) to examine if the various TAs are systematically grouped by difficulty level (see more in Appendix B2). This procedure enables segmenting the observed measurements into constituent groups (or clusters) so that the members of any group are similar to each other, according to the selected criterion (i.e., the difficulty level of each item).

The Rasch model and the clustering method cannot provide answers on how deep the divide is separating the levels of the cluster analysis. Wilson (1989) developed a variant of the Rasch model, the so-called Saltus model, as a method that can differentiate between different levelsFootnote 3. Specifically, the Saltus model allows to differentiate between major and less pervasive changes in moving from one level to the other without sacrificing the idea of one common underlying continuum. Readers interested in the technical details of this model are referred to Appendix B3. Thus, the Saltus model was used to differentiate between major and less pervasive changes in moving from one level to the other without sacrificing the idea of one common underlying continuum.

5 Results

5.1 Developing a common scale of teaching quality

The Rasch model was used to analyze teachers’ performance on the 32 DMEE items and the 10 MQI items. After dropping two DMEE items on the focus dimension of orientation and the stage dimension of dealing with misbehavior which did not fit the model, the remaining 40 items of DMEE and MQI fit the model well (see Table 2, Column 4). Specifically, all TAs had item infit within the range 0.81 up to 1.22, and item outfit within the range of 0.77 up to 1.19 (see Appendix C). Moreover, the results of this analysis revealed that these TAs were well targeted against the teachers’ lesson performance since their scores ranged from − 1.76 to 2.04 logits and the difficulties of the 40 items (i.e., TAs) ranged from − 1.88 to 2.36 logits. Furthermore, the indices of the separation of persons and TAs were higher than 0.85, suggesting satisfactory scale reliability (Bond & Fox, 2001). Finally, Yen’s (1993) procedure (see Appendix B1) was used to test for Local independence, a central assumption of Item Response Theory models; violations of this assumption are usually tested using test statistics based on item pairs (Debelak & Koller, 2020). Local independence was not violated. Collectively, these findings suggested that subject-generic and subject-specific TAs could form a single scale with good psychometric properties (for how the fitting of the Rasch model compared to alternative more complex IRT models, see Appendix B1).

Table 2 The psychometric properties of the three scales developed for the DMEE and MQI separately and their combination

5.2 Developing levels of teaching quality

The application of Marcoulides and Drezner’s (1999) analysis to cluster the 40 TAs based on their Rasch item difficulties showed that they could be optimally organized into five groups (i.e., levels of teaching, see Table 3). The cumulative D for the five-cluster solution was 41% whereas the sixth gap added only 3.9% and the seventh gap added even less (3.2%). Given that explaining at least 40% of the observed variance is considered satisfactory in cluster analysis (Romesburg, 1984), we further examined this solution using the Saltus model.

To apply the Saltus model, we assumed that the 40 TAs are structured into the five groups of the cluster analysis. The Saltus solution had better fit to the actual data than the Rasch model and offered a statistically significant improvement over the Rasch model which was equal to 1087 chi-square units at the cost of 30 additional parameters; this solution was also found to fit better to the data compared to fitting saltus with a set of alternative solutions with fewer or more clusters (see Appendix B2). Table 3 presents the Rasch difficulty parameters of the 40 TAs along with the Saltus difficulty parameters, starting from the easiest level (i.e., Level 1 shown in Column 3) and moving up to the most difficult level (i.e., Level 5, Column 7). A comparison of the Rasch and Saltus parameters and a justification of the choice of the latter over the former are presented in Appendix B2. Based on this comparison it can be claimed that the spectrum of TAs measured through the DMEE and MQI is discontinuous rather than continuous. A description of the different levels is given below.

A first observation on how the DMEE and MQI TAs have been clustered into levels is that, except for the fifth level (i.e., the most demanding one), all other levels include both subject-generic and subject-specific TAs. A second observation is that there is not only conceptual homogeneity within the TAs of each level, but there also seems to be a progression in how the TAs are organized into levels, starting from TAs that impose fewer demands on teachers and moving up to TAs that impose increasingly more demands. We unpack this argument while describing each level.

A common threat linking the subject-generic and the subject-specific TAs of the first level is that both pertain to structuring a basic learning environment rather than providing more quality opportunities for student learning. This level includes subject-generic TAs related to the quantitative characteristics (i.e., frequency) of factors associated with the direct teaching approach such as structuring and application. The subject-specific TA of this level also pertains to providing a basic learning environment, since it expects the teacher largely to acknowledge and respond to students’ contributions.

Table 3 Rasch and saltus parameter estimates for factor scores measuring the teaching aspects of the DMEE and the MQI

The TAs of the second level relate to providing students with opportunities to actively engage in learning, interacting with their classmates and the content. In terms of the subject-generic TAs, the second level concerns qualitative characteristics of the two aspects of the classroom learning environment factor (i.e., encouraging interactions and dealing with misbehavior) and the appropriate use of questioning, including the provision of constructive feedback. The subject-specific TAs of this level largely concern the opportunities that the teacher provides to students to offer explanations, raise questions, offer examples, and engage in reasoning.

The generic and specific TAs of the third level place even more demands on teachers to build an environment that does not only support active student participation, but it is (mathematically) rich. This level includes subject-generic TAs that pertain only to the quantitative characteristics of factors associated with constructivist approaches (i.e., orientation and modeling). Unlike the specific TAs of the previous level that largely relate to providing students with opportunities for engagement with the content, the specific TAs of this level put more demands on teachers, since they expect them to structure mathematically rich environments through the provision of explanations, the linking of representations, and the identification and remediation of students’ errors.

Further increasing the challenge, the types of TAs included in the fourth level expect teachers to not only offer a rich environment but afford students challenging learning opportunities that can have an impact on both cognitive and meta-cognitive learning outcomes. This level includes generic TAs that relate to the qualitative characteristics of factors associated with constructive teaching. Compared to the generic TAs included in the previous level, those of this level do not capture only the provision of related opportunities but require that these opportunities are suitable for students and promote learning. The specific TAs stipulate that the teacher selects and enacts challenging tasks, thus providing students not only with rich mathematical experiences, but also experiences with demanding content.

The fifth level places the most demands on teachers by expecting them to differentiate their teaching to meet different student needs. This level includes only generic TAs related to the differentiation dimension of the DMEE and reveals that differentiation of teaching is very challenging for primary mathematics teachers.

6 Discussion

Drawing on two widely used and validated frameworks, the subject-generic DMEE and the subject-specific MQI, this study showed that it is possible to develop a common scale with good psychometric properties which encompasses both generic and specific TAs. This finding, which is in line with a basic assumption of the DMEE supporting that TAs are interrelated (Kyriakides et al., 2020), implies that both subject-generic and subject-specific aspects can be thought to form an overarching construct, namely teaching quality. The study also provides further support to those indicating the limitations of using exclusively either generic TAs (e.g., Panayiotou et al., 2021) or specific TAs (e.g., Litke et al., 2021) to describe teaching quality.

In arguing about the importance of forming a common scale of TAs, we acknowledge certain limitations. First, our sample consisted only of primary teachers. Although teaching mathematics in primary grades requires strong mathematical knowledge for teaching (cf. Ball et al., 2008), replication studies are needed with secondary school teachers. Second, during the analysis 11 items were dropped. Dropping nine of these items (e.g., misbehavior from DMEE and Errors and Imprecision from MQI) was unsurprising, given that they were infrequently observed (apparently due to the self-selected sample of the study and that the lessons were videotaped); had these items been observed more frequently they could have been included in the scale formed—an open issue to address in future research. Two additional items not necessarily related to the teacher sample and mode of lesson observation were also dropped (see Sect. 5.1). We argue, however, that doing so might be unavoidable when trying to develop a common scale measuring both types of TAs, as long as the main constructs of each framework are well represented in the common scale; interestingly, leaving these two items in and running more complex IRT models (see Appendix B1) resulted in much worse fit indices. Despite having to drop these 11 items, it is important to note that all DMEE TAs (factors and dimensions) and the three main MQI dimensions (which form the overarching dimension of Ambitious Teaching) were represented in the scale formed.

Another important study finding was that the subject-generic and subject-specific TAs of the two frameworks could be optimally clustered in five distinct levels. Although the grouping of generic TAs into levels has already been demonstrated (Kyriakides et al., 2020), this was the first time to search for grouping both generic and specific TAs in levels. Equally important, except for the last level, all preceding levels included both subject-generic and subject-specific TAs. This implies that the quality of the lessons observed appears to be a function of both generic and specific TAs; hence, it seems unlikely to have lessons that excel in one type of TAs and are particularly poor in the other type.

The development of the common scale and the establishment of levels combining subject-generic and subject-specific TAs have important theoretical implications. Although one approach for integrating both types of TAs would be to start from developing a theoretical model that integrates both (cf. Charalambous & Praetorius, 2020), this study proposes and examines another approach: starting from existing frameworks that have been considered to fall into two different clusters (generic vs. specific) and exploring the potential of integrating them not by juxtaposing them, but by investigating which aspects of them can co-exist in a single level and whether this co-existence provides a meaningful description of teaching quality. To further explore this line of integration advanced herein more work is needed, including utilizing other subject-generic and subject-specific frameworks, other grade levels (e.g., lower elementary, secondary), and different educational contexts. Finding levels of teaching quality through a cross-sectional study has certain limitations (Kyriakides et al., 2009). Therefore, the levels formed cannot be considered developmental. To develop such levels, longitudinal studies are needed to figure out whether some teachers move from one level to the other in a stepwise manner and/or other teachers remain at the same level. Such studies could provide empirical evidence on the extent to which these levels can define certain professional needs for particular groups of teachers, which, in turn, can be used for designing professional development programs. Future studies could also employ more interpretive work (e.g., Bookmark method), to examine the interpretive validity of the levels formed and how practitioners themselves understand the similarities and differences of the items clustered within each level.

Despite these limitations, at least two initial insights could be gleaned. First, the clustering of the 40 TAs into distinct levels calls for more critical examination of what unites aspects that are called generic or specific. One possibility could be that this co-existence of subject-generic and subject-specific aspects in the same levels challenges the very notion of distinguishing between these two types of TAs and calls for considering them as a unified whole. Another could be that these TAs are indeed distinct given that different disciplines impose different demands on teachers, thus rendering certain TAs more specific than others. More thoroughly examining these possibilities both theoretically and empirically represents an important open issue. At the same time, another question is equally important: If these TAs are indeed distinct, what can explain their co-appearance in the same levels? Is it just a statistical artifact or are there more fundamental similarities among them? We argue that unpacking the demands (in terms of knowledge and practice) that different TAs (identified as generic or specific) impose on teachers represents a productive path for addressing this question. Such an analysis may have implications for teacher initial and ongoing professional development. Second, to the extent that such levels are replicated in future studies, we could identify if the teachers’ lessons are consistently clustered into levels and the extent to which they relate to student learning gains. This could indicate the importance of treating such levels as a heuristic for identifying teachers’ needs to better support their students’ learning.

The identification of the five levels in this study is in line with the use of stage models in teacher professional development (e.g., Berliner, 1994; Sternberg et al., 2000). Specifically, the five levels illustrate not only the complex nature of teaching quality but could help develop specific teacher education courses, considering the needs of each teacher group situated at different levels. Given that subject-generic and subject-specific TAs were integrated in most of the levels emerging, the study findings provide teacher educators with ideas as to how the two types of TAs can be integrated in initial and ongoing teacher education.