1 Introduction

A number of frameworks have been developed for investigating teaching quality and its impact on student learning (cf., Klette & Blikstad-Balas, 2018; Praetorius & Charalambous, 2018). These frameworks vary in the structure, the selection and naming of teaching dimensions, “the depth and breadth” (Senden et al., 2022, p. 151) of the teaching dimensions covered, and how subject matter is considered, ranging from generic frameworks designed to be applicable for all subjects (e.g., Classroom Assessment Scoring System [CLASS] by Pianta & Hamre, 2009) to those categorized as subject-specific as they were developed for a specific subject acknowledging its particularities (e.g., Mathematical Quality Instruction [MQI] by Hill et al., 2008, or the Protocol for Language Arts Teaching Observation [PLATO] by Grossman et al., 2013).

A closer look at subject-specific frameworks reveals big differences in how they conceptualize the subject specificity of teaching quality. This is even evident in the wide range of terms used to describe them (e.g., additive, integrative, inclusive, see Brunner, 2018, or hybrid, see Senden et al., 2022). Further complicating the situation, recent studies show that (a) there is an overlap between aspects of teaching considered as generic and subject specific (e.g., cognitive activation, Lipowsky et al., 2018; Praetorius & Charalambous, 2018; Schlesinger & Jentsch, 2016), and (b) certain subject-specific frameworks are also suitable for analyzing teaching quality in subjects they were not designed for (Klette, 2023), indicating more similarities than differences in teaching across subjects (Bell et al., 2019; Senden et al., 2022).

One could conclude that research so far has not answered what subject specificity actually means (Mu et al., 2022), and how the ideal extent to which aspects of teaching quality should be considered subject specific can be determined (Praetorius et al., 2020). It may be that the question of subject specificity depends less on the concrete operationalization and more on whether the meaning of the underlying construct is the same for all subjects (Martin-Raugh et al., 2016).

Questions about what teaching quality means for different subjects would benefit from more comparative research (Lindmeier & Neumann, 2018). Comparing subjects enables the identification of similarities and differences in teaching quality across subjects (Charalambous et al., 2019; Praetorius et al., 2016). An in-depth comparison of this kind requires a comprehensive framework that includes both generic and subject-specific dimensions (Lindmeier & Neumann, 2018). Building on such a comprehensive framework, the investigation of subject specificity could then unfold both conceptually and empirically. Conceptually, experts from different disciplines could discuss the extent to which the framework dimensions are relevant for their subject. Empirically, data collected on teaching quality in different subjects could be compared to examine similarities and differences between subjects.

While Praetorius et al. (2020) presented an initial conceptual analysis of this matter, this study aims to empirically investigate questions of construct and measurement equivalence, using a promising measurement invariance approach to compare two different subjects, mathematics and German.

2 Literature review

2.1 The importance of co-examining generic and subject-specific dimensions of teaching quality

For decades, teaching effectiveness researchers have been developing frameworks suitable for studying teaching quality in any subject (e.g., Three Basic Dimensions System [TBD], Klieme et al., 2001). The frameworks capture rather generic teaching principles that are relevant in multiple contexts. Researchers within this line of argumentation have cautioned that using a variety of subject-specific frameworks could impede communication about teaching quality (cf., Charalambous & Praetorius, 2020).

At the same time, there has been a great emphasis on the relevance of subject-specific frameworks, recognizing that what constitutes teaching quality can vary from subject to subject (Bell et al., 2012; Gitomer, 2009). Subject-specific frameworks pay more attention to content in teaching and learning processes and conceptualize teaching quality using subject-specific terminology (Brunner, 2018; Lindmeier & Heinze, 2020), pointing out that any attempt to conceptualize teaching quality without considering the subject-related content runs the risk of missing the idiosyncrasies associated with teaching a particular subject. Lindmeier and Heinze (2020) even go so far as to declare that it is paradoxical to conceptualize teaching as subject independent while claiming to measure high quality teaching in a particular subject context.

A number of frameworks have therefore been developed to capture teaching quality in specific subjects (e.g., in mathematics, see Boston et al., 2015). A deeper analysis of these frameworks reveals great heterogeneity in how they approach subject specificity (Brunner, 2018). Dreher and Leuders (2021) highlight several issues with subject-specific frameworks: specifications are made at different levels, from subject (e.g., representations in mathematics) to topic (e.g., representations in algebraic expressions); some generic aspects of teaching have been translated to subject-specific aspects rather superficially without any arguments for their subject particularities (e.g., responding to students appropriately); classifying a teaching dimension as subject-specific in one subject does not preclude its applicability or relevance to other subjects (e.g., using representations). They therefore argue that decisions about whether and to what extent a given teaching aspect should be considered subject specific remains unresolved.

Recent research elaborates on a number of approaches for improving our understanding of the subject specificity of teaching quality (Dreher & Leuders, 2021; Lindmeier & Neumann, 2018; Mu et al., 2022), highlighting the comparison of subjects particularly useful for investigating subject differences more systematically. Analyzing just one subject would limit the possibility to find dimensions as clearly subject-specific. A comparative approach would both allow generic aspects to be distinguished from subject-specific ones and reveal if some teaching practices are more common in one subject than another. Also, only the comparison of subjects allows to investigate construct equivalence across subjects (Praetorius et al., 2020). A synthesis of both generic and subject-specific frameworks would serve as an ideal starting point for comparing subjects (Lindmeier & Neumann, 2018).

2.2 Approaches to combine generic and subject-specific frameworks of teaching quality

The importance of combining generic and subject-specific dimensions has been highlighted by researchers (Brunner, 2018; Lindorff et al., 2020) in order to allow using the advantages of both approaches while balancing out the outlined disadvantages. Empirical evidence suggests that a combination of generic and subject-specific dimensions might be better able to explain teaching and its effects on student learning (Charalambous & Kyriakides, 2017; Lipowsky et al., 2018; Seidel & Shavelson, 2007).

To this end researchers have developed various types of combined frameworks. There are frameworks which incorporate a set of subject-specific aspects of teaching into a generic framework in an additive way (e.g., Teacher Education and Development Study-Instruct [TEDS-Instruct], Jentsch et al., 2021) or use an inclusive approach to entirely operationalize a framework from a mathematical-didactic perspective (e.g., Teaching for Robust Understanding [TRU], Schoenfeld, 2018). These two examples give an indication of the many ways the combined frameworks could be designed (for further examples, see Senden et al., 2022).

Praetorius and Charalambous (2018) presented a synthesis of eleven generic, mathematics-specific and combined—so called hybrid—frameworks. The synthesis led to the development of the MAIN-TEACH model, version 1.0Footnote 1 (Charalambous & Praetorius, 2020) which includes eight dimensions of teaching quality (i.e., classroom management, social-emotional support, support for active engagement, selection and implementation of content, cognitive activation, support for consolidation, assessment and feedback, and adaptation). The model differentiates three layers and their respective dimensions based on their function in the learning process. The underlying layer consists of the dimension adaptation and is relevant for all of the other dimensions. The intermediate layer is comprised of dimensions that create conditions conducive to learning (i.e., classroom management, social-emotional support, support for active engagement), and the upper layer includes dimensions which directly support learning processes (i.e., selection and implementation of content, cognitive activation, support for consolidation, and assessment and feedback).

MAIN-TEACH 1.0 includes dimensions characteristic of generic frameworks (e.g., TBD) such as classroom management, dimensions that are typically covered by subject-specific frameworks (e.g., MQI) such as selection and implementation of content, and dimensions that are only present in some combined frameworks (e.g., TRU; TEDS-Instruct; see Fig. 2 in Praetorius & Charalambos, 2018). For example, it includes support for consolidation, considered in TEDS-Instruct, but not in TRU, and adaptation, captured by TRU, but not by TEDS-Instruct. Therefore, the synthesis of multiple frameworks results in a greater range of teaching aspects. However, because it was developed by synthesizing only mathematics-based frameworks, subject-specific characteristics of other subjects may not have been sufficiently considered in the initial development of the model. Therefore, researchers from ten other subjects were asked to compare MAIN-TEACH 1.0 to conceptualizations of teaching quality in their own subjects to further develop the synthesis. Over two special issues, the researchers came to the conclusion that the proposed dimensions of teaching quality were relevant for all subjects (Praetorius & Gräsel, 2021; Praetorius et al., 2020).

While studies with subject expert groups provide conceptual evidence that MAIN-TEACH 1.0 can be used to analyze teaching quality across subjects, the empirical evidence is still pending. Such empirical studies could follow different approaches (see Mu et al., 2022), a promising one being the comparison of subjects.

2.3 Potential of subject comparison to identify subject-related differences in teaching quality

Studies of teaching quality using any of the available frameworks or models have mostly focused on a single subject despite the growing awareness of the importance and potential benefits of comparing subjects. Few empirical studies have compared teaching quality across subjects using multiple analytical methods. Some explore consistency in teaching quality of the same teachers across subjects while others test for measurement invariance (MI).

We briefly review studies with a focus on comparing subjects, starting with those examining the consistency of teaching quality in lessons of different subjects taught by the same teachers to the same students and then moving to those that tested for MI across subjects.

In the first category, Praetorius et al. (2016) compared teaching quality in German and English as a second language (ESL) lessons taught by the same teachers to the same students by partitioning the variance in student ratings of classroom management and motivational support into subject-specific and cross-subject parts. The authors showed that there was substantial subject-dependent variance for motivational support, but hardly any for classroom management. Cohen et al. (2018) investigated the consistency of teaching quality of the same teachers and students in mathematics and language arts lessons. Different cross-subject correlations depending on the assessed dimension of teaching quality were found, ranging from r values from 0.55 to 0.73 for student ratings, and r values from 0.25 to 0.55 for expert ratings. In another study, Cohen (2018) investigated whether explicit instruction in mathematics and language arts was consistent and found that 90% of the teachers did not demonstrate the same explicit instructional practices across subjects. Comparing teaching quality and student value added scores of students taught in mathematics and physical education by the same teacher, Charalambous et al. (2019) found evidence supporting teachers’ differential effectiveness across subjects; for example, half of the 18 teachers studied had dissimilar teaching quality in the subjects explored.

In the second category, Wagner et al. (2013) examined differences in student ratings of teaching quality in ESL and German lessons. The authors tested the generalizability of five teaching dimensions across subjects and levels by investigating the assumption of MI. They could establish metric MI for classroom management and structure, but not for motivation, understandableness of teacher behavior, and student involvement. MI of student ratings of teaching quality was also tested in a study conducted by Schurig et al. (2016) in biology, German, ESL, and mathematics. The generalizability of seven dimensions of teaching quality was tested for all subjects at three grade levels, and three measurement points. The authors could establish metric MI for all subjects and scalar MI between German and mathematics, which led them to conclude that the framework is transferable across subjects. Blömeke and Olsen (2019) arrived at a similar conclusion for mathematics and science, even establishing scalar MI.

A closer look at the second category of studies shows that these studies applied MI analyses with the primary intention of justifying combining factors or scale scores over subjects and levels. In other words, these studies did not conduct in-depth analyses of potential differences across subjects, in the sense of a dependency between certain aspects of teaching quality and subjects, but rather wanted to rule out subjects as a source of variance in teaching quality.

However, MI could be considered a promising technique for identifying differences in the factorial structure of indicator-based measurement instruments (Leitgöb et al., 2023; Millsap, 2011). In terms of subject specificity, this means that by using MI analyses, it is possible to find out if the meaning, and thus, the relative weight of indicators of certain dimensions of teaching quality, represented by the factor loadings, varies between subjects (i.e., metric MI not established). Similarly, MI analyses would allow the identification of potential subject dependencies on the level of certain indicators, which is captured by the indicator's intercept instead of the whole factor score (i.e., scalar MI not established). Differences in intercepts indicate that the occurrence of a certain indicator relative to the other indicators of the quality dimension of interest differs between subjects. This would imply that a level of a certain aspect of teaching quality may be easier to achieve when teaching one subject than another (for a detailed explanation of the concept of MI, please see the electronic supplementary materials).

The identification of not only overall or scale-related differences in the factorial structures but also discrepancies in the single indicators across subjects would allow researchers to better understand the particularities of each subject (Lindmeier & Neumann, 2018) and to draw conclusions about possible differences in the meaning of teaching concepts across subjects (Martin-Raugh et al., 2016; Praetorius et al., 2020).

3 Research questions and hypotheses

Drawing on MAIN-TEACH 1.0, this study investigated subject-related differences of observer ratings on teaching quality in mathematics and German language with a two-fold measurement invariance approach.

First, we examined whether the MI for different dimensions of teaching quality can be established across the two subjects based on (a) a global MI analysis over all seven dimensions and (b) based on indicator-level MI analyses. Second, for those dimensions of teaching quality for which the required level of measurement invariance could be confirmed, we compared (a) the correlational structure of the seven factors and (b) the latent factor means across the two subjects. Specifically, we asked:

Research Question 1 (RQ 1): In which dimensions does the multi-dimensional structure of teaching quality show measurement equivalence or a deviation between mathematics and German language when applying a global-level and an indicator-level analysis of MI?

Research Question 2 (RQ 2): In which dimensions does the multi-dimensional structure of teaching quality show differences in factor correlations and factor means between mathematics and German language, provided the requirements of MI are met?

In formulating our hypotheses, we were guided by the multi-layered character of MAIN-TEACH 1.0 which distinguishes layers of teaching dimensions based on their function for the learning process, and by the conclusions that resulted from the conceptual work on MAIN-TEACH 1.0 with ten subject expert groups (Praetorius & Gräsel, 2021; Praetorius et al., 2020): It can be assumed that dimensions which facilitate learning such as classroom management are more likely to be independent of the subject than those assumed to directly support content-related learning processes such as cognitive activation (see Cohen, 2018, for a related argument). Therefore, the following hypotheses were investigated:

  • H1: Measurement models for teaching quality dimensions related to the direct support of the learning process (e.g., selection and implementation of content) are expected to vary between subjects.

  • H2: Measurement models for teaching quality dimensions related to facilitate learning (e.g., classroom management) are expected to be invariant between subjects.

  • H3: Measurement models for teaching quality dimensions related to adapting teaching to learning (e.g., adaptation) are expected to be invariant between subjects.

Concerning the second research question, given that there is not much research on latent mean differences in teaching quality across subjects (Schurig et al., 2016), we did not formulate any hypotheses but followed an exploratory approach to investigate differences between the two subjects.

4 Method

4.1 Project context

In 2018, the Inter-Cantonal Association for the External Evaluation of Schools (argev) initiated the development of an observation system for teaching quality that could be used to evaluate schools (ISCED 0–2) in Switzerland. Drawing on the synthesis (Praetorius & Charalambous, 2018) that led to MAIN-TEACH 1.0, the observation system INSULA 1.0 “Instrument for Teaching Evaluation Aligned to the Swiss Lehrplan 21 Curriculum” (Rogh et al., 2020) was developed, piloted and validated in close collaboration with the school evaluation and school practice based on quantitative and qualitative data (for details, see Wemmer-Rogh et al., 2023). In the 2021/2022 school year, the Zurich and Grisons cantons initiated a new five-year school evaluation cycle, during which the teaching quality of all schools in these cantons will be assessed once by school inspectors using INSULA 1.0.

4.2 Lesson sample

The study used data collected during the first year of the evaluation cycle in both cantons. Observation data for 1939 lessons in 126 elementary schools were obtained during 2021/2022. As the study took place within the school evaluation context the participating classes were selected by the cantonal evaluation agencies. This introduced a lot of variation in the data (e.g., school levels, subjects etc.). To control for this variation in this study, we used a subsample of the data, focusing on mathematics and German lessons at primary school level, since only for these two subjects were the samples of sufficient and comparable size. Our subsample consisted of 319 mathematics and 237 German lessons; only in a few cases were these lessons taught by the same teachers.

4.3 Instrument

Teaching quality in both subjects was assessed using INSULA 1.0. The development of INSULA 1.0 and the conceptual refinement of the synthesis into MAIN-TEACH 1.0 took place in parallel. This had two consequences: (a) INSULA 1.0 reflects the dimensions of the MAIN-TEACH model 1.0 with slight differences in structure and terminology as a result of further development in the project context; (b) insights from the collaboration with the subject matter experts could be taken into account during the development process.

The observation instrument has seven dimensions with two to five subdimensions (see Fig. 1). The subdimensions are designed to depict good teaching practice. The inspectors were asked to rate seven dimensions and 21 subdimensions for the entire lessons using a 4-point scale [(1): little pronounced, (2) moderately pronounced, (3) predominantly pronounced, (4) extensively pronounced]. To support the high-inference rating process, behaviorally anchored rating indicators for all four scale-points were provided for each subdimension (for an example for D1.1, please see electronic supplementary materials). These indicators are worded in a subject-generic way but were developed in cooperation with various professional groups, including subject experts to ensure that they work for different subjects.

Fig. 1
figure 1

Teaching quality dimensions and subdimensions in INSULA 1.0

Because the conceptual work on MAIN-TEACH 1.0 with the subject expert groups recommended that dimensions closely related to content should take subject specificity into account, the observation instrument was accompanied by a teacher questionnaire that collected information on the selected learning objectives, content, and methods for the observed lesson. The provided information should help the raters make more informed judgements of the quality of the teaching dimensions. The brief teacher questionnaire was completed in advance by teachers and handed to the school inspectors, who were instructed to use the information along with their observations when rating content-related dimensions, in particular D3.1 and D3.2.

The seven-dimensional factor structure of the INSULA 1.0 instrument could be replicated with an acceptable model-fit (χ2 (168) = 858.94, p < 0.001, CFI = 0.941, TLI = 0.926, RMSEA = 0.049, SRMR = 0.036).

4.4 Raters, rater training, and agreement

Fifty-three school inspectors were the raters for this study. School inspectors in the cantons of Zurich and Grisons are mostly experienced teaching experts (e.g., have been teachers before becoming school inspectors). Background information about the inspectors was collected using a questionnaire that was completed by 32 of the 53. The survey revealed that 78% of the inspectors were female and almost 91% had been teachers prior to becoming school inspectors. On average, they had worked in the teaching profession for 15.0 years (SD = 8.5) and had 7.2 years of evaluation experience (SD = 6.9). Before evaluating lessons, all of the school inspectors in both cantons had to complete 3.5 days of training and be certified. The training included an introduction to the theoretical and methodological concepts of measuring teaching quality, an explanation of the dimensions of teaching quality, video examples illustrating each level of each subdimension of teaching quality, several opportunities to practice rating with master-coded videos, and a certification test. The certification criterion was set to an average 60% exact agreement and 80% adjacent agreement with the master ratings across all ratings (Bell, 2020).

Thirty-three raters made 44 double observations in eight different subjects so that inter-rater agreement could be checked. Following recommendations in Köhler et al. (submitted), a combination of indices to evaluate reliability was used: exact and adjacent agreement, intraclass correlation coefficient for absolute agreement of two raters in a one-way random effects model [ICC(1.2), see Koo and Li (2016)], and within-group agreement between the raters (rwg). The inter-rater agreement in the validation study was acceptable for all dimensions (exact agreement ranging from 54 to 74%; adjacent agreement within 1 point ranging from 96 to 100%, ICC(1, 2) ranging from 0.81 to 0.90, rwg ranging from 0.88 to 0.96).

4.5 Analyses

The data was analyzed in two ways. First, the overall level of MI of the multi-dimensional instrument was tested (Horn & McArdle, 1992; Millsap, 2011; see 4.5.1). As this strategy only results in a global assessment of the instrument’s MI over all seven dimensions and provides no information about exactly where any violation of invariance may have occurred, a second strategy was additionally pursued. Based on a configural model with separately (freely) estimated parameters for both subjects, difference parameters for every loading, intercept, factor correlation, and factor mean were computed and statistically tested (Cheung & Lau, 2012; Stark et al., 2006; see 4.5.2).

4.5.1 Overall assessment of measurement invariance

Given the structure of the INSULA 1.0 instrument where three of seven dimensions are assessed using just two indicators, it was not possible to apply MI analyses at the scale level, to identify differences in the factorial structure between the two subjects. Instead, we first used multiple-group confirmatory factor analysis (CFA) to test for the global level of MI simultaneously for all of the dimensions for mathematics vs. German lessons. Model identification was established by applying effect coding (i.e., restricting the mean loading per factor to 1, and the mean intercept to 0) to all seven dimension-specific measurement models, separately, for both subjects (Little et al., 2006).

4.5.2 Direct parameter-specific assessment of measurement invariance

Assessing the global level of MI for a multi-dimensional instrument, or parts of it, is an established methodology. However, when the overall level of MI or the MI for a given construct cannot be confirmed this method does not reveal what has caused the lack of MI. In other words, it does not identify the exact part of the measurement model that differs between subsamples.

Therefore, to answer RQ1 we additionally took a different approach (see Cheung & Lau, 2012; Stark et al., 2006). Using the complete multi-dimensional model and including all seven dimensions as correlated latent factors with freely estimated parameters for both subject groups (configural invariance), we calculated subject-related difference parameters separately for each loading and intercept. After that, to answer RQ2, we extended this approach to the factor correlations and latent means of those factors for which MI could be established at the necessary level (see also Tsaousis & Alghamdi, 2022). If these difference parameters differ significantly from 0, they indicate which specific element of the measurement model of the instrument is different for mathematics and German lessons.

For model identification, we again applied effect coding to all seven dimension-specific measurement models separately for both subjects (Little et al., 2006). In this way a non-arbitrary baseline for parameter comparisons could be established by considering all loadings and all intercepts, not just arbitrarily chosen reference parameters, without performing the very long and complex procedure suggested by Cheung and Lau (2012). Statistical testing of these difference parameters was based on bias-corrected bootstrapping (Cheung & Lau, 2012; Tsaousis & Alghamdi, 2022). Bootstrapping has the advantage of providing asymmetric confidence intervals (CI), which are considered more adequate than symmetric ones that rely on theoretical distributions. We applied n = 1000 bootstraps.

All CFA analyses were carried out using Mplus 8.8 (Muthén & Muthén, 19982022) and all descriptive statistics using SPSS version 29 (IBM Corp., 2022). Model fit was assessed using the maximum likelihood robust (MLR)-based chi-square statistic (χ2) and its degrees of freedom (df); comparative fit index (CFI), root mean square error of approximation (RMSEA) with a 90% confidence interval, and standardized root mean square residual (SRMR). According to simulation study-based recommendations, values over 0.95 for CFI, and values below 0.06 for RSMEA and below 0.09 for SRMR would indicate favorable Type I and Type II error rates and therefore acceptable fit (Hu & Bentler, 1999; Schermelleh-Engel et al., 2003). Because the MLR estimator was used, model comparisons were based on the Satorra-Bentler corrected χ2 value.

5 Results

5.1 RQ1: structure of teaching quality across subjects

Table 1 shows the factor loadings and intercepts for the correlated factors model with freely estimated parameters for both types of lessons. Model fit was reasonable (see Table 1 note).

Table 1 Multiple group confirmatory factor analysis for mathematics vs. German language lessons; standardized factor loadings and unstandardized intercepts by subject

5.1.1 Overall assessment of the level of measurement invariance

The global MI analysis (Table 2) indicates that overall, the instrument yields basically identical loadings, but not identical intercepts, when all seven factors for mathematics and German lessons are assessed at once. In other words, we found metric, but not scalar MI in the overall test.

Table 2 Measurement invariance by subject (mathematics vs. German)—comparison of models over all seven dimensions

5.1.2 Direct parameter-specific assessment of measurement invariance

To identify and accurately localize differences between the subjects in the measurement model, we calculated the pairwise differences in factor loadings (Table 3) and intercepts (Table 4) and tested them by drawing bias-corrected bootstraps. Despite general metric MI, we identified a subdimension where there was a significant difference in loadings between subjects if the mean loading per factor, and per subject, was kept constant at a value of 1 by using effect coding (see Table 3). Specifically, the factor loadings of subdimensions 3.1 and 3.2 of dimension selection and implementation of content behaved differently. These subdimensions are the appropriate selection of content and learning objectives (3.1) and the alignment of teaching with the selected learning objectives (3.2). For teaching quality in dimension 3, the loading of subdimension 3.1 was systematically higher for mathematics lessons than for German lessons, possibly indicating that this subdimension is of greater importance and plays a more prominent role than the other indicators in this dimension. For subdimension 3.2, it was the other way around. For the remaining subdimensions of dimension 3 and all the other dimensions, the differences in factor loadings remained within the range of random variation.

Table 3 Differences in unstandardized factor loadings by subject (mathematics vs. German language) and statistical testing of type bias-corrected bootstrapping
Table 4 Differences in item intercepts by subject (mathematics vs. German language) and statistical testing of type bias-corrected bootstrapping

As testing multiple differences per dimension may lead to alpha error accumulation, we decided to apply the Bonferroni correction to adjust the critical p level for the number of contrasts for the factor loadings within each construct (Table 3). For dimension 3 there were nk = 5 contrasts, leading to a revised criterion of p < 0.01. After the correction, the factor loadings-related differences in dimension 3 were no longer significant. In other words, the result was the same as for the overall assessment of metric MI: there were no significant subject-related differences in factor loadings in all seven quality dimensions.

The indicator-level MI analyses of the intercepts shown in Table 4 suggest that no scalar invariance could be established in the overall assessment of MI because of subdimensions 3.1 and 3.2. Their intercepts differ the most. In subdimension 3.1 the intercept, compared to the other intercepts of dimension 3, is significantly lower for mathematics lessons than for German lessons while for 3.2 it is higher for mathematics lessons compared to German lessons. A higher intercept indicates a more positive rating.

Given the multiple contrasts, we corrected the results for alpha error inflation again. As a result, the subject-related differences in the intercepts were no longer significant.

In answer to RQ1, there was no evidence for significant differences for both subjects. When the alpha accumulation correction was applied, no significant subject-related differences in factor loadings or intercepts were found. One can conclude that the factorial structure of our data was basically identical for the two subjects. Therefore, H1 was rejected while H2 and H3 were confirmed as the dimensions expected to behave differently across subjects were found to be stable across both subjects. It should be noted, though, that the rationale behind H1 seems partially correct. The single dimension that most directly targeted subject content, selection and implementation of content, did indeed, when no alpha error inflation correction was applied, show a certain amount of subject dependency in the measurement model. This was evident not only in the variation of the relative levels (intercepts) of the subdimensions, but also in their meaning, as reflected in the factor loadings.

5.2 RQ2: Factor correlations and level of teaching quality across subjects

Given the results of the global and alpha error inflation corrected indicator-level MI analyses, which both confirmed metric MI, it was reasonable to assess pairwise differences in factor intercorrelations.Footnote 2 No significant differences were observed throughout the analyses (all p ≥ 0.05). In answer to the first part of RQ2, the correlational structure of the seven dimensions of the instrument was basically the same for the two subjects.

By applying alpha error accumulation correction to the indicator-level difference parameters not only metric but also scalar MI was confirmed for all seven dimensions. Therefore, to answer the second part of RQ2 we estimated latent means for these seven factors by subject and compared them by applying bias-corrected bootstrapping. Table 5 shows the differences in factor means (“Estimate” column) and their bootstrapped confidence intervals. None of the seven factor means differs significantly by subject (all p ≥ 0.05).

Table 5 Differences in factor means by subject (mathematics vs. German language) and statistical testing of type bias-corrected bootstrapping

In answering RQ2, we not only conclude that the factor correlation matrix did not systematically differ by subject but also that the level of the observed teaching quality in mathematics and German lessons was the same for all seven quality dimensions.

6 Discussion

6.1 Examining subject-related differences with MAIN-TEACH

The premise of this paper was that comparing dimensions of teaching quality across subjects using frameworks with both generic and subject-specific dimensions could lead to insights into similarities and differences across these subjects. Our results showed that the factor structure of the dimensions of teaching quality described by MAIN-TEACH 1.0 as assessed with INSULA 1.0 did not differ significantly between the observed lessons in mathematics and German. As hypothesized, no significant subject-related differences could be found in the MI analyses for the dimensions of adaptation (H3), and classroom management and motivational-emotional support (H2). These results resonate with prior research findings, suggesting, for example, that classroom management is not very sensitive to subject content (Praetorius et al., 2016; Wagner et al., 2013). Contrary to expectations, there were no significant differences between the two subjects for the dimensions selection and implementation of content, cognitive activation, support for consolidation, and assessment and feedback after alpha inflation correction of the statistical testing of difference parameters resulting from the indicator-level approach (H1).

Although these results do not support the first hypothesis, it would be premature to suggest that these dimensions, as measured by the INSULA instrument, must be generic. Two subdimensions of selection and implementation of content exhibited significant differences before correcting for alpha inflation indicating the benefits of investigating subject specificity at the subdimension level. This ties in with Praetorius et al.’s (2020) conclusion that the lower the level of focus (i.e., dimension, subdimension, indicator) the greater the subject specificity. Work at the indicator level could reveal that some indicators are more relevant to one subject than another which is of particular interest because MAIN-TEACH and its associated instrument INSULA were designed to capture teaching quality in all subjects. In light of these observations, we can say that this study only partially confirms the underlying assumptions of MAIN-TEACH. Further studies that take into account the limitations of this study in terms of both data collection and analysis are required.

With respect to data collection, many data related issues in this study derive from using a newly developed instrument in combination with school evaluations. Since the instrument was recently implemented, it might not have been sensitive enough to capture differences in the teaching quality dimensions across subjects. Due to the school evaluation context, our sample exhibited also a rather great variability in levels of schooling, teachers, and raters. Additionally, there is an explicit requirement that conducting evaluations using INSULA for the observation of content-related teaching quality dimensions is only allowed with sufficient subject-specific expertise. At the same time, we did not collect data on the concrete level of subject-specific expertise of the raters (Lindmeier & Heinze, 2020; Mu et al., 2022).

In terms of data analysis, our study applied an indicator-level MI approach in Structural Equation Modeling (SEM) in addition to overall MI analyses. This approach addresses differences between subdimensions. It offers high resolution, flexibility, and the ability to handle deviations from normality through bootstrapping. However, the multiple tests involved require accounting for alpha error inflation. The applied Bonferroni correction may have resulted in overcorrection and a reduction in statistical power. This is indicated by the fact that the indicator-level approach did not identify any significant differences in intercepts after correction, whereas the overall MI analysis resulted in the rejection of scalar MI. Further research, particularly simulation studies comparing different MI approaches, is needed, as no such studies currently exist. While the statistical properties of the relatively novel indicator-level approach may not be ideal, it serves as a valuable heuristic tool. This method provides insights into subject-related differences at the subdimension or indicator level that are not accessible through overall MI analyses.

6.2 Challenging issues in the quest for understanding subject-related differences in teaching quality

This study should be seen as a first attempt to investigate the issue of subject specificity for frameworks that combine generic and subject-specific dimensions, such as MAIN-TEACH. In this section we discuss how this work could be continued.

First, this study focused on the comparison of teaching quality across subjects. One could go a step further and analyze differences between topics or between learning objectives (see also Litke et al., 2024). Several authors question the logic of drawing the boundary between subjects (e.g., Koedinger et al., 2012; Renkl, in press). It is argued that certain subjects can encompass diverse and heterogeneous areas within themselves, while specific areas across different subjects may exhibit striking similarities (Renkl, in press), which suggests that it would make sense to focus the specification on the particular configuration of topics or learning objectives in a lesson instead of the subject (see also Keller et al., in press; Kirschner et al., 2017).

Second, this study provides some indications that it might be more productive to look for subject specificity at lower levels such as the indicator level instead of focusing on the dimension level. This finding can have important empirical and conceptual implications for future work. At an empirical level, future studies employing different designs (e.g., experimental or triangulated approaches) to measure teaching quality at different levels (e.g., at the indicator level vs. holistically at the dimension level) are warranted. At a conceptual level, collaborative work among subject experts is needed to determine the extent to which certain factors might contribute to making certain subdimensions more or less subject specific. This back-and-forth movement between the empirical and the conceptual work will enable an improved understanding of why and under what circumstances certain subdimensions appear to be more subject specific than others.

Third, the study limitations point to a conglomerate of challenges associated with generating reliable empirical data for the investigation of subject specificity. Ideally, this study would have had a balanced design that disentangled teachers, classes, raters, and subjects as sources of variance (see Dreher & Leuders, 2021). An optimal design would have the same teachers teaching the same students in different subjects while being observed by the same raters. This would be hard to implement on a large scale, however, especially at the secondary level.

Fourth, following Mu et al. (2022), this study corresponds to one of the four suggested approaches to investigate subject specificity, the applicability approach. A next step would be to combine this with other suggested approaches (i.e., relevance, knowledge or predictivity) defined by Mu et al. (2022). For example, a combination with the predictivity approach would allow to investigate whether teaching quality, as captured by models such as MAIN-TEACH, is more predictive of student learning in certain subjects than in others.

Finally, and more generally, the precise definition and measurement of teaching quality within a specific educational setting remains a focal point of ongoing discussions in research on teaching quality (Bell & Gitomer, 2023). This issue extends beyond subject-specific considerations to various factors, ranging from socio-cultural dynamics to specific content. Navigating this complex matter appears to be a challenging balancing act between frameworks that reflect the very specific teaching situation (i.e., socio-cultural situation, subject matter, content, learning goals) and those that cover rather generic teaching principles relevant to many settings.

7 Conclusion

After years of paying little attention to subject-related differences in teaching quality, educational effectiveness researchers are now increasingly interested in the role the subject plays in teaching quality. Nonetheless, we are only beginning to understand how subject specificity can be empirically studied (see Dreher & Leuders, 2021; Mu et al., 2022). Despite its limitations, our data from school evaluations was an important opportunity for a first empirical examination of subject-related differences in MAIN-TEACH. The complexity of subject specificity, however, requires both collaborative efforts and the exploration of new strategies to advance research in this area and enhance our understanding of teaching quality and its impact on student learning across different subjects.