1 Introduction

1.1 Teacher quality: context and comparability

Internationally, teachers have been cited as the most important school-level determinant of academic success (Darling-Hammond 2000; Hattie 2003; Rivkin et al. 2005; Kyriakides et al. 2013; Nilsen and Gustafsson 2016a, 2016b). However, despite decades of research, there is still considerable debate over the importance of particular teacher characteristics. Research on teacher characteristics varies widely, and ranges from beliefs about intelligence and learning, self-efficacy, job satisfaction and motivation, to workload, and stress (Goe 2007). This study will operate from the theoretical framework defining teacher characteristics within the teacher quality construct outlined by Goe (2007). According to this review, changeable characteristics or teacher “attributes and attitudes” form part of the input dimension of teacher quality. While teachers are considered crucial for student outcomes, evidence on the importance of teacher characteristics is weak or conflicting. A myriad of studies conducted using international large-scale assessment data have found mixed results (Goe 2007; Nilsen and Gustafsson 2016a, 2016b; Toropova et al. 2019). With this in mind, Goe (2007) recommends that more research on teacher characteristics be conducted with a particular focus on the teaching context.

International large-scale assessments (ILSAs) such as the Trends in International Mathematics and Science Study (TIMSS) are well positioned to answer such questions through information collected in the contextual questionnaires for students, teachers, and principals. While such studies have advanced global educational accountability and contributed valuable knowledge regarding determinants of student outcomes, they have also sparked questioning over the validity of cross-national comparison (Oliveri and von Davier 2011; Biemer and Lyberg 2003). Underlying the contextual questionnaires is the much-debated assumption of scale score equivalence or measurement invariance (MI). Issues related to MI often prevent researchers from answering important substantive questions, which entail comparing latent factor means and the relationships among latent variables across countries or time. The main reason for the concern over measurement invariance in cross-national comparison is the difficulty involved with measuring psychological traits or constructs across cultures, as cultural factors may influence how respondents interpret and answer such questions. Several scholars have argued that TIMSS is superior to other ILSAs regarding the potential to examine teacher characteristics due to the systematic collection of data directly from teachers. TIMSS is also the only ILSA to link students and teachers directly. Despite this, research on teachers in international large-scale assessments is often limited to comparisons of relationships between variables because of the failure to reach scalar invariance across countries (Nilsen and Gustafsson 2016a, 2016b). Such questions which have not yet been investigated involve comparisons of latent construct means in teacher questionnaires across education systems or their subgroups. Certain teacher characteristics may matter in some contexts and not in others (Strong 2011). For instance, teachers have been shown to be especially important for low-achieving and socioeconomically disadvantaged students, and especially in mathematics (Goe 2007; Darling-Hammond 2000; Rivkin et al. 2005). Equally, the context (i.e., school, country, or educational system) may predict the teacher characteristics themselves due to differences in system level characteristics or educational policies. Taken together, mean comparisons and subsequent connection to student outcomes may have important insight into teacher-related policies which researchers have been largely unable to investigate.

As will be discussed in the following section, the alignment optimization method outlined by Asparouhov and Muthén (2014) provides one possible resolution to this problem as well as an empirical basis for investigating such contextual questions. This study will utilize this method and examine measurement invariance in five scales of the teacher background questionnaires in TIMSS 2015. These constructs fall under the category of “teacher characteristics, beliefs, and attributes” according to Goe’s (2007) framework, but vary in their scope. Job satisfaction (JS) refers to how satisfied teachers are with their employment and their plans for continuing to teach in the future. School emphasis on academic success (SEAS) refers to teachers’ perceptions of the academic climate and emphasis on academics of other teachers at their school, and Safe and orderly school (SOS) refers to the teacher’s general feelings of safety and organization at their workplace. School condition and resources (SCR) refers to the teacher’s perceptions of their access to teaching resources and how well the school is maintained. Last, teacher’s Self-efficacy (TSE) refers to the teachers’ perceptions of their confidence and ability to teach mathematics (for more on TSE, see Raudenbush et al. 1992).

The present study applies the alignment method as an exploratory tool to examine measurement invariance in the latent constructs from the teacher questionnaires in TIMSS 2015 across educational systems. Our paper is both content and method focused. Our intention is to provide researchers in comparative education—particularly those interested in teacher effectiveness—with one possible starting point for tackling questions which remain unanswered due to issues surrounding measurement invariance and cross-national comparison. The paper seeks to answer the following research questions:

  1. (1)

    What is the level of configural, metric, and scalar invariance of the teacher-related constructs in the teacher background questionnaires of TIMSS 2015 across educational systems?

  2. (2)

    Within these constructs, which indicators display the highest level of non-invariance in the teacher-related constructs? Is there a statistical basis for making comparisons of these constructs educational across systems?

  3. (3)

    Based on the newly constructed group mean values, which education systems have the lowest and highest levels of the teacher-related constructs?

1.2 Approaches to measurement invariance and a review of past literature

MI (Jöreskog 1971; Mellenbergh 1989; Meredith 1993) refers to the assumption that latent constructs and their relations should be unrelated to group membership, and is one of the main challenges of working with ILSA data (Gustafsson 2018). Within the traditional multiple group confirmatory factor analysis (MGCFA) approach, several levels of MI are tested, beginning with the configural or baseline model. In order to confirm configural invariance, factors must be equally configured under a similar variance-covariance structure across groups. Next, factor loadings (regression slopes) are compared; if loadings are similar across groups, metric invariance is achieved. This implies that each indicator is related to its underlying latent variable with a similar gradient. Scalar invariance is the most restricted form of MI and requires regression intercepts to be equivalent, in addition to latent structures and factor loadings. In scalar invariance, the same regression line should be able to estimate the relationship between an indicator and the latent variable for all groups. The three forms of MI build successively upon each other, representing a growing degree of invariance. Violating the assumption of MI results in constraints that inherently limits how researchers may interpret and relay their findings in a comparative context. As meeting the scalar MI assumption is very rare, occasionally, “researchers just ignore MI issues and compare latent factor means across groups or measurement occasions even though the psychometric basis for such a practice does not hold” (van de Schoot et al. 2015, p 1). More cautious approaches avoid comparing constructs altogether. Either scenario may be problematic in the context of ILSA research, given its relevance and potential for educational policy and reform.

There are several conceptual and methodological recommendations for managing MI. Rutkowski and Rutkowski (2010, 2013, 2017) propose the possibility that “one size might not fit all” and that scales be constructed with differing cultural conceptions in mind. A more moderate and early solution comes from Byrne et al. (1989) in partial measurement invariance, which allows intercepts and loadings of individual items to be tested. Following this approach, the majority of scholars recommend basing the types of comparisons on the level of invariance confirmed (i.e., configural, metric, or scalar), and this undoubtedly leads to smaller number of constructs being investigated due to their failure in reaching full invariance. Schulz (2016) argues that focusing only on constructs and variables that are highly similar in terms of measurement may lead to a narrowing in the scope of international studies. Generally, partial measurement invariance is a practical assumption in ILSA research, where invariance at the scalar level is rarely confirmed. However, scholars have debated whether the traditional MGCFA approach to partial measurement invariance is the most “simple or interpretable” solution (for more detail, see Marsh et al. 2018 and Asparouhov and Muthén 2014).

A more recent approach, an alignment optimization method, has been proposed (Asparouhov and Muthén 2014). Alignment optimization allows for invariance of individual items to be tested, for scales to be reformulated in order to take non-invariance into consideration, and to create a more flexible threshold for measurement invariance. Schulz (2016) writes, “the question is also at what point lack of measurement invariance becomes problematic and leads to problematic bias in cross-national surveys” (p. 15). The alignment method (Asparouhov and Muthén 2014) undertakes this question. This method has certain advantages over other approaches to MI. Traditionally, MI is tested using MGCFA at each constraint of the latent factor model, with groups defined by unordered categorical variables (van de Schoot et al. 2015). This approach requires that invariance levels be tested sequentially and for each item, which can result in hundreds of tests. Moreover, such tests can result in inaccurate results if multiple groups are present or if sample sizes are large (Asparouhov and Muthén 2014; Rutkowski and Svetina 2014). The traditional approach to MI also assumes that full measurement invariance can be achieved, which may be an “unachievable ideal” when the number of groups is large (Marsh et al. 2018; Asparouhov and Muthén 2014). Unlike MGCFA, alignment as outlined by Asparouhov and Muthén (2014) does not assume MI, but identifies a result which minimizes parameter invariance across groups through an iterative process analogous to the rotation in an exploratory factor analysis. Several studies have investigated measurement invariance using the alignment method with promising results as an alternative to MGCFA. Munck et al. (2018) investigated MI across 92 groups by country, cycle, and gender using civic education data and found that despite significant non-invariance in some groups, comparison of group mean scores had a statistical basis, and that attitudes toward civic engagement across countries and time could be validly compared. Similarly, both Marsh et al. (2018) and Lomazzi (2018) employ the alignment method to test MI of gender role attitudes across countries.

Much attention has been paid to the phenomenon of MI in the student background questionnaires, but much less in teacher-related constructs (Caro et al. 2014; Schulz 2016; Segeritz and Pant 2013 He et al. 2018; Rutkowski and Svetina 2014). Nevertheless, some studies have investigated measurement invariance in teacher background questionnaires using traditional approaches. Examining teacher self-efficacy, Vieluf et al. (2013) find evidence supporting metric equivalence, while Scherer et al. (2016) also find evidence for metric but not scalar invariance. Taking a different approach, Zieger et al. (2019) use multiple pairwise mean comparison to teacher job satisfaction in TALIS, whereby they identify the comparability of countries based on such pairs. Similarly to MGCFA, this approach grows in cumbrousness alongside the number of groups in focus. Despite a growing awareness of the potential of the alignment method, application of this approach in investigating the measurement invariance of latent constructs related to teachers and teacher quality is still rare. Our search was able to produce a single study published just this year. Zakariya et al. (2020) examined teacher job satisfaction in TALIS, and also found no evidence for scalar invariance. Extending their analysis to include an alignment optimization approach, they found that teachers in Austria, Spain, Canada, and Chile had the highest mean job satisfaction compared to the other countries in the sample. Our analysis does not use the same sampling procedure as TALIS, as TIMSS focuses on teachers as they represent students in a country. Additionally, our results apply only to mathematics teachers, unlike the results of studies looking at all teachers using TALIS data. As such, it will be especially interesting to compare our results to those of Zakariya et al. (2020) and other past studies.

2 Methods

2.1 Data and measurement

TIMSS is a curriculum-based survey, which tests mathematics and science achievement for students in grades 4 and 8 around the world. TIMSS employs a two-stage stratified sampling procedure and samples whole classrooms as well as schools. Additionally, responding to the teacher context questionnaires is mandatory. Student data can therefore be aggregated to the teacher level (Eriksson et al. 2019). TIMSS uses a cross-sectional design and is conducted every 4 years. This study consisted of 46 education systems included in the TIMSS 2015 survey. There was a total sample size of 13,508 grade 8 (or equivalent) mathematics teachers. In the total sample, 36.8% of teachers were male and 56.6% female; while 2.7% were under 25, 12.5% were between 25 and 29, 29.9% were between 30 and 39, 24.7% were between 40 and 49, 18.9% were between 50 and 59, and 4.7% were above the age of 60 (6.6% had no response). In total, 42 separate countries participated, but in some cases, sub-regions of countries were included, such as Buenos Aires in Argentina, Ontario and Quebec in Canada, and Dubai and Abu Dhabi in the United Arab Emirates (UAE), the term “education system” will be used interchangeably with country, system or group. Norway included cohorts from two grades. However, aside from the regions previously listed, the majority of the groups are representative of countries. Table 1 describes each education system and its respective sample size.

Table 1 Education systems and number of teachers sampled

Several teacher-related constructs from the teacher questionnaire were included in the analysis: Teacher Job satisfaction and Self-efficacy, teacher perception of School emphasis on academic success, School condition and resources, and Safe and orderly school. Indicators and coding for each construct can be seen in Table 2.

Table 2 Constructs, indicators, and coding of teacher-related constructs

Each of the constructs included a varying number of indicators. For School emphasis on academic success, only 5 out of a total of 17 indicators were used; as the remaining indicators did not relate to teachers, they were excluded. All indicators were included for each of the other 5 constructs. Coding varied from frequency-dimensions (i.e., “Very often” to “Never or almost never”) to agreement (i.e., “Agree a lot” to “Disagree a lot”) and more general ratings (i.e., “Very high” to “Low”).

2.2 Alignment optimization

As we have previously discussed, there are three levels of measurement invariance: configural, metric, and scalar. In order to compare latent variable means and variances across subgroups, scalar invariance is required (Millsap 2011). However, this assumption (i.e., equal factor loadings and indicator intercepts across subgroups) often fails. Moreover, the likelihood ratio chi-square testing for each parameter very quickly becomes cumbersome, especially when many subgroups are being compared. The alignment approach does not assume MI and “can estimate the factor mean and variance parameters in each group while discovering the most optimal measurement invariance pattern. The method incorporates a simplicity function similar to the rotation criteria used with exploratory factor analysis” (Asparouhov and Muthén 2014, p. 496). It estimates a factor score for all individuals despite the presence of significant non-invariance in some groups. Alignment starts with estimating such a configural model with group-varying factor loadings and intercepts to latent variable indicators and the factor mean and variance. Consider a configural MGCFA model, written as:

$$ {Y}_{pj}={v}_{pj}+{\lambda}_{pj}{\eta}_j+{\varepsilon}_{pj}, $$

Here, vpj is the intercept of an indicator p in a group j, λpj is the factor loading of the indicator i in the group j, ηj is the latent variable for group j, and εpj is the residual for indicator p in group j. In this model, the latent variable mean is fixed to zero and the latent variable variance to 1:

$$ {E}_{\left({\eta}_j\right)}={\alpha}_j=0;{V}_{\left({\eta}_j\right)}={\psi}_j=1 $$

As a second step, the fixed factor mean and variance are set free. Normally, this model would be unidentified. The alignment method, however, constrains the parameter estimation through imposing restrictions to optimize the simplicity function F. As is shown in Eq. 3, the sum of the component loss function for the factor loadings and intercepts of every latent variable indicator p between any pair of groups weighted by their group sizesFootnote 1 should be minimal.

$$ F=\sum \limits_p\sum \limits_{j_1<{j}_2}{w}_{j_1,{j}_2}f\left({\lambda}_{pj1}-{\lambda}_{pj2}\right)+\sum \limits_p\sum \limits_{j_1<{j}_2}{w}_{j_1,{j}_2}f\left({v}_{pj1}-{v}_{pj2}\right) $$

The alignment approach estimates the latent variable mean and variance for each pair of groups in such a way that the parameter estimates are optimized to produce the minimal total amount of non-invariance across groups. This procedure leads to a great number of parameters that have no significant non-invariance across groups and a few being largely non-invariant. Significant differences are tested by z-statistics (for a more detailed description of the algorithm, see Asparouhov and Muthén 2014; Muthén and Asparouhov 2018).

The aligned model produces an alignment optimization metric (A-metric) with some useful statistical information for determining measurement invariance of the latent variable across groups. The first important piece of information is the amount of groups that has no significant differences in each intercept and factor loading. The order of the factor mean and the groups who hold the minimum and maximum intercept and factor loading for each factor indicator are also given in the alignment results. In addition, an R-square, measuring the degree of invariance of the intercept and factor loading of each factor indicator is estimated in the model.

$$ {R}_{intercept}^2=1-\frac{V\left({v}_0-v-{\alpha}_j\ \lambda \right)}{V\left({v}_0\right)} $$
$$ {R}_{factor\ loading}^2=1-\frac{V\left({\lambda}_0-\sqrt{\psi_j}\ \lambda \right)}{V\left({\lambda}_0\right)} $$

As is shown in Eqs. 4 and 5, v0 and λ0 are the intercept and factor loading estimates from the configural model and v and λ are the average intercept and factor loading estimated from the aligned model. The R2 “tells us how much of the configural parameter variation across groups can be explained by variation in the factor means and factor variances (Muthén and Asparouhov 2018, p. 643). An R2 value close to one indicates a high degree of measurement invariance and close to zero indicates high non-invariance.

Mplus detects missing patterns in the data sets and provides full information maximum likelihood (FIML) estimates of the missing data through the EM algorithm. It also should be noted that all models in the study were estimated with the COMPLEX option implemented in Mplus to account for the non-independency of the students and teachers caused by the cluster sampling design in TIMSS (Muthén and Muthén 1998-2017).

2.3 Analytical process

The current analysis was done stepwise. All analyses were conducted using Mplus software 8.3 (Muthén and Muthén 1998-2017). In the first step, a single-factor measurement model was estimated for each of the teacher-related constructs with pooled data. These single-factor measurement models were modified by adding correlated residual terms suggested by the modification indices to get acceptable model fit. The significantly correlated residuals indicate that there are common variances between the pairs of residuals, suggesting some narrow dimensions in addition to the single latent factor. In the current study, we are only interested in precisely measuring the general factor, with no narrow residual factors being specified. With the pooled model structure as the point of departure, the conventional MGCFA models of different teacher-related factors were conducted, and model fit indices of the configural, metric, and scalar invariance models were compared for each of the constructs. Based on these comparisons, conclusions of the MI were reached. In the next step, the alignment approach was tested to the degree of measurement invariance of the teacher-related constructs, as mentioned above. The results of the two MI approaches are compared, and the advantages and disadvantages of the two are discussed. In order to check the reliability, a Monte Carlo simulation was done to further test whether the conclusion about measurement invariance based on the aligned model results of the constructs is trustworthy.

3 Results

3.1 Results from the MGCFA approach

A single-factor measurement model was fitted to each of the teacher-related constructs with the pooled data of all 46 education systems. These single-factor measurement models, however, did not fit the data well. Modification indices suggested the inclusion of one or more correlated residuals to improve the model fit. These modified single-factor model structures were used to test the measurement invariance across the 46 groups in the conventional approach. Table 3 presents the model fit indices of the configural, metric, and scalar MI models for all teacher-related constructs.

Table 3 Conventional measurement invariance model fit and model comparisons for all teacher-related constructs

The configural models of all the latent constructs in Table 3 show acceptable or close model fit, with the Root Mean Square Error of Approximation (RMSEA) and Standardized Root Mean Square Residual (SRMR) being below .08, and comparative fit index (CFI) and Tucker-Lewis index (TLI) being greater than .95 (see, e.g., Hu and Bentler 1999). Three out of the seven teacher-related factors (teacher perception of School emphasis on academic success, School Condition and Resources, and teacher’s Self-efficacy) reached metric invariance, which implied that the factor loadings of each of the three latent constructs were equal across all educational systems, but not the intercepts of the latent construct indicators. It may also be observed that none of the scalar MI models fits the data, indicating that the assumption that both intercepts and factor loadings be equal across the 46 systems cannot be held true.

With the traditional measurement invariance approach, the restricted MI assumption (scalar invariance) has been proven false. Additionally, metric invariance was only found in three latent constructs. Consequently, cross-country comparisons cannot be made with the latent variable means as well as the relationships among the latent variables. Given these results, the next section will aim for an approximate partial measurement invariance (e.g., Millsap and Kwok 2004) by using the alignment approach (Muthén and Asparouhov 2014).

3.2 Results from alignment optimization

Alignment optimization explores partial (approximate) measurement invariance by starting out with a well-fitting configural model. It then adjusts the factor loadings and intercepts of the factor indicators in such a way that these parameter estimates should be as similar as possible across groups without compromising the model fit. Essentially, the fit for the aligned model stays the same as the configural invariance model. In this section, the aligned model results for each of the seven teacher-related factors will be presented.

3.2.1 Job satisfaction

Table 4 presents the results from the aligned modeling approach for the latent construct JS. The highest R-square of the intercept estimate is observed for the variable My work inspires me. About 87% of the variation in the intercept observed in the configural model can be explained by the variation in latent variable mean and variance in the aligned model, indicating a high degree of invariance. Morocco is the only non-invariant country in the intercept estimate of the indicator I am proud of the work I do. This variable together with the indicator I am enthusiastic about my job also displayed a rather high R-square. I am content with my profession as a teacher and My work inspires me hold completely invariant factor loading estimates across all systems. For the variables I am enthusiastic about my job, and I find my work full of meaning and purpose, a large number of groups with invariance in the intercept estimates are also observed, ranging from 44 to 46 educational systems. The variable I am going to continue teaching as long as I can holds the least invariant intercept with the R-square being the lowest, 44%. For the factor loadings, the indicator I am proud of what I do is the least invariant, with an R-square of 23%.

Table 4 Results from the aligned model of job satisfaction (JS)

Countries with extreme parameter estimates can be found in columns 4 to 7. For example, South Korea holds the lowest intercept estimates in My work inspires me, while Canada-Ontario has the lowest factor loading estimate. In general, the overall degree of invariance of the construct JS is rather high, with few education systems showing measurement non-invariance in the factor loadings, complying with the close fit for the metric invariance model in Table 3. The average invariance index is 58% for JS. The percentage of significant non-invariance groups is 8.9%, much lower than the limit of 25% suggested by Muthén and Asparouhov 2014. A higher number of groups show invariance in the factor loadings of each of the indicators as compared to the intercepts.

3.2.2 Teacher perception of school emphasis on academic success

Five indicators are used to identify the latent construct of school emphasis on academic success, and the results from the aligned model of SEAS are presented in Table 5.

Table 5 Results from the aligned model of school emphasis on academic success (SEAS)

For factor loading estimates, all five indicators to the construct School emphasis on academic success showed complete invariance over the 46 countries. This agrees with the model fit indices for the metric invariance model in Table 3. For the intercepts, only two countries are non-invariant for the indicator Teachers’ degree of success in implementing the school’s curriculum, corresponding with the high R-square estimate 73%. The intercept of Teachers’ expectations for student achievement holds the most variation, with only half of the countries being invariant. The minimum and maximum estimates of the intercept and factor loadings can be found in columns 4 to 7. Only 7.8% of groups have been observed with significant non-invariance. In general, the high degree of confidence indicated by the average invariance index of .65 implies that the mean of the construct SEAS can be compared meaningfully across the different groups.

3.2.3 Teacher perception of school conditions and resources

Table 6 shows the results of approximate invariance from the aligned model of the school condition and resources.

Table 6 Results from the aligned model of school condition and resources (SCR)

As revealed in Tables 6 and 4 indicators, The school building needs significant repair, Teachers do not have adequate instructional materials and supplies, The school classroom needs maintenance work, and Teachers do not have adequate support for using technology have invariant factor loadings across all education systems. Only Lithuania is non-invariant in the factor loadings for the variables Teachers do not have adequate workplace and Teachers do not have adequate technological resources. The R-square for these indicators also showed a high degree of invariance, being above 60%. However, one exception can be observed for the variable The school building needs significant repair, for which the R-square is 29%, despite showing complete invariance across all groups. For the intercept estimates, the number of non-invariant systems in each indicator ranges from 4 for the variable The school classroom needs maintenance work (R-square = 82%) to 10 for the variable Teachers do not have adequate workplace (R-square = 57%). These results were also confirmed by the conventional measurement invariance results, where metric invariance was achieved for the SCR construct but not scalar invariance (see Table 3).

The average invariance index for the construct SCR was 62%, indicating 62% confidence to carry out trustworthy cross-system comparisons. The total non-invariance measure is 8.39%, below the limit of 25%.

3.2.4 Teacher perception of safe and orderly school

Among the 8 indicators of the latent construct Safe and orderly school (Table 7), The students behave in an orderly manner, The students respect school property, and The students are respectful of the teachers are completely invariant in the factor loadings over the 46 countries. The R-square estimate for the factor loading of these three variables is around or above 70%, implying that approximately 70% or above of the variation in the factor loadings estimated in the configural model can be explained by the factor mean and variance across the groups. For these three variables, the standard deviation of the parameter mean is also smaller, compared to those of other indicators. The lowest R-square for the factor loading is observed in the indicator The school is located in a safe neighborhood (29%), relating to a larger variation (see column 3 under SD).

Table 7 Results from the aligned model of safe and orderly school (SOS)

Students respect school property holds the highest R-square (i.e., 83%) for its intercept estimate, only Lebanon is non-variant. The lowest R-square is found in the indicator The school’s rules are enforced in a fair and consistent manner (35%). The number of countries with non-invariance intercept ranges from 1 and 13. From the model fit indices of the conventional measurement invariance model, metric invariance is supported and was confirmed by the aligned model.

In sum, the parameter estimates of the latent variable model reached 58% confidence to make reliable across-country comparison and the percent of significant non-invariance for education systems is only 9.8% over all estimated parameters.

3.2.5 Teacher’s self-efficacy

Aligned model results for self-efficacy can be seen in Table 8. The intercept estimates show the indicator Developing students’ higher-order thinking skills as the most invariant, with an R-square of about 90%. Here, only four educational systems show measurement non-invariance and the variance in the estimated mean intercept is rather small. The intercept estimate for indicator Making mathematics relevant to students also holds a high R-square (86%). Improving the understanding of struggling students and Assessing student comprehension of mathematics show the lowest R-square values, implying a high degree of non-invariance. This is also confirmed by the higher standard deviations in column 3. Over ten educational systems show non-invariance for these two indicators. Columns 4 to 7 present the education system with the minimum or maximum estimate of the intercepts.

Table 8 Results from the aligned model of teacher’s self-efficacy (TSE)

The number of educational systems with invariant factor loadings for the TSE constructs is higher than that of the intercepts. Developing students’ higher-order thinking skills, Improving the understanding of struggling students, Providing challenging tasks for the highest achieving students, and Adapting my teaching to engage students’ interest are completely invariant over all 46 education systems. The factor loading estimate for Inspiring students to learn mathematics has the highest number of non-invariant systems (5).

In general, the average invariance index was rather high for all estimated parameters in the aligned model and a low proportion of significantly non-invariant groups. We, therefore, have 57% confidence to make meaningful comparisons of the means and variances of teacher self-efficacy.

3.3 Monte Carlo simulation

As recommended by Asparouhov and Muthén (2014), Monte Carlo simulations were conducted in order to check the quality of the alignment results of the five teacher-related factors. These simulations used parameter estimates from the alignment models as data-generated population values. For each of the teacher-related factors, two sets of simulations were run with 100 replications, 46 groups, and two different group sample sizes (500 vs. 1000). Table 9 shows the correction between the generated population values and estimated parameters.

Table 9 Correlations between the generated population and aligned estimated values

The correlations in Table 9 are the average of the correlation between the population factor mean (or factor variance) and model estimated factor mean (or factor variance) of the 100 replications. These correlations generally are very high, most of which are .98 or above, with the average correlation higher than the factor variance. However, relatively low correlations also are observed for the simulations based on 500 group sample size, for example, .95 for the average correlation of the factor variance in Job satisfaction and .96 in teacher perception of School emphasis on academic success. These correlations tend to get higher when the group sample size is increased to 1000. Asparouhov and Muthén (2014) suggested a level of .98 for these correlations to be able to confirm reliable alignment estimates, and a correlation below .95 may be cause for concern. The current simulations therefore suggest that to a great extent the aligned results for the teacher-related constructs are highly reliable for cross-country comparison, despite some non-invariance among education systems. It can be noted that the aligned models work better when the group sample size is higher, implying an asymptotic accuracy in the alignment results under maximum likelihood estimation.

3.4 Average estimates of intercepts and factor loadings across invariant groups

Table 10 presents the weighted average estimates of factor loadings and intercepts across all invariant groups in each teacher-related construct. These weighted mean values are common for the invariance education systems, and only apply to those invariance systems. The number of such systems can be found in the column next to the weighted mean of intercepts and factor loadings.

Table 10 Weighted average estimates across invariant groups

As is shown in Table 10, the highest average intercepts for teacher’s Self-efficacy, for example, is observed on its indicator Providing challenging tasks for the highest achieving students (v = 1.616)—and the lowest on Helping students appreciate the value of learning mathematics (v = 1.375). The average factor loading was highest for Developing students’ higher-order thinking skills (λ = .495), indicating that this indicator forms an important part of the construct of self-efficacy in teaching mathematics.

3.5 Comparing estimated latent variable means of the teacher-related constructs

Latent variable means of all teacher-related latent constructs that were estimated for the 46 education systems by the aligned model (see Appendix Table 11). Groups can be compared based on these factor means.

3.5.1 Teacher job satisfaction

The latent variable mean of teacher job satisfaction is based on indicators concerning teachers’ feelings of contentment with the profession as a whole, their current school, their enthusiasm and pride in their work, and their intention to continue teaching. According to the estimated mean of JS in Fig. 1, students in Japan, Singapore, England, Hong Kong, and Hungary have mathematics teachers with the highest level of job satisfaction as compared to other education systems in TIMSS 2015. Students in Italy, Lithuania, Sweden, South Korea, and New Zealand also have mathematics teachers with relatively low levels of job satisfaction. By contrast, in Chile, Qatar, Thailand, Argentina (Buenos Aires), Kuwait, Oman, Israel, Lebanon, Malaysia, and the United Arab Emirates, students have mathematics teachers who are the least satisfied with their job.

Fig. 1
figure 1

Estimated latent variable means for mathematics teachers’ job satisfaction

3.5.2 Teacher perception of safe and orderly school

Broadly, SOS refers to whether teachers feel the schools are located in a safe neighborhood and feel the students are respectful. The latent variable mean of SOS is shown in Fig. 2. The results indicated that students in Botswana, South Africa, Morocco, Turkey, Japan, Italy, Slovenia, South Korea, Sweden, and Jordan had mathematics teachers with the highest levels of perceived school safety. In Argentina (Buenos Aires), Ireland, Kazakhstan, Norway, UAE, Lebanon, Qatar, Singapore, Hong Kong, and Lithuania, students had mathematics teachers with the lowest levels of feeling as though the school was orderly and safe.

Fig. 2
figure 2

Estimated latent variable means for mathematics teachers’ perceptions of safe and orderly school

3.5.3 Teacher perception of school conditions and resources

SCR refers to school infrastructure, whether teachers have adequate workspace and instructional materials, and whether the school environment is well taken care of. Results for latent mean comparisons can be found in Fig. 3. Students’ mathematics teachers in Botswana, South Africa, Turkey, Morocco, Saudi Arabia, Egypt, Jordan, Armenia, Malaysia, and Iran reported the highest levels of satisfaction with school conditions and resources. In UAE, Singapore, and Bahrain, students’ mathematics teachers reported the lowest perceptions of SCR.

Fig. 3
figure 3

Estimated latent variable means for mathematics teachers’ perceptions of school condition and resources

3.5.4 Teacher perception of school emphasis on academic success

SEAS is indicated by teachers’ perceptions of whether teachers understand schools’ curricular goals, their success in implementing the curriculum, their expectations for student achievement, and their ability to inspire students. Latent variable means are presented in Fig. 4. Recall that SEAS is reverse coded so countries with the lowest levels show the highest mathematics teacher perceptions of SEAS. Students in Italy, Japan, Russia, Hong Kong, Chile, Hungary, Sweden, Norway, Turkey, and Thailand have mathematics teachers who report the highest levels of SEAS. In Qatar, Malaysia, Oman, Ireland, Canada, South Korea, UAE, Bahrain, and Kazakhstan, students generally have mathematics teachers who report the lowest levels of school emphasis on academic success.

Fig. 4
figure 4

Estimated latent variable means for mathematics teachers’ perceptions of school emphasis of academic success

3.5.5 Teacher self-efficacy

Latent variable means for TSE are found in Fig. 5. Teacher self-efficacy is measured by teachers’ feelings of capacity to inspire students in mathematics, show students a variety of problem-solving strategies, adapt their teaching to engage students, make mathematic relevant, and develop higher-order thinking skills. In Japan, Hong Kong, Singapore, Chinese Taipei, Thailand, Iran, Morocco, New Zealand, Sweden, and England, students have mathematics teachers who report the highest levels of self-efficacy in teaching mathematics. In Qatar, UAE, Bahrain, Lebanon, Oman, Argentina (Buenos Aires), Slovenia, Kazakhstan, and Botswana, students have mathematics teachers with the lowest levels of self-efficacy to teach mathematics.

Fig. 5
figure 5

Estimated latent variable means for mathematics teachers’ self-efficacy

4 Discussion and concluding remarks

Seeking an optimal alternative to assess measurement invariance of the teacher-related constructs across multiple countries, the current study compared the more restricted traditional MI approach with an alignment optimization method. With TIMSS 2015 data from 46 countries as the empirical basis, the results confirm the initial position of this study. In the traditional MI approach, the level of metric invariance was only reached for three constructs, namely, teacher perception of School emphasis on academic success, School condition and resources, and teacher Self-efficacy. This result implied a limited comparability across countries restricted to the associations between these constructs and other variables being studied. The quest for furthering cross-national comparability is a worthwhile and essential endeavor in the large-scale international studies.

In this study, the purpose of the alignment optimization method is to justify previously unanswerable questions related to group mean comparisons. Scalar invariance was not reached for any of the teacher-related constructs, signifying that under the traditional MI framework, latent factor means could not be validly compared in any case. The results from the alignment optimization approach, however, have demonstrated a different picture, since it takes into account the partial invariance in the parameters of each latent variable indicator and identifies the most optimal measurement invariance pattern when assessing comparability (Asparouhov and Muthén 2014). Departing from the configural invariance models, the current study found a low number of indicators in each construct and country with significant non-invariance. Despite this, all five constructs fell below the non-invariance threshold of 25% suggested by Asparouhov and Muthén (2014). In general, the Monte Carlo simulations confirm the reliability of the majority of the alignment results, with some caution around Job satisfaction and School emphasis on academic success. These results give valuable information about the specifics of what contributes most to scalar non-invariance. Indeed, the indicator-by-indicator results may be more informative of cultural and societal differences across the constructs than traditional MI approaches.

It was noteworthy that the teacher Self-efficacy construct in particular reached acceptable invariance level, as the cultural comparability of self-efficacy has long been the subject of inquiry in teacher quality literature (see Scherer et al. 2016; Vieluf et al. 2013). The current findings support those of Scherer et al. (2016) in suggesting that teacher Self-efficacy is a construct that can be generalized across cultures. The results for teacher Job satisfaction are more difficult to compare with previous research, as the construct for teacher job satisfaction in TIMSS greatly differs from that in TALIS. In TALIS (2013), the construct includes regretting becoming a teacher, whether teachers would make the same decision if they could decide again, whether they wonder if it would have been better to choose another profession, and the advantages of being teacher outweigh the disadvantages (Zakariya et al. 2020). By contrast, the TIMSS teacher job satisfaction construct includes pride and enthusiasm for the job, ability to feel inspired, intention to continue teaching, and satisfaction with the profession as a whole and with working at the current school. However, both Zakariya et al. (2020) and Zieger et al. (2019) found statistical grounds to compare the construct across some countries. Zieger et al. (2019) present a more conservative approach, however, recommending that comparisons with Chile, Shanghai, Mexico, and Portugal were unreliable. In the current study, only Chile overlaps as an education system with these countries. Interestingly, this is the country in our research which differs the most with previous research. Zakariya et al. (2020) found that teachers in Chile reported among highest levels of job satisfaction compared to other countries, while we find that students in Chile have mathematics teachers with some of the lowest levels of job satisfaction. Perhaps this is a reflection of math teachers differing from other teachers, or perhaps this is a reflection of a more serious issue of comparability. As mentioned JS was a construct that displayed some reliability concerns in the Monte Carlo simulation. This caution may be reflected by other investigation recommendation caution around cross-cultural comparisons of teacher JS (Pepe et al. 2017; Zieger et al. 2019). There is little empirical research on MI and the other constructs, including teacher’s perceptions of School emphasis on academic success, Safe and orderly school, and School conditions/resources. The results of this study, therefore, provide the first evidence for the potential of comparability for the majority of these constructs.

Several insights came out of simple observations of the resulting factor mean scores. First, it was possible to detect which countries are on the higher or lower ends of the constructs. As mentioned, Japan, Singapore, England, Hong Kong, and Hungary had the highest levels of mathematics teacher job satisfaction, with Qatar, Chile, Kuwait, Thailand, and Argentina (Buenos Aires) reporting the lowest level of mathematics teacher JS. Interestingly, countries with students with mathematics teachers who reported the highest levels of job satisfaction also tend to be among the top performers in mathematics in 2015 (for Singapore, Japan, and Hong Kong in particular). However, more research needs to be done to investigate the relationship between these newly constructed means and student outcomes. Our results for teacher job satisfaction differ vastly from those of Zakariya et al. (2020). Given the differing sample (we are focused on mathematics teachers only), as well as entirely different indicators of job satisfaction, as well as different countries included, however, this is not so surprising. In addition, recall that TIMSS samples teachers as representative of students in a country, while TALIS samples teachers as representative of teachers in a country. We are more interested in the former for this paper, as our ultimate interest in cross-national comparison is the comparison of educational contexts of students. For TSE, similar patterns emerged, with top mathematics performers Japan, Singapore, Hong Kong, and Chinese Taipei taking the top ranking positions. Middle Eastern countries such as Qatar, UAE, and Bahrain reported the lowest levels of TSE. The Japanese sample also displayed the highest level of self-efficacy, contradicting the oft discussed cultural tendency in Japan to avoid self-enhancement (Takata 2003). The other constructs did not have such similarly evident clusterings of countries, such as the contrast between East Asian countries (who tended to report higher levels of job satisfaction and self-efficacy) and those situated in the Middle East (who tended to report lower levels of most constructs). It was possible to detect a small group of African countries (Botswana, Morocco, and South Africa) which tended to report high levels of both satisfaction with school conditions and resources as well as perceptions of safety and orderliness in the school. Future research can investigate these differences in more detail and investigate potential hypotheses as to why they exist.

This study has some limitations. As mentioned by Munck et al. (2018), differentiating sources of bias from each other (i.e., method bias related to the instrument versus construct bias, see Schulz 2016) is not possible with this method. Next, interpreting the importance of the non-invariance of individual indicators (as compared to the final average invariance index) is not straightforward. Determining the ultimate degree of comparability rests on the total alignment score. In our study, the Monte Carlo results for JS and SEAS fell below the recommended threshold when the N was reduced to 500, indicating potentially unresolvable issues with the comparability of these constructs. Last, there are some important potential limitations of the alignment optimization method itself which call into question its usefulness as an alternative to the traditional MGCFA approach. Svetina et al. (2016) write that “this sort of latent variable standardization implies that the latent variables are not on the same scale, and as a result, cannot be compared” (p. 128). They and other authors argue that it should be used as primarily an exploratory approach. We believe such an exploratory approach is extremely useful in the context of international comparison. Particularly in the case of research on teacher characteristics, where certain questions continue to be ignored because of obstacles related to the issue of MI.

We believe the significance of the present study outweighs its limitations. First, it demonstrates and supports the possibilities of applying the proposed method to the field of comparative psychological and educational research. Next, as mention extensively throughout this paper, it presents ways for ILSA researchers to investigate previously unanswered questions related to group mean comparisons of latent constructs. Alignment can be applied to assess the comparability of a myriad of other student-related or school-related constructs. It has implications for policy-related research, given that system level factors may be related to group mean scores. Last, it has important implications for future research investigating the importance of teacher characteristics for student outcomes.

We have several recommendations for future research regarding this method. First, as mentioned, differences in group mean scores and those in the individual indicators can give us important information about cultural differences which should not necessarily exclude their comparison. Future research can investigate such differences with potential cultural conceptions in mind. Next, policy-makers should pay attention to countries which consistently score high on constructs reflecting teacher job satisfaction, self-efficacy, and their working environments. Such countries include Japan, Singapore, Hong Kong, and Chinese Taipei. Similarly, there is much to be learned about countries which consistently score low, such as many countries in the Middle East. Such differences may be attributable to differences in teacher resources and teacher-focused policies. We are also interested in particular in the question of the role of teacher characteristics in student outcomes. Researchers may also use this method to examine first whether JS, SEAS, SCR, SOS, and TSE are comparable across TIMSS surveys, and then to examine changes in teacher characteristics across the last two decades for countries. We can also recommend more in depth comparisons of teacher characteristics across subgroups in participating countries, such as student from disadvantaged socioeconomic backgrounds. Ultimately, further investigations of such questions would yield more insight into the potentially context-dependent aspect of teacher characteristics as they relate to student achievement.

The purpose of international large-scale assessments is to examine differences in educational systems across countries. However, as noted by Scherer et al. (2016) in much public policy research “there is a pre-occupation with cross-cultural differences rather than of cross-cultural generalizability” (p. 4). Herein lies the paradox of research with international large-scale assessments. ILSA and comparative education research necessitate that education systems have differences—but their differences almost never comply with the restrictive statistical rules necessary for cross-country comparison. Although not without limitation, the method outlined in this paper provides one way forward. The growing number of studies using this method suggests possible changes in the future of large-scale assessment research, and scholars are extending its capacity (Marsh et al. 2018). According to Munck et al. (2018), the alignment optimization method can “update existing databases for more efficient further secondary analysis and with meta-information concerning measurement invariance” (p. 687). Measurement invariance has become a problem that all comparative education researchers must eventually face, either by making ill-founded comparisons or avoiding latent factor mean comparisons altogether. This method exists as one promising way that large-scale assessment research may reach its full potential for influencing policy and educational reform.