Keywords

3.1 Introduction

The present book adopts a common approach in terms of theories, conceptualizations, and methodology throughout all chapters. The primary objective of this chapter is to describe the common methodology that serves as a foundation for each chapter. While the individual chapters can be read independently, it is important to understand that they all share common methodologies and assumptions, which will be thoroughly outlined in this chapter. For instance, a key aspect of the methodology is the uniformity in data preparation across all chapters. Furthermore, this chapter also aims to provide a comprehensive overview of the Trends in International Mathematics and Science Study (TIMSS) and its design. As we progress through the chapter, the overall methodology employed in the book will be examined in greater detail. This examination will cover various aspects of the methodology, including the reliability and validity of the constructs examined in the chapters, the process of data preparation, and the analytical approaches employed. By presenting a clear and coherent understanding of the shared methodology, readers will be better equipped to appreciate the interconnectedness of the chapters and the holistic approach taken in this book.

3.2 About TIMSS

3.2.1 The TIMSS Assessment and Questionnaires

TIMSS was first implemented in 1995, following the existence of earlier studies like the First and Second International Mathematics Study (FIMSS and SIMS) (Brown, 1996). TIMSS is made possible through the International Association for the Evaluation of Educational Achievement (IEA), with its administration managed by the TIMSS & PIRLS International Study Centre, Boston College. TIMSS measures students’ competence in mathematics and science and is conducted every four years, primarily targeting fourth and eighth-grade students. However, since 2015, Norway shifted the target grades from grade four to grade five and from grades eight to nine to enable better comparisons with other Nordic countries. This was necessary because Norwegian students were typically younger than students in these countries by approximately one year, and adjusting the target grades would help to ensure a fairer comparison (Bergem et al., 2016).

The mathematics test contains more than 200 tasks (hereafter referred to as items), while the science test includes about 250 items (Mullis & Martin, 2017a). Approximately half of these items are in multiple-choice, and the rest are in open-ended format. The test frameworks are based on the participating countries’ curricula. All the countries provide feedback to the frameworks and participate in the development of the mathematics and science items in every cycle. The frameworks, test items, and questionnaires are subject to an extensive quality assurance process, which includes several rounds of feedback and revisions as well as a field-trial study (Cotter et al., 2020).

The TIMSS 2019 mathematics assessment for fourth-grade students is divided into three content domains: number, measurement and geometry, and data (Mullis & Martin, 2017a). These domains represent 50%, 30%, and 20% of the test, respectively. Each domain has specific objectives for students to achieve. For instance, in the number domain, students should be able to add and subtract up to 4-digit numbers in simple contextual problems. Similarly, the science assessment for fourth grade also has three content domains: life science (45%), physical science (35%), and earth science (20%), each with its own set of specifications. For example, life science covers a topic about human health, which requires students to be able to identify or describe some methods of preventing disease transmission (e.g., vaccination, washing hands, avoiding people who are sick) and recognize common signs of illness (e.g., high body temperature, coughing, stomach-ache). The framework for the test also specifies a detailed description of the content and objectives for each domain.

In addition to the content domains, the frameworks for mathematics and science assessments also have a cognitive domain that specifies the thinking processes to be assessed (Martin et al., 2017). This domain is divided into three areas: knowing (40%), applying (40%), and reasoning (20%) in the fourth-grade assessment. This cognitive dimension ensures that all aspects of competence are assessed, as students are expected to demonstrate their knowledge of the topic, apply this knowledge to different contexts, and engage in reasoning through processes such as synthesis, evaluation, and generalization.

In addition to the mathematics and science assessments, TIMSS also gathers information related to teaching and learning processes through questionnaires (Mullis & Martin, 2017b). The questionnaires are administered to students, teachers, and school leaders (principals) in grades four and eight, as well as to parents in grade four. The student questionnaires include, among other things, questions about socioeconomic status (SES), minority status (in terms of language), bullying, perceptions of teaching quality, and motivation to learn mathematics and science. Teachers are asked a number of questions, including their educational background, teaching experience and specialization, what they teach in the classroom (content coverage), how they teach and assess their students, and perceptions of the school environment. Similarly, principals are asked, among other things, about their educational background, school composition, instructional resources, and learning environment. Parents answer questions related to their children’s education, including educational resources at home and their child’s early numeracy and literacy. These questionnaires provide valuable contextual information that can help to better understand the factors that influence student achievement in mathematics and science (Mullis & Martin, 2017b).

3.2.2 The TIMSS Design

International large-scale assessments (ILSAs), such as TIMSS, have complex designs that require special consideration that should be taken into account when analyzing the data.

Hierarchical Design

The target population in TIMSS typically includes students in the fourth and eighth grades, and representative samples are drawn from these populations in each participating country. To achieve this, TIMSS employs a hierarchical design in its sampling and data collection process (Martin et al., 2020). The sampling procedure consists of a two-stage random sample design, which involves selecting schools and then selecting one or more classrooms within these schools (Martin et al., 2020). In addition to the selected students in an intact classroom, the sample also includes their mathematics and science teachers, their principals, and their parents (parents are included only in grade four). While the teachers of the students do not constitute a representative sample, each student is linked to their mathematics and science teachers, which contributes to the validity of inferences related to teachers. This hierarchical design ensures that the data collected is representative of the target population and provides valuable insights into the mathematics and science achievement of students in participating countries.

The hierarchical design has implications for data analysis. If this design is not accounted for in the analyses, the standard errors of the estimates may be underestimated (Rutkowski et al., 2010). This is because students within the same classroom or school tend to resemble one another, hence, violating the requirement of a random sample (Rutkowski et al., 2010). To address this issue, several methods can be employed. The empirical chapters in this book apply multi-level analysis to account for the clustering of the data.

Sampling Weight

Sampling weights are used to adjust the data so that it accurately represents the population being studied and ensures that the data collected from the sample is representative of the entire population (Meinck & Vandenplas, 2020). Without weighting, the data may be biased and may not provide accurate estimates of population parameters. In TIMSS, appropriate sampling weights are necessary to account for the complex sample design and to provide accurate estimates of population parameters (Martin et al., 2020). Researchers need to use the appropriate sampling weights in their analyses to account for the hierarchical design of the sample and to ensure that the results are representative of the population being studied (Rutkowski et al., 2010; Stapleton, 2013). Failure to use appropriate sampling weights can lead to biased estimates of population parameters, which can compromise the validity of the study findings (Meinck & Vandenplas, 2020).

In this book, TIMSS sampling weights and weight factors were taken into account when analyzing the data at the individual or classroom level in the empirical chapters. For analyses using a multilevel model, the chapters use multilevel weights following the recommendations from Rutkowski et al. (2010) and Stapleton (2013). At the student level, the weight is set to a product of student response adjustment and student weight factor (WGTADJ3 × WGTFAC3). At the classroom level, the weight is a product of school response adjustment, school weight factor, classroom response adjustment, and classroom weight factor (WGTADJ1 × WGTFAC1 × WGTADJ2 × WGTFAC2). For more information on weights and weight factors, see LaRoche et al. (2020).

Trend Design

TIMSS is designed to allow for comparisons of student performance in grades four and eight across different cycles of the assessment, which is known as the trend design. This approach allows for the tracking of changes and trends every four years. This is made possible by retaining approximately half of the test items from one cycle to the next, ensuring continuity in the content assessed (Martin & Mullis, 2019). Furthermore, most countries participate in every cycle of TIMSS, which enables the concurrent calibration of scales (Martin & Mullis, 2019). This method allows researchers to scale achievements on the same scale as previous cycles using trend items and trend countries, making direct comparisons across cycles possible (Martin & Mullis, 2019). The trend design helps in identifying changes in student achievement over time, as well as examining the impact of factors, such as content coverage and teaching practices on student performance. However, few take advantage of the trend design in secondary analyses. In this book, the trend design is utilized to examine how changes in content coverage, teaching quality, and assessment practice are related to the changes in achievements over time (see Chaps. 6 and 7).

Plausible Values

As in other ILSAs, TIMSS uses plausible values to represent student proficiency in mathematics and science. In each subject, there are over 200 test items that are used to thoroughly assess mathematics and science (Martin et al., 2017). To minimize the time students spend on the test, items are divided into blocks, preserving the same distribution across content and cognitive domains as the overall test, following the assessment framework (Martin et al., 2020). In TIMSS 2019, there were 28 blocks of items, which comprised 16 blocks of trend items from previous cycles (eight blocks in mathematics and eight blocks in science) and 12 blocks of items that were new in 2019. TIMSS 2015 and 2011 had the same design with 28 blocks. Each student receives two blocks of mathematics items and two blocks of science items (Martin et al., 2020). As individual students only receive a subset of the entire test, plausible values estimate group content-related scale scores for the population, rather than providing accurate individual-level scores (von Davier et al., 2009).

From TIMSS 2011 to 2019, five plausible values were drawn for each student. These values are randomly drawn from an empirically derived distribution of score values based on the student’s observed responses to assessment items and selected background variables (von Davier et al., 2009). When analyzing data, researchers must consider these plausible values to ensure accurate estimates of the relationships between variables (Rutkowski et al., 2010; von Davier et al., 2009). This procedure requires separate analyses for each set of plausible values. Once analyses are conducted for all sets, the resultant model parameters are pooled (Laukaityte & Wiberg, 2017). The pooling involves calculating the means across all sets of model parameters, while the variances are quantified according to Rubin’s combination rules, which consider the variances within and between plausible values and the number of plausible values (Laukaityte & Wiberg, 2017). The empirical chapters in this book employed Mplus, a statistical analysis software, which offers a convenience option (i.e., TYPE = IMPUTATION) to perform analysis for each set of plausible values and automatically combines the resultant model parameters.

Cross-sectional Design

TIMSS, along with other ILSAs, employs a cross-sectional design (Martin et al., 2017). This design involves collecting data at a specific point in time, typically once every four years in the case of TIMSS, to assess and compare the performance of students across participating countries. In a cross-sectional design, data are collected from different participants in the same age group or grade level, without following them over time. This approach allows researchers to identify patterns, trends, and relationships between various factors, such as education systems, teaching practices, and student achievement. However, since data are collected only at a single time point, it is not possible to establish causal relationships or determine why some variables change over time (Cummings, 2018). As a result, causal language should be avoided in favor of discussing “relationships” rather than “effects”. For instance, instead of discussing the “effect” of content coverage on achievement, it is more appropriate to refer to it as the “relationship” between content coverage and achievement.

Given that this book is intended for educational policy stakeholders, practitioners, and researchers, using overly technical language may hinder clarity and comprehension, particularly when discussing advanced methodologies. As a result, some language choices in this book may be simplified to improve understanding. Nevertheless, it is crucial to emphasize that causal relationships between predictors and outcomes cannot be established through cross-sectional designs.

3.3 The Main Measures Used in This Book

This section focuses on describing common measures of teacher practice examined throughout the book (for the theoretical foundations of teacher practice, see Chap. 2).

3.3.1 What Teachers Teach: Content Coverage

Content coverage represents a critical aspect of any curriculum, as it outlines the topics students will learn and the knowledge they are expected to acquire. TIMSS distinguishes between the intended curriculum at the national level, the implemented curriculum at the classroom level, and the attained curriculum as learning outcomes at the student level (Mullis & Martin, 2017b). This book mostly focuses on a narrow concept of content coverage to describe student exposure to various topics in mathematics and science. Specifically, content coverage refers to the TIMSS’ mathematics and science topics implemented by the teachers in classrooms using the TIMSS teacher questionnaire. This measure of content coverage is employed in Chaps. 4, 6, and 8. In addition, Chap. 4 examines the alignment between content coverage (the implemented curriculum) with the topics covered by the intended national curriculum in the Nordic countries (the curriculum questionnaire) and the attained curriculum as assessed in the TIMSS test. Meanwhile, Chaps. 6 and 8 investigate the relations between content coverage and student achievement across the various domains of mathematics and science (the attained curriculum).

Content coverage was assessed using the teacher questionnaire, focusing on the extent to which teachers had taught specific topics covered in the TIMSS test to fourth-grade students (Martin et al., 2020). Teachers were asked to rate when they taught various topics within the subdomains of mathematics and science. In TIMSS 2019 for mathematics, there were seven topics in the content domain number, seven in measurement and geometry, and three in data, whereas, in science, there were seven topics in life science, 12 in physical science, and seven in earth science. For example, in mathematics, for the number domain, teachers were asked when the class had been taught the topic “concepts of whole numbers, including place value and ordering”. The response scale includes mostly taught before this year, mostly taught this year, and not yet taught or just introduced. There are two main issues with this scale. First, new teachers may not know what students have been taught before. Second, the relatively large number of items may lead to a higher rate of missing data, particularly in science (which is asked after mathematics in the questionnaire). Furthermore, “not yet taught” and “just introduced” represent distinct concepts.

Given the challenges with the response scale, it was challenging to measure content coverage as an indicator of the implemented curriculum. To address this issue, this book used the percentages of students who had been taught each of the topics (before or during the school year) as reported by their teachers, averaged across all topics in each subject domain, and also across all topics in all subject domains. In mathematics, these percentages are represented by three variables: the percentages of students taught the topics number (ATDMNUM), measurement and geometry (ATDMGEO), and data (ATDMDAT). In science, the percentages of students taught the topics life science (ATDSLIF), physical science (ATDSPHY), and earth science (ATDSEAR) were used. It is worth noting that the missing values at the item level were quite large, especially in science. In TIMSS 2019, data for the percentages of students taught science topics are available for at least 70% but less than 85% of the students in Denmark and Sweden, whereas, in Norway, the data are available for at least 50 percent but less than 70% of the students. The high proportion of missing values in these countries highlights the need for caution when interpreting the results, as they may not fully reflect the true content coverage.

In the curriculum questionnaire, TIMSS’ National Research Coordinators (NRC) responded to a set of questions focusing on national curriculum policies and practices related to each country’s education system, as well as the organization and content of mathematics and science curricula (Martin et al., 2020). The NRCs were asked whether each of the TIMSS mathematics and science topics were included in their countries’ intended curriculum and, if so, whether the topics were intended to be taught to “all or almost all students” or “only the more able students”. The TIMSS 2019 curriculum questionnaire was administered online, and participants were advised to draw on the expertise of curriculum specialists and educators. However, there were no reliability checks or procedures in place to reduce measurement uncertainty or improve reliability (Martin et al., 2020). Consequently, the curriculum questionnaire was examined together with the teacher questionnaire in this book.

To measure the attained curriculum, student achievement in the specific content domains was used. For mathematics, five plausible values of achievement were used for number, measurement and geometry, and data, similarly five plausible values were used for life science, physical science, and earth science for science.

3.3.2 How Teachers Teach: Teaching Quality

TIMSS measures teaching quality using a combination of student and teacher questionnaires. As discussed in Chap. 2 this book used three basic dimensions of teaching quality consisting of classroom management, supportive climate, and cognitive activation (Klieme et al., 2009). Classroom management was assessed for the first time in TIMSS 2019 using student questionnaires that asked students how often various situations happened in their mathematics classrooms. Students were presented with six items (e.g., “my teacher has to keep telling us to follow the classroom rules” or “students interrupt the teacher”) and were asked to indicate the frequency of their occurrence using a response scale that includes: never, some lessons, about half the lessons, and every or almost every lesson. A scale called Disorderly Behavior During Mathematics Lessons was also created based on students’ responses to these items.

Supportive climate encompasses various aspects, including teacher support, classroom interaction (teacher-student and student–student relationships), and instructional clarity. This book focuses on the teacher support and instructional clarity that was assessed using the student questionnaires in TIMSS 2011 to 2019. It measured student agreement on various statements with a response scale that includes agree a lot, agree a little, disagree a little, and disagree a lot. Note that only two items were similar across these cycle (i.e., “I know what my teacher expects me to do” and “my teacher is easy to understand”). The items were situated separately in mathematics and science classrooms.

This book differentiates between general and subject-specific cognitive activation (see Chap. 2 for further details). Teachers reported how often they engaged students in various activities with a response scale that ranges from never to every or almost every lesson. Few items are similar across TIMSS 2011 to 2019; two items in generic cognitive activation (e.g., “relate the lesson to students’ daily lives”), three items in mathematics cognitive activation (e.g., “apply what students have learned to new problem situations on their own”), and five items in science cognitive activation (e.g., “design or plan experiments or investigations”). For the first time in 2019, TIMSS added an item to the student questionnaire and asked students how often they conduct experiments in their science lessons with a response scale of never, a few times a year, once or twice a month, and at least once a week. This item represents inquiry-based cognitive activation in science and is included in the analysis to supplement the teacher questionnaire.

The TIMSS’ hierarchical design, which involves clustering students in intact classrooms and collecting information from both student and teacher questionnaires, is highly suitable for measuring teaching quality. A significant advantage of this method is that students’ responses provide first-hand experiences of the teaching process. Ideally, if the goal is to accurately measure teaching quality, all students would rate their teacher similarly. However, perceptions may vary among students. To account for these variations, the chapters that explore whether teaching quality may “explain” achievement differences between classrooms employ a two-level model at the student and classroom levels. In this model, the perceptions of teaching quality are controlled at the student level, and the results are focused on the classroom level. This approach aligns with previous research and offers a reliable method for measuring teaching quality (e.g., Lüdtke et al., 2007; Marsh et al., 2009).

Some challenges may arise in measuring teaching quality using student and teacher questionnaires. Research has shown that young students may have difficulties in evaluating their teachers and distinguishing between different aspects of teaching quality (Lüdtke et al., 2007). This means that students may perceive teachers who generally perform well as having high quality in all aspects of teaching, regardless of whether or not this is the case. Conversely, teachers who perform poorly in one aspect may be perceived as having low quality in all aspects of teaching (Teig & Nilsen, 2022). TIMSS also collects information about teaching quality through the teacher questionnaire that covers more items and deeper aspects of teaching quality than the student questionnaire. Nevertheless, these self-reported measures may be susceptible to social desirability bias (Muijs, 2006). Teachers may feel pressure to provide responses that they believe are socially desirable, rather than providing honest answers. This can lead to the over-reporting of positive behaviors and under-reporting of negative behaviors. This book used both the student and teacher questionnaires to minimize the challenges associated with both approaches.

As previously discussed, the TIMSS’s design, which links students with their teachers, is well-suited for measuring teacher practice, and indeed, teacher practices have been measured in all TIMSS cycles since 1995 (Klieme & Nilsen, 2022). However, the measurement of teaching quality—tailored specifically to classroom management, teacher support and instructional clarity, and cognitive activation—is a more recent inclusion. These aspects of teaching quality were not specifically incorporated into the TIMSS context questionnaire framework until 2015. The aspect of classroom management was later added in 2019 but only for mathematics. Moreover, with each TIMSS cycle, more information is gathered, leading to valuable insights and improvements in the measures for teaching quality. Therefore, the quality of the teaching quality measures in TIMSS has improved through both pilot studies and the main studies in 2015, 2019, and 2023 (Klieme & Nilsen, 2022). This implies that the validity of teaching quality is higher in the chapters that utilize data from TIMSS 2019 (i.e., Chaps. 5 and 9) than in the trend chapter that analyses changes from 2011 to 2019 (i.e., Chap. 7).

3.3.3 How Teachers Assess Their Students: Assessment Practice

TIMSS has measured various aspects of teachers’ assessment practices in 2011, 2015, and 2019. Nevertheless, only items related to homework were similar across these cycles. Measuring homework in TIMSS is challenging due to varying definitions and frequencies across countries and schools. Two primary homework measures exist: the frequency of assigned homework and the expected time students spend on it. Interpreting these measures separately can be difficult because teachers may allocate homework rarely but provide tasks that would take the students a long time to complete. However, combining the frequency and time spent on homework provides a more useful estimate (see Chap. 7 for a description of the procedure). In addition, teachers were also asked how they integrate homework into their teaching (in-class homework discussion). Teachers reported on how often they correct assignments, provide feedback, discuss homework in class, and monitor the completion of the homework. This additional information helps create a more comprehensive understanding of homework as an assessment practice in different educational contexts.

In TIMSS 2019, five new items were added to measure how much importance teachers place on various assessment strategies in mathematics and science, such as observing students, asking questions during class, short written assessments, longer tests, and long-term projects. These new items allow for a more in-depth analysis of teachers’ assessment practices and contribute to a better understanding of their impact on students’ learning outcomes.

3.4 Analytical Approaches

3.4.1 Data Preparation

To maintain coherence and enable comparisons across the chapters of this book, all data were prepared in advance and analyzed in the same manner. Two main files were created: one for the chapters that analyze TIMSS 2019 data, and another file containing merged data from 2011, 2015, and 2019 for the chapters conducting trend analyses. The IDB Analyzer was employed to merge teacher and student data and to combine data from different countries. Some variables required reverse coding to ensure that higher values represented more positive outcomes.

To accommodate multi-level analyses at the student and classroom levels, data that include multiple teachers per student (between 2 and 6%) were deleted randomly so that each student was only linked to one mathematics and one science teacher. This simplification helps to provide a clear, one-to-one relationship between each student and their respective teachers. This straightforward link improves the clarity and interpretability of the results, as it avoids potential bias from averaging responses across different teachers, who may have different teaching practices and interactions with the student. It also eliminates the potential issue of students having varying experiences with different teachers, which could complicate the interpretation of the findings. For example, if one teacher interacts with the student more than another, it may not be appropriate to weigh their responses equally. This approach also adheres to the nested structure assumed in multi-level models, where students are nested within teachers. This approach is taken to maintain clear hierarchical data analyses, even though it necessitates the removal of some data (between 2 and 6%).

For the trend analysis that required merged data from the three TIMSS cycles, a dummy variable called TIME was created and coded as 0 for 2011, 1 for 2015, and 2 for 2019. This variable allowed for an easy comparison of trends across different cycles and aided in the identification of patterns or changes over time. Furthermore, unique identification numbers were assigned to students and teachers to guarantee uniqueness across countries and over time. This approach facilitated the tracking of individual data points and the comparison of trends across different cycles. By ensuring the uniqueness of identification numbers, potential issues related to data duplication or misinterpretation were minimized, contributing to the overall reliability and validity of the findings.

3.4.2 Preliminary Analysis

In all empirical chapters, we employed Mplus software to analyze the data within the Structural Equation Modeling (SEM) framework. SEM consists of two parts: a measurement part, which assesses the reliability and validity of constructs and the model itself, and a structural part, which consists of regression between one or more variables or constructs. The measurement part of the modelling includes Confirmatory Factor Analysis (CFA). Whenever possible or appropriate, CFA was used to create latent variables, which represent the underlying constructs that are not directly observable (Brown, 2015). CFA is a valuable approach to assessing the reliability and validity of a latent variable. Using Mplus, we can estimate the so-called factor loadings (with values between 0 and 1) to ascertain how well each item in a latent variable measures the underlying concept. For instance, if we want to measure student motivation in mathematics, items like: “I like mathematics” or “mathematics is my favorite subject” would typically present high factor loadings (e.g., 0.9). Conversely, if we included an item on whether the students like chocolate, the factor loading for this item would be very low and probably insignificant. Thus, careful item selection based on relevant theories in CFA is crucial to create an accurate representation of the constructs under investigation.

SEM integrates factor analysis, path analysis, and regression techniques, allowing researchers to test and estimate the relationships between observed and latent variables (Brown, 2015). SEM is a robust methodology that enables researchers to perform estimations simultaneously at both the student and classroom levels (Hox et al., 2017; Morin et al., 2014). This approach allows for a more comprehensive analysis of the complex relationships among variables in educational research. Keys in this process are model fit and factor loadings, which serve as indicators of reliability and validity, supporting the overall quality of the findings. By assessing model fit, researchers can determine how well the proposed model represents the observed data (Brown, 2015; Hu & Bentler, 1999). Factor loadings, on the other hand, show the strength of the relationships between observed variables and their underlying latent constructs (Brown, 2015). These assessments are crucial for establishing the credibility of the results and for interpreting the implications of the findings in the context of educational research.

Before conducting the main analyses within the chapters, it was crucial to determine if the measures were comparable across countries. Therefore, Measurement Invariance (MI) testing was conducted to determine whether the constructs of teacher practices measure the same underlying construct across different groups. MI testing is essential in cross-cultural research to ensure that comparisons made between groups are meaningful and valid. Note that the MI testing was only conducted to ensure comparison across Nordic countries for the TIMSS 2019 data. The MI testing across time was not conducted due to the very few similar items that measure teacher practice across TIMSS 2011 to 2019.

The MI testing was conducted within the framework of SEM or CFA. The process involves a series of hierarchical model comparisons, where increasingly restrictive models are compared to assess the invariance of the measurement. There are three primary levels of measurement invariance (Kang et al., 2015; Sass & Schmitt, 2013):

  1. (1)

    Configural invariance. This level establishes that the same factor structure (i.e., the relationship between items and the underlying latent construct) holds across groups. In this stage, no equality constraints are imposed on the model parameters.

  2. (2)

    Metric invariance (weak invariance). At this level, the factor loadings (the strength of the relationship between items and the latent construct) are constrained to be equal across groups. Establishing metric invariance suggests that the units of measurement are the same across groups, allowing for meaningful comparisons of relationships between constructs.

  3. (3)

    Scalar invariance (strong invariance). This level tests whether the item intercepts (the point at which the item is expected to have a zero score) are equal across groups. Scalar invariance is necessary for meaningful comparisons of latent means or group differences.

To assess the fit of the invariance models, fit indices such as the Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), Root Mean Square Error of Approximation (RMSEA), and the Standardized Root Mean Square Residual (SRMR) were used. If the fit of a more restrictive model was not substantially worse than the fit of the previous model, it could be concluded that the measurement is invariant at that level (Chen, 2007; Cheung & Rensvold, 2002). When comparing the nested models, the following cut-off values for fit indices were used to determine the degree of invariance achieved: a decrease of 0.01 or less in the CFI or TLI (Cheung & Rensvold, 2002), a change in RMSEA of 0.015 or less, and an increase in SRMR of 0.03 or less (Chen, 2007).

As shown in Appendix 1, all constructs of teacher practice exhibited metric or scalar invariance, with the exception of cognitive activation strategies in mathematics using the student questionnaire, in-class homework discussion, and teachers’ emphasis on various assessment strategies. Consequently, analyses involving these aspects of teacher practice are considered as manifest variables rather than latent constructs. Moreover, cross-country comparisons that incorporate these variables need to be carefully interpreted. It is essential to acknowledge that these non-invariant constructs might reflect differences in understanding or interpretation across countries. Caution should be exercised when drawing conclusions based on these variables, as the observed differences may not necessarily indicate true differences in the constructs themselves.

By using a consistent analytical approach across chapters, this book offers a comprehensive and coherent view of teacher practice. The utilization of MI testing, SEM, and CFA analyses allows for reliable and valid comparisons across countries, contributing to the overall understanding of various aspects of teacher practice in different educational contexts. This approach strengthens the book’s capacity to provide insights and inform policy discussions, helping stakeholders make informed decisions for improving educational practices and outcomes.

3.5 Concluding Remarks

To facilitate the comparison of findings across the chapters, it was crucial for all authors to use the same prepared data and adopt a consistent methodology. However, certain deviations in operationalizing the constructs were occasionally needed. For instance, authors might have made minor adjustments to the operationalizations to enable model convergence or enhance model fit. Furthermore, due to changes in some items across the 2011 to 2019 cycles, the constructs used in chapters analyzing trends might differ from those in chapters utilizing TIMSS 2019 data.

Regardless of these minor discrepancies, the overall coherent approach across the chapters supports a valuable integration of knowledge about what teachers teach (content coverage), how teachers teach (teaching quality), and how teachers assess their students (assessment practice) from a range of perspectives. This uniformity ensures that the findings from different chapters can be effectively compared, offering a comprehensive understanding of the intricate relationship between teacher practice and student achievement in various educational contexts.