2.1 Rationale

The framework of the TIMSS study describes policy malleable features at the system, school, classroom and student level that are known to influence selected desired outcomes of education, such as achievement in the core curricular domain of mathematics (Mullis et al. 2009). Without going into details of the multi-stage sampling procedure applied in TIMSS, a distinguishing feature is that it produces a sample of intact classrooms, including their mathematics teacher(s), representing the 4th grade students in the participating countries (Joncas and Foy 2012). In other words, the data set from TIMSS provides a unique opportunity to link responses from students in a classroom with those from their teacher(s) for a large number of world regions, educational cultures and systems (in the following also called “countries”).

It is well known from previous research that classroom matters. First and foremost, teachers matter (for a summary of the state of research see, for example, Kyriakides et al. 2009). Teachers’ experience, teacher education background, beliefs and motivations, as well as their content knowledge, pedagogical content knowledge, and general pedagogical knowledge (actual and perceived), are characteristics that, to varying degrees, have been shown to have effects on student outcomes. Secondly, teaching or instruction matters for student outcomes (for a summary of research see, for example, Seidel and Shavelson 2007). Educational effectiveness studies and qualitatively oriented classroom observational studies seem to converge on some key features of high quality instruction. In short, high quality teaching consists of instructional practices leading to students being dedicated to cognitively active time on task.

However, there are not many studies seeking to model how teacher quality is related to student achievement, and how teacher quality is put into action by what teachers actually do in the classrooms. This research gap applies particularly to international comparative research. Most of the reported studies of these relationships, although valuable (for example Baumert et al. 2010), took place in one country only, and usually in a Western country. Comparative research that tries to extend the findings from these studies to other educational cultures and systems is lacking. The generalizability of the findings is therefore an open question.

From most definitions of learning it follows that learning occurs as a result of an interaction between the individual learner and his or her surroundings. In the school setting these are, such interactions that most often are generally planned and staged by the teacher. Teacher quality should thus matter, but the degree of its influence may vary by depending on teacher quality indicators or among educational systems. Furthermore, although some aspects of teacher quality have been shown to be directly positively related to student outcomes, they are also resources for the instructional processes in classrooms, and hence teacher quality may be a predictor of instructional quality. As pointed out above, we know for instance that stronger pedagogical content knowledge of mathematics teachers (one possible indicator of teacher quality) is positively related to student achievement in mathematics (Baumert et al. 2010). This may be a direct effect, where teachers influence individual students by diagnosing their (mis)conceptions and addressing these directly, or it may influence the teachers to create classroom conditions for learning where students are cognitively challenged and activated.

In line with this reasoning, we hypothesized that teacher quality is partly mediated by instructional quality. Although the capacity of TIMSS to address this issue is limited because of its design and instruments, the study has collected a lot of information from the teachers about their background and dispositions. The study has also collected rudimentary information, from both the teachers and the students, about the degree to which the classroom is characterized by instructional activities known from other research to be beneficial for student learning.

Against this background, the following research questions led this study:

  1. (1)

    Which teacher characteristics are significantly related to instructional quality?

  2. (2)

    To what extent do the relations between teacher quality and instructional quality vary by country? Is it possible to identify regions or clusters of countries where similar relational patterns exist?

  3. (3)

    Is instructional quality significantly related to student achievement? Does this relation vary by country, and, does a pattern exist that applies to countries from larger regions or cultures?

  4. (4)

    If teacher quality is significantly related to instructional quality and if instructional quality is significantly related to achievement, does instructional quality partially mediate the relation between teacher quality and student outcomes?

2.2 Theory

2.2.1 Educational Effectiveness Research as the Point of Reference

The studies presented in this book are rooted in the tradition of educational effectiveness research (Sammons 2009; Scheerens and Bosker 1997). The analysis in this chapter seeks to establish the structural relationship between aspects of teacher quality, instructional quality and student outcome with the hypotheses that teacher quality matters significantly positively for instructional quality and student outcomes, that instructional quality matters significantly positively for student outcomes, and that instructional quality partly mediates the influence of teacher quality on student outcomes. Several models for effective schools have been proposed, all of which to some degree include teacher quality and instructional quality. Our model employed a section of the dynamic model proposed by Creemers and Kyriakides (2008). However, this is a “static” model used to analyze cross-sectional data, and thus should accordingly be seen as a pragmatic conceptualization of the relationship between these core concepts of teaching and learning, reflecting the design and data available from the TIMSS study.

Educational effectiveness research (Nordenbo et al. 2008; Scheerens 2013) relates to an explicit notion of input-process-output logic, usually represented by regression models, where an educational outcome, in our case grade four students’ mathematics achievement, is modelled as a function of one or more independent variables, in our case teacher quality and instructional quality. In most of these models one or more intervening concepts are included, in our case instructional quality, to conceptually relate the modelled variables. In other words, this is empirical research that tries to open up the educational system as a “black-box”, where the input is the amount of resources, conditions or other antecedents hypothesized to be related to variation in the outcome. The complexities of studying the degree to which possible inputs affect an outcome involves variables that relate to one or more of the levels in the education system. TIMSS is designed to provide data where these complexities are represented by data at both the student and the class/teacher level.

Scheerens (2013, pp. 10–12) suggested that the lack of a unifying theoretical model for school research may well reflect that “[t]he complexity of educational ‘production’ may be such that different units and levels are addressed by different theories,” and he concluded his systematic review of the theoretical underpinning of educational effectiveness research by stating “[a]s it comes to furthering educational effectiveness research, the piecemeal improvement of conceptual maps and multilevel structural equation models may be at least as important as a continued effort to make studies more theory driven.” This chapter and the other chapters in this book are intended to provide improvements in the conceptual understanding of what characterizes effective instructional practice. By the inclusion of multiple educational systems, these chapters will also contribute to address questions regarding the degree to which educational effectiveness research can provide models and theories which are sensitive also to the wider social, political and cultural context in which education is embedded.

2.2.2 Teacher Quality

Teacher quality (TQ) includes different indicators of teacher qualifications, in particular characteristics of teachers’ educational background, amount of experience in teaching, and participation in professional development (PD), as well as personality characteristics such as teachers’ self-efficacy. A number of previous studies were able to relate measures of such teacher characteristics to student educational outcomes (see for instance the review by Wayne and Youngs 2003).

Evidence suggests that the quality of teacher education does have an impact on teachers’ educational outcomes in terms of teacher knowledge and skills (Blömeke et al. 2012; Boyd et al. 2009; Tatto et al. 2012); these, in turn, are significantly related to instructional quality and student achievement (Baumert et al. 2010; Hill et al. 2005; Kersting et al. 2012). The degree and major academic disciplines studied can be regarded as indicators of teachers’ education, although they are only rough approximations of specific opportunities to learn. In the case of mathematics teachers, a major in mathematics delivers the body of content knowledge necessary to present mathematics to learners in a meaningful way and to connect mathematical ideas and topics to one another, as well as to the learner’s prior knowledge and future learning objectives (Wilson et al. 2001; Cochran-Smith and Zeichner 2005). However, knowing the content provides only a foundation for teaching; student achievement is higher if a strong subject-matter background is combined with strong educational credentials (Clotfelter et al. 2007). Correspondingly, teachers’ pedagogical content knowledge and content knowledge of mathematics are of great importance for instructional quality and student achievement in mathematics, with the former exerting a greater effect than the latter (Baumert et al. 2010; Blömeke and Delaney 2012). Whether teachers had an education where mathematics or mathematics education were a major focus and the type of degree are proxy variables available in TIMSS. This makes it possible to study how teachers’ educational background may affect teaching and students’ achievement across countries.

An almost universal characteristic seems to be that teachers do not feel sufficiently prepared for their complex tasks, in particular during the first years on the job (Kee 2012). TIMSS developed three constructs reflecting teachers’ preparedness to teach numbers, geometry and data, respectively. The constructs were developed within the context of Bandura’s social-cognitive theory, and the measures of teachers’ preparedness for teaching may reasonably be assumed to reflect a concept which is similar to teacher self-efficacy (Bandura 1986; Pajares 1996). Self-efficacy beliefs influence thought patterns and emotions, which in turn enable or inhibit actions. Teachers with strong self-efficacy are typically more persistent and make stronger efforts to overcome classroom challenges than others (Tschannen-Moran et al. 1998). TIMSS provides data about teachers’ sense of preparedness so that the relation of this dimension of self-efficacy can be examined across countries.

In almost all countries, a variety of professional development activities exist, from very short classes to comprehensive programs (Goldsmith et al. 2014; Guskey 2000). These include school-based programs, and coaching, seminars, or other types of out- and in-service training with the aim of supporting the development of teacher competencies. Overall, meta-analyses support the hypothesis that professional development is positively related to instructional quality and student achievement if the activities meet certain quality characteristics (Timperley et al. 2007). Desimone (2011) classified these quality features into a focus on content, active learning, coherence, and a certain minimum length of the professional development course to be sustainable and collaborative activities. Collaboration in terms of joint work on cases and practicing under supervision of colleagues seems to be particularly relevant (Boyle et al. 2005). Discussions, reflection and continuous feedback seem to stimulate real changes in beliefs and routines (Goldsmith et al. 2014). TIMSS included several scales that assessed both teachers’ participation in formal professional development activities and their involvement in continuous and collaborative professional development activities with colleagues in the school.

2.2.3 Instructional Quality

Several studies have established a relationship between measures of instructional quality (InQua) and student achievement, student motivation or other outcomes of schooling. Even though the concept of instructional quality is understood differently by different researchers in the field of educational effectiveness research, there is agreement that it is a multidimensional construct (Baumert et al. 2010; Creemers and Kyriakides 2008). Besides classroom management, three instructional characteristics, namely cognitive activation, clarity of instruction, and a supportive climate, are regarded as essential features (Rakoczy et al. 2010; Decristan et al. 2015). TIMSS includes several measures relating to different aspects of instructional quality, with responses both from teachers and students. For more about the theoretical framework of this construct see Chap. 1.

2.2.4 Universal, Cultural or Country-Specific Models?

National specifications of degrees and licenses, foci of programs in terms of majors, amount of in-service training and length and level of teacher education reflect partly overlapping and partly differing visions of the knowledge and skills that teachers are expected to have in a country (Schwille et al. 2013). These specifications of what is required of mathematics teachers before they are allowed to teach mathematics to students at grade four can be assumed to be intentionally developed by national educational policy makers and teacher education institutions (Stark and Lattuca 1997). The same applies to professional development activities provided to teachers or to characteristics regarded as high quality teaching in a country.

In his study of primary school education in England, France, India, Russia, and the United States, Alexander (2001) illustrated the subtle and long-term relationship between culture and pedagogy. Based on videotaped lessons and interviews with teachers, he demonstrated that opportunities to learn provided during schooling reflected a country’s educational philosophy transmitted and meditated through the classroom talk between teachers and students. Leung (2006) confirmed similar cultural differences, specifically with respect to mathematics education in the East and the West. Although mathematics can be regarded as a fairly global construct (Bishop 2004), the curricula of school mathematics, as well as of mathematics teacher education, differ across countries, and are influenced by the context in which they are implemented (Blömeke and Kaiser 2012; Schmidt et al. 1997). With this as a backdrop, it is interesting that a study like TIMSS permits examination of the extent to which the relationship between teacher quality, instructional quality and student achievement can be generalized across the world, or across regions of the world.

2.2.5 Control Variables

Current research indicates that in some countries gender differences in students’ mathematics achievement still exist, but that these vary in their direction (Mullis et al. 2012). There is an even stronger relationship between students’ socioeconomic background and achievement (Mullis et al. 2012). In order to estimate the relation of teacher quality and instructional quality to mathematics achievement of students at grade four, the background characteristics of students need to be controlled for in the analysis.

2.3 Methods

2.3.1 Sample

This study is based on grade four student and teacher data from the majority of countries participating in TIMSS 2011. Five countries were excluded because there were no data on one or more predictors (Austria, Belgium, Kazakhstan and Russia) or there were very high levels of missing values for most of the variables included in the analysis (Australia). For students with more than one mathematics teacher, data from only one of the teachers was included at random, resulting in a data set with a simple hierarchical structure, where students were nested in one specific class with one specific teacher. The amount of data excluded by this procedure was negligibly small (for details see Chap. 1). The final sample included 205,515 students from 47 countries nested in 10,059 classrooms/teachers with an average classroom size of 20 students. Student sample sizes per country varied between 1423 and 11,228, with the number of classrooms/teachers ranging from 67 to 538, and an average classroom size between 12 and 34 students. The school level was neglected in the analyses to avoid overly complex hierarchical models. Furthermore, the choice of omitting the school level in the analysis is based on the fact that for many countries the classroom and school level cannot be analyzed separately, since only one grade four classroom was drawn per school.

2.3.2 Variables

A structural model was developed to reflect the hypothesized relations between teacher quality, instructional quality and student achievement (Fig. 2.1). Furthermore, the internationally-pooled descriptives of all variables, including their range across countries were inspected (Table 2.1).Footnote 1

Fig. 2.1
figure 1

Model of the hypothesized relations of teacher quality (left hand side of the figure) in terms of years of teaching experience (Years exp), teacher education degree (Degree), major focus of teacher education (Major), professional development represented by three indicators (PDmath, PDspec and Collabor), and sense of preparedness represented by three indicators (PrepNumb, PrepGeo and PrepData), to instructional quality (InQuaCI, InQuaCA, and InQuaSC), and to student achievement represented by five plausible values (PV1–5; right hand side of the figure); all abbreviations are explained in Table 2.1, and the numbers linking the relations hypothesized correspond to columns in Table 2.2, where the actual estimates can be found

Table 2.1 Descriptives of the variables used in the model

Teacher quality measures

Teacher quality is represented by three central dimensions in our model, namely teacher education background, participation in professional development (PD) activities, and teachers’ sense of preparedness. Teacher education background is described by teachers’ years of experience and their formal initial education. These characteristics were included as separate categorical and manifest variables because they do not reflect a joint and theoretically derived latent construct. Instead they represent different and not necessarily related dimensions of teacher quality.

The variation between countries for these variables was remarkably large. Across all countries, the modal category of number of years of experience (“By the end of this school year, how many years will you have been teaching altogether?”) was more than 20 years. The Eastern European countries were particularly pronounced in having many teachers with extensive teaching experience, indicating an older teaching force than elsewhere (see Appendix A, Table A.1). But there were also countries in the data set where the largest group of teachers that taught mathematics at grade four had less than 10 years of experience, and, in some countries, less than 5 years of experience. The Arabian countries were most pronounced in having a relatively young teaching force.

Teachers provided information about their degree from teacher education (“What is the highest level of formal education you have completed?”) out of six options from “did not complete ISCED level 3” to “finished ISCED level 5A, second degree or higher”. Across all countries, the modal category was “ISCED level 5A, first degree”, indicating that many countries had a large proportion of teachers with a bachelor degree. But there were also some countries where the largest group of teachers did not have university degrees, but had completed practically-based programs at ISCED level 3. Italy and the African countries were most pronounced in this respect (see Appendix A, Table A.2). In contrast, there were countries where the largest group of teachers held a university degree at least equivalent to a master degree (“ISCED level 5A, second degree or higher”). The Eastern European countries were most pronounced in this respect.

A dichotomous variable was created by combining teachers’ responses to two questions regarding their specialization in mathematics. This variable identifies teachers with a major in mathematics or in mathematics education (“During your <post-secondary> education, what was your major or main area(s) of study?” and “If your major or main area of study was education, did you have a <specialization> in any of the following?”). On average, slightly fewer than 40 % of all teachers across all countries had a major with a specialization in mathematics. However, in some countries the proportion was below 10 % (for example in some of the Eastern European countries), whereas in other countries the proportion was more than 80 % (for example in several Arabian countries) (see Appendix A, Table A3).

Furthermore, there were measures of teachers’ participation in PD activities. One set of questions asked the teachers whether or not they had participated in PD during the last two years. These questions are represented in the model by two item parcels reflecting either broad PD activities covering, for example, “mathematics content” in general, or reflecting PD activities preparing for specific challenges, for example”integrating information technology into mathematics”. Across all countries, approximately 40 % of the teachers had participated in broad or specific PD activities, respectively. However, the between-country variation was large, from countries having as few as 10 % the teachers taking part in broad or specific PD, to countries where more than two-thirds of the teachers had taken part in one or both forms of PD activities. It is difficult to discern any systematic cultural pattern in these differences (see Appendix A, Table A.4).

In addition, there was a set of questions regarding whether teachers had taken part in collaborative activities representing continuous, collaborative and school-based PD (“How often do you have the following types of interactions with other teachers?”, with “Visit another classroom to learn more about teaching” as an “exemplary” form of interaction). Across all countries, teachers commonly participated in these types of activities two to three times each month. However, in some countries the largest group of teachers participated in collaborative PD daily or almost daily. These questions were included as the third item parcel defining the latent construct of PD.Footnote 2

The third teacher quality dimension included in the model reflects teachers’ self-efficacy. The indicator used was their self-reported sense of preparedness to teach specific topics in mathematics within the three domains of number, geometric shapes and measures, as well as data display (“How well prepared do you feel you are to teach the following mathematics topics?”, with “Adding and subtracting with decimals” included as an exemplary topic). For each domain, teachers were asked to rate these topics on a three-point Likert scale from “Not well prepared” (0) to “Very well prepared” (2). Teachers were also invited to use a “not applicable” response category if the topic was not covered in their curriculum. In our analysis, the items marked as not applicable were treated as missing. To simplify the final model, the three domains were represented as item-parcel indicators of the latent construct of preparedness. Across all countries, the mean of the three item parcels was each time around 1.8 and, thus, close to the maximum category of the Likert scale. This suggests that there was little discrimination evident in the items. The international variation was also more limited within this dimension than in others included in the model. The lowest means were around 1.5 and, thus, straddled the categories “Somewhat prepared” and “Very well prepared”. Interestingly, slightly lower self-efficacy was most evident in Japan and Thailand (see Appendix A, Table A.5).

Instructional quality measures

The measure of InQua applied in this chapter is based on the teacher questionnaire in TIMSS where six questions asked teachers to report how often they perform various activities in this class (“How often do you do the following in teaching this class?”). This measure was preferred over other measures available (see Sect. 2.5) since it has a more explicit relation to three of the four characteristics of high quality instruction (Table 2.1). Teachers were asked to rate these activities on a four-point Likert scale from “Never” (0) to “Every or almost every lesson” (3). These items are represented by three item parcels with two items in each parcel covering different aspects of the latent construct InQua. The first parcel reflected teaching characteristics that were intended to deepening students’ understanding through clear instruction (such as “Use questioning to elicit reasons and explanations”). The second parcel pursued this objective through cognitive activation (through questions such as “Relate the lesson to students’ daily lives”). The final parcel covered a supportive climate (for example “Praise students for good effort”). Across all countries, the indicators for a supportive climate appeared to be widely present, as the mean was close to the maximum of the scale. The mean of the other two parcels was slightly lower. Interestingly, Scandinavian countries had the lowest means on the cognitive-activation item-parcel (see Appendix A, Table A.6). Some international variation existed on all three item parcels.

Outcome measure

We selected student achievement in mathematics represented by five plausible values as our outcome measure. The scale was defined by setting the international mean to 500 and the standard deviation to 100. Country means varied between 248 and 606 points, which is a difference of more than 3.5 standard deviations (for more information, see Martin and Mullis 2012).

Control variables

Data about gender and socioeconomic background were gathered through students’ self-reports to the questions “Are you a girl or a boy?” and the frequently used proxy measure of home background “About how many books are there in your home?”Footnote 3

2.3.3 Analysis

The research questions were examined using multi-level structural equation modeling (MLSEM). The intra-class correlation (ICC) for students’ achievement in the pooled international data set (ICC = 0.30) and within countries (ICC = 0.07–0.56) were all above the threshold at which multi-level modeling is recommended (Snijders and Bosker 2012).

Item-parcels were used as indicators, as recommended when structural characteristics of the constructs are the focus of interest (Little et al. 2002), as applies in the present investigation, and when sample size is limited in comparison to the number of parameters to be estimated (Bandalos and Finney 2001). The latter also applies to the present investigation given that there are only about 140 to 260 classrooms in most of the countries. By using parcels as indicators for the latent variables, the number of free parameters to be estimated was significantly reduced. The items were combined into parcels based on theoretical expectations confirmed by initial exploratory analysis of sub-dimensions in the latent variables included in the model.

Data analysis was carried out using the software MPlus 7.4. The clustered data structure was taken into account by using a maximum-likelihood estimator with robust sandwich standard errors to protect against being too liberal (Muthén and Muthén 20082012). Missing data were handled by using the full-information-maximum-likelihood (FIML) procedure. The model fit was evaluated with the chi-square deviance and a range of fit indices.Footnote 4

Before the final model was run, measurement invariance (MI) across countries was tested for the latent constructs in the model. Comparing constructs and their relations across countries produces meaningful results only if the instruments measure the same construct in all countries (Van de Vijver and Leung 1997). In order to ascertain such equivalence, MI was established using multiple-group confirmatory factor analysis (MG-CFA; Chen 2008). As instructional quality and the teacher constructs were measured at the classroom level, we tested for measurement invariance at the school level. Firstly, configural invariance was examined, which means that in each country the same items had to be associated with the same latent factors. As a second step, we tested for metric invariance, by studying whether the factor loadings were invariant across countries. Invariance of factor loadings enabled us to compare the relationship between latent variables across groups. It was possible to establish metric invariance for all latent constructs included in the present model (see Appendix B).

To examine our research questions, a single-group model was first applied before country-by-country analyses were carried out. In the multi-group model, factor loadings were constrained to be the same for all countries, reflecting the metric invariance criterion referred to above, in order to ensure comparability. Indirect relations at the between-level were estimated by multiplying the coefficients for the respective direct relation. In the single-group model, the two control variables gender and books at home were grand-mean centered on the international mean, whereas all predictors, the mediator InQua and the dependent variable student achievement in mathematics were group-mean centered on the country means. In the multi-group model the control variables were again grand-mean centered (this meant now on the country mean) whereas the predictors, the mediator and the dependent variable remained unaltered. Relations were regarded significant on the within-level if p < 0.05, but given the relative small number of units at the between-level as compared to the number of parameters to be estimated, a more liberal decision rule for the significance testing with p < 0.10 was applied for this level.

2.4 Results

2.4.1 Model Fit

The fit of the pooled model to the full data set was very good; both with respect to relative and to absolute fit indices (see Table 2.2). Only the ratio of the chi-square deviance to the degrees of freedom was unsatisfactory which is commonly observed with large samples. Within countries, the model fit varied substantially but given the small sample sizes the fit was sufficient on most indices in the majority of countries. Only in nine out of the 47 countries more than two of the applied indices indicated an unsatisfactory model fit. Typically for these cases, the CFI and TLI estimates were below the threshold of 0.90 and the SRMR estimate on the between-level above the 0.08 criterion.

Table 2.2 Sample size, intra-class correlation (ICC) and model fit of the pooled and the country-by-country models

2.4.2 Relation Between Teacher Quality, Instructional Quality and Mathematics Achievement

The pooled model using the data from all countries reveals that participation in PD activities and teachers’ sense of preparedness were the strongest predictors of InQua (see Table 2.2), with relatively large effect sizes given that the directions of relations typically vary across countries. Effect sizes around β = 0.20 may therefore be a first indication of a widely recognizable, if not universal, pattern. This is supported by the country-by-country results. In almost half of the countries PD activities (23 countries) and preparedness (22) were significantly related to InQua, with moderately strong effect sizes (β = 0.61 or β = 0.50 respectively), all of which were uniformly positive. Whereas PD activities were related to InQua particularly in European (11 out of 18) and Western Asian/Arabian (7 out of 12) countries, teachers’ sense of preparedness was significantly associated with InQua in South-East Asia (4 out of 7), Latin America (2 out of 2) and the Scandinavian (4 out of 5) countries. The relevance of the predictor preparedness was also evident through its somewhat weaker, but still statistically significant relation to student achievement.

Another predictor that influenced InQua and students’ mathematics achievement was teachers’ experience. On average, across countries, students with higher mathematics achievement were taught by more experienced teachers, and teachers with more experience also reported higher instructional quality. However, for both of these relationships there were also significant effects in the opposite direction for a number of countries, which contradicts the hypothesized relationship.

Teachers’ level of education was not associated with InQua in the pooled data set, but a significant positive relationship was found in nine countries. However, students who were taught by teachers with relatively higher ISCED levels performed somewhat higher in the mathematics achievement test, and this positive relationship was also confirmed for twelve of the countries. This characteristic was most prominent in the Western Asia/Arabia-region, although with moderate effect sizes.

Whether a teacher education program had had a major focus on mathematics or mathematics education did not significantly predict InQua. Still, as with teacher education level, students in classrooms demonstrating stronger mathematics achievement were in the overall international analysis more often taught by a teacher who had majored in one of these fields. Within countries, these relationships were mostly insignificant, but we found also both moderate significant positive and negative coefficients in some countries (Table 2.3).

Table 2.3 Results for the single-group pooled model and the country-by-country models. (Numbers in column headers refer to relations displayed in Fig.  2.1)

Across all countries, mathematics achievement of students at grade four was not predicted by InQua, and within countries the predictor had a significant relation to achievement in only three countries. As a result, the mediation effect of InQua was negligible and thus the hypothesized mediation effect of InQua on student achievement is not supported by the data included in this analysis.

The importance of controlling for students’ socioeconomic background was demonstrated by the strong relationship between the number of books at home and student achievement. In 39 out of the 47 countries, students who reported more books also had a higher mathematics score. This applied to all European, English-speaking and South-East Asian countries. In contrast, socioeconomic background was not significant in the African countries. Gender differences were evident in 28 countries, particularly in European (17 out of 18) and Latin America (2 out of 2) countries, and these differences unanimously favored boys. In contrast, Western Asian/Arabian (2 out of 12) and African (1 out of 3) countries were much less affected by gender inequalities, and when these were present in these countries, the differences favored girls.

2.5 Discussion

TIMSS data provide a unique opportunity to link student outcomes with teacher and instructional characteristics because they collect data from intact classrooms. The good fit of our model to the data within countries and across countries can be regarded as evidence that the model was well specified and that important teacher predictors of student achievement were selected. However, it seems to be important to distinguish between predictors that can be characterized as being more proximal or distal, respectively, to instructional quality or student achievement. Initial teacher education may have happened decades ago in case of experienced teachers, and programs may have been very different at that time compared to current teacher education programs (Wang et al. 2003). Teachers’ initial education is in this manner an example of a teacher characteristic which, at least for a large group of teachers, is distal to the other variables included in the model, and moreover, likely confounded with other omitted variables. Taken together this makes it difficult to identify a systematic relationship between features of mathematics teacher education and instructional quality or student achievement.

Professional development activities taken during the past 2 years and teachers’ self-efficacy are, in contrast, much more closely related to what happens currently in classrooms. The analysis presented demonstrates that teachers’ participation in PD activities and their self-efficacy are both significantly associated with grade four students’ mathematics achievement, both in the pooled international model and within a high number of countries. This finding therefore extends research-based knowledge by providing evidence for the generalizability of the influences of self-efficacy (Bandura 1986) and PD (Timperley et al. 2007) across widely different educational contexts.

However, for all other variables in the model, a large variation between the countries was observed and universal relationships with instructional quality and students’ achievement were generally not observed. Teachers teach in a context of structures, policies and expectations. Scheerens (2007) separated these conditions into entities that were more or less “given” antecedents (such as population characteristics or general valuation of education and teachers) and conditions that were more malleable by policy (such as level and type of decentralization or accountability arrangements). These differences in conditions may affect both the between-country and the within-country variability in teacher quality and instructional quality, and also the relationships between these concepts and students’ learning outcomes. The TEDS-M study showed that in some countries teacher education is nationally standardized, while in other systems teacher education can be highly decentralized (Ingvarson et al. 2013). Furthermore, in some countries, teachers are trusted by both the public and their employers, who grant them more or less full autonomy in how they implement the curriculum and the instruction. In other countries, teachers will be firmly placed in a hierarchical system, with less freedom to influence the curriculum and instruction, in the extreme case with prescribed and detailed lesson plans.

Correspondingly, for all variables of teacher quality included in this chapter, we observed a noticeably large variation across countries. One potential consequence of such variation is that, in systems where teachers are fully autonomous individuals with responsibility for developing and implementing instruction, a relatively large within-country variation in instructional quality is possible, while systems characterized by teachers being provided with more or less prescribed lesson plans would likely have fewer degrees of freedom for some of the components typically included in instructional quality. In our models, the observed differences in direct relations of several variables describing teacher quality to instructional quality may be a reflection of this wider “ecology” of teaching. Taken together, this variation illustrates how international studies may use systematic differences in conditions and policies for teaching in order to at least provide examples of how alternative policies work in other settings, although, of course, such interpretations should be done with care since the wider cultural context of education represents a range of potentially very influential omitted variables.

In relation to this, it is also worth discussing how the educational system caters for specialized or generalized teachers of mathematics at grade four. It is reasonable to assume that in more or less all countries teachers in secondary schools will have a specialization in one or a few subjects. However, in primary schools, at least in the first years, there will be a larger between-country variation in the degree to which teachers have a general versus a specialized teacher education. Teachers with general qualifications will by default have a broader background with less in depth subject knowledge. This is a variation at the system level, which to a large degree was observed for the two proxy measures of teachers’ educational background.

2.6 Limitations of the Study

One limitation of our study was its reliance on cross-sectional data. In order to study the effect of teacher and instructional quality on student achievement, and not the least, in order to study the possible mediation of teacher qualities by instructional quality, use of data from experimental or longitudinal designs would be preferred. Follow-up studies with improved designs are urgently needed. Since the international studies are repeated at regular intervals, it should be possible to have repeated measures at country level in later surveys.

However, this would imply measures remain unaltered, which we would not recommend given another limitation of our study; the unsatisfactory quality of some of the measures used. This is primarily an issue regarding the measure of instructional quality used in this analysis. This measure was based on items in the general part of the teacher questionnaire. Consequently, the questions did not include explicit references to the subject of mathematics. In several countries, a teacher of grade four mathematics will also be teaching the same class other subjects. It may be that some of the teachers responded to this list of questions without having mathematics instruction in mind, which may cause validity problems (Schlesinger and Jentsch 2016).

There were other related measures which could have been used, and which are used in the analyses in other chapters in this book. A set of questions in the mathematics specific part of the teacher questionnaire also asked teachers to report their instructional activities in mathematics. However, these questions reflect surface characteristics of teaching practices, and did not correspond to the theoretical framework of instructional quality applied in this book, which is based on current research on instructional quality. A measure based on students’ responses could also have been used. However, given the low age of the students in grade four, we opted to rely on the teachers’ reports. Improvements in the instructional quality measures to better include recent research in this area (in particular the work done by the Klieme group; see for example Decristan et al. 2015; Rakoczy et al. 2010) seem to be urgently needed.

A third feature of our analysis that may be regarded as a minor limitation is, given the limited sample size of teachers and classrooms in many countries, item parcels were applied instead of single items. This leads to some loss of information. Given that the reliability of most parcels was reasonably high, the grouping of items into parcels can be assumed to represent a minor reduction of information with only small consequences for the analysis.Footnote 5 However, given that there are potentially differential relationships between the three indicators and student achievement across countries and within countries, the research questions of this paper may also merit reinvestigation at the item- or indicator level.

There were other dimensions in the TIMSS questionnaire gauging teacher characteristics that were found to be of relevance for students’ achievement. These measures were omitted from this analysis for several reasons. Firstly, for some of them it was not possible to confirm metric measurement invariance (this applied, for example, to teacher motivation) and, secondly, their inclusion would have introduced a risk of multicollinearity. In addition, as a two-level multi-group analysis framework was applied, keeping the model simple was a necessary priority. It should be noted that the final choice of indicators of teacher qualities in our model did not fully match the dimensions cited most often in contemporary teacher effectiveness studies. For example, TIMSS did not include measures of the teachers’ actual knowledge and skills to teach mathematics (see for example, Blömeke et al. 2012; Tatto et al. 2012).

2.7 Conclusions and Recommendations

The results of the present study clearly support the relevance of teacher quality for instructional quality and for educational outcomes. Instructional quality and mathematics achievement were significantly related to several teacher characteristics selected on the basis of contemporary research and, their availability within the TIMSS 2011 data. Patterns emerged across countries and cultures, both with respect to the absolute level of some constructs and the relations between teacher quality, instructional quality and outcomes. Some characteristics were more regionally relevant. However, although the model fits the data from the majority of countries, the structural relations represented by this model do not provide a universal model.

The lack of a universally applicable model is obvious: significant research is needed to clarify the generalizability of these results. One particular topic for research concerns the relevance of initial teacher education, which several times was found to be non-significant, replicating previous findings from other cross-sectional surveys (see for instance, Nordenbo et al. 2008). This could be related to the fact that teacher education has changed profoundly in many countries over the last decades (Wang et al. 2003; Darling-Hammond and Lieberman 2012). It is reasonable to assume that characteristics of students recruited into the profession have changed over time. Access to teacher education may historically have been more selective and restricted to students with relatively higher marks from secondary education. Also, the demand and provision of deep mathematical knowledge in the teacher education may have changed as teacher education has been reformed at specific points in time. Teacher experience and formal qualifications as measured in TIMSS are therefore likely confounded with other characteristics not included in our model. Distinguishing between age cohorts would provide important information, but this was not feasible with the current data set given the already rather small sample size. One solution for future surveys could be to include larger samples of teachers and classrooms in countries where changes in some of these confounding characteristics can be described and included in the model from other sources.

We have chosen to focus on cognitive outcomes in this chapter, given that other chapters in this book cover student motivation or bullying as outcomes. It is important to recall that outcomes of education are multi-dimensional and that cognitive and motivational variables are both important. Evidence suggests that motives are often positively related to cognitive learning outcomes and that motivation supports cognitive learning long term (Benware and Deci 1984; Grolnick and Ryan 1987). Reducing schooling to cognitive outcomes would therefore be a shortcoming. In further studies of how teacher quality and instructional quality relates to outcomes, it would therefore be relevant to include also students’ motivation and interest as dependent variables in one and the same model.

Another major recommendation for future studies based on our experience with analyzing the complex relationship between teacher quality, instructional quality and student outcomes, is that future surveys need invest in the development of improved measures of instructional quality. A long-standing controversy exists whether teacher or student ratings describe instructional quality more reliably and/or more validly (Desimone 2011; Schlesinger and Jentsch 2016; Wagner et al. 2015). Current research understanding suggests that the correlation between these two approaches is only moderate and that their relation with student achievement differs. This may reflect not only that students and teachers perceptions differ, but also that the measures represent slightly different aspects of the instructional activities taking place in the classroom. In general, we would therefore recommend that measures of instructional quality, in line with the current practice in the IEA studies, include both types of sources to develop measures of the quality of the instructional activities.

However, the current measures in both the teacher and the student questionnaires fail to fully represent the depth and breadth of the concept of instructional quality. The three core aspects in the measure of InQua that we applied (clarity of instruction, cognitive activation, and supportive climate) are represented by two items only. Each of these aspects represents separate and relatively broad and many-faceted constructs by themselves, which should be reflected in future studies. Furthermore, classroom management is a vital dimension of instructional quality not included in the generic teacher questionnaire. And not the least, as discussed already, the construct used in this chapter is based on generic questions, while it would provide more fidelity to the analysis if a measure specific to the quality of the mathematics lessons had been applied. In future surveys, priority should rather be given to the improvement of context sensitive measures of instructional quality. Frequency of different specific activities may not represent an ideal way to assess the quality with which these activities are carried through. Some actions probably occur relatively often in high quality teaching (for instance, summarizing at the end of the lecture), while others would probably need to be used less often in order to represent an optimal quality (for instance, working on problems with no obvious solution). In summary, new improved measures of InQua should:

  1. (1)

    reflect both students’ and teachers’ experiences,

  2. (2)

    have a broader scope, including the four core components, clarity of instruction, cognitive activation, classroom management, and supportive climate,

  3. (3)

    cover each of these aspects in depth by including separate, but related, constructs,

  4. (4)

    be subject-specific rather than generic, and

  5. (5)

    include scales aimed at capturing qualities of various activities.